The most expensive mistakes in A/B testing aren't math errors. They're misread output. The test ran fine, the numbers are correct, and a team still ships a losing variant — or kills a winner — because nobody on the call agreed on what "p < 0.05, +3% lift" actually meant.
This post walks through how to read an A/B test result the way an experienced analyst does: number by number, in order, with a clear idea of what each one is allowed to tell you and what it isn't. By the end you should be able to pick up any standard test readout — ours, your in-house platform's, your vendor's — and reason about it without bluffing.
If you want to follow along with your own data, paste it into the analyzer and read the result alongside this guide.
The five numbers that matter — and the order to read them in
Every two-arm conversion-rate readout boils down to the same handful of numbers. The trick is reading them in the right order. Most people glance at the p-value first, decide the test is "significant" or "not significant," and then back-rationalize the rest. That's exactly backwards.
Here's the order an analyst should actually look at:
| Order | Number | Question it answers |
|---|---|---|
| 1 | Lift (point estimate) | How big does the effect look? |
| 2 | Confidence interval | What other effect sizes are still compatible with this data? |
| 3 | P-value | How surprising is this under the null hypothesis? |
| 4 | Power / sample size | Was the test even capable of detecting the lift I care about? |
| 5 | Effect size (Cohen's h) | Is this big in a way that's comparable across tests? |
Read in that order, the p-value stops being a verdict and becomes one piece of evidence among several. Read in the other order — p-value first — and you'll routinely ship false positives and kill true winners.
The rest of this post is a tour through each of those five, plus the red flags that tell you the result is less informative than it looks.
1. Lift: the headline, not the verdict
Lift is your point estimate of the treatment effect. Usually expressed as a relative percentage — "the variant converted 3.2% better than control" — sometimes as an absolute difference in percentage points. Always check which one the readout is showing you, because the two differ by a factor of the baseline rate.
- Relative lift:
- Absolute lift:
A 0.2 percentage-point lift on a 4% baseline is a 5% relative lift — same data, two very different-sounding numbers in a stakeholder review.
Lift is the most intuitive number on the page, which is exactly why it's dangerous. It's a single sample from a noisy distribution. With small samples, observed lift bounces around violently; with large samples, it stabilizes. The lift number alone tells you nothing about which regime you're in. That's what the next number is for.
2. Confidence interval: the range you should actually report
If you only learn one new habit from this post, make it this: read the confidence interval before the p-value, and never quote the lift without it.
A 95% confidence interval is the range of true effect sizes that are compatible with the data you observed, under the model's assumptions. If your readout says:
Lift: +3.2% 95% CI: [+0.4%, +6.0%]
…then the data are consistent with the true lift being as small as 0.4% or as large as 6.0%. The 3.2% is just the midpoint of that range. Anyone who quotes "+3.2% lift" without the interval is hiding the uncertainty that should drive the decision.
A few shapes of CI you'll see and what they mean in practice:
| 95% CI | Lift | What it tells you |
|---|---|---|
[+0.4%, +6.0%] | +3.2% | Probably positive, but anywhere from "barely matters" to "great." Underpowered. |
[+2.8%, +3.6%] | +3.2% | Narrow and clearly positive. High-confidence win. |
[-1.2%, +4.8%] | +1.8% | Crosses zero. "Not significant" — and you didn't learn much. |
[-0.2%, +0.3%] | +0.05% | Crosses zero, but the interval is tight. You learned the effect is small. |
[-15%, +20%] | +2.5% | Wildly wide. The test is essentially uninformative. |
Two non-obvious points worth internalizing:
- A CI that crosses zero carries the same verdict as p > 0.05. They're two views of the same arithmetic. If someone tells you "the p-value isn't significant but the lift is +4%," you can immediately ask, "what's the lower bound of the CI?" — it will be below zero.
- A narrow CI around zero is a stronger result than a wide CI around a big lift. The first one closes the question — the effect, if any, is tiny. The second one closes nothing — you might have a huge winner or a huge loser. Teams routinely celebrate the second and shrug at the first; it should be the opposite.
The width of the interval, not the position of the midpoint, is what tells you whether the test was actually informative.
3. P-value: what it does and (mostly) doesn't mean
The p-value is the most-quoted, most-misunderstood number in applied statistics. Here's the careful one-line definition:
The p-value is the probability of observing a test statistic at least as extreme as the one you got, assuming the null hypothesis is true.
That's it. It is a statement about how surprising your data are under the assumption that there is no effect. It is not a statement about the probability that the variant works.
Things p-value is not, despite what everyone in the standup will tell you:
- Not the probability the variant is better than control. That's a Bayesian posterior, which requires a prior, which a frequentist p-value doesn't have.
- Not the probability the result is "real" or "a fluke." Real / fluke isn't a property of a single result; it's a long-run frequency property of the procedure.
- Not a measure of effect size. A p-value of 0.001 with a tiny n means the same thing about the evidence as a p-value of 0.001 with a huge n, but the effect sizes behind those two results could differ by 100x.
- Not "1 - confidence." A p-value of 0.04 does not mean "96% confidence the variant works."
- Not transitive across tests. Two tests at p = 0.06 are not "almost as good as" one test at p = 0.03.
The most common operational misuse — bigger than any of the above — is treating the 0.05 cutoff as a switch. The difference between p = 0.049 and p = 0.051 is, statistically, nothing. The difference between p = 0.04 and p = 0.001 is huge. Read p-values as a continuous measure of evidence against the null, not as a binary verdict, and most "but it's not significant!" arguments dissolve.
A useful sanity check: in a well-run test, the p-value and the confidence interval should agree. If the 95% CI excludes zero, the p-value is below 0.05. If the 95% CI excludes zero by a lot, the p-value is far below 0.05. When the two seem to disagree, you're misreading one of them.
4. Power: was the test even capable of finding what you wanted?
Power is the probability that your test, if a real lift of a given size exists, will detect it as statistically significant. The standard target is 0.80 — accept a 20% chance of missing a real win.
There are two ways to think about power, and only one of them is honest:
- A priori power (correct): "Before we ran the test, given our planned sample size, the test had an 82% chance of detecting a 5% relative lift at α=0.05." This is the number a sample size calculator produces.
- Post-hoc / observed power (almost always wrong): "Given the lift we observed, the test had a 31% chance of detecting it." This is mathematically circular — it just restates the p-value in a more confusing way — and you should not let it into your decision-making. (Hoenig & Heisey, 2001 is the classic takedown.)
The reason to look at power fourth, not first, is that it's a planning number, not a reading number. You should already have computed it before launching the test, using the planner. At read-time you're sanity-checking: did we actually hit the n we sized for, and was the n we sized for matched to a defensible MDE?
A null result on an underpowered test is almost always uninformative. If the test was sized to detect a 10% lift with 80% power and you saw a flat result, you have not "shown the variant doesn't work" — you've shown the true lift is probably smaller than 10%, which is a much weaker claim. See why your A/B test is not significant for the full diagnostic tree.
5. Effect size: making results comparable
Lift in percent is meaningful inside a single test — comparing variant to control on the same metric — but it travels badly. A 5% relative lift on a 1% baseline is a totally different physical effect than a 5% relative lift on a 40% baseline. Cohen's h is a standardized effect size for proportions that lets you compare across tests, surfaces, and metrics:
Rough interpretation (Cohen's own conventions, which you should treat as loose buckets, not hard cutoffs):
| Cohen's h | Effect size | Typical CRO interpretation |
|---|---|---|
| 0.20 | small | Realistic win for a button copy / layout tweak. |
| 0.50 | medium | New page / new flow territory. |
| 0.80 | large | Rare; usually means you broke or fixed something. |
You don't need to compute Cohen's h by hand for every test — the analyzer shows it for you — but glancing at it answers a question lift can't: "Is this effect big in a way I should expect to replicate, or is it a small effect that just happened to hit significance because n is huge?"
Statistically significant ≠ business significant
This is where most "wins" go to die. A test can be statistically significant and not worth shipping. It can also be statistically non-significant and still worth shipping — if the cost of the variant is near-zero and the CI puts the most likely effect in the right direction.
Two cases worth memorizing:
- The expensive non-win. Your MDE was 5%. The test came back with lift = +0.6%, CI = [+0.3%, +0.9%], p < 0.001. You "won." You also won something far below the threshold where shipping pays for itself. Significance is real; business significance is zero. The honest write-up is: "the variant produces a tiny but reliable lift, below the bar we set for shipping."
- The cheap winner that wasn't significant. Lift = +3%, CI = [-0.5%, +6.5%]. Not significant. But the variant is one line of CSS, has no downside risk, and the CI is mostly positive. Shipping it is not a statistical claim — it's a decision under uncertainty with cheap downside. This is a perfectly reasonable call to make, as long as you don't then go around telling the company you "won" the test.
The discipline is to separate the statistical question ("what does the data say about the effect?") from the decision question ("given what the data says and what shipping costs, what do we do?"). The analyzer gives you the first. The second is on you.
A worked example, read top to bottom
Here's a readout in the shape AB SHARK produces it. Imagine you ran a checkout-button test for two weeks.
| Variant | Users | Conversions | Rate |
|---|---|---|---|
| Control | 28,400 | 1,392 | 4.901% |
| Variant | 28,510 | 1,486 | 5.213% |
And the summary stats:
- Lift (relative): +6.36%
- 95% CI on lift: [+0.4%, +12.7%]
- P-value (two-sided): 0.034
- Z-statistic: 2.12
- A priori power for MDE = 10%: 0.83
- Cohen's h: 0.014
How an analyst reads this, in order:
- Lift +6.36% — interesting size, plausible for a button-copy test. Not so big it's suspicious.
- CI [+0.4%, +12.7%] — positive, but barely. The lower bound is right up against zero. The data are also compatible with a +0.4% lift, which is below most teams' ship threshold.
- p = 0.034 — significant at α = 0.05, by a comfortable but not crushing margin. Consistent with the CI just clearing zero.
- Power was 0.83 for the 10% MDE we planned for. We sized this test honestly; a null result would have been informative. The observed lift came in below the MDE, which means the CI is naturally going to be wider relative to the effect — that's why it's straddling zero.
- Cohen's h = 0.014 — tiny standardized effect. Real, probably, but small.
Decision: probably ship, but tell stakeholders the realistic expectation is closer to +1-3% than +6%. If shipping costs anything significant, run a confirmatory follow-up sized for a 3% MDE before scaling.
Notice what we didn't do: we didn't see "p < 0.05" and announce a 6% lift to the company. The CI did the real work.
Required n scales viciously with the lift you want to detect
If you take only one chart from this post into your next planning meeting, make it this one. Required users per variant at a 5% baseline, α = 0.05, power = 0.80:
The reason this matters when reading results: the lift you observe is almost always smaller than the MDE you sized for. The CI width is set by the sample size you actually collected. If you sized for a 10% MDE and observed a 4% lift, your CI is going to be roughly a third of your sized-for MDE — wide enough that the lower bound is probably touching zero. That's not a bug; that's the test telling you it can't resolve effects that small with the n you collected. See the MDE deep-dive for how to set this honestly.
Red flags: when the readout is less informative than it looks
A few patterns should make you slow down before declaring a result, no matter what the p-value says.
- Huge lift, wide CI, small n. "+18% lift, CI [-2%, +38%]" is not a win; it's noise. Big swings on small samples are the default, not a signal. Wait for more data.
- p just under 0.05 with surprisingly small n. Either you got lucky, or you stopped the moment the test crossed the line — i.e., you peeked. The fix is to commit to a sample size in advance and not look at significance until you hit it (or use a sequential test that's designed for peeking).
- Power well below 0.80 on a null result. You did not learn the variant doesn't work. You learned your test wasn't sensitive enough to tell. The honest writeup is "inconclusive," not "no effect."
- Sample sizes wildly mismatched between arms. A 60/40 imbalance for a test you launched 50/50 means assignment or logging is broken — fix the pipeline before believing any number on the page. This is also a classic sign of a sample-ratio mismatch, which invalidates the test.
- Multiple metrics, one is significant. If you tested 20 metrics at α = 0.05, you expected one false positive by definition. Pre-register the primary metric, or apply a multiple-comparisons correction (Bonferroni, BH-FDR) before celebrating.
- The lift moved a lot between the day-7 check and the day-14 check. Either there's a novelty effect (treatment looks great early, fades) or the early reads were noise. Honest read-out happens once, at the planned end of the test.
If you spot any of these, the right move is rarely "ship it anyway." It's "go look at the raw data" or "run it again."
The one-page checklist
Pin this somewhere. When a result lands, walk down the list:
- What's the lift, and is it relative or absolute?
- What's the 95% CI? Does it cross zero? Is it wide or narrow?
- Where's the p-value, and does it agree with the CI?
- Was the test powered for the MDE we actually care about? Did we hit the planned n?
- What's the effect size in standardized terms (Cohen's h)?
- Is the lift large enough to matter for the business, separately from whether it's statistically significant?
- Any red flags? Peeking, SRM, novelty, multiple comparisons, wildly-wide CIs.
If you can answer those seven questions about a test, you can read the result honestly — and you can defend the decision in a room full of skeptical engineers and PMs.
Read it once, read it right
The five numbers above aren't five different verdicts. They're five views of the same underlying question — what did this test actually learn about the treatment effect? — and reading them in order, with the CI doing the heavy lifting and the p-value playing a supporting role, is how you stop shipping false positives and stop killing real winners.
Paste your numbers into the analyzer and read the result with this checklist next to you. If the test isn't there yet, plan it first — sizing the test honestly is what makes the read-out honest later.
Further reading:
- A/B test sample size calculator — how to size a test that can actually answer your question.
- Why your A/B test is not significant — the diagnostic guide for flat results.
- Minimum Detectable Effect, explained — the single most-mis-set knob in test design.
- CUPED variance reduction — how to shrink your CI without collecting more users.