Why Your A/B Test Is Not Significant: A Diagnostic Checklist Before You Give Up

"Not significant" is the most-misread result in applied statistics. Half the time the team shrugs and ships the loser. The other half they keep the test running for three more weeks "just to see," peek every morning, and eventually declare a false-positive win. Both reactions skip the only question that matters: which kind of "not significant" is this?

There are at least four, and they have four different fixes:

The test never had a chance (underpowered).
The test had a chance, took it, and the answer is the effect is small.
The test was fine but your metric is too noisy to see the effect.
The test was fine but you peeked, and the readout you're looking at is a random walk that didn't happen to cross the line on the day you checked.

Each of those leads to a different decision: keep running, replan, call it null, redesign the metric, or stop peeking. The p-value alone tells you none of them. This post is the diagnostic flowchart.

If you want to follow along with a specific test, paste your numbers into the analyzer — observed power, CI width, and effect size all come for free, and they're what the rest of this checklist is built on.

What "not significant" actually means

Before the diagnostic, one definition to get out of the way. A p-value above your α threshold (usually 0.05) does not mean "the variant doesn't work." It means:

Under the assumption that there is no effect, data at least this extreme would happen more than 5% of the time.

That's a statement about how surprising your data are given no effect. It is not a statement about whether an effect exists. The technical name for the situation is failure to reject the null — and "failure to reject" is genuinely different from "accept." See how to read A/B test results for the longer version of this distinction.

The whole point of the diagnostic below is to figure out why you failed to reject — because the action you take next depends entirely on the reason.

Step 1 — Was the test even powered?

This is the first question and most teams skip it. Power is the probability that your test, if a real lift of a given size exists, will flag it as significant. The industry default is 80%.

If your test was sized for a 10% MDE at 80% power and the true lift is 3%, your test's chance of catching it is somewhere around 15-25%. A null result from that test tells you essentially nothing — the test was designed to fail to detect lifts that small.

Two ways to check:

A priori power — the number you (should have) computed before launching, using the planned sample size and MDE. If you sized via the planner, it's the value you targeted.
Sensitivity check — given the n you actually collected, what's the smallest lift you had ≥80% power to detect? If that number is bigger than the lift you think is realistic, the test was never going to find the effect.

A quick sanity table at α=0.05, power=0.80, baseline 5%:

Users per variant	Smallest relative lift you can reliably detect
5,000	~17%
15,000	~10%
31,000	~7%
61,000	~5%
170,000	~3%
385,000	~2%

If you ran a 15,000-per-variant test on a surface where real wins are 3-4%, you were always going to come back null. The fix isn't a longer runtime past the planned end — it's a re-planned test, often on a different surface or with variance reduction so 15,000 users buys you more sensitivity.

A note on observed power. Plugging the observed lift back into the power formula and saying "see, we only had 30% power" is a common habit and it's mathematically circular — it just restates the p-value in a more confusing way. Use the a priori power, or compute the MDE the actual n supports. (Hoenig & Heisey, 2001 is the canonical reference.)

Verdict if Step 1 fails: the test was underpowered. The result is inconclusive, not null. Don't ship either variant on the basis of it. Replan with a defensible MDE, or pick a different surface.

Step 2 — Is the confidence interval tight or wide?

Step 1 used the planning n. Step 2 uses the actual readout. The confidence interval is the most underrated number on the page, and it distinguishes the two genuinely different flavors of "not significant":

Lift	95% CI	What the test actually learned
+1.8%	`[-1.2%, +4.8%]`	Wide and crosses zero. Test was uninformative. Underpowered.
+0.05%	`[-0.2%, +0.3%]`	Tight and crosses zero. Real evidence the effect is small.
+6.0%	`[-3.0%, +15.0%]`	Wildly wide. You learned almost nothing. Need much more n.

The midpoints all "fail to reject." The conclusions are completely different.

A wide CI around zero means the data is consistent with anything from a meaningful loss to a meaningful win. You don't have a result — you have an absence of evidence.
A tight CI around zero means the data is consistent only with effects so small they don't matter. You have a result, and the result is the effect is small. That's a perfectly valid finding, and it closes the question.

Most teams celebrate the wide-around-big-lift case ("look, +6%!") and shrug at the tight-around-zero case ("ugh, null"). It should be the opposite. The first one tells you to run more, the second one tells you to move on.

Verdict if the CI is tight around zero: call it null. You have real evidence the effect is below the bar you'd care about. Don't keep running. Pick the cheaper variant (usually control, since it's already shipped) and free the surface for the next test.

Verdict if the CI is wide around zero: you're back in Step 1 — the test didn't have the resolution to answer the question. Keep going or replan.

Step 3 — Did you peek?

This one is uncomfortable to ask but it's the silent killer behind roughly a third of "weird" results. If you checked the dashboard daily and the test crossed p < 0.05 on day 9, dipped back above on day 12, and sits at p = 0.18 today, you're not looking at one test. You're looking at the most recent point on a random walk that has been below 0.05 multiple times already.

The fixed-horizon p-value is calibrated for a single look, at the end. Repeated looks inflate the false-positive rate from the nominal 5% to something closer to 20-30% depending on how often you peek. They also have a subtler effect on null results: if you stopped a test early because it "trended flat for a week," you may have killed a true winner whose effect only emerges as n grows.

The honest checks:

Did you have a pre-registered stop time? If not, your current p-value is roughly meaningless — you're reading one point on a path whose shape was already used to make decisions.
Did you stop early because it looked flat? That's the inverse of the usual peeking sin and produces the same selection bias in the opposite direction.
Did you start over after a "bad" first week? Resetting the n resets the calibration. You can't.

If you peeked, the fix isn't to apologize and read the p-value anyway. It's to either run a new test with a pre-registered stop time, or use a sequential test (mSPRT, group sequential, α-spending) that's explicitly calibrated for repeated looks.

Verdict if Step 3 fails: the readout is uninterpretable. Don't ship either side on the basis of it. Either commit to a fixed end date from here and treat that as the real result, or switch to a sequential method.

Step 4 — Is your metric just too noisy?

Some metrics are easy to test on (binary conversion at a flat baseline). Some are brutal (revenue per user, with a long-tail distribution where 1% of users contribute 40% of the metric). The sample size required to detect a fixed relative effect on a noisy metric can be 5-10x what you'd need on its less-noisy cousin.

Symptoms:

The CI is much wider than your planner predicted, even though you hit the planned n.
The lift estimate swings by several percentage points between days, not just early but persistently.
A handful of users' behavior visibly moves the metric (you can name them).

The mechanism is in the sample size formula:

n \approx \frac{2(z_{\alpha/2} + z_\beta)^2\,\sigma^2}{\delta^2}

If $\sigma^2$ is twice what you assumed, you need twice the n. Most revenue-style metrics are noisier than people sizing the test pretended, which is why so many revenue tests come back wide.

Three things that actually work, in roughly increasing payoff:

Winsorize or cap the metric at the 99th percentile. One whale shouldn't decide your test. Outlier capping costs you a little point estimate honesty in exchange for a lot of variance reduction.
Switch to a less noisy metric that's correlated with what you actually care about — conversions instead of revenue, sessions instead of session length, "any purchase in 7 days" instead of "purchases per user."
Use CUPED to subtract off the part of each user's metric that was predictable from their pre-experiment behavior. On metrics with a good covariate, CUPED typically cuts required n by 30-50%. No extra users, no extra weeks.

Verdict if Step 4 is the issue: redesign the metric before re-running. More users on the same noisy metric will help, but the return is sub-linear and the fix is usually a metric change, not a duration extension.

Step 5 — Is the true effect just smaller than your MDE?

The most common and most boring reason for "not significant" is that the effect exists, but it's smaller than the lift you sized the test to catch. You planned for 10%, the true lift is 3%, and your test had ~20% power to detect it. You came back null. The variant isn't broken; your MDE was wishful.

This shows up as:

Observed lift in the right direction but well below the planned MDE.
CI that's positive on average but crosses zero on the lower bound.
A p-value somewhere between 0.10 and 0.30 — too small to dismiss, too large to claim.

The fix is to decide whether the smaller effect is worth detecting — and if it is, replan honestly. The cost is brutal: halving the MDE quadruples the required n. (See the MDE deep-dive for the full curve.)

Required users per variant at a 5% baseline, α = 0.05, power = 0.80:

If a 3% lift on this surface would be worth shipping, the test you actually need is roughly 170,000 users per variant. The test you ran was for a 10% MDE and 15,000 users. Of course it came back null.

Verdict if Step 5 is the issue: replan, but only if the smaller effect is genuinely worth shipping. If 3% on this surface wouldn't be worth the engineering cost, the honest call is "we don't care about effects this size on this surface" — and the null result is, in practice, a ship-control decision.

The decision tree, in one screen

Walk down the list in order. The first answer of "yes" tells you what to do:

Diagnostic	If yes	Next move
1. A priori power for a realistic MDE was < 50%?	Underpowered	Replan. New MDE, more n, or different surface.
2. CI is tight and crosses zero?	Real null	Call it null. Ship the cheaper variant. Move on.
3. You peeked / no pre-registered stop time?	Invalid readout	Reset. Pre-register a stop, or switch to a sequential test.
4. CI much wider than planner predicted?	Noisy metric	Redesign metric. Winsorize, switch metric, or apply CUPED.
5. Lift in right direction, < planned MDE, wide CI?	Effect smaller than MDE	Replan at the smaller MDE if it's worth shipping. Otherwise ship control.

What you'll notice: "keep running the existing test past its planned end date" is not on the list. Extending a test that wasn't sized for the real effect is just a slow, expensive form of peeking. Pre-register a new stop or replan.

The four "not significant" verdicts, side by side

Pulling the diagnostic together, here are the four real outcomes that get collapsed under "not significant" in most readouts:

Verdict	What the data is telling you	Action
Inconclusive	We didn't have the resolution to answer. CI is wide.	Replan. Don't ship either side based on this.
Real null	The effect, if any, is too small to matter. CI is tight around 0.	Ship control (or cheaper variant). Move on.
Underpowered for the real effect	A true lift exists, smaller than your MDE.	Replan at a defensible MDE if it's worth it.
Invalid (peeked)	The number on the page isn't calibrated for how you read it.	Pre-register, or switch to sequential testing.

The same p-value > 0.05 leads to four different write-ups. The skill of reading A/B tests well is being able to tell which one you're in.

Things that don't fix "not significant"

While we're here, a handful of common moves that feel like fixes but aren't:

"Let's run it for one more week and see." Without a pre-registered stop and a planning calculation that says one more week buys meaningful power, this is peeking dressed up as patience.
"The lift is +4%, let's just ship it." You can ship under uncertainty — that's a decision, not a bug — but call it that. Don't tell the company you "won" a test you didn't win.
"Let's segment and find where it worked." Subgroup analysis on a null primary metric is the single most reliable way to manufacture a false-positive win. If you must, pre-register the subgroups before the next test.
"Let's switch the metric." If you change the primary metric after seeing the result, you've just run an uncontrolled multiple-comparison test on yourself. Pick the metric in the planning doc and live with it.
"The other team's similar test won, so ours probably would too with more data." Two tests are not one bigger test. You can run the bigger test deliberately; you can't combine them retroactively without a meta-analysis you almost certainly won't do.

The honest discipline is: when a test comes back null, walk the diagnostic, write up which of the four verdicts it is, and act accordingly. That's it.

Try it on your test

If you have a specific test that came back not significant, you can short-circuit most of this:

Paste the numbers into the analyzer. Look at the CI width and the observed power — that's Steps 1, 2, and 5 in one screen.
If the CI is wider than the planner predicted, go to Step 4 (noisy metric) before going to Step 1 (more users).
If you don't have a pre-registered stop, go to Step 3 first and decide whether the current readout is even interpretable.
If you're planning the next test, use the planner and pick an MDE you can defend — the diagnostic above mostly exists because teams pick optimistic MDEs at planning time.

Further reading:

How to read A/B test results — the order to read the five numbers in, and what each one is allowed to tell you.
Minimum Detectable Effect, explained — how to pick an MDE you can actually defend at planning time.
CUPED variance reduction — how to shrink the CI without collecting more users.
A/B test sample size calculator — how to size a test that can actually answer your question.