A sample-size calculator takes four inputs: baseline conversion rate, significance level (α), power (1 − β), and minimum detectable effect.
Three of those have boring industry defaults. Baseline is whatever your current conversion rate is. α is 0.05 because everyone uses 0.05. Power is 0.8 because Cohen said so in 1988 and nobody has had the energy to argue since.
That leaves MDE — the only input that's actually a judgment call. And it's the one that decides whether your test runs for two weeks or six months, whether you ship real wins or sit on them, whether your experimentation program builds momentum or quietly dies.
This post is about choosing it well.
What MDE actually means
The minimum detectable effect is the smallest true lift your test is
designed to detect — meaning, if the true lift is at least this big, you
have at least power (typically 80%) chance of getting a significant
result.
Three things follow from that definition that people regularly get wrong:
- MDE is a design parameter, not a result. You pick it before the test starts. It tells you how big the test needs to be. It is not the lift you observed.
- A non-significant test doesn't mean the true lift is below the MDE. It means you couldn't detect a lift as small as the MDE with the n you ran. The true lift could be anywhere — bigger, smaller, zero, negative.
- Smaller MDE is not "more rigorous." It's just more expensive. There is a real trade-off, and pretending there isn't leads to tests that never finish.
Why sample size explodes when MDE shrinks
The two-proportion sample size formula simplifies, near a baseline rate , to roughly:
The interesting part is in the denominator. Halve the MDE and you quadruple the required users. Cut it by a factor of three and you need nine times the traffic.
Here's that relationship at a 5% baseline conversion rate, α=0.05, power=0.8:
A 1% relative MDE on a 5% baseline needs about 1.5 million users per variant. A 10% MDE needs 15,500. The curve is brutally non-linear, and that's why MDE matters more than the other three inputs combined.
Three frames for picking MDE
There's no objectively correct MDE. There are three frames I've seen work, and the right one depends on what's binding for you.
1. The business frame: "What lift would be worth shipping?"
If a 2% relative lift on the checkout button isn't worth a code review and a deploy, then designing a test to detect it is a waste of traffic. Pick the MDE at the size you'd actually act on.
This is the right frame for tests on small surfaces (button copy, minor layout) where shipping has real cost and you don't want to ship "wins" that the noise floor can't reliably distinguish from each other.
2. The historical frame: "What's the median win size of our last 20 tests?"
If your team has been running experiments for a while, you have data on how big winners typically are in your context. Use that distribution. If your median win is 6%, an MDE of 5% is reasonable. An MDE of 1% means you're designing for outliers.
This is the right frame for mature programs, and it has the side benefit of forcing the team to actually look at the distribution of past effects, which is usually a sobering exercise.
3. The traffic frame: "What can I detect in 4 weeks?"
Sometimes the binding constraint isn't business judgment, it's calendar. You have N weekly users and a stakeholder who won't wait more than a month. Solve the sample-size formula in reverse: given , what's the smallest MDE this test can resolve at 80% power?
If the answer is 12% relative and your historical wins are 4%, this test is going to come back null and you should either redesign the experiment or invest in variance reduction (CUPED, stratification) before you spend the traffic.
Absolute vs relative MDE — the silent killer
Two ways to express the same thing, and mixing them up will burn you.
| Type | What it means at p = 5% baseline |
|---|---|
| Absolute MDE = 1pp | Variant rate of 6.0% (a 20% relative lift) |
| Relative MDE = 1% | Variant rate of 5.05% (a 0.05pp absolute lift) |
The required sample size for these two designs differs by a factor of about 400. Most calculators (AB SHARK included) default to relative MDE because it's the language stakeholders speak in: "we want a 5% lift on checkout." But every calculator surfaces this differently, and "5% MDE" in a slack message is dangerously ambiguous.
When you're writing a test plan, write the variant rate explicitly: "baseline 5.0%, target 5.25% (5% relative lift, 0.25pp absolute)." It takes ten extra characters and prevents the most expensive mistake on this page.
What an MDE that's too small looks like
Symptoms:
- Tests run for months and never reach the planned sample size.
- Engineers stop pre-registering and start "checking in" — peeking — which inflates false positives. (More on that in the peeking post.)
- A growing pile of "inconclusive" tests that no one wants to revisit.
- Stakeholders quietly stop asking the experimentation team for help.
The team feels rigorous. The org gets nothing shipped. This is the most common failure mode in mid-sized programs.
What an MDE that's too big looks like
Symptoms:
- A 3% real lift hides under a 10% MDE and the test ends "not significant." The variant gets killed even though it was actually better.
- The team stops trusting that wins are findable.
- A competitor who picked an MDE half as big eats the same lift on the same surface a year later.
The cost here is invisible — you never see the lift you didn't ship — but in mature programs it dominates. Tests are cheap; missed wins compound.
A heuristic that works
If you don't know where to start, start at 5% relative MDE.
- If your last 5 wins were all bigger than 5%, you can comfortably move to 10% — your interventions are big enough that you don't need to resolve small effects.
- If your last 5 wins were all smaller than 5%, don't drop the MDE first. Drop the variance instead: pick a less noisy metric, segment to high-signal users, or use a variance reduction method like CUPED. (I'll cover CUPED in a follow-up post.)
- If your weekly traffic doesn't support a 5% MDE in a reasonable timeframe, that's your honest answer: this test isn't worth running on this surface, find a higher-traffic surface or a bigger intervention.
The thing to internalize: MDE is not a knob you can turn freely. It's the intersection of what's worth shipping, what your traffic supports, and what your noise floor lets you see. Get those three numbers in the same room and the MDE picks itself.
Try it
AB SHARK's planner takes baseline + MDE + α + power and returns the required n per variant and an estimated duration, plus a sensitivity chart so you can see how the answer moves as MDE changes. Pick a few candidate MDEs, look at the durations, and have the conversation about which one you can actually live with — before you start the test.
Related reading: the sample size calculator guide walks through the full formula, and the post on why your A/B test isn't significant covers what to do when MDE was the wrong call in hindsight.