Most A/B tests don't fail because of bad code or a weak hypothesis. They fail because the test was never large enough to detect the lift the team was hoping for in the first place. You ship a flat result, you call it a "null," and you move on — when in reality you ran a coin-flip-resolution experiment looking for a thumbtack.
The fix is boring and unglamorous: calculate your sample size before you start. This guide walks through what an A/B test sample size calculator is actually doing under the hood, the four numbers it needs, the trade-offs between them, and the cases where the standard formula quietly breaks.
If you just want the number, our planner will give it to you. If you want to know whether to trust the number, keep reading.
What an A/B test sample size calculator actually does
A sample size calculator answers one question:
Given my baseline conversion rate and the smallest lift I'd care about, how many users per variant do I need so that — if the lift is real — I have a high probability of detecting it as statistically significant?
That's it. It is not a guarantee that you'll find a winner. It's a guarantee about your test's sensitivity: how small a signal it can resolve against the noise of normal user behavior.
To produce that number, the calculator needs four inputs:
| Input | Symbol | What it controls |
|---|---|---|
| Baseline conversion rate | The variance floor of your metric. | |
| Minimum Detectable Effect | MDE, | The smallest lift the test can resolve. |
| Significance level | False positive rate (typically 0.05). | |
| Statistical power | True positive rate (typically 0.80). |
Get those four right and the math is trivial. Get one of them wrong — usually the MDE — and you can be off by a factor of ten.
The four levers: power, significance, MDE, and baseline
Before the formula, the intuition. Sample size is determined by a tug-of-war between four numbers, and fixing any three pins the fourth.
- Significance () is the false-positive rate you're willing to tolerate. At , if there is no real effect, you'll still call a winner 5% of the time. Lowering makes the test stricter — and demands more users.
- Power () is the true-positive rate: the probability of catching a real effect of size MDE. 0.80 is the de facto standard, meaning you accept a 20% chance of missing a real lift. Cranking power up to 0.95 roughly doubles your required n.
- Minimum Detectable Effect (MDE) is the smallest lift you want the test to be able to resolve. This is the lever most people get wrong — usually by setting it too optimistically. More on this below.
- Baseline conversion rate is your control group's expected rate. Rare events have higher relative variance, so a 1% baseline metric needs vastly more users than a 30% baseline metric to detect the same relative lift.
The headline relationship: required scales with . Halving your MDE quadruples the users you need. If your MVP test was sized for a 10% lift and you actually wanted to catch a 5% lift, you needed 4x the traffic. This single fact explains most "underpowered" tests in the wild.
The sample size formula for a two-proportion test
For a standard conversion-rate test (binary outcome, two variants, large ), the normal-approximation formula is:
where is the baseline rate, is the variant rate, , and , are the usual normal-distribution critical values (1.96 and 0.8416 for and power=0.80).
In plain English:
- The numerator is a "noise budget": how much variability you need to overcome to confidently reject the null. The two terms come from the null and alternative worlds — is used under the null where both variants share a rate, and the split term is the actual variance under the alternative.
- The denominator is the signal: the squared absolute difference between control and variant rates. That square is why halving MDE costs you 4x in users.
This is the formula every reputable A/B test sample size calculator — ours included — solves under the hood. Here it is as eight lines of Python:
import math
def required_n(p_baseline: float, mde_rel: float,
alpha: float = 0.05, power: float = 0.8) -> int:
"""Users per variant for a two-proportion z-test, normal approximation."""
p1 = p_baseline
p2 = p_baseline * (1 + mde_rel)
p_bar = (p1 + p2) / 2
z_a = 1.959963984540054 # two-sided 95%
z_b = 0.8416212335729143 # power 80%
num = (z_a * math.sqrt(2 * p_bar * (1 - p_bar))
+ z_b * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
return math.ceil(num / (p1 - p2) ** 2)Call required_n(0.05, 0.10) and you get roughly 31,000 — the number we use
in the worked example below.
Worked example: a 5% baseline checkout test
Say you run a checkout page with a 5% conversion rate, and you're testing a new button copy that you think might lift conversions by 10% relative (from 5.0% to 5.5%). You want the standard and power = 0.80.
Plugging into the formula:
- , ,
- Numerator
- Denominator
- per variant
That's ~62,400 users total. If your site does 5,000 checkouts a week, that's roughly 12 weeks to reach the required sample. If you only have time for two weeks, your test is structurally underpowered — no amount of tweaking the JavaScript fixes that. You either need a bigger expected effect, a higher baseline (e.g., test earlier in the funnel), or a variance-reduction technique like CUPED.
How required n scales with MDE
This is the chart every PM should have tattooed somewhere. Holding the baseline conversion rate at 5%, , power = 0.80:
The curve drops off fast. Detecting a 25% relative lift at a 5% baseline takes ~5,300 users per variant; detecting a 5% relative lift takes ~122,000. Practically, this means the MDE is the single biggest determinant of your test's duration. If a stakeholder says "I want to know about any meaningful lift," you're allowed to push back and ask: "What's the smallest lift that's actually worth shipping for?"
How to pick the MDE without lying to yourself
The single most common way A/B tests are killed before they start is by choosing an MDE based on what would be nice rather than what's plausible. Three heuristics:
- Look at past wins. What's the median lift of tests on this surface that actually shipped? If your team's last 10 winners averaged a 3% relative lift, sizing for a 10% MDE is wishful thinking.
- Tie it to business significance. What's the smallest lift that would justify the engineering cost of building and maintaining the variant? If shipping is only worth it at ≥5% lift, sizing for 2% is wasted traffic.
- Use prior research, not vibes. Industry benchmarks for "good" CRO wins on conversion-rate metrics cluster around 2–5% relative; double-digit wins are rare and usually come from changes much bigger than a button color.
A well-chosen MDE is honest, defensible, and locked in before the test starts. Tuning it after you peek at the data is one of the cardinal sins of experimentation.
Common mistakes that wreck sample size planning
Even teams that run a sample size calculator get tripped up by the same handful of issues. In rough order of how often we see them:
- Using post-hoc power. Computing power after the test using the observed effect size is mathematically circular — it just restates your p-value. Power is a planning tool, not a result. See our results reading guide for what to look at instead.
- Mixing absolute and relative MDE. "I want to detect a 1% lift" is ambiguous: is that 1 percentage point (5% → 6%) or 1% relative (5% → 5.05%)? The required sample sizes differ by a factor of 100. Always be explicit.
- Forgetting that is per variant. The number out of the formula is per arm. A two-variant test needs users; a four-variant test needs plus a multiple-comparisons correction.
- Peeking and stopping early. Calling significance the first time flashes is equivalent to inflating — sometimes by 3x or more. If you want to peek, use sequential testing or alpha-spending.
- Ignoring the traffic split. 50/50 is variance-optimal for a two-arm test, but multi-arm and ramped rollouts change that. A 90/10 split needs far more total users than 50/50 to get the same power on the small arm.
- Treating "two weeks" as a sample size. Calendar duration is not a proxy for statistical power. A two-week test that hits 10% of required is just as underpowered as a two-day test that hits 10%.
When the simple formula isn't enough
The two-proportion z-test calculator handles the 80% case of conversion-rate testing cleanly. The other 20% needs different tools:
- Continuous metrics (revenue per user, session length, page views): use a Welch's t-test sample size formula, with a function of the variance of the metric rather than .
- Skewed metrics (revenue is the classic offender — long-tail, lots of zeros): the normal approximation degrades. Bootstrap-based sample sizing or Mann-Whitney U are more honest. Capping outliers (winsorizing) also helps.
- Repeated peeks at the data: switch to a sequential testing framework (mSPRT, group sequential, alpha-spending). These trade slightly larger expected sample sizes for the ability to stop early without inflating Type I error.
- High-variance metrics with pre-period data available: apply CUPED. It reduces variance by , which can mean 30–50% fewer users for the same power.
- Cluster-randomized tests (assigning by account, not user): your effective is closer to the number of clusters than the number of users, often by an order of magnitude.
If any of those apply, the back-of-the-envelope two-proportion calculator will lie to you — usually by underestimating the required sample, sometimes dramatically.
A pre-test checklist
Before you push the experiment live, you should be able to answer all six:
- What is the baseline rate for the primary metric, measured over a representative recent window?
- What is the MDE, and why is that the smallest lift worth shipping?
- What and power are you using? (If not 0.05 and 0.80, why?)
- How many users per variant does the calculator say you need?
- Given your traffic, how long will that take? Is that compatible with roadmap reality?
- What's your stopping rule — fixed-horizon, sequential, or alpha-spent? You should not be making this decision mid-test.
If you can't answer #2 or #6, do not launch the test yet. You will not like the result.
Plan it once, run it once
Sample sizing isn't statistics busywork — it's the single highest-leverage decision in the entire test lifecycle. A test that's properly powered tells you something either way: a win is real, and a null is informative ("the effect, if any, is smaller than the MDE we cared about"). An underpowered test tells you almost nothing, and worse, feels like it told you something.
Plug your baseline, MDE, , and power into the AB SHARK planner and you'll get the required per variant plus a duration estimate based on your traffic. If the number is bigger than your roadmap allows, that's the planner doing its job — better to find out before you ship the test than after.
Further reading:
- How to read A/B test results — what to do once the test is running.
- Minimum Detectable Effect, explained — a deeper dive on the most-mis-set knob.
- CUPED variance reduction — how to cut required by 30–50% with no extra users.
- Why your A/B test is "not significant" — the diagnostic guide for flat results.