There is one trick in the controlled-experiments literature that pays for itself within a single test cycle, ships in a few hundred lines of code, and gets you results in half the time. It's called CUPED — Controlled experiments Using Pre-Experiment Data — and Microsoft published it in 2013 (Deng, Xu, Kohavi & Walker, WSDM 2013). Every serious experimentation platform — Microsoft, Booking, Netflix, Meta, Airbnb — runs some version of it under the hood.
It is not glamorous. It is not Bayesian. It does not require new infrastructure. It is the boring industrial workhorse of variance reduction, and if you run A/B tests on continuous metrics — revenue, sessions, time on site, clicks per user — you are leaving 30–50% of your sample size on the floor by not using it.
This post walks through why it works, when it works, when it doesn't, and the handful of ways people manage to break it.
The setup: why sample size is really a variance problem
The two-sample sample-size formula, for a continuous metric, simplifies to:
Two things on the right-hand side control how many users your test needs:
- , the effect you want to detect (your MDE)
- , the variance of the metric
Most teams treat as a fact of nature. It isn't. A substantial fraction of that variance is predictable from things you already knew about the user before the experiment started — and any variance you can predict, you can subtract out.
That is the entire idea behind CUPED.
The intuition in one paragraph
Imagine you're testing a new checkout flow. Some users in your test were always going to spend 5, regardless of which variant they saw — their pre-period spending told you that before the test even started. If you compare raw weekly revenue across variants, that pre-existing heterogeneity shows up as noise that obscures the real treatment effect. CUPED adjusts each user's outcome by their pre-period behavior, so what's left is mostly the variant effect plus genuinely new noise. Less noise, smaller required n.
That's it. Everything below is just making that paragraph precise.
The math, briefly
For each user , pick a pre-experiment covariate — typically their value of the same metric over the 30 days before the test started. Compute the adjusted outcome:
where is chosen to maximize variance reduction:
Then run your usual two-sample t-test (or whatever test you'd use) on instead of .
Two properties make this useful:
- Unbiased. Because is measured before the test, it's independent of the treatment assignment. Subtracting a function of doesn't shift the difference between variants in expectation. You're not gaming the result — you're cleaning it.
- Lower variance. The variance of the adjusted estimator is
where is the correlation between and the covariate . Variance is reduced by a factor of , and so is required sample size.
The whole game is finding a covariate with high .
How much sample size do you actually save?
Because required n scales linearly with variance, the savings table falls straight out of :
| Correlation between metric and covariate | Variance reduction | Required-n reduction |
|---|---|---|
| 0.2 | 4% | 4% |
| 0.3 | 9% | 9% |
| 0.4 | 16% | 16% |
| 0.5 | 25% | 25% |
| 0.6 | 36% | 36% |
| 0.7 | 49% | 49% |
| 0.8 | 64% | 64% |
For most consumer metrics with a sensible 30-day pre-period covariate, lands somewhere between 0.4 and 0.7 — which is exactly the "30–50% sample-size cut" you'll see cited everywhere. Revenue per user and sessions per user tend to sit in the upper half of that range; engagement metrics on logged-in surfaces are even higher.
That same reduction shows up as a calendar shrink. A test that needed six weeks to hit the planned sample size now needs three. Either you ship faster, or you spend the same time and resolve a smaller MDE — your pick.
When CUPED helps a lot
CUPED is a leverage tool, not a free lunch. The leverage comes from two ingredients, and you need both:
- A metric with strong autocorrelation. Continuous, user-level metrics where this week's value strongly predicts next week's: revenue per user, sessions per user, minutes watched, items viewed, ad impressions, GMV. These have in the 0.5–0.8 range.
- A user population with reliable pre-period data. Logged-in users with a stable history. The longer the pre-period, the better the covariate (with diminishing returns past 30 days for most metrics).
When both conditions hold, CUPED is the highest-ROI change you can make to your experimentation stack. Most published case studies report 30–50% variance reduction on flagship metrics, and that translates directly into half-length tests.
When CUPED doesn't help much
The places it falls down are predictable:
- Binary conversion metrics with low base rates. If your metric is "did the user convert in the test window" and the base rate is 2%, there is very little user-level variance to predict, and pre-period conversion is a weak signal anyway. You'll see single-digit percent savings at best.
- First-time users with no pre-period. If most of your test population didn't exist a month ago, you don't have a covariate. Acquisition and onboarding tests fall into this bucket.
- Metrics with no autocorrelation. Some metrics genuinely look like fresh draws each session — checkout funnel completion rate per visit, for example. If is near zero, CUPED does nothing.
- Very short tests on rapidly-changing populations. If the pre-period covariate is stale by the time the test runs, collapses.
The honest workflow is to compute once on historical data per metric, and only enable CUPED where the savings clear, say, 15%. Below that, the operational complexity isn't worth it.
The four ways teams break CUPED
Most of the failures are subtle, and most of them look like wins right up until someone runs an A/A test and notices the false-positive rate is wrong.
1. Using post-experiment data as the covariate
The covariate must be measured strictly before the user could have been exposed to the variant. If includes any post-assignment behavior, then is itself affected by treatment, and subtracting it shifts the difference between variants. You will get more "significant" results — and the extra ones will be false positives.
Symptom: A/A tests start failing at well above 5%. If you're not running A/A tests as a routine check, start now.
2. Tuning θ on the test data and then evaluating significance on the same data
The standard recipe estimates from the same pooled (treatment + control) data the test runs on. That's fine: under the null, is independent of the treatment indicator in expectation, and the bias is negligible at typical sample sizes. What's not fine is choosing in a way that depends on the observed treatment effect — picking the covariate that "makes the test pop," for instance. That's a researcher-degrees-of-freedom problem dressed up in math.
Fix: lock the covariate and the estimator before the test starts, and treat the choice as part of the pre-registration.
3. Forgetting that variance reduction doesn't fix bias
CUPED reduces variance. It does not fix peeking, it does not fix sample-ratio mismatch, it does not fix selection bias from non-randomized assignment. If your test was broken without CUPED, it is still broken with CUPED — just with tighter confidence intervals around the wrong answer.
4. Comparing CUPED-adjusted variant means to raw control means
Apply CUPED to both arms or to neither. The adjustment is a within-user transformation, and the test is on the adjusted outcomes across arms. Reporting "control = 44.50 (CUPED)" mixes units in a way that will eventually embarrass someone.
CUPED vs stratification vs regression adjustment
CUPED is one of three related tools you'll see in the literature:
| Method | What it does | When to reach for it |
|---|---|---|
| Stratification | Block randomize on a pre-experiment variable (country, device, user tier) | Categorical covariates; small number of strata |
| CUPED | Subtract a linear function of one continuous pre-period covariate | Continuous covariate with |
| Regression adjustment | Regress outcome on multiple pre-period covariates | You have several useful covariates and the engineering budget for it |
In practice CUPED with the obvious covariate (same metric, 30 days prior) gets you most of the way there. Regression adjustment with a handful of pre-period features can squeeze out another 5–15%, but the ROI on the second covariate is dramatically lower than on the first.
A worked example
Setup: you're testing a new recommendation module. Your primary metric is revenue per user over a 14-day window, baseline 50 (revenue is famously skewed). At a 5% relative MDE, α=0.05, power=0.8, the required sample size is roughly
per variant.
You compute, on historical data, that revenue in the 14 days after some reference date correlates with revenue in the 30 days before at . Plug that into CUPED:
- Variance shrinks by a factor of , so effective σ becomes .
- Required n becomes per variant — a 49% reduction.
A test that needed five weeks at your weekly traffic now needs about two and a half. If you instead held the duration fixed, you could resolve a roughly 3.5% MDE in the same five weeks.
What about Bayesian or sequential tests?
CUPED is orthogonal to the inference method. You can CUPED-adjust your outcomes and then run a frequentist t-test, a Bayesian posterior update, or a sequential test. Variance reduction is variance reduction — it makes whatever test you were going to run sharper. The 30–50% savings claim translates directly to "tighter posterior" or "earlier stop" depending on your framework.
CUPED on the AB SHARK roadmap
The variance-reduction protocol in
backend/app/core/variance/ already supports a no-op baseline and
stratification; CUPED is the next adapter on the list. The
contract is a single transform applied to the outcome column before
the analyzer sees it, so once it lands, every metric in the analyzer
inherits the savings without changes to the rest of the pipeline.
If you'd like to be notified when it ships — or if you want to argue about which covariate the default should be — the /plan page is the right place to start a test today, and the variance-reduction toggle will appear there first.
The one-sentence summary
If your A/B test metric is continuous, user-level, and at all autocorrelated with past behavior, CUPED is the cheapest way to halve your test duration without changing anything about how the test is designed, randomized, or analyzed. The cost is one extra column in your data warehouse. The benefit, on most consumer metrics, is six weeks turning into three.
Related reading: the sample size calculator walks through the variance term that CUPED attacks. The MDE explainer covers what to do with the sample-size budget once you have it. And the peeking post is required reading before you spend your newfound variance savings on "checking in early."