Statistics
The Bet Test: Spotting Problems in Bayesian A/B Test Analysis
Learn more
If you’ve been looking around at the state of the art in online experimentation, you’ve probably come across a technique called “CUPED” - maybe advertised as a feature by an A/B testing tool, or in research published by companies like Microsoft, Netflix, or DoorDash. It’s a deservedly-popular topic, given its promise: to reduce experiment runtime, enabling experiments to conclude even up to 65% faster.
One of the most common complaints in experimentation programs is around how long it takes to run experiments. The bulk of that time isn’t active work from data teams planning or analyzing experiments - it’s simply waiting for a sufficient sample size to be collected. Unless you work at a company with FAANG-scale user traffic, an experiment with standard statistical parameters likely takes over a month to collect sufficient data. Several months, sometimes.
The necessary sample size for an experiment (at a given power level) really boils down to two variables, the minimum effect size you care about detecting, and the variance in your measured outcome. Before Eppo became the first commercial experimentation platform to offer CUPED, if you wanted to run experiments faster, your options were pretty much limited to increasing that Minimum Detectable Effect, inflating your false-negative rate.
But what if you had a magic wand that would make experiments run faster? With no tradeoffs, at all? That’s what CUPED promises, by reducing that other variable - variance.
If you’re looking into utilizing CUPED, it’s important to understand what it is, exactly, and where it can successfully be applied (hint: not everywhere). In this article, we’ll walk through what exactly CUPED is, why it can be so challenging to implement (both for in-house teams and commercial vendors), and what’s so special about Eppo’s “CUPED++” implementation.
In 2013, a team from Microsoft led by Alex Deng wrote a paper “Controlled-experiment using pre-experiment data”, CUPED for short, introducing a new method that could speed up experiments with no tradeoff required. With this method, Microsoft could bend time, making experiments that typically took 8 weeks only take 5-6 weeks. Since that paper, the method has gone mainstream.
CUPED, at its core, is a variance reduction technique. It leverages historical data about your users to reduce noise in the observations made in your experiment. In other words, if we know some pre-experiment data about user behavior for a certain metric, we can use that to decrease our uncertainty about the estimated means of said metric in each experiment variation.
Suppose that you are McDonald’s and want to run an experiment to see if you can increase the number of Happy Meals sold by including a menu in Spanish.
Each successive method reduces time-to-significance by reducing noise in the experiment data. Just like noise-canceling headphones, CUPED can take out ambient effects to help the experimenter detect impact more clearly.
In visual terms, suppose that the treatment-versus-control experiment data looked something like this before applying CUPED:
After applying CUPED, the uncertainty around the averages decreases, like this:
The net effect of sharpening these measurements is that it takes less time to figure out whether an experiment is having a positive impact or not. CUPED is bending time with math – alleviating the largest pain point in experimentation today.
In business terms, a long-running experiment is the same as a long-delayed decision. Organizations that can learn from experiments and react quickly enjoy a tactical advantage over their competitors.
Besides being a decisional dead weight, long-running experiments have other technical and cultural ramifications: think of the technical debt incurred by keeping the two code paths running for several weeks or months. Think about whether anyone at the company will be excited to run an experiment if the result is slower than a container ship crossing the Pacific Ocean. Think of all the small wins and little ideas that will go unmeasured and unimplemented because experiments are just too slow.
The scarcest resource in experimentation today isn't tooling or even technical talent. The scarcest resource is time.
Experimentation speed is about creating a feedback loop so that good ideas lead to even better ideas and misguided ideas lead to rapid learning. The faster experiments finish, the tighter that feedback loop gets, creating a compound interest effect on good ideas.
CUPED can be challenging to implement both because it has a limited scope of potential applications, and because it’s computationally expensive.
When it comes to determining if you have a potential use case in the first place, remember that the key to CUPED lies in that “pre-experiment” piece. If you are experimenting on users (or other experimental units) that don’t already interact with you, or don’t interact with you in a way that might predict future behavior, there isn’t going to be the requisite pre-experiment data. This means that traditional CUPED implementations are no help when testing around things like onboarding flows or new users. (This is what Eppo’s CUPED++ approach solves for - more on that in the next section).
That data also needs to be accessible to your experiment tool, which is why historically commercial software vendors were unable to offer CUPED as a feature. Eppo inherently solves for this as the only experimentation platform that’s 100% data-warehouse native, which is why we were the first platform to offer CUPED as well. But if you use a tool that requires you to send events to it, the likelihood of it having reliable pre-experiment data stored and accessible is low.
CUPED is also computationally expensive. More data needs to be ingested - once a user is assigned to an experiment, a pipeline must fetch that user’s historical data from some reasonable recent window of time. But more importantly, the CUPED-adjusted means themselves involve linear regressions that take longer to execute.
We ran into this ourselves in our CUPED beta at Eppo - as customers would try to apply CUPED to large data sets, we started hitting dreaded OOMKilled errors. Out of memory. To build a scaleable solution, we developed a new approach to the computation in pure SQL, described in-depth by Eppo Statistics Engineer Evan Miller in a 2023 QCon conference talk.
If you lack the specific pre-experiment data required to leverage CUPED, what else might be available to you? Inspired by a deep dive on one of the mathematical foundations of CUPED (dating all the way back to 1933), the Eppo statistics engineering team noticed a missed opportunity for many teams. In most implementations, CUPED refers to reducing the variance of a metric by using pre-experiment data on only that metric itself based on the covariance between the two; equivalent to running a simple regression. However, it’s also possible (as the original paper discusses) to include a full vector of other experiment metrics, or all treatment assignments (i.e., all experiments a user has been bucketed into) and reduce variance even further… or in cases where our pre-experiment data is lacking.
Here’s what Eppo’s CUPED++ makes possible:
You can read more about it in a white paper on Eppo’s statistics engine from MIT’s Conference on Digital Experimentation.
---
Although it’s already a decade old, CUPED certainly represents one of the most exciting innovations to how we statistically analyze digital experiments. For most of that decade, it was an approach available only to giant tech companies with large experimentation platform teams, given the difficulty of implementation - and the roadblocks preventing legacy commercial tools from offering it. With Eppo’s first-in-class warehouse native experimentation platform, and development of CUPED++, we’ve made variance reduction available to more companies, and more use cases, than ever before.