Statistics
The Bet Test: Spotting Problems in Bayesian A/B Test Analysis
Learn more
TL;DR:
I think the most neglected topic in experimentation discourse is experiment duration and the levers you can pull to lower it. When I talk to companies who are at sub-Facebook volumes of traffic, so many problems are rooted in long durations to converge business metrics.
To illustrate, each of these challenges in experimentation are caused by or exacerbated by long experiment durations:
It's true! The most versatile method is to implement a technique called CUPED. CUPED is similar to the concept of lead scoring that you see in marketing, but applied to AB experimentation. When we implemented CUPED at Airbnb, we were able to decrease experiment runtime by up to 20-30%.
It works like this: for each customer in the experiment, you make a guess on how likely they are to make a purchase. It turns out that experiments run faster if you measure Purchases - f(Guess) instead of Purchases. An illustrative example is included as a footnote for those interested (1).
The problem with CUPED is that only mature companies have the resources to implement it. CUPED has the same technical barriers as machine learning, complete with point-in-time data pipelines, offline simulation, and model calibration. The result is that the biggest and most valuable companies in the world receive an extra advantage of shorter experiment durations, while startups that desperately need every advantage they can get struggle to run experiments on low traffic.
Besides CUPED (and its cousin, quantile regression), there are other methods that help lower runtime:
At Eppo, we believe that the value of experimentation at scale shouldn't be limited to the companies that can afford PhD Data Scientists and 20-person experimentation platform teams. Eppo provides CUPED out of the box to all our customers, along with a variety of other variance reduction techniques to shorten runtime.
Besides these advanced techniques, there's an easy way to lower experiment runtimes: your choice of metric. There are ways to use metrics that lower your experiment durations.
The first way is to reframe your core metrics to be yes/no instead of counts. Instead of counting "sums", count "uniques". For example, # subscription upgrades (where a customer might make 1, 2, 3, ... 100+ purchases) will make experiments run much longer than # customers who upgrade (where a customer either made a purchase or didn't).
The second way is to pick a different metric, one that is on the path to the outcome you want. The most famous example is Facebook's 7 friends in 10 days metric, which converges experiments more quickly than long-term retention. For companies whose north stars are too delayed to be statistically massaged into a reasonable timeframe, these metric "indicators" become a necessity.
Unfortunately, finding indicators again requires a specialized skill set. The process is written up in the Quora post, but it involves a.) creating a dataset with a bunch of candidate indicators, b.) running a kitchen-sink regression with every candidate, and c.) seeing which ones are most predictive. This process is tricky to execute, as it's easy to find some spurious pattern that doesn't hold up if you're not careful. But when you succeed, you have a metric that can shorten experiment time dramatically while still delivering ROI.
Both approaches have drawbacks. In an ideal world, you'd use the metric that best matches business goals. Indicators require research time to run a bunch of regressions. But they present a path forward for the low-volume startup to adopt an experimentation strategy.
There's one last technique for lowering experiment runtime, which is not to have any bugs or mistakes that necessitate restarting the experiment. It's unfortunately all too commonplace for experiment assignment infra to have issues, or for crucial data to not be tracked, or for a bug to break the test on a specific browser. For all of the time poured into experiment execution, it still remains an incredibly brittle process.
Today's commercial experimentation tools do us no favors here. They lack the diagnostic and investigative capabilities to even notice if something has gone awry and just assume that some PM will constantly refresh experiment results to catch any mistakes.
While advanced statistics and metric choices are helpful, it's always good to remember that the shortest experiment is the one that executes cleanly.
At Eppo, we recognize that purpose-built technology and powerful statistics shouldn't just belong to Facebook. It's actually the companies who are early in their experimentation maturity who most need this support.
Reducing experiment duration simultaneously improves a host of other problems. Whether by advanced statistical techniques, guided metric choices, or making more robust experiment design processes, we want to help companies quickly get ROI from their experimentation practices.
Interested in hearing more about what Eppo's a/b testing tool can do for your practice? Email me at che@geteppo.com. We'd love to chat!
1. For example, if you're McDonalds, you can probably make some smart guesses on whether each customer will buy a Happy Meal. People with kids are more likely than teenagers. People arriving for breakfast will probably get an Egg McMuffin instead. Or even more simply, people who previously purchased Happy Meals will purchase more Happy Meals. With these guesses in hand, you can then calculate (# happy meals) - (X*guessed # happy meals).