WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

About

Statistics

December 16, 2024

The Bet Test: Spotting Problems in Bayesian A/B Test Analysis

Tyler Buffington

Before Eppo, Tyler built the in-house experimentation platform at Big Fish Games. He holds a PhD from the University of Texas at Austin.

A common scenario

“This A/B test improved conversions by 200%!”

People active in the experimentation community have likely seen a claim like this followed by a dissenter inevitably invoking a concept called Twyman's law. (Section 2 of the paper A/B Testing Intuition Busters provides a well-known example of this.)

Per Wikipedia, Twyman's law states that “Any figure that looks interesting or different is usually wrong.” The idea is that we should be skeptical of such large reported lifts because they are surprising. Anyone with experience running A/B tests is accustomed to seeing much smaller treatment effects — typically below 10%. A 200% lift should definitely raise some eyebrows and lead one to question the result.

Despite the fact that most experimentation practitioners have a healthy skepticism of large lifts, it is unfortunately common practice to use Bayesian A/B testing tools that don’t share that healthy skepticism, and in fact make assumptions that essentially throw Twyman’s Law out the window

Why many Bayesian analysis tools fall short

A common approach in Bayesian A/B testing tools or calculators is to use uninformed priors, which means that we utilize no prior knowledge about the treatment effect. This approach assumes that all treatment effects are equally likely. In other words, our test is just as likely to produce a 2% lift as it is a 200% lift. This approach is completely inconsistent with Twyman's law.

When we see a reported lift of 200%, we are quick to express skepticism of the result. Despite this, many Bayesian calculators explicitly make the dubious statistical assumption that an unrealistic lift is just as likely as a realistic one.

On Outperform, we have previously dubbed the uninformed prior approach the Bayesian imposter because it provides numbers identical to a frequentist analysis with misleading Bayesian language. In practice, many users of these tools end up embracing a policy that is essentially the same as frequentist p-hacking. The guise of Bayesian analysis leads them to believe they are sidestepping frequentist pitfalls like peeking and multiple comparisons, but in reality, they are falling into these since the uninformed prior has them effectively working with p-values.

Renowned Bayesian statistician Andrew Gelman has also warned against the dangers of noninformative priors in “any setting where the prior information really is strong, so that if you assume a flat prior, you can get silly estimates simply from noise variation.” This is the exact setting of A/B testing. The range of true lifts is well known, but business metrics are noisy, which leads to silly estimates that make the Twyman's law scenario so common.

The bet test

One of the hallmarks of a good prior distribution is that we would willingly rely on it to inform bets of real money. Imagine a prior you’d like to use for Bayesian A/B test analysis. A simple bet test for that prior is as follows:

From the prior distribution, calculate a 50% credible interval for the relative lift (e.g., -10% to 10%).
Imagine a game in which we will run an A/B test with a very large sample size. You win $100 if you correctly guess whether the treatment effect is inside or outside the 50% interval calculated in #1.
If you have a strong preference for inside or outside, your prior fails the bet test.

Let's try this Bet Test with the uninformed prior as an example. In that case, the 50% interval spans relative lifts from -∞ to ∞. [1] We know any treatment effect must fall in this range, so we would obviously call “inside” and take the $100. This one is a no-brainer — it shows that the uninformed prior is not sensible.

The pitfall of specifying separate independent priors

Another common approach is to use informed priors but specify separate priors for each variant. Advocates of this approach typically recognize that a key advantage of the Bayesian approach is the ability to leverage prior information, but they do so in a misguided way that fails to unlock the benefits of a proper Bayesian approach, such as shrinkage. As we will show, it also fails the bet test.

For example, if we are running a test of conversion rate, we may know that conversion rates observed in A/B tests for our product are typically around 2-3%. That seems like reasonable information to leverage in our priors. The typical approach would be to use an informed beta distribution prior because of conjugacy. For example, using Beta(15, 600) provides a prior that aligns with the known range of our conversion rates.

The trap lies in what we do next. Since we have defined our prior over the conversion rate, we really have only defined a prior for a single variant. The natural approach from here is to use this same prior for each variant in an A/B test separately and then update both priors with test data.

This may seem reasonable, but we are actually making a subtle independence assumption that leads to nonsensical results. To recognize this, think about the reasons for the variation in conversion rate. One likely factor is seasonality — conversion rates tend to be higher around certain times of the year compared to others. Another factor might be a recent promotion that has led many high-intent users into the funnel. There are many factors like these that influence the conversion rates of all variants. If we know that conversion rates for the control group are on the high end of our prior, we should expect the same for the treatment group. In other words, we should expect the conversion rates between the control and treatment groups to be correlated.

However, our prior structure completely misses this effect. Instead of viewing the prior as a univariate beta distribution, we should view it as a joint distribution over both the conversion rate for the treatment and the control. We can visualize this distribution with a scatterplot of the independent draws of the two prior distributions.

As we see in the scatterplot, there are many draws from the joint prior that represent relative lifts exceeding 50%, which would make most experimenters call Twyman's law. With this prior, we are explicitly assuming such scenarios are likely outcomes.

In reality, we should expect the joint prior to look more like the following scatterplot. In this case, if we know the true conversion rate for the control group, we have a fairly confident sense of the conversion rate for the treatment group as well because we know that most true lifts are less than 10%.

‎We can then apply the bet test to the independent prior shown in the first scatterplot. First, we must plot the draws from the prior as a histogram of relative lifts, as shown below.

Now, we can calculate the 50% credible interval by evaluating the 25th and 75% percentiles of this distribution, which yields an interval of -18.8% to 25.2%. If I could win $100 by correctly calling “inside” or “outside,” I would call “inside” without thinking twice. Any true relative lift outside of this interval is an extremely rare occurrence.

Final remarks

If you decide to use a Bayesian methodology for running A/B tests, it is important to be thoughtful about the prior. We argue that A/B testing is a domain with strong prior knowledge, as evidenced by the frequent invocation of Twyman's law in the experimentation community.

Priors should ideally be informed by a meta-analysis of previous experiments using a sampling-noise-aware methodology like a Bayesian hierarchical model. Short of that, the prior should be consistent with the known magnitude of A/B test treatment effects in order to provide proper shrinkage and meaningful “chance to beat” assessments. In order to leverage the benefits of the Bayesian approach, the prior should also be placed directly on the treatment effect rather than on the means of the separate variants in the test.‎

Thanks to Sven Schmit, Demetri Pananos, Lukas Goetz-Weiss, and Ryan Lucht for their helpful comments and discussion on this article.

Footnote

[1‎] Take the uniform distribution on [-U, U], then the credible interval is [-U/2, U/2]. Now you can make U → infinity to see that the confidence interval also converges to infinity.