
Statistics
What to Do When You Encounter Sample Ratio Mismatch in A/B Testing
Learn more
Statistics is a complex field, filled with potentially unverifiable assumptions and a nuanced interplay between models and the data used to fit them.
As such, I’m highly skeptical when self-proclaimed “experts in statistics” publish short think pieces claiming that statistics can be neatly classified without regard for nuance — for example, that the difference between Bayesian and Frequentist analysis is negligible in A/B testing.
First, this debate has been explored so extensively that I doubt further discussion will yield new insights. Second, when these so-called experts do weigh in, their arguments tend to fall into one of four common categories:
Despite the extensive literature on these points, they remain persistent talking points. While those who repeat them likely have good intentions, they often construct straw-man versions of Bayesian analysis that are too easily dismissed. Though these arguments can be stated succinctly, addressing them properly takes more time (owing mostly to Brandollini’s law). This post examines each point in turn and considers what is true, what is false, and what deserves more attention.
Let’s take a moment and review how Bayesian statistics works. Most Bayesian analyses start with a prior over the unknown parameter(s). The prior is intended to convey one’s knowledge about the parameter(s) before (or prior — hence the name) when running the experiment. A prior is said to be “informative” if we know a lot about the parameter(s) prior to the experiment, and hence can specify a very narrow — in the sense that the probability is concentrated around a small set of values — prior. If we do not have very much pre-experiment knowledge about the parameter(s), then we can make our prior more variable — in essence, wider — to account for this uncertainty. Once data is collected, Bayesians will update their prior by combining their pre-experiment beliefs with evidence from the experiment.
This new set of beliefs is the posterior distribution (posterior because it comes after having collected data). The posterior can then be used to summarize the new knowledge about the parameter(s) of interest. For more on Bayesian analysis, we recommend reading Eppo’s docs. For some models and under certain parameterizations, it is true that “flat” prior will result in inferences similar to a frequentist analysis, and perhaps those models are all we intend to use.
However, Bayesian statistics is not limited to these types of models, even in the context of A/B testing. As an example, one could (and perhaps should) use hierarchal Bayesian modeling to partially pool estimates of lift to get an estimate of aggregate impact while also combating exaggerated effects due to the winner’s curse, similar to Apple’s approach.
We at Eppo are very aware of these benefits and use similar concepts in our Bayesian Aggregate Impact Estimator. Using a flat prior in a hierarchical model would remove all benefits from this approach and, in fact, could “pull estimates apart,” resulting in implausible estimates as Andrew Gelman describes in a 2013 blog post. In the case of hierarchical Bayesian models, a flat prior has the opposite effect that one would want or expect.
For some models, it is true that the influence of the prior on the posterior decreases as the sample size increases. However, the rate at which this happens depends on the prior used — more informative priors require more data to remove their influence.
This is straightforwardly seen in Eppo’s Bayesian estimate of the lift. The posterior estimate of the lift is the product of two things: the frequentist point estimate of the lift, $ \hat{\mu}_\Delta$, and a term involving the variance from the prior, $\sigma^2_{prior}$, and the variance estimate of the lift from the data, $\sigma^2_{\Delta}$. The posterior mean is
Note that $\sigma^2_\Delta$ decreases inversely proportional to the sample size, so as we get more data then $\sigma^2_\Delta$ becomes smaller. Consequentially, as $\sigma^2_\Delta$ becomes smaller, the term in parens becomes closer to 1 and hence the posterior mean approaches $\hat{\mu}_\Delta$.
The rate of convergence depends on the variance from the prior. When the prior variance is small, then we need more data in order to overwhelm the prior! To say “the likelihood will eventually dominate the prior”, while true, misses a lot of nuance of how much data is needed. Depending on the prior, we might not be able to collect enough data in a reasonable amount of time for the likelihood to dominate, leaving us with an estimate that is somewhere between 0% (or more precisely, whatever the prior mean is) and $\hat{\mu}_\Delta$. In a sense, the Bayesian approach regularizes the estimate of the lift, biasing it towards the prior mean, or in Eppo’s case, 0% lift.
This regularization isn’t particularly bad, and in fact has a few benefits, including:
By now, we have established that dominance of the likelihood over the prior depends on the prior and the amount of data you can obtain in a reasonable amount of time. This, in turn, affects lift estimates and credible intervals since both are functions $\sigma^2_{prior}$ and $\hat{\sigma}^2_\Delta$ of, so this talking point is moot in light of the previous section. However, even in the cases where frequentist and Bayesian credible intervals agree, treating a Bayesian analysis as if it were a frequentist analysis strikes me as lacking imagination. To use Bayesian credible intervals as confidence intervals — to make decisions based on what is or is not in the credible interval — is to ignore decades of decision theory, which, in my humble opinion, is what most data scientists should be doing.
While decision theory comes in both frequentist or Bayesian flavors, I argue a) Bayesian methods make decision theory very straightforward, and b) using decision theory can result in a much different result than “is 0 in my interval”? Let me provide an example of the point made in b).
In a previous role, I ran an experiment in which we randomized customers meeting a certain profile to either a standard free trial (control) or a free trial to a more expensive plan but with more features (treatment). The hope was that while the more expensive plan might result in fewer conversions to paid (because of sticker shock), any conversions to paid would give us more total revenue and make up for any loss in conversions. We ran an experiment and used a typical null hypothesis significance test, and failed to reject the null. Were we to take the “is 0 in my interval” approach — Bayesian or otherwise — we might not have shipped the change.
I took a different approach. I estimated our expected loss (for more on expected loss, see this report by Chris Stucchio) and found that if we shipped control and we were wrong (i.e., treatment was superior in truth), we could risk losing out on 2x the amount of ARR per randomized account than if we were to ship treatment. The decision was then clear to make — we wanted to minimize our expected loss and so we shipped the treatment, statistical significance be damned!
Eppo makes calculating the expected loss easy, so Eppo users can take similar approaches with their experiments. The point I want to make here is that Bayesian methods may make similar conclusions to frequentist methods if you treat a Bayesian method like a frequentist method. But if you do that, you are leaving a wealth of really useful decision theory — which becomes really easy when you’re a Bayesian — on the table, like determining the expected loss from a decision or using the Expected Value of Sample Information to determine how worthwhile collecting more data will be.
When you start doing decision theory as opposed to null hypothesis significance testing (or the Bayesian equivalent thereof, checking if 0 is in the credible interval) you can start to make very different conclusions and decisions, as I demonstrated in my example.
Informative priors are often spoken about as if they are a means of cheating, having the appearance of scientific rigor while imparting bias to achieve some pre-determined outcome. Frankly, this concern is not unique to Bayesian analysis. Indeed, all of statistics can be misused to manipulate experiment outcomes while maintaining the appearance of scientific rigor.
Such behaviors include:
As such, I find the “Informative priors can be used to manipulate experiment outcomes” argument particularly trite. As for the subjectivity of Bayesian priors, I (and other notable Bayesians) am not particularly convinced that priors are inherently more subjective than other modeling choices in statistical analysis. Every statistical method, Bayesian or frequentist, involves subjective decisions—whether in defining a population, choosing a data-generating process, or deciding how to clean data or omit observations.
The claim that Bayesian priors introduce an unacceptable level of subjectivity overlooks the fact that frequentist methods also rely on subjective elements, such as the choice of significance thresholds, what likelihood to use in a regression, or how much power an experiment requires.
Rather than fixating on the supposed objectivity of certain methods, we should instead emphasize transparency and justification of assumptions. Through this lens, informative priors are incredibly transparent, as we are forced to make decisions about our models, have justification for those decisions, and have those decisions scrutinized (via, for example, prior predictive checks).