Statistics
The Bet Test: Spotting Problems in Bayesian A/B Test Analysis
Learn more
Statistical significance is a cornerstone of (data) science. The idea is simple: only accept results with p-value < 0.05 and dismiss the rest as noise. While this heuristic is a good starting point, it can quickly morph into rigid dogma: what I call the Cult of Stat Sig. The cult splits the world into black and white (or green/red and grey, depending on your experimentation platform of choice). Stat sig results are considered unimpeachable and exact; the others are the same as 0. Statistical significance becomes a goal in itself, no longer a tool but something to worship.
The dogmatic approach harms experimentation programs. Its flawed logic leads to poor decision-making, and its rigid rules provide cover for ignoring inconvenient results. However, not all hope is lost: breaking free from the cult without descending into statistical anarchy is possible.
Many experimentation programs emerge from a state of statistical anarchy: Cherry-picked results reported without information on their uncertainty, widespread skepticism about experiment results, always wondering, “Is this real or just noise?” Or overly complex methodologies used without validation, with every analyst using their own notebook and custom approach.
To get out of the state of anarchy, the first step is to establish a baseline of statistical hygiene, often enforced through an experiment platform. A simple, easy-to-follow heuristic is needed. Traditional hypothesis testing and p-values provide such a heuristic: If p-value < 0.05, report the result; if p-value ≥ 0.05, treat it as noise.
The heuristic's intent is valid, and following it typically improves things —for a while. Albeit crude, it is a reasonable guideline that can help build trust in positive results and reduce false positives. The problem is when the heuristic is mistaken for a commandment and when its recommendations are taken to fallacious extremes.
Fallacy n.1: If the result is statistically significant, it is an exact measurement. Reaching statistical significance does not mean there is no uncertainty left. An effect is stat sig when you can “reject zero.” Just because you are confident it’s not zero does not mean there is no uncertainty left on the effect size. An estimated lift of +5% with confidence interval (+1%, +9%) contains different information from one with (+4%, +6%). Yet, in the cult of stat sig, they are considered the same result: exactly equal to +5%.
Fallacy n.2: If the result is not statistically significant, it is the same as zero. Another incorrect conclusion due to oversimplification. A wide confidence interval that covers 0% (“not stat sig”) does not contain the same information as a tight confidence interval around 0% (also “not stat sig”). For example, an experiment that moves retention with a confidence interval of (-18%, +2%) is not conventionally stat sig but should give pause. Only a tight interval, roughly centered at 0, would provide evidence that the impact is neutral. Yet, in the cult of stat sig, both are considered to be exact 0s.
Some adherents to the cult truly believe in both of these fallacies and will impose their flawed reasoning on others. Others aren’t true believers; they know why the rules are fallacies, but perpetuate them anyways, believing that there is no third way: it’s either the cult or statistical anarchy. This post argues that the cult causes real harm and that a better way is possible.
When planning an experiment analysis, it’s good practice to pair a primary metric (the goal) with one or more guardrail metrics (the checks and balances):
purchases paired with returns for an e-commerce platform
sessions paired with crashes for a mobile app
opens paired with unsubscribes for an email marketing experiment
What all these examples have in common is that the guardrail metric is less sensitive than the primary metric: a larger sample size is required to detect an effect on the guardrail compared to the same effect size on the primary metric. In these examples, the guardrail is less sensitive because changes in rare events – such as returns, crashes, and unsubscribes – are harder to detect than more frequent events (purchases, sessions, opens). In other settings, guardrails are underpowered because they measure longer-term outcomes, such as retention. (In other settings, such as page load time, the guardrail is typically not underpowered relative to the primary metric.)
Power analyses are all too often run by only considering the primary metric when choosing how long to run an experiment. This can result in insufficient power to detect degradation of the guardrails. The guardrail might drop, but the drop will rarely be statistically significant. (Eppo’s sample size calculator makes it easy to perform a power analysis jointly for primary and guardrail metrics, but not everyone has access to such tools.)
This situation is already bad; the cult of stat sig compounds the problem. Results for the guardrail metrics will often be inconclusive, even when there is a true harm. The point estimates on the guardrails will often be negative, but they will rarely be statistically significant. These results should be cause for concern and prompt extending the experiment or further investigation. Instead, treating these non-statistically significant results as exact zeros silences the alarm bells.
There are problems even for stat sig results. When an experiment is barely powered, marginally stat sig point estimates should not be taken at face value. When looking across stat sig results, the winner’s curse means that the true impact is closer to 0 than what the point estimates suggest. There are no straightforward solutions here, but ignoring the residual uncertainty – treating stat sig results as exact – makes it even harder to understand and account for the curse.
An experimentation program that operates according to this model will unknowingly cause much harm. Best practices are seemingly followed: power analyses are done, and only (properly powered) stat sig results are reported. But in reality, the wins are not as large as claimed, and the harm accumulates undetected. Even the best experimentation programs can suffer from these issues, and there is no panacea: holdouts help avoid the winner’s curse, and variance reduction helps with underpowered guardrail metrics, but they can’t solve everything. The most important first step is to not be in denial.
Mistaking the strength of statistical significance with business significance is another symptom of the cult of stat sig. The starting point is obsessing over whether p-values cross some magical threshold. A natural evolution is to use p-values themselves as markers of importance: the lower, the better.
An experiment lift with a highly stat sig p-value of 0.001 is surely more important than one with a p-value of 0.04, right?
Not necessarily. The lift with a smaller p-value has stronger evidence against the null hypothesis of no effect: it is a more unlikely result if the experiment does nothing in reality. But statistical significance is not the same as business significance:
The p=0.001 result might have a smaller magnitude. A very precisely estimated lift of +0.3%, with a tight confidence interval around that value, gives high confidence that the result is not 0 (or negative), but is +0.3% truly important? In contrast, the p=0.04 result might be for a lift of 4%, less precisely measured. It’s relatively more likely to be statistical noise. Still, if confirmed by running the experiment longer or through an independent replication, it's also more likely to be a large, impactful win.
The p=0.001 result might come from a metric that is more sensitive but that overall matters less to the business. (This interacts with the first bullet: sensitive metrics are those that can give stat sig results even for small changes.) For example, it might be a precisely measured lift of some top-of-funnel clickthrough rate, but what truly matters to the business is a conversion metric such as signups or purchases, which are typically less sensitive. Indeed, it is often the case that the most statistically sensitive metrics are also more superficial and more loosely aligned with business goals.
The p=0.001 result might come from an experiment with higher power: maybe it was run for longer or on a higher percentage of traffic. The result with a higher p-value will benefit more from extending the experiment or from a replication with higher traffic. Considering it to be less important because of its higher p-value gets things backward.
Escaping the cult of stat sig is not easy: it is often entrenched in organizations and their processes and in the training and experience of individual practitioners. Escaping it may require deprogramming bad beliefs and bad habits that have been cemented over the years. There is a real risk of throwing the baby out with the bathwater, regressing to the earlier state of statistical anarchy. For all its flaws, the cult of stat sig provides a bulwark against the worst abuses.
Here are some suggestions for those looking for a better path forward. They do not need to be adopted wholesale; start with those that seem more attainable today. (The list is ordered according to my experience of what is easier to do and most likely to stick.)
Report confidence intervals instead of p-values. P-values are routinely misinterpreted. Even when interpreted correctly, they emphasize the wrong information: confidence in rejecting the null hypothesis of no effect rather than effect size. Confidence intervals are easier to interpret: the naive interpretation (“the true effect belongs to this range 95% of the time”) is, for practical purposes, acceptable. They convey uncertainty and correctly focus attention on the size of the effect. And they can satisfy even your irredeemable cultist friend, who can always check whether the confidence interval covers 0 to see whether the result is stat sig or not.
Pre-register your main analysis plan, and always report pre-registered results. Do not filter on statistical significance; instead, normalize reporting non-stat sig results along with their uncertainty. Pre-registration is not a straightjacket: you are allowed to explore and report additional findings. Results that were not part of the analysis plan can then be – correctly – treated with a higher degree of skepticism.
Use non-inferiority tests on guardrail metrics. If you’re testing for the wrong thing, it’s easier to fall prey to the cultist fallacy. When a guardrail metric shouldn’t be adversely impacted, traditional hypothesis testing (against a null of zero effect) isn’t useful: it doesn’t help distinguish between tight estimates around zero (good) and wide ones (not enough information to rule out large negative impact). The solution is to be explicit about the inferiority margin that is tolerable, and then everyone can agree that the test passes when the one-tailed null is rejected.
Use shrinkage estimators when a single number is required. Confidence intervals are an expressive way to communicate statistical uncertainty. Regrettably, they are also impractical whenever you need an input to some downstream calculation, such as aggregating results of multiple experiments or a financial model. As much as it would be good for these downstream users to account for uncertainty properly, the reality is you’ll eventually be asked for “just one number.” At that point, you need to express uncertainty in a different way. Don’t report the center of the confidence interval, which is equivalent to assuming no uncertainty. Instead, report a shrunken estimate from a Bayesian posterior using an informed prior or from a frequentist shrinkage estimator.
Thank you to Bertil Hatt and Sven Schmit for detailed and thoughtful feedback on an earlier draft of this post.