
Statistics
Bayes Vs. Frequentism Can Be a Big Deal
Learn more
Statistical significance is a cornerstone of (data) science. The idea is simple: only accept results with p-value < 0.05 and dismiss the rest as noise. While this heuristic is a good starting point, it can quickly morph into rigid dogma: what I call the Cult of Stat Sig. The cult splits the world into black and white (or green/red and grey, depending on your experimentation platform of choice). Stat sig results are considered unimpeachable and exact; the others are the same as 0. Statistical significance becomes a goal in itself, no longer a tool but something to worship.
The dogmatic approach harms experimentation programs. Its flawed logic leads to poor decision-making, and its rigid rules provide cover for ignoring inconvenient results. However, not all hope is lost: breaking free from the cult without descending into statistical anarchy is possible.
Many experimentation programs emerge from a state of statistical anarchy: Cherry-picked results reported without information on their uncertainty, widespread skepticism about experiment results, always wondering, “Is this real or just noise?” Or overly complex methodologies used without validation, with every analyst using their own notebook and custom approach.
To get out of the state of anarchy, the first step is to establish a baseline of statistical hygiene, often enforced through an experiment platform. A simple, easy-to-follow heuristic is needed. Traditional hypothesis testing and p-values provide such a heuristic: If p-value < 0.05, report the result; if p-value ≥ 0.05, treat it as noise.
The heuristic's intent is valid, and following it typically improves things —for a while. Albeit crude, it is a reasonable guideline that can help build trust in positive results and reduce false positives. The problem is when the heuristic is mistaken for a commandment and when its recommendations are taken to fallacious extremes.
Some adherents to the cult truly believe in both of these fallacies and will impose their flawed reasoning on others. Others aren’t true believers; they know why the rules are fallacies, but perpetuate them anyways, believing that there is no third way: it’s either the cult or statistical anarchy. This post argues that the cult causes real harm and that a better way is possible.
When planning an experiment analysis, it’s good practice to pair a primary metric (the goal) with one or more guardrail metrics (the checks and balances):
What all these examples have in common is that the guardrail metric is less sensitive than the primary metric: a larger sample size is required to detect an effect on the guardrail compared to the same effect size on the primary metric. In these examples, the guardrail is less sensitive because changes in rare events – such as returns, crashes, and unsubscribes – are harder to detect than more frequent events (purchases, sessions, opens). In other settings, guardrails are underpowered because they measure longer-term outcomes, such as retention. (In other settings, such as page load time, the guardrail is typically not underpowered relative to the primary metric.)
Power analyses are all too often run by only considering the primary metric when choosing how long to run an experiment. This can result in insufficient power to detect degradation of the guardrails. The guardrail might drop, but the drop will rarely be statistically significant. (Eppo’s sample size calculator makes it easy to perform a power analysis jointly for primary and guardrail metrics, but not everyone has access to such tools.)
This situation is already bad; the cult of stat sig compounds the problem. Results for the guardrail metrics will often be inconclusive, even when there is a true harm. The point estimates on the guardrails will often be negative, but they will rarely be statistically significant. These results should be cause for concern and prompt extending the experiment or further investigation. Instead, treating these non-statistically significant results as exact zeros silences the alarm bells.
There are problems even for stat sig results. When an experiment is barely powered, marginally stat sig point estimates should not be taken at face value. When looking across stat sig results, the winner’s curse means that the true impact is closer to 0 than what the point estimates suggest. There are no straightforward solutions here, but ignoring the residual uncertainty – treating stat sig results as exact – makes it even harder to understand and account for the curse.
An experimentation program that operates according to this model will unknowingly cause much harm. Best practices are seemingly followed: power analyses are done, and only (properly powered) stat sig results are reported. But in reality, the wins are not as large as claimed, and the harm accumulates undetected. Even the best experimentation programs can suffer from these issues, and there is no panacea: holdouts help avoid the winner’s curse, and variance reduction helps with underpowered guardrail metrics, but they can’t solve everything. The most important first step is to not be in denial.
Mistaking the strength of statistical significance with business significance is another symptom of the cult of stat sig. The starting point is obsessing over whether p-values cross some magical threshold. A natural evolution is to use p-values themselves as markers of importance: the lower, the better.
An experiment lift with a highly stat sig p-value of 0.001 is surely more important than one with a p-value of 0.04, right?
Not necessarily. The lift with a smaller p-value has stronger evidence against the null hypothesis of no effect: it is a more unlikely result if the experiment does nothing in reality. But statistical significance is not the same as business significance:
Escaping the cult of stat sig is not easy: it is often entrenched in organizations and their processes and in the training and experience of individual practitioners. Escaping it may require deprogramming bad beliefs and bad habits that have been cemented over the years. There is a real risk of throwing the baby out with the bathwater, regressing to the earlier state of statistical anarchy. For all its flaws, the cult of stat sig provides a bulwark against the worst abuses.
Here are some suggestions for those looking for a better path forward. They do not need to be adopted wholesale; start with those that seem more attainable today. (The list is ordered according to my experience of what is easier to do and most likely to stick.)
Thank you to Bertil Hatt and Sven Schmit for detailed and thoughtful feedback on an earlier draft of this post.