Statistics
The Bet Test: Spotting Problems in Bayesian A/B Test Analysis
Learn more
It's a common trope in experimentation programs: your experiment showed stellar results only to perform lackluster in production, leaving leadership frustrated, skeptical even, with the promises of experimentation. What seems like a game-changing innovation in the controlled environment of the experiment can falter when exposed to the unpredictable dynamics of the real world.
The gap between experimental success and real-world impact is a challenge many face in the realm of experimentation. The culprit? Often, it comes down to a violation of experiment validity. Experiment validity, the gatekeeper of reliable results and actionable insights, can make or break the trust in your experimentation efforts. In this blog, we uncover the nuances of experiment validity, exploring its various forms so that you can fortify your experiments against the pitfalls that lead to misinterpretations, frustration, and missed opportunities.
Experiment validity is a vital concept in experiment design and research, ensuring that the results hold weight, are dependable, and can be applied in real-world scenarios. It gauges how well an experiment truly captures and assesses what it set out to measure and is the linchpin for crafting accurate conclusions and generalizing results based on experimental findings beyond your sample.
Here, we will explore common threats to validity and how taking into account these threats, you can learn to guard your experiments against invalid conclusions and prevent lost opportunity costs incurred by rerunning experiments with flawed designs. We'll delve deeper to guide teams in making statistically sound decisions in experiment design and guide leaders to make informed decisions based on a thorough understanding of the limitations imposed by the experiment design.
There are four broad categories of validity: internal, external, construct, and statistical. Each presents its own consequences when violated. We'll define these and identify common threats to each type.
Internal validity refers to the degree of confidence that the impact we observe in our experiments is reflective of the difference in the treatment and control experience alone and not outside factors. The cornerstone of internal validity is about cause and effect. At its core, it gives us assurance in saying, "The tweaks we made to our website caused these changes." The good news is that, in most cases, proper randomization safeguards experiments from violations of internal validity.
Selection Bias: Imagine you're testing a new feature on your website, but instead of randomly assigning users to control and treatment, the feature calls for users to opt in. For example, say you are testing a new rewards program. Some users will join the program and receive the treatment while those who don’t join will default to the control experience. In an experimentation setting, we say users self-select into the treatment.
The danger here is that, absent a randomization protocol, the users who choose to opt in might be fundamentally different from those who don't opt-in, introducing a selection bias that muddles the true impact of your feature.
The Novelty Effect:
The novelty effect poses occurs when users unintentionally modify their behavior in response to a new experience, introducing a temporary distortion in the measured outcomes. This means that the observed effects during the experiment run may not accurately represent the genuine impact once the feature is deployed and the novelty wears off.
External validity concerns the generalizability of the experiment's findings to the broader population and other settings. It assesses whether the results can be applied beyond the specific conditions of the experiment. Factors affecting external validity include the characteristics of the participants, the setting, and the time of the study.
Sampling Strategy: Limiting tests to a subset of markets carries the risk of the measured effect not generalizing to all markets. Similarly, test results limited to free users might not generalize to paid users. And the same can be said for differences in user platforms, user tenure, and so on. Sampling randomly from the population you intend to apply the change where possible is imperative.
External Factors: Holidays, natural disasters, and other rare events pose a challenge to experiment measurement due to the circumstances of the outside world overshadowing the impact of the change being tested. For example, impacts observed on tests run during the end-of-year holiday period pose the risk of not generalizing to the typical user behavior throughout the rest of the year, as holiday-related factors may influence user engagement in unique ways.
Construct validity is all about the approach taken with respect to metrics and measurement. It assures that your measurements and business objectives align with sufficient coverage and connect to the outcomes you want to evaluate. Said another way: it's about ensuring the metrics you choose make sense for measuring your business goals and cover all the crucial aspects you aim to understand or improve.
Mismatched Metrics and Objectives: When chosen metrics deviate from the experiment's true goals, this misalignment can lead to misinterpretation of results, as the measured outcomes may not faithfully represent the impact of the feature under test.
Short Term Proxies For Long Term Measurements: The practice of utilizing short-term metrics as proxies for long-term outcomes, such as measuring revenue or click-through rates in lieu of more difficult to measure experiment metrics like customer lifetime value, can introduce challenges to construct validity. While the intention may be to gain quicker insights and observe tangible impacts, this approach raises concerns about whether the chosen short-term metrics accurately represent the broader, long-term goals that companies truly value. Construct validity is challenged when the presumed relationship between short-term indicators and ultimate objectives lacks strength, potentially resulting in the misinterpretation of experimental findings. It is imperative to carefully consider the alignment between the chosen metrics and the holistic, long-term outcomes of interest to ensure meaningful and valid insights are derived from measurements.
Statistical conclusion validity concerns the typical statistical design elements we think of when we experiment. Have you employed appropriate statistical methods to derive sound conclusions from the data? In other words, statistical construct validity is about using proper statistical techniques to validly capture and interpret the concepts or constructs of interest.
Peeking: Peeking at accumulating results in experiments, often referred to as "the peeking problem," poses a threat to statistical validity because it introduces the potential for biased decision-making. When experimenters regularly check interim results and make decisions based on the data as it accumulates, there's an increased likelihood of observing false positives or negatives due to chance alone which can lead to premature conclusions, inaccurate interpretations, and a higher probability of making decisions based on random fluctuations rather than true effects. To maintain the integrity of statistical analyses and draw reliable conclusions, it's essential to adhere to predetermined analysis points and avoid frequent interim assessments.
Underpowered Experiments: Underpowered experiments pose a threat to statistical validity because they lack the necessary sample size to reliably detect true effects. When an experiment is underpowered, there's an increased risk of Type II errors - failing to identify true effects when they exist - undermining the accuracy of the findings. Essentially, underpowered experiments diminish the ability to distinguish between real effects and random variability, compromising the overall reliability and validity of statistical analyses. Adequate sample size is crucial to ensure experiments have the power needed to detect meaningful effects with a reasonable level of confidence.
The four types of validity in an experiment you need to know: internal, external, construct, and statistical. By understanding these concepts, you can make more informed decisions based on experiment results. You'll be able to identify where these results are applicable and understand their limitations, and this knowledge will empower you to design experiments that are more nuanced and effective.
On Eppo's experimentation platform, you'd have built-in diagnostics and guardrail checks for potential issues with validity - no memorization required. If you're ready to run more trustworthy experiments, reach out for a demo now!