Header Nav OpenHeader Nav Close
Strategy
March 28, 2025

Mastering Experimentation Methods: From Design to Analysis

Tyler Buffington
Before Eppo, Tyler built the in-house experimentation platform at Big Fish Games. He holds a PhD from the University of Texas at Austin.

Introduction to Experimentation Methods

Controlled randomized experiments are the gold standard for assessing the impact of a product or code change. In recent years, companies across various industries have increasingly relied on experimentation to accelerate innovation and mitigate risk. Unlike correlation-based methods, controlled experiments provide estimates of the causal effect of an intervention. Even though they provide the most reliable form of evidence when assessing the impact of a change, many pitfalls can compromise the trustworthiness of an experiment. In this post, we will explore the methods associated with the lifecycle of an experiment, from design to decisions, and how to avoid common pitfalls throughout each step.

Experiment Design Best Practices

The first step of an experiment is design. A core component of the design step is identifying what exactly to test. A good starting point is a comprehensive repository of your team’s previous experiments, such as Eppo’s Knowledge Base. Understanding what has worked in the past often inspires new ideas. The next step is to form a hypothesis and clearly document the decision criteria for shipping the change. This is a deep topic worthy of its own post, but here are several key considerations:

  • Why do we think the idea is worth testing and could potentially move business metrics?
  • What is the primary metric or overall evaluation criterion?
  • What are the relevant guardrail metrics?
  • Which segments are worth analyzing for heterogeneous effects?
  • Which statistical methods and decision thresholds will be used? Will variance reduction techniques such as CUPED be used? Will you use a fixed-sample approach like a t-test or a sequential method that allows for early stopping? All of these components should be defined before the experiment. In the metaphor of the Texas sharpshooter fallacy, we must draw the target before we shoot. If we draw the target after we see where the shots land (i.e., observe data), we are liable to trick ourselves into finding false patterns in the data. Using experiment protocols with pre-defined metrics and decision criteria is a great way to avoid this trap. Another key aspect of design is defining the configuration of the experiment, including:
  • How long will the experiment run? Using a sample size calculator is an effective tool for estimating how long it will take for the test to reach sufficient power for a reasonable minimum detectable effect.
  • Is the necessary data being reliably collected to inform the decision?
  • What will the traffic split be? We recommend using equal splits to maximize power (in the two-variant case) and to avoid the pitfalls described in section 7 of Kohavi et. al.
  • Will the test be implemented client-side or server-side?
  • Where will the triggering point be? For example, if the test only affects the checkout page, users should ideally be triggered when they view the checkout page to avoid dilution.

Continuous Monitoring in Experiments

When using the most common statistical methods in A/B testing (e.g., a t-test), we must wait for the experiment to finish running before making a decision to ship a variant. This is easier said than done, and many teams unfortunately fall into the pitfall of the peeking problem. However, that does not mean that all experiment monitoring is problematic. First of all, teams should always regularly monitor experiments for configuration problems. Additionally, there are alternative statistical methods that allow for early stopping of experiments.

Detecting Problems

While an experiment is running, some diagnostics should be run regularly, including but not limited to:

  • Sample ratio mismatch (SRM) checks: this checks for an imbalance in traffic allocation relative to what is expected. Common causes of SRM include improperly ramping an experiment, a bug associated with loading a webpage for a specific variant, or problems with the experiment logging.
  • No experiment subjects: This often denotes a problem with logging experiment assignment events.
  • Misconfigured metrics: for example, if we see that there are no conversions in an experiment, there is likely an issue connecting users to purchase events.
  • Pre-experiment imbalance: If there is a statistically significant difference between users in the treatment group and users in the control group before the experiment was active, there is likely a problem with the randomization or the experiment dates are misspecified. These issues can irreversibly invalidate the experiment, so it’s important to detect them early. Ultimately, monitoring for common problems greatly reduces the risk of issues that can greatly degrade experiment velocity.

Early Stopping Techniques

Additionally, not all statistical methods suffer from the peeking problem. For example, sequential testing methods account for the fact that the experimenter is monitoring the experiment with the intent to make a potential decision. These methods have become more popular in A/B testing in recent years and are used by leading experimentation companies. There are multiple flavors of sequential testing. At a high level, sequential methods are typically bucketed into two categories:

  1. Group sequential testing: The experimenter plans a series of interim analyses (i.e., “peeks”) in advance. For example, a group sequential test plan may involve checking the results once per week for a maximum of eight weeks. This approach has become popular at companies like Spotify and Booking.com.
  2. Fully sequential testing: The experimenter can stop the experiment at any point in time without pre-determining when the interim analyses will be conducted. This approach has become popular at companies such as Netflix. The decision to use a particular sequential or non-sequential statistical method is subject to various tradeoffs. One of the primary ones to consider is the tradeoff between flexibility and statistical power.

Fixed-sample testing Group sequential testing Fully sequential testing
Example t-test O'Brien-Fleming Generalization of always valid inference (AVI)
Advantages Most statistical power, more accurate point estimates, simplest to implement Balances flexibility and statistical power Most flexible; experimenters can stop the test at any point
Disadvantages Suffers from the peeking problem Requires defining a complicated analysis plan with interim analyses Less statistical power


One effective approach is to use a hybrid methodology of sequential and fixed-sample testing. For example, one can use sequential testing to detect degradations early to minimize the time that harmful changes are live and use a fixed sample approach to detect improvements to retain the statistical power and the accurate point estimates. In practice, this can be viewed as two separate one-tailed tests — a sequential test for the degradation tail and a fixed-sample test for the improvement tail.

Analyzing Experiment Results

Several concepts are helpful to understand when analyzing an experiment. However, one should not wait until the experiment results are ready for review to start thinking about these considerations. Ideally, the experiment analysis is planned in advance to the point that there is minimal friction when translating the results into a decision.

Defining Good Metrics

One of the most challenging aspects of experimentation is quantifying success with effective metrics. There are several key aspects of metric definitions that teams should consider:

  • Business relevance: how does the metric align with top-level strategic goals?
  • Statistical power: some metrics are strategically relevant, but have high variance, making them difficult to measure meaningfully.
  • Outlier handling: metrics like revenue per user tend to have a small number of extreme values that can inflate variance. Consider using capping or winsorization.
  • Time lags: some metrics are lagging indicators of the impact of a change. For example, it may take weeks for a bad experience to cause a user to churn.

Bayesian and Frequentist Approaches

Another key decision in the analysis approach is whether to use Bayesian or frequentist approaches. This is an often-debated topic worthy of a much longer discussion, but the gist is that the two frameworks have completely different goals. In the context of most A/B testing implementations, the goals can be roughly summarized (and admittedly oversimplified) as follows:

  1. Frequentist: control the error rate: this means that we want to control the probability of an incorrect conclusion given an assumed true effect. For example, if there is no true effect, we want no more than a 5% chance of concluding that there is one.
  2. Bayesian: quantitatively update beliefs with data: this means that we define a distribution of possible treatment effect values before running the experiment (the prior), and the experimental data (the likelihood) updates the distribution to quantify our current knowledge after analyzing results (the posterior). Advocates of the frequentist approach emphasize the error control guarantees and lack of subjectivity compared to the Bayesian approach. Conversely, Bayesian advocates emphasize the ability to incorporate prior knowledge and the presentation of results that are more relevant for decision-making. There are also benefits of the Bayesian approach, such as mitigating the winner’s curse (also known as “Type M” errors in the language of Gelman and Carlin). Leading experimentation companies such as Amazon have begun leveraging Bayesian approaches for this application. The benefits of the Bayesian approach strongly depend on choosing a reasonable prior, which can be difficult to specify. A common misconception is that choosing a non-informative prior is a safe default, but in reality, this approach leads to incorrect and overconfident results. Given that most Bayesian A/B testing tools use poorly specified priors, one should exercise extreme caution when using Bayesian analyses in experimentation. A summary of the comparison between Bayesian and Frequentist approaches is shown in the table below:

Frequentist Bayesian
Common outputs p-values, confidence intervals chance to beat, credible interval, loss
Oversimplified intuition "It would be strange for an A/A test to generate this result, therefore I reject the notion that the treatment does nothing" "Given the data and prior knowledge, the treatment probably had a positive effect"
Advantages No subjective assumptions, more widely understood given greater emphasis in statistics courses More intuitive connection to decision-making, unlocks useful concepts such as the expected value of sample information, provides a natural haircut on point estimates
Disadvantages p-values and confidence intervals are often misinterpreted, underpowered tests are exaggerated Defining priors is challenging, most Bayesian tools make incorrect assumptions and yield overconfident results

Experiment Deep Dives

When analyzing experiment results, it can be insightful to slice and dice the results along various dimensions. However, there are common pitfalls that make it so experiment deep dives produce net negative value for experimentation teams, such as:

  • Segmenting results into groups with small sample sizes, amplifying the impact of noise
  • Failure to recognize the multiple comparisons problem
  • Analyses that are not self-serve and, therefore, consume the data team’s bandwidth
  • Aimless analyses that find false insights in the data that do not generalize to new data
  • Unnecessary complexity that introduces friction in decision-making due to conflicting perceived patterns Here are effective practices to mitigate these pitfalls:
  • Defining segmentation and hypotheses before running the experiment
  • Making analyses self-serve to avoid consuming data science bandwidth
  • Invest resources into enabling more experiments rather than overanalyzing individual experiments.
  • Test any insights that were not part of the initial hypotheses in follow-up experiments.

Experiment Reports

Beyond its ability to inform the immediate ship vs. don’t ship decision, experimentation provides value by shaping product knowledge and informing the future roadmap. Unlocking this benefit requires clear experiment reports that can be referenced later and understood by people who were not involved in the experiment. This includes:

  • Clearly documented hypotheses that indicate the thought process behind the experiment
  • Screenshots that clearly show what was tested
  • Key results next to key takeaways

Applications of Experimentation in Real-world Scenarios

Although there are general best practices associated with the execution of digital experiments, different applications are often better suited to different approaches. The table below summarizes common use cases and associated considerations.

Use Case Goal Challenges Common metrics Common techniques
Releasing software (canary testing) Mitigate risk Need near realtime event logging Error rates, latency Sequential testing
Changing a website design Validate new ideas identity resolution for logged out users Conversion rate, revenue per user Fixed sample or sequential testing
Refining a search algorithm Improve relevancy Choosing candidate models, mutually exclusive experiments Bookings, mean reciprocal rank Fixed sample or sequential testing
AI model evaluation Measure business impact of new models Offline metrics don't necessarily translate to better business outcomes Retention, revenue, new subscriptions Fixed sample or sequential testing
B2B experiments Validate new ideas Cannot randomize at the user-level due to interactions Retention, usage rate, signup rate Clustered experiments

Conclusion

Companies in various industries have increasingly relied on controlled experiments as a means of evaluating the causal effect of product changes. Although controlled experiments may seem simple at first, there are various subtleties associated with proper design, implementation, and analysis. By following best practices, organizations can unlock the value of experimentation by leveraging trustworthy insights and shaping product knowledge.

Table of contents

Ready for a 360° experimentation platform?
Turn blind launches into trustworthy experiments
See Eppo in Action

Ready to go from knowledge to action?

Talk to our team of experts and see why companies like Twitch, DraftKings, and Perplexity use Eppo to power experimentation for every team.
Get a demo