Products

Experimentation

Product Experimentation Web Experimentation Lifecycle Experimentation Lifecycle Experimentation

Feature Flagging

Release Management Automated Rollouts Config Flags Release Management

AI Personalization

Contextual Bandits Contextual Bandits

Why Eppo

WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

About

Strategy

March 28, 2025

Mastering Experimentation Methods: From Design to Analysis

Tyler Buffington

Before Eppo, Tyler built the in-house experimentation platform at Big Fish Games. He holds a PhD from the University of Texas at Austin.

Introduction to Experimentation Methods

Controlled randomized experiments are the gold standard for assessing the impact of a product or code change. In recent years, companies across various industries have increasingly relied on experimentation to accelerate innovation and mitigate risk. Unlike correlation-based methods, controlled experiments provide estimates of the causal effect of an intervention. Even though they provide the most reliable form of evidence when assessing the impact of a change, many pitfalls can compromise the trustworthiness of an experiment. In this post, we will explore the methods associated with the lifecycle of an experiment, from design to decisions, and how to avoid common pitfalls throughout each step.

Experiment Design Best Practices

The first step of an experiment is design. A core component of the design step is identifying what exactly to test. A good starting point is a comprehensive repository of your team’s previous experiments, such as Eppo’s Knowledge Base. Understanding what has worked in the past often inspires new ideas. The next step is to form a hypothesis and clearly document the decision criteria for shipping the change. This is a deep topic worthy of its own post, but here are several key considerations:

Why do we think the idea is worth testing and could potentially move business metrics?
What is the primary metric or overall evaluation criterion?
What are the relevant guardrail metrics?
Which segments are worth analyzing for heterogeneous effects?
Which statistical methods and decision thresholds will be used? Will variance reduction techniques such as CUPED be used? Will you use a fixed-sample approach like a t-test or a sequential method that allows for early stopping? All of these components should be defined before the experiment. In the metaphor of the Texas sharpshooter fallacy, we must draw the target before we shoot. If we draw the target after we see where the shots land (i.e., observe data), we are liable to trick ourselves into finding false patterns in the data. Using experiment protocols with pre-defined metrics and decision criteria is a great way to avoid this trap. Another key aspect of design is defining the configuration of the experiment, including:
How long will the experiment run? Using a sample size calculator is an effective tool for estimating how long it will take for the test to reach sufficient power for a reasonable minimum detectable effect.
Is the necessary data being reliably collected to inform the decision?
What will the traffic split be? We recommend using equal splits to maximize power (in the two-variant case) and to avoid the pitfalls described in section 7 of Kohavi et. al.
Will the test be implemented client-side or server-side?
Where will the triggering point be? For example, if the test only affects the checkout page, users should ideally be triggered when they view the checkout page to avoid dilution.

Continuous Monitoring in Experiments

When using the most common statistical methods in A/B testing (e.g., a t-test), we must wait for the experiment to finish running before making a decision to ship a variant. This is easier said than done, and many teams unfortunately fall into the pitfall of the peeking problem. However, that does not mean that all experiment monitoring is problematic. First of all, teams should always regularly monitor experiments for configuration problems. Additionally, there are alternative statistical methods that allow for early stopping of experiments.

Detecting Problems

While an experiment is running, some diagnostics should be run regularly, including but not limited to:

Sample ratio mismatch (SRM) checks: this checks for an imbalance in traffic allocation relative to what is expected. Common causes of SRM include improperly ramping an experiment, a bug associated with loading a webpage for a specific variant, or problems with the experiment logging.
No experiment subjects: This often denotes a problem with logging experiment assignment events.
Misconfigured metrics: for example, if we see that there are no conversions in an experiment, there is likely an issue connecting users to purchase events.
Pre-experiment imbalance: If there is a statistically significant difference between users in the treatment group and users in the control group before the experiment was active, there is likely a problem with the randomization or the experiment dates are misspecified. These issues can irreversibly invalidate the experiment, so it’s important to detect them early. Ultimately, monitoring for common problems greatly reduces the risk of issues that can greatly degrade experiment velocity.

Early Stopping Techniques

Additionally, not all statistical methods suffer from the peeking problem. For example, sequential testing methods account for the fact that the experimenter is monitoring the experiment with the intent to make a potential decision. These methods have become more popular in A/B testing in recent years and are used by leading experimentation companies. There are multiple flavors of sequential testing. At a high level, sequential methods are typically bucketed into two categories:

Group sequential testing: The experimenter plans a series of interim analyses (i.e., “peeks”) in advance. For example, a group sequential test plan may involve checking the results once per week for a maximum of eight weeks. This approach has become popular at companies like Spotify and Booking.com.
Fully sequential testing: The experimenter can stop the experiment at any point in time without pre-determining when the interim analyses will be conducted. This approach has become popular at companies such as Netflix. The decision to use a particular sequential or non-sequential statistical method is subject to various tradeoffs. One of the primary ones to consider is the tradeoff between flexibility and statistical power.

	Fixed-sample testing	Group sequential testing	Fully sequential testing
Example	t-test	O'Brien-Fleming	Generalization of always valid inference (AVI)
Advantages	Most statistical power, more accurate point estimates, simplest to implement	Balances flexibility and statistical power	Most flexible; experimenters can stop the test at any point
Disadvantages	Suffers from the peeking problem	Requires defining a complicated analysis plan with interim analyses	Less statistical power

One effective approach is to use a hybrid methodology of sequential and fixed-sample testing. For example, one can use sequential testing to detect degradations early to minimize the time that harmful changes are live and use a fixed sample approach to detect improvements to retain the statistical power and the accurate point estimates. In practice, this can be viewed as two separate one-tailed tests — a sequential test for the degradation tail and a fixed-sample test for the improvement tail.

Analyzing Experiment Results

Several concepts are helpful to understand when analyzing an experiment. However, one should not wait until the experiment results are ready for review to start thinking about these considerations. Ideally, the experiment analysis is planned in advance to the point that there is minimal friction when translating the results into a decision.

Defining Good Metrics

One of the most challenging aspects of experimentation is quantifying success with effective metrics. There are several key aspects of metric definitions that teams should consider:

Business relevance: how does the metric align with top-level strategic goals?
Statistical power: some metrics are strategically relevant, but have high variance, making them difficult to measure meaningfully.
Outlier handling: metrics like revenue per user tend to have a small number of extreme values that can inflate variance. Consider using capping or winsorization.
Time lags: some metrics are lagging indicators of the impact of a change. For example, it may take weeks for a bad experience to cause a user to churn.

Bayesian and Frequentist Approaches

Another key decision in the analysis approach is whether to use Bayesian or frequentist approaches. This is an often-debated topic worthy of a much longer discussion, but the gist is that the two frameworks have completely different goals. In the context of most A/B testing implementations, the goals can be roughly summarized (and admittedly oversimplified) as follows:

Frequentist: control the error rate: this means that we want to control the probability of an incorrect conclusion given an assumed true effect. For example, if there is no true effect, we want no more than a 5% chance of concluding that there is one.
Bayesian: quantitatively update beliefs with data: this means that we define a distribution of possible treatment effect values before running the experiment (the prior), and the experimental data (the likelihood) updates the distribution to quantify our current knowledge after analyzing results (the posterior). Advocates of the frequentist approach emphasize the error control guarantees and lack of subjectivity compared to the Bayesian approach. Conversely, Bayesian advocates emphasize the ability to incorporate prior knowledge and the presentation of results that are more relevant for decision-making. There are also benefits of the Bayesian approach, such as mitigating the winner’s curse (also known as “Type M” errors in the language of Gelman and Carlin). Leading experimentation companies such as Amazon have begun leveraging Bayesian approaches for this application. The benefits of the Bayesian approach strongly depend on choosing a reasonable prior, which can be difficult to specify. A common misconception is that choosing a non-informative prior is a safe default, but in reality, this approach leads to incorrect and overconfident results. Given that most Bayesian A/B testing tools use poorly specified priors, one should exercise extreme caution when using Bayesian analyses in experimentation. A summary of the comparison between Bayesian and Frequentist approaches is shown in the table below:

	Frequentist	Bayesian
Common outputs	p-values, confidence intervals	chance to beat, credible interval, loss
Oversimplified intuition	"It would be strange for an A/A test to generate this result, therefore I reject the notion that the treatment does nothing"	"Given the data and prior knowledge, the treatment probably had a positive effect"
Advantages	No subjective assumptions, more widely understood given greater emphasis in statistics courses	More intuitive connection to decision-making, unlocks useful concepts such as the expected value of sample information, provides a natural haircut on point estimates
Disadvantages	p-values and confidence intervals are often misinterpreted, underpowered tests are exaggerated	Defining priors is challenging, most Bayesian tools make incorrect assumptions and yield overconfident results

Experiment Deep Dives

When analyzing experiment results, it can be insightful to slice and dice the results along various dimensions. However, there are common pitfalls that make it so experiment deep dives produce net negative value for experimentation teams, such as:

Segmenting results into groups with small sample sizes, amplifying the impact of noise
Failure to recognize the multiple comparisons problem
Analyses that are not self-serve and, therefore, consume the data team’s bandwidth
Aimless analyses that find false insights in the data that do not generalize to new data
Unnecessary complexity that introduces friction in decision-making due to conflicting perceived patterns Here are effective practices to mitigate these pitfalls:
Defining segmentation and hypotheses before running the experiment
Making analyses self-serve to avoid consuming data science bandwidth
Invest resources into enabling more experiments rather than overanalyzing individual experiments.
Test any insights that were not part of the initial hypotheses in follow-up experiments.

Experiment Reports

Beyond its ability to inform the immediate ship vs. don’t ship decision, experimentation provides value by shaping product knowledge and informing the future roadmap. Unlocking this benefit requires clear experiment reports that can be referenced later and understood by people who were not involved in the experiment. This includes:

Clearly documented hypotheses that indicate the thought process behind the experiment
Screenshots that clearly show what was tested
Key results next to key takeaways

Applications of Experimentation in Real-world Scenarios

Although there are general best practices associated with the execution of digital experiments, different applications are often better suited to different approaches. The table below summarizes common use cases and associated considerations.

Use Case	Goal	Challenges	Common metrics	Common techniques
Releasing software (canary testing)	Mitigate risk	Need near realtime event logging	Error rates, latency	Sequential testing
Changing a website design	Validate new ideas	identity resolution for logged out users	Conversion rate, revenue per user	Fixed sample or sequential testing
Refining a search algorithm	Improve relevancy	Choosing candidate models, mutually exclusive experiments	Bookings, mean reciprocal rank	Fixed sample or sequential testing
AI model evaluation	Measure business impact of new models	Offline metrics don't necessarily translate to better business outcomes	Retention, revenue, new subscriptions	Fixed sample or sequential testing
B2B experiments	Validate new ideas	Cannot randomize at the user-level due to interactions	Retention, usage rate, signup rate	Clustered experiments

Conclusion

Companies in various industries have increasingly relied on controlled experiments as a means of evaluating the causal effect of product changes. Although controlled experiments may seem simple at first, there are various subtleties associated with proper design, implementation, and analysis. By following best practices, organizations can unlock the value of experimentation by leveraging trustworthy insights and shaping product knowledge.