Statistics
February 25, 2025

Rethinking Experimental Statistics: New Standards for 2025

Ryan Lucht
Experimentation evangelist focused on sharing ambitious ideas for getting everyone testing. Before joining Eppo, Ryan was an experimentation consultant helping companies like DoorDash, Zillow, and Clorox grow their programs.

TL;DR

  • Traditional experimental statistics are evolving as organizations move beyond rigid p-value thresholds (< 0.05) to customize statistical standards by-experiment to balance innovation with risk.
  • Companies like Amazon and Etsy are adopting hierarchical Bayesian models and shrinkage techniques to measure true cumulative experimental impact.
  • New testing guidelines are enabling automated experimentation that streamlines decision-making while maintaining statistical rigor.
  • Organizations are expanding experimental statistics beyond A/B testing to include methods like geolift tests and synthetic controls for complex scenarios.
  • Companies are shifting from individual test oversight to broader testing guidelines and policy-level governance to scale experimentation programs effectively.

Innovation thrives on experimentation, but as organizations scale their programs, the metrics and testing guidelines steering these efforts require a fresh perspective. Trusting your results, refining methodologies, and ensuring actionable insights are more critical now than ever. These shifts are revolutionizing how organizations make decisions—faster, more effectively, and with greater confidence. Below, we examine the emerging trends and standards defining the future of experimental statistics.

Moving Beyond Traditional Standards in Testing

For decades, the p-value threshold of < 0.05 has stood as a near-sacred benchmark in experimental statistics. While this rule provides a certain simplicity, it was never intended as an all-encompassing solution. Increasingly, testing guidelines are evolving to serve the practical needs of innovation-driven environments, where missing a promising opportunity can be far more costly than a false positive.

Although work along these lines has been proposed for years (see “A Decision Theoretic Approach to A/B Testing”, published in 2017), new work from the Netflix team presented at this year’s MIT Conference on Digital Experimentation is reviving interest in the approach. You can read their paper Optimizing Returns from Experimentation Programs here on arXiv. This approach emphasizes aligning testing guidelines with real business needs, treating statistical rigor as an enabler of innovation rather than a limiting factor.

Metrics as a Key Driver of Success

Driving true innovation requires robust measures of cumulative impact that account for more than just summing the outcomes of individual experiments. The traditional focus on each experiment's trustworthiness is expanding to include program-level reliability, ensuring cumulative results align with actual business impact.

One recurring challenge is reconciling the apparent gains of multiple experiments with a lack of corresponding improvement in aggregate business performance. A Chief Product Officer might ask, “If all these teams report significant wins, why aren’t we seeing it reflected in overall metrics?”

To address this, several leading organizations have pioneered advanced approaches to estimating cumulative impact without the use of long-term holdouts, namely utilizing hierarchical Bayesian models. Here is some additional work presented at this year’s MIT Conference on Digital Experimentation: 

Rethinking Testing Guidelines for Modern Workflows

The evolution of workflows is placing greater emphasis on testing guidelines that support efficiency, scalability, and consistency. Previously, experimentation required significant manual effort; manual result analysis, approval processes, and tracking outcomes all slowed things down. This inefficiency often bottlenecked innovation.

In 2025, we’re seeing a shift towards "auto-experiments," which remove much of the friction in decision-making while allowing organizations to adhere to well-defined testing guidelines. For instance, Airbnb has adopted pre-defined metrics, like increased bookings or an improved five-star rating ratio, to streamline decision-making. These standardized metrics enable faster evaluations without sacrificing rigor or scalability.

Scaling with Confidence

Well-defined testing guidelines play a pivotal role in expanding experimentation pipelines. Automated frameworks allow faster iteration and more standardized evaluations, ensuring innovation is not only achievable but measurable across the board.

Expanding the Scope of Experimental Design

Traditionally, experimentation was limited to contexts where clean, randomized control trials (RCTs) were feasible. While A/B testing remains a staple for website changes, forward-thinking organizations are pushing past these boundaries to design frameworks that solve more intricate, real-world challenges.

For instance:

  • Marketing Attribution: Companies like Webflow address the complexity of attributing impact in marketing campaigns, especially for budgets spent on platforms like YouTube. To resolve this, geolift tests allocate marketing spend across geographic regions and track outcomes—a flexible workaround for traditional RCTs.
  • Retail Pilots: Retailers often use pilot programs to predict new product success. These quasi-experiments integrate synthetic controls and advanced modeling to compensate for the lack of randomization, producing actionable insights even in dynamic environments.

By broadening their testing guidelines, organizations gain valuable insights into areas previously deemed too complex to measure.

Building a Culture of Experimental Excellence

The convergence of these advancements in metrics, adaptive testing guidelines, and automated workflows is transforming experimentation into a cultural pillar. Organizations are moving beyond evaluating individual tests, instead focusing on broad policy-level governance. Such cultural shifts allow experimentation programs to scale while maintaining consistency and rigor.

This transformation is occurring at a critical juncture. Advances in AI and productivity tools are generating unprecedented volumes of ideas and innovations. Organizations that challenge legacy mindsets and proactively refine their experimentation practices will outpace competitors by making smarter, faster—and ultimately more impactful—decisions.

Final Thoughts on Testing Guidelines and Metrics

Experimental statistics are evolving rapidly to meet the demands of modern businesses. By refining key elements such as p-value thresholds, adopting more trustworthy metrics, and integrating better testing guidelines, organizations foster a culture of continuous learning and improvement. These adjustments make it easier to create agile, data-backed decision-making processes that drive innovation across the board.

If your team is rethinking how experimentation fits into its strategy, now is the time to explore how these advances can help you innovate smarter and faster. The most successful programs aren't just about optimization—they actively shape the future.

FAQs

Why are companies evolving their testing guidelines beyond the traditional p-value < 0.05 threshold?

The p-value threshold has historically been a simple, universal guide. However, organizations like Netflix are re-thinking the relative risk of False Positives given their actual distribution of experiment outcomes, and weighing this risk against the potential of testing (and winning) faster. Instead of rigid guidelines, customizing our experiment plans can better balance precision with practicality, aligning with specific business goals.

How are "auto-experiments" revolutionizing experimentation workflows?

Auto-experiments automate routine tasks—running tests, analyzing results, and applying decision frameworks—while incorporating pre-defined testing guidelines. Companies like Airbnb have leveraged this approach to scale their efforts without compromising rigor.

When is experimentation still possible without A/B testing?

In complex scenarios where A/B testing isn't feasible, alternatives like geolift tests, synthetic controls, and pilot studies provide valuable insights. For example, these methods help measure the effectiveness of marketing campaigns or new product launches.

What’s the role of experimentation committees under the new paradigm?

These committees now focus less on individual tests and more on governing the policies and testing guidelines for broader innovation. This enables organizations to manage experimentation programs at scale while ensuring consistency and impact.

Table of contents

Ready for a 360° experimentation platform?
Turn blind launches into trustworthy experiments
See Eppo in Action

Ready to go from knowledge to action?

Talk to our team of experts and see why companies like Twitch, DraftKings, and Perplexity use Eppo to power experimentation for every team.
Get a demo