A/B Testing
December 17, 2024

Clustered Experiments: When Traditional A/B Testing Falls Short

What if randomizing users isn't sufficient?
Lukas Goetz-Weiss
Eppo's Customer Data Science Manager. Before Eppo, Lukas built in-house experimentation tools at companies like Angi

In the dynamic landscape of data-driven decision-making, experimentation stands as a cornerstone for validating hypotheses and optimizing product performance. While traditional A/B testing has long been the go-to method, sophisticated data leaders recognize the limitations it imposes, especially as products grow in complexity and user interactions become more intertwined. Enter Clustered Experiments—a powerful approach designed to enhance your experimentation strategy and yield more reliable insights.

What Are Clustered Experiments?

At its core, a Clustered Experiment deviates from the standard A/B test by randomizing groups (or clusters) of analysis units rather than individual units. This methodology is particularly beneficial in scenarios where individual randomization leads to interference effects, compromising the integrity of the experiment and the consistency of user experiences.

Traditional A/B Testing vs. Clustered Experiments

Standard A/B Test: Randomizes individual users into different variants and analyzes metrics on a per-user basis. For instance, splitting users to test a new feature and measuring the conversion rate per user.

Clustered Experiment: Randomizes groups of users (clusters) such as companies, geographical regions, or user segments, and analyzes metrics both at the cluster level and at the user level. This approach mitigates interference within clusters and maintains consistent experiences.


When Traditional A/B Testing Falls Short

Some examples where a Clustered Experiment is more appropriate than a traditional A/B test include:

  1. Session-level metrics: Evaluating session-level metrics in a user-randomized experiment can violate the independence assumption inherent in standard A/B tests, leading to an increase in false positives. In this case, users play the role of a “cluster of sessions”. By accounting for this correlation structure, we can restore guarantees on false positive rates.
  2. Organizational-Level Interference: In B2B contexts where users belong to the same organization, randomizing at the user level might cause cross-group contamination. For example, in internal messaging software, treated users might influence untreated peers within the same company.


Overcoming Statistical Challenges

One of the primary challenges with Clustered Experiments is the lack of independence among observations within the same cluster. Traditional statistical methods assume independence, leading to underestimated variances and overconfident results when applied naively to clustered data.

Enter Clustered Analysis

Applying the delta method to the ratio of cluster-aggregated metrics and the cluster size offers a robust solution. For details, please see Deng et al. and Chapter 18 of the authoritative text Trustworthy Online Controlled Experiments. By expressing complex metrics as ratios of simple metrics normalized at the cluster level, we can inherently account for the clustered structure of the data. This gives results that are mathematically equivalent to the common Cluster Robust Standard Errors (CRSE) approach, but is far more scalable from a computation perspective.

Key Advantages:

  • Accurate Variance Estimation: Properly accounts for intra-cluster correlations, ensuring that confidence intervals and significance levels are reliable.
  • Flexibility: Can be applied to a wide range of metrics, whether they are per-user, or per-order, timeboxed, or filtered to specific metric properties.
  • Maintain advanced statistical functionality - Since clustered experiments are analyzed with ratio metrics, you can still accelerate  readouts with multivariate variance reduction (CUPED++), and use your favorite statistical methodologies: Sequential Hybrid, Fixed Sample, Bayesian, etc.


Real-World Applications

1. Measuring Average Order Value (AOV) with User-Level Randomization

Scenario: A business aims to assess the impact of a new pricing strategy on AOV, where AOV is calculated per order. Users can place multiple orders, meaning the randomization unit (user) varies from the analysis unit (order). 

Challenge: Randomizing at the order level could expose a single user to multiple variants, leading to inconsistent experiences and behavioral biases that persist across orders.

Solution: Randomize at the user level while analyzing AOV at the order level. By treating each user as a cluster of orders, you ensure that all orders from a single user adhere to the same treatment, preserving consistency.

2. User-Level Conversion Rate in Company-Randomized Experiments

Scenario: A SaaS company offers a new feature and wants to measure its effect on user conversion rates. Each company consists of many users, and randomizing at the user level could lead to cross-user contamination.

Challenge: If individual users within the same company are randomized, treated users might influence control users, skewing the conversion metrics.

Solution: Randomize at the company level and analyze the conversion rate at the user level.

Implementing Clustered Experiments with Eppo

Eppo simplifies the planning, launching, and analysis of clustered experiments. Using Eppo’s SDK, you can easily pass in your cluster identifier as the primary ID and the analysis unit as an attribute. For example, in the B2B scenario mentioned above:


const variation = eppoClient.getBooleanAssignment(
	'enable-my-new-feature',  
  companyId,  
  { 	
  	userId: 123,	
    // any additional targeting or segmentation data, either for the user or the company  
  },
  false
);

This ensures that all users within the same company experience the same variant, while Eppo tracks exactly which users were exposed to the new feature and when.

When configuring the analysis, Eppo allows you to evaluate both company-level and user-level metrics seamlessly. The user experience remains unchanged, but behind the scenes, Eppo applies cluster-robust statistics to deliver reliable insights.

Conclusion

Clustered Experiments represent a sophisticated evolution in the realm of controlled experimentation, addressing the nuanced challenges of interference and multi-level data structures. By leveraging clustered experiments, experimentation leaders can conduct more accurate and reliable experiments, ensuring that insights derived are both actionable and trustworthy. Add it as a powerful tool in your experimentation arsenal.

Table of contents

Ready for a 360° experimentation platform?
Turn blind launches into trustworthy experiments
See Eppo in Action

Ready to go from knowledge to action?

Talk to our team of experts and see why companies like Twitch, DraftKings, and Perplexity use Eppo to power experimentation for every team.
Get a demo