Statistics
March 6, 2025

What to Do When You Encounter Sample Ratio Mismatch in A/B Testing

Allon Korem
CEO of Bell. Bell offers outsourced services in A/B testing, Causal Inference, Marketing Mix Modeling and Geo Tests. Their services are trusted by top companies: Monday.com, playSTUDIOS, Playtika, Lemonade and more.

As an eager analyst, you’ve just received the data for an A/B test. Wasting no time, you dive into the analysis: selecting the appropriate statistical test and meticulously sidestepping pitfalls like data peeking. To your delight, the results reveal a significant improvement in the treatment group. However, despite following best practices, something fundamental is still missing from your process. Any idea what it could be?

The missing piece in your process is a Sample Ratio Mismatch (SRM) check — or simply put, verifying whether the actual allocation of participants to groups matches the intended split. This post aims to explain why checking for SRM is crucial, explore common reasons it occurs, and guide you on how to detect, diagnose, and address SRM issues effectively.

What is SRM?

In each A/B test, users are divided into at least two groups. Before the test begins, the analyst determines the proportion of users assigned to each group. The best practice is to split users evenly between the groups, though other distributions are also acceptable.While minor deviations from the planned allocation are common, when this discrepancy becomes substantial, it results in a sample ratio mismatch (SRM).

Also known as unbalanced sampling, SRM can arise in online controlled experiments due to failures in randomization or instrumentation. Even a small discrepancy in group sizes can invalidate the results, especially with a large sample size. SRM is typically detected using a chi-squared test, such as Pearson’s chi-squared goodness of fit test. For instance, a p-value of 2.54 × 10^-10 would indicate a statistically significant sample ratio mismatch, signaling that the observed group sizes deviate significantly from the expected proportions.

Why does SRM matter?

To isolate the impact of product variation on the Key Performance Indicator (KPI), the control and test groups must be equivalent across all parameters except the one being manipulated in the test. How can we ensure this? The simplest approach is random allocation — since users are assigned to groups randomly, there should be no other consistent difference between the characteristics of users in each group.

While more advanced techniques, such as stratified sampling, can further ensure group similarity, random allocation is often sufficient for our purposes. However, discovering SRM in the dataset can seriously undermine the principle of random allocation, as it suggests that users may be disproportionately excluded from one of the groups. This imbalance can introduce bias into the user characteristics represented in each group, potentially compromising the test's validity.. Take a look at the following illustration to gain a clearer understanding of this point:


Figure 1. Suppose you're analyzing the effect of a change in your game on revenue. Your population includes both game-addicted players (purple) and regular players (black), split evenly between control and test groups. Due to technical issues, loading times are longer in the treatment version, causing regular players to quit—especially in the treatment group. This leads to a smaller test group with a higher proportion of addicted players, as shown in the final sample. While the treatment group shows higher averaged revenue, the result is inconclusive because it's unclear if the new version or the different player profiles are driving the revenue increase.

Thus, detecting an SRM in our dataset may signal a violation of one of the fundamental assumptions of statistical inference: the assurance of random allocation. Without it, any observed differences in the data could be attributed to underlying group characteristics rather than the variation being tested, undermining the validity of the results.

How to Detect SRM?

SRM detection involves comparing the expected and actual sample sizes for each variant. While perfect alignment isn’t expected, there should be reasonable consistency between the planned and observed allocations. To evaluate whether discrepancies are acceptable, analysts typically use the goodness-of-fit chi-squared test. This test compares the planned and actual group proportions to assess if there is a significant difference. The null hypothesis here assumes equal allocation between groups. Thus, unlike standard KPI analysis, the goal is not to reject the null hypothesis, but to obtain a non-significant result, indicating that the actual allocation aligns with the planned one.

To quantify the difference between what you planned and what you actually observed, we use the chi-squared statistic, calculated with the formula:

\[\chi^2 = \sum \frac{(O_i-E_i)^2}{E_i}\]

Where:

  • Oi represents the observed frequency (i.e., the actual number of users in each group),
  • Ei represents the expected frequency (i.e., the number of users you would expect in each group based on your planned allocation).

For example, suppose you're running an A/B test  with 200 users. You expect 100 users in each group, but the actual allocation is 90 users in the control group and 110 in the treatment group. The chi-squared calculation would be:

$$\chi^2 = \frac{(90-100)^2}{100^2} + \frac{(110-100)^2}{100^2} = 2$$

This value follows a chi-squared distribution with 1 degree of freedom (since we have two groups). Using this statistic, we can calculate the p-value, which is approximately 0.157. Since the p-value, is greater than typical significance levels used for SRM checks (e.g., 0.1), we fail to reject the null hypothesis. This suggests that the allocation to groups is acceptable and there is no significant deviation. If you want to gain more intuition about the idea of chi-squared, try to repeat this calculation with an actual allocation of 130-70. What will be the conclusion in this case?

While it's crucial to conduct SRM analysis at the overall sample level, it can also be valuable to examine subgroups within the population. For example, you might check whether the division of users between the control and test groups is consistent across different operating system levels. In this case we use a different type of a chi-squared, a chi-squared test for independence that  identifies whether allocation discrepancies might exist in specific subgroups of the sample.

Real-World Examples and Case Studies

Here are some real-world examples and case studies illustrating how companies have dealt with Sample Ratio Mismatch in A/B testing:

  • An online fashion company uses a free-text search engine to help customers find items in its store. The company’s analysts conducted an A/B test to examine how modifying an item’s description (e.g., “black shoes” vs. “elegant black shoes”) affects sales. In this experiment, users were randomly assigned to either the control or treatment group upon loading the item’s page.

Despite equal allocation in theory, the analysts found that the treatment group had significantly fewer users than the control group. Further investigation revealed that the updated item descriptions in the treatment group led to fewer search appearances. Since the search engine relied on exact matches, users searching for “black shoes” did not see the item described as “elegant black shoes” in their results. As a result, the control group received more page visits, ultimately causing an SRM. 

In this scenario, identifying a significant effect is problematic because the two groups differed not only in item descriptions but also in how users searched for products on the site. This confounding factor makes it difficult to isolate the true impact of the description change.

  • Due to budget constraints, a gaming application decided to start collecting user data only after 30 seconds of use. In one experiment, analysts tested a change in the onboarding system that caused users to abandon the app more quickly—some even before any data collection began. As a result, the treatment group ended up significantly smaller than the control group, leading to a pronounced SRM.

In this scenario, any significant result is meaningless, as the treatment group consists only of users who stayed in the app longer. These users may naturally exhibit better KPIs, making it impossible to isolate the true effect of the treatment.

  • A company wanted to compare two versions of its website, A and B. Users were considered part of the test if they were exposed to a button on the landing page. However, for version B, there was an additional entry point from a different source. As a result, version B had a larger number of users, leading to a significant SRM.

In this case, a meaningful comparison between the two groups was impossible because user characteristics differed based on how they entered the test. Therefore, even if version B showed a lower conversion rate, this difference could not be reliably attributed to the site version itself.

These examples highlight the importance of addressing SRM and using data-driven strategies to ensure the success of A/B testing initiatives.

I Found SRM In My Data, What to Do Next?

Once SRM is detected, it’s crucial to identify where the issue in group division is arising within the test process. Specifically, there are two key points where problems might occur:

1. Randomization Mechanism: Sometimes, the discrepancy between the planned and actual allocation is due to the procedure used to assign users to groups. For example, if a randomization function generates numbers from 1 to 256 based on user and experiment IDs for a test with three groups (each allocated 33%), the numbers may not divide evenly, leading to SRM. Fortunately, this type of SRM doesn’t indicate any systematic differences between groups, so it doesn’t undermine the test’s validity. In such cases, recognizing this as the cause of SRM allows you to proceed with the analysis without concern.

2. Confounding Variables with the Treatment: SRM becomes problematic when the cause is related to the treatment itself. In this case, the initial group allocation might be equal, but imbalances can emerge later in the process. There are two main types of factors to consider:

  • External Factors: Sometimes, specific events, such as infrastructure updates or shifts in traffic patterns, coincide with the test and cause traffic anomalies. To detect such issues, visualizing user assignments over time and correlating anomalies with external events can help identify if the flawed allocation is linked to an external factor.
  • Inherent Factors: These are factors directly related to the treatment, such as technical issues in the test version (e.g., loading times), which may lead to different rates of dropping. To detect these issues, analysts should compare the groups on key measures of the experiment’s flow, such as API delays or load times.

In some cases, the treatment may only affect certain subsets of users, for example, longer loading times for mobile users but not for desktop users. In such situations, it's critical to examine group allocation across various subpopulations (e.g., mobile vs. desktop, different browsers) to see if the distribution is influenced by the treatment. If SRM appears in specific segments, it could signal issues in the experiment’s infrastructure or flow.

Once such patterns are identified, the next step is to investigate whether there are differences in technical properties, like loading times, across these segments. Although multiple comparisons can lead to false positives, when using statistical tests to flag potential issues, false positives are less concerning. 

What to do if you cannot find the cause for the SRM?

Even if the source of the SRM cannot be identified, it is still likely that something is causing the discrepancy in the proportion of users, rendering the data invalid for analysis. In this case, you may consider re-running the test. If the SRM is due to random chance (i.e., a false positive) or an unknown issue that occurred on specific dates or with particular traffic, it’s possible that the SRM will not recur, even without fixing something in the test. 

Conclusions

A fundamental requirement for reliable statistical inference is that the groups differ only by the factor manipulated by the treatment. Detecting an SRM in the data strongly suggests that this assumption has been violated. Therefore, when SRM is identified, the data cannot be reliably analyzed until the source of the imbalance is investigated and resolved. In this blog, we covered how to check for SRM, explored its potential causes, and discussed strategies to identify its source. We hope you found this guide helpful—may your experiments stay balanced, and what you plan will be what you get!

Table of contents

Ready for a 360° experimentation platform?
Turn blind launches into trustworthy experiments
See Eppo in Action

Ready to go from knowledge to action?

Talk to our team of experts and see why companies like Twitch, DraftKings, and Perplexity use Eppo to power experimentation for every team.
Get a demo