Strategy
Experimentation Protocols: Your Practical Path to Better, Faster Testing
Learn more
TL;DR:
Confidence levels are so important when it comes to making data-driven decisions. They’re an essential part of our experiment design process when we want to avoid making guesses when it’s time to ship new features or optimize products. But understanding confidence levels and choosing them can be tricky - you’ll want to know how they impact confidence intervals, how small sample sizes can limit them, and the different statistical methods you can apply them to.
What You’ll Learn in This Blog:
Confidence levels tell you the frequency with which you’d expect your test results’ confidence intervals to capture the actual value of what you’re measuring if you run the experiment many times. For example, choosing a 95% confidence level and thus calculating a 95% confidence interval means that if you repeated your test 100 times, about 95 of those intervals would include the “true mean” of your metric.
Confidence intervals show you a range of plausible values for your metric rather than focusing on one single number. Instead of incorrectly extrapolating what we observed during the experiment and saying, “This new feature improved conversion by 2%,” a confidence interval correctly quantifies the precision of our experiment design, saying, “Our true improvement lies somewhere between 1% and 3%.” This extra context helps you understand how much wiggle room there is around your estimate based on how you planned and conducted the experiment.
Before running an experiment, you need to set the “certainty bar” you aim for. This is your confidence level (often 90%, 95%, or 99%). While choosing a confidence level is generally up to the person or team running the experiment, it’s influenced by your sample size, baseline rates, and variance, and how quickly you need to make decisions. Setting this level at the start establishes a clear standard for interpreting your results once the data comes in.
While 95% is common, it’s not your only option. Increasing to 99% makes you more certain but widens your confidence interval; the cost for certainty is less pinpointed insights and slower decision-making ability. Dropping to 90% narrows your interval and speeds things up but raises the risk of missing the true outcome. It’s a trade-off. Pick the level that aligns with how quickly you need answers and how sure you want to be.
The choice also depends on your data’s quantity and stability. Larger sample sizes, higher baseline values for the metrics you care about, and/or consistent data patterns can make obtaining conclusive results at higher confidence levels easier. Smaller samples or higher-variance data might push you to choose a slightly lower confidence level to avoid overly broad, less actionable intervals.
Finally, consider what’s on the line. If a poor decision would be costly or hard to undo, a higher confidence level could be worth the extra caution. If you need to move fast to respond to market changes, a lower confidence level might help you act decisively, even if it means accepting more uncertainty.
The goal of any experiment is first to determine if we have sufficient evidence to reject our null hypothesis (i.e., to reject the assumption that our treatment has no measurable impact) and second, to get as close as possible to the true population parameter for the group you're studying. Statistical significance helps us measure the former, and confidence intervals help us measure the latter.
If you’re still not sure which confidence level to choose, seeing how that choice impacts the confidence interval might help you make a more informed decision because the level you choose makes a significant impact on the range of values you end up with at the end of your experiment.
In short, the higher you set your confidence level, the wider your intervals tend to become, reflecting that you’re being more cautious and allowing more room for the “true value” to fall inside. On the other hand, lower confidence levels produce tighter intervals but carry a higher risk of missing that true population parameter.
If your data shows more spread (higher standard deviation), your confidence intervals naturally widen due to the data points spreading farther from the mean. This spread makes it harder to pinpoint the true value with precision. To help with this, Eppo employs techniques like CUPED and winsorization, which reduce the standard errors in your results and bring those intervals back into a more useful, narrower range.
Bigger samples mean less guesswork. A larger sample size drives down the margin of error, giving you tighter intervals for the same confidence level. But if you’re working with a small or niche dataset, expect wider intervals since you have fewer data points to anchor your estimate. In these cases, you might have to consider adjusting your confidence levels.
In the real world, confidence levels play a huge role in important decisions like “ship” or “no ship” regarding product launches or feature rollouts. Here’s how:
Suppose the lower bound of the confidence interval is above zero. In that case, it suggests a positive outcome (e.g., a feature will likely have a beneficial impact), so the decision may be to proceed and "Ship" the product or feature.
The results are inconclusive if the confidence interval includes zero (meaning it crosses from positive to negative values). This indicates that the data doesn't strongly support a positive or negative effect, so the decision may be to hold off, or "No-Ship," until further testing or data is gathered.
Bell curve graphs can really help explain the relationship between the sample mean and the confidence interval. The sample mean is at the center, and the confidence interval is shown as a range that extends to either side. The confidence level indicates the likelihood that the true population mean lies within that range.
Before we move forward, we want to make sure we clear up some common misunderstandings about confidence intervals:
The interval represents the range of values where we expect the true mean to lie based on the sample data. However, even if the interval is calculated with a high confidence level, there's always a chance that the true mean falls outside this range.
The interval is based on the specific sample chosen and can vary if you collect different samples. This uncertainty is captured in the interval, which shows the potential range of values for the true mean, not a fixed point.
Eppo simplifies statistical analysis by automatically calculating confidence intervals and precision. Using advanced techniques like CUPED (Controlled Pre-Exposure Data) and sequential analysis, Eppo ensures that experiment results are reliable and based on sound statistical methods.
Eppo's tools are designed to minimize standard deviation, delivering exact results—even with small sample sizes. This allows you to make informed decisions even when data is limited, reducing the risk of misleading conclusions.
Eppo’s dashboards make it easy to understand the data by displaying essential metrics in an easy-to-digest, visual way. No statistical background is required to understand the experiment results in Eppo. This helps you interpret experiment results at a glance and turn insights into actionable decisions quickly.
Eppo supports a wide range of statistical methods to cater to the unique needs of different experiments. Whether you're testing new product features or updated marketing campaigns, Eppo’s flexibility ensures you can rely on the best methodology for your specific experiment.
Ready to simplify your confidence level calculations and make more reliable decisions? Explore how Eppo can help you automate confidence intervals, reduce variance, and streamline your experimentation process for better outcomes. Request a demo today!