A/B Testing
Accelerate Experimentation Adoption with Protocols and Property Analysis
Learn more
Experimentation is a highly cross-functional endeavor and like any machinery having a lot of moving parts, there are a lot of opportunities for failure. Shortcuts taken in experiment planning and execution can sow doubt in the reliability of your experiment results or even double your runtime due to missteps in logging, configuration, or mismatched analytical methods. Sometimes we have to slow down to speed up.
Writing and sharing experiment plans help us to slow our thinking down just enough to clarify our objectives and steps required for execution. In this guide, I will share what I've seen work in organizations running 100+ experiments a month. It is a piece of the process that was lacking in organizations I've seen struggle to get their experimentation programs off the ground.
Just as we would write specifications for a design or engineering team, an experiment plan’s primary objective is to communicate the specifications required for executing an experiment. It outlines the motivation and context for our hypothesis, illustrates the different user experiences under the control and treatment, who we intend to include in our test, when we will run the test and for how long, amongst other details.
An experiment plan is more than documentation; it's an essential blueprint that guides teams in efficient execution and collaboration. It also serves as a contract and standardized process that safeguards experiment rigor and trustworthiness by "putting to paper" what we are committing to do. Some experiment plans include a checklist for quality assurance, mitigating the risk of compromised experiment data and time lost to rerunning misconfigured tests.
A good experiment plan is peer-reviewed and shared with stakeholders, facilitating a culture of transparency, partnership, and rigor. And finally, an experiment plan serves as an artifact to remind us of what was tested and why, creating institutional memory.
There are core components that are key to launching a successful experiment that you'll want to document. Your teams might find value in adding additional information however, use caution to avoid process bloat; you want to minimize the risk of introducing too much friction. A rule of thumb for good documentation is that if it were a sheet of paper that a random person on the street picked up, they could walk away with a solid understanding of the problem, solution, and desired outcomes.
Here, we discuss the essential components of an experiment plan in-depth with examples of how the omission of each can result in outcomes varying from the not-so-desirable to the catastrophic.
The problem statement
The hypothesis
The User experience
Who is included in the experiment (and where does bucketing happen)?
Sampling strategy
Primary, Secondary, Guardrail Metrics
Method of analysis
Baseline measure & variance
Statistical design
Decision matrix
The problem statement concisely tells your collaborators and stakeholders what it is you are trying to solve for, in context. It describes the business problem and relevant background, proposed solution(s), and outcomes in just a few sentences. The context could be drawn from an analysis that demonstrates an area for improvement, insights gleaned from a previous experiment, or it could be motivated by user research or feedback. It is important to provide links to the motivating context for anyone who will need to understand the etymology of your product, detailing what has changed and why.
Without a clear problem statement that connects the motivation for your feature, the proposed solution, and hypothesized outcomes as a result of implementing this solution, it will be difficult to obtain buy-in from your stakeholders and leadership.
Note: If you have a business requirements doc (BRD) or a product requirements doc (PRD), this is a good place to link to it. Keep in mind that these documents serve different functions, and while there is some overlap, a data scientist does not want to parse context from engineering specifications any more than a developer wants to parse context from a statistical document. Neither a BRD nor PRD is sufficient on its own for the purposes of designing an experiment.
The hypothesis is intrinsically a statistical statement, however, don't let that daunt you if you're not a statistician. It should easily fall right out of a well-written problem statement. It is a more distilled problem statement, stripped of the business context, and generally takes the form of:
By introducing X we expect to observe an INCREASE/DECREASE in Y,
where X is the change you are introducing and Y is the primary metric that ties back to the business objective. Y should hypothetically demonstrate an improvement caused by the introduction of the change.
Your hypothesis is the critical formative element for the statistical design required to create a valid experiment that measures what you intend to measure. It also plays a crucial role in metric selection and decision frameworks. Absent a good hypothesis, your objectives will be murky and success undefined casting doubt on the experiment as a whole.
Note: The astute reader will note that the structure of the hypothesis above omits the null hypothesis and implies a one-sided test of significance, whereas the more common method of experimentation is a two-sided test designed to detect any change, positive or negative. I think this departure from a strict statistical definition is okay because 1. We include a distinct methods section, 2. The hypothesis is often written from a product point of view and the nuts and bolts of hypothesis testing are non-intuitive, and 3. The semantics are understood in the context of the desired direction of change. You might take a different approach that best fits the culture within your company. My sense is that writing a hypothesis in this way enables the democratization of experimentation.
Here, you want to provide screenshots, wireframes, or in the case where the difference is algorithmic or otherwise not obviously visual, a description of the experience for control and treatment.
We want to make sure that the specific change we're testing is clearly communicated to stakeholders, and that a record is preserved for future reference. This can also help with any necessary debugging of the experiment implementation. test, and potentially debugging the experiment implementation.
This description can vary from trivial to nuanced. A trivial example might be "all users". But for an experiment that is deployed on the checkout page, you would want to trigger the experiment allocation on the checkout page, so then the population you are sampling from would be "all users entering checkout". This might have further conditions if your population is restricted to a platform, like iOS, and a certain region. For example, "all users in NYC entering checkout on iOS".
Being explicit here helps us QA where the experiment allocation gets triggered and ensure we're filtering on the relevant dimensions. It also calls out any conditions on the population we would want to consider when pulling numbers for the power calculation.
The downside of missing this ranges from a misconfigured experiment and subsequently longer time to decision cycles to an incorrect estimation of the number of samples required, which could lead to an underpowered experiment and an inconclusive result, both of which are costly and frustrating.
This is another topic where the design could be trivial or complex. In most cases, you will be sampling from "100% of the population, 50% Control, 50% Treatment". If you are lucky enough to have far more users than those required to run the experiment, you might have a situation where your sampling strategy could look like "20% of the population, 50% Control, 50% Treatment".
There are more advanced sampling strategies in switchback experiments, synthetic controls, or cases where a test is deployed in limited geo-locations. These are generally limited to special cases.
This information is useful for QAing the test configuration, inputs to the statistical design and power calculation, and in special cases, discussing with stakeholders any risks involved with a full 50/50 rollout. Some systems might not be able to handle the traffic given a new feature (think infrastructure or customer service teams).
While I can't tell you which specific metrics you should be measuring, I want to share how to think about which metric(s) belong in primary, secondary, and guardrail metric categories.
I like to anchor my thinking in, "Would a deterioration in this metric prevent me from rolling out this feature?" If the answer is yes, it is what I call a decision making metric and is a good candidate as a primary metric.
Secondary metrics are selected in support of the primary objective. They help us to tell a more robust story. I like to call these "hypothesis-generating metrics" because when they surprise us, a deep dive can help us to develop subsequent hypotheses. But even when we find these metrics to be statistically significant, remember that they may be underpowered and/or not corrected for in the overall error rate and shouldn't contribute to go/no-go calls.
Guardrail metrics are metrics that you don't expect the experiment to move but when moved, could be indicative of unintended second order+ effects or bugs in the code. Examples might be unsubscribes or degradation in page load times. It is imperative that they are sensitive enough to be moved - metrics like returns or retention are too far downstream to detect an alarming effect in near-realtime. or the timeframe of the experiment duration.
To protect the integrity of your experiment and guard against cherry-picking, it is important to establish these metrics from the outset.
This is where you'll specify the statistical methods you plan to employ for your experiment. This might just be a two-sample t-test or more advanced methods in statistics or causal inference like a difference-in-differences test, group sequential methods, or CUPED++.
When in doubt, consult your friendly experimentation specialist.
Often, you already have the baseline measure of the primary metric from when you generated or prioritized your hypothesis. In case you don't, you'll want to first collect these values before you can proceed with the next sections in the plan. (If you require analytics support in accessing this data, get your request in as early as possible to avoid bottlenecks and launch delays.)
When establishing the baseline measure (e.g. your average revenue per user, before the experiment), there are a couple of things to keep in mind.
First, if accessing from a dashboard, be aware that your business metric and experiment metric might not share the same definition. This is very important because a small deviation from the true baseline can cause a large deviation in your expected runtime. Make sure that the baseline values you obtain reflect the baseline experience you will measure. For example, your experiment population might be limited to a region, while the reference dashboard is aggregated across all users. Another way this shows up is that sometimes experiments use metrics like revenue per session where a session can take on any number of custom definitions - this is likely not reflected in a business analytics dashboard causing yet another way your metrics might differ.
Try to obtain a robust estimate. Every business has seasonal variation, so you might think that an estimate from a similar time period in the previous year is best, but also consider the changes in your user experience that have been deployed in the past year. Those two values might not be as comparable as you think. On the other hand, two data points close in proximity are more likely to be similar than two points more distant. For example, today's local temperature is likely more similar to yesterday's temperature than the temperature 3 months ago. So you might consider taking an aggregate measure looking back over a time period that is: 1. Near the date you plan to execute your experiment and 2. With a lookback window roughly equivalent to your expected experiment runtime. Don't forget to be aware of any anomalous data from holidays, rare events, and data pipeline issues or changes.
MDE is a tough input for beginners to wrap their heads around. It is essentially a shot in the dark guess around the relative effect size you expect from your experiment, but of course you don't yet know what the relative effect size is!
The thing you need to know before starting is this: let's say you choose your minimum detectable effect to be a 2% lift - if the true effect is larger, you are likely to have the resolution - by way of number of samples collected - to detect it! But, if instead your true effect is smaller than 2%, you won't have planned a sufficient sample size to detect the signal from the noise.
I like to approach this iteratively. So when you're taking your "shot in the dark", choose a value on the lower bound of what you think is a meaningful and possible lift for this test. Run your power calculation and see, based on average traffic values, how long that would take to test. If it is prohibitively long, you can adjust your MDE upwards, but within the range of plausible lifts. However, remember if the true effect is smaller, you might end up with an inconclusive test.
If you find that in order to run a test in a reasonable timeframe the MDE gets pushed outside the range of plausible values, you might consider alternative statistical methods to assist in reducing the variance. Another approach is to perform a "Do No Harm" test (also referred to as a non-inferiority test).
How long is a reasonable time frame? This will vary quite a bit based on the size and maturity of your company and how much uncertainty is tolerable for your organization. I recommend a minimum of two-week runtimes (to capture two full weekly cycles). But I have also seen cases where experiments just take longer. Unless you are running a longitudinal study, I don't recommend testing longer than eight to twelve weeks - beyond that, externalities are more likely to creep into your controlled experiment and it becomes difficult to attribute the observed effect to the treatment alone. You're better off revising your methodology or taking a bigger swing at a more meaningful (and sizable) change.
Power and significance refer to the two most important parameters in choosing the statistical guarantees your test will offer (assuming you're using a frequentist test).
The widely adopted rules of thumb here are to set your power at 80% (i.e., 100% - power level = 20% is the false negative rate) and your significance level at 5%. Your org might use different heuristics, but in most cases, there is no need to adjust these.
Again, when in doubt, collaborate with a data scientist or your experimentation center of excellence.
In this section, you will use the inputs from your baseline measure, variance, MDE, and power and significance levels. The result will be the total number of samples required per variant required to run your experiment. It is also helpful to your reviewers to state how the calculation was arrived at (link to an online calculator, the package/method used, or formula used).
The power calculation is essential to committing to running your experiment to the required number of samples and guards against early stopping and underpowered experiments. It's a handshake contract that keeps all parties committed to conducting a trustworthy experiment.
Add up the number of samples per variant, divide by the total estimated traffic, and round up to the nearest value or 14 days (whichever is larger) to estimate the number of days your test will take.
A word of caution, the actual volume of traffic might differ from your estimate. If you see this happening, run your test to sample size and not number of days. It's the number of samples that ensures that your specified false positive and false negative rates.
Once you have your metrics and experiment design in place, you'll want to enumerate over the decision metrics and possible outcomes (positive, negative, or inconclusive) and specify the decision you will make in each scenario.
It is imperative that this is done before data collection begins. First, it keeps us honest, guarding against cherry picking and personal bias, unconscious or otherwise. We are all imperfect humans with attachments and hunches. The other reason is that it expedites decision-making post experiment. Your HiPPOs (highest paid person's opinion) will appear once an experiment has concluded. Even worse when there are multiple stakeholders involved, each with different metrics they are responsible for. In my own experience, I have seen leaders talk in circles for weeks, delaying a final rollout decision. I have found it is a lot easier to bring your stakeholders and relevant leadership along on the experiment journey and commit to a decision given specific criteria before data has been collected than it is to try to align various stakeholders with different concerns once the data is available for results to be sliced and diced.
Socializing experiment plans amongst your stakeholders and peers can go a long way toward building trust in your product development and testing programs. With thoughtful planning and transparent communication, we not only enhance the reliability of our experiments but also reduce errors in design and configuration, in turn, reduce the time to ship features that make an impact. In the process, we create a record of what we have done and why. And having a plan that incorporates peer review and feedback only increases our confidence in shipping features that impact our customers and our businesses.