
Culture
How We Ran Successful Experimentation Office Hours at Groupon
Ideas for training and nurturing experimenters across the company
Learn more
If we’re lucky, it’s easy to make decisions based on experiments: if our metrics go up in the test period, ship it! If they go down, revert the change.
But let’s be honest, it’s rarely that simple. Instead of one key metric, we have five or more. Maybe split across one or two key dimensions like user segment or country. All of a sudden we have a forest of metrics to consider, and one would have to be very lucky indeed for them all to be positive. What then?
Writing down hypotheses prior to an experiment is a familiar-enough idea. The most academic data scientists among us might have the habit of writing out H0, H1, and so on. In a business context, there’s another layer to the problem. Say we prove H1 is true. The next step is to settle on what H1 implies for the decisions we are considering.
Most teams expect to do the work of translating results into decisions after the experiment concludes. That’s a natural-enough instinct. I want to suggest an alternate approach. Before you have results, take a little time to think through as a team what data would imply what decisions and write down quantitative thresholds for these criteria. This process can yield more rigorous decisions, faster decisions, and save data staff time. This is decision pre-registration.
In the simple example we started with, the chain look something like this:
“If [product] improves [total revenue] then [ship product to 100% of users].”
When metrics proliferate, we have to consider a much more complex outcome space. If we have five metrics and 3 states per metric (positive, null, negative), there are 243 outcome combinations each of which might imply a different decision. So rather than working forward for all possible outcomes, it’s simpler to work backwards. Start with the decision space, and then ask yourself – what sorts of experimental results would make me recommend which of these decisions?
For most product experiments the decision is “launch” or “revert.” There might be variations like “launch in the US, but hold the Japanese launch until we improve localization.” Sit with the decision-maker and write down the major options explicitly. Then try to rough out the sorts of experiment outcomes that would lead you to make each decision. An example statement might look like this:
“If [product] improves [total revenue] and has positive or null effects on [all other metrics] then [ship product to 100% of uses].”
It’s not viable to exhaust every potential experimental outcome and map it to a decision. You can, however, rapidly address most of the likely outcomes with a few simple statements about what constitutes grounds to revert a change and which metrics must have significant improvements to warrant shipping.
If you’ve ever set a “guardrail” metric, you’ve already done a form of decision pre-registration. Consider this proposed decision:
“If [product] decreases [total time spent in app] by more than [5%] then [revert and diagnose the problems and try again].”
Decision pre-registration is a broader version of this idea. Guardrail metrics answer the question “what results would be bad enough that we end an experiment early?” What if you considered all the other decisions you might make as a result of seeing experimental data?
This discussion must happen with the decision-maker. Make sure you understand who that actually is. Sometimes an organization will say that the PM is empowered to make the launch decision, but practically the final decision sits with someone higher up the management chain. The decision-maker brings to the discussion what decisions are under consideration, plus other non-data concerns that might lead to one decision or another. Then data staff can suggest ways that particular experimental outcomes might lead to one or another decision. Then the decision-maker ratifies the link by confirming that if they saw result X, it would imply decision Y. If you don’t find the true decision-maker, or they are not genuinely engaged in this discussion, it is much harder to follow through on the proposed decisions in the end.
Null results are a common experimental outcome, and deserve special attention in this discussion. All critical metrics either have effect sizes too small to care about or are not statistically significant. What do we do in that case? In many teams I’ve worked with, the decision from a null experiment is “ship it.” Often this decision is made because there was a principled reason to build the product in the first place. Maybe it’s a change to the information architecture that is not expected to have a positive impact but sets up a design framework that is more extensible going forward. Or the team expects that when paired with a bigger marketing campaign they will show future successful results. Other teams might argue that the ongoing maintenance costs of a feature with no detectable impact are too high to justify shipping the product and would instead revert any product with this result pattern. Settle this question in advance.
It’s not uncommon for an experiment to have more than one target metric and for those to move separately. For example you may run an onboarding experiment where you aim to improve total accounts created and seven day new account retention. If accounts created rises but retention falls, what then? You may consider that a signal that you’re getting the wrong sort of new accounts, and to hold the launch to workshop design treatments that better explain the value of an account. Or you might say “ship it” and brainstorm retention-focused products next sprint. You might be making a change for fraud reasons that will decrease successful account creation rates while also decreasing fraud; consider what level of account creation impact would be acceptable for what levels of fraud decrease? In a case like this, you might simply express it as a ratio; “we’ll launch a change that is no worse than a ratio of 1% decrease in account creation to a 10% decrease in fraud.”
These are sometimes uncomfortable discussions for decision-makers. Being asked to commit to a method of decision-making can feel like a loss of autonomy for them. One compromise you can offer that might help set them at ease is to emphasize that you understand there are non-data considerations for the decision. It’s totally acceptable for them to say “actually, we’re going to launch this unless all the metrics are significantly negative.” Or even “this is a CEO mandated product direction and we will launch regardless of results.” These cases happen every day, and it’s far more healthy to acknowledge when you are in this state than undermine your intellectual honesty with false data pageantry.
Why not wait and discuss the actual results instead of spending the energy to game out all potential results? Three reasons: Rigor, speed, and simplification.
These discussions are more rigorous before you have results in hand than after. Even the most data-driven teams can engage in motivated interpretation when faced with an ambiguous set of results. What product team doesn’t want to ship what they’ve worked on for months? This structure helps manage our cognitive bias to reject and rationalize data that does not fit with our view of how the world works.
Although you spend some time up-front discussing outcomes that may not happen, once you have a pre-registered decision, you spend less time debating the decision when results come back. An hour-long discussion in advance yields hours of work cutting the data in different ways and squeezing signal from noise to rationalize the outcome. Plus you improve your product velocity; you can transition from experiment read-out to action without a meeting because you’re simply slotting the results into the decision you already made. That puts your product in customers’ hands faster.
Pre-registration allows you to trim the amount of measurement you’re doing. If your proposed decisions don’t depend on a particular metric, you can remove it. Given what we know about the perils of multiple comparisons, fewer metrics is better. You may also find that the decision doesn’t actually rest on any metrics in the experiment, and you can move straight to a full launch. If there are regulatory or sales or operational reasons you must ship something, it may be faster and cheaper to rely on post-launch analytics than experimentation. Experiments always have a cost in staff time and data collection time, and if you don’t have to pay them, great.