WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

About

Culture

January 27, 2025

Make Decisions Before Experimenting

How pre-registration yields more rigorous decisions, faster decisions, and saved time

Drew Harry

Former VP, Data & Research at Twitch. PhD @ MIT Media Lab.

If we’re lucky, it’s easy to make decisions based on experiments: if our metrics go up in the test period, ship it! If they go down, revert the change.

But let’s be honest, it’s rarely that simple. Instead of one key metric, we have five or more. Maybe split across one or two key dimensions like user segment or country. All of a sudden we have a forest of metrics to consider, and one would have to be very lucky indeed for them all to be positive. What then?

Writing down hypotheses prior to an experiment is a familiar-enough idea. The most academic data scientists among us might have the habit of writing out H0, H1, and so on. In a business context, there’s another layer to the problem. Say we prove H1 is true. The next step is to settle on what H1 implies for the decisions we are considering.

Most teams expect to do the work of translating results into decisions after the experiment concludes. That’s a natural-enough instinct. I want to suggest an alternate approach. Before you have results, take a little time to think through as a team what data would imply what decisions and write down quantitative thresholds for these criteria. This process can yield more rigorous decisions, faster decisions, and save data staff time. This is decision pre-registration.

In the simple example we started with, the chain look something like this:

“If [product] improves [total revenue] then [ship product to 100% of users].”

When metrics proliferate, we have to consider a much more complex outcome space. If we have five metrics and 3 states per metric (positive, null, negative), there are 243 outcome combinations each of which might imply a different decision. So rather than working forward for all possible outcomes, it’s simpler to work backwards. Start with the decision space, and then ask yourself – what sorts of experimental results would make me recommend which of these decisions?

For most product experiments the decision is “launch” or “revert.” There might be variations like “launch in the US, but hold the Japanese launch until we improve localization.” Sit with the decision-maker and write down the major options explicitly. Then try to rough out the sorts of experiment outcomes that would lead you to make each decision. An example statement might look like this:

“If [product] improves [total revenue] and has positive or null effects on [all other metrics] then [ship product to 100% of uses].”

It’s not viable to exhaust every potential experimental outcome and map it to a decision. You can, however, rapidly address most of the likely outcomes with a few simple statements about what constitutes grounds to revert a change and which metrics must have significant improvements to warrant shipping.

If you’ve ever set a “guardrail” metric, you’ve already done a form of decision pre-registration. Consider this proposed decision:

“If [product] decreases [total time spent in app] by more than [5%] then [revert and diagnose the problems and try again].”

Decision pre-registration is a broader version of this idea. Guardrail metrics answer the question “what results would be bad enough that we end an experiment early?” What if you considered all the other decisions you might make as a result of seeing experimental data?

This discussion must happen with the decision-maker. Make sure you understand who that actually is. Sometimes an organization will say that the PM is empowered to make the launch decision, but practically the final decision sits with someone higher up the management chain. The decision-maker brings to the discussion what decisions are under consideration, plus other non-data concerns that might lead to one decision or another. Then data staff can suggest ways that particular experimental outcomes might lead to one or another decision. Then the decision-maker ratifies the link by confirming that if they saw result X, it would imply decision Y. If you don’t find the true decision-maker, or they are not genuinely engaged in this discussion, it is much harder to follow through on the proposed decisions in the end.

An example pre-registration process:

Identify likely decisions post-experiment (e.g. launch to X% of users, ramp performance marketing spend to $Y/week, extend closed beta period by Z weeks to iterate more).
Identify metrics and effect sizes for those metrics that would cause the decision-maker to make that decision.
If it seems like the experiment is very unlikely to shift the decision, consider jumping straight to making the decision and skipping the experiment.
Discuss “mixed” results:
- What would we do if we get null or small results?
- What would we do if we get conflicting results, i.e., we see some improvements we hoped for, but also unexpected decreases?
Write up the decision pre-registration. That might be a series of statements like “We will do [A] if [metric X > 1%] and [metric Y is > -1%].” These need not be fully exhaustive of the space; focus on common outcomes from past experiments.
Launch the experiment, with only those metrics included that are considered salient to the decision.
When results are available, map the results to the pre-registered decision and keep moving!

Null results are a common experimental outcome, and deserve special attention in this discussion. All critical metrics either have effect sizes too small to care about or are not statistically significant. What do we do in that case? In many teams I’ve worked with, the decision from a null experiment is “ship it.” Often this decision is made because there was a principled reason to build the product in the first place. Maybe it’s a change to the information architecture that is not expected to have a positive impact but sets up a design framework that is more extensible going forward. Or the team expects that when paired with a bigger marketing campaign they will show future successful results. Other teams might argue that the ongoing maintenance costs of a feature with no detectable impact are too high to justify shipping the product and would instead revert any product with this result pattern. Settle this question in advance.

It’s not uncommon for an experiment to have more than one target metric and for those to move separately. For example you may run an onboarding experiment where you aim to improve total accounts created and seven day new account retention. If accounts created rises but retention falls, what then? You may consider that a signal that you’re getting the wrong sort of new accounts, and to hold the launch to workshop design treatments that better explain the value of an account. Or you might say “ship it” and brainstorm retention-focused products next sprint. You might be making a change for fraud reasons that will decrease successful account creation rates while also decreasing fraud; consider what level of account creation impact would be acceptable for what levels of fraud decrease? In a case like this, you might simply express it as a ratio; “we’ll launch a change that is no worse than a ratio of 1% decrease in account creation to a 10% decrease in fraud.”

These are sometimes uncomfortable discussions for decision-makers. Being asked to commit to a method of decision-making can feel like a loss of autonomy for them. One compromise you can offer that might help set them at ease is to emphasize that you understand there are non-data considerations for the decision. It’s totally acceptable for them to say “actually, we’re going to launch this unless all the metrics are significantly negative.” Or even “this is a CEO mandated product direction and we will launch regardless of results.” These cases happen every day, and it’s far more healthy to acknowledge when you are in this state than undermine your intellectual honesty with false data pageantry.

Benefits

Why not wait and discuss the actual results instead of spending the energy to game out all potential results? Three reasons: Rigor, speed, and simplification.

These discussions are more rigorous before you have results in hand than after. Even the most data-driven teams can engage in motivated interpretation when faced with an ambiguous set of results. What product team doesn’t want to ship what they’ve worked on for months? This structure helps manage our cognitive bias to reject and rationalize data that does not fit with our view of how the world works.

Although you spend some time up-front discussing outcomes that may not happen, once you have a pre-registered decision, you spend less time debating the decision when results come back. An hour-long discussion in advance yields hours of work cutting the data in different ways and squeezing signal from noise to rationalize the outcome. Plus you improve your product velocity; you can transition from experiment read-out to action without a meeting because you’re simply slotting the results into the decision you already made. That puts your product in customers’ hands faster.

Pre-registration allows you to trim the amount of measurement you’re doing. If your proposed decisions don’t depend on a particular metric, you can remove it. Given what we know about the perils of multiple comparisons, fewer metrics is better. You may also find that the decision doesn’t actually rest on any metrics in the experiment, and you can move straight to a full launch. If there are regulatory or sales or operational reasons you must ship something, it may be faster and cheaper to rely on post-launch analytics than experimentation. Experiments always have a cost in staff time and data collection time, and if you don’t have to pay them, great.

No Headings

Turn blind launches
into trustworthy experiments

See Eppo in Action

Ready to go from knowledge to action?

Talk to our team of experts and see why companies like Twitch, DraftKings, and Perplexity use Eppo to power experimentation for every team.

Get a demo

Make Decisions Before Experimenting

Benefits

Table of contents

Ready to go from knowledge to action?

Keep reading