Products

Experimentation

Learn more

Feature Flagging

Learn more

Key features

Artificial Intelligence

RESOURCES

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

A/B Testing

May 16, 2024

Four Customer Characteristics That Should Change Your Experiment Runtime

Dialing in your experiment planning beyond just a sample size calculation

Simon Jackson

Former experimentation leader at Canva, Meta, and Booking.com turned Founder and Principal Consultant at Hypergrowth Data

If you’ve run A/B tests, you’re probably familiar with using sample size calculations to plan how many participants you need and, thus, how long you should run your test.

Sample size calculations are essential (and the first major point I’ll cover), but the biggest mistake most teams make is stopping there.

In fact, of all the A/B test planning guidance I’ve given teams (many teams!), it’s what to do next.

To plan proper, high-signal experiments, you must consider a few other factors about your customers.

They’re not complicated, but if you follow this advice, you’ll drastically improve the quality of your experiments, your ability to learn from your customers, and the pace of high-quality innovation.

And even if you have a unique situation not covered here, you’ll learn a lot about how to plan your own unique experiments.

Unfortunately, most teams experimenting out there do not have the in-house knowledge needed to help them figure out their own situation.

To learn from your customers as quickly but as effectively as possible, you need to tailor experiment runtimes to them!

The customer features we will explore today are what I call “Runtime Modifiers.” Each exists on a spectrum that moves your required runtime up or down.

By adjusting your experiment runtime based on these customer features, you will:

Prioritize and plan more effectively
Run higher-quality experiments
Get confidence in your results faster
Avoid acting on misleading results

Let’s dive into each feature that might define the types of customers you’re working with.

1. How many relevant participants do you have?

Most people know about the number of relevant customers you have to experiment with, which is called your sample size. Consider two extreme examples.

Sidebar: “Relevant” is an important addition here because you should only learn from customers who might actually be impacted by your changes (more irrelevant customers just add noise).

Small samples

Small samples tend to be noisy. It’s just luck of the draw, and chances are, you’re going to get a lot of wildly different customers in a small group. Get your hands on any power calculator like this one from Booking.com, and you can check it.

Here, I checked the impact I could detect with 100 customers (on a 10% baseline). It’d take a whopping 191% lift for me to detect that signal!

‎

Large samples

Getting more customers means better representing what most customers tend to do or what’s “average.”

Using the same power calculator settings, bumping my customer sample to 1 million means I can now detect a 1.5% uplift. That’s a 127x reduction in the signal size we can detect.

‎So, generally speaking, the more relevant customers you’ve got, the less time you’ll need to run your experiment.

2. How your participants enroll

I’ve been a bit cheeky and ignored an assumption in the section above.

It’s not just the total number of customers you can get; you don’t just start your experiment and magically get that sample.

Participants enroll slowly over time. You need to wait! How long depends on how they enroll over time, and I’ve regularly encountered two specific patterns.

Pattern 1: Regular enrolment over time

The enrollment pattern that most people think of when they run A/B tests is “regular enrolment over time.” This pattern refers to a situation where you get the same number of participants enrolling regularly. For example, having 1,000 new participants enroll every week.

If you were to look at the daily count of customers entering your experiment, you tend to see something like this:

‎There may be weekly highs and lows, but they tend to be regular, so about the same number of enrollments show up each week.

I’ve encountered this pattern mostly in B2C scenarios, such as at Canva or Meta, where new users and potential customers show up regularly over time. It’s definitely common in B2C situations, such as when shoppers are on an e-commerce site.

If you re-read section 1 about sample size, you’ll see it assumes this pattern.

Pattern 2: Skewed enrolment over time

This pattern is defined by customers mostly entering early on (with a long tail after that). For example, 90% of your customers might enter the experiment in the first few days and the rest over the next few weeks.

If you were to look at the daily count of customers entering your experiment, you tend to see something like this:

‎I first encountered this pattern when working on the supply side of Booking.com’s business, building the products and services that helped accommodation providers (like hotels) upload rooms to be bookable to potential customers on the demand side.

I’ve also seen it several times since then and definitely see it more often for B2B products providing admin or management software.

Before we jump to the other considerations below (wink wink), if you pair these patterns only with the sample size, you will see that you can typically run experiments with Pattern 2 much faster than Pattern 1.

Why? Say you want about 100 customers in your experiment. With Pattern 1 (regular), if ten new customers show up each day, it will take you a week to get to 70%. With Pattern 2 (Skew), you might see 70% of your customers show up on Day 1!

So, in general (but wait for more!), you can run shorter experiments when your customers mostly show up at the start.

3. Participants’ frequency of use

However, there’s a common contrasting feature to the patterns by which customers enroll in your experiment: customer frequency-of-use. Let’s look at two extremes.

High frequency-of-use

Customers entering experiments in a skewed way (e.g., 70% on day 1) tend to be high-frequency-of-use customers. That is, they use the product a lot. Like, every day. Using the example I shared from Booking.com, hotel staff tend to be updating their supply almost every day! Similarly, think about how you use your work software, such as through emails.

The tricky thing with high-frequency-of-use customers is that they are hyper-sensitive to change. They use the product so much that many interactions become instinct-based (like learning to drive a car). So, when change is introduced, you get something called a ‘novelty effect.’

Novelty effects can be positive, such as a new shiny button that everyone wants to click, or negative, like moving a button so no one can find it anymore. Either way, novelty effects are responses that suddenly spike when customers get something new and then fade as they get used to the change.

High-frequency-of-use customers tend to be significantly more susceptible to producing novelty effects.

So, in general, if you have high-frequency-of-use customers, you will need to run your experiment for longer (compared to low-frequency-of-use customers) to allow for novelty effects to pass.

Low frequency-of-use

Conversely, when you’ve got participants enrolling at a regular rate (e.g., 100 new customers per day), you tend to be dealing with low-frequency-of-use customers. Customers use the product relatively infrequently, such as once per month or year. Think about websites you’d use to book a holiday.

The nice thing about low-frequency-of-use customers is that each time they visit your product, it’s like a new experience again. They’ve probably forgotten a few things and are expecting to do some thinking and make some mistakes.

Low-requency-of-use customers tend to be slightly more oblivious to the changes you’ve made in an experiment.

4. Time for participants to trigger value

The final customer feature I always check when thinking about runtime is the time it takes for them to trigger the value you expect and for it to show up in your metrics (usually your primary metric).

Whatever your business, there are actions customers take that are clear signals of value. Typically, it’s a point of purchase. Depending on your business, however, the path to get there can be quite fast or slow. Consider each.

Fast time-to-trigger

Fast time-to-value customers go from entry in an experiment to a potential value-creation action very quickly—I’d say in the range of minutes to days. Think, for example, of shoppers in a supermarket, buying something from Amazon, creating a design in Canva, or purchasing a plane ticket.

Slow time-to-value

Slow time-to-value customers, however, might have to wait a while before that value is realized, say weeks to months or even longer. For example, the Netflix team has to wait an entire month to see which customers continue or cancel their subscriptions. Or the time it could take a new Etsy seller to sign up to make their first sale.

You can probably imagine that having slow time-to-value customers typically means you have to run experiments longer. Why? You need to give your customers adequate time to get through their potential value cycle to understand if your changes have had a meaningful impact.

So, in general, when your customers have a fast time-to-value, you can run shorter experiments.

Examples to put it all together

Now that we know how to consider these other factors, let’s test them with a couple of real-life examples.

Let’s assume for all cases that a power calculation using the primary metric and desired minimum detectable effect tells us we need a sample size of 500K to work with.

OK, let’s dig in…

Example 1: Testing a new payment UI on an e-commerce site (like Amazon)

In situations like this, relevant customers tend to:

Enroll in a regular pattern over time
Be low or medium frequency-of-use users
Go from arriving on site to making a purchase decision (trigger value) quite quickly (on the scale of hours to days).

So, we build up our runtime according to these like so:

Say we get 250k enrollments per week; we’ll have the desired sample in 2 weeks.
We don’t need to be too concerned with novelty effects, so we won’t adjust for this.
Time-to-value is not beyond a day or two, so we won’t adjust for this either.

Based on this, I’d suggest running the experiment for two weeks.

Example 2: Testing a new sign-up flow in a freemium business (like Canva)

In situations like this, relevant customers tend to:

Enroll in a regular pattern over time
Be new (so low-frequency users)
Go from arriving on site to making a purchase decision (trigger value) over quite a long period—sometimes weeks or months

So, we build up our runtime according to these like so:

Say we also get 250k enrollments per week, which means we’ll have the desired sample in 2 weeks.
We can ignore novelty effects.
Time-to-value is weeks to months, so we need to adjust for this. This timeline will require us to examine typical patterns. Let’s say we find that the first two weeks capture the critical behavior (e.g., most new subscribers make a purchase or not in the first two weeks, which tends to predict their future purchase behavior).

Based on this, I’d suggest we run the experiment for at least four weeks.

Why? This choice comes from two weeks to get the desired sample and another two weeks to give them sufficient time to potentially make a purchase.

Example 3: Testing a new home-page UI in an enterprise software (like Microsoft Outlook)

In situations like this, relevant customers tend to:

Enroll in a skewed pattern over time
Be high-frequency-of-use users
Go from enrolling to making a critical value decision quickly (on a scale of days)

So, we build up our runtime according to these like so:

Say we get all the required enrollments in the first two days. That means we have a very short baseline to work with.
Given the frequency of use, we need to worry about novelty effects. How long exactly it will take for novelty effects to spike and return to normal will vary, but let’s say we’ve run similar experiments and see it takes at least 3-4 days.
Time-to-value is not beyond a day or two so we won’t adjust for this.

Based on this, I’d suggest we run the experiment for at least one week.

Why? We get the desired sample in a matter of days. However, novelty effects could take a few days to dissipate. Combined, we’re looking at at least 5-6 days, at which point we could round up to account for any stragglers and as a safety precaution (e.g., to account for any weekly seasonality).

Wrap up

Remember, learning quickly and effectively from your customers will require tailoring your experiment runtimes to their unique features. Be sure to consider these four “runtime modifiers” in your planning.

Doing so will help you run higher-quality experiments and get you innovating and delivering value at a much faster rate!

I hope you’ll think them through next time you need to plan an experiment.

If you’d like some help, let’s connect and chat on LinkedIn or contact me through hypergrowthdata.com.

Until next time, thanks for reading! 👋