Products

Experimentation

Product Experimentation Web Experimentation Lifecycle Experimentation Lifecycle Experimentation

Feature Flagging

Release Management Automated Rollouts Config Flags Release Management

AI Personalization

Contextual Bandits Contextual Bandits

Why Eppo

WHY EPPO

By Role

Data Scientists Engineers Product Managers Product Managers

Resources

Customers Outperform Updates White Papers White Papers

FEATURED CASE STUDY

Coinbase Saves Millions, Reduces Experiment Analysis Time by 40%, and Restores Trust in Experimentation with Eppo

Learn more

Blog

About

Product News

April 25, 2023

5 Ways to Ensure Data Quality When Running Experiments

Here's how to ensure you’re making experiment decisions based on solid data foundations.

Sven Schmit

Eppo's Head of Statistics Engineering and fmr Data Science Manager at Stitch Fix. Sven holds a PhD in statistics from Stanford University.

The value of identifying the right opportunity, running an experiment and making an informed decision can be worth millions. Building an experiment-driven company culture is a large undertaking that involves rewarding behavior to constantly search for new opportunities and not being afraid to shut down good ideas if they prove not to provide ROI.

Once you have the organizational buy-in, the greatest challenge of running experiments is ensuring that your data is reliable. You need reliable data to properly identify opportunities for experiments. For example, you would need correct data about the signup funnel in order to identify that your drop-off rate is lower than the industry average. Reliable data is also paramount to trusting the outcome of experiments so that your leadership team is comfortable with implementing business-critical changes.

In this post we look into five things you can do to ensure data quality when running experiments.

1. Get your metric data straight before running the experiment

One of the most common challenges when running experiments is misalignment between the engineering teams implementing the feature flag and data scientists analyzing the outcome of the experiments. This leads to unfortunate outcomes such as:

The metric on which the experiment is evaluated not being fit for purpose
Selection bias where users are not randomly assigned to the treatment and control groups
Events emitted inconsistently across platforms (e.g. iOS and Android)
Events not triggering for certain user groups (e.g. users using an ad blocker)

To mitigate these types of problems, you should include the data scientist from the beginning of the process; they should work alongside the engineering team when planning out the experiment. Companies with the most sophisticated experimentation frameworks, such as Airbnb, define and certify core metrics ahead of any experiment process, and not reactively for each experiment.

Experimentation often involves new products that require novel instrumentation. It's helpful to consider: "Suppose this experiment doesn't work. What questions will I ask and what data do I need?”

Thinking a few steps ahead helps reduce the stress on the data team. If an important feature has just been shipped, it’s not uncommon for senior stakeholders to take an interest in the outcome. It can be helpful for the data scientist to work with the Product Manager on preemptively communicating expectations for the experiment by sharing a brief snippet with stakeholders.

Example:

“We’re rolling out an experiment to 20% of users to reduce contact rate. We expect the result to reach statistical significance within 14 days of being shipped. We’ll share a preemptive update on the initial results after the first 7 days. Keep track of health metrics in this dashboard: https://example.cloud.looker.com/dashboards/1”

‍

2. Consider the longevity of experiment metrics

Be clear up front about how long you need to monitor your metrics to be able to confidently make a decision based on your experiment. For example, if you make a change to customer support to encourage more customers to use the in-app chat instead of calling in, you may want to measure the long-term impact on NPS and customer satisfaction.

You may evaluate the success of an experiment based on the overall reduction in support tickets while monitoring phone calls during the same period. But you may decide to measure NPS and customer satisfaction over a longer time period to account for implications such as the impact on happiness for new customers or delay in survey responses.

While this may look simple on the surface, it can mean that you have to put in guardrails to guarantee the reliability of these metrics over the entire time period. If the methodology for asking users about the NPS has changed in this period or you neglect to monitor the data for potential data quality issues it can make it harder to assess the medium to long term effect of the experiment.

‍

3. Set up controls to proactively monitor data quality issues

If you’re working at scale, you likely have hundreds of thousands of tables in your data warehouse. While not all are critical to your experiments, you’ll often be surprised by just how interconnected tables are and how an issue upstream can propagate downstream.

In the last resort, you learn about issues from stakeholders or end-users, but if you’re taking the quality of your data seriously, you’re likely running manual or automated data tests to catch issues proactively.

‍

Manual tests

Manual data tests should be the backbone of your error detection and are available out-of-the-box in tools such as dbt. These are curated based on your business circumstances, and should at a minimum cover your most important data models and sources. Well-built tests help you catch issues before stakeholders and simplify the debugging process by highlighting or ruling out where issues occurred.

Synq has written an in-depth guide with ten practical steps to level up for tests in dbt, with concrete recommendations for how to achieve state-of-the-art monitoring with dbt tests.

‍

Anomaly detection

Your first resort should be manual tests, as these can help cover gaps and tightly couple your business knowledge to expectations from the data. Adding checks to automatically detect anomalies on your data can be helpful to learn about issues that your manual controls may not capture.

Anomaly detection controls and data observability platforms can help you detect issues across quality, data freshness, volume and schema issues:

Quality: Automatically run data checks on parameters such as NOT NULL and uniqueness to catch common quality errors and give you insights into where the data deviates from what you’d expect
Freshness: Can help you automatically detect if sources or data models are stale so you know about missing data before stakeholders
Volume: Volume can give you an idea of the completeness of your data. If you normally have 500 rows and suddenly see 5,000 new rows something may be off
Schema: Proactively knowing about upstream schema changes can give you an idea if the expected data type changes and if that’s in line with your business expectations

Example anomaly detection on metadata from Synq

‍

4. Monitor and communicate issues early on

If you’re running business-critical experiments, watch it like a hawk the first few days and make sure you’ve shared responsibility between the data scientist, the business team and the product & engineering team.

Your experiments might succeed or fail, but your brand requires demonstrating reliable execution. That means systematic early detection and mitigation of bugs and setup issues.

You may only be able to say anything conclusive about the outcome of an experiment after the experiment's duration, but having dashboards and checks in place early on can help you catch unexpected issues. In order to catch issues early on, you might consider segmenting key metrics by factors that are key to your experiment – such as operating system or user type.

It can help to have a shared Slack channel to discuss the experiment, as well as a dashboard that’s shared between business, data and product & engineering teams. This enables you to get input from as many places as possible, and people working on the business side can often bring in a unique set of operational insights from working with customers day-to-day.

‍

5. Clean up after the experiment

For companies running many experiments, it’s not uncommon to leave behind a lot of remnant data and outdated dashboards. This can be costly, and it contributes to the overall messiness of the data warehouse and dashboards.

A new experiment typically requires spinning up at least one new data model. It may require that you build a dashboard, and in some cases you want to update the data model in (nearly) real time when the experiment is live. In some cases, we’ve seen teams update their experimentation data model every five minutes, only to forget to disable it after the experiment has concluded. This meant that one single data model for an archived experiment was costing upward of $10,000 annually.

It can be a good idea to have a documented approach to how you archive experiments. This could include:

What’s the expected lifetime of an experiment?
How should experiments be archived (e.g. move data models to the archived folder and disable from running on the daily schedule)?
How can people access historical results of experiments? This is important if you want to look back at the end of the year and assess which experiments had the most impact

‍

Summary

In this article we looked at five ways you can ensure the data quality when running experiments.

Get your metric data straight before running the experiment – include the data team early in the process when designing experiments
Consider the longevity of experiment metrics – be explicit about how long you need to be able to evaluate driver and health metrics after the experiment is done
Set up controls to proactively monitor data issues – monitor data quality issues in your data warehouse using manual and automated tests
Monitor and communicate issues early on – monitor experiments early on and raise potential issues to course correct
Clean up after the experiment –be explicit about how you archive assets such as data models and dashboards after an experiment is done

‍