Engineering
AB Testing 101 for Engineers
What I wish I knew about AB testing when I started my career
Learn more
Rolling out brand-new features into your SaaS can feel daunting.
Fortunately, there’s a way to validate new features and changes before they are released to the wider public. That’s called canary testing.
Here’s what we’ll cover in this primer:
What is canary testing?
How does canary testing compare to A/B testing?
How does canary testing compare to blue-green testing?
Advantages and disadvantages of canary testing
How canary testing works
How canary testing and feature flags work together
Canary testing is a method of slowly rolling out software changes to a small group of users in a live environment to minimize risk. It allows you to test new features or updates on a small scale without the danger of exposing all your users to potential issues.
To fully answer the question “What are canary tests?”, we’ll embark on a little history lesson.
It’s called “canary” testing because coal miners used to release canaries (the bird) into the mines. Canaries have a lower tolerance to toxic gasses than humans, and they’d alert miners if gasses inside the mine were reaching dangerous levels — before the miners noticed.
Just like coal miners would use canaries to detect dangerous gasses in mines, software teams use canary testing to detect issues before exposing all users to new changes.
The “canaries” are a small group of users. Once segmented, they become the “canary group” and will receive the update first.
This canary group is closely monitored to ensure no issues arise before rolling out the change to the entire user base.
These two concepts are very similar. Canary testing and A/B testing both aim to refine user experience with real user feedback but differ in execution and purpose.
You could actually think of A/B testing as an even more beneficial/advanced form of canary testing.
Here’s the main difference between the two:
Canary testing focuses solely on risk mitigation before wider release so that new updates don't compromise system stability. A small and specific portion of the user base is exposed to these updates.
A/B testing can also provide risk mitigation, but it further seeks to validate improvements to metrics like engagement by comparing feature versions. A/B testing is more intentional about randomly assigning audiences to each release version to enable both risk mitigation and statistical inference.
To put it simply, while canary testing introduces changes to a small audience for safety, A/B testing applies changes into more intentionally-sized and randomized user groups for statistical analysis, emphasizing optimization in addition to immediate risk management.
We know what canary testing is, but what about blue-green testing?
Blue-green testing involves keeping two separate production environments, blue and green. You release the new version to green, and — if it works — you direct all your traffic to the green environment.
Both techniques aim to enhance the user experience by introducing new features or updates, involving real users for feedback, and deploying changes gradually.
However, they differ mainly in:
Deployment strategy: Canary testing introduces the new version to a small user group first, expanding to all if successful. Blue-green testing switches all traffic from an old version (blue) to a new one (green) in a separate environment upon success.
Resource requirements: Canary testing is more resource-efficient, affecting a segment of users, whereas blue-green testing requires duplicating the entire production environment.
Risk management: Canary testing minimizes risk by initially impacting a small group, allowing quick rollback. Blue-green offers easy rollback to a stable environment, reducing downtime risk.
In other words, Canary testing suits smaller teams or projects with fewer resources, focusing on performance validation. Blue-green testing is ideal for critical applications requiring stable releases, emphasizing minimal service disruption.
So, what are canary tests actually good for?
The key benefits of using canary tests can be divided into three main pillars:
Rapid insight collection: Canary tests deliver immediate feedback on new features. This quick glimpse into user reactions and metric impacts allows you to take swift action.
Agile response: If an issue does surface, it's easy to halt the canary test, sparing the broader user base from potential frustrations. This feedback loop makes updating and improving new features much faster.
Controlled exposure: Launching updates to everyone at once can introduce significant risks, such as widespread user dissatisfaction. Canary testing confines these risks to a manageable group of users.
Safe production testing: By limiting the initial release, canary testing provides a safety net. If the test reveals any problems, stopping it promptly prevents issues from spreading into other areas. You always want to preserve your system’s overall integrity.
Authentic user feedback: Canary tests are grounded in actual user data and behavior, offering insights into how new updates perform in terms of engagement, conversion, and error rates.
Evidence-based launches: Analyzing data from canary tests helps you know whether a feature is ready for a wider release or needs further refinement. However, this is only a first step in moving decision-making from speculative to empirical. Truly data-driven decision making requires valid inference from a more complete A/B test.
Running canary tests can still bring some complications into the mix:
Navigating mobile apps' single environment: It's tricky to segment updates in mobile apps because each user's device acts as a standalone environment, complicating targeted canary testing.
Solution: Use feature flags to enable or disable features for specific users remotely, overcoming the single-environment hurdle.
Juggling multiple features at once: When you're rolling out lots of new features quickly, it can be tough to keep track and ensure each one works as expected without overwhelming your testing process.
Solution: Deploy a refined feature flag system to manage and monitor each feature separately, making it easier to handle rapid releases.
Dealing with multiple production machines: Coordinating canary tests across several machines can feel like a logistical nightmare, requiring tight synchronization to maintain consistency.
Solution: Embrace automation through infrastructure as code (IaC) and CI/CD tools to simplify deployment and keep all machines in lockstep.
Providing a consistent experience for everyone: With so many different devices and operating systems out there, it's a challenge to make sure everyone has a smooth experience during testing.
Solution: Use tools that allow you to test across a broad range of devices and OSes, ensuring that every user gets the same quality experience.
Getting the most out of your test data: Canary testing throws a lot of data at you, and it can be tricky to sift through it all to find the insights you need.
Solution: Integrate your tests with a powerful analytics platform that can help you sort, analyze, and act on your data quickly and accurately.
Once you’ve identified what you want to test, you’re ready to run a canary test. Here’s a five-step process to get started:
Select your canary group: Choose a small percentage (ideally 1-5%) of your user base who’ll receive the new feature first. These users will serve as the early indicators of how the new update performs in a live environment.
Gradually roll out the features: Increase the percentage of users exposed to the new features over time by adjusting your feature flags. Aim to increase the exposure by about 5% every few days, and carefully monitor the impact it has at each stage.
Never stop monitoring performance and user feedback: Keep a close watch on key metrics such as revenue, signups, and page views. Also, pay attention to customer feedback, including any comments and complaints, to spot negative impacts early on.
Respond quickly to any issues: Should any problems arise, immediately adjust the feature flag to roll back the feature for the canary group. This swift action minimizes impact and protects the broader user base while you address the issue.
Decide whether to expand or roll back: If the canary test goes well, roll out the new features to your entire user base gradually. If not, roll back the feature for all canary users and plan for a future re-test once you’ve solved what’s causing the problem.
Feature flags and canary testing are the perfect tag team.
Feature flags let you toggle features on and off for certain users. For instance, you can enable a new feature for just 1% of your users to start, monitor how it performs, and then gradually roll it out to more people.
You can see how this goes hand in hand with canary testing. It minimizes risk because, if something goes wrong, you only impact a small group.
Using Eppo’s feature flagging makes canary testing a breeze. Our feature flagging tools allow you to allocate traffic and test across environments in a matter of seconds. Plus, you get granular control over who sees which features.
Whether you’re using simple on-off flags, gradual rollouts, or even full-blown enterprise-wide experiments, Eppo is the out-of-the-box solution for implementing feature flags into your canary testing fast and with little to no manual intervention.
Now you’ve got a complete overview of how canary testing can fit into your company’s experimentation strategies.
Are you ready to take your experiments a step further?
Eppo is the experimentation platform that simplifies canary testing and ups your experimentation game. Here’s why Eppo stands out:
Precision feature flagging: With Eppo, you get enterprise-grade feature flagging tools that offer unparalleled control. Allocate traffic with pinpoint accuracy and test features across environments effortlessly.
Comprehensive experimentation suite: Beyond canary testing, Eppo provides a robust toolkit for A/B testing, feature analysis, and more. This allows for a holistic approach to product development and optimization.
Seamless integration: Eppo integrates smoothly with your existing tech stack, ensuring that your transition to more sophisticated testing methods is as smooth as possible.
Support for scale: Whether you're a startup or an established enterprise, Eppo is designed to scale with your needs. Manage experiments across teams and projects without compromising on speed or efficiency.