A Comprehensive Guide to Canary Releases

Kay James

March 21, 2022

•

In order to effectively build cloud native applications, your engineering organization must adopt a culture of decentralized decision-making, create a supporting platform, and implement continuous delivery to move faster. In this series, we’ll discuss key patterns in cloud native application development. We will present why they’re effective, how to implement them in your organization and the consequences of doing so. We will also, and provide examples using popular cloud native tools and explain how these fit into your current software delivery lifecycle (SDLC). In the first part of this series, we’ll discuss canary releases and show an example of how to implement them with the Edge Stack API Gateway (which is powered by the CNCF Emissary-ingress project).

What is a Canary Release?

A canary release is a software testing technique used to reduce the risk of introducing a new software version into production by gradually rolling out the change to a small subset of users, before rolling it out to the entire platform/infrastructure.

The phrase “canary rollout” is often used as a synonym for canary release, and fundamentally there is no difference in the terms. However, cCanary releases are commonly confused with blue-green releases, feature flag releases, and dark launch releases.

A canary release differs from a blue-green release by enabling an incremental rollout of a new service. With a blue-green rollout the new software version is “switched” in one action and made available to all users instantaneously.

A canary release is also different from a feature flag release, as feature flags are used to expose a specific feature to a small subgroup of users. A canary release exposes a specific version of the entire application or service.

A dark launch canary release differs from a regular canary by duplicating traffic from a small subgroup of users and routing this to a new version of the service that does not return data to the user. A “dark launch” is named this because the response is “dark” or hidden. Although the new service is tested with real traffic, the end-users do not see the results — only the engineering team does.

Motivation

The canary release technique was inspired by the fact that canary birds were once used in coal mines to alert miners when toxic gases reached dangerous levels. Somewhat gruesomely, the gases would kill the canary before killing the miners. However, this provided a warning to get out of the mine tunnels. As long as the canary kept singing, the miners knew that the air was free of dangerous gases. If a canary died, then this signaled the need for an immediate evacuation.

This technique is called “canary releasing” because a small subset of end-users selected for testing act as the canaries and are used to provide an early warning for the release of new functionality. Unlike the poor canaries of the past, obviously no users are physically hurt during a software release. Negative results from a canary release can be inferred from telemetry and metrics in relation to key performance indicators (KPIs).

Canary tests can be automated as part of continuous delivery, and are typically run after testing in a pre-production environment has been completed. The canary release is only visible to a fraction of actual users, and any bugs or negative changes can be reversed quickly by either routing traffic away from the canary or by rolling back the canary deployment.

‍

Applicability

You can use canary releases when:

An application consists of multiple (micro)services that are changing at independent rates, and verification of functionality must be conducted in a realistic (production-like) environment
There is a high operational risk of deploying new functionality, and this can be mitigated by experimenting with directing a small percentage of traffic to the new deployment
A service depends on a (third-party or legacy) upstream system that cannot effectively be tested against, and the only reliable method to validate successful integration is to actually integrate with this service

You should not use canary releases when:

You are working on a mission, safety, or life-critical system that cannot tolerate failure. No one wants to see software developers implement a canary release of a nuclear meltdown prevention safety mechanism.
End users will be overly sensitive to canary results. For example, extra care would have to be taken if canary releasing software that manipulates large amounts of financial transactions.
The experiment would require the modification of backend data (or the data store schema) in a way that is not compatible with the current service requirements

Structure/Implementation

Typically canary releases are implemented via a proxy like Envoy or HAProxy, smart router, configurable load balancer or API gateway like Ambassador Edge Stack. The releases can be triggered and orchestrated by continuous integration/delivery pipeline tooling (such as Jenkins or Spinnaker), automated “DevOps” platform (like Codefresh or Harness), or feature management SaaS platforms (like LaunchDarkly or Optimizely).

Here are some implementation issues you should consider:

A prerequisite to implementing canary releases is the ability to effectively observe and monitor your infrastructure and application stack. This includes the ability to observe and comprehend both technical metrics (e.g. an increase in HTTP 500 status codes being returned to end users) and business metrics (e.g. a drop in the number of customers purchasing)
The front proxy, router, load balancer, or API gateway used to direct traffic must be programmable, and expose an API that allows dynamic configuration of traffic shaping and shifting.
Ideally, the canary release process and associated traffic shifting configuration will be written and stored declaratively, as this enables a “GitOps” style of working, and facilitates disaster recovery and auditing
If the new canary version of the application requires a datastore schema modification, the rollout of this must be carefully managed in order to prevent breaking the existing production services that rely on this schema. Often the “parallel change”, otherwise known as “expand and contract”, pattern must be used
Services involved within the canary rollout will typically have to be capable of context propagation or the passing of headers or tokens to upstream services (that indicate a request is part of the canary).

‍

‍

‍

Consequences

Using canary releases has both benefits and liabilities:

Benefits include:

The gradual rollout of new functionality limits the potential system blast radius of any operational issues
The gradual release of new functionality to users reduces the risk of negative outcomes impacting a large percentage of your user base

Liabilities include:

Manual canary releasing can be time consuming and error prone (a positive pattern is to automate the entire canary release life cycle)
There is limited value in the use of canary releases if the underlying system, application, and user behavior is not observable and well-instrumented
Managing incompatibilities between API versions and database schema changes can be challenging if the team does not have good testing and migration strategies in place. The same challenge can be found in relation to the mutability of the data structure in state management services in general

Example

An example of how to implement a canary release with the Ambassador Edge Stack API gateway can be found in the article “Configure Canary Rollout in your Cluster.”