How to reduce risk in cloud-native deployment

TL;DR:

  • Microservices are everywhere, letting developers work on smaller, independent services that together form a bigger, more complex application.
  • Independence enables speed and autonomy, but the complexity of independent services means inevitable failure.
  • What matters in the event of failure is limiting the scope and scale of the damage. Proactive risk mitigation using progressive delivery practices and testing in production with canary releasing and blue-green deployment techniques delivers this risk cover.

Cloud-native development: Speed with a side of risk

Embracing cloud-native development principles offers many advantages, namely the opportunity for releasing software faster on flexible infrastructure. However, the tradeoff for increased speed is accountability for the risk involved. Releasing a faulty single microservice can have cascading negative effects throughout a system. And applications must be built with the various failure modes in mind.

Developers in cloud-native organizations have adopted a wide spectrum of practices to proactively mitigate the risk associated with releasing changes, such as canary releases and blue-green deployments. In this article, we will highlight a few strategies for integrating risk mitigation into a development workflow.

Modularizing development with microservices

Microservices modularizes and decentralizes application development. With a microservices architecture, applications are developed as a collection of small, decentralized services, which are independently deployed using automated deployment mechanisms. Successfully adopted, this approach enables decentralized development and decision making, which in turn results in faster development and faster delivery of functionality to end users.

The price of speed is increased risk

Microservices-based development promises speed and ease of adoption, making this approach a popular go-to. But it should come with a warning: Microservices architectures are complex, and complexity makes failure inevitable. With decentralization, developers need to think about loose coupling between services and designing for "fast failure". Decentralization also means that a developer working on a single service may not have complete insight into the entire system. A service that one team doesn’t even know about may cause a failure in one of the services they own.

This approach benefits from engineering with failure in mind, actively trying to fail fast on a small scale, hedging bets against worst-case scenarios and catastrophic failures. The risk is not failure itself; the risk is in not understanding what can and will fail, and building resilience around failure scenarios. Resilience happens as a result of proactively mitigating the impact of these failures to continue to run at some functional level without crashing the whole system.

Mitigation strategies: Don’t bring a chainsaw when a scalpel will do

Well-designed microservices with a solid supporting framework, including mitigation strategies, facilitates more fine-grained and creative experimentation. And with progressive delivery, the pitfalls of a new, modular, and more fragmented application workflow are easier to manage and avoid. We will address these in greater detail in future articles, but here's a brief overview.

Progressive delivery

An extension of continuous integration/continuous delivery (CI/CD) principles, progressive delivery builds on CI/CD but adds processes and techniques for gradually rolling out new features with good observability and tight feedback loops. This enables developers to test the release of functionality in a controlled and segmented way while still moving toward safe and fast deployment. Progressive delivery makes it possible to test in a real production environment without disruption to users, and to easily roll back changes if behavior isn't as expected.

Canary releases

Canary releasing is the practice of releasing a change to a small subset of users, so a minimal number of users is affected if there's a bug. Incremental changes continue to roll out until deployed to the entire user base or platform.

For example, 5% of users might be routed to a canary candidate while the other 95% are routed to the current application in production. The percentage of users routed to the canary continues to increase until all users are routed to the canary, or the candidate is rolled back at some point because of a problem identified during the controlled rollout.

Blue/green deployment

Blue/green deployments, also called traffic switching, rely on two nearly identical production environments, and shifting traffic from one to the other. Only one environment is ever live at a given time, enabling testing in production and quick rollbacks in the event of a significant problem.

A/B testing

A/B testing, or split testing, takes two variants, A and B, of a single variable and compares them to find out how/if a variation affects user behavior. This can be used to compare which version of a web page performs better.

Traffic shadowing

Similar to blue/green deployments, traffic shadowing is a practice in which production traffic is asynchronously copied to a non-production service for testing with zero production impact.

Feature flags

Feature flags, also known as feature toggles, allow for turning features or functionalities on or off without actually deploying code. This enables the “hiding” of features from users until they are ready to be released to a wider set of users, at which time the flag's status is changed.

Resilience engineering

Engineering for resilience is fundamentally about designing/engineering for failure and keeping things working in both expected and unexpected operational conditions.

Risk is everyone's responsibility

Cloud-native, microservices-based development requires a cultural shift, particularly in the way developers work. It's a shift to thinking differently and testing differently, and therefore mitigating risk differently. With the freedom of decentralized development comes the responsibility of understanding that developers work independently but aren't working in a silo. One service still needs to work with other services, and there's a greater responsibility on individual developers to make sure this happens without introducing large-scale failures across the entire system.

Microservices architectures shift the responsibility for bigger-picture risk thinking to each development team and individual developer. This makes progressive delivery practices and proactive mitigation techniques valuable tools not only to reduce the blast radius in the event of failure but also to keep the continuous deployment engine moving.

The time to learn about progressive delivery is now

Continuous integration and continuous delivery has long been a key tool in the software engineer’s and architect’s toolbox. With the rise of modular software architectures and cloud platforms, and the new opportunities for speed of delivery that these provide, the need to mitigate risk of release is becoming increasingly important. This is now everyone’s responsibility.