Day 2 Operations in K8s

Day 2 operations in K8s

In shifting "run" and incident management responsibilities to developers, the discussion has focused on so-called day 2 operations in the software lifecycle. Traditionally, day 2 operations focused on the maintenance, monitoring, and optimization of an application post-release. In cloud-native development, however, day 2 comes earlier in the cycle, and developers are increasingly responsible. A microservices architecture demands a faster and more complete feedback loop rather than a linear set of (waterfall-style) stages, which includes operational aspects.

Not only is a developer better placed to understand the big picture of the system at runtime, bringing day 2 operations into the development life cycle earlier facilitates earlier identification of and fixes for problems, mistakes, and errors before rolling out to production.

Automation is a key theme throughout the implementation of day 2 operations, and the topics of reliability, security, and observability plays a key role in developer-managed (Dev) operations (Ops).

Incident management

Code-based bugs inevitably sneak through even the most rigorous testing process, and sometimes issues only manifest themselves in certain use cases, infrastructure dynamics, or cascading service failures. Often when the initial incident fire is extinguished there is much to learn to prevent a recurrence. Therefore, all cloud-native developers need to learn about effective incident management and (blameless) postmortems and analysis.

With the adoption of cloud-native architecture and Kubernetes-based infrastructure, the role of incident response is increasingly being pushed toward developers. This is partly to drive increased responsibility, i.e., developers should have “skin in the game” to help them empathize with the pain caused by incidents, but also because cloud-native applications are increasingly complex: they often operate as complex adaptive systems.

Problems from an incident may not be close to the actual fault. The link between cause and effect isn’t so obvious, and the search space has increased. For example, increased CPU usage in one service can cause back pressure and break a corresponding upstream service.

Focus areas for and beyond effective incident management

Cloud-native incidents require effective management during and after the incident. Much of the response should be automated, but the learning process is hands-on and human.

For the things that can’t be automated, clear runbooks need to be created to guide triage and diagnostics. And the application of these needs to be regularly practiced via game days. Incidents should also be followed up with blameless postmortem events for analyzing root causes and ensuring that similar incidents are avoided in the future.

Game days

Game days are a no-pressure way to practice and build resilience into incident response. Game days provide a no-fault opportunity to simulate and recreate incidents. They can help improve processes, test the resiliency of systems, validate observability mechanisms, and reduce stress during actual incidents.

Hands-on: day 2 operations

Read: How to Run a GameDay
‍Do: Plan and run a game day for your organization.

Day 2 Operations in K8s