Postmortems in K8s
Get hands-on! Learn from incidents
Day 2 Learn from Incidents
Postmortems in K8s
While game days are great for staying prepared for potential incidents, what about incidents that did occur. These are ideal opportunities for learning. Retrospectives that examine the root causes of an incident are key to fixing problems and processes and ensuring that they do not happen again. Blameless postmortems are a great tool for actively learning from incidents.
Remember how painful the last incident was? When everything is going well and running smoothly, it's easy to forget the pain and avoid digging into the root causes of failure. After all, the fast-moving cloud-native development environment is designed for speed of development and shipping of new features and functionality. It's easy to overlook the fact that a highly distributed system may in fact be more prone to failures than traditional software.
Using blameless postmortems is a way to avoid repeating the trauma and build resilience and efficiency into processes.
What is a postmortem?
A postmortem is a discussion or analysis of an incident or event that occurs after an incident ends. It allows for a thorough understanding of an incident and should provide insight that can be applied to future incident management, answering what went wrong and why.
The team affected by the incident gets together and does a number of things:
- Describes step by step what happened
- Identifies causes
- Identifies lessons learned
- Outlines steps or things to rectify to move forward and try to ensure the same kind of incident doesn’t happen again
Why a blameless postmortem?
With the increased speed and velocity of cloud-native development, incidents are a fact of life, and it's easy to point fingers when an incident occurs. The blameless postmortem approach prioritizes discovering and fixing root causes. The blameless aspect of the postmortem is key because, as often as technology businesses claim that failure represents an opportunity to learn and innovate, the propensity to blame and shame still pervades. Pointing the finger at any one employee or team isn’t productive to learning and does not encourage team members to come forward with issues or open communication more generally.
As for why a team, or a company more broadly, should do blameless postmortems? Aside from the fact that successful companies, such as Atlassian and Netflix, rely on them, they constitute an opportunity to:
- Learn from failure
- Learn to communicate more clearly within teams
- Create more effective troubleshooting and mitigation approaches
- Become more resilient as an engineering team and organization
What kinds of issues call for a postmortem?
Not every issue requires a postmortem. Postmortems make sense for larger and systemic issues, but not necessarily for ongoing minor issues or maintenance matters unless those kinds of issues end up leading to major incidents. Appropriate issues to address in blameless postmortem processes include:
- Major outages that affect end users
- Repeated incidents
- Failed deployments
- Security breaches
- Data loss
- Missed deadlines
What kinds of questions to ask in a postmortem?
In a blameless postmortem process, the answers focus on objective facts of what happened, and discovering the root cause of an issue, not opinionated views on where one team or another failed to do their job.
- What was the intended outcome of the event that triggered an incident?
- What actually happened during the event?
- Why was that the outcome?
- How can that outcome be avoided in the future?
These questions remain the same whether or not the aim is blamelessness. It’s the answers that change. Determining how to avoid an undesirable outcome in the future relies on looking forward and identifying actionable items and owners for those actions.
What does a good postmortem look like?
While a good part of postmortems are technical in nature, that is, identifying what went wrong, another part of successful postmortems is cultural. Accepting the need to examine what went wrong is key to creating a more robust engineering culture. Aspects of a successful blameless postmortem include:
- Identifying exactly what happened, with step by step explanations and discussion.
- Focus on what, not who - What happened, not who caused it. Failure is going to happen, so the important takeaway is what lessons can be learned.
- Find mitigation positions. When the root cause is located and defined, what can prevent it from happening in the future? What do team members need to look out for? Is it a technical problem or a more systemic organizational process failure?
- Build the plan of attack. Once analysis and discussion is complete, it can be used to create a game plan or plan of attack to keep the same problems from happening again.
- Develop process and policy-focused plans, such as checklists of best practices, which may also include tools and solutions to ease the pain and tedium of completing postmortem accounting. At Ambassador Labs, incident response accounting is handled within the Rosie the Robot Slack bot tool, which was designed specifically for recording events as they happen, i.e., a ChatOps tool as a part of the incident management workflow.
Hands-on: Learn from Incidents
- Read: Postmortem Culture: Learning from Failure
- Do: Conduct a blameless postmortem for the next failure that occurs in your system
Move on to your next lesson.