Full Service Ownership with Kubernetes: Traffic Management, Day 2 Ops, and Observability

The cloud-native story doesn't end when an app is deployed. The emerging trend of “full service ownership” or “full cycle development” for the developer means following through once apps are in production, ensuring that the application is both available to end users and running optimally. And when something goes wrong, which it inevitably will, it means handling the failure, limiting the impact for users, and examining through blameless postmortems what went wrong and how to prevent it from happening again.

Traffic Management: Reliability, Observability, and Security

When the time comes to run a cloud-native application, the first order of business is actually getting external “user” traffic into your Kubernetes cluster and to your backend services, which requires a way to manage incoming traffic.

A Kubernetes-native ingress controller, such as Emissary Ingress, serves this purpose, routing and securing traffic into your cluster. However, an ingress only deals with the “first hop” of external traffic entering a cluster. With a microservices-based system there are often multiple hops between dependent services.

A service mesh, such as Linkerd, provides additional traffic management functionality for service-to-service communication within a Kubernetes cluster.

Both ingress and service meshes provide “layer 7” (L7 from the OSI model) traffic management capabilities, such as load balancing, rate limiting, and circuit breaking. These are key to safeguarding availability and scalability. They also offer traffic observability, from top line rate, error, and duration (RED) metrics all the way through to access logs and distributed tracing to visualize the flow of a user request through the microservice graph.

🚀 Hands-on: K8s Traffic Management: Emissary-ingress + Linkerd

A full walkthrough of the instructions can be seen in the video below:

Optional readings

🏆 Challenges! K8s Traffic Management: Emissary-ingress + Linkerd

Answer the following questions to confirm your learning. At the end of the module you can complete a series of “checkpoint” questions and enter a competition to win prizes!

  • What is Linkerd's ingress mode and why is it that Emissary users don't need it?
  • Using the Linkerd “viz” dashboard, can you find the top line traffic/request metrics for the qotm service?

Check your answers

Day 2 Operations in Kubernetes

In shifting "run" and incident management responsibilities to developers, the discussion has focused on so-called day 2 operations in the software lifecycle. Traditionally, day 2 operations focused on the maintenance, monitoring, and optimization of an application post-release. In cloud-native development, however, day 2 comes earlier in the cycle, and developers are increasingly responsible. A microservices architecture demands a faster and more complete feedback loop rather than a linear set of (waterfall-style) stages, which includes operational aspects.

Not only is a developer better placed to understand the big picture of the system at runtime, bringing day 2 operations into the development life cycle earlier facilitates earlier identification of and fixes for problems, mistakes, and errors before rolling out to production.

Automation is a key theme throughout the implementation of day 2 operations, and the topics of reliability, security, and observability plays a key role in developer-managed (Dev) operations (Ops).

You can learn more about all of these concepts in this recording:

Incident Management

Code-based bugs inevitably sneak through even the most rigorous testing process, and sometimes issues only manifest themselves in certain use cases, infrastructure dynamics, or cascading service failures. Often when the initial incident fire is extinguished there is much to learn to prevent a recurrence. Therefore, all cloud-native developers need to learn about effective incident management and (blameless) postmortems and analysis.

With the adoption of cloud-native architecture and Kubernetes-based infrastructure, the role of incident response is increasingly being pushed toward developers. This is partly to drive increased responsibility, i.e., developers should have “skin in the game” to help them empathize with the pain caused by incidents, but also because cloud-native applications are increasingly complex: they often operate as complex adaptive systems. Problems from an incident may not be close to the actual fault. The link between cause and effect isn’t so obvious, and the search space has increased. For example, increased CPU usage in one service can cause back pressure and break a corresponding upstream service.

Focus Areas For and Beyond Effective Incident Management

Cloud-native incidents require effective management during and after the incident. Much of the response should be automated, but the learning process is hands-on and human.

For the things that can’t be automated, clear runbooks need to be created to guide triage and diagnostics. And the application of these needs to be regularly practiced via game days. Incidents should also be followed up with blameless postmortem events for analyzing root causes and ensuring that similar incidents are avoided in the future.

Game Days

Game days are a no-pressure way to practice and build resilience into incident response. Game days provide a no-fault opportunity to simulate and recreate incidents. They can help improve processes, test the resiliency of systems, validate observability mechanisms, and reduce stress during actual incidents.

Learn more about how to organize and run game days:

Blameless postmortems

Remember how painful the last incident was? When everything is going well and running smoothly, it's easy to forget the pain and avoid digging into the root causes of failure. After all, the fast-moving cloud-native development environment is designed for speed of development and shipping of new features and functionality. It's easy to overlook the fact that a highly distributed system may in fact be more prone to failures than traditional software.

Using blameless postmortems is a way to avoid repeating the trauma and build resilience and efficiency into processes.

What is a postmortem?

A postmortem is a discussion or analysis of an incident or event that occurs after an incident ends. It allows for a thorough understanding of an incident and should provide insight that can be applied to future incident management, answering what went wrong and why.

The team affected by the incident gets together and does a number of things:

  • Describes step by step what happened
  • Identifies causes
  • Identifies lessons learned
  • Outlines steps or things to rectify to move forward and try to ensure the same kind of incident doesn’t happen again

Why a blameless postmortem?

With the increased speed and velocity of cloud-native development, incidents are a fact of life, and it's easy to point fingers when an incident occurs. The blameless postmortem approach prioritizes discovering and fixing root causes. The blameless aspect of the postmortem is key because, as often as technology businesses claim that failure represents an opportunity to learn and innovate, the propensity to blame and shame still pervades. Pointing the finger at any one employee or team isn’t productive to learning and does not encourage team members to come forward with issues or open communication more generally.

As for why a team, or a company more broadly, should do blameless postmortems? Aside from the fact that successful companies, such as Atlassian and Netflix, rely on them, they constitute an opportunity to:

  • Learn from failure
  • Learn to communicate more clearly within teams
  • Create more effective troubleshooting and mitigation approaches
  • Become more resilient as an engineering team and organization

What kinds of issues call for a postmortem?

Not every issue requires a postmortem. Postmortems make sense for larger and systemic issues, but not necessarily for ongoing minor issues or maintenance matters unless those kinds of issues end up leading to major incidents. Appropriate issues to address in blameless postmortem processes include:

  • Major outages that affect end users
  • Repeated incidents
  • Failed deployments
  • Security breaches
  • Data loss
  • Missed deadlines

What kinds of questions to ask in a postmortem?

In a blameless postmortem process, the answers focus on objective facts of what happened, and discovering the root cause of an issue, not opinionated views on where one team or another failed to do their job.

  • What was the intended outcome of the event that triggered an incident?
  • What actually happened during the event?
  • Why was that the outcome?
  • How can that outcome be avoided in the future?

These questions remain the same whether or not the aim is blamelessness. It’s the answers that change. Determining how to avoid an undesirable outcome in the future relies on looking forward and identifying actionable items and owners for those actions.

What does a good postmortem look like?: Aspects of successful blameless postmortems

While a good part of postmortems are technical in nature, that is, identifying what went wrong, another part of successful postmortems is cultural. Accepting the need to examine what went wrong is key to creating a more robust engineering culture. Aspects of a successful blameless postmortem include:

  • Identifying exactly what happened, with step by step explanations and discussion.
  • Focus on what, not who - What happened, not who caused it. Failure is going to happen, so the important takeaway is what lessons can be learned.
  • Find mitigation positions. When the root cause is located and defined, what can prevent it from happening in the future? What do team members need to look out for? Is it a technical problem or a more systemic organizational process failure?
  • Build the plan of attack. Once analysis and discussion is complete, it can be used to create a game plan or plan of attack to keep the same problems from happening again.
  • Develop process and policy-focused plans, such as checklists of best practices, which may also include tools and solutions to ease the pain and tedium of completing postmortem accounting. At Ambassador Labs, incident response accounting is handled within the Rosie the Robot Slack bot tool (https://blog.getambassador.io/rosie-the-robot-chatops-for-incident-response-a4d94d1c6395), which was designed specifically for recording events as they happen, i.e., a ChatOps tool as a part of the incident management workflow.

🚀 Hands-on: Day 2 Operations

🏆 Challenges! Day 2 operations

Answer the following questions to confirm your learning. At the end of the module you can complete a series of “checkpoint” questions and enter a competition to win prizes!

  • Name three topics that are important in day 2 operations
  • What are game days?
  • Name three issues or situations that would benefit from a development team running a blameless postmortem

Check your answers

Observability in K8s: Metrics, Logs, and Traces

It's impossible to understand what is going wrong in an incident, and getting to its root causes, without having clear observability and visibility. In traditional development models, monitoring focused on infrastructure and relied on logs. In the cloud-native space, an entire ecosystem must be monitored and understood. It becomes essential to understand a combination of metrics, traces and logging stacks, focusing not just on the infrastructure but also on user experience, performance, applications and infrastructure.

What are observability and visibility in this context?

Observability and visibility are the two primary ways for identifying and understanding what is going wrong during an incident and digging into it after the incident. When incidents occur in a cloud-native environment, the complexity of the infrastructure makes it difficult to get clear visibility into what happened and how the incident might be fixed. Observability and visibility go hand-in-hand, providing ways not only to inspect, understand, and fix incidents as they are happening but also to inform ongoing incident-prevention work, dive into systemic or root causes, and, most of all, build greater resilience.

What is visibility?

Visibility mirrors what would traditionally be thought of as monitoring. Its primary function is to indicate that something is wrong and provide the basic metrics needed for troubleshooting. Monitoring has traditionally been the domain of ops engineers, but this has shifted to become a developer concern as well. With the complexity of containerized applications, developers are best positioned to understand what might be going wrong, and in parallel, visibility has expanded to introduce new tools and techniques for investigating and diagnosing issues, in many cases, specifically for developers.

For example, a service catalog provides a centralized "source of truth", listing services, their ownership and dependencies, resources, and other metadata, essentially delivering on the "single pane of glass" concept where a developer can gain instant visibility into the full picture. Another example is the need for distributed tracing. A distributed system spans multiple services, and to locate an issue, a single logical trace that can span these services is necessary.

What is observability?

Observability is the constant monitoring of system and of business KPIs with the goal of understanding why something is happening. It goes beyond the here-and-now of visibility (which itself is key to observability) and extends to the analysis and understanding of broader problems or issues, the underlying system and root causes. There is considerable overlap in visibility and observability. Observability just encompasses more and different things, including insight and potential actionability.

Observability and debugging for developers: Using traces to locate issues

Distributed tracing can be a very useful tool to enable a developer to locate issues within a complicated graph of microservices. For “deep systems”, where a single user’s request is often handled by multiple layers of services before returning a result, it is essential to be able to observe the path the request took through the system.

Ara Pulido from DataDog provides more context in this video.

The why of observability for developers

The why of observability can inform both the post-incident rundown and postmortem, but of potentially much greater, lasting value, can drive the way applications are created from the outset. That is, cloud-native developers can practice "observability-driven development (ODD)": "defining instrumentation to determine what is happening in relation to a requirement before any code is written". “Just as you wouldn’t accept a pull-request without tests, you should never accept a pull-request unless you can answer the question, “how will I know when this isn’t working?”

Observability can become a part of the development process itself. because it's possible to flip the script on development and use production to drive better code. How can this deliver benefits for developers? The insight gathered from shipping and running applications will strengthen future development by:

  • Enabling more data-driven development and product decisions
  • Helping avoid future incidents and issues by identifying root causes or systemic problems
  • Gathering more granular performance analysis
  • Contributing to evidence-based decision-making throughout the development process.

🚀 Hands-On! Observability with Prometheus and Grafana

Prerequisites for hands-on tutorial with Prometheus and Grafana

  1. A DigitalOcean Account, which you can create with $100 in credits automatically applied here.
  2. A credit card (no purchase necessary, the card is needed to create an account) and your DigitalOcean credit code
  3. A working version of these three command line tools:
    1. kubectl (version 1.21 or higher)
    2. helm (v3.6.0 or higher)
    3. doctl (1.64.0 or higher)

Follow the step-by-step instructions located here to configure Prometheus and Grafana on DigitalOcean Managed Kubernetes

🏆 Challenge! Observability with Prometheus and Grafana

Answer the following questions to confirm your learning. At the end of the module you can complete a series of “checkpoint” questions and enter a competition to win prizes!

  • Explain at least three benefits of being able to observe components in your Kubernetes Cluster
  • In your own words, describe how Prometheus and Grafana work together in a Kubernetes Cluster

Check your answers

Full Service Ownership

Full service ownership is becoming the best practice for cloud-native organizations. In this model, development teams own the entire life cycle of a service, and developers, as part of this ownership, become central to the full code, ship, run equation.

Why full service ownership for developers?

"Full-service ownership means that people take responsibility for supporting the software they deliver, at every stage of the software/service lifecycle. That level of ownership brings development teams much closer to their customers, the business, and the value being delivered. In turn, that unlocks key competitive advantages that make all the difference in today's digital world." -PagerDuty

Why is full service ownership the ideal? In brief, cloud-native development has promised rapid development loops and the ability to ship softer faster -- safely.

Beyond coding for developers

Full service ownership can increase development agility and scale, but the mindset behind how and why changes with this transfer of responsibility. Developers who still believe their team's job is just to write code fail to see that the real job, or mission, is to deliver and run software that represents business value.

Despite the difficulties of introducing developer-centric full service ownership, some organizations have real-world experience in introducing and making this model successful by:

  • Investing in developer education
    Developer education is an ongoing investment for new and experienced developers, regardless of whether or not they are new to Kubernetes and cloud-native development.

    Education will consist of developers becoming immersed in the best practices of the specific organization in order to be able to, as the goal above states, deliver and run software that is of business value.

    Education of this type is both targeted and specific training as well as less formal but more hands-on, such as in the form of game days and failure simulations. There is no better way to learn than to do.
  • Focusing on creating a developer experience
    While creating the developer experience (DevEx) usually falls to the platform engineering team, it is still something the developer should keep in mind in gauging the maturity of their organization and what they can expect in terms of support from their DevEx or platform team. The DevEx team is "productizing" the developer experience, making it replicable and easy to adopt for developers joining a team.
  • Adopting a developer control plane (DCP)
    The entire cloud development loop is complex, and having a single control plane to enable developers to control and configure from development to release to production simplifies the process. Without this "one-stop-shop" of sorts, developers are overrun by an unmanageable number of different development, release and production tools that would all require a level of mastery that is impossible. Developers need their own control plane that integrates the various tools used for all aspects of the development lifecycle, enabling them to become full service owners. An example of an all-in-one dashboard to centralize development tools and manage Kubernetes services is the Ambassador DCP, which is built on popular CNCF tools and integrates into existing GitOps workflows.

Human Service Discovery

Troubleshooting always begins with information gathering. While much attention has been paid to centralizing machine data (e.g., logs, metrics), much less attention has been given to the human aspect of service discovery. Who owns a particular service? What Slack channel does the team work on? Where is the source for the service? What issues are currently known and being tracked?

Kubernetes annotations are designed to solve exactly this problem.

Oft-overlooked, Kubernetes annotations are designed to add metadata to Kubernetes objects. The Kubernetes documentation says annotations can “attach arbitrary non-identifying metadata to objects.” This means that annotations should be used for attaching metadata that is external to Kubernetes (i.e., metadata that Kubernetes won’t use to identify objects. As such, annotations can contain any type of data. This is a contrast to labels, which are designed for uses internal to Kubernetes. Label structure and values are constrained so they can be efficiently used by Kubernetes.

As the number of microservices and annotations proliferate, running kubectl describe can get tedious. Moreover, using kubectl describe requires every developer to have some direct access to the Kubernetes cluster. Over the past few years, service catalogs have gained greater visibility in the Kubernetes ecosystem. Popularized by tools such as Shopify's ServicesDB and Spotify's System Z, service catalogs are internally-facing developer portals that present critical information about microservices.

Note that these service catalogs should not be confused with the Kubernetes Service Catalog project. Built on the Open Service Broker API, the Kubernetes Service Catalog enables Kubernetes operators to plug in different services (e.g., databases) to their cluster.

Annotate your services now and thank yourself later

Much like implementing observability within microservice systems, you often don’t realize that you need human service discovery until it’s too late. Don't wait until something is on fire in production to start wishing you had implemented better metrics and also documented how to get in touch with the part of your organization that looks after it.

There are enormous benefits to building an effective “version 0” service: a dancing skeleton application with a thin slice of complete functionality that can be deployed to production with a minimal yet effective continuous delivery pipeline.

Adding service annotations should be an essential part of your “version 0” for all of your services. Add them now, and you’ll thank yourself later.

🚀 Hands-on: Adding service metadata with K8s annotations

Read the following introductions to full service ownership and full lifecycle developers.

  1. Introduction
  2. Service Ownership Functions
  3. Full Cycle Developers at Netflix - Operate What You Build

Adding Service Metadata in K8s

  1. Read the following: Annotating Kubernetes Services for Humans
  2. Watch this video: Using a Service Catalog
  3. Follow the tutorial here: Quick Start

A full walkthrough of the instructions can be seen in the video below:

🏆 Challenge! Adding service metadata with K8s annotations

Answer the following questions to confirm your learning. At the end of the module you can complete a series of “checkpoint” questions and enter a competition to win prizes!

  • Full service ownership
    • Who should take ownership for software when it's running in production?
    • Name three types of metadata that should be associated (and easily accessible) with a service running in production.
    • In order to embrace full cycle development the Netflix team say that "Knowledge is necessary but not sufficient" What other things do they believe are essential?
  • Using K8s annotations
    • Verify that you can see via your terminal the results of "kubectl describe svc ...." for a service you have annotated with your name visible as the owner.

Check your answers

Checkpoint! Test your learning and win prizes

When you submit a module checkpoint, you're automatically eligible for a $10 UberEats voucher and entered to win our weekly and monthly prizes!

UberEats vouchers are only issued to the first 50 valid checkpoint submissions every week (Monday through Sunday). Limit three total entries per developer, one entry per week per module. All fraudulent submissions will be disqualified from any prize eligibility. All prizes are solely distributed at the discretion of the Ambassador Labs community team.