Building a Kubernetes-Based Platform: Progressive Delivery, the Edge, and Observability
Practically every cloud vendor or private cloud solution supports the deployment and operation of the Kubernetes container orchestration framework. Since the initial release of Kubernetes by Google in 2014, a large community has formed around the framework, often facilitated by the organisation that is now the steward of the project, the Cloud Native Computing Foundation (CNCF).
Kubernetes has been widely adopted as a container manager, and has been running in production across a variety of organisations for several years. As such, it provides a solid foundation on which to support the other three capabilities of a cloud native platform: progressive delivery, edge management, and observability. These capabilities can be provided, respectively, with the following technologies: continuous delivery pipelines, an edge stack, and an observability stack.
Starting with Kubernetes, let's explore how each of these technologies integrates to provide the core capabilities of a cloud platform.
Following from the early success of Docker, containers have become the standard unit (“artifact”) of cloud deployment. Applications written in any language can be built, packaged, and “hermetically” sealed within a container image. These containers can then be deployed and run wherever the container image format is supported. This is a cloud native implementation of the popular software development concept of “write once, run anywhere”, except now the writing of code has been superseded by the building and packaging of this code.
The rise in popularity of containers can be explained by three factors: containers require less resources to run than VMs (with the tradeoff of a shared underlying operating system kernel); the Dockerfile manifest format provided a great abstraction for developers to define “just enough” build and deploy configuration; and Docker pioneered an easy method for developers to assemble applications in a self-service manner (docker build) and enabled the easy sharing and search of containerised applications via shared registries and the public Docker Hub.
Containers themselves, although a powerful abstraction, do not manage operational concerns, such as restarting and rescheduling when the underlying hardware fails. For this, a container orchestration framework is required. Something like Kubernetes.
Control Loops and Shared Abstractions
Kubernetes enables development teams to work in a self-service manner in relation to the operational aspects of running containers. For example, defining liveness and readiness probes of the application, and specifying runtime resource requirements, such as CPU and memory. This configuration is then parsed by the control loop within the Kubernetes framework, which makes every effort to ensure that the developer’s specifications match the actual state of the cluster. Operations teams can also define global access and deployment policies, using role-based access control (RBAC) and admission webhooks. This helps to limit access and guide development teams to best practices when deploying applications.
In addition to providing a container runtime and orchestration framework, Kubernetes allows both developers and the platform team to interact, share, and collaborate using a standardised workflow and toolset. It does this via several core abstractions: the container as the unit of deployment, the pod as the component of runtime configuration (combining containers, and defining deployment, restart, and retry policies), and the service as the high-level, business-focused components of an application.
Kubernetes-as-a-Service or Self-Hosted
Kubernetes itself is somewhat of a complicated framework to deploy, operate, and maintain. Therefore, a core decision when adopting this framework is whether to use a hosted offering, such as Google GKE, Amazon EKS, or Azure AKS, or whether to self-manage this using administrative tooling like kops and kubeadm.
A second important decision is what distribution (“distro”) of Kubernetes to use. The default open source Kubernetes upstream distro provides all of the core functionality. It will come as no surprise that cloud vendors often augment their distros to enable easier integration with their surrounding ecosystem. Other platform vendors, such as Red Hat, Rancher, or Pivotal offer distros that run effectively across many cloud platforms, and they also include various enhancements. Typically the additional functionality is concentrated on supporting the enterprise use cases, with a focus on security, homogenized workflows, and providing comprehensive user interfaces (UIs) and administrator dashboards.
The Kubenetes documentation provides additional information to assist with these choices.
Avoiding Platform Antipatterns
The core development abstractions provided by Kubernetes -- containers, pods, and services -- facilitate collaboration across development and operations teams, and help to prevent the siloed ownership. These abstractions also reduce the likelihood that developers need to take things into their own hands and start building “micro-platforms” within the system itself.
Kubernetes can also be deployed locally, which in the early stages of adoption can help with addressing the challenge of slower or limited developer feedback. As an organisation’s use of Kubernetes grows, they can leverage a host of tooling to address the local-to-remote development challenge. Tools like Telepresence, Skaffold, Tilt, and Garden all provide mechanisms for developers to tighten the feedback loop from coding (possibly against remote dependencies), building, and verifying.
Continuous Delivery Pipelines
The primary motivation of continuous delivery is to deliver any and all application changes -- including experiments, new features, configuration, and bug fixes -- into production as rapidly and as safely as the organisation requires. This approach is predicated on the idea that being able to iterate fast provides a competitive advantage. Application deployments should be routine and drama free events, initiated on-demand and safely by product-focused development teams, and the organisation should be able to continuously innovate and make changes in a sustainable way.
Improving the Feedback Loops
Progressive delivery extends the approach of continuous delivery by aiming to improve the feedback loop for developers. Taking advantage of cloud native traffic control mechanisms and improved observability tooling allows developers to more easily run controlled experiments in production, see the results in near real time via dashboards, and take corrective action if required.
The successful implementation of both continuous delivery and progressive delivery depends on developers having the ability to define, modify, and maintain pipelines that codify all of the build, quality, and security assertions. The core decisions to make when adopting a cloud native approach to this are primarily based on two factors: how much existing continuous delivery infrastructure has organisation has; and the level of verification required for application artifacts.
Evolving an Organisation’s Approach to Continuous Delivery
Organisations with a large investment into existing continuous delivery tooling will typically be reluctant to move away from this. Jenkins can be found within many enterprise environments, and operations teams have often invested a lot of time and energy in understanding this tool. The extendability of Jenkins has, for better or worse, enabled the creation of many plugins. There are plugins for executing code quality analysis, security scans, and automated test execution. There are also extensive integrations with quality analysis tooling like SonarQube, Veracode, and Fortify.
As a pre-cloud build tool, the original Jenkins project can often be adapted to meet new requirements when integrating with frameworks like Kubernetes. However, there is also a completely new project, Jenkins X, that is the spiritual counterpart for the cloud-native world. Jenkins X has been built using a new codebase and different architecture than the original Jenkins, with the goal of supporting Kubernetes natively. The core concepts of progressive delivery are also built into Jenkins X.
Organisations with limited existing investment in continuous delivery pipeline tooling often choose to use cloud native hosted options, such as Harness, CircleCI, or Travis. These tools focus on providing easy self-service configuration and execution for developers. However, some are not as extensible as tooling that is deployed and managed on-premises, and the provided functionality is often focused on building artifacts rather than deploying them. Operations teams also typically have less visibility into the pipeline. For this reason, many teams separate build and deployment automation, and use continuous delivery platforms such as Spinnaker to orchestrate these actions.
Avoiding Platform Antipatterns
Continuous delivery pipeline infrastructure is often the bridge between development and operations. This can be used to address the traditional issues of siloed ownership of code and the runtime. For example, platform teams can work with development teams to provide code buildpacks and templates, which can mitigate the impact of a “one-size fits all” approach, and also remove the temptation for developers to build their own solutions.
Continuous delivery pipelines are also critical for improving developer feedback. A fast pipeline that deploys applications ready for testing in a production-like environment will reduce the need for context switching. Deployment templates and foundational configuration can also be added to the pipeline in order to bake-in common observability requirements to all applications, for example, logging collection, or metric emitters. This can greatly help developers gain and understanding of production systems, and assist with debugging issues, without needing to rely on the operations team to provide access.
The Edge Stack
The primary goals associated with operating an effective datacenter edge, which in a modern cloud platform configuration is often the Kubernetes cluster edge, are threefold:
- Enabling the controlled release of applications and new functionality;
- Supporting the configuration of cross-functional edge requirements, such as security (authentication, transport level security, and DDoS protection) and reliability (rate limiting, circuit breaking, and timeouts);
- Supporting developer onboarding and use of associated APIs.
Separate Release from Deployment
The current best practice within a cloud native software delivery is to separate deployment from release. The continuous delivery pipeline handles the build, verification, and deployment of an application. A “release” occurs when feature change with an intended business impact is made available to end users. Using techniques such as dark launches and canary releases, changes can be deployed to production environments more frequently without the risk of large-scale negative user impact.
More-frequent, iterative deployments reduce the risk associated with change, while developers and business stakeholders retain control over when features are released to end users.
Scaling Edge Operations with Self-Service
Within a cloud-native system that is being built with microservices, the challenges of scaling edge operations and supporting multiple architectures must be implemented effectively. Configuring the edge must be self-service for both developers iterating rapidly within the domain of a single service or API, and also the platform team that are working at a global system scale. The edge technology stack must offer comprehensive support for a range of protocols, architectural styles, and interaction models, that are commonly seen within a polyglot language stack.
Avoiding Platform Antipatterns
Gone are the days where every API exposed within a system was SOAP- or REST-based. With a range of protocols and standards like WebSockets, gRPC, and CloudEvents, there can no longer be a “one size fits all” approach to the edge. It is now table-stakes for all parts of the edge stack to support multiple protocols natively.
The edge of a Kubernetes system is another key collaboration point for developers and the platform teams. Platform teams want to reduce fragmentation by centralizing core functionality such as authentication and authorization. Developers want to avoid having to raise tickets to configure and release services as part of their normal workflow, as this only adds friction and reduces cycle time for delivery of functionality to end users.
The Observability Stack
The concept of “observability” originated from mathematical control theory, and is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. Modern interpretations of software observability have developed almost in lockstep with the rise of cloud native systems; observability in this context is focused on the ability to infer what is occurring within a software system using approaches such as monitoring, logging, and tracing.
As popularised in the Google SRE book, given that a service level indicator (SLI) is an indicator of some aspect of "health" that a system’s consumers would care about (which is often specified via an SLO), there are two fundamental goals with observability:
- Gradually improving an SLI (potentially optimising this over days, weeks, months)
- Rapidly restoring an SLI (reacting immediately, in response to an incident)
Following on from this, there are two fundamental activities that an observability stack must provide: detection, which is the ability to measure SLIs precisely; and refinement, which is the ability to reduce the search space for plausible explanations of an issue.
Understandability, Auditability, and Debuggability
Closely related to the goals of improving or restoring SLIs, are the additional motivations of supporting observability within a software system: understandability, auditability, and debuggability. As software systems have become pervasive and mission critical throughout society, the need to understand and audit them has increased dramatically. People are slow to trust something that they cannot understand. And if a system is believed to have acted incorrectly, or someone claims it has, then the ability to look back through the statement of audit and prove or disprove this is invaluable.
The adoption of cloud native technologies and architectures has unfortunately made implementing observability more challenging. Understanding a distributed system is inherently more difficult when operating at scale. And existing tooling does not support the effective debugging of a highly modular system communicating over unreliable networks. A new approach to creating a cloud native observability stack is required.
Three Pillars of Observability: One Solution
Thought leaders in this modern observability space, such as Cindy Sridharan, Charity Majors, and Ben Sigelman, have written several great articles that present the “three pillars” of cloud native observability as monitoring, logging, and distributed tracing. However, they have also cautioned that these pillars should not be treated in isolation. Instead a holistic solution should be sought.
Monitoring in the cloud native space is typically implemented via the CNCF-hosted Prometheus application or a similar commercial offering. Metrics are often emitted via containerised applications using the statsd protocol, or a language native Prometheus library. The use of metrics provides a good insight an application and the platform of a snapshot in time, and can also be used to trigger alert
Logging is commonly emitted as events from containerised applications via a common interface, such as STDOUT, or via a logging SDK included within the application. Popular tooling includes the Elasticsearch, Logstash, and Kibana (ELK) stack. Fluentd, a CNCF-hosted project, is often used in place of Logstash. Logging is valuable when retroactively attempting to understand what happened within an application, and can also be used for auditing purposes.
Distributed tracing is commonly implemented using the OpenZipkin or the CNCF-hosted Jaegar tooling, or a commercial equivalent. Tracing is effectively a form of event-based logging that contains some form of correlation identifier that can be used to stitch together events from multiple services that are related to a single end user’s request. This provides end-to-end insight for requests, and can be used to identify problematic services within the system (for example, latent services) or understand how a request traveled through the system in order to satisfy the associated user requirement.
Many of the principles from the data mesh paradigm apply to the topic of observability. Platform engineers must provide a series of observability data access tools and APIs for developers to consume in a self-service manner.
Avoiding Platform Antipatterns
With cloud native observability there is no “one-size fits all” approach. Although the emitting and collection of observability data should be standardised to avoid platform fragmentation, the ability to self-serve when defining and analysing application- and service-specific metrics is vital for developers to be able to track health or fix something when the inevitable failures occur. A common antipattern seen within organisations is the requirement to raise a ticket in order to track specific metrics or incorporate these into a dashboard. This completely goes against the principle of enabling a fast development loop, which is especially important during the launch of new functionality or when a production incident is occuring.
Summary and Conclusion
Kubernetes has been widely adopted, and has been running in production across a variety of organisations for several years. As such, it provides a solid foundation on which to support the other three capabilities of a cloud native platform that enables full cycle development. These capabilities can be provided, respectively, with the following technologies: continuous delivery pipelines, an edge stack, and an observability stack.
Investing in these technologies and the associated best practice workflows will speed an organisation’s journey to seeing the benefits from embracing the cloud native and full cycle development principles.
To learn more about adopting these technologies at your organisation, click here to download our whitepaper "4 Essential Elements of Kubernetes Platform". You can also subscribe below to get these articles and more delivered to your inbox!