A Leadership View on Cloud-Native Development: Focus on Uptime, Collaboration, and Developer Self-Service
Bjorn Freeman-Benson, Ambassador Labs's SVP of Engineering, recently shared insights about empowering developers, whether by evolving into the full lifecycle ownership model that fundamentally changes the developer experience, or with tools and processes that make this change possible, such as developer control planes.
Full service ownership for developers... before it was a thing
Bjorn's retelling of his experience leading development teams at organizations such as InVision and New Relic mirrors the shifting shape of modern development teams. Bjorn's previous teams were, in fact, ahead of the curve. In Bjorn's engineering teams, developers have always been responsible for operating their own services because, as he states: "developers know (their work) the best. So when it starts malfunctioning, they go, 'Oh yeah, I know that pattern, I've seen it before,' and they don't have to go in (to investigate) and (can solve the problem) much more quickly than somebody who doesn't know the service".
This experience, in part, shapes how Bjorn builds and leads cloud-native development teams and also how he sees the responsibilities of developers.
Building the foundations: Self-service facilitated by platform and ops teams
"The goal was to build a platform that then allowed the development teams to just do the rollout from their point of view without having to worry about all the details of the insides of it." -Bjorn Freeman-Benson, SVP Engineering, Ambassador Labs
Development and operations teams can gain the most efficiency by working together. Complete separation of duty means neither team has insight into what the other is doing. As Bjorn points out, developers are more likely to have easier insight into emerging failure patterns and problems within their own services, while a separate operations team would have to spend significant time digging around to identify what was going wrong.
Bjorn believes that ops teams' time is better spent building robust foundations to automate all the things development teams need to use but don't need to know inside and out. This underlying infrastructure could, for example, consist of all the details of Kubernetes, or the different running environments, or the different YAML files that defined each configuration. Instead, the platform team focuses on automating the platform to make it possible for developers to do their rollouts from their point of view without thinking about all the underlying details that make it happen. Essentially, ops enables developer self-service, which in turn gives devs what they need to complete the full code-ship-run process.
Taming tool sprawl and paving a clear path for developers
By extension, the platform team also assumes responsibility for tooling. As Bjorn explains, building one's own tools is a fairly frequent requirement. This is something echoed in a recent conversation with Lunar's Kasper Nissen, who stated, "By centralizing tooling and providing an opinionated platform, or “paved path ," for developers, companies like Lunar can accelerate developer ramp-up and productivity. If a platform team can offer a preferred take on how to do specific tasks in the workflow and recommend a set of tools to use, developers won't spend too much time on trial and error".
Bjorn explains, like Kasper, that platform teams are responsible for building tooling and automating it for the team and ultimately for paving the path and keeping it clear of obstacles. And why? "If you're building a SaaS piece of software, it's about operations as well as writing the software. And so that's the extension that we've made as developers--to go from just developing to developing and operating". There is inherent struggle in this shift left, which is where the developer control plane concept was born.
The origins of the developer control plane: Embracing code, ship, and run
"As a developer," Bjorn states, "I'm responsible for that whole thing." The industry, and the developers in it, have figured out how to write code, and there's tooling around that. This has been followed by a lot of good tooling around deployments and packaging, removing the "dependency hell we used to live in" to ship software. In addition, continuous integration (CI) enables the testing of pipelines. The whole code-ship equation is covered, but run, and how to operate, is missing.
"We don't have a lot of tooling around how a developer would run operations. We've got some observability, and that is much more helpful than when we didn't have those things. But we're still not looking at the whole cloud-native journey and what is the tooling I need across that whole journey. And so the reason I keep emphasizing the run part is because that's the part that's least mature, in my opinion, around the tooling."
This is where the concept of a developer control plane (DCP) reveals its value. It answers many of the questions developers encounter as they take on the full cloud-native experience: the cloud-native SaaS scheme of operating across multiple environments is fundamentally different from just developing on one's own laptop. But it's also possible to develop on one's laptop and in one's development cluster in the cloud.
"A developer control plane says, 'Oh, you know that wonderful experience that you have when you're developing just on your laptop with your IDE? We want that same experience across the whole cloud-native service deployment operation experience'." -Bjorn Freeman-Benson, SVP Engineering, Ambassador Labs
Many pieces of the puzzle exist and could be put together independently. But the idea behind a DCP is that platform teams put the components together to relieve engineers from that work in order to concentrate on the parts in which they really must "deliver the goods".
In essence, a DCP gives developers what they need to control and configure the entire cloud-development loop to ship software faster, without the distraction of trying to find and figure out a million different tools. The developer's focus and creativity is better spent on creating, shipping, and running software that delivers value — not in tool discovery and discard activities.
But why the "shift-left" to developer lifecycle ownership?
While the full code-ship-run ownership model characterizing the cloud-native developer experience makes sense to leaders like Bjorn and to companies at the vanguard of cloud-native production, there can be a disconnect for some developers in understanding why they are suddenly "saddled" with extra responsibility. Why should full ownership fall to them? As alluded to: it comes down to delivering value, and the best way to achieve and maximize this.
Shifting left is not new responsibility: It's changed responsibility
In many cases, this shift-left is less about forcing developers to take on completely new responsibility as it is making the organization and its component parts take on the responsibilities that best suit their skills (and support efficiencies) within the new cloud-native development environment. That is, taking the longer-term view of consequences means that the end-to-end engineering process fundamentally changes.
In the new development pattern, with its distributed services across multiple platforms, it doesn't make sense to enlist site reliability engineers (SREs) and ops teams to do constant firefighting. Instead developers, who have the most insight into the services they have developed, are the best frontline defense, and the SREs and ops should clear the path and, in cases where developers can't make changes but can identify the problems, jump in with their expertise.
A new view on incident management
The shift-left move also changes the approach to incident management. While in the past the on-call duty was typically left to SRE/ops teams, it's becoming a part of dev duty as well. A frequent question, though, from developers who are not accustomed to this model: What is the incentive for the developer to take on incident management, i.e. "I already did my job"? Developers familiar with the cloud-native space can probably already answer this question in a couple of ways:
- less likelihood of being called outside business hours
- the opportunity to proactively improve the stability of the service (to avoid off-hours incident calls)
New customer expectations
Bjorn has observed a shift in software development that prizes uptime and availability over features. While both are important, customer expectations about what a service delivers focus mostly on uptime. Customers are considerably more upset about outages, failures, and poor performance than they are about a feature that doesn't ship. Supporting the business model and customer expectation, then, means shifting the development model.
How development, rollout, and delivery is managed is directly tied to customer satisfaction, which makes it everyone's consideration, and the "run" component of the code-ship-run equation assumes equal importance. It's with this shift in customer expectation and what the enduring product is (uptime versus feature) that makes developers realize the scale of operations and both the accountability and value of executing successfully on the full life cycle.
Conclusion: Reimagining and powering the full development life cycle with a DCP
Cloud-native development has reforged software development so that the workflows and experiences development teams traditionally use and are comfortable with, now do not necessarily make sense. Clear delineations between developers, operations and site reliability have blurred. Both the changed development experience, platforms and tools and shifting customer expectations make the case for:
- Full lifecycle ownership for developers, which supports efficiency
- Infrastructural and automation support from SRE and operations (not just firefighting), particularly with implementing a DCP
- Closer collaboration across teams that makes development and release more efficient from end to end.