Platform Engineering, Adopting Team Topologies, and Supporting the Developer Experience
Crystal Hirschorn (@cfhirschorn), Director of Engineering - Infrastructure, SRE and Developer Experience at Snyk, has been an engineer for more than 20 years in organizations as varied as media conglomerate Condé Nast and cloud application security powerhouse Snyk. In a recent discussion with Ambassador Labs's Head of Developer Relations, Daniel Bryant, Crystal shared perspectives on platform engineering, building and leading effective teams, paving a clear path with centralized control planes, and supporting the developer experience.
Master Sociotechnical Systems to Build Strong Teams
How do human factors interact with technical factors when building ideal teams? The interaction between human and technological considerations gets to the heart of sociotechnical systems design and engineering. That is, what requirements sit at the intersection where hardware, software, people and community meet?
Crystal has taken inspiration from the theories behind the book Team Topologies in building out her organizations (and was actually quoted in the book). Team Topologies "provides the (r)evolutionary approach required to keep teams, processes, and technology aligned", which relies heavily on considering teams' cognitive load when assigning responsibilities. This fundamentally demands that an organization "evolve their team and organizational structure to achieve the desired architecture. The goal is for your architecture to support the ability of teams to get their work done—from design through to deployment—without requiring high-bandwidth communication between teams".
With this demand as the backdrop, it becomes obvious, according to Crystal, that there is a need to gain a clear understanding of the operating model, i.e., sociotechnical system(s) in which the business operates. This involves constructing a team with the knowledge of what market the company operates in, how mature it is, and what the business and technical goals are. Then, when building a team around these considerations, ask what the team should look like to serve internal and external customers.
To get started, it is critical to understand where the team needs to go, and to get this understanding, according to Crystal, it is essential to know how the team is doing and to look at key metrics within the software development life cycle (SDLC).
Gauge Team Performance: The Value and Limitations of DORA Metrics
"To know how we are doing, we need to know what our current state is, and we also need to know where we want to be in terms of performance and delivery metrics around our SDLC. Looking at DORA (DevOps Research and Assessment) metrics to get insight into performance, we have a clear-cut way to get a level set of where the team is at and where it wants to go," Crystal explains.
DORA metrics look at lead time for changes, deployment frequency, mean time to recovery and change failure rate. These metrics provide much-needed visibility and actionable data for making improvements to performance. "It's only through visibility and the data that exists that this is really possible, " Crystal shares. "The data is there in GitHub, CircleCI, ArgoCD, Kubernetes. It's possible to hook into a lot of those APIs to get the data from all of these sources and visualize it into graphs that help inform better decision-making."
While the full picture painted by the data provides valuable insight, it doesn't tell the entire story. "The insight generated does not give you the answers, but it indicates where you should focus. Systematically, how should we look at fixing these things as engineering leaders in the company?" Crystal states. "We still need to put some of what the data tells us into context. When we looked at our performance against the indicators that make high-performing teams, it always looked like we were in the elite performance quartile. But these DORA metrics canvas the entire industry, where many companies won't be a six-year-old, cloud-native company that deploys 50 times a day like Snyk does. These metrics are a snapshot, but we need to benchmark internally against ourselves to understand what success looks like. In our case, it is about determining what the expected outcome is versus what's good. What do we expect to see, and how can we continue to improve?"
Design Platforms for the People Using Them
With a clear vision of what a team should expect to see, it becomes easier to focus on internal goals, how to achieve them and what the developer experience should look like. How should platform engineering consider the platform's users? What does a platform look like that defines and supports the developer experience without completely locking it down?
"What we have brought to bear during the last 18 months at Snyk is an MVP — or minimum viable platform – which we call ‘Polaris’ internally," Crystal explains. "Beyond just infrastructure as code, it is also built as a reference architecture. That is, an example of what we think a good platform would look like and the principles we want it to codify as well."
The trick for Snyk was to engineer a developer control plane that potentially reduces cognitive load for all developers, provides a paved path for the majority of developers, and enables flexibility for "off-road" excursions for the minority who want to do something "non-standard". More broadly, as Team Topologies describes, the aim is to reduce cognitive load and the need for too much context switching.
Most real-world platform engineering cases bear this out. Bo Daley from Zipcar, for example, argued for platforms that wrangle complexity for a better developer experience, "A new developer is going to take some time to absorb all the pieces of the platform, and to fully grasp all the levers they can pull. Our approach, which makes it simpler, is to abstract and codify what we consider to be the best practices or conventions to pave the path. But we don't fence developers in.”
Cheryl Hung, formerly of Google and the CNCF, shared similar sentiments on engineering platforms for developers, "Tools like Backstage or other developer control planes lessen the learning curve and provide a clarity of experience for developers without limiting their ability to seek out and learn platform tools beyond that portal. That is, developers can get access to a dashboard and 95% of what they need is centralized and actionable from the UI. Modular control planes enable enough developer autonomy to do what they need to do and "break the glass" to escape and move beyond that environment if needed, learning platform tools, working from the command line, or requesting specific functionality from the platform teams, when and if needed."
Crystal sees platform design as flexible as well, arguing that it is not an either/or choice. "When an engineer wants to get their hands dirty with a platform, why shouldn’t they be able to? Right now we have a finite number of engineers in my group, and we needed to build a platform really quickly. We needed to get all the teams migrated over to it quickly to launch our first single-tenant customers. As part of the migration, there was one team that needed a resource we did not have the capacity to build. But we could empower that person to do it for themselves. That is, we can provide a playbook to follow so that no one runs into bottlenecks. We want to encourage an ecosystem of components, and developers who can take something and run with it are encouraged as well."
"Champion a Developer Who Is a Resource for Other Developers"
Not every developer is going to be keen to get their hands dirty, and the ideal platform makes sure they don't have to. It should be, as Crystal describes, turnkey. For example, maybe a developer needs a new instance, and they should be able to go into a self-service UI, press a button and have an instance up and running in 45 minutes. This will require a lot of automation that may not be in place yet. In the meantime, developers will often be other developers' best friends.
In hands-on cases like the one Crystal shares, the developer who wants something more than the platform offers by default is free to take on these tasks and should be encouraged because they can be a great resource for other developers. In this case, the developer happened to have an infrastructure background, followed the playbook and developed what he needed. But it didn't stop there. That developer advocated for the team in monthly R&D discussions, explaining how he did what he needed in a self-service fashion, how he didn't need to interact with the infra team, and was able to build a whole Terraform module himself and add it to Polaris (the internal platform).
Bo Daley shared a similar experience from Zipcar, "Focusing on the developer experience, for DevOps and platform engineers, is about being responsive and having real conversations, observing what developers actually do and the problems they run into. And sometimes a developer is another developer's best resource. Any time you find a developer who seems like they are on the verge of understanding the big picture, invest in that developer because they will help other developers, get them across the line in understanding the core concepts. Champion a developer who is a resource for other developers."
Developer Experience: Culture, Education and Experimentation
Introducing a new developer platform fundamentally disrupts organizational culture, which is often a bigger change, and harder to manage, than any technical changes. Cultural change requires education and experimentation. Just as some developers prefer to stay on the paved path, others prefer to experiment. By the same logic, different learners learn at different paces and in different ways. Education, training, and communication will all be integral parts of the "long game" of developing and optimizing a developer platform.
"We need to understand from engineers how they learn and reduce barriers to entry and make their access to platforms as easy as possible. This is where our developer experience team will really help with this inward-facing effort, listening to the R&D organization and bringing that feedback in," Crystal explains.
Conclusion: Designing Platforms to Reduce Toil and Increase Joy
Kelsey Hightower shared in a previous podcast that empathetic engineering is informed by considering the user experience and the needs of end users when building software, "It does not work any more to create software in a vacuum. The user reaction should be, 'Wow, someone thought about how I would use this; it's intuitive, it's frictionless'." It is through empathetic engineering that developers will achieve this outcome."
The same is true when creating internal control planes for developers to reduce toil, cognitive load and pain and give developers a valuable tool for doing their work. If done correctly, as Kelsey stated, the end user should get a sense that this platform or application was truly created with their needs in mind and even potentially a sense of joy.
"We aim for delivering a developer platform that is a pleasure to use and that makes a developer's work easier," Crystal shares. "The goal of platform engineering is to reduce developer toil and pain, and maybe even find interesting cases where our platform does more than just reduce pain and instead brings joy."