Enabling and Measuring Developer Productivity at Ambassador Labs
Or how we achieved engineering and operational excellence by consuming our own products.
By Alex Gervais - Principal Software Engineer
If you've followed Ambassador Labs for a little while, you probably know about one of our core values: Make speed a habit. You might even remember our t-shirts circa 2018 with the slogan "CODE FASTER." — Daniel Bryant still showcases it on his public profile picture. The Ambassador Labs culture is fast-paced, always running, and always high energy. We are relentless in our execution and on the lookout to eliminate friction points and boost developer productivity.
As an engineering organization, we abide by this value. Move fast, break things. We've set ourselves up for success by adopting best-in-class practices and tools. Our playing field is the Kubernetes ecosystem, and we are particularly efficient at building platforms and developer tools, namely Telepresence, Emissary-ingress and Ambassador Cloud - our developer control plane.
We are all about boosting our own and developer productivity overall. I certainly hope you won't be surprised to find out we are our #1 end user of these technologies. Some call it dogfooding, but we prefer to say “we drink our own champagne” 🍾.
How good is our champagne? Well of course I will claim it's the best, but the good news is: it's not just about personal taste. We can measure the impact of our tools' usage on our overall productivity quantitatively.
Before we jump in and look at numbers, there needs to be a disclaimer: At Ambassador Labs, we do not measure individual contributors' performance by lines of codes, the number of pull requests, or comments during reviews. We instead evaluate performance based on outcomes and impacts on the business. This reinforces speed and team efficiency, not artificially made up gameable numbers and lone wolves. We also adopted the shape-up methodology and cycle-based development, which means we do not "score tasks" with points as a scrum organization would, and the number of closed tickets is irrelevant for the most part.
That being said, let's see how we are doing as an organization in terms of developer productivity and operational excellence.
Our ultimate goal is developer productivity. For us, measuring developer productivity means that engineers are able to deliver business value fast, and with confidence. Through our usage of Telepresence and Ambassador Cloud, we've seen developers reduce their inner and outer dev loop execution time. Engineering managers are keeping a close eye on the time it takes a change to go through the first review, iterations and approval. We do not want to see pull requests go stale and unmerged for long periods of time... we do not want to produce wasted efforts and time! Engineering managers also pride themselves on their ability to onboard new engineers into teams and get them set up in no time. It is pretty typical for a new engineer to run our cloud application locally and intercept some requests on their first day, then open a pull request on their second day.
Since before we released Ambassador Cloud, we've been measuring the quality of our operations through a few essential metrics, including the number of incidents, service downtime, software regressions, mean time to detect and mean time to recover.
Incidents can of course happen at any time, but we are also courageous enough to run “Game Days”, where we trigger incidents on purpose, unknowingly from most developers, to measure our ability to respond, test hypotheses about our software architecture, and make improvements in a controlled manner. Over time, we've battle-tested our Edge Stack API Gateway installation through all kinds of scenarios and allowed our engineering group to build knowledge about this piece of infrastructure and its operations.
Because shipping software to production intertwines development and operations — we want to be fast, and ship safely — we are evaluating our performance with a single benchmark: the DevOps Research and Assessment (DORA) metrics.
There are four key metrics from the DORA research program that all organizations should be tracking when assessing the performance of their software development teams:
- Deployment Frequency — How often we successfully release to production;
- Lead Time for Changes — The amount of time it takes a commit to get into production;
- Change Failure Rate — The percentage of deployments causing a failure in production;
- Time to Restore Service — How long it takes us to recover from a failure in production;
In our case, our average lead time is best summarized by this graphic:
The following activities are expected to get a commit into production:
- Coding: Typically, developers will code for about 3 hours before opening a pull request. Developers don’t waste time setting up their local environment and dependent services. They’ll instead use Telepresence intercepts locally to test and debug UI and API changes against our shared Staging environment.
- Automated checks: Once the PR is opened, CI checks will execute unit and integration tests, linting and dependencies checks in parallel. Moreover, during this period, a Telepresence deployment preview of the changes is pushed to the staging environment. As a result of CI execution, a preview URL is posted on the pull request to speed up the review process.
- Review and iterations: Engineers and subject matter experts will review code from other engineers for bugs, design choices and improvements. Using the preview URL, designers, UX experts and product owners have the ability to experience the changes before they get accepted. Change requests are made and the original developer can iterate on the proposed code and UX changes. This is all done asynchronously.
- Deploy to Staging: Every accepted PR is merged in the main branch. Whenever commits are made to the main branch, CI will build the application, run tests, and generate release artifacts for Staging and Production. ArgoCD will automatically roll out every change to Staging.
- Deploy to Production: Given release artifacts were already built and published, deploying the latest changes to production is as easy as clicking the “Sync” button in ArgoCD. Further validations and manual tests are usually performed ahead of time in Staging before shipping to Production.
On average it’ll take 1.2 days to get a change to production from the moment the first line of code is written. This also means if we choose to roll forward to fix a bug in production (instead of reverting to the last good-know deployment), we can hope to resolve it in roughly 30 minutes. In reality, this number averages a day for our organization because of multiple factors, including our time to detect, the low severity of some failures and Game Day exercises where we deliberately slow down to observe the response process.
The use of intercepts, preview URLs and deployment preview allows us to review changes faster, and ship with more confidence. Our engineering practices brought our overall change failure rate below the 5% mark. All changes are automatically deployed to Staging, our shared development environment, which sees versions being rolled out multiple times a day. Deployments to production are happening "on demand" and simply require a manual "click of the button". They typically occur 1.1 times a day.
Our DORA metrics values put us on the "elite tier" when compared to similar organizations.
When analyzing our performance, we broke down the numbers into distinct periods. We wanted to see if our cycle-based development approach (six weeks of business-driven features at the time), and its subsequent cooldown period (two weeks of breathing time between cycles where engineers are free to work on whatever they want.) had an impact on the number of changes made, their quality and their deployment frequency. As it turns out, our change failure rate is higher during cooldown. These cooldown periods are typically when engineers will perform large refactorings (and their pull requests stay open for longer), try out experimental changes, or remove feature flags.
Another common practice we've seen at Ambassador Labs is the use of feature flags. These flag conditions allow us to deploy to production multiple times a day with incomplete features, test things out, gather feedback with a subset of users and eventually release features for the world to see by removing the conditionals.
Another advantage of leveraging Telepresence in our local dev toolkit is that it allows sharing and intercepting services from a single environment. We don't duplicate Kubernetes clusters or namespaces (and all of its dependent infrastructure) for each developer. We've actually been able to onboard new cohorts of engineers this year all the while reducing our cloud infrastructure footprint and associated costs.
As we keep eliminating friction points by consuming our own products, we strive to improve the development ecosystem and influence the best practices adopted by our industry. As engineering leaders, our role is to identify bottlenecks and enablement opportunities for new and existing developers, whether they are internal to Ambassador Labs or just out there building awesome platforms on Kubernetes. We like to see developer productivity enabled by Ambassador solutions everywhere.
We are also keen to explore how the emotional well-being and satisfaction of our engineers affect our ability to deliver as a group. A good amount of research is being done on what's being called the SPACE framework.
Conclusion: Maximizing productivity
Improvement comes in a lot of measurable and non-measurable ways, and often eliminating friction is one of the ways. Without quantifying the effect of friction, it's possible to say that getting rid of it is a recipe for maximizing developer productivity and a pathway to faster time to market for software delivery.
Enabling a smoother ride is equally important to improving developer productivity and the development process because it maximizes developer time and the time of team members, which creates a more satisfying developer experience overall. It would not be an exaggeration to say that the better the developer experience, the more productive the developer.
Take a look at other ways to maximize developer productivity: