Ambassador Developer Office Hours

Developer Control Planes: An Architect's Point of View

May 13, 2021

In a cloud-native world, software developers are no longer only responsible for writing code. Today’s developers must write and package code, deploy these services into production, and make sure that the corresponding applications continue to run correctly when released into production.

As an architect, how do you make sure your development teams have the tools they need to manage all of these tasks? We recommend a Developer Control Plane.

A developer control plane enables developers to control and configure the entire cloud development loop in order to ship software faster.

This Ambassador Developer Office Hours session was a conversation between Daniel Bryant, Director of DevRel, and Bjorn Freeman-Benson, SVP of Engineering, at Ambassador Labs about how a developer control plane can enhance your existing technology stack and enable collaboration among your development teams without requiring devs to worry about managing configuration.

Transcript

DANIEL: Welcome everyone to the office hours, Ambassador Labs office hours. Pleased this time to be joined by our SVP of Engineering, Bjorn Freeman-Benson. Welcome Bjorn, great to have you here!

We're going to be diving deep today into the idea around this developer control plane concept. Now we introduced this last week at our Ambassador Fest. If you didn't pop along, the videos will be published soon, so you can dive into those when they're up on our YouTube--and we're going to do some other content around this too.

But today, really, we'd like to go on a journey looking back at Bjorn's experience running many teams, microservices, cloud, these kinds of things. And how the way folks have worked with these applications--worked with these platforms--how this has evolved and how there's a need coming together of this sort of common control plane, if you like, for us as developers to actually work effectively with the platform and deliver value to our users. That's what we're all here for at the end of the day, and that's the number one goal. Right? So that's where we're going to go along.

Feel free--there's a Q&A box, I believe. You can raise hands as well, and myself and the mods will keep an eye on that. If you do have any questions along the way, pop them in the Q&A box, and we can bring those up. But I think it makes sense to get started, so welcome, Bjorn. Would you mind introducing yourself, please, and giving us a quick tour of your background?

BJORN: Yeah, so officially my title is--I run engineering here at Ambassador Labs. But prior to that, I've had a long career of being an engineer or working my way up through running engineering organizations. Prior to Ambassador Labs, I ran engineering for InVision, which is a design tool. That was an interesting experience, especially since it was a fully distributed work--or fully remote, distributed workforce--before the pandemic. And of course, then when the pandemic hit, I already had that experience.

And prior to InVision, I ran engineering for New Relic, which probably all of us as engineers at least know about. And in each one of those cases, I built the organization up from a small little startup up through hundreds and hundreds of engineers. I think when I left InVision, it was 400 and some--450 engineers. And prior to New Relic, I actually worked for the Eclipse Foundation, working on the Eclipse IDE. So, in fact, I'm one of the early authors of the debugger at Eclipse.

So if you used the Java debugger--well, an older, older version--I had a hand in actually writing that. And then way back, if we go farther and farther in my past, I worked on various startups: building hardware for cell phones, various simulations of cities for governments, and all sorts of interesting things. So I've had a wide variety of interesting experiences building code, but it's always been about engineering and delivering solutions to the users over and over again

DANIEL: Awesome stuff, Bjorn, awesome stuff! Great variety of experiences you've had, right? So fully remote, fully in-house, so to speak, and then hardware, software. I think it's very interesting to try and pull patterns out of some of these things and draw inferences from those. I think that's super interesting. So I would love to dive into your experience--taking the step back to InVision you mentioned there. If we focus on them for a moment, could you provide, like an overview of the architecture there? Was it microservices, was it monoliths, these kinds of things?

BJORN: Yeah, so when I joined InVision, it was a single, monolithic, cold fusion application, amazingly enough, and it had some interesting characteristics in that one of the things that we did was extract Sketch files--so Sketch is another tool that designers use--Sketch files into InVision data so that you could then build a prototype in InVision and share it with your colleagues.

But Sketch only ran on Macintoshes. So we actually contracted with a data center that ran a stack; I think it was somewhere around 200 Macintoshes for us--you know Mac Minis in a rack. So our monolith then would call out to those Mac Minis, run Sketch with remote control, get the data files, and suck them back in again.

DANIEL: Wow.

BJORN: Anyway, the cold fusion application didn't scale at the scale of number of users we had, and I mean, it was a giant application that used a lot of memory, and it just wasn't cost-effective to scale it. We were scaling it, but by putting in a lot of VMs--but it wasn't cost-effective. So we undertook to rewrite the system into a series of--call them micro-services. I call them services because some of them weren't that micro--some of them were micro, some of them weren't that micro--all running on top of Kubernetes. So, in addition to the rewriting, we also ported from just running on VMs to running on straight Kubernetes. So even our cold fusion application was running in the Kubernetes fabric. And then, we were using Google Pub/Sub for all the asynchronous calls between the different services.

And so that ended up scaling; we had like four million users at the time that I left. And the other interesting thing about that architecture which I really liked was that we sold a private enterprise version of InVision. So you could buy the SAAS product or you could buy a SAAS product in a private version. For the private version, what we did is we ran you your own Kubernetes cluster. So it was all the same code, but it was just for your data so that the data wasn't intermingled in the sense of a SAAS application. And it turns out there were a number of customers who would pay for that--a higher price--over just using a SAAS application. And I don't know why; their security group wanted that or something, but anyway.

So we ended up running 192 Kubernetes clusters in our environment. So when you wrote a service and then deployed it, you would deploy it.  You'd put it first in your team's development cluster, which was actually just a namespace in a test cluster, and then we would roll that from there into the staging cluster so the QA people could QA it. And then, we would begin this process of rolling it out to all the production clusters. And so, rather than shooting ourselves in the foot by rolling it out to all of them at once, we bucketed the customers into sort of levels of risk.

The straight SAAS customers got it first--you know, 3,950,000 users or something got the first version which ran in the biggest cluster. And then, once it proved to be stable in that cluster by running for a few days--we had a very diurnal traffic pattern. You know, people would use it during the day and not in the evenings--and so if you ran it for 24 hours, you pretty much covered any buttons that might show up in the software. And then we would start rolling it out to the first ten customers in our private--we call them private clouds--but our private clouds, and then a day later, the next 10 clusters and so on until we had an automated system that works out that way. So that meant that our highest paying customers ended up getting the software that was baked the most; it'd be like seven days from the time we shipped it to the time that it rolled out for them. So then, if there had been any bugs, it would've shown up by then, right?

DANIEL: Yeah, very nice, very nice. Along the way, who was responsible for all those rollouts? Were they developers, or was it more platform folks that were responsible for rolling these things out?

BJORN: So I've always operated--at InVision, at New Relic, at every place I've been--on the model that developers are responsible for operating their own services. And I do that for a number of reasons, but the primary two reasons are: one is the developers know it the best. So when it starts malfunctioning, they go, "Oh yeah, I know that pattern, I've seen it before," and they don't have to go in and much more quickly than somebody who doesn't know the service.

Like InVision, we had all sorts of interesting problems. We had a thing where you could upload images so you could put them into your designs, and if you uploaded images of a particular type, the image decompressor would explode because it couldn't handle things of that size. And that pattern kept happening over and over again as people built larger and larger images. You know, like when we went to 4K screens, all of a sudden people were putting 4K images, and then the software hadn't really been designed for 4K images, so we had to go back and redesign it. But the team saw that same pattern that had previously happened when people started uploading HD images. And they're like, "Oh, we've seen this before," right? Whereas if we had a separate ops group, then it wouldn't have been as clear to them what the problem was. So that's one reason that I always had the development team operate the services.

At the same time, I've always had a platform team that built the underlying infrastructure that automated the things so that the development team didn't have to know, for instance, all the details of Kubernetes or the fact that we were running 192 different environments. They didn't have to edit 192 YAML files that defined each one of those configurations when they added, for instance, a new mount point to a service. So there was a platform team that did some of the operations, i.e., they automated the platform. But then each--the goal was to build a platform that then allowed the development teams to just do the rollout from the point of their point of view without having to worry about all the details of the insides of it.

DANIEL: So, very much focused on self-service for developers.

BJORN: Right. So, for example, at New Relic, the architecture there is that it was a bunch of services--well, eventually it was a bunch of services--and they communicate with Kafka. And so when you wanted to add a new Kafka topic, or you wanted to partition Kafka in a different way, what we had was we had a sort of generic--well, not generic--we had a very specific description of the asynchronous topics that you would use, and then the platform would apply that to all the Kafka brokers and the configurations.

So as a developer, you didn't have to go and define it in the Kafka way; you defined it in our YAML, and then the platform would apply that to all the environments. So we had the same thing at InVision. We're just using Google Pub/Sub, and Pub/Sub, you don't have the topics problem as much; you have different subscriptions. But it's fairly equivalent.

DANIEL: That's super interesting. One of the questions I had sort of following on that was, were there tools you used off the shelf, or did you have to write some of these tools to manage the deployment and release of the code?

BJORN: Yeah, so we ended up at both InVision and at New Relic--and prior to that at Eclipse, but that was distant history--building our own tools. You know, we used the standard things. You used your helm charts you used, et cetera. But the integrations--there were no good integrations or at least none that we could find that did that whole integrated whole. And so we ended up building our own tooling, which is why I had a platform team in each of those companies to build that up and automate those things.

DANIEL: And how did you manage the lifecycle from a developer's point of view? Because my past as a developer, there was definitely a tendency for like writing the code, shipping it, and that was it. Do you have to mentor and coach folks to take that full ownership over the life cycle?

BJORN: Well, yeah, I mean a little bit. People who are junior and less experienced, but that's the same in life. You know, I recently taught my teenage nephew to drive, and his view of the risks of driving and my view of the risks of driving are very different. He just needs more training on what to watch out for, about people leaping out in front of parked cars and things. So same thing, you find people with--developers with--less experience just don't sort of take the larger long-term view of the consequences of customers, and the more experienced people do. But that's just part of the engineering process.

Now, the big change that I've seen over the decades I've been doing software is actually this idea that you're responsible for operating your software, not just writing the software. So back when there was color in my hair, and I was young, and so on, we wrote software that shipped on floppies--our job was done once we made the golden master floppy. I mean, that like, literally, that was the end of our job as engineers, and then the rest of the organization took it from there.

And now we've gone from the point of not just are we responsible for writing that; we're also responsible for operating it, not just in the staging QA environment but actually in all the customer environments. And if we worked in a place that had, say, US Department of Defense clearance, we'd have a whole separate cluster of security clearance operations, right? And so, as developers, we're responsible for operating across all of those things, including places where we might not actually have the ability to make changes, like if we had a DOD cluster, right?

DANIEL: Interesting, yeah.

BJORN: So that's been the biggest change for me--or one of the biggest changes--over the years of developing software is that now we're actually responsible for operating it, and people are paying us. At InVision, what were they paying us for? They were paying us for the features? Well, not really. They were paying us for the thing to be up. And so if we had an outage, then people were upset--much more upset than if we didn't ship a feature in a particular month.

DANIEL: Yes, because they couldn't get the value from what they were actually paying for.

BJORN: Right, exactly. You know, at InVision, we had this really bad outage one day where I think the CEO was out raising another round of funding because that happens at startups, and he was using the product to do the demo to the VC firm he was talking to, and we had an outage at that exact moment. We all heard about that one.

DANIEL: I can imagine. Yeah, I can imagine

BJORN: But that was an interesting story because, you know, the CEO came back and yelled at us all, and we did a bad job because it went down. But the engineers were saying, "Well, we should know when the CEO is going to go out and give those presentations so that, you know, we make sure that we don't do anything bad." And I'm like, "Well, guys, we have 4 million users.  At any given moment of the day, one of those users, or two, or ten, or 100,000 of them are making a pitch that's as important to them in their life as it was to our CEO to those VCs."

So we can't just say, "Oh, well, when that CEO is on deck, we're going to be careful." We have to be careful all the time because we have all these users. And that was sort of a realization that the engineers hadn't really grokked the scale that we were operating at, that we were actually providing value to all these people all the time, as opposed to just, "Oh, I'm typing on my keyboard” and, “Oh, it's working at the moment," or, "Oh no, it broke at the moment." If it's just me and the computer, it's a very different relationship than me and four million users.

DANIEL: Yeah. Because that responsibility, that accountability is really important to understand.

BJORN: Yeah, and you know, like at New Relic, people counted on New Relic--other companies counted on New Relic to be monitoring their stuff. I was visiting--I think it was Zynga in San Francisco--in their big operations room, which is a giant room. And all the developers were working in this room, and the room was surrounded by windows, and then up above the windows but below the ceiling was a line of monitors all the way around the room, which were showing New Relic's dashboards.

And while I was there talking to them about a problem in the PHP agent, we had an outage, and all the graphs went to zero because New Relic was no longer collecting data. And so it was showing that everything was zero. And everybody in the room panicked because they thought that their system was down because all the graphs had gone to zero. And again, this is an example of how the developers at New Relic weren't grasping the fact that they operated this software, not just built it, and the impact that was happening on, you know, here it was 50 developers in this room who had a heart attack.

DANIEL: Yeah, I can imagine! As an observability vendor, you need to be more up than the system that you're observing.

BJORN: Exactly, right. So anyway, that's all a long way of saying that it's about, you know, if you're building a SaaS piece of software, it's about operations as well as writing the software. And so that's the extension that we've made as developers--to go from just developing to developing and operating.

DANIEL: Well said, Bjorn. Yeah, no, I've definitely seen that in my career, and it can be a big jump, but I think it's one we're definitely seeing a lot of folks talk about--a lot of struggling these days. As you've progressed through your career, were there clear components that stood out as part of this concept, but now sort of talking around developer control planes? Was there clear components that you sort of saw in each of your roles?

BJORN: Probably, but to me the developer control plane idea comes from the fact that we're now operating as well as developing. So as a developer, I'm responsible for that whole thing. And one of the things that I've seen over and over again--which is why I like this developer control plane that we're building at Ambassador--is that the cloud-native SaaS scheme of having to operate across multiple environments is different than just how do I develop on my laptop? Or how do I develop on my laptop and my development cluster in the cloud, you know? So there are many tools out there today where you can go and spin up a development cluster, and you can even use VSCode in the cluster and code right there and so on.

But it's--in my opinion--a very local view because it's just about you in that cluster. It's not about the whole cloud-native experience of running that across all of the places that you have to run it, like at InVision (192 clusters) or New Relic--I think they're currently operating in three data centers with Kubernetes clusters in each one. So they're not running 192, but I was talking to the friend the other day. I think it's like a dozen or something. But they still have to operate across all of those.

DANIEL: Yeah, yeah. So we've often talked about this sort of notion of code, ship, run, and each of those things now equally, well, I was going to say actually, each of those things are equally important, but what I'm hearing you say is it's probably the run is arguably the most important of the things there?

BJORN: Well, I'm arguing that it's most important only because we're not that good at it as an industry, right? We're very good at the writing code. I mean--you could always be better--you know, we're very good at writing tests when we remember to do it, and we've got tooling around all of that stuff. And then in the last decade, there's been a lot of good tooling about deployments and figuring out how to do--how to package things up--and the advent of Docker on top of LXE and all this container stuff makes that easier. It gets rid of the dependency hell that we used to live in when we're shipping. Now it's all in development, and we can build test pipelines with CI, so we've got a route of tooling around that. We don't have a lot of tooling around is how I, as a developer, have to then operate. We've got things like, at New Relic, we were doing some observability things, and that was much more helpful than when we didn't have those things. But we're still not looking at the whole cloud-native journey and what is the tooling I need across that whole journey. And so the reason I keep emphasizing the run part is because that's the part that's least mature, in my opinion, around the tooling.

DANIEL: Yeah, very well said, Bjorn. And on that note, we often talk about the CNCF landscape as a whole gamut of tooling. What's your thoughts around that? Because lots of people are coming up with lots of new tools, new ideas. Innovation is great, but is it a case now--do you think we should start consolidating on some of those things?

BJORN: Well, I see in the industry that we go through waves of expansion and waves of consolidation. And so I don't actually know where we are in that matrix, in that cycle at the moment. But what I do know is that in order to do an effective tooling for anybody--you, me, another company, and so on--you need to integrate together a lot of different pieces. So, in times of consolidation, you end up with products like, you know, back in the nineties, IBM had--what did they call their thing, was it visual aids was their big product that integrated all the things together or something--

DANIEL: I've bumped into similar things with CA and other companies like that.

BJORN: And Microsoft had a whole integrated Visual Studio thing where it was all the tools you needed all in one product, which was fine when you were a single focused consolidation thing. And then in times when we've spread out in the industry into lots of ideas--let a thousand ideas bloom--you end up using that thing and then a lot of little other ideas and things. And so when you start a new job, you spend a lot of time getting logins to all these things and learning idiosyncrasies of all these things and so on.

So what I find is the most effective tools are ones that integrate with lots and lots of tools. In fact, the ones that most effective are the ones that integrate with the tools that you use. So if you use Jira and you use CircleCI, and you use GitLab, and you like--if these are the things that you use--then you want a tool that integrates with those things. You actually don't care if it integrates with anything else, and you certainly don't want to buy a unified whole because it doesn't--it's not, you know, if you buy IBM's current product, whatever it's called, it doesn't integrate with Jira because they've got their own thing, right? You don't want that. You wanted something that integrates with all the things that you use.

DANIEL: Perfectly said, Bjorn. My next question actually is, have you got any advice on how architects should help their teams in choosing these tools? I think what I heard you say is sort of look for the integrations or look for open standards. I see some folks talking about open standards as a way to have options for the future.

BJORN: Yeah, maybe. I found that people don't change tools that often; you know, just in my career, people change jobs more often than they change tools. So I don't think that the company needs to spend too much time thinking that you're hooked up to an open standard because the chance of you changing from Jira to some other way of tracking tickets is basically nil, right? Maybe you might change your CI system when a new CI system comes out. But so I think that the key isn't that so much--that there's open standards--it's just that it actually integrates with the things that you actually do.

DANIEL: I like it.

BJORN: I mean, open standards might make that easier, but what you really want to do as well. We use this set of things, or we use this set of things. And when I was last at KubeCon, I saw that CNCF had a new project in that area, and we want to use that, so I want to make sure that my tooling works with that thing.

DANIEL: Yeah. I love that. And that actually leads nicely on--so here at Ambassador Labs, we're trying to productize the developer control plane. What do you think are the advantages and disadvantages of being opinionated, with how some of these tools do plug together?

BJORN: Well, the problem with lots and lots of tools is that there's lots and lots of different models you have to think about to use all of them, right? And having all those models in your head isn't really creating value for your end customer, which is what they're, in the end, gonna pay you for. You know, if you had Daniel Bryant LLC and you were selling an iPhone app, you don't really want to spend your time fiddling with the YAML that defined CircleCI integrations because nobody's paying you for that. What they're paying you for is the feature that integrates with Facebook and Instagram and puts TikTok videos on your app or whatever it is that you're app does, right? And so that's where you want to be able to spend your brain cycles.

And so what you want is you want a tool, or --this my argument--is you want a tool that does that stuff for you so that you don't have to spend your time thinking about it. And instead, you can spend your time on the things that are the most valuable to you, which, you know, may be things that make you money or may be things that are just plain fun. But either way, whatever it is you want to spend your time doing, you spend your time doing that and not the--I guess it's called toil in the SRE world. And we've all, as we've moved from this model of writing code--where way back when I first started, there was a lot of toil in writing code. In fact, I programmed in college on cards, I mean, it was the last of the card machines, but there was a lot of toil and sorting the cards in the box. Now all of that's gone. I've got an IDE; it does keyword completion for me, it does variable reading. All that toil of writing code has gone away. But as we've moved into this, now we operate our services--that we haven't figured out yet as an industry, and so there's a lot of toil in that.

And so the thing that a developer control plane does is it says, "Oh, you know that wonderful experience that you have when you're developing just on your laptop with your IDE? We want that same experience across the whole cloud-native service deployment operation experience." And there are a lot of those pieces are out there today, and you could put them together yourself, but then you'd have to put them together yourself. And so what we had at the group in InVision and the group at New Relic is the platform team was putting together those things in a way that made it so that the engineers didn't have to do all that crap work--just concentrate on the part that they were really interested in, enjoyed, and provided value from.

DANIEL: Very well said, Bjorn, very well said. I always have to ask the question, are we drinking our own champagne at Ambassador Labs? Are we using the tools that we're creating? Is there any like, challenges with that? Is there, clear benefits, I'm guessing, as well? I'd love to get your opinion on that.

BJORN: Yeah, so we are. We've done that at every company I've ever worked at. At the Eclipse Foundation, we used Eclipse to write Eclipse. At New Relic, we used New Relic to monitor New Relic. At InVision, we used InVision to prototype the new features at InVision. And so here at Ambassador, we use all of the Ambassador tools to write the Ambassador tools.

You know, there are certain challenges in that, in that of course, we're writing the next generation of features, which don't always work. We write software, and they don't work before we release them. So that provides challenges that hopefully our customers don't have to deal with because when we do release it to them, it actually works. But yeah, I mean, we've got it. We have our clusters on Google and Amazon. You know, we've got our DCP on showing all of those clusters. We use Telepresence into those things. We use the Ambassador Edge stack--now going to become the CNCF Emissary--as the ingress on each one of those clusters. Yeah, so we use all the stuff ourselves. We use other things too, because we don't cover all of the tooling, but we do use our own stuff.

DANIEL: Super, that's great to hear! If you were going to give some general advice to architects listening where to get started in looking at this concept of a developer control plane, what would you recommend to those architects?

BJORN: So there's sort of two approaches to this. One is if you've done this, if you've run an organization that ships multiple services before, then you work on it from a top-down. It's like I've seen this problem three times; I know how to build a system to solve these problems in a better way-- which is the argument that we're using at Ambassador for the developer control plane work that we've done, which is that you and I and others in the company have seen these problems many times, and so we go, "Gosh, well, I know a generalization of this that's going to solve the problem for most people, so let's implement that," which is, to me, the exciting thing about Ambassador, is that we can take all that experience.

But if you haven't seen that, if the company you're working at--this is the first time they've done that--then rather than trying to build a general solution to a problem that you're not fully aware of, which is a problem that, you know, back in the day at New Relic, the first platform team we had just didn't work because we were trying to build a general solution for what we thought was the problem, but it turned out not to be the problem. We built a great solution for that problem that wasn't a problem, but then people didn't want to use it because it wasn't a problem.

DANIEL: Yeah, fascinating.

BJORN: So the second version of the platform team, instead we solved it from the bottom up, which was we looked around, and we said, "Well, what is the biggest problem that people are having," and then we'll build tooling to solve that. And then, "What's the next one that people are having," and then we'll build tooling that integrates with that other tool, and that solves that and build our way up to the platform that actually solved the problems that we were having. And so that's the generalization from specific instances case where, if you haven't seen the pattern three or four or five times before, that's the approach to take. So it just depends on where in the sort of meta life cycle of the life cycles of services you are.

DANIEL: Yeah, I love both approaches. I know you said joining the team in Ambassador Labs, I think we've all experienced slightly different versions of building the platforms and so forth when we've all made our own mistakes. So we've all had successes, right, so bringing that knowledge together, I'm like, I thoroughly enjoy having those discussions and going, "Oh yeah, I've seen that too."

BJORN: I prefer to make expenses, you know, in the previous--make mistakes and spend the expense in previous companies. We don't have to make those mistakes in this one.

DANIEL: That's what I like to say in my talks. I learned my mistakes; make new mistakes yourself. We're all gonna make mistakes but don't do the same ones I've done. Love it, love it. Bjorn, thank you so much. This has been amazing. It's always a pleasure to chat to you and deep dive into different areas. I really enjoyed hearing your take on how things have evolved and how, you know, the developer control plane has come together and where we're going in the future. So thank you very much for your time.

BJORN: Thanks you all, and I look forward to chatting some more in the future.