Building an Edge Control Plane with Kubernetes and Envoy
March 10, 2020 | 22 min read
The Envoy proxy is fast becoming ubiquitous as the universal data plane API for cloud-native networking and communications. However, the power of Envoy comes at the cost of configuration complexity. In this talk, I’ll discuss what we learned from designing and implementing the Ambassador edge control plane for Envoy, built around the Kubernetes API and Envoy’s v2 configuration. I’ll talk about the evolution of Ambassador from a simple Envoy configuration engine built around Jinja2 templates and variable substitution to the more sophisticated, multi-pass, compiler-type architecture that is in use today. I’ll also discuss how engineers today are using Ambassador, the community that has developed around this project, and where we see the requirements and technology evolving.
I am Flynn, I'm from Datawire. I lead the ... Oh, thank you. Thank you.
I lead the Ambassador Open Source Project at Datawire - now Ambassador Labs. Ambassador is a edge proxy built to be Kubernetes-native, that's built on top of Envoy. We basically let Envoy wrangle all of the business of moving your data around, and we let Ambassador worry about wrangling Envoy.
In the two years since our first public release, we've actually picked up thousands of active installations, added nearly 7,000 commits from 70 contributors, some of whom are in this room. Thank you very much. We've grown the code base dramatically, added a ton of features, and what I'm here to talk to you about today really is, how exactly did we get here?
It's important to recognize that we tend to understand things by looking backwards, even though we have to look forward while we do them. So I'm going to try to point out places where, what we saw looking forward ended up not being what we see when we look backward. But as we look backward, we can identify five pretty distinct phases in Ambassador's life.
The first one was the experiment. We didn't really know what we were doing. We saw an opportunity that we wanted to try to accomplish something with. The second one was when our experiments started to succeed a little bit, and we had to actually turn it into a real product that people could rely on. Once we got it to that point, then we needed to go through and pick up a bunch of features, so that we could start attracting more users, so that we could continue having a product.
We deliberately did that in a way that was expensive later on, and led us into the phase we called the Grand Refactor, where we had to go through and do an enormous amount of work to try to save ourselves from what we'd written ourselves into beforehand. After that, that leads us into our present balancing act, where we're trying to do the usual of balancing features against technical debt, against time to market against everything else, so that we don't end up back in the situation of having to do another Grand Refactor.
So, we'll start off with the experiment. This is at the very beginning. We're actually going to spend a fair amount of time talking about the experiment, because we learned a bunch of really, really critical things during this phase. We also learned them by doing a lot of things that, technically, turned out to be not what we really wanted to do in the long term. But there was no way to know about that ahead of time. We had to actually do them and get from our users, and iterate, and come up with something that was workable.
We started from a place where Datawire had been doing a fair amount of application development ourselves, and as we did that, we kept running into the same sorts of pain points, so we figured maybe other people ran across those as well. That was our initial basis for where to start from. And we formed a couple of early hypotheses about how we might be able to use Envoy to make it better, and what sorts of things people would think about this.
A big one was that we figured developers didn't really want to think in the same terms that you configure Envoy in. Envoy's a pretty low-level tool. It's powerful, it's flexible, it's hard to use. We thought we could do better than that. By mid July we had refined some of these into our primary user persona, Jane, a microservices developer who is mostly defined by being very busy, wanting to focus on her business problems, and therefore viewing Kubernetes as useful, but if she has to mess with it, that's friction in terms of getting her job done.
A little while after that, we identified Julian, Jane's, more ops-oriented counterpart. These are not really made up hypotheticals for us. Everybody that we interact with on a day to day basis ends up meshing to some degree or other with these people. You will still hear a conversations at Datawire where we say, "Oh, we could do this feature like this, but I don't think Jane would like it very much," or, "If we do it this way, how will we ever explain that to Julien, It's very complicated."
So these personas ended up being the way that we design Ambassador's user and user experience, both from the beginning and all the way through to now. A particular example of that is that Jane probably is working with other developers, and she probably doesn't have a cluster all to herself. Which means that if Jane is going to change something about her service, she needs to be able to do that in a very incremental way, without having to go rewrite some huge global source of truth for the whole cluster.
And that implies that she needs incremental stuff, which implies that Ambassador needs to be generating the global Envoy config and keeping Envoy happy. We initially thought that we'd do this with rest APIs and a database, and that turned out to be a terrible idea. We learned very quickly that doing high-availability [statefullness 00:05:16] in Kubernetes is really, really hard.
It was pretty common for a while to hear people going, "Oh no, my database pod crashed. What do I do now?"
The turning point on that was realizing that Kubernetes has already solved that problem. That, if we did everything with Kubernetes resources or things stored in Kubernetes, then we could let that be Kubernetes' problem and get on with the rest of our lives. So we shifted away from the rest API and databases, into storing the design and config maps.
And then we figured out how much of a pain it is to patch a config map, so we shifted over to annotations and now to CRDs, or annotations and CRDs, both. That bought us a lot of flexibility without making us reinvent a lot of low-level things. Another good one is, at the point where Jane has a service running on her laptop and she's ready to test it with real traffic, She immediately has to solve the problem of ingress.
How to get access for people outside her cluster through to inside her cluster. And we initially thought we'd just do that with an ingress controller, and then we found out pretty rapidly that the Kubernetes ingress resource is pretty limited with respect to Envoy's capabilities. But at the same time, this is the first meshy thing that Jane has to think about. And it may be the last meshy thing. So it was a big deal. We decided to focus there.
What we ended up doing again was deciding that we would just let Kubernetes wrangle the hard stuff for us. Don't worry about an ingress controller, just deploy it as a service, or deploy it as the [Daymond 00:06:48] set. And then, although we were initially concerned about this, it turns out to enable some really cool things where you can use a layer four-load balancer in front of Ambassador to make it really easy to horizontally scale Ambassador. Also since Ambassador is an edge proxy, it lives on the edge and that means we get to worry about people who want to do things like offloading TLS termination.
That's actually a particularly nasty example of a real world area where living at the edge makes your life really complicated. If you have a load balancer ahead of you, terminating TLS, Ambassador might still need to make policy and routing decisions based on where the client was originally, whether they originally spoke http or https.
So you have to have a way of getting all that information from the TLS termination points into Ambassador. And we have a lot of tools for this. We have the proxy protocol and all the exported stuff, and Envoy knows how to do these things. But it turns out that configuring everything to play nicely in the real world can get very, very complicated very, very quickly. And you have to explain it to people like Julian, if not Jane. So it has to make sense in the way you express these concepts.
And of course, if you're trying to terminate TLS on a layer four load balancer, your life is even more miserable. So one of the things that we learned fairly quickly through all this was that, being in this experimental phase where we didn't really have any users relying on us yet, but we did have users who were willing to try things and talk to us, turned out to be an incredible opportunity. The things that we learned through that are still the core of the product that we build today, probably always will be.
So, at one point we started to realize that this experiment was actually succeeding. So we needed to run through quickly and try to turn it into an actual product so that we can answer questions like, "Oh wait, what version of Envoy is in this? And what version is coming next? What are you planning to do in the future?" Not widely advertised, before Ambassador 0.10.12, I don't think we could have actually reproduced a given Ambassador build from source. Don't tell anybody, that's a secret.
But, we basically had to go through and retrofit release engineering onto our experiment. You tend not to do a lot of release engineering in experiments. It's not worth the pain up front. But you really, really need it for products. So having some idea of how you're going to do it when you realize that you're succeeding can be important. And it was particularly interesting, because you don't usually get to predict that point of becoming a product instead of being an experiment. You can usually only see that in hindsight.
So there was a lot of things that we needed to do very quickly there. This was also the point that we sort of got into the first recognizably modern Ambassador, of using annotations instead of using config maps, and having the diagnostics UI, and things like that. So, again, being able to see the experiment turning into a product may be something you only see in hindsight. So that may be something that you need to be prepared to react quickly to handle, as opposed to getting to plan it and schedule it.
So, at that point we had a product. We were pretty confident that we could reproduce the builds. We thought we had the release engineering under control. That was the point that we decided we wanted all the users, and so we decided to go through and add a bunch of features to make Ambassador more compelling. And we very deliberately chose to do this by incurring technical debt. We were financing development with technical debt in order to get things to market more quickly. And we did that because we realized this was an area where we got to see something useful going forward.
We realized that if we took the time at that point to fix all the technical debt stuff that was already visible coming down the road, we would not have a product. It would not survive. We would never get to the point where we were able to fix these things. So we did features first. There was a lot of growth that happened during this time. I think the code base quintupled in size, we added a couple of new engineers. We've got our very first external contributor, everybody to give it up for Alex Kobe, he's back there in the audience. Yay.
That meant we had to go and put a lot of effort into figuring out how we could support external contributors, because it was also very evident by this point that there was a lot more work that we could do with just a small team of engineers. That's part of the reason you do open source, to have the community, to let the community contribute. It requires some effort to make that work out well. And we also added a ton of features. I went through the change log and I couldn't fit them all on the slide, which was kind of cool.
We also ... Another area of growth was the number of people who were actually running Ambassador in production, really started skyrocketing through the summer of 2018 as we were doing this. And that meant that we were starting to get more questions about and reliability and things like that.
So there's a whole other area of work going in there, and driving us into thinking how do we sustain this stuff longterm? What sorts of things do we really need to worry about? So that brings us into an area that was less fun. Where, as we were going through this growth, we're also starting to hit some pretty significant limits in the architecture of the product. Ambassador at this point was primarily a template engine doing text processing, internally. It had never been architected for really rapid growth. We knew this when we started the features phase.
We also knew when we started the features phase that we were going to have to move from the Envoy version one API to the Envoy version two API, because we kept getting requests for features that required version two. We were also getting closer to version one's end of life. So, in August of 2018 we basically went through and ripped a bunch of Ambassador's guts out and put them all back in later.
This was interesting. If you look over the change log and if you were an Ambassador user during that time, you would have noticed that things had slowed down dramatically in terms of what we releasing, when. This is why, because we were hitting these places where small Envoy changes required really big changes in Ambassador's code and it was just costing us a lot of velocity.
Our specific goals in the Grand Refactor were, we'd already recognized for a while that we needed to get rid of the template engine and rebuild Ambassador more along the lines of a proper compiler, because that's really what it is. It was equally obvious that we had to get Envoy V2. We needed to switch to Envoy's ADS so that we could stop having to do hot restarts and just do live updates immediately.
I already mentioned the bit with small changes, should need small changes. We also needed tests that could run more quickly, but also could be written more quickly, especially for external people. We were able to do these things, so we shipped Ambassador's 0.50.0, that was the result of the Grand Refactor. We shipped it in January 2019, five months after we started. We expected it to take about three months. There are a lot of moving parts in there.
The code actually wasn't the really hard bit. The really hard bit was trying to make sure that we didn't actually change behavior between Ambassadors 0.40.2 and 0.50.0, and we almost succeeded in that, but only because a bunch of people in the community helped us out by running early access releases at 0.50, and then providing feedback for places that didn't work. There's no possible way we could have done it without that.
We still have a lot of pain around testing, actually. That's still a big problem with all of this stuff. But that gets us to where we are now, where we are trying to avoid another one of those things. We're trying to take advantage of the refactor's ability to hopefully speed things up, and so far, it seems to be working. If there were any wood around here, I would knock on it. V2 plus the ADS is definitely nicer to work with. We've been hearing from people that the refactored code base is easier to contribute to, we've been seeing more external contributors, which is wonderful.
And also there's a bunch of features in there, like SNI and TCP mappings. Those things up there would have been absolutely impossible to do in the old code base. We had to have the new code base to have any chance of doing them at all. So, in the near future we still have to put an enormous amount of effort into tests. They are still our number one pain point, both within Datawire and externally. There's a lot of stuff going in with release engineering, and being able to more closely track Envoy releases, things like that. And a lot of it is still balancing features, paying down technical debt, avoiding getting into more trouble.
The biggest things that we've learned through all of this stuff, we've been learning a lot about the complexity of the edge versus service-to-service. A lot of that was very surprising, but it's why ... A lot of that is also why we continue focusing on the edge, and trying not to get too distracted doing heavy service-to-service stuff.
We've learned that the attention we've paid to trying to help out the interloop of development within Datawire has helped somewhat, and seems to help translate for external contributors as well. But easily the biggest thing that we've learned through all this one is that being able to test things by talking to users and listening to them is a huge win, and very, very important.
Crystal ball time, what's coming next? We know that having a low barrier to adoption, making it easier to try out Ambassador, is going to stay critical. We know that there are things about performance that are going to be important. There's a lot of stuff around debug ability and transparency that I am looking really forward to getting into.
There's also a bunch of things that are still dealing with Envoy directly. We've had people asking to our multiple listeners, which Envoy can do, but Ambassador only kind of does it when you're talking about redirecting third text to HTP, that sort of thing.
I would like to make GRPC a first-class thing, so that you don't have to talk about mappings with random strange looking prefixes when you do GRPC. A bunch of other care and feeding stuff with Envoy, and making sure we're ready for Envoy version three, which will be coming at some point. And who knows, yeah, maybe there will be a next gen ingress that we can worry about.
But that is it for here. What kind of questions do you guys have?
Over there is ... Ah, there we go.
Hi, thanks for sharing all of that. I just ... When you're talking a lot of the complexity and difficulties people have at the edge, and particularly with just things like native ingress at Envoy. Just wanted to understand a little bit more, what do you think is causing that kind of difficulty and complexity? Like, where is most of that coming from? Is there one or two key points?
I would say most of it is coming from TLS, really. We could talk a lot about horizontal scaling and [White Over 00:00:18:46], [Fail Over 00:00:18:46], and all that kind of stuff. But the fact of the matter is, that doing those things at layer four with only clear text is pretty straight forward. And as soon as you throw TLS into the mix, your life gets much, much more miserable.
Most of the things like the proxy protocol and the ex-forwarded headers, and all of that stuff, when you get down to it, and boil down to trying to have a way to reintroduce the original envelope as it were, the metadata about the original connection, to be able to carry that all the way through. Http had never really had a great way to do that, and so we've been retrofitting stuff along as we go. But to me that's where most of it lives.
Great, thank you.
You guys are making this too easy if you don't have any questions.
Someone's got to have a tough question, right?
All right, well come on up and ask me directly if you got anything you want questions for, want answers for. I'll be here, Datawire has a booth over towards the left side of the vendors hall, sort of halfway down. Love to see you, thanks much.