Scaling Edge Operations at Onefootball with Ambassador

Description

Onefootball is a media company with more than 10M monthly active users delivering more than 10 TB daily content. We needed a Kubernetes-based API Gateway and Ingress solution that could handle our 6,000 rps workload reliably and efficiently. In this session, we'll talk about why we chose the open source Ambassador API gateway, and how we made the migration to it. We will cover the challenges identified and the benefits we've seen, like:

Cost reduction: Reduced the number of cloud-based load balancers from ~100 to 4.

Observability: The combination of Ambassador and Prometheus' capabilities to empower our small SRE team.

Maintainability: We took advantage of Ambassador's Kubernetes declarative configuration, and we were able to decouple the cluster settings and applications' delivery process allowing more velocity on the shipment of new features.

Transcript

Rodrigo: Hi everyone. It's a pleasure to be here. Thanks so much to take your time to be here. It's a user key session and during the next 30 minutes we will tell more about how we handle the API gateway deployments at OneFootball and some decisions that you made during the process.
Jonathan : Thank you very much for to be here and we are really happy. Our main focus today is not to create a how-to or to bring you extensive technical details. It's more related with how even with a high-traffic environment and a small SRE team, we were able to deploy Ambassador and bring some money saved to our company, reduce operational workload, and deliver a new API gate features to our services.
Rodrigo: So my name is Rodrigo Del Monte, I'm working as a system administrator for more than 10 years with AWS and working with Cuban artists for two years.
Jonathan : My name is Jonathan Beber, I am SRE at OneFootball. I have been working for the last six years with AWS and Cloud and containers environments, and I didn't have anything else to put here. So that's my Twitter if you want to follow me and know more.
Rodrigo: We at OneFootball, we have the mission to tell the world football stars. OneFootball was founded on 2018 and now we have more than 10 million active users, monthly active users. We are the world's biggest mobile football platform. And that's basically, we aggregate content, created content, the stats, and tell the stars that football is right.
Jonathan : We have more than 180 employees now, and we believe we can change the way that fans consume football. So we think that we can reach them everywhere and anytime. So if you're a football fan, give a try to that. And if you want to take a look on our careers page, we are hiring, and we are located in the Berlin city.
Jonathan : Now we will just give a quick overview of our environment to make more sense for you when we start talking about integration.
Jonathan : So we run an EC2 Instance and Kubernetes over at EC2 Instance. We're running today over GitOps, but we are also trying ATS. We have more than 50 microservice today in production. It's a small engineering team around 30 to 40 engineers and we have applications writing mainly in Golink and, but we also have applications in PHP, in OJS, Python, among other languages. And we are a media company, so mainly our traffic are images and static contents so we use a CDN solution to help us to handle and digitally receive more hits than we are ready to receive.
Rodrigo: And now we are using Ambassador as API gateway and Ingrid Solution. Some part of our content is also cached in our site. So we are heavy user of rights on the last cache. And the kind of traffic, it's sometimes spike, we have some huge spikes. For example, the last Cristiano Ronaldo transfer to Juventus generated a huge spike there. So we can't cache everything, so for that we use RR tool, also SKO, all redirects. And we use start provide some context about where we were by the end of the World Cup 2018. We had a quiet World Cup from the engineering perspective. German and Brazilian national team didn't have the same look that we had.
Jonathan : So three months before the World Cup, we start to migrate everything from Martin 350 EC2 Instance, a lot of ELBs, iChat proxies, nodules to the lovely Kubernetes work and supported by how permit to use in this cloud-native environment.
Jonathan : Fred Vinik is a team leader at Onefootball and told a little bit about this process in the last AWS summit. So you can search for how Onefootball won the World Cup to have more details. But our mission that time was to keep everything as simple as possible. And we also wanted to say thank you to the company, and especially to the product team that on that time freeze it out, the software delivers, software features just to pre arise to focusing on stability and performance. So that was where we were at that time.
Rodrigo: And of course we were happy about our residents, but we had to move forward. And when we were migrating to the Kube, actually we decided to not change too much using the rule first, make it work and then make it right and make it fast. And at that time we had, for example, one ELB per service. We had like forty services running. And also for logging retro MCU clusters and some HA products as well. And we were surrounded by all these ELB CGN configurations and DNS entries and that did not make our lives easy.
Jonathan : So, and the business team was pushing forward like we had the freezing, so now it was time to again start delivering new features. And with new features we're coming up with new services. Forever new service, we had a new LB, a new CDN configure it was a user facing service. We had overhead of monitoring because every new server was not centralized. We would have to take care of how this traffic is coming to the service and how we would monitoring it. And so for example, in CDNs in ELBs, if you will have to change our SSL certificated to us or certificate, we have to change on CDN, and now of this bunch of ELBs. Of course, we don't change US CSL certificates in a monthly basis, but it's just one example of a ever growing and boring task that we had at that time.
Rodrigo: And so as a [inaudible 00:07:26] server team, one of our requirements was find some simple solution. We previous tried some other solutions that required some state that the base config which could add some overhead to the team and Ambassador keeps everything inside Kubernete is taking innovative of their notations inside the service object. And that for us, for us was a killer feature. So basically we are already using Helm to manage our deployments and when we deployed Ambassador, we were just deploying one more application in production.
Jonathan : So at first we intend to use Ambassador just as an Ingress solution. So and you probably know that we had a lot of Ingress solutions at Kubernetes, but the difference is once we needed API gateway features and API gate capabilities Ambassador was there, we choose and really easy. Under the hoods, Ambassador using Envoy, there is a well-known and stable solution and what each facilitate us to integrate with another service mesh solutions that also use Envoy for example ... One thing that we are trying now, but and also apply different Envoy filters and the performance is always the same that Envoy would provide for us because Ambassador to just generate configs and pass it to Envoy. All the traffic go just through Envoy. So for us, Ambassador would prove itself to be an open source Kubernetes a [inaudible 00:09:15].
Rodrigo: You might be asking yourself why not use just Kubernetes as an Ingress, but Ambassador is more than a HTTP router. We used to say that it has some batteries include. For example, we use the traffic shadowing. So we are delivering new machine learning service in production and we use this traffic shadowing as a service. We will provide more details in the further slides. We also use database headers. So basically we use these headers to make some tests in production so Rio users are not affected by this tests and also how to base it on path so we can have some kind of manual canary release using these paths.
Jonathan : So we had the API gate solution defined and then we had to start thinking about how to migrate everything. At that time we had just around 40 microservices and but it was production in staging. So it was more than 80 migrations to plan and apply and it was a small team. So we had this challenge upfront. So-
Rodrigo: And that time was not an option for us as we were receiving production traffic. So basically the DNS type was completely manual on our site and it was hard to maintain and the mistakes that sometimes happen, for example, could fail. And for example we could create a new service and forgot to point to the right side in the DNS site to the right serves. Of course, it's possible to automate, but it's a human problem.
Jonathan : We always use Helm to deploy applications. So each application site has its on the repository. It had a Kubernetes path and it was nice because it keeps the application codes really near to it's configurations, but at the same time it was hard for us to tracing change across 40 different repositors. So for sure it didn't help in our productivity.
Rodrigo: Our server team is more than our competency area across multiple cross functional teams at Onefootball. And sometimes projects like that can get lost. So we will have a project across multiple teams. We didn't have any team leader, our product manager pushing it to production.
Jonathan : And sense we had multiple load balancers start to fuse this thread with like with too many points of configuration and even our access logs of the load balancer will spread across multiple load balancers. We could like centralize everything on just one bucket and outwards, but it was not easily integrated with our centralized load solution. So it was the same with other configs like timeouts or SSL configs or health check endpoints. It was hard to find what was being applied to each application.
Rodrigo: And another problem that we had for Ambassador was we use cops to manage our Kubernetes cluster, which had a known limitation of 15 publicly load balancers because you can just attach 15 security groups to a single S2 node. It was already fixed by the GitOps when they introduced the shared security groups, but at the time our Kubernetes version, it was not supported it would be one more change to the upgraded Kubernetes version.
Jonathan : So the first thing that we came up was to create our own Helm chart repositor. It was really important for us because the team could focus [inaudible 00:13:50] could focus in just one repositor. And we start to create important separation between the application code version in the applications configuration version.
Jonathan : So with these two important artifacts to start to being generated, the code application itself, so the application code itself. So it was a doctored image artifact. We found the dependencies and the Helm chart virtual artifacts as well. So we had this difference between what is my application code version and what is my Helm chart version and we could trace a difference and which application, which application operations apply to each application version.
Rodrigo: So the Helm chart poster that we created is as simple as possible. All the Helm configurations that live inside the application was moved to decentralized it, repositoed, and introduced the simple version in this Helm charts like a similar version and each time that the new comment is messaged to the branch master how the change is charted is build and pushed to the helm repository.
Jonathan : And when the application is built and deployed. Also now our CD solution applies the last Helm chart version to each application. So with this margin we would we be able to trace it easily. We could change margin on application and add the Ambassador config to more than one application per pull request, even on all of them if you want it. And again, keep tracing of that very easily, which application is using Ambassador or not.
Rodrigo: Now out of the configuration about the load balancer. It centralize it in just one Kubernetes service, the Ambassador and we have to run it. Now it was time to start to create out applications, mappings for each service and One Ambassador works like an Envoy control plane and it tries to simplify and Envoy's configuration. So to create these maps we need to create its annotation inside the Kubernetes service. For example, this image represents Ambassador definitions. So we have the kind mapping with the host to and point to the search that we use the short name as a service.
Jonathan : Now it's a little bit outdated because that's why just release it this year this, but yes. Here you can focus on the host, the first host configuration, It's [inaudible 00:16:42] expression. You can use [inaudible 00:16:45] expressions or not. Mainly our servers receive ... answer a lot of names. So we use it as Jackson mainly all the services, but what we are saying is if one of the requests comes through Ambassador and matches this host hijacks the traffic group essentially the disservice.
Jonathan : It's not mandatory to send to the same service where the annotation is, but in this case it would go to the same site because we are using the short name, the short name model of naming to our services.
Rodrigo: When we are going to production we are confident, but not enough to switch the DNS around and even for a non-critical service. So the old but good. A technique that we use in our team is to, when we are migrating some service is to use the Helm53 waves. So, and it works, it work at pre-Rio l for this case. So our immigration process was like that. We first applied the Ambassador mapping, then we just tested the mapping, checking and router had redirecting like this checks to the host and it's a simple curve contested and then we start to increase the traffic queue in the Helm53 waves. So we start with 1% or 10% depends in the beginning depending of the service and start to monitor it, gain more confidence and to increase it to the hundred percent of the traffic to Ambassador.
Jonathan : And following these the steps, we were gaining confidence on Ambassador and on its metrics as well. And we were able to deploy mode for critical applications at the same time, but we never use this format because it was important to us, not to us have confidence in Ambassador, but also for the other software engineers and the engineering leaders and gradual and transparent migration like that avoided us endless meetings and syncs between teams and do this kind of process that it's not so easy when you have to think between a lot of people.
Jonathan : When the migration was done, we remove all the other load balancers from Amazon. We saved more than $2000 per year. So it was not so much, but in the end it also remove its complexity in our DNS configuration. And since applications now pointed always to the same address to the load balancer that was responsible for Ambassador.
Rodrigo: And other thing that we did wt the same time, we also create a new domain that's API.onefootball.com. It's this second map, so we can have more than one mapping in your configuration. So basically this mapping, what do it okay, everything that's API.onefootball.com/the short name of the service name. Please send to the specific service and at the same time we could keep the compatibility with the old clients that use the old. You allow, for example, service.onefootball.com.
Jonathan : So since we start to use Ambassador as I told you, we could focus in just one point of configuration. But more than that Ambassador pod was integrated with our solutions for our applications. So the same pipeline that deliver our applications. The same is scrapper metric solution that we use for our applications. The same CD pipeline was the same that we are using for our API gateway solution. So it will attack us directly. Our mission of keep things that simple ... Keep things as simple as possible and even to upgrade Ambassador was as easy as a code change in our own code base. So it was very simple.
Rodrigo: Before Ambassador we had lack of metrics when we talk about the load balancer. All the input was measured in a white box using the APM tool. And of course there was way to collect metrics for example from CloudWatch and generate some reports automatically, but for is it was not done. Second, we imported this. We could import this metrics to our existing solution or would have a third place to start this solution. So, and if I delivers good metrics about the internals and Ambassador helps to expose these metrics using the stats protocol. The first message here is just an example of the documentation, how it is connected in works.
Jonathan : So it was pretty much the same thing for us. We just have like every Envoy, every Ambassador pod has the Envoy pod in Envoy as well. It has a site container that's the exporter. This is that the exporter project is responsible for converting stats, the metrics from parameters from it.
Jonathan : So we've, we have the parameters operator with the service monitor and the service monitor responsible to collect out this data for all these Ambassador pots. And we use an external parameters to have a better visibility of our persistence. And so we just define a federation job on this external parameters that is outside the cluster and it collects the data from the site parameters that is responsible for collecting Ambassador data.
Jonathan : It was really nice for us because we had more insights now and we could easily generate metrics there was not white box. So for example, for our SLOs, we can use these metrics that come from Ambassador from our load balancer level, not from a white box level. And so we could have insights like success, fails, how many requests are failing on the client side or how many requests are failing on the server side.
Jonathan : Oops. Yep. As we said before, we provide content for more than 10 million users around the world in 13 different languages. And for sure we have all these tests with that you might know like unit tests, integration tests, UI tests and so on, but sometimes it's difficult to avoid some odd behaviors in production. So that's why we [crosstalk 00:24:24] Yeah, we try to to avoid it because yeah, for operations the feature is just test on some production. So the trends just training and the game is the game, right.
Jonathan : So we try and avoid this bad experience for the users. We try to create controlled explosions. So for example, releasing software per language, new features per language or we think crazy percentage of the traffic. So it's basically a canary deployed. So the problem was before Ambassador, we were creating this canary release logic on the application code level or we are trying to create it on the CDN level and it was hard to maintain [inaudible 00:25:18] and the worst part it was hard to hold back.
Rodrigo: And then comes Ambassador with the batteries included If some fails in these lists we can quickly hold back the Helm chart version and we start everything. Here is two examples of kinds of release that we do. So the first one is the shadow is, so we have in order to target the content to the right audience, we have to take this content and we are improved this service, use some machine learning manage kit to do, but we don't want to switch everything around. Here we are capturing some traffic from the entity extractor serves, which still serves and replay it on the machine learn serves to take some metrics and be more confidence to release it.
Jonathan : The second example here we have like two services and these new API green just received an important softer feature. It was changing a feature that was really important for us and user facing. So while we did, if you see in the prefix in the first one, we've tagged just the an hour or [inaudible 00:26:46]. So it's the Dutch language and the Russian language so we can deliver this feature, just through these two language and it was really important for the product team and as well the engineering team got some insight and some feedback before releasing it or promoting this feature to all the 13 languages that we work with. So that's where we are today with Ambassador. We have a lot of work to do. Oops. We have a lot of work to do it yet, but we want to talk with you a little bit about what we see being our next steps with Ambassador now.
Jonathan : The first thing about Ambassador servers as Ingress solution, but more than that, they brought us something that we were not looking for at that time, but now we know the power. So we want to have the same capability now of tracing of canary release and increase it, centralized solutions and declarative solutions to our clustering side, to the trafficking side, and our clusters. So the everything that is inside the cluster have to have the same capabilities.
Rodrigo: And today our TLS just being terminated by the load balancer and one of our concerns of course the mutual TLS between applications and it's important to have the track secured in features like secret breaking and to increase the reliability of a service in another point of the observability that Ambassador already helps with is it brings the metrics in the proper logs, but we still missing some distributed tracing solution.
Jonathan : So Ambassador for us really the north to south traffic goalkeeper so all the traffic that is coming that enters our clusters passing through Ambassador. It's great for us and we have an active way to handle this traffic, but inside the cluster we have the west to east traffic. So we are looking for eastern.
Rodrigo: And yeah, as a part of the engineer roadmap where we are going, we opt to start to use in to test east to do, because these two is also used in Envoy under the hood and it has a good integration with Ambassador. So we are able to have this distribute to trace, the mutual TLS retries and so on and is to can be use it as well as the Ingress solution, but the time that we were evaluating this solution, we opt to go with Ambassador to which the simplest to get something where in production and delivery value.
Jonathan : So Ambassador for example, integrates with this very well because, one it allows us to define the last context for each application for each mapping. So we just have to map the east to certificate and then the mutual TLS to start working. Ambassador have the stats we told you about and so we have easily goes through these two parameters and say please grab this data. It scraps this data from Ambassador, that's the exporter; site container and yeah, that's just one thing and we have also the-
Rodrigo: Yeah. And yeah, Ambassador integrates pretty well with Zipkin API. Take advantage of the white box capabilities. It generates the headers to be used across the applications to different applications into and also defining a tracing serves so you can send it to external serves like Zipkin, Wagar, our LightStep who we receive all this information and saved to.
Jonathan : Yes, we have an example here with some tasks that we did and for example, should settle distributed tracing serves is just a few lines of code into Zipkin. For example, Ambassador really starts sending out distributed tracing details should to this. For example, Zipkin on the part 94-11 and we've just one tracing serves to manifest other requests. We start to being reported so it's pretty easy.
Rodrigo: And that's how you have to say for now. Again, thank you very much for attending this session. We hope that it was useful for you and also we want to thank you all the engineering team that help us to deploy in this project and to the talk preparation.
Jonathan : Thank you very much. It was a pleasure. Anyone have questions?
Speaker 3: So you mentioned that you don't use the termination of DLS in Ambassador. How do you do it now?
Jonathan : We terminated the last one on our load balancer. [crosstalk 00:32:29]. Yes and then we pass the implant tests from Ambassador.
Speaker 3: Okay.
Jonathan : More questions? So thank you very much. We're going to be here if you want to talk with us as well. Thank you.