Developer Control Planes: A Platform Architect's Point of View
The discussion surfaced several key themes:
- Always be educating: Underpinning all of these shifts is the need for developer ups-killing. Organizations at the leading edge of cloud-native innovation need to be prepared to support hands-on developer education at all levels via documentation, self-service recordings, and in-person/virtual training.
- Rapid onboarding is a competitive advantage: At Lunar, onboarding provides a service catalog that provides references to different libraries, and different variations on how to create a service, with examples of how other teams have created similar things. Lunar uses Backstage to help with fast-paced engineer onboarding. It has enabled clear visibility into what services exist and what they do, which benefits not only developers new to the company, but the company as a whole.
- Defining a "paved-path" platform reduces tool sprawl: At Lunar there isn't a mandate to use specific technologies, but providing an opinionated "paved path" take on tooling and centralizing tools has accelerated developer ramp-up and productivity. In part, defining the path streamlines ramp-up for developers, but it also helps the platform team in combating tool sprawl.
- Creating opinionated workflows supports a good developer experience: Lunar has created Shuttle, a CLI for handling shared build and deploy tools between many projects no matter what technologies the project is using. The Lunar platform team also adopted developer-friendly GitOps workflows (using Flux and a custom release manager) very early on in their cloud-adoption journey.
- Enabling developer ownership is key to speed and safety: The "you build it, you run it" mantra isn't just theoretical. Organizations like Lunar operate this way every day. The expectation is that developers own the full software life cycle, but to empower them to do so, it has to be manageable. Platform teams need to lay the groundwork for shifting left and make it easy for developers to code, ship, and run. If done correctly, this should help organizations realize the speed benefits of cloud-native development and get fast feedback loops without any downside.
Daniel (00:02): Hello, and welcome to the Ambassador Labs podcast, where we explore all things cloud native, platforms, developer control planes, and developer experience. I'm your host Daniel Bryant, director of DevRel here at Ambassador Labs. And today I had the pleasure of sitting down with Kasper Nissen, lead platform architect at Lunar.
Join us for a fantastic discussion covering topics of just how to rapidly onboard developers to your organization platforms and systems, how to design a paved road platform, and we'll explore the current benefits and challenges of implementing GitOps for continuous delivery with Kubernetes.
So, welcome Kasper. Many thanks for joining us today. Really appreciate it. Could you briefly introduce yourself and your background please?
Kasper (00:50): Yeah, sure. Well, first of all, thank you for having me. So yeah, my name is Kasper. I work as a lead platform architect at a company called Lunar. We are a sort of a FinTech startup that turned into a bank. So yeah, we've been going since August 2015, and just been building a sort of a challenger bank in the Nordics.
With that, we sort of started cloud native containers almost from the get-go. We've been running Kubernetes introductions since early 2017 and all the nice projects in the CNCF under the umbrella there. I'm also a cloud native computing foundation ambassador, so I run meetups locally here in the city that I live in, in Aarhus in Denmark, and also been helping sort of a lot with creating the bigger sort of Nordic sort of community alliance. Co-founded the cognitive Nordics. I think we are, I don't know, it's kind of hard when you're shifting from meetup to community.cncf.io, but back then, we were around 7,000-
Daniel (02:10): Oh, wow.
Kasper (02:10): ... sort of members within the meetup organization. We are quite a lot of people in the Nordics, 13 groups, I think it is.
Daniel (02:19): Wow. That's-
Kasper (02:20): So, we just tried to help each other out and organize and help each others figures and events and stuff like that. I think that's the short intro.
Daniel (02:30): That's awesome, Kasper. You and I have crossed paths many times. I think your very early days, my early days in cloud, in skills matter events in London, one day, Cloud native, I remember seeing you and a colleague talking about Luna. I was just imagining and I was like, "This is super interesting." So, it's been fascinating to watch your journey and your company's journey, your team's journey hand in hand as Cloud has evolved. Right? I think that's been a big learning curve for us all.
Kasper (02:55): Yeah. We tried a lot of different things, failed in some places and change direction. And so, that's just how it is. Things are moving really, really fast as it is right now and we are constantly learning. So, it's just following what's out there and see what fits in your organization, and how you can benefit the organization in itself. Yeah...
Daniel (03:21): Yeah, absolutely Kasper. And I appreciate you sharing your learning as well, because something you touched on there is something I always tried to do as a consultant is not just share the good stories because everyone shares a good stories at conferences, right?
Kasper (03:29): Yeah.
Daniel (03:29): You want to share the bad ones too, don't you? "I tried this and it didn't work, and here's why." I think that's interesting.
Kasper (03:34): Exactly. I think it was actually quite fun that a talk a couple of years ago around, so back in 2017, I think a go-to, I was up on the stage and saying, "Hey, you should de-centralize everything. Developers should create their own Kubernetes manifest files, and they should have the responsibility to do that and Dockerfiles and everything." But we sort of figured it out that wasn't probably the best idea. So, I think we got to touch a bit on that as well.
Daniel (04:02): For sure, yeah. That's a perfect lead, Kasper, because yes, I'd love to know about that actually. I guess for the listeners, send some context, could you provide a high-level overview of your architecture you've worked on?
Kasper (04:15): Yeah, we run in AWS and we run a lot of Kubernetes clusters. We run our free environments and a sort of a platform environment that we are currently in the process of centralizing a lot of tooling there. Run, I don't know, it's hard to figure out how many microservices we run, again, not sort of following that anymore, but I guess around 300 microservices or something like that, primarily Go-based.
Daniel (04:43): Good stuff.
Kasper (04:44): Also did a transition there started. It started rails, transition to node, and eventually became a Golang. That's also an evolution there. Yeah. We do use all the monitoring tools from the CNCF toolbox from Prometheus and Grafana, and Jaeger all of those tools as well. I thought that the GitOps pattern, really early, I guess, back in, what was it?
Kasper (05:16): Maybe early 2019 or something like that, we've been running the setup that we run right now. I've been in Git for two years, so it's been obvious, something like that. And that's been a really, really nice way for us. It's actually a funny story because back then, we were talking about this transitioning to becoming a bank and what things do you need us as a bank? You need audit, right?
Daniel (05:42): Yes.
Kasper (05:42): And you get that from running a system like this. Also get the sort of the flow around approvals and you can do a lot of things around PR flows-
Daniel (05:52): Interesting.
Kasper (05:52): ... and some policies there, and how do you actually do that? And what is the requirements there? And also in terms of discovery, not discovery, disaster recovery and being available all the time, having everything in Git makes them value that from cost something you need to secure that still and things need to come up in a certain order, but it just makes it a lot easier to have that desired sate in that Git repository somewhere, and I'd be able to recreate from there.
And also, GitOps provide this really nice abstraction in terms of least privilege. So, you don't really need to provide your developers access to the production systems because they're interacting with something else, which is the Git repo, or in our case, we have some tooling in front of that as well. So, they're not really interacting with the Git repo. They're just interacting with some tooling that we developed. It's a really nice obstruction too, and you get a lot of benefits, especially in sort of the banking and financial services industry.
Daniel (06:54): Because I know, like you've mentioned that Kasper, regulation, compliance and there might be foray, this is like 10 years ago in to finance. There was a lot of two-party sign-off, right? As it developed, writing code and someone would have to review it and then Ops would have to approve it. And then someone would actually kind of watch the Ops person applying it. And I was like, "Wow, I'm just writing some this new Java code." Right? Is that got easier over the years or not?
Kasper (07:18): It did. That was really one of the things that we tried to tag with creating the system that we have right now using GitOps as well, because I think this is sort of the case that you just mentioned, this is what is going on in a lot of banks out there still because the principle of segregation of duties. Right?
Daniel (07:35): That's it. Yeah.
Kasper (07:37): But what we did instead, it was to actually focus around the PR flow and putting in some quality. Really, in order to get something into Master, you need to have a peer review your stuff. You always have the four eyes on every change that goes into production using a PR flow, which is really nice. And then we have some restrictions on what kind of branches and stuff like that can go into production.
It's only the master and in the main branch that could go into production, right? That is how we're sort of dealing with the segregation of duty concept is to focus it and tearing it down to the PR flow and the revenue flow around the PR and limiting what can go into which environment. And that works really, really well because if you need to segregate the development and the operations, you just have a big headache and stuff will queue up yeah, that's not really a way to be agile in the years that we live in now.
You want to be able to move fast and also be able to roll back fast if something goes wrong in that sense. Just having that. That was our way to sort of tackle this really not agile, concept of segregation of duty, just focusing on the flow that developers are sort of doing anyway and just putting in some small restrictions and policies around that, but sufficient to comply with the regulations.
Daniel (09:04): Very nice indeed. You touched on it already there. I'm curious, are developers empowered to own the full life cycle of services like from designing the coding to the running these in prods? Now, you sort of mentioned a few times this sort of, obviously there's a segregation of duties, but there's this notion of you build it, you run it in the Cloud. Right? I know that you and I talked about this in the past. I'm kind of curious, how does it work? Do you as a developer, own everything all the way through to code, shift, and run?
Kasper (09:30): Yes, they do.
Daniel (09:32): Nice.
Kasper (09:33): What we do is that we have a central platform team, or now becoming teams because we are scaling insanely. But that's all completely another challenge how do you actually do that?
Daniel (09:47): Interesting.
Kasper (09:48): But we have a central team at the moment that are sort of creating tooling and providing all the stuff necessary to make shifting all of these responsibilities left to developers, make that easy for them and then manageable, because you can't just put all of this responsibility on to developers and say, "Hey, go figure it out the experts in all the different systems that you actually need to get the data and whatever you need to do in order to take ownership of the entire life cycle."
Kasper (10:17): You need to provide some kind of easy way to do that and some same default for, I don't know, dash boarding or unloading or whatever it might be. So yeah, they completely own the full live segments of this right now.
Daniel (10:34): That's very cool. For a lot of folks, that is something to aspire to. Now, we see again, the conference talks, right? Folks are talking. Like yourself, you were at the Vanguard, you're kind of leading the charge here. I think a lot of even small organizations, but definitely big organizations would love to get there because they realize the speed benefits. If you've got that full ownership as a developer to your point, you can move fast, right? You can get those feedback groups going without the need for these handoffs.
Kasper (11:00): Exactly. And that is the key thing. It is just to be fast and move fast and can put stuff in there and test a new feature out and, hey, it didn't work out, you just remove it again and just to be able to put those things out there, try it out, see if it works, get some feedback.
Daniel (11:23): Experiment.
Kasper (11:23): Yeah. And then experiment with both with features, but also sort of the tech side of things as well. So, it's the full life cycle, so to speak
Daniel (11:32): Very interesting. This is a super good question. I've been keen ever since you mentioned it in the intro, Kasper. What tools do you use to manage all of this, because you've hinted that you've created some of your own stuff. And I know there's other tools out there. You're not talking about Backstage in the past. I've seen Ambassador Labs, we're doing similar things. I'm really curious to know yeah, what you've done, and what the sort of mentality was around the build versus buy, because that's always really hard. Right? Do you buy something in open-source even? Or do you build something yourself? Right?
Kasper (12:00): Yeah, exactly. The story, and that's also mentioned in the beginning, but this talk from GoTo was that viewer sort of de-centralizing everything. So, the de-centralization was the key concept of how we started out putting, I think we had a deploy folder or something with the Kubernetes manifest, and it was just pure manifest deployment service and whatever you needed in order to get that up and running.
And also the double files, and managing double files is probably the biggest pain, because when you're in the financial services industry, you also need to manage your risks and be sure that it applies very... But especially in the financing industry, we need to make sure that we don't have any high CVs or stuff running in our systems that can do a bad thing.
And what we found out back there was that it's just really, really tedious. As we were growing and our teams got more services. So, the team owning 10 services had to go into 10 different repositories, update 10 docker files and whatever else they needed to do and that just was not working at all.
Daniel (13:17): Like a patching, Kasper if like. As a CV popped up online, we went to this all central, so whatever you had to go into 10 files, change them all to the latest version and committed all, yeah.
Kasper (13:26): And then the PR flow, and the review, and yeah. That didn't really work out and developers didn't really want to take on that ownership-
Daniel (13:36): Interesting.
Kasper (13:36): ... because they didn't really feel that that was their tasks. What we did instead was, we created a tool called shuttle. It's an open source project and it's available on our archetype organization. It's called Lunar Way. It's not Lunar. We changed names during all of this. But it's on the archetype organization called Luna Way Shuttle.
And what is basically is, it's just essentially a distributed make-files, so to speak. But what it allows us to do is, we have a sort of a centralized, we call it a plan. It's basically just a lot of templates and our manifest, for docker files, and stuff like that, a script you can run on your repositories. And then in each repository, we have a shuttle to download file, which also specifies ownership, who owns this thing, and then you can configure environments and stuff like that.
And then you really don't need to care about community's manifests. You are not actually seeing them at all docker files if you don't need to. So, we are abstracting all of that away from our developers. With the possibility, if they have a special case that we are not at the moment, able to handle with this plan that we call it, then they can opt out and create the stuff themselves. And they can also inner source it in too. And we have seen a lot of cool examples of, now there's this problem, some team fixes it and then they create a PR across back to us and then it's available to the entire organization.
That's really nice that we have this sort of a platform where we can allow teams to figure out what they need. And if it's something that's applicable to the rest of the teams, you can just source with them. And that works really well. Every repo has to show them a demo file that specifies ownership and we take that ownership with us into labels and permit ES metrics and log lines and whatever else just to be able to select and filter when we researched for certain stuff. And it also makes it really easy to create dashboards for specific teams or services and all of that.
So, that's one tool that we build ourself. It's a fairly simple tool. Right now it's primarily just running some bad scripts in the background. But as a developer, you actually just run shuttle on builds, and then you get it out and send it out of it, or you can say, shutter run generally configure and you get some manifest out.
Daniel (16:06): Very cool.
Kasper (16:07): So, everything is sort of tacked away, which is really nice. That makes the patching pod really easy because then it's just us patching our templated docker file, and then everything's good.
Daniel (16:17): And then it kind of would, you can roll out in the background. I guess you've got to test it well, right? Because I've had that situation where a patch was fit or patch address the vulnerability and actually changed the functionality to the way the memory management happened in Java, Apple, for example, which meant the throughput suddenly went, boom, like that. So I guess there's some notion that you've got to team up with the developers go, "Hey, we're rolling out patch. This might be the implication."
Kasper (16:42): Yeah, sure. And they always had the ability to specify which version of code they want to use or whatever it might be. So, they have the option to actually opt out and-
Daniel (16:52): Oh, interesting.
Kasper (16:53): ... I think many actually do this in order to avoid this problem, but then it's their responsibility and they take-
Daniel (16:59): Oh, nice.
Kasper (17:00): ... Yeah.
Daniel (17:02): Freedom and responsibility, as Netflix say, right? You want to opt out, it's on you?
Kasper (17:05): Yes, exactly. Our job is basically just to make the right choice, the easy choice-
Daniel (17:11): Oh, I like it.
Kasper (17:11): ... but if that's not the case, they have the option to control it themselves, but they take on the responsibility also. That's just part of it. And then that's sort of around the repository and around how we run Shuttle in our pipelines as well. So, everything you can run locally, we just run that in continuous integration as well, which makes it really easy to test everything beforehand instead of have to fiddle with some geek and screwy whatever stuff that gets running, which is really hard to test.
It's really nice to just be able to... If it runs locally, it will run in the CI's server as well, which is really nice. And that makes that part easy as well. And then from there, we push, of course, the docker container to a registry and then we push our artifacts. We call them, it's basically does communities manifest files into an S3 bucket, but then sort of triggers what we call a release manager.
And the release manager is something that we build ourselves as well, but the primary job of that component is to basically move files around in the GitOps conflict repo. So, it gets an event saying, "Hey, there's this new built available." It checks for policies and we have a policy that is an order release policy. So, as a developer, I can specify that whenever I push to, let's say this feature brands or whatever, it will go automatically into our development environment, or if I push to master, it will always be, when that is merged, it will go out to production automatically.
That's something that you as a developer, controls as well. And so, you can sort of create your own flows and how you want to do that, which is also really nice. The only restriction we have there is that master main branches, it's the only branch that you can push into production because that's what we require the review.
Daniel (19:07): Makes sense.
Kasper (19:11): And then yeah, you can also just do it manually with a CLI tool. You can say release this branch into this environment on whatever you want to. That's pretty easy and provides this really nice inspection as well, because now you're not dealing with going into GitHub or pulling down the repo yourself and figuring out and managing the files. You're just saying, "I know I built this. I can just release it into whatever environment." So, it's just a release or it's automatically done for you. That's up to you. That just makes this process really, really easy as a developer to interact with the running systems and the running software-
Daniel (19:46): Yeah, I like it.
Kasper (19:47): ... on that and impact that. That's really, really nice. And then we tried recently, not recently, October last year or something like that. We started looking into Backstage, because we want to, now we have a lot of different tools as we talked about monitoring tools, the release manager tools, and Kubernetes running sneak for scanning, tool vulnerability and source graph for something else. And we just have a lot of different tools out there that developers to some extent, need to be experts in, or at least have some knowledge about in order to get what they need in order to take on the responsibility. And that's just, [mind blown sound!].
When we push them, you need to make it easy for them to get the right information. What we want is, sort of a single view. When you log into your computer in the morning with a cup of coffee or something, you have this single pane view of your team or a specific service or whatever you want to. But just sitting down, I can see what happens since yesterday, what was released, was there any sort of stuff going on in production? Or about how does the world look today? That's really something that we want to build. We are not there yet but that's sort of our goal to create just this overview of how is everything looking-
Daniel (21:17): Mm-hmm (affirmative). I like it.
Kasper (21:17): ... since I was in last time. Get the pros of the system, getting yeah, whatever is interesting for you as a developer. So yeah, we adopted Backstage back then and started building on sort of, we get the service catalog, which is a really nice start out. And when we have this ownership that I talked about earlier in the shuttle downloads files. We can just propagate that into Backstage as well-
Daniel (21:41): Oh, nice.
Kasper (21:41): ... so now we have that searchable and really nice service catalog where we can see documentation and we have some plugins at the gets the latest bills, and I don't know, GitHub integration that get some of the details on GitHub as well, and just yeah, some different stuff around the service and how it's running. Also, we created a Backstage plugin for the release manager as well.
Daniel (22:07): Oh, very cool.
Kasper (22:09): You're able to see the latest artifacts that are available and you can just click on a button and it will be coming to that environment.
Daniel (22:17): Oh, so you can deploy from the UI?
Kasper (22:19): Yeah.
Daniel (22:20): Nice.
Kasper (22:20): We're just interacting with our release manager. It's just an issue. It should be a server. So, it's fairly easy to just click the button, communicate what is going on in the background is of course, just you all being moved around basically
Daniel (22:34): Yeah, nice. That's how well these things are, right? Whether it's us as humans or machines, it's mainly new than you would know, right?
Kasper (22:43): Yeah, exactly. And then stuff happens. And that's really, really nice. And it makes it fairly simple to build this yeah, whatever we call it, whether it's a developer control plane or it's a puddle of whatever it might be, because just having that key repo makes it easy to do really interesting stuff. There is a manager idea, it's just moving files around based on whatever you sort of pressed and then stuff goes into production and you get a slack message because we also have-
Daniel (23:12): Oh, that's nice.
Kasper (23:13): ... of course, a slack integration. From sort of the developer perspective, we actually also have slack integrations on each step in Jenkins, so you can see live how your build is going.
Daniel (23:26): Interesting.
Kasper (23:27): We report whenever a step is finished and then you can follow the entire life cycle and also the entire deployment cycle into the environment that you are reducing into. So, you also get a message when the release manager has processed that your artifact, there's a policy on it. You actually also get a message in the PR before merging it that, "Hey, if you merged this into master, it will go automatically into production because you have a policy saying that is what you want," Just to be able to make sure that they actually know what the impact of pushing words in that case is.
And then whenever we have the five out of five running, we push the message as well and saying, "Hey, now your services is running out there and everything is working as it should be." And if it fails a request or back off or something like that, we grab thy, I think it's the last 30 lock lines or something like that from the container that crashes-
Daniel (24:25): Oh, cool.
Kasper (24:26): ... and forward that back directly to them in slack. So, they will get the message right away saying, "Hey, this is what happened in the container."
Daniel (24:34): Very cool.
Kasper (24:35): So, they don't have to go into kubectl logs.
Daniel (24:38): kubectl logs. Yeah.
Kasper (24:41): Yeah, exactly. They get it right away instantly.
Daniel (24:43): That's cool.
Kasper (24:44): Really nice. Then just missing a button saying, "Hey, roll back." Or something, but-
Daniel (24:49): Oh, so it doesn't auto roll back? You got to do that yourself? Go in and say, quickly roll this back. If you can't fix it forward, roll it back quick?
Kasper (24:57): Yeah, exactly. But we have, luckily, and that's kudos or Kubernetes for sort of the rolling update strategy before you can send that.
Daniel (25:06): Oh nice, yeah.
Kasper (25:08): It just spins up the container that crashes and nothing happens to what it's running...
Daniel (25:12): Oh, got you. That's kind of, yeah. So that you limit your actual blast radius, so to speak?
Kasper (25:18): Yes.
Daniel (25:19): Yeah, nice.
Kasper (25:20): That's a really nice feature to utilize that and throw it. The new container just crashes and then nothing happens and then you can just roll it back, so you have the same state as you had before. But it would be nice to have some automatic stuff going on there but it works really well as it's running right now.
Daniel (25:37): That sounds very cool. There's a couple of things I want to dive in there. That was a super genius. Thanks a lot. And I'm sure again a lot of folks listening, this is where they're aiming for, right? This sort of the mantle of GitOps. And as you were talking earlier, I definitely heard that the release manager, seen similarities, I think with Argo CD and with Flux. Those are two probably big sort of open source in CNCF projects in this space.
Daniel (25:58): And then towards the end, I was kind of curious, was there a mentioned of Argo roll outs or flagger, kind of the canary, and these kind of things. And what was the choice for you doing your own thing? Was it a case that those tools didn't exist? Or did you have some specific requirements perhaps?
Kasper (26:13): We actually have Flux running. I mean, we use Flux to, it's basically just applying whatever's in a specific folder and I think-
Daniel (26:22): The synchronization.
Kasper (26:25): ... because having released managers, yeah, it just basically just moving files around in that repo and then Flux applies.
Daniel (26:31): Oh, interesting.
Kasper (26:31): And then we have the last component running in each environment, which is something that we call the release statement that listens. We have the annotate all the deployments with annotation and saying, "This is managed by the release manager." And if it sees that, it just reports the state back and grabs the logs, if that's the case, or grabs, if it's to create container configuring, you get the message saying, "Hey, you don't have this config map," or, "This key is missing." Or whatever it is.
That's sort of the last step. We have Flux applying. We actually also had an integration between Flux and of these statement because Flux has the option to send events out-
Daniel (27:11): Yeah, of course.
Kasper (27:12): ... because that would catch some of the not so good stuff in Flux we want.
Daniel (27:17): And you're right, right now.
Kasper (27:21): Something with the duplicate definitions and stuff like that, stopping the pipeline and all that. We actually sort of created the web socket as a means of integration. You can get that and you can communicate that-
Daniel (27:30): Oh yeah, nice.
Kasper (27:31): ... back. But that's fixed in B2. That's a really nice, but yeah, we do use Flux, and Flux was back then the only tool available. Weaveworks had this UI stuff where you could hook in the repo and move files around, but-
Daniel (27:51): I do have that, yeah.
Kasper (27:54): ... it wasn't really what we were looking for. And we had a lot of discussions around sort of the GitOps flow that Weaveworks was presenting at that time being that Flux should trigger on a registry event saying, "Yeah."
Daniel (28:09): Oh, that right. New container. Yeah.
Kasper (28:11): Yeah. That's a new container available in this registry. If Flux see that it gets an event, applies to that thing, as a constitution, set the image on this deployment and then it will sort of committed back to Git configure as well-
Daniel (28:22): Oh, that's right. Yeah.
Kasper (28:25): ... which was something that we really didn't...
Daniel (28:26): The mutating of the YAML, as well as the read. Yeah, I've had a few folks over that.
Kasper (28:29): Yeah. We were sort of creating what we call the one way flow instead where we had the release manager to move parts around and say, "This is the desired state, and Flux, you are only applying whatever we sort of specify as the desired state. That's your only job."
Daniel (28:45): Interesting. Separation of responsibilities, right?
Kasper (28:48): Yeah. And then we just disabled all the features of Flux and just, you only need to deploy or apply this thing if something's changing and that's everything.
Daniel (29:01): Oh, that's super interesting. Yeah. So, just the... Yeah. Sorry, you are saying?
Kasper (29:03): Flux is only sort of listening and that took events from the contract repo and from whatever is the syn loop going on inside of Flux. So, if you manually go in to do something cute cuddle label, it will roll it back, of course.
Daniel (29:16): Nice.
Kasper (29:16): But that's the only two things that it really does.
Daniel (29:21): So, just to confirm my understanding, that's been super useful. The release manager is more about the promotions through environments than it is actually the syncing, because that comes up with some folks, even whether they're using Argo, or Flux, or other tools is how do you promote effectively between environments from dev, to staging, to ultimately production, right?
Kasper (29:39): Yeah.
Daniel (29:39): So yeah, you built a tool to do that. That's that is very interesting. Any thoughts about open-sourcing the release manager, or is it too specific to what you do?
Kasper (29:48): It's not open-sourced, it's publicly available. Let's call it that.
Daniel (29:51): Okay, interesting.
Kasper (29:55): I think we sort of almost lose out all the sort of the Lunar-specific stuff in there.
Daniel (30:01): Cool.
Kasper (30:01): Our plan initially was to create it as an open source project, but we don't label it as an open source product, but it's publicly available.
Daniel (30:07): Oh, that's interesting.
Kasper (30:08): It's a public repo. You are able to go in there and have a look around if you're-
Daniel (30:13): Oh, brilliant. Yeah, definitely in the show links, I'll link it, because I've played with Shuttle after you and I chatted a while back. I remember looking at Shuttle. I think that Airbnb and a few other folks had done something similar, so it's good to see that there's a lot of sort of coming together of like, "These are the goals we want as a developer. These are the functionalities we want." I think that's super interesting.
I want to also touch on and Backstage. Backstage is super interesting. You mentioned about a single view of the world, the single pane of glass. I think we're seeing a lot of folks talk about that now, particularly with Backstage being a CNCF project, profiles being raised, right? You need a service catalog, you need a single pane of glass. Is it useful when you're onboarding folks, and it can be a new developers, new engineers, or even new teams to a new service, right? You've got 200 services. I'm guessing there is a little bit of movement there between teams. I wonder how useful is something like Backstage for onboarding folks?
Kasper (31:04): Yeah. We actually do have, we are hiring people insane. We've been focusing a lot on the onboarding process as well and we actually right now use Backstage to onboard people.
Daniel (31:15): Interesting. Yeah.
Kasper (31:17): We are we just going through our onboarding documentation, which of course is Backstage with links to different services and the service catalog, and you can browse around and see how everything is sort of laid out and find diagrams and whatever is sort of needed for you-
Daniel (31:33): Oh nice. Yeah.
Kasper (31:34): ... to get started. So, everything is in Backstage for documentation, and you have the option as a new joiner, just to sit down, browse through the different services and see what is out there and how are people doing stuff? Then as a compliment to that, we have source graph, which is a really nice tool as well in terms of onboarding people because it's just easy when you are sort of trying to, so this is probably when they are onboarding and starting to build a new feature or a new service or something like that, just to be able to browse through how are other people doing it, finding references across repost-
Daniel (32:11): Oh, interesting.
Kasper (32:11): ... it's a really, really powerful feature to see that. We have a lot of different libraries and a lot of different variants of how you can create a service internally to just be able to find different references to how are other teams doing this, is really valuable.
Daniel (32:28): It's a bit like co-pilot, yeah. But get help with that co-pilot but not quite as advanced. Right? Maybe you don't want that anyway, but just to look for similarities, look for best practices?
Kasper (32:37): Yeah, it would be really nice to have a co-pilot for your own sort of config
Daniel (32:42): Yeah, agreed.
Kasper (32:43): That would be really, really powerful, especially to rely on standardization on how you should do stuff or how is it mostly done in most services.
Daniel (32:58): Super interesting Kasper? Yeah, now I like that a lot. As in, I think when I've onboarded with companies is like, "How do I get my code? How do I push it through the pipeline? How do I monitor it?" But then once you've got those kind of basic things done, it is very much like, what should I be doing? What's going to make my teammates lives easier, right? And particularly if you can't find the API docs, you probably call the API wrongly. Right? I've done that before like, "I'm sure this works." Right?
Kasper (33:22): Yeah, but that's also-
Daniel (33:22): So, do you have some things around that as well?
Kasper (33:25): ... Yeah. That's also something that we have available in Backstage. All services that we use swagger for HTTP, and we use rRPC and protocols, which is also visualized in Backstage. When you go into Backstage and look for a specific service, you can see that your PC set up the protocol specifications-
Daniel (33:48): Oh nice.
Kasper (33:48): ... You can see the layout, what is all the end points? And also see the graph QL if that's a graph QL API.
Daniel (33:54): Oh, very nice.
Kasper (33:57): But the last thing that we are missing right now is being able to sort of see events, how events definitions and stuff like that.
Daniel (34:04): Or the acing KPI ramped into that.
Kasper (34:07): We are not sort of following a formalized format. As it is right now, we are sort of in the process of migrating to a standard that we can actually visualize also in Backstage. So we can see-
Daniel (34:18): Oh, interesting.
Kasper (34:19): ... exactly what is going in and out of each service and see the contract, so to speak on. Yeah. Another thing that you also touched upon is setting up all of these different stuff when you are a new person, right? How do I set up a good report? How do I get the CI to actually push to a registry somewhere? And how do I actually get that stuff out? We use Backstage scaffolded for actually doing this.
Daniel (34:45): Oh, interesting. Yeah.
Kasper (34:47): Backstage has this really cool feature of the scaffolding where you can just hook in whatever you sort of need. So, when I think we have, what is it? Four or five steps that you need to fill out which team, what is the name of the service, description, and then you can just click, and then everything is being set up for you.
Daniel (35:07): Nice.
Kasper (35:07): You get a good report out of the box. You get some default code, how the code is laid out in the repo. You actually also get this thing deployed into production if you want to with a single opinion point.
Daniel (35:22): Oh, cool.
Kasper (35:23): Just to be sure that everything is lined up and available for you to actually go in and develop whatever feature you need to develop, and so you don't have to go through the hassle of figuring out how to actually set all of this up, because it can take time because a lot of different systems that you need to configure stuff and-
Daniel (35:45): That's very cool. We've often talked about this, like different names that Dan North calls it, a dancing skeleton. Yeah. A pure in his RSV of engineering is aversion zero all the way through from being able to code it, to be actually running in prod and often yeah, I'm sure you've been in the same, right? Places I have worked, to get something into prod it's taken months. Right?
Daniel (36:06): And I won't mention any names here, but I'm about rocking up getting pretty quickly set up with my local dev environment, but I was like, "I want to push this to prod." And it's like, "Oh, we need to hook this thing up to Jenkins." Or was that RunDeck needs to be kicked it up? Or you want monitoring as well?
Daniel (36:18): I remember just going back and forth to the Ops team just going, "Can I get this? Can I get that?" So, you're saying you can literally push a button, pick a time. Right?
Kasper (36:26): Everything is set up. Yeah.
Daniel (36:26): 10 minutes later or whatever. Yeah. How cool is that, right?
Kasper (36:29): It's really awesome. And we also have stuff like dash-boarding and stuff like that. It's also templated. So everything you need, it's just, fill out these five boxes and then you have everything set up and it takes five minutes or something to actually get it into production and then you're ready and you can just take ownership and implement the feature that you need.
Daniel (36:49): Yeah, that's awesome. That is super cool. So, closing the loop on now, I guess what's it like from the incident management point of view? You've said, pushed your servers out, right? The work in of a couple of months it's now tier one service and it blows up. As a developer, would you use Backstage as your jumping off point, or PagerDuty, which is a tier first, perhaps, that kind of thing?
Kasper (37:09): Yeah. As it is right now, we don't have that sort of all the signals into Backstage as it is now. Right now it's more in the monitoring tooling and of course, PagerDuty if-
Daniel (37:20): I got you.
Kasper (37:22): ... something really goes off and needs Pager on to the on-call rotation. But we are also in that sense, migrating from being a small startup to actually now being a lot of people with a lot of responsibility and figuring out how do we actually do this. And now I think we are getting to a point where we are big enough to have a on-call rotations.
Daniel (37:44): Nice.
Kasper (37:44): On different teams and stuff like that, because as it is right now, we just have groups for engineers spread across all the different teams with different skill sets that are sort of doing the rotations right now, just to be able to take on and manage both the-
Daniel (38:01): It's an extra responsibility. Right?
Kasper (38:03): Yeah, it is. But we also have actually a]PagerDuty integration into Backstage, which allows our developers, our support team or whatever, they can trigger the on-call directly from that.
Daniel (38:14): Oh, interesting.
Kasper (38:15): And they can see who is actually on call as well. That's really, really a nice, it was just a plugin and we just, that's been a little bit... There's a lot of interesting projects around Backstage and target teams created because it seems like people are sort of right now consolidating a bit on Backstage as this single pane things, which means that all the different vendors out there needs to create a plugin for-
Daniel (38:43): That's it.
Kasper (38:45): ... Backstage to actually hook in all these different tools that we need. We also have to sneaker plugin that customer sneak, so they created a plugin, just plug it in and it's awesome.
Daniel (38:59): That's very nice. Could you see Backstage being your incident management dashboard in the future, Kasper? I know you said mostly Grafana and Prometheus and looking at your logs now or whatever, but I'm wondering, particularly as a company grows, right? You might be paged for a service that is not your service or your service is crashing because of the dependency. That's a classic, right? And then just even figuring out who owns this service is super valuable. I wonder, could you see a point where maybe a bit of a loaded question, right? But could you see a point where Backstage became your incident management portal?
Kasper (39:31): Yeah, maybe. It depends on, I guess the sort of development within the plugin itself with people. I don't think it would be something that we were probably going to put a lot of effort into. Right now, we just needed the trigger button basically.
Daniel (39:49): Yeah, I just hope.
Kasper (39:51): Exactly. But yeah, why not? You can do everything in Backstage or whatever Why not? But I think we need to figure out, what is the responsibility of a plugin in Backstage versus the actual product.
Daniel (40:06): Well, that's an interesting question, isn't it?
Kasper (40:09): Yeah. How much should the plugin actually do in terms of providing some kind of insights into what's going on? And if you need more insights or specific actions, you need to go into the tool.
Daniel (40:21): Oh, that's interesting.
Kasper (40:21): I think that's what we see sort of the bridge where it is right now. Present a higher level overview of whatever it might be, and if you have sort of a need to go into more details, click the button and you will be taken to whatever vendor or whatever project it is for diving into the details. But that might change.
Daniel (40:44): Yeah, that's super interesting. So that is a really nice observation. I'll think a bit more about that because it almost reminds me of all the things I've worked on over the years, even API Gateways, where responsibilities sort of bled in appropriately. Your business logic ended up at an API Gateway and we're like, "No, no, no, you should not have business logic and make API Gateway separation of responsibility, separation of concerns." And I hadn't thought too much about that from a plugin perspective, but that's super interesting. Yeah.
Kasper (41:11): Yeah, that's one of the cool things around Backstage is that we don't have any data or anything like that within Backstage that is false. We are just delegating to all the services to get the information and then we can present it in the way we see fit in terms of providing valuable information to whoever's going to look at it. It doesn't need to be developed. We can present all kinds of information if that's what you need. Just having that place where we can figure out what is, let's take monitoring as an example, right? We probably want some of the phone calls and signals-
Daniel (41:48): Yeah, of course.
Kasper (41:49): ... put in for the squad itself, maybe for the service itself as well, and then that's probably sufficient if they are all green or within the right thresholds...
Daniel (41:59): Traffic lights, this kind of thing.
Kasper (42:01): Yeah, something like that.
Daniel (42:03): Green, amber, red. Yeah.
Kasper (42:04): Yeah. And then, if you need to click on I don't know, the one that's red and you go into Grafana or where it might be, I think that's sort of the separation that you're looking at right now but yeah, who knows? I think it's implementing something about Grafana within Backstage that's probably not kind of, I don't know. Everything can happen.
Yeah. It's a concentrator. Even I think with views in commercial and monitoring SAS, do you know what I mean? Is that if you present every bit of information up front, you're just bewilded, so I'm thinking of Datadog, I'm thinking of even LightStep, HoneyComb, these kind of things, that they've really thought about their correct abstractions top level, like traffic lights often, and then dive in and these kinds of things. I think that's yeah, super interesting.
Just sort of falling off from that Kasper, in terms of say folks with blameless postmortems, those kinds of things, do you use any information from Backstage to do postmortems? Say an incidents occurred, we all have them, right? And the you've addressed that. Is there any data you would collect or display in Backstage? Or do you mainly just get around a table or a virtual table these days, of course, and just discuss the incident? Or I'm kind of curious, because for me, incidents are inevitable as the company grows and as the users use of systems in different ways. Managing them can be a make or break moment. You know what I mean? Can you bring the learnings from those incidents into the team?
Kasper (43:25): Yeah, that's... Right now, we don't have information in Backstage that are useful in a post-mortem scenario. It's mostly going to the Grafana to figure it out what happened, or finding logs or whatever it might be. And we are actually using confluence to write down.
Daniel (43:45): Oh, good.
Kasper (43:47): It's just really embarrassing, actually.
Daniel (43:49): I think actually, you have one list to get this. It's nodding along Kasper. We've all done it, right? Jeering conference, we've all done it. But we are definitely looking into a PagerDuty customer as well, and actually gathering all of the details around the incident in the actual incident tool, because that's a nice feature as in whenever the... And then they're prying confined find postmortems from older events and-
Kasper (44:22): Oh, and so, this is how you mediate that, for example?
Daniel (44:23): Yeah, exactly.
Kasper (44:24): They have some AI stuff and everybody has something like that, right?
Daniel (44:26): Yes. AIOps, right?
Kasper (44:29): But it sounds cool. And I think it's something that could help us out going in that direction at some point.
Daniel (44:35): Well, that's interesting.
Kasper (44:36): Just having everything around an incident in one place that is also the place where the stuff that actually triggered, so we get the timeline. You can probably also get paid to do, just to scrape whatever is scratched on it created and get the timeline laid out or whatever it might be. That is something that people will look into in the future as well. But right now, it's a process where we meet up, so that was an incident we meet up whoever's sort of relevant for that incident and we just talk it through and write down. We have a template. What would be-
Daniel (45:10): Oh, cool.
Kasper (45:10): Yeah. What-
Daniel (45:12): Like in confluence, like a template kind of -- almost like a Word doc, right?
Kasper (45:14): Just click on that and then you get a table with the timeline and you can just enter whatever is the timeline and you'll get what did we do well? What we did?
Daniel (45:22): Oh, nice.
Kasper (45:22): We not do so well? What is the action points? We also have some regulatory stuff to fill out, so something to that as well so it goes into the right department and all that. Right now, it's a template, and then it works all right in conference, but it would be really nice to be able to have that link back whenever we sit in a situation, this thing, maybe I haven't seen this when I was on call, but if that's a link to a previous event that's sort of similar or has that text that sort of indicate it's the same thing, then it would be really nice to just have that in the yeah, the UI that is-
Daniel (46:02): Or even be able to jump to, like often I talk about it, sort of, in observability toolings, you're actually really looking for the insight to these days. We're going to go and burn that level of give me the data, but when something bad, sounding like an incident that jumping straight to the remediation steps saying, "This looks very similar to what happened last week, and it was a database restart." And it's like, "Right. At least they can perhaps try that." Right? Or document in slack, I believe it's similar. I'm trying this. And then you're constantly improving. Right? I think joining the dots is really here for incidents.
Kasper (46:30): ... And so, what we do around runbooks is, we actually also use conference for that.
Daniel (46:37): Good stuff.
Kasper (46:38): We're actually all alerts has a conference page where it just defines what can you try out?
Daniel (46:45): Nice.
Kasper (46:45): Or what you look at? Or yeah, what is the remediation if we know it? Or otherwise, just try to get to this dashboard or go to this and see if you can see anything that looks odd. Yeah. All the known, since the database is running out of storage. Okay it's simpler process. But if it's something we haven't seen before, it can be really hard to run a run book.
Daniel (47:16): Oh, for sure. These diagnostics really, right? You're just like, "Try this, understand this, this could be an area you want to explore."
Kasper (47:21): Yeah, exactly. That's how we do it right now
Daniel (47:25): Do you link the Confluence runbooks from Backstage, Kasper? I think that's kind of interesting. Right? You can mention your services. Do you link off to say, if there's an incident happening here is the location of the run books?
Kasper (47:36): No, not as it is right now, but it's something that'd be definitely could do. And I think we will probably also move all of the incident templates and postmortems runbooks and stuff into Backstage as well.
Daniel (47:49): Yeah, makes sense in it.
Kasper (47:50): Nice to have Markdown available.
Daniel (47:54): Yeah. And the tech docs, that's really cool in Backstage, right? Yeah, I like that. That's been awesome, Kasper. I just take a few of my notes we've covered so much. Actually, thank you for it so much. Then we've covered literally from the coding to the shipping with your release manager to the running and the incidents. Anything else you think is super interesting you want to share. I'm conscious of time as well, but this has been a fantastic tour of the landscape.
Kasper (48:16): I think that's some interesting challenges when you have a different AWS accounts, different clusters running in those accounts, how do we actually sort of consolidate all the data in one single place and create the connection. If you, for example, we are not using the Kubernetes plugin for Backstage, but how do you actually want to, if you want to use that and you have Backstage running in one cluster, how do you actually get the information from all the other environments, is something that it's probably not that sort of a solved problem out there as it is. We are using, Linkerd multi-cluster-
Daniel (48:50): Oh cool.
Kasper (48:51): ... to connect clusters in one direction as it is right now, but how do we actually do that? And the second part is probably around what is the next steps for GitOps right? So, what we are looking into, and this actually sort of came down from top level management is that we just bought a new company, or a company in Sweden. They are running on Microsoft Azure.
Daniel (49:22): Oh, classic. Right? Amazon and Azure?
Kasper (49:23): And I will be multi-cloud!
Daniel (49:23): Yes, surprise. Brilliant.
Kasper (49:26): So how do we actually do this? And how do we, as a central team, manage this multi-cloud thing?
Daniel (49:33): Super interesting.
Kasper (49:34): And connect clusters across clouds and stuff like that. So, what we are looking into for next steps for our GitOps processes, of course, we're looking at plus API.
Daniel (49:42): Of course, very nice. Yeah.
Kasper (49:45): If we can have Kubernetes as the foundation that we can put into all of the different clouds that we have stuff running in, then it makes it a lot easier for us to use the process that we talked about, all the tooling that we've talked about, because that is the foundation here is Kubernetes. If we have Kubernetes, I can put a Kubernetes Cluster in somewhere and link it back through our platform cluster, which is where most of the tooling is running. Then that is really, really powerful. And it just makes it easy for us to, if we acquire a new company at some point, right? e.g. Google Cloud.
Daniel (50:17): It's a classic story in different cloud. Yeah.
Kasper (50:20): Put a cluster in there and we can create a connection and if they're not on their Kubernetes environment, or whatever they are running, we can make it easy for them to migrate because we have everything set up in terms of doing this. And then, the next thing is of course, how to actually manage all the stuff that you use from that particular cloud. How do you, as a developer, if I need a specific database for my service because it's, I don't know high volume service or whatever it might be or something that we want to really keep, but on itself, on its own instance. How do you actually do that?
I think many people are probably doing some Terraform stuff, but then you're missing the reconciliation loop that you get from the GitOps process and all the agents that we have running everywhere. We are looking a lot into a cross point and-
Daniel (51:11): Just could imagine cross focus in there. I am focused. Super interesting, aren't they? Yeah. I like it.
Kasper (51:15): It's a really, really interesting project in terms of, they provide these low-level custom resources, right? For an RDS instance, and setting up the PostgreSQL database in AWS, it's not just creating an RDS instance. You also need to link it to a parameter group and a DB sub groups, yada, yada, yada. There are lots of different stuff, you actually need to set up in order to get a database. And then cost plan, which has this composite a resource definition where you can actually compose, how does the PostgreSQL instance look from our perspective to provide the same default?
That also becomes really interesting in a multi-cloud scenario, right? Because if you sort of expose the PostgreSQL instance to your developers, then underneath in sort of the plain custom resources, if we can say that, we can just create it for each different cloud, how do we actually set up a database in that cloud on that cloud? And then, sort of the obstruction, it's just the PostgreSQL instance when AWS or, PostgreSQL or whatever it might be. Right?
That is looking really, really interesting and it's something that we will be working a lot on in the coming half year, I guess and figuring out how do we actually do this? And can it do all the things that we think it can? Because if we have that, we also get all the cloud resources that you need as a developer, get that into Git, have the reconciliation group, we can put on the same labels and texts so you can create custom sites.
Daniel (52:51): Oh, it's interesting. Yeah.
Kasper (52:53): And get all the benefits that are sort of missing when we haven't thought of finance constantly, "Why are you using all of its money?" It's that squad over there.
Daniel (53:01): Yeah, the cost breakdown. It charge backs, right. Is that safe?
Kasper (53:04): Yeah, exactly. Just to get the insights because right now we have no clue.
Daniel (53:10): Yeah. Very common, Kasper. I think very common. Yeah.
Kasper (53:13): It's just, yeah. Being able to have that tool in there, both in terms of a multi-cloud strategy, but also in terms of all the other stuff that we can potentially create with this, and then yeah. Operator and controls and stuff like that, it's just more of that because we put those custom resources into our good repo, we get the stuff that we talked about in the beginning. Right? We get audit, we get the log. We can see who did what? We get the PR, the revenue flow. We always have before us-
Daniel (53:45): They can change the workflow, the actual.
Kasper (53:48): ... And we can actually also take away access to the console if we want that. And so, you as a developer, don't really need to go into whatever cloud we are running on. You actually, and then in our case, with Shuttle, you can just say, "I want the object storage and it should have this name," for a project, for example. And then, if it's on AWS, it'd be just creating this free bug, or if it's an AWS creative, whatever object storage they have, right?
Daniel (54:19): I like it a lot. I can see extraction's the key here Kasper, aren't they? And you mentioned a few times in terms of different parts of the workflow, but I think this is for their, well, it touches a whole bunch of the code-ship run workflow, but getting the abstraction so that developers don't have to think about the differences of IAM rules and security groups and these kind of, and even choosing a disc, right. Amazon it's like, don't want SSD, don't want standard metal disks. That as a developer, sometimes you care, sometimes you don't. Right?
Kasper (54:47): Exactly. And we just want to make, and as a platform we can put it in the same default for how do we actually create this, so we don't get developers to that sort of, maybe they actually don't know it, or forget to push the button that says, "This is the private DB. Don't put a public and point to them."
Daniel (55:13): Yeah, that's a really good point. Even security, yeah characteristics. Yeah, really a good point actually. Yeah. It's very easy to do bad things. Sometimes you develop it in the cloud, right?
Kasper (55:18): Yes. And then just, if you've been running Terraform or whatever conflict management, it's always, you've done it and then that goes a couple of months or weeks or whatever. And somebody actually went into UI and changed something anyway and then you are sort of figuring out how that fits
Daniel (55:37): You don't get that slight constant convergence?
Kasper (55:39): Yeah. I think that's the key in everything that we try to do right now is reconciliation. And having an agent is actually doing this because that's automation, that's what we all strive to do and just have the system detect if something's wrong and mitigate and convert it back to whatever the desired state is.
Daniel (55:58): Yeah. Brilliant stuff Kasper. Well, this has been yeah, a fantastic tour. There's so many principles you're pulling out from all the best practices around GitOps and arounds with a single view of the world and things and how you onboard folks. Yeah. Thank you so much. There's so many notes I've taken, right? I mean, it's going to take me a while and the team, Erica and myself, to break this down. Thank you. Thank you. Kasper, I really appreciate your time and your insight. You know where we're at, if we can never help you, just give us another try. We're more than happy with you, because you're so generous with your time. Thank you very much.
Kasper (56:24): Yes, sure. Thank you for having me. It was really fun.