LIVIN' ON THE EDGE PODCAST

Developer Control Planes: An Experienced SRE's Point of View

Ambassador Labs · S2E01: Mario Loria on Cloud Platforms, Developer Experience, and SRE

SUBSCRIBE:

About

Cloud-native software development has changed the developer experience but more fundamentally has changed how developers, and the organizations they work in, should think about developer responsibilities and ownership of the full software lifecycle.

Episode Guests

Mario Loria

Senior Site Reliability Engineer (SRE) at CartaX

Daniel Bryant, Director of DevRel at Ambassador Labs, recently spoke with Mario Loria, Senior Site Reliability Engineer (SRE) from CartaX, an electronic marketplace for private securities. In a wide-ranging discussion that covered ground from the changing developer experience to the ideal role of SREs in a modern, cloud-native environment

A number of key themes emerged:

An organization and its leadership needs to get behind the end-to-end "developer as service owner" mindset to make it work.

Developers should own the full life cycle of services but in most cases don't: "It should not be up to me as an SRE to define how your application gets deployed or at what point it needs to be rolled back, or at what point it needs to be changed, or when its health check should be modified." Developers should be capable of — and empowered — to make these determinations.

Developer education and mindset will need to change to embrace the "you build it, you run it" approach, with SREs helping to shape and support the developer-ownership mindset with appropriate platforms providing tools and an interactive self-service experience instead of riding to the rescue when things go wrong.

In this new environment, one of the best things SREs can do is focus on infrastructure and core services to support the main challenge and goal of a developer: shipping software safely at speed. The developer doesn't necessarily need to care about what platforms and tools are used but does need to be able to use them to, for example, canary a service or get service metrics.

To hand over complete ownership to developers, greater transparency and visibility into what's going on with their services is needed. To "liberate" developers from an over-reliance on SRE firefighting, and facilitate developer autonomy in finding their own solutions, a developer control plane centralizes and ties together the code-ship-run processes a developer needs to understand.

Transcript

Daniel Bryant (00:00): Welcome Mario. Many thanks for joining us today. Could you briefly introduce yourself and your background please?

Mario Loria (00:05): Sure. My name is Mario Loria. I am currently a senior SRE at CartaX, previously at Carta previously at StockX. I've seen many different company industries and have built Kubernetes clusters in many of them and had the pleasure of understanding operations and the developer experience in many of those places.

Mario Loria (00:26): I am pretty active in the Kubernetes community. I am a CNCF ambassador. I am a certified Kubernetes. I really enjoy both the community and the problems that we're solving in the infrastructure cloud architecture world. I have a specific affinity a little bit to networking and load balancing and those pieces, auto-scaling. But have definitely been going deep diving in security lately at a recent role at CartaX. So I'm really happy to be here.

Daniel Bryant (00:54): You know we love the networking ambassador labs, right?

Mario Loria (00:56): Yeah, absolutely.

Daniel Bryant (00:56): Our pedigree is Ingress and Envoy Proxy . That's great stuff. Great stuff. So could you provide a high-level overview of either perhaps the CartaX architecture or architectures you've seen in general? It sounds like you've pretty much got a PhD in finance and equity management, these kind of things. It'd be great to get an overview of what a standard architecture looks like in that space.

Mario Loria (01:16): Sure, absolutely. I think for the most part, there's a monolith and there are a few microservices that are somewhat being decomped out. There's still some pieces of that functionality that we're trying to kind of usually pull out of the monolith, or make easier to manage within the monolith.

Mario Loria (01:34): What actually ends up happening is the monolith stays in a sort of its own world and then we have a new world. The new world is that from scratch kind of a Kubernetes world with more of a kind of service oriented architecture, not necessarily microservices maybe macro services. But the idea is that they have the basic building blocks of what they need given by Kubernetes by default. So we get this idea that operations employees kind of maybe have to understand this new system, i.e, Kubernetes. Developers kind of know it exists, but they're more interested in, how do I interface and develop and deploy and then long-term operate?

Mario Loria (02:10): So in terms of architectures, that's the general pattern that we see. I think there's also a lot of things where people are handling for scale when they don't have any scale yet. I think there's a fine line there. So in terms of like overall architecture, there's generally like, they take a region, that's our key primary region, there's ingress traffic that flows in for whatever their app or service or platform entry point is. There's some front ends that answer that, the front ends work with certain backends, either through an API gateway or directly, or a hybrid of those. You might have front end or backend clusters. You might have one cluster that has everything and maybe namespaced out.

Mario Loria (02:50): There's many different configurations, there's S and GraphQL, Kafka, some service mesh. There's a lot of different things that are kind of tangential to those other pieces that they depend on.

Mario Loria (03:02): So that's generally what I've worked in for the most part. I think my auto-scaling from working in the e-commerce world and in the web hosting world a little bit, auto-scaling becomes more of a priority when you talk about FinTech. What we're doing more at Carta, scale is not almost at all an issue. The scale actually comes from deploying and shipping features and so it's more internal on the development side, where they can be able to QA.

Daniel Bryant (03:28): Oh, interesting.

Mario Loria (03:28): It's a much easier line item if you will. So there's a bigger focus there on what do disparate environments look like? What does progression look like? What does verification stages look like?

Daniel Bryant (03:37): Oh, super interesting, Mario. Definitely several points to dive into there. For us going a little bit higher level, in your experience in these spaces, are developers empowered to own the full life cycle of services from that designing to coding through running in prod, or is there some cutoff point there?

Mario Loria (03:55): That's a great question. What I've seen for the most part is that we think that developers should have that, but in reality they don't, because we actually ended up coming to the rescue. In most cases when it comes to either deploy or the operate side of the table, we come to the rescue as SREs. What happens is that, the developers get the sense that basically there's an SRE team and I will ping them whenever there's an issue, instead of taking it kind of the precipice of handling that on themselves.

Mario Loria (04:28): The other side of that is the organization, the higher level strategy and leadership has to very much agree that this is the line of delineation where we're going to say, "A service owner is a developer. They own the service from start to finish." So, I think there's two parts. It really depends on the organization, it really depends on the developers and where they're coming from and their history.

Mario Loria (04:51): The other part of this is that, I think SREs take for granted the power of education in many organizations and so we don't end up not educating. We end up saying, "Look in the repo." So, that intuitiveness is not there, that UX is not there and so we think that developers should have more of a handle on things than they do and they almost never have remotely the understanding that they really need to handle the services. So what you end up happening is a developer that says, "Okay, I don't know if my pods are restarting, I don't really know what's going on. All I know is that I know kubectl get pods." Then they say, "Okay, well now I need an SRE."

Mario Loria (05:30): That's where you get influx of tickets and ad hoc requests for SRE teams that say, "Okay, well, this is completely broken. You completely broke it up." That's it. The end of it is I fixed it for you, it's done and the developer is happy and can move on doing what they were doing in the first place, but we actually haven't solved anything long-Term. The issue is still the fundamental education and understanding, not too much power, but too much impetus of putting the developer in the driver's seat and saying, "Okay, here you go." But they only have an experience with a Chevy and we're giving them a McLaren to drive. So it's a very different world.

Mario Loria (06:15): That generally from what I've seen, I think what more companies are moving to, and I think Netflix might've started this model, or I'm sure it's been around for a while, but where we say, "We're actually going to let developers do a lot more, we're going to let them use any tool they want." Like in this example, go ahead, use telepresence. It sounds a very simple networking problem. Do your hybrid deployment or development and go and just move, innovate, do what you need to do and you are taking the tool on yourself.

Mario Loria (06:44): As a developer, I'm saying I'm using telepresence. So now the buck stops at me, I'm not going to SRE for help, because SRE didn't tell me to use telepresence, they're not supporting a local telepresence self-service or anything like that. The self hosted version.

Mario Loria (06:57): I think, what I'm starting to see is, developers being more willing to take these things on, but us needing to provide them a surface area, the visibility and the understanding of what their actions and the repercussions of what they're doing, how those kind of ripple through the rest of the organization and really the platform. So how does me pushing this code, changing this feature and using this new tool actually impact the overall platform? Is GraphQL going to get overloaded? Am I going to hurt other services? Am I stepping out of bounds of where we kind of expect our API versions to be working right? There's a lot of questions that come up there.

Mario Loria (07:38): So, I think for the most part, the way I kind of operate and the way we're working at CartaX, is that we want to empower developers and shift-left and actually give them the ability to make more decisions and do more autonomously, but with a interactive self-service, self-sufficient sort of experience into the more complex pieces of the operations side. That's the reducing toil and all the other things that we talk about.

Mario Loria (08:03): I'll stop there. I feel like I'm talking too much.

Daniel Bryant (08:06): No, that's great Mario. No, great context. As I was listening to you there, I clearly heard the education piece. That's so powerful, I think for developers, for SREs, always paying it forward and sharing the knowledge, thinking when you're asking folks to take on more responsibility. I mean, Netflix, you mentioned they have this culture of freedom and responsibility. Exactly what you've just reiterated, which I think is well said.

Daniel Bryant (08:26): I also heard you talking about perhaps baking some of this into the tooling itself. Now you mentioned telepresence, thumbs up from us, of course. CNCF project, great stuff. Have you built any other tools to support this kind of thing? If so, did you try and bake in some of those understanding mechanisms, those education mechanisms, even?

Mario Loria (08:47): Yeah. I will say that the building of tools is actually something that I have not seen done in most organizations. They're not really building tools to solve these issues. What they're trying to do is ship out and say, "What is out there that solves this for me, that gets me 90% of the way? Then I'll consider later if I really need that 10%." I have not worked in organizations though that are doing SRE very to the Google standard. They don't have to. We don't all have to copy Google, it's not copy paste. We're not Google either, if we're being real with ourselves.

Mario Loria (09:23): So what I have seen, and this is the nature of where I worked and I haven't really worked in a lot of the bigger tech, maybe more modern engineering organizations, if you think of the West Coast class. But what I have seen is the SRE team is by default this DevOps support organization. They are very much doing the ops, they are not very much doing the dev. There's this false sort of thinking where, oh, well they're doing dev because they're writing Terraform, they're writing Kubernetes YAML manifest.

Mario Loria (09:56): So we basically say, "Well, they're writing config." Config is the dev part of DevOps. Actually, I think that that is the part where we get caught, because instead of them writing I don't want to say scripts, but Python automation, let's say operators and controllers in Kubernetes, it's a great example.

Mario Loria (10:14): Instead of them writing those sorts of things, they think their job is just to maintain in terms of the provisioning side of things, in terms of like making sure your home is happy, and other pieces like that and that's what they get locked into. Because A, they don't have time to go learn how to write a controller. B, they have to maintain. C, they're probably understaffed and overwhelmed already. D, remember going back to ad hoc requests, my service keeps restarting, I'm just going to go to SRE. That education isn't there, you've got developers that have needs.

Mario Loria (10:45): So, the ratio of developers to SREs is growing and growing as organizations get larger generally. Now, this is starting to change. I have seen where an SRE team starts with three people. This was actually me at StockX and then they actually said, "Okay, we're going to actually make this a platform team, which has sub groups or sub focuses, concentrations, I should say, where security, developer experience, cloud infrastructure." Let's just say those are three examples. So they build a platform engineering team that's 10 people.

Mario Loria (11:18): Now, we have more focus in these areas. We have a line of delineation, you work on this and you work on that and you enjoy that. If they need to cross and mesh and discuss and work on a problem as one huge unique solution for what we're trying to do as an organization, they can do that.

Mario Loria (11:38): The big thing there, I think is you're never going to get anything perfect, I think you have to try a lot of things first. I think you have to experiment. You can't just go to the esri.google book and say, "Okay, I'm just going to do everything here." You have to take as much as you can, soak up, listen to how other people are approaching some of these problems and then try things in your own organization.

Mario Loria (12:03): This is why I preach so much as I think that a lot of it does start with the developer experience and how we think about empowering and the communication patterns that we have with our developers. Instead of, it's their problem, it's really our problem, because it's all the business. It's, what we're trying to do, everyone is trying to do the same thing. If that's broken, then that's a whole nother issue. But that's the way I think about it.

Daniel Bryant (12:28): Yeah. Love it. Love it. I'd love to cover something you mentioned, your role and summarizing there is around the shift-left responsibility. You mentioned developer experience at the end, super, super important. As a sort of developer, ex-developer, reluctant operator, I totally get the developer experience.

Daniel Bryant (12:42): How important do you think it is for developers do design applications to be compatible and also take advantage of modern things like say Canary releasing? We've talked about that in context of API gateways, in the context of service mesh, this kind of thing. You need to think about observability from almost day one, if you're doing that as a developer. That traditionally has been more of an SRE ops type thing. I'm kind of curious that, how important you think it is for developers to really understand this kind of stuff, the shift-left?

Mario Loria (13:12): Yeah. Going back to ownership, when we talk about that, I think what we fundamentally are saying is, it should not be up to me as an SRE to define how your application gets deployed or at what point it needs to be rolled back, or at what point it needs to be changed, or when its health check should be modified. It's not up to me as an SRE to do that.

Mario Loria (13:32): So if we're saying the developer owns that, then hence forth, we're saying that the developer has to be involved. When I go to releases as a Canary release in production, what are the values? What feature flags do I have enabled? What is the percentages? What is the timing that we're going to... retries, timeouts, things like that?

Mario Loria (13:51): So, I think the best thing that we can do as SREs is to implement the platform. I hate the word platform, it's overwhelming. Implement the backend. I like to think of it as like the backend, the front end is the area that the developer is touching and that front end enables backend things to happen. You could think of like a backend is like the SRE team, we control node groups. There should never be a developer that needs to ...

Daniel Bryant (14:18): Oh, got you. Pure infrastructure kind of controls.

Mario Loria (14:20): Exactly. Pure infrastructure and core services, maybe. As an SRE, what can I install and provide the tooling for, to leverage. An example of this, you're talking about Canary, we would handle making service mesh a thing. I think people get caught up in what service mesh do I use? How do I interact with it? All of those pieces.

Mario Loria (14:44): At the end of the day, when you're a developer Daniel, you don't actually care about if it's Linkerd or SQL. That doesn't even matter to you. At the end of the day, you just need to Canary your service. You need to do it safely, you need to know what your controls are, you need to know what the workflow looks like, and you need to know what is the kill switch if I have an issue, you need to understand where you get your metrics, you need to understand what the metrics mean logs and other pieces. Really, have like a holistic view of what's actually going on with your service.

Mario Loria (15:14): So, let's zoom in on that front end. This is where I think backstage DCP, many other solutions are coming. That's the front end. That should be a gateway to understanding these backend components. If Linkerd is showing that, we've got retries going on, that needs to be surfaced to the developer. Do they need to have Linkerd installer on their laptop and know the 14 commands to figure out all the status of their service? Not, they shouldn't have to. So, how do we provide that information in a more easier to digest, more applicable and relative way to what their challenge is and what they're trying to do.

Mario Loria (15:51): So all of this is to say, a developer has a different set of requirements and how they're approaching a problem and what the outcome is that they're looking for than I do. So, we have to plan and scale and design for that. If we don't, this is where we have developers that we tell them, "Hey, you need to solve the kubectl. You need to set up your kubeconfig, you need install these tools. This is how you switch context. This is where the namespace is." Most of them have a messed up kind of view of what a namespace is, because of how namespaces actually work in network namespaces. Right. It goes back to what I was saying before, "My services, my pods are just subtly restarting or dying. I don't know what's going on." They can look and you can give them kubectl all day, but at some point, they're going reach out, they're to talk to you, they're going to ping your team. They're going to say, "SRE, I need your support."

Mario Loria (16:45): I'm not saying that that should never, ever happen, that they should never talk to you, I'm saying that we can provide tools that make them feel and give them the confidence that they can actually do whatever they need to do. If, there is an issue they can roll back. If there is a key SLI, they know what that is and understand how it's generated.

Mario Loria (17:05): I think, that's why I love what DCP and your resources, blog posts, your copy has been fantastic in highlighting against CodeShip run. I think about it, develop, deploy, operate, it's the same thing. It's these three tenants.

Daniel Bryant (17:21): Yeah, I like it a lot. What do you think the implications are for the shed tooling? Because I think I've heard you say you got the front end and the backend, that needs to be shared abstractions to some degree. Right?

Mario Loria (17:36): Yeah.

Daniel Bryant (17:37): That's interesting. So what do you think works? You mentioned DCP, backstage, the view. What do you think is missing in this space? Maybe it's a tooling thing. Maybe it's some standards. Maybe it's shared education amongst dev SRE and ops. Because sometimes that feels like cats and dogs, we don't communicate properly. I'm kind of curious what your thoughts of that, man?

Mario Loria (18:00): Yeah. That's a really good question. What I actually think is, the thing that my mind jumps out at is centralization of information. The reason for that is that there's four different systems that, let's say we'd give our developers. There might be Argo. There might be, of course, GitHub or GitLab. There might be Datadog or Grafana. There might be Graylog. There might be a few other services that we say they can log in, like Goldilocks and get their resource requests or other little things that they seldom use. But we tell them, "Hey, here's the domain, it's available, go use it."

Mario Loria (18:42): I think there's a sprawl and it's not as much tool sprawls, it's much as just like cognitive load-

Daniel Bryant (18:48): Interesting.

Mario Loria (18:49): ... in terms of developing, deploying, and operating. So, to do all of these things, if you think of someone you just onboarded into the team, now, the barrier, the mountain they have to climb to actually understand how we manage our services end-to- end is massive. So I think, going back to the front end and maybe like a single pane of glass sort of thing, where instead of having all these different places to look or to tune, or to try and understand what's going on, we need to be able to link out effectively to a runbook, to a Datadog dashboard, to whatever other resource it might be that the application is reporting to, or that we depend on for handling Canaries.

Mario Loria (19:41): The more that we ask developers to jump from one thing to another, the more tedious it becomes. So a good example of this is CI split with CD. So CI, it might be CircleCI. The CD is Jenkins. At one of my previous organizations Jenkins you'd go in there and there was no standard. It was basically, we stood up Jenkins and we said, "Here you go." What you got is a sprawl. As a newcomer, you start looking at this, you're like, how did this even get to this level?

Mario Loria (20:11): I'm not saying the SRE should then roll out standards, I'm saying that there should be like a general guideline and best practices that are culminated to help again, centralize some of the best knowledge make, decisions that apply to and fit best for what the organization is trying to do, again in their scale, not too far. Then, ensure that everyone is kind of on the same page of, this is what you can do. If you want to step out, there's no problem with that. Again, you own your service, you can do whatever you'd like with your service.

Mario Loria (20:44): But there are going to be some elements that if you were running in our platform, this is kind of the way that things are optimized for. So, I think, what still works and what's missing, I think a lot of the tools that we're talking about, there are new things coming out. Whether it's operators for managing databases or other things that make HAML a little bit easier to digest. Argo is an amazing tool.

Mario Loria (21:10): None of these things have actual... they're not bad. They're not bad in any way. The thing that's bad is when we don't actually understand the value they provide in the overall pipeline of what we're trying to achieve and how to leverage them in the best way for what our organization is trying to achieve at a higher level. I guess, if that makes sense.

Daniel Bryant (21:29): Yes, it does make-

Mario Loria (21:30): Hopefully I kind of answered the question.

Daniel Bryant (21:31): Yeah, it does. As you were talking, it made me think go back to the Netflix stuff and I've heard the Spotify folks say this as well about this notion of a golden path or a paved road, paved path, there's a bunch of different names for it. Is that something you would aspire to within CartaX or other places as in, like you say, you can always break glass if you want to and do your own thing, but you're fully responsible for that. But you, as an SRE team, as a platform team have this golden path, you may even have templates for applications, crank the handle, out pops a dummy app, then it's connected into observability, to CI, to CD, developers fit in the business logic. Is it something you would aspire to that kind of golden path?

Mario Loria (22:10): Yeah, absolutely. My goal actually, and this is going to like thinking as a product team, if SRE is a product team.

Daniel Bryant (22:16): Oh, interesting.

Mario Loria (22:16): My goal is that that product is so good that there is no reason for them to need to do anything else. I'm not going to come down and say, "Well, this isn't like..." I'm not going to be VP of engineering or something and say, "You have to use this, because it's what SRE provides necessarily." If there's a requirements from a security standpoint or from a organizational standpoint in our world, Fedora, FCC, sure. The developers aren't going to actively want to go against that. They're a part of the organization. They want us to succeed. They need to know about those things and they're going to try to implement them to the best of their abilities.

Mario Loria (22:50): But I think the thing that I like to imagine is, as an SRE, we are releasing a product. That product is the platform as a service, if you will, the pass that we are saying, "This is what you have and we've tried to build it with you in mind, with the company in mind, with our business logic in mind, please help us make it better, please help us-

Daniel Bryant (23:14): Yeah, tell him.

Mario Loria (23:14): ... determine how your services need to work and how things need to progress and how GraphQL should be configured and what the scale is like. Please help us. Let's work together and make this platform." Because the platform, it's like buying a Xbox Series X and having no games for it. The hardware is fantastic, it's super exciting and maybe a little bit over priced, at least right now, but it's not really anything unless you have the games in the software to actually use it effectively. That is the services that run on top our clusters.

Mario Loria (23:44): Going back to the golden path, I think that is a model that we try to think about, what is the developer? If they're thinking about the ingress and some headers that they have to deal with, what do they get with our platform that they can do, that they can maybe tune just a couple of things and get what they need to do? What does that experience like for them? Can they get the features that they're looking for in that vein or are they going to start looking at other solutions?

Mario Loria (24:11): Have we failed there or are we actually fine, because we don't really want developers doing that in the long run. That's kind of an organizational thing that we think about. That is also how we make decisions on what to prioritize, which is a whole other discussion.

Daniel Bryant (24:26): Interesting. Maybe switching gears a little bit, as you're abstracting some of this Kubernetes complexity away, the CNCF is obviously fantastic. You and I both, we all have the CNCF. But there is incentive there to have more projects. Every kube I go to, there's more, more things popping up. How do you think that approach scales? As you know, there's more choice for us as engineers, but then there's a paradox of choice sometimes, there's more complexity. Any thoughts perhaps with your CNCF ambassador hat on and with the office as well. Right?

Mario Loria (24:58): Absolutely. I actually watched in downtown Ann Arbor, Michigan, the tech hub of Michigan if you will, there was a presentation from one of the people who designed Apple's product and go-to market strategy, maybe more on the marketing side of how we advertise, what we've got. Something that he reinforced to me is that, one of the brilliant pieces of what Apple has to offer, is the simplicity in the options that you choose.

Mario Loria (25:27): Before the iPhone XR and other models, three or four models, there was just an iPhone and it was the iPhone 6. There wasn't really many different models. With the MacBook's, there's two or three models in the MacBook. It's pretty simple. There's a couple of little things once you pick your model, but that's it. That's simplicity.

Mario Loria (25:47): I think a lot of people would say, "Well, that forces me into a model." But the other side of that is it satisfies 90-plus percent of what people need. They actually have less of a cognitive load of trying to decide what they really want, what they have to buy. If you go to Lenovo page right now, I guarantee you, there's at least eight tiles of different laptops and you can customize each one. It's like, I don't even understand what... I just need a computer that works really well for-

Daniel Bryant (26:13): Good analogy.

Mario Loria (26:14): ... photo editing stuff. Right?

Daniel Bryant (26:15): Yeah.

Mario Loria (26:16): Exactly. So to your question, I think that it's better to have choice in this world. There's a lot of nuances, or there's a lot of rabbit holes that we can go down in, in anything we're trying to do.

Mario Loria (26:29): We talked earlier about, we had a note that actually was tainted in our Kubernetes cluster. Actually, we found out because the DaemoSet couldn't deploy to it. I told my team about, there's a tool node problem detector that can actually kind of solve this problem. But there's actually probably at least five, probably more if I search GitHub ways to solve this problem or tools that people have built that are automations, that find these and then pull them out of the cluster by default.

Mario Loria (27:02): This is on the team, when you as a team are saying, "We have to solve a problem." I think it's really important to first, write out what is the problem description? Why are we trying to solve the problem? How does fixing the problem actually... like if we say the problem is fixed, and then we look to a year later or whatever, and the problem is fixed, what does that mean for us? What is the impact? In solving that problem, there's one solution. We're not going to install five of the solutions that are on GitHub, we're going to install one of them.

Mario Loria (27:29): So, I think in this case, because there are so many different needs and wants and backgrounds of the teams that are managing e-commerce platforms and FinTech platforms and web hosting platforms or security platforms, they're all very different. Because of those differences, we have so many options that solve things in different ways.

Mario Loria (27:50): I think that's actually better because you zoom into the problem and you say, how do I solve this problem? Then you look at what's out there and then you play around a little bit, you actually learn. You ended up learning a lot more than you would if there was just one solution. If there's one solution, it's like, "Well, this is how I have to do it. I don't have an option." You actually end up seeing, oh, wow, that did that this way. Does that actually impact other ways that we are trying to solve the problem or other things that we want to do maybe in the future, or is that going to plug in with service mesh? You have a lot of considerations here.

Mario Loria (28:22): By having those options, you can take a step back and say, "This is the world."

Daniel Bryant (28:26): I like it.

Mario Loria (28:27): This is how the cloud native ecosystem has understood these problems in the FinTech world. Let's say, zooming in there's the FinOps foundation, all of that. This is really interesting, let me learn more about that, let me play with that, let's test drive that and figure out why they chose to do it that way.

Mario Loria (28:45): An example of this would be a really quick example, NGINX ingress. I don't know if it's still does this, the open source default ingress NGINX project for Kubernetes actually worked off of end points instead of the service objects itself. So it would actually pull on the end points from the API instead of looking for the service objects. Again, a fundamentally different way for an ingress controller to operate versus what you'd think.

Mario Loria (29:11): I think, it's one of those things where when you try to oversimplify, you can hurt yourself, but when you try to get overly complex and too deep in something, you can also spend your cycles. So I think there's kind of a middle ground there that is really tricky a balance.

Mario Loria (29:29): The short answer to your question, I think it's a good thing, but I think it needs to be managed from your mental mindset of how you approach these things. It could be said the same about cryptocurrency and too many currencies out there, and all the old coins. I mean, it's how you think about it. It's how you digest and process things. It's getting caught up in all the excitement around all these new solutions. Because for the most part, a lot of the solutions coming out are just kind of slightly twisted or modified the way of doing something that's already been done. Right?

Daniel Bryant (29:58): Yeah. On that note, I want to get your opinion on standardization. Because you mentioned it a Linkerd, Consul, Istio, where I'm thinking SMI. The promise with , the Service Mesh Interface, when it was announced at KubeCon, a couple of years ago, was, we, even as platform, folks don't need to worry too much what's below that line of Service Mesh Interface. SMI gives you like traffic management, observability security, these kind of things.

Daniel Bryant (30:22): What's your thoughts in general Mario about, would you look to adopt open standards? Would you look to contribute to open standards? Do they help reduce some of that cognitive load you've talked about as well?

Mario Loria (30:36): Yeah, I would agree that they do. I'm trying to think, yes, or, I don't know actually a ton about Service Mesh Interface. But I think that's an example where you actually look at that and you say, "Okay. So if I find a service mesh that follows SMI, then I know I'm getting these features by default. I know this is what I'm getting."

Mario Loria (30:53): If you think of like a customer buying something at a retail store, if it's got a designation on it, that designation tells them something by default. It's like looking and seeing that Mario, he's a certified Kubernetes administrator. So he has got some level-

Daniel Bryant (31:06): Kind of baseline.

Mario Loria (31:07): Right. However, that's a double-edged sword, because I might've studied it a year and a half ago, I might've not done Kubernetes for the past year, I might have lost a lot of that knowledge or I have a different viewpoint or I didn't actually do that well. I got 1% over the required and I just barely passed. That could be the case as well.

Mario Loria (31:27): So, I think there's a middle ground where if you interviewed me, you would have to actually care and ask some questions around the certified Kubernetes exam to make sure I actually am up to speed, I'm not just someone who can read and copy and then do a test and actually practice and have been able to leverage these things. So there's an in-between there.

Mario Loria (31:45): So, I don't think it's kind of a perfect indicator, I think it is a better indicator than not. But I think going to open source, I know there's Open Service Mesh as well, I think.

Daniel Bryant (31:57): Oh, yeah. Microsoft's.

Mario Loria (31:57): There you go. Yup. I remember, I don't know if you remember this, but I think there was actually code copied from Linkerd.

Daniel Bryant (32:04): Oh, that's right. I remember.

Mario Loria (32:06): Yup. That was highlighted by Oliver on the Linkerd side. I think what we're getting is, open source is amazing. You can fork and you can add to projects, you'd be a very like dense contributor. You can really do a lot in the ecosystem. I don't think there's anything bad there. What I think is bad is when the lines start to blur of where the value add actually is and why this solution needs to exist versus the other solutions.

Mario Loria (32:36): I think Istio has a market. I think Linkerd has a market. But that Open Service Mesh, I wasn't really sure on the market, I wasn't really sure on their goals, their principles, their vision for what this thing should be. I'm sure, there's a great one. I haven't looked at it in a long time.

Mario Loria (32:52): But when it released, it's like, okay, so there's another solution, what do I get from this? I don't have a problem with there being 100 solutions, what I have a problem with more so is, when we say this one is named something and this one we're trying to start a movement around this and we kind of trick people into thinking that ours is the better one, because of kind of the way that we release or the aura around it.

Mario Loria (33:16): I think, some of that you can fight through. I think SMI helps again, set an entry point to understanding what this gets you. But I think long-term, if I'm picking Linkerd, if I'm picking a service mesh like that, this is a pretty major thing.

Daniel Bryant (33:32): Oh, great.

Mario Loria (33:33): It's in the critical path of everything that's going on. So I don't actually care about what the Quickstart Docker looks like as much as I care about the project's vision and mission from its founders and where it's going to be in three years, because I'm still going to be using that.

Mario Loria (33:47): Maybe I think about this differently, maybe I'm dumb and we shouldn't care about things in the future and all of that. But I think, you're betting on Linkerd, you like Linkerd.

Daniel Bryant (33:55): Yeah, I agree.

Mario Loria (33:56): I don't want us getting in bed with them, I guess, that's not the metaphor I was looking for. But, the way I think about it is like, I am not getting the project because it has a really great code base necessarily, I am adopting this project into everything we do in our critical path because of the people. This is when VCs invest in companies, they don't actually invest in the company, they invest in the founder. Right?

Daniel Bryant (34:23): 100%. Yeah. I like it, Mario. Lots of great content. I really appreciate all the brain deep knowledge. You mentioned the DCP a few times. If folks are looking to build a platform, build a DCP, what advice would you give them? Maybe it is build versus buy, or maybe it is, where would they start? You mentioned a few times about the CodeShip run, which I think is a super insight as well.

Daniel Bryant (34:41): I'd just love to get your thoughts on, what's the most critical things folks should think about when they are providing this service to their developers?

Mario Loria (34:51): Yeah. That's a really good question. You asked before, what have you seen people building? I think honestly, a lot of time we get talking about something about a unicorn and rainbow scenario and we actually don't do a lot of actual work. That's why I think that people should, for the most part defer to what's out there.

Mario Loria (35:08): Even if you don't use anything that's out there, I think you should use the ideas and the learnings from your research and from your playing around in greenfielding to figure out what is going to work best. I think you can highlight problems, but you can't offer full 100%, this will fix the problem sorts of advice, until you've really tried the available options in front of you and really applied them to your environment and your ecosystem in your own organization and how your developers work, et cetera.

Mario Loria (35:38): I think for the most, I look at one of our coworkers that spent the night upgrading GitLab to version 14 internally. I think about that and I think, there's a lot that we have to do. I don't think most businesses really need to be in the business of running GitLab, from an operation standpoint.

Daniel Bryant (36:00): Interesting. Cool.

Mario Loria (36:01): I think I might be very skewed. At StockX, we very much depended on third parties. It was no question, we were not self hosting anything. We didn't have the time, the resources and we had enough money that it made sense to do that. Actually, I got very spoiled, but I also realized, you know what? I don't need to really understand how to run Kafka, this is done really well for me already. It solves 98% of what we need, which is just running Kafka well.

Mario Loria (36:27): So that's not 100% the case for everybody. Everyone has got different use cases. I'm not saying everyone should go out and use a third-party tool 100% of the time. But I think that at the very least, you should very much see what's out there. You should just soak. You should just really play around. You should greenfield things. Then make some decisions about where you're at now and where you're going. This one is a tricky one.

Daniel Bryant (36:51): No, I like it.

Mario Loria (36:52): If you make a decision, this is where we're at now, we need the solution for this now and you don't consider scale later on and not even scale with the number, but more scale of like what your needs might be, I think there's a fine line. Because you can't plan for three years down the road and every thing you do at every juncture, that's just not feasible. You'd spend four months evaluating something before actually trying it and bring it into your organization and implementing it, that's not exactly be feasible.

Mario Loria (37:21): So you have to figure out for yourself, what you're trying to do, how it impacts you now and how it will impact you in the future. I think if you use the principles that you've defined already, you use the mission that you've defined already in the context of, what is best for our developers, what is best for our own team, what is best for the business and the overall company to achieve the objectives that we all want to achieve? If you think in that context, in every step of approaching a new problem or approaching a set of solutions or approaching a longterm strategy, you will be much better off, instead of, if you just kind of have this knee jerk reaction to everything and take the first thing that you see on GitHub.

Mario Loria (38:04): That's why I think the landscape document that shows 14 trillion things on a fancy little... the CNCF landscape PDF, I actually think that it's good and bad. It's good in that, it shows you everything a plain view, really simple to find. It's bad in that, we then get the sense and we then discuss, are there too many things. We think of it as a bad thing, which it really isn't.

Mario Loria (38:27): So, hopefully to answer your question, I think that teams should consider, do they want to managing this thing? I like to think about problems in a future sense of, if I did this now, where would I be a year from now? Will the problem be solved still? Would my operational maintenance burden be the same or less? Again, what's favorable for you? Maybe you want more operational burden, whatever that might be. Are we still moving and doing impactful things as an organization, or is this going to long-term maybe slow us down or put us in a place where we can't get out of? Then we're really in a bad rabbit hole that it just kind of spirals.

Mario Loria (39:10): So, think about where you are now, where you want to be in the future and how the principles and that your decision making process that you might have already, will kind of impact what you want to do.

Mario Loria (39:25): Hopefully I didn't go around in circles and kind of answered the question, but it's very different for everybody.

Daniel Bryant (39:30): Yeah. 100%. To use a finance metaphor, talking about buying options. You're buying yourself options that you can exercise in the future and it's making those smart decisions upfront, but I'm balancing the risk.

Mario Loria (39:42): Absolutely. Let's go down to the finance rabbit hole. I love this. There's going to be turbulent pieces. Nothing in life is like perfect 100% of the time, there's going to be turbulent portions where you have to think, is this thing actually going to help me? Is our buddy or my coworker updating GitLab version 14, is that going to be good long-term? Well, we can bet that, okay, it's a new version, it's got some new features, some fixes, ideally it should be good, but we've actually already had an issue come up about an hour ago where something broke.

Mario Loria (40:15): So, in the short term, we didn't consider the consequences of getting to that desired state. We jumped in a little bit, maybe too fast. Or, we didn't have controls in place that said, "Well, how are we testing this functionality that we're looking for?" Or that this upgrade, that sounds amazing is actually good for what we're trying to do. Now we've got a slew of developers here, I'm looking at my Slack, that are asking about, "Hey, this isn't working anymore. I'm getting this weird cryptic error."

Mario Loria (40:41): That's the other side of this is, well, how do we test GitLab? In their public, do they have a suite of end-to-end tests that we can run every time we upgrade? A, they don't. B, this burden is now on you because you run the service. As soon as you put yourself in that critical path, that's when the paradigm changes with how the organization depends on you. That's where the ad hoc requests are flying in. That's where your burden is so heavy that you don't actually find yourself doing projects that are really meaningful to anybody on the team.

Daniel Bryant (41:18): Yeah, Mario. I feel you. I've been there in my past life with my past roles. Yeah, awesome. This has been fantastic. Is there anything you want to say that we haven't covered?

Mario Loria (41:29): I don't think so. I think, with Ambassador [releasing the DCP] and Spotify releasing Backstage, and Crossplane.io. I think there are projects now that are coming out, that are saying, "Let's stop writing YAML, let's generate the YAML for you based on a couple of things that you want to do. Let's make sure that whoever it is, that's making the decision about what needs to happen, i.e, the developer can do that. So in a simple, effective way, and they can see the results of that." Part of this might be measurement and observability components.

Mario Loria (42:02): I think if there's anything I'd leave people with, it's that a lot of the problems that you might be seeing, or that you might be encountering, that your brain has this kind of instant reaction to, whereas we can solve this, if we have more documentation, we can solve this, if we change the system we're using, or we change and modify that script. I actually think that a lot of the problems can be solved by approaching them from a human soft skill's standpoint of how someone interacts and operates and what they're trying to do.

Mario Loria (42:38): I'm trying to think about this. It's like a metaphor, maybe buying a car. Most people, when they buy a car, they just need to understand how the windshield wipers work. It's a very simple thing. Why do they need to do that? Because, they need to make sure the windshield is clear. So we just broke down that problem. We didn't say, "What size is your windshield wipers? What company?" There's 13,000 things that are nuances about windshield wipers. But really, what is the key thing? Is we need to wipe the windshield and keep it clean when it's raining.

Mario Loria (43:11): So, if you break down the human part of that, to what someone might be trying to achieve in their organization, it can put you on the same kind of thinking field as them.

Daniel Bryant (43:25): Empathy.

Mario Loria (43:25): Yup. Exactly. I'm just so tired of seeing teams that basically, "Well, they're having all these problems, they're really annoying, I wish they would just get it. Why don't they do the research? Why don't they..." Well, no, it's actually probably something that's kind of complex and you didn't do anything to really solve for the UX side of it. I think user experience is a huge thing here.

Daniel Bryant (43:45): Oh, interesting. Yeah.

Mario Loria (43:46): I'll leave people with that and thinking that a lot of these problems are not hard problems, they're actually kind of soft skill focus problems. You're Istio isn't actually the problem. It's how people interact to understand and deploy in an Istio environment that hasn't been well-defined. Yeah, I just urge people to think in that mindset.

Mario Loria (44:09): The new tools that are out there, I think Terraform is a great tool. I personally don't like writing Terraform. I try to tell people that I'm rusty with it, so that they don't elect me to do it. The reason for that is that, I don't think we should really have to sit and write YAML files to define our infrastructure. I actually think that I should be able to go in AWS and just create whatever I want. Then automatically, there should be processes that take that and save that for me as a-

Daniel Bryant (44:36): And spit out, say, CloudFormation

Mario Loria (44:38): Exactly. A declaration. Right. So instead of, we're moving out of the manual declaration, I want to get to automated declaration of infrastructure. That's even less work for the operations engineers to do, but then when we need it, it's there. That's an example, I think, where we're solving the human side of that. We're taking off the burden of, well, to implement this feature, it actually takes a week because we have to write YAML, but I could go on the AWS console and do it in 20 minutes. Right?

Daniel Bryant (45:05): Yeah.

Mario Loria (45:06): I'll end with that. There's probably a lot of other things I could mention to people. I actually wrote a internal paper on developer experience-

Daniel Bryant (45:13): Oh, wow.

Mario Loria (45:13): ... more so from the side of building a developer experience team at Carta. Some of it which we've been trying to do. Which, I think when you get to a certain size, this might make sense. But, there's a lot of things, that for a whole weekend, I just wrote, and I love doing that. But the problem is that if you try to read it, it doesn't read very well, so I will do some editing. At some point release these to the... some of those may be snippets or pieces. A lot of what I've been saying today it kind of regurgitates that and makes it a little bit more clear hopefully on a blog or something like that. Maybe even call O'Reilly and write a book with them.

Daniel Bryant (45:49): Oh, too, right? Yeah.

Mario Loria (45:49): But no, I honestly love this topic and I would love to chat with anyone who wants to discuss more about it. So I'm always learning.

Daniel Bryant (45:57): Awesome stuff Mario. Folks that do want to reach out to you, where can they find you on the interwebs?

Mario Loria (46:02): Yeah. I'm at Mario P. Loria on Twitter. Actually you can find pretty much everything about me at my personal splash page, marioloria.dev. I'm most active on Twitter and LinkedIn, so please feel free to communicate and reach out. I think that those are the key places to reach me. I'm always in the Kubernetes Slack as well working on the Kubernetes Office Hours or some monthly panel where we help people raise issues.

Daniel Bryant (46:28): Oh, nice.

Mario Loria (46:28): So join the #office-hours in Slack. Just I'm usually @Mario or @MLoria in these Slack. Usually I can get Mario, usually I can get-

Daniel Bryant (46:38): Nice.

Mario Loria (46:39): Please reach out. Again, I love talking about these things. I actually do consulting as well around the mentality of approaching these problems. People get to a certain juncture, what is the decision making process, to figure out do I double down on GKE Autopilot or do I do something else? I love hearing about how we got here and then, how we get there. What got you to a certain point might not get you to a-

Daniel Bryant (47:04): Yeah. 100%.

Mario Loria (47:07): So, would love to chat with anybody.

Daniel Bryant (47:09): Awesome Mario. Awesome. Well, I really appreciate your time today. Thank you so much. This has been like super insightful. I'll try and brain dump all of the notes, has been fantastic. Thank you. Thank you. Thank you. Really appreciate it.

Mario Loria (47:17): Absolutely. Yeah. No problem. Thank you.

Developer Control Planes: An Experienced SRE's Point of View

About

Episode Guests

Featured Episodes

S3 Ep10: Foundations of Formidable API Federation feat. Daniel Kocot

S3 Ep11: Embracing Tech Change: Matthew Reinbold on Adapting to Industry Shifts

S3 Ep12: Kubecrash 2024: Engineering Insights with Danielle