Livin' on the Edge Podcast #4: Alla Babkina on Platforms as Products, 2020 Fullstack Engineers, and Observability
Key takeaways from the podcast included:
- If an engineering team decides to build an application platform, this must be treated like any other product within the organization. Requirements should be gathered from customers (e.g. developers), and the delivery and maintenance of the platform should be explicitly managed. Appropriate people, time, and resources should also be provided.
- Although the term “fullstack engineer” is probably over-used, in 2020 it is beneficial for engineers to understand an appropriate level of detail about several topics: programming language, algorithm design, and the underlying cloud platform.
- Upskilling engineers in fullstack topics enables them to design, build, and run applications that perform effectively and consume (cloud) resources appropriately.
- Not every engineer wants to be “fullstack”, and so platform/operations support should be provided to engineering teams as appropriate.
- A challenge for modern software engineering organisations is cultivating the team mix. A key question technical leaders need to ask is “should a product be built by a small band of fullstack engineers, or a series of teams with specialist roles?”
- Engineers must understand the business context in which they are operating, and also how the features they are working on relate to key performance indicators.
- The successful adoption of new technologies and techniques can be driven by focused sprints, where the output of something practical can be demonstrated to the entire team. “Fika” (coffee time) sessions can be a great way to share knowledge.
- Microservices are just one way of creating modular applications with well-defined APIs. Other approaches, such as a well-architected monolith, are often equally as valid.
- Observability must be designed and built into all of the applications and also the platform. Engineers must create metrics appropriately for alerting, and also use logging effectively in order to provide enough “breadcrumbs” to locate where the issue is.
This week's guest
As a technology lead, engineering manager or consultant, Alla Babkina helps teams build solutions that businesses need, users love, and engineers are proud of. She has experience in several industries, deep practical knowledge of technology and how to best apply it to business use cases. Alla drives for innovation and strong engineering principles, which allow her to build industry leading systems and teams for organisations that she works with.
Hello everyone. I'm Daniel Bryant and I'd like to welcome you to the Ambassador Living On The Edge podcast, the show that focuses on all things related to cloud-native platforms, creating effective developer workflows and building modern APIs. Today I'm joined by Alla Babkina, head of engineering at Headstart, a diversity recruitment platform.
I've followed Alla's work for many years now and we briefly worked together at a consultancy company in London called Open Credo. Alla was a suburb consulting colleague and had the unique ability to become productive with practically any technology within a matter of days. She also always kept the big picture in mind, such as the leadership and the organizational drivers, which trust me, wasn't always an easy thing to do.
Recently Alla has worked on and led several teams that have built cloud-native platforms, and so I was keen to understand what her key learnings here have been. I was keen to also ask questions around her technical experiences here, recommendations for tech, but also understand how she prioritized and balanced the related business concerns.
If you like what you hear today, I definitely encourage you to pop over to our website. That's getambassador.io, where we have a range of articles, white papers and videos that provide more information for engineers working in the Kubernetes and cloud space. You can also find links there to our latest releases, such as the ambassador edge stack, our open source ambassador API gateway, and also our CNCF-hosted telepresence Kubernetes tool too.
Hey Alla, welcome to the show. Thanks for joining me today.
Hi, Daniel. Nice to see you again.
So could you briefly introduce yourself, please, and share a recent career highlight?
So my name is Alla. I've been working with Headstart, a diversity recruitment software platform, as head of engineering for the past six months. I recently, just about a year and a half ago, moved more into tech leadership where I started off as a Java developer, having switched careers from a lawyer, went all the way from full-stack to very deep backend, to distributed systems, to operations and distributed cloud systems, and now ended up managing all of tech in a small startup.
You're perfect for this podcast, Alla. You've literally done all the roles possible, which is perfect. So what I wanted to pick your brains about today was developer experience and in particular inner development loops. So from having an idea to coding, to testing, to deploying, to releasing and verifying, and it can be actually the tech stuff or you can talk about it from a leadership perspective as well. But I wanted to dive into tech at the beginning. Without naming names, could you describe the worst thing you had to do with where you were coding and you needed to get something into production and it just wasn't working?
As a developer, I wouldn't be able to say straight away. I've been pretty lucky in my career. But there was one engagement that I was involved in which included handing over the technology from a startup that has failed to one of its major investors. And I was brought on site for five billable days to do a technical handover and technical diagram.
In the five days, the best I've managed to get is to get the code base checked out on someone else's laptop, because I couldn't get access to even the office wifi, although I did have a card to the canteen where everything was free. And we couldn't get the project to even build in the [IDE 00:03:26] because you couldn't get access to the standard Maven repositories. You needed to get everything pre-approved, which took at least a week up-front. And I couldn't install any of the usual software I would use or use it in the cloud because of the firewalls. So at the end of five full working days, apart from having a couple of nice coffees at the canteen, I managed to deliver a drawing in Microsoft Paint of a financial trading platform. And that was my deliverable. So I couldn't even touch the code.
Wow. That's probably the record. That wins the award for worst experience so far, I think, out of all the people I've chatted to. What about your best developer lead? And you could even describe something you're working on now, or has there been this magic moment... We all know as developers, you get that magic moment where it's super easy. You have the idea, you code it, you test it, you see it running and you see the users getting value. What's your experience been like in regards to that?
So this is actually quite a recent experience. This was at Headstart. It was before the recent couple of people joined. So the team was still pretty small. It was the perfect developer and team, an agile delivery and CICD experience. So our goal from the beginning of the year, and something that we put into our OKRs was to get really, really, really comfortable with continuous delivery and shipping to production without sweating too much, at least five times a day as a team. I think that's a pretty big goal. So in order to illustrate how that might work, we organized what we call a fika. So that's Swedish for coffee break. We have these every Friday, and it's a relaxed two-hour session where we either get to talk about something tech or to do something together or just to chat around things we don't get to chat within the normal week.
And we decided that within that fika, we would identify a candidate for something that we could get something valuable. So it couldn't just be code refactoring that no one sees. Something valuable to the user that we could design, deliver, test and showcase to the rest of the company. Not a demo, but announce it on our product updates channel within two hours. And we had a couple of candidates.
So it turned out that within the application process, for a very long time we have been collecting extenuating circumstances for graduate candidates, so where they would have completed the degree maybe not with the best grade or took a little longer and they could say why. And we were not highlighting it to the recruiters. So we were using it in some of our matching algorithms, but we were not highlighting it to the recruiters. And we thought, okay, it would be pretty good to show the recruiters if a candidate has something in their history.
So it was pretty simple. It's more or less like a [Reed 00:06:29] Model, but we've identified where to place it within the existing front end for the recruiters, how to display it, how much to display, whether to repeat it or not, to fight out a design decision. So this was before we hired a visual designer, and no one was very much into design back then. And to get it delivered with two commits and really quick peer-reviews and tests within two hours, announce it on the product updates channel. And it was a massive hit with the rest of the team.
And I think there were social media posts around that we're doing this and that we're trying to give people who might have had difficult circumstances a chance. And this felt particularly good because we knew that for some people it would be life-changing. So after we've taken a look, we've seen people with histories like, "I didn't complete my degree on time because I was involved in a hit-and-run accident and was in a coma in the hospital for a couple of months." Yeah, so that was a massive success story, although it took a leap of faith for the rest of the team to go there. Like, "How can you deliver a feature on a call all together in VS code within two hours?" Yeah, you can.
How about testing though? How do you test that?
So we ran automated tests for the API. So that works for the front end because of its simplicity. On the corporate side, we don't have a big test suite for that. So we normally do a small test, but we're moving in that direction as well. So we're looking for something that won't slow us down, because we can't afford to run a full-blown end-to-end test suite every time we deploy, because we deploy up to 15 times a day now, and it will just not leave any time, but we also want to make sure that we don't break anything too often.
Yeah. It makes sense. How important do you think it is for developers to understand the business context they're working in? You mentioned about breaking things there. I'm guessing the engineer's like, "Breaking stuff there is just code not working," but there's also bad user experience folks. So Alla, yeah, I guess the question is how important is it for developers to understand the business and how would you go about upskilling them in understanding things?
So with the current team, luckily I don't have to upskill anyone, maybe downskill a little bit. The team has been effectively the product team for three years. So everyone's very very involved, but I think that's a privileged situation to be in. So engineering is not science. You're not inventing anything really. You're coming up with solutions from existing building blocks by combining them, choosing between different options and so on. So you need to make decisions, and in order to make decisions, you need to understand the criteria. And the only way to understand the criteria is to know what are you trying to achieve in the end.
So not understanding the business context means that a lot of this gets lost, and developers end up making the wrong decisions in the bigger picture. So they might be good decisions within a particular written-up ticket or story or task, whichever way you frame this, but it won't be the right decision within the business context. And engineers are clever. They usually sense this. They're sensing that they're doing something wrong and there's an agency problem there, that I've done my part of the job but it doesn't really work.
What about going down back to the level of architecture? I don't know if you use microservices, but it seems like almost everyone uses some form of service-oriented architecture. How important do you think it is for developers to understand this notion of being able to decompose systems into functions, services, modules, call it what you will?
So in order to understand how to decompose them, usually developers need to understand why. So you can't say, "We're doing decomposition now. We're using microservices." That's a bit of cargo culting. On my team that causes a lot of pushing back, so the team is quite reluctant to adopt microservices for the sake of microservices. So we are having discussions as to why they would be beneficial. There are two things to consider there.
The first one is the way that the teams communicate. So you can communicate via service boundaries, service contracts, service structures, service requirements. You could do the same with service modules, for instance. You could do this with code structures. It doesn't have to be on infrastructure level or deployable-unit level. So you need to understand why you're doing this. That's the first one.
But the second thing that you need to think about when decomposing services is the operational model. So you need to know who's going to be running what and when and how to what SLAs. Again, it comes back to communication, to internal and external stakeholders. You also need to understand who's responsible down to what level of abstraction, because there's a lot more operational work involved in microservices than deploying a single monolith, whatever it may be. It can be a microservice if your service is small. It's just one micro service. But there's a lot more involved in that, because you need to develop the platform to run these services on. You need to think how they're going to communicate between themselves. And that again goes back to the operational level.
How important do you think it is for developers to be operationally aware? And where my question's going with that is I guess there's only so much developers can learn. Do you know what I mean? They've got the business context now, they've got architecture. If I've got to be ops-aware too, it seems like a lot.
I think that's the key question of 2020. I saw a really funny GIF somewhere. It might be from one of the colleagues that we used to work together with. A full-stack developer 2020, which is someone who knows how to make homemade sourdough and some neuroscience and some team psychology and some business and some finance and networking and security. So I think where we're going back to the full-stack model a little bit. So developers do need to understand operations within the limits of what the organization expects of them, so their different operational models.
What they need to understand is, when is their work done, and when is it done to a sufficient standard? So in the universe that I've been operating in for the past couple of years, that does involve thinking about running the software after it's delivered. It doesn't involve SRE work per se. So normally developers would run their services on some sort of a platform, public one or an internally one developed, but it requires awareness of operational aspects. So developers would be expected to understand at a minimum how computers work, not just the language API.
So you do need to understand what is memory-intensive, what is CPU-intensive, at least on a high level. You don't need to know the memory model of your language in depth, but you need to understand that this is just memory-intensive because you load a lot of data into memory and then you do something with it and then you output it, as opposed to, this is something that involves a lot of computations and this will be CPU-intensive. So just understanding where it goes and to be able to make the right use of the platform.
And the thing that I've been talking about a lot recently, because of my previous experience before Headstart as head of platform at ClearScore, is every developer needs to write the software imagining two things. So one is the one that's been known for many years as a meme now, that you need to think that the person who's reading your code next is a psycho who knows where you live. You don't want to upset them in terms of reading. But the second thing that they need to understand is imagine if you wrote a piece of code and you were called up at 3:00 AM saying, "Hey, this thing is broken and I don't understand why, and you need to guide me exactly through how to say where is it broken and how to fix it, and whether it's fixable." So you need to leave breadcrumbs, if not for the support person who will be there then for yourself, if you're in you-build-it-you-run-it environment.
This is actually an interview question I've been using recently. We're not hiring at the moment. And I've taken that question down. It proved to be very unpopular because it scares people. So one of the candidates actually gave feedback and said, "Oh, I can't work for you because your head of engineering will call me up at 3:00 AM and say, "You have to fix this code." But that's not what I said. I said imagine. So I had to have this question with disclaimer, saying we don't operate a 24/7 support policy, so I will not call you at 3:00 AM. This is just a scenario. But it's good to think about it like that. Because I've been called at 3:00 AM, well not called, but paged by my page [GT 00:16:28] with things which were broken, and was thankful for all the breadcrumbs left by engineers, by ops, by the whole team.
I like that idea of breadcrumbs. Because I've been in the same situation and I'll be like, "I need breadcrumbs and I should have left myself more of them." When you say breadcrumbs, what stuff do you mean? Listeners are thinking, "I like the sound of this. What kind of stuff do I leave future me or my team?"
Yeah. So you need to understand the numbers that you want to have to determine whether your service or your piece of code or the whole system is operating normally, or if something is wrong. So in the first place you need metrics. So metrics are by definition numeric. So you need to know what normal looks like, and you can't get that retrospectively, so you need to build it early on. So you have some time over which you can say, this is our normal. So sometimes that might be a day, in more predictable applications like banking. There is a normal day. So if you track it over a week, somewhere in the summer, that's normal.
In our industry, that might be a year, because what is normal is very seasonal. So what is normal in January, which is downtime, it looks like dead time in September. So in September it might be a hundred times higher. So you need to understand what your normal is over some period of time with numbers. And if the numbers tell you that something's wrong, at a minimum you need to leave logs that will tell you what is wrong in a specific instance. So you leave metrics for the global view of the system, where it's the rule of big numbers, it's statistics. And you add logs for where you want very specific information on every case.
Say folks are building a system that is not just a monolith. It's maybe got some microservices and Lambdas in the mix. Have you got any advice on how to locate where things are going wrong? Something I've frequently stumbled on. I'm reasonably good metrics. Logs, I definitely learned my lesson the hard way on that one. But what I'm finding with building these systems where we're composing them of many different things, I can't even find where the problem is. Do you know what I mean? I know my users are getting 503s. Don't know where it's happening. Any advice from your pass on that one?
So this is the complex topic that's only evolving properly now. So we're talking about observability, and that's not easy to get right. That's one of the downsides of distributed systems and microservice architectures. You need to trace an operation which is atomic from the user's perspective, because the user can't see it encompasses 12 different systems and three Kafka clusters across four continents somehow. So you need to be able to trace a single operational across many services.
And again, this is something that needs to be built from the ground up. So it needs to be built from the beginning. And this is why it's really important for engineers to understand what it is. So you can't bolt on metrics. This is something that teams struggle with. Occasionally they think, oh, well the platform team or the SRE team or the operations team or the DevOps team, and I air quote here, the "DevOps" team that does all the DevOpsing and collaborating, they will do all of our metrics. But they can't, because they don't know what your service is doing. Your service is a container to them. So you need to build it inside. It's like a cyborg organism. You need to build it in. You can't put it on as a shell, as an Iron Man suit.
And this is where developer upskilling is necessary. So developers often ask, "But how am I expected to know all of this Terraform stuff and all of the things about container users and all of the things about rolling policies and so on?" Well, you don't have to know all of it, but you have to go an extra step with ops and say, "Look, I really want to make my service operational. I really want it to be running smoothly. Even if you are the ones who will be running it most of the time, how do I make this operable?" Same as people who specialize in operations need to take that extra step to educate another person every day on how these things work.
And I think that we're all guilty of what is called unconscious competence, where you think that the Kubernetes policies or [Arbac 00:21:30] is common knowledge, isn't it? Everyone knows how it works. Easy. Why are you asking me this? But you have to put yourself in the shoes of, say, a front-end developer who all they know is single applications. And they do them well and they optimize them. And then they might have to containerize these applications, even if they are single-page, in order to be able to run them on the same cluster to achieve another goal, and you need to lend them a hand. We need to meet each of them in the middle. So devs and ops and then business.
Even with all this DevOps, I think that's still a number one problem, isn't it, that collaboration with folks? And I think a lot of it in my experience, it does... You've hinted this several times already. It does start at the platform. If you get your platform right, I think it falls in together. Have you got any advice for folks on building, buying, creating a platform?
With a platform you need to always remember who you're building it for. So the platform and the type of people that will be operating on the platform. Not the platform, but on the platform who will be using it are inseparable. So if you make certain decisions on the platform... Like we will give the developers all the control. So we will let them do whatever. But on the flip side, they will need to think about memory management, and they will need to think about how to achieve a Canary deployments, and they will have to think about how to implement health checks properly.
Then you're putting yourself in a position where you can only hire a certain type of developer. So you can hire a developer who's an all-rounder or can pick everything up. So it probably costs a lot more money, there are fewer of them on the market, and it raises the question of, do you need the platform team at all? Or maybe you need a couple of experts who can help full stacks 2020 to do some operational things that are way too deep, like networking.
On the other hand, if you want to abstract things away, you need to run the platform as a proper product. So it needs to be a product like the product you're serving to the customers. It needs to have proper documentation. It needs to have proper release cycles. It needs to have proper release planning. It needs to have end of support and all of the things which need a product management capacity. So it doesn't have to be a product manager. Product manager can be a technical manager, or it can be someone within the team who's interested, but it needs that capacity. It can't be packed. Otherwise you're just building a very unfriendly, ambiguous piece of software which is meant to boost everyone's performance or efficiency, but in the end ends up costing a lot and confusing everyone and making everyone stressed out.
Yeah, I've definitely done that one by mistake in the past. Several smart folks I've been chatting to of late have been saying the same thing. Treat this with a product, because you're going to invest in it, and you said it perfectly, you've got to manage it. You've got to own it, effectively, to make it useful to enable all your developers to be proficient. Yeah?
Developers have very high expectations. So as a part developer myself, I know that end users who are non-technical have fewer expectations for tech products that they're using. So an end user has an internet bank broken, or if you get some transactions lost might challenge your bank, and there will be a bit of complaining, but not as much of complaining as when you, God forbid, break some specific case on an API for a service that is free to use and everyone's abusing. So developers are very, very, very demanding. So you want to keep them happy and you want to keep the dev experience good.
One final question I wanted to ask you is what do you think the future of platforms is going to look like? You and I talked off-mic about, say, Lambdas and functions and stuff. Do you think it's going to go that way, all into functions as a service, or do you think there's going to be some kind of hybrid situation?
That's a very interesting one. So in my personal bubble, I think I've seen people going for all the things first. So all the distributed workloads, all the ECSs and the Kubernetes and all of that and all the server lifts. So server lifts were all the rage two years ago. And then I saw something that looked like developer fatigue. So we've reached the point, in my view, where there's too much going on, where yes, it's very very easy to deploy new functions and new services and to scale things and to do this as a service and this is a service. But with so much going on, developers just want to go back to writing code. So they just want to write application code again and not play with the bells and whistles around it. So to have something much, much simpler. And as one person said it, you have a problem when your Terraform code to deploy a Lambda is longer than your Lambda.
And we face a lot of this now. So I think we might see a comeback of all-rounder platforms. I've seen a lot of firms developing them internally. There's so much money and resource and time being poured into developing internal platform products, which don't go well because there's much less product delivery involved into developing them than you would expect for a product, as in the requirements are not triple-checked with the devs. So they're the highest person in the room's brain-dump or they're not documented really well. They're too loose, too strict, they don't fit the most important use cases, difficult to understand, too ops-y, not ops-y enough, too expensive, too much procurement involved. I think we might see a comeback of public platforms, or it might be an opportunity for someone's business.
Very interesting. Alla, it's been a pleasure talking to you again. Thanks so much for your time.
Likewise. Thank you.