Nic Jackson Discusses Cloud Native Platforms and Developer Tooling
In this episode of the Ambassador Livin’ on the Edge podcast, Nic Jackson, developer advocate at HashiCorp, discusses all things related to modern “cloud native” platforms and developer tooling.
Be sure to check out the additional episodes of the "Livin' on the Edge" podcast.
Key takeaways from the podcast included:
- Kubernetes provides a great set of primitives for building a platform, but additional technologies do need to be integrated with this to offer a Platform-as-a-Service (PaaS) like experience that many developers want.
- Developers need self-service access to a platform that allows them to understand both the operational impact and business impact of their changes.
- Platforms should follow the “shared responsibility” model of ownership. Operators want to set global safety and compliance properties, and developers want granular control of release and operation of a service.
- A continuous delivery pipeline should codify all application quality and safety requirements. The execution of the pipeline should be fast and consistent (e.g. minimize flaky tests).
- Debugging services running in Kubernetes can be challenging. Metrics, logging, and distributed tracing are your friend. Prometheus and Kibana provide a lot of value.
- Integrating a service mesh, like Consul, with distributed tracing can allow issues within the system to be located much more effectively than exploring logs.
- Being able to canary release functionality, both via an edge (API) gateway and via a service mesh, provides a controlled method of testing and verifying new functionality.
- Being able to capture and replay sanitized user traffic allows for very effective testing of the overall system, especially when running load tests.
Hi everyone. I'm Daniel Bryant and I'd like to welcome you to the Ambassador Living On The Edge podcast, the show that focuses on all things related to cloud-native platforms, creating effective developer workflows and building modern APIs. Today I'm joined by Nic Jackson from HashiCorp. Nic and I have previously worked together at a UK eCommerce company named Not On The High Street, which we often abbreviate to NOTHS. At NOTHS, Nic and I were part of the development team that worked closely with the ops team to build a cloud-native platform. We learned a bunch of things about platforms, dev loops, releasing code, testing, and I was keen to pick Nic's brains on these topics in more detail.
If you like what you hear today, I would definitely encourage you to pop over to our website at getambassador.io, where we have a range of articles, white papers and videos that provide more information for engineers working with Kubernetes in the cloud. You can also find links there to our latest release of the ambassador edge stack, the open source Edge Stack API gateway, and our CNCF-hosted Telepresence tool too.
Hey Nic, thanks for joining me today. Could you introduce yourself for the listeners please?
Nic Jackson (01:02):
Yeah, of course. So I'm Nic Jackson. I work as a developer advocate at HashiCorp and I've been at HashiCorp for about three years. I am not a career developer advocate. My background is actually software engineering. So software development, software management, engineering team management. All the things.
Good stuff. So today we're talking about developer experiences and developer loops. Now you and I have worked together in the past, which is an interesting connection here as well. But when I say dev loops, I'm talking about the ability to very rapidly code, test, deploy, release, verify, that kind of thing. I'm sure any developer, we all recognize the good times and the bad times. So without naming any names, protect the innocent, could you describe your worst developer experience or your worst dev loop?
Nic Jackson (01:49):
So I'm going to go to something slightly different just to annoy you. But I think my worst dev loop, as somebody who's been working in the industry quite a while, is the current state of everything.
Nic Jackson (02:03):
And I'm going to tell you why by explaining my best dev loop. So about 13 or so years ago, I was working as a .net developer. Azure had just launched. So we were at Microsoft. We were deploying to Microsoft. From visual studio code, I could press a button, it would build my code and it would deploy it to my Azure instance as a Canary. 13 years ago. So I admit that one-click deployment is not the right way. You want to be able to do check-ins and stuff like that. But we're talking about the flow, right? The dev flow. Not what's going on in the background.
Nic Jackson (02:50):
Nowadays, what do I have to do? I build this, I build it in a docker. I push it, I do this, maybe the CI flow kicks off, I wait for that to complete. Then I've got it in my test environment. I can go through and play around with things. It's just very, very slow. Now, I appreciate why it's slow because the complexity is nx-times what it used to be. And it has to be. We need distributed computing, multiple components. Complexity grows as... But why can't I have that developer experience in my modern environment? I think we, as developers, have to learn all of this tooling, and some people don't enjoy doing that. I'm like you, I'm curious, I'll dig into anything, but I think a lot of things feel quite unnecessary at the moment.
Do you think it's important for developers to understand the business requirements, to the point they can take responsibility for that full life cycle. We see Netflix talking about full cycle quite a bit. Developers can create hypotheses, run experiments and verify the results. They're business-aware folks. You think that's important for every developer to be aiming towards that kind of thing?
Nic Jackson (04:00):
Yeah, yeah. A hundred percent, because bugs happen. I see two different kinds of bugs. One is that it's a sloppy mistake, which you've written code which doesn't have courage of a unit test, or something along those lines that you've misinterpreted what an internal API does. And then you've got missing features. And I think actually for most developers these days, a lot of the bugs that they get are actually missing features. It's maybe an edge case that hasn't been covered, therefore it hasn't been codified, therefore there isn't a test for that edge case. And I think if you are thinking about what the business use case is, what is the feature use case, you're translating that into what code do I need to write? And you're exploring all of these edges in your mind, which then you start to codify.
Hmm, totally makes sense. Taking a step closer to the tech for a second, do you think modern architectures have impacted the dev loop? You've already hinted at we're all building distributed systems these days. Well, it doesn't have to break down complexity in the apps by building micro services, but there's obvious trade-offs with other things you've already hinted at in terms of build cycle.
Nic Jackson (05:15):
Yeah. I'm trying to think how to put this. I don't want to say anything disparaging. I think the key thing is that if... I'm going to say it the way that my brain says it. If you know what you're doing, it's not terrible. But I think the core problem is there is a massive gap between just getting started and being really confident with the architectures and the tools. And I think that's the core problem. It's unnecessary knowledge. You need to understand how a Kubernetes deployment manifest is written. You need to understand how to build a docker container. You need to understand how to attach your local computer potentially to a remote system. You need to be understand how to interact with Prometheus, to be able to dig through metrics, to understand any edges. There's a lot going on.
Nic Jackson (06:11):
And then you've got to understand all of these integration patterns. So how does one service talk to another? How do I document my service? If you only have one service, you don't necessarily need to documented it, but if you've got contracts between two pieces of the system, then you need to do things like contract documentation, like Swagger. And there's a lot to learn. I feel that once you've learned it, it's fine, it doesn't become a thing any more. But it's not the greatest experience.
Yeah. Developers, I think, are having to become more operationally aware, whether it's cloud, Kubernetes, even VMs and stuff. What do you think about that? Is it again a case of all developers need to be at least somewhat operationally aware, or is this... There's still roles, I guess, for a specialist as well? Or is it a mix of the two?
Nic Jackson (07:00):
So I'll make, I suppose, a bold statement, in that I would say that I've spent quite a lot of time and energy looking into this and learning all of these various different things, and I'm a developer. So even to the extent of really understanding machine administration and things like that, I'll trade it all for a modern Heroku.
Yes, well said.
Nic Jackson (07:28):
Because ultimately what I want to be able to do is ship features. How those features are shipped, as long as they're shipped with the quality controls that I need such as the automated bills, the automated tests running and those safety and security mechanisms... How all of that happens, I don't care too much. I will say that you have to have mechanical sympathy, so regardless of whether you can click a button and it just magically appears on the interwebs, you still have to have mechanical sympathy of understanding what's going on in that infrastructure in order to write good performance code.
Nice. We mentioned that as a modern Heroku and I totally get that. You and I have worked on rails platforms as well and rails convention and the configuration. Fantastic stuff. What do you think the biggest challenges are now? Is it for the ops teams to build a platform?
Nic Jackson (08:26):
Right. Yeah, I think that's the key thing. I mean, the way that I look at everything, and I think Kubernetes is a perfect example. I genuinely like Kubernetes, don't get me wrong, but I see Kubernetes as a set of primitives, as opposed to the platform. I'm waiting for somebody to build a platform on top of Kubernetes. And a big shout-out to the folks at Rancher, because I think they're doing incredible work around that problem. But yeah, I think that the level of interaction is too low-level. We need a collection of higher level of abstractions. And that needs to be generalized, because I feel that what happens is people recognize that as a problem and they build their own PaaSs. We shouldn't be building passes, we should leveraging a pass.
So when we're looking at, say, pass, traffic control is really important, be it traffic control ingress, traffic control service-to-service. What do you think is the best approach here? Should developers be controlling that kind of thing? Should operators be controlling? Should operators be providing some kind of self-service platform? What do you think the relationship should be around defining traffic control going forward?
Nic Jackson (09:37):
It's a good question. And I think it's a shared responsibility. The developer should control their own routes. So for example, if you want to do some Canary testing or dock deploys or something like that, then as a developer you should be able to configure the ingress routing of your application to say, well, I want a 50-50 split between V1 and V2 or 10-95 split between V1 and V2. I mean, I think operators have got this ownership of the core route, but the developer should have access to take control of their own responsibilities. You want to be able to do things like controlling the splits between the variations of service.
Nic Jackson (10:24):
So I want to do a Canary deployment, as a developer or a development product team, I should be able to control that flow. If I want to do something like a dock deploy and I want to do some multivariant testing. I should be able to control the routing to say that if this HTTP header is present, or this cookie, then send traffic this way, if not send it that. And that's really, really important. I think that just ties in to very much that responsibility of modern microservice development.
Nic Jackson (10:58):
So I'm a big believer that the application should own certain things, and that's things like not just application code, but secrets, routing, configuration, various platform, maybe even infrastructure-level stuff. Such as I've got configuration to deploy databases and stuff like that. That's tightly related to the application. And I think when you take this modern pizza-team type approach where an operations person is embedded into the development team and we all do the dev-ops thing, don't have dev-ops engineers.
Yes. Yes, I hear you.
Nic Jackson (11:39):
Yeah. I think that that routing is absolutely essential, and the control over it.
What do you think about the cross-functional, non-functional type requirements? I know you and I have done a bunch of work on the console space in this, and there's retries, timeouts, things like that. I'm guessing that's super important, but again, more dev, more ops? What do you think?
Nic Jackson (11:58):
I think it's both. So, I mean, I would say that as a dev, you understand what's going on inside of the application code. Maybe as a more ops person, you're able to help out with infrastructure-layer things. So for example, I have a service and the service is performing slowly. So I look at my application code and I can see that, well, this thing here, this block of code is running really, really slowly. And all I'm doing is writing a file to the disk.
Nic Jackson (12:30):
Well, the ops person can come along and go, well, I can see from the machine-layer metrics that the IO on the disk is absolutely saturated. And I can also see that I'm using this particular type of cloud instance. And the cloud instance only has a certain number of IOPS for the disk. So therefore what we need to do is change the cloud instance, increase the IOPS for the desk, and actually the problem of performance disappears. It wasn't a code level problem per se, and it wasn't a deployment problem. It was just the wrong machine. When you work together in a cross-functional team and you've got that type relationship where there's shared knowledge, it's just like that Venn diagram type thing, but that works really sweet. And I think we saw that in action when we worked together at Not On The High Street.
Yeah, indeed. So I wonder how much the platform can be used to drive the collaboration between dev and ops. Because when you and I worked together at Not On The High Street, as a team, we got together with the ops team and we really thought about the kind of UX in terms of how as developers we committed code, how that related to the platform, what ops offered us on the platform... And this was a few years back, we were using Mesos and Marathon pre Kubernetes, but we really did think about this stuff. So do you think the platform can be used as a key driver to increase collaboration?
Nic Jackson (13:56):
Yeah, I think so. And I think it makes sense having centralized platforms. I think it drives a certain efficiency. Some of the things that we were doing were you can make a deployment, you have a pipeline which runs your unit tasks. It can run some functional tasks. At NOTHS, we sometimes, and I say sometimes, had manual acceptance testing. And generally that was based on our appetite for risk. So if I was making a very small change that I knew would have low-risk impact if it went wrong, and actually I had high confidence that the work was correct and would be caught by the automation, we'd just deploy continuously. However, if I was making a big change and I was unsure, or it was high risk to the business, then we would take the additional security of saying, well, we're going to have some manual acceptance around this.
Nic Jackson (14:52):
And in both of those instances, pipeline is essential, because the manual acceptance test there can actually deploy the versions of your application code into the test environment, without the need of interaction with an ops person or a dev person. They can just click buttons. The devs don't need to be doing anything manual. It's reproducible, it's consistent. I just commit my code, and I go through that GitFlow merging approach. And ops can concentrate on doing smart work, which is building the platform better rather than consistently just, "Can you deploy this application for me?" Nobody wants to be doing that.
Raising a ticket, can you deploy this. Yeah, sure.
Nic Jackson (15:39):
Everybody benefits from the time being spent on putting the brainpower into the automation as opposed to the manual interaction. So yeah, I think it's really important. Maybe an interesting anecdote around what we did there as well at NOTHS was, do you remember that we had some, I think we were using Ansible at the time for our infrastructure as code. And when you created a new microservice, you had to copy this Ansible playbook, and then you would modify it. And the developers were like, "Oh, I can't work with this. What is this YAML? I can't deploy this." It's really not that complicated. You copy this, you paste it in there and you change these three things. But there was a massive barrier there for some reason. And we tried to educate and we tried to do training, and people were just like, "I don't want to do this."
Nic Jackson (16:28):
So what did we do? Well, we created a CI job which copied it, created the GitHub repo, cloned all of the things, put the defaults in, so that then when a developer wanted to create a new microservice, all they did was run a CI job and then clone the resulting GitHub repo. And it just smoothed it out. It was a simple bash script, but I think user experience you've got to think about. And I think when there's resistance, you shouldn't just assume that it's just somebody being awkward.
Yeah. You just reminded me, one of my most happy memories in some ways of that, but there's many happy memories of that project. I learned a lot and it was great working with awesome people. Yeah, I really enjoyed it. But one of the key things, more from the Java side, I remember when my colleague Will who I was working with, he integrated a bunch of stuff into the template project in Java. So you could literally click a button, type in the name of your project, like "new service one", and out would pop up an archetype, effectively, with pre-existing hooks into testing and into observability. That was game changing. I didn't need to think about this stuff. Yeah?
Nic Jackson (17:34):
I was talking loosely around this sort of stuff the other day as well, that I love Go. And I've been programming Go for a good few years now, probably six or seven years. Or if you're a recruitment consultant, I've been doing it 20 years. But the interesting thing I found around Go and the limitations on the structure of the language, the simplicity of the structure of the objects and the language, it makes it really difficult to take that nice templatized approach when you've got something like java.net, and you can have abstract classes and... It hides all of that implementation, which in some ways it's good. In most ways I think it's good. But with Go, it just isn't as easy to do that.
What do you think the importance of observability is like? It is, I'm guessing, super important, but is it more challenging with modern infrastructure and modern service design, do you think?
Nic Jackson (18:38):
Yeah, I think so. I mean, the way I look at observability is, observability is the thing that you don't need until you realize you need it. And most of the time, as I said, it's just something which exists, and I think that in a perfect world, never looking at a dashboard is a wonderful place to be. But I mean, it's unusual. When you get a modern microservice system, obviously with just, say, metrics, which is a more traditional approach, everything is localized to that service. It's very difficult to correlate things. So you can understand very, very quickly where something has gone wrong.
Nic Jackson (19:22):
So I could see that my service which is storing something into the database is going wrong. It's running slow and that's why everything's breaking. Now, I can't see the reasons why. So I see where, but I can't really see why. And I think that's where things like distributed tracing really start to become a benefit, because I can see the full request trace that leads to the incident. I've got that forensic trail.
Nic Jackson (19:54):
And I'm genuinely excited about tracing. I think the major barrier to tracing at the moment is the limitations around storage and retrieval. Obviously a trace is a much bigger document than a very small metric. It doesn't reduce as well either. But I'm very excited about the possibility that one day I feel we will not use metrics at all. What we will actually do is just have everything in a trace. All of your gauges, such as CPU or memory consumption at that particular instance in time, IOPS at that particular instance in time, logging, so any log messages which you've admitted. And you tie that all into this one document.
Nic Jackson (20:39):
And again, I think how that's distributed, whether the metrics end up in Prometheus and the raw trace data ends up in something like Jaeger or Elasticsearch or wherever, and the logs end up in Logstash, or... I don't care. I just want to be able to interrogate all of that information together because that's what gives me the rich picture, and that's what helps me to understand what has been the cause of whatever has gone wrong.
So I've got to give a shout-out to LightStep and Honeycomb. I've seen Charity and Liz talk at conferences about their Facebook and Google experiences of being able to ask these ad-hoc questions datasets, high-cardinality datasets, and Ben and the LightStep team are talking very much about being able to identify exactly where our problem is. To point the developers to there would be super powerful, wouldn't it?
Nic Jackson (21:32):
Yeah. And I think the cardinality is really important as well. I always look at the resolution of data as really important, because one of the approaches that you tend to take is that over time you can reduce the data and start combining it and reducing the resolution of it, because as time goes on, it becomes less important. I question whether that's true or not. Because one of the things we used to do at Not On The High Street was have Black Friday. We would always like to look at the previous year's data. And a lot of people say, "Well, it's never going to be exactly the same." We found that things were actually remarkably similar. The traffic patterns, things like that were very, very similar. And being able to have that high resolution of data to use as reference when we were doing performance testing on the next year's system was really, really useful. And I think if we reduced that down to the hourly averages, we would have lost out a lot. So...
We literally replayed the data back, if memory serves, Nic, didn't we? We literally replayed the requests and the shape of the traffic almost verbatim to test the existing traffic matched the new system.
Nic Jackson (22:53):
Yeah. The way that we used to do it, which you can actually achieve right now, which worked beautifully for us, was we looked at Google analytics and we-
Oh, that's it.
Nic Jackson (23:00):
Yes. So we would look at the percentage of requests that were coming in to a particular page and a particular product and what the distribution across all of that was. And we would then use that to generate a load test. So we would say that if 95% of pages were the homepage, when we were building our global load test, we would send 95% of our traffic to the homepage. And if 10% were for payment, then we'd send 10% to payments. So we were actually simulating the load on the system as it would be at a particular point of time based on last year's traffic.
Nic Jackson (23:38):
What I think is really interesting now is the capability of using Datawire's ingress, which is envoy-backed, and being able to take that traffic and maybe actually log the traffic itself to be able to do replay of actual requests. And I think we're going to see some very cool stuff in the coming years around that replay of live data, rather than simulating the structure of it. But it worked great for us, and I think it's definitely something I recommend people digging into.
Testing in a complex system is hard. I think we found actually having real user data, there's no need then to simulate or to cut up the system at which you pump in requests in the front end, see what happened, and that is pretty much... Providing, as you mentioned, you get a representative sample of actual requests being made, you test your system pretty well, don't you?
Nic Jackson (24:36):
Yeah. And I think it's, again, back to the questions we were looking at earlier around the business logic thing. I've worked in a system before which had an outage. And the reason that the system had an outage was because Elasticsearch crashed. And the reason that Elasticsearch crashed was because a particular input from a search box caused basically a death spiral query within Elasticsearch. And it took the entire system offline just by somebody basically... I don't remember exactly what it was, but somebody used two stars in a search term. And that was enough to cause this bad query inside of Elasticsearch, which literally just was running a hundred percent CPU to, to try and resolve this recursive query. And that was just like, well, you never think anybody's ever going to do that. And it wasn't malicious. It wasn't like somebody deliberately was trying to DDOS us by probing. It was literally just somebody probably fat-fingered a search term.
Yeah. Happens, doesn't it?
Nic Jackson (25:42):
Right. But that took a lot of discovery to figure out what was going on there, because we knew that it had happened, but we didn't have the resolution on our logs and our metrics to be able to understand exactly what the input was which caused that crazy output. And eventually we managed to dig through with a little bit of experimentation. But certainly being able to capture requests, maybe even just erroneous requests, that's going to be a massive benefit to being able to just catch all of those edge cases. And again, it wasn't a bug. It wasn't that the developer did anything wrong. It's just, we never wrote the feature.
Yeah. Super interesting, Nic. Final question is, what do you think the future developer experience will look like in, say, three years time, compared to today?
Nic Jackson (26:30):
I'd like to think in three years time, we're going to come to terms with the fact that we need a common pass, which sits over the tooling that we're working with right now. Not just Kubernetes, everything. We need to get back to that Heroku-like efficiency which allows us to deliver business features rather than every developer spending X percent of their time doing exactly the same as every other developer in every other company and ultimately all we're doing is figuring out a way of writing Canary deployments or something. So I think we're going to get to that stage. I honestly think... And don't get me wrong, I am the biggest IDE snob because I know how to use Vim and Tmux, therefore I am superior.
Nic Jackson (27:16):
I think everything is going to head back to the IDE. I am very impressed with the user experience with tooling, like Golang, well, any of the IntelliJ stuff, and visual studio code. And I think as developers, we're going to start heading back that way because of the efficiencies of operation, the fact that clicking a button is 50 times faster than running five commands. And I think that we were driven towards or away from the IDE for two reasons. One, I think it was snobbishness that we thought we were exercising our own intelligence by doing things in a complicated way. But also out of necessity. The tools just didn't exist. And it was a case of cobbling together these five different tools to get this workflow that we want. I think that it's going back to the IDE.
Always a pleasure to chat to you, always love your insights, always learn a bunch chatting to you. So thanks very much, Nic. Appreciate that.
Nic Jackson (28:14):
Pleasure, buddy. Any time.