LIVIN' ON THE EDGE PODCAST

Charity Majors on Instrumenting Systems, Observability-Driven Development, and Honeycomb

Ambassador Labs · LOTE #11: Charity Majors on Instrumenting Systems, Observability-Driven Development, and Honeycomb

SUBSCRIBE:

About

When building microservice-based (distributed) systems, engineers must learn to accept that most problems that will be seen in the future cannot be predicted today. Therefore, being able to observe the system and formulate and verify hypotheses in relation to issues is vitally important. Being able to answer ad hoc questions from your observability system, without having to ship custom code or metrics updates, is vitally important. If engineers have to invest large amounts of time creating a custom dashboard for each issue they encounter, their workspace will be “littered with failed dashboards that are never seen again.”

Episode Guests

Charity Majors

CTO at Honeycomb

Charity Majors is the CEO and co-founder of Honeycomb, a tool for software engineers to explore their code on production systems. Charity has been on call since age 17, a terrifying thought. She has been every sort of systems engineer and manager at Facebook, Parse, Linden Lab etc, but somehow always ends up responsible for the databases. She likes free software, free speech and peaty single-malts.

In this episode of the Ambassador Livin’ on the Edge podcast, Charity Majors, CTO at Honeycomb and author of many great blog posts on observability and leadership, discusses the new approach needed when instrumenting microservices and distributed systems, the benefits of “observability-driven development (ODD)”, and how Honeycomb can help engineers with asking ad hoc questions about their production systems.

Be sure to check out the additional episodes of the “Livin' on the Edge” podcast.

Key takeaways from the podcast included

On-call alerting should be triggered by service level objectives (SLOs), rather than simply being triggered by an infrastructure failure or a monitoring threshold being breached. Engineers should only be woken up if the business is being impacted.

Engineers must move away from the classic approach of simply monitoring well-understood infrastructure metrics towards actively instrumenting code in order to be able to have more of a constant “conversation” with production systems.

Engineers should strive to understand what “normal” looks like in their system. By establishing baselines and scanning top-level metrics each day, an engineer should be able to quickly identify if something fundamental is going wrong after a release of their code.
The four metrics correlated with high performing organizations, as published by Dr Nicole Forsgren in Accelerate, should always be tracked: lead time, deployment frequency, mean time to recovery, and change failure percentage.

Engineers work within a socio-technical system. Teamwork is vitally important, and so is the ability to rapidly develop and share mental models of issues. The UX of internal tooling is more important than many engineering teams realize.

Test-Driven Development (TDD) is a very useful methodology. A failing test that captures a requirement is created before any production code is written. However, due to the typical use of mocks and stubs to manage the interaction with external dependencies, TDD can effectively “end at the border of your laptop”.

Observability-Driven Development (ODD) is focused on defining instrumentation to determine what is happening in relation to a requirement before any code is written. “Just as you wouldn’t accept a pull-request without tests, you should never accept a pull-request unless you can answer the question, “how will I know when this isn’t working?”

Developers need to understand “just enough” about the business requirements and the underlying infrastructure in order to be able to instrument their systems correctly.

Using modern release approaches like canary releases, dark launching, and feature flagging can help to mitigate the impact of any potential issues associated with the release.

Honeycomb is a tool for introspecting and interrogating your production systems. Honeycomb supports high-dimensionality of monitoring data. Engineers add a language-specific “Beeline” library or SDK to their application, and within their code they can add custom, business-specific metadata to each monitoring span, such as user ID or arbitrary customer data.

Honeycomb’s “BubbleUp” feature is intended to help explain how some data points are different from the other points returned by a query. The goal is to try to explain how a subset of data differs from other data; this feature surfaces potential places to look for "signals" within data.

Although the “three pillars” observability model is useful, the primary goals of any observability system are to help an engineer to understand the underlying system, identify issues, and locate the cause of issues.

Transcript

Daniel (00:03):

Hello everyone. I'm Daniel Bryant. I'd like to welcome you to the Ambassador Living On The Edge podcast, the show that focuses on all things related to cloud native platforms, creating effective developer workflows and building modern APIs. Today, I'm joined by Charity Majors, CTO at Honeycomb and author of many great blog posts on observability and leadership.

Daniel (00:21):

I was keen to pick Charity's brains around how engineers should approach monitoring and observability when building microservices-based distributive systems. I also wanted to dive a little bit deeper into the topic of observability-driven development, which I've seen Charity talk about now for several years.

Daniel (00:34):

Finally, I was keen to understand a little bit more about what the Honeycomb team have been up to recently and explore what problems their observability tooling suite helps engineers solve. For example, the new BubbleUp problem identification feature looked super cool. If you like what you hear today, definitely encourage you to pop over to our website.

Daniel (00:50):

That's www.getambassador.io, where we have a range of articles, white papers and videos that provide more information to engineers working in the Kubernetes and cloud space. You can also find links there to our latest releases, such as the Ambassador Edge Stack, including service preview and the developer portal, our open source Edge Stack API gateway, and also our CNCF hosted Telepresence tool, too.

Daniel (01:12):

Hi, Charity. Welcome to the podcast. Thanks for joining us today.

Charity (01:14):

Thanks for having me. I'm excited to be here.

Daniel (01:17):

Could you briefly introduce yourself please and share a recent career highlight?

Charity (01:21):

A recent career highlight? Oh my goodness. Well, I am the co-founder of honeycomb.io. I've made a career out of being basically the first ops infrastructure nerd to join teams of startup engineers. About a year ago, my co-founder and I traded places. She became CEO and I became CTO. That was honestly the best thing that ever happened to me because I never, ever, ever wanted to be CEO. It just kind of like... But you know, I feel like my career has been... I've always been motivated not by doing things because oh, this is fun, but because there's a problem that needs to be solved and I should get it done, and that had to be CEO for three and a half years. It just about killed me, but it's done and I'll never do it again.

Daniel (02:10):

Wow. Awesome stuff, Charity. So I wanted to ask the traditional first question on this podcast about developer experience, developer loops. It could also be ops themed as you and I were chatting off mic about this, but I'd love to hear a really, without naming names, but a really gnarly story around the capability to rapidly code test, deploy, release, and verify. I'm sure you've got many from your super interesting past.

Charity (02:29):

I do. I do. You know, I think I was really born out of my experience. I was the first infrastructure lead at Parse, the Mobile Backend as a Service. And this was a fun set of problems. We were doing a lot of things before they... We're doing microservices before they were microservices. We were doing like almost a... But like, it was a platform as a service, right? So you can build your mobile app using our APIs and SDKs and you never had to know or think about what was going on. Great for them. They just hit send, upload the mobile app, cool. Blah, blah, blah. And then in the middle of the night, their app would hit number one in the iTunes store or something. And guess who gets paid? Well... Right?

Charity (03:08):

And because our interface gave no feedback to them about, say the performance of the queries that they were composing because it was just in an SDK, right? They couldn't see that they was doing a five X full table scan on the Mongo DB table. You know, they had no idea. They weren't even being bad engineers. They just couldn't see it. We had over a million mobile apps by the time I left. The problems that were just dazzlingly hard were the ones where it could be something you did as a developer that broke you up. It could be something that we do. It could be something in combinations, too. Or be something, because they're all shared hardware pools and shared databases, something that any one of the other million apps who's sharing some component had done at the wrong time.

Charity (03:50):

All right. So I do have a bug in particular that I'm thinking of that wasn't right. So, I'm sitting there at work one day and this customer support people came over to me and they're very upset. They're like, "Push is down." And I'm like, "Push is not down. It's in a queue," and I'm getting pushed. Push cannot be down. Right? They're very insistent. I was just like, "You're wrong. Go away." The next day, they come back and they're like, "Push is still down. And people are getting very upset." All right. So I finally go... This is how most of my stories go. I finally go and I start digging into it manually. Like, "What is going on?" Well, it turns out that Android devices, I don't know if it's still true, they used to have to keep a socket open to subscribe to pushes on a channel. So every Android device we ever wanted to push to had a socket open and so we had this pool. You know, there was auto-scale that had a million connections to each of them.

Charity (04:44):

And at one point we had increased the size of the ASG, which brought up some more capacity, load balanced, everything, and it turns out this particular time when we upsized it, when it added the new nodes to the round robin DNS request, it exceeded the UDP packet size in the response, which is usually fine. It's supposed to be fine. In the spec, it says it's supposed to fail over to TCP if it fails in UDP, which it did for everyone in the world, except for anyone coming through over a particular router in Eastern Europe. Then it could not resolve, push.parse.com and it was completely down.

Charity (05:21):

This is like one of the outages that I use to illustrate to people when I'm trying to tell them about how you have to give up the sense of control. I'm going to predict the things that are going to happen. I'm going to manage my threshold. I can curate them. I can flip through my knowledge of past outages to tell what's going.

Charity (05:37):

Like, you just have to accept that most of the problems you have in the future, you cannot have predicted. You will only see once. It will never happen again. If you invest a lot of time into creating a dashboard so you can find it immediately the next time, you're just going to have a past that's littered with failed dashboards that are never seen again. You have to switch from this monitoring model to one of really actively instrumenting and being in much more of a constant conversation with your code and production and pushing point of testing out to encompass it as well.

Daniel (06:11):

Very interesting, Charity. I chatted to Sam Newman last week and one thing he said, and I think it's what I'm hearing from you is as a system grows in scale and complexity, you can no longer monitor for what's wrong. You almost have to look at the business. Are we allowing our customers to do the thing they should? And when it's not, then dive in to figure out what actually is wrong.

Charity (06:29):

The number of things that you are allowed to care about shrinks relentlessly, and you get to the point where it's like, okay, this is where people have to make the leap from monitoring to SLOs. Because when you have SLO, you're like, this is the business contract with our users. And if they're not in pain, we're not going to wake anyone up because otherwise, you drive yourself mad. And if you're like, "Well, ideally I would page people before my users notice." Don't.

Daniel (06:56):

Interesting.

Charity (06:57):

Just don't try. Just don't because it's impossible. You're just going to either burn your people out or not... You just have to draw a bright line there and say that it's users being able to tell. But the thing is that it's not that... When I say you're allowed to care about relentless fewer things, that's the very blunt perspective of waking people up in the middle of the night. Actually, there are more and more things you have to care about sometimes, or figure out how to be sensitive to. This is why I feel like if you actually want to build these systems at scale, if you're writing code for that system, you should be in the system, looking at it through the perspective of your instrumentation every goddamn day or when something breaks, you're not going to know what weird looks like.

Daniel (07:40):

Oh, yeah. Interesting.

Charity (07:42):

Because your sixth sense of, I just shipped something, is it doing what I wanted it to? And does anything else look weird? And you always want to make it more specific than that. But if you do, then don't. Because weird is a thing, right? It's that spidey sense? It's in you? If you Learn to follow your curiosity and if it's rewarded in a short amount of time, that's really the best way to do it.

Daniel (08:06):

I've heard you use the word portrait, like that kind of almost being able to feel the humble system or know that steady state. I think it's something I've made that mistake in the past. I'm like, it looks weird, but is it always looked weird?

Charity (08:13):

Yeah. If you're not in there every day, you don't know, and then you're going to waste even more time trying to figure it out.

Daniel (08:18):

I hear you. Well said, well said. I was looking through your Twitter today and you've got that evergreen tweet pinned on there, which I thought was fantastic. And you said, "Observability, short and sweet: can you understand whatever internal state the system has gotten itself into just by inspecting and interrogating its output even if, especially if, you have never seen it before." Is that pretty much still your working definition of what observability is about?

Charity (08:40):

And without shipping custom code to handle it, because if you have to ship code to handle it, that implies that you had to be able to predict it and anticipate it in advance. Yeah. This is a definition that has a heritage in mechanical engineering. Of course, they get very hefty when we use their words, but it's not the first thing that we borrowed and so whatever. But yeah, absolutely, I think that that... You know, and it's a socio-technical definition, right?

Daniel (09:05):

Yeah.

Charity (09:06):

This is not something where you can buy a tool, drop it in there and you get it, right? It's the people, their knowledge, their practices. It's the system, it's the tools you use to... You know, it's all of it, which is why it's not an easy answer, but also, it's a very approachable one. And I think it's very amenable to small steps.

Daniel (09:29):

Yeah. Nice, Charity. I chat to Nora Jones actually a couple of weeks ago. Nora and John Osborne, many of the folks who are doing this kind of... They're really echoing this socio-technical aspect. I think it's something we don't hear about quite so often in computing at the moment. Let's say in airplane crashes and so forth, we always look at the socio-technical side, the technology combined with the humans. Do you think we need to put more in computing?

Charity (09:50):

This might have been the thing that I've been thinking about this year more than anything else. I think about it, even in context of like, this is what... The blog post that I wrote that is still the most popular of all is the one about the pendulum. You know, the engineer-manager pendulum of your career.

Daniel (10:03):

Oh, yeah. That's correct.

Charity (10:05):

Right? I feel like that the socio-technical language gives me another way of explaining why that matters because the scarcest resource in all of our lives is engineering cycles. Right?

Daniel (10:15):

Mm-hmm (affirmative). Yep.

Charity (10:16):

The difference between a low performing team and a high performing team can be very difficult to quantify. Workloads are different in difficulty. You know, it can be very difficult and we do end up relying on our intuition a bunch. I think that the four DORA report metrics are really hip.

Daniel (10:31):

Oh, I love them. Yeah.

Charity (10:32):

But then if you're trying to debug it, if you're like, "Okay, I've got a team, it takes us a month to ship our code," and you know that you're getting better, why does it matter to have technical managers? How technical? How close should they be to the code? How much systems knowledge do they need? How much people knowledge do they need? I think that this is a really interesting way of just explaining why you can't just hire better people or just train your people. You can't just buy tools. You can't just fix your code. It takes a lot of judgment. And you have to be able to surveil this and go, "This is what's holding us back in the moment. Let's work on this first." And then you buy yourself some time and some space to like, "What's the next thing that's holding us back?" You know? Without that literacy of both going out to go deep on the people and on the code, I think that you're really just going to struggle.

Daniel (11:24):

What you say in that actually reminds me one of my mentors. I asked a couple of years ago, what should I be learning about? He said to me, "At systems, learn about systems." He recommended a couple of fantastic books. I'm guessing that's what I'm hearing from you is though the system context of knowing where to-

Charity (11:38):

Yes. But I would say that you're not going to learn it from books.

Daniel (11:40):

Yeah. That's right. Yeah.

Charity (11:42):

I would say put yourself in the on-call rotation. I think that living, breathing systems, I feel like you can't... Anybody who claims to be a senior engineer who doesn't know how, who doesn't have that intuition this is a healthy system, this is not, isn't a senior engineer. There are a lot of people who are very good at data structures and algorithms who I would not trust within a 10-foot pole of my system. I feel like DevOps, like we talk about it like this newish thing, but it's not. It's really returning to our roots, right?

Daniel (12:13):

Yeah. Yeah, yeah.

Charity (12:13):

There was a time when you wrote code on production for f*ck's sake.

Daniel (12:17):

I've done that that. Yes. Yeah.

Charity (12:20):

Right? And then there's specialization and all this stuff gets torn up and scattered to the corners of the earth. It's just like, we lost something critical there of just understanding what had actually happened when we wrote this code.

Daniel (12:34):

Yeah. I remember when Liz Fong-Jones joined Honeycomb. I'm a big fan of Liz's work in general. And I remember them saying that they were entering the Honeycomb on-call rotation. And I was like, "what? Even the person of Liz's fame has to be on-call at Honeycomb?"

Charity (12:45):

You have to. I'm a big fan of Liz, too. But we very much agree on that. It's something that there's no substitute and we take it very seriously that we live what we preach. Our DORA metrics, this four metrics, for us we look them up out of curiosity a little bit ago. And they're an order of magnitude better than the elite status of the DORA report.

Daniel (13:09):

Seriously?

Charity (13:09):

Mm-hmm (affirmative).

Daniel (13:09):

Wow.

Charity (13:09):

Yeah. And here's the thing. I feel like we have just... Our expectations are so low. Our bar is so low. What we expect a life of living as a computer engineer, we accept so much pain and suffering and wasted time and frustration. We have a sense of humor about. We're like, "This is just what it's like to do stuff with computers. You never know what's going on. Ha, ha, ha, ha, ha." You know? How funny is it?

Charity (13:34):

We're just going to ship some more broken shit onto the system that we never really understood and we're going to cross our fingers and watch our monitoring checks and then go home. You know, maybe we get woken up a few times a week. How is that okay? I feel like everyone should be on call, but getting woken up two or three times a year is reasonable.

Daniel (13:54):

Well said, Charity.

Charity (13:54):

Right?

Daniel (13:54):

Yeah. Well said.

Charity (13:57):

If you're building a system and you expect to understand it and you expect to understand when something looks weird, I'm not trying to... It is way harder to dig yourself out of that hole than it is like we've never gotten into it. You know?

Daniel (14:12):

Yeah, yeah.

Charity (14:12):

It's harder, but it's not impossible.

Daniel (14:16):

No. That's superb, Charity. That's superb. Just switching gears a little bit now. I was super interested about your recent article on the new stack around observability-driven development. I've seen you talk about a couple times before, but I still haven't seen it get enough love yet. Could you just briefly introduce the topic for folks? I'm sure most of us are familiar with TDD, but you've talked about ODD.

Charity (14:33):

TDD was revolutionary. Right?

Daniel (14:36):

Mm-hmm (affirmative).

Charity (14:36):

It made it so that we could tell when we had new progressions. It really boiled down everything that you were doing to this very small, repeatable, deterministic snippets. And it did that by stripping out everything that was real. Everything that was variable or interesting or chaotic or contentious. And we're just like, "Eh, let's just mark it." Right? So that's great. I'm not saying anyone should not do that. We should all do that. Also, if we accept that our job isn't done until our users are using our code and we see that it's working, that's just step one. TDD, basically it ends at the border of your laptop. That's it. Right?

Daniel (15:18):

Yeah.

Charity (15:20):

I feel like once your tests are done, cool, but you should never accept a pull request or submit a pull request, unless you're confident that you will know how you will know if this is not working once it's in prod.

Daniel (15:32):

Well, I actually pulled out and highlighted that one, Charity, because that I thought was just a fantastic line.

Charity (15:37):

That sixth sense of... I feel like we've been leaning on auto instrumentation, like the magical stuff from the vendors for so long that so many engineers have just kind of lost that muscle for just thinking about what is future me going to wish that I had done right now. You know?

Daniel (15:51):

Yeah, yeah.

Charity (15:53):

There are ways to make it easier. God knows we've done a lot of them, but there's something less irreducible there, where you just have to be thinking, this is going to be important. Right? And putting that in the blob, so that future you has it when it's the thing that matters. I feel like if you do that, it's instrumenting with that idea of your future self. I think a really key part of this that most companies haven't done yet is automating everything between when you merge and when it goes live and making that short, like under 15 minutes, so that if you merge your code, you don't have to wait for a signal that someone's done something, you don't have to do... You just know that within 10, 15 minutes, it's going to be live. Yes, use feature flags. Yes, use canaries. Yes use...

Charity (16:35):

Yes, yes, yes, yes, yes. This can be done safely. This can be done safe. But because we haven't done that, it introduces this huge gap right in time when it's a variable and you're going to move on to something else and someone else is probably going to take it and make sure that it goes live. And who remembers to come back in a day or two or a week or however long? And then it has to be short so that you have the muscle memory, so that you have this persistent itch in the back of your mind. It's not done until you've gone and looked at it and made sure that nothing else looks weird.

Charity (17:04):

But the best engineers I've ever worked with are the ones who have two buffers open at all times. One is their code, one is looking at it in production. Observability matters because if you're just looking at the time series aggregates, you see the spikes, but you can't break it down and say, "Oh yes, this spike is this version, this canary, these characteristics, these 10 different things." You have to be able to go from high level to low level and back and quickly, or you're just guessing again. You're just back to guessing and trying to interpret low-level ops, system metrics and translate them into the language of your code. You know, there's a whole thing. But once you get it going, you can expect your developers to ship better code consistently, to find bugs before your users do, and everybody just has a lot more time to make forward progress.

Daniel (17:54):

Yeah. I like it, Charity. Something I'm definitely hearing from you is that engineers do need an understanding of the business of the KPI say, and also the SLIs at the ops level. So I guess some developers I work with, they just want to write code to sort of ship things. But I think what I'm hearing from you is you do need a bit of knowledge either side there, business and ops as well.

Charity (18:12):

You don't have to be an expert. You have to know, is my user going to be happy about this or not? That's not terribly hard. If you weren't aligned with them, then you probably should be working on something in academia. Here's the thing. I feel like some people who get all curmudgeonly about this, they've actually just been burned. They've worked with teams that made it painful to care. They've worked at places where they got punished for caring and now they've developed this armor.

Charity (18:46):

But most, if not all engineers got into this because we are curious, because we love building things, because we like to have an impact. Nobody likes to put a lot of effort into something that nobody uses. This is the universal hunger. So I feel like it's the job of the ops teams and whatever teams to be friendly, to invite engineers in, to serve them the tools so that it is rewarding. So do you get that dopamine hit when you go to look at your... and you can see the spike, and you're like, "Oh, I'm going to figure it out." And you figure it out within a few minutes. That, you get hooked on that feeling. You know? I don't feel like this is a hard thing. It's just we have to deal with the scars of past traumas.

Daniel (19:30):

Yeah. I can relate to that like pushing code into prod and just seeing I had Nagios or whatever, and just thinking like, "Not quite sure. CPU is up!"

Charity (19:36):

Then you panic and then you're just like, "Shit. My night is screwed." And people were yelling at you. There's a lot of places where there are a lot of terrible things that have been done to people. It's true.

Daniel (19:46):

Indeed. I'm guessing, I mean, that's very much what you and the team are working on at Honeycomb, right? Being able to like-

Charity (19:50):

Yeah.

Daniel (19:50):

I've heard you talk about high cardinality before. If you spot that spike with things like Honeycomb, you can dive in and figure out what's going on. Right?

Charity (19:58):

We had this super cool thing called BubbleUp. So if you accept by definition of observability, there are a lot of technical things that flow from that that, let's say other tools out there in the field that call themselves observability tools do not have, which really pisses me off. It's fine. Yeah. But if you do accept that, then you need high cardinality, you need high dimensionality. Honeycomb has this cool thing called BubbleUp where if you see a spike and you want to understand it, you just draw a little blob around it. Say like, "Explain this to me." And we will precompute for all of the hundreds of dimensions that are inside the thing you care about and as well as the baseline outside for the entire window. Then we dip them and sort them.

Charity (20:41):

So you can see if there are 20 different things that have to go wrong for those errors, you'll just see them all at the top. Just like, "Oh, these things were different about these requests that I care about." It is the closest thing I've ever seen to magic. This is why I get pissed off when people talk about AI ops, too. Like, f*ck AI ops. That's not. This is like the purest distillation of this terrible thing in the industry where C-level and VP of engineering don't trust their teams. They trust vendors more than they trust their own engineers because engineers come and go. Vendors are forever. Right? So they've been signing these checks tens of million dollars for any vendor that is like, "You don't have to understand your systems. I will tell you what to look at and care about."

Charity (21:24):

That's what AI ops is. And it's bullshit because what we should be in the business of doing is helping, not just one or two people understand the system just in times of crisis, but making the system, self-explaining, making it tractable, building social tools where the bits of my brain where I'm understanding and querying my part of the system, where I understand it deeply and I'm an expert. Other people can come and see it, see how I interact with my part of the system. Because we're all like building on a slice of the distributed systems and responsible for the whole goddamn thing. Right?

Daniel (21:54):

Yeah. Right.

Charity (21:54):

So we have to be able to tap each other's brains, the part that I understand you have to have access to. And we have to focus on taking the engineer and just letting engineers do what they do best, which is thinking creatively and spotting things they care about and adding meaning to things and then make the computers, do the things that machines do really well. Just computing, lots and lots and lots of numbers and serving them up in a way that makes it simple and easy for you to see what actually matters.

Daniel (22:19):

Yes. That's awesome, Charity. I chatted, I think it was Ben Sigelman recently, and he was saying the UX of these systems is one of the hardest things. I think I've seen you and Liz and several folks at conferences saying the same things. I'm guessing you're putting a lot of time into the UX, the developer experience, the UX of Honeycomb, right?

Charity (22:35):

Yeah, for sure. Yeah. It's very important to us that we don't want to tell you what you should be looking at. We don't want to take data away and be like, "No, no, no. We know what you want" because any machine can detect a spike. Only you can tell me whether it's good or bad. Right? But what we can do is make it so that anomalies rise to the top and your eye can pick out patterns right that you can attach meaning to and then you can go and interact with.

Charity (23:00):

Honestly, this is why I hate the three pillars. There are not three pillars. Pillars are bullshit. Pillars are just data formats, right? So you've got metrics, logs and traces. You've got your metrics tool. You see a spike, but you can't drill down. You can't see what they have in common because it's a time of capturing that data. It was all spread out into hundreds of different metrics that have no connective tissue whatsoever. You can't see what they have in common. You can't even ask those questions.

Charity (23:25):

So you go and you jump into your logging tool. The problem with logs is you can only see what you know you put there, you can only find what you know to look for. Right? So you can find the things you can search for, but you're always finding less outage, even if you're lucky enough to find the problem and you're copy pasting an ID over in your f*cking tracing tool.

Charity (23:42):

You've got a human here that's just like... You know, that's stupid. Not to mention the fact that costs three times as much, right? Really the tracing is just a waterfall. You're just visualizing by time. Right?

Daniel (23:54):

Mm-hmm (affirmative). Yeah.

Charity (23:56):

Our events, our spans, and we can derive metrics from them. You wouldn't derive all of the data formats from the arbitrarily wide structured data block that we use. So this is why your data logs, your New Relics and your splits are all trying to get to where we sit technically faster than we can get to where they sit business wise.

Daniel (24:12):

That is super interesting. So maybe this is an impossible question, but if folks are just getting into this and their engineers, where would you advise them to start with observability?

Charity (24:25):

At least a third of observability is in how you collect that data. The only people out there that have observability tools are us and LightStep for the record. Those are the two. The rest are monitoring tools and APM tools. Those are very different beasts. I don't usually say this as directly, but I'm a little pissed off today. You should start by installing the beelines, which are basically just really rich SDKs with the exponential backups and retries and batching, and some really nice stuff for high through put. If you install them, it's about as hard as installing New Relic libraries and it gives you about the same amount of data.

Charity (25:02):

So you get popped into the very traditional. You know, you install a library. Cool. Now I've got my default metrics and graphs that I can interact with, but it's more than that because you can also jump down into them and slice and dice and get all the way down to the raw requests. And you can start anything in your code that you're like, "Oh, this might be interesting." We auto wrap and capture timing information for all the calls out across the network and all the database calls and all this stuff magically, but you can also go, "Oh, here's a shopping cart ID. This will be useful. I'm going to stuff that in there. Oh, this user ID will be interesting. I'm going to stuff that in there."

Charity (25:35):

In addition to all this stuff we pre-populate, you can stuff your own stuff in there, which is how it becomes yours, which is how, while you're developing, you're typing. Instead of printing something out in the log line, you just stuff it into the Honeycomb blob. And then the cool thing about this is cost scales linearly with adding more metrics. Right? The cost of metrics is super expensive.

Daniel (25:58):

Yes, yes.

Charity (25:58):

It is effectively free to put more bits of data onto the Honeycomb because these arbitrarily wide structured data blobs, it's just the cost of the memory to append it and ship it over the wire, which is dirt cheap. Right? So it incentivizes you to just like, anytime you see something that might be interesting, toss it in there. Maturely instrumented systems with Honeycomb usually have 300 to 500 dimensions per row. They're quite right.

Charity (26:25):

So you just stuff in there and forget it. And someday, a year or two down the line, you've got it, and it happens to be that thing. That is the thing that is... It's amazing. You accrue more and more and more and more and more and more data. It doesn't get more and more and more and more expensive. Then it feels like magic when suddenly you run BubbleUp and you're like, "Oh shit. I had that years ago. That's the problem." It's cool.

Daniel (26:47):

That incentivizing people to do the right thing is so important because we're engineers-

Charity (26:50):

It's so hard.

Daniel (26:51):

... we're smart and lazy, call it what you will, but we react to incentives. Right?

Charity (26:55):

Mm-hmm (affirmative).

Daniel (26:55):

So if you incentivize people to like, "Hey, put this in so you can use it in a few years time." We're going to do it, aren't we?

Charity (27:00):

Yeah. I hope so.

Daniel (27:01):

Yeah, Nice. Final question, Charity, what were you looking forward to over the coming say six to 12 months. What's exciting in your world and Honeycomb's world? What do you think we as engineers should be looking to?

Charity (27:12):

Oh, I'm going to say... This is a weird one, but this is how I've gone corporate. We just hired George Miranda, who is this amazing product marketing person who used to be an engineer. He's the reason that you now know about ODD, observability-driven development, because he was like, "This feels like a thing that people..." You know? So he's like all of the shit that I've just been kind of flinging out into the universe over the past three years, is going to start taking shape into better stories. We have never found it difficult to sell to engineers, but engineers don't tend to have budgets. So it's typically their bosses. Right? So we are putting more effort into arming our champions so that when they take it to their C-suite and their VPs, we have something that looks professional and plausible and stuff.

Charity (28:01):

We also have this really cool thing, Secure Tenancy, which if you're running on-prem, you can use Honeycomb just as though there's just the proxy run on your side that streams the events through and encrypts. We've never seen anything and it's super cool.

Daniel (28:15):

Interesting.

Charity (28:16):

We got a patent for it. So even on-prem people can now use this. We've got a bunch of financial institutions. Oh, we also got HIPAA compliance. Right?

Daniel (28:22):

Wow. Cool, cool.

Charity (28:24):

You want to hear the interesting technical things, but that has never been what's hard for us. The technical stuff is already done. It's all the business stuff that is finally starting to fall into place.

Daniel (28:33):

Yeah. I get that other stuff. We're doing it in Datawire, in Ambassador. It's the same kind of thing. We're really investing in telling stories. Over the last like say six months, I've really seen the value of it even for engineers. They're telling that compelling story, because we as humans, that's the way we share knowledge, right?It's done through the ages.

Charity (28:48):

It is.

Daniel (28:48):

We tell these stories. Right?

Charity (28:50):

You know, people have been coming to Honeycomb, and it took me a long time to realize that people were coming to Honeycomb and we'd ask them, "Oh, what other tools are you going after?" A lot of them were like, "None". And we're like, "This is weird. This is not what we were told would happen." Right? But they weren't coming to us with a checklist. They were coming to us because of the stories we were telling about how you could write code better and have better lives as engineers. And they wanted to grow up with us and and be part of that.

Daniel (29:19):

That's cool. I mean, that's a privilege, right? Being able to be part of that journey.

Charity (29:21):

It really is. It's been pretty amazing.

Daniel (29:23):

Awesome stuff, Charity. Well, really appreciate your time today. Thank you very much.

Charity (29:25):

Yeah. Thanks for having me.

Charity Majors on Instrumenting Systems, Observability-Driven Development, and Honeycomb

About

Episode Guests

Key takeaways from the podcast included

Featured Episodes

S3 Ep10: Foundations of Formidable API Federation feat. Daniel Kocot

S3 Ep11: Embracing Tech Change: Matthew Reinbold on Adapting to Industry Shifts

S3 Ep12: Kubecrash 2024: Engineering Insights with Danielle