Ep. 16 | Understanding Software Availability with Ross Brodbeck - Overcommitted

Brittany Ellich (00:00) Welcome to the Overcommitted Podcast where we discuss our code commits, our personal commitments and some stuff in between. I am your host today, Brittany Ellich and we are joined by…

Jonathan Tamsut (00:11) Hi, I’m John.

Bethany (00:12) Hey, I’m Bethany.

Erika (00:13) And I’m Erica.

Brittany Ellich (00:14) We are a group of software engineers who initially met as a new hire group at GitHub and found a common interest in continuous learning and building cool things. We continue to meet and share our learning experiences and talk about our lives as developers. Whether you are pushing code or taking on new challenges, we are so happy you’re listening. Today’s episode has a special guest. Our second special guest podcast we’ve done, Ross Brodbeck. Ross, would you like to introduce yourself?

Ross (00:40) Third thing, my name is Ross, Ross Brodbeck. I work for GitHub in the Reliability Organization, which you could think of for the purposes of this podcast as SRE. I’ve been doing that for a number of years now, and I’ve been working in developer tools for a number of years even more. I worked at GitHub and Microsoft before that in developer tools, and I’ve been around the software industry for many, many years.

Brittany Ellich (01:03) That’s great. Thank you. Yes, we were just saying how Ross is internally famous for availability. So he seems like a really great person to talk to about availability, which is our topic today. We want to talk about software availability, what it is, how it’s measured, and how do you make sure you’re handling it correctly. So we’re going to start out by just talking about.

What is availability to set the scene for anybody who isn’t familiar with the term? Ross, how do you define availability and what’s the difference between availability, reliability and uptime? Because a lot of those are used synonymously a lot of the time.

Ross (01:38) So availability is a little bit of a squishy term because you get to find availability in a lot of different ways. I would think of it at a very basic level as something like what the Google SRE book talks about when we talk about errors and latency and whether or not your application is actually working. So does the application work for users? It doesn’t necessarily mean, and this goes to the second part of your question about, doesn’t mean that

The application is fully functional. Like it could just be around and doing some things, but it may be degraded. Reliable is more about does the application do everything that you think it should do and are your user actually happy with it? I mean, that’s a very vague definition of those terms, but basically that’s how you, I would think of availability versus reliability. And again, not to get too like nerdy about the Google SRE book, but there is a section in the Google SRE book that talks about like,

correctness and whether or not you would measure correctness. So there’s even some overlap there about availability being correctness. And I would think of a lot of availability in general, just like this whole topic, as being very subjective based on your needs as an organization and also just based on your actual business, which I think is sometimes not talked about very much in a lot of the literature, but I think it’s very much defined by product. So there’s some products that maybe don’t really need to be that available.

Maybe they would define availability differently than other products, like banking software, for example. Uptime. And then uptime is kind of like a way to measure availability, if you will. That’s how I would think about it.

Brittany Ellich (03:05) Gotcha. Makes sense.

Jonathan Tamsut (03:06) Okay, I have a question next. So I guess, you you have a lot of experience working on availability in the context of GitHub and maybe other places. So for people who are, you know, working on their own systems, you know, at their whatever companies, sort of, yeah, like what are sort of some like transferable knowledge or skills that you’ve…

learned, you know, what are some sort of hard-won lessons that you’ve learned that you think sort of are more widely applicable to like availability in general. And I think it’s like also like interesting

as sort of a follow-up question, or you can answer these questions in any order, is like, what is sort of like your mental model or like framework for like approaching availability problems in general? Like if you were to, if I were to say, hey, I have an app that is down and not meeting user requests, like what are some things you would think about first? So yeah.

Ross (03:58) OK, yeah, so you’re going to have to remind me about the second question. But the first question, let’s definitely talk about framework for approaching availability. So I mean, there’s a lot of things you can do to level up your availability knowledge, I guess. And some of them are pretty extreme. I mean, the SRE world, I already mentioned the Google SRE book. That’s kind of like the canonical reference. I would caveat it with there’s a lot of.

information in there and it’s really good and it describes a very specific approach to availability, but I wouldn’t read that book, you know, if I’m talking to a bunch of different software engineers and say, this is how we should do it. This is how we should approach it. This is exactly what to do because I think that has a little bit of a reputation in the industry for people kind of like waving it around and saying like, this is what you should do. So caveat with that is a great reference and you should totally go read it because it’s free and it’s on the web. But if you just Google, Google SRE book, you’ll totally find it and has like

everything you would ever want to know about the baseline of measuring availability and thinking about availability and what you consider. Why would it be this way? think, especially once you’re running an application, especially one that has high traffic, I think that book is a good way to go back and check how you’re actually doing things because it’s hard to jump into availability if you’re just doing things like running a toy application or writing a utility. I don’t think that’s something that you learn enough.

Part of it is doing. And so you mentioned what kind of skills and things. And so one thing that makes me think about that I think is the most important thing you can do if you’re a software developer in this space is go into production. And I don’t mean log into production and do things. I actually don’t advocate that you do that ever. But what I do advocate that you do is go look at your production stack and use whatever observability you have. mean, step one, if you don’t have observability.

would be like you need some observability in order to find out what’s going on in production. And my experience has been that you can learn a lot about your application behavior and how to debug things and how to think about availability just by going and looking at the applications that you have in production today, looking at your telemetry and looking at the current problems that you see. And you can frame it in something like the Google SRE book, but participating in that way is the best way to learn, in my opinion. And I think there’s a lot of…

not easy to learn skills about debugging and how to understand performance and what’s happening in an application that you really, I think it’s very hard to write a book to tell you how to do that. It’s something that you kind of have to learn on the fly. And I’ll extend that to what I think is maybe the most extreme example of that. And I wouldn’t advocate that you go do this tomorrow if you’ve never done it, but the most extreme example of that is go participate in a bunch of live site incidents. Go actually.

be around when the application is down, especially if it’s an important application. Because that’s when the rubber really meets the road in terms of trying to figure things out. we do a bunch of game days across many teams where we have fake availability incidents and fake availability response and all that kind of thing. And that works to an extent, but it’s always kind of a controlled scenario that feels very on rails almost, because I think there’s always an answer.

But when you’re in production and there’s a database query that doesn’t work right and nobody knows where it is or how it works, and let’s say the original person that wrote it is gone because it’s 10 years old, that’s the kind of time when you’re really going to learn, do I know how to use these tools? You’re going to be put to the test. And I think that’s when you learn a lot about debugging those types of incidents and what should I do in this situation?

Jonathan Tamsut (07:17) So a follow up question to that is like, what are your thoughts on sort of like proactive versus reactive sort of availability? So, you know, I think like a lot of companies have bad issues, you bad availability incidences that kind of triggers a response and in an ideal world that that same incident wouldn’t happen again. But like, do you think it’s even possible in sort of a…

comprehensive way to have proactive availability, to sort of proactively patch the availability holes and prevent these incidences from happening. How do you think about that?

Ross (07:49) Yeah, okay, so that’s a great question that I could talk about for like literally this entire podcast probably. So let me see if I can like try to break it down in a way that I don’t just take over this whole conversation. So the first thing that I think about when you say, you know, that you have problems with availability goes back to what I said at the beginning of the podcast, which I think is the most important thing you can take away, which is the business is gonna determine how available and what those scenarios are. And so for example,

I can plug my blog, you can go read my blog. I wrote a blog article about this. basically, like the way I think about this now is different. So one thing that I feel very lucky to be a GitHub because when I randomly linked in other engineers at other companies, sometimes they answer me because I think because I work at GitHub, not because I’m like special at all. But since I, my name has GitHub attached, they’ll sometimes respond to me. And so.

A couple of years ago, I went and did exactly that. And I asked a bunch of SREs and people who were in the industry a lot of questions because I was starting to build this availability program at GitHub. I was, you I’m famous for talking about this stuff. And I was wondering like, well, what are other companies doing? like, you know, and what I learned is that there is a big difference between different companies on how they measure availability and what they care about. So just as an example, like a, you know,

banking company is going to care about a lot of regulatory compliance things. And they’re probably going to have very strict standards for most of their applications that handle money, do anything important. That’s going to be a very strict environment. They’re going to have really good measures because they have to. Whereas, let’s say, a company that sells airline tickets on the web, you just Google, I want to buy an airline ticket, and you go to one of those websites. I don’t want to say which ones because I don’t want

accidentally talking about somebody I talked to, but any of those, the only thing that a lot of them seem to care about is their funnel, which I think totally makes sense. If people can come and buy airline tickets, that’s how they make money, then they don’t necessarily care if the button in the settings page doesn’t work. No one cares to measure that. It doesn’t make any sense to spend a lot of developer effort to fix that because it’s not doing anything for any of the customers. Customers don’t even care about it.

And then you have a company like GitHub where I think that one thing that I feel like I learned is that we have a pretty high challenge because we are a developer tools company where we ship product for developers and a lot of our customers need our stuff to be up all the time. And it’s not just one feature. It’s like, okay, if you’re a marketing company, maybe you’re like marketing landing page needs to be up, but you don’t care if the marketing landing page and the settings page and the blah, blah, blah, blah, blah, to be up.

We care. mean, like, if actions doesn’t work, people are really unhappy. If pull requests don’t work, know, people are really unhappy. If issues doesn’t work, like there’s a whole slew of things I can go down that happen at GitHub that we have to measure and be responsible for. And that surface is very high. So I think that that’s like the first thing to think about. I already lost track of your question. I apologize. But I think it goes back to like, then you have to define those frameworks around those scenarios. Right. So.

You’re talking about applications and whether or not you care. So first you have to define that. Then you have to go into how are you going to measure availability in the scenarios that you care about. And again, different types of applications have different measurements. So there are standards, kind of. If you go, again, Google SRE book, they’ll talk about the golden signals, like saturation, latency, error rate. I forgot the other one. So hopefully not too many people watch this podcast.

But then you’ll have to measure those, but you may measure other things that your customers really care about as well, or you may measure less because they don’t. And then once you’re measuring those, then I think you can dive more specifically into the application itself to figure out what is actually successful. And again, I lost track of the question, but I think the most important part of that whole analysis is when it comes to evangelizing availability of a company is to make sure that you’re evangelizing the right things because

And I can say this about myself. I worked at other companies, not developer tools companies like marketing company, I worked at like the old days, like web portal software and stuff. And I used to care. I always have cared very much about like error rate in the application. And you know, this thing is breaking, like we got to it. And what I’ve learned over time is that like some of those times that I cared about that stuff, it really didn’t matter. It didn’t matter to the business. It didn’t matter to anybody else. And by pushing on something that doesn’t matter.

all you’re really doing is like taking away from the things that do matter. And so I think understanding what matters and measuring that and then pushing that at the business is the most important thing that you can do to focus the efforts that you’re going to have on those types of things. Because at the end of the day, whether or not I personally want it to be this way, availability is generally like a cost center in my mind. It’s something that costs money, but it doesn’t ship a new product feature. from a…

capitalist standpoint, we’re not measuring on the balance sheet as being like a net new feature that’s making money for the company. We’re always going to be measuring it as like a loss, a cost center. So that’s always going to be something that’s going to have to be something you have to think about as a developer of like, how much is it worth to do this? And how can I convince the business that it’s worth it? Now, sometimes the business knows and they care, but if you care, then you’re going to have to think about that mindset. Because sometimes you may push on something and realize like, know what? It doesn’t matter at all.

because of this reason.

Jonathan Tamsut (12:54) so you’ve identified what’s important for availability. Like, how do you preempt availability issues and prevent them from happening? I think like, you know, that’s probably like a system design, like good system design.

and maybe some like load testing game days type stuff.

Ross (13:09) Yeah, I think, OK, so yeah, so that’s where I guess I was going here is that to preempt those types of issues, you definitely need to know what it is you actually want to preempt and what’s OK to let go. All so you’re not going to measure everything. And so I do think that there are a number of things you can do. I it depends on your application. Obviously, guess planning for failover and doing it if you can is the best thing you can do.

not that’s not always going to be something that you can support or something that is even like maybe desired because of the cost based on whatever application you have. Again, if you have like a marketing site that you don’t care about, why are you going to build, you know, five nines of availability for that to go down? You know, if, like a region goes down, do you have to fail over or can you just wait till it recovers? Like maybe that business requirements, not that hot. So that is a big kind of dependency on like what you would actually do, I think. And then the other thing is game days. Absolutely. I mean,

Playbooks and game days are the two things that I guess I could advocate for regardless of the application is that you always, and it’s hard, that’s another one that’s like, it’s even a cost center I think for developers. Like I don’t think typically people want to write playbooks or run game days. It’s not always like the most fun or it takes planning and work and you know, all that kind of thing. But it does help a lot. And when there is an incident, because if you don’t have that type of knowledge written down, it becomes a game of

Let’s page the one person who knows because they wrote the code, right? And that’s never any fun for anybody.

Erika (14:33) Yeah, you mentioned also spending time in production and like getting familiar with your own telemetry and monitoring. And it makes me think of like the times when I’m implementing a new feature and, you know, I don’t look at the logs for that method or that, you know, slice beforehand. And afterwards, when I look at it and I’m looking for errors, I see all these errors, but then I go back and

look at how long they’ve been going on for and they’ve actually always been around and it’s, you know, a two-year-old error and so it’s nothing that I changed but, you know, it’s working enough that nobody’s reported it but if I’m looking to see if my changes broke anything or made anything worse it’s really hard if you can’t piece apart what was an error before versus what’s sort of new. So yeah, it’s a good reminder to

check the logs, check the behavior before you do anything to change it to know what your baseline is. Because it might not always be as picture perfect as you might expect it.

Ross (15:39) Yeah, absolutely. think there is a big, I don’t know, there’s always the tension of should we fix everything or not? Like as you mentioned, errors versus like, are these real errors or not? Like exceptions in production is a big, like hot topic, I guess. Maybe it’s not, I don’t know if people in the industry talk about it, but I know it’s a hot topic around here and I know it’s a hot topic at other places where, you know, should we be tracking exceptions that happen if they don’t have end user impact? I mean,

I mean, just sort of maybe, but this is where like SLOs and burn rate kind of come into play, where it’s more like, OK, let’s measure the behavior that we care about, compare it to baseline, and then look at whether or not it’s getting worse and how quickly it’s getting worse so that we don’t have to watch every single exception. But I think there’s a fine line there because I know at GitHub, one of the things we have is automated rollback if your application can support that and we can do any kind of query.

really that we want, and we want to compare a Canary rollout to full production, if Canary shows up with a new error, ideally you’re not going to watch that at all. As a developer, shouldn’t ideally watch that. You should just get a notification that something happened and it should automatically roll back. Because I think once you get to a certain scale, watching things in production gets really painful and really hard. And it’s error prone. It’s not fair to expect somebody to do that.

Bethany (16:54) Yeah, absolutely. I have really loved the tooling that we’ve gotten around, having safer deployments, relying on just knowing which pages to have open, for sure. I know we’ve been talking a lot about what availability is and then also availability as a cost center. But I am curious what your thoughts are around

Besides the obvious revenue loss of the incident, what are the hidden costs of downtime that organizations often overlook?

Ross (17:20) I think there’s definitely a few. I mean, we talked about some of them for sure. So like one of them is just on-call burnout. And I think that’s a pretty well-known thing in the industry. So I don’t know. I don’t know if I would call that like quote unquote hidden. I don’t know if, you know, if all organizations think about that or not, but certainly people being on call and getting paged all the time in the middle of the night is a huge cost to an organization. Number one, it’s a cost mentally and to their personal lives, which is, which is something that can.

absolutely get people to eventually leave jobs. And then number two, there’s also just the cost of people not doing the work that you want them to do that’s not availability work. If you’re getting paged all the time and you have to go research why is this thing broken and what’s going on, then you’re not shipping new features or doing the cool stuff. Maybe I don’t think that’s cool, but some people do, doing the cool work. And so that’s a huge cost. And if you don’t have a good handle on tracking that.

then I think it could be one of those things where people talk about debt and like, I have a big application that’s really old. And maybe that application’s fine if you don’t have to maintain it or do anything to it. But if that application’s old and it pages people in the middle of the night every night, because nobody knows how it works and it has errors, then that’s the debt that really needs to be paid down. So there’s that. mean, there’s also just the customer trust factor, which is very hard to quantify. And I think this, again, goes back to industry.

is like how much your customers need your product and what parts of the product do they need and how much do they have invested in it? And I guess we all work for GitHub, so I don’t want this to sound like braggy, but I do feel very happy to work in developer tools because I think that the industry in general is held to a high standard by other people. Like the things that we work on, developers look at and say like, I could do a better job than this, right? And I think that’s good because you feel

a lot of personal, I don’t know, desire to make things good because you know that people are looking at it from the lens that you’re looking at it from. So I think industry is, you know, it may matter. Like there may be more tolerance in other industries. Certainly there’s more tolerance if you’re writing marketing stuff or whatever. I don’t want to slander marketing all the time. I feel like I keep saying marketing, but like whatever. What other industries, there’s definitely other tons. And there’s some that there’s like no tolerance, right? If you’re writing software for a nuclear power plant, there’s no tolerance for failure.

So, you know, it depends.

Brittany Ellich (19:37) you

Classic, classic, it depends answer. Yeah, so I think we talked a little bit about, you know, making sure that you are choosing the amount of availability that makes sense for your organization instead of just arbitrarily saying like, we need to be as available as possible. And I think those are some really good examples of, you know, places that maybe availability isn’t the most important and the most important thing to invest in and examples of where it is actually incredibly important.

So I’m curious how, in your opinion, you would quantify the value of improving availability in places that are.

availability is really important and critical. I think we talked a little bit about up time and that’s often measured in the nines or 99.9 % to four nines, 99.99 % available. I’m curious, is it worth it to just pick the maximum amount of nines and just continuously work towards getting more nines or?

What does that investment look like?

Ross (20:31) I’m going to try not to ramble

on this subject, but it’s going to be really hard. So again, I apologize in advance, but I will try to stay on the question. so I think, well, so to answer your question, like in a, a very succinct way, I do not think that picking a number of nines and just trying to iterate to it is probably the right way to go. And that goes back to the example I gave of like marketing site that you’re fine with being down, right? The application architecture is going to change fairly significantly depending on how many nines you’re trying to achieve.

And maybe I’ll just back up and say, OK, so what are nines? One thing you can do to make this pretty simple is just go to Uptime.is, which is a website that calculates SLAs and SLOs. I think it says SLA is on there, but whatever. You can use it for whatever. And you just can type in a percentage, and it will tell you how many minutes of a given time frame mapped to that percentage. And I really like that, because there’s a lot of ways to measure nines, which

I’m going to probably get into, and this is why this subject could be super long. But one way to measure it is just how many minutes have we been down? So if you’re operating a website as a simple example and the website is completely inaccessible for some period of minutes, then you can take that and divide it out over the total number of minutes in a day, a month, a year, whatever. And that’ll give you a nines number. It’ll give you the percentage of time that your website was up. And so.

You know, if you look at those nines, like the way I mentally translate it right now is three nines, which is common to hear is 43 minutes of downtime every month. So that means like in a month, you have less than an hour to be down, whatever you’re measuring. It doesn’t matter. And then if you go for four nines, it’s 4.3 minutes per month. So that’s why I say like the application architecture rate, and if it’s five nines, like it’s, four seconds. like application architecture there is huge. If you write an application, that’s only going to be three nines available.

And then you try to make it, you know, four nines. think this is where I don’t remember what the people talk about or whatever, but there’s this concept of like every so many years you’re going to rewrite your whole architecture. And to me, this is like a good example of that where you’re going to find out real quick, like, man, we got to just rewrite the whole thing because it’s never going to work this way. Right. Like maybe we didn’t design it to be regional, regional fail over. Maybe we didn’t design microservices that could be replaced or maybe the database can’t go down or whatever.

then you’re going to have to go and change all that architecture or else you’re never going to reach the 9s goal that you would have. So I think you do have to do that. And I think the other key here is to, this goes to SLOs. like uptime is measured as a service level objective. There’s a lot of like stuff to talk about there, but the way I would simply explain it is that a service level objective is basically an objective for a availability number and an objective that relates to time.

So we want to be, let’s just define it simply and say, we want to not have more than, I don’t know, 5 % error rate over the entire month. That’s the service level objective. It has to have a time component and a measurement component. And there’s a lot of ways to define that type of SLO. This is where I feel there’s a lot of like,

You can go read my blog. I wrote a lot of stuff about looking at SLAs for different companies, for example, and it varies so much. And how you define that number changes how many nines you can even have. So just as an example, you can measure just success rate, like good, bad success rate, or you can even measure, can you ping my website? Or is my website’s, I don’t know, health check thing up?

Don’t check the actual site, just check the health check. If that’s up, then that counts, we’re good. You can hit that 1,000 times every minute. Then because you have so many hits and it’s up all the time, as long as you don’t have any major problem, you could say your nines are crazy good. But if you have that website, same website, and you start measuring, let’s say it’s in React, did the React components load within half a second or what?

Like, all of a sudden, the nines capability that you had goes way down. And let’s say you have 100 React components. Now you’re doing probability math. Each one of those has to be up. And if you’re going to combine them, now we’re getting into, how do you combine them? Like, do they all have to be up every minute? Or can half of them be down and half of them be up? So depending on how strict you are with your definition, you can totally just, I mean, I’ll say it kind of evilly. You could like, evilly manipulate your nines however you want. If someone tells me they have five nines, my first

question to them is like, well, how are you measuring that? Because I want to know because I think there’s a lot of like variance in how you can decide how many nines you have.

Erika (24:47) Yeah, it’s interesting. We’re going through some SLO revisiting on my team. And we usually have the opposite problem where some of our SLOs are too granular and they’re looking at specific pages. And it’s not helpful for the alerting aspect because you get a bunch of alerts maybe because the page doesn’t get.

like enough requests to create like a meaningful baseline. And so you get, yeah, like alerted on it when there’s not actually a problem. So yeah, like to your point, like there is that sweet spot of sort of like how granular versus like overarching do you get and like also like how much of the application do you take into account? Like, are you…

like are you yeah are all do all is the only thing you care about like the networking aspect do you only care about the data use aspect or do you care about it like all up like which layers are you taking into consideration and like which dependencies because they can be meaningful for different reasons like one can kind of help you point out like where the problem is but

Yeah, it may not actually indicate end customer impact. so, yeah, you have to be thoughtful about what you’re using as an SLO or a monitor.

Ross (25:59) Yeah, absolutely.

Brittany Ellich (25:59) Yeah, that makes sense.

Yeah, agreed. It’s very hard to determine, too. I think it’s hard to measure necessarily what… One thing, if the entire website is down, then it’s easy to say, okay, we’re unavailable right now, but most likely the scenario is like one service is down or one thing is not working. And like, does that mean that you’re still available or not? Very hard to measure for sure. So now think we’re gonna talk a little bit about how to create a good availability program.

And what are the important things within that? Bethany, do want to kick us off on that one?

Bethany (26:28) Yeah, no, I know we’ve been talking a little about important parts of an availability program like monitoring and alerting, but I’m curious what your thoughts are on the essential components there.

Ross (26:39) Yep, these are all questions that I can talk about probably too long. So I will try to again keep my responses as brief as I can but Or remind me of the original question when I go way off topic here but so this is this is to me one of the most interesting things that you can talk about from a company-wide perspective and I’ll go back to like some of the differences and I’ll reference my blog article again. So like there’s there’s differences between companies here. I think there is the there’s one idea of like what industry or

And okay, that’s one type of company that you’re thinking about. How are we developing for them? But the other type of company that I think about is how big is your organization? Because if you are in a startup and you’re just trying to get your, you know, site up and do the thing, like you don’t need an availability program, just to be honest, like there’s no reason for it. Just do the thing. and so there’s some minimum level of things you should do, but I could not recommend to anybody that’s running a small organization or a very, I’ll say like immature, but I don’t mean like bad. just mean.

young organization to like set up some big program. Because I think you’re probably wasting a lot of time that you could just spend developing your product. And I think when your product gets to a certain maturity level, and when you have a certain customer base that is large, is when you need to start considering that you need to have an availability program. Because now you have customers who, you know, this goes back to like, what’s the value of an incident or whatever is that if you have a lot of incidents, and you have a lot of customers, you don’t want to lose them.

If you have no customers, then incidents don’t really matter to you. So I think that’s a big important caveat to availability programs and how much effort you put into them. Because if you read the Google SRE book again and you just do what they say, you might be wasting a lot of your time if you’re three people just coding a website. Don’t worry about it. At the GitHub level, with the amount of people we have and services we have, though, I do think there’s some key components. And I think

It’s interesting because I think monitoring and alerting is good and you need it, but I think it’s often confused with SLOs in particular with reporting. And so I want to differentiate number one that I think you need, you do need monitoring and alerting and alerting on SLOs is great. You should check out burn rate and look at how to alert based on burn rate if you’re going to alert based on SLOs because I think that’s a key way of avoiding some of the alert fatigue that you can find. But it’s really hard to see it by the way. mean,

Low hit rate service, example, like synthetic traffic is a way to fix that. And there’s like all this stuff you can do. But ultimately monitoring is great. But I think the key component to like a large organization and making sure that’s a sustainable availability program is actually reporting. And how you do the reporting, I don’t know that there’s a right answer. I can say that there’s some things that I feel like I’ve learned that are interesting about humans that have nothing to do with computers that I think are good ways to think about availability program.

So number one, like alerting is good, but I think the teams need to own the alerting and they decide what to do alert on. I don’t think you can mandate alerting or tell people what they should alert on. You can give them guidance, but they may not use the guidance. Like, you we talked about applications that are noisy. Like if you have a noisy application and the errors aren’t real, then the team themselves are going to know like what the signals are for alerting. But I also think you can do things like reporting. So SLOs for me are a good reporting tool.

And now that’s a little bit of like a weird way of using the word SLO because SLO is an objective. And so like by the letter of the law, I guess an objective is something that the team sets for themselves. It’s not something that they report out to their management chain and say, here’s what we’re going to do. But that would be an SLA. It’s an agreement. Right. Like, hey, we agree we’re going to do this. But I think in practice, it’s too hard to do that because then you have to maintain two different numbers and you have to like

do all this work and tooling and like maybe if you have a very homogenous platform and all the work is done for you, could do that. But I think in practice, it’s easier to just say, yeah, we have this SLO and yeah, we’re responsible for it. And like, yeah, it’s a little bit of both and like, it’s not ideal, but it is what it is. And that is what we do at GitHub. So think having that means that you can then report on that number to your management chain or to the product owners, ideally if it’s in concert with your product owners. And that’s really important. And I think this is one of the things that I like that we do at GitHub is we compare

daily incidence to the numbers for availability. And the reason why I think that’s important, which I mentioned the human aspect is like, maybe it’s just me and I’m not very good at math. not, mean, I’m like fine at math. I I went to like engineering college and I had to do engineering math, but still when it comes down to like burn rate and numbers and all that, I really don’t want to sit and try to calculate like, okay, well how long was this thing down? And like, what’s the number?

And so looking at daily numbers and having a daily SLO that compares to an incident helps me look at like a big table and just say like, okay, on January 4th, was there an incident? If the answer is yes, then this number should be low. And if the answer is no, then the answer should be high. And so I think that’s a very powerful tool from a reporting perspective. Doesn’t mean you should alert every day that it’s down, but it does mean that if you wanna know, and this goes to like operations review, so.

At a larger company, think from a director level, you want to have an operations review where you review with your teams, hey, what were the numbers last week? How did we do? And that’s where I think that type of accountability is really important. It doesn’t have to be mean. It’s just a check on, hey, we had an incident. Our customers told us. This can happen. Your customers filed a bunch of customer reports, and you had the status publicly and say, hey, this is broken. Well, if that happened and the SLO didn’t dip, then

the SLO must be wrong. Your customers are telling you the SLO is wrong. Even if you’re like, yeah, I don’t care about that, right? But the customers clearly do. So I think that’s a really important check and balance for an availability program. And customers might not even be your public customers because, inside of a large organization, you’re going to have platform teams. You’re going to have a hierarchy of teams that are building on top of all these other things. So I don’t just mean, hey, you’re a customer-facing team. You need to have an SLO.

I mean, you could be the database team or the network team or whatever team, the platform team, and you should have an SLO. if you’re hearing from your customers, like, hey, our application’s not working, that is a sign that whatever you’re measuring, you either need to communicate to your customers that this is the objective that we have, or you need to figure out what you’re missing in your measurements so that you can match that expectation with reality for your consuming teams. Those are your customers.

So having that, think, is super, super important. And then the last tiller to this, I think, is incident review and incident response, which there’s tons of stuff written on and whatever. you need to train people on that. We talked about game days and playbooks. And we have incident commanders at GitHub. If you have a large organization, I think you need some type of role like that. And ideally, it’s not a centralized role with like a classical SRE team might be, where it’s like four people or 10 people that are always on call.

because I already mentioned like that’s like burnout. That’s huge burnout. So yeah.

Bethany (33:11) And for those who aren’t aware, could you explain what an incident commander is? Or what that role entails? ⁓

Ross (33:16) Yes, ⁓ an incident commander

is somebody who gets paged into an incident whenever it happens. And so usually these are like alerts that go off that are, you know, high confidence that, we have a problem. They’re not always though, like someone could just declare an incident and then an incident commander gets pulled in and their job is to basically coordinate the incident. So usually that means finding the right people, making sure we’re working towards mitigation. By the way, that’s like my favorite thing to say is like mitigate first, root cause second. So, you know, that’s a key thing is like, it doesn’t matter.

exactly what the problem is. It matters that we fix it. So our first step is to go find out what changed and try to revert it, even if we don’t know if that will fix the problem, as long as we’re sure it’s not going to make things worse. So that’s kind of the job of the incident commanders to run that type of incident. And usually the incident commanders have had a lot of experience in incidents, which it turns out is really necessary because sometimes you might need to be pushy or you might need to empower someone to page or you might have to page another team. And there’s a lot of social friction.

I guess obviously to aging someone in the middle of the night unless you’re really, really sure. like having someone there to help support that and make sure like, yes, it’s okay. This is broken. We need to do it. It makes those types of things go a lot faster and make sure that like everybody does, you know, the thing. Then there’s communication, but yeah, that’s, that’s what an incident commander is.

Erika (34:28) So I get to ask a really fun question, looking forward to the future and pick your brain a little bit on any emerging technologies or practices that you think will change how we approach availability in the next five years.

Ross (34:41) I know, I know I’m probably supposed to say AI and I think AI has its place. We just talked about incident commanders. I AI to me is probably a good place for that. and I know there’s some products that do that kind of thing because I guess I’m thinking of AI as something that can understand a large corpus of material and then, you know, translate that experience into, can do the thing. so incident commander is a good role where like, it’s pretty well understood. There’s a lot of like literature and like examples for that. So I think you could probably train an AI pretty well.

go do this thing. I think where it gets a little more dicey with AI in emerging technology is the investigation piece of this. And I know there’s companies trying to do this with MCP servers and whatever, but I guess I feel a little more pessimistic about that because I’m not sure if there is enough prior art that’s the same in order to train something to say, investigate this. you can probably hit the basic scenarios, but I don’t know.

Maybe they will. mean, great if they can do investigations without us having to do it. But I don’t know. That seems like less likely. in terms of technologies that are upcoming, I I would say I’ll paint a more pessimistic picture. And I’ll say that there is this report called the SRE report that Catchpoint publishes every year. And so I would highly recommend, if you’re interested in these topics, to go check it out. And they just published it for 2025. Maybe, actually.

quite a months ago probably, but I just discovered that they published it. But in that report, one of the things that they talked about was they survey everybody and get, it’s a lot of sentiment survey, but one of the surveys was like, how much of your time do you spend on toil? And it was interesting, this is the first year of that survey where they said that toil went up instead of going down. So I’m painting that pessimistic, because when you tell me that, like I just read that survey not too long ago, so it’s still stuck in my

Like I was really surprised to see that like, wow, toil went up, right? You have all these technologies, all this observability, all these things. And I think maybe, I don’t know, like are we at the tipping point where we have so many tools that that’s enough tools? we, we, would be easier to consolidate the tools and have fewer and do some of the human work than it would be to continue having more tools. I don’t know the answer. you know, maybe it’s things can fail over more and AI can actually operate. I haven’t.

even thought about that, that would be kind an interesting world to explore too.

Erika (36:54) very interesting like psychological rabbit hole where I almost wonder if like AI and introduction of AI is making the perception of toil greater where like because we have the ability to automate so many things with AI the things that used to be sort of mundane and like you know taken for granted that you have to do manually or like now seen as toil I have no idea but that can be another take behind the data

Ross (37:19) Yeah, it’s a good question. And I feel a little bit in a bubble about AI maybe because we all work at GitHub. I don’t know if you all feel this way. I was just at Fourth of July barbecue with my neighbors who work in tech and some of them are like, yeah, I’m using AI but I’m just kind of doing this thing, right? And we’re all, you know, obviously using all the tools that were given basically for free that we’re developing every day. And so like my perception of what are people doing every day might be skewed based on my own.

industry versus like, you know, I don’t know if you’re in finance or one of those banking or something. Like, I feel like some of those areas, they’re not, you know, they’re not there yet in terms of what we’re doing, at least. Again, not to slander that. I just think that obviously those areas need more and things catch up. We’re like competing in that arena. yeah.

Brittany Ellich (38:04) Love it. I have learned so much today. I’ve taken so many notes just for myself, not even just, you know, for the actual, for the podcast. This has been great. But we do need to wrap up a little bit here. And so what we’re going to do is move on to our fun segment, which today is going to be our top software book.

recommendations where you pick one to three. mean, obviously more if you really need to, but this isn’t necessarily one you’ve read recently, but it’s like your top, like somebody is just entering the industry and this is your best recommendation for them. What you think, what you think those are. So I’m going to go first to give you all a little bit of a time to think about what those are. So my top three recommendations are,

The first one is the software engineers guidebook as like a career level recommendation. That’s like one of the best ones, I think that goes through every single career level. Highly recommend that one. and the second one would be designing data intensive applications. If you’re interested in how data intensive applications work or anything at a company that’s larger than, you know, 10 engineers, this is, it’s a really great resource. and then my.

Third one is going to be thinking in systems. These are all ones from our book club fairly recently. but yeah, if you’re interested in learning about how to think, that’s a great, that’s a really great one. Ross, do you want to go next? Do you have? Okay.

Ross (39:24) I’ll go next. This is a hard one.

I will say, I don’t read very many software books. I read articles and stuff, but there is one book that I really like that I have not read in a while, so I don’t know. This is probably gonna date me, so whatever. But there’s this book called The Best Software Writing One, which is selected and introduced by Joel Walspolsky, I think he’s saying his name, Joel on software. So that’s why I say it might date me, because he used to be a very prolific blogger.

And he collected all these like software writings from a bunch of different very famous authors and they’re like papers, you know, like there’s the one about painters and hackers or whatever by the open source guy. See, I don’t even remember his name, but like there’s tons in there and I’m not sure how much is still fully applicable, but it’s a much more like soft book where I think you can get a lot out of people’s thoughts rather than specific technical stuff. So I would recommend that one. think it’s pretty cool.

Bethany (40:17) I think my recommendations would be, I, whenever I have a mentee that’s like graduating this program I volunteer for, I always get them algorithms to live by just because I think it is a really good way to break down these seemingly complex algorithms into stuff you can see and attribute physical…

nature to the algorithms and see how they work and how people are thinking of them. And I think that is a great one. I also really love The Staff Engineer by Will Larsen. I think that one is just a really good book if you have any interest in going the IC path. It really helped me align.

with what I wanted to do even though I’m not at the staff level, I am able to see what that entails and if it’s something I’m interested in and so I thought that was a very influential book for me. And then finally, I probably would say, I hesitate to say this, but reading Clean Code when I was earlier in my career really helped with imposter syndrome in a way because it was so about how to write code that is

obvious what it’s about, maintainable, and I thought it helped me really change my perspective. It’s not about writing the snazziest one-liners possible or anything like that. I really, I wouldn’t say I agree with a lot of the stuff in the book, but I think it’s a good way to start forming your opinions and how you think about software.

Erika (41:35) I will heavy plus the software engineers guidebook and the podcast. Yeah, he actually interviewed Thomas Domke recently, which was a pretty good listen for any GitHub fans. Yeah, so I will say that book changed my approach to my career and my day to day. John, I think you’re last.

Jonathan Tamsut (41:53) Yeah, so I kind of think about availability. I think sort of the platonic ideal is to write perfect rock solid distributed systems that never go down, that don’t even need monitoring. Obviously, that’s impossible. But I think if you want to build, an availability program is sort of a symptom of the fact that

mistakes happen, like systems break, a large collection of people make mistakes, a system becomes overly complex. So I think like a textbook on distributed systems, there’s like one, I think there’s this Dutch guy, I believe he’s Dutch, it’s available online. I think it’s just called Distributed Systems by Martin Van Steen. I’ve read, you can just read the PDF of it, I think is…

Good. I also think like there’s this book called the principles of object oriented design in Ruby. Pooder is the acronym. I think that’s a good book on like how do you write code? I mean, it’s specific to Ruby, but like how do you write code that’s like well organized? And I think I’d recommend that book. I think that book was like helpful for like thinking about how do you write software that is maintainable?

and that will hopefully not go down. But yeah, there’s sort of two dimensions to availability, sort of the underlying hardware, the code organization, so.

Brittany Ellich (43:10) Love it. I’ve got some new ones on my list now. Thank you so much, Ross. This was great. I know you mentioned your blog and we will absolutely be linking to that. But if people want to hear more about you or from you, where should they go?

Ross (43:23) Unfortunately that’s all I got. My blog is on Substack and you can put it in the show notes or whatever and check it out if you’re interested for sure. I really appreciate y’all having me on here though. It was really great.

Brittany Ellich (43:33) Yeah, this has been great. Excellent. With that, that wraps up.

what we’ve been talking about. So I just want to plug a little book club reminder. I know we talked about it in our last episode, but our book club is kicking off. I believe today will be the day that this is released will be the very first chapter that we go through. We’re going to be doing looks good to me by Adrian Braganza and join us in our GitHub repo. that’s overcommitted dash dev slash tech dash book dash club. And the link will also be in the show notes. We recently spun up a discord server as well to chat through that.

which is exciting. Thank you so much for tuning in to Overcommitted. If you like what you hear, please do follow, subscribe, or whatever it is you’re supposed to do on the podcast app of your choice. Check us out on Blue Sky and Discord and share with your friends. Until next week.

Listen

Hosted By

16: Ep. 16 | Understanding Software Availability with Ross Brodbeck

Show Notes

Episode Transcript