Bethany (00:00) Welcome to the Overcommitted Podcast, your weekly dose of real engineering conversations. I’m your host this week, Bethany, and I’m joined by…
Erika (00:07) Hey I’m Erika
Bethany (00:08) We met while working on a team at GitHub and quickly realized we were all obsessed with getting better at what we do. So we decided to start this podcast to share what we’ve learned. We’ll be talking about everything from leveling up your technical skills to navigating your professional development, all with the goal of creating a community where engineers can learn and connect. Today on Overcommitted, we are joined by Warren Perad, CTO and co-founder of Authors.
a user authorization API that helps developers plug in authentication and access control without building it from scratch. Warren has two decades of experience spanning healthcare IT in Wisconsin, e-commerce platforms in Switzerland, and now running a security SaaS startup. He’s also the host of Adventures in DevOps, a podcast with over 300 episodes featuring industry veterans on everything from infrastructure resilience to engineering leadership.
And most recently, authors made headlines for staying fully operational during the massive AWS outage in October 2025. So he knows a thing or two about building systems that don’t go down. To kick us off, what’s one thing you’re currently building or obsessed with learning right now, Warren?
Warren Parad (01:11) I think that’s always the challenging question to answer. And I knew I had to be prepared for this podcast because you ask every guest this question. honestly, after a few episodes of people bringing up different IDEs and using LLMs on our podcast, and we try to stay very far away from anything AI related, just because it seems like everyone else is talking about it, I felt the need to sort of go out and review how those LLMs function and whether or not you can do some software development. So right now I’m trying to avoid diving into cloud code right now.
And I did spin it up before the podcast and I feel like the experience I’m getting is you just have to wait for a lot of time and I don’t know what to do while that’s happening. It’s been quite a long time in my experience, know, from, I don’t know, 10, 15 years ago where there was a time of just having downtime and I don’t know what to do with it anymore.
Bethany (01:57) Very valid. is, you have this like complexity thing, like you can be like, oh, I’m gonna run more agents, but then you have to manage more agents. it’s just kind of like a foot gun there. So yeah.
Warren Parad (02:11) Yeah,
think multitasking seems like the direction it’s going. And I don’t mean to steer the episode in that direction ⁓ as far as the topic goes. But I I was discussing with my CEO, and she’s like, well, haven’t you ever managed multiple teams before? It could be similar to that. I’m like, yeah, it could be if you also were a bad manager and were micromanaging each of those teams so you knew exactly what was going on. And I feel like I don’t want to micromanage five plus agents so that I can get work done. So I really need to figure out, maybe it’s just not the time for the agents yet where you’re in a mode where you can
get feedback quick enough for it to be valuable to use it. Because right now, I feel like the current state of the world is we’re just waiting for it to do the work that you already know it’s supposed to do. And that’s not a great place to be in.
Erika (02:53) Yeah, it’s definitely a fine line. I’m learning how to both split brain myself to find those tasks that are interruptible that I can do while something is running. What’s something that I can do that’s low cognition enough where once I get the, this task is done, I can switch back and forth? But yeah, also the level of feedback you need for each concurrence.
system that’s running to like give it enough information and then also like know when to steer it in a different direction. It is definitely a skill in and of itself and you have to question what returns you’re getting at at various points.
Warren Parad (03:33) mean, it’s interesting you bring that up because there was a study that was backed by Microsoft about the impact of using LLMs on critical thinking out of Carnegie Mellon. And if you use LLMs, then the conclusion is you’re swapping off your expertise in whatever the area is for an expertise, if at all, using agents or LLMs or whatever you’re utilizing. I think it’s really important to keep that in mind where.
You’re basically saying, don’t care about the skill anymore. Let it atrophy. I’m going to swap out being able to micromanage a bunch of agents performing work.
Bethany (04:04) Yeah, it’s very interesting because it’s such early days with these things that it’s hard to tell if there’s bad best practices or what impacts this has on your career, your psyche or well-being. So it’s very much that like a day to day thing. It’s one day it’s like, we shouldn’t do this. But the next day it’s like, no, we should do this. And then just trying to figure out how
to do this and how your job is morphing in this new world if those are the expectations. So it’s all very like live learning on this.
Erika (04:38) I’m like.
And like, how do I tell them my critical thinking is atrophying? is there like a way to like, is there like a critical thinking test that I can take every week to like judge whether I’m going up or down in different areas? Like.
Warren Parad (04:53) You say that
and now I’m wondering maybe that’s the next new great startup idea pitch like actually ensuring that you’re not degrading over time by leveraging your work. But I did realize that part of the effort goes into taking all of these nuances that you’ve learned over your career and codifying it in some way so that you can have an LLM run straight away.
Bethany (05:14) Yeah, absolutely. It’s very much like documentation first or making sure comments are there first.
Warren Parad (05:20) yeah, that’s why went into engineering so that I could document more for sure. I do, So one of the questions I feel like I do get asked is like, how does this affect especially early stage engineering for those that are inexperienced or coming out of university? And I think this is one of the areas where it’s a sort of this bias we have being on the leading edge where we assume everyone’s using this technology. Everyone’s making changes and it’s right for everything.
Bethany (05:24) Yeah, that’s a fun part.
Warren Parad (05:47) And that’s just not the reality. feel like if we look at some statistics over the last years, still like 50 % of the websites on the internet run on WordPress. I don’t know if that statistic is true anymore because WordPress themselves are the ones pushing that number. But even if it’s remotely close, it does tell us that realistically, there’s a lot of technology, a lot of companies, and a lot of businesses that aren’t tech first that are still using lots of things that have nothing to do with AI in any regard.
that should really tell us, if companies aren’t investing in AI, then you don’t need to worry about how AI is progressing because it won’t help you get those jobs. And if companies are really firing engineers because they can replace them with AI, then you also don’t need to learn about AI because you’re not going to get those non-existent jobs.
Bethany (06:30) Yeah, it’s really a catch-22 in terms of optimism for the job industry coming in, but it really is. I cannot imagine what it’s like being a new engineer entering this world or entering this landscape because I feel like so many of my learnings was from that aha moment of doing it manually and going in and researching that bug and not necessarily being told exactly what that bug is but having to do the due diligence.
But on the converse side, maybe it’s like, we have these super engineers that just, they have that baseline experience already and are just getting that next level earlier, but who knows?
Warren Parad (07:04) I think there is one of the challenging questions, and I joked about this on one of the later episodes of our podcast, was that, what is productivity really? What are you measuring? And my joke is that, for sure, it can’t be lines of code. But I think this is actually what people have gone back to measuring effectively, is that only output matters. And we know from the entire history of the economy and business, output is almost completely irrelevant.
So it’s quite an interesting story to see companies these days who clearly must never have had any sort of business KPIs, no strategy in how the business should go forward, what makes it a good user experience, and are now just turning out more code. I think any company that just cares about producing more, like we’ll just see them evaporate in the next five to 10 years. And really all that will be left is not the companies that use AI, it’s the companies that have figured out how to make their business strategy work better with the technology tools that are available.
Erika (07:59) feel like this does transition us really well to the idea of systems design and reliability because in my experience, that is not something that LLMs are particularly good at like recommending or like, like they can like read the docs, they can read the best practices, but like that critical thinking piece is so crucial to good systems design and like experience. So.
Yeah, I know that’s that’s kind of the the next thing we want to talk about and I’ll hand it back to you Bethany.
Bethany (08:31) No, thanks for the transition, Erica. Yeah, so we brought up in the initial intro that
that author has made the headlines for surviving the AWS October 2025 incident, which I feel like should be on a t-shirt that I survived the AWS US East one incident. And there was a lot of huge names out there that went down like Disney, Reddit, New York Times. Curious if he could walk us through what that day looked like, maybe critical thinking skills that popped in there and, and.
like how your systems responded throughout the day.
Warren Parad (09:04) Yeah, so I think there’s like a really simple story for how we’ve decided to implement a nice strategy for dealing with stuff. So I think the industry agreed term is like, just don’t be in US East one. And then when AWS goes down, because US East one is the least reliable region, as far as new deployments go, you won’t have an impact. in welcome back to the real world, you have customers that live in multiple regions and one of them may be US East one.
So if one of your customers made a poor decision and want to push that responsibility onto you as well, you’re forced to run in that region. And so the question is, what do you do? And if they’re multi-region as well, you do need an answer for when something happens. I think we are well plugged in to technology today, what’s going on with other companies and whatnot, that you will hear the Cloudflare being down, AWS being down, long before it may even impact one of your systems. Either because the technology choices you made are slightly different, or you know someone who
says, I have to cancel lunch and go home really early because I need to get back on because right now all our systems are crashing. For us, it’s always a scary moment because there is a question of, we going to be impacted? Do we have an impact right now that we aren’t identifying? And it was pretty quiet for us, honestly. You start to hear things on the internet. It used to be Twitter, not anymore, but one of the communities you’re in, Blue Sky, for instance, and
you’ll see people complaining like, is this down? Yes, AWS is probably having an incident that’s not reported in the health dashboard because why would that be the case? And the major strategy we have in place is DNS failover. So we’re using AWS, so Route 53 health checks, where we’re dynamically discovering whether or not we can connect to our databases or we need to switch regions in some way or there’s a latency that’s too high and would violate our SLA. And from there, there’s any number of technologies we have in place which allow failover.
either switching the whole region over because we’re using the DNS record or using a strategy where we’re calling one database in one region and then automatically following up with calls to a backup region if there’s an issue. So we have the dashboard. You can see there’s an event. We start getting 500s from connecting with our database because we’re actually using DynamoDB. So it was very quick for us to turn around and expose this problem as an incident that
could then trigger Roughly 3 to failover and then there’s a little time where you wait. But during that time, individual requests can still fall back to a different database in another region. And then you see it switches regions fine. We have paired regions for every local one, for each of our customers. later when it was supposedly resolved, you see it come back up. Interestingly enough, we did discover an incident for us that happened during this time. And we kept getting paged by our own systems throughout the day.
It was a good learning because what had actually happened is our incident system was identifying errors that were in production. It turned out that those errors were what triggered the failover. But because the region was down, the event log was delayed by like six, eight hours. And so long after the incident had been resolved, our systems started reporting that there was actually a real problem. And that was because AWS finally got
the logging infrastructure working together. And so we started getting spams of emails and notifications that there was another incident happening. Not one that affected our systems, but was quite annoying for anyone on call at that moment to have to wonder, OK, is there actually a real problem that’s happening?
Bethany (12:20) that’s so interesting. Yeah, when you get paged and it’s not an incident, is definitely always the fun moment where you’re like, okay, is this actually worth looking into or not? And it risks your availability because you’re like, I’m being drawn in so many places.
Warren Parad (12:32) Well, it is.
Yeah, so I think the learning here was in regards to when you’re logging particular data, make sure that the timestamp of the incident is well tracked in the event and not trusting the event system that’s managing the logs or using CloudWatch or exporting it somewhere else using date timestamps from that thing. And it can be very difficult to make sure because your logging doesn’t just go from one location. It’s being passed or pipelined through many different services which may each have their own idea of like when now is in a way.
For sure, we were getting that, and we weren’t, in this case, checking to make sure that even though the timestamp was in there, that first of all, it was the right timestamp, and second of all, that the timestamp was actually relevant for alerting. Because knowing about an alert that happened or an error that happened six hours ago is something that we should still follow up on, right, in case there is a delay. But at the same time, it isn’t something that should be the highest priority.
Bethany (13:28) That totally makes sense. Clock says being unreliable, the eternal issue. Before we move on, Erica, I’m curious if you have any really bad incident stories you want to share or that you’ve been a part of that might be similar.
Erika (13:44) Well, so the first one that comes to mind is not really an, like a live incident, but I did work on, when I was back in my consulting days, I worked on the COVID contract tracing center in the state of Massachusetts. And it was like, was, our, our, our piece was on AWS. We had some like, you know, services and they have like a call center, you know, offering
um, in ADOES and then there was a Salesforce piece. Um, and so we had like an integration between the two for like managing, um, COVID cases and, um, whatever like update we were doing, like the Salesforce team had, um, I don’t even remember.
exactly what it was now, but there was some like issue with their migration piece that ended up taking like six hours. So what whatever was supposed to be like a, you know, one hour update. Like I was I was online until like six in the morning because we had to do it in like off hours too. So like we started the migration at like, I don’t know, like seven p.m. or something. And it was like an 11 hour thing where like my piece was the very last.
part to go, so I had to be online until the very end. Yeah, so that’s the first one that comes to mind for me as far as developer horror stories of site management. What about you?
Warren Parad (15:06) That reminds
me of some of the things I had to go through in my past. I got my early love of being on call when I was working the healthcare IT company in the United States. their strategy was when there’s an alert that goes off, you have to follow whatever the customer’s on call strategy is for dealing with your technology stack. So it wasn’t like you could just handle it from your own company’s SOPs. was whatever was driven from them.
You know hospitals the bleeding edge technology right there known for that. Add so at this time basically what it was is. We had to join a group meeting call of course this was so when this happened to of course it was like two thirty in the morning and you had a specific work laptop like you’re not allowed to use your own and a special beeper and yeah cuz that was a thing back then right and.
You get on and you first of all have to jump through three different automated voice setups and already even get to a shared bridge call where they’re going around and manually assigning user IDs and passwords for people to log into their system on the call. Only three people could be logged in at a time because they wanted to watch whatever was happening there. And then finally they’ll say your name and give you credentials and you log in and you’re like, this isn’t the application. I don’t know how to debug what’s going on here.
have to use Citrix to log in to proxy into their servers. Wait, no, actually, I have to first SSH into their VPN, and then from there, I can run a remote desktop protocol to get into the production servers. And then from there, I can actually access production application. Because of course, what you want to do is have support engineers from a third party company log into your production servers in order to figure out what’s wrong. there’s already an issue going on here, but let’s not get into that.
And then once you’re in there, the company says, you know you have to do? You have to run this script and make sure that the script reports success. And I’m like, which part of this could not have been automated, honestly, when there is an incident? No, I get that a lot of systems in a hospital organization are critical. And if there’s a power outage and your data center’s on site and you have to cycle the power, your stuff is going to go down. And you may be down for an hour or 11 hours, in Erica’s case, and be waiting for stuff to come back up. And I get that you need to validate it.
At the same time, I was working in hospital billing claims. There was no reason that it was absolutely critical for me to validate the state of the database. And you know what? If that script said error, I don’t know what I would have done. I can’t fix this. I was not the primary developer who worked on the claims part of the billing application. And you know what? I don’t think, I have a lot of stories from them, and this is not the worst one.
Bethany (17:39) That sounds like a nightmare, honestly. Especially for something that’s very customer-facing with things that plug in, that’s wild. You’ve mentioned that, or you’ve talked about leveraging that reliability-first design. Did that come from some of those experiences? And I’m very curious about how that’s involved through your career and what that looks like in practice.
Warren Parad (18:01) Yeah, so I think I got really lucky in a lot of regards, fundamentally in our current company, which I actually don’t call it a startup because we’ve been around for almost eight years now. But realistically, an important aspect here is that if you want to build reliable software, there’s a lot of technological solutions to handle that. But fundamentally, it comes from having both the culture and the mindset of the organization embodying it. And for that, it means hiring people that
care about the reliability or build reliable systems by default without having to think a lot about it. And so I think whatever your experiences are, they’re incredibly critical for this. Mine came from absolutely relevant for sure. The hospital, ⁓ healthcare IT, for instance. But I think I got my start as not a software engineer. I was electrical engineer and in computer design in university.
And so I never built a compiler from scratch. don’t know half of the computer science professors or famous names when other people bring them up. I have no idea who those people are. I can tell you how to build a laser or receive microwaves from outer space or do digital signal processing, but software engineering, that was not my primary area. And so when I started working in the hospital healthcare IT industry, I didn’t go straight into software engineering. I got put into this sort of
what would now be ⁓ SRE or any sort of observability. And I feel like that helped create the perspective of having uptime be really important. Like the technology was not the first aspect. The second part was engaging with the customer a lot. Like I didn’t realize this, but I had weekly calls with a whole bunch of different hospital organizations on how their technology stack works. And it didn’t occur to me until much later where that was a required aspect of product engineering, where you actually understand what the users need.
And those two things were like the front and center of how I started my career. And I feel like they followed me a lot. So I don’t see outputting code or a functionality, classes, services as ever the primary solution. And I didn’t realize that was how I was thinking until I started my own company. And I started getting into stuff and saw that, wait, no, we need to hire specific people with specific mindsets. Why do I think this way? Why do we hire people that think this way? And then it sort of was an ⁓ epiphany moment for me.
Bethany (20:10) that definitely makes sense from getting that SRE perspective early on and seeing failures probably firsthand, especially in hospitals, thought that would influence how you think about software going forward. Erica, I’m curious, how does your team think about failure? Do you feel like your team applies a reliability first mindset or do you feel like that’s somewhat of an afterthought?
Erika (20:35) Yeah, I think it’s…
interesting thinking about reliability at the scale of GitHub because you can only control so much. like you have your piece that you are thinking about and like you can do your best to like, you know, make that as reliable as possible. But you’re kind of working within the constraints of the system. I guess kind of depending on where you are in the latter.
what projects you’re working on. But yeah, so I think like within that system, I guess the same rules apply where like, you we were talking about AWS earlier. It’s like, well, you know that US East one is the most likely region to go down. Like, you know, we know which databases, which schemas are the most vulnerable. We know which, you know, which data shapes are,
are the most prone to failure. And that’s kind of a lot of times the thought process that we go through first is like, okay, when we’re building this, what do we know that might fall over and how can we plan for that? How can we build for that? How can we find the biggest risks for what might happen, what might go wrong?
So yeah, would say that we go through that checklist at the start of any project. And yeah, especially on the off space, the other constraint is like we build the framework, but the teams that consume it have their own data. we don’t always control how it operates in their system.
But it’s all good feedback. We’ll get feedback from teams like, hey, this is not working, so we’ll have to redesign something given those constraints. yeah, that’s what I think about from reliability.
Bethany (22:26) Yeah, that makes sense. I feel like often it’s make it work and then make it reliable follows after. But I think sometimes to your point Warren, I know you’re laughing at this, so I would love to hear your thoughts here. I think there’s different scales where it makes sense. I mean, designing for reliability, you’ll often have to kind of like
Warren Parad (22:37) You
Bethany (22:48) rip out everything that you put in and then rethink it. So especially for critical services like authorization, it makes sense to think through our reliability from the start.
Warren Parad (22:58) Yeah, absolutely. I think you’re really onto something there. And I almost want to point back to something that AWS had said a while ago while designing their S3 architecture. It wasn’t just design one time from the ground up in, I don’t know, almost 20 years ago. It was redesigned at each level of scale. And so it really is important not to just design something or design reliability in based off of…
There’s no such thing as like, it’s going to be reliable. There’s a question of like, well, how reliable does it have to be? And that’s really a business question. Like how many users are going to be there? What is the sort of interaction that are going to be happening? What are the failure modes that can be part of it? And a huge part, I think Erica sort of brought this up, is it is part and driven by the organizational structure and the intrinsic incentives that exist for each team and individual to build reliable stuff. So yeah, sure, the design of the technology is one thing.
Do you want one nine, two nines, five nines like we have or something more durable depending on what actual customers are bringing to the table?
Bethany (23:59) So on that point, I’m very curious. What does a Five Nines organization look like? Is that something you’ve intentionally been building from the start of Authors? Or is that something that’s come about later on in the company’s history?
Warren Parad (24:13) Yeah, so realistically, it was something that we had decided as part of the product that we were building that it had to be so reliable. What we had done is we went to a lot of competitors that we didn’t like at the time that we had tried. And we saw they were promising high reliability only on their enterprise plans. And we’re like, I don’t understand that like you either really need it or you don’t. And as something so critical as an infrastructure component, as like identity and access management, it just can’t be down ever.
you have the same expectations of your cloud provider as you would of your identity management. And so we see what we have as really infrastructure. And from that regard, we created these expectations on whatever product we were going to build. And from there, we made sure whenever we added a new piece of technology or a framework to our stack or a new service from AWS or another cloud provider, we know what the failure modes that have to really be thought about fundamentally. You don’t just add
Like a good example that I have a talk where I go through this and one of the things is like a lot of companies build probably what’s two nines or even one nine, 90 % uptime, which is probably good for a lot of business scenarios. you probably don’t need to be extra reliable. If you ask people, hey, do want this to be reliable? They’re going to be like, yeah, sure, reliability is great. But if you ask them how reliable and they say, well, something other than 100 % uptime, maybe they thought about, well, what actually makes sense for the business?
And the thing that they miss is that you don’t get from two nines to three nines or to five nines by adding something. You actually almost always need to subtract because the lack of reliability usually is by the addition of third party services or dependencies or a pattern that you’ve introduced. So a lot of our endpoints become much simpler in nature. The way we design stuff we know has to be less complicated.
Because as you add complexity to an endpoint or a piece of functionality, that’s another place that multiplies the likelihood of there being downtime or an incident if there is an issue. If you have three components and there is a likelihood of a bug of equal percentage in each one of those, then the likelihood of there being no bug is a multiplicative factor, right? For each additional factor you add there, it’s going to increase the risk of a problem.
And so in order to decrease it, you actually need to remove stuff. So figuring out what makes sense for individual endpoints or individual flows for customer use cases is even more important than it is if you’re just like, just going to build a prototype and throw it on. It doesn’t matter what its uptime is.
Bethany (26:36) That definitely makes sense. I feel like the calculus of service availability with dependencies and stuff with such a pivotal read and realizing that having critical dependencies just you have to factor that into your availability. can’t just say, yeah, we’re
five nines unless AWS goes down or unless XYZ goes down. have to, it’s, you’re basically taking ownership of what that dependency is doing in your stack as well.
Warren Parad (27:04) for sure. I mean, there’s another aspect here that’s probably important to talk about is that we can say SLAs, and that’s really just a contracted number. Like, actually doesn’t matter how much we go down. An SLA is just saying, if we’re down more than that, some legal consequence applies or financial consequence. But it doesn’t actually say anything about how often we’re sort of trying to be up. And so there’s a big difference between promising an SLA of 5.9s and building for an SLA of 5.9s.
Bethany (27:31) Yeah, definitely, So, I mean, speaking in about auth and maybe the anti-patterns, do you see any common auth anti-patterns that teams often fall into that maybe you’d like to suggest other alternatives?
Warren Parad (27:50) I think there is really this aspect of it’s simple and we’ll build it ourselves. And I think you you spoil this a little bit with ⁓ there’s maybe a link here to an article I wrote a long time ago. But I think a lot of people don’t think about how much investment they have to put in to evolve a piece of architecture or technology or a stack or framework that they’ve added a long time ago. Things like SDKs.
often don’t provide the necessary level of control or functionality in order to overcome a lot of the challenges, especially in the auth space, because there’s a lot of protocol or standards to wrap around. And the interesting thing there is that every single identity provider out there did something custom on top of the protocol. It’s nice to think that SAML and OAuth2 solve everything.
But realistically, in our implementation, maybe here’s a little bit of a secret sauce, is that we have a custom implementation for every single identity provider that any of our customers brings up because all of them have giant foot guns or do something non-custom that isn’t even in the standard. And going through the documentation isn’t sufficient. So the likelihood that an SDK supports it or provides for multi-service interactions is just not believable, realistically.
⁓ Another one is like not really understanding what the use case is and maybe I’m just as a broken record of like going back and like talking to your customers about like what is actually the need and God forbid thinking about like more than six months out and figuring out like well, how is our product gonna evolve? What are the needs there? I think from our own research a lot of companies end up spending a million plus on running a team to manage their ⁓ identity and access management infrastructure because
Either they started with an off-the-shelf open source solution that they installed, and now they’re basically on the hook for making it reliable, where a lot of open source was not built to be reliable. They’re built to be a runway for converting into a paid product. Or realistically, it doesn’t have all the functionality. Or you spin up the team, be like, we’re going to do a project. We’re going to install, insert your favorite open source technology into the stack here. And then six months down the road, the team doesn’t exist.
And now you’re blocking critical functionality for your application based off of the lack of simple things that, if you had thought about the problem space, are actually required. Things like invites or group or user management, granular or resource-based access control, I think are just some common examples. Or, you know, I think maybe you ask, you ask pitfalls. And I think the number one thing is throwing extra claims into your JWT. And maybe I’ll just leave it at that. that’s, see if that triggers anyone.
Erika (30:14) man.
Yeah, yeah. Well, that’s interesting. Yeah, it must be really interesting to see all the different use cases from people who use your startup or use your product. And yeah, like the knock on effects of like their customers and what they’re looking for, because it is so true that like authorization at its core, I mean, authorization and I guess you do you do so you do identity and access management.
you have the auth and the authorization piece, which, yeah, from my experience, those can get conflated very easily. ⁓
Warren Parad (30:50) Well, I mean, I feel
like you’re almost joking about that. But I think the truth is, when we started off, it was specifically just the access control piece without identity management. We call it identity brokerage or really aggregation, where you’re getting identities from multiple different sources. It could be GitHub Actions, GitLab workflows, or AWS as an entity that’s trying to authenticate into another service, or service to service interactions, or with your customers who have their own sort of API keys that interact with you. And so I think.
We didn’t do any of that to start, but what we found is that a lot of companies and their engineers, there’s this sort of bias of experts perspective where you think everyone has more knowledge of the area than they really do. Like most engineers never touch anything related to auth. And what we realized is it didn’t make sense to have a product that was so focused on access control because most of the engineers, most of the people that ended up on our marketing website or came to our product already had expectations of what that means based off of the lies they saw on someone else’s marketing page or expectations that they.
had from the company they worked at, the culture or whatever terminology they used, which may or may not be industry appropriate. That doesn’t matter. They have those expectations. They come to your website. They come to your product. And now they expect things to be there. And so what we found is like, identity is like a solved problem realistically, other than all of these little bells and whistles that are necessary for every single identity provider, because Google inserts an HD claim. Apple says you have to use H-Mac signing for secrets. And the list goes on and on. But the reality of the situation is,
that it is cheap to add this layer on. So we do, we throw that in there, and now we can tell everyone, yeah, sure, we still have that. They can get what they need. But when they come and they realize what the challenges are that they actually have, that their users actually have, then we get into the features that are actually critically important here.
Erika (32:31) Yeah, yeah, mean, because it can be so many different levels of nuance, like you said, where it’s like, have like, ⁓ read, write admin, like, okay, that’s authorization. Also this like fine grain access control, like that’s also authorization. So like, it’s not a one size fits all and…
Yeah, like how you implement it, how much you care about it, how much you build it is really dependent on your product need.
Warren Parad (32:58) Yeah, it actually goes further than that. It’s like if we just say the word authorization in HTTP, there is an authorization header, but you’re actually passing an identity token or an access token. You’re not saying the authorization or the permissions that are required there. It just represents the user identity. Authorizations also is a word in the payment space. You get pre-authorization or authorization to bill a credit card. Same with in the hospital world. You get authorization for a doctor to perform an operation based on whether an insurance company will
will pay or support it, or whether or not the hospital will. And so you may be coming from a different space where those words actually have a completely different meaning, and then trying to apply it again actually does create confusion.
Erika (33:39) Yeah, and there’s also the idea of like timing to like what needs to be immediately available or how long the authorization lasts. How long do these permissions like apply? Is it a single use? Is it a multi-use? Yeah, can be a very simple or a very complex space just depending on what you’re building.
Warren Parad (34:01) Yeah,
you brought up one of the most ridiculous ones. You know, the thing that I like to think about is that the access token represents identity, the user, and not the permissions or the authorization. And so the idea of trying to force, deny or block a token or revoke it after it’s been issued doesn’t make sense. My identity is still who it is. I didn’t change. So therefore the token changing doesn’t make sense. But once you start sticking permissions into the token, that’s when you start ending up with some of these problems. And it’s very difficult for people to see these
long tail issues that show up when it seems like such an easy solution to just cash some additional data in the token and then just pray everything works in the future because you haven’t run it.
Bethany (34:38) Well, I’m noting down a lot of thanks for the future because I have definitely been on teams where it was just throw it in the claims and call it a day. So I’m taking notes.
Warren Parad (34:50) You know, I love this one because I think it’s one of the most common ones. If it’s some static data that very rarely changes, yeah, for sure, go ahead and throw in the claims. So claim just for anyone who’s listening who’s not familiar with JWTs or JWTs pronounced according to the RFC, you can add additional properties to the JSON Web Token with whatever you want in there. Sometimes people want to throw like the tenants or company IDs in, you go ahead and do that. But then the question is what happens when someone adds in access to another company or tenant?
you have to cycle the token in some way and that may force the user to like log out and log back in in order for that to be persisted there. And that still works even, but then you start to get in more complex systems where you have different services with different databases and different resources and some users should have access to just this resource and not that resource and all the resources from this tenant in this service but not this other service even though it’s the same tenant. There is MSPs which are basically managed service partners which are
⁓ managing the customer tenants for themselves, but on behalf of the customers, and they shouldn’t have access to anything other than being able to manage stuff, where do those permissions go? And the important part here is often the tokens are returned as part of the HTTP header or the URL or the body in some way, and then you’re limited in size. Trying to send the token and the authorization header with claims in it means that you’re going to be limited in the header size. Now, well, technically, according to all the RFCs, there’s no max header size. If you go above like four…
four kilobytes, you’re gonna start running into problems. And so if you have something that could be potentially unbounded array inside your token, you just created a time bomb for your organization, for your company even, the business could eventually come down at some point because you did that.
Bethany (36:29) That definitely makes sense and I will learn from this for sure.
Warren Parad (36:35) Hahaha
Bethany (36:37) Zooming out as we’re kind of wrapping up, I’m curious, like we’ve talked a lot about technical anti-patterns, but as a CTO, I’m curious if there’s any product and business anti-patterns that you’ve noticed as you’re building a business and you’re leading a business. Is there anything you’ve noticed that a lot of companies get wrong and you feel like you’ve done right by your company?
Warren Parad (37:03) Well, I think it’s still too soon to say that we’ve done right. I have this sort of morbid joke. So anyone who doesn’t like dark humor should probably skip past this part of the episode. It’s that you really can’t define whether or not you did something that was regretful until the moment you die. Because anything you do now could turn out good, even though at the moment seems bad. And things that are very short-sighted seem good now, which often companies focus on, are bad in even the medium and long term.
And so that can be definitely a challenge there. One of them that comes up and I think hopefully resonates with your audience is that on the subject of candidate hiring and interviews. that’s, I think there’s like some statistic out there which is that less than 50 % accuracy for hiring candidates into positions, like less than 50 % accuracy, like I more often hire the wrong person than the right person, it means I should actually just flip a coin every single time.
So like, most people who are in the interview process, they’re not experts there. They don’t fully understand what they’re testing for. And I think that’s one thing that you really have to realize if you’re in this position, that how are you going to make sure that you are getting the right people into your team? And part of the problem of building a company is ensuring that the culture you’re building, which is really just a key word, a business key word, a corporate key word for saying you are a group of human individuals and what
together they make up is now the culture of your organization, making sure that you have the people that are aligned in the future that you want to build. And one of the problems is, especially in our space, we think about having reliability, which means we need people who build reliable stuff. I like Simon Wardley’s breakdown here of pioneer settler town planner, basically archetypes. Not everyone maps to one of these models, but the idea is…
Some people are better or prefer to operate in a particular area and solve particular problems. Maybe they prefer to do greenfield stuff versus really think about optimization. And if we’re hiring these town planners who want to build reliable software, is our interview process set up to ensure that those are the people that we’re going to hire? Notoriously, in the interview, you ask rapid fire questions or ask questions on the spot, which require
being able to just respond immediately. Well, I don’t think town planners optimize for handling those sorts of things. And so if you’re interviewed, the standard interview process, which even if you did it correct, you’re still, means you, maybe you hire the right person there, you’re, you’re biasing to hiring the wrong people for the role. So I think really it’s not just about the technical skills that are being tested for an interview. It’s also the format of the interview, which is super important.
Bethany (39:39) I feel that so deeply. It’s a-
Automatically ⁓ preparing for interviews is honestly just about seeing if you can fill that format that they’re testing for. And then it just becomes about passing the test, which isn’t necessarily the same thing as finding a good employee or finding the right fit for your team. So that’s really cool that you all have spent so much time thinking through this process and how to get people who do have a good fit in the organization.
a mutually beneficial partnership in that respect.
Warren Parad (40:13) Yeah, you know, it’s really that perspective. It’s the company wants to hire the right person and you want to have the right person working for you. And that means that it shouldn’t be about trying to skirt the system. It shouldn’t be about, you know, making sure that you have all the evidence perfect for it because you’re going to miss out on something important. And I think that’s definitely a challenge that often is left to the engineers who have free capacity rather than the best ones in the organization or
untrained interviewers because you know, we can just hire someone to use one of these LLMs to do all our coding for us now. You know, what could possibly go wrong?
Bethany (40:45) What could go wrong? So speaking of which, this is a good transition to our fun segment actually, which we’ve pulled together a never have I ever kind of prompts. So instead of taking turns, we’ll just list these out. And I guess we’ll give like maybe three chances. whoever…
first, I mean, we’re not drinking, but we’ll, guess, win bragging rights for having fun experiences and fun stories. So we’ll just, I guess, say when we put down a finger for our audio listeners and go from there. So we’ll start off with a fun one. Never have I ever deployed on a Friday and regretted it by a Saturday morning.
Erika (41:33) Bye.
Bethany (41:34) You’ve done that? Okay, okay, so that’s a finger down then. Any fun? Like, did you get paged for that?
Warren Parad (41:35) yeah, for sure.
I think realistically ⁓ most breakages happen when humans change stuff. Things usually don’t break on the fly, runaway memory leaks or whatever are rare.
You have plans. You canceled them in that regard. You hope the on-call alert wakes you up and you can fix it. And usually they’re the dumbest things like, like it should, it’s like a null reference exception because the language we’re using isn’t Rust.
Bethany (42:03) Yep, yep. Good old Rust. Erica, do you have any experiences with this?
Erika (42:07) ⁓ if I do, I’ve blocked them out.
Bethany (42:10) Yeah, same. I’m having trouble thinking of any that I’m sure I have, but…
Erika (42:17) I’ve regretted
it by Friday evening. Like I’ve definitely like deployed my PR at like noon on Friday being like, ⁓ there’s, you know, this is like a simple UI change. And then like, I get stuck in the merge queue for like six hours because again, like you said, some, some, something very like bizarre and weird is happening. And then like, I’m still in the merge queue at 6 PM or something. But usually by Saturday morning, I’m, I’m, I’m, I’m good.
Bethany (42:20) True.
Warren Parad (42:44) I’m jealous of your ability to just purge your memories and not like just dwell on the mistakes of your past.
Bethany (42:50) I probably need to do more reflecting on it to feel shame for past mistakes to be honest.
Erika (42:53) Yeah.
Bethany (42:56) ⁓ okay, so never have I ever blamed DNS when I had no idea what was actually wrong. I’m gonna put my finger down.
Erika (43:04) ⁓
Bethany (43:05) Or maybe
less DNS, like networking. It’s like, it’s a networking blip. And then it’s really, I don’t know what’s happening.
Erika (43:13) Okay, yeah, I would put my finger down for that. Just like general networking. Like, I don’t even know.
Warren Parad (43:19) I think we’ve seen that there is this aspect of if you guess, you waste a lot of time potentially playing around with trying to fix a problem that doesn’t exist. And so it’s always better if you don’t have the sufficient logs to spend an extra, spend the first part of the incident trying to de-stress. I think this is the standard thing. Like if there’s an emergency, don’t panic, don’t make assumptions, just calmly do the next thing, which honestly is add some logging to prove.
your guess rather than trying to throw out a fix to immediately get there.
Bethany (43:48) That makes sense. Yeah, a lot of times it’s pressuring on mitigation, mitigation, and then it’s like, well, if you’re totally blind, it’s kind of hard to figure that out.
⁓ okay. Never have I ever rolled back a deploy that made things worse than the original bug. I’m definitely putting down my finger for that one. Anything involving state is always going to be a ⁓ wild ride fixing.
Erika (44:04) way.
Warren Parad (44:12) I feel like there’s an LLM involved conversation here.
Bethany (44:14) Mmm, true, true.
Warren Parad (44:16) There’s a lot of SRE AI products that are coming out and from our own experience, if you could automatically rollback or rollback was the right thing, you probably would have had a unit test or something in place that would have caught the problem anyway. So the likelihood that the rollback will fix it and not run into a database migration threshold honestly is getting smaller all the time.
Bethany (44:36) doing heavy lifting and assumptions on the quality of tests there.
No, it’s true. Okay, never have I ever been woken up by a pager duty alert and silently hoped someone else would grab it.
Erika (44:49) I don’t think I’ve done this because usually when I’m woken up by a PagerDuty alert, I have nothing else in my head. It’s very much execution mode and there’s no other thoughts, but I have definitely slept through PagerDuty alerts before. Yeah.
Bethany (44:58) Yeah.
Erika (45:06) Or I’ve had the moments where like, guys had these where they like come up in your dream and you’re like, wow, that’s so weird that I’m getting paged in my dream. And then like you like slowly wake up and you realize that like it’s actually the pager going off.
Bethany (45:19) I feel that it’s very much like execution mode. You’re like, all right, we’re in this. Honestly, I wish I could subscribe to an alarm service that’s pager duty because that probably would get me up in the morning better than.
Erika (45:24) you.
Oof, what a life.
You
Warren Parad (45:32) There are
those alarms out there that if you don’t wake up, they start donating money to your charity of choice for every five minutes that you let pass.
Bethany (45:41) And then I’m like, it’s
for charity. I’m sleeping for charity.
Erika (45:43) It’s a good trigger.
Warren Parad (45:44) This
is interesting because there was actually a study out there that basically reviewed if you charge a fee for penalizing people for doing something that you don’t want them to do, they see the fee as sort of a justification that it’s okay to do that thing. So it was about picking their children up late from daycare. If you warned the parents that what they’re doing is unacceptable by leaving their child there too long, then they took the hint. But if you said, yeah, for every 20 minutes you have to pay an extra 20 bucks,
They’re like, ⁓ it’s free, or not free, but it’s paid daycare extended service. Like, I’m paying for that benefit. And they don’t see it as a feedback to make a change.
Erika (46:19) and
Yeah, knowing you Bethany, I feel like you would respond better to like donating if you got up in time. Like every time you get up with your lawn clock, a puppy would be saved.
Bethany (46:22) Interesting.
Warren Parad (46:31) You
Bethany (46:34) That’s what we- my gosh, I would- yeah, that’s true. That’s true, that’s the alarm service that needs to be built. Okay, we’ll do one more since we are coming up at time. Never have I ever discovered an outage because a customer tweeted about it before monitoring caught it.
Erika (46:37) Yeah.
Bethany (46:49) I’m putting down my finger, but I’ll let others, which I’ve lost, but.
Warren Parad (46:53) I mean, I’ll put a finger down. I won’t say tweet, but we do have communities for our products. And I’ll say there is something valuable here. We call it a gray failure, which is that there is an infinite number of problems that could go wrong, and we can’t possibly validate on every single one of them. And so a thing that we need to realize we need to optimize for is exposing an interface for letting people who discover problems report them to us, because they may have more stringent tests or a chain of actions that
we didn’t predict that is important for them, like network hops or regional based validation, or maybe a library on some weird version of Ruby or whatnot that I hope no one’s using. And we need them to report those to us because we just aren’t able to find everything like that. And then it gives us an opportunity to improve. And so I think this is an important aspect. How quickly can customer reported issues be escalated to someone who can fix it is something that we think about.
Bethany (47:49) Yeah, think it’s like it’s very valid and we have something similar with our status page where if we see an influx of traffic to our status page we’re like, okay, what’s wrong? ⁓ But it is so helpful to have user feedback because it’s truly interacting with the users is so important. yeah, I would say Reddit and Twitter are definitely parts of our
Warren Parad (48:02) Amazing.
Bethany (48:15) strategy, especially for working on copilot and how things change quickly. Yes, absolutely. Now, a lot of we’ve been sent down a lot of rabbit holes for sure. So it’s very much about validating what actually is a is a thing versus not. So
Warren Parad (48:21) You have to be careful of the noise though, right?
It’s really helpful to know that if there are questions or let’s say there’s no problem, if there is a question, customers reporting that there’s a confusion somewhere means that there’s still something that needs to be fixed, right? Documentation in your docs or in the API or maybe there’s an action which has confusing results and you just wouldn’t know about that if you didn’t expose this mechanism and then watch it for what actually happens.
Bethany (48:57) Absolutely. All feedback is a gift. It just might take some digging to actually get to what the feedback is sometimes.
Warren Parad (49:05) There’s a story there.
Bethany (49:08) No, I read it in a book actually, so it’s… yes. Okay, great. So, wrapping up, where can folks find you?
Warren Parad (49:10) Okay.
⁓ Yeah, so I’m unfortunately on LinkedIn and Blue Sky. think honestly pinging me through the podcast webpage or one of the community discord servers if you’ve got something. I’m also an overcommitted discord server, so if there are questions about this topic, I’m happy to answer them there as well.
Bethany (49:31) We got a plug too in that. Well, thank you so much, Warren, for joining us. It’s been an awesome conversation. And thank you, Erica, for also sharing your experiences here. And thank you, listener, for tuning into Overcommitted. If you like what you hear, please do follow, subscribe, or do whatever it is you like to do on the podcast app of your choice. Check us out on Blue Sky and share with your friends. Until next week, bye.
I said all right.