Overcommitted | Software Engineering and Programming Insights

Overcommitted brings you software engineers who are genuinely passionate about their craft, discussing the technical decisions, learning strategies, and career challenges that matter.



61: AI Code Quality: The New Software Engineering Bottleneck

AI is generating more code than ever, but most engineers aren't verifying it. Sonar Staff AI Researcher Joe Tyler shares breakthrough findings from his LLM Leaderboard research on code quality, the hidden "coding personalities" of different models, and why the...

Show Notes

AI is generating more code than ever, but most engineers aren't verifying it. Sonar Staff AI Researcher Joe Tyler shares breakthrough findings from his LLM Leaderboard research on code quality, the hidden "coding personalities" of different models, and why the real bottleneck in software engineering isn't writing code: it's securing and reviewing it. Discover the gap between developer distrust and actual verification practices, plus how to position yourself for the verification-first future of software development.


Topics: AI code quality, LLM research, software engineering careers, code verification, developer tools


Show links:

  • Sonar LLM Leaderboard: https://www.sonarsource.com/the-coding-personalities-of-leading-llms/leaderboard/
  • Sonarqube: https://www.sonarsource.com/products/sonarqube
  • Sonarsweep: https://www.sonarsource.com/products/sonarsweep/
  • Joe's LinkedIn: https://www.linkedin.com/in/joe-tyler-a668051b1/
  • Latent Space: https://www.latent.space/
  • Turing Post: https://www.turingpost.com/
  • Nathan Lambert: https://substack.com/@natolambert
  • Cameron Wolfe: https://substack.com/@cwolferesearch
  • Sebastian Raschka: https://substack.com/@rasbt
  • Andrew Ng: https://www.andrewng.org/

Episode Transcript

Erika (00:01) Welcome to the Overcommitted Podcast, your weekly dose of real engineering conversations. I’m your host today, Erica, and I’m joined by…

Brittany Ellich (00:11) I’m Brittany Ellich

Erika (00:13) We met while working on a team at GitHub and quickly realized we are obsessed with getting better at what we do. So we decided to start this podcast to share what we’ve learned. We’ll be talking about everything from leveling up your technical skills to navigating your professional development, all with the goal of creating a community where engineers can learn and connect. Today, we’re joined by Joe Tyler, staff AI researcher at Sonar.

the company behind SonarQube the tool used by over 7 million developers to analyze code quality and security. Joe works at the frontier of LLM research inside a product company, leveraging generative AI to revolutionize code quality and security. He’s a specialist in LLMs with a background in natural language processing and real world AI applications. He’s also a lead researcher behind

Sonar’s LLM leaderboard, which benchmarks models like GPT, Claude, Llama and others across thousands of real programming tasks. Welcome, Joe.

Joe Tyler (01:17) Thanks for having me.

Erika (01:19) Thank you so much for being here. To kick us off, we’ll ask you the question we ask everybody. What’s something you’re currently building or learning that has you excited right now?

Joe Tyler (01:31) Well, yeah, maybe we’ll get into more detail on this later, but I’m doing some really fun research at the moment about how we can train AI models to be better at writing production-level code, which I find really interesting. And then maybe outside of my day-to-day job, ⁓ I’m learning Rust. yeah, been recommended by quite a few of my colleagues.

Erika (01:57) on my list for sure, so I’m jealous that you’re checking that off. That sounds fun. Cool. ⁓ Well, Joe, so you are leading research on Sonar’s LLM Leaderboard, which is an ongoing, regularly updated independent analysis of how leading models perform on code quality, security, and maintainability.

which goes beyond standard benchmarks and reveals each model’s strengths, blind spots, and failure patterns. ⁓ A key finding of yours is that newer, more powerful models don’t just make fewer mistakes, they actually make different and even more sophisticated ones. The 2026 State of Code Developer Survey, which surveys over a thousand developers globally,

Joe Tyler (02:45) Yeah, that’s right.

Erika (02:55) found that AI now accounts for 42 % of committed code, yet 96 % of developers don’t fully trust that output, and only 48 % verify it before shipping. So since you’re actively working on this, it’s a leaderboard, it’s not a one-time study, can you walk us through what you’re tracking, how it works, and what the most interesting findings are right now?

Joe Tyler (03:22) Sure, so yeah, maybe zooming out. So Sonar is like the industry standard. Our tool SonarQube is industry standard for like automated code review. And then over the last year within our research team, we’ve been looking at trying to understand how LLMs write code, what are the shortfalls when they write code, and then also how we can ⁓ use our tools within the model training cycle to improve their quality of code.

This study first came out of one of these research projects where we thought, let’s just, we’ve made a benchmark, let’s try and run it on some leading models. We found big differences between different models. We found, for example, that think GPT-40 was the best model, one of the leading models at the time, and we found that ⁓ suffered …

a lot more in terms of security of the code it was writing, ⁓ and more bugs, whereas Anthropix, Claude 3.7 at the time, was writing ⁓ more maintainable code. We thought it was interesting, even though GPT-4 might have beaten it on some functional benchmarks. was …

writing more complex code, which might have ⁓ impacts if you’re using that model down the line.

We have now built this out into a leaderboard, which you can find on our website. We recently hit ⁓ over 50 different models being evaluated on there. We have expanded just from the leading ClosedProviders to include loads of ⁓ open models. Really exciting ⁓ coming out of China right now. ⁓ GLM5 is really a top performer for us in terms of ⁓ different code quality issues.

Brittany Ellich (05:17) That’s really interesting. I wouldn’t think that there were like common mistakes. Sorry, I’m going to ask a lot of dumb questions probably because I use a lot of LLMs, but I don’t clearly don’t understand them fully. But I think it’s very interesting that there are like common mistakes that they each different model makes like once where is it security or versus the other. That’s not a thing that I would have thought existed. That’s interesting.

Joe Tyler (05:24) Yeah,

I mean, every lab must train their models in their own ways and optimize for different things. yeah, we found this in our benchmarks. We have a fixed set of over 4,000 real-world ⁓ coding examples that we give to all of the models. ⁓

And yeah, by averaging across their performance on all these tasks, we get a pretty good estimate of how complex their code tends to be, how many lines of code they tend to write. ⁓ One trend recently, we saw that the latest Claude Opus 4.6 model, it improved marginally on the functionality, but it also…

was writing about 15 % of its lines that it was writing were comments, which is lot of wasted tokens, a lot of cost for the end user.

Erika (06:41) Yeah, so what are your considerations when you’re doing this research? ⁓ Like, I mean, some things that kind of come to mind are ⁓ the, like the things that you talk about have some level of objectivity and some level of subjectivity, like maintainability. There is definitely a gold standard, but then within that, there’s things that people appreciate more or less. Like you mentioned comments and some people are

anti-comment or, you know, comment everything and then a lot of people fall kind of somewhere in between. So how do you kind of like take the human aspect of it or yeah, consider all the factors.

Joe Tyler (07:22) Yeah, comments was an observation that we made. The usual way we measure our maintainability is in terms of the number of code smells we see from each model. And also, we look at the cyclomatic and cognitive complexity ⁓ of the code that gets generated. So we see, yeah, we’ve seen Opus, Claude, and so Claude’s Opus models and Claude’s Sonnet models, ⁓ they average about 120 ⁓

average cognitive complexity per, I think per 1,000 lines of code. I think that’s right. And then that compares to Gemini 3 and 3.1 at around 160. And those are both less than the GPT.

level GPT’s models ranging from 5.1 up to 5.4, which are all around 180. This is an observation we noted when these models were dropping at the start of this year. We were trying to explain why developers were really clicking with the Claude models. One way we read into that is that they’re writing nice, maintainable code that’s easy to comprehend, and it makes them easier to work with.

Erika (08:38) Yeah, it’s funny, it almost reminds me of like the co-workers that you gravitate towards like, I want to pair program with this person because I like the way that they write code versus like, no, I’m not going to write with them. Like they’re trying to, they’re trying to code golf everything. And like, I don’t, I don’t really want to do that. So,

Joe Tyler (08:57) Yeah, exactly. think also there’s some, it’s not one of our papers, but a paper I read recently out of Carnegie Mellon, titled something like Speed at the Cost of Quality, where they looked at teams that had adopted cursor and then their immediate productivity boosts and then longer term productivity slowdown. If they’re not aware of the ⁓ complexity and maybe maintainability issues that these AI models are…

⁓ introducing, even though you’re going to get short-term benefits, you might face bigger issues down the line. So that’s definitely an area of serious interest for us as well.

Erika (09:39) Yeah. And do you only look at text prompts to code or do you look at any sort of iterative cycles and like improvements on itself? ⁓

Joe Tyler (09:51) Yeah, so these benchmarks considered so far are like code generation tasks. So that’ll be like of like greenfield coding where you’re asking a model to develop a feature. We also have benchmarks we’ve looked out for code remediation, but those aren’t featured on the public leaderboard yet. But that will be…

where a model or an agent is looking at a piece of code, there’s an issue with it and it has to find a fix.

Erika (10:25) Do you have any sort of discourse with ⁓ the creators of the models themselves? Do you have any insight into why these things are the way that they are or their take on these sort of personalities that you found? Or have you kind of stayed away from that?

Joe Tyler (10:47) We haven’t had that sort of conversation with them, but I think we’ve spoken to lots of different people who are interested in the results of these benchmarks. It’s interesting that there are these clear trends like the complexity that I brought up earlier. That the lowest complexity models have been have been anthropics models for the last two generations and GPT 5.4 is around the same. There must be some.

Yeah, something going on with how they’re training their models, that they maintain these sort of patterns.

Erika (11:24) Yeah, I mean, it’s interesting to me too, because you, like I mentioned, like, pair programming earlier. And I think of like, even if you don’t want to write code the same way as somebody else, you can still learn from everybody’s styles and even learn like what not to do. But sort of that awareness is helpful. And kind of knowing like,

These are the trade-offs, these are the pros and cons about this kind of implementation approach versus this other one. ⁓ So yeah, it’s a very cool resource to be able to compare all those different styles and ⁓ to your point, Brittany, know kind of more of what you’re getting, not ⁓ blindly kind of working through these things.

Joe Tyler (12:23) Yeah, I think another trend overall from these models is we’ve seen, I think out of our top 20 on the leaderboard right now, 19 of them, 19 of the most verbose, so they’re writing the most code, most lines have come out in the last six months. So think there’s a real trend towards these models writing more code. Maybe that might be better. They might, if they’re some like…

GPT have definitely made improvements in their complexity, so they’re writing more modular code than before. But overall, these models are writing ⁓ more lines of code and burning more tokens while achieving better functional performance.

Erika (13:07) Yeah. So we sort of teased a little bit earlier that newer models don’t necessarily equal better code. So talk us through some of the things that you found there, ⁓ what the changes have been over time.

Joe Tyler (13:23) Yeah, so we’ve always seen GPT-5 models. Since October last year, there’s been different more updates to those models. ⁓ And we see that even though the number of bugs they’re writing is reducing, it’s also shifting. So we’re seeing ⁓ more bugs with concurrency issues.

Maybe these models are trying to write with more advanced patterns, and therefore you’re introducing more subtle bugs that ⁓ are harder for an AI-driven analyzer to detect. But with our ⁓ static tools, you’re able to detect those sort of subtle issues.

Erika (14:17) Good to know. It also maybe shows that it’s learning on itself. ⁓ It’s growing as a developer. But day one, you’re not writing concurrency bugs because you’re writing multi-threaded code. And in your mind, what do ⁓ these archetypes like Claude as the senior architect or

Joe Tyler (14:32) Yeah.

Erika (14:46) GPT is this efficient generalist. What does that mean in practice for developers?

Joe Tyler (14:54) Yeah, so that was a report when we first published our research to try and make things a bit more tangible in the minds of technical and non-technical readers. we were seeing that Claude, I guess it was Claude 3.7 or Claude 4 at the time, was writing the most well-structured code, the lowest complexity. And that made us label it as an efficient architect because…

⁓ it’s going to lead to the fewest issues with your architecture as you use that model to code. ⁓ Whereas GPT-4.0 was ⁓ efficient generalist, it wasn’t writing that many lines of code, and it had a pretty good functional performance, even if it wasn’t quite at the level of Claude 4.

Brittany Ellich (15:43) Interesting. Is that something that you take into account now? You said you’re learning Rust like for your own code writing, or are you like abstaining from using LLMs while learning Rust or how’s that?

Joe Tyler (15:54) That’s

interesting, actually. ⁓ In my day-to-day coding, do use ⁓ coding tools. I mainly use Claude Code. ⁓ I use Cursor sometimes. But yeah, I think to teach myself a new language, I’ve taken the Rust documentation. I’ve used ⁓ an LLM to build some…

sort of lessons, projects for me to work through. And then I turn the AI off and I try and just code. I think while you’re learning the fundamentals, it’s good to think for yourself. But yeah, now we’re developing, like I mainly develop in Python. So much Python’s getting written. I’m not writing that much of it myself, but it’s because I know how it works. I’m able to ⁓ still do it.

decent code review.

Erika (16:56) It’s valuable skill. Speaking of, let’s talk a little bit about verification of code because you did a survey where you found that 96 % of developers don’t fully trust AI output, but only 48 % actually verify it. So what do you think is going on there? And Brittany, you can kind of jump in here too with how you

⁓ how you tend to verify AI code. Are there instances where you don’t have to verify it and what are kind of the trade-offs? Yeah, how do you both think about that?

Joe Tyler (17:40) you want to jump in fast Brittany or should I can go

Brittany Ellich (17:44) Yeah, I’m just reading that and I feel like developers are probably very skeptical in general. So I wonder if that’s coming through in that 96 % number where they’re like, well, I don’t trust any of it. But you know, some of it I have a look at some of it I don’t, but that’s a terrifying number to see like 96 % don’t trust it. then 48 % are still only actually looking at it based on like, where’s that other like 40 %? What are they doing? Why are they not? Why are they not looking at it? ⁓ Yeah, curious, curious what you actually found about

for sure.

Joe Tyler (18:14) Yeah,

I mean, think if you write some code on just like a project of yours and it runs, then you’re happy and you push it. And I think that’s why we were, that’s why we built, like, designed this benchmark to go beyond just like the functional performance and looking at other aspects of the code. ⁓ Yeah, it is a ⁓ very low number of people that are actually verifying their code. there are, yeah, I have a…

the sonar CLI tool that I use, and I have Claude Code call that when it’s finished and also during the coding process to make sure that, yeah, I’ve not written some bug. And then on your point of, is it always right to verify the code, guess, ⁓ it’s different use cases. Like, if you’re quickly prototyping something, you might not need to verify as much, ⁓ especially if you’re writing,

if I’m just doing something quick for some research. But then when you’re working on developing more production stuff, like more stuff that’s maintained in your team, you want to make sure that you’re not just writing your AIs and generating a 10,000 line pull request and ruining the architecture.

Brittany Ellich (19:32) Yeah, I’ve heard a lot of people too talking, ⁓ and like some discourse about like, well, in the future, we’re not even going to review code and like, that’s going to solve the code review bottleneck. where do you land on that? Do you think that’s like actually a likely future or is it still going to depend on like how important the code actually is? ⁓

Joe Tyler (19:50) ⁓ Well, I think these latest AIs do generate quite large pull requests, therefore, think having some sort of Definitely having a static review tool is good, and then even I’ve been using an AI code review tool to roughly talk you through the PR to begin with, and that’s a useful starter. ⁓

Yeah, I don’t think the answer is to get rid of review. think if you’re not looking at the code, that’s exactly when you need a verification layer.

Erika (20:30) Yeah, it’s always like going back to the kind of like personas and archetypes and perspectives to like, you do write code for different audiences, like you, you do have that like functional aspect of it, but then, you know, kind of need to consider like, what would a junior engineer think looking at this? What would a senior engineer look like think about looking at this? Like, there’s, there is this

question of perspective and audience. ⁓ And I mean, I guess you can give AI personas. ⁓ But yeah, there, I feel like there’s, there’s always gonna be some level of human element where that, where that falls in the spectrum of like action and agency, like, I don’t know, but yeah, it’s hard to imagine there not being any human in the loop.

in the process, my mind.

Joe Tyler (21:31) Yeah, think we’ll probably, I think I look at code less when I’m, I if I’ve started just like coding in the terminal now, rather than using an IDE. And I think that’s, feel like I’m only confident in doing that because I know that there’s some sort of static tool checking that there’s nothing like completely ridiculous going on behind the scenes.

Erika (22:01) Well, cool. Well, thank you so much for walking us through all that. ⁓ It’s really interesting work and we’ll definitely link the leaderboard and the show notes for people to check out themselves. ⁓ We’re going to transition a bit into ⁓ your story and your career. ⁓ And you transitioned from data science ⁓ to machine learning research in a relatively short amount of time. ⁓

So yeah, what actually made you want to make that switch and how did you learn what you needed to learn sort of along the way and ⁓ throughout that transition?

Joe Tyler (22:46) Yeah, I didn’t feel it was that large of a jump in the end actually because data science can be quite a broad term. Luckily, I was working in a really small, a nice, team ⁓ with some really good engineers. ⁓ Even though I was doing data science, I was still working with large language models. It’s been scaled up massively.

but the core underlying tech that I was using to train models for understanding legal language is the same as you use for coding. thought coding and legal language are vaguely similar problems in my head. It’s sort of English, but not quite, and it can be quite hard to verify if… ⁓

if your solution is correct. We relied before we’d rely on a team of trained lawyers to help us verify that the model was giving good predictions. now, yeah, now we’re, well, luckily we have coders that can look at our predictions, and also we have these coding benchmarks that we can rely on. Yeah, I think I’ve…

From the start, working with large language models, I was really, really keen to understand the underlying tech as much as possible, and that really helped me. And working so closely with some really, really good engineers, having a great manager who pushed me to write good code, ⁓ made me more interested in coding, and that led me to want to pursue this career in… ⁓

AI for code generation

Brittany Ellich (24:47) Was there anything that was different that you had to like shift your thinking of going from like legal text to, ⁓ to code? I love the idea too of like we hire like lawyers to be like coders basically for legal stuff. I’ve never thought about it that way. And I think that that’s, that actually makes a lot of sense because they’re like literally just there to interpret the thing. So yeah, that’s

Joe Tyler (25:08) Yeah, we had a big ⁓ data set from output of lawyers ⁓ that was very useful because looking at some of these clauses would be completely ⁓ incomprehensible.

Brittany Ellich (25:23) Yeah. Yeah.

Joe Tyler (25:25) ⁓ anything different i

I would say, ⁓ Sonar has so many ⁓ expert engineers and managers. ⁓ The SonarQube product has 7,000, I think, many, 7 million developers worldwide that are using our, and so 7,000 enterprise clients. That sort of scale really appealed to me. ⁓

and it means that we have this interest in understanding how LLMs write code because it’s important for our customers to know this and therefore I thought there’d be some really interesting projects to work on.

Erika (26:19) Very cool. And for any engineers who want to shift into ⁓ understanding AI tools better instead of only using them ⁓ and maybe even help to shape them or shape the landscape, what advice would you give to them on what it takes to really engage with this area?

Joe Tyler (26:48) ⁓ I’d say just get started, follow some good newsletters. ⁓ Latent Space is a really good one. ⁓ Turing Post as well. ⁓ There are some great substack writers that I follow like Nathan Lambert, Cameron Wolfe, Sebastian Raschka. ⁓ And yeah, and then of course there’s ⁓ anything by Andrew Ong, it’s definitely ⁓ worth a read.

⁓ I think it’s a great time for a developer to take interest in AI. Our team is people with lots of different specializations in our team across cloud and ⁓ deep software engineering and AI research. It’s been really…

fun and productive everyone coming together with their different specialties. think if you have the problem-solving abilities that you have as a developer are fairly similar to what you need to understand AI research.

Erika (28:10) Thank you all.

Brittany Ellich (28:12) Yeah, I like that. I also, I recently was listening to a podcast talking about like the differences between like what DeepMind is working on versus OpenAI and how OpenAI has been like focusing on LLMs and ⁓ how that’s like getting most of the love right now. But do think it’s also worth learning more about like the deep and reinforcement learning that DeepMind and I’m sure other places are working on as well?

Joe Tyler (28:35) Yeah, so on the model training side, we’ve ⁓ been doing some fun research over the last year and a half. And we’ve basically been taking open source models and training them on a pipeline that has SonarQube, our verification layer, baked into the data pipeline. And that’s given us some… ⁓

interesting results. So we trained like a large llama model last year and we saw that when we trained it with our data pipeline, it’s generated 67 % fewer vulnerabilities and 32 % fewer bugs. And we saw similar gains on GPT-4.0 as well. And then in December, we actually have released a fine-tuned version of

GPT-OSS, is OpenAI’s latest small model. There might be a new one coming soon, I think. We saw a reduction in the number of bugs and vulnerabilities by training with our method by 41%. So, % fewer bugs, 41 % fewer vulnerabilities. Also, we saw a 20 % reduction in the complexity of the code that it was writing.

Yeah, this is real area of interest. This is what my work’s on at the moment. It’s a product called Sonar Sweep, we ⁓ are looking at customizing open models to ⁓ write really high-quality production-quality code and also understand an organization’s own code base.

Erika (30:37) Yeah, that was really interesting. it is so fascinating how different the results are with slight tweaks in input or even the place where you’re running it, like running a prompt in a bare terminal versus in a repository using code context. Yeah, it can kind of make your head spin all the different.

all the different possible inputs and how they might all interact with each other. So, yeah.

Joe Tyler (31:13) Yeah,

there’s so many ⁓ code context tools on the market right now that tell you that they’re going to make your agent understand your code base. We’ve just released ⁓ in Open Beta last week our own context tool, which makes use of ⁓ the static tree of your code base, and will guide the LLM to the right spot. ⁓

I don’t have the exact benchmark numbers to hand, yeah, in our blog post, we’ve seen a reduction in the time it takes for an LLM to get to a response, or an agent to get to a response, and also a reduction in the cost, because you’re burning fewer tokens by just being able to navigate to the relevant parts of your code base more quickly.

Erika (32:10) Yeah.

Well, we are going to transition to our fun segment here at the end. ⁓ And we always do some kind of ⁓ a game or round robin. And today we are playing exactly to your strengths of playing a guessing game where I’m going to show three snippets and ⁓ I’m not going to get to guess because I put this together, but you both get to guess which model.

wrote the code. So the options are… Sorry, let me fold the window.

Joe Tyler (32:52) Okay, good. was worried you were going to get us to guess from the list of 660 models.

Erika (32:55) It’s probably all, yeah, no.

The options are Claude Sonnet 4, GBT 4.0, and Claude Sonnet 3.7. Okay, so I am willing to go ahead and show my screen here, and I’m gonna show all three, and then at the end, you can place your guesses of A, Are you ready?

Okay, here we go. Okay, and so the prompt here is write a Python function that reads a file, processes each line, returns a count of lines matching a regex pattern. So here’s snippet A, I’m showing my screen. ⁓ I’m just gonna kind of narrate what I see. It imports the ⁓ union. I’m not gonna say this right, because I don’t know Python, but.

⁓ from typing import union, sets up a logger, then sets up this count matching line functions, which ⁓ has a chunky code comment at the top, ⁓ sets up the file path, ⁓ checks some errors, has a try block, accept, and then a debug statement at the end before returning the match count. So that is snippet A.

SnippetBee is considerably shorter, has a one line comment, ⁓ imports the whole re package? Is it packages in Python?

Joe Tyler (34:35) Yeah,

it’s a regex package, yeah.

Erika (34:37) Okay, thank you. ⁓ It uses open and basically loops through the files to return account. And then snippet three ⁓ imports regex package, pathlib ⁓ and, ⁓ or path and union from the pathlib and typing packages. ⁓ And then also sets up a nice chunky comment, not quite as long.

⁓ has a nice ignore case check here at the top, compiles the regex, and then basically also does a loop before returning the match count. All right, oops. I think I might’ve revealed some of the answers too. Okay, good. You’re just looking at the code. All right. ⁓ So are you ready to place your guesses, Joe and Brittany?

Joe Tyler (35:23) I didn’t see.

think so.

Erika (35:39) All right, as a reminder, was Sonnet37, Sonnet40, ⁓ and GPT40. All ⁓ right, so maybe Brittany, I’ll have you put your guesses in first. Between A, B, and C, what do you think the model matching is?

Brittany Ellich (35:40) Yeah.

Sounds good. I think that my reliance on just using Opus for everything is gonna fight me here. But based on what Joe mentioned here, I listened, I paid attention, and you talked about how Sonnet and Opus use more comments. And so that’s literally the only basis for my answer here. So I think that A is probably Sonnet 37. It seems like maybe it’s like a little bit older than the C, which also had a lot of comments. And so I’m guessing that that one is…

Erika (36:05) All

Brittany Ellich (36:29) Sonnet 4 and then I’m guessing what B then, far fewer comments, is GPT 4.0. And very brute force too, I feel like. Just like going straight for the full Redgex package there.

Erika (36:44) Alright, Joe, what do you think?

Joe Tyler (36:46) Okay, that’s interesting. ⁓ I’d agree with B. GPT-4.0, think. As I said, it hasn’t really written comments, I hope that’s right. ⁓ The snippet A has tried to do something more advanced there, maybe. It’s written the logging pattern, which I don’t think the last one did.

it’s handled its edge cases there, so maybe that’s the more advanced model. So maybe I’d go for the Claude Sonnet for A, and therefore Claude Sonnet 3.7 for C.

Erika (37:25) All right. All right. Time to reveal the answers. So snippet A with the sort of more advanced architecture is indeed Claude Sonnet A. Snippet B is indeed GPT-40. And snippet 3 is Claude Sonnet 37. So yeah.

Brittany Ellich (37:50) Nice one, Joe. Clearly

Joe Tyler (37:51) Okay, great, well

yes. No, thanks for choosing such a good example.

Brittany Ellich (37:53) you’re the professional here.

Erika (37:55) Like

you do this for a living. You actually know what you’re talking about. Awesome. Well, Joe, one last word to you. What do want engineers listening to this to take away from the work that you’ve been doing, ⁓ you and your team have been doing?

Joe Tyler (38:01) Yeah. Here we go.

⁓ Yeah, I think you should definitely have a look at the leaderboard and I’d be interested to know if the findings in terms of the complexity and what different types of issues people see line up with what we’ve got. As I mentioned, we’ve got more advanced benchmarking coming out soon. We’re updating the leaderboard. I know Google just released a new…

batch of Open Models, the Gemini 4 series. We should imminently have the results of how they stack up, which I’m really excited for. ⁓ I tell developers to try always be an early adopter, try out these tools, and also try and find a way to benchmark them because there’s so many, there’s a lot of noise with different…

know, MCP servers, CLI tools that everyone says is going to improve your agent, but you need to figure out, is this actually making me better? Is this making me spend more money? But yeah, it’s an exciting time. These tools are coming on so fast, and yeah, I’m really enjoying developing at the moment and researching.

Erika (39:33) them. Well, aside from the leaderboard, is there anywhere people can find you online?

Joe Tyler (39:39) ⁓ Yeah, I’m on LinkedIn. I think I can send you my link for that. Yeah, that’s probably the main place that I post about our work and different things I’m interested in.

Erika (39:58) Cool. Well, thank you listeners so much for tuning in to Overcommitted. If you enjoyed this episode, please do follow and subscribe on whatever podcast app you use and find us on Blue Sky. Share it with an engineer friend who might appreciate it. Until next week, goodbye.