Hype and Reality of the AI Coding Shift

59 min

•Apr 23, 20263 months ago

Summary

Sonar releases a comprehensive developer survey revealing that while 72% of developers use AI coding tools daily and 42% of code is already AI-generated, 96% don't fully trust AI-generated code. The episode explores the verification gap, the shift in developer toil from code writing to code review, and how organizations must adapt their quality assurance processes for the AI-assisted development era.

Insights

The 'great toil shift': AI eliminates traditional toil (documentation, tests) but creates new toil around code verification and review, with developers spending similar total time on different tasks
Trust paradox: Despite rapid adoption and improving code quality, developers remain skeptical of AI output because humans retain accountability while AI does not, creating a verification imperative
Experience gap: Junior developers embrace AI tools (40% productivity gain) but admit 66% of AI code 'looks correct but is broken,' while senior developers use AI more cautiously as reasoning assistants
Greenfield vs. brownfield divide: AI excels in new projects (90% adoption) but struggles with legacy codebases (43% effectiveness), particularly with non-mainstream languages and implicit system couplings
Shadow IT risk: 35% of developers use personal accounts for AI tools, exposing organizational IP and data to unvetted third-party services, indicating governance gaps in enterprise AI adoption

Trends

AI code generation shifting from experimentation to production infrastructure requiring deterministic verification layersEmergence of agentic workflows and multi-agent orchestration as the next frontier, moving beyond single-tool AI assistanceLLM model quality divergence: newer models achieving both high performance and low complexity, breaking the previous linear relationship between capability and verbosityDeveloper skill evolution from code writing to AI orchestration and agent training as core competenciesEnterprise governance lag: rapid tool adoption outpacing organizational policy and security frameworks for AI-generated codeCode review complexity increasing: 38% of developers find AI-generated code harder to review than human-written code due to subtle bugs and needle-in-haystack detection challengesSecurity-specific AI risks emerging: prompt injection attacks, rules file backdoor attacks, and other LLM-specific vulnerabilities requiring new detection rulesDeterministic verification becoming critical infrastructure: static analysis tools repositioning as essential verification layers for both human and AI-generated codeCost-performance tradeoffs in LLM selection: reasoning modes, token pricing, and model verbosity becoming selection criteria beyond raw performance benchmarksBrownfield modernization bottleneck: legacy systems with implicit rules and non-obvious couplings remaining difficult for LLMs to navigate without additional context

Topics

AI code generation adoption and daily usage patternsCode quality and security verification for AI-generated codeDeveloper trust and accountability in AI-assisted workflowsThe 'great toil shift' from code writing to code reviewShadow IT and governance gaps in AI tool usageLLM coding personalities and performance benchmarkingGreenfield vs. brownfield project suitability for AI toolsExperience level differences in AI tool adoptionAgentic workflows and multi-agent orchestrationStatic analysis and deterministic verification layersAI-specific security vulnerabilities and detection rulesEnterprise procurement vs. developer tool experimentationLLM leaderboard methodology and code quality metricsHuman accountability and production code shipping decisionsDeveloper skill evolution and career relevance in AI era

Companies

Sonar

Code quality and security analysis company that released the State of Code Developer Survey and operates an LLM leade...

Stack Overflow

Referenced as a benchmark for developer surveys; Sonar aspires to join Stack Overflow and GitHub in the pantheon of a...

GitHub

Mentioned for its State of the Octoverse report as a leading developer survey alongside Stack Overflow

Red Hat

Chris Grams' early career employer where he conducted some of Red Hat's first developer research surveys

Tidelift

Company that sponsors open source maintainers; acquired by Sonar about a year before this episode

Sun Microsystems

Manish Kapoor's early career employer in enterprise software

Oracle

Manish Kapoor worked at Oracle before joining Sonar

OpenAI

LLM vendor whose GPT models (GPT-4, GPT-5.2) are evaluated in Sonar's LLM leaderboard and coding personality analysis

Anthropic

Creator of Claude models (Sonnet, Opus) evaluated in Sonar's leaderboard; provided free credits enabling developers t...

Google

Creator of Gemini models evaluated in Sonar's LLM leaderboard for code quality and security

Dept Agency

Software agency where Matt Merrill currently works as a software engineering leader

People

Chris Grams

Co-host discussing survey methodology, key findings on AI adoption rates, and the trust gap in AI-generated code

Manesh Kapoor

Co-host discussing the 'great toil shift,' shadow IT risks, LLM coding personalities, and governance challenges in AI...

Matt Merrill

Host with 20+ years experience in backend development and cloud architecture; frames discussion from engineering lead...

Quotes

"96% of developers said they do not fully trust the code that's coming from AI. So you have this juxtaposition of 42% of code being written by AI going up within the next couple of years to 72%, but developers don't really trust it."

Chris Grams•Early in discussion

"Writing code is no longer a challenge. It's about what happens after the code is written and what happens in a world where agents are talking to each other."

Manesh Kapoor•Mid-episode

"The most valuable skill is no longer knowing how to write code. That is pretty much a solved problem. It's more about understanding the code, making sure the code is correctly written by the agents."

Manesh Kapoor•Closing advice

"If you're a developer who is trying to figure out how to stay ahead and keep your skills relevant, probably the best thing you can do right now is to gain the ability to manage and orchestrate and train agents."

Chris Grams•Closing advice

"Ensuring that there's a human who's willing to put their stamp on it and say, I'm willing to ship this into production and take all the risks that entails. That's the biggest challenge for 2026."

Chris Grams•Final takeaway

Full Transcript

AI coding tools have gone from novelty to core infrastructure in under three years. Today, many devs use AI daily, a substantial share of new code is AI-generated, and expectations for automation are rapidly increasing. Sonar is a company specializing in analysis of code quality and security, and they recently released a new survey, the State of Code Developer Survey. The survey provides a deep examination of how developers are using AI in real production environments, and where the real-world gaps and risks still exist. Chris Grams is the CVP of Corporate Marketing at Sonar, and Manesh Kapoor is the VP of Product Marketing and Developer Relations at Sonar. In this episode, they join Matt Merrill to discuss what the survey reveals about AI-assisted development, why 96% of developers still don't fully trust AI-generated code, how deterministic verification layers fit into agent-driven workflows, and what engineering leaders should prioritize as AI shifts from experimentation to production infrastructure. Matt Merrill is a software engineering leader with over 20 years of experience building and scaling software teams across enterprise and product-focused organizations. His background is in back-end development, cloud architecture, and distributed systems design. He currently architects and delivers software products and leads a team of engineers at Dept Agency. You can learn more about his work at code.theothermattm.com. I am here with a couple of folks from Sonar to talk about the State of Code Developer Survey Report. So before we get started, Manish and Chris, would you guys mind introducing yourselves and talking about your background and what you do at Sonar? Manish, you want to go first? Sure. Nice to meet you, Matt. And thanks for having us on the program. Chris Grams, I'm the VP of Corporate Marketing here at Sonar, which might immediately make people want to stop listening, except for the fact that I would say I'm also sort of our resident data and survey nerd. So probably know the survey results coming out of this better than most. I've been in enterprise tech for a long time. I started out early in my career at Red Hat. I spent about a decade and then went on to work with a bunch of other software companies in the consulting role and then was one of the early employees at Tidelift, which is a company that sponsors open source maintainers and was acquired by Sonar about a year ago. So I've been with Sonar in this role just over a year. Nice to meet you. Manish? Yeah, nice to meet you. Hi, Matt. Nice to you. Thanks for having us here. I'm Manish. I'm based in Austin. I've been with Sonar for about two and a half years, almost three years now. Like Chris, I have got a long history with the enterprise software background. I started with Sun Microsystems and then I joined Oracle and here I am at Sonar. I've worn different hats in my life. I am a VP of product marketing and developer relations at Sonar right now, but I've worn the hat of a product manager, developer relations lead and different other technical roles in pre-sales and other technical roles as well. So I'm fairly hands-on, technical, looking forward to speaking with you today. Cool. And so how about Sonar itself? For anybody that's not familiar with Sonar, do you mind talking a little bit about what the company does and what products you offer? Yeah. So our main product is a product called Sonar Cube, which has been around for quite a long time and is used by over 7 million developers around the world. Basically what Sonar does is You can think of us as the essential verification layer for code, whether it's AI generated or whether it's written by developers, that we help ensure the quality and security of all of the code that's created by developers, by AI, and now increasingly by AI agents. So to give you a sense for the scale of Sonar, we analyze 750 billion lines of code every day. So we view our mission as helping organizations ensure the code that they're shipping into production is high quality, secure, well-maintained. Awesome. That's a great overview. Thank you. I read over the survey report and it was really interesting. And I think as somebody, my background is in engineering leadership. And prior to that, mostly backend engineering and things like that. My immediate question was, okay, another survey report. How does this one differ from kind of the Stack Overflow one, which is pretty well known? And as I dove into it, I was like, wow, this is really unique. And I really enjoyed it. So if you could, what do you think that we can learn from this survey that you may not be able to learn from the Stack Overflow one? Well, first off, I would say it's nice getting mentioned in the same sentence as the Stack Overflow survey, because we aspire to be part of the pantheon of big developer surveys, the Stack Overflow and the GitHub state of the Octiverse report, I think they call it. To us, those are the ones that help leading ensure that people have the information they need to make smart decisions as developers and development leaders. So if we are successful with this, we're becoming part of that pantheon of great developer surveys. And I would say where we wanted to be additive is we were like, what unique perspective do we at Sonar have to bring? And I think it really comes down to the fact that we know code pretty much. Like I said earlier, we analyze 750 billion lines of code a day. So we started earlier this year, we started this state of code series of which we did a lot of analysis and hopefully we'll get a chance to talk more about some of the work we're doing on understanding the coding personalities of the leading LLMs. but we also analyzed code from a perspective of its maintainability, its security, and released a set of reports on that. What we wanted to do with this particular report is we said, okay, we've looked at codes for the lens of how maintainable, how secure, how reliable our code. We looked at the code that LLMs are creating. Now we want to look at what is the perspective of the developers who are using these new coding tools every day? What do they think of the current state of the art. And so this is almost like the human side of Sonar's version of the state of code going in and really understanding what people think of the world as it's changing extremely quickly right now and trying to get a sense for where we are. And one last point I would make on that is that we fielded this survey in October, I think of last year, and the world's changed quite a bit even since then too. So as we go through this, I think we'll want to spend a little bit the time talking about the stats that we feel like are holding true. And I think Manish and I can give you a perspective on the things that we think have changed as well since we filled a survey last fall. That will be good to talk about. I love what you said about the human side of it. That was what came through to me is that you really tell a really nice story with this data. And so I'm curious about, can you talk a little bit about how you collected and analyzed it? because the report itself is lovely. So again, I've been doing these for a long time. I actually started doing these sorts of developer surveys probably 20 plus years ago at Red Hat and did some of Red Hat's first ever research data. And the thing I've always felt about good research is that, by the way, appreciate what you said about it tells a story well because that's exactly what we're going for is that we wanted to not come in. I always come into these projects from the point of view of don't give me data, give me knowledge. Like what is the takeaway or set of takeaways? And we also go into the design of the survey with a strong point of view on some hypotheses of things that we think may be the case. And we want to go test and see whether people agree with us or not, and whether that's their perspective as well, or some places where maybe we're wrong. And there were a couple of examples of places where our hypotheses were wrong in this too. But we come into it with a perspective on the stories that we want to tell. And then sometimes we come out of it with a whole new set of stories. And in this case, we had a combo of both. We had some data points that I think were validated by what we found and some that surprised us. I would love to hear about the ones that surprised you as we go along. So if you think of it, yeah, just definitely drop that. So, all right, we've talked enough. We've done enough meta talk. Let's actually get into the results and let's start with a bang. I think this question is to each of you and Chris, you can go first. What's your biggest takeaway right out of the gate? What was the thing that you think resonated with you all the most? Well, I think just at the top line, there is a couple of data points that at the time when we first saw the data were staggering. So for example, 72% of developers who have tried AI are already using it every day. I wouldn't be surprised if now that number is even higher than 72. But last fall, when we first looked at the data, 72% seemed pretty amazing and probably a validation of what we expected to be true. But in addition to that, we asked developers to tell us what percent of the code that they're generating or that they're writing today is AI generated and how we asked them both about their thoughts on that today and then their thoughts about that in the future, like what they would expect it to be in a couple of years. And what we heard is that 42% of developer code is AI generated or assisted today already, 42%, which is kind of crazy. But by 2027, developers expect that number to jump to about 65%. So as you look at that, again, this is one of those things where we're seeing this even last January when I started at Sonar, I think most developers were still looking at the code that was coming out of these AI tools with a little bit of a skeptical eye, like was it doing a really good job? But the fact that 42% of code is AI generated as of last fall already is pretty interesting. But here's the other data point that I would add to that, is when we ask developers, how much do you trust that that code that AI generates is factually correct? 96% of developers said they do not fully trust the code that's coming from AI. So you have this juxtaposition of 42% of code being written by AI going up within the next couple of years to 72%, but developers don't really trust it. That creates like a verification gap or a trust gap that needs to be solved. So that's probably the top takeaway, I would say from the survey. Right. And that rings true with me based on my experience in my job. And I'm hearing more and more about agentic code reviews and kind of almost, this isn't the right way to put it, but pitting agents against each other to verify and things like that. So that is a very interesting finding. Manish, what was your biggest takeaway? Yeah, I think I agree with what Chris said. Mostly I was surprised. The one thing I'm surprised about is how quickly AI coding has been adopted as a use case. If you look back, it was just about two and a half years back when the first GPT versions came out. And here we are talking about more than 72% developers already using it every day. It's a daily affair for them, right? And this number is from last quarter. I'm pretty sure it's gone up even higher now. Basically, 90% of the developers are using it every day. So it's just how quickly this use case has taken off and how much it is being used across the industry. What I've also seen is while AI is speeding up code generation, it's slowing down everything that happens right after the generation part of it, which is a lot of software engineering work, which still needs to happen. Code generation is the first part of the equation, right? Next comes reviewing the code, verifying the code, basically debugging the code, integration testing, long-term maintenance of the code. So I think those things we are seeing, they have not caught up yet in terms of the speed at which code is being produced. So agents will even be a force multiplier. If you look at coding tools and then agents come around, just like what you mentioned, mat, when there will be a series of agents, a legion of agents talking to each other. It's basically how quickly things will evolve. We have to still see how it pans out. So I'm just waiting for what happens next. So I think for me, writing code is no longer a challenge. It's about what happens after the code is written and what happens in a world where agents are talking to each other and developers are offline or not so much involved in the day-to-day coding process and the SDLC process. Yeah. I am wondering myself with bated breath, though I have to say I'm pretty excited. Every day I get more and more excited about what I see. Speaking of that fast adoption, one thing that I'm encountering in my day-to-day life at work is that developers want to use these tools. First of all, they're getting pressure. Second of all, they're starting to become more interested in it. People are using personal accounts for work things to be able to do this just because they want to move faster, they want to try. And so you did have some findings there in the report that were pretty interesting. So do you want to share some information about what you found there? Yeah, we were surprised by this. There's a significant amount of shadow AI which is happening. By shadow AI here, I mean, we found that almost 35% of the developers are using their personal accounts to access AI tools rather than using the corporate sanctioned ones, right? I think why this is happening is because you are a developer, Matt, I've come from a technical background. So I've been sort of a developer, but not truly a developer, but written enough code to be deemed a developer. So developers by nature, they are builders and tinkerers, right? We all want to tinker with the latest and the greatest technology. We want to explore the most modern tools that are out there. So we want to stay at the cutting edge of things. So I think that is part of the reason developers want to go outside of the sanctioned tools as this world is changing so fast. And with agents coming around, it goes even further where developers are beginning to use agents for scanning the repos, for generating migration scripts, for refactoring modules for the old code, and all of those things are happening every single day. But this also means that the code, the prompts, the data, the context information they are sending to these third-party tools, the shadow IT tools, or non-sanctioned tools, you're basically taking a risk with the IP and data privacy of your organization when you start using those tools. So you do need a governance and the governance landscape is not very well defined unfortunately as of now It is going to get very complicated with the agents coming along and the legion of agents working together It's going to become challenging for some time, but I'm sure we'll solve it like we have solved other problems. One thing which remains common in terms of governance is anything that is being generated, whether you use corporate sanctioned tools or whether you use your own personal tools, the code that is being produced has to be verified. The next steps have to still take place. You have to do even more due diligence in terms of verification and assurance that the code is reliable and production ready. Just today, I had one of our privacy council folks giving us information on, if you're going to use these tools, make sure that these are in contracts and things like that. I work for a software agency and it definitely feels like the tail as wagging the dog, right? Like everybody expects you to use it, but there definitely is not that forethought yet in terms of how that gets governed. So that's awesome. I agree. Chris, is there anything you want to add about those findings or? The only other thing I might add is just in terms of the number of tools that people are using internally, it seems like people are testing lots of different things, right? So I think we saw an average of the average individual used four different tools. Oh, wow. And people on either side of the curve went higher or lower than that. But if you think about that too, what it says is that we haven't, as of again, when we fielded the survey last fall, that we haven't really settled out into who the winner is on the AI tool side. Although you could make an argument that maybe Claude is making a pretty strong play for it here in the last month or two, but people are still testing a lot of different things. That rings true for me too. Yeah. I mean, even up until November was kind of like, I'm not sure which, but now once I've seen what cloud does, it's pretty incredible. And also companies can't react that fast to these changes and people are bound to just sign up for something and try it. So it's no wonder it's happening. Yeah. Like if your company had contracted to use a tool that now is the third or fourth best tool in the market and you have to go through the whole enterprise procurement cycle to get another tool versus being able to do your job 10 times faster and do something on the side, you can see why, even though it's maybe from an enterprise standpoint, not risk ideal from a risk avoidance standpoint, you can see why it's happening. Yes, definitely. Every AI team eventually hits the same wall. The models are solid, the infra is solid, but the data coming in is hours old because the pipeline is batch when it should be streaming and nobody's had time to fix it. That's not a modeling problem. That's a pipeline problem. Estuary gives you CDC, batch, and streaming in one platform. 200 plus connectors, live in hours, not weeks. Your AI is only as good as your pipeline. estuary.dev Today's episode of Software Engineering Daily is brought to you by Unblocked. Your coding agents have access to your codebase. Maybe you've even connected other tools via MCPs. But access doesn't mean context. Agents can't reason across MCPs. They don't know your architectural decisions, your team's patterns, or why the API was shaped the way it is. So agents look in the wrong place and deliver bad outputs. Then you spend time correcting, turn after turn. Unblocked is the context layer your agents are missing. It synthesizes your PRs, docs, Slack, and tickets into organizational context that agents actually understand, so they make better plans, write higher quality code, use fewer tokens, and require fewer correction loops. If you are running Claude Code, Cursor, or any agentic workflow, Unblocked is worth a look. Get a free 3-week trial at getunblocked.com.se daily. In mobile application security, good enough is a risk. GuardSquare uses advanced, multi-layered code-hardening techniques and automated runtime application self-protection and mobile application security testing, combined with real-time threat monitoring to deliver the highest level of mobile app security. Discover how GuardSquare brings all these together to provide mobile app security for your Android and iOS apps without compromise at www.guardsquare.com. So let's pivot a little bit. One of my favorite things that I saw in the report, there's two favorite things, but one of them is this concept of the great toil shift. Can you talk a little bit about what that is and what you found there? Yeah, Matt, this is one of the ones we were saying earlier, some things that surprised us is, so we expected to go in and have people tell us that AI had reduced a lot of the toil work that they dealt with. And examples of toil work would be things like writing documentation, writing tests, stuff like that, that just take up an enormous amount of time. And that is sort of like the bureaucracy of writing code. So when we directly ask people the question, do you feel like AI has reduced the toil work that you do on a regular basis, the vast majority of people said yes. 75% of people said yes, AI has reduced toil work. But the answer was actually a lot more complicated because when we later asked them more questions and based on whether you were using AI a lot in your job, like whether you use it every day or whether you use it less, the people who are using it less still had a lot of those traditional toil work tasks, like the ones I just mentioned. But the developers who are using AI every day had a whole new set of toil tasks that were very different. And the kicker is they both represented that they were spending about the same amount of time on toil work. It was different types of toil work. So the new toil, as opposed to the old writing documentation, AI writes documentation great. So that just took that off the table. And then all of a sudden, people have this new task, which is what Manish was talking about earlier of now they have to verify the quality and security of all this code that's being generated at hyperspeed. And that becomes the new toil work is the verification process getting there. And actually, because AI is never going to be held accountable for the quality of code that it produces, the human is going to be held accountable for the quality of code it produces. And so to me, this is like one of the biggest challenges when it comes to toil work. If you know that the person who's being held accountable, all of a sudden going and checking all this code that's being written by the robots is you've got to do it. And it's not going to be the most pleasurable work necessarily, but you have to do it because it's just like you wrote that code yourself as far as accountability goes. It's so interesting too. And I've used this analogy with people that I've talked to and it might not be like quite an apt one, but it works for me, which is when spreadsheets came along, accountants didn't go away, right? It just changed the type of work they needed to do. And I feel like it's exactly the same thing, but I love the naming of it. The toil shift, probably going to use it. Yeah. 38% of the developers say that reviewing code written by AI is harder than reviewing human written code. So that's another thing, right? Yes. And going back to the reviewing part of it. So that was something that stood out to me as well. Yeah. It feels harder to me to put yourself in that context and follow the thread through. So yeah, that makes sense. We've also found that sometimes there's a little bit of a needle in a haystack challenge when it comes to AI-generated code too, which as these models are getting better and better at writing more performant code and even more secure and higher quality code, sometimes the issues in that code just get harder to find. So you as an actual human going into this and not having written it yourself or having somebody who you know write it and you're reviewing your peers code, that it becomes just there may be less issues, but the issues that are there can actually be more pernicious because they're harder to find. Yes, definitely makes sense. So as a company who one of your main products is a static analysis tool, can you all talk to some like creative ways that you see your customers using static analysis to kind of combat some of these things? Yeah. So I think the last 17 years, folks have used us and considered us to be the de facto code quality standard. We actually do much more than code quality. We actually, our analysis engine, it does not only review your code for code quality, code security, code reliability, maintainability, complexity, and now even architecture. We also look at the architecture of your code base to see how quickly it is differing from a good architecture to a bad architecture. So yeah, our customers have been using us in quite a few different ways. in the new agentic world, I think we have added support for all the modern AI native IDs, whether it's Windsor, Cursor, GitHub Copilot, any of those IDs that you're using, or any of the CLIs. CLIs, as you know, Matt, are becoming very popular now, whether it's Gemini CLI or Codex or Cloud Code or any of these things. Basically, we have support for all of these. And when I say support for all of these tools, we are available for them as a deterministic verifications layer for them, independent deterministic way right ai writing code ai reviewing code it's like basically you will end up in situations that the ai will say whatever i've written is the right code they may not catch everything they are trained on the same data set as what they are using for writing the code and reviewing the code so may not be able to so we have a deterministic way of reviewing the code and it is embedded right into the stlc where the modern stlc where the ai is writing code whether it's the IDs, the CLIs, or even in the pull request where agents are sending us pull requests for review, we can do it. So we are embedded tightly into the SDLC. We also have a MCP server, which I'm seeing quite a few of our large customers are using MCP servers, SonarCube MCP server. It's basically a gateway into the code analysis of SonarCube for the agents talking in the protocol that agents like to use, right? And that interface is available for agent tech IDs, CLIs, and whatnot. We also have in the product, apart from analyzing the code, we are also looking at how can we improve our detection engine to detect AI-specific issues. So we have a facility to add custom rules. So I've seen some customers looking at adding custom rules within the product to detect AI-specific patterns. So they can do that. We have added several rules to detect AI-specific issues like prompt injection attack, rules file backdoor attack. So these are AI specific issues that AI introduces, which a typical human developer will not introduce. What was the last one that you said there? Rules file backdoor attack. It's a vector which is referred to as rules file backdoor attack. But basically what it is, is the configuration files, the .mdc files and the .md files, the rules files that are used by these coding agents and IDs. You can have hidden Unicode characters in those. Those are hard to detect by AI. So we have rules to detect those kind of issues which are introduced through your configuration files, such as a rules file. And these are called the rules file backdoor attacks. And then the LLM prompt injection attacks also, we have a rule for that, which can detect issues pertaining to LLM prompt injection. So these are things which we have added in addition to what we already do. So these are specific issues tied to LLMs. So that backdoor attack, that would be you copy a skills file from Claude? somewhere else and you unknowingly copied a bunch of unicode characters that poisons that prompt or something like that is that is that exactly that's really interesting exactly okay note to self exactly a bad actor can introduce some hidden unicode characters in the file right so that is really interesting so as i put this together right you know if i'm writing a skill or i'm i'm describing what my cicd process would be like i can use your mcp server or other integrations to basically say, as part of my checks, I want to run static analysis and report out on the results. And if the results look like XYZ, fail the build or whatever it might be. Is that correct? That's absolutely correct. So we have quality gates and you can define the policy conditions when your build should fail or pass. So there will be applications that you build which are internal facing, which are not external or which are not sensitive or not production, meant for production. It's for a small team. You may not want to have a very strict enforcement policy for that one whereas if you're writing a banking application or a medical application where cost of that application going down or breached is very very high you would have a higher set of policy conditions applied to that the gate will be set at a higher level so you don't you feel the gate even if there's a single security bug in it that's the condition whereas if you have a low priority bug and you want to pass it for a lower internal application which is not critical machine critical you can do that so you can define your gates and policy conditions based on your needs And that's a use case that our customers very often use as well. Very cool. Thank you. Yeah, one thing about that I find particularly interesting is a couple of months ago, we were talking to a very prominent analyst, a technology analyst. and we were talking to them about the process for code review in the AI world. And one of the things they observed is they said, look, our recommendation as a leading analyst firm is people use the same review process for reviewing code that they did for human written code as they would do for AI generated code. Like if you had a good process in place with all the things like quality gates and things that Manish was talking about, if you had a good process in place for your human generated code, you can actually use the same process effectively for AI-generated code because it's still ultimately just code. That really resonated with me. And you had mentioned at the beginning AI code review tools. And that's one of the things that I'm scratching my head a little bit about is it seems sometimes like people all of a sudden feel like just because code is written by AI, that there needs to be a new AI review process for reviewing that code. And maybe in some cases that more helpful but also some of the sort of standard tried and true static analysis processes for reviewing code actually work with AI generated code as well and can give you repeatable results over and over as well Yeah I think one of the key things here is the false positives from what I used some AI code review tools and obviously use our own product all the time The false positives in certain situations can be a lot higher with AI code reviews. So I think you need to have a best of both worlds, is my thinking. Whatever the ideal use case is for each type of technology, we should use it for that. So yes, so I think absolutely, Chris, what Chris said is our customers are actually wanting to use a deterministic, which is always consistent in its result, which has low false positives as a verification layer. And for some use cases, they might be exactly to the point that you and Chris mentioned earlier, writing documentation, writing PR review descriptions. Those are better suited for LLMs for sure. LLMs can do a much better job in that. And we should exploit the strength of LLMs for those kind of use cases. Yeah. Yeah. So one of the things that you guys mention and link off to in this state of code developer survey is a report that you have on LLM coding personalities. And I thought that that report was fascinating. And I'm hoping that maybe you can talk a little bit about that and what you guys are finding with that. About a year back, we started working on evaluating LLMs for code quality and code security and code complexity. The history here is we have benchmarks. We have industry standard benchmarks that almost every LLM vendor comes out with, whether it's human eval or any of the standard benchmarks for testing the coding performance of LLMs. So while those benchmarks are very good and they're relevant and they are the starting point, they're half the equation, we think. Why do I say that? I think benchmarks basically check LLM output for correctness. They check whether the code is able to write a particular algorithm or solve a particular problem and whether it's correctly solved or not. What they don't check for is how each LLM produces the code for solving that algorithm or solving the challenge, right? The benchmark problems. So we looked at, we have got an evaluation criteria where we evaluate, it's a proprietary criteria. So we evaluate 4,400 coding problems. We give it to all LLMs and we evaluate how they perform in those coding problems. And these are problems which the LLMs are not aware of because they're not part of the standard benchmark. So these are unknown problems, new issues for them, right? And when they run those producers' code, we score them just like any other benchmark in terms of how many times, whether they have a good enough pass rate or not, whether they produce functionally correct code or not. But we take it a step further. We look at how many bugs did it produce per thousand lines of 10,000 lines of code or million lines of code? How many security issues did it produce? How much complexity did it produce? Whether it's cognitive complexity, What does the cognitive complexity look like? What does the cyclomatic complexity look like of the code they generate? So benchmarks are the good starting point, but we took it to the next because we all all about code health and reviewing the code. That's all our whole business is built on, right? So we took it to the next level to take the analysis and figure out what they are. And based on our findings, we assigned personalities to the LLMs, the first six or seven LLMs that we did. We found that they had unique personality traits. We call them archetypes of personalities. some of them were writing really good code but the functional performance was superb but the cognitive complexity and cyclomatic complexity was way too high the others were cyclomatic complexity and cognitive complexity was very less they were writing less number of lines of code but they were also correct so we assigned subpersonalities to these llms so it's a matter of how each llm produces code how abstract they are or how over generalized they write the code or how do they do the error handling or how much of duplicate logic they have or what kind of code smells they have, what kind of security issues they have. So we did a deep thorough analysis of each and every type of problem that they create. So we started with personalities report, but we have now evolved into an LLM leaderboard that you can find on sonar.com slash leaderboard, where we have about 35 models as of today, LLMs that we have evaluated for security, reliability, maintainability, psychomatic complexity, cognitive complexity, issue density, all of those things are noted there. Not only that, we also tell you if it's producing 10 security bugs, what are the types of security bugs it produces? Whether it's path traversal issues or it is secrets, leaked secrets issues, what kind of issues are there? So we have a detailed report. Perhaps we don't have enough time to go through it, but I encourage you to take a look at it at sonar.com slash leaderboard. You can dig into all of the 35 LLMs in terms of evaluating them in terms of the traits that they have. You mentioned the personalities. Can you talk a little bit about the personalities? What's a quirky example or an interesting example of one of the tools and how it was categorized? So when we first did this last summer, we were really focused on that idea of the personalities themselves for each of the models. But what increasingly became clear is that the models were evolving so quickly and new models were coming out that we've actually, that's one of the reasons we switched over to the leaderboard thing and away from the personalities. is because the personalities were changing so fast that it was hard to just assign something to quad code. I forget even what we said last summer was quad code's personality, but its personality now is best coding expert in the world. So I think the biggest thing about each of the personalities, like Manish was saying, is that we found as the models got more performant, they also got more verbose in terms of writing more and more lines of code, which added to the cognitive complexity as well. And so we found that to be the case all the way up until about probably November of last year, when all of a sudden some of the models, you'll see this as you're looking through the results on the leaderboard, some of the models are actually both performant but not getting complex. They're in decreasing the amount of issues. So where it was linear for a while there last year, where the smarter, you know, in terms of performance the models got, the more lines of code they would write, the more complex. It's getting a little more nuanced than that now. And some of these models are getting really good, really good. So I would say the personalities, at least of the top, you see where the cream is rising to the top on some of these. The top personalities are all just getting to be to the point where they're expert coders. I don't know, Manish, what would you add to that? Yeah, I think you're right. So we had the personalities defined for the first six models that we evaluated back in the day when we evaluated GPT-5, Cloud Sonnet 4, Cloud Sonnet 3.7. We assigned personalities to those. So, for example, we picked models which was relatively small language models like OpenCoder 8 billion parameters. And we picked up a large model like GPT-5. So we looked at both the ends. And based on that, we created some personalities like OpenCoder, a relatively small language model. We called it a rapid prototyper. Why did we call it a rapid prototyper? Because it was basically solving issues relatively fine, using less number of lines of code, but it was not thorough in terms of the number of issues that it was introducing. It used to prototype it. A prototype does have bugs, but it can be a quick and efficient way to test a proof of concept. So we called it a rapid prototyper. Whereas Claude Sonnet models were like a senior architect. They took a lot of things into play, like scalability of the application, how many users will be there, and kind of, you know, performance and all of that criteria into writing the code. So we called it a senior architect. So we had given some names to the first six models, but now that we have over 35 models, it's hard to give a name for all of those models. We are moving away from giving them model names or personalities as they're getting better and better. It was a lot of fun though, when it was endearing to anthropomorphize these LLMs. That is a lot of fun. And it's like a nice way to remember that. But now we're getting into So cattle, not pets, right? We're getting into that territory. Yeah. For anybody listening to this, I'm looking at this leaderboard and Opus 4.5 thinking is slightly ahead of Opus 4.6 thinking, which is really interesting. But then when you look at it from a security point of view, which is security vulnerabilities, issues per millions lines of code, GPT, 5.2 high is at the top of this. reliability, which is bug severity and issues per million lines of code. Gemini 3 Pro Highly. That's fascinating to me that all those different aspects. I think also one thing that stands out to me too, is that this was done specifically with Java. That's what it looks like. So there is that aspect to it as well, is that these models could be trained differently on different languages as well, but still very fascinating. Anything else you guys want to add on the leaderboard or the personalities before we move off that topic? I think if any of your listeners want to get the models tested, there's a form in there, we are happy to test it for them. Oh, cool. We keep getting requests. So we try to keep up as much as we can. But with almost a couple of models coming out every couple of weeks, it's hard. So if there's something missing, we can look at it if there's a huge amount of interest in that. How long does one of those benchmarks take to run just out of morbid curiosity? It seems like it would take a while, but maybe not? No. Initially, when we started doing it, it did take a while, but now we have made it a framework which can evaluate models rapidly. The bottleneck for us in the last few tests were when Copus 4.6 came out, there was a whole lot of API requests being made to the model. And it was over flooded with requests. So it was not fast enough to respond, or sometimes our tests would time out. So we had to restart the test. So you DDoS'd Anthropic, basically. No, I'm just joking. The universe DDoS'd. That's awesome. The big takeaway, I would say, on our work on this over the last six months from our perspective is as you're evaluating which LLMs to use, don't just look at performance alone. look at it through a more holistic layer of how verbose is the code that's being written, how many security issues are being created, not just how well it performs at completing a coding test is that I think there's more nuance to it than that. And some of the models that perform very well, when you take into account all of the other things like the cognitive complexity, the verbosity of the code, even the cost of running the models is obviously a very big thing too, Not every model is created equally from just the cost of the tokens that you have to do. So you have to take all those things into account, not just the sheer performance aspect, which I think had been where a lot of the focus was previously on evaluating models. And the reasoning in addition to cost. You can go to different reasoning modes. For each model, DCS has two to four reasoning modes. The higher the reasoning, the higher the cost, the longer it takes to solve the problem. But it's more thorough then. And the skeptic in me too is also thinking about how much are these prices going to steeply increase, right? Like, are they trying to just get people on the hook to be the leader and things like that? So that is definitely on my mind as I think about what my organization is. All right, let's pivot over to years of experience and how that affected survey results. So I've been doing this a long time, 20 plus years, and I thought that that was really, really interesting. So can you talk about how the perception of these tools comes into play with years of experience? Yeah, I can start in them. And maybe you can go into a little bit more detail. But this was another one of the results that was fairly surprising for us is that what we learned was that there's a big gap in terms of how people at different levels of experience use AI. So junior developers told us that AI makes them 40% more productive, but then 66% of them also admit that the code that it writes looks correct, but is actually broken. So they're just like more ready to roll in and start writing the code, but then they're also sort of scratching their head and not sure whether they can actually trust the results of what came out. Now, senior developers were a little bit more measured. I would say they're using it in different ways. Like, again, this is data from last fall, but 65% of them say they were helping it mostly to understand old complex code or write documentation, do things where they're cleaning up the past and using it in ways to maybe check the accuracy of things versus the junior developers who are ready to roll it and let the AI do all the coding. Now, again, I would asterisk this by saying that I think a lot has changed in terms of the quality of the code that's created right around. If you're following along the conversation about this on Twitter, there's sort of a general consensus. It seems that around mid-December, maybe when everybody was taking a holiday break and got a chance to play around with some of these new models, that they realized that these things had gotten really good. And so even some of the more senior developers were starting to do a little bit more sophisticated things than they were doing before. But I would say that's the big gap is just junior developers are ready to jump in and try the tools. And senior developers maybe are trying it more measured ways because they're also maybe a little bit more jaded about they know the risks of low quality code getting into production and how it's going to hurt them later. And so I would say that's probably the key difference and maybe a little different than what we expected. But Manish, anything else you'd like to add there? No, I think you covered it well, so I don't have anything to add there. Yeah, senior developers used AI tools as reasoning assistants and they understand the code, what is being produced, and they try to make sense of it. There's more of a trust factor when it comes to junior developers. They jump right into it and develop it. And that's just the nature of experience, right? So an experienced developer wants to question everything or look at all angles, because the newer, younger generation of developers, they are more trustworthy in terms of the new technology. It's got to be really scary right now to be a junior developer, too, and see the thing that you spent maybe years learning how to do, that a machine can go and do that part. and it's the higher order engineering tasks around designing the architecture of an application or things like that that maybe they don't have as much experience on that now are the things that are separating us from the machines right so for senior developers i think it a really interesting time especially if they can figure out how to turn into orchestrators of AI And Manish was talking about the idea of legions of agents of how can you orchestrate That another conversation on X slash Twitter these days seems to be people playing around with how to build their own army of agents out there doing individual tasks and who doing the best job at coordinating all the agents working together But I think increasingly, junior developers are going to have to move beyond their coding skills and figure out some of those upstream skills in order to stay relevant. One of the things that the report pointed out, I have a note here that said junior developers report higher job satisfaction related to the use of AI coding tools. And that makes intuitive sense based on what you guys are saying, too. however what you said i'm not on twitter or x but what you said about using the holiday break and also you know anthropic gave you those free credits to do exactly that it's exactly what i did and i walked away from it with a completely changed opinion about how effective it was and i've been doing this a long time i just i found that absolutely fascinating that it sounds like i was not the only one who had that experience you're not alone you're not alone we're seeing this, I mean, it's like every day it feels like I see some either blog post or something where a senior or well-established engineer with a great reputation says, hey, if you're not doing this and you're not starting to build your own team of agents to do the work for you, like I've seen several people say, I don't write a single line of code by hand anymore. Most of my work is figuring out is the sort of architecture level work or is training individual agents to do particular tasks, you know, like giving agents each their own tasks so that they can help work with each other and review each other's work. And it's really fascinating how quickly that's moved literally just in, I mean, what is it, February 17th here as we record this today. And a lot of this has just happened in the last month and a half. Yeah, it's truly astounding. You said that passion or enthusiasm, excuse me, that junior developers have for these tools. I honestly walked away from it being much more enthusiastic. And I'm really curious. We talked about differences from October to now. And I'm really curious if that's going to change with senior developers. Have you guys seen anything like that since October? Yeah, we're actually thinking about doing another pulse version of this survey just because things have changed so quick. We feel like we need to get out there in the field and do a little bit more research just so we can compare now that we have a baseline to see what's changed. So stay tuned. If we're able to make that happen, we'll be happy to share the additional data on it too. But we have, again, like we walked into it with hypotheses last fall, I have some new hypotheses as well that I want to test. And a lot of it is around how much of this, because sometimes there's that conversation on Twitter slash X, which good for you for staying off of that and staying out of that world. But some part of me wonders whether the people that are having that conversation are like way, way out on the cutting edge or how much of the sort of mainstream of enterprise engineering groups are already harnessing agents. And we did have some data on that as of last fall that we saw a pretty significant number of organizations had already been playing around with agents. But again, those numbers, I think just in the last month and a half, it probably changed, but I think we need to test it and see for sure. From my experience working with larger clients, it's ticked up in the past six months. It's ticked up quite a bit. Yeah, it's really interesting. Another thing that I noticed that was really interesting is the use of AI in Greenfield versus Brownfield projects. So Greenfield as in new code, Brownfield as in legacy code. What did the report say in terms of clues as to where it's used most effectively and where people thought it was used most effectively? Yeah, so AI is best for projects which are starting from scratch is what we found. 90% of the developers use it for new projects, and it's much less effective when you have to work with an existing code base, particularly if programming languages used in that project are not commonly used these days. I think if you look at LLMs, they are generally very good with other flagship languages like Python, JavaScript, TypeScript, Java and all. But if you go back to some of the legacy applications and old code bases, which have some legacy code in there, and I won't name languages or frameworks there, but some of the legacy projects they're not so good at. So only 43% of the developers find AI effective in updating and optimizing code that is already in use, particularly for older frameworks and older languages. So that's one of the observations. So I think it also comes down to correctness. So AI excels in greenfield because surface area is small when you start with, and correctness is generally high with the newer state-of-the-art LLMs. With the brownfield applications, they sometimes struggle with the legacy API assumptions, and there are some non-obvious couplings in those applications or implicit rules that are not very well documented or hard to figure out just by looking at the code base. that's less of a use case. Currently, LLMs are being less utilized for round field applications. That makes intuitive sense to me as well. Have you heard anything since, one thing that I'm, this is something I'm curious about, but haven't tested, is the use of like an agent's MD file or something analogous to it that can help give hints for legacy applications. Have you heard about anything like that since October? Yeah, I have not heard for specifically for legacy applications, but that does make sense. It's logical. I think with cloud code having the concept of rules, skills, hooks, hooks can enforce certain guarantees. Rules can give you some constraints. With those things in place, I think it definitely, there's a chance that we might see an uptick in usage of AI in the older applications, the brownfield applications as well. I'll be really curious to see. So one of the clients we're working with has a 20-year-old system that's mostly written in C++ and the tech lead that's on it is struggling with it. I was like, you should run an experiment. See if you can embed some of the tribal knowledge that is in that app in some of these agents' files and see. So for the listeners, we'll report back on that as well as what you find in your pulse check. Yeah, I think context is very important, as you said. So context where the files that you give to agent and information that you give to agent should help with the brownfield applications. Just real quick, I am curious. You had respondents for this all over the world. Did you see any interesting geographic patterns at all? Nothing that we wanted to report out on. I think everybody is living in the same world, basically. And so we looked to see whether there was anything that we could tweak out that was statistically significant and not enough that made it into our top findings of the report. I think we're all kind of going through this together, regardless of where we live in the world. That makes sense. I was just morbidly curious. All right, let's start wrapping it up. And I think I learned a ton from the survey, but for me, it's been a while since I've been like a boots on the ground developer. I've been in leadership for a while. I still code, but I'm in leadership. So as I put myself in the shoes of a developer who's having executives or managers push these tools, sometimes those asks are grounded and sometimes they're not. And so for anybody listening that might be giving what they think is an unrealistic goal for using AI. What's your advice based on what you saw? I think the big thing would be to tease out whether your organization is focused on the speed of writing code using AI or the speed of shipping code using AI. And the latter, shipping code using AI is harder. And that's why you see many organizations not really getting the sort of promised benefits of AI is because the promised benefit maybe they're focused on is you can write code a lot faster. But as Manish was saying earlier, the name of the game is being able to verify the quality and security of that code before you ship it, where you're not really gaining as much speed as you think you would. So I think it's an education process for developers who are working at organizations that already get that. Like I was saying earlier, it may be that if you had a really robust code review process for the age of developer written code, that you're actually probably set up pretty well to succeed in the AI generated code world where you have automated code review in place, you have tools like Sonar Cube or whatever you were using before to check the quality and security of code. And that would be the big advice is to make sure that organizations have a really good process for verifying the quality and security of the AI generated code. Or if you don't, then sort of ensuring that you do your part to like educate the leaders on your team that just because AI can write code really fast doesn't mean it's good code that you want in production that's going to potentially increase your organization's risk or create spaghetti code that's hard to maintain or some of the other challenges we talked about there. I don't know, Manish, anything you would add? No, I think you said it right. So I think don't lose sight of quality. Ship it faster, but don't lose sight of code quality and software quality and application quality. Don't compromise on that. Be careful what you ask for is coming to mind. And also, you know, what's coming to mind is Lucille Ball on the chocolate line, eating the chocolates. It's like if they start coming out faster, you got to, yeah. Exactly, exactly. If I'm a developer, what do you think the most important takeaway from this survey is? Yeah, so for developers, the most valuable skill is no longer knowing how to write code. That is pretty much a solved problem, writing code. It's more about understanding the code, making sure the code is correctly written by the agents or your tools that you're using, and making sure it's being reviewed and you're following all the putting some guardrails in place to make sure you ultimately, developers are still accountable, irrespective of who writes the code, like Chris said, you want to do the right things, you want to put some some guardrails in place. You want to validate the code. You want to make sure the code is written well. It's not so much about learning a new programming language anymore. Anything you want to add, Chris? Maybe just repeating what I said earlier about if you're a developer who is trying to figure out how to stay ahead and keep your skills relevant, probably the best thing you can do right now is to gain the ability to manage and orchestrate and train agents. And if you can do that really well and you can harness the skills of these leading edge LLM tools, and you can sort of stay on top of knowledge, be curious about learning because these things are evolving so fast. What you learned during your Christmas break may not be even the current state of the art anymore. And so staying on top of that and spending at least a few hours a week just learning and testing and trying new things. I read something today where somebody said, this may be the most important year of your career because things are moving so quickly right now that if you're not paying attention and you're not staying curious, you can quickly fall behind in your career and have a hard time catching up. So one side of me, I was like, oh, am I doing that? And I was like, I think I kind of am, but it was a little bit of a call to action for me too, of like making sure that in that hour at the end of the day, when you're winding down and everything, you try a couple of new things every day and just sort of see, did you get better results on it today than you did yesterday? And increasingly, I'm finding you are. Yeah. I'm hearing two things which resonate as true with me. It was Manisha saying, you know, you can't lose sight of the basics. The code needs to be bugged. You know, you got to ensure quality, but at the same time, you got to kind of dance on the bleeding edge in order to stay relevant, quite frankly. Yeah. And then last question, if I'm a development leader or an engineering leader, what do you think the most important takeaway is? You can't ignore the trust problem in code is as AI is getting better, but still like that stat I said at the beginning is the sort of big takeaway for me. 96% of developers still don't fully trust the quality of AI generated code. Not necessarily because it isn't good, because in many cases, the code is really good in continuing to get better. It's because they know AI isn't going to take the fall when the code fails or there's a security issue. So if you're a leader, you have to ensure you still have human accountability for the code that gets shipped into production. And so you have to build the right systems and processes to ensure that you're able to verify the quality and security of the code before it ships. Don't just close your eyes and ship AI-generated code unless you're in some sort of leading edge, vibe-coded startup where you can risk doing that if you have actual customer data and customer information and things like that. probably makes sense that you go and really figure out how to trust the code that you're shipping. And that to me is the 2026 problem of AI generated code. In 2045, the problem was how do we generate even more code? And that's like Manish said, a solved problem. Now we can generate pretty good, high quality code. It's continuing to get better, but ensuring that there's a human who's willing to put their stamp on it and say, I'm willing to ship this into production and take all the risks that entails. That's the biggest challenge for 2026. Yeah. I think also the thing that if I'm a leader, I'm thinking about, okay, you're telling me that 66% of people don't trust the code and I need to put that human stamp on it. That's a good thought provoking thing to end on, I think. But is there anything else that you guys want to cover that we didn't tonight? Matt, I appreciate the time. Thanks for inviting us on. I think Manisha and I and the whole team are, we're passionate about doing this sort of research and the LLM leaderboard project that we talked about earlier is a sort of like a continuing passion where I think we go look at it every day and see the new models that go up and the results and just watch these things getting better. It's a wild time. I've been through some different turns in my career, as I'm sure you both have, but I'm not sure I've seen anything quite like this. So it's a fun time to be involved is a little scary, but it's a lot of fun. So I appreciate you having us on. Yeah. Thank you so much for being here.