Engineering Rigor in the LLM Age

93 min

•Jan 15, 20265 months ago

Summary

Oxide engineers discuss how LLMs can increase engineering rigor rather than replace it, sharing concrete examples of using Claude Code for kernel work, API migrations, and bug discovery. The episode challenges the 'vibe coding' narrative, arguing that LLMs are best used as tools for amplifying patterns, improving code quality, and tackling previously infeasible technical debt.

Insights

LLMs function as 'pattern amplification machines' that excel at replicating carefully designed patterns across codebases, enabling engineers to tackle large-scale refactoring that would otherwise be deferred indefinitely
The most effective LLM usage pairs human expertise in problem design with LLM execution, creating a verification loop where engineers maintain high scrutiny during code review similar to reviewing a nemesis's work
LLMs lower friction for exploring unfamiliar domains (kernel development, Zig, signal semantics) by providing intelligent scaffolding, enabling engineers to contribute to projects outside their expertise
Code quality and rigor should be the primary goal when using LLMs, not velocity; engineers should invest time in refactoring, documentation, and test coverage rather than maximizing output
Well-written design documents and RFDs become force multipliers in the LLM age, allowing engineers to encode complex decision logic that LLMs can reliably execute across multiple implementations

Trends

LLMs as debugging and static analysis tools for unfamiliar codebases without requiring deep domain expertiseShift from code generation as primary use case to LLMs as tools for increasing rigor through comprehensive testing and documentationBifurcation of software into high-rigor foundational systems and lower-stakes one-off code, with LLMs enabling quality improvements in both categoriesDocumentation and design-first engineering practices becoming competitive advantages in the LLM eraType systems and constraint-based design patterns gaining importance as LLM-friendly code propertiesOpen source maintainers adopting LLMs for triage, bug report validation, and code review to manage increased submission volumeRust and statically-typed languages becoming more attractive for LLM-assisted development due to compiler feedback loopsTechnical debt reduction becoming feasible through LLM-guided migrations previously considered too tedious to justifyEmpathy and disclosure becoming critical practices when submitting LLM-assisted contributions to open source projects

Topics

LLM-assisted kernel development and systems programmingCode generation and pattern amplification with Claude CodeAPI design and versioning strategies for distributed systemsType system design for LLM-friendly codeTechnical debt migration and refactoring at scaleBug discovery and static analysis using LLMsDocumentation and RFD-driven developmentCode review practices for LLM-generated codeOpen source contribution standards with LLM disclosureTesting strategies and doc test generationRust safety guarantees and LLM code qualityVibe coding versus rigorous LLM usageStale bot and issue triage automationPair programming with LLMs versus humansEarly-career engineer guidance in the LLM age

Companies

Oxide Computer Company

Host company; engineers discuss internal LLM usage for kernel work, API migrations, and system design

Anthropic

Claude and Claude Code are primary LLM tools discussed throughout for kernel work and code generation

IBM

Joked about potential cease-and-desist regarding 'Deep Blue' terminology for software engineer ennui

GitHub

Discussed as platform for open source contributions and code review; criticized for diff rendering limitations

Ghosty

Open source TTY emulator project where LLM-assisted bug reports were successfully submitted and confirmed

Meta

Created Stalebot, criticized as emblematic of problematic automation in open source maintenance

Graphite

Code review tool mentioned as inspiration for building LLM-based PR review scripts

CodeRabbit

LLM-powered code review tool mentioned as alternative to manual review processes

People

Brian Cantrill

Oxide CTO; shares experience using Claude Code for Illumos kernel lock refactoring work

Adam Leventhal

Oxide co-founder; discusses OpenAPI diff library development and LLM-assisted coding patterns

Rain

Oxide engineer; presents IDDAQD map library work and RFD 619 type migration using LLM-guided instructions

David Pacheco

Oxide engineer; shares experience debugging Ghosty crashes using Opus 4.5 without prior Zig expertise

Mitchell Hashimoto

Creator of Ghosty TTY emulator; confirmed LLM-assisted bug reports as legitimate issues

Quotes

"LLMs are actually being used to result in more rigorous engineering. And it's actually not even that hard."

Brian Cantrill

"Pattern amplification machine where you give it a pattern and it just kind of amplifies that pattern into the rest into whatever degree you want"

Rain

"If you're using LLMs extensively, that code had better be the best freaking code on the planet"

David Pacheco

"Use the LLM as a tool to improve code quality, not velocity. Slow down and refactor, split things up, improve documentation."

Rain

"Death to Stalebot. I say so."

Brian Cantrill

Full Transcript

Last week, he made us wait for four minutes. Well, my man in the office says, Hello, Adam. Says, Brian's here. Hey, Brian, how are you? I'm doing well. How are you? I'm doing very well. And we've got all the Oxide friends here. We've got David and Rain. And Rain. Great. You know, our predictions episode was only a week ago. And yet it already feels, like at least one of our predictions already feels like such a lock. it's amazing that it was even considered a prediction as little as a week ago. I think this Ennui, the software engineer Ennui, and then your absolutely brilliant naming of deep blue for this sense of software engineer Ennui, wondering what the real purpose of anything is, that the LLM could just do everything for them. It feels like this has already taken root in the last week. Is that my imagination? It feels like this has been... I don't think it's just your imagination. I think it may also be my imagination. But when I saw someone, not even tag us, but just describe the feeling as deep blue, I was like, wow, this is really getting there. We really made it. We made it. We've definitely arrived. How ironic would it be if that cease and desist came from IBM for naming, for sullying the good brand of deep blue into a kind of like. i'll tell you the predictions market does not have that one coming exactly let's see deep blue disambiguation on the wikipedia page where they actually need to clarify that we're not talking about the the the the software enduring neo-depression lom-based depression yeah if we see the uh if we see the the poly market uh on that spiking we know that there's a cnd coming and some insiders are profiting on it oh it's my understanding of those insiders are supposed to be us right isn't that the way isn't isn't that what poly market isn't that sorry isn't that who they serve isn't i think in this case it would be it would be the folks at ibm about to sue us but yeah i mean that's that's basically yeah can we take out a position on getting a cnd is it they've got a cnd over the course they're going to be oxide and friends uh i mean old conventional wisdom oxide friends bingo card new conventional wisdom oxide friends polymarket yeah um there's gotta be a good hedge right yeah yeah it should be uh but it it feels like this has really been uh i mean we knew this last week, but just the presence of LLMs. What do LLMs mean for software engineering? I feel like I've seen like six different pieces a day on, right? About talking about, like, what does this mean? What does it not mean? And I feel like there's quite a bit of noise out there. Well, noise. There's a lot of consternation out there. That is for sure. uh it this is a uh this is an issue that has a lot of people thinking about it one way or the other for sure i definitely learned the the i mean there is a demographic and if you it's hard to say it's like these all right look if you're in this demographic you are gonna think that we are belittling you by making other people aware of this i just want to pause right now yeah all listeners, please write down what large group of people Brian's about to alienate. Excuse me, I'm being handed this folded piece of paper. Listen, I've staked the mortgage payment on a C&D, so I really need to get the goose this thing along. We tried to get a C&D from the Republic of Germany last year after offending all of our German listeners, but nothing came to that. So there's a virulently anti-LLM demographic out there. And I get it. But that's not what we're going to talk about. I guess we already are. Sorry. Whoops. I was like, wait, those folks? You already alienated all those. All those folks already are. That Hornet's nest you've already kicked many times. Like, do you not follow yourself on Blue Sky? Right. right you maybe you should i know i i also feel like i'm doing you know i had a boss who would do this once who'd be like listen we're gonna go into this meeting and i want no one to mention insert name of former customer like we're just not gonna talk about them i'm like i wasn't gonna bring him up and then like the very first i want to explain to you why former customer is no longer a customer i'm like okay didn't okay so oh i get it when we were in the car right down you weren't talking to us you were talking to you like the part of your brain that tries to not screw everything up was trying to talk to the part of your brain that actually in fact screws everything up and that part of the brain wasn't listening as it turns out so i kind of feel like same thing for me here on like nobody bring up the fact that there's a demographic that believes that llm use is immoral i will do that from the top so no i'm sorry all right well um this is a hash we can cut all this out right this is yeah yeah sure sure um but we are to the contrary what we want to talk about today is um the what we have seen is that we want to split this kind of there's this false dichotomy out there that you are either vibe coding a term that i get again i believe is not going to survive the year a prediction that maybe not be not be faring that well in its first week evaluating our predictions one weekend um the uh that where you have a fully closed loop in an lm is just simply creating software of its own volition uh that is kind of like that that is one poll and then the other poll of course is like no no these things are like you should never use them they shouldn't be used for anything they're you know etc etc etc and correct if i'm wrong but i feel like a shibboleth of vibe coding is this idea of like you just do it and if you don't like the results you do it again and if there's a bug you do it again and you're never sort of cracking open that not like seeing what all the gooey middle is you're just like just go for it like kind of a lack a tautological lack of curiosity of curiosity about what's going on inside yes well this which is part of the reason i think that the term will die with this because i think the term is going to be associated with that lack of curiosity but yes absolutely. And there are domains in which that lack of curiosity may be okay, but other domains in which it's not. So that's kind of the, those are the kind of the two poles. And I think what we have, we, what we believe, what we've already seen is that there is a big, big, big middle ground. And in particular, what we have seen is LLMs are, can actually be used to result in more rigorous engineering. And it's actually not even that hard. I think that there's, and I've got some specific and recent experience. Adam, maybe I can lead off with that before I introduce our colleagues. So I have been exploring using Cloud Code to do kernel work, to do, we've got our host operating system is Helios. It's an Illumos derivative. And I had some like, what I thought was a good, you know you always have to want to have like a good first task for these things just like when i picked up rust i wanted to find like the good like what is the right thing to try rust on um my first thought of a doubly linked list ended up being that was the was the wrong idea that was the worst thing so okay let's let's not do the worst thing let's do a different thing um and i mean just and actually you know you kind of had the same experience with rust of like picking up not a great first thing although not deliberately right i mean because you did a sudoku solver yeah and a grammar yeah and and i mean how and how was that as a good as a first rust project um it was like i think it was very early for rust it was early for rust it was early for rust and just the work that has gone in the intervening of like 10 plus years yeah uh to making it approachable and the error messages um sort of like convergent rather than divergent like i think my big frustration was, it was like, go, go try this. And it's like, oh wow, that's much wronger. Like, don't like, who told you to do that? You told me to do that. What are you talking about? I didn't tell you that. I don't even know. I don't know what you're talking about. Yeah, exactly. So, but you, but you also, so I mean, in hindsight, would you, today with that, that would be a fine first rust project. There was nothing. I think so. About the project itself. Yeah. Right. Yeah. I mean, it was, it was, it was even simpler than the thing you, that ended up being your first Rust project. Right. So you always want to have a good kind of first thing for these things. And I've been kind of waiting for a good, like what is a good thing to use Claude Code on? Because I just want to like see how it does basically on this stuff. And I had some like some relatively straightforward scalability work that needed to be in a lock that needed to be broken up. I knew how I wanted to do it. It was going to be a little bit tedious, but I was just kind of curious to see how it did. And it should be said that the idea here also was like, you're breaking up this lock in a way that many locks before it have been broken up. Is that fair to say? Yes, absolutely. What needs to be done here is really quite straightforward, and I can describe it pretty completely to Claude Code. And I'll drop a link to the actual bug itself. um, a Lumos 17, 8, 16. Um, so I'll drop a link in for that. And so you can, you can see exactly what the problem was at hand, pretty straightforward. Now I was going to use a, like very deliberately not using it. I'm definitely not closing the loop, but not vibe coding it, not one shotting it, but really, um, because in particular, like I am not, we're not, I'm not even going to let it build anything, right? I'm going to let it, we're going to go into the source base and I just wanted to see how it did. And I really did like, it did remarkably well. One thing that was really interesting and I would, but I mean, not definitely not perfectly and had some subtle issues that needed to be resolved, but we got those resolved pretty quickly. And I think I would say like it had two subtle issues, but it had also did not have a subtle issue that it could, it made a subtle discovery as well. And the thing that was really interesting to me about it is I was unleashing it on like a pretty big source base in terms of a Lumos. And it was really interesting to watch it effectively read block comments to understand how subsystems worked. And to understand, and so reading not just code, but also comments. And all in all, it was really pretty impressive. You know, it definitely understood, I mean, it's, we're talking Lumos here. So it's like, this is not like anything that you have trained on that is the Linux kernel or the BSD kernel is like literally not going to apply. It would be very easy for you to create arguments to functions that didn't exist. I'm talking about the Kstat facility, which is a facility that doesn't exist. So it's like you really cannot rely on something that you've really trained on. You're going to have to kind of look at this. But it was good. And I would say like net net, it probably saved me probably about in terms of the like the actual um time to implement this it probably saved me like half the time i spent about two hours on it to have something that was i was pretty confident would work and did work um versus i think it'd probably be about four hours on it um and someone's interesting well i have the lms train on alumos it's like yes but what it was you the the way it was iterating uh and if folks haven't used cloud code it is really um it's worth experimenting with especially on an established source base. And so one of the things that I would just like to throw out there as like a first way that these things can help increase your rigor is by asking questions about a source base. And clearly like, you know, all of the caveats apply that you can get the wrong answers and so on. You need to verify these things. But it was really, I made it much more, it figured out a lot of what needed to be done surprisingly quickly. So I will absolutely be using it again for other kernel projects if only to as a starting point to uh and i i you know one thing it did it was funny is adam it needed to add a field to a structure and this is the actual structure itself none of the fields is commented you know how we like best practice would be to comment every structure member and in this particular source file none of the structure members are commented. And what its proposal was, was to actually comment the structure member. I'm like, for bad reasons, like we're not going to do that. We're going to be consistent with what's there by not commenting the new member that you just added. Like the code you want to actually write is actually cleaner than what's there. But it does think, the other kind of thing that it brought to mind is like, boy, there's so much like technical debt kind of things. And one thing I think would be interesting that I surely we're going to see is people going into existing source code and commenting it better using Claude to just have better comments. Then obviously validating all of its work and, and, you know, not allowing it. Anyway, that's, that's kind of my, my story from my experiment over the weekend doing a Lumos kernel work and came away pretty impressed. Awesome. And did, when you started that project, did you have a sense of what the code was probably going to look like? Yes, yes, definitely. Yeah. I mean, like this is one of these where I, in many ways I had biased it for maximal success. I knew I had a pretty good idea of what it was going to look like. Um, but there's also some fiddly bits of people, you know, look at the, uh, and I actually, I'll put a, a link to the diff into the actual bug. It's like there are some fiddly bits to get right. Actually, there's a little bit of math that needs to be, that you need to do correctly. It's not. Um, but yes, I definitely knew what the code was going to look like. And this is, it doesn't span multiple files. We're not introducing a new subsystem. Like this is pretty straightforward as it goes. So this is, I would say, a relatively, a case that I really picked because it's kind of biased for success. Also picked it because we need to do it, by the way. I mean, that's the other thing. It's like, this is like- This was not a yak shape. This was like, you were doing it in four hours or you're doing it in two hours, either way. Either way, it had to be done. That's exactly right. I would say the other thing is that the four hours versus two hours ends up being really, uh, actionable because I started this at 10 o'clock at night. And it was like, there's a pretty big difference between going to bed at midnight and going to bed at two in the morning. You know what I mean? In terms of, so, you know, sometimes like that, that difference can be, um, so it was, uh, pretty impressive. Um, and gave me the belief that we could actually use this in lots of other places, but that is my limited experience. I want to, definitely want to, so we've got two of our colleagues here. We've got David and Rain here, and both of you have used LLMs quite a bit and have discovered, I would say, new vistas of rigor. Rain, do you want to kick us off on some of the stuff that you've done where you found this to be useful? Sure. So there's a couple of different things I can talk about here. One of them is kind of the first work that I did that was around May of last year. And then the other one is like the work I did around December with like reorganizing types and stuff. Which one should I go with? Let's actually start chronologically because let's start as you're kind of getting into this stuff. Yeah. So I guess, as you pointed out, a lot of the memes around LLM-based coding are vibe coding, right? You don't pay attention to the code. You just let yourself in the flow or whatever, right? And that is, I have to say, personally speaking, that is kind of exactly the opposite of the way I want to build software. And, you know, for me, like I want software to kind of, you know, aim towards correctness. I really want a high degree of rigor in my software. So when I came into LLMs, I came in with like a huge amount of trepidation. Like I was like really worried about like, you know, I was just kind of trying it out. right and and i was like okay you know i want to make sure that everything looks good and so on so um the first use that i found that i thought you know was kind of really impactful was um so i wrote this uh we were having a bunch of issues at work around like you know how do we store uh keys and values and maps and so uh i kind of on the side around like april or so i kind of started prototyping this approach, which basically lets you store keys and values side by side next to each other. And I spent a few weeks, you know, kind of trying that out, right? And, you know, I did a bunch of prototyping. I did a bunch of work. And then, you know, it was like an interesting experience because that was all handwritten, right? So it was like three weeks of like around 2000 lines of code, like carefully handwritten, like there's a lot of unsafe code. And, you know, it was like pretty challenging. But then I realized that one of the things I needed to do was that if you define a map in Rust, there is like a lot of extra, there's a lot of things you need to add to that map in order to make that like a functional API. So if you look at Rust, like HashMap or B3Map or whatever, there's like, you know, know, there's a ton of different APIs that are all like, some of them are syntactic sugar, some of them are more primitive. An example is like, say, the entry API, which if you've used Rust Maps, you might be familiar with the entry API. So that's an API that lets you kind of say whether an item is occupied or not, and it lets you insert an item. I think it's a beautiful design, but it is a very verbose design. And so the map library I was writing, and I'll just drop a link to it, it's called IEDQD. this map library had four different maps, right? And so one of the things I was dreading was, okay, oh my God, I need to write like all of these map APIs four different types, right? And that is just like terrifying. So it's like, okay, you know, you have a prototype and maybe you have like one of those types, but then you have like these, you know, three other things and for each thing you need to go in and like, you know, update the map type. And it's just like, it would be like a couple of weeks of work at least. And it would be like pretty hard for me to justify that work as opposed to kind of, you know, just like ambling along with the default maps. But then I really also wanted to get this in the hands of my coworkers because I'm actually really excited about this pattern. So what I ended up doing was that I ended up handwriting one of the maps. and then I told uh I think back in the day it was like sonnet 4.1 or something right so this is you know we were like a couple a couple generations before right back in the day of like eight months yeah right right and and so I just told it to kind of replicate you know the same apis across all of the other maps right yeah and it just nailed it right it just like um it it just like there were local differences to things that kind of adapted the map types to those differences. This was like, I want to say a total of around 20,000 lines of code. Then I asked it to generate doc tests. And one of the things you should do for, and if you look at, say, the Rust core types, you will see that every method has a doc test associated with it, right? And so I wanted to get that kind of rigor, right? Where like every method has a doc test associated with it. And I don't know about you, but like I hate writing 5,000 lines of doc tests, right? And I just told the LLM to do that, right? I kind of, you know, I gave it a couple of examples to start with. And I just told Sonnet 4.1, I think, to do that. And, you know, it just kind of replicated that the things that it wrote like thousands of lines of doc tests. And, you know, this work that I'd been dreading because it would be like weeks of work. it took me like i want to say like less than a day to get like the whole thing ready right so it was three weeks of careful deep analysis and work and like thinking about unsafe and so on and then like one day of um some i was talking to someone in blue sky about this and um i think they described it as like a pattern amplification machine where interesting right and so so you give it a pattern and it just kind of amplifies that pattern into the rest into you know whatever like degree you want right there's like you know I spent like the thing is that before LLMs I would have probably like I would have like investigated like a code generation library I would have like tried out macros or whatever and all of them have like some downsides uh the the kind of the LLM kind of doing things and like tweaking things locally as it went along and like you know things like for a btree map it'll say like ordered and for a hash map it won't say that just like you know making sure that the documentation is all aligned and everything uh it was like that was my first experience and it was like a great experience where like it wasn't a one shot but it was like i want to say like maybe like five or six prompts total and it just kind of just nailed it and so that was my first experience so but yeah a bunch of follow-up questions so that's really interesting So one, I mean, this is the kind of tedium that you do. Just like you say about the doc tests. We all know the doc tests are great. As a user of something, you really appreciate them. It takes a lot of time to get that working correctly. It's really easy when you, as a human, I mean, like bluntly cutting and pasting, right? When you are cutting and pasting, it's super easy to make a mistake where it's like, oh, that doc test, by the way. Have you looked at the doc test? Like that actually, you just cut and paste it. You changed it in two places, but not the third. And so now like what you have is kind of nonsense in the test. Like, well, that's not very good. Or the test is testing the wrong thing, right? Like they're testing the wrong method or testing the wrong structure, whatever. Like it so easy to make mistakes here So easy to make a mistake Yeah Yeah Okay So another question I have for you because the other thing is that when you are I mean as you say I've got the pattern, I want you to replicate it. It also makes for a code that's pretty easy for you to review. This kind of reminds me of my experience. I pretty much know exactly what I'm expecting here, and I'm going to be able to review this pretty quickly. Rain, one question I've got for you, because one thing that was super surprising for me is, and like, look, maybe, hopefully I'm in a safe space here. Like you, I've got the, the brain that I engage when I'm writing my own software. And I struggle to engage that when I'm reviewing someone else's software. You know, I try to, and, and the best reviewers I think are able to review code as if they themselves are writing it. And I think I, but to me, like, I really have to work on that. And I definitely know when I'm in the like, yeah, yeah, this probably works mode in my brain versus the like, no, no, wait a minute. Like this, like I need to like, I'm in like doing my checklist before takeoff. And I'm like, I'm going to die in this airplane if I don't get the flaps down correctly. So I'm like, and the thing that was super surprising to me is that when I was reviewing Claude's work, I was in that mode of like, I'm writing this myself and like very, a heightened state of alert, really reviewing things closely, finding some subtle things in its script. Did you find the same when you, when you were reviewing the code that, that it had written? Um, in this case, I think, so I had, I have, I have the same struggle that, that you do, right? Like where I'm like, you know, when I'm reviewing code, especially when I'm on like, and look, look at github.com, uh, at the, uh, I'm sure we all have our complaints about the github, you know, if you like, I'm sure, right? Yeah, like, oh, by the way, like here, let me show you all the trivial stuff. The non-trivial stuff, I don't know. That's a lot of lines to render. Let's not review that. Too big, why bother? Why bother, right? So I had a bit of the same experience. I feel like I was kind of somewhere in between here where I think much of this depends on how, or at least for me, depended on how intensely you and the LLM were pairing with each other, right? Yeah, it's like... So I've had experiences with an LLM. Like, so for this one of an LLM, I just like, you know, it just was doing its thing and I was not paying a huge amount of attention. And then I ended up like reviewing it and, you know, it like made like maybe two or three mistakes, right? But like also like, I feel, I felt like, you know, I was pretty assured by the fact that all the hard bits were kind of handwritten. And then, you know, the LLM was just like wrapping those hard bits, right? So it was like, it was doing like relatively easy things. There have been other things that I've used LLM for, especially like Opus 4.5 over the holidays. And that's, for those ones, like I ended up having it like this very intense, like mind meld pairing session. And like that felt like, you know, I knew every single line of code and what it was doing. Right. And so I was like, you know, carefully kind of working through things. And I was like a wild time, but like I felt like it depends on kind of the more I end up using it. And so so, you know, it depends. But I do, you know, like even like the current LLMs and again, this can change because I know I know things have advanced so quickly. But even current LLMs have they they they do get things wrong or they do things suboptimally or do they do they think did do things in a way that's unmaintainable. and you do have to pay attention to that right and that is part of the rigor which is like okay like i feel like i have built up some muscles around this from having used it right and so i think part of the rigor is also like getting some practice with like looking at llm code and reviewing it yeah interesting so the so in this first use case you i've got like i've got a lot of just tedium that needs to be done and i do think that i think is really interesting about the about this case is you're doing something we do a lot, which is like, okay, I've got this problem. I kind of want to solve it in a way that's a little more generic where my colleagues can use it and so on. But we always have the tension. We always, on the one hand, we always encourage ourselves to, hey, this is a good opportunity to build a new abstraction. If you think this could, but we're also all kind of realists. I'm like, yeah, but we can't not ship the next release or what have you, because we're kind of focused on, you know and there's always that balance and to take this thing that like oh this to to reduce the amount of work involved in this by a factor of four yeah maybe the difference between doing it and not you know where it's like just straight up right yeah right yeah i think i i actually suspect david has a few things to say because i know david and i have some have had some chats about this but like for me like there are like new vistas that open up and i think that's the way i think david put it right so there there are things that were simply not feasible to do given you know company priorities and like personal life stuff going on like all the different things right that are involved in you know a human's life that i feel like have opened up right and so for me like iddaqd actually like the goal of this library was to increase the amount of rigor in our software so i think it is very cool that you know is able to kind of work on this right so this is a way you you increase rigor is you build an abstraction that increases rigor even if it is tedious right that that is an increase in rigor right in the overall system totally yeah so david i mean you you were as rain points out like you were among the earliest adopters at oxide i think you've really shown the light for a lot of us and and you know showing what these things can and can't do you want to talk a little bit about your experience kind of getting into this uh yeah yeah um yeah i I mean, for a long time, I think until this year, really, when Cloud Code took off, I was using LLMs as kind of like a fancy search, even before they were actually search engines. And, you know, everyone was like, it's not a search engine because you're getting this very lossy picture of what's in the model weights. Even then, on things that they were trained very well on, which is like what I work on, web dev, they were great, even, you know, just for retrieval. So I was using them a lot for that or, you know, small snippets. this year I think is when it really took off that the models could really do more complex autonomous things based on a very small description and more importantly I think pull in like what you were talking about where when the cloud code is looking at the luminous code that you have on disk it's pulling in context that it doesn't have and that's very different from it's not so much the typical use case the typical use is you ask a one sentence question and there's only so much detail that you can get back out of it because there's just not enough texture in the question to tell it what to tell you back. And so like when, you know, I gave that the talk about LLMs at Oxcon in September, a lot of what I stressed was like the way to set up the problem for yourself is like you want to give it enough so that the answer is in some sense contained in what you give it. And what these agent tools do by just living in a repo and pulling in whatever context they want. They give themselves that texture and context. So that's really what's changed this year from the way I was using it a really long time ago. I was trying to, you know, I wrote a CLI that lets you pass stuff on standard in and you can dump files into it. But, you know, giving the things the ability to just do that stuff on their own, it makes things so much easier because you don't have to manually select a list of files that's worth looking at. And so what kinds of things were you kind of, where were you first really beginning to use this to beyond just search or what have you, were you beginning to like, okay, I can actually use this to, I can pair with it as Rain was saying. Yeah, the early things, earlier this year were things like stubbing out, like I would stub out a test. This was before they got good enough to really like, you know, you can tell it the kind of, the shape of the set of tests that you wanted. It'll write 50 tests. Before that, it was more like you would write the title of the test and maybe five comments saying the steps of the test and it would fill in, you know, it would still feel great because you'd be saving all this typing of the most tedious kind. make this request, check this on the response. That was where it started to feel like it was really helpful. And I think I gave some demos of that kind of thing where it's like, you know what you want and you can tell it piece by piece and it will fill it in and it would do a good job. This is kind of what people are talking about in examples where it can follow a pattern really well. Like if you give it one example, you do this thing yourself once and you need to do it five more times, it can follow that pretty well. but more recently it seems like it's you know with Opus 4.5 it's been able to figure that stuff out on its own even without the stubbing out of all the details one example the thing that really impressed me when Opus first came out was something quite different from you guys' examples because it was an example where I'm not an expert it was specifically something where it was kind of a pure test of the thing's ability because I didn't know anything about what I was doing and I was still able to get to a surprisingly good result and this was uh debugging crashes in the ghosty terminal so i ran into a couple of crashes i've never written a line of zig i don't know anything about the code the the ghosty code base i've never looked at a crash dump to my shame as an oxide employee um so um but i was you know a few crashes that i wanted to investigate that you know there were things that i couldn't find anybody talking about them on the ghosty github so i figured they were pretty rare so I looked into them and I just have Opus essentially figure out I you know the only thing I really had to do was find the rust port of mini dump stack walk to like look at the at the crash dump and um pointed at the problem and and and I knew where the crash dump was located on my disk and then from there it basically was able to like look at the code statically analyze it and find the source of these of I found three different bugs this way um real and then these I was able to write up the bug reports and they were confirmed to be real bugs and fixed. And so that was what really unsettled me was that this was an area where I really knew nothing. And just using my sort of like sense of what sounds like it makes sense to validate that I wasn't going to be posting AI slop on the ghosty GitHub, I was able to come to, you know, three real bug reports without really putting very much into the process. Yeah, that is wild. And so they are, were they primarily operating on the stack backtrace or were they stack backtrace plus Was it actually walking data structures? And was it actually like meaningfully? There was no live debugging. I think it was looking at the stack trace and then looking at the code that, and you know, the error that came up and then just sort of thinking about what could have happened in the code to cause the error. Interesting. Yeah, that is really interesting. I want to be using them as debugging tools a lot more. And I'm very curious about this use case. So that is, and that is wild. So when you submitted the, I mean, this is a great thing about Ghosty being open source, and Mitch Hashimoto's project is like, you know, I mean, I just say, good on Mitchell. Like, as obviously does not need to work for the rest of his life, has made generational money, and he's writing a TTY emulator. I just think that it's, you know, that's pretty great. I think that is every software engineer's dream. But, and then making it open source. What was the reception to the actor? You said these were bugs, confirmed to be bugs. So it sounds like what it found was legit. Yeah, you know, part of it was that, you know, the bug report itself was even to some extent out of my depth. Like a couple of them I was really confident. And then one of them I was like, it sounds really good, but I just wasn't able to, you know, I didn't know enough about how Ghosty worked or how Zig worked to really evaluate. So I was nervous, but I was, you know, up front with a lot of humility of like, I'm really not sure about this, but it sounds so good that I cannot hold it back. Okay. So let's actually talk about the first two where you're like, okay, I don't know any Zig, but like I'm a software engineer. I know many other programming languages where you were like, okay, I'm pretty sure that I just based on its description and me looking at this code, I'm pretty sure I've got a legit bug here. Can you describe kind of those first two a little bit in terms of like, what did you, I mean, you had confidence. You could like, I can actually, not knowing very much zig or knowing only the zig I've learned I think I've got a legit bug here yeah one of them was very simple because it was like a copy paste error where they were just referring to the wrong variable and you could tell uh you know it was supposed to be graphene bytes and it was hyperlinked bytes you know and you could tell that that was uh so it was like okay that sounds pretty pretty straightforward um another one this was like two months ago so I can't yeah yeah I'm so sorry the two months ago being several yeah years ago especially in The really complicated one was that, you know, something, it was a mutex lock that was like not being taken at the right time. And so there was like a conflict in, and so, you know, reasoning about that was pretty tough for me not understanding how the code worked. But it was pretty impressive that the model was able to see it. You know, like, this is where you should have taken a lock and you didn't. Yeah, okay. So another thing that I think is really interesting is the, and then so Mitchell himself replied to, you've linked all the issues. in the chat. Um, so obviously people can get those. Um, I think one of the things that I really like about this, David, is that like you lead off by saying like, look, this is, I've been using cloud code with Opus four, five to investigate. Oh, you're very upfront with, Hey, I, this, an LLM has done the work here as a way of like saying, I'm not, I like someone else is going to need to look at this. Who's got greater domain expertise. Um, which I think is, and this is worth, yeah, it's worth looking at and go see specifically. They have a very, they have a quite clear LLM disclosure policy. So Mitchell has been pretty open that he uses LLM's tooling, but he also has, they really want upfront disclosure. So they made it easy by telling me exactly what to do. I was more worried about sort of the embarrassment if my issue was fake for me to be like one of those guys posting issues that are fake that the LLM told them, you know, was a bug. Yeah. Well, and the, but, but, you know, you, you obviously quadruple checked all this stuff and it looks like you had, So this experience, as you said, was, I mean, like the way you say it was like unsettling. When you describe it as unsettling, why was it unsettling? Yeah, well, I thought this was such a clear cut case where it was obviously not my expertise that was operative here because I didn't have any. You know, I had like there was a high level like I could tell that it felt legitimate. And I think there may have been one or two things where it came up with something that I was like, that doesn't sound real. but you know the amount of guidance that i actually provided in the process was was a very small proportion of what actually took place that was what i think felt um unsettling about it and you know the guidance that i did provide also didn't feel you know that ineffable human taste that that people love to attribute to themselves it really wasn't that it was like finding that the rust port of the of the stack trace you know symbolicate or whatever right right when i do love the fact you're like i'm actually even a little bit embarrassed you say in your this these issues that you're like i i uh but it's also like what it was what else was to do like this thing crashed like i am i just supposed to like not give someone the feedback that this thing has crashed and i've got like or am i supposed to just like sling an issue in there with i mean it's like it just feels like you're being actually helpful to the project um right well if it had turned out to be fake, I wouldn't have been. If my diagnosis was wrong, then that was just creating work for them. So it hinges pretty tightly on the fact that they were legit. Yeah. Yeah. And that's, it was okay. So I think it goes to, and you know, we've, we've talked a bit about RFD 576, where we kind of talk about our own LLM thinking at oxide. And it just goes to that, like having empathy for the person that's going to read this and the making sure that in this case, really contextualizing it. But also, it sounds like you're doing your own checking to make sure that the degree that you can. It's interesting. A lot of these attributes or these qualities that we attribute to LLM-generated code are all things that, as we're talking about, I've associated with other colleagues. I'll provide an example. I just mean, when you're doing a code review, Brian, I think my guess is like the degree of scrutiny that you feel yourself applying may change depending on where that code came from like it certainly does for me and not so much at oxide but like when i was at sun there were some times i get a code review of like i really need to imagine what i would have been been like to write this so that i know what i'm looking for in other cases you're like well there's some code and there's some tests and i'll and i'll look around but um you know a lot of the thinking has probably already been done. And then on that, like for my, you're exactly right. And like, I mean, oh, look, I'm just ashamed to say it, but I'm going to say it. Like the way I would review code from like a nemesis, you know, a nemesis integrates code and you're like, I am going to, I'm going to get my nemesis. And I'm like, one of the things I realized I needed to do was for my own self-review and for reviewing people that were not my nemesis, I needed to like channel that dark part of my brain. that's like, I'm going to find this thing in here. And that's like, I mean, it's embarrassing to say, but it's definitely true. Yeah, well, I do that because for, you know, when I'm reviewing someone who I consider a friend and I want to do them the service of helping them with their code. But I guess we're just motivated differently. And that's fine. Yeah, okay. It feels like you're just trying to explain away a bunch of the code review comments. Very good code review comments. I mean, it feels like comments you'd give a nemesis. That's right. Okay. One man's nemesis, another man's friend. There you go. And then, David, as you're describing, you don't want to file a crap bug report. Man, have I seen some crap bug reports where people take you on this wild ride through a core file and you end up just nowhere. You're like, okay, I'm following, but all of this is just blathered. You don't need an LLM to hallucinate. We've been doing that. Yes. And we've seen these bug reports where you're like, okay, like you have, there's certainly a lot of information here, but you've actually not contributed. So that same, that empathy you're talking about is so at the core of, of, of engineering full stop, irrespective of the tools we're using. Yeah, no, that's a very good point. It is really infuriating to see a bad AI bug report. I'm probably more optimistic than most people about LLMs, and I think part of that is just working at Oxide, and I don't really see anybody doing the pathological things that we hear about online. Everybody's so careful and serious at Oxide, so I worry that I'm biased toward optimism because I'm not seeing the median user of these tools. but then you know i see one example i get one bug report on a on a repo that i'm actually familiar with i'm like forget it throw these out we're done yeah but i but to adam's point like i don't like that when you get bogus bug reports without lms either i mean when you get or you get like bogus prs yeah but it's harder to write off all of humanity like you can write i guess that's true i guess it's like it's more it's more limiting if you yeah yeah absolutely But I do think that, you know, you get people like, and we've definitely had this happen where we will make things that are, that we are open sourcing, not making a big deal out of it. We're not trying to create a community out of it. We're just open sourcing it kind of hygienically. And then someone will come along with a kind of like spurious PR to change things. We're like, no, sorry. No, this is like not LLM assistant. This is in the pre-LLM age. We're just like, no, no, this is, sorry, this is actually not helpful. so adam just to your point that like the the lack of empathy is uh is definitely drive by pring is not new as someone's pointing out in the chat um i think the difference though is that you know like llms will like do amplify that problem right like you know you can kind of get i was talking about this with someone like you can get something that is not great by in like a few minutes as opposed is do maybe a few. You're so spot on. And I think my experience has been like a unproductive, like not empathetic colleague, like that's fine. Like if I can run faster than you, I can keep up. Like I'm going to, like you're not going to outrun me. I don't need to like worry about kind of diverting you in the wrong places. A highly productive, unempathetic, careless colleague, Like that's what takes way 150% of my effort just to like keep them from doing harm. And you're right, Rain, that like it takes that formerly plotting, you know, colleague who you have or collaborator who you had to like keep on the rails. It keeps it much harder to steer them. Yeah. It's like a gish gallop almost. Like I feel like that's that's how I think about it. Right. Where it's like a gish gallop for issues. I've luckily not faced too many like crap bug reports. I've seen some AI bug reports, but they've all been like very high quality. So, you know, kind of at the standard that like I think I would expect myself to write a bug report. So again, like I am biased towards optimism here, but it is something I'm worried about. Like I do look at people just, you know, putting up garbage and it's like, okay, well, it's now harder to filter out garbage. I have to say on the flip side, a thing I've done is I've used Opus 4.5 and fed it a bug report and told it to tell me whether this bug report is real or not. So, yeah, so that's, you know, maybe that's the way to keep up. It's like some open source Jevin's paradox or whatever. Like there's no money involved here, but I just mean the cost of creating PRs and projects and all of these things has dropped so much that the volume has just accelerated. Well, I also do think that with these open source projects, especially, I mean, you know, God bless small communities where you've got, I mean, it's like, I mean, I would be almost like intrigued by someone who's like, I'm going to use an LLM to file a bunch of bugs against a Lumos. You're like, that's weird. I mean, that's like, that's not a, versus like. talk to someone about that yeah he talked to someone yeah i mean like i'm almost like i'm almost like not opposed that's that's a very okay um versus like a project i mean certainly we saw this with node where you know i been in very very large projects with many many many contributors and very small projects and there a lot to be said about being in a small project and a lot to be said about a project that doesn't attract as much attention because it doesn't attract as much of that kind of negative attention either. So there's, I think this problem is, I'm sure there are some high profile repos for whom this problem is really, really acute. And, you know, maybe that was that way with Ghosty and Mitchell. But for a lot of the stuff, at least I work in, it's not really an acute problem. I've been surprised that the tooling for maintainers hasn't been able to keep up. I mean, you expect some lag, right? The volume of garbage has to balloon for a bit before it becomes such a big problem that people are incentivized to put some work into solving it. But I think one of the things we're going to see in the next few months is maintainers more openly using LLM tooling to like cut through that morass of AI bug reports and AI. And for code review too. I mean, I think just like it is, for code review is, I mean, honestly, like my eye-opening moment with respect to LOMs and software and sharing was on Oxide and Friends when we had a listener who had access to GPT-4 when I did not have it. And I, Adam, for some reason, I can't even remember, and you have to figure out exactly when this was. I guess it's a little hard to ring the chime for an episode that I can't, I can't recall more. Yeah, just ring it. Exactly. Give, give the people on YouTube something to complain about. But we, and I'll go back and find the episode. But the thing that was really interesting is I had had like a PR that day that I was linked to and someone dropped in a GPT for a code review of that PR. And I'm like, wow, this is not all wrong. Like this is not also not great, but this is definitely not like garbage. the comments that it has. And that was a long time ago with respect to LLMs. And I mean, code review, it just feels like the opportunity for code review is really rich. And Dave, to your point of like not really giving, like why don't maintainers not have, you would, and maybe they do and I'm missing it about like, it just doesn't feel like GitHub is providing it. Anyway, why am I doing this? Of course, diff's too large to show. Yeah. Yeah, you definitely expect it to be built into GitHub pretty soon. I mean, there are tools like Graphite, CodeRabbit. I mean, that's kind of what started me on this. I saw someone, you know, praising this tool Graphite, which does look really nice online. And it's like 20 bucks a month a seat. I was like, wait, 20 bucks a month a seat? I can write that in a script. So it's like, so I wrote a script that pulls the diff, the comments, the PR body, and just feeds that into an LLM, just says review this. And, you know, you have downsides there where you're, you know, a lot of times there's additional context that's not in the diff. Like if you're using something that is imported in the code already, the import is not in the diff. So it's going to say, are you sure you're importing this? It doesn't know that the test passed, doesn't know that CI passes. But you can get quite a lot that way. You know, it can find, you know, a mismatch between your SQL migration and your main DB init SQL. I can find inconsistencies really well. And inconsistencies in between like your human readable stuff, like your PR description and the actual code. There's a lot you can do there. even without what we now have, which is like the tools that can go and vacuum up anything that they need to validate their hypotheses about why the PR is broken. With Cloud Code, it can write a new test that validates that the code as written doesn't do something. And there's a lot of low-hanging fruit there that we're not really touching at all. On the topic of low-hanging fruit, I think my nemesis on GitHub is Stalebot. Right? Like I hit some real bugs. Oh, oh, oh, oh. Oh. Oh, Stalebot. I feel like LLMs could slay Stalebot. You know, the like, well, this bug is six weeks old, so I guess nobody really has it or whatever. It's like, nope, there's a crash dump and a stack trace and a bunch of information. Yes. And at least helping to weed through that so that you can sort of maybe neuter Stalebot a little bit or make sure you keep around the things that like refer to real problems with sufficient data to diagnose them or maybe diagnose them autonomously. But a lot of the work that, I think there's a theme here, but a lot of work that just doesn't get done, that could get done, and in some cases maybe doesn't need the highest level of sophistication to complete, those are great tasks. Adam, you are so right. And God, I hate Stalebot with the white hot passion of 10,000 sons. Stalebot is such an indictment. uh i mean like we don't talk about stale bot enough we don't we like for all of you decrying the future look into the past stale bot is i mean stale bot is everything wrong because i mean it is just like oh well no one has seen this issue in six weeks so we're closing it like how does what how does that make sense that doesn't make sense it even comes from facebook by the way i know facebook had an internal stale bot and then someone built an external version and i mean it's the Facebook culture. Like it is exactly the kind of thing that you would expect Facebook to make. And so, you know, move fast and leave broken things around. Yeah. Move fast and close out bugs that haven't had any activity in the last 48 hours. Okay. But it is also so gross because like, but no, no, I made, look, I made the dashboard green. It's like, um, so Adam, you are right i hadn't even thought about what this means for stale bot i stale bot you are a marked bot i i hope loms because that's a it's a great point of just like if nothing else i'm sorry if like if we get rid of stale bot it was worth it all i i gotta say you're gonna be up till like four in the morning vibe coding like the joker of stale bot you know who's gonna just trail stale bot around It's kind of like reopening and fixing bugs that it's trying to close. I actually have an example of a bug that I feel like I would have just ignored in the past, but had a much better time with Opus 4.5. And so this is a bug on Cargo Next Test, which is a personal project. And the bug is titled, SIG T-T-O-U, When Test Spawns Interactive Shell. Now, if you've spent any time on this stuff, you're like, my eyes glazed over pretty much, right? And so it's like, okay, you know, this person actually did a nice investigation with Claude and kind of posted this and said that I've worked with Claude to get good attribution for this and reproduction. It was the words below, but I stand behind that, right? And so it was like a pretty well-written issue. But, you know, it's the sort of thing that I really would want to dive in to, like, you know, it kind of gave an example of like, you know, this is used by all these other projects. And so you should do this thing as well. and it is like it is one of those um you know things where like okay you got to spend like a whole day like investigating what the other projects do and how it fits and like you know really getting to the root of the problem right and i i'm like pretty lazy generally and i'm like i don't want to do that and so you know i'm like i would either do a half-assed thing and honestly in the past i would just do what you know what the suggested fix was right it turns out that suggested fixes actually like woefully incomplete, which is where I feel like, you know, I kind of gave this to Opus 4.5. And so, you know, one of the things it said is that, like, less and Vim and a few other projects follow this pattern, right? So I actually gave it the less source code, and I gave it the Vim source code, and I gave it the source code to like a bunch of other things. And I was like, okay, like, dig into this, like, what do these projects do? And so this kind of comes back to the, you know, asking questions of code bases you're unfamiliar with. And so, you know, I did that, right? I had no idea about the less code base. I had no idea about the Vim code base or anything. And so it spent like 10 minutes or so. And it actually like, you know, wrote up a nice summary of like, here is what all the projects do and so on. Right. And then I, you know, I kind of, you know, I was like, okay, you know, this makes sense. And then, you know, I tried that. And so, you know, it was like an interesting, like, it took me like maybe two or three hours to do. And the final fix for that was like pretty small. It was like 130 lines of code or so. But like, it was great because like, you know, we tried the first thing, right? We tried the suggested fix. I, you know, the LLM did the work and the LLM kind of, you know, wrote the test, which is its own annoying and janky in its own way. And then, you know, I kind of tried that. I like dog fooded it a bit. I found that, okay, this isn't complete in various ways. And then, you know, we iterate a few times. And so there are so many places along this path where pre LLMs, I would have just dropped out and be like, ah, this sucks. I don't want to deal with this. You know, I'm done for the day or something. Right. And totally. Like when you do that, like you're going to kind of take one of two things. It's going to be like, oh, we'll just take this kind of like mediocre fix. Or it's going to be the like, maybe I'll just let Stalebot finish this one off. Right. I don't have to kill it. I'll just, I mean, even this title reign, like SIG T-T-O-U. Okay, fine. T-T-Y out. No, you've got to like cue like Antiques Roadshow. You've got to be like, okay, like I'm now going to go into like POSIX signal semantics. And then you're like, when test spawns interactive shell, it's like, well, here's a thought. Don't do that. Don't do that. I mean, yeah, no, seriously. Yeah, but yeah, it makes it attainable. and it makes you get past the like, have you tried not doing that? I don't know. That sounds like a dumb test. Yeah. Okay. And so, and then this, the thing, and I do think like this is a really important point because then, okay, so you pick this up now properly because it's easier. We've lowered the friction. You actually get this completely fixed. Getting this fixed makes next test more robust. It makes it more rigorous. Like you've actually like, I mean, on the one hand, it's like, oh, okay, really? I mean, as you say, Adam, like maybe don't spawn in your actual job, But like, hey, actually, no, now you can though. You know what I mean? And I think that like, I just see this in lots and lots of places where we are going to make our infrastructure actually more robust because we can now go pick up a bunch of work that we just weren't going to get to realistically. We, the people who work on this lower level infrastructure, we're just not going to get to. Yeah. So I have an example of some work that I got to finally. I mean, Rain described herself as lazy. rain i offer this counter like i think you're kind of bringing a knife to a gunfight here um but uh out lazy adam will you yeah oh boy because and when you know this like i have been wanting to do an open api diff library since before you joined i'm sure i've been like talking up this vaporware of like and and i i made multiple earnest attempts at starting it and it was just it's just one of these pieces of code that's like there's no good way to do it all the ways to do it are gross and boring and stupid and actually it doesn't even if this is not like your case that code running in privilege mode or whatever this is some code that like if it segues like if it reboots the machine somehow like it's fine like i don't know like it's just not that high stakes very low stakes um and the thing that got me across the line was uh you know i started using some of the open API, open API, excuse me, open AI models in VS Code and mostly using it just through the lens of a very smart completion. And it allowed me to kind of repeat this pattern that I wanted to use to make sure I wasn't forgetting to compare certain things. And as Rain was saying, absent this, I would have like written some code to write code. I would have, you know, written some URL script or something stupid to output a bunch of code or use a proc macro or something like that. But my real, so that was great. And I actually got the thing working and it was really fun to build. My real breakthrough was then it was coming to demo day and I want to show it off. And this is a library. So there's not like a front end to this thing. So I was like, okay, I'll write a little CI tool. And literally all I wrote was function main, open a comment, said parse the first two command line arguments. And literally the rest of the program, I just had tab completed. It's like, I think this is, it figured out, this is probably the program you want to write. I had to fix a couple of little things here and there, but it was very eye-opening. And then that became my demo in addition to the actual library I built, but was like the live coding, the live, I guess, coding of just hitting tab and watching it do the thing that it inferred I wanted to do. So I would argue, Adam, that you have embodied all three of Larry Wall's famous virtues of a programmer, that you've shown your laziness, your impatience, and your hubris in a stroke. But this point of laziness is really important because we all know, and we kind of speak about euphemistically as laziness, but we all know that a hallmark of good software engineering is coming up with powerful abstractions. And when you are kind of repeating code multiple times, that part of your brain is like, ah, this is not the right abstraction. And because Adam, both you and Rain mentioned like, ah, I would have made this a proc macro or I would have done something where, because like, I think we like over-index on that where we're like, then this whole like the dry thing that do not repeat yourself, where you become so over-indexed on it that you then do things that are actually either generating suboptimal artifacts. Or it's like, there are times where it's just like, actually, it's just not that big of a deal to have code that is like similar, but slightly different in three places. Like, it's okay, we're all going to live. But we really resist doing that. And LLMs make it easier to kind of do that. One of the things that annoys me the most, and you know, I'm very, very grateful to the Rust open source community, right? And this is just to qualify that this is nothing, it's not going to be negative at all, right? To the open source community. But one of the things that really annoys me is like, you know, in like the Rust doc, you click on the source link, right? And the source link leads you to like a macro definition. Yeah, you see that for like, even for like the in the standard library, there's a few examples with like the integer types. So you cannot click like, for example, you want to see like the next power of two implementations, which is like, I mean, you know, it's a bit manipulation thing. I want to look at that, right? You click on it, and it doesn't show you that. And it's like really sucks. And I hate it. So I kind of have made it a point in my library is that I would much rather copy PS code just so that you can click through the source link and you can get it, right? And so, you know, and so macros are just like, you know, they don't work with that. And but the DR, the don't repeat yourself is to use a macro. So it's like, okay, well, LMs actually do provide a better solution to that. Yeah, I've been thinking a lot about this, because like, if you think about where our intuitions come from about what is worth abstracting or like what is too much repetition they're so tied up with the expected reader of the code whether that's like ourselves or the people that we know like they're they're they're tied in very very tightly with our intuitions about what people can handle and what's reasonable to expect of other people and what is reasonable to expect of an llm is radically different from what is reasonable to expect of other people um and so like the amount of repetition that that is tolerable in a code base or is manageable is seems to me like what may way higher. And it's not just in a code base. Like we were thinking about API design, API response shapes earlier. And we, I was working with Adam on this, that we erred on the side of making a response shape like more flat and less nested, even though we lost a little bit of type information that way, just because the flat one was a little easier to read. And the type information we decided was not really worth it keeping in that particular case. But I think when you assume that future developers will have LLMs at their disposal, I think that it must tilt the calculus toward encoding more type information at the expense of readability because that type information is what is going to keep LLMs on the rails in the future. It's like, you know, all the good practices like make invalid states underpresentable and all of those things. It turns out that all of that actually helps LLMs a lot too. Yeah. Is there anything like, do LLMs like Rust? I know that's a weird kind of question, but I've wondered that if the things that we appreciate about it in terms of not being able to represent invalid states and so forth, if that is a useful property when LLMs are constructing code. 100%. I mean, I feel we said this when we, again, ringing the chime for unknown episode, But I feel we said this when we first started talking about LLMs and Rust, that actually Rust is going to be a really good fit for these things. Because you get the, I mean, Rust, something I've said from the beginning, that Rust shifts the cognitive load to the developer in development. And it forces the developer in development to consider a lot of issues that historically you wouldn't see until some code is deployed into production. And I loved that shift. I think that shift is really important. And I think that like that tacks right into what LLMs can do. And I think that it's, that they reinforce one another. So I think like LLMs, I think are, and Rust are a very good fit for one another. That's what I, which I don't think is that hot to take. I don't think that that's, that's spicy. And just say like more, you know, if a more elaborate type system lets you, lets you put more of that work in upfront to sort of constrain the program further, you could say that LLMs like allow you to tolerate an even more elaborate type system. Maybe now dependent types are going to be feasible for people to learn and work with. Maybe we'll see Idris take off. Like if there's a diesel error, you'll be able to understand what it means. Easy, easy, easy, easy. Listen, we're not living in that kind of a future yet, pal. Someday. Yeah. Artificial superintelligence required for that. No, I think the ASI is going be like i i've actually i actually don't know what this that or message i can't make sense of this thing um it's like it's a 2k long um yeah that that is really interesting when i just think it's in general having the um great type information i mean the the code that i would be scared that to me would scare me the most would be just like pure javascript i know it generates a lot of it but we use typescript i mean david correct me i'm wrong we use typescript for more or less and everything, it would really terrify me to use because it's just so easy to have an issue that doesn't show up until you get into runtime. So my blog uses a static site generator written in JavaScript. And I don't really know JavaScript. I mean, I know I can hum a few bars, but that's kind of it. And I used LLMs a lot to sort of get things the way I wanted them. And part of it is like, I don't give a shit, right? Like it's a static site generator. Okay, sure. It's going to generate it statically. And there isn't some like runtime edge condition I need to consider. So it's like, eh, go for it. So I just be in some cases, like it's going to be, you know, depending on the context, you know, I think that, and I think that may be true in many JavaScript contexts. And that's why in the cases where people are writing front-end code and they have additional rigor, they want to apply, they're using TypeScript or more robust languages. Yeah. Yeah. Just to defend the JavaScript world a little bit, I think the spectrum of, you know, rigor that you might need, it applies in a lot of different situations. Like, you might make a one-off Rust CLI as a debugging tool in that same situation. Like, you can tell if it works by running it. You know, the sort of, the depth of static analysis, you don't really need that because you run the thing and it does what you want and you can tell that it worked. You know, so there's a lot of situations, I think that's kind of an underrated point that, you know, people assume that's all or nothing. The code needs to be perfect or it doesn't work at all, which is like ridiculous. You know, our most rigorously engineered code is still going to have some bugs in it. So obviously there's sort of a spectrum of amount of bugs that we can tolerate or amount, you know, of leeway that we have. Yeah, that's fair. Rain, do you want to talk about some of your more recent experiments with LLMs? Because you've really kind of gone nonlinear with some of the things that you've been in. And I think in particular, well, because like, I mean, getting past the like, okay, these things are kind of experimental and getting into the like, no, no, like actually we're going to, at some level we're going to, I mean, I don't want to say assume them because we're not really assuming them, but we're, but we kind of acknowledge that like these things can actually, it can be used as part of software engineering. Do you want to describe some of the things that you've done recently? Yeah. So this is kind of a project that, you know, kind of a bunch of us were discussing and I decided to take on sometime around early December. And so the project here is that, you know, we, as some of, I'm sure some listeners may have heard of, like, we have done a lot of work in building automated update for our system, right? So we have, you know, the self-service update thing now. And one of the things it has to cope with is the fact that, you know, you're not going to be able to update everything atomically, right? You're not taking the entire system down and back up again. And so you need to deal with, like, how do you kind of, you know, like, manage this and, like, this kind of skew while an update is happening. So, you know, like my colleague, Dave Pacheco, kind of has done this like genuinely brilliant design where like, you know, there's kind of the server side APIs, which is the idea is that this is an API that can talk to multiple versions of clients. And so, you know, you update the server first and you have like this DAG of dependencies that you update. It's just like this, you know, really well constructed system. It's pretty great. So one of the issues we ran into is that as we gained experience with the system, we were having trouble figuring out how, like, you know, you have all these different versions. And so you have like a type right And that type and like has the same name but it has different like fields for example or like you know maybe one of the sub fields is different And so how do you actually like store those in the repo right Like and it sounds like a simple problem, but this actually blows up and becomes like this incredibly complicated problem with many, many, many different factors involved. So, you know, kind of, I did, again, like, this is one of those things where it's like this combination of human and LLM work, where I spent a bunch of time prototyping a bunch of things and kind of coming up with an approach that works and that satisfies all of the hard constraints and also as many of the soft constraints as we can. And so this was a lot of work. And then one of the interesting things that I found really useful for LLMs is that what I did was I ended up essentially compiling the set of things that the final state we want to get to and writing it out as a set of instructions that both a human and an LLM can follow. And so in this guide that I dropped a link to, this is RFD 619. And in this guide, It is in Section 5.1. And so this 5.1 is kind of this initial migration, right? And so, again, like, you know, spent like a couple of weeks working on this, on like this whole RFT. And then what I did was like, okay, you know, I'm like, I just kind of fed this guide into the LLM. And, you know, I told it to like migrate a small repo, right? One of our smaller APIs. And it just did that in one shot. So this was like, you know, not a very big API. It just did that, right? I found that, okay, there were a few things that was unsatisfied. So I went back and changed the guide. I updated the guide. I kind of started from scratch. So I iterated on it. I want to say overall, this guide kind of went through maybe a couple dozen iterations of me looking at the LLM output and being like, okay, this is great or this is not good and so on. and, you know, kind of basically ended up converging on something that is like this clear, very reproducible set of instructions that are simply way too complicated to capture in any deterministic algorithm, right? So this is like, you know, there's like enough judgment here and it's just like this really complicated set of things that, I mean, you know, there is no way I can write like a migration tool to do this. I mean, maybe someone smart and you can do that. I don't think I can. But what the other one let me do is it kind of, again, let me design this guide once and then apply it everywhere. So it was funny because there was one morning where I just rapidly put up three PRs where the first one was 1,000 lines of code, the second one was 2,000 lines of code, and the third one was 3,000 lines of code. And I got all three of those done in an hour. And that was just wild. Like, you know, and this is like one of those things where like, it turns out that LLMs are really, really good at following instructions that are like, you know, clearly written and are written in a way that, you know, the LLM kind of works well with. So I had a, like, this is again, one of those things where like, it sounds like so like, you know, mid priority, right? It's like, how are we going to, you know, migrate like 40,000 lines of code and like rearrange the types, right? This is the kind of thing that just falls, you know, people just don't do, or like we might do it in the future and there's this long migration period or like, you know, this is the kind of thing that you do in like tech debt week, but this would be more like tech debt month, right? But like, you know, an LLM just like, as I said, it just nailed three different APIs in like one hour. And that was just like, it just blew my mind. It's like, oh, you can spend two weeks carefully designing a thing and then just have the LLM just like repeat that pattern over and over again. It was also really helpful for like the process of iterating on the guide itself, because like I would just like, you know, it's like if there's something I'm not satisfied with or like, you know, maybe one of our coworkers had some feedback on something, you know, I could like very quickly like update the guide, right? And then I would be like, okay, run like JJDiff on, you know, the changes that I made and like replicate those changes into, you know, this like prototype that we're working on. And it just like did that. And it was amazing. Like it was one of those like, wow, you just, you could just, you could just do that. Something people have mentioned in the chat and we haven't talked about too much is that, you know, something that really helps the LLMs in these kinds of loops is they have a signal, like a verification signal that can tell them when they're done and how far away they are from it. And like types passing, obviously, test passing is one of those things. But I'm curious how you think of what the verification signal is to the LLM as it's doing this. Does it satisfy these natural language requirements? So this is an interesting question. So in this case, we kind of had a couple of hard verification signals. So the first one was just that what you described, which is the code compiles. That is kind of the most fundamental requirement. and then the test bus. And we have a lot of deterministic validation. In fact, a bunch of this actually uses the work Adam was describing on being able to compare OpenAPI documents to make sure that if there are changes, those changes are only trivial ones. And so we put a lot of work into that. And so having all that deterministic validation was really helpful. What I ended up doing for some of the more, like the fuzzier signals here, was that I basically kind of, you know, after it kind of did this work, I would like start a new context window. I would feed the guide again and I would ask you to carefully review the, you know, the current PR for performance with the guide, right? And Claude's like, who the hell wrote this? This is like, I need to write. Yeah, interesting. And so that ended up finding, you know, a bunch of degrees of freedom, which some of which I wanted and some of which I didn't. But that was like, that was a good experience. I would just do that like two or three times. And then, you know, obviously I would go through and manually review and make sure that, you know, everything kind of aligned. But again, like that felt like a very quick process because, you know, I was just I was just able to like, you know, maybe like spend five minutes doing the migration and then another five minutes reviewing it. And that was it. Right. It's wild. and i mean this is kind of a just a much more elaborate example of really your the the idd qd example that you had that where like look i've done this once i need you to kind of do it in these subtly different but important ways that are kind of tedious i mean this is just a a in many ways a much more elaborate version of that where it's like okay this is the the we designed this rfd very deliberately, and a lot of engineering has gone into the way we think about doing this, and that has come out from actually doing it by hand and so on. And now we actually need to kind of knock this out for a bunch of these different services. Someone in chat described it as using English as a programming language. And yeah, I mean, this is basically like using English as a programming language for programs that are just too hard to write in a deterministic like computer language and and that's what it felt like doing um and and i think it's actually you know it's it's kind of remarkable like these are the kinds of things that you would absolutely have humans do before you know before before the advent of this stuff and like it feels like or not i mean like i mean just to your point it's like the work that is like it's just the work is just like not done and you have been like someone's like hey i was in this service and it has a different like what's going on over here it's like oh we just haven't gotten to that one yet and go into our this dashboard from two years ago and we're waiting for the next you know tech you i mean tech debt week you're just like oh my god i can i can feel like my the tech debt flu coming on for tech debt week so i mean and like and for me like you know i think there's a there's a way david put it that was really memorable like a code like that uses like LLMs extensively had better be the best freaking code on the planet, right? Like if you're, if you're doing this, like you, like all your like code should be extremely tight. You know, you should like your, your, you should like put all the work into refactoring, like good documentation, like all of these things that I think, you know, many of us feel like are, are, um, you know, maybe, maybe kind of slipped down our priority list there. It is very helpful to think of these tools as not ways to improve the velocity of what you do, but ways to improve the quality of what you do. And so I'm like, you know, if there is one thing that I think I want people to take away, it is like, slow down, right? Like, don't just like, you know, spit out as much code as possible. Instead, like, use the LLM, right, which is a tool there to be like, okay, you know, maybe let's refactor this. Maybe let's, you know, split this up. Like, there's so many things you can do to improve code quality along the way that will lead you to higher code quality than you would be able to do in the same amount of time. Rain, I just cannot emphasize enough how important it is that if you're listening to this as a podcast where you like, please go back and re-listen to what Rain just said, because I think this is so important. And I think it is so important to realize that you've got this power now to go deliver a higher quality artifact. Like, yes, the world emphasizes, like, the velocity, which is a term that I, again, don't like because it makes us all sound like projectiles. But this is what it allows us to do is do things that we simply never would have gotten to before that allow for more rigorous artifacts. And I think that you can make an argument that the software we write is going to kind of bifurcate. Adam, some of it is going to be your JavaScript in your static site generator, which to quote your own language back to you, you quote, do not give a shit about. Am I saying that correctly? I stand by that. But then in order for like, underneath that is now these rigorous artifacts that we actually, in a world where we're doing much more software, we actually need these rigorous artifacts to actually work much better. and you know i i think that like i guess i don't i mean i and i think this is like with the i i for um this is like the fossil time hour gen x fossil time hour where i would maybe we can knock down some things on people's bingo cards but when you talk about like software in the 90s sucked and operating systems had bugs that you would hit frequently compilers had bugs that you would hit frequently and the i mean what ultimately like the day i put c++ down is because i was dealing with two different compiler bugs simultaneously. And just like the, and then getting basically random results. I was just, and that was common in the nineties. And man, like go have a compiler bug to really like take the wind out of your sails, let alone two of them. And we needed to get to a world where we had open source artifacts that we could make much higher quality. And the quality of software went way, way, way up. as a result, we could do more of it. And I mean, I, boy, do I see that happening vividly here. Yeah. No, I think, I think you're right that, that, uh, you know, the reason why I don't care about the quality of my blog is like, yeah, that, that, that's not a foundation on which I'm going to build, uh, you know, decades worth of technical innovation. That's like one and done. And I think there's lots of software that kind of fits that model. And I think that's where you kind of get the slop, you know, slop where pejorative term, but for some of this code and like, it's sort of fine. Like if you're, if you're building something that is a one-off, it is associated with like some time and place and whatever, fine, like whatever, I don't know. And yeah, there's gonna be a lot more of it and that's frustrating. But on the other hand, for the stuff that is foundational, that has always been rigorous and the rigor is increasing, this becomes a lever by which the rigor continues to increase. Yeah. I mean, for me, it's just like, there are so many things that I feel like I've been able to do with this to increase rigor. Like my interest, like I've, you know, I've coded a couple of things here and there, but like my interest as a professional is really focusing on rigor. And my background is in dev tools where like correctness is like absolutely essential and non-negotiable. and for me it's like okay you know there are so many more tests that i'm writing now like i you know the other day i was like i want to learn how to use kani right which is you know this model checker for rust and and i wanted to use that right and i'm like i there's always been this activation energy we have to go read the documentation and stuff and so what i instead ended up doing with this is that you know i took an existing project that i had which i felt like was a good fit for Kani. And I just asked Cloudopus 4.5 to, hey, come up with a few properties that we can verify that way. And it just did that. And I'm like, now I understand how this stuff works and what the limitations are and stuff. And just like, there are so many ways you can kind of use this stuff to go like increase the level of rigor in your software. And honestly, it really bothers me that the dominant narrative is the whole like slop, white code stuff, right? Because like our infrastructure engineers, there's so much more you can get out of it. Yeah, but is that always been the case for the kinds of code that we care about, Rain? That like, you know, one of the things that's beautiful about Oxide is we go to a demo day where, you know, we show off, you know, Rain, you show off this 30,000 line change or whatever, or I show off like this library that compares one thing to another thing. And it's like people are hooting and hollering as opposed to, you know, systems demos are traditionally seen as boring. And the thing that's whizzy is when you can demo something cool and graphical and whatever. And rigorous is not to everyone have the same kind of sex appeal. Yeah, totally. I think, and Rainer, but you're also right about the dominant narrative. And I was trying to think about, I mean, clearly it is truly a dominant narrative in that it's dominating kind of everything. And Adam, I was trying to think back in terms of our careers. When have you had these kind of like big narratives where it feels like it's reductive? You know, one thing I was thinking about was the rise of Java was that way, where the rise of Java was really suffocating because there was this idea, and it's like very different. So I don't want to be too reductive here. But with the rise of Java, there was this idea of it's like, it's the end of every other programming language. Like this is, this is what we're going to do. And this is kind of crazy to think about that the, because this is, I mean, right, it is. It's like, it's humorous now, but it was at the time there was this idea that everything's going to be in Java. We're going to do the operating system in Java. The microprocessors are going to execute Java bytecode. We are, and I mean, at Sun at the time, it was like, I know this is not right. And I think Java is like really powerful and important. And it's going to allow many more people to write software. And I remember thinking at the time, like, well, at least it's the death of C++. Ha, ha, ha. But it took a while for people and some like some failed experiments, right? It took Nano Java and Pico Java inside of Sun and a bunch of, and two different OSs inside of Sun and Java. So there were a bunch of like where we got. And then people were like, okay, no, this thing is like, it's important and it has a role, but it's not everything. It wasn't just all languages. It was operating systems and operating environments, right? It was like the write once run anywhere meant you don't have to worry about the details of Mac and Windows and Unix and all the different flavors of Unix. No, you just write it once and you run it anywhere. and it meant all of that other stuff was just going to become meaningless and the only thing that was going to matter with Java you're totally right that it took all the air out of the room for like a big chunk of like the late 90s maybe early 2000s totally and if you were implementing in C it's like well I hope the past is working out for you this is the whole idea of like you are actually a living fossil and Java is actually going to come to replace you and you know and in some ways it was like i actually i really do think it was kind of worse because if you were doing as what we were doing like you know we're in the operating system developing this thing in c it's like java didn't really have anything for us you know it was not like oh i mean we did it around the margins but not like our tooling i mean even the kind of the value that java legitimately delivered we didn't really realize any of that um and you know ultimately we, we ultimately had a good relationship with Java, but it wasn't like, whereas I think with like LLM, it's like, no, no, you can actually, everybody can kind of up their game with this thing in a way that's really exciting and uplifting. Yeah. Spot on. Well, I, David rain, anything else that we, I know there, there's obviously a lot to talk about here. I think we covered everything there is to say about LLM. i mean the thing i will say personally is like having a culture where writing things down is valued is like you know it is like a real multiplier here and so uh i mean our oxide like i'm very happy that you know all of this work that we do like we now have a new way to gain leverage from from all this writing work that you know we are culturally do uh if you're if you're in a place that, you know, maybe doesn't have as strong rigorous, like, requirements, or like, you know, isn't as committal as Oxide or Ship Hardware or whatever, I would still consider, like, you know, doing work to write things down and produce good documentation, good design documents, because, like, at least the current generation of LLMs, like, really like that. And so, you know, like, kind of, you know, get a little more disciplined, right, with some of these things, right um so yeah that's that's what i would say like write things down that's a great advice and actually let me ask you to expand on that just a half a beat because i do feel as as part of of deep blue um you do have especially i and it's unclear to me by the way if this is truly young people of like undergraduates versus a a kind of a more mid-career malaise and maybe that's like maybe Deep Blue cuts across all of it. But people who are wondering, like, what is, you know, how can, what is my role in this kind of this new LLM age? What would be some advice that you would give to an engineer that's early in their career and looking at this stuff? Honestly, like, this is kind of the advice I would give. Like, I would say, like, practice, you know, like, writing. Like, for me, like, writing is not a natural skill for me. This is something that it's taken me many years of work to kind of get where I am now. I would say like, if you're starting out, like practice writing, don't have the LLM write things, but like have feed it into the LLM and see how it behaves when you kind of do that. And like, that is the one bit of advice that I think, I think this is the kind of advice that, you know, is like timeless in the sense that we have always written things down and we will always keep writing things down. There's always a lot of value to that. But I think in the LLM age, Like this is one of those ways where you can really multiply the amount of rigor you have. Yeah, that's great advice. I think the advice I would add is like, hey, you can, now you've got the ability to pick up a new language, pick up a new system much more quickly than before. And you should use that as a way of getting into something maybe you would have been intimidated by. I mean, I do think that like, I mean, look, kernel development feels intimidating to people. Lots of people don't pick up kernel development because they're intimidated by it. And if you view an LLM as like, giving you the opportunity to jumpstart you in kernel development, go for it. That's great. That has got a very robust basis. So hop in there and hop into Illumos or something that you wouldn't do otherwise, maybe a database, what have you. And use that LLM to get you jumpstarted and to get you mastery over this thing. LLMs don't judge. Like, ask all the questions that Yes. If anything, they could judge just a little bit more. Be like, that is kind of a bad question. Okay, this is why pair programming never really worked out for me is because you always have someone being like, you don't use Dvorak? Like, I thought, no, I don't use it. Or like, do you actually know there's actually a faster key binding? That's like, no, can we, aren't we trying to work on this problem together? Like, why are you coming? Like, you don't use syntax highlighting. What am I even here for? You know, it's like, okay, we're just now having fights over things. And, you know, it's just like, we don't, you know, you don't have those, those, the LLM, I don't think the LLM, maybe I'm a little, maybe I should confide to Cloud Code that like, by the way, I don't use syntax highlighting. What do you, what do you think about that? Let's see what it's, but yeah, it's, it's free of judgment, which is really terrific. Well, thank you all. I, you know, I know this is a hot topic and I, I think that I'm hoping that we can show that big moderate middle and really show that there is a third path that by the way is the most likely path, which is that we actually use these things as tools. They're not coming to replace you, but they are actually going to allow you to do a lot more. And the one that should be most worried about LLMs is stale bot. Stale bot for you. Death to stale bot. I say so. uh adam thank you for uh stoking that rage um but um thank you all i think this is really great stuff thank you for for coming in on a hot topic um and thank you all in the chat too i think this is this is really important this is not going to be um our last llm episode this year i don't think adam that's a great prediction that's exactly it feels like a lock all right thanks right thanks david thanks adam take care