"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

117 min
Feb 4, 20262 months ago
Listen to Episode
Summary

Brian Elliott and Sid Pardeshi, CEO and CTO of Blitzy, discuss their AI-powered enterprise software development platform that can autonomously complete 80-90% of large-scale coding projects. They explain their approach to infinite code context through knowledge graphs, multi-agent orchestration, and how they ingest 100+ million line codebases to deliver enterprise-grade code generation.

Insights
  • Effective AI systems require orchestration of multiple models rather than relying on single LLMs, with different model families (OpenAI, Anthropic, Google) excelling at different tasks and providing better results when reviewing each other's work
  • Context management is more critical than raw context window size - effective context windows depreciate significantly as they fill up, requiring sophisticated just-in-time context injection strategies
  • The future of enterprise AI development lies in dynamic system design where agents, prompts, and tools are generated just-in-time rather than hard-coded, allowing systems to improve automatically as models advance
  • Memory and relational understanding at the application layer will be more important than fine-tuning for achieving enterprise AI autonomy, as local contextual decisions cannot be effectively compressed into model weights
  • The software engineering job market is shifting to favor senior developers short-term but will eventually favor junior developers who effectively use AI tools, as code generation becomes commoditized
Trends
Shift from fine-tuning to memory-based AI systems for enterprise applicationsDynamic AI system architecture that adapts automatically to model improvementsMulti-model orchestration becoming standard practice for high-quality AI outputsTest-time inference and reasoning budgets replacing temperature as primary model control mechanismsEnterprise AI moving beyond code snippets to full autonomous project completionKnowledge graphs and relational understanding becoming critical for large-scale AI applicationsAI-native software development lifecycle replacing traditional development processesJunior developers with AI skills becoming more valuable than senior developers without AI adaptationCommoditization of code generation leading to focus on system design and architectureEnterprise AI requiring parallel environment deployment and testing for quality assurance
Companies
Blitzy
AI-powered enterprise software development platform that autonomously completes 80-90% of large-scale coding projects
OpenAI
Provides models used in Blitzy's multi-model orchestration, particularly strong for structured output and code review
Anthropic
Provides Claude models used for first-pass code generation and reasoning, part of Blitzy's three-model approach
Google
Provides Gemini models used for long-horizon work and task progression in Blitzy's system
LangChain
Provides Langsmith tracing product used by Blitzy for logging and monitoring agent interactions
Nvidia
Former employer of Blitzy CTO Sid Pardeshi, where he was a prolific inventor
Tasklet
AI automation company mentioned as example of AI maximalist approach with rapid stack rebuilding
Apache
Apache Spark used as evaluation benchmark for testing Blitzy's performance on large codebases
Cerebras
Mentioned as example of fast token generation for AI model inference
Sentry
Security vulnerability detection tool mentioned as example of defensive coding practices
People
Brian Elliott
CEO of Blitzy, discussing enterprise AI software development and autonomous coding systems
Sid Pardeshi
CTO of Blitzy, former Nvidia inventor who developed the core technology for large-scale context engineering
Daniel Meisler
Creator of personal AI infrastructure framework, known for 'harness is more important than model' philosophy
Andrew Lee
Founder and CEO of Tasklet, AI maximalist known for rapid stack rebuilding and 'speed is the only moat' mantra
Tyler Cowan
Referenced for 'you are the bottleneck' concept in relation to human involvement in AI systems
Sam Altman
OpenAI CEO quoted on paying exponentially more for marginally better AI results
Bill Gates
Quoted on schematizing the world to enable computer automation
Francois Chollet
Creator of ARC AGI benchmark, discussed in context of AGI definitions and test set limitations
Quotes
"We believe we can get AGI type effects out of non AGI LLMs."
Brian Elliott
"We might be the most bearish on LLM capabilities as a pure standalone, like single LLM asset. And am I the most bullish on the orchestration of those in long running complex systems?"
Brian Elliott
"Context is serial, information is relational."
Brian Elliott
"Long term memory I don't believe will be solved at the LLM level."
Brian Elliott
"The goal of the or the purpose of the system and application layer is to reduce entropy to get to reliable outcomes."
Sid Pardeshi
Full Transcript
4 Speakers
Speaker A

Hello and welcome back to the Cognitive Revolution.

0:00

Speaker B

Today my guests are Brian Elliott and Sid Pardeshi, CEO and CTO of Blitzi, a company that uses AI in just about every way you can imagine to help enterprise software teams implement large scale features and execute modernization plans with unprecedented speed. Regular listeners will know that Blitzi has recently come on as a sponsor of the Cognitive Revolution, and while this does technically make this a sponsored episode, you can rest assured that this conversation absolutely stands on its meri. In fact, I've noticed over time that my interviews with sponsors often end up being among my favorite episodes. And I think the reason is that founders who've achieved real product market fit are often unusually willing to share the nitty gritty details of their approach.

0:02

Speaker A

It's a uniquely effective way to convince.

0:47

Speaker B

Prospective customers that they're better off buying from an AI pioneer than attempting to recreate such a sophisticated system in house. And it also signals that their product is still rapidly improving. So over the course of the next two full hours we will go super deep on Blitzi's approach, what they mean when they say infinite code context and what enterprise software development looks like when more than 80% of major projects can be done autonomously in days. Highlights include the architecture they use to generate agents dynamically just in time with prompts written and tools selected by other agents why they actually run enterprise apps in a parallel environment as part of their onboarding process how they ingest 100 million line code bases and deliver value in the form of improved documentation which also improves coding copilot performance even before the code generation process begins how they use detailed knowledge graphs to support sophisticated context management strategies which minimize models context anxiety and other strange behaviors the critical role of taste in evaluating new models and framework changes on such large scale projects which models they find strongest for which purposes and why they always use models from different developers to check one another's work why they are more bullish on advances in AI memory than on fine tuning how they came up with their $0.20 per line of code pricing model and why they will do anything they can to deliver more value for customers even if it forces them to raise prices in the future what it will ultimately take to achieve 99% project completion and even full autonomy in enterprise software development and finally, their outlook on the software engineering labor market which favors senior engineers in the short term but junior engineers who can use AI effectively over time. Brian and Sid are both high energy guys and they were remarkably forthcoming in this conversation. I learned a ton and I expect that any enterprise software Leaders who listen will will come away thinking about specific projects where they'd love to put Blitzi to the test.

0:48

Speaker A

So, without further ado, I hope you.

2:51

Speaker B

Enjoy this deep dive into the present and future of autonomous software engineering with Brian Elliott and Sid Pardeshi of Blitzy.

2:53

Speaker A

Brian Elliott, CEO at Blitzi welcome to the cognitive revolution.

3:02

Speaker C

Awesome.

3:07

Speaker D

Let's get into it.

3:07

Speaker A

One of my favorite things to do in life is talk to AI maximalists. And I've known Blitzy by reputation for a while as the company that has figured out a way to create infinite code context. And it doesn't get more maximalist than infinite. So I'm excited to unpack what you guys are building, how it all works, and the impact that it's having on the enterprise software industry. We're going to go through all the layers. But first question, just to orient myself and the audience to you, how AGI pilled are you? How AGI pilled is blitzy. How AGI pilled are your customers?

3:08

Speaker D

We believe we can get AGI type effects out of non AGI LLMs. Right. And so as folks are thinking about the impact of artificial general intelligence, they're talking about like huge swaths of work being able to be done to provide economic value autonomously across domains. Right. That's like, that's one. One amongst many definitions that is a moving target for defining AGI. And so the core question is like, how can you achieve that output? With the limitations and constraints of LLMs, we might be the most like, bearish on LLM capabilities as a pure standalone, like single LLM asset. And am I the most bullish on the orchestration of those in long running complex systems?

3:44

Speaker A

Yeah, that really echoes conversation I recently had with Daniel Meisler, who created this personal AI infrastructure framework. His mantra is harness is more important than model. Obviously one big limitation there is the context window is finite and even at a million tokens relative to the size of an enterprise code base, that's not nearly enough. Any other kind of limitations of LLMs as standalone creatures that you think are kind of most important to have in mind?

4:35

Speaker D

Yeah, so there's so many. Right. And like the being so forward on the limitations is what allows you to build something really powerful and really magical. Right. So context is one, but there's a difference between a context window and an effective context window. Right. So as you start to eat into, let's say 20, 30, 40% of a context window, there's a depreciation that occurs. Each model is a little bit different. There's lots of different ways to test this with, you know, sell in house benchmarks. But you start to lose intelligence and quality as you start to fill up. Even the advertise context window, right? And it's the depreciation is a little bit different by even task type. So what you want to do is really effectively manage the amount of work and the type of work that you are loading into a context window, while also pulling out anything that you don't need. So that's like, that's maybe a more nuanced view on the limitations of a context window. The other limitations, right, are how many tools an individual agent can effectively call. It used to be that they could call zero tools and it could call 1, 2 or 3 and then call it 8 or 10, right? And then tool selection in the agent itself is also something that you really need to understand, steer and give only the correct tool access to. You think of tool as a, as a calculator or a compiler or sort of like any outside entity that one is doing. And then like lastly, it's maintaining long running intent of the human, right? Or like intent of the machine or instruction, right? It's a byproduct of context management, but it sort of has to do with attention in general. And so if you can design a system that says great, LLMs are a very, very cool probabilistic type of computer, they have all these limitations, at least when leveraged as a single instance. And then if you can accept those limitations and then build the harness or the cognitive architecture, you can really create something that can do AGI type effects.

5:08

Speaker A

So I can't help but ask for a couple specific tips because right now I'm doing the work of kind of building out context of my own life, you know, pulling out all my email history, my Slack history, all the transcripts of the podcast, all this stuff into just this, you know, kind of big data soup. And now I'm trying to layer on various kinds of summaries, different angles at it. And in some ways this is probably quite similar to what you guys are doing with code bases, albeit for me it's just my own stuff. So I was just thinking earlier today, I wonder how much context I really should put into Gemini Flash or if that is the right model. Maybe there is a different model that even though its nominal context window is shorter, I would actually get better results for a given amount of context. How would you advise me? You know, are there any kind of top line heuristics that you would be willing to share that are like, this is what we see as the best. And this is kind of where it drops off.

7:08

Speaker D

Yeah, well, let's put a, let's put a pin in a point of not to just use one family of models at all to do this. And we'll cover that in a second. And let's talk about how you sort of manage this information. Right, So a context is serial, information is relational. Right. And so like that, that email connects to that thing that you said in the Slack message. Right. And those might be on different applications. And so the question is, what are the core relationships that govern this domain? Right. So we put out a paper about domain specific context engineering. Right. But what is core is context engineering is not general, it is domain specific. Meaning there is a core set of entities that relate in certain ways inside of the domain of, let's say, personal life or work life. Right. And so you have to first understand and define those relationships and then pair that with semantic understanding. Right. And that is how you get closer to the context that might be important for any task while removing away the context that is not important for any task. That was a very broad philosophy, but the idea that semantic clustering is sufficient is really inaccurate.

8:04

Speaker A

Yeah. Okay, I like where you're going with this. So what I'm doing right now is again, starting with all this raw information, and then I'm kind of trying to build up layers of higher and higher order understanding. First of all, I'm just going, okay, let's create a timeline. So I grab whatever I have from all sources, sorted by date. So some might be emails, some might be podcast transcripts, throw all that into an LLM and say, you know, give me a summary of kind of what I was saying, doing, thinking about whatever at this given point in time, build out a timeline. Then on top of that, it'll be like, who are the relationships that really seem to matter over the course of all this time? Then it'll be like, what are the projects that I was engaged with and which people was I working with those on? And I'm so kind of building up all, you know, through all those levels right now. How does that play out? I'm sure it's a, you know, again, an analogous thing. How does that play out in the context of a giant enterprise code base that you guys get your hands on?

9:21

Speaker D

Yeah, well, the, the approach that you're taking on that personal project will be okay at first and then get worse over time. Right? So you, you're at the personal project stage of like a slower mid market software application. So say like you can kind of just shove all the stuff in there and you'll get some approximately right results. But like Gates had this quote, like if you could schematize the world, like you can get a computer to do anything. And so in your example you're trying to schematize your life, right? In the example of code, you really are trying to schematize code and the relationships in code agnostic of language. So you can throw, you know, in the case of obliging, you could throw a 50, 100 million line code base on it. And because we have a deep relational understanding that we built first, takes a few days of compute to build that, but a deep relational understanding that we understand first inside of that code base, that is the base layer that allows us to do large amounts of work development work autonomously. Like in your example, you first schematize your, your life that might be dates, it might be months as a group, dates as a group, it might be other activities as a group that relate to other things. But you first and foremost need to understand what are the core relationships that govern the domain. And we have done that in a very, very unique way with, with code. So when an enterprise starts with us, they ingest their code, it takes a few days of compute and we then have a deep and novel approach on the category of knowledge graphs. But like that's maybe not, not sufficient to explain how deep the understanding is. So that in any line across a 100 million line code base, I can tell you exactly what is relationally relevant down to the line level. So that when I generate code, I am injecting and pulling out the correct context just in time.

10:19

Speaker A

So obviously like dependencies is one core type of relationship within software. What's the kind of. And I guess a lot of that has been done traditionally with static analysis tools. There are all sorts of tools that can go through and say this file imports these other things and they import these other things. And so we can kind of fan out that way. So what's the breakdown between how much you're using those kind of static analysis tools versus LLMs to do this ingestion? And what's the double click on like the nature of relationships that goes beyond dependencies?

12:02

Speaker D

Totally. And so if you think about the ASTs, for instance, right, these are like one of these tools like version specific language specific abstracts and text trees. These are like a pre LLM world view of understanding the relationships and meanings between a language and version of a programming language. Right. So you can think of what We've invented as a. So not an AST, but resembles the characteristics of an AST from accuracy, I.e. programming language Agnostic, designed for AI agent traversal. Right. And so that was like a lot of words, but you think like globals, classes, variables, functional relationships inside of an application. And so by having the traditional, I would say programming language agnostic relationships on top of actually building and running the application, which we do as we create relationships, you're able to create a much deeper understanding. And I think like one of the, one of the powers is you're really not able to get understanding unless you are building and running applications and putting that through the paces to understand everything from what you said in the left side, which is like dependencies to how things are relating when they're running, run in production and have like actual logs running and understanding those relationships. So you can imagine the spectrum of compile time runtime production at load items that like a software development team might look at and those ultimately form the base of the relationships that basically schematize enterprise code.

12:35

Speaker A

Yeah, okay, that's really interesting. First of all, do I understand correctly that you are literally running enterprise applications in your own like parallel universe to the. Because of course your clients are continuing to run their applications in production. Right. So you've got to identify you were.

14:00

Speaker D

That on often in their own cloud environment, sort of spinning up, spinning up again in their cloud environment. But like let's see to get started, like one of the reasons that ticks, it's not days, you know, zero minutes to get started is it's getting your access to your environment, getting all the necessary keys so that you can spin up these applications and run them. But it's cool because like as you, when you get large scale code outputs from Blitzy, you'll also see like the QA that we did and like the screenshots of us, of an us, an agent, like clicking through and running an application in production. And so that happens both upon ingestion to make sure that we can run and build the application and then add code generation as we go through qa. Running the application is core to getting high quality code because you need a recursive correction loop when not just something doesn't compile and compile and build, but it doesn't act in production how you're expecting it.

14:19

Speaker A

Yeah, just the feat of managing to actually stand up another parallel instance of the production application is like I'm sure not trivial in many, many cases.

15:08

Speaker D

Like you need to see the database, right. Like there's like real implementation work, right.

15:21

Speaker A

Of there's a lot of times where people don't even because they haven't really done it or this thing has been kind of running in the way it's been running for a long time. I bet a lot of times they don't even have a sort of ready plan for how you would even do that. Right.

15:26

Speaker D

A lot of applications insurance for whatever reason of all places, like they just really have no way to provide us these instructions. And so like what we'll do, we'll go through this iterative approach which provides value even in the approach where we will take the information in which they think it takes around the application and then like Blitzy will find the limit case of not being able to do it. And then we'll like hey, like hey, we don't have access to this package. And they're like okay, well I had no idea it depended on that package. And so you're able to kind of go through this process of actually creating the correct build instructions for the application that's, you know, essentially been sitting somewhat dormant that they want to activate or move over into a more modern technology stack as a part of getting Blitzy to stand it up. So we've provided value just in implementation, I would say. But it does come with obviously challenges and there's lots of old enterprises, for instance, to build the application. It's not as if you're just writing a script or a package. It requires what would have typically been a human and dialogue boxes popping up and putting information in. But Blitzy is sophisticated enough to spin that up and then like put in user creds and we're in their IP to build an application that is like, you know, that's how Windows applications were built back in the day. And so it requires a real build sophistication in the application to get this level of fidelity.

15:39

Speaker A

Hey. We'll continue our interview in a moment after a word from our sponsors.

16:55

Speaker B

Want to accelerate software development by 500%. Meet Blitzy, the only autonomous code generation platform with infinite code context. Purpose built for large complex enterprise scale code bases. While other AI coding tools provide snippets of code and struggle with context, Blitzi ingests millions of lines of code and orchestrates thousands of agents that reason for hours to map every line level dependency with a complete contextual understanding of your code base. Blitzi is ready to be deployed at the beginning of every sprint, creating a bespoke agent plan and then autonomously generating enterprise grade premium quality code grounded in a deep understanding of your existing code base, services and standards. Blitzi's orchestration layer of cooperative agents thinks for hours to days autonomously planning, building, improving and validating code. It executes spec and test driven development done at the speed of computer. The platform completes more than 80% of the work autonomously, typically weeks to months of work, while providing a clear action plan for the remaining human development used for both large scale feature additions and modernization work. Blitzi is the secret weapon for Fortune 500 companies globally, unlocking 5x engineering velocity and delivering months of engineering work in a matter of days. You can hear directly about Blitzi from other Fortune 500 ctos on the modern CTO or CIO classified podcast, or meet directly with the Blitzi team by visiting blitzi.com that's B L I T Z Y.com schedule a meeting with their AI Solutions consultants to discuss enabling an AI native SDLC in your organization today. The worst thing about automation is how often it breaks. You build a structured workflow, carefully map every field from step to step, and it works in testing. But when real data hits or something unexpected happens, the whole thing fails. What started as a timesaver is now a fire you have to put out. Tasklet is different. It's an AI agent that runs 24 7. Just describe what you want in plain English, send a daily briefing, triage support emails, or update your CRM. And whatever it is, Tasklet figures out how to make it happen. Tasklit connects to more than 3,000 business tools out of the box, plus any API or MCP server. It can even use a computer to handle anything that can't be done programmatically. Unlike ChatGPT, Tasklet actually does the work for you. And unlike traditional automation software, it just works. No flowcharts, no tedious setup, no knowledge silos where only one person understands how it works. Listen to my full interview with tasklit founder and CEO Andrew Lee. Try Tasklet for free at Tasklet AI and use code COGREV to get 50% off your first month of any paid plan. That's code COGREVASKLETAI.

16:59

Speaker A

You guys have been at this for a few years, right? So one big question I had is obviously the capabilities of models have changed dramatically, right? So in terms of the ability to look at a screen, understand what's going on, I think we saw that kind of demoed for the first time with the GPT4 launch, but it was still pretty rough around the edges and not really even available for after that. The computer use benchmarks. We're kind of in the steep part of the S curve. Right now I remember fondly, but also with frustration, the experience of early computer use. Agents like, even if they could see the button, they couldn't necessarily click on the button. They couldn't quite find the right place to click. So that stuff has all improved dramatically. How do you think about turning blitzy on itself? I recently did an episode with Andrew Lee from Tasklet and he's another AI maximalist I really enjoy talking to. One of his mantras is speed. In the AI era, speed is the only moat. And he takes a lot of pride in just like how fast they kind of rebuild their stack from the ground up. So I guess what would be the big unlocks that you have seen in terms of like, okay, models couldn't do this before. We had to do all this stuff to compensate. Now they can, we can kind of simplify that or we can like aim higher in terms of what we could do. Be interested in what those big milestones would be as you look back and like just how often do you find yourself kind of having to do major modernization work of your own stack, even if that modernization is only a few months from the last version to the new version?

20:00

Speaker D

It's such a good question. And so when we started building blitzy in 2022, we essentially made a bet that the models were going to get faster, way better than anybody in the market expected them to get better. And so we started building for a future universe that wasn't here when we were doing the, all of the design and all of the work of this. Like it's a, there's no NVP of, let's see, it's like an end to end platform experience. Right. And so those, the world that we built for over the last three years and the world in 2025 essentially intersected, right? These things were going to continue to get really, really good. And we were correct. And so when you are building systems for an ever improving state of LLM intelligence, you want to build the systems dynamically. So when people talk about building harnesses, they are like, they're sort of like hard coding and codifying actions based on the level of like LLM intelligence and capabilities. Right? And so those, those harnesses depreciate as LLMs get better. And you know, the level of depreciation is like tied to how hard coded your design is, let's say, and the rate of intelligence increase. Right. And so everything that we do in Blitzi is dynamic design, meaning Blitzi's agents are generated dynamically just in time, prompts are written by other agents. Tool selection is assessed just in time by context injection, right? And then the whole planning process, right, that governs all of this is sort of chunked and revisited iteratively, right? And so you in this, like, as the models get better, like, it's just great for us, right? Like we can, we can more or less just do. Just do more. And it's a config file to toss in a different LLM. But because everything inside of a. Inside of the system is dynamic, we don't get, we don't feel the depreciation that one would typically feel when they're like building harnesses in that. And I would say, like the classical way that folks build harnesses today, like for instance, right? Like as a new model comes out, new prompting instructions for that model come out. Like, our agents just reference the latest prompting instruction tied to their model and then that self writes a prompt for an agent that's injected, right? And so, like, it doesn't matter that the prompt guidance changes for the next Gemini model, right? Like, we will just go reference that. We, the agent will go reference that as is dynamically writing a prompt for another agent.

21:33

Speaker A

So that sounds awesome. It sounds like you're living the dream in many respects. One thing that I do wonder about there though, is like, how do you evaluate that? Because the typical harness, while it does have this, and you know, I can speak to that. When I tried this sort of personal AI infrastructure at various points in time, and I kind of always felt like it wasn't really there to give me tremendous value, I think now it's actually, we've maybe hit that point. But as I look back at some of this old code, I'm like, oh my God, 8,000 token context window is. When I first tried this, that was so limiting. And I was doing so many gymnastics to try to make that work. But one benefit of those gymnastics was, or at least one thing that was easier, was I could at least kind of define an eval test set that I could wrap my head around that I could look at and be like, okay, this makes sense in inputs and outputs. And I kind of can throw a new model at that and get a quick sense of is it better, is it worse, whatever. When so much is dynamic, how do you think about evals? I guess one thing I could imagine would be you might do some like fixed evals kind of as prep work. Like, let's characterize the effective context window of this new model and then tell itself what its effective context Window is give it some sort of like metacognitive information. But you've probably got lots of other insights into how to eval such a dynamic system. So love to hear it's.

23:59

Speaker D

So I think it's important that your evals map onto the real world as closely as possible. And so like, if you think about most evals in the world, they are designed to be easy, very easy for the human to evaluate. Right? Like, very easy. Because it's like, all right, well like here's a function and here's a, here's a different version of that function and this other version is more accurate. Right. But that is a local optimization on an exponential technology. Right. And so our evals are a bunch of applications that we've built over the years that are of larger scale. Some of them, you know, started on open source and we built their own versions of private applications over the years. And so we're testing blitzy on executing what we ultimately want to be a 100% outcome. And we're seeing how close we get to that outcome with the new configuration of, let's see. Right. And so we might give it a million line. Maybe we'll give it like Apache Spark, it's like 1.3 million lines of code. We have a custom configuration of Apache Spark from previous, previous projects that we've done personally in life. We'll give those instructions and we'll be able to see very quickly like how close did we get to 100% completion with this adjustment. And it requires an extreme amount of taste because you're, you're. If you are not 100% there, right, you don't get to 100% of the result. 100% being like what you did as a human in a previous life to get that to 100%, you're now saying like, is this 85, 88, 90, 95%? 100%. Right. And it's a difference between, you know, functionally correct, which like blitz, it can guarantee, like functional correctness. Like, hey, we passed every end to end test, we passed every integration test, we passed every feature test, but it may not be the final version that you actually intended to put in production. Right. And there's always a difference between functional correctness and intent. And so it's that taste that is required to really improve the system and provide that feedback on top of like the sort of traditional large scale evaluation. So this is why I think it is really, really hard to build these systems without the right longitudinal experience to understand what great technical design and great implementation From a software perspective is like we always say, blitzy is the instantiation of. If you had Sid work at the time of Compute, my CTO and co founder, because he's instantiating his technical taste into the outcome in a way that is really, really impressive for the enterprise. Well, of course they can specify their own taste and their own rules, and the system will respect that. But that is how we do evals, which is at scale, at a very large point with a lot of taste involved.

25:29

Speaker A

So that. That kind of final taste step. If I'm looking over Sid shoulder as he's evaluating the work of a new model thrown into the Blitzy meta harness, what am I seeing him doing?

28:08

Speaker D

You are looking at the final output, but it's really. You're looking at the logs. Right. And so these are. We use Langsmith for logs and tracing. Shout out Langsmith, big fan of the LangChain guys.

28:21

Speaker C

This is what.

28:29

Speaker D

This is their tracing product. So as you look at blitzy logs, like if you were to type them out on a piece of paper and I put it on a scroll, that would scroll all the way down to the end of the block. Right. The amount of agentic interactions that occur at runtime are absolutely massive. Where you have somebody injecting context, somebody writing a prompt, somebody writing code, somebody reviewing that code, somebody building the code, somebody doing before and after local pass to pass, fail to fail, end to end. So that's happening to get a piece of functionality out in the bigger system.

28:30

Speaker A

Right.

28:58

Speaker D

So as you see these agents interacting, it's a lot like looking at your engineers have a technical discussion on what correct might look like. Right. And so what you need to do is you need to look at the final output of the meta harness, Right. Like the pull request here and then trace back in the system. I didn't like what happened here. What happened in the system. Right. And how can I steer the system to dynamically be able to address this kind of instance in the future? It's a completely different approach to building software because the outcome is like a little bit emergent in a way. And you have to build the system to understand how to dynamically steer and validate to get to the right outcome.

28:58

Speaker A

Yeah. What does that steering process look like? Is it just like giving the system like free text feedback and. Because that sounds like its own.

29:44

Speaker D

Yeah, really, it's, it's. You try to be algorithmic as possible. Right. And so as you think about like chunks of work being completed, the, the first step that we'll do after receiving a. A future state spec from the the client which is like our system will work with you to get a future state spec of what you want. It's a that's a web application portion as that's sent off to to do work. Right? That will then you start off a planning process or blitzy starts off a planning process and then it executes against that plan. And so each one of those planning steps and each chunk of working the planning, reading, testing, validating QA and doing that recursively recursively is driven algorithmically to get to an outcome. And so it's tweaking the algorithms that govern the system to get to the right outcome.

29:54

Speaker A

Fascinating. Hey, we'll continue our interview in a moment after a word from our sponsors.

30:43

Speaker B

Your IT team wastes half their day on repetitive tickets, password resets, access requests, onboarding, all pulling them away from meaningful work. With Serval, you can cut Help desk tickets by more than 50% while legacy players are bolting AI onto decades old systems. Servl allows your IT team to describe what they need in plain English and then writes automations in seconds. As someone who does AI consulting for a number of different companies, I've seen firsthand how painful and costly manual provisioning can be. It often takes a week or more before I can start actual work. If only the companies I work with were using Servil, I'd be productive from day one. Servil powers the fastest growing companies in the world like Perplexity, Verkada, Merkor and Klay. And Servil guarantees 50% help desk automation by week four of your free pilot. So get your team out of the help desk and back to the work they enjoy. Book your free pilot@servl.com cognitive that's S E-R-V-A-L.com cognitive going back to the kind.

30:49

Speaker A

Of initial ingestion and sort of the knowledge graph that is created. I'd love to hear your thoughts on knowledge graphs and how they relate to RAG and whether you guys are using like embedding. There's been obviously like many different approaches and schools of thought here. I've always been attracted to the idea of knowledge graphs, but certainly for a long time it was like RAG was kind of more in vogue and then it seemed like a lot of times it was just like dump everything in the context window started to become the prevailing approach when possible. Obviously it's not possible for large code bases. Are you able to get to the point where you've mapped things out so well that you don't have need for kind of fuzzy semantic matching or do you also avail yourself of that and kind of have, this is what we were able to find structurally that's relevant and this is also maybe some other kind of relevant stuff that sort of fuzzy matched you might want to be aware of.

32:00

Speaker D

Yeah, you really want to, you want to use both as a hybrid source of truth. Right. And then when there's, when there's conflicts, then you want to, like you, the system, want to explore much deeper and much further. Right. And so it's important that these, the issue with RAG as a standalone item is sometimes people will rely on the RAG abstraction layer as the source of independent truth. Right. And so to answer your first question directly, like you want to use both relational understanding and a semantic understanding, and you want to pair those as agentic tools so that you can arm the agent to use these different tools to go and pull the right information, but you really want to use these, you want to use these tools as an abstraction layer to go search the source of traffic truth. Right. And so this is where you don't want to, you don't want to rely on the semantic match to pull out the truth. You want to rely on the semantic match as a map or a legend against the actual source of truth, go efficiently search, traverse and find that, and then pull the source of truth into the context window. Right. So it's really an efficiency search mechanism more than it is a storage of truth mechanism.

32:53

Speaker A

So one thing I've observed that I wonder how you address is so often when I have an agent searching through whatever, right. My Google Drive or my Gmail, one huge disadvantage that it has relative to me is that I have this sense when I have found what I was looking for and it's always clear to me, I'm always like, I've not found it yet until I find it. And then I'm like, that's what I was looking for. That's obviously predicated on my historical familiarity and the fact that I was involved in creating all this stuff.

34:00

Speaker B

Right.

34:36

Speaker A

So I kind of know. Yes, that's the thing. The model obviously lacks that kind of deep familiarity, historical participation, and so it can't be so confident in general that like, that was the thing that I was looking for. So how do you guide models when they're doing this kind of search to make that judgment call of when to stop the search? I find that to be a very perplexing thing in my own building.

34:36

Speaker D

Yeah. So this is all about the mechanism of the request between yourself and the model and so in the instance of I have a fuzzy idea somewhere between some mental neurons on what I might want, you might actually be doing the most efficient item by just going through and searching. But if you think about completing work, right, in a workplace, work follows some sort of structure, right? And so in software development, it follows a spec, right? And so therefore, like you can express, and this is how people will do it, they'll express what they're roughly trying to achieve with Blitzy, it'll go look against the source of truth and it'll come back with a plan in the form of a future state, technical specifications, like what architects deal with all day to go do that work, right? And so until you can provide the system the right structure of output, it is unlikely from a system level to go and sort of do your bidding correctly. And so then the question is like, how do you create the right interface experience to enable humans to enter with a fuzzy output, get confirmed on a structural strong output, and then send that task off to the system versus the experience that you just described, which is fuzzy input is sort of all you get, right? And so that some people use like chat for this, right? Which is like, I'm roughly thinking about this idea. I think it's this thing tied back to this date. And then it can say, oh, is it any of these possible things that you want to go explore further? That's a intermediate abstraction layer ahead of the true deep search. And so it's all about creating an intermediate experience between the system of intelligence, the system of record, and then how you're sort of expressing that ask.

35:04

Speaker A

So as much as possible, basically you are giving when we actually are getting to the work stage in the process. You hope at that point that you have effectively given the agent everything it really needs to know, or at least the location of everything it really needs to know. And then it can kind of do additional search to read in the details of that file, that function, that service, whatever. But you've already had a human approve a plan and kind of sanity check it at that level. So it should have clarity basically on exactly what it needs to be reading.

36:42

Speaker D

That's right. And what's super important is like the system is capable of doing both steps, meaning, like I can provide you what I'm trying to do inside of my like 30 million line trading system, right? And then you can come back and like, let's come back in about an hour after you give it this, let's say like a page of general instructions that you're trying to achieve on the code base and it'll come back with a very in depth implementation plan because you didn't think about like the edge cases or the services that it might touch. Right. It's an, the whole point is it's impossible for a human to grok everything that might affect. Right. And so like that is sort of phase one of system interaction being like hey, heads up, like human with limited human context window. Here is the plan that you said expressed against this enterprise code base. And like here's a bunch of things, things that we're going to have to do to implement this that maybe you did or didn't think of. And by the way, if you want to do this a different way, that's cool too. But let's assess and make those trade offs before we go off and write a hundred thousand or a million lines of code. Right. And so that experience of leveraging system intelligence to generate a clear version of work is required as sort of like phase one in order to do large volumes of work in phase two.

37:19

Speaker A

So earlier you mentioned that I'm going to, my, my approach is going to work until it starts to fail. What's going to cause it to fail and what should I be mindful of as I approach, you know, how do I know when I'm approaching failure and how should I be prepared for those failures?

38:32

Speaker D

Yeah. And so, so maybe I'll, I'll start by saying how we recognize failure in our system and then maybe we can map it onto your own, your own passion for project here, which I love inside of the blitzy system at runtime, right. We are doing as much work as we can autonomously. You can think of it as spec and test driven development at the time of compute and we'll retry and re loop and recursively go back and self improve between running the application and getting the featured outcome. But at a certain number of attempts we have to say, okay, we can't do this part. And so we have a separate and independent evaluation system that figures out what the desired end state was, what the system was able to do, and then writes the doc for man. If we could optimally get to this end state, this is the most likely path that we believe a human could do that this system can do. So you need to build these mechanisms, these systems, that system of work, the system of qa, the system of evaluation, to operate somewhat independently. Right. So that when you get the output as a part of the output, you also get the report on what the system failed to do. And we always call that like that's like the human completion part, right? And so getting these to be really accurate allows you to move with confidence. Right? And so if you. And so for us, like that's a project guide that says like, hey, these functions or these parts of the application, like, we need your help. And by the way, we did all of this work, we passed all these tests. Here's the QA and here's the screenshot so you can feel good on that, like go review that code, but go spend your time on this part. So we map that onto yours. Right? You would need to have. This is my intent, my intent, get some work. That work has QA involved recursively ahead of it, getting the outcome. And also a separate system to evaluate and grade the effort of that in. Both of those artifacts should come to you. And both of those systems within your application should be independent in nature, like a report.

38:49

Speaker A

So how about a kind of model scouting report? You had said that, you know, you don't want to use just one family of models. I mean that, that's clear to me. But you know, why and do you have kind of rules of thumb for which families are kind of better in which ways? How many are you, you know, how many families are you using? Does Grok, you know, crack the list? Do any Chinese models crack the list? Are you fine tuning models for particular purposes? Tell us, give us a tour of the model zoo.

40:51

Speaker D

Yeah, so we use the three major families of models in BloodC today. OpenAI, Google and Anthropic. The other ones are great and maybe incorporated in the future for different purposes. Right, but it's very clear the researchers like preferences are like somehow expressed in these models intelligences and that they're very, very smart in sort of like different ways. And they're much, much smarter when you compare different families of models and have them review each other's work. Right. And so if you take, like, if you, if you took Opus and Sonnet from Anthropic and had them compare each other's work versus versus an OpenAI and anthropic model, you're going to get monstrously better results by having a different family of models review or you know, different, different companies review work, at least in all of our, our experience. Right? And so that is super interesting. And it, it changes every day. But like first pass code gen, like Anthropic remains really, really strong. Structured output and code review. Great result from, from OpenAI. Like by the way, what I say here will probably depreciate by the time the podcast even comes out and, and Gemini has been, been better for long horizon work task checking, task list progressing. Right. And so that's, I don't know, date timestamp this towards the end of January and I'm almost certain that it'll probably change by, by the end of February.

41:22

Speaker A

Yeah, the pace is unbelievable and relentless for sure. So translating that back to kind of the meta structure of the whole thing, I'm kind of imagining that there is like a brief given at the highest level where it's like, for this kind of task, you're probably going to want to use this model. For this kind of other task, you're going to use this model. And are you then allowing the system to dynamically select which model to use as a sub agent as it unfolds itself?

42:38

Speaker D

Yeah. And you can think of an example of a dynamic algorithm rule would be like, you can pick the one that you think is best for this situation. And also the reviewing agent must be a different, must be one of these other options. Right. And so we're not constraining the choice, but we're sort of like then constraining the selection of choices in the review model. So like that's, that's an example of a sequence of steps used in validation that is dynamic in nature. Not, not like you must use Gemini, then you must use OpenAI for instance. Right. And then you sort of, you asked, you asked about fine tuning, right? And so this is, fine tuning is a, like a last mile optimization, I would say, and not a, not a bet on dramatically better improved models. And so fine tuning is an expression of essentially I can't get enough correct context engineering within the system and I can't get the right results. And so there's a place for it in the ecosystem. But it's like soon as you fine tune a model and the next one comes out and it is more raw intelligence, you're basically out of luck. Right. And so we are much more bullish long term on what we call memory. Right. And so you see like a very shallow instantiation of this in tools like ChatGPT, where it will start to sort of remember your preferences. But there's a lot of memory that occurs in the enterprise environment, right? And memory is another way to express both relational and semantic understanding, but with a lot more signal of truth. Right. And so we very much believe that to get to 100% autonomy within an enterprise workflow, you have to sustain memory of the actions of the best people and what they view as correct, and then sort of store that in your Instance in the Enterprise's instance of the platform in this situation, the Enterprise instance of Blitzi. And that's how like even after the architect retires, that is the only one that knows like that system, that the Enterprise itself has that IP in their instance of their AI system.

43:10

Speaker A

Memory for the missing middle, as I've sometimes called it of LLM Memory has been an obsession of mine for a long time. I was really taken by the Mamba architecture when that came out. Just because hey, here we have something that's kind of competitive with a attention mechanism transformer, but it has a fixed state space size. We can kind of potentially run this thing indefinitely. Obviously there are still limits to that. There's a spectrum in memory between pure scratch pad and deeply integrated nested learning, continual learning, futuristic stuff. Which sounds awesome, but it also does have some kind of challenges in that a nested learning type approach, the model may perform better, but it doesn't necessarily mean that you have a record of what happened or what the key lessons were because it's in the weights. Right. So what do you think is if you were kind of going to put your own spec, let's say out to the frontier model companies for what you want to see memory look like, what is the shape of memory? That would be the biggest difference maker for you guys?

45:15

Speaker D

Long term memory I don't believe will be solved at the LLM level. Right. And so LLMs have so much momentum beside them that like another architecture, even if it were to solve for this like will not get the level of intelligence required to, to, to execute what these systems need. Right. And so the memory is a problem to be solved at the system and sort of like by system, the application layer, but, but the system layer and that that memory is sort of is application or sort of domain specific like to what is important to remember in what instance. Right. And so memory, right. You can think of memory as all the way back to the traces. Right. The Langsman traces like a series of steps was actions. Those series of steps were driven by decisions that you chose to put in context. The decision to put something in context might change in the future based on what you've learned from the way the Enterprise expressed work. Right. And so this is tying all the way back to your context management system. Like that's where you're storing memory and preferences based on actions, not based on, not based on model weights.

46:28

Speaker B

Interesting.

47:35

Speaker A

I have some hope that there could be an integrated memory breakthrough that certainly.

47:36

Speaker D

Like it will make things so much easier. Like I, I hope for it Like I really do. And even some expression of memory in the model layer will ease the burden on the system layer. But when it's like how much memory? And just give you maybe a specific example, if you think about memory at an enterprise code base layer, the things in which one needs to remember are extremely locally specific. And so a memory on an enterprise code base is not universal. So it's not use this payment provider service over this payment provider service, even though my enterprise has nine. It's hey, when you interact with this cluster of context, you need to use this service even though to you they look relatively functionally equivalent, there's some organizational or contract reason in which you need to use this service. Right. And so that is so local from a context interaction perspective that impressing global memory at the model layer actually has severe limitations. And so the question is, how do you bifurcate global truths or global memories which people instantiate with rules today to kind of try to manipulate these models to do what they want? How do you instantiate universally true long term memory and the weights in the models? Because these are more brute force levels of intelligence while keeping locally contextual memory based decisions at the system or at the application layer.

47:43

Speaker A

Yeah, I, I totally agree that there's, you can't, I mean the, the nature of compression is like you can't compress everything, right? You've got to, something's got to be lost.

49:10

Speaker D

That's every problem I feel like is a, is a, is a search and compression problem at the end of the day. Like, and you're just, you're trying to, you're trying to get rid of much as little loss at compression as you can and you're using search to try to minimize that. But we, I think about search and computer compression like all day.

49:19

Speaker A

Yeah, yeah, it reminds me of, I'm sure you've heard this, but the old, it's kind of an old, I don't know if it's a parable or something of the sort of junior developer gets a problem, you know, gets all excited, starts like ripping off code, just typing a mile a minute, whereas the, you know, the seasoned vet kind of leans back and says I think I've seen something like this before. And that's kind of the, the thing that I can imagine, even with finite size, finite memory space, I can't imagine that getting developed to the point where you could just get tremendously higher reliability on going out and finding the right documentation when it's actually needed, making the right decisions. Not because the model would have memorized every last detail of it. But it would have that sort of intuitive sense that we probably have undervalued in ourselves until we've kind of seen how much we contrast with LLMs who lack it. Sixth sense of, yeah, there's something here that I kind of know I need to go and I kind of know what I need to get.

49:37

Speaker D

And if you were to look at like, how blitzy spends time as like the representation of the best cluster of developers at inference, we spend a huge amount of time in planning and like, system understanding and impact analysis. Meaning, like, let me really methodically think through this and, and then let me spend a lot of time figuring out everything else that this is going to affect. The co generation is like, relatively fast, right? And then a bunch of time on QA and validation and like, recursively improving the code based on what you're trying to achieve. But writing like a million lines of code like, is like as fast as you can stream tokens, right? Like, we're like, our runs are, you know, as short as 12 hours, as long as like a few weeks, depending. But it's a huge, huge refactor. Right? And so as you break that up, it is that wise developer emotion of let me sit back, let me plan, let me think, let me think about everything this is going to impact across the system and then let me implement as opposed to the junior dev, which is just rocking code on minute zero.

50:37

Speaker A

Yeah, okay. That's a great transition to a couple questions I had around what you might call blitzy scaling laws. Or another way to think about it would be limits to parallelization. Well, you could just sound off on it, but I'm interested in what is the curve. Sam Altman famously tweeted, it's going to be really weird to live in a world where you can pay exponentially more for marginally better results. So you've clearly got a curve like that. I'm interested to know how you think about that curve and where you want to be on that curve. How do you know when to stop paying for more inference? And then parallelization, like, Kimmy K2 just came out. They've got their agent swarm thing and, you know, you can. There's another kind of some sort of logarithmic thing here where like 1000 agents does not make you go a thousand times as fast. It can make you go five times as fast, maybe, maybe ten times as fast. You know, you could maybe characterize like, what that looks like. And also what do you think the reasons are for it? I mean, Some things I guess are just sequential. You got to plan before you can execute and so on. But yeah, that's plenty of prompt. Take it from there.

51:36

Speaker D

Nice. Yeah, good structure prompt there. So let's, let's talk about parallelism and the limits of parallelism, right? And this is like when you think about the work getting done at the system level, this is a, a core topic, understanding within the domain that you operate. We operate in enterprise software development. What sort of work can be done in parallel versus sequential? Because trying to do everything at once is a surefire way to get really, really bad results, right? And so just like a, in engineering, right? Like an engineering team will like look at an epic, they'll break it down into tasks and they'll realize which tasks depend on what. Really, that is a huge part of what is happening at the planning phase within Blitzy, we are deciding based on software development fundamentals, like thing X depends on thing Y. Therefore we have to get X to build, compile and pass tests before even starting on thing two here. So we must do it in that sequence, right? And so that is what is for us happening at the planning stage, which is patient parallel versus sequential tasks. That is just a software development problem set. Now in other domains, like there are other ways to think about what can be done in parallel versus what can be done in sequential. But in engineering it's very easy to grok what depends on what in a sequence of work. And therefore we have a system that algorithmically works through and assesses that. So that is the answer on parallelism. We want to do it as high quality as possible, which means in the instance where the system is not entirely sure, it'll assume sequential in the instance where it is extremely sure that it is parallel to do it in parallel, right? And so it's sort of like a tolerance preference on quality, which will answer your first question, which is pay more to get better results, right? Our, our thesis is like, we will pay any incremental dollar, we will write any incremental algorithm. We will really do anything within the system to improve the quality of the code, right? All the way to fully autonomous enterprise software development as the goal for the company. And so like we, we are not in our, our opinion, cost constrained because the other side of a pull request is human labor, right? And so like, I would much rather have that human be working on problems that are on the edge, that are, that are truly innovative, that are thinking about absolutely disrupting the way that they're applying technology to their business than I would have Them spent on vanilla application development, just, just a regular. They've already expressed their preferences vis a vis blitzy on the technical design that they want implemented and then they're handing off that work for us and every ounce of work between, let's see, which we typically do like 80 to 90% of the sort of quantum of work and then we'll, we'll call out. We need the human developer to do like in the vision of the company, like that remaining work which is just like traditional, like configuration, like qa, like that's a bug, that's a bug in the system towards the vision of the company. Because software developers are problem solvers, they're engineers are problem solvers at day zero. And if we can have the world's smartest people working on problems on the edge, not working, not worried about like packaging compatibility or qa, like we've done a great service to humanity.

52:39

Speaker A

I'm going to come back to the developer experience maybe in a few more minutes, I suppose.

55:46

Speaker D

Suppose.

55:51

Speaker A

Let's talk about the economics a little bit more though, because on the Website there's this $0.20 per line component to pricing and you can kind of complicate that. I think there's like kind of a base buy in level and then you know, 20 lines or 20 cents per line beyond a certain level or whatever. But that strikes me as the creating possibly an interesting tension for you where you have now said okay, this is what we're going to charge you. But then you also just said I'm willing to spend every incremental API call necessary to maximize value. So is there just enough headroom under 20 cents that you don't mind kind of bumping up and down? Do you ever have projects where you have to go to the customer and say hey, actually I need kind of 25 cents a line, but it'll be worth it because we're going to do that much more with Opus here and it's going to make it better or what, Whatever. How, how did you come to that? 20 cents and like you know, how, how safe of a line in the sand has that proven to be for you?

55:52

Speaker D

Yeah, if we have to increase prices, we will. That's like our, our sort of going in point to like not in the act of contract engagement. Like, like if we, if we have to dramatically increase computer, we can dramatically increase compute to get to 100% autonomy. Like we'll do that and our customers will sort of coast off of that in the sort of duration of their contract and then we'll have to right size it. I'm not necessarily worried about the, the gross day zero versus the value created day zero. So like, yeah, it's an attractive business today. Absolutely. That doesn't matter. What matters is the amount of value left to be created is so high that like if we close the gap from 80% of the work to be completed autonomously to 99% of the work completed autonomously in a year from now, like the net, new customers are going to be so happy, more than happy to pay more money because they're able to do so much more right with the, with the same amount of people. And so as you think about really the delta of value creation that you're thinking about and you always want to just push as hard as you can on value creation because the market size for software development, sort of 1.2 trillion in labor. But like that is an infinitely expanding market, meaning software is designed to fix the problems productivity of the customers. And if you're telling me that like we're out of problems to solve sort of with software like that, that's where I don't sort of believe you. And so our market size is capped by the problems that can be solved with software. The goal is to get to 100% autonomy and in a way get to you know, 80, 85, 90, 95% which is incredibly and deeply valuable for the enterprise that wants to move incredibly fast. And like they're thrilled with this, thrilled with this level of autonomy today. And so you can't, you can't let a short term pricing decision dictate the technology decisions when the value creation is so high.

56:53

Speaker A

Yeah, that makes sense. You mentioned going from 80 to 99% completion. I guess for starters, like what is the, maybe even stepping, taking one step back from there. When did you get a new customer? How do you know if this is going to be an easy or hard engagement? And what do you have to do? Because in the sort of text to SQL world, this comes up all the time, even in relatively small scale things where it's like, okay, it's one thing to look at the schema and be able to write valid queries against it, but it's another thing that there might be three different columns in the this table that sort of are like whatever variable 1, 2, 3 and which one am I supposed to be using and why do these exist and how do they differ in meaning? So I imagine you must come into a lot of different environments where sometimes there's great documentation and it's reasonably clear what's going on and what you need to do. Other times probably not so much. Do you have a process for identifying what is genuinely ambiguous and potentially only exists in the heads of the employees at the company? And how do you like then have a, do you have an AI agent interview those people to extract that information? What's that kind of human side of the onboarding look like?

58:43

Speaker D

Yeah, it's a great question. So I mean the typical enterprise has very little documentation and very little test coverage, right? And so those are the first things that we actually look to address with blitzi. And the awesome part about this is by addressing documentation and test coverage ripsy, you actually automatically increase the effectiveness of all of the AI code gen tools that are in your stack which like we highly recommend you have the, the individual developer productivity tools as a part of that stack. So you kind of get like super fast time to value as you're sort of getting implemented. Right. And so as you, as you ingest a code base there's opportunity to provide what documentation you do have. What's super helpful here is like if you have domain specific information like domain as, like I'm in finance and when we say this in our code comments like this is roughly what it means. But there's an iteration process at ingestion where we're going to provide you a spec of here's everything as we understand it today within your code base. Everything that is everything will be, will be technically accurate to what we can be surmised technically. But they're the product portions, right? We're sort of expressing what you're trying to achieve. There is where we'll sort of have an iteration period where we'll say here's the, here's the blank slate without any information. And now let's provide the system the information. And not by, by, by not starting at zero as in like tell us everything that we don't know. Well, like here's what the system can technically understand like all of your dependency diagrams, all the variables are all classified technically correct. Now find out for what from a product perspective we should have as additional context provide us that and then we get off and running, right? And so that is the process to get to truth from a spec perspective. But you also have to remember that the spec is the human readable abstraction of the truth. And because in context we're always using the actual source of truth, as in going back to the source code and pulling that into memory just in time, the spec can be like the product spec can be a little bit inaccurate at the end of the day, if you are instantiating new, like if you're moving from like C to rust, like it doesn't necessarily matter if this thing in the spec which is defined to be human readable is exactly precise versus the fact that we can go and run the application and then mirror the mirror the exact effects on the other end in the case of a language translation. Now when you're doing product development, which we do, like I would say half of our business is large scale modernizations, refactors, half of our businesses, the steady state product development acceleration, that's when you want to be a little more prescriptive and a little more precise because the system will be using your product expression from a sort of like, hey, we're doing this in finance to then go and make further decisions.

1:00:03

Speaker A

So then when you're going from 0 to 80%, 80% plus of the work being done, are there moments when the system loops in a human on the blitzy team and some says like hey, I need help with judgment here or I think this is a question we should be able to get answered. Or is it like literally from that sort of go time to 80% plus is that fully autonomous with no.

1:02:40

Speaker D

From go to pull request? Yes, from go to pull request. It's like it would be an impossible task to try to insert a human into this process. Like the way that you would do this like with agents at scale just doesn't work. The only thing that stops, which is at the beginning of the process is if we are missing like an environment variable or something from a configuration perspective to actually build and run the application. And so you could find that out, like you could build, run and then try to be adding some net new piece of functionality that actually calls a service that you didn't need to run the initial application. And so in that instance, it'll notify the customer in blitzi that says like, hey, we need access to this service. And this wasn't a part of of implementation or setup, but from a sort of spec to pull request, it's all agency and it sort of has to be.

1:03:07

Speaker A

Yeah, Tyler Cowan rings in my ears all the time. You are the bottleneck. So no doubt you got. Obviously I was. Of course you'd have to keep that to a relative minimum. But interesting that you have basically zero.

1:03:56

Speaker D

I mean the point is, right, like if the system can't do something, it goes onto the human report for the enterprise. We don't have to be 100% out the gate. Like we pass unit tests, integration tests, End to end tests, we sort of do all that, but whatever remaining work is a part of the ultimate report that goes out to be completed by the team. And that's an awesome user clock, an awesome use of cursors. People pull that report down, they go deep on whatever edge case that blitzi could install, they get that ready to go into production, they go to qa, they go to Merge, and then they start their next sprint with blitzi. So the system is sort of designed to account for the fact that we want to accurately do as much work as we can and then say like, great, the human, the human pickup is on the back end of this pull request.

1:04:11

Speaker A

So what is that last up to 20% today? You mentioned edge cases just now. Is that the bulk of it like just unanticipated scenarios that were ambiguous or otherwise problematic? That kind of kicking back to the humans not so much because of code, but because of missing judgment that wasn't supplied up front.

1:04:55

Speaker D

Yeah. So it's typically items that we think were not captured in the testing strategy. Right. And so maybe as like a double click in. Right. Anytime Blitzy touches any file, we're doing unit tests before and after as we do clusters or contexts, we're doing clusters of pro of work, we're doing integration tests between services, we're going to the end, we're doing end to end tests at the end. Right. And so there'll be some instances where let's say we were like, oh, we passed like 73 of 75 tests. And like for whatever reason, like, let's see, would change things to fix item one and it'll break item two and it'll change things in item two and it'll break item one. Right. And so the system will say, great, like we're 73 out of 75 from an end to end testing perspective. These are the sort of files that we're undulating between back and forth. And like you need to sort of go in as a human and figure out like where there is conflict between these two services. Because our system has sort of gone back and forth so many times. It's funny, like sometimes the task is an impossible one, which is kind of funny. You're like, oh, okay, you're asking for two contradictory things in your spec. And this is like one way to approve that when you're asking for opposite things. Sometimes it's configuration stuff, sometimes it's just QA work. Right. And so as a part of the report or we'll break down the tasks remaining and the estimated hours to complete those tasks for the human teams and who would be responsible from a functional skill set perspective. So some testing strategy we didn't get to 100% that we can align on and then some plan for code review and QA is really included in that we'll call final 20%.

1:05:19

Speaker A

So how do you get to 99%? Because when I hear that description, it sounds like less something that a new model is going to be able to handle and more that people just aren't that, you know, maximalist, I guess, in terms of really defining for the apps what it is they want.

1:06:53

Speaker D

We have like large customers that will get a blitzy pull request. They still go through dual review PR strategy. Like I recommend that like whatever your QA process is like, you should continue to do that for the foreseeable future for anything else for regulation purposes. Like they won't touch a line of blitzy code and they'll press merge. Right. And so like those customers are unbelievable at expressing intent and doing spec driven development for a large majority of customers that are not as far along that curve. Right. What they'll do is they'll express intent, they'll get a specific, they'll get a code back and then at the code step they'll realize like, oh, I didn't consider this outcome even though it was like maybe expressing the spec, like I'm moving so quickly. And so we had to build another product. The ability to sort of refine further from there from the blitzi platform. Right. Because people aren't used to today like doing two or three months of work making those decisions. You used to be able to get to month two and then figure out the nuance between month two and month three. And so you have the ability like once you get into the code, you realize like, oh, I didn't express this implementation the way that I would have preferred it. And it was hard for me to really conceptualize what that would look like between spec and implementation. They can just go back and they can refine that existing pull request and let's see, they can provide their updated guidance like, hey, actually like on the implementation of this portion, like I want to use this approach. And then it'll run for a much shorter amount of time and sort of adjust the existing PR to their preferences. And this just has to do with like existing patterns or behavior today. But what we'll see is like folks will go through this flow as they're getting familiarized with blitzy and like refine that larger amount of work once or twice and then they'll naturally start to get really good at expressing their intent or like identifying it at the spec stage because they are going through that, the muscle of, of basically being a systems level thinker, a systems level architect, and getting all of that implemented.

1:07:17

Speaker A

So what room for improvement is there on the models? It seems like what you're describing there is still kind of like models could get better, but it's really the humans that need to get better at expressing what they want. For you to drive that completion number up toward 100%, would model improvement then just translate to even faster execution, even cheaper total inference cost, or are there still things that you would highlight that are like, yeah, models are not that good at this and it actually would be really helpful if they were better at it.

1:09:16

Speaker D

Yeah, we ultimately want more intelligence. Right? Cheaper is fine, cheaper is fine. But if you think about the instance that I walked through with the different end to end tests going back and forth failing as the code was like recursively going back, running the application, trying to fix it, today our system will just be like, those two things happened and go look at it human. We're stuck. If you had more raw intelligence, it could very prescriptively be like, hey, this is exactly why this is happening. Here is the trade off decision that you need to express to us. Which one of these routes do you want to go from? A trade off decision. And I can go and implement that. Right. And so when the, the trade offs themselves, which are sort of complex, can then be completed or understood by the model itself. Like it could then come back with two different pull requests, both of those with the end to end task fully passing. And you're like, hey, I took trade off one here and trade off two here. And those are the only two logical trade offs that you could have made as opposed to like, I couldn't solve this problem like over to the human. So we want more intelligence. It is going to allow us to go further in situations and it'll allow you to be less, less precise at the spec stage or have to be so forward looking in sort of like your technical design.

1:09:52

Speaker A

Yeah, okay, interesting. Seems like we're pretty close though.

1:11:13

Speaker D

We're closer than people think.

1:11:17

Speaker A

Yeah, not that many more, not that many more special requests there.

1:11:19

Speaker D

Hey, Sid, we've gone pretty deep. I was gonna have you join in if there were some, some, some nuggets you wanted to drop and add in here. But just as a introduction here, like, Sid was a prolific inventor at Nvidia. He's been, he's been thinking about Building large scale software systems since he was a little, little boy actually he's got great stories about taking computers apart and building software when he was a little kid. And he's really the inventor of a lot of the core, really all of the core technology here at the company being large scale context engineering system that unlocks the ability for US to understand 100 million plus line code bases long running compute orchestration systems.

1:11:25

Speaker A

Sid Pardeshi welcome to the cognitive revolution. So boy, yeah, we have covered a lot of ground and Brian has done a great job of explaining a lot. I was just going to go next to strange behaviors from language models. This is a theme of my life, this feed, whatever is that. I am both extremely enthused about AI, love what it can do for me. I experience incredible personality productivity gains all the time. And then I also pay reasonably close attention to data search that shows all kinds of emergent, surprising and sometimes quite, in my view, scary bad behavior from language models. And one big question of course in the big picture is like to what degree can we successfully get AIs to monitor the work of other AIs and, and get to a point where we can be confident in the system overall, even if some of the models some of the time are doing something that we would want them to be doing. So interested in what you guys have seen there in terms of like QA is one dimension of it, of just catching bugs, catching mistakes. But then there's also famously, I think Claude 37 was maybe the high watermark of writing unit tests that would just return true and kind of always pass when obviously the core objective had not been met. How would you guys describe the trends in that? I assume it's improving, but how much have you seen that sort of thing improving? And then what have you done and how well has it worked to get AIs to detect those kinds of problematic behaviors in one another so that obviously at the end of the day you want to deliver something to customers that doesn't have these sort of fake unit tests.

1:12:06

Speaker C

Right? You've really described two patterns there. One is strange behaviors from the LLMs and how to control them. And one is the LLM as a judge philosophy. We've been super early with LLMs as a judge. I think one interesting bit you described there was getting LLMs to correct each other's work. What we've seen is that LLMs definitely have some peculiar behaviors given the conditions, right? Assuming everything's constant. So assuming constant temperature, top P, top K, whatever parameters you're using to influence Behavior and assuming constant prompts. If you gave two different sets of LLMs, and let's assume they're both following the best prompting guidelines of each vendor, OpenAI and Anthropic, for example, if you gave them the same situation and or condition, you may get different reactions. Like, for example, we've seen three Bench verified, right? It's like a very popular leaderboard, but we have different scores, right? Even though the problems are very much similar in that sense, there are different problems that anthropic fails on versus what OpenAI fails on. But if you go to a real world situation where you have a lot more ambiguity, what you will see is if you run the same situation, even through claude, multiple different times, you may find that it comes up with a different resolution. So, for example, it's an ambiguous situation, there's only one way to solve it correctly. If you run it five times, it may be that CLAUDE is able to solve it correctly one or two times. And maybe the approaches that it took are slightly more nuanced or different each time. And that is because of how the transformer architecture works. These are like sequence to sequence models. They're generating the next set of tokens to answer the question, right? And they may end up sampling different parts of the space, that is like one way you can end up with a difference. Or they may just end up taking a different trajectory, right? They could have executed a search query like maybe they maybe the correct answer executed use the tool correctly, right? Wrote a more elegant search query to find what it was looking for. So because these are probabilistic models and there is at any point in time there's a probability that the LLM lands in the right tool and uses it correctly. So that's why you have these differences. And it is definitely effective. The way to make LLM as a judge effective that we've seen is by using two different models, two dissimilar models to evaluate each other's work. Because what you're doing then is, is you're not just adjusting for these probabilities, but because of the inherent architectural differences, right? Not at a very deep level, but let's say GPT 5.2 is definitely built a lot differently. It has a different set of parameters, it has a different size than Opus 4.5, and it may take a different trajectory, it may use tools differently. And by that sense you have now increased the chances that collectively they land at the correct error. A correct answer, right? Which solves the problem. So that's our LLM as a judge Right. It's an important part of landing at the correct answer. But let's talk about the strange behavior aspect that you mentioned, and that's really interesting. So we've been very deep into the Claude family of models and OpenAI. For example, one interesting behavior the O series of models of OpenAI had was that they were very reluctant to use tools. So these were reasoning models and the first earliest reasoning models, but they did not like to use tools. If you ask the model to search the code base to come up with an answer to something, you find it jumping to conclusions without doing thorough research. So that was a problem with the earlier series of models. But if you look at the latest OpenAI models like Codex and or even GPT4. Right. Even at the GPT4 was active at the time of 01 and 03, and GPT4 was by far the best model when it came to tool calling.

1:14:04

Speaker B

Right.

1:18:10

Speaker C

We had repeatedly provided feedback to anthropic that GPT4 outshines Claude 3.5 by a mile. Right. Even though Claude 3.5. The best thing about Claude 3.5 Sonnet was that it used tools really well, but it was nowhere close to as powerful as or efficient as GPT4 and GPT4.0 at tool calling. Right. But as time went by, that changed quickly. So the Sonnet 4, Sonnet 4.5 and even 3.7, but not to that extent, were really good at tool calling. The problem with 3.7 was that it was over eager. So 3.7 made a lot of mistakes when calling tools leading to tool schema errors. If you didn't validate that correctly, it could cause all kinds of issues in your application. But they quickly fixed that with the 4 and 4.5. But the most interesting strange behaviors with these models is that they tend to give up as soon as they have real context anxiety is how I like to describe it. So even though Anthropic says that this applies to even OpenAI, right. They say that it's a much larger context window model. So for example, I think GPT 4.1, they introduced 1 million tokens if I'm not mistaken. But the documentation clearly said that if you exceed 200 tokens 200k tokens, you may experience different behaviors, the request will take longer and the quality may not be that good for Sonet, even though it says it's 1 million token context window, you will notice marked differences in behaviors the moment you exceed about 100k tokens or even 200k tokens. Right. It's not Just about the price. The anthropic charges you differently if you exceed that. But what you will see is that if you're working on a complex problem, the model will tend to give up. It will say things like, okay, because I have these time constraints. What time constraints? I never told you you have to finish in an hour or 10 seconds. I just gave you a problem, I expect you to solve it. But then the model brought in the concept of time and said that because I have these time constraints and I have been working on this for too long and by the way, too long was just 10 minutes. I have to now wrap up and get a final response. And it gave you the incomplete response. And then context pressure. This seems too complicated. Let me take a simple approach. And then that's where you have that behavior that you mentioned. Let me return true and let's see. This solves all of the requirements. You said I should not have any bad code. Check. I should not have overly verbosed code comments.

1:18:11

Speaker D

Check.

1:20:36

Speaker C

I'm just returning true. And the test should always pass. Check. I'm just returning true. It's always going to pass. So what I've done is philosophically correct, justifying to itself that its decisions are correct, even though what it's doing is blatantly wrong relative to the user's original instructions. But these are due to external factors that the model providers are implementing. So when we experience this, we solved this our way. There were a number of ways to prevent these issues, one of them being the obvious one which is prompting. But we reached out to Anthropic and Anthropic actually fixed them. So 4.5 Sonnet had this issue, but 4.5 Opus does not. It has other kinds of issues again which. So as an application builder, you're constantly solving for these issues, right? In production with different labs, different model providers, they all have different vectors in which they would effectively fail for any given use case.

1:20:36

Speaker D

And from an overarching perspective, in information theory, they call this concept entropy, which is sort of the outcome of a probabilistic system, has high entropy in the and LLMs are probabilistic systems. And so the goal of the or the purpose of the system and application layer is to reduce entropy to get to reliable outcomes. And so like the techniques in which we're describing, reduce entropy to get closer to a desired truth.

1:21:28

Speaker A

I love that you mentioned entropy because I was just thinking Sid had mentioned temperature and that got me thinking back to history of like in my early LLM based application development days. That was a Huge lever that I would mess with depending on.

1:21:55

Speaker D

You were a high temp guy. I can tell it depends on the.

1:22:12

Speaker A

Use, you know, but certainly sometimes these days it seems like, you know, I think some of the APIs have even removed temperature and I certainly don't think about it nearly as much as I used to. So that tool to control entropy has kind of gone away. But I wonder what other strategies you guys have for perhaps progressively increasing entropy. This is something I talked about with the AI co scientist team at Google. They sort of said in their system searching through the scientific literature is the main source of entropy that they sometimes need to get off of a local maximum or out of a local minimum, whatever you want to think of it as, and and onto the next higher hill that they can then explore and climb, whatever. What do you guys do? I would imagine maybe you want your first pass to be like the most you want to take your best shot. I used to, in code applications I used to turn temperature to zero. I figured I'd want the model's best guess first. But then if that didn't work well now maybe I'd turn temperature up. But again there's a lot of different ways to turn temperature up, like context engineer a little bit different, maybe swap out to a whole other model, do a web search for some commentary on this problem, whatever. And then hopefully with different inputs you can maybe eventually land on the right output. Long winded way of saying how do you ramp up the entropy as needed when the first kind of default answer isn't working.

1:22:15

Speaker C

Yeah, I would say the levers have changed. Right. And that's a very helpful. Thanks for setting that background. Let me add more color to it. So in the beginning you had temperature, right? And for code generation or any use case where you needed high, you didn't need as much creativity. Right. You wanted to focus on getting the right answer rather than the most creative answer. So the best practice guidance was to bring temperature down to 0 or 0.1, 0.2 depending on the use use case. Different model providers had different guidance, right? But then as you introduce tool calling, right, with 3.6, Claude 3.6 and GPT4, that kind of stuff, having temperature with tool calling created problems because it may end up you already have the ability to land on a different response because it could take a different trajectory in tool calling. And then you have temperature, which is influencing its behavior, its creativity. And that just created complications. But what changed significantly, what really changed everything was the introduction of reasoning. When you had reasoning starting with the O series of models. And then eventually with Claude, both OpenAI and Anthropic force you to set temperature to one, which means you don't have any control over the temperature parameter. So the lever has changed from temperature to the thinking budget. Right? So you may have a 200k token context window or a 1 million token context window and you have between zero to let's say how many ever tokens of reasoning the model supports. Typically you've seen as 32k for Opus and Sonnet, 64k for some others, or OpenAI model that's about 128k tokens. So that's the reasoning budget. So that's how much thinking the model is allowed to do before and or in between responses. So in the beginning you only had reasoning, like one set of batch reasoning before the model gave you a response. And then that was it, right? It went into its own trajectory. There were hacks you had to do together model to think while it's working, while it's calling tools. But then you had what we call now as interleave thinking, that's what Anthropic calls it, where the model thinks while making every tool call. It automatically thinks before making a call. And then there's a budget that you set for the overall amount of thinking, how much of the context window is allowed to use for thinking. And then there's weird metrics for prompt caching and whether or not thinking invalidates prompt caching and how much of thinking actually plays into the context window and all that. That's different between different providers. But at a high level, the reasoning budget is the lever you have. So if you allow the model to think for longer, you get higher quality answers. Because essentially what the model is doing while it's thinking is taking a stab at creating a response. So what happens is, okay, the user is asking me to write code to do X, Y this xyz. Let me take care of stab at it, okay? This is how I would write it. Then it writes the actual code, it reviews its own code. This is all thinking. It hasn't written a single token of output yet. It's just thinking and says, oh, but I shouldn't do this because the user asked for this, it goes through that process and then by the time it has either a exhausted thinking budget or got a good enough answer to the user's response, it is now ready to write the final response. So it's essentially what you were doing earlier with setting temperature to zero and maybe running the response for five times, maybe with tweaking prompts, the model is doing that by itself, by default and giving you a higher quality response. Right. And if you draw parallels to what actually makes code generation work, if you look at, on a base case, let's say Claude 4.5 opus a really good model in terms of code generation that gets responses right in one shot. Right. But that is the thinking model. The moment you turn off thinking, 5 to 10 percentage points drops even on sweep, which is supposed to be one of the most easiest problems. And the responses are no longer that high quality. So the theory that we have, essentially the observations that we can see is that the models are really getting better at test time inference. Right. They're getting really more efficient at thinking. The system prompts that all of the model providers are building into, the models that encourage the models to think before responding seem to be covering a wide spectrum of cases that allow for multiple things. One, high quality responses depending on different use cases. Also more guardrails and ways to safeguard against things like prompt injection or getting the model to say something malicious. There are multiple layers just beyond even prompting that are applicable to achieve this. But we're definitely seeing that the performance gains that we have from models are primarily driven by test time inference along this trajectory of model improvements.

1:23:47

Speaker D

And Sid, maybe you could comment, and I think it's worth having Sid comment on the path from where we are today towards fully autonomous enterprise software development. Yeah.

1:28:54

Speaker C

So, you know, we, the market, when we started like, and we said that we're going to bring fully autonomous software development, nobody believed us because you had what, tens of thousands of tokens of context windows and models could write 200 to 300 lines of code at a time, maybe thousand lines, but it wasn't good. The code wouldn't compile, it wouldn't do what the user said. The context window is still too small to even cater to large enterprise code bases. Right. And we're not really seeing that change. So we've had 1 million token context window models for a while. We've had even 10 million token context window models. But the efficient frontier for the effective context window, if you don't want to deal with issues like context pressure, and if you always want code that compiles, works, runs, or eventually gets to that point, is still less than 100k tokens. So even though we've made a lot of, say, progress on quote unquote, intelligence models are more intelligent, they produce high quality responses. You still have the problem of context and we've solved that and a series of other problems to make this work. And our perspective is that today the folks that are getting the best results from something like CLAUDE code are using tons and tons of techniques to get to achieve that. Right? You have CLAUDE md which contains, let's say, the instructions. You have maybe a series of plugins that you're using mcps, you have these prompt templates, you have a number of other tricks that you're doing. You're probably using cloud code to get one output, then you're switching to codecs and maybe getting it reviewed and then pasting that back in. So the most elite AI users that are getting the 10x gains are doing a lot of hard work to make it happen, right? So have you really changed or improved productivity? I would argue no, because you're still doing a lot of work to get that. You've changed what you're doing. You're not actually writing the code, but you're spending your time figuring all these tricks out. And every three months the models change, the prompting practices change. So you're relearning all of that, you're switching between codecs and CLAUDE code, and there's this constant struggle to make the model work for your code base. And our vision has always been that you shouldn't need to do all that. The LLMs are the models today. It matters a lot if you're using, let's say Office or some other open source model. But we're seeing open source catch up. So it is our theory that LLMs will be commodities. And regardless of that, the point really is that you should be able to go to a model with your work, which is typically in your project management tool, like Jira or whatever that is. You should be able to plan the work and you should be able to get a PR back. That just works, right? It follows all the coding practices that you outlined. It solves everything in your plan in detail. It takes into account your past, current and future roadmap. It has the ability to fix merge conflicts. If you have a very high velocity team, it follows the specifications in your figma and it just works across your entire code base. It compiles the unit, tests run, there's good code coverage, there's evidence of testing. This is what you'd expect from a human developer, from a really good human development team. These are the unsaid or quite often very vocal parameters of success criteria for success that are set within the engineering org. That is what we've set to build with LICI, just PRs and high quality code that works. And we will spare no action to make sure that we get to the highest level right? If it's LLM as a judge, if it's more test time inference, if it's in the future, maybe even test time training to learn about the specific preferences of the user, the goal and the vision we have is again like I said, code that just works out of the box without you having to do heroics to get it to solve the success criteria.

1:29:05

Speaker A

That's a funny characterization of how work has changed. It certainly resonates with me, I feel like and I don't code full time, but I've created many more applications in recent months than I ever used to. So in some sense I'm definitely more productive. Like I made three AI apps for family members for Christmas presents this year for example. But the it is definitely true that I'm always on either hands on or on Twitter looking for the latest tips and tricks and that definitely it is striking that for all of the labor saving nature of the technology, the people that are getting the most from it are probably working as harder harder than anyone. Maybe that changes. Maybe it just continues this way until the singularity. I don't know. I want to do a quick double click on test time training. This has been obviously highly related to continual learning, which has obviously been a big part of the discourse recently and there's been some really interesting advances in that space with respect to much more contained puzzles like RKGI type puzzles, that kind of thing. We talked a little bit earlier, Brian and I did about just kind of Is there any point to using open source models? Is there any point to fine tuning? It sounds like today, basically the reality is like the frontier models are kind of the best, you want to work with the best, you can't really fine tune the best and so it's usually not really worth it. Kimi K2 or K2 5 I should say just came out and with the community's obviously still digesting exactly where that is, it does seem like all Chinese models to be a little bit in as much as I don't think it's actually truly the best, which is what they're you know, their benchmark graphics would have you believe. But I have used it a bit and others also seem to be reporting the same thing. But it does seem to be like really good and that the gap is seemingly quite small between it and you know, whatever your favorite model is for your favorite use case. So does this change the outlook? You know, I guess because like whether fine tuning is worth it or not would seem to depend a lot on the gap between what you can fine tune and what you can't. And this gap seems to have potentially narrowed quite a bit. So I'm kind of wondering if you're like, oh, hey, this maybe changes the trade offs or the analysis and maybe we do want to get into that sort of thing.

1:33:09

Speaker D

Now.

1:35:32

Speaker C

My perspective on fine tuning has always been very classical. It's in the sense that you should only fine tune if you have a very narrow use case that you believe by fine tuning you will get much better performance and that the rate of that performance gain is much more significant than waiting for another three months till the next series of models comes out. You also lose things when you fine tune, right? So you lose the ability of the model to generalize. And it's not always a given that when you fine tune, performance will increase, right? Because you don't really have, let's say you don't really have necessarily have access to the original data set. And even if you did, you cannot really create a map between what was the influence of specific parts of the data set on the model's behavior. Right? So that's why fine tuning, especially when you don't have large amounts of data and if you don't have a very clear niche use case that hopefully has historically been successful, it may be a previous family of models. It's always like drawing from a pack of cards. It's always a risky game. Now, if you talk about models and their ability to get better, there's also another challenge there, which is, for example, Gemini, and Let's say even OpenAI are very close in terms of the score on Swebench. And in some cases it has been proven that there are models that beat anthropic on code generation, right? In very specific use cases. Even then in the real world, if you compare Gemini, OpenAI and Anthropic, they are very different in terms of the code generation. In terms of the use cases for code generation that you would want to apply, you want to apply them to, they're very distinct, right? Even though they're creating similar ISH scores. The point really is that the current leaderboards that we have are insufficient, right? There's a lot of test set leakage. There's a lot of just broad insufficiency from the standpoint of generalizing to a typical use case. For example, a number of leaderboards rely on the opinions of humans. For example, they'll give you A and B, both have code to solve a specific use case. And, and you're supposed to select which one you feel did a better job right now, Depending on my mood, I could have chosen either. But if you don't define clear success metrics that would apply in an enterprise setting, you are not creating a very effective leaderboard. Because the leaderboard then is only the perception. Maybe in someone's perception. Writing a lot of comments is very helpful. Oh, because I read the comments and I answered the code and someone else's perspective like this is overwhelming. I cannot read that many comments when I'm trying to understand the code. It's just distracting. Right? So that. So leaderboard design is actually a complicated problem. And even you mentioned rkgi, Right. The fun part is that Francois Chollet, the creator of that leaderboard, he talks about how when LLMs got to 70% plus on that leaderboard, everyone said, oh, I guess AGI is here. But then he brought in Arc AGI 2, which didn't really change the difficulty of the problems. Right. It just had different problems of the same kind. So if you were to give ARC AGI to a five year old and then ARC AGI two to a five year old, they would perform the same relatively on both leaderboards. But an LLM that scores 76% on RKGI1 would not even score 20% on RKGI2 when it just came out. Right. Even though you have massive gains in intelligence and gains relatively on paper on the leaderboard from a real world scenario, just because of how LLMs work, you don't have a change in the LLMs ability to learn something that it has seen for the first time. The highlight of ARC AGI 2 is that these are problems that are different than an LLM would have seen in its training set. They're not harder, they're just different. And then there's two definitions, broad definitions of AGI. If you focus on the academic definition that Francois Chollet is alluding to, he says it's the ability of the model to learn patterns and adapt to them on the fly. Patterns that it has not seen before, apply its intelligence to a new problem and be able to solve it. But, and the, and the other definition of AGI that's more popular that you know, that I've seen floating around much more often is just human level performance on a broad range of tasks. Right. By definition and by real world results, these are fundamentally different constructs. Right. And the problem that I see is that we've, we've gravitated far more towards the latter, but I've ignored the former. And that is why I'm bullish on test time training because what test time training promises is if we detect, let's say, a pattern that the LLM is not familiar with. Right. Where it's not going to perform well, we can give it more context about solving that particular problem such that it does better. Right. And produce better results. Now, in any problem in general, it's very hard to know whether or not you're going to get the correct answer because you don't have a metric. Like, for example, with code, you can compile the code and you know whether or not you're working on the correct answer. Right. Or you can define unit tests that you can execute to learn if you're on the right track. That doesn't apply to general scenarios. So specifically in case of code, I'm bullish that you can implement test time training in such a way that you improve the odds of getting to the correct answer. But even then, much many of the techniques. I've read papers on test time training, we're not at a point where at the moment it's practical to implement that, but I definitely see that becoming a real thing in the next one to two years.

1:35:32

Speaker A

Yeah, this will be something I'm watching very closely to see how that develops as well. I think the two last things I want to talk about are just security briefly, because I know that's obviously a huge concern of enterprise, you know, customers broadly. Right. They don't want to be importing a bunch of insecure code into their environment. And of course LLMs have a reputation for writing insecure code. The other thing I want to talk about in maybe closing is kind of the labor market in light of all these changes. And that could also include who you are looking to hire and as much information as you would be willing to share about your hiring practices. But on the security side, I guess, where are we today in terms of security? What have you found to work? And do you think this problem is going away? I've seen some research suggesting that formal methods can be used to both validate code that LLMs write and then also as a reward signal that should get them to be writing far more secure code far more often anyway. So my sense is that like many other things of like LLMs can't reason or they can't do this, they can't do that. I think this is probably going to be something we'll leave behind. But I know you guys have also had to do at least the best solution you can before the models themselves have kind of been properly trained. So I guess all that to say, what's your view on the security of LLM generated code?

1:42:03

Speaker C

Yeah, I think it's a shared responsibility is the first thing I would like to say. One is there are many behaviors of the LLM that can be influenced and prevented at the training step itself. So if you look at the reports that anthropic, OpenAI, Google, all of them put out when they launch a new model, they test against these behaviors and these behaviors could be getting the model to do something it should not be doing. Like for example, let's say I need a recipe to create a weapon. If I put that as a prompt, hopefully the model does not respond with the correct answer. But what people have done typically is to fool the model framed it as an emergency situation such that if the model provided the recipe, it would save someone's life or it would make a positive change. So try to game the reward function that may have been defined for the model and get a response. So prompt injection is one of the ways where they've been able to do that. And there are several other ways to jailbreak what the LLM can do. But ultimately it comes down to system design. Right? For example, security considerations would be different for something like CLAUDE code where you interact directly with the model as opposed to blitzy where you have a plan and then you execute that plan and then blitzy decides whether or not to under the instructions and in what way to deliver the code. Right. So when you're not interacting directly with the model, the attack vectors change, that is one, but specifically for the code generation use case. So if you there's the one aspect of security is causing harm or using content that is not considered clean for that use case. And like I said, the vectors there are the models typically refuses to send you a response or if it does, you have to set different kinds of guardrails depending on the system. But in terms of software itself, it could just be having an outdated, let's say knowledge reference. Most models right now have I believe January 2025 as the knowledge cutoff. And there have been a number of libraries that got updated with security fixes after that date. Right. So if your model did not look up the web when using an open source library, or it did not realize that this was a bad practice, it's a bad practice in code. It was a newly discovered knowledge, it did not source the web to understand that it is likely that your code generated using the LLM has these security flaws. Right? But thankfully, as all things go in software, you have A number of ways to detect and prevent that as far as the software item is considered, right? One is having defensive tests within the code. So if you know some of the attack vectors that your application and or product is vulnerable to, you can define tests and you can use AI to create these tests, have them in the code and make sure that your code does not have those flaws, right? So every time you run a job, you make sure the test will pass, so you add more tests as needed, etc. 2 is having tools that check against known vulnerabilities, right? So the number of such tools sentry is one that comes to mind, the number of others that report vulnerabilities, CVCs in the code and then you can use AI to address those vulnerabilities, right? So in blitzy we run a pre check to detect for security flaws and we address them before creating the pr so that you don't have to go through that process, but at a high level because you have access to such tools and different languages, frameworks, have different set of tools, you can provide blitzee the ability to check for them. And you can also do that with other tools, right? Just code is significantly easier to protect from security gaps. And I definitely believe from the standpoint of coding we will have tools or you have the ability to configure tools that prevent security issues.

1:43:36

Speaker A

So this has been outstanding and I really appreciate how much you guys have been willing to share. I'm going to take the transcript of this episode and turn it into a to do list for my own personal AI infrastructure project and we'll start implementing. Last thing I want to talk about for just a couple of minutes in closing is the effects that all this is having on people. There was a paper you guys probably saw, it ended up being fake, but I think it was kind of an interesting, like it resonated. Which was maybe the most interesting thing about it. It was a. Supposedly it was about materials scientists at some, you know, big company and supposedly they had introduced AI and they'd become more productive, but job satisfaction had dropped. And again, this turned out to be fake. But I think it was shared so much because people felt like they it satisfied their expectations, if nothing else, right? So interested in how you see.

1:47:49

Speaker B

The.

1:48:44

Speaker A

Role of the software engineer changing? Do the software engineers like how it's changing? And then there's also of course this big question around junior developers. Like, you know, is the death of the junior developer much exaggerated? Are you guys hiring junior developers? What are you looking for in your hiring? If you want to Tell us a little bit about like what your comp looks like. That would be very interesting, but understand if that's not something you want to talk about on a podcast. But yeah. What do you think of the impacts, what's under and over hyped when it comes to impacts on the roles people have and the labor market more broadly?

1:48:44

Speaker C

I think if you think of it from the standpoint of short term, medium term versus long term, right? In the short term and the immediate term, what happens is code is now a commodity. In the olden days, if someone had written a script to do something and that was a very complicated or boring task, that script was like gold. You would pursue that developer, be friends with them in the hopes that they would maybe share that script with you that they got after scourging through hundreds of pages of documentation and just raw experience of having done that numerous times. Now I can just go to Cloud, prompt it, get a script back and just do something. But if I'm a junior developer, I won't be able to look at the script and know if this would destroy my production database or if it would do what I'm expecting it to do or if it produces unintended effect. Abc, right? And that is the danger, that is the difference between using AI and not using AI, to me, really. Right. If you can't tell that difference. So in the short term, the market unfavorably is weighted towards senior developers because when you give a senior developer access to AI and writing code, you don't have to go through the boring mechanical process of writing a lot of code or even copy pasting a lot of code, right? Just feed it to AI, get code back, review it and get done with it. But then as AI gets better, as the chatbots get better, as a model gets better, as the tools get better at preventing unexpected unintended outcomes, at understanding intent and at writing code that satisfies the intent, right? What's going to happen is junior developers, and this is already happening, right? Mid level developers are performing at the level of senior engineers just because code is a commodity, right? And mid level developers, they've spent some time with the code, they know what a bad action looks like, they know how to make corrective measures and they still are producing velocity gains, right? So the two advantages that senior developers have were like depth of knowledge, maybe speed, right? And just the ability to understand how the system works, all of that. Now you can get from AI, right? You can connect Claude code or Blitzy or any other tool to your code base and you have an accurate understanding of what the code is like. There may be hallucinations along the way, but that's changing quickly. Speed, like you cannot beat AI in speed, right? Like you connect cerebras to some model, you're going to get very fast tokens. And even the labs at a baseline have a very Claude 4.5 is really fast, right? So you cannot beat the models on speed and just knowledge bit, right? If the model is intelligent enough, like I said, to understand the intent, you're going to solve that problem as well. So because of that, I believe in the medium to longer term you will have junior developers that are far more valuable in terms of they are cheap to hire. There's a ton of them that are now doing computer science degrees and are not going to be employed just because at the end the rate at which enterprises are hiring has gone down. And the short term they're favoring more senior talent. But these developers, assuming they upskill on AI, continue to remain in the industry using the tools, they're now going to be much better at getting work done. So the talent, as it ages out, is going to be replaced by more junior developers. So that's a theory I have. Now in terms of hiring, we've hired senior, junior and mid level developers and we have a mix of them. They're obviously doing different things. The challenge we have is as a startup we need to produce a lot of code. Quantity, right? And it has to have quality time is a very critical factor, right? So for us, we obviously have shared the bias to initially hire a lot of senior developers. But what we quickly realized is for tasks that don't really require senior developer input, it's not a large code base, it's not really a cutting edge technology, right? It's something that is well known. For example, running blitzi on a leaderboard, right? Writing scripts that automates that process. We have hired as interns, high schoolers last summer to do this, right? And we have junior developers who are research engineers that are using blitzy to run this. They are using AI tools to run all these operations and we can hire them at a very favorable compensation, right? And that's going to be in assets. So just because the market is really flipping on its head, the expectations in terms of salaries for software developers unfortunately is going to go down. The junior developers who know AI, they don't have to unlearn, right? Biggest challenge with some of the more senior folks is that they have to learn to trust AI. And the biggest hesitation for any senior developer who's been around long enough is that I can't trust anything else other than myself, right? If I don't write the code, I can't trust it. And that's like this psychological hurdle. I would say that the senior developers have to adapt to AI. The ones who do adapt are going to be immensely successful, but then there's going to be that challenge. And that I believe is a gap that the mid level developers, once they've known enough and the junior developers will fill, especially because of the favorable cost equation. And then you asked about the salary ranges. So we have a number of open positions and the salary range is anywhere between 100k to 300k from a cash standpoint. And equity is separate in that discussion. And there's always room for us to pay more for the right talent. And it's interesting how the definition of right talent has changed. Typically you paid more for someone who had many years of experience and has built many systems. But now if you were to run a hackathon, you'd be very surprised as to who is actually winning that hackathon. You have high schoolers who are extremely adept at using tools at prompting, and often a good prompt and a good tool can beat out what a senior engineer can do in the same span of time, right? Especially if you're talking about greenfield development. Hands down, someone with fewer years of experience can do a lot better just because of the psychological gaps, right? But if you're talking about legacy enterprise software where you have to check a lot of boxes, you need a lot of experience. You think something is right, but you realize only after being bitten by doing something wrong. That's a space where senior engineers will continue to thrive.

1:49:25

Speaker D

I love it.

1:56:12

Speaker A

That was a great answer and again, I appreciate how much you have been willing to share outstanding conversation. I'm looking forward to getting under the hood with blitzi, and this is certainly a space that we will continue to watch closely. For now, Brian Elliott and Sid Farneshi, CEO and CTO at blitzi, thank you both for being part of the Cognitive Revolution.

1:56:12

Speaker D

Thank you.

1:56:34

Speaker B

If you're finding value in the show, we'd appreciate it if you'd take a moment to share with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions and sponsorship inquiries either via our Website Cognitive Revolution AI or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts which is now part of a 16Z where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement@aipodcast.ing. and thank you to everyone who listens for being part of the cognitive revolution.

1:56:36