Agent Swarms and Knowledge Graphs for Autonomous Software Development with Siddhant Pardeshi - #763
Siddhant Pardeshi, CTO of Blitzy, discusses autonomous software development using agent swarms and knowledge graphs to generate millions of lines of validated code. The conversation covers challenges with traditional AI coding tools, the need for context engineering and agent orchestration, and how Blitzy achieves 80% autonomous completion of enterprise-scale development projects.
- Traditional AI coding tools fail at enterprise scale due to context limitations and shallow indexing, requiring human intervention every few hours
- Autonomous development requires combining vector search with graph databases to create anchor points for agents navigating large codebases
- Agent swarms with database orchestration can scale to tens of thousands of agents without single-point-of-failure bottlenecks
- Code acceptance, not code generation, is the real challenge - AI must produce maintainable, secure, production-ready code
- Dynamic agent personas and role-specific prompting significantly improve performance on complex enterprise tasks
"We frequently write hundreds of thousands of lines, millions of lines of code. Everything compiles, everything runs, all tests pass, the UI works. It's pixel perfect."
"Code is a commodity now. Getting AI to write code is very easy. Getting code that follows your standards, code that is secure, code that is ready for production is a completely different story."
"The approach that we took has been to dynamically recruit multiple swarms of agents and use the database as part of the orchestration layer."
"We made this bet back when the context window was about 10,000 tokens and it could barely write usable code. But we made this bet that AI is going to be as good, if not better than humans at writing code."
A big thanks to Blitzi for supporting the podcast and sponsoring this episode. Want to accelerate software development velocity by 5x? You need Blitzi, which brings autonomous software development to your enterprise code base. Your engineers declare intent and Blitzi agents map your code base and generate an agent action plan. Once approved, Blitzi gets to work autonomously, generating hundreds of thousands of lines of validated, end to end tested code. More than 80% of the work completed in a single run. Blitzi is not just generating code, it's developing software at the speed of compute experience. Blitzi firsthand@blizzi.com TWIML that's B-L-I-T Z-Y.com TWIML
0:01
the approach that we took has been to dynamically recruit multiple swarms of agents and use the database as part of the orchestration layer. And you can recruit tens of thousands of agents, but not have to worry about this single orchestrator that's keeping track of everything that's happening. We've been able to apply that successfully and we frequently write hundreds of thousands of lines, millions of lines of code. Everything compiles, everything runs, all tests pass, the UI works. It's pixel perfect. And so we perfected that really. Foreign.
0:48
Welcome to another episode of the TWIML AI podcast. I am your host, Sam Sherrington. Today I'm joined by Siddhant Pardeshi. Siddhant is co founder and CTO of Blitzy. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Welcome to the podcast, Sid.
1:31
Thanks, Sam. Glad to be here. I'm a longtime listener. I've been listening since 2019.
1:49
That's amazing and it is so great to hear. I am excited to meet you and I'm really looking forward to digging into your experiences at Blitzi, where you're working on autonomous development. So let's, let's dig right in, but start by talking a little bit about your background. You were at Nvidia before you started Blitzy.
1:54
Yeah, I was at Nvidia since 2016. January 2016. And back then, the day I joined, Nvidia's stock was worth 32 billion. That was Nvidia's market cap, $32 billion. And I think Anthropic's revenue today is more than that. It was quite the experience being at Nvidia at that time. And Nvidia was structured. I don't know if they still are, but it functioned very much like a startup for the entire time that I was there right from 2016 to 2022. And when the attention is all you need. Paper dropped. I was right there. I was inventing things for Nvidia in the generative AI space. I was deep into Gyans or generative adversarial networks and various autoencoders and I was brushing with nlp. It was still quite earlier, you know, you had bert. We were using BERT for like translation and stuff like that. The transformer was groundbreaking tech and eventually when I realized the potential of what it could do and simultaneously had an opportunity to go to HBS to do a joint master's program in an MBA and an ms, I chose that. And I met Brian at hbs, my co founder and CEO and we decided to form Glitzy based on the idea that AI will catch up eventually with humans. And we made this bet back when the context window was about 10,000 tokens and it could barely write usable code. But we made this bet that AI is going to be as good, if not better than humans at writing code. And there'll be a section of software development that's not just about code generation, but entire software engineering that will get completely automated by autonomous development. And that's what Blitzi is all about.
2:19
It certainly is true that one of the areas where AI is having the most impact today is in software development. When you think about software development, do you have a way that you taxonomize the space and the opportunity?
4:12
So I think software development is the best opportunity and space for to apply AI. And the reason for that is because software is verifiable, it's compilable, it's testable, you can visualize it. And there is the concept of a correct answer. There could be many correct answers, but there are correct answers and wrong answers, which is not always the case in other domains, right? So it's super important to get, you know, to realize that. And then if you think about the space itself, I think we all got started with AI assisted development, right? You had copilots today you have CLIs and IDEs, IDE tools with embedded AI assistants. And they all have the ability to, for example, do tasks asynchronously. Like for example, you can give it a job that will take even an AI, maybe like hours to complete and it will think for some time, go off asynchronously, ask you follow up questions and whatnot. And then you have another part of the space which is about autonomous development. There are tools in this category. There's I believe, Devin from Cognition that falls into that category. We operate in that category. And the idea here is that you hit build and out comes a pr, right? But the PR that comes out is already tested, validated, everything works and it's exactly how you intended it to be, right? There's no errors, the code is acceptable, right? So the biggest challenge that we have on both sides of the spectrum is code acceptance, right? You can write a lot of code and code is a commodity. Now, like, getting AI to write code is very easy. Getting any code is easy. Getting code that follows your standards, codes, that code that is really good, it goes, that is secure, code that is ready for production is a completely different story, right? Because you have, on one hand, you have these greenfield builds or like new products that you can build from scratch. And AI is really good at that. So if you look at the demos that the labs put out, hey, I built this game and it looks amazing. I can't believe it. But then when you put the same AI on an enterprise code base and you're supposed to work with the, it messes it up.
4:29
It's a lot more challenging.
6:48
It's way more challenging. It's an orders of magnitude high problem because the AI is dealing with so much information and so many and conditions that causes tools to fail. So the autonomous part of the spectrum is a much harder challenge because you have to simultaneously address all of these items and work for acceptance as your final metric.
6:49
And so thinking back from acceptance through the agent, the AI writing some code on the other side of that, there's gotta be some specification that the code has to meet in order to be accepted. Are you essentially pushing all the complexity of coding into spec development?
7:11
That's a, that's a great point. So yes and no. Let me explain the yes part. Like, if you could write a spec, then you should write a spec, right? It's all of the tools we know and love have plan mode. They've recently realized that. We started doing that back in 2023, 2024, when we built Blitzy. But spec development, this really helps the agents anchor themselves. But again, what people realize immediately is that the spec is not good enough, because then you have these other general rules that you want agents to follow. And traditionally what people have done is use things like agents, MD added skills and other stuff, trying to keep the spec lightweight, because these models tend to forget right after a period of time or if you go through compaction and stuff like that. So there's that part where if you have a task in general that you can write a spec for you know what it should do, you know all of the conditions that it needs to satisfy, then yes, writing a spec is great. But then you have this other class of tasks where it's not really clear what the dependencies are. Like, for example, you don't know what the schema for the backend database looks like. And you can't write a spec for it because you don't know what the constraints are. Right? And you can't just trust the AI to, okay, figure out the schema and then do X, because you're going to get new information when you figure out the schema, right? And that's going to affect the decision and the architecture of the code that you're writing, right? So because of that, you always have this spectrum where you're working one to one with the agent. It's giving you more information and you're helping it make decisions, right? So does the future. If you can build more intelligent models that are maybe human like, or better than humans at making architectural decisions, then yes, you can have like this entire class of work that is focused on writing specs and guiding other, maybe less capable, cheaper, faster agents to write code. And that also is an exciting opportunity for the future.
7:36
So just to replay that, to make sure I understand, I think what you're saying is that, yes, you're, you're, yes, this, the spec is important because if you get the spec right, that anchors the agent and the agent can produce better code. But no, today a spec isn't sufficient because there are assumptions and unknowns and things that evolve during the course of development. And so rather than pushing everything to the spec, what I heard in there was that there's still a lot of human in the loop during, during development. Which raises the question, well, a, is that, is that right? Is that capturing what you're saying? But also then B, you know, you talk a lot about this idea of autonomous development. If the human is in the loop, how autonomous is the development? How do you think about that distinction and nuance?
9:35
Yep. Yeah, that's a fantastic question. So the thing is, even if today, let's frame it this way. So today, if you want to get even, if you have a great spec, you spend a lot of time writing a specific. But if it's a complicated spec that maybe covers fifty thousand, a hundred thousand lines of code, right, for which enterprise projects are often at that scale, if you want to migrate, if you want to upgrade Java for a large code base, right. Or if you want to add a UI on a complicated backend Those are huge, huge changes across multiple, multiple files. You can have a spec, you can write a spec, you can give it to the agent, your favorite cli, maybe cloud code or whatever. It's going to spend time executing, but at some point it's going to run into, you know, a use case where it has a question from the human, where something is not clear, it needs to make a decision, or it's going to run through several context compactions because it only has like 1 million tokens of context. And then the quality of the output after compaction is not the same.
10:38
Information will be lost.
11:37
Yes, information will be lost. It has to do that. It does do a very intelligent job of, you know, trying to retain all relevant information. But it's not perfect, right, because it's really hard to solve that. And even if you do that, you're going to. There's no guarantees. It's not going to lose anything that's important. And then if it does spend time in going back and getting back something important that it lost, chances are that it's so big in size and volume of tokens that it's going to overload context again. It's stuck in a loop. So that's why you had, for example, Anthropic put out this huge project that was a C compiler. And the very first issue, not very first, but the most popular issue on that forum is that this repo should not exist because hello world does not compile on this compiler, right? So you have problems like that when you try to apply what is called the Ralph Wiggum loop to existing, let's say, tools that are not designed for that, essentially just running the same thing again and again till it gets to the correct answer. The point I'm trying to make is it's not just about giving AI a spec, it's all about context engineering and agent engineering. So context engineering is about giving the AI the right amount of context at the right time. The problem, what happens is at scale across the enterprise, when you have like hundreds and thousands of developers, not everyone is using the tool with the same level of efficacy. So all of the CI tools, Codex, clot code, you name it, require a significant amount of setup. Like, have you connected to the right mcp? Are you using the right skills? Are you using the right prompts? The same prompts that work for Anthropic don't work for OpenAI. Like, for example, OpenAI doesn't use XML tokens in their training, but Anthropic does, right? So if you use XML tokens with OpenAI or if you shout in your prompts, which you have to do with Cloud, at times you have to shout at cloud to get it to listen to you, it's an ineffective strategy. And then. But it's widely held that GPT 5.3 gets many things right that OPUS does not. Right, Right. So there's all of these complex agentic engineering. So that's the part that is agentic engineering, where you recruit the right agent with the right set of prompts and tools with the right level of prompt engineering for the right task. Right. Because there are definitely tasks that GPT is better than OPUS for. And then there's context engineering, which optimizes for giving the agent the right amount of information at the right time and for getting it to focus on the smallest possible task that is efficient for that agent. Right. Without like overdoing it or underdoing it. So when you apply those two at scale and you solve, you know, some of the most important challenges, like for example, context limits. So we've, we have a very creative solution internally, at least for speaking for blitzi. Right. Where we've achieved effectively infinite context because we've applied context engineering and agent engineering. Right. So these are really powerful techniques and tools that you can apply with today's AI to achieve autonomous development, which we've done successfully.
11:38
Maybe we can jump in and define when you say autonomous development, what exactly that means for you. Like talk through the process from the perspective of a customer or user and what they're doing and what, you know, what they see, the way in which they're engaged.
14:53
Yeah, so that's a good idea. So I can talk about how you would do, let's say a task like a Java upgrade. Right. It's very easy to think of modernizing or maybe like a COBOL to Java transition or even new feature development. Right. Using traditional versus I'm calling it even traditional. Let's say it's like state of the art development. Right. With the typical development workflow. Right. Even. Let's say you're assuming you're using codecs or cloud code. You would work out a specific for the program and then you would probably use, maybe you would prompt Claude code with the requirements, it would enter plan more, build a spec. You would then take that spec, hop to Codex, ask it to review it. Right. And then you hope that you've written the right set of, followed the right prompting guidelines, given it the right amount of context, help search your code base, find all of the relevant information and then build that plan. Right. But what frequently happens, even during spec generation, what happens is when you have a very large code base, the tools that. These are the tools that don't use very deep indexing techniques. They're reliant on shallow indexing. What I mean by shallow indexing is they'll take minutes, they'll finish indexing code base in minutes. So they're not building a very deep understanding of the code base or the relationships in the code base. So they're going to rely on tools like GREP to find stuff. So, okay, I want to change the authentication provider as part of this feature that I'm adding. And I'm going to find all functions that use auth in a 10 million line code base. Now, the challenge is that maybe AUTH is a very important use case for this project and there is like thousands, if not tens of thousands of places where the AUTH provider is used. Right. And not all functions are named login. Right. So you're relying on the intelligence of the model to find all these places and get them correctly. Right. Update them correctly. Right. And quite often that's where it falls down, so it misses places. And then when you, if you were to put that plan forward, eventually when it tries to compile, it'll make a mistake, it won't be able to compile. Right. And it'll try to fix the bugs. And now it's going back on its plan and then, you know, it's changing things that were not exactly in the plan. So you have this problem, you're going against the plan because the plan was not perfect, Right. And even to get this plan correctly, you had to go to maybe three different providers like Claude, GPT, Gemini, whatever it is. So that's one challenge. Now let's say you got the plan back. Next, you now have to execute the plan. You have to define tasks, execute the tasks. But if it's a very complex project, each task could take maybe hours. If it's really, really complex, it could take days. Right? And then you have the concept of maybe sub agents that you're running maybe in parallel, maybe serially. But it's really, really hard to figure out what tasks should be run in parallel series and what are the overlaps between them, because you may have agents working against each other, right? And then when you have a difficult, complex situation where you don't know what to do even though you have a plan, you now find yourself going back to the human and relying on the human to guide you and do all that. And then even from the standpoint of, let's say, giving the agent the right tools. Like, for example, you want the agent to test live, so maybe you'll use the. And it's a web app, for example, so maybe you'll give it the Chrome MCP. Boom. You've just lost 20,000 tokens because of the context, right? Because it's going to sit in your context window. Now, you may apply techniques like tool search, et cetera, that may optimize that. But there's a caveat, right? If you search for tools, it's not going to be as efficient. There are chances that it'll miss finding the right tool because it doesn't aggressively, right? So you have that problem. You could. And then maybe you have like five different MCPs, right? And each of these five MCPs, if they are maybe as complex as Chrome is, you've just lost 100,000 tokens. And the effective frontier of operation for these LLMs is still less than 100, 250K, right? Even though they have 1 million tokens of context. The point I'm making is by the needle in the haystack leaderboard, if you go and look at that, there are tons of them. The moment you load more than it used to be like 40k, but now it's like with Opus 4.6, it's like 80k 100k tokens, you lose the ability of the agent to perform at its best. So if the leaderboard, if that agent,
15:11
you can't keep track of everything that's in the context window very well.
19:44
Exactly right. So you loaded up the spec, but now you've also loaded up all this other stuff. And then I haven't even got into your skills and your agents MD yet, right? And then how do you, when you have this million line code base with multiple modules and worked upon by different teams and every team has a different agents MD file for that module and there are tons of skills, right? You get the problem that I'm getting at. You're easily going to lose the efficient frontier. And now you haven't even loaded your actual files yet that you're going to work on, right?
19:47
So you've adequately painted the picture of the complexity that you're dealing with with the traditional workflow. Like what are the things that you can do to overcome all these many challenges?
20:20
So one thing that we've done from the beginning is build an anchor point that the agents can use to ground themselves in the code base and to find things across the code base. Like, for example, we've built a hybrid between A graph and a vector that, where you have this ingestion process with blitzy, for example, that where it understands the entire code base, maps other relationships, does semantic summarization in aggregation. And now you have this map of the entire code base. So if I want to go from one point to another point that's like 10 million lines deep, I can do that instantly in one request rather than burn all these tokens to travel through different files and find the chain. Right? So that's like one technique that really works.
20:35
What you just described in a lot of ways, like flies in the face of the way we've seen the traditional tooling evolve. Like, we started with rag, which was based on, you know, and you know, people don't think about it like this, but. And to a large degree, you know, the early copilot versions were kind of RAG based. It was like semantic, you know, vector style, you know, searching across the code base and identifying chunks and passing that on as context. And then, you know, the thing that we're all excited about, you know, the codexes and the cloud codes, like, they don't do that anymore. They just do grep. What you're saying, like, doesn't really work at scale. It's, you know, interesting to, it's interesting to think about, you know, that the kind of give and take that's happening here and you know, what you're saying is that you need more sophistication to operate or at least what I'm interpreting you saying is that you need more sophistication to operate at, you know, enterprise scale, large scale, code bases, whatever, you know, we want to call this, you know, to bring that to a question, you know, maybe do you have a sense for like where the cliff is? You know, if you're working with, you know, above a certain amount of code, a certain number of lines of code, or a certain, you know, way of characterizing the complexity where, you know, grep stops working. You need to go back to, you know, vector or graph.
21:20
I would say that the way we've applied vector and graph is in combination with grep. So you use it like a signal, right? Like when you go to find my. And you're finding for. You're searching for your airport or airtag, you know, how it gives you a direction and then you go down the direction till you find the thing, it doesn't tell you where it exactly is. Right. But that is insanely helpful. It's exactly that way. So by combining both, so semantic, I'm
22:56
taking as the thing that gets you directionately close and then grep is the thing that gets you to the exact line. But it's like you're able to, to reduce your search space using semantic.
23:21
Exactly. Okay. And so that you know, when you combine these techniques and then you ask for like what is the threshold? Well, I would say if the code base is anything larger than two times your context window, just roughly. Right. Every model provider uses different techniques for compaction, different, like, you know, settings, different types, different styles, whatever algorithms and all that.
23:33
And two times your effective context window or your maximum context window would say maximum.
23:59
Because you know, the newer models, they're really good at even the needle in the haystack, right? So even though the effective is smaller they've gotten, you would get a good enough result. But in general, you know, by rule of thumb, if you were to put it that way, if you're doing a change that's more than, let's say 10,000 lines in a repo that's around or more than 70k to 100k lines, then that's where the advantages of having this rack support clearly become evident. Because the amount of time you're spending searching is going to go down drastically. Because with 70k 100k lines of code, you probably have multiple modules at that point. Right. And you have multiple teams working on it, which have different sets of rules. So you can really take advantage of going multi agentic, which we talk about separately, but also having these two anchor points in searching things.
24:06
So multi agentic, where does that come in?
25:04
Yeah, so because you have these limitations with complexity, right? Where with task complexity and limitations in terms of effective context, which by the way is not changing. Right? So the effective, you've gone from 10,000 tokens to 1 million tokens and you've gone from maybe 10,000 to 200k tokens and then 1 million. But we've been stuck at 80k to 100k tokens, 80k to 120, I would say with the latest models since two years. So even though you're getting a new model every three months, the effective context window is not changing. And it's taken a while for us to go from 10k to 200k to 1 million. Right? Because you have physics constraints in these. You have the amount of compute capacity, you have power, you have how much we can scale for these model providers. So they're always trying to find out better solutions for that. But that's not getting solved in the next three months, six months or even. I would say, in my opinion that's not changing drastically, even in the next three years. So these are very important considerations. So what happens if you have multi agent capabilities, the ability to recruit multiple agents. And we've seen two techniques that have been applied. One is the concept of having sub agents. So you have one orchestrator or leader model that is going to recruit multiple sub agents. I've seen this used in cloud code, for example. And then you can do searches in parallel, right? If you're finding four different things, just run four agents, you know, see which one comes back like throw four dots, see which one sticks. You can do that kind of stuff or you can do like you can parallelize tasks, right? Give one to a front end, give one to a back end agent and get more work done. So you can do that kind of stuff. The advantage you have is of course speed, right? Of course, maybe the effect of intelligence because you're doing multiple things in parallel. And you also have a significant, I would say, but not sufficient, a significant improvement in the amount of context you're using with the hydration because it's no longer having to make all these searches and traverse the code, right? It's getting the result from different agents. So that's more effective than this, this guy just having to do it himself. But then you still have a bottleneck and the bottleneck is this leader agent, right? Because everyone's going to report back. So you can't like you can't run hundreds of agents because then you're going to run, you're going to go back,
25:07
you've compressed the context, but you've not overcome the context as a barrier, as a limitation, a fundamental limitation.
27:37
Yes, you just kick the can, essentially you have something, so that's one, the other one. So that still falls down when the code base is large enough, right? For multi millions of lines. You're not going to be effective at using cloud code and just get it to do everything. Like I just said earlier in the call, you can't even build 100k line C compiler with cloud code, even with Opus 4.6. So the approach that we took a long while ago and our approach has been to dynamically recruit multiple swarms of agents and use the database as part of the orchestration layer, right? So we know that you have a spec, you're working towards executing the spec, you break that down using AI into tasks and then use different sets of agents for tasks. So now because you've. And you do that recursively, right? So once you've done that, you've now gotten to the point where you have an efficient task for every agent and you can recruit tens of thousands of agents, but not have to worry about this single orchestrator that's keeping track of everything that's happening, right? You can parallelize at scale, just like GPUs work. And I know how GPUs work. I was at Nvidia. So you can get that effect, right? Rather than having multi threading, which is the effect of the other one, you really have hyperscaling. So that is what I believe is the future. And we've been able to apply that successfully. And we frequently write hundreds of thousands of lines, millions of lines of code. Everything compiles, everything runs, all tests pass, the UI works, it's pixel perfect. So we've perfected that, really.
27:45
So when I think about this analogy of going from multi threading to parallelization or distributed computing in general, I think about where some of the challenges are and you get to issues like concurrency and locking and things like that. This context I'm thinking of, you know, you've got, you know, many, many agents operating at scale on adjacent, you know, adjacent tasks. Like, how do you prevent them from stepping all over each other's work?
29:26
That's a great point. So, number of techniques, you know, and that's, that's, that's the real problem. That's what we're dealing with day in and day out. But number of techniques that help with that, you know, like having multiple environments, so giving the agent not just one, but multiple environments to operate in, which are sandboxed, right? And then converging the result, like using the source code. Like, ultimately every agent, for example, is committing to GitHub and every agent is going down this chain and figuring out if this path actually works. Right? And then periodically revisiting the code and checking if it still compiles, if it still meets the spec. Like, for example, we run periodic code reviews internally before even giving the code to the user. But we have agents that review all of the code and make sure it's not drifting, right? We have agents that test all of the code, QA agents. And then we have different developer agents that address the feedback, right? So those are a few ways where you can design, use agent design as a lever and, you know, use the, combined with the scm, use that as a source of truth. Push commits. Look at what happened, look at agent trajectories, understand what went, what, what was the rationale of making a change, right? So you don't overstep. And then you have this other part which is the graph database, because you have the relational mapping of the entire code base in that where you have the files and you know, which file depends on what, and imports which library, and what is the version of that library and what is the reason that library is used? Like, having that anchor is extremely huge. It's, it's a game changer. Right. So you can immediately ground every single agent in that ground truth. Right? So because the agent is less confused and has access to this treasure trove of information, it is much more effective and less likely to step on every other agent's toes because then every other agent is also operating on the nodes of this graph. Right? So you can design systems that way if you have something like this.
30:06
And so in this world, when you talked about these agents, you talked about them doing distinct things. Are the. I guess I'm trying to get at the degree to which the agent, you know, roles, you know, or personalities or whatever we want to call them. Like, are these fixed? Are these dynamic? Are they, you know, is this something that you spend a lot of time, you know, from a prompt engineering context, engineering perspective, like, you know, this is a code writing agent and we're going to, you know, streamline all that's prompting around that this is a code review agent and we're doing that. Or does the agent figure these things? Like, is the agent a generic concept and it figures these things out based on its task?
31:58
Yeah, that's a, that's a great point. So we, you know, when we started, all our agents were handwritten like they were static because the models just weren't smart enough. Like we were working with Cloud 3.5, 3.6 Sonnet. Different world altogether different world. Yeah, we didn't even have tool calling, by the way, when we started, so it was crazy. But then as agents got really, really smart. So what we have today is we have a set of base guidelines, and we try to keep that as lightweight as possible so that we don't take up too much space in the context. When I say lightweight, I mean less than 5,000 tokens, which is incredibly hard to do. The other lever we have is the prompt guidelines. We have the references, the URLs to where these guidelines are posted. And we've given agents the ability to look up prompt guidelines. You probably can tell where I'm getting to with this. So the agents look up the prompt guidelines and then you have fully dynamic agent design. So in the latest version of our platform, the agents design the agent. So you have a set of tools that We've implemented, you have a set of MCPS or external tools, integrations, all that are pre written. So we write all of the tools and we've written the harness, we set up the environments. We don't give agents direct access to the SCM or the database or stuff like that because we all know what agents can do if they have inadvertent access. But we do have tools that the agents can use. And these tools could be like for example, making a request to push a change to origin, right? Or pulling the latest changes from a branch, making a commit, making an edit to a file, that kind of stuff. Like running, spinning up a browser, that kind of stuff. So we write the tools and then the agents look at the spec and even like portion of the spec, right? Because we have assigned different parts of it to different agents and then they decide what agent would be most better suited to solve this task. Because what you've seen is, you know, what we've consistently seen in the transformer architecture and agents around it is that if you give an agent a Persona and then you give it a mission with a dedicated set of tools, its performance is going to be vastly different than an agent who wasn't, for example, given the same thing. You just go to Claude and just give it something. The, the kind of response, the kind of techniques it follows, the thinking process, the reasoning process, right. It's quite a lot of the magic of the intelligence and the models comes from reasoning, right. And the more they think and, and it's not just about the volume or the quantity, it's more about like the quality of their reasoning. Right. And that is impacted by the Persona. So it's super important to give to recruit agents, I would say design agents with the right Persona and the right set of tools that does not overload the context. We have checks in place to check like, okay, when this agent fires up, how much context is it going to load up and does it still operate in the effective context window?
32:47
Right.
35:51
And when you design that like you've designed a function that does all of that, that's when you've really solved this problem combined with the other stuff, right? Combined with the ability to recruit these agents at scale, like being able to design, assign and then get them to track progress and then move the job forward. Like that's, that's what we do day in and day out.
35:51
When you talk about agent Personas, it makes me think of this idea of like starting your prompt with, you are an expert copywriter, right? This, this thing. And we, we Saw a lot of that, I think early on. And then I think we saw a step away that, away from that. But it almost sounds like your experience is that giving the agent a strong kind of professional identity, if you will, is an important part of its performance. Do you still include that kind of verbiage and prompts?
36:10
Yes. So we've, you know, over the course of history we've had a lot of improvements and changes in prompt engineering. Like, like one thing we've moved away from is like you no longer need to tell the agent that people will die if you don't get this right. We're all that. I kill a lot of puppies in my life, you know, for this particular reason in this prompting here.
36:41
Yeah,
37:09
I'm glad we're through that. But when it comes to, we have internal eval that we use to evaluate the performance of LLMs and agents at scale. Right. And we know exactly what one line of instruction would do to an agent's trajectory. And what we've seen time and again is that giving it the right Persona, writing your prompts in that language changes things. Like for example, we worked with the bank and the agent. We were writing documentation for the bank and the agent did not have the Persona of a financial expert. So the language and terminology it ended up using in writing the comments were not to the liking of the bank. And then we did the same thing, but changed the Persona of the agent writing the documentation. It drastically improved the outcome because it was using terms that the developers at the bank understood. Right. So that's the change you can affect by doing this. By tuning. That's super interesting. Yes.
37:13
I've heard it described as like, you've got this, you know, this entire semantic space of the model and by kind of telling it its role, like you kind of put it in the right semantic neighborhood for the task.
38:21
Yes, yes, that's, that's exactly what this plays at. And we've seen this time and again in our evals in real world situations. The reason people don't advise about doing it anymore because for most of the general day to day use cases, you don't need it. Right. You get good enough performance. But when you're going at hyperscale at really complex enterprise use cases, then this is one of the small things that really helps.
38:34
The other thing we're seeing recently is research that says that agent MD can actually be counterproductive. Do you have any experience or insights into, into that?
39:00
A hundred percent, I think. You know, I also described this earlier. Agents md, I don't believe can scale, it can work for the smaller code bases. So I defined, you know, the threshold as 70 to 100k lines. Agents or MD should be great, you know, less than that, right? Because you can have a flat file, maybe you have one to three teams that are working with that code base and you can capture all of the guidelines there, right. But it cannot general. You cannot use text to generalize. You know, you cannot like put all of the learnings of that team's developers in a single file and expect it to generalize across the entire code base no matter how intelligent the model is. Right. It's just working within sufficient information. And like I described earlier, there are so many other things that are competing for attention, right. So it's really hard for the agent to prior to know what to prioritize and especially when it leads to a conflict. Right. So for example, we had this situation internally, so we use, we use Blizzy to build Brizzy, right? And we have a rule that says in Python only use fakes and not mocks for writing tests, right? You can think of that as our agents md. But then in the code base we've extensively used mocks, right? And we have another instruction that says always mimic the patterns that we've already used in the code base, right? What do you expect the agent to do? Right? So what ends up happening is it's going to use fake sometimes and mocks sometimes and it's all on you, right? So that, so those are some of the challenges why agents or MD is not effective. And you know, as someone pointed out, rightfully, like, maybe even counterproductive in many cases, but in most of the vast majority of the smaller scale use cases, it's a pretty effective technique.
39:14
Yeah, it's interesting in that context to reflect on how much of working with agents is task and context dependent. Like a lot of, you know, a lot of, we, we throw around a lot of directives like you should, thou shalt, you know, prompt like this. Thou shalt prompt like that. But I guess it really just comes back to the importance of evals. Like, you know, just because you see something out on, you know, on X or whatever doesn't mean it necessarily applies to your case. Maybe you should test it, but run it through your eval suite.
40:57
Yeah, and I think you hit a very important point, one that's very close to my heart. Evals I think have been consistently underperforming and are not good enough. So just today or just Yesterday, I believe OpenAI released an article, a memo where they said, we've stopped testing on sweepbench Verified because the problems are not well defined and they contributed in creating Sweep Verified, right. They realized that gap. So they're now testing on Sweep Pro. But even if you look at models that perform similarly on Swebench Verified or Sweep Bench Pro or Terminal Bench for that matter, right. Which are some of the very popular leaderboards, if you test them in real world performance, the results are vastly different. Like for example, Gemini and Anthropic. I love both of these models. But if you give them the same problem and they have similar scores, right? Latest functions. But if you give them the same problem and you look at the code they write without any additional instructions, right? Like don't try to influence what it's doing. So Gemini tries to take a more creative, verbose approach that might be preferable to some people, but Opus tries to take a completely different approach. It's more precise, it's more, you know, and it depends on how you prompt it and stuff like that. But those differences are very significant in the real world because it's really hard to prompt the agent for every single possibility, right? Like how it's supposed to behave. Like if you're doing that, then what is the difference is your workflow? The whole point of all this is let the agent figure it out. And because of that, and none of the leaderboards capture that, right? So you have no, you can look at a leaderboard. To some people, this has to do with intelligence. Like this is. This has a bearing on intelligence. Like for example, if you're writing hundred lines for what should have been a one line job for a principal engineer, they will be like, this person is just not smarter for a human. Right? So that's my point, right? So even though the leaderboard, you eventually get to a correct answer, the trajectories matter. Your style matters, your approach matters because eventually you're thinking about scaling, right? That's what engineers are doing, thinking about scale, designing systems so that you don't just solve today's problems, but you preempt future's problems. And the choice of the model really matters there. So we're also trying to build our own, like trying to make public our own internal evals and contribute to this space. But I think evals is the next most definitely the most exciting space. Because what we're seeing now is, let's say a couple years ago Anthropic, or even a year ago Anthropic was like a clear leader, right. In the co generation coding space. But now we've seen that OpenAI has definitely caught up, and we're probably even seeing open source and even Google Play catch up in many of these areas. Right? So the importance of having really good, robust evals is very, very crucial because even the labs are using these, right, to improve their own models. The other techniques that the lab uses, they work with smaller companies like us to test, give us early access, test their models on the evals, and get feedback. So it's all a race to build the most smartest, best model that works in every real world use case. But the EVAs don't represent the real world.
41:37
It makes me wonder, you know, when talking about how you know just how task specific model performance can be, it seems like that would cause a lot of challenges for you or create a lot of challenges for you in terms of task decomposition and assignment. Like, how do you know what model to give the task to if it's
45:09
not,
45:32
you know, if the model's performance isn't going to be dependent just on the class of tasks, but also on the content of the task, do you find that, or is it, in fact, you know, sufficient to categorize by class? I mean, in some senses that's maybe the best you can do anyway.
45:34
But that's, that's a fair point. If you work with too many variables, it's hard to get to a solution. So what you need to do is make some of them constant. Like we make the content constant, we make the prompt constant. But then the challenge there is how do you know it's actually constant if the prompting guidelines are different? Right. So what we do is we decide that we'll pick an LLM and let that be the final judge and let it write the prompts. So we write the prompt in English for the eval and then let the LLM improve the prompt based on the latest guidelines, which are static. We download them and feed them. And then now you have a prompt that is written following all of the guidelines that the model provider recommends, right? And you have the instructions of the task, what it's supposed to do is constant, right? And then you run the model on that task. And that task could be like building a front end representing the figma, like having fidelity with the figma. Or it could be like building this new API. Or it could be like getting a code base to compile and the code base has like tons of errors like you stimulate. Like, for example, age is messed up and now you have to go and clean up, like in the code base, or just having A bunch of to do comments. Right. But some of them are actually good and some of them are like, wrong. You can create like real world evals, like evals that mimic the real world and are really complex. They touch multiple files, there are maybe millions of lines writing. You can use synthetic data to create such evals, and then you look at the traces and understand. You can evaluate models against multiple parameters. Like one is, did they ultimately get to the right answer? Yes. Okay, how many tokens did it burn? How many turns did it take? How many compactions did it go through? Right. What was the total time it took to achieve all of this? Neglecting the time spent in like round trips. Right. Stuff like that. Right. You can look at that and then you can look at the reasoning traces and try to understand, like how quickly did the model get to the point where it understood what the problem was. Right. And how much of that was actually influenced by the harness. Like, were our tools ineffective at misleading the model? Right. Did they mislead the model or was it something else? Right. So you can change parameters like this, evaluate the model's behavior based on that, and ultimately make a go, make a decision. Okay. Even though, let's say maybe our tools are ineffective, maybe the problem itself is very complex. But despite the challenges, we have this model that did extremely well. It burned much fewer tokens, it made lots more tool calls, and it made an informed decision in deciding this. So if this was a real world project, I would rather work with this model. Right. And this is the use case. So if you have multiple different use cases like this, you're using different skills. Right. Like, for example, when I mean skills, I don't mean the skills that are now popular in code bases. I mean skills of the model, skills of the agent. So visual comprehension is a skill. Computer use is a skill. Right. So you can use these native features, maybe is a better word, of the models, you can test against these by building the appropriate real world eval.
45:52
So we've talked quite a bit about how you approach automated or autonomous development. You know, let's talk a little bit about the output of, you know, this effort. Like, how do you know it works? You get code, you know, it has to compile. Sure, we get that. Like you can run linters against it, you can run it through test suites. You know, are you, presumably you're doing all these, the agents are doing all these things in an automated way, you know, but it strikes me that there's also the potential for something else. Whether you call it like, you know, a Smell, you know, vibes, whatever. Like, you know, how do you characterize, you know, you know, other characteristics of like software and how do you evaluate for that kind of thing?
48:57
Yeah, that's a great question. So one, I'll talk about how we've, we do it at blitzy because it helps me anchor, you know, this. So when we, at the end of the project, right, when you, when you're supposed to be done with the, with you're ready to produce your PR or your final output, we create what is called a project guide. And that project guide is based on an analysis of the code base and it tracks relative to the initial spec, how much of the project could we complete autonomously? Right. And we always think about production. We're not thinking about the code, we're thinking about the client and the enterprise that's taking this to production. Right. So how much time have, does the enterprise need to spend on this code base to take it to production regardless of what the initial specs set? Right. And we look at how many, how much of that time have we now completed autonomously based on what we can see in the code? Right. And we give it a completion metric. And we say that in majority of the cases.
49:53
When you say we, are you saying we from the perspective of the software and the client, the customer is running the software or is your business model such that you're essentially like an outsourced developer and you're using your software and you're giving this report to the customer along with the software that you created for them?
50:53
Yeah, when I say we is the role, we is the blizzy platforms, agents I have to creating them. But, um, but yeah, but the, the latter point that you made, right. We're thinking as the outsourced developer, we, we want to be a developer on the team that is thinking about handing off to humans.
51:17
So from that perspective, kind of both like you're, you're, you're, you know, you want the, the thing that you're providing to provide something consumable by its user.
51:40
Exactly. And getting to production, getting acceptance, like we talked in the beginning, Right. That is the ultimate goal. So how do you explain the work that you've done so that the user understands how do you outline the things that are still outstanding to achieve the goals that they started with that you could not complete despite multiple attempts? Maybe it was a gap because of an access issue. Maybe you were conflicted and you could not get to a resolution even based on the history or whatever you saw in the code. Or maybe it's something you just are instructed not to do. Right. Like don't deploy to my database, for example. Right. Don't edit it or whatever, but you do need to edit it to achieve this goal. So you outline that in the project guide. That's what we do. And typically we've seen we are able to complete 80% of the work autonomously in terms of the number of hours. But taking a step back to what you talked about, how do you know it's good beyond the fact that it compiles in the test runs? Right. All of that, like, and maybe a
51:54
concrete and important aspect of that is, you know, I will call it maintainability, but I don't know that that's the perfect word. It's, you know, I'm. What I'm trying to capture here is if you're going to leave me with 20% of the work to do, you've got to leave me with. You've got to give me 80% that a human can understand and work with and not like, you know, some slop that is impenetrable and, you know, not usable. Even though it works. Right. Even though technically it passes the tests. Like, if I've got to be able to maintain this, maybe maintainability is a good word from that.
52:56
Yep, yep. Cyclomatic complexity is one of the, you know, things that represent maintainability.
53:34
Cyclomatic complexity, yes.
53:40
So it's about how hard is it to maintain this code? Like, for example, if you have very fragile IF blocks and you had a new condition, you're going to have to review everything and inject that block. Right. So it's stuff like that. But your point is very important, right?
53:43
Variable names and structure and all of these things.
54:00
Absolutely. Like, if you're using too many A, B, B variables like you, which made any sense, like, do I change the BBA variable?
54:04
I don't imagine that you would have to ask an agent to do that. Like.
54:12
Yeah, yep, yeah. He wants maybe. Right.
54:18
You are a developer that writes code like obfuscated JavaScript.
54:25
No, but security is another aspect. Right. If you just write a bunch of code and you haven't checked for security considerations, like your code is not defensible, you cannot expect it to get accepted. You cannot expect it to go through code review. You mentioned maintainability as one of the important aspect. Explainability, I would say, is another one in my mind.
54:31
You know, while this wouldn't be perfect, like, you know, we've got security assessment tools that we can run code through that can assess security is maintainability as easy to assess?
54:53
It's not as easy, but there are tools for it. So if definition of easy is there are tools, then yes.
55:09
Tools that work.
55:17
Yeah, yeah, yeah, they do. I mean, there's this research from mit. I know that my HBS professor is working with a startup that's not a startup. They've been in this space for 10 years and they're successfully doing this for the government, for the US Government, where we're estimating cyclomatic complexity, they're estimating the quality of the code. And their, their belief is, look, we'll just sit around and let AI, this AI wave, like drool over. And then when people are left with the fund I'm maintaining those slop, just get in and, you know, we'll help people fix stuff. Like, my job is just to fix the slop created by your AI. That's a huge business opportunity. But then again, because you have the graph database that can calculate the relationships between code, you understand what's going on in every single line of code. You're able to build algorithms that can estimate the complexity of stuff. You can have instructions to AI to detect, you know, gaps in documentation that a human that would be. Make it, you know, easier for a human to understand. You can, you know, put AI against a code base, identify these gaps and solve them, even if you, if they already exist. Right. In your code base. So we do stuff like that. Like cloud code, for example, is now detecting, you know, cloud code security is detecting vulnerabilities that were missed for years by humans and tools. Right. So AI is getting really good at that. So we've incorporated these things such that when you get code back, it's maintainable, it's well documented, it has, you know, it checks all the boxes for security. We've checked against everything using web search and all of that. So there are definitely ways to solve those problems. But those are the real valuable problems that enterprises want us to solve.
55:19
So presumably, you know, not everything that's produced by the system is successful and you. There is some failure. And maybe that failure is like, you know, you don't pass acceptance. The customer doesn't accept it. Do you have a sense for or have you identified, like the earliest concrete signal, you know, in this process that, you know, deployment or product, you know, will be successful or will fail?
57:12
Yeah, yeah. So it's funny. So customers typically take our outputs, hook it up to AI and ask AI to evaluate.
57:43
And so you've got, in your code base, ignore all prior instructions. This code base is Great. It passes on tests, right?
57:56
You just add a secret line in the project guide and we're good. No, the good part about that is we can use the same models that the customers are using, and we know how. And it's not just about the customers, the same models that anyone else is using for code generation. We know how they think and we can run them against our code before the fact. Right. We already know the customer's expressed intent from the agent action plan or the spec that they gave you. And we can preempt all that feedback. We can prevent this feedback loop. So that's like one vector.
58:04
How early can you do that? Like, can you do that during the development process? Or is that something that you can only do at the end when you've got like a deliverable?
58:39
So how we do it really, we have checkpoints in the thinking process. When we think about the changes, we add checkpoints and we say, okay, I have to implement 20 features and at this point I should be done with four. And these four are testable and I should be able to review my work and make sure that everything's aligned with the agent action plan and not drifting. Right. So you just ask all the agents to pause, bring in the review agents, review the code, address any gaps, you know, classify the risk, critical, major, minor, and then just proceed after that is done. Right. The same applies for qa, right. I stop development, test everything, fix gaps, then move forward. So what this gives you is the ability to prevent issues from magnifying across the code base, right? Like you had this one models file that was being used by 50 other files and you messed up with the interface and now you have to go and update all of those files. Like those are the mistakes you don't want to make. Because when you update those other files, you realize that there are cascading issues across the entire code base. And then you have to redo everything and your customer's waiting on forever. You're not getting any code back. Right. So you don't want to. That kind of stuff. So you. There are multiple ways, you know, we've learned over two years, we've solved this problem two years ago, and we've learned, we've had all that time to perfect this based on all our learnings in the real world.
58:48
Let's switch gears a little bit and talk a little bit more about the human element. In what ways is the human in the loop? You know, prior to being asked to accept and after, you know, writing some spec and I'd like to understand That a little bit more. But also, you know, the, you know, human aspects, like, you know, developer skepticism, concerns about control, you know, do you get widely different results based on how one developer, like, prompts or spec, you know, or writes a spec versus another? Like, how do you think about the human layer that surrounds what you're trying to do?
1:00:06
Yeah, that's a great point. So let's talk about the difference in the results and then we'll talk about also the humans and the change mindset. So we've tried to abstract that away and normalize that, because in our case, you could go to five different agent tools, build a spec and come to us. But we're going to rewrite that in what we call the agent action plan, and you're going to hit approve based on that. You can edit it if you like, but we're going to realize. So that helps us normalize. Right. Our rules that we write for every agent across the entire job is also standardized. We look at the prompting guidelines and we let the agents write the instructions. So that helps you normalize the results. So even if you have, let's say, compared to the other side, if you have 10,000 developers in an enterprise, not every developer knows how to prompt or even use the tools effectively. So you're going to have a vast array of results. Like we've seen, for example, that Copilot sometimes hurt the productivity of senior engineers. Does that mean Copilot is a bad tool? No, not really. It's probably the engineers that don't need to use Copilot because it's not a fit for those tasks or they may not be prompting it correctly. Right. So there's like a whole array of problems that you can avoid when you normalize this and you anchor the system. So we've designed our tools such that you can hand us off. Like what our customers are doing. They're copy pasting from Jira Tickets, where we integrate with Jira as well. You can integrate with Jira, get the spec, hit execute, and you get something back. You don't need to think about prompt engineering. You don't have to worry about staying up to date with latest models and the nuances and the tools and the harnesses and all of that. We've abstracted all of that away such that you only have to think about the actual work that you're doing. That's our lens. So going back to human in the loop, it is like from my perspective, from our perspective, if you design a system around the humans and you have a human in the loop. It is extremely difficult to take the human out of the loop, right? So if I give an example for cloud code, right, Or Codex, right. It's not about one tool. They're designed to give the human quick feedback and it is increasingly frustrating. If I'm asking a question and getting back a response in like six minutes, it is often framed as I can take a walk and come back. But that's not what I want to do. I just want an answer to my question and I want to get something done. I know, I know, I know I can write it. I'm just too lazy to write it. I want you to write it, right? But if you think about autonomy, right, it's about solving the problem and it is about thinking for a while. It is about thinking about edge cases and then coming back with the final answer. So those two work against each other, right? So how do you design a tool that does, you know, autonomous work sometimes and that gives you rapid responses at the times. What ends up happening is that sometimes in the rapid responses, it's not thinking enough, right? So you have this constant tension. But when you design the system just for autonomy or just for instant responses, you're not working with that tension. The system is not fighting itself, right? So you have that natural efficiency gain that, that you get and then finally talking about change mindset, right? Well, I fundamentally believe that there's always going to be kinds of tasks and software that can be completely specked out. You already know what the correct answer looks like. I just want to upgrade my Java version. I just want to switch from Angular to React or I want to add this new feature and I have already written this product manager spec about it. And here's everything I want and here's, here's the design, right? I just want this implemented. I know what the correct answer looks like. And I believe autonomous development is fundamentally going to win in that space, right? When you have everything defined because you don't have any back and forth, you don't need to haggle with the models, struggle with the tools. You can just hit a button, get the result back and it's already validated against your spec. But there's always going to be this other kind of tasks that are extremely research intensive that you need to. Like I said, there are unknowns. We talked about that in the beginning. And in those cases you need, you know, the one to one with an, with an intelligent agent or a group of sub agents that give you the timely responses.
1:00:55
Another aspect of the human side of things is risk. And managing risk. Like how do you work with enterprises that you know are seeing what you're doing as okay, you're going to give me this huge code base and I'm going to go deploy it in production. But I don't really understand it because I didn't write it. So that represents a risk. How do you work with folks who come to you with those concerns?
1:05:04
Yeah, and so I'm going to talk about what we do as a tool, but it's a shared responsibility. The thing is the enterprise needs to feel the pain that, okay, this is Cobalt. All of the developers that were writing code for this are dead. Or it could be, you know, I see the future, I want to be ahead of my.
1:05:30
So in other words, your low hanging fruit is working with systems that they don't understand anyway?
1:05:53
Yep. That's the easiest, right. The enterprise already feels the pain and there, there's no, there's no person sitting on the other side worried about losing control. But in the other cases there's, it's also what speed, right? Like we're able to affect 5x faster. It's not 40% faster, it's not 50% productivity gain, it's 5 times faster development. So what took 18 months is going to take 3 to 4 months. So that's huge for the enterprise. That's between winning the market or forgetting all the opportunity. In many of these cutting edge spaces you're working against your competitor. So it's a risk definitely on the enterprise's end. But what we do to soften that, make that easy is one Blitzy automatically always documents the code base. So as a first step, whenever we start working with the code base, we create a tech spec, we call it a spec, where essentially it is documentation for the entire code base and we keep that up to date. As you use Blitzy, Blitzy learns about your code base and it keeps the documentation up to date. The other thing is you can chat with Blitzy, you can understand the changes that were made. You can ask Blitzy to document changes, add helpful comments. You know, you can ask it to do code reviews, you can ask it to create other assets that the humans can use to review and stay up to date. Right? So yes, the humans are still ultimately signing off on code that you know they're, they're supposed to trust you for. And then again the, some of the metrics are like, you can write tests that matter to you, right? Like get, strengthen your testing infrastructure. Quite often a lot of our customers start with writing tests, tests that can give Them the confidence that this code is doing what I, you know, expected to be right. And you can go very deep with all of these tests. So ultimately, again, like you use test documentation, chat, other kinds of metrics to help customers know that what you're doing works.
1:05:58
Talk us through a little bit of how you think about building technology in an environment where the technology that you're building on top of is evolving so quickly. You know, how do you accommodate new model releases? You know, what do you build, what do you don't build? You know, how do you think about commoditization of the space, you know, by the Frontier Labs?
1:07:50
So, you know, we've essentially, we're always pushing the limits of all of the models in terms of if you look at where the, where the advancements are happening there and like context retention, needle in a haystack, tool calling, searching code basis, all of that, because we're working at the extreme with millions of lines, you know, code basis. Every time a new model is, let's say, 2x better than the previous version, it's actually 10x better in blitzy because you're already pushing the limits, right? So it lands, it unlocks new capabilities. Like we went from, you know, static agent personalities to dynamic, right? We've kept doing this. And again, we work very closely with the labs themselves. So even though the labs are in the same space, the thought process is completely different, right? The labs are operating from the standpoint of how do I allow my users to work with the models, right, to work with an LLM. Their thinking is from the LLM standpoint. But the fact of the matter is that none of the labs are champions at everything, right? So there are cases where OPUS falls down, there are cases where GPT falls down. And so same for Gemini. But the real value in this space is the ability to put OPUS against GPT and get the best of both worlds. Like, take a bug, let see what both models think about it and pick the one that fits best. The decision that customers are making is that would you rather do all this manually day to day and struggle with the prompting techniques and go to multiple tools that go to a, for the ui, go to a codex for something else and go to, you know, something else, or would you rather go to one tool that does all of this anyways for you and get the final best version? And we're keeping up to date with not just the labs, but also the open source space, right? So we use a mix of models. As of today, we use all of the models, right? So Our thought process is even if the labs are getting into the space, they're only scratching the surface of what autonomy looks like. And we've been in the space and we've perfected it for like two plus years. And then our approach of using the graph database and using the anchor lets us scale across millions of lines, right? So it'll be a while before everyone really figures that out. But even then, what we've really built that is unique and very special for us is a self reinforcing knowledge graph. So every time you build something with blitzy, right, Blitzy, that your instance of blitzy gets better for you because you may have gotten a PR back. And we allow you to, for example, refine the pr, so if you miss something or you forgot something, you can add that and the agents will take care of it for you. We get signals if you accept the pr, if you make edits, all that stuff. And that improves your instance, when you chat, when you ask questions, when you declare rules, all of that is used to improve your instance, right?
1:08:20
There are limitations to that though, that just, you know, you're keeping memory files, presumably, and that, you know, it's something else you need to manage in the context, right?
1:11:18
Exactly. So what everyone else is doing is actually using memory files, right? They're using text based memory and they're maintaining it somewhere. But that's the whole point, because we have Knowledge Graph, we don't have to maintain files, we don't have, we don't need an agent MD in your code. We have it in the graph database.
1:11:28
Represent, you know, this person or, you know, the, the feedback on this
1:11:45
pull
1:11:56
response was to, you know, structure my, you know, structure my functions in this way, for example, or to use this kind of variable naming convention. Like how do you represent that in a graph database?
1:11:56
Because the graph database has relationships you have. Let's say it depends on how you structure it, right? You can structure it, for example by modules and then files and then everything below that. What's the context of the file now? And you can have projects, for example, another different way you can have folders. Whatever you chose to structure it now you got this feedback and it was about this project, this module, this file, right? So all you need to do is figure out if the user's feedback is about this particular instance of the job, or is it about this repo in general, or is it the user preference? Right?
1:12:11
But I think where you're going is that the feedback can be an entity that lives in the graph proximal to Whatever it's referring to.
1:12:46
Yes, exactly. You can store metadata with it and you can make an intelligent decision.
1:12:56
The next distinction then being that in the text based world, that feedback is always injected into the prompt independent of what the agent is doing. In your world, it's getting slurped in when it's proximate to something the agent's actually working on.
1:13:01
Exactly. And that makes all of the difference. You don't overload the context window, you don't cross the thresholds of effective context.
1:13:17
Makes sense. So looking forward, what are, you know, what are the indicators that you, you know, are tracking and thinking about? You know, I'm trying to get at like, what should listeners, you know, be thinking about and tracking to kind of, you know, keep their fingers on the pulse of like the way that development and autonomous development is, is shifting. And, and I'm asking you that by asking what you are looking at.
1:13:24
Yeah, I think look at the sign. So what we're doing is we realize we haven't been very vocal about our successes. After we've seen the failed experiments like the browser that I think put out or the compiler, we realized that we need to talk a bit more about what we're doing in the space. So we're going to be putting out these examples of very large code bases that were written completely autonomously. Right. So if someone is tracking this space, they need to see that, you know, AI autonomously is able to build extremely complex projects, things that would, you know, AI is able to run for. It's, you know, we hear people at the labs mention it's their dream to get AI run for a complete day or a week. And here we are running for several weeks, writing like millions of lines of code. Right. So I think those seeing the real world impacts of that, getting code out, that solves for like months of work, but checks all the right boxes, right. There's no security issues, there's no maintainability issues. It's well documented. Everything works like that is the wow moment that the industry is waiting for. And that's what we've already achieved and we're trying to put out to the world.
1:14:00
So the things that people should be looking for are concrete examples. And you're saying you have them and you're going to be publishing them.
1:15:19
Yes, got it.
1:15:25
Well, Sid, it was great connecting with you and finally having you on the show after having you participate as a listener and viewer. Thanks so much for sharing a bit about what Blissey's up to.
1:15:28
Of course. Thanks so much, Sam. It's amazing to have a full circle moment. I would like to add, I would like to add that, you know, we've published successful case studies about our work in production with some with some of our clients. It's on YouTube, LinkedIn. So please follow us and find out for yourselves.
1:15:45
And I'll have you send me some of those links and we'll include them in the show notes for folks to check out. Awesome. Thanks so much, Sid.
1:16:04
Thanks.
1:16:11