Evals, Feedback Loops, and the Engineering That Makes AI Work
Ankur Goyal, CEO of BrainTrust, discusses how successful AI companies focus on engineering around models rather than just using the smartest models. The conversation explores the tension between AI's continuous nature and systems' discrete approach, the economics of frontier vs. open source models, and why proper evaluation frameworks are crucial for AI product development.
- Companies shipping successful AI products aren't using the smartest models - they're using the best engineering around models with proper evals and feedback loops
- Chinese AI models show high token volume usage but low dollar-weighted spend, suggesting cost-driven adoption with quality trade-offs
- The AI industry may hit demand-side limitations before supply-side constraints, as enterprises struggle to implement AI systems despite unlimited apparent demand
- Frontier labs can raise capital faster than they're limited by engineering complexity, creating a unique dynamic where money directly translates to model capability
- SQL significantly outperforms Bash for agent tasks across accuracy, efficiency, and speed metrics, challenging the popular 'give agents a Unix environment' approach
"AI is continuous and systems are discrete. Humans fundamentally think a little bit more in terms of systems and you know, predictability and reliability, consistency than they do non determinism."
"Right now we are kind of building, you know, like God. And so it's possible and probably economically viable to keep throwing capital at the problem to make God 1% smarter."
"If you're building an agent and you're using pre trained models or whatever, it's a fool's errand to think that the agent that you're building, the way that you're providing context to it, like all that stuff that is not engineering and it shouldn't be engineered and it should be bitter lesson pilled."
"I think evals are like the scientific method applied to software engineering with, you know, non deterministic systems like AI systems."
"In the past, large companies that received lots of money were basically naturally rate limited by engineering speed. These frontier labs don't have that problem. They can literally just raise money and build a model based on the money."
AI is continuous and systems are discrete. Humans fundamentally think a little bit more in terms of systems and you know, predictability and reliability, consistency than they do non determinism.
0:00
In the past, large companies that received lots of money were basically naturally rate limited by engineering speed. These frontier labs don't have that problem. They can literally just raise money and build a model based on the money. Like they just throw more computer, more data at it to kind of ask the question like what is going to end up limiting them.
0:13
If you're building an agent and you're using pre trained models or whatever, it's a fool's errand to think that the agent that you're building, the way that you're providing context to it, like all that stuff that is not engineering and it shouldn't be engineered and it should be bitter lesson pilled, meaning you should build that system in a way that you can throw it away tomorrow. Right now we are kind of building, you know, like God. And so it's possible and probably economically viable to keep throwing capital at the problem to make God 1% smarter. But when you can't make God 1% smarter, there is like an insane opportunity to engineer God to be more efficient.
0:34
The AI industry keeps reaching for brute force. Frontier labs throw compute at training runs instead of optimizing. Developers give agents a Unix environment instead of structured tools. Teams chase the latest model instead of engineering the one they have. But the pattern Angkor Goyal keeps seeing is the opposite. The companies shipping AI products that actually work aren't using the smartest model models. They're the ones with the best engineering around the models, the evals, the feedback loops, the testing harnesses. This conversation covers where that discipline matters and where it doesn't. The cycle between open source and closed source models. Why Chinese models show high token volume but low dollar spend and a benchmark. Comparing Bash versus SQL for agents with results Goyal calls comical. Martin Casado speaks with Ankur Goyal, founder and CEO of BrainTrust.
1:15
I think people watching this will know you, but let's just very quickly go through background, mostly just to set the stage because I want to talk a lot about whether AI is actually a systems problem or not.
2:04
Great.
2:12
So do you mind just kind of giving the rough sketch?
2:12
Yes. Nice to meet you, Martin. I'm, I'm ankur. Prior to BrainTrust, I, you know, back in the ancient history used to work on relational databases and like way before LLMs, I saw deep learning come out and become a thing and started to get excited about how the way that we query and work with data, which was primarily SQL is not as powerful as what we could do. And so I started a company almost 10 years ago now called Impera, where we did.
2:14
It's been that long?
2:44
Yeah, it's been time flies. Yeah.
2:45
Wow.
2:47
Where we did AI powered document extraction and that turned out to be like one use case that worked pretty well. Although we were using like computer vision models and stuff. Because they're way back. Yeah, they're. I mean at that point they were way more powerful than language models. And so there we had a bunch of customers who were doing different use cases and we might make our invoice stuff better and then make the bank stuff worse, or make the bank stuff better and the invoice stuff worse. And we had to get really good at avoiding that. And that's when we built internal tools to do evals and then get good data to do evals and sort of run this feedback loop pipeline. Didn't think too much of it at the time, but then we got acquired by Figma and I led the AI team there and we had exactly the same problem, but this time like building on top of LLMs.
2:47
By the way, you know what's interesting? Everybody says the word eval. You say the word eval, I say the word eval. I found out very few people actually know what an eval is like, specifically what an eval is.
3:33
Yeah, I think evals are like the scientific method applied to software engineering with, you know, non deterministic systems like AI systems. So you come up with a hypothesis. Let's say I'm going to try out a new model or I'm going to tweak my prompt, or I'm going to throw more context into this agent by fetching from whatever this API. And I suspect that this is going to improve the quality of my agent or I suspect that it will make it faster or whatever. So you come up with a hypothesis and then you essentially simulate running the system on a set of inputs and you observe the outputs. You might have ground truth, you might not have ground truth, and you try to measure what the difference is and then you quantitatively look at the difference. And I think what's really important is you qualitatively understand like, hey, this thing, okay, it says it got better, but let me like look at it with my eyes and see if it actually was better or not. And you sort of, by reconciling that, not only do you, you know, you double check things with your intuition, but you also give yourself the opportunity to Improve the next eval.
3:42
Awesome. So we're going to get back to that in just a second. But just so everybody knows what we're talking about with evals. So your background before that you were at memsql? Yeah, maybe talk about that very quickly.
4:42
Yeah, MEMSQL was started right around the time that people realized old school SQL databases were not really built for the web. And I think there was this really big no SQL thing that happened for a while.
4:53
The head fake the entire industry had faked.
5:07
Well, I think the same thing is happening right now with agents and bash.
5:09
Yes, let's go. Yeah, let's go into that. I agree.
5:12
You know, imagine the same agents and bash thing I E hadoop back then. And I think what happened is like many enterprises who are trying to use a SQL were struggling because you know, the end users and all this legacy systems and stuff that they had built didn't understand how to speak NoSQL. They understood how to speak SQL.
5:15
Well, you think that was the issue? I mean my experience with the issue was like it didn't provide the guarantees you needed and so you had to implement it at the app layer.
5:32
Well, of course, I feel like you're.
5:38
Like re implementing an rdbms.
5:40
Most enterprises weren't even there though. I think that's what like smart developers who are actually paying for the NoSQL products realized. But the enterprises weren't even able to do PSU because hey, I have crystal Insights, you know, Okay, I have it pointed at Oracle. You're trying to sell me NoSQL whatever database. How do I point this at you?
5:42
So I think the more pernicious version of this problem was is you got sold on this. Oh, the web giants are doing this stuff that's eventually consistent. I can too. And then you end up building this into a system that actually requires strong consistency.
6:04
Right, right.
6:17
And then all of a sudden this is like the wrong thing. And then like that's where the rubber meets the road.
6:18
Yeah, yeah.
6:23
And that made, I would say this costs the industry companies years.
6:23
I remember there was this weird moment when I would I MC was based in the Bay Area so we would constantly try to get like the cool tech companies to use us and we had a lot more traction with traditional enterprises. So I would often fly to New York and meet with like banks and stuff and they were using our product and there were, you know, like people who had a relatively basic understanding of SQL were doing like very complex financial analysis using our product and you know, SQL queries that are this long. And then I'd go to, you know, tech company of your choice and visit them. And they had 50 engineers working on writing a MapReduce job that does like a really dumb version of that of the same query and just trying to figure it out.
6:27
And also like workflow like that is totally different than like database queries. Mean these are almost like two different things. Like a lot of the Hadoop use case was like MapReduce, which is like process a bunch of documents versus like actually something where you would need SQL.
7:09
And it makes sense for that use case. Yeah.
7:22
Which actually brings me into to what I want to talk about next, which is from my perspective, working with a lot of founders in this AI space, they tend to come from one of two backgrounds. They're either AI people or they're systems product people. Yeah. And like you're definitely a systems product person that's becoming an AI person. Do you find that there's in any way a tension between these two things?
7:25
Yeah, I mean AI is continuous and systems are discrete. And so the, you know, the way that we think about stuff tends to be quite different. Like I think systems people are, they're not trying to optimize and they're not trying to necessarily optimize over like a large aggregate. I think in systems, like if you're building SQL systems, every SQL query needs to be correct. And so I think where the, like, the tension is, I, you know, we always want to build compilers and I think AI people always want to build optimizers. And you know, compilers have optimizers, but they're algebraic optimizers, so they're kind of different. And I think that humans fundamentally think a little bit more in terms of systems and, you know, predictability and reliability, consistency than they do non determinism. And I, what has worked well for us is I think understanding as much about AI as we do, which is certainly not as much as like the hardcore researchers, but quite a bit. We're able to provide tools that help AI make sense for people that maybe like to think with reliability and performance, whatever guarantees.
7:45
Yeah, Let me just try this on for size to see if this makes sense. And which is, it feels like a lot of the way that the AI industry has evolved is around this vague notion of the bitter lesson. I know, like people use that term like freely, but here's what I mean, which is these models are universal function approximators and it's distribution and distribution out. And so they just throw a bunch of data and a bunch of compute at it and then you get this thing and it's almost like anti engineering.
8:49
Right.
9:17
I'm going to sit down, you know, I'm going to have a team for four months that's just going to throw a bunch of data and compute at this stuff. And what comes out is basically the thing as opposed to some clever way you read the original Sutton bitter lesson paper, it's like don't do engineering. Yeah, yeah, do like the thing. And so I feel like you come like you're a database guy, like database systems product, which is like all about engineering and trade offs and incrementalism and abstraction versus you know, here is the God model type stuff. And you're in the position of actually having to reign the complexity of this stuff with things like evals and products or whatever. And so I'm just wondering, do you think that over time like the bitter lesson wins out here or do you think that we're going to have to like use a systems approach or do you think these things can ever be even reconciled?
9:18
Well, I think the. Even though the thing that we're all fascinated by is this bundle of weights that gets produced and it's kind of like a mini God or whatever, there is a shit ton of engineering surrounding that thing, especially around capturing, cleaning, preparing, distributing, shuffling data.
10:03
You mean making. You mean the training part or like just using them in an app?
10:22
The training part? Yeah, yeah, the training, the all that. Even though you know the artifact is, is in some ways not engineered right. There's a ton of engineering around it to make the artifact engineerable. And I think the same thing is true in application use of AI. Like we actually. Mikayla from Replit said this once to me which made a ton of sense, which is a new model comes out, we want to throw away our entire code base. Yeah, and I heard a very similar.
10:25
One from somebody that worked at Cursor and he said imagine writing an operating system for a chipset every time like a new version comes out. Like you had an entirely different machine code or instruction set.
10:52
Exactly. I mean I think, by the way, I think that happened today with the new models that came out. But what do you mean? 5.3mx 2.5 and GLM 5. But we'll get into that in a second. But I think if you're building an agent and you're using pre trained models or whatever, it's a fool's errand to think that the agent that you're building, the way that you're providing context to it, like all that stuff that is not engineering and it shouldn't be engineered and it should be bitter lesson pilled, meaning you should build that system in a way that you can throw it away tomorrow. The thing that actually matters and the difference between the teams that build products that you and I would say work like Cursor's product or products that just feel like shitty, is all the engineering work that goes around it to make sure that you can build a feedback loop from what's happening in production to what you're actually testing. There's one, I can't say which yet because they haven't shipped it, but one of our customers, which is relatively sophisticated, they are already testing the new Chinese models in their agent harness and they have very precise insight into where it works and where it doesn't work. That harness is the testing art, is very well engineered. The innards of the system that they're testing is not engineered at all. And that's very much on purpose.
11:04
I'm going to the Chinese models in a second, but I actually want to push on this in just one more direction, which is, I mean, at some level, like whatever, you pour a billion dollars into a training run and out pops this thing and then you hope that you've got a good harness that manages it and understands it. And that's kind of what Brave Trust does as a company. And it reigns the complexity. But I just have this kind of weird maybe doubt from coming from traditional software is what if you had a team that spent five years and a billion dollars building an operating system and then it shows up, they show up and they just give that to you. Like you just wouldn't have any hope from the outside, you know, so in some way like this is the same type of problem or is the problem reduced a little bit more just because, you know, like the way that these function is more scoped. I mean, it just feels to me like a billion dollars of compute and data. I'm just using the kind of some rough number. Like the current training runs are of that size and that thing kind of shows up. Like, how can we have a hope of even reigning in that complexity?
12:17
I mean there's like a hilarious, a hilarious metaphor here for vibe coding. Right? You know how people talk about the comprehension debt that you build up while you're vibe coding something and then eventually you need to look at it.
13:14
Yeah, yeah.
13:27
So I, you know, it's a good question. It's really hard to know what the current models actually are like. I don't actually know and I have no idea. What the margin structure is. But it feels to me like what happens is every n days we stop making rapid progress on the quality of the models, and then people engineer them like crazy to make them incrementally more efficient. I don't actually understand whether that means having a deeper, you know, inductive understanding of what's happening inside the weights or not. I sort of doubt that it does. But I do think that when, like right now, we're kind of building, you know, like God, and so it's possible and probably economically viable to keep throwing capital at the problem to make God 1% smarter. But when you can't make God 1% smarter, there is like an insane opportunity to engineer God to be more efficient. And I do think that these things are in constant tension. Like, there's a lot of people I feel like, who are. Who want to be nerd sniped into making these systems more efficient. But, you know, the bitter lesson is sort of preventing that.
13:31
My somewhat naive view of this is it's really actually a capital flow issue, which is as long as these companies can continue to raise a ton of money that always just do the dumb thing, there's no reason not to do the dumb thing because it's a lot faster. Right. You can't scale engineering, and as soon as they can't raise capital, you'll have to do engineering.
14:42
Yeah.
14:58
I mean, I think specifically with BrainTrust, as a user of brain trust, the way that I think about it is, like, you have these things that landed from space, so there's no way you'll. There's no way you'll actually understand them.
14:58
Yeah.
15:08
So literally the problem is not to understand them. Because a lot of people, when they think of E files, they think of, I'm understanding this thing, but that's not what's happening. It's like you're almost protecting your app from them. You know, like you're building, like, whatever machinery is needed to use exactly what you need to, but have enough guardrails on so, like, you can actually have. You can actually tie it to a traditional state machine.
15:08
I mean, just like the scientific method, I think evels are actually almost entirely about understanding the problem that you're solving. Like, I. I think that evals are. The product managers should only be working on evals. And I think evals are the natural evolution of a prd. By creating a really good eval, you are, you know, making a declarative representation of what your product should be. Yeah, yeah.
15:27
No, this is my sense too. And there's another thing that for me, along this lines that I think a lot about, which is these models, as soon as they're released, they just sublimate intelligence. Right. Just because if you actually look at what's going on, a lot of the AI companies that are not like one of the two big foundation model companies is they're training these smaller models and they're doing all this rl, but they always use the soda models to do it. It's almost whatever OpenAI does, a billion dollar training run, that model's out there and then that's just kind of bleeding out to these other models.
15:52
Which is a form of engineering.
16:26
Yeah, of course, 100%, you know, but, but we're kind of in a way as like a, as an industry very lucky that these things bleed intelligence so quickly. But then it kind of begs the question like, where does this intelligence goes? Right? If the engineering, if the, you know, if it becomes an engineering solution, then you'll just say, well now people will kind of engineer solutions until the next model comes. But one place is definitely going is other models and in particular it seems to be going to Chinese models.
16:27
Right.
16:52
Like in a way, you know, if you just, you know, you could just wait for three months and it'll show up in somebody else's model and you can use that. However, we don't see economically weighted a lot of use of Chinese models. Yeah, right. And so do you have a, like a. I would love to hear your thoughts. Are the Chinese models good? And if so, do you have a sense of why we're not like an economically weighted seeing a lot of use? And do you think this is going to change?
16:53
It's quite interesting. I think what we see is in terms of number of use cases or number of logos, usage of the Chinese models is very low. In terms of number of tokens across our customer base, usage of the Chinese models is very high.
17:17
Really? What about dollar weighted?
17:31
Dollar weighted? It's low.
17:33
Yeah. This distribution is so interesting because I think this is probably the most important distribution in the entire industry. It's just very hard to get this data.
17:34
I mean, just to put this in perspective, the US Open source inference providers today, at the day that we're recording, probably the time that by the time this comes out, this information will be invalid. I ran a benchmark, we wrote like a Bash versus SQL benchmark, which turns out to be a pretty good agent benchmark.
17:41
Yeah, we need to talk about it. That was super cool.
17:58
I got, yeah, I got nerd sniped.
18:01
Really hard Actually, we're gonna talk about that next. That was so good.
18:03
What's interesting about this benchmark is, you know, among other things, Sonnet 4 or 5 onwards score 100% with caveats. The caveats are cost, error rate, latency, et cetera. But think, you know, there's sort of like this saturation point where it's okay, the model is actually just smart enough for this use case and there are trade offs between the different models that are worth considering. Kimik 2.5 came out last week, very exciting model. Or maybe two weeks ago, very exciting model. Still in like the 73, 75% score range for this use case. Like it would frequently get lost and confused. GLM5 came out and I just benchmarked Minimax 2.5 right before coming here on their API directly. And they both score 95% and the cases that they messed up are just very small, arguably grading errors rather than real errors. And so you know, these two models are actually as good as the models that are as good that have essentially saturated this specific use case. And GLM5 is three times cheaper than Sonnet. And then on their inference, Minimax's inference, mini M2.5 is three times cheaper than GLM5.
18:06
Do you think that the reason that dollar weighted these things are small part of the market is self cannibalization or is it because right now the Frontier Labs are able to release the next model just to retain the majority of the user base. And as soon as that's done, we're going to see more of a shift towards the open source models because it just doesn't economically make so much sense to me why we're not seeing more usage of these.
19:21
I think.
19:43
Or wait it again.
19:44
Yeah, yeah, no, well first I think that people use these models which have worse APIs, higher error rates. None of the open source providers give you good rate limits unless you beg them. Like you could sign up, anyone can sign up for OpenAI right now and you'll get really good. I don't know how they do it, but they're just very good at managing rate limits and stuff at just unbelievable scale. If you try to use any open source model, you'll immediately be hit by rate limits and you have to like beg the CEO to get access to higher rate limits. So I think like they're just not as well delivered right now. Another thing is like that's honestly just.
19:44
Saying like it's a systems Very much.
20:21
So, Very much so. A second thing is that's great, people reach for these models Specifically to save cost.
20:23
That's not the self cannibalization issue, which is like by definition, like they just don't want to pay as much.
20:29
Yes, exactly. And not only are the serverless prices cheaper, but if you actually work with the inference provider or you run on your own GPUs, you can get it to be like even cheaper. Right. Because you're trading predictability for cost. So if you have a continuous workload, there's a use case we're going to ship pretty soon and the rate that we're getting is almost an order of magnitude cheaper than GPT5 nano. Wow. And so I think that's another like really, really big component.
20:34
What do you think? I mean, what do you think happens over time? Do you think that the frontier models slow down and then the distribution shifts to.
21:03
Well, I'm a systems person, not an AI person, so I in the. Instead of making me too is why.
21:10
This is such a biased conversation.
21:16
Instead of making a predict, I'll describe this, you know, the system that. In which I suspect that the thing will.
21:18
And we know adaptive feedback loops, you can't, you can't predict convergence and divergence. So we'll leave that to the listener. Okay.
21:25
Hopefully a smart listener will write to us about this. But I think that what we see is, you know, let's say that there's some cadence at which truly innovative models come out. I would say that the last time that this happened was December. Like GPT 5.2 and Opus 4.
21:30
5 were remarkable.
21:44
Exactly. There's step function changes. And what, and what happens is as soon as one of these things comes out, the entire industry forgets about open source models and no one talks about them and everyone says why would you waste your time using them? Blah, blah, blah, blah blah. Except we have a few very shrewd customers who've observed that certain high volume use cases just don't change over time. And they've specifically instructed their staff not to get caught up in this stuff. Some of these customers are like literally still using llama 31, which is, you know, ancient news, but they're getting really good results out of it.
21:45
This is super interesting. So I mean, is this, is this performance, price, predictability or all the above?
22:17
It is the team that is solving the problem. But remember like up until two years ago, engineering teams would spend like years solving a problem. Right to your point. Right. So the team solving the problem, in this case customer service, or one, one case customer service, is familiar with the quirks of the model and understands how to eke performance out of it for this use case. And, but what if performance mean in.
22:23
This context, is it like latency, is.
22:47
It correct inference, latency and accuracy? Basically it's, you know, it's a, it's like a, you know, you could say like an rdbms. It's like a widget, you know, or an alien thing that they understand how to wield. Yeah, they understand how to wield it so well. And the use case isn't really changing. Right. Imagine like a high volume consumer use case for customer service that you know, doesn't. Why change it when you can just make this thing get more and more optimized. But anyway, what happens is the new model comes out, everyone forgets about the open source models and then the new models stagnate and then the open source models approximate the performance of the closed source models in the you know, sort of three months timeframe that you described. And oh wait, December is you know, three months ago. And then everyone says oh my God, what's the point of the closed source models? We'll just use the open source models. And historically what's happened is that with pretty consistent regularity. Except I think there was maybe a six month stretch before 01 was announced where I think things felt kind of dead and like Deep seat came out and a bunch of interesting things happened.
22:49
Yeah, but I mean isn't Deep Seat coming out soon?
23:50
14Th maybe they announced should be exciting. But the, with, with relative regularity that commercial model comes out before the open source models have a chance to actually really disrupt enterprise mindshare. And so it's kind of this, you know, it's like this push and pull thing where if there is a period of time where there's not enough innovation in the closed source models, if they're, if the open source models replicate the performance and you know, it's enough time, then I think it's just bound to happen.
23:53
It's just so interesting. We've just got a different equilibrium state here. Where in the past large companies that received lots of money were basically naturally rate limited by engineering speed. It would take two years to get the next windows out or whatever it was. I mean it's just very hard to scale a large software project. These Frontier labs don't have that problem. They can literally just raise money and build a model based on the money. They just throw more computer, more data at it. So you kind of have to ask the question what is going to end up limiting them? If it's not engineering complexity, is it like maybe there's engineering complexity on the data cleaning.
24:28
I don't think we've reached the point of market inefficiency that justifies true engineering investment.
25:04
I mean, the specific question is. Let's imagine the following. Let's imagine. Let's call Frontier Lab X. Frontier Lab X is able to raise a billion dollars, and then it can train a model based on that. And then the model on that billion dollars does really well. And it only has to do well for a couple of months to get to the next raise. Then it raises $5 billion and it trains the model. And then, you know, and that does really well. And then it raises the next one and it raises $10 billion. Like, the question is, like, at what point in, like, in that world, it's able to raise more money than any other company that's downstream of it together. Anthropic's current race is probably more money than the entire ecosystem that depends on it can. That's how much money we're talking here. And so in that world, like, when do you ever get to the point of engineering? I think you just have to assume either the money rationalizes. Yeah. You know, or there is some fundamental scaling limit in building these models, which I don't know what that would be.
25:12
And what happens if neither of those things happen?
26:10
I mean, it's AGI, man.
26:13
Yeah.
26:14
I mean, there. There is this. Sometimes I think to myself, like, so either we're in this one path where one company can keep raising 10 times more money, and then at that point, like, money doesn't matter anyways. Yeah. Because that just becomes, like, arbitrarily good at whatever.
26:15
Money is an artificial construct, Right. To begin with.
26:29
Yeah.
26:31
What is money?
26:32
Yeah, yeah. Or on the other side is at some point, you know, as the sun gets larger, the surface, you know, has to fragment just because, like, now, like, you're competing with against absolutely everybody, and then you're going to see fragmentation. It feels like we're still pretty early and the sun can grow for quite a while.
26:33
I think. So, Yeah. I mean, I'd probably guess that the speed at which companies outside of the coding domain, which is itself a first derivative for real consumers actually taking advantage of AI, but the speed at which, like, enterprises are deploying and can make use of and consumers, like. I mean, most consumers I talk to are using ChatGPT or Gemini, not, you know, one of many of the interesting consumer AI products that have been started. But I feel like the speed at which the planets or whatever surrounding the sun can actually ingest the Heat is potentially the first limiting factor.
26:52
Oh, that's a very interesting point. So you're basically saying that these models will get to a point where like they just can't even be consumed. So it's actually on the consumer. It's like something demand side because right now it feels like it's unlimited demand.
27:36
Exactly. Use this, this eval example I was telling you about where I think Sonet 4 or 5 is, you know, it's going to score, it scores a hundred percent on this benchmark and it's going to score 100%. I'm pretty confident. Like it's hard to come up with a question that it's going to really struggle to answer, but it's actually, it's very difficult to integrate that into an enterprise setting and actually make use of that kind of system to begin with.
27:45
This is great. Actually. This actually refrains by thinking I always kind of assume unlimited demand because the demand has been so historically strong for this. But like at some point time we won't know the difference between AGI and AGI.
28:06
Right.
28:17
It's smart.
28:18
You know, in my observation, working with a lot of enterprises, I think the demand is unprecedented in that every enterprise is willing to spend a lot of money. But the implementation of AI systems is still very early.
28:19
Yeah. And do you think this is like endemic to the technology or is it just the enterprise being the enterprise?
28:32
There's a natural rate limit at which, you know, human political systems can ingest with new things. Right.
28:37
But it's a very secular phenomenon like the Internet was, which is, that may be the, the case for the organization, but every individual is using it.
28:43
Every individual is using Chat GPT and Gemini. So I think that's the, if you ignore the enterprise for a second, you think about Chat GPT and Gemini. The question or the framing of the question is a little bit different. Which is, are Chat GBT and Gemini going to become the everything app for everything possible or whatever, or maybe one of them wins or you know, whatever it is. Yeah, it really feels like a two horse race there. And if that, that happens, is there, is there sort of a limit or does that just become the unlimited.
28:49
Yeah, this is a great point. Like, like, for, for individuals, it's like there's a natural saturation point because whatever, you know, there's only so much I need to do with one of these apps where the enterprise would be like a different saturation.
29:17
But there could be things that, I mean Chat GPT could start doing your laundry and cooking your food and I mean no one necessarily. Certainly I don't necessarily know what these companies are cooking up. Yeah. But there's plenty of consumer stuff that, you know, these. Although.
29:26
Although, ironically, if you look at solution spaces dollar weighted, like the ones that actually require automation are not really working in the same level as things that are just basically, you know, text prediction. True.
29:41
Yeah, yeah.
29:51
I mean, the state spaces feel very different.
29:51
Yeah.
29:53
So the things where you, like, like you feel like you're mapping to a constrained manifold, like code or language or whatever, they do a very good job. And then as soon as this is open ended. Yeah, we'll see.
29:53
I mean, even when you're writing code, if you can constrain, if you can constrain the environment, like what's happening right there? You just get a.
30:03
What do you mean, what's happening right there? Do you actually have an agent going in your laptop?
30:11
I do, yeah, I do, yeah. Yeah. I'm trying to rewrite something and I, Yeah, I wrote a very specific test and Codex is, it's doing its best to try to. Actually, there's, it's, there's two things that it's doing.
30:14
What's it doing?
30:25
One thing we have this, just so.
30:26
You know, for those that are watching this. Ankur has a laptop on a chair over there that's half open. So it's not closed because it's turning away at a problem that is. Dude, that's a fucking side of the times.
30:28
There's one problem which if we batch a bunch of LLM calls, we can be way more efficient, like almost linearly more efficient. And there's another problem where we have this really cool feature in our SDK, but we have to implement it in every language. So I'm trying to work on a way of doing it across all languages and then have a sort of smaller bootstrap thing in every language. So these are, both of these problems are actually really good for agents because you can, we have a bunch of tests that test both of these features and you can sort of define the very specific constraints of what you want the solution to look like. But both of these things are just, you know, they're very hard, technical problems. So it's just churning for a while.
30:41
Going back to our previous topic on, you know, you've got, you know, a billion dollars goes to create a model. They create the model and then the value goes down.
31:23
Right.
31:30
And then maybe it's, you know, these Chinese models are, you know, gaining from it and there is some kind of like peace dividend of intelligence, the ecosystem One of my observations has been the companies that grow the fastest tend to be on that token path. Like, they go, you know, billion dollars, create this thing, and then it just kind of leaks money. If you're like, like on that kind of stream of leaking money, then, like, you get growth. Now, unfortunately, that, you know, downstream of the big model, like, margins are actually very tough because it's not even a. It's not even an economics fundamental thing. It's just as an organization, you'll always go towards wherever top line is greater. Yeah, you're early on. So everybody goes ahead and does that.
31:31
And so I think people also just perceive all the value to be in the model.
32:14
Yeah, yeah, I just meant more. Let's say that you're on the token. Let's say you're doing a company and then you're on the token path. You could decide to add value on top of that, or you could decide to like, have margin or not. You can actually make that decision and, like, you can add enough value to do margin, but, like, it'll come at the expense of growth.
32:17
Oh, yeah.
32:33
And I've just yet to find a startup founder who, like, given the ability to like, cheat a little bit on margins in order to do growth, won't go for growth.
32:34
Right.
32:43
And I've yet to find an investor independent of what investors say, or they will give you credit for the margins as opposed to the growth. And so I think, like, we're all trained to just go for growth and it's so easy to do it with if you're on the token path. It's easy because you just sell the token for cost.
32:43
Right, right, right.
32:57
Like you're selling electricity for cost. So I'm just wondering, as you build out your company, like, how do you think about being on the token path versus framing around it? Or is this something that you consider?
32:58
When we first started, I spent some time talking to Ollie from Datadog, who I really admire and is also an investor in BrainTrust, just to absorb some of his wisdom. And he actually described his framework for doing this with cloud. And he said something like, you know, our goal was to be some percentage of cloud spend and try to figure out how we can price our products so that it makes sense. I mean, people have all kinds of comments about their pricing, but whatever the company's doing really well, I think most people use them, would say the product's very valuable.
33:07
Yeah, datadog is awesome.
33:39
And I think what I took away from that is we need to, even if we're not reselling tokens, we need to figure out a way to make our product feel valuable on the order of the number of tokens you're consuming. And so that's why our pricing, it's not directly we don't charge you for the number of tokens, but we charge you a very cheap rate in terms of the number of gigabytes that you ingest. And guess what? A Token is roughly 4 bytes. And so that's helped us do some value alignment there with, with customers. And it also resonates because hey, our cost at the end of the day ultimately, ultimately can be attributed that way too.
33:40
It's so interesting. Like this is this, this mirrors very closely what happened when we went from perpetual licensing to recurring.
34:16
Oh yeah.
34:23
And everybody had this conversion where they're like, I can sell you this thing in perpetuity or I can sell you recurring. And basically the recurring was something that would pay in perpetuity thing off for a period of time, assuming some depreciation.
34:24
Yeah.
34:41
So like you have to this equivalency because the entire world goes to like the new model. So right now it's like token based usage. So whatever your pricing model, it feels like you have to have some equivalency to the token path and you have to know what that conversion is based on, like whatever the customer is using it or your understanding of it.
34:41
Right.
34:57
It's very disruptive to change a pricing model. Right. You know, if you're going from like seat based to usage based. Right. I mean in your case, like we did that. But even like the billing systems and stuff is a tough problem.
34:58
It's a very tough problem. Yeah.
35:11
Yeah.
35:12
I think like it's hard to change billing that often. And then billing on the order of tokens which are super granular is also very hard. And then of course everyone's trying to abuse things all the time.
35:12
So is there a fraud issue? By the way, do you see this at Brain Trust? Because I mean a lot of these companies that are basically reselling tokens, that's a lot part of their business. I mean it's a huge issue, but.
35:23
It'S customers deal with it constantly. So we help.
35:32
But you're on more B2B so it's probably less of a. Yeah, I mean.
35:35
We also, this is going to change, but we also have a relatively high entry price point. You know, just the set of folks that are attracted to our product or end up signing up. Sophisticated. Yeah, but we, I mean we do have people that, you know, create ABC1 ABC2 ABC3 or whatever free plans and try to rack them up. But I would say that the, you know, the flavor of usage that we offer is at this point relatively well understood. So it's not, not that easy to abuse.
35:38
Yeah, yeah, listen, so we don't have too much more time. I want to get into the BASH SQL thing because I think that was so funny. So maybe frame up like how you got nerd sniped and what happened so there.
36:04
I think a number of people online have made this qualitative observation that, boy, you know, OPUS is so good at using bash. And hey, you know, when I use Opus and I use it with a cli, it feels way better than when I used Opus with an MCP server. And maybe that's the root of why people are making this comment. But it, if only I could just map every problem to be Bash then because the model understands how to use Bash, it will perform better on this problem. And honestly, I think that's bad quality engineering thinking.
36:14
And let me be clear, you're saying bash, but my understanding is people are saying I could give it kind of an API instruction data or I could literally put it in any UNIX environment and decide to do whatever it wants in that UNIX environment.
36:48
There's many variants.
37:00
Now, like, that's the one that I'm familiar with.
37:01
No, yeah, yeah, so, so I think what. And we benchmarked actually multiple of these things, but what some people started doing is literally depositing things as files on. Yeah, exactly.
37:02
Like I will give it, I'll put it in a UNIX environment. I'll give it curl. Yeah, it can download whatever files it wants. I'll give it access to the Internet.
37:12
And that is like, that's layer one.
37:19
Okay.
37:21
Layer two is I'm going to create like a fuse type thing where, where I, you know, create these. I'm not going to, Hey, I like, you know, I have too many customer support tickets for this thing to look to download every customer support ticket. So I'll create a fake file, one per customer support ticket.
37:22
Okay.
37:38
And then of course it gets even more complex. People have now created like virtual environment type things in every programming language and so you can write, you know, like a Python BASH system or something. And all of this is basically saying, okay, great, let's assume that the models are really good at Bash and then let's engineer our thing to sort of meet them where they are. And we have an agent inside of BrainTrust called Loop. So we benchmark this stuff constantly. We now support SQL directly in BrainTrust. So we've of course been benchmarking that. And you know, again, going back to what I was saying about hypotheses, like, to me, intuitively, it didn't seem models would be fundamentally worse at writing SQL than writing BASH to solve problems. In fact, I think models are really good at writing code, which is harder than writing SQL. And so they have all the facilities to express something as SQL that they could express as a series of BASH statements. And the problem is, how do you organize the data in a SQL environment that a model can actually access? And then of course, if SQL happens to be a more efficient way of reading the data for the problem, then it might be more efficient. Like, part of the reason I think people are going crazy about BASH is that if you're working with code, the most granular abstraction that you can use to manipulate and read code is essentially bash, Right? Because the abstraction that is provided to you is a bunch of files. If you're working with like customer support tickets, the metadata row that describes the customer support ticket has a lot of information. So if you can run a SQL query across those to filter down the set that you should look at more granularly than you should. And so we ran this eval and the results are just like comical. SQL is more accurate, it's more efficient, it's more token efficient, it's faster. The worst models perform better on SQL than they do on, like everything.
37:38
It feels like this is another dichotomy that's shaping up, which is like, there is a whole percentage of population that seems to be like, give it a computer and let us do its thing.
39:27
Right?
39:36
Which is kind of like this brute force. Just give it a computer. And there's another which is let's give it computer science fundamentals. And like, really, I mean, I had a conversation this morning with some of the top systems people I know and they're like, now when I use agents, I use strong typing, I use referential transparency.
39:37
You're really underestimating the intelligence of the model if you force it to do the brute force thing.
39:53
Yeah. And it almost feels to me like the brute force one are people that don't quite, aren't quite comfortable with the complexity of like true engineering and computer science. And they actually understand files and they understand like curl and like whatever, so they understand that. Where I feel like this is a potential for a golden age in cs where you actually understand, like, what are the tools to make a system reliable and safe and proof correctly.
39:58
I'll give you an example. So at braintrust we have a folder called Type Specs and it has all the type specs for the API, the UI brainstor, how all these things relate to each other. Now when I write code, I hand write the type specs because I sort of pace around our office and I have a bouncy ball. I throw the bouncy ball. I think really deeply about what the type system for brain trust should be and I hand write the type specs and then a bunch of tests fail and that's when you know, the agent starts running and it just goes when I review other people's code because now people are sending a lot more PRs. I just read the type specs and then you know, usually I find I, you know, we debate about what the type specs should be and although I still scan the rest of the code, I have pretty high level of confidence that if we, if me and the person who's you know, submitting the PR agree on the type specs, then it's highly likely that the rest of it will be implemented, you know, appropriately.
40:21
What about state guarantees or like do you include that in your specification?
41:20
Well for us it's challenging because people self host the product and there's so much data that you can't migrate anything ever.
41:25
Yeah, so just saying like I can see you doing the types but that makes a lot of sense. The other thing seems to be like, you know, state management is hard and there are a bunch of fundamental trade offs that you need to articulate formal way state management.
41:32
As a database person, state is also just types and I think defining the right types that allow state to flow performantly, you know, that's where a lot of the challenges like so for me.
41:46
Like things like there's the simple question of, you know, here's an index and that would be part of a schema. So I'm assuming that's under your type system. But there's other stuff which is like it's okay if you know, this is if you read it and it's invalid. You know, you can optimistically update this thing. This requires strong consistency like oh yeah, yeah, that sort of stuff which you need. I mean you've got me process a lot of stuff. So I'm just wondering at what level do you have to get involved for that?
41:58
Yeah, well personally I tend to get involved in all that kind of stuff, maybe more than is needed just because I enjoy it. But I think that the best database systems are the ones that make consistency also declarative and reason about these types of challenges. As part of the type system.
42:23
Okay, so your type system really is a declarative type system, like the database sense?
42:42
Oh yeah.
42:46
It includes all of that.
42:46
Okay. Yeah, yeah. Now we're not perfect, but my stance is always, hey, we're debating this thing. How do we formulate the problem so that we can express the trade off in the type system? Very cool.
42:47
Great. Well, listen, this has been great to catch up. We've got to do it again. Any last things you want to cover?
42:58
No, thanks for having me.
43:03
Okay, awesome. Thank you.
43:05
Thanks for listening to the A16Z podcast. If you enjoyed the episode, let us know by leaving a review@ratethispodcast.com a16z we've got more great conversations coming your way. See you next time. As a reminder, the content here is for informational purposes only, should not be taken as legal, business, tax or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast. For more details, including a link to our investments, please see a16z.com disclosures.
43:10