Vespa AI and Surpassing the Limits of Vector Search

39 min

•May 12, 20262 months ago

Summary

Radu Giorgay from Vespa discusses why vector search alone is insufficient for modern retrieval systems and introduces tensor-based retrieval as a more flexible approach. The conversation covers the limitations of single-vector similarity scores, the importance of combining multiple ranking signals, and how Vespa's tensor framework enables more sophisticated relevance functions at scale.

Insights

Vector similarity alone is an incomplete signal; production systems require combining multiple relevance factors like lexical search (BM25), recency, metadata, and business rules for optimal results
Tensor-based retrieval generalizes vector search by supporting named dimensions and complex mathematical operations, enabling faster computation and easier adoption of new techniques without architectural changes
Efficiency in ranking functions directly enables better results—faster base relevance computation allows more sophisticated re-ranking models and larger result sets to be processed within latency constraints
Real-time data updates at attribute level provide competitive advantages for dynamic content (pricing, inventory) that other search engines struggle with due to commit-based indexing models
The golden dataset problem remains unsolved after 15+ years in search; measuring relevance quality and establishing feedback loops for improvement is harder than building the search technology itself

Trends

Multi-signal ranking architectures replacing single-vector similarity as the industry standard for production search systemsTensor-based computation frameworks gaining adoption as a more future-proof approach than vector-only systemsMulti-modal search (text, images, video, tables) becoming increasingly important as documents grow more complexReal-time indexing and attribute updates becoming table stakes for e-commerce and dynamic content applicationsAI agents amplifying the importance of search accuracy—compounding errors across multiple sequential searchesHybrid search (lexical + semantic) consistently outperforming pure embedding-based approaches in benchmarksChunking strategy and multi-stage re-ranking architectures becoming critical design decisions in RAG pipelinesContext window limitations in LLMs driving need for better result filtering and relevance ranking upstreamExploration-focused search replacing top-K result paradigm as users and agents demand broader result set visibilityMeasurement and evaluation infrastructure lagging behind model innovation in search and AI systems

Topics

Vector Search Limitations and Alternatives Tensor-Based Retrieval Architecture Hybrid Search (Lexical + Semantic)Multi-Stage Ranking and Re-Ranking Document Chunking Strategies BM25 and Lexical Search Persistence Real-Time Indexing and Updates Named Dimensions in Tensor Computation RAG Pipeline Design and Optimization Multi-Modal Search (Images, PDFs, Tables)Search Relevance Measurement and Golden Datasets Ranking Function Optimization Content Node Computation vs. External Re-Ranking AI Agent Search Requirements Search Infrastructure Scalability

Companies

Vespa

Open-source search and data-serving engine using tensor-based retrieval; main subject of discussion

Yahoo

Acquired Vespa's predecessor FAST; continues to use Vespa for large-scale search across multiple verticals

FAST

Pre-Yahoo company (Fast Search and Transfer) that pioneered large-scale web search; Vespa's historical origin

Elasticsearch

Lucene-based search engine; Radu spent 12 years consulting and training on Elasticsearch implementations

Solr

Lucene-based search platform; subject of Radu's consulting work alongside Elasticsearch

OpenSearch

Open-source fork of Elasticsearch; mentioned as part of Radu's consulting background

People

Radu Giorgay

Discusses tensor-based retrieval, vector search limitations, and Vespa's architecture; 12 years consulting background

Sean Falconer

Hosts the episode and conducts the interview with Radu about Vespa and search technology

Quotes

"Vector similarity in itself is not enough. And even if you look at what are traditionally called vector databases, they add stuff onto it. It's not just vector similarity that they care about."

Radu Giorgay•~12:00

"Hybrid search, so just BM25 combined with all those models, would outperform the models themselves. So lexical search is a signal that I think is here to stay and has been proven over and over again."

Radu Giorgay•~15:00

"If you can do some tensor computation on the first level, then you're going to save a lot of cycles. But then the other thing is you can do the first re phase... that second phase will run on the content nodes as well."

Radu Giorgay•~45:00

"If you're 90% accurate in isolation and then you do that 10 times... the more you can get the accuracy up on the search in isolation, the more the accuracy goes up in the aggregate as well."

Radu Giorgay•~52:00

"How do we get a good golden set? How do we measure search effectively? How do we get that feedback loop going? These are problems that have been there since before I got into search, which was 15 years ago."

Radu Giorgay•~58:00

Full Transcript

Vector search has risen to become a foundational tool in modern search and retrieval systems, including the RAG pipelines that power many AI applications. However, the demands on retrieval systems are growing more sophisticated, which is revealing the limits of relying on a single-vector similarity score. Vespa is a popular open-source search and data-serving engine. Central to Vespa's architecture is tensor-based retrieval, which is an approach that represents data as tensors rather than simple vectors. Tensor-based retrieval enables richer mathematical operations and more flexible ranking functions that can surmount the limitations of a single-vector similarity score. Radu Giorgay is a software engineer at Vespa, with a background spanning nearly 12 years of consulting and training on Elasticsearch and Solr. In this episode, Radu joins Sean Falconer to discuss why vector similarity alone falls short in production, how tensor-based retrieval generalizes to support richer ranking functions, the trade-offs in chunking and multi-stage re-ranking architectures, and where AI search is headed next. This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him. Radu, welcome to the show. Hi, thanks for having me. Yeah, absolutely. I'm glad you were able to be here. I interviewed your founder, Vespa, and CEO probably a couple of years ago. So it's great to catch up again on everything that's happening over at Vespa. A lot has changed in the world of AI, and I'm sure in the world of Vespa over the last couple of years. Yep. So you've been working in this space for a while. I guess, like, what's your origin story? How did you end up working in search infrastructure and ultimately get involved at Vespa? Yeah, this was two jobs ago, working at an antivirus company, and we needed to centralize logs. And that's how I got into Elasticsearch. And then I moved on to a company that was, at the time, at least doing mostly consulting on top of Elasticsearch and solar. And so I've been doing for almost 12 years at that company, consulting, training, that sort of stuff for Elasticsearch and solar, and then OpenSearch. And then what ultimately led you to Vespa? Well, I guess mostly curiosity because Vespa not being based on Lucene and having different internals, different distributed model, different trade-offs that it makes. Got me intrigued, met a bunch of people at conferences, was more and more curious. And yeah, that's how I got into it. Yeah. I mean, Vespa is a company, at least the origins of it has been around for a long time. over 20 years has been working on search-related problems. Can you share a little bit about some of the origins of the company and what was their original problem that they were focused on and how much of that originating DNA is there today? I know quite a few things, but only from other people because I wasn't around, of course. I've only been here for like a couple of years. As far as I know, the origins are pre-Yahoo, so there used to be a company called FAST, which is a recursive acronym, comes from fast search and transfer. And they've been doing web search and search in general. I think there were a few other things, but the idea was large-scale search. And then through a series of acquisitions ended up in Yahoo. And in Yahoo, they were serving lots and lots of use cases. And Vespa still does serve lots and lots of use cases within Yahoo. Not sure which of them can be told publicly, but the idea is like you have a bunch of verticals that you can serve. Some smaller scale, some really huge scale. So I think this implies that a lot of the problems that Vespa needed to solve were quite generic as well as large scale. So I think you'll see this in Vespa today. Like a lot of the solutions we adopt tend to be over-engineered, if you will, because we expect them to be used and pulled in all sorts of directions. What do you mean by that in terms of overengineering? Can you give an example? Well, tensors, I think, are a good example of that because it's like you don't only support vectors and distance functions and all that. You support all sorts of maths on top of all sorts of numerical structures. So then when you come up with a new use case, it's just that much easier to add it up because there are lots of things that are already supported and already thought through to be scalable and fast. I see. So you're talking more about having some first principles thinking around search that generalized all sorts of problems versus attacking everyone as like a narrowly brand new problem that you have to go and engineer a specific solution for that particular problem. Yeah. For me, coming with my consulting background, I'm used to solving specific issues. And at Vespa, when I look at an issue and I'm like, okay, how do we solve this? And people are like, wait, wait, let's make sure we don't bump into something like three months later where we have to do this all over again. So is there something generic that we can do? Will this perform at scale? And all that sort of stuff. Does this align with how Vespa is used in general and things like that? So that for me is a bit of a shift. Yeah. Yeah, I think there's always a trade-off between moving from like a consultancy or even forward deployed engineer type of role where you're really trying to help solution specific things for a customer and unblock them versus being part of like core product and R&D where you're thinking beyond just a singular customer, but how do we generalize this thing to maybe all of our customers? Yep. So Vespa has been fairly vocal about writing about vector search and how it's reaching some of its limits. For some people, maybe that's a bold claim. Vector search, vector databases is something that's been around for quite some time. I think it's really got a lot of traction, certainly in the last few years. Can you talk a little bit about the core argument behind the dialogue coming from Vespa in terms of vector search reaching its limits? Are there certain things that vectors are good at and where are things starting to break down? I think the general idea is that people want good relevance, right? You have a corpus, you're searching in it, you need the most relevant things to surface. And so for that, vectors are only one, like in general, vector distance is one signal, right? You may have N other signals, like is this document recent? Does this document match well? lexical search, which chunk is more relevant? Do we care about the top chunk? Do we take care of the average chunk or average of top 10 chunks or whatever business rules we may have? So in practice, what we see is that a lot of people end up having really complex algorithms for measuring what ends up being a relevant score. So having flexibility around this, I think is important. I don't think there's anything wrong with vector similarity. I just think that vector similarity in itself is not enough. And even if you look at what are traditionally called vector databases, they add stuff onto it. It's not just vector similarity that they care about. So I think it's just a natural trend that people, when they start to use these things, they just care about multiple signals and also how to combine them. Yeah. So if you're only looking at something like vector similarity, what are some of the things that you might end up getting wrong in some of these use cases or scenarios where you're limiting yourself to the singular signal? Well, one thing that comes to mind is lexical search. I think there's a lot of, let's say, memes in the search world about BM25, which is, let's say, the most popular algorithm behind lexical search. And that BM25 actually performs really well. As time goes by, it doesn't seem to die. and we had a recent blog post where we benchmarked a lot of embedding models and most of them in most of their flavors would outperform bm25 and first of all actually let me take a step back when i say most of those models outperform bm25 i mean those model off the shelf outperformed bm25 off the shelf in reality nobody actually uses that most people will tune both their embedding models and their BM25 implementation. But for argument's sake, okay, most of those models would outperform BM25. But hybrid search, so just BM25 combined with all those models, would outperform the models themselves. So lexical search is a signal that I think is here to stay and has been proven over and over again. That is just one example. Another one that comes to mind is if you have long texts, you can have, I mean, vector similarity on the whole text becomes quite meaningless because you can't capture that meaning of a blog post or a book in one vector. So I think that's where chunking comes in. And then how do you combine chunks and stuff like that and metadata. So again, in practice, what we see is people end up adding up on more and more stuff that becomes their compound signal. Can you talk a little bit about how this vectorization process works So what do you lose when you turn a document into a vector What are you kind of giving up by using that representation Compared to what? I mean, there's other ways that you could potentially represent a document. I'm sure you could do something like you could just store the text of a document, for example, and do some sort of text-based search over it. So then you do lose things like exact filtering. Like that's one of the main complaints people have about vector searches. Like, where's my threshold? Where's the cutoff between a relevant and an irrelevant document? Like with lexical search, this is usually quite easy to figure out. Like a lot of people in the Lucene world use some sort of minimum match. It's like, okay, if you have three words, then all three need to match. But if I have 10, then seven out of 10 is good enough. And then I have a reasonable cutoff. Okay, it's not perfect, but it is somewhat intuitive, explainable, and most of the time it works. With vector search, it's like you don't really know because you can still have for something like cosine similarity, you can say, okay, decent similarity score is like 0.7, and I'm going to cut it off at that. But then that depends that that similarity changes when you're running different queries. So in other words, it's very hard to figure out what that cutoff point is and what that cutoff point means. And that messes up with faceting. So if you want to analyze your result set, then you're looking at what, because vector search, unless you have this artificial cutoff point, then you're going to match everything. Yeah, I mean, I think that would be certainly true of if you're not doing any kind of chunking. The whole point of if you have a large document of breaking it down into smaller chunks is that you have more tightly coupled semantically meaningful chunks. So if I take an entire book and I turn it into one singular vector, then I'm creating a single point in high dimensional space that represents this whole book. There's no way I can encapsulate all the meaning of that book and have all the points in space that are similar to it in some reasonable way. I'm going to end up losing a lot of the specifics, I would think. But if I break it down by paragraphs or sections or chapters or whatever it might be, then at least I have more tightly coupled dots in high dimensional space that are going to probably have a sphere of similarity around it to other dots in that space. They're probably more like semantically meaningful. So the more data I'm essentially trying to stuff into the vector, the more I'm generalizing essentially the ultimate meaning of that thing. Is that fair? Yeah, I think that's a fair thing. It's like you have a limited number of data points, effectively, the dimensionality of your vector that you can store. So the more meaning you have in something big, the more you're going to compress and the more lossy it's going to be. Yeah, exactly. Yeah. So it's a very lossy format, especially as you're getting more and more text stuffed into the singular vector representation. There's also, if we look at things like the use of RAG over the last couple of years of vector databases, Typically, the RAG systems are getting more and more complicated where we're doing, we have the pipeline where we're breaking up, we're chunking these things, we got different chunking strategies, we're indexing it in a vector database. And then when we're actually retrieving it, we're also doing multiple steps where we're maybe retrieving relevant documents. And then maybe we're using like a re-ranking model as well to re-rank the results from it. In terms of this kind of two-stage architecture where we're decoupling the search from the re-ranking, is that problematic? Are there challenges around not tying those things together and decoupling them? I think it is problematic in the sense that if you have a lot of data to re-rank, then you're going to have a lot of traffic coming in and out. That can become bottleneck. So it's primarily an efficiency problem? Yeah, and efficiency is really important because if you're more efficient, then you can essentially afford to do fancier stuff. Let's take this re-ranking example. So if you have a really good re-ranker that performs really badly, you can only throw a few results at it because otherwise you're going to have an acceptable latency. But if your ranker is super efficient, then you can throw all your results at it and you're going to have great results. But this, I think, applies all over the place, right? Like if you can, in general, have a base relevance function that performs well and does a lot of stuff, then you're going to have a really good baseline to work with. And I guess all these problems that we're talking about are perhaps even more amplified in the multimodal world. It's one thing where we're talking about compressing text into a vector. Like what happens when we compress images and video and things like that? Do we lose too much in using this kind of lossy format, especially when we're talking about rich medium? I don't know that I have enough experience with things like audio or video, but I know for things like PDFs, it's going to be really hard to put all that information in a single vector because you can have N pages and on those end pages. I think it's enough if you extract the text. We have the problem that we talked about earlier, which is how we cram a lot of text into a vector. But now if you have diagrams in it and other things, good luck. Yeah, and tables, which you're probably going to... I would suspect even the approach to doing similarity measurements between those things is probably going to be quite different than you would do for traditional text. Yeah, so what I've seen with PDFs are people doing... vectors per page, or rather per patch per page. So you have models like Colpoly and so on that can do this sort of stuff. Do they handle tables and images differently though? They don't actually, they're pretty generic. You just throw the image of a PDF page at it and they give you a vector per patch. So you'd have typically 128 patches, so 32 by 32, and you're going to have one vector for each of those patches, but it's all very well coordinated. So in the end, because you can throw a text query at the same model, so they live in the same vector space, it can actually figure out whether something's on a table or on a graph. So you can have a graph of, I don't know, energy consumption by month, and you can say, what was the consumption in July? And it can highlight that for you. Okay. I want to get into a little bit of this topic around tensor-based retrieval, essentially for the listeners that perhaps at this point know what a vector is. And I think generally because of all the everything that's been happening in AI over the last couple of years, I think people who didn't know what a vector was three, four years ago, perhaps know what a vector is today. They might not be super familiar with the concept of a tensor. Can you explain essentially what is the difference from vector to tensors and why does it matter for search? Yeah. So a vector is a list of numbers, right? The data type can differ, can be a float. Normally it's a float natively, but we can quantize it. So basically compress the float into, let's say, a 16-bit float or an integer or even a bit. So that is a vector. A tensor is a more flexible way to represent numbers. So a simple thing could be just to represent one number. You can have an array, which would be a vector. You can have named dimensions. So like I mentioned, patches earlier for Colpally, you can say we have a patch ID and for each patch ID, we can attach a vector. So now we're going to have a map of vectors or you can have a sparse tensor where we can say, let's say for personalization, right? So I go to a clothing store and I prefer black pants and blue t-shirts and stuff like that. those could be named dimensions in my tensors. And based on my preference, I can store numbers. So like, let's say a heavy preference would be closer to one. And maybe if I hate them, they should be a negative number. And so I can perform all sorts of math on top of these numerical structures, these tensors, and I can get the results I want. So for example, with vectors, we can do the similarity search that we all know and love, but we can do personalization, for example, by doing some sort of dot product between my preferences and what a specific item of clothing would be or we can do copali and we can sum up things, we can do maxim. So all sorts of things can be done on top of tensors. I'm not sure if that answers your question. Yeah, so every feature or thing that you want to describe needs to map into a numeric representation, right? And that could be a vector, It could be a singular value, but some numeric representation. Right. In the context of tensors, yes. I mean, with Vespa, you can do much more with ranking than just using tensor math. But tensor math is a really flexible way to represent a lot of things and then do those interactions quickly. Right. So by representing things as tensors versus just purely vectors, you have a whole set of tools essentially that you can use to perform these different types of searches using tensor math that you wouldn't be able to support using if you're just doing essentially cosine measurements between two different vectors Correct Yeah And also I think most importantly we are very I wouldn say completely because nothing is complete but very future So for example, when Korpali models came in, we could just natively support that because you can have these patch vectors modeled in a tensor, and then you can implement Maxim using tensor math, and there you go. You have all the Maxim stuff. You don't need to come up with a whole new feature of how do we deal with this? How do we deal with multiple tensors? How do we combine them in the way that they're supposed to be combined? This was something that Vespa was already supporting. Yeah. I mean, not before the model and the technique came into existence. It's not like we were supporting it. But yeah, we were supporting it from day one because all the plumbing was already there. You just needed to write the correct expression. And there you go. I see. Another good example is Bayesian BM25. There's a new technique to normalize BM25 scores because one of the main problems with BM25 is that you don't have a predictable score that you can use to then combine with other kind of scores. So it's like, it's ideal if we can normalize it between zero and one, and then you can treat it much more uniformly. And so when that technique came out, we were like, okay, how do we implement this in Vespa? And it turns out pretty much everything was already there. You know, all the sigmoid calculations we could already do in the rank profile math. So this was impressive even for the author. Can you walk me through what is the process for doing a tensor-based search in Despa? The process would depend on exactly how the tensor looks like. So if you have a vector, you would define, I mean, any type of tensor, in fact, you will define it in the schema. It's like, okay, this is the shape of the tensor. This is the data type. and then you feed the data, which should match that shape, right? So if it's a map of arrays or whatever that is. And then when you run the query, you typically also have a query tensor. You can construct tensors at query time from the signals that you may have. Like, I don't know, chunk similarities, you can construct tensors from that. And then you would have something called the rank profile. So in the schema, you would say, this is my rank profile. And the rank profile expresses how the similarities, how the score of the document should be computed. So let's say we do a dot product between two tensors, or we do a similarity between a bunch of vectors, and you can iterate, we can take the average of that similarity, or whatever we want. The top N vectors, similarity, and average that. Whatever math you can think of should be there or a lot of the relevant things are already there. And you can construct your relevance function that way. How do I know what relevance function to use? I think that is very much up to you and how do you, let's say, tweak your relevance. It would depend on the use case. I think most people would just start from something simple like lexical search. Then, okay, we can find a decent vector, like an embedder model to work with my data. Then I can think of, okay, what are other business relevant signals that I want to incorporate, all sorts of metadata. So people typically iterate. And I think it's very important to have some sort of golden set that you can evaluate and see whether my quality is going up or down. That's a very generic approach. It sounds like there's maybe some additional complexity involved with getting this set up and working. But the advantage is that you're trading off some level of maybe technical investment and complexity upfront. But the trade-off is that you get better results. This is valid for any system. I don't think it's particular with tensors. If you add more signals and you want to combine them, it's going to be just that engineering investment that you were talking about will happen everywhere. Maybe tensors require a little bit more understanding of some sort of math. Not crazy. I mean, my math stopped at high school and I can still grok it to some extent. So it's not too, too scary. but it's a little bit more than just, you know, at least what I'm used to. How much is Vespa, like, abstracting away some of that math for you? There are some helper things. Like, for example, we talked about Colpali. You have, there are aliases. Like, you can just multiply two tensors, for example, like X asterisk Y, and then that's going to do a dot product for you. You can also do the unfurled thing. I think more interestingly, we have a bunch of helper, let's say, frameworks, if I can say that. There's something called Tensor Playground, where you can go and click around and you have some examples. And you can also come up with your own and you can fiddle with tensors and see what the results are. we also did in a couple of years in December we had this tensor advent challenge where the idea was okay let's have some thematic challenges you have to solve with tensors like how much santa has to pack and how much the elves have to travel and stuff like that that you would just solve with tensors just to get a feel of that math and then there's a quite a big repository which is called Sample Apps in the Vespa GitHub, which has lots and lots of examples of use cases. And you can see the rank profiles there and you can see the schema. A lot of people will take one of these sample apps and just change it to what they need. And I think that's useful. It's rare that you just start from scratch on a path that nobody went to before. You mentioned this a little bit earlier of this concept of name dimensions and Vespa's tensor framework supports name dimensions like token and region, timestamp. What does that give you? Why does that design choice matter? It matters because it's very quick to... Let me step back here and try to come up with an example. So one of them is you can have attributes that you care about for ranking, right? So let's say you're searching for cars And you may have things like, is this car expensive? Is this car cheap to insure? Does this car use a lot of fuel? Is it new? Whatever, right? So things that maybe I care about when ranking. So even if you don't have tensors, right, you can still take those into account when ranking, right? You can take the mileage. You can take all those dimensions and you can come up with a formula that takes all those dimensions and comes up with a final dimension, which is the score of my document. But it is quite expensive to get, assuming that you store this in multiple fields, you need to get the value from all those fields, do whatever math you need to do in some sort of high-level math. Hopefully, you don't have to bring it all the way to the application because that's going to be horrible. but even if you have to do this at some sort of high level script like you know with with elastics such as painless for example that you know can be very slow by contrast if you have this natively in a tensor then you can simply you know take the user's preferences take those car attributes and do a product which is super super fast yeah so this will scale a lot better than you know taking those attributes manually so i think this is what it gives you in essence because kind of comes back to what we discussed earlier about efficiency that allows you to do fancier things. Because at some point, you will not be able to do things in other search engines, even though the capabilities are there. But if at your scale, they don't make sense, you're not going to use them, right? It doesn't help you that they're there if you can't use them. But with tensors, it's different because a lot of those tensor operations are super fast. And so they will scale and people do use them at very large scale. Yeah, I mean, one of the things that I think that seems unique about Vespa around some of this efficiency stuff that you're speaking to is that the tensor computation is happening on the content node where the data lives. It's kind of the idea of like, do you bring the data to the computation or bring the computation to the data? Data is expensive to move around. So if you can bring the computation to the data, then it's going to save you some cost in terms of time of moving this data around, which then gives you probably more compute cycles that you can spend on trying to get good results out of the search. Exactly. And I think this comes at two levels. One is the computation that you do on all the documents. I think it's just unfeasible to bring all the documents somewhere outside where the data lives, right? Like you will have to take some sort of top end unless you have a tiny data set. You just cannot afford to take all the data out of the content nodes and into something external. And so that is one thing. Like if you can do some tensor computation on the first level, then you're going to save a lot of cycles. But then the other thing is you can do the first re phase So basically the second phase first phase runs on all documents Second phase runs on top end That second phase will run on the content nodes as well So you can bring a more sophisticated model. It could be a light GBM, like a tree. It could be a 0.nx model. It's usually not something super big, but it can be complex enough, for example, to handle multiple signals of multiple ranges, such as you have similarity, and then we discussed BM25 and recency and all the things that maybe matter to you and come up with a coherent score. And that still happens on the content or without moving data. And only later, you can maybe move a much smaller set to what we call the global re-ranking, which happens on a stateless layer. and that can, again, have its own model, maybe a bigger, more complex model, can also run on GPU that can do the final re-ranking. So there's sort of stages to that. I guess one of the things I was wondering about too is there's like the concept of RAG vectors or even if you're using some other search technique like the tensor stuff that we're speaking about is very, very popular a couple of years ago. And then now with AI agents, I think there's some dialogue around like how relevant is this today? Can you talk a little bit about, you know, where do some of these concepts fit into the agent world? Right. So for agents, this would be just a search. I don't think they care all that much about what happens under the hood. But if what happens under the hood gives them good results quickly, then that is, I think, even more important than it is for humans because agents would typically run multiple searches. And so the problem I think would be compound with latency or with bad results because it's latency definitely. If you're 90% accurate in isolation and then you do that 10 times, then it's like, you know, 0.9 to the power of 10, which means that you're successful. Like, I don't know what the math is, but it's probably going to be like 10% success in that compound factor of searches, right? So the more you can get the accuracy up on the search in isolation, the more the accuracy goes up in the aggregate as well. And I think the other thing is that models, at least to my knowledge to this day, aren't as good as figuring out how to filter the context. So if you give them bad results, they will tend to hallucinate more because now they have bad context to rely their hallucinations on. Yeah. Or if it's too much, right? Like all models degrade in performance, the larger the context that you give them due to context rot. I mean, just based on the way the attention mechanism works, they can only pay attention to so many things. So they might pay attention to the thing that you don't want them to pay attention to if you give them bad results. How does Vespa handle updates to data? So if you have a knowledge base that's changing every minute, there's news, there's pricing, there's inventory, how does the index, a re-indexing of that information work? So to talk in general terms, Vespa is real-time, meaning when you make an update, the moment you get the acknowledgement as the application, that thing is searchable. so most engines would be near real time meaning there has to be some sort of commit happening which you know there's always a trade-off like that there's no free lunch right but this is the trade-off that vespa does it's like it assumes that you need your data to be available right now so you won't have some some caches that you have with other engines but the upside is that you can, for things that are moving quickly, such as pricing for e-commerce, that is a very frequent example, or how much you have in stock, that you can change a lot. And if the data you're changing is an attribute, so effectively the price or the in-stock thing that is kept in memory, that is super, super quick. This contrasts with other systems where you have a commit and then you would effectively need to kind of re-index the document in order to change one value from it, which can be prohibitive. But in Vespa, that's the advantage that you can quickly update things. How does that technically work? I'm not sure I'm following. If I have a new update, how does Vespa handle that in real time? Like a continuously flow of new information, like how does Vespa handle making that available in real time? So if you have an in-memory attribute, like a price, and you want to change it, I mean, it will be backed by disk, right? So you have all the persistence, the write-ahead log, all that stuff. But you send the update, it's changed in memory, it's also replicated to all the other nodes. And when all the other nodes got the update request, you get the acknowledgement from the client. And also, this happens at the operation level. So if you want to update three, let's say, products prices in one go, the way you typically do this with Vespa is with HTTP2. You're going to send, we have libraries that do this. You're going to send effectively 10 updates individually and they respond individually. And each of them at the moment they respond, you know, they're already kind of flipped in memory. So you see the new price. every searcher that runs after that will see the new price or whatever you updated. Vespa has been in the search world for a long time, like we've talked about. So, you know, over a 20 year journey, what's next for search? Like if we fast forward ahead three, five years, like what are some of the problems that need to be solved that haven't been solved today? I don't know, to be honest. There's so much work in the short term that I find it hard to look Because things are moving so quickly, it's hard to tell. I do have a feeling that multi-model search will become more important. It would have visual cues here and there that will become more important depending on the use case. I would think that the ability to explore data in real time would also be increasingly important. I think people are, and even agents, are not necessarily happy with seeing top end results. They may want to know what else is in that result set. And that brings yet again the question of what is the result set? Where do we consider? Where's the threshold between relevant and irrelevant results? And yeah, I think there are also problems that have been there since before I got into search, which was like 15 years ago. and are still not really solved, which is like, how do we get a good golden set? How do we measure search effectively? How do we get that feedback loop going? How do we improve performance? Not performance in the sense of latency, but relevance without breaking other things. Yeah, I think if those have been around for more than 15 years, I would assume they will be around for the next five years as well. I think the golden data set problem is a huge one, even outside of search, just in like AI in general, whatever AI system I'm building, like if I don't have a good data set to essentially like test against, how do I know that the investments I'm making are moving in the right direction? And I see this, a lot of companies and projects like skip that step probably because it's hard. But it's really hard to know whether the things that you're doing are actually useful if you don't have any way to test against it. But people skip that step because there's not like an easy way to achieve it essentially right now. Yeah. And I feel like it's also a chicken and egg problem. Like even if you do it, which as you said, not everyone does it, but even if you do it is like, how do you know your testing thing is good? How do you make sure that, because that's, I think the main difference between what we see on the internet when people publish, oh, this is the new state of the art model. This is the new state of the art technique, this and that, or academia. They have a golden set, that golden set is the benchmark. So the assumption is that the golden set works. But if you're starting your, I don't know, e-commerce shop or book search website, whatever search use case, and you start from scratch, like now what? How do you know? I mean, I think that's the advantage that some of the stuff around like coding has is that typically companies have, you know, a history of things that they can kind of build like benchmark data sets around. Like there's issue trackers, there's prior code that engineers have built. Like there's been essentially a history of creating stuff that they can mine for creating these like golden test sets. But if you're starting brand new in a new field where the measurement of what good is is far more subjective than just like compiling and running something against a unit test, it's like really, really hard to create those data sets. And even if you put the work into creating it, to your point, like, how do you know whether it's good or not? Radu, thanks to you so much for being here. It was a great conversation. You're welcome. Thanks for having me.