Now you actually have the ability at your fingertips through just in some ways, just through English commands to be able to go and parse through and extract and answer questions that you have that you otherwise would never be able to answer. And that's what's so exciting about this time right now. Yeah, these technologies are growing extremely fast and we can go, you know, we can really solve some really hard problems that were just previously unattainable. Welcome to Embracing Digital Transformation, where we explore how people process policy and technology drive effective change. This is Dr. Darren, Chief Enterprise Architect, Educator, Author, and most importantly, your host. On this episode, AI, ETL, and the Unstructured Data Problem, Why Accuracy Still Matters with Founder and CEO of Aaron Mehul Shah. Mehul, welcome to the show. Thank you for having me, Darren. It's good to be here. Hey, before we jump into our topic today, which is a great topic, because I've been doing the same sort of work that we're going to talk about today, so we'll have a lot to share. And too bad if the audience doesn't care what we're talking about because we're going to have fun talking about it. But before we get started, everyone knows on my show that I only have superheroes on the show, and every superhero has a background story. So what's your background story? So I've been doing, you know, my joke is I've been doing data before it was big. Or, you know, I've been doing, you know, AI before it became, you know, into the scene. My history is actually in research. I spent a whole bunch of time back in the early 2000s, late 90s, you know, working on databases, working on large-scale data systems. I was a research scientist at HP Labs, and we worked on a variety of, you know, highly scalable energy efficient storage and database systems. And really had a lot of fun doing it. Learned a lot about, you know, sort of trends in technology, how people use it, and, you know, where things are going. The challenge I had with just, you know, sitting there and just focusing on the technology was we were just unable to get this information, sorry, these tech tools, into the hands of users every day. And, you know, kind of, you know, getting them, you know, delighting them effectively. We certainly created great technology, but delighting them was hard. So I created a couple of friends of mine, and I started a company about 13 years, 14 years ago now. It was called Amiado, and it was in the early days of the cloud, 2011, 2012. And what we saw was that there was this huge need for processing log data. At the time, it was really hard. And so we built an ETL service that would take log data, put it into databases for you so that you could run SQL over it and do your analysis. And that's what Amiado did. It was acquired by AWS. And a lot of the tools, the techniques, the ideas from Amiado turned into AWS Glue, which is AWS's flagship ETL service. So that's another joke that I often say, which is that, you know, I've been doing ETL before anybody could spell it. ETL stands for extract, transform, and load. Right. So my, you know, my co-founders and I, you know, we built that service, grew it. And then I had the opportunity to just really understand a lot of what was going on in the unstructured data space by being on a platform like Amazon and AWS and Amazon and Amazon Cloud. And so, you know, from there, just, you know, got lost in all of the problems and the solutions that you could build for your customers. I had an opportunity to run the Amazon Elastic Search service and eventually forked Elastic Search into Open Search so it could stay open. And learned a lot more about how people were doing document processing. And that's what, you know, that's what led me to the current company that I'm running right now, Aaron, where we're really helping enterprises unlock the goldmine of information that they have in the mountains of unstructured data that they have. So all those years of data hoarding that I've been doing, right, because I can actually do something with all that data, because I noticed that when I go in, talk to organizations, they have petabytes of stuff that's never touched, right, because they don't know how to engage with it, right. I think I even have my, I have like five terabytes of PowerPoint presentations and Word docs and all this unstructured data that I'm afraid to throw away because I'm like, well, I might need it someday, right. But you're saying today is the day, right? Today's the day. I think that the time is now. So, you know, for a long time, you know, this market was, I would say, a market where there was a lot of non-consumption. And what I mean by that is people really wanted to, like you just described, be able to get, you know, kind of search through their old logs of documents, of PowerPoint presentations, contracts, you know, pricing quotes, whatever you have. Invoices, but they were just unable to do that. They would actually sit there and do, you know, build very specific solutions for really only super high value problems. Like if this was going to be a, you know, a case that was going to be worth $100 million, let's go build a bunch of technology so that we can go through all this stuff. And so, so what I meant by non-consumption is you had all this data. People wanted to get to it, but the barrier to getting to it was just so expensive, finding the technology, finding the experts that did it. You know, you had to find the technology that could actually get to the bottom of it very quickly. It was extremely hard. But now with the, you know, the growth of frontier models, LLMs that can, you know, not only just pull out text, but also deal with all kinds of multimodal information, you know, graphs, images, video, sound. Now you actually have the ability at your fingertips through just in some ways, just through English commands to be able to go and parse through and extract and answer questions that you have that you otherwise would never be able to answer. And that's what's so exciting about this time right now. Yeah, these technologies are growing extremely fast and we can go, you know, we can really solve some really hard problems that were just previously unattainable. Well, there's still the problem, even though it can read a document and do things like that, I still have the problem of where is my data? That's right. Right? And do I want it searching through a petabyte? It'll never come back. You know. I mean, how do I put all of that into one large language model? I can't do it. It's too big. It's too cumbersome. If I have a specific file, I can say, oh, tell me what this file has in it. I can do that sort of thing. But if I've got a million files, what am I going to do? Exactly. And if you're, you know, if you're, if you're a large enterprise, you know, you have the stuff in spades. Yeah, some of these, these enterprises have like petabytes and petabytes of information that they've kind of stacked up. Some cases, exabytes of information that they've kind of stacked up. I mean, just to give you a couple of examples, you know, if you're taking a look at, you know, what, you know, what our government agencies can kind of put together, just, you know, the sort of Like GSA is a great example. GSA is an example or NOAA, the fisheries and so on. Oh, yeah. They just have so much, you know, satellite data, you know, documents from applications. They have just so much information just sitting there. It's unbelievable. And right now, the state of affairs is pretty poor. The data is sitting here. And then the people that want to get access and get answers from that data, you know, they're still, you know, literate. Literally manually going through this information. You know, some of the customers that we work with, for example, you know, we literally watched them. I went them just last week to see some of these customers, they're insurance companies. And insurance, you know, started about a couple, couple hundred years ago in this country, largely, you know, around farming. And so a lot of the insurance companies and, you know, their ecosystem is in, you know, in the Midwest. So, or, you know, the farming areas in the East Coast, so Pennsylvania, Ohio, Iowa, sort of in these, you know, in these states. And I went and saw one of our customers and they, they're really happy using our software. But before that, what they were doing was literally cutting and pasting from documents from one, you know, literally from one PDF to an Excel spreadsheet. I don't like it's, it's, it's 2025. Like you guys are still doing this. And like, yeah, look, it's just, you know, like, have you guys tried LLMs? Have you tried chat? It doesn't work. And so this is actually something that you hit upon, which is that in these AI systems are really great, as long as the data can fit within their context. Yeah, yeah. And even that. The context has gotten a lot bigger. Right. It has, it's gotten a lot bigger, but LLMs should stand for lazy language models. Because they, they give up. Yeah, so, so. They won't scan a whole document. They'll find like three or four things and go, okay, I got an answer and send it back. Yeah. It's interesting. You know, so certainly the context has gotten bigger. And I remember when I first started in this, in this space, and we started messing around with LLMs back in 20. Yeah, it was like 4k or 8k tokens, which at the time seemed enormous, by the way, because you just couldn't do anything else. Yeah, at the time. We certainly got, you know, million token contexts now, about two, three years later. So, you know, growing by an order of magnitude every year, just actually great. 10 million token contexts and some of the really large frontier models and the pro models that are out there. So, so that's awesome. But the problem is that even though the context is available, and somebody can let you stuff in a bunch of information in there, you see actually a quality, a quality wall, basically, after about 100k or maybe 200k tokens, the quality of the answer just starts dropping off really quickly. I know what it's like, too. I know what it's like, too. I try to throw in, I don't know if you have like, you know, a life insurance policy or something, or. Oh, yeah. Right. Or maybe you have a mortgage statement, or maybe you have, you know, a loan, you know, a loan package that you sent. You can send a single contract with two or three pages, right on the ball. They can do it. Send it to the chat, CPT, or just send it to Gemini, Claude, and they do a pretty good job. Throw the whole mortgage application in, you know, throw in the whole, and, and, you know, you'll literally see it beg for mercy. It'll say things like, well, you know, you know, we got a few hundred, you know, entries out that you wanted, there's a lot more. And I'm like, okay, well, where are the rest? Somewhere. Yeah, this is kind of front, yeah. So personally, okay, you know, if you're dealing with your own stuff, you can read through it. But imagine if you're processing thousands and thousands of mortgage applications, if you're an escrow company, or, you know, you're a lender, and, yeah, it's going to be tough. Or let's say you're a government agency, and, you know, and the government, everything is driven by documents. I mean, Darren, you know this. Oh, yeah. Right. And, and, you know, if everything is really just, you know, pushing paper, the best way to automate this stuff is to start using these LLMs. But right now, the token is actually, token context that they have is actually quite a limitation. Well, and, and, and like you said, even though the context windows could be large, they just, the quality goes down. They give up after a certain amount of time, even though you can jam them full of a bunch of stuff. All right. So if that's a, that's a huge problem, how in the world am I solving that problem? Well, you know, I come from the database world. So, you know, my background is in, you know, building, you know, large, reliable database systems. And the trick in database systems and any kind of database system that you build is to divide and conquer. You know, they're really only two or three, you know, great techniques in computer science and divide and conquer is one of them. Caching is another. And so, and divide and conquer is the best way to kind of handle all of this stuff. So, you know, you probably heard this idea of context engineering. So there's two different things that you can do to get LLMs to be better. Okay. One is you can give them better instructions. This is prompt engineering. Okay. And then the other thing is the information that you put in their context, you know, can be carefully crafted at the right time with the approach. And then you can get them to be much more accurate at the right time with the appropriate stuff so that you can pull, you can get them to be much more accurate. So try to make sure that the stuff that's in there fits within sort of the accuracy sweet spot, you know, if you will, for the LLM. And then what you do after you start kind of divide and conquer is, you know, kind of bring all of the answers that you get back after kind of interacting with the LLMs, kind of merge all those answers back, and then give it back to the user. And then this idea of dividing and conquering, how you should divide, how you should conquer, and it's going to be depend on the task that you have. So, you know, if you have a very large mortgage application, you can actually use an LLM to help you decide how to divide and conquer. Say, hey, you know, in this mortgage application, you know, look for where did you're, these guys are listing all your assets, go to those pages. Now take those pages, and then I want you to answer the questions over these pages. So we call this planning. And, and you can use LLMs to do that. And so a combination of planning, divide and conquer, and, you know, carefully crafting what's in the context is a way to actually scale these LLMs. And that's what we're doing at Aaron. Fundamentally, that's what our technology is. You know, it sounds very similar to something I did in the US Census. I learned something about LLMs with the US Census when we were crunching numbers. I learned that large language models are not good at numbers because they weren't trained on numbers. They were trained on language. So if I tell it to add, like, a billion rows, right, I have a billion rows, I said, find me the average and give me a summation of, you know, column F and G. It, it does the first 10 and gives out. That's exactly right. So I did, I did the divide and conquer concept that you have. I call the aggregated queries. And I got a lot better results when I said, hey, you know, I'm going to take this big, huge thing. I'm going to chunk it into smaller bits and, and aggregate the results that I wanted from each one. From each one. So I could build up and then the last query would do, you know, the final final set. So I know this, I know the technique you're talking about. I've used it. It's wonderful. It works really, really well. It works extremely well. Another technique that also works extremely well, if you've tried this is to tell the LLM that you have a tool and this is how you use the tool. So just as a human would look. Interesting. Yeah. And just as a human would say, okay, well, you know, I'm not going to spend, you know, 30 hours adding up all these numbers. What I'm going to do is I'm going to use, you know, my database or I'm going to use my calculator or I'm going to use my spreadsheet. Right. And the spreadsheet has the particular interface. The database has a very, you know, specific interface calculator. We all know what the interface is. The buttons on the calculator. You can also tell an LLM here are the tools that are available to you in the process of answering this question. Here's the data. You know, go ahead and plan on how to use this. And so you can, you can see LLMs tool uses a very common thing. You can actually try this with chat. You can see it. So tool uses a very common thing. And it works extremely well, especially when you need really precise answers. Over, you know, very specific or easy to specify, easy to specify problems that a database or a spreadsheet or some tool can handle for you. So the things you know, go ahead. No, I'm just keep going. Yeah. I was going to say the things you're mentioning takes large language models from an individual person use into the enterprise where I'm doing a lot more than, you know, write, write a Japanese haiku of my mortgage application, right, which I've had people do in workshops before. Because a lot of people are using chat GPT for individual work that I'm doing, but the type of things you're talking about is really turning it into an enterprise tool. Right. That's right. Where there's more constraints, there's more data, the scales different. So the approach is different, right? Absolutely. And I think the biggest difference, all those things, more data, the scale, the approach, I think all those things are elements to it. The thing that's driving, you know, all these, you know, all these sort of architectural decisions in the enterprise is actually one important criteria. When I work with underwriters, when we're working with analysts, we're working with estimators, so underwriters and insurance companies. I don't know if I had a chance to kind of introduce the company that I'm running right now. It's called Aaron AI. And what we do is we take people's unstructured data, unstructured documents, and we can automatically, using agents, extract structured data fields from those documents. And we can do that at high scale. So millions and millions of documents. And we've been doing that every day. And we can do it with extremely high accuracy. So think 97, 98% accuracy. That's awesome. Right. And so current LLMs don't get there. They're very general machines. They can do all kinds of things from writing poetry to, you know, recommending restaurants to, you know, extracting key information from documents. We just care about extracting key information from documents. But once you focus on that, you can get that to scale extremely well. And, you know, sort of to your point, the question is why, you know, why are all these architectural decisions being made in the enterprise? It's actually driven by one important criteria that all of these knowledge workers have. They have to do work day after day. And that work has to be done accurately. Okay. And if it's not accurate, they're going to either take more time to get it to be accurate, which is actually a waste of time for them. So you actually haven't helped them. Like if you get like high 80s or, you know, mid 80% accuracy. You're not good enough. I mean, 15% of the time they're going back and doing their old thing. And at that point, they might have just ignored what you've given them. Okay. Right. It's only when you can really hit the, you know, 97, 98% accuracy level where you're actually speeding them up. And the other thing that the accuracy helps you do in some cases, the accuracy is now starting to get better than what humans can do themselves is that it actually allows them to make, you know, get a build a better build a better business, make more money. Okay. So as an example, underwriters and insurance companies, their job is to maintain relationships with, you know, brokers and carriers and so on. They're, you know, their hard work right now is just, you know, shuffling documents that they're doing. It's undifferentiated, you know, heavy lifting. Right. All right. And if you can let them turn around quotes, what you used to take gaze for our customers, we can let them turn that around in like say 10 or 15. Now they get more at bats and they can actually sell more policies. More importantly, they can sell better quality policies. They can sell policies that are more, you know, that are appropriately priced for the risk that they're taking and, you know, better suited for the customer that's coming in. And so you actually end up building a better business if you can get higher accuracy. Okay. And that's the thing that's actually driving all of these architectural decisions that we're talking about. Because without the accuracy, sure, you can just use chat chippy tea. You know, they just throw it in, you get something back and, you know, maybe you keep going forward. But it's actually not helping them if you're just kind of doing it, you know, it's one one one two. Yeah, it's a good demo. But high accuracy is critical. Like imagine your, you know, imagine your securities regulator and you're going through a ton of, you know, legal briefs and filings to find precedent. Yeah. Okay. What do you do today? Well, you go through a search engine, you get your top thousand results, and then you like literally just, you know, go through by hand. We can help them basically find things that they wouldn't have found before in minutes. So how do you, how do you handle the? There's so many questions in my head right now. First off, are you using your own LLM? Have you trained a new model to do this? Or are you using an off the shelf one and you're just, you're doing more context engineering? What's your approach that you took on this? Yeah, so, so there's a couple of things. It's actually a conglomeration of techniques. It's not just one thing where we can just say this is what we do and we just do it all the time. You know, it's, there's a bunch of engineering that's involved in different parts to our, our platform, if you will. We did build our own models to just to get the text or the, you know, just to get the sort of the raw data off of documents. You'd be surprised. You'd think that documents are standardized, but they're not. Oh, they're not. Yeah. You know, and so Aaron can process, I don't know, some 60 different languages, more than that probably now over 35, 36 different document types. And we keep adding them. We can do it and we can pull the text off the page, the images off the page pretty accurately and pretty fast. And there we actually built our own vision models to do it. If we found public data, we, you know, took permission data from customers and just did the hard work of labeling this. Yeah, the, the, the, the examples and training models. Right. But that's, that's not enough. Okay. So that basically takes the, the words off the page, so to speak, and gives you, It gives me a text. Yeah, it gives you text, raw information and so on. Okay. But we, we, the next step you have to really do is now you have to get the LLM to really like pull all, pull out the key properties. You know, the key things that you care about. And there we let, you know, we really leverage the trend there is in these large language models of these frontier. Yeah, they're pretty good at that. They're good at that. Right. So, you know, when chat GPT gets better, we want to be able to get better with it. When, you know, GPT five, five, two came out, we just suddenly got better. Yeah. When Claude comes out with their next model, I think, I think it was Opus. Yeah, we just suddenly got better. And so we do two different things. One, and this is something that no single vendor is going to be able to do is we actually use different frontier model providers, feed the information to multiple different frontier model providers, and then use sort of a quorum technique to be able to go and say, hey, you know, across all of these things, you know, are they all agree? There's some kind of consensus. Yeah. Your consensus or not. And you'd be surprised when there's consensus, how accurate the stuff is. And when there isn't consensus, you know, you just haven't given it the right instructions. And so what we found is by kind of pushing the data through multiple models, seeing where there's consensus, you can see where there's certainty and uncertainty, saves humans a lot of time. Okay. And being able to go through that. And then when there's uncertainty, humans can actually like the knowledge workers can go in and say, no, no, no, this is what I want you to do. And we have a cool optimization feedback optimization technique that takes that information from the actual knowledge workers, feeds it back into our system, and then appropriately adjusts the models, the architecture, the prompts, the context engineering. So then the next time they see the same thing, they don't have, it doesn't go through again. They go click. Yeah. And so, you know, this is a technique called reinforcement learning. It's a general technique. We have a variation of that, or a specific version of that that we call coral, correction optimized reinforcement learning. And what it is, is that actually it's a technique for optimizing your prompts, where you don't need a lot of labeled data. We're actually bootstrapping with these front frontier models. I mean, it doesn't take that many iterations, maybe like a dozen iterations and, you know, maybe 10 or 15 documents. And now all of a sudden you're getting like 97, 98% accuracy in your extractions. I've seen the exact same thing. It's very similar. What I do with the US Census was we set the SME with the AI and a front end to it. So when they were analyzing documents, they could interact with the SME and say, yes, this is, and we've gotten our accuracy way high, like 99% now. Exactly. Yeah. Because you're using, you're not using the AI to replace the human. You're using it to augment the human. And yeah, and that's, so that thing techniques incredible. That's exactly right. And I think that's where everybody's kind of scared by the boogeyman, like AI is going to take your job away and so on. Well, because CEOs are laying people off and saying, AI, you replaced them. You can change, you can blame all the AI vendors for saying, yeah, we need universal income because I'm replacing everyone's jobs. Exactly. You're scared. Yeah. I think that's a lot. Look, there's a certain amount of low paid, low income work that honestly doesn't need to be done anymore and where AI is going to help you. And the people that are doing it, they're not happy doing that work anyway. It's tedious, it's manual and so on. And I think that this is good for everybody. That's going to go away. But people coming in and saying, hey, we're going to replace your lawyers and your doctors. That's not going to happen. It's not going to happen. You can't replace human judgment. And right now, the way to think about LLMs and frontier models and so on, they're not, they've gotten really good. They're like, the greatest mathematicians. So the greatest mathematician that I know of right now, and there's many of them, by the way, but the one that's sort of in the press a lot is Terence Tao. And Terence Tao, quite a humble person for what he's actually been able to accomplish. He, a few others in the computer science, theoretical computer science community are starting to use LLMs to actually start proving minor lemmas in their proofs. This is huge. But they're not going off and saying, hey, I have this idea for a proof. Go off and come up with a proof. That's not what these LLMs are doing at all. What they're doing is they're very carefully saying, hey, I need a lemma that does X, Y and Z. Here's a small, think of it as a sub routine in a program that I need you to go build. And I don't know if there's a recent result that I can leverage or where there's a few set of things that I can kind of put together from the cross of the corpus of mathematics that's out there. It'll take me probably a few weeks to go figure it out. Maybe you can come up with this little small lemma. So the intuition for how to go and think about what problems to go solve and how to solve them still is coming from humans. But a lot of the sort of the work, the menial, I wouldn't say menial, but the work that the humans have to do is starting to get automated so that they can actually be more productive. Are mathematicians afraid that they're going to be obviated? Absolutely not. The greatest mathematicians still are going to be the greatest mathematicians. Well, in fact, they'll be able to work on problems that people have said are unsolvable. And they'll be able to work on problems at a larger scale. So mathematicians today, they can collaborate with maybe two or three other mathematicians, maybe at most five on a problem, because A, you got to trust their problem solving capabilities. There's got to be a sort of a culture of coordination and so on. But now with the ability to use LLMs and proof proving techniques with things like lean, you can actually try to solve really, really challenging mathematical problems. And so we're going to be able to work on mathematical puzzles with a team of like 100 people. It's almost like a software project as opposed to being a mathematical sort of proof endeavor. And so the reason I make this analogy is our lawyer is going to go away. Absolutely not. They still got to work with humans, convince juries, convince judges, and LLMs are not going to go replace lawyers. That's not going to happen. Our doctor is going to go away. Are you kidding me? Are you telling me that LLM is doing your surgery? Absolutely not. You want to hallucinating during the middle of it? No. And actually, there's a nice observation by Terence Tau that I often repeat. I don't know if it's an exact quote. But LLMs are just getting to be better and better guessing machines. So they can guess the solution. That's a good way to put it. Right? Yeah. But to be able to verify it, you need to have some independent way of checking their results. And you just, you know, they're not going to be 100% right. And so in the sciences, we found independent ways of checking their results. You actually do the experiment. Right? So somebody said, this protein folding, you know, won the Nobel Prize. Here's the protein structure. Well, you know, you do some X-ray crystallography to go and look at the structure and you're like, yeah, that is the structure. So you must be right. And so, you know, there's independent ways of verifying these things. Making predictions about weather. You look at past weather patterns, you see how well it does and you say, hey, this thing's doing good. And even in coding, the thing that is taken off like wildfire. It has. The thing generates all this code that we have mechanisms to verify that the code is working, deploy it. If it breaks, pull it back. If it's good code. If there's good, does mine. Yeah. I've actually been teaching those techniques in my class. I teach at Vanderbilt. And so, yeah. Yeah, this is great. So in the enterprise, when we're talking about knowledge workers, they have their way. They're already doing this work. They have their way of checking when things are correct. And what this is going to do is just really going to assist them. And that's really what's going on here. And I think LLM's are just going to make us much more productive. I think it's going to make us happier because we're going to be doing less grunge work. Yeah, grunge work. But I don't think they're going to replace humans. I think that's a long, long, long way away. I think so too. So all right. If people want to find out more about your company and what you guys do and want to engage, how do they reach out to you? I think, absolutely. So our company is called Aaron AI. You can just go to the website, aryn.ai, and tells you about the products that we have. You can also just book a demo, and that'll get me on a call with you. And I'll show you what the property, we call it agentic property extraction system does. And I'll tell you about all of the cool technologies that we built. So feel free to just go to the website, or you can just send us an email at infoataren.ai. That's awesome. Hey, Meeho, thank you for coming on on the show today. This has been great. We could go for another hour easily, but people be bored. We wouldn't. We'd have a lot of fun, but thanks again for coming on the show. Darren, thanks for having me and looking forward to seeing the podcast. Talk to you soon.