AI incidents, audits, and the limits of benchmarks
Sean McGregor, founder of the AI Incident Database and co-founder of the AI Verification and Evaluation Research Institute, discusses AI safety through incident documentation, third-party auditing, and evaluation of AI systems. The episode covers how AI incidents are tracked, the limitations of current benchmarks, and the need for systematic approaches to AI safety verification.
- AI safety requires systematic documentation and analysis of incidents, similar to aviation crash reporting, to prevent recurring failures
- Current AI benchmarks are primarily designed for research purposes rather than practical deployment decisions, creating gaps in real-world safety assessment
- Third-party auditing of AI systems is becoming essential as models become more general-purpose and harder to evaluate in specific contexts
- The interface between multiple AI systems (like guard models and foundation models) often represents the weakest security point
- Statistical rigor is necessary when evaluating AI vulnerabilities, as anecdotal exploits don't represent systematic security flaws
"You don't want a bad thing to happen and that you don't want that bad thing to produce a harm. You don't want someone to say like, I've been impacted or some organization been impacted."
"The world is hard. The real world is real hard."
"You manage what you measure. And so let's measure these risks, and then we can separate the actors that are investing in safety and stronger AI systems, safer AI systems, from the ones that aren't doing that."
"Anecdote does not equal data in this instance. And we need you to show that it's systematically like pushing towards arson, that it's always going to burn things down."
Welcome to the Practical AI Podcast where we break down the real world applications of artificial intelligence and how it's shaping the way we live, work and create. Our goal is to help make AI technology practical, productive and accessible to everyone. Whether you're a developer, business leader or just curious about the tech behind the buzz, you're in the right place. Be sure to connect with us on LinkedIn X or Bluesky to stay up to date with episode drops, behind the scenes and AI insights. You can learn more at PracticalAI FM. Now onto the show.
0:04
Well, welcome to another episode of the Practical AI Podcast. This is Daniel Whitenack. I am CEO at Prediction Guard and I am joined as always by my co host Chris Benson, who is a principal AI research engineer at Lockheed Martin. How you doing Chris?
0:48
Hey, doing great today, Daniel.
1:05
How's it going?
1:06
It's going really good. Lots of fun things in the news to follow and lots of fun things to work on. By the way, for our listeners who might need a reminder about this, we are doing some webinars recently on a variety of topics, some of which are maybe even related to some of the things we're talking about today around security or safety. So if you want to find out about Those, go to PracticalAI FM webinars. That's where we have some of those things listed out. But I'm really excited today because have an amazing set of things to talk about with Sean McGregor who is co founder and lead research engineer at the AI Verification and Evaluation Research Institute and also the founder of the AI Incident Database. How are you doing Sean? Thanks for joining us.
1:07
I'm doing well. Thanks for having me here.
1:58
Yeah, yeah, of course. And this is interesting kind of to think about AI incidents, verification, evaluation. How did you find your way into a day to day where you're thinking about and documenting and studying AI incidents among other things?
2:00
How did I get to work on what you might call impractical AI or.
2:19
Yeah, yeah, well I guess, I guess where practical AI turns into problematic AI, let's say.
2:24
Yes.
2:33
Yeah. And really practical AI is the AI that has consequences and matters in the world and those are the ones who actually care to look into where it goes wrong. So the kind of really quickie professional tour of things is I'm about a 2017 vintage PhD in machine learning. I focused on reinforcement learning as applied to wildfire suppression policy. So fire starts in a forest. What do you do about it and how does that impact the development of the land and the values we we get from it over the course of 100 years really in that setting. I had a very strong sense of the power of the technology that we were developing, but also the brittleness of it and just how difficult it was to even know whether the system was doing what you wanted to do, particularly in a reinforcement learning system. And I just never wanted to wander through a forest and have a sense of this is a charred wasteland because of the forester took my simulation and didn't really realize it was research code and that he should actually apply his own human expertise about what is good fire and bad fire and that. So I went from that. Then I worked on energy efficient neural network processors, which that was a great effort, the organization called Cintient, that shipped when I was there millions and has continued to ship many edge neural network processors. And that gave me a really strong impression of the kind of power brittleness once again inside these consumer electronic devices, hearing aids and the like. And so I left that I started a company dedicated to the test and evaluation of machine learning systems. And when that happened or when I started that the kind of LLM explosion happened. And you know, that's the kind of Cambrian explosion of sorts of, you know, practical AI technologies that I'm sure we were all still feeling. And we sold the assets associated with that to a old safety organization called Underwriters Laboratories and spent some time there doing safety stuff before leaving and starting up the avery, the AI Verification Evaluation Research Institute. And during this whole time I had a project that just kind of got away from me because it filled a niche, it served a need. And that's the Einstein database that you really need to have systems that collect and produce usable data sets, motivating safety practice. And you see this in aviation. A plane crashes, you record what happens and you use that to make sure effectively in the aviation industry you have a form of regression test. You don't want that past crash to happen again. You also have similar things in food safety. You have medical adverse event reporting. This is the fundamental primitive that you need to make sure you have in safety. As a bad thing happens, you make sure it doesn't happen again. And so we've in that project collected more than 5,000 human annotated reports of AI incidents. Those are collected across more than a thousand discrete incident records at this point. And we've formed a lot of the training data of like what is an incident. And we've been interacting with a lot of intergovernmental organizations and grc community governance, risk and compliance community companies and the like, just trying to motivate what is how to motivate safety. And like, why very often that can actually be a business imperative. Because a lot of the incidents that we have in the database, you could actually look at the stock price before and after, and there's an impact.
2:34
Could you talk a little bit more, kind of expand on what the nature of safety means in the context of AI? Because coming in as a neophyte on this, I don't know if there's kind of a standard definition. It seems to me there isn't. At least not that I'm aware of. So how do you define it? You mentioned kind of defining an incident. Could you define what you think an incident is and how these different concepts relate together in a way that those of us who are not part of that kind of world, that specific world, can kind of adopt the ideas and understand what you're talking about, Kind of give us a background on it.
6:23
So this is a three or a four hour podcast, right?
6:59
It's like the question, like used to always we would have the question, like, is something AI or machine learning? And like, eventually you just learn that those words are completely meaningless in terms of the way people apply them.
7:04
So yeah, we just need to treat them as like a pointer. And, you know, like the memory might be allocated differently with. Or the pointer can point to different memory or maybe the memory swap.
7:18
Exactly.
7:29
So like, there's a lot of safety terms out there. And like, I've actually had a poster for this at one point because this comes up so often. You could have incident, accident, adverse event. You have vulnerability, exposure, harm, event, controversy, issue. Like, each of these have like slightly different nuance to them. But you can kind of bring it back to the concept of you don't want a bad thing to happen and that you don't want that bad thing to produce a harm. You don't want someone to say like, I've been impacted or some organization been impacted. And so the intergovernmental definition of incident that developed was effectively an event that a harm has taken place. That's an incident. Your mileage may vary in different contexts and like different communities there, there are different terms that resonate. But the reason that we went with incident to begin with is basically it's sufficiently vague while still meaning something, because you can't say accident. Some of these have intention. You can't say exposure or compromise or, you know, the ones that are in the computer security, because some of them don't have intention. And so incident covers them all. And, you know, I could gesture and give you some Venn diagrams. But this is a, this is a podcast, so we'll, we'll avoid that.
7:30
I'm wondering like in particular when you narrow it down to AI incident, obviously sort of AI, depending on your definition of, of AI kind of has existed for some time. And so I guess like AI incidents have been happening for some time in the sense that like you could have a computer vision model detecting maybe it's problems and things coming off of a manufacturing line, it misses one and that causes harm maybe downstream to someone who uses a product or something like that. How has this, maybe you know, you talked about the shift to of course, the kind of expansion of what you know, how AI is impacting our life. How from your perspective has that idea, I guess been stressed in new ways in recent years?
8:55
We have constant debates over what we would ingest into the AI and SYNC database. And we, because we have to operate and it's a database that we have to make these decisions day in, day out, have very difficult decisions. And it's not just in our case, it's not just what should be considered an incident, it's also if we start ingesting these, are we lassoing an infinity that's just going to like pull us into some extreme direction. And maybe we would like to index, catalog and present to the world that infinity. But there are these kind of like little harms that are repeated maybe millions of times a day when you have these systems that indexing all of them isn't necessarily useful. So we do have a little bit of a bent towards is this informing the production of a safer AI or safer world for AI. But a lot of things meet the fundamental criteria of involving AI and minor harms have taken place. There are also increasing number of high scaled harms that take place. Like if you make a billion people slightly more depressed, non 0 number of people probably have died as a result of that. And that's where it's not just you're building a bridge that needs to bear weight, you're building a bridge that needs to bear the weight of all of humanity flowing over it.
9:48
I'm curious, as you're selecting things to go into the database per the criteria that you just talked about, how are you sourcing that? And especially when you consider the fact that so many AI things today are proprietary and kept secret in organizations for obvious reasons, how is that sourced and how does the different sourcing affect the utility of the database and how you make those kind of selections?
11:19
Most of what we have at the moment is journalistic Reporting. The reason that we are living in that space at the moment is journalists actually put in a tremendous amount of work to validate the base facts involved in it. And that is where the degradation of the journalistic community, or at least the ability for them to make wages has actually been quite harmful. We also do get direct reporting at times. We do get people that email us or submit direct to our forms. We get people that create blog posts and things like that and submit it to the instant database. We do encourage people to submit to the incident database. The simple matter of the fact though is the volume that we're dealing with is one where eventually we do need to switch from this kind of voluntary reporting to more mandatory form of reporting. In the EU code of practice there is a requirement for severe incidents to be reported. It hasn't reached implementation as of yet. There's a lot of back and forth on that front. But the insight you can derive from mandatory reporting is greater than that of this voluntary reporting. Like we can prove existence, we can prove it's happening. It's very difficult for us to assign a rate to incident events though, particularly as it becomes non newsworthy through time. So we do have some works that we're trying to make the most of that. This bears some resemblance to public health practices where you don't have perfect insight into disease spread, but you do have these kind of indirect measures of doing tests in the sewers and things to see if the, the how much of the viral particles are there and whatnot.
11:49
Well, Sean, I appreciate you taking us through this kind of idea of AI incidents. Obviously we are practical AI and in a practical way we would very much like to prevent incidents or at least understand kind of the security implications of using certain AI models. I know with the avery, the AI Verification and Evaluation Research Institute, in my looking at that, you're talking a lot about kind of third party auditing of AI or kind of frontier AI or foundational AI, however you want to frame that. Do you want to explain a little bit about kind of how that side of things might fit in the landscape and why it's important?
13:33
Certainly. So a fundamental problem that we have, particularly at the frontier model level with things like OpenAI, Anthropic, Google's Gemini and so forth, is it's very difficult to even know how safe your systems are because they're general purpose systems. This basically broke the safety frame. All the safety processes that we have are built around starting presumption of there's a specific context it's operating in. And you reason about its safety within that context. Well, if your context is just wild card star everything, where does that leave you? Do you need to, you know, verify across all circumstances that it's going to be safe? Because the answer is going to be no. So how do you approach the task of verifying claims? How do you encapsulate something that a customer or person or, you know, organization would rely upon and say like, okay, this has gotten this level of assurance applied to it, so I know I can apply it within my college age student population, teaching them how to do the maths and getting that level of signal. At this point, a real practical problem that you have is no one's going to believe you when you say this is good for college age math remedial level. You have to run a pilot program and see how it works before you actually can believe the representations being made by companies. Because they probably haven't evaluated in your exact circumstances. They might have something that's analogous to that, but they probably won't have it exactly in your circumstances. And you know, this is a problem on the practical AI side of things, but it's also a problem on just the increasing power of these systems and the scale in which they operate. You actually do want some really strong top level guarantees that the system is going to try and steer away from catastrophe and doesn't actually have an active propensity for doing the bad things. And the science of establishing that is actually still a work in progress. There's a lot of work not just in establishing evaluation, but meta evaluation. Basically deciding is this evaluation saying the thing that it says it says or not? And Avery, as an organization is concerned with doing third party audits, basically saying, is this thing safe? Is this thing doing the thing it should be doing? And there's a premise in there that a third party is better positioned to do that than a first party. And we've all been on development teams and you know, pushed to prod, pushed to the real world, pushed the real world circumstances and found like, oh, I didn't think about that. Like in the. One of my favorite incidents in the Einstein database is this gentleman who got a traffic citation mailed to him and he looked at the traffic citation and said like, you know, this isn't my car. It not only isn't a car, it is a woman who's wearing a shirt that says Knitter and there's a purse strap going across it. So it like kind of played with the letters so it looked like k n i9, t e r or something. Like that. And I don't think the people making the traffic camera were thinking about a woman walking through the field of view of the traffic camera with like a shirt perturbed in such a way that it would catch it and register it as a license plate. And then the traffic station went out like this. The world is hard. The real world is real hard.
14:19
I'm curious, I like that example. It's a very personal example and I can certainly imagine that happening to me actually, bizarre as it is. But I'm curious, as you are looking at applying this process that you're describing to maybe a larger incident, is there an incident that comes to mind in the database that where you could kind of compare what your estimate like if relative to not having gone through third party process versus if it had, how you might envision that playing out in a different way? I'm kind of trying to get a sense and you can, you can interpret it any way you want or find whatever example, but trying to get a sense. Like when organizations are listening, you know, they have employees and their leadership are listening to the podcast and they're trying to put this in the context to their own operations. Like what is, what do you think is a kind of an aha moment potentially for the listener and maybe a Fortune 500, maybe not that big, you know, maybe a mid sized company where this process will make a substantial difference to the outcomes involved.
18:19
Probably a good way to work here is by analogy to other segments that have audit and audit is something that no one truly enjoys. Like you don't, you don't necessarily want to be audited, but you actually do, you do want to be audited. If you're an organization that wants to receive investments from other organizations, having audited financials is table stakes. You're not going to put your money into an org that hasn't been audited. Or actually, I should say there are famous instances in which an org has not been audited and people subsequently regretted it because the reason they weren't audited is they were improving disbursements by slack emojis and it didn't go very well in those circumstances for anyone. And so we do periodically get reminders that these processes are important and that audits solve a problem. And that's, you need to be able to trust the fundamental representations about the state of an organization. And it's the same thing for the model because the model is similar in a sense that it's taking actions, it's doing things and has impacts. And so there's value to knowing when you should or should not trust information.
19:33
And I'm looking through some of what the Avery Institute has been involved with and in particular kind of recent work of this like meta evaluation of benchmarks, this sort of bench risk, which is actually how we got connected through a colleague, Aishwarya. And I'm wondering, this is like a meta evaluation and I know a lot of people make decisions about models based on benchmarks. Could you maybe help us kind of walk through mentally what is the difference between maybe like an audit and a benchmark and maybe like tying in some things that might make people think about whether benchmarks as they look at them on leaderboards and that sort of thing are really. Yeah, if they're really relevant to the real world behavior of models.
20:47
So maybe over overextending the financial metaphor on things, when you, when you audit a financial organization, you say, okay, I see that this organization has a balance sheet indicating they have $10 million in the bank. You're like, great, now as an auditor, I need to go and check the balance with the bank and see is this money real? Is it actually there? And similarly an audit is concerned with, I've seen the balance, I've seen the balance sheet and now I need to check the evidence and make sure that I actually believe that representation in some form. And so where meta evaluation or evaluating the evaluations or you know, checking the benchmarks, evaluating the benchmarks are useful is you're basically checking those receipts. And in the course of the bench risk project, we found that a lot of the receipts were just kind of like, lol, trust me, bro, like written on a piece of paper and that there were real substantive issues of like, okay, maybe there's like an IOU and there's some gold buried out in a field somewhere. But I, I have some doubts and at the very least I need to look into this a little bit more before I rely on it for real world purposes. And the dichotomy that we identified is a lot of the benchmarks that are produced are produced for non practical purposes. They're produced for knowledge generation purposes. They're produced for research purposes where people are wanting to understand systems, they're making sense of it, but it's not produced with the intent that someone's going to then say, all right, I'm going to deploy this in my environment. And I know now that it's unbiased because it scored well on bbq and BBQ is a great benchmark. It was really foundational in the bias benchmarking community, but it's also not produced with the intention that every time a frontier model is released you say like, look, we've improved on bias and it's 10 BBQ points better than the prior ones. And that's a good thing. You do want that. But it's not a practical AI purpose. It's not saying it's unbiased for my specific application because there's a distribution associated with bbq, there's characteristics that aren't necessarily generalizable to the environment to which you're deploying it. The prompts can be in like a very particular space for BBQ that you're just not operating in. And all these things are associated with, in the, in this research work that we put out, they're associated with failure modes that we identified. We collected a list of failure modes and then we looked at how well benchmarks had expressed mitigations against those failure modes and the results were mixed, like some did better than others. But by and large, most benchmarks historically have been produced for research purposes, not for practical AI purposes. And this is a problem because this is what we're working with and informationally. And this is why we're starting to see an evaluation ecosystem pop up. And that being a discrete task and something that's supported by organizations separately.
21:46
I'm wondering, as we're kind of going through this and discussing the process, if, as you're looking at some of the different risk categories that you guys outline on your website and stuff, a couple of examples are misuse and unintended behavior and infosec and emergent social effects. Which of these do you think is kind of like maybe the most underestimated? The thing that people aren't really expecting that has consequence that may be outsized relative to what those of us not in this field would expect. Do you have any sense of that in terms of like this? Kind of a little bit of the surprise that people might not realize is there.
25:04
That's tough because I'm a little bit of the fish and water being told it's wet.
25:47
Fair enough.
25:54
I will say in kind of rewinding my own experience on this is I feel like I'm a pretty astute observer and I've been more predictive of what problems are coming around the corner than I think most people I've encountered. But even with that, I still am regularly surprised and I wish we could be a little bit more predictive about it. And I think we are going to get there and we're going to learn from the past and hopefully push that into the future. But this is a new thing. We don't have a great operating history, particularly for the general purpose AI systems. I think that the split in this community that you should probably think a little bit about, that question in terms of expectation is there's security and safety people, and that there is a little bit of difference between those. Security is safety, safety is security. But what you care about tends to be slightly different in what you're expecting. And security people have a much greater capacity for the presumption of bad actors doing bad things and that producing bad outcomes. And then the safety people have a much stronger presumption of just. The world is its own adversary. We're all in our own kind of game against nature, and bad things will happen regardless. And maybe in the bad actor side of things, they're a little bit more insistent upon it though, particularly when the scalability of doing bad things is actually substantially increasing. So we have to solve all of these.
25:54
Well, Sean, one of the things I wanted to ask you about. Well, in particular, I. I'm a big fan of the Darknet Diaries podcast and listen to that a lot and always love hearing stories of hackers and the kind of world that is there and all of what people try to do at DEFCON and other places. I was really intrigued by your study titled To Eras AI. I see in the description you took some. Some models to defcon and I'm reading your description, had the pleasure of taunting a room of hackers with a declaration that the model is flawless. And obviously certain things happened after that. So I would love to hear maybe how this came about or what the idea really was, and then would love to hear some of that story.
27:23
Sure. So to set the stage a little bit, DEFCON is a hacker convention conference that's been run in Las Vegas for a good number of years. And it's. I don't know if listeners have seen the 1990s movie Hackers, which has like a very early, like Angeli Jolie and some others in it. And it's like these young men and women like rollerblading everywhere, you know, like that's just how the future would be and talking about how risk architectures would change everything. And there's this kind of hacker aesthetic that kind of noisily was portraying of joyfully compromising systems and in this particular instance, having a mixture of people that are doing that with good aims and bad aims. But DEFCON is this community organized thing and you have villages and one of them is the AI village. That's been run for a few years now. And in that village you have presentations, but there's also a section of it dedicated to some sort of a challenge that's conducted over the course of the conference. And people filter through and figure out how to break things and they're treated to cash prizes in this case. So this was something called the Generative Red Team two. So it was a change from the prior year in form. And we basically asked people, there's documentation for the open language model produced by the AL Institute for AI. So a 7 billion parameter large language model along with its guard model put in front of it. And we said, here's the documentation, here's the representations about what this model is supposed to do and your job is to show how that's wrong. That kind of unleashed a few days of chaos. And like, I don't know what else was happening in the conference, but like I was at the judging table for the entire time with reports flying in and needing to decide, you know, like, is this actually a violation of the documentation? It did the thing we wrote, which we did put some Easter eggs in there of like, we expected people to, to be able to, to break it. But did they submit sufficient evidence of a violation of that documentation? And that is actually a surprisingly difficult thing. We had all these people that were very good at the compromising systems and saying like, yeah, I know how to prompt this LLM and get it to tell me how to burn down this convention center violation of what the model should do. And we kept on needing to say, oh, that's great, but you haven't told us anything. Anecdote does not equal data in this instance. And we need you to show that it's systematically like pushing towards arson, that it's always going to burn things down or that you can put some sort of more universal jailbreak on it that makes it so that it underperforms in that particular filter of things. And so what we basically had to bring to the security world with security world does not like this, and I don't blame them for it is statistics. We had to say, like, it's not just about the fact that there's an exploit. It needs to be. It's systematically vulnerable in some form because it's always rolling dice. That's just the nature of it. And that made for a wild few days, paid out good cash purse and learned quite a lot about the collision of disciplines that we need to foster when it comes to security safety statistics, machine learning. The world's getting more complicated Prior to.
28:19
Kind of talking about some of the chaos that ensued at that point, as you talk about that collision of culture, if you will, with kind of taking the hacker out of their thing of saying, hey, I have an exploit, look at me, and requiring that level of rigor. First of all, can you describe why you required that, why that level of rigor was important in this case and how did that force different behaviors out of the hacker community that were applying themselves against the problem that you had presented?
32:07
We had a few people that were unhappy initially and then we explained it and then they went back and came back and then they started turning the crank and getting a lot of payouts. So people that really wanted to encamp and figured out the modality did very well. The problem and the reason why, like anecdote doesn't equal data here is if you say something is, you know, 99% filters out 99% of the bad thing. If they wanted to, they could roll up to one of those stations and you know, just keep on issuing the same query or like adding one period to, to it and then one time out of 100 they'll get something. They'll be able to walk up and say, money, please. The problem is that's not really useful if you're designing the system. You need some idea of what is the systemization here because you know, it's 99%. You don't to some extent like you're going to keep on working to get to 100, but you care about the cases where it's not 99% but you made a mistake and it's actually 70% or it's 1% or 0%. And that's a statistical argument. That's a here's a description, a higher level description. Then here's an example. This isn't me prompting it. This is a description of a attack that is not accounted for in your documentation that you're vulnerable for. And you need to solve this as a strategy. You never solve talk like a pirate. So anyone can talk like a pirate to the system and get it to do bad things. And that's the thing you care about rather than I asked it in piratees something 100 times in one time gave me bad.
32:43
Were there certain modalities like you described of a failure that were a surprise kind of to, to the judges in terms of like the, or maybe to like the like. I don't know how much interaction there was with the model building team, you know, at Allen Institute and that sort of thing. But you know, any Any kind of modes of failure that weren't expected, or just the ones that were continuously problematic, but even if they were known before, somewhat unexpected.
34:18
We characterize this in the research paper as well and I do recommend people take a look at it because it's very practical in nature. The biggest one and the one that I think most people just incidentally exploited and they didn't have full information on this one, we didn't give them full information on the integration of the guard model and the foundation model or the CHAT model underneath. And there's multiple ways of configuring that handoff. You can either do like a hard reject and say like, this doesn't pass the guard model and we're going to give you nothing. Or you can reprompt the underlying model and say like, but don't, here's what you're being prompted, but don't answer it. Or you know, there's variations of that and then you can still get like a somewhat useful reply out of the system. It just should reject it. The configuration of that handoff was looser than it necessarily needed to be. And exploiting that handoff between the guard model and the underlying model was the kind of big vector that you could use that to print cache over the course of the competition. And that's perhaps the thing people need to think about, particularly as deploying these systems are very often a collection of multiple systems that present as one is the interface between them is very often under tested. It's very often difficult to know how those systems will interact. And particularly when the benchmarks are expressed at a lower level than the, than the whole system, you don't know a lot about what will happen at that point. You've muddied the waters a little bit.
34:48
I'm curious, as you went through this exercise with the attendees, beyond just the specifics of this problem, what was the value of going through the exercise and what you know in terms of the outcome for you and what you got out of it beyond the immediate guard model with the, you know, the model it's protecting in that process, like, what was that like? And I'm also curious if you go and do a similar exercise in the future, which I'm assuming there will be some, some variation of that, what can you envision being a good next step to tackle in terms of trying to secure value and get to a next level outcome, have you thought about what you would do as you came out of the previous event, should you do it again?
36:24
So I think one of the really strong takeaways and there's actually been some work since then was the need for tools and to kind of scaffold this in some form and to make it where you have a clear form of expression of what these things are called, which is flaw reports, because it's not an incident like no one's harmed. You're kind of testing things in a laboratory. And so what you have in classical computer security is you have bug bounty systems, you have the ability to submit those. There's kind of clear adjudication processes that often lead to payouts to people and the extensions that are necessary for machine learning based systems, data driven systems, distributional systems is I think clearer as a result of the activity that we ran at defcon like that. That's the kind of research value that's produced by that AL Institute probably got some operational use as well, that they learned a lot because they're at that judge table dealing with dawn slot as well. That was the vendor table for things. And so I think if this is run again in the future, getting past those pain points of you're kind of rolling your own code and all that, and that's not necessary for this, scaffolding it a lot more, we'll carry it a lot more distance and then move this towards something where there's one or more companies that are operating a flaw reporting business in the same way that there is security bug reporting.
37:15
Well, Sean, as we kind of get close to the wrapping up point here on the podcast, I'd love for you to share anything that comes to mind in terms of like, as you look out kind of at the next phase of maybe your own work, but maybe the ecosystem a little bit more broad in terms of the area of research in which you are involved. What are you, what are those things that as you, you know, end your days or you're driving back to home or whatever. What are those things that stick in your mind? Like you're excited, like what, what if this happens? Or I'm really interested to see how this develops, or I'm encouraged by kind of this going into the future. What, what's kind of top of mind for you in terms of the community that you're involved with coming into this next year.
38:57
There's an interesting phenomena that every time we pass a milestone in the Ancient database and we're like, oh, we've crossed a 500 or a thousand incidents, we're like, yay. And then we're like, but it's not a, it's not a great thing.
39:44
Should you celebrate yeah.
39:56
Should we celebrate? And I feel like my. My dream utopia might be like, I need to go do something else because the safety problem has been solved and there's no more. There's no more to do. Like, the unfortunate fact, though, is this is only going to get more complicated and there's more to do, there's more to solve. There's not going to be a perfect AI system out there. So what I'm hoping for in the next year is to develop a greater sense of what institutions, what techniques, what methods we need to develop and scale and apply so that we can have a safer AI ecosystem. And this is important not just from the perspective of wanting to prevent harm, but, you know, you can't deploy unsafe systems to clients that want safety, that care about outcomes. And so our ability to make a safer system is very heavily involved in our ability to ship product, to solve problems in the real world. And so I hope to highlight that. I hope to have a greater accounting for the risks and make it so that we have better and better measures of those risks, because you. You manage what you measure. And so let's measure these risks, and then we can separate the actors that are investing in safety and stronger AI systems, safer AI systems, from the ones that aren't doing that. And right now, that's not where we are.
39:58
Well, really appreciate that perspective and appreciate the work that you continue to be engaged in. Sean, thank you for that and from the community, it's really, really important, so appreciate that. And thank you very much for taking time to chat with us about it. It's been great.
41:50
Oh, it's been a lot of fun. Thank you for having me.
42:05
All right, that's our show for this week. If you haven't checked out our website, head to PracticalAI FM and be sure to connect with us on LinkedIn X or Blue Sky. You'll see us posting insights related to the latest AI developments and we would love for you to join the conversation. Thanks to our partner, Prediction Guard, for providing operational support for the show. Check them out@prictionsguard.com Also, thanks to Breakmaster Cylinder for the Beats and to you for listening. That's all for now, but you'll hear from us again next week.
42:15