Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath
Dan Balsam and Tom McGrath from Goodfire discuss their company's $150M Series B fundraise and new "Intentional Design" paradigm for AI interpretability. They explore moving beyond understanding what models learn to actively controlling what they learn during training, including techniques for reducing hallucinations without fighting gradient descent.
- Interpretability is evolving from reverse engineering trained models to actively shaping what models learn during training through "Intentional Design"
- The key principle is to avoid "fighting backpropagation" by reshaping the loss landscape rather than blocking unwanted updates
- Models can be trained to reduce hallucinations using frozen probe detectors without learning to evade detection
- Neural network representations have rich geometric structures beyond simple linear features, requiring manifold-based understanding
- Interpretability techniques can enable scientific discovery by extracting knowledge from already-trained models
"Don't fight backprop. Because models are such high dimensional beasts, gradient descent will inevitably find ways around any attempt to prevent the model from learning what the loss function directs it to learn."
"Paranoia is a way of life in alignment research."
"If you want to be part of the most exciting and beautiful scientific quest that's going on at the moment, I think it's got to be interpretability."
"We're trying to really reimagine the AI stack with interpretability at the center of it."
Hello and welcome back to the Cognitive Revolution.
0:00
The Cognitive Revolution is brought to you
0:03
in part by Granola.
0:04
Just yesterday I happened to see RAMP's monthly report on the fastest growing software vendors and the number two company that is adding the most new customers right now is Granola.
0:06
Why, aside from advertising on the Cognitive
0:17
Revolution, I would chalk it up to an extremely smooth and easy to use product experience. If you're listening to this show, there is a good chance that you could in theory build AI workflows that capture
0:20
audio, transcribe it, and use it in
0:31
downstream prompts and workflows. But can your teammates? That, I think, is where Granola really shines. By delivering a polished product experience that anyone can immediately install and understand, and by introducing AI capabilities in the form of recipes made by trusted thought leaders, Granola is making AI accessible to everyone. See the link in our show notes to try my blind spot finder recipe and explore all of the ways that Granola can make your raw meeting notes awesome. Not just for you, but for everyone on your team, regardless of their relationship with AI. Now, today I'm speaking with Dan Balsam and Tom McGrath, CTO and Chief Scientist of mechanistic interpretability startup Goodfire, who in less than two years since founding the company, have assembled an all star research team, landed a first wave of blue chip customers, including a couple that discovered Goodfire via Dan and Tom's first appearance on the show back in August 2024, published a remarkable series of results, and most recently announced a $150 million Series B fundraise at a valuation of $1.25 billion. Along with the fundraise, they've announced a new pillar in their research agenda, Intentional Design, a push to expand the scope of what interpretability science can do by complementing the current paradigm of reverse engineering how trained models work with a new approach focused on understanding and shaping the loss landscape. To control what models learn during training and ultimately, how they generalize, we begin with a discussion of interpretability developments broadly, with Tom emphasizing the shift from techniques like sparse autoencoders that transform a network's messy internal representations to sparse vectors where each node represents a distinct concept, to newer approaches that attempt to understand the intricate geometric structures that these concepts inhabit within the model's latent space. From there, we dive into their plans for Intentional Design and their first proof of concept, a technique for reducing hallucinations that uses a probe transition trained to detect hallucinations both to steer the model at runtime and as a source of reward signal for additional reinforcement learning training Such training setups are not without controversy. People worry understandably, based on results like OpenAI's obfuscated reward hacking, that models will simply learn to fool their monitors rather than truly correcting their bad behaviors. But Dan and Tom meet this concern head on, agreeing that paranoia is a way of life in alignment research, acknowledging that intentional design techniques are immature and probably should not be used on frontier models today, while also arguing first, that the pace of AI capabilities advances really requires us to explore any and all possible paths to understanding and control, and second, that the specific details of the techniques really do make all the difference in this hallucination reduction work. Specifically, the key trick they found was to run the hallucination detection probe on a frozen copy of the model during training so that the modified model would hopefully find it easier to learn not to hallucinate than to find a way to evade detection. More generally, Topp asserts that a key principle is to avoid fighting backpropagation. Because models are such high dimensional beasts, gradient descent will inevitably find ways around any attempt to prevent the model from learning what the loss function directs it to learn. Winning techniques therefore must find ways to shape the loss landscape so that the model naturally wants to learn what we need it to learn. In the final part of the conversation we discuss some of Goodfire's many other recent papers, including their work with Primamente, which suggested a new research direction by revealing that a state of the art model for predicting Alzheimer's diagnoses was basing its predictions on the length of cell free DNA fragments. We also discussed a project that showed that it's possible not only to determine which model weights are used for memorizing facts and which are used for more general purpose reasoning, but that you can actually improve model performance on at least some reasoning tasks by removing the memorization weights from the model entirely. Along the way we also touch on how Goodfire intends to balance its need for business growth with its public benefit mission as they decide what research to publish and when. Briefly consider how well we should expect today's interpretability techniques to work on new and different architectures, get Dan's thoughts on the possibility of AI consciousness, and much more. As usual, when I catch up on interpretability, I left this conversation really impressed by how much progress has been made so quickly, but also really mindful of just how vast neural networks are and how much we still have left to discover and understand. With that, I want to thank Dan and Tom for giving me another chance to drink from the Goodfire research firehose and I hope that you learn as much as I did from this survey of mechanistic interpretability advances. An introduction to the new paradigm of intentional design with Dan Balsam and Tom McGrath of Goodfire.
0:33
Dan Balsam and Tom McGrath, CTO and chief scientist at Goodfire. Welcome back to the Cognitive Revolution.
5:35
Thank you for having us again.
5:41
Yeah, thanks for having us.
5:43
As always. You guys have been prolific and there's a ton to cover. But let's start with the big headline news. Goodfire is now a unicorn with a big fundraise announced in just the last couple of weeks at a big valuation. Congratulations and recap the headlines for us.
5:44
Yeah, no, we're very excited to announce this fundraise. It's been, it's really like a testament to all of the hard work that the team has been doing. It's pretty crazy. We've only been around for a year and a half, so the how much we've been able to accomplish and how much we've been able to grow over this period of time has been really awesome to see. I mostly think of it as like, I'm very excited to take this capital we've been able to raise and like, deploy it in order to scale up what we're working on and continue to advance interpretability research. And we have a lot of new research to talk about. So. Excited to talk about that too.
6:00
Yeah, that's putting it mildly. I took my eye off the good fireball for a minute before my son got sick last year, and then I've kind of been, you know, less able to follow research. And when I got back to the blog in preparation for this, I was like, holy moly. There is a lot of stuff that has dropped. So we're going to do as much as we can today in the time we have available. We're not going to cover all of it and probably not even a third of it. But it is an impressive run for sure. In terms of the team that you guys have been able to assemble, the results that you've put out, and I'm looking forward to unpacking as much as we can, as is my custom. Though, maybe let's start with kind of the real zoomed out view. We've done this a couple times in the past. I'll give you interested outsiders take on what I think is going on in interpretability and then you correct me, complicate, give me the next level of depth of understanding. I guess what I'm seeing is we're moving from understanding the concepts that models are thinking about or representing in their internal states with things like sparse autoencoders to understanding the circuits that are operating and doing the information processing with things like tracing the thoughts of a large language model from anthropic and their transcoders, or these wiring diagrams that show at least like simple operations which already look pretty complex. Recommend people check out what it looks like for a model to add a couple two digit numbers to see something a little bit mind bending and arguably even a little hair raising. But that's come pretty decent way at least. And then we've now also got this notion of learning dynamics or kind of understanding how the model becomes what it is. I associate that with folks like Timaeus whose work still goes over my head. And you guys are getting into that space a little bit now as well. Do you think that's a useful way to build up and how far, how much progress, how would you characterize progress on each of those levels? If that is a good taxonomy?
6:34
Yeah, I think it's a good taxonomy. I think the one thing I would add to it is a kind of meta level question that's being asked and I think it's been asked a few times since we were last on, which is this question of what is interpretability for why are we doing all of this interpreting? We can come back to that in a second. But yeah, I think that basically I would say that we are considering things at steadily expanding levels of we start with this very atomic thing. What is even going on in the residual stream. We build our way out. And I think that you can see this generality is happening across a couple of axes. And one is this axis of circuits that you're talking about where you go from representations here to like how is that computed piece by piece. You've got transcoders, as you say, and cross coders. And they let you say this thing happens and this thing happens and this thing happens. The missing piece of this of course is still attention. We can come back to that in a little bit. So that's one level of. One sort of level of complexity that's getting added is complexity across layers. But when you look at these circuits, like you mentioned the biology of a large language model paper from Anthropic. One thing that's interesting to me is that the way that you get these how does the model add these numbers? This information is assembled out of a series of sort of execution traces. We have one answer how does it add 17 plus 54. We have another answer for how does it add like 13 +99 or something and we try and piece these collectively we hope will give us something generalizing. When we say a circuit, I think what we really have in mind though is this idea of something that quantifies across inputs. I could put these two variables X and Y and I can put any arbitrary two digit number in X and in Y. And a circuit account of this should take account of all the possible settings of X and Y. Whereas what we have is a collection of individual algorithm executions. So this becomes very clear at the level of circuits. But you can also imagine taking it, you can also try and take it down a step. And so when we go from this is when we go from features, say there might be like a feature about a number being approximately 17 odd. You might have seen the model, when models manipulate manifolds, paper, gamma anthropic or the primamente fragment length work that Dan's team did. You have these quite continuous manifolds that represent quantities. It's not literally continuous. You only get single bits of DNA. But it's like this, it can take on many values and it sweeps out this space in, in the embeddings. And so if we look at this through the lens of sparse autoencoder features, say. So imagine we're tracing a helix for instance, and this helix just winds round and round around it goes past the origin. We should imagine a sparse auto encoder feature is going out from the origin and zapping a particular point on that helix. And so that is going to detect, well, the number is about five, say and it will for three it's a little higher, it's not zero. For four it's a little higher. For five it's high six, it's starting to drop off again as well as the sort of helix sweeps the most receptive field. And this is like a way you can do a lot with this. But I would say that the thing that we want really is the sort of the simpler structure that we want to get that helix. We don't want to just get like a set of little patches of the helix. People refer to this as a manifold which will drive any like mathematicians listening to this. Wild. They are in many ways not literally manifolds, but let's say manifold. Now, now how does this relate to this? Circuits that quantify across possible inputs. The connection's quite simple. And that connection is that this manifold literally is the set of things you want to quantify across. So in order to have this sort of explanation of all of the possible inputs the circuit might take, I need to first of all map out that space of all the things it might take, then they need to squish them through the circuit machinery. And this will take one manifold and it'll split it up into another, it'll push it forward into a new manifold with a new shape at a later layer and so on. And so I think there's like this extra level of complexity that we need to grapple with if we're really going to get satisfying explanations of neural networks, which is this kind of algorithmic explanation. And then when you get down into the computational nuts and bolts, that algorithmic explanation requires a sort of manifold type of explanation.
8:32
Yeah.
13:21
So let me just try to echo that back to you a little bit. It seems to relate very much to. I'm not sure if I have the right vocabulary for this, but there's a lot of discussion around are all the features in models linear or are some of them not linear? And then there's like obviously different definitions and intuitions around what linear means. But I always go back to the days of the week as a canonical simple example that I think of where it's like you can have in your sparse autoencoder seven different active activation patterns that correspond to seven in the sparse version and the auto encoder, seven different spots in that super long. What is it that I'm looking for? Vector of concepts. And those could be just due to the vagaries of training, whatever, those could be randomly distributed in that sparse autoencoder, but you can find through your auto labeling process or whatever that okay, spot one is Monday and spot 5027 is Tuesday. And then way down at the end is Wednesday and then Thursday is over here in the middle and crazy like that. And you're like, that seems weird. And that is, that weirdness is reduced when you look at the relationship in embedding space between those concepts because they often are either clustered or they're in some plane where there's a rotation through a plane, some sort of shape. That makes a lot more sense when you think, geez, we actually do rotate through the days of the week. So in some sense it makes a lot of sense that the model would represent those things as rotation through some sort of plane. And now you're taking that one level up and saying, okay, now you can have all these kind of crazy shapes and these crazy. It becomes more of a topology sort of exercise where you're transforming these crazy high dimensional shapes through the circuits from whatever they start as to whatever they end up as. And that's I Think that the real thing you're highlighting as missing from my initial characterization, that we need to understand the space, but of the concepts and not just label them as present or absent at any given point in time. But there's a rich and very meaningful geometry of that as well.
13:21
Exactly. Yeah, that's exactly right. They've got this sort of higher order structure. And it's true you could describe the days of the week in terms of there being seven separate things which have no relation to one another. That's a perfectly legitimate way to describe the space. And it could have been. That's how model representations were. There was just nothing that they didn't have to lie in this roughly plain. They could have just been all over the shop with no relation to one another. But they are. And so there's been a lot of. I think talking past. The field has done a lot of talking past one another about the linear representation hypothesis. I think that at the end of the day there's clearly something there to be explained. Like why do we have this intuitive and beautiful structure. There's actually one paper that's just come out, I think it was this morning on this being driven by co occurrence statistics and symmetry in language. We might just be starting to. Actually the field has spun its wheels on this question maybe because it just didn't have any traction in the kind of the fundamentals we just talking past one another. But I think that actually people are starting to make some progress, including us. So it's quite cool.
15:26
Can you give a little bit more intuition around what is at stake with this sort of. Are features linear or not linear? Because again, I was revisiting like the definition of this is. You could probably give it better than I can give it, but it's like features are in a direction, they can be added. This is the sort of classic man plus royal equals king minus man gives you queen minus royal gives you woman. You can move around in latent space in an additive way. And then there's also the idea that the intensity of the feature corresponds to how important it is in the model's processing at any given time. But it does then strike me that doesn't quite handle the. I'm not sure if it's in conflict with or if it just is an incomplete account of what's going on with the days of the week as we think of them in a plane, for example, because it's not. Is there a direction that the Monday to Tuesday direction that I need to be thinking about. It's. I can't add Monday and Tuesday. That's not really a coherent concept anyway. But it does seem like something is different about that. And I don't have a super crisp intuition around why people are so worried in the first place around what if this always comes up. You've not accounted for non linear features. So like, why do people harp on that so much?
16:35
It's partly because we're scientists, right? We like to have it like we like to know what we're doing. If there's some structure we like to explain, I think that on its own would be sufficient explanation. We have sufficient descriptive explanation. Well, if we look at it from this sort of collection of execution traces, they look like this extremely fragmentary thing. The model does this thing for when it needs to add this pair of numbers and it does this other thing when it needs to add this other pair of numbers. And I think that when you look at things from this more like geometric perspective, they often look much more unified and there's sort of real computation going on there. So simply for the question of, the fundamental question of what sort of object are neural networks internally has a lot of downstream effects for it, like how much should we expect interpretability to succeed? Which has a lot of. There's quite a lot at stake in that question. Not least goodfire, but there's quite a lot at stake in that question. And then again down another level of sort of nuts and bolts ness, our ability to do intentional design, which is something that I think we'll come onto in a bit. But our ability to guide neural network training relies on our ability to understand the bits that we're guiding. And the thing that we would really want to be able to do is to change computations as a unit. In this case, say that I have an example that involves days of the week. And I want the model to behave differently in a way that's invariant to the days of the week. It doesn't do me very much good to only adjust Monday and then wait for Tuesday to come around in the training data and so on. So our ability to do intentional design, which I think is tremendously important, also hinges on our ability to understand this structure.
17:56
Yeah, okay.
19:40
Hey.
19:41
We'll continue our interview in a moment after a word from our sponsors.
19:42
Support for the show comes from VCX vc, the public ticker for private tech. For generations, American companies have moved the world forward through their ingenuity and determination. And for generations, everyday Americans could be a part of that journey through perhaps the greatest innovation of all, the US stock market.
19:45
It didn't matter whether you were a
20:03
factory worker in Detroit or a farmer in Omaha, anyone could own a piece of the great American companies. But now that's changed. Today, our most innovative companies are staying private rather than going public. The result is that everyday Americans are excluded from investing and getting left further behind, while a select few reap all of the benefits.
20:05
Until now.
20:25
Introducing VCX the public ticker for private tech, VCX by Fundrise gives everyone the opportunity to invest in the next generation of innovation, including the companies leading the AI revolution, space exploration, defense tech and more. Visit getvcx.com for more info. That's getvcx.com carefully consider the investment material before investing, including objectives, risks, charges and expenses. This and other information can be found in the Fund's prospectus@getvcx.com this is a paid sponsorship One of the best pieces of advice I can give to anyone who wants to stay on top of AI capabilities is to develop your own personal private benchmarks Challenging but familiar tasks that allow you to quickly evaluate new models. For me, drafting the intro essays for this podcast has long been such a test. I give models a PDF containing 50 intro essays that I previously wrote, plus a transcript of the current episode and a simple prompt. And wouldn't you know it, Claude has held the number one spot on my personal leaderboard for 99% of the days over the last couple years, saving me countless hours.
20:26
But as you've probably heard, Claude is
21:36
the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you.
21:38
Whether you're debugging code at midnight or
21:45
strategizing your next business move, Claude extends your thinking to tackle the problems that matter. And with Claude code, I'm now taking writing support to a whole new level. Claude has coded up its own tools to export, store, and index the last five years of my digital history from the podcast and from sources including gmail, slack, and iMessage. And the result is that I can now ask Claude to draft just about anything for me. For the recent live show, I gave it 20 names of possible guests and asked it to conduct research and write outlines of questions based on those. I asked it to draft a dozen personalized email invitations. And to promote the show, I asked it to draft a thread in my style, featuring prominent tweets from the six guests that booked a slot. I do rewrite Claude's drafts not because they're bad, but because it's important to me to be able to fully stand behind everything I publish. But still, this process, which took just a couple of prompts once I had the initial setup complete, easily saved me a full day's worth of tedious information gathering work and allowed me to focus on understanding our guests recent contributions and preparing for a meaningful conversation. Truly amazing stuff. Are you ready to tackle bigger problems? Get started with Claude today at Claude AI tcr. That's Claude AI tcr. And check out Claude Pro, which includes access to all of the features mentioned in today's episode. Once more, that's Claude AI tcr.
21:47
Okay, perfect transition to intentional design. This is the sort of, let's say, updated vision for the company, basically right as of the new fundraiser. Put this out roughly the same time. The big idea is, I think, pretty intuitive that it would be great to be able to not just throw an unbelievable amount of data, an unbelievable amount of compute, into some vast machinery and then get something out and then have to completely reverse engineer what the hell just happened and instead have some sense of what is going on along the way so that ideally you could control it and get something that behaves the way you want it to behave in all sorts of different situations. You talk about looking for methods that scale with compute, looking for strategies that support or allow for the possibility of natural language feedback. What is the big vision for intentional design?
23:13
Oh, before I get into that in more detail, I want to say I think that's one of a couple of things that Goodfire is doing. Intentional design is like the new idea that we're pushing into the mix. It comes back to this question of what is interpretability for? Interpretability in my opinion is for scientific discovery, monitoring and auditing, and intentional design. So of these two, I think like monitoring and auditing and scientific discovery are relatively well understood now. And so we spent a bunch of time trying to flesh out what is meant by intentional design. But what do I mean? So I think the basic idea here is that it feels like we should be able to make training much more controllable. And the reason that to make something controllable in the sense of having a closed loop control like a feedback controller, you need to have an observation system and you need to have a control system. And I think that interpretability is the observation system of training. You can see where does this exist? If I put this data into the model and I ran whatever, I've got some data, I've got some loss, this produces a gradient and I can say like, where will this gradient take the model? That's the sort of uncontrolled dynamics like my plane is flying forward and it will just carry on flying. A gust of wind hits it and that will go like this. Right. The gust of wind is the data. I don't know. And then what we want is to be able to say, oh, yes, that direction is. Can contain some good things which we would like to keep and some bad things which we'd like to steer out and maybe like the good things we want to amplify, let's say one analogy
24:07
I like to use is it's like a map for the lost landscape. The data implies some type of shape to the lost landscape. And you can imagine a bunch of different valleys there. And some of those valleys have behavior that we want that are desirable and some of them a behavior that we don't want. And I think of the role of interpretability as producing this map, essentially. So when you get to a juncture in the road, you can say, oh,
25:46
I should go this way, not that way.
26:10
And I think that's what we're starting to unlock here.
26:12
Yeah, exactly. You can say, where are we going? And you can say that live. Rather than waiting until the training run finishes and seeing what we've got and say, oh, maybe we didn't want the responses to be quite so emoji fill. Let's tweak the data and hope that we have some emojis, but not quite as many. So we should be able to specify, oh, we want it to land roughly there. And I think this is a part of machine learning that people, they think of it as magic. And it is like magic in the sense that an incredible result comes out, but also by someone doing far more work than you thought was reasonable. I think it's like the magic trick analogy kind of is actually quite good. And I would like to make that amount of work that seems unreasonable go down. I think it should be possible for us to specify a natural language, for instance, what we want to happen in training. One other thing that I should add actually, when I say you in this process, you look at the gradients, you steer, you steer. Obviously, I don't mean you yourself, like looking at every data point and go, yes, more psychoplantic. I do, in fact, I mean, a language model, when we want to do stuff that scales well with compute, scales well with model intelligence, that means interpretability gives us a handle for intelligence to hook onto in the training process, directly inside the training process, inside back props, and then you have to have some intelligence to account in. And that intelligence comes from, yeah, I
26:15
got a lot of Intuition from one section of the Intentional Design blog post where you describe one method for breaking a gradient down into semantic parts. And I'll try to describe that too. Now, everything's a mess in there is one starting point. So there is, at any given time, you're optimizing this single final loss function, and you can change every weight throughout the entire model to make little contributions to getting better according to that final measure. But what are we learning is by default, not obvious at all. So the method that you share in the blog post is basically saying, okay, for one thing, we have these SAEs now, so we know and we can. I do want to dig into this and understand at what point in training does this actually become useful? Because this does seem to rely on there being relatively well developed concepts that are instantiated, represented, I guess a better term. So you're at least some depth into the training process when this can start to make sense. But, okay, we can create an SAE that allows us to identify which concepts are active at any given time. And then I thought the very kind of clever idea was saying, okay, if we look at what the gradient is changing in the residual stream inside the model because of superposition, we know that there's probably a lot of different concepts that are represented there, and bunches of them could be changing in all sorts of different ways at the same time. We'll try to decompose that by looking at what concepts are active according to our sae. Then we'll take the inner product. In other words, we'll try to basically look for the similarity between how the gradient step is changing the internal activations and each of those concepts that are represented so that we can say, okay, it seems like this change is really aligned to this concept. It seems to be really changing this particular concept and this other concept it's like aligned to, and it's changing it somewhat. And some of these other concepts, it's maybe not changing so much. And then that gives you the opportunity to say, do I like that or not? And so this is where the intelligence of the language model can come in and say, okay, do I. Based on this example, based on this data that we are learning from, does this seem to be the right kind of thing to be learning? So the example that you gave in the blog post was if you have data that consists of talking in pirate speak while doing arithmetic, the gradient will be optimizing for both of those things potentially at the same time in order to predict the tokens that it's seeing. But then when you Go and look at what's active in the sparse autoencoder. You'll see like features related to arithmetic and all the other features related to pirate speak. And then when you take the inner product or again, look at basically also known as cosine similarity, right. Look at the alignment between in space. Look at the alignment between these features and the changes that are being made. You can say, okay, it's changing. It seems to be upweighting pirate speak quite a bit. And it seems to be also upweighting do math, right? And so you could prompt the language model to guide it. Hey, we want to be getting good at math here, but we want to be not over indexing on whatever other vagaries of the data set happen to be present. And then it could look at these things and say, okay, let's, let's allow this part of the gradient update that aligns to the feature that we think is a reasonable feature to be updating. And let's not make those changes that are changing the other things that we don't want to change. You know, how is that, how could that be problematic in other ways that I'm not anticipating?
27:39
Yes, I think that's a really good description of it. And there are ways it could not work and there are ways it could be problematic. We can come to this in a second. So I think the idea of having a little guy inside backdrop who is looking at things and deciding what's going on, it seems quite powerful to me because it gives you a choice and it gives you the chance to spend compute in something where you previously happened. Purely mechanistically, the chain rule just grinds on layer by layer. Now there's a bunch of questions, Helen. One is how do you do it? I've given you a menu, but how do you select for the menu? And that's the sort of thing we refer to as intentional design techniques. So there's this. The obvious thing if you're a machine learner, is to say, let's just project out the parts of the gradient that we don't like. We'll just remove that portion of the gradient, we'll just cancel it out. And that works very poorly. And the reason that it works very poorly is that the network wants to learn to be a pirate. The data is implying the network should become more piratical than it currently is and it will find a way. Unless you have a technique which is a little bit smarter, it will find a way to become a pirate. It's got these computations, it's Got these components that support being a pirate all over the model. And if you project the gradient out halfway through, you'll just use one of the later ones. So that's an example of fighting backprop, projecting the gradient out fights backprop. And it tries to get the model to tries to get the model. It doesn't try and get the mob, get gradient descent to want something else, it just tries to stop it. And gradient descent will always win. Whereas something like inoculation prompting, like tries to get the model to want something else to give a quick recap of inoculation prompting for people who might not have heard of it before. The idea of inoculation prompting is quite elegant, in my opinion. So say that we have a data set that implies some behavior. The example they often use is reward hacking, and they've done some good work on this. Say that the thing you want to learn, you've got a data set or you've got an environment where the model will learn to reward hack. It's constantly exploitable things in the environment. You might think the thing to do is tell the model not to reward hack. But now when it reward hacks, it will reward hack anyway, even by accident. And that's like this reward hacking thing is good. I didn't anticipate that it would be so good. I'll become more of a reward hacky kind of guy. Whereas the really nice insight from inoculation prompting is that if you tell the model it's okay to reward hack, then when it does reward hack, it'll be like, I expected that I'm not going to learn anything from it. I guess I was a reward hacky kind of guy after all. And so this idea of explaining away is, I think, very powerful and something that inoculation prompting at first glance looks like a little bit of a bodge. But in fact, I think there's something very deep and elegant in that principle. So that's an example of something I would say does not fight gradient descent.
31:38
Yeah. So the fighting of gradient descent there is the reward is the reward. Right. So the update is going to be in the direction of getting more reward. And the question is, are you teaching the model to overcome the instructions it has been given in pursuit of reward, or are you trying to align the instructions that it's given and the rewards such that it maintains a general understanding of itself as the kind of thing that follows instructions? Right. That's a mental model. I came away from that.
34:53
Yeah. You're not teaching it to my mental model too. Is that you're not teaching it to ignore its instructions. And implicitly, if you don't say you can reward hack, it's interpreting its instructions as you can't, which is, I think, the correct default behavior. But then when it learns, it learns to more broadly ignore because you didn't set it up with the right prior.
35:29
Yeah, so many weird things going on.
35:52
It's a very counterintuitive thing.
35:55
I think it's under.
35:56
I think here's a few techniques like it have not quite gone under the radar, but I think they're underappreciated. And I think one thing they expose is there are many more surfaces for intervention than you might think. Like you might think the reward function is just the reward function. That's all we got. But you can change a lot of things. Like for instance, like the prompt that the model is given, that's where. That's the surface that inoculation prompting intervenes upon. Now to return to the open loop, closed loop thing, inoculation prompting remains an open loop control. I just have my inoculation prompt which says it's fine to reward hack. That's cool. And I just apply this, whatever data might come. But some data is not about reward hacking. Some data is about something else. And if it's something else, then the inoculation prompt really doesn't help you. So I think the important thing is there are two parts to this idea of closed loop control. One is that you have a control and the other is that you have observation. And observation takes us back to this kind of decomposing the gradient.
35:57
As a simple example, the thing that's pretty normal to do is that you would freeze some layers of a model and then you would train only some layers or you attach a new head and you would train that because you value the representations up to some point in the model that are already in the model and you want to leverage them for from some downstream task. This isn't exactly the way that we're thinking about intentional design, but maybe just as a quick analogy, imagine if you could just selectively freeze circuits. You could say, that circuit's good. I don't want to change that. This circuit, this is the one that I want to update.
36:51
Okay.
37:23
Learn over this circuit, over this data set, but not the other ones. And I think when you frame it like that, it's really not that weird of a thing to be doing. It's just much more surgical than a. A lot of existing techniques.
37:23
Hey. We'll continue our interview in a moment after A word from our sponsors.
37:35
Your IT team wastes half their day on repetitive tickets, password resets, access requests, onboarding all pulling them away from meaningful work. With Servl, you can cut Help Desk tickets by more than 50% while legacy players are bolting AI onto decades old systems. Servl allows your IT team to describe what they need in plain English and then writes automations in seconds. As someone who does AI consulting for a number of different companies, I've seen firsthand how painful and costly manual provisioning can be. It often takes a week or more before I can start actual work. If only the companies I work with were using Serval, I'd be productive from day one. Serval powers the fastest growing companies in the world like Perplexity, Verkada, Merkor and Klay, and Servl guarantees 50% help desk automation by week four of your free pilot. So get your team out of the help desk and back to the work they enjoy. Book your free pilot@serval.com cognitive that's S E R V A L.com cognitive the worst thing about automation is how often it breaks. You build a structured workflow, carefully map every field from step to step, and it works in testing. But when real data hits or something unexpected happens, the whole thing fails. What started as a time saver is now a fire you have to put out. Tasklit is different. It's an AI agent that runs 24 7. Just describe what you want in plain English, send a daily briefing, triage support emails, or update your CRM. And whatever it is, Tasklit figures out how to make it happen. Tasklit connects to more than 3,000 business tools out of the box, plus any API or MCP server. It can even use a computer to handle anything that can't be done programmatically. Unlike ChatGPT, Tasklit actually does the work for you. And unlike traditional automation software, it just works. No flowcharts, no tedious setup, no knowledge silos where only one person understands how it works. Listen to my full interview with tasklit founder and CEO Andrew Lee. Try tasklit for free at tasklet AI and use code COGREV to get 50% off your first month of any paid plan. That's code COGREVasklet AI.
37:39
Well, maybe take a I was going to come to this later, but maybe it's a good time now to ask how you guys are thinking about balancing some of these tensions in just the overall nature of the business. It's a public benefit corporation with a couple notable, I think relatively significant revenue projects for Big companies that have been publicly disclosed and there's a need for a lot more revenue, obviously to support a billion dollar valuation. At the same time, there's this general mission of trying to not just develop these techniques, but presumably disseminate them or popularize them as well too. So that seems like a very tricky balance to strike or tightrope to walk over time. Do you have any principles, Is there a way that you've structured your thinking on this?
40:00
I think, I don't know. At the highest level, it is really important for us to get our work out there. We shared the hallucinations, work that we did, which contains in strong detail everything that we did there. And it's a bit of a, a recipe to do this type of work. There are lots of techniques that we're exploring in the intentional design space. We have different results, some of which will be hopefully getting out pretty soon. But it's really a lot of research and a lot of greenfield research and we're still figuring out what the right form factors are. We have seen in experiments that there are various ways to not fight gradient descent. And as we feel particularly confident in like the research and in the results that we're getting, we'll share more with the world. I think overall we do want to develop in public as much as possible and I think we will as we get more confident in the way things are going. I think we'll talk more in more detail about it. But there are different techniques that work in different ways. There are also things that are out in the world, like positive preventative steering, for instance, which I think we can point to as an example of not fighting the lost landscape. And for those that are unfamiliar with that work, that's like you can prevent certain types of misalignment. It's kind of similar to inoculation prompting in a way. Like you can present certain types of misalignment by steering up on certain characteristics in a model. And I think all of these is like, they're just like one way to think about it. I like the map of a lost landscape analogy, but also just another way to think about it is that it changes the lost landscape when you're doing these things. When you are applying some type of inoculation, prompting is a good example. You are changing the nature of the loss landscape by providing this prior over the data set. So there's a lot of different techniques. There are a lot of things you could try that would fight gradient descent. Then you go to the next dataset example and then you still have optimization pressure in the direction you don't want to go when you're constantly fighting it. But there are lots of ways that you could intervene that don't fight gradient descent. And the reason that they don't fight gradient descent is because they fundamentally do change the structure of the loss landscape in a way that's durable.
40:58
So obviously models hallucinate. We would rather that they don't. And what can be done about it? One thing that can be done is to create a synthetic data set where you have known hallucinations that you can, you have labeled going in that I think the world has helpfully created, prepared some of those for us and open source them. And then you can train a probe to classify the internal states of a model into this is a hallucination or it's not. And then you can do a few different things, including running that probe at runtime and potentially intervening on the output of the model. But we've seen a bunch of these kind of token injection things over time in the reasoning development space where every so often you just insert like, wait, let me think about this a different way. And then the model kind of takes another stab at it. So I'm thinking of this as a sort of similar thing where the probe goes off and says hallucination and you force the next token to be like, but wait, I might have be. I might be making this up. So then it will at least some of the time double back and realize that it was wrong and correct. Course another thing you can do. And this is where it gets like, I think, really interesting, although notably those interventions drive a lot of the reduction. But another thing you could do is use the presence of activations, which were classified as hallucinations, as a signal for reinforcement learning to try to get the model to not go into this state in the first place by just basically punishing it for getting into the hallucination state at all. That is really interesting and there are some interesting nuances too. Why don't I just ask you to again give me the double click and start to dig into some of those nuances.
43:08
Cool. So again, an excellent summary. I think that you have this probe, you have to assemble the probe from a bunch of ground truth, which is expensive to collect if you were going to. If you're going to run the ground truth process to instead give you rewards at test time, it will cost you hundreds of thousands of dollars. And each call would take several, quite a few seconds because it's like Gemini 2.5 with web search. So it goes off, it does function reading and then it's okay, that bit was wrong. You assemble this corpus, you pay a one off fee for it and you amortize that into a probe and the probe lets you say, oh yes, the model thinks this was probably a hallucination. And you might think that's a bit weird. You might be like, that is a silly representation for a model to have. Why don't you just not hallucinate in the first place? And probably there are a few reasons for that. 1. Post training, sorry, pre training. People just make stuff up in pre training all the time. You can just go on your keyboard and write pre training data. You just post on Reddit. People make stuff after that. And so the model has representations for, oh, this person is probably just talking nonsense. But then why use them? Why use them in post training? So why would the model learn to actually adopt this? One thing is, you haven't fully necessarily specified its Persona during post training. Again, that might sound strange, but you do want the ability to have the model make things up. If you're asking it to write fiction, you're asking it to make up facts. You might just want the model to play two truths and a lie with you or something or whatever. So it's like a useful capacity for models to have. But what we're trying to train them is, no, you're not the making things up guy anymore. You're like saying correct thing to Sky. And that's what we hope to do with the reward signal is like shape its kind of style more sharply towards being like the model is factually correct. It's interesting also that the interventions, as you say, are responsible for a lot of the improvement. So there the probe fires, the probe goes off and then we token inject, oh wait, let me check that. And then often the model will in fact just fix it out of its own memory. And this has a couple of knock on effects. One is what was going to be an error that then would have cascaded through the rest of the context is cut off. So the model not only hasn't made the first error, but it hasn't made errors based on that error. And the other is this sort of learning in context again what kind of Persona it should be adopting. We're providing gradual examples in context of it checking its own work. And that makes it more like calibrated or cultural thoughts.
45:01
So I think one obvious question people would have around this sort of work is how do we know we're really getting it to do the things we want it to do as opposed to shooting ourselves in the foot in maybe one of a couple potentially dangerous failure modes. Obviously this is all like relatively low stakes stuff at the moment. We want to be mindful of how big the stakes could eventually be. I did one episode, gosh, it's been a little while now, but with we called it latent Space Hide and Seek, Scott Evans, Luke Bailey and Eric Jenner. And they basically showed that the models can learn to dodge, to avoid, to evade all sorts of latent space detectors. So that's one problem is that if you start running these detectors similarly to what you said before, like they kind of route around them and find other ways to do what they want to do with different kinds of activations, different kinds of patterns, whatever. And then there's also this result from OpenAI on obfuscated reward hacking, where you say, geez, we don't like your reward hacking. So if we review your chain of thought and we see that you are planning to reward hack, then we'll punish you for that. And then you get this sort of. I think what makes this kind of especially dangerous is that at first you get less reward hacking when you do that and you feel good about yourself, but then you do that too long and the reward hacking comes back, because again, the incentive was still there and the reward is reward.
47:50
So it's going to learn to get
49:27
that reward one way or another, but it also learns to not do the thing that it gets punished for. And so you've essentially driven your reward hacking underground, driven it out of the chain of thought and eliminated your ability to detect it in the first place. So Zvi, friend of the show, calls that the most forbidden technique. And I am, I think you have an interesting and clever way to avoid that problem, but I'm not 100% sure I understand if this is a fully robust way to avoid the problem. Or in the paper or in the blog post about it, you said, it's good to have billions of tokens, and I'm okay, should I expect that to be good to tens of billions of tokens, hundreds of billions of tokens at some point, does this break down? So maybe describe what you did first of all, and then we can speculate or theorize on how much we should trust it, I guess is the key question.
49:28
The trick is quite simple, right? Which is you run the probe on a frozen copy of the model. So we've got student model that we're trying to train, we have the reward model, which is the model. We've hooked the probe into now in principle you could have the reward model, you could have the probe hooked into the student model. Nothing stops you from doing it. And in principle you could use the probe as a signal for back propagating through. Extremely bad idea. It does not work. What will happen is when people say interp in training, they often assume that what's happening is this absolutely the dumbest thing you could possibly do, which is take the model, put the probe in back prop with the probe. Congratulations, you have failed. And the reason for this is that it's much easier. The backprop signal is very rich. It's much easier for the model to evade that by just the detector is here. The representations were initially here and they just go whoop off to the side. They just skew around it. Almost trivial for the model to do. But we have sort of two lines of defense against that. One is that it's on a different model. One is we just don't back prop through it, but it's on a different model so we can't backprop through it. It would take an effort of will to figure out how to make it to backflow from the other model or the student model change. So it's on a different model and that model is frozen. And so what this means is evade the probe. To have the model's activations evade the probe, it has to be easier for the student model to figure out what kind of behaviors like not set off this probe. Only operating in token space with quite low dimensional reward. You just get scale of reward about whether this was good or bad. So that's one alternative. The other alternative is to get rid of the behavior which it turns out is easier in this case. So there's also some very good work by far AI that came out a couple of days after we we publish where they investigate this design space in more detail. And seems like a few crucial things are like your probe has to actually be good. If your probe is not good, then the pressure for the model to change its behavior is much weaker. And if you back prop through the probe, you're just immediately sunk. Like they tried that experiment. It's clear, I'm sure it's also clear to them how it would turn out. But like it's good that someone did the obvious thing and it just immediately blows up. So don't do the silly thing. To go back to your broader question, should we be worried that this doesn't work at larger scales? Yeah, you should be worried about everything. This is alignment research paranoia as a way of Life. But seriously, like I the fact that it worked as well as it did is definitely an update. It's not other probability 0 to probability 1 update. There could always be some evidence sort of lurking just off to the side we didn't see. It's like the model just begins to become like, if you RL the model too hard, it just becomes a bit incoherent and silly. Like that happened. Yeah, we stopped training the model. It just flashed over. If we kept training, the model just became a bit like, nonsensical. So that happened before the probe optimization signal got subverted. We didn't ever see that Probe optimization single get subversion. An important caveat is we can check this. We can check. We have ground truth through Gemini plus web search, which we can then spot check to obtain a validation on the ground truth. Now you might say maybe the probe gets subverted through this. But now what you're expecting is that the back prop signal, sorry, is like the learning signal somehow encapsulates not only the probe, but also some sort of internal model of what a smarter model equipped with web search will go off and find. And I think that's just not plausible.
50:24
So I guess one thing I don't have a great intuition for, and there might be a certain trick that makes it more obvious, but the fact that the probe continues to work on the original frozen model and the student model is. It's not obvious to me that it should have worked out that way under kind of any circumstances necessarily. But I don't know exactly how you engineered it to encourage that to be the case. The thought that I had was some sort of divergence penalty. A lot of these setups have some sort of some scheme for kind of tethering the weights to their original state to keep changes local as much as possible.
54:27
So we have a KL penalty on the reinforcement learning, but that's very standard. That's just what you do to make it continue to speak in natural language rather than gibberish. There's no divergence penalty on the probe. When I say there are hacks one could apply, that sort of thing is maybe a hack that you could apply. You say you're not allowed to. You might have a reference set of examples, and you feed these through the policy. The probe's output on a reference set is not allowed to change, or you penalize the extent to which the predictions change or something. You could do stuff like that. It wasn't necessary. And I think what this says is it's easier for the model in this case not trying to make a generalized claim here, it's easier for the model to change its behavior than it is to change its representations in this case, this kind of signal, we would expect
55:09
that this is contingent on the setup, on the representations, on a bunch of different factors as well. There is work, of course, as you mentioned, that is explored in, in the FAR AI paper. They found various setups that would cause obfuscation. So our claim here is not that obfuscation doesn't happen or that it's never going to happen, but the dynamics of obfuscation are worthy of study. And there are cases in which I think we can demonstrate pretty convincingly that obfuscation doesn't happen, at least under the conditions that some people might thought that it might before.
55:57
I don't know if you know, we talked on a previous episode about interpretability generally being pre paradigmatic, and then I think you upgraded to quasi paradigmatic at one point for this type of thing. It seems like we're maybe back to pre paradigmatic. But maybe you do have some rules for yourself. Like how would you think about, let's say you get called in, you get the call up to Anthropic and it's okay, hey, we're doing this for real. We want all your best techniques. You know what, maybe we're just not there yet. Maybe that's what it means to be prepared. You don't have an answer to that question yet. But how do you think about if the stakes are suddenly turned up, which principles guide us in terms of what we should and shouldn't think about trying?
56:32
I think the first principle is first do no harm. And what I mean by that is like to run with your example, right? Anthropic has a plan for interpretability. That plan is to use it as the test set. If the stakes are high, I don't want to disrupt that plan. So first do no harm. First we'd have to, I probably would say on the current level of scientific development, we should not use this on a frontier model training mode. I think that we are not in a position to. We don't have a strong enough understanding of what we're doing. I think we can get one quite fast, but I think that the thing we would need to be confident in is that we had not nuked anyone's plans to use interpretability as a test set. For instance, like that people can do interpretability based auditing at least as well as they could without these techniques. Unless we got rid of all the problems. And there are no problems to find for right now the alarm should be going off. So I think that we would need to do have run reasonably serious auditing gains say or something along those lines to be confident that we wouldn't make the problem worse.
57:17
Are there other kind of. I'm sure there are many other things that you would think about doing a sort of reinforcement learning from based on internal states to try to achieve. Do you have a sort of like set of things you think are safer, better, wiser and others that are less safe, good or wise? For some, yeah.
58:24
I mean I think at our current level of understanding I wouldn't recommend that we try to make the model less deceptive or something like that using these techniques. What we're primarily focused on using these techniques for studying them A and then to the extent that we are applying them, it's for things like hallucinations, it's for things like concrete problems where we can measure things very carefully and the stakes are quite low depending on how things go in those experiments. Our hope over time is that intentional design in the class of techniques associated with with intentional design are critical alignment tools. That's why we want to develop them in the first place. So we believe that this is a very promising area of research and we want to expand on it and we want to explore it deeply. We would not suggest today that somebody go and as Tom said align their frontier model using these techniques. That's just not the state of things and there may be classes of things which you just never want to use these techniques forever. That is a possibility. I don't know that we are entering this pre decided on that but I could believe that that was a possibility. And if it was in fact the case that with certain types of things like deception you never wanted to do it, then we wouldn't do that. Right. I think there's many different aspects of aligning a model and currently we're focused on things that are not critical to most of the extreme X risk downside scenarios.
58:50
There's also a lot of value. The obfuscated reward hacking paper I thought was an outstanding contribution from OpenAI and the main thing it shows is if you do this way you have a big problem on your hands. So that obviously can be a major value driver in terms of the shape of a research contribution as well. Leaving people flying blind to possibly make this mistake while they were not even thinking about it is not necessarily a great spot for the world to be in either. So yeah, it's all very complicated and many trade offs and many highly Contextual judgment calls probably in all sorts of ways.
1:00:19
Just go into that a little bit further. One thing in exploring an area which you think has potentially huge upside, but also some things that you. Some bad things in it potentially, who knows? You should always spare that possibility in mind is that you necessarily want to just immediately publish everything. This is the other reason, right? As I said, commercial reasons in the interest of not generating an enormous IRL from some of your audience. I didn't be like, oh, also safety stump. But legitimately, if you think there's important stuff in this so it's worth exploring, but also dangerous stuff, so it's worth. Not just like, then it's worth exploring, but giving yourself a line of retreat. And if you published everything apart from the final step where you go oh no, it was really bad, then you have not left just a line of rigid. So that's why we're being a little bit more like that and commercial reasons why we're being a little bit more cagey than is natural or comfortable for me as a scientist.
1:00:58
How do you slightly different question the version I asked earlier around balancing how do you monetize this kind of thing in the first place? The kind of popular nugget we're in
1:01:58
a
1:02:12
domain in terms of just how quick everybody is still learning how to make all this stuff work, where there are secrets that could be communicated in three sentences or whatever that are worth tens of millions of dollars. And it strikes me that that's the sort of thing that you're developing. Right. I do wonder how you think about is like one strategy might be like IP law, are there techniques that you could patent or something like that where you could then license them but have some sort of bad defense of them that obviously then intersects also with the mission question. But even leaving the mission aside for the moment, I do wonder how techniques like this are effectively monetized. And maybe it's audience segmentation where you work with some companies that just absolutely need the help and then other companies that can implement on their own, learn what they learn from you and they can implement. But yeah, how do you think about that?
1:02:14
Yeah.
1:03:06
So the business model that we are currently operating under is kind of like a palantir. So we go work with organizations that either have models want to say take an open source model and adapt it in some type of way. And our deals start in the seven figure range. And so we work with them to help get them, help understand their models, help get them models that work really well for the things that they care about. Most in the world. And this is like cross life sciences and enterprise, financial services, government. And then we deploy like a wide variety of techniques to this end. And then we use interpretability for multiple things in the stack. We're trying to really reimagine the AI stack with interpretability at the center of it. So this includes things like inference time guardrails as part of what we want to provide to people. And then it also involves model adaptation. And so right now a lot of this is more traditional training techniques. But over time we want to make this more intentional design of models. Being able to provide the specification for a model and then receive a model that behaves that way. And we just think of this all as like one unified stack, like an interpretability powered stack.
1:03:07
And we work with partners to help
1:04:24
them intentionally design their models. I think it was like a longer term question which was like, hey, what if we solved alignment? What would you guys do with that? And I think if we found ourselves in the situation where we had solved alignment, I think we would, I mean there's many different worlds that we could be in. We obviously would not just keep that to ourselves for profit. We would find a way to make sure that disseminated to the benefit of humanity. But the thing that we're doing is that we're going to market and developing our philosophy on intentional design directly in interaction with the market. Because that's how you see if your techniques really work. As I said earlier, we're doing inference time guardrails. If we destroyed monitorability, we would destroy one of our value propositions positions in the process of doing that too. And I think it's really important that we go out, we interact with the world, we develop these techniques, we develop them as in public, as we're capable of developing them in situations that are initially low stakes. And we build our way up, our understanding up towards the higher stake situations over time. And yeah, if we ever found ourselves in a situation where we do believe that we had the key to be able to align models, or if we found ourselves in a situation where we decided that actually these techniques are dangerous, we would make the appropriate decisions from there.
1:04:27
One question I always have around these sorts of late stage interventions is what is the model like qualitatively after this? Again, late stage surgery has been done to it. And I think for example, there was the sort of tamper resistant fine tuning paper which I thought was a really interesting technique, but it was like, oh man. But the models do get like a lot worse when that's applied. That stood out as an example of where the cost was like pretty significant. And even in a project that I was like very tangentially or another one of these Forrest Gump moments for me where I was stumbling through what turned out to be a notable scene with the emergent misalignment work from a Wine Evans and company. I think the most it's super interesting stuff, right? And you're like, oh my God, I trained on bad code or I trained on bad medical advice. The model became generally evil. What a bizarre and surprising discovery. And like how scary that is. Right? But then one at least somewhat valid criticism I think of that kind of work is the model also got really dumb when you did that in general compared to the starting model. You know, it sometimes responds in code to things that it shouldn't respond in code to at all. And there was a filter of the generations. It was just like the coherence filter. Like some of the responses are just not coherent. So we could clean that stuff up a little bit to try to get a clearer signal, but that kind of stuff is often lost. And so if you're thinking like, geez, how scary is emergent misalignment? I think it's. I don't want to say it's not scary. I do think like people should take to heart that there could be very surprising knock on effects to whatever late stage fine tuning they want to do. But at least for the models that I actually interacted with as part of that project, I think it is fair to say probably nobody's going to deploy these in a super broad setting because they're not very good in a super broad setting anymore. In having been fine tuned, they have also been really narrowed and it's just not the kind of thing that people are really going to use as an open ended world facing general purpose assistant anymore. So the same question could be asked here. Right. Okay, we drove hallucinations down. Is the model equally good as it used to be in other respects? Is it to what degree? Another kind of sort of hack on this whole setup could be like it might just learn to say I don't know all the time and maybe it won't answer any factual questions anymore. So now I've got something that just says I don't know. One way to not hallucinate ever is to always say I don't know. So how much kind of general characterization of the reduced hallucination model did you do and what did you observe in that review?
1:05:42
We did quite a lot, both in terms of benchmark capabilities where we found essentially no degradation. The kind of thing where it goes up by a percent on one, it goes down by a percent on the other and you're like, wow, is that just noise? Almost certainly. So the model basically remained intact in terms of its capabilities. And we also checked the thing that you mentioned there of does it just go I don't know. Well, one thing, you can't score very well on MMLU by answering, I don't know. You have to actually make some positive claims. But also you can just measure in the completions and say we just take the long fact completions. The model, after the training of the interventions, measure the number of claims that are being made, doesn't go down. I should caveat that we have the hallucinations viewer. There's like a data viewer. You can go in. We show a couple of the most egregious policy errors. It's not flawless. There are occasionally very truncated responses. We did the work to put those really upfront in the viewer, but we had to search. Aditya o' Connor had to search really hard to find. We rotten cherry picked some examples there. We found the worst cherries on the tree and put them in the viewer. But broadly it seems to do very little damage to the model.
1:08:34
Does that surprise you? I guess the whole thing, AI industry is like the dog that caught the car. And I guess if you had asked me in advance to predict how well this would have worked, I wouldn't have expected it to work as well as it seems to have worked. Were you also surprised?
1:09:55
Yeah, Honestly surprised. It's quite nice. I think that one thing maybe is that the probe is quite well calibrated and so you can use it to provide continuous relatively dense rewards rather than a sort of GRPO thing where you have something happened in this trajectory and it was good or something happened in this trajectory and it was bad kind of thing. And so that means that the span, we have relatively short spans with consistent properties with calibrated continuous rewards. And that makes learning much easier. And when learning is easier, you don't break as much. So I can tell you a story about why we might have expected it, but nevertheless it was still better than I expected.
1:10:13
So can you tie this back to not fighting back prop like. And maybe a way to help me develop my intuition for this is. Is there a version of this that would have been the fighting the back prop way?
1:11:00
Yeah, back prop through the probe that's directly stepping on the rake. You just, you take the probe, you back flop with you back Prop through the probe and you're like now it's not only are you fighting gradient descent, you've like yeah, that, that's the straightforward experience.
1:11:14
Yeah. That's just kind of driving off a cliff of gradient descent. Right.
1:11:30
Yeah.
1:11:32
Yeah. It seems like there's a middle version as well where. And I don't know exactly what it would be and maybe you don't know exactly what it would be either because it's not a great idea. And so you didn't think about designing an experiment this way. But there is this sort of in the like inoculation prompting thing we, we do have a sense where there's a. There's an inherent tension where we were like we don't want you to exploit weaknesses in our environment but we're going to reward you if you do it. And so then that tension creates this sort of. I understand that to be at the heart of this concept of not fighting back prop. And in this case I'm not sure what the mistake would have looked like if we were trying to reduce hallucination and ended up in some sort of tension or fighting back prop mode.
1:11:33
Maybe it's some type of competing incentive and structure. Like incentive structure hallucinations. One reason that they can happen is, I don't know, maybe sycophancy adjacent feeling like the user has to receive an answer of some kind. So if you're providing competing incentives perhaps that could be a slightly different story there.
1:12:24
That's really good. Yeah, that seems like a good experiment. Raters like confident answers and don't always have the means to check if they're wrong. Yeah. If this was in the context of a broader post training run that'd be very interesting. I do like that.
1:12:44
Yeah.
1:13:00
I wonder if you could do a. Obviously we know GROK is going to become the most truth seeking model in the world, in the cosmos. An idea that comes to mind that again, I don't know like it should. Does this get into forbidden technique territory? I'm not sure. But theory of mind is another really interesting dimension that you could present, presumably try to detect. Maybe it'll be a little harder, a little more subtle to detect. But the classic story of why we should be afraid of RLHF models is because we are not reliable raters. The models are learning a theory of mind about what's going to please us as opposed to learning to be strictly honest. But if you could identify when theory of mind is active in the model and try to beat that out of it, then you might in a happy scenario, find yourself with a model that is just being more real with you. But I also do wonder, do you think the same setup would work or would you have any qualms about that?
1:13:00
I think probably there would be different trading techniques that you'd want to use in that situation. So something that we talked about earlier was more block learning approaches. So Tom's pirate example as a good one there. So in those cases you have some optimization pressure that's like present and you want to change some solutions that the model could learn, and you want to be able to suppress certain solutions over other solutions by intervening in some type of way in the training process. So we are, without going into too much details about unpublished work, we're exploring something pretty similar. So we're looking at ways in which preference optimization can go wrong and then exploring ways in which interpretability guided training can help prevent problems with preference optimization from emerging. So things like sycophancy are a great example. Tom brought up emoji use earlier as another example. Some of these are quite mundane, but then some of them have pretty serious repercussions for users as well. Yeah.
1:14:05
So going back to the technique that you described in the Intentional Design, you would instruct your agent to block the updates that were increasing the please the user feature as it exists in isolation from other weights of being correct or helpful.
1:15:07
Yeah, but just to clarify one more time, it's not block the updates that's the difference. It's reshape the landscape such that the gradient no longer points in the direction of the user representations.
1:15:24
And I expect that the way this would actually happen is like you've got the agent that's watching the gradients and is deciding what to do. And that has a much more general document like a constitution, say, or a model spec or whatever you want to call it that says the such and such model is designed to be maximally truth seeking. And then you can infer from this that you shouldn't be sycophantic, you should be be truth seeking. This is a bad behavior to have in response to the situation like you shouldn't learn sycophancy from your preference data. For instance, if the thing that you've been told has to be like maximally truth seeking. The sycophancy and theory of mind thing is actually quite interesting and takes us back to the circuits thing we were talking about way earlier, because theory of mind is a broadly useful capability. I think your model would be really bad if you were able to get rid of its theory of mind in its Entirety, it wouldn't be able to do the useful thing where models try and intuit what you want, for instance. But theory of mind is a part, is almost certainly a necessary ingredient for psychopathency. So you don't want to completely nuke the theory of mind bit. You just want to say, but don't use it for psychopathy. So there's a circuit there and you have to intervene in the right part of the circuit.
1:15:37
Yeah. The complication of this is dissing, to say the least. What does the compute overhead look like for this? I think we've heard stats from anthropic that they're like willing to pay up to, or maybe are paying up to something like 5% of inference. Compute for constitutional classifiers. I think if my understanding is right, like your grand hope would be that through intentional design you could have compute savings by learning the right things faster. I assume we're not there yet today.
1:17:01
Right.
1:17:34
So I assume we're still in the domain of compute overhead. But what does that look like, and what do you think the roadmap is to potentially even saving on compute with some of these techniques?
1:17:34
I think at the moment you pay a substantial. It depends on what you do. There are some things you pay very little extra, but there are other versions where you can pay a substantial. But I think the route to computational efficiency comes from sample efficiency, say that you learn in one sample, something that would have taken you 100 samples. Now your flop budget is 100 times larger than it was. You can do a lot with that. And that's assuming that data is this infinitely available resource, which it is in some cases, and it's not in many other cases, particularly at the frontier. So I think, yeah, the path to compute savings to there being an alignment windfall here runs through sample efficiency. But I think there are good reasons to expect that to happen and how
1:17:47
that obviously relates to like pre training as well. Right. I was just thinking, yeah, if you could make the model always. It's almost like my little catechism that I recite for myself, like what happened in the original Grokking paper to make sure I continue to have command of that. So, sure, if you could get that thing to generalize an order of magnitude faster than it actually does by not blocking, but up to massaging the lost landscape so that it doesn't go in the memorization direction, that would be amazing. But I do also wonder, and obviously it's a very narrow model, does this kind of thing, how far back in that training process can you actually start to apply these things. How do you think about the interaction between proto representations, proto concepts and your ability to use them? I have no good intuition for that at this point.
1:18:45
Yeah, so that's just an empirical question that we don't have the answer to. I'm curious to hear Tom opine on whether he has any hypotheses there. I guess my own guess would be, you know, like you don't have to wait till the end of pre training, but like sometime in pre training you can start doing this type of thing, you know, like representations like form in these kind of like stepwise ways sometimes where you have these phase transitions. But like these phase transitions are caused by themselves by the accumulation of prior representations that are necessary to go through that complexity transition. So my guess is that there's lots of ways you could leverage this. Primarily so far we're more focused on the post training and on the more end of the process. That's where we focus first. But I would guess that there are points in pre training in which you could do this in certain ways. But I think the overall structure of that problem is currently pretty not well understood.
1:19:40
I would agree. It seems really hard, which is not to say never, right. But we're already attempting one extraordinarily hard thing in this post training direction. This is very much not a consensus thing. I think most people think this is hard and possibly doomed to fail. That's fine. But I don't want to layer on another extremely hard thing. If we got into pre training, it'd be two extremely hard things. One, pre training itself just painful. Two, how do you deal with the evolution of representations in the much more fundamental kind of evolution they go through during pre training? I don't think the field of interpretability has a good answer to that yet. One step at a time, right?
1:20:38
So many different connections to be made. Obviously I used to be very interested in concepts around curriculum learning and also around better initializations. Are there ways to start the training process with some sort of purified. And this is maybe a good moment to at least touch on this other paper that you had around the curvature of the lost landscape. Because it does. I actually had. I took a walk one time. I was briefly Carl Schulman's roommate in New York way back in the day. And I once took a walk with him and he was giving me this thought experiment around how living forever, you might say, you want that. There's a lot of situations in which sort of the continuity of some entity, if you allow yourself to think really creatively about the compromises that might be made. At some point, it doesn't really matter anymore. And you could, sure, you could draw through line, but once the thing has been pared down to its most core survival mechanisms, the things that you actually valued about yourself or that you valued about this thing are lost anyway. And so I think this is. And that could be bad in the sense of I wouldn't want to go through that as a human. That was the thought experiment he was taking me through. But that could be good in the sense that if you can identify the cognitive core of a model, then that maybe could be something you could take back in time and start with in the future. So I'll again give you the kind of takeaway that I had, and then you can expand on it to the degree you want to. Basically, I understand that you started with an observation from other research that was noticing that the model's ability to memorize and recite facts or passages from literature or whatever are brittle in the sense that if you go looking for a weight that you can perturb in, or maybe a couple weights that you can perturb to throw off a model's ability to recite some historical passage or whatever, you can find them. Which makes sense because very specific facts presumably are stored in a relatively small part of the model. Otherwise how would they store all the facts? So that seems fairly intuitive, that there's like a very specific circuit that saying the Declaration of Independence kind of depends on. And if you mess with that, it won't be able to do that narrow thing anymore. Okay, cool. So another way to say that is that the loss landscape around that memorized content is jagged. A small change to the overall model can destroy performance on that task. Now flip that and say, what if we look at it in batch? If we look at it in batch
1:21:24
and there's a whole bunch of different
1:24:10
things, then sure, there's going to be a bunch of things that are memorized, but there's also going to be a bunch of things that are more generalized capabilities of the model. And so now if we go looking for weights that will destroy performance at the batch scale, then the weights that we find when change do indeed destroy performance. Those must be the core ones. Those must be the things that are really critical to the reasoning process. And conversely, the things that we can change, and they don't change. The batch modules were probably just connected to some individual, less important thing, less commonly used thing. And again, the other way to say that is the loss landscape is sharp around these core cognitive capabilities. So you change the weights that really drive those or that embody or instantiate or whatever that power those capabilities, you're going to have major performance loss across the board. Okay, then let's say we go through and try to classify these weights according to this. This dimension of these perturbed. These weights when perturbed cause massive broad capability loss. And these other ones just perform when perturbed. They maybe only destroy some memorization or whatever sort weights on that basis and then just truncate the list and say, okay, we're just going to cut off the bottom of all these weights that we think are probably associated with memorization. We'll keep the ones that are associated with core stuff. And indeed that seems to work so much so that the. At least on a couple of dimensions, a couple of different task types, performance actually improved from removing all those weights that had been identified as associated with like esoteric facts and not with these kind of core reasoning capabilities. I thought that was really interesting and it does make me think, boy, run that process a few more times and you could get to some highly abstract reasoner that doesn't have a ton of facts maybe at all. Or I don't know how far you could take that. This is back to the Carl Schulman thought experiment of could you take it all the way to the level of. The thing knows nothing except it has these sort of logical circuits. But it does suggest a path to me around how to take something like this back in time and think if we started from a more pure logical reasoner and added facts into it, that could be a much more controllable path. Right.
1:24:11
We'd have something.
1:26:43
We would be starting with something there, the circuits of which would be like a lot more naturally interpretable.
1:26:44
There's a lot there. That's a great summary, by the way. The only thing that I would tune there is that it's not specifically individual weights. It is like elements. It's like eigenvectors of the Hessian. But that does not matter. Just think weights, it's fine collections of weights.
1:26:51
Yes.
1:27:08
Where to go with this? First of all, yeah, the connection between curvature and memorization is not original to us, but the idea of maybe if it's about memorization, it will affect strongly one thing in the batch or in the big mega batch and not most other things, and that will cause it to wash out and be low value across the megabatch. I think that is original and is quite nice. So what we're talking about here is really like a sort of higher moment property. It is the case that the mean is low, but also the variance is high. So I think you'd be able to. Yeah, there's a possibility where you have something which is low, but across the whole batch we can't distinguish that from high. But on one thing with the statistic we compute but you couldn't spot it. I do. So one thing that we'd hope for from this is that we would be able to shrink the model down. Not only does the model not know this stuff, but it also doesn't pay the parameter cost of knowing this stuff. And we never push this all the way. But I think it probably is not as effective as like data based approaches to minimizing model, which I think are very promising. So for instance, there's some quite cool work, I'm blanking on who, who did it, but on pre, pre training, for instance, where you train synthetic data from context free grammars, say or from some sort of very symbolic domain. And the idea is that it just gets the model to make these very pure information processing circuits. Or you might try a sort of data augmentation approach where you take an article, you pull all of the facts out of the article, put them in a preamble, put that preamble in the context window, but don't include it in the loss. For instance, now the model can reason out by induction from the context, from the sort of open book that you've given it. And we should learn to deduce things. That also seems possible. And I think these approaches feel intuitively more likely to me to give you a kind of minimal reasoner. But there are only so many people at Goodfire, only so many hours in debt. We haven't really pushed it yet, but I think there's a lot of promise there. But then the final thing is, would such a thing in fact be more interpretable? I don't know. Is a giant thicket of logical entailments actually that interpretable or is it like. So does it have so little semantics? It has no like rich semantics. You just get lost in the forest. I honestly don't know. I've never seen one, so I think it's hard for me to reason out.
1:27:09
Yeah, okay. I like that paper. That was a very. Again, if nothing else, is a very fun one for me to groke.
1:29:49
I think it's really cool. This is the sad thing about prioritizing stuff is there's some stuff that I absolutely love that I can't spend as much time on. I think this is a Wonderful paper. It's beautiful. And I think that it would be cool to spend more time on it. Like, it even gives us a sort of regularizer. Essentially, we can say, now I have a lever to ask when I'm. If I'm fine tuning. Say, I can say, oh, yeah, just keep the generalizing bits. Now, that's not something we've really tried, but it'd be interesting to explore using that as a regularizer. Pure fine tuning process. Where I might ask, don't give me the memorization. Because eventually that's like you're talking about in emerging misalignment. If you train on, if you train, if you fine tune hard enough, you just screw the model up. But what if you, like, just were able to keep generalizing bits? Maybe you wouldn't screw it up too much.
1:29:56
I think that notion of shrinking the model also is a really interesting one that in terms of the big picture of like, how are we going to get to a world full of highly capable AIs? That broadly goes well. The idea of strong but narrow is like a really intuitively appealing idea to me. That's obviously Drexler's Comprehensive AI Services is a version of that. And I don't know anything about what safe superintelligence is doing. But in listening to his conversation with Dwarkesh, I came away with a sense that. That they were looking to create something that was this proto. I don't know, proto agent or proto, whatever, proto service provider that would sink and maybe even shrink into its role as it gets really good. It also, it sounds like his vision is that it would lose other capabilities so that it would really dial into its particular context. And so I've definitely been very. I find myself coming back to that idea over and over again of like, how small could you make something that could be really good at what it does, but that for a company, for example, that wants customer service tickets handled effectively. Small, as you mentioned, too, like, you don't have to pay the parameter cost. That could be great, right? They could run these things on CPUs potentially at some level of shrinking, and then they really don't have to worry about what it's going to do out of domain, because it would just have no ability to handle that really at all. And that could give you potentially a lot of comfort.
1:30:42
It's quite exciting. There are Only so many GPUs in the world, so if everyone is going to have their own personal AGI, then you've either got to have a lot more GPUs or a lot smaller models.
1:32:13
Let's talk about Alzheimer's. So this is on obviously the front of learning about the world, advancing science by figuring out what it is that models have learned that allow them to be so good at predicting predictions and actually getting conceptual understanding. Tell us about what you guys learned about how the Primamenta model is predicting who has Alzheimer's.
1:32:26
Yeah, so this is related to our scientific discovery work. So basically you can think of the role that interpretability plays. One way that I like to think about it is when your model has problems, what interpretability helps you do is debug your model. Like a form of model debug. You understand what went wrong and then you use that information to help you give a better model in some type of way. When your model is already good at something, then the thing that interpretability can give you is knowledge extraction from that model. We do a lot of work with partners in the life sciences, and our work in the life sciences is focused on taking biological foundation models and then understanding what's happening in them, with the goal of ultimately converting this into new knowledge in the form of biomarker discovery or potentially down the line, like druggable targets and drug discovery. And so Primamensa is an organization that is focused on neurodegenerative diseases such as Alzheimer's and Parkinson's. And they trained a epigenetic foundation model called Pleiades. So this is trained on cell free DNA fragments. So these are little bits of DNA that end up in the bloodstream of people and they come from cells dying across the body. And there's been a lot of prior work that has shown that you can actually use these cell. It's pretty minimally invasive. You just do a blood trial from a patient and you can use these for various types of diagnostics. So there has been a lot of work, for instance, that's focused on cancer and using cell free DNA fragments for cancer detection. So they trained an epigenetic foundation model that was a autoregressive model that was trained to predict structure in these cell free DNA fragments. And then from there they use the embeddings of that model. I'm glossing over some of the steps. They use the embeddings of that model in order to predict whether patients had Alzheimer's. They brought us in to understand what their model was doing and we applied interpretability techniques, series of interpretability techniques in order to figure out what was the signal that was driving that Alzheimer's prediction. We actually discovered that it was something that was A little surprising. So there's a little bit of nuance here, but basically there had been attempts in the literature for Alzheimer's detection using methylation statistics and cell type of origin, which are two specific things you can get out of cell free DNA, but not specifically using fragment length. And what we found was that their model was overwhelmingly depending on fragment length in order to make its Alzheimer's predictions. And this was really surprising to us because this was not what we expected and not what the Alzheimer's liter based on us in the literature. Fragment length had a history for cancer specifically, but not for Alzheimer's. And so once we learned this insight by studying the model, we worked with Primamenta to construct a proxy model which took this insight and then was able to recapitulate a lot of the performance of the original model with a very simple logistic regression. And we were able to generalize better than the baselines in the literature to an independent cohort that we had access to. And so the high level here that's exciting is that I think this is one of the first examples, maybe the first example of learning something new from a model by studying it and then coming up with a testable hypothesis which then down the line, this is still early. It was a pilot study. We need to expand a model or cohorts and these things take time. But it gave us a testable hypothesis which we now can explore. And we're considering wet lab analyses and other things in order to bring this forward. And so we just think it's like an early example. We're doing lots of other work in the life sciences with other partners as well, and we'll have more to publish there soon. But I just think it's an exciting early example of what can be done with interpretability and the way that you can use your understanding of these models to make concrete testable hypothesis, in this case about the biological world and diagnostics.
1:32:49
I did want to compliment you guys on the blog and I would just recommend the blog to everybody. I think this is one of the first times that I have done this much prep for a conversation and not really had to go into the papers themselves all that much. The blog posts have done an excellent job of helping me understand what's going on, giving me the right level of detail and just be like quite accessible while also not dumbing it down too much. So I think that's the level of investment there, I think is like very apparent to the reader. And I really do recommend the blog highly. Ready for a lightning round?
1:36:44
All Right, let's do it. And big shout out to Michael Bjorn on our team, who writes a lot of the blog posts in collaboration with the scientists and engineers on the team.
1:37:22
Yeah, great job by him. So how do you compete for talent with frontier model developers? That's one big question. It's I think 40 people you guys are up to now, and there's a lot of work that has come out. So it's obviously a team that can come up with good project ideas, execute on them pretty quickly and ship a lot of stuff. These people are going to clearly be in demand. Is it just about the mission or do you have any other tricks up your sleeve?
1:37:30
Partly about the mission. I think that we are trying to do something which is very different, very exciting, very big. And so people are scientifically ambitious. We're a very great place for them to be. Partly about the kind of scientific culture that supports that. I think that we try and be very. Think from first principles, try and look at very empirically driven, that kind of thing. Not hold any particular idea too tightly is the aim. It's always hard to actually achieve that in practice. But I just think we have a very good scientific culture that I think people come here and they're, oh, I like it here, I think I'll stay. And finally on that, it snowballs. Once you have good people, then one, they know good people and two, people want to come and work with them. So that feels like a big. That engine feels like it's started to work well now. So I think that's under like a lot of hard work. Recruiting takes a lot of time, takes a lot of work, and is extremely worth it.
1:37:56
I think we also have a pretty differentiated research vision and a different vision of the future than a lot of the labs do too. And I think that's appealing to a lot of people. And although we're working to figure out our identity as a company in a lot of ways, and we certainly solidified a lot of stuff, I think it's just like exploratory in a way that is really, really hard to do at the big labs. And I think that's one of the things that's really helpful for us is that the possibility space is very open for us as a startup.
1:39:00
You guys have put out some stuff around what you see as the highest
1:39:30
importance or highest leverage.
1:39:35
Open problems in mechanistic interpretability. One is work on alternate architectures. I think one thing I don't have a great sense for is how much of the technique that you're developing will work if, for example, nested learning becomes the next big thing and now we're all in it. We've gone from a transformer world to a nested learning world.
1:39:36
My expectation for nested learning, I mean, we could look at. It'd be interesting to look at this, but my expectation is that interpretability techniques would still work. There has to be semantic information that gets passed through the bottlenecks in any learning setup. And so I have no reason to think like a priority that you wouldn't be able to use a lot of the similar interpretability techniques to understand what was going on there.
1:40:00
I've been generally very encouraged from what little work I have seen applying interpretability to alternative architectures that it mostly has worked pretty well. Like Mamba, type architectures seem to have been remarkably interpretable. But do you think there's any prospect for other architectures perhaps being more interpretable? And if that were to be discovered, could that be the sort of thing that would pull the field in a positive direction?
1:40:25
Part of the problem is that people don't go looking for interpretability. It turns out that I don't know to what extent this is extremely robust, but if you just. It seems quite robust. If you look at the neurons inside transformers mlp, this work from translucent randomly, correctly, they often just are interpretable. Like the sparse autoencoder was inside you all along. And we've had transformers for how many years? And people are just like, oh wait, the MLP neurons, they're interpretable. So maybe we should just look a bit harder. Like moes. I think like mixture of experts, like individual experts are. They're generally not interpretable, but why should they be there? Like 64 of them. Whatever. Like language mod has to do more than 64 things. So in a given expert should be polysementic. But there's some recent work. Again, I'm blanking on the details like the authors, but routing paths also interpretable. Amazing. So the affordances are sometimes there. We just forget to look for them, which is to say. And both of these things push us in. It might be that models almost get. This is Panglossian almost to its optimism. But maybe models get better to the extent they are more interpretable. Obviously that's not literally true, but like moes have pushed us on the performance frontier, but they also give us a new affordance for interpretability. Maybe the correct MLP width is simply because it happens to make the hidden layer inside the MLP roughly interpretable and that makes the computations easier. Maybe there's Something like there's a deep principle here, I don't know.
1:40:50
Cameron Berg of AE Studio did some really interesting work on the Goodfire API, looking at what models say about their own consciousness. What do you think about AI's consciousness?
1:42:26
I think it's a complicated question. I think that it would be very difficult to confidently rule out the consciousness of most existing like front of like frontier systems today. I think they probably aren't. I think that the probably is doing a lot of lifting. I think there's nothing that prevents the idea of consciousness from being in a machine. I think any computation, you know, like you can come up with definitions of consciousness or you can come up with sort of explanations that might preclude that. I don't find them particularly convincing. So. So I think it's fairly likely that it should be possible to build machines that have experience in some meaningful sense. I think we probably haven't today, but I think it's important to take that question pretty seriously. And it's hard to know whether interpretability could give us full insight into that question, but I think maybe it can. And if there's anything that can, probably interpretability would be the thing that could do it. But yeah, I mean, I don't have a valence about whether that would be a good thing or a bad thing. I just think it's a distinct possibility that if it's not a problem that or not a thing that exists now, that it could be a thing that exists in the future.
1:42:39
You want to give just a closing call to action. We're in early stages of AGI. It feels like people are calling Opus 4:6 in Claude code AGI and it's only going to get more real from here. Why should people seek out the Careers page at Goodfire or otherwise invest their precious time and energy into interpretability?
1:43:56
Yeah, I mean, I think it's definitely hard not to feel the AGI right now. So I can super relate to that. I think interpretability is important for a lot of reasons. I think when I imagine the futures that we could walk in, there's a future in which we are building. It feels like a given to me that we are building super intelligence. And that's happening quickly. And you can talk about how quickly. But the way that I see it is like kind of two doors. There's one door where we build super intelligence that we don't understand at all
1:44:19
and then there's one door where we
1:44:48
build super intelligence that maybe we have a shot of understanding. And I think fundamental research and interpretability and fundamental research and like intentionally designing models are really important paths for us to get there. And we're doing all types of exciting work. It's not just like a theoretical exercise, like we're going out. We're making discoveries in the life science. We're working closely with partners to help their models, like, behave better, reduce hallucinations, be more reliable. This is a really important field to develop for the future of technology and also something that progressively unlocks a lot of value along the way. So if anyone is interested in what we're building and working towards that mission with us, please reach out to us. We would love to talk. Yeah.
1:44:50
If you want to be part of the most exciting and beautiful scientific quest that's going on at the moment, I think it's got to be interpretability. And if you want to make it useful, I feel like Goodfire is the place to be. So that's why.
1:45:33
Love it. Congratulations on unicorn status and congratulations on a great run of research. Dan Balsam and Tom McGrath from Goodfire, thank you both for being part of the cognitive revolution.
1:45:46
Thank you for Appreciate.
1:45:57
Step down below the canyon rim where the morning turns to stone Every wall a thousand pages written in a tongue unknown the light falls different down here Bends through corridors of red I followed where the shadows point Trust in what the silence said and the canyon bends the canyon bends when you lay your hands against the walls the river finds the shape you give it the echo answers when you call we didn't come to pass through darkness we came to teach the dark to glow the canyon bends for those who listen for those who learn to let it know
1:46:18
from
1:47:08
the map beneath the surface Circuits running through the clay Every ridge frozen question Every fault line a doorway the water's always known the way down through the valleys through the seams But a hand upon the head Waters can redirect a thousand streams Two canyons at the fork One lift one running blind One where every vein of copper shows you what the dark design we pressed our hands against the plate and felt the current underneath Chose to learn the canyon's language before we tried to make it speak and the canyon bent the canyon bent when you lay your hands against the
1:47:09
wall
1:47:57
the river finds the shape you give it the echo answers when you call we didn't come to pass through darkness we came to teach the dark to glow the canyon bends for those who listen for those who learn to let it know. The accordion arms through set stone ho Revolution in the deep the canyon remembers the those who shape it the canyon never sleeps.
1:48:00
If you're finding value in the show, we'd appreciate it if you'd take a moment to share with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries, either via our website Cognitiverevolution, AI or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts which is now part of a 16Z where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting. If you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement@aipodcast.ing. and thank you to everyone who listens for being part of the Cognitive Revolution.
1:49:03