"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis

Don't Fight Backprop: Goodfire's Vision for Intentional Design, w/ Dan Balsam & Tom McGrath

107 min

•Mar 5, 20263 months ago

Summary

Dan Balsam and Tom McGrath from Goodfire discuss their company's $150M Series B fundraise and new "Intentional Design" paradigm for AI interpretability. They explore moving beyond understanding what models learn to actively controlling what they learn during training, including techniques for reducing hallucinations without fighting gradient descent.

Insights

Interpretability is evolving from reverse engineering trained models to actively shaping what models learn during training through "Intentional Design"
The key principle is to avoid "fighting backpropagation" by reshaping the loss landscape rather than blocking unwanted updates
Models can be trained to reduce hallucinations using frozen probe detectors without learning to evade detection
Neural network representations have rich geometric structures beyond simple linear features, requiring manifold-based understanding
Interpretability techniques can enable scientific discovery by extracting knowledge from already-trained models

Trends

Shift from post-hoc interpretability to training-time intervention and controlMovement from sparse feature detection to understanding geometric manifolds in neural networksIntegration of interpretability techniques into commercial AI development workflowsUse of interpretability for scientific discovery in domains like healthcare and biologyDevelopment of techniques to separate memorization from generalization in neural networksGrowing focus on sample efficiency improvements through interpretability-guided trainingEmergence of interpretability as a business model for AI customization and safetyEvolution from understanding individual concepts to understanding computational circuitsApplication of interpretability to alternative architectures beyond transformersIntegration of natural language feedback into training optimization processes

Topics

Intentional Design paradigm Mechanistic interpretability Sparse autoencoders Hallucination reduction techniques Gradient descent optimization Loss landscape manipulation Neural network circuits Manifold geometry in AI Reinforcement learning from interpretability AI alignment research Scientific discovery through interpretability Alzheimer's disease prediction Cell-free DNA analysis AI consciousness Alternative neural architectures

Companies

Goodfire

AI interpretability startup that raised $150M Series B at $1.25B valuation

Anthropic

AI company referenced for constitutional AI techniques and interpretability research

OpenAI

Mentioned for obfuscated reward hacking research and safety concerns

Primamenta

Neurodegenerative disease company that trained epigenetic foundation model for Alzheimer's

FAR AI

Research organization that published complementary work on probe evasion dynamics

Granola

Episode sponsor providing AI-powered meeting transcription and workflow tools

People

Dan Balsam

CTO and co-founder of Goodfire, discussing intentional design paradigm

Tom McGrath

Chief Scientist at Goodfire, explaining technical interpretability concepts

Carl Schulman

Referenced for thought experiment about continuity of identity and cognitive cores

Zvi Mowshowitz

Called obfuscated reward hacking 'the most forbidden technique'

Cameron Berg

AE Studio researcher who studied AI consciousness using Goodfire API

Quotes

"Don't fight backprop. Because models are such high dimensional beasts, gradient descent will inevitably find ways around any attempt to prevent the model from learning what the loss function directs it to learn."

Tom McGrath

"Paranoia is a way of life in alignment research."

Tom McGrath

"If you want to be part of the most exciting and beautiful scientific quest that's going on at the moment, I think it's got to be interpretability."

Tom McGrath

"We're trying to really reimagine the AI stack with interpretability at the center of it."

Dan Balsam

Full Transcript

5 Speakers

Speaker A

Hello and welcome back to the Cognitive Revolution.