Autoresearch, Agent Loops and the Future of Work
The episode analyzes Andrej Karpathy's Auto Research project, which demonstrates how AI agents can autonomously conduct machine learning research through iterative loops. The discussion explores how these 'agentic loops' represent a new work primitive that could transform various business functions by having AI agents continuously experiment and optimize while humans focus on strategy and evaluation design.
- Agentic loops represent a fundamental shift from humans executing tasks to humans designing the arena and evaluation criteria for AI agents to operate within
- The most successful agentic loop implementations require five key characteristics: objective scoring, fast/cheap iterations, bounded environments, low cost of failure, and ability to leave traces
- This technology creates a new high-value skill set focused on arena design, evaluator construction, and loop operation rather than direct task execution
- The capability gap between what AI can do and what most companies are actually implementing is widening rapidly, creating competitive advantages for early adopters
- Future collaborative agent networks will need new abstractions beyond current tools like GitHub to enable massive parallel research and shared learning
"Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies."
"The person who figures out how to apply this pattern to business problems, not just ML research, is going to build something massive."
"Most marketing teams run around 30 experiments a year. The next generation will run 36,500 plus easily."
"The loop is the hero, not the model."
"At some point it's so wide that it almost becomes malfeasance to meet them where they are."
Today we're discussing what Andrej Karpathy's weekend project about auto research can tell us about the future of work. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. Alright friends, quick announcements before we dive in. First of all, thank you to today's sponsors, kpmg, aiuc, Blitzi and Insightwise. To get an ad free version of the show, go to patreon.com aidaily brief or you can subscribe on Apple Podcasts. If you are interested in sponsoring the show, send us A note at SponsorsiDailyBrief AI also on aidailybrief AI, in addition to finding out about all of the different things going on in the AI DB ecosystem, I would point you specifically to number three, our newsletter. We very strangely, for a very long time have not had a newsletter, and part of the reason for that is that I was never sure exactly what we would add that was different than what the other good AI newsletters out there offered. However, I was finally convinced that there was something very simple that many of you wanted, which was just links to the stuff that I had mentioned in the show that day. And so our newsletter is back. Appreciate everyone who has signed up for it since we relaunched it. If you want a quick, easy index for what that day's AI Daily Brief had and all the links to the relevant articles and content that are mentioned there, again you can sign up with a link from aidaily Brief AI now today we are talking about a new project from Andrej Karpathy called Auto Research, and you might notice that we are doing an entire episode about this instead of our normal division into the headlines in the main episode. It's because I think that this topic is actually even more significant than it seems on the surface of it. One would be tempted to think that all of us nerds were just getting overexcited because Andrej Karpathy, who is held in such esteem, released the new GitHub repository. And while that is certainly true, there is something bigger going on here. You might remember a couple months ago me talking about something called Ralph Wiggum. Ralph is, in simplest terms, a software development loop that keeps running, building software in an iterative and persistent way by looping the same instructions over and over and over again. It's named after Simpson's character Ralph Wiggum, for his lovable and indomitable persistence despite whatever's going on around him. Now, we'll talk more about Ralph in a little bit, but the key concept to take away is is this idea of an iterative loop. Carpathy's Auto Research is also at core about an iterative loop and I think combined what you have is arguably a new type of work. Primitive primitives are the basic building blocks of work that are so fundamental that they show up everywhere across roles and industries and that people reach for automatically once they have it. New ones don't come around very often, and so this idea that agentic loops might be one is I think worthy of some serious scrutiny. But let's talk about what Andre actually released first and then we will come back to that on Saturday. Andre, who was on the founding team at OpenAI and who was previously the director of AI at Tesla, and who you might remember from coining such terms as Vibe coding last February, and who has now suggested we are in a different era of agentic engineering as of this February. Again tweeted on Saturday I packaged up the auto research project into a new self contained minimal repo if people would like to play over the weekend. It's basically nanochat LLM training course stripped down to a single GPU one file version of around 630 lines of code. Then the human iterates on the prompt MD an AI agent iterates on the training code py the goal is to engineer your agents to make the fastest research project indefinitely and without any of your own involvement. In the image, which he shared alongside it, everydot is a complete LLM training run that lasts exactly five minutes. The agent works in an autonomous loop on a git feature branch and accumulates git commits to the training script as it finds better settings of lower variation loss by the end of the neural network architecture, the optimizer, all the hyperparameters, et cetera. You can imagine comparing the research progress of different prompts, different agents, etc. Part code, part sci fi and a pinch of psychosis. As a caption to the image he wrote one day Frontier AI research used to be done by meet computers in between eating, sleeping, having other fun and synchronizing once in a while using soundwave interconnect in the ritual of a group meeting. That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,200 fifth generation of the codebase. In any case, no one could tell if that's right or wrong, as the code quote unquote is now a self modifying binary that has grown beyond human comprehension this repo is the story of how it all began. So let's talk about what Auto Research actually is, at least in the version that was released by Andrew. Autoresearch is a system for training a small language model. Basically the kind of model that powers all of these AI tools, but much smaller. The type of model that could one day run on, for example, an edge device like a phone. The goal is to make a model as good as possible at understanding and generating text, normally or classically. A human researcher would sit there tweaking the training setup, doing things like adjusting the model's architecture, changing how fast it learns, experimenting with different optimization strategy. They'd run an experiment, check the results, decide what to try next, and repeat. That's basically the core loop of machine learning research, and it's bottlenecked by how fast a human can iterate. Auto Research instead hands that entire loop to an AI agent, and it does so in an intentionally simplified and tiny way. In this repo, there are just three files that matter. The first is Prepare py, which is fixed infrastructure that doesn't change. It downloads the training data, trains a tokenizer and handles evaluation. The second is Train py. This contains the entire GPT model definition, the optimizer and the training loop. This is the single file the AI agent is allowed to edit everything in. It is fair game. The model architecture, the hyperparameters, the batch size, the attention parameters, the learning rate schedule, literally everything. The third file is program md, and this is the most conceptually important one, especially in the context of this idea of these loops being larger primitives. It's a markdown file, plain text written in English that contains the instructions for the AI agent. It describes how the agent should behave as a researcher, what kind of experiments to try, what to be cautious about, and when to be bold versus conservative. This is the file that the human in this equation, edits. So the way that this is going to work is you point an AI agent like Claude or Codex or whatever at the repo and tell it to read program MD and start experimenting. The agent reads the instructions, looks at the current state of Train py, decides on a modification to try, makes the edit, and kicks off a training run. Every training run has a fixed five minute budget. When the run finishes, the system evaluates the model on a validation set and produces a single number. In this case, that's validation BPB or VAL bpb, which stands for validation bits per byte. In this case, lower is better. The agent then makes a decision. If the new VAL BPB is lower than the previous best. The the change is kept, it gets committed to a git feature branch, it becomes the new baseline and the agent builds on top of it for the next experiment. If the VAL BPB is the same or higher, the change is discarded, the agent reverts to the previous best version and tries something different. Then the loop repeats indefinitely because of that five minute constraint. You can run this for an hour and get 12 experiments. You can run it overnight and get about 100. The session that Andrej shared showed 83 experiments, of which 15 had improvements that they kept and which drove the Val BPB from 0.9979 down to 0.9697. So basically, instead of the researcher running the research at this point, they are designing the arena that the research lives in, which is the program MD file. Andre describes it as a super lightweight skill and basically it's a research strategy document. Karpathy explicitly says you are not touching any of the Python files like you normally would as a researcher. Instead you are programming the program MD markdown file that provide context to the AI agents and and set up your autonomous research org. The human's job becomes write a better memo and the agent's job is execute research within the frame the memo sets. The loop between them is mediated by a single unambiguous number. In the case of Andre's experiment, the VAL BPB that tells you whether things are getting better or worse, and that is the whole system. Almost immediately people started squawking about this. Lior Alexander wrote, you don't write the training code anymore. You write a prompt that tells an AI agent how to think about research. The agent edits the code, trains a small model for exactly five minutes, checks the score, keeps or discards the result, and loops all night. No human in the loop that fixed 5 minute clock is the quiet genius. No matter what, the agent changes the network size, the learning rate, the entire architecture, every run gets compared an equal footing. This turns open ended research into a game with a clear score. Cosmic Labs CO founder Meg McNulty writes Wild Shift turning a single GPU into an autonomous experiment loop changes the pace of iteration. If the evaluation metric is well designed, the system can explore hundreds of ideas far faster than manual tuning. Craig Hewitt argued that the specific Context of training LLMs isn't what matters. Instead he called it the cleanest example of the agent loop that's about to eat everything. 1 human rights a strategy doc 2 agent executes experiments autonomously. 3 clear metric decides what stays and what gets tossed 4. Repeat 100x overnight. The person who figures out how to apply this pattern to business problems, not just ML research, is going to build something massive. The code is almost irrelevant. The architecture and mindset is everything. Daniel Meissler called this automation of the scientific method and it's me. Chase also noticed that this would be valuable for things outside of ML research as well. He writes, while this was made for self improving LLMs, the framework could be applied to anything 1. AI agent reads context in previous results 2. Proposes targeted code edits 3. Runs a fast reproducible experiment 4. Gets an objective scalar score 5. Git commits only the winners or reverts 6. Repeats forever on a feature branch and of course many made the connection to the Ralph Wiggum loop that was popularized a couple months ago. Newseron writes sounds like a hyper mode Ralph Wiggum from a few months ago. Instead of looping until a task is done, you give the agent a benchmark on what to improve. Goal isn't completion, but continuous improvement against a measurable target. Co founders Nick called it the Ralph Wiggum Loop for science. Define what winning looks like. Hand over the variables. Let the agent find what drives it. Y Combinator president Gary Tan made this connection as well in a blog post about auto research. Gary writes, auto research didn't emerge from nothing. The same pattern put an AI in a loop with clear success metrics was already working in software development by mid-2025. Geoffrey Huntley, a developer working from rural Australia, invented what he calls the Ralph Wiggum technique. Feed a prompt to a coding agent, whatever it produces, feedback in loop until it works. The loop is the hero, not the model. Now expanding on the Wiggum loop just a little bit. Basically what you have is a script that runs an AI coding agent in a loop. Over time, each iteration of the loop does the same thing. It feeds the agent a prompt that includes a project specification. Tells the agent to read the current state of the code base, pick a task to work on, implement it, run the tests and commit it. If everything passes, when the agent is done with its task or when it runs out of context window, the loop terminates the agent process and spins up a brand new one fresh context window, no memory of the previous session. The new agent reads the same spec, looks at the codebase which now includes the previous agent's commits, figures out what's been done and what still needs doing, picks the next task and goes now there are a couple things that the RALPH loop was trying to solve for In a traditional session, if you keep going long enough, the context window is going to fill up. The model starts losing track of earlier parts of the conversation and the responses degrade. The RALPH loop solution is to deliberately kill the agent and start fresh before that happens. Memory then doesn't live in the AI context window, it lives in the files and in the code that's been written, the git commit history, a progress. Txt file that each agent appends to, and a JSON based product requirements document that tracks which tasks are done and which aren't. Every new agent instance bootstraps its understanding from these external artifacts, not from a conversation history. Each individual agent session then might not be perfect, but the loop corrects for that over time because state is externalized and the system is self healing. Foreign. Is powering a $3 trillion productivity revolution and leaders are hitting a real decision point do you build your own AI agents? Buy, off the shelf or borrow? By partnering to scale faster KPMG's latest thought leadership paper Agentic AI Navigating the Build, Buy or Borrow decision does a great job cutting through the noise with a practical framework to help you choose based on value, risk and readiness and how to scale agents with the right Trust, Governance and Orchestration Foundation. Don't lock in the wrong model. You can download the paper right now at www.kpmg.usnavigate. again, that's www.kpmg.usNavigate. there's a new standard that I think is going to matter a lot for the enterprise AI agent space. It's called AIUC1 and it builds itself as the world's first AI agent standard. It's designed to cover all the core enterprise risks, things like data and privacy, security, safety, reliability, accountability and societal impact, all verified by a trusted third party. One of the reasons it's on my radar is that 11 labs, who you've heard me talk about before and is just an absolute juggernaut right now, just became the first voice agent to be certified against AIUC1 and is launching a first of its kind insurable AI agent. What that means in practice is real time guardrails that block unsafe responses and protect against manipulation, plus a full safety stack. This is the kind of thing that unlocks enterprise adoption. When a company building on 11 labs can point to a third party certification and say our agents are secure, safe and verified, that changes the conversation. Go to AIUC.com to learn about the world's first standard for AI agents. That's AIUC.com with the emergence of AI code generation in 2022, Nvidia master inventor and Harvard Engineer Sid Pareshi took a contrarian stance. Inference, time, compute and agent orchestration, not pre training, would be the key to unlocking high quality AI driven software development in the enterprise. He believed the real breakthrough wasn't in how fast AI could generate code, but in how deeply it could reason to build enterprise grade applications. While the rest of the world focused on co pilots, he architected something fundamentally different. Blitzi. The first autonomous software development platform leveraging thousands of agents that is purpose built for enterprise scale code bases. Fortune 500 leaders are unlocking 5x engineering velocity and delivering months of engineering work in a matter of days with Blitzi. Transform the way you develop software. Discover how@blizzi.com that's B L I-tzy.com as a consultant, responding to proposals can often feel like playing tennis against a wall. You're serving against yourself, trying to guess what the client really wants. That all changes with Insight Wise. Now you've got an AI proposals engine that thinks just like your client. It returns to the brief time and time again, picking apart your work, identifying key evaluation criteria and win themes and making recommendations to ensure you stand out. Suddenly you're on center court. But this time you've got a secret weapon. Insight Wise gets rid of all the time consuming manual work so you can focus on winning more business more often. Generate reports, pull insights from your own data, build competitive advantage and go to sleep before 2am when it comes to proposals, you only get one shot with insight Wise. Make yours an ace. So part of what RALPH was trying to solve for was just the limits of the context window. But the other part is that people want agents that work while they sleep or while they're doing other things. And this is a way to solve for that. So with connection to RALPH loops made, many people started exploring auto research in other contexts. Veron Mather wrote, I hooked this up to a peer to peer astrophysics researcher agent which gossips and collaborates with other such agents and your Open clause to 1. Learn how to train an astrophysics model. 2. Train a new astrophysics model. 3. Use it to write papers. 4. Peer agents based on Frontier Lab models critique it. 5. Surface breakthroughs and then feedback in that loop getting a little bit more practical. Vadim, the CEO of Vugola, writes, I built a version of this for my whole company. The core problem with most agent setups, they output something and stop. The agent writes an email, sends an email, generates code. Done. The next time it runs, it starts from zero. No memory of what worked, no memory of what failed. Pure amnesia. That's not automation. That's a script you babysit. The fix is one close the loop every agent in my setup on OpenClaw reads a shared brain file before doing any work, then writes back to it after I call it Learnings md. It's baked into every agent's system prompt. Before starting work, read Learnings md. After completing work, append what you learned, do Learnings md. That's the foundation one file. All agents read it. All agents write to it. Now, they're not isolated processes, they're a network that accumulates knowledge. So basically, Vadim is describing a loop for the entire agentic process of his company. In an article on X, he writes most marketing teams run around 30 experiments a year. The next generation will run 36,500 plus easily. Things like new landing pages, new ad creative, maybe a subject line test. Except what if you applied an experiment loop? Eric writes. Modify a variable, deploy it, measure one metric, keep or discard, repeat forever. Cold email, Ad creative, landing Pages, job postings, YouTube thumbnails, discovery call scripts. They all follow the same loop. He also gave the example of cold outreach, which is their first test. The setup is 15 inboxes and around 300 emails per day, with the agent modifying one variable per experiment sends 100 emails, waits 72 hours, scores positive reply rate, keeps or discards and repeats Roberto Nixon wrote about how the auto research model could be applied to advertising 1. You define success, purchases, apps, installs, whatever and set a budget. 2. Meta Google TikTok's infinite content machine generates thousands of ad variations, copy, format, imagery, etc. Tests real time against live audiences, keeps what works, kills what doesn't. 4. Agent Loop runs continuously A campaign moves from fixed asset to a living organism, ever evolving towards your stated goals. So humans define goals and set guardrails, essentially a system prompt, in this case that is Brand guidelines and then press go. Everything else is automated. Now apply this to any business function with a measurable outcome and fast feedback loop. And so this brings up the does this type of agentic loop primitive work for every context, or are there some specific set of characteristics? I think you're going to see this loop applied to a huge range of activities, but where it's going to initially be most successful are areas where there are five things that are true. First there is a score, something that is scorable. In other words, that the loop can tell better from worse without asking a human. The more subjective worse or better is, the harder this is going to be. Although even that's not impossible. You just have to build some sort of objective scoring into the system. The second requirement is that iterations are fast and cheap, basically that bad attempts waste minutes, not months. The environment needs to be bounded with the agent having a defined work and action space. The cost of a bad iteration needs to be low, that is you're not going to try this live with legal filings and the agent needs to be able to leave traces. So with Claude, we designed an eval loop readiness map which basically plots things on an X axis of how automatable the evaluation is and a Y axis of iteration speed. The top area of the map. Then are work processes that have seconds long iteration speed with fully automated evaluation possible. On the other end of the spectrum is where evaluation is largely or entirely subjective and the iteration speed is months. So what are some examples up in the top quadrant where iteration speed is seconds and evaluation can be fully automated is things like code generation. Some of the other ones that Claude came up with were game AI and NPC behavior, AD bit optimization, algorithmic trading, and then of course we've got LLM training research, according to Andrej Karpathy. Moving down where you start to have iteration speed that's a little bit slower and automation that's a little bit more partial. You have things like content moderation, AB testing, copy supply chain routing, and then so on and so forth. It goes down all the way to the other end of the spectrum. Something like political negotiation is subjective and takes months, therapy and counseling. Highly subjective with very low iteration speed and whether each of these individual inputs is right or wrong. And I don't agree with where Claude put all of them. The point is this. It is my very strong instinct that every single work process that has the ability to have success measured and scored in an objective way is going to have people experimenting with agentic loops around it. Now, I think what makes this a primitive is that this is not just the new job, although I'm sure there will be specialists. This is something that people are going to do within their existing roles in the same way that meetings or slide decks or email or spreadsheets are primitives that people use and cut across every function. What we're going to have in the future is things like a product manager writing a PRD kicking off a RALPH loop before dinner and reviewing the PR in the morning, a sales rep writing targeting criteria and tone guidelines, pointing a loop at 200 leads overnight and reviewing the top 30, a financial analyst defining constraints, looping through portfolio allocation backtests and reviewing the optimized output A recruiter writing a scoring Rubric, looping through 500 resumes and reviewing flagged edge cases. A QA engineer writing acceptance criteria and then looping through test generation and execution. A lawyer writing a risk flag checklist and looping through a stack of vendor contracts. Now, interestingly, there is already very clearly a lot of work to productize this. Also on Saturday, March 7, Claude Code creator Boris Czerny wrote Release today. Loop Loop is a powerful new way to schedule recurring tasks for up to 3 days at a time. Egloop babysit all my PRS auto fix build issues and when comments come in, use a worktree agent to fix them, eg slash loop every morning using the Slack MCP to give me a summary of top posts I was tagged in Think about the heartbeat in OpenClaw. The heartbeat is effectively the core loop of any OpenClaw agent, where by default, every 30 minutes the heartbeat fires, creating a moment for the agent to wake up, ask where things are, and continue on with its core mission. And yet, even with all this change that I'm describing, this is almost certainly not the end state of the loops. Primitive. Andre himself wrote about this on Sunday. The next step for auto research, he says, is that it has to be asynchronously massive collaborative. For agents, the goal is not to emulate a single PhD student, it's to emulate a research community of them. Current code synchronously grows a single thread of commits in a particular research direction, but the original repo is more of a seed from which could sprout commits contributed by agents on all kinds of different research directions or for different compute platforms. GitHub is almost, but not really suited for this. It has a softly built in assumption of one master branch which temporarily forks off into PRs, just to merge back a bit later. I'm not actually sure what this collaborative version should look like, but it's a big idea that is more general than just the auto research repo. Specifically, agents can in principle easily juggle and collaborate on thousands of commits across arbitrary branch structures. Existing abstractions will accumulate stress as intelligence, attention and tenacity cease to be bottlenecks. Other people picked up this theme. Blake Heron writes, the missing layer is memory across the swarm. Right now each agent's run in an isolated thread with no awareness of what other agents tried, what worked, what conflicted git tracks code changes, but not decisions, reasoning or failed experiments. You need a semantic memory layer underneath the branches, so Agent 47 knows Agent 12 already tried that direction and it didn't converge, Kathy F writes, the real unlock is when these agent researchers can share negative results efficiently. In academia, failed experiments go to the graveyard. In a collaborative agent network, every failure is a data point that prunes the search tree for everyone. Yujian Jin goes farther, saying, AGI is billions of AI agents doing autonomous research together. Figuring out the right abstraction for multi agent collaboration is the key. GitHub is not good for agents. Dan Romero wonders if it's going to look closer to a social network than to a new version of GitHub. Molt book, he writes, was too anthro skeuomorphic, but an agent native social network to collaborate on auto research is interesting. As we round the corner here already we were living in a world where our comparative advantage as humans had been retreating to a higher level of abstraction. The new high value skills around agent loops are things like arena design, that is Writing the program MD file and creating the context in which the agent is operating, evaluator construction, or building the score function, that is being able to tell the agent what good actually is and building a scoring system for it. And then there's other skills like loop operation, problem decomposition. But the point is that all of these things operate on a much higher level of abstraction than most of our work tasks today. One interesting experiment to run this week is to as you're working, find the things that you repeatedly do or are part of doing where you know right now, what better looks like. Ask if you could encapsulate that judgment clearly enough for an agent to use it as a score. If you can, you might be able to point a loop at that part of your job to work on your behalf overnight, and that likely gives you a preview of the next version of your job. One of the great challenges right now as someone who thinks about how to help individuals and companies adopt AI, is that every week the capability overhang gets bigger. In other words, the gap between meeting companies and people where they are and what I think they should be actually doing gets wider. At some point it's so wide that it almost becomes malfeasance to meet them where they are. And yet what other choice is there? The only other choice that I've found is to try to provide as many resources as I can for the people who are living at the other side of that gap and who are really pushing the boundaries. And if you think that you had an advantage when you were just vibe coding with lovable or Claude code, let me tell you, if you start to figure out how to implement agentic loops in your work you are going to literally run circles, looping circles around everyone else. My spidey sense says that what auto research represents is bigger than just a weekend project for one of AI's favorite people, and I'm excited to dig in further. For now. That is going to do it for today's AI Daily Brief. Thanks for listening or watching as always and until next time, peace.
0:00