What the big new AI trend Tokkenmaxxing is & why its a big problem

23 min

•Apr 23, 20263 months ago

Summary

Token maxing is a new trend where AI tool consumption (measured in tokens spent) has become a status metric in tech companies, but heavy token usage doesn't correlate with actual productive output. The episode explores why this metric is flawed, citing examples from Uber, Meta, and Microsoft where engineers game the system, and proposes alternative measurement frameworks like Agentic Work Units that track real business value instead.

Insights

Token consumption has become a performative status game in Silicon Valley, driven by pricing models that charge per token rather than per seat, creating perverse incentives for engineers to maximize spend rather than output
Heavy AI tool usage produces twice the output but at 600x the cost, with only 11% of AI-generated code actually reaching production in real-world scenarios like Uber, indicating a massive quality and efficiency problem
Goodhart's Law applies directly: when token consumption becomes the measured target, it ceases to be a good measure of productivity, mirroring historical failures like Wells Fargo's fake accounts scandal and Soviet chandelier factories
The hidden quality cost of AI-generated code—43% requires manual debugging and developers spend 38-40% of time verifying AI work—is invisible on token dashboards but critical to actual delivery and ROI
Organizations should measure Agentic Work Units (completed business-relevant work) instead of tokens, tracking metrics like production-deployed code, bugs per shipped PR, or human review hours to align incentives with real value creation

Trends

Token consumption as a performative status metric in enterprise AI adoption, particularly in engineering teamsShift from per-seat pricing to metered token-based pricing models driving unexpected cost explosions for enterprisesEmergence of AI-first process frameworks using markdown files and skills to provide agent context, with inherent scalability and maintenance challengesHidden quality assurance bottlenecks becoming the real constraint in AI-assisted development, not code generation speedAlternative metrics frameworks (Agentic Work Units) emerging as counter-trend to token maxing across multiple industriesDrift and decay in AI agent systems as documentation and skills become outdated without continuous maintenanceCross-industry pattern: knowledge workers will face same token maxing dynamics that developers currently experienceEnterprise AI tool vendors shifting contract models from flat-fee to usage-based, creating cost surprises for customersInternal tooling solutions emerging to manage agent context and task orchestration beyond simple markdown-based approachesQuality verification becoming a critical bottleneck and cost center in AI-assisted work across all knowledge worker roles

Topics

Token Maxing and Performative AI MetricsAI Code Quality and Production Deployment RatesGoodhart's Law and Metric Gaming in OrganizationsEnterprise AI Pricing Models and Cost StructuresAgentic Work Units as Alternative Performance MetricsAI Agent Context Management and Skills DocumentationHidden Quality Assurance Costs in AI-Generated CodeProcess Bottlenecks in AI-Assisted DevelopmentKnowledge Worker Productivity MeasurementAI Tool Adoption and Organizational CultureMarkdown Files and Skills for Agent OrientationDebugging and Verification Time in AI WorkflowsEnterprise Contract Migration to Usage-Based PricingCross-Industry Application of AI Consumption PatternsInternal Tooling for AI Process Management

Companies

Meta

Built leaderboard system 'cordonomics' ranking engineers by token consumption; pulled it after public backlash

Uber

CTO admitted 2026 AI budget spent after rolling out Claude Code; only 11% of AI-generated code reached production

Microsoft

Engineers found deliberately burning tokens by gaming AI consumption as performance metric

Salesforce

Developed Agentic Work Units (AWUs) as alternative to token consumption metrics; 2.4B AWUs generated in Q4 2025

Jellyfish

Consultancy released study of 12,000 developers across 200 companies showing heavy AI users produce 2x output at 600x...

Shopify

VP and CEO publicly argue heavy AI usage is cost-effective if output increases, even at higher token spend

Anthropic

Moved enterprise contracts from flat $200/user model to metered compute model, tripling customer bills

New York Times

Covered token maxing trend; called it a new status game in Silicon Valley

Fortune

Published coverage of token maxing trend alongside other major tech publications

TechCrunch

Covered token maxing trend as emerging industry phenomenon

OpenAI

Referenced in context of token-based pricing models and LLM consumption patterns

NVIDIA

CEO Jensen Huang advocates for $250k annual token consumption per engineer; benefits from LLM token-driven business m...

Singapore Airlines

Uses Agentic Work Units to track customer resolution outcomes instead of token consumption

Wells Fargo

Historical example of metric gaming: measured cross-sells, resulting in 3.5 million fake customer accounts

Mindset

Host's company; built internal tool 'memex' to manage AI-first processes beyond markdown files and skills

People

Jack Horton

Host of In The Loop podcast discussing token maxing trend and AI productivity measurement

Jensen Huang

Stated engineers should consume at least $250k in tokens annually; advocates for heavy AI tool usage

Niklas Zennström

Admitted entire 2026 AI budget spent after rolling out Claude Code to 5,000 engineers

Quotes

"95% of them were using AI every single day, with 70% of the code they were submitting being AI written. But only 11% of that was actually running in Uber's systems."

Jack Horton•Early in episode

"When a measure becomes a target it ceases to be a good measure"

Jack Horton

"Measuring programmers by amount of code written is like measuring the progress of us building a plane by weight. It has literally no value."

Bill Gates (quoted by Jack Horton)

"The engineers at the bottom of this graph ship code changes for about 28p each. The engineers at the top of that graph ship them for $89 each."

Jack Horton

"43 percent of ai generated code requires lots of manual debugging after it's gone through qa"

Jack Horton•Mid-episode

Full Transcript

Over the last few weeks, the New York Times, Fortune, TechCrunch, and many more have all covered the same new trend. It's called token maxing, a new trend that's spreading literally everywhere. Yet this month, a consultancy called Jellyfish just released a study of 12,000 developers across 200 companies. The engineers at the bottom of this graph that they created ship code changes for about 28p each. The engineers at the top of that graph ship them for $89 each. The heavy users produce twice as much output at 319 times the cost. This is what token maxing is. Today I'm going to discuss many of the silly stories behind this new trend, with engineers gaming leaderboards to encourage them to spend lots of money in tokens. To explain what's really happening, what token maxing is, and why the same dynamic is coming to every industry and job. not just software and importantly why it's such a big problem that you must catch at work and what you should be doing instead this is a loop of jack horton i hope you enjoy the show so so so let's start at the top and discuss what the new trend of token maxing actually is and to be honest why it's a bit of a big problem so for context in in this month so april 2026 the cto of uber just got on stage at a big industry event and admitted that his entire 2026 ai budget was completely spent. Now, four months earlier, he'd rolled out Claude Code to about 5,000 engineers, and by March, 95% of them were using AI every single day, with 70% of the code they were submitting being AI written. But only 11% of that was actually running in Uber's systems. So 95% of them were using AI, 70% of code that was being submitted was created by AI, but only 11% made it to production. These are very important numbers and they tell a really good story about what's happening right now. And this is going to be something that will be mirrored across many different jobs and industries. As I've said a few times, programming is just ahead of the game and knowledge workers generally are going to be going through many of the same challenges, ideas, new ways of working that developers have gone through over the last 18 months. So for context, a token is roughly three quarters of a word. So when an LLM gives you three quarters of a word, you spend the token. And to give you a sense of scale, the entire text of War and Peace, the book, is about a million tokens. And very funnily, there's VCs and people blogging online and releasing articles all the way through 2026 about burning about 250 million tokens in a single day by running a swarm of agents all operating in parallel so 250 million tokens by the way is about roughly 250 copies of war and peace consumed in a single day by a single person i think jensen puang of nvidia said on the all in podcast recently that if a half a million dollar engineer so they pay him half a million dollars a year did not and i quote consume at least $250,000 worth of tokens a year, he said he's going to be deeply alarmed. And by March, the New York Times had called this basically a new status game all over Silicon Valley. And the measure is quite simple right now. It's just consume. Consume more tokens, spend more, and prove that you're doing lots of stuff. Now, over the last few weeks, this story has taken an interesting turn because engineers at Meta and Microsoft and Salesforce and a bunch of other big companies have been found to be deliberately burning tokens. So running lots of agents, trying to have massive context windows, which cost lots of money, going over and over on prompts, just looping prompts, because they worked out that their managers were using AI consumption as some form of performance metric. And of course, they gamed it. And those engineers that were at the top percentage in terms of usage and token spend were producing about twice the output, but at 10 times the cost and meta's internal tool called cordonomics is essentially something that went public recently so they built this leaderboard ranking system and it ranked people by the number of tokens they'd consumed so about 60 trillion tokens were used or spent in a 30-day window to put that 60 trillion tokens into context if every employee at meta read a full novel every single day for 30 years they would collectively consume about 0.1 percent of that total number and meta by the way has 78 000 employees and that leaderboard handed out badges like token legend cachet wizard session immortal and meta cto was you know shouting on the rooftops basically saying that you know their their top ranked engineers on these leaderboards have become 5x to 10x more productive saying that this is easy this is like you know printing money basically and when it went public of course they then pulled the leaderboard because it's quite embarrassing and i'll go into why now this is very developer focused but we're going to see the same in every single function in some form people will so desperately want their teams to be using ai that they'll just want them to consume tokens you know if token consumption is going up surely output's going up now the question for me and i think you can probably tell my opinion here is this actually making people more productive So if we discuss that point, you know, why I think tokens don't mean more output. so Jellyfish that consultancy that released that study that I mentioned at the start of this discussion so they measured pull requests shipped per quarter across 12,000 developers over 200 big companies now a pull request for anyone outside of engineering is a standard unit of developer work a chunk of code that's been submitted for review it's been approved and then added to the product so shipping pull requests has often been other than merge requests I guess the closest thing to let's say maybe a universal output. Now the engineers that were using the least AI burned around three dollars of tokens over a quarter, so over three months, and shipped 11 pull requests each. Now the engineers that were using the most burned $1,820 of tokens a quarter and shipped 23. 13 more PRs, pull requests. So that's twice the output, but about 600 times the cost. and the word output i guess is a is a loaded word because a lot of what the heavy users produce doesn't necessarily ship everywhere remember uber's numbers 70 of that committed code was written by ai but only 11 of that actually was running in production now it's quite easy to diagnose you know just because you're producing lots doesn't mean it's good this is a new area it's an area that everybody's trying to innovate from a process perspective a systems perspective trainings and skill perspective but also there's just natural bottlenecks in a process if you think about the theorem of constraints process will always move as fast as the slowest moving part and right now there's certain areas that are just bottlenecks you know review is a bottleneck a senior engineer might be able to review say two to four hundred lines of code in a day let's say yeah ai can generate that in 90 seconds now people are doing automations on top at this and automated testing but still that doesn't mean that all of this is quality enough to pass QA so that is quality assurance and testing to actually get inside the product and again just because you're burning lots of tokens are you doing it effectively are you being efficient with your token spend you know putting a thousand words pdf into a context window and then asking to do you know lots of questions on that is going to use loads of tokens are you actually achieving anything there now none of this is new as a problem there's a principle called good heart's law so this is when a measure becomes a target it ceases to be a good measure now a good example is a paper by someone called khrushchev and it's about a soviet candle factory and the way they evaluated progress was by weight now if the chandeliers grew heavier and heavier so all the workers were producing heavier and heavier chandeliers until they started pulling ceilings down, they fulfilled the plan. Now the question is, does anyone really need that plan? Does it actually solve any problem? No, it doesn't. You're just creating heavy chandeliers and measuring value based on that. Bill Gates once said the same thing. It's like measuring programmers by amount of code written is like measuring the progress of us building a, let's say, a plane by weight. It has literally no value. Wells Fargo famously measured stuff by cross-sells to selling more products to the same customers and ended up with about three and a half million fake customers all of this started as a reasonable proxy and kpi to try and get people to do the right things that the organization wanted but obviously stopped working the moment everybody realized that that's how they were being assessed and it doesn't actually deliver any value if that measure does not deliver value and everyone gains that system you're getting more of a bad thing and token maxing is just the newest member of that family it's the new trend now there are some people that push back on this. You could argue that if your engineers spend $1,000 a month on, say, LLMs and ship 10% more, that's arguably cheap. It's not too expensive. Shopify's VP of engineering has been making that argument very publicly. Shopify's CEO makes a very similar point, that only through massive usage of AI will you get the right skills to eventually become productive. And the reality is both are right. If the output is real, then token usage is also great, because you're getting good output and clearly lots of it. But the jellyfish numbers, that study that jellyfish released, and the example by Uber suggests that often this just still isn't the case. And I'll come on to some of the reasons I think this is the case, by the way. And one of them, which is one of the reasons I got so interested in this episode, is what mindset is actually solving for internally and as a result has been solving for for other customers and partners. Let's talk about, I guess, the reason that things are getting so expensive and things aren't necessarily improving in terms of productive output. And as a result, what I think people should be focusing on instead. So as I said, I think the case for heavy consumption is partly right. It's going to help people gain skills. It's going to force people to actually adopt this technology and get over themselves. But this still doesn't explain why things are so expensive and things still aren't very good in terms of quality output. And I think there's about three or four reasons. and the first reason I think is pricing. So that previous example of measuring people by a proxy like lines of code written or the weight of a chandelier, well when those came about that didn't cost as much. So when a programmer in say 1982 wrote let's say 500 lines of code, those lines sat on a disk for nothing Didn cost people money Whereas AI coding tools aren sold by the seat they sold by the token So Anthropic for example moved their big enterprise contracts towards the late end of 2025 from a flat, roughly $200 per user model with tokens included, to a more metered compute model. So you get $20 for a base seat and then you get usage charge. So when legacy contracts all over the world have started expiring around, you know, March, April, May, 2026, and all these companies have really been drinking from the kool-aid of token maxing many customers bills will be tripling overnight for anthropic let's say a typical big company runs maybe 150 to 250 dollars a month per person and then heavy users run five to ten times that you can see how the numbers start to get really expensive when you've got seven eight thousand developers and then when you have some agents that are just running continuously doing huge tasks and using that entire token limit in just a matter of days. Well, imagine if you had 7,000 people with those agents. Again, if you remember the episode I discussed recently, the OpenClaw episode, and why it's such an important software invention, and why Jensen Huang is in love with it, is because these things are going to consume so many tokens, which drive LLMs, business models, which ultimately benefits NVIDIA. Anyway, I think the second reason is culture. Right now, we've got this weird culture, like at Meta and many other big companies where you've got this leaderboard of consuming tokens. You're Jensen Huang wanting people to spend, you know, quarter of a million dollars in tokens per year. So really this token consumption has become quite performative in a way that, you know, let's say a company selling $20 per month per seat per user just didn't see and didn't experience in the same way. And when that metric is really publicly visible and the boss says no limit and you get rewarded and a pat on your head obviously people will start to play it and gain it and the third reason here is a hidden quality cost you know 43 percent of ai generated code requires lots of manual debugging after it's gone through qa so testing and there's been some studies to say that some developers are spending about 38 to 40 percent of their week on debugging and verifying whether what the agent has actually built is the thing they meant to build. So that heavy token usage by an agent, so generating lots of words, although it seems really attractive, it's only real if a human clears it up or you have some form of really trustworthy testing software and suite that automates that. And these things aren't always easy to automate reliably. It's not easy. And that cleanup time isn't showing up on the token dashboards, it shows up when you try to deploy. And this is going to be the same across all spaces and industries. For those in every single, let's say, a law firm or a marketing team, a sales team, wherever you are, the role of managers is also quality assurance and quality checking. You need to make sure what we're producing as an organization is up to scratch. And again, maybe you can get AI and go, yeah, AI, go and check this. But maybe you also need people. So you have explosions of output from some people and then lots of people having to verify that. i think this comes to the other problem which is context and orientation of agents and then being able to reliably produce things that you want it to produce now some people are solving this with skills with md files to give agents context and rules and processes and i've done episodes on that recently that have been really, really popular. But I think still that is just a trend of the day. It will go away over the next 18 months because ultimately this system still relies on people updating those documents, updating those skills, still keeping them so fresh that they don't go stale and an agent reliably using that at scale across large organizations, which although it's infinitely better where we are now than 18 months ago, I still don't think it's perfect. I think we're a way off from what will come as the next evolution. And mindset is seeing this because we're quite far ahead of most people from a process perspective internally, and we've already really faced that problem. We actually built our own tool to solve that as an internal tool, which the team are absolutely loving. And I'll give you a good example. So when you ask an agent, say, to make a change in a code base, it might not have all the context of your code base itself, because it can't just scan all of your code base, because the context window blows up. So a lot of people create markdown files, so essentially documents which specify specific ways things work how my apis work how this works how that works they also have processes they enshrine so how to deploy how to push tickets to jira or git labs your product project management tools you then get people to pass information so a product team would go and write up what you're trying to build in the spec you pass it to engineers and that might go into an md file and then that md file gets given to agents and people create like we've been doing quite complex setups, like really powerful setups, clever setups of skills and MD files that essentially enable an agent to operate and do good coding. And this is the same that's happening now with Cloud Cowork and all organizations as I've covered in other episodes. So you create this fantastic system. However, that doesn't mean the agent's actually producing great work. A lot of the time it produces random work. That doesn't mean that all of your engineers are able to suddenly use these things really well. It doesn't mean that you can easily track all of the key important decisions that are made, because in a project management tool you have tickets, or you have tasks but an agent doesn think in that way Doesn think in a small like tiny task add a headline write sub copy review it thinks in larger chunks of work but those larger chunks of work therefore when they're executed have to rely on the agent accessing that md file structure the skills that you've created to actually do the work and then you have to test whether that's been done because if it writes 10,000 words of or lines of code, let's say, doesn't mean that all of that is actually what was intended to be built. You often find things that accidentally are snuck in. You know, why did you do that? Oh, I thought this was a good idea. And again, I'm going to keep coming to this. This is the same thing that's going to be happening at scale with knowledge workers. And, you know, three or four months in or five months in from having OmnD files and skills, you will start to find that suddenly some of these go stale. People stop updating them. They don't use them properly because naturally people just drift especially when it's nebulous you know it's documents like these things of skills that i you know i have been promoting them and they're really great but they are just a document specifying how to do something and that information must be continually updated 24 7 for every decision and the more decisions strategies projects you do the more skills and md files you need and it just becomes very bloated and chaotic trust me very quickly so this problem around token consumption relies on everyone updating this essentially document sets that live somewhere but then you need to track who's allowed to update that document has this person used that skill can i track if they've used that skill wait someone over here updated that document which changed our process okay so we need a review process to allow only some people to update an asset like an md file or a skill but not others but then again things go stale. Work isn't actually at high quality anymore. So this is kind of what happens in a big system when you start to really scale it and scale AI first processes. So yeah, this is a real problem. So you get drift from agents delivering the wrong things, you get documents that go outdated, people not keeping everything up to date, and essentially the entire system starts to slowly decay and performance goes down, yet token consumption goes way up. That's honestly one of the reasons we've actually solved this and been solving this internally for ourselves we call it memex it's essentially a system whereby you specify strategies at the top and instead of using project management tasks the document which has access to your code base which is indexed has access to specific files and other functions and all the skills turns that into tasks which are managed in a different way so for anyone that's really interested in that do reach out because we're actually having lots of conversations about this internally to share how to use something like this and how to implement better AI-first processes at the moment. Now, I think that's a good opportunity to now go on what I think people should be measuring instead, how they should go about this. companies that are thinking about this really really effectively over them from a process perspective and how we're solving that with internal tools like i described i think also the way you measure output is important so for example salesforce have been trying to measure something called agentic work units so awus as a complete counter to token maxim instead of counting tokens consumed an awu is essentially a completed unit of business relevant work so let's say singapore airlines use awus to track customer resolution type and other companies use them to track the quality of product recommendations and about 2.4 billion awus were generated across their system and ecosystem in q4 2025 alone now if you're running a team the job is therefore to define what your awu is the unit has to represent a complete piece of work you'd otherwise have to have paid a human to do. And it has to track whether that work actually stuck. So for engineering, it's not necessarily pull requests generated or pull requests merged into a code base. It's pull requests that made it into production and stayed there or bugs per shipped PR. So per piece of work, how many bugs were generated, how many incidents were traced back to AI-assisted code or track the human review hours your team is burning on AI code. For sales, it could be qualified opportunities that progressed a sales stage that was ai assisted versus not for legal maybe it's contract signed not drafts produced or for content actually published that drove traffic for marketing not just content generated again i keep coming back to that chandelier factory example what a stupid stupid proxy and token matching i believe is the exact same and it will end and it is a big mistake to get obsessed with that it's maybe a good starting place to get your team using it but it's not going to produce value so yeah those organizations that yes try and get people to adopt these tools but care about how you organize contextual information that goes beyond just md files and skills that don't try and get people to game the system to spend tokens and think about what they're really trying to measure are going to succeed here and produce actual value not just trash anyway that's it for this week thank you for listening and i'll see you next time Thank you.