Inference Got Cheap. Renegotiate Everything.

9 min

•May 5, 20262 months ago

Summary

AI inference costs have dramatically dropped due to hardware specialization and new model architectures, inverting 18 months of rising expenses. Vendors quietly priced contracts under old economics, creating immediate renegotiation opportunities for enterprises that understand the cost shift.

Insights

Inference (model usage) represents 90%+ of enterprise AI costs but was conflated with training costs, causing systematic mispricings across vendor contracts
Chip vendors splitting product lines into training and inference hardware signals the industry recognizes these require fundamentally different optimization approaches
Open-weight models like DeepSeek V4 at 10% of closed-model pricing create immediate negotiating leverage for any enterprise with vendor contracts
Smaller distilled models and mixture-of-experts architectures can handle 95% of workloads at 10% of frontier model costs, representing massive untapped savings
The buyer-seller dynamic inverted in May 2026; enterprises now have pricing power and should renegotiate immediately rather than wait for renewal

Trends

Hardware specialization: chip vendors splitting product lines between training and inference optimizationModel stratification: frontier models for hard cases, distilled/smaller models for 95% of routine queriesMixture of Experts architecture adoption: routing queries to specialized submodels instead of monolithic modelsOpen-weight model competition: credible alternatives undercutting proprietary model pricing by 90%+Inference cost deflation: per-query pricing entering sustained decline after 18 months of increasesVendor margin pressure: cloud providers acquiring inference optimization startups to compete on costContract renegotiation wave: enterprises gaining leverage to revisit multi-year AI agreements mid-termCost transparency shift: industry moving from opaque 'AI cost' buckets to itemized training vs. inference pricingDistilled model adoption: enterprises discovering 70-90% of workloads can run on cheaper smaller modelsValuation pressure on AI labs: $900B+ valuations only sustainable if inference economics improve at scale

Topics

AI inference pricing economics Training vs. inference cost distinction Hardware specialization for AI workloads Model distillation and smaller models Mixture of Experts architecture Open-weight model competition AI vendor contract renegotiation Cloud chip development (TPU, Trainium, Inferentia)Per-token and per-request pricing models Enterprise AI cost optimization Frontier vs. specialized model selection Multi-year AI contract economics CFO AI budget forecasting Cloud margin compression AI workload routing and optimization

Companies

Google Cloud

Announced TPU-8T/8i split, first major vendor to publicly separate training and inference chip lines

NVIDIA

Splitting product line between training and inference chips, following Google's hardware specialization model

Amazon Web Services

Offering Trainium and Inferentia chips, implementing training/inference hardware split strategy

AMD

Participating in industry-wide trend of splitting product lines for training vs. inference optimization

Nebius

European cloud provider acquiring EigenAI ($643M) and Tavoli ($275M) to build inference-optimized stack

EigenAI

Startup acquired by Nebius for $643M; specializes in making AI inference faster and cheaper

Tavoli

Company acquired by Nebius for $275M; part of inference optimization stack strategy

DeepSeek

Chinese AI lab releasing V4 open-weight model at fraction of frontier model cost, creating pricing pressure

Anthropic

Raising $50B at $900B+ valuation; success depends on making inference economics work at industrial scale

OpenAI

Frontier model provider (GPT 5.5) competing with DeepSeek on reasoning capabilities and cost efficiency

People

Stephen Forte

Host delivering analysis of AI inference cost economics and vendor contract renegotiation strategy

Quotes

"For 18 months, the story has been the same. AI is expensive and getting more expensive. That story has inverted. It inverted last week and most of your vendors are quietly hoping you do not notice."

Stephen Forte•Opening

"Training is medical school. Inference is every patient visit for the next 40 years. Medical school is brutally expensive. It is also a one-time cost. The patient visit is what actually pays the bills."

Stephen Forte•Mid-episode

"Industry estimates put inference at north of 90% of what an enterprise actually pays over the life of a deployment. The training number is the headline. The inference number is the bill."

Stephen Forte•Mid-episode

"You do not need a neurosurgeon to read your blood pressure. AI vendors are finally pricing accordingly."

Stephen Forte•Late episode

"For 18 months, you have been the seller's customer. As of last week, you are the buyer. Act like one."

Stephen Forte•Closing

Full Transcript

Welcome to the AI Brief from the YPO Technology Network. I'm Stephen Forte. For 18 months, the story has been the same. AI is expensive and getting more expensive. Compute is scarce. Models are bigger. Cloud bills are up. Every CFO in the country has a slide that ends with a hockey stick and an apology. That story has inverted. It inverted last week and most of your vendors are quietly hoping you do not notice. Here's the contract for today's episode. I am going to teach you the single most important distinction in AI economics, walk you through the four pieces of evidence that the price floor is cracking, and give you three things to do this week with the contracts already sitting in your legal folder. Plain English, no jargon. By the end, you will know more about how AI actually gets billed than most of the people selling it to you. Let's start with the distinction nobody bothered to explain to you. There are two costs in AI. They are not the same cost. They are not even close to the same cost. And the people pricing your contracts are counting on you treating them as one bucket. Cost number one is training. Training is teaching the model. It happens once. It costs billions of dollars. It takes months, thousands of chips, and a power bill that would heat a small city. When you read in the Wall Street Journal that some lab spent $4 billion building a frontier model that is training. Cost number two is inference. Inference is using the model after it is trained. Every time your sales rep asks the chatbot a question, every time your support agent generates a reply, every time a document gets summarized, a contract gets reviewed, an invoice gets categorized, that is inference. It happens billions of times. It happens forever. Here is the analogy. Training is medical school. Inference is every patient visit for the next 40 years. Medical school is brutally expensive. It is also a one-time cost. The patient visit is what actually pays the bills, what scales with demand, and what determines whether the practice makes money or not. If you confuse the two you misprice everything That is exactly what has been happening to your AI budget Most CEOs have been quoted a single number called AI cost and most of that number inference dressed up as something more exotic Industry estimates put inference at north of 90% of what an enterprise actually pays over the life of a deployment. The training number is the headline. The inference number is the bill. Now, here is why I am bringing this up on a Tuesday in May. For 18 months, inference cost was the locked-in part of the equation. Models got bigger. GPUs got scarcer. And the per-query price you paid was, if anything, drifting up. Then four things happened in 11 days, and together they tell you the floor is moving. First, on April 22nd, Google Cloud announced its eighth generation of custom AI chips. And they did something they had never done before. They split the chip line in two. One chip, the TPU-8T, is for training. The other, the TPU-8i, is for inference. Two products, same generation, different jobs. That split matters more than the chips themselves. It is Google saying out loud what the industry has been quietly admitting for a year. Training and inference need different hardware. Training chips are like Formula One cars. Massive power, massive memory, built to learn under absurd load. Inference chips are like a fleet of delivery vans. lower power, optimized for speed, and parallel queries cheaper per trip. You do not race a delivery van and you do not deliver packages with an F1 car. NVIDIA is doing the same split. Amazon is doing it with Tranium and Inferentia. AMD is doing it. Once your chip vendor draws a line down the middle of its product line, the per-query cost of running AI starts falling. That is what is happening right now. Second, on May 1st, a European cloud provider called Nebius agreed to buy a startup called EigenAI for $643 million. 98 million in cash, the rest in stock. Egan does one thing, it makes AI chips run inference faster and cheaper. That is the entire pitch. That is the entire company. Nebius did not pay million to make training a little better They paid million because the next 24 months of cloud margin will be decided on the inference side of the meter Three months earlier, Nebius bought another company, Tavoli, for $275 million. Same theme, different layer. They are building a stack whose entire selling proposition is that running AI is going to get cheap and they want to be the ones doing the running. Third, on April 24th, a Chinese AI lab called DeepSeek previewed its newest model, DeepSeek V4, open weights, which means anyone can download it and run it themselves. DeepSeek claims V4 has, in their words, almost closed the gap with frontier reasoning models from OpenAI and Anthropic and runs at a fraction of the cost. You do not need to take their word for the benchmarks. You need to take their word for what it does to negotiating leverage. When a credible open model lands at 10% of the price of the closed leader, every Western vendor has to discount or explain why they are not. Fourth, Anthropic, the AI lab behind Claude, is in talks to raise around $50 billion at a valuation north of $900 billion. Larger than ExxonMobil, larger than Walmart, larger, frankly, than most things that have a valuation. That round only pencils if Anthropic can make inference economics work at industrial scale. Investors are not paying $900 billion for a science project. They are paying for the assumption that the cost of every clawed query is going to keep falling and the volume is going to keep climbing. That is the bet. Four data points, 11 days, one direction. and it gets one layer better because the chips are not the only thing splitting in two. The models are splitting too. The big frontier models, GPT 5.5, Claude Opus, are the neurosurgeons. Slow, expensive, you bring them the hard cases. The smaller distilled models, the haikus and minis and nanos, are the nurse practitioners. Fast, cheap. They handle 95 of the visits at 10 of the cost And a newer architecture called Mixture of Experts takes that idea further Instead of one giant model thinking about everything the system routes your specific question to a small specialist submodel that is good at exactly that thing. DeepSeq v4 is built this way, so is most of what is shipping in 2026. You do not need a neurosurgeon to read your blood pressure. AI vendors are finally pricing accordingly. Here is my read. Every multi-year AI contract you signed before May 2026 was priced under the old inference economics. The vendor knew it. You did not. That is not malice. That is how procurement works when one side understands the cost curve and the other side is trying to run a healthcare network or a manufacturing footprint. The asymmetry is closing this quarter. If I were sitting in your seat this week, I would do three things. One, pull every AI vendor contract signed in the last 18 months. Look for the inference pricing it will be hiding under names like per token, per request, per seat, or per active user. Get those numbers on one page. Two, walk into your CIO's office and ask one question. What percentage of our AI workload could run on a smaller or distilled model with no meaningful drop in quality? The honest answer in almost every company I have spoken to is north of 70%, sometimes north of 90. That gap is your savings. Three, open the renegotiation conversation now, not at renewal now. Vendors fighting for share, and they are all fighting for share, will move on price ahead of renewal if you give them a reason. The reason is that you understand what they understand. If you wait for the contract to roll, you are paying 12 more months at last year's prices. The training story made the headlines. The inference story makes the budget. For 18 months, you have been the seller's customer. As of last week, you are the buyer. Act like one. That is the YPO Tech Network AI Brief for Tuesday, May 5th, 2026. I am Stephen Forte. If this was useful, send it to a fellow member. I will be back tomorrow with more. Until then, stay sharp.