Inference Got Cheap. Renegotiate Everything.
9 min
•May 5, 202629 days agoSummary
AI inference costs have dramatically dropped due to hardware specialization and new model architectures, inverting 18 months of rising expenses. Vendors quietly priced contracts under old economics, creating immediate renegotiation opportunities for enterprises that understand the cost shift.
Insights
- Inference (model usage) represents 90%+ of enterprise AI costs but was conflated with training costs, causing systematic mispricings across vendor contracts
- Chip vendors splitting product lines into training and inference hardware signals the industry recognizes these require fundamentally different optimization approaches
- Open-weight models like DeepSeek V4 at 10% of closed-model pricing create immediate negotiating leverage for any enterprise with vendor contracts
- Smaller distilled models and mixture-of-experts architectures can handle 95% of workloads at 10% of frontier model costs, representing massive untapped savings
- The buyer-seller dynamic inverted in May 2026; enterprises now have pricing power and should renegotiate immediately rather than wait for renewal
Trends
Hardware specialization: chip vendors splitting product lines between training and inference optimizationModel stratification: frontier models for hard cases, distilled/smaller models for 95% of routine queriesMixture of Experts architecture adoption: routing queries to specialized submodels instead of monolithic modelsOpen-weight model competition: credible alternatives undercutting proprietary model pricing by 90%+Inference cost deflation: per-query pricing entering sustained decline after 18 months of increasesVendor margin pressure: cloud providers acquiring inference optimization startups to compete on costContract renegotiation wave: enterprises gaining leverage to revisit multi-year AI agreements mid-termCost transparency shift: industry moving from opaque 'AI cost' buckets to itemized training vs. inference pricingDistilled model adoption: enterprises discovering 70-90% of workloads can run on cheaper smaller modelsValuation pressure on AI labs: $900B+ valuations only sustainable if inference economics improve at scale
Topics
AI inference pricing economicsTraining vs. inference cost distinctionHardware specialization for AI workloadsModel distillation and smaller modelsMixture of Experts architectureOpen-weight model competitionAI vendor contract renegotiationCloud chip development (TPU, Trainium, Inferentia)Per-token and per-request pricing modelsEnterprise AI cost optimizationFrontier vs. specialized model selectionMulti-year AI contract economicsCFO AI budget forecastingCloud margin compressionAI workload routing and optimization
Companies
Google Cloud
Announced TPU-8T/8i split, first major vendor to publicly separate training and inference chip lines
NVIDIA
Splitting product line between training and inference chips, following Google's hardware specialization model
Amazon Web Services
Offering Trainium and Inferentia chips, implementing training/inference hardware split strategy
AMD
Participating in industry-wide trend of splitting product lines for training vs. inference optimization
Nebius
European cloud provider acquiring EigenAI ($643M) and Tavoli ($275M) to build inference-optimized stack
EigenAI
Startup acquired by Nebius for $643M; specializes in making AI inference faster and cheaper
Tavoli
Company acquired by Nebius for $275M; part of inference optimization stack strategy
DeepSeek
Chinese AI lab releasing V4 open-weight model at fraction of frontier model cost, creating pricing pressure
Anthropic
Raising $50B at $900B+ valuation; success depends on making inference economics work at industrial scale
OpenAI
Frontier model provider (GPT 5.5) competing with DeepSeek on reasoning capabilities and cost efficiency
People
Stephen Forte
Host delivering analysis of AI inference cost economics and vendor contract renegotiation strategy
Quotes
"For 18 months, the story has been the same. AI is expensive and getting more expensive. That story has inverted. It inverted last week and most of your vendors are quietly hoping you do not notice."
Stephen Forte•Opening
"Training is medical school. Inference is every patient visit for the next 40 years. Medical school is brutally expensive. It is also a one-time cost. The patient visit is what actually pays the bills."
Stephen Forte•Mid-episode
"Industry estimates put inference at north of 90% of what an enterprise actually pays over the life of a deployment. The training number is the headline. The inference number is the bill."
Stephen Forte•Mid-episode
"You do not need a neurosurgeon to read your blood pressure. AI vendors are finally pricing accordingly."
Stephen Forte•Late episode
"For 18 months, you have been the seller's customer. As of last week, you are the buyer. Act like one."
Stephen Forte•Closing
Full Transcript