Open-Weight AI Models

50 min

•Apr 28, 20263 months ago

Summary

Benny Chen, co-founder of Fireworks AI, discusses how open-weight AI models are becoming competitive alternatives to closed-source models like GPT-4. Fireworks provides infrastructure for serving, fine-tuning, and deploying open-source models at scale, processing 13 trillion tokens daily, with custom optimizations like FireAttention kernels and speculative decoding that enable cost-effective production deployments.

Insights

Open-weight models have achieved price competitiveness with closed-source APIs in 2024, driven by models like Llama and OpenClaw, making them viable for production workloads when cost-per-token matters
Reinforcement fine-tuning (RFT) is a paradigm shift that democratizes model customization—product managers can now author evaluations without ML teams, eliminating the need for expensive data labeling workflows
Evaluation frameworks are undervalued assets; companies that invest in clear eval definitions gain vendor independence and can switch between model providers based on performance, not lock-in
Multi-hardware support (NVIDIA + AMD) is critical infrastructure strategy for supply chain resilience and cost optimization, not just vendor loyalty—even Meta used dual-supplier strategies for procurement
Numerical consistency between training and inference kernels is often overlooked but fundamental to RL success; many published algorithms fail in practice due to numeric misalignment, not algorithmic flaws

Trends

Open-weight models converging toward closed-source performance as training recipes and datasets become more transparent and shared across the communityReinforcement learning becoming the primary scaling lever as pre-training compute hits diminishing returns and electricity costs constrain further scalingEvaluation-driven development replacing dataset-centric fine-tuning; companies building proprietary eval frameworks as competitive moats rather than labeled datasetsVendor consolidation risk in AI infrastructure as frontier labs (Anthropic, OpenAI) launch competing products (Claude, ChatGPT) that disintermediate inference providersProduction trace data from real workloads becoming more valuable than benchmark suites for model selection and evaluation, shifting from synthetic to production-grounded evalsMulti-hardware inference becoming table stakes for cost optimization; single-vendor lock-in creates pricing risk as GPU supply constraints easeSpeculative decoding and custom kernels becoming standard optimization techniques rather than research novelties, enabling 10-100x latency improvements for specific workloadsSmaller open-weight models (7B-13B) becoming economically viable for enterprise use cases through fine-tuning and RL, reducing reliance on frontier model APIs

Topics

Open-Weight AI Models Inference Infrastructure and Optimization Reinforcement Fine-Tuning (RFT)Model Evaluation Frameworks Speculative Decoding Custom CUDA Kernels Multi-Hardware Support (NVIDIA/AMD)Cost Optimization for LLM Inference Supervised Fine-Tuning vs. Reinforcement Learning Production Trace-Based Evaluation Model Selection and Benchmarking Data Privacy and On-Premise Deployment Scaling Laws for Language Models Supply Chain Resilience in AI Infrastructure Numerical Consistency in ML Systems

Companies

Fireworks AI

Guest's company; platform for serving, fine-tuning, and deploying open-weight models at scale with custom inference o...

Meta

Guest worked on ML infrastructure, ASICs, recommendation systems, and PyTorch enablement for 8 years before founding ...

OpenAI

Mentioned as closed-weight model provider; released GPT-3.5 and operates managed API service; also open-sourced GPT-2

Anthropic

Closed-weight model provider competing with OpenAI; recently cut off Windsurf access, exemplifying vendor-as-competit...

Cursor

Major Fireworks customer using custom models and speculative decoding for fast code completion and editing features

NVIDIA

GPU manufacturer; investor in Fireworks; produces A100, H100 chips; dominant inference hardware provider with supply ...

AMD

Secondary GPU vendor; Fireworks investor; supported via custom kernels to provide multi-hardware cost optimization fo...

Mistral

European open-weight model provider; produces competitive models served on Fireworks platform

Notion

Mentioned as Fireworks customer using platform for search features

Atlassian

Mentioned as Fireworks customer using platform for search features

Ramp

Mentioned as Fireworks customer using platform for search features

Perplexity

Search competitor to Google; mentioned in context of vendor-as-competitor dynamics in AI products

Google

Mentioned as search competitor to Perplexity; also produces open-weight models (Gemini) with 10-11 trillion daily tokens

Intel

Collaborated with Meta on ASIC development for recommendation systems in 2017

Xai (X AI)

Elon Musk's AI company; mentioned as emerging model provider with competitive offerings

Vassal

Fireworks customer achieving 40x faster code fixing with reinforcement fine-tuning

Together AI

Implied competitor in open-weight model serving space

Hugging Face

Implied as model hub/repository for open-weight models discussed throughout episode

People

Benny Chen

Guest discussing Fireworks' platform, his 8-year tenure at Meta's ML infrastructure teams, and open-weight model stra...

Gregor Vand

Podcast host conducting interview; security-focused technologist based in Singapore

Peter Welinder

Created OpenClaw framework; recently joined OpenAI; praised for his work on open-source AI tooling

Dario Amodei

Mentioned for public statements on pre-training improvements and scaling laws

Elon Musk

Mentioned as founder of XAI; referenced in context of model provider selection

Quotes

"Open weight models give organizations direct control over how the models are deployed and used. Importantly, the performance of these models is steadily improving, and they've become credible alternatives for production workloads."

Episode introduction•Opening

"At the end of the day, if you know how to evaluate your model, you have all the power. You get to decide which supplier to use in what setting and when."

Benny Chen•Mid-episode

"Reinforcement learning was a new paradigm that the industry found to keep the scaling going. Because even for relatively small models, if you can push RL, you can get really good results out of the model."

Benny Chen•Late-episode

"The only important bit is can you pull your traces out of your production system? And then can you articulate what is good or bad?"

Benny Chen•Mid-episode

"I really don't think a lot of those are competitor related in the sense that a lot of our customers are making honestly a lot of money and they just want to make sure that we handle those complexities for them."

Benny Chen•Late-episode

Full Transcript

Open weight models are AI systems whose trained parameters are publicly released, which allows developers to run, fine-tune, and deploy them independently rather than accessing them only through a hosted API. While closed weight models from companies like OpenAI or Anthropic are delivered as managed services, open weight models give organizations direct control over how the models are deployed and used. Importantly, the performance of these models is steadily improving, and they've become credible alternatives for production workloads, with advantages in customization and data privacy. Fireworks AI is building a platform focused on serving and customizing open-weight models at scale. The platform includes optimized inference infrastructure, multi-hardware support across NVIDIA and AMD, and reinforcement fine-tuning capabilities. Benny Chen is a co-founder of Fireworks AI. In this episode, he joins Gregor Vand to discuss his path from Meta's ML infrastructure teams to co-founding Fireworks AI, why open-weight models are becoming increasingly competitive, how custom kernels and speculative decoding improve performance, reinforcement fine-tuning, and much more. Gregor Vand is a security-focused technologist, having previously been a CTO across cybersecurity, cyber insurance, and general software engineering companies. He is based in Singapore and can be found via his profile at van.hk or on LinkedIn. Hello and welcome to Software Engineering Daily. My guest today is Benny Chen. Welcome, Benny. Thanks for having me. Yeah, great to have you here. So we're going to be talking all about Fireworks AI, which is a company that I believe you co-founded, I think three, three and a half years ago. Is that right? Yeah. Nice. Yeah. So before we dive into Fireworks AI, and I think especially today, this is quite pertinent to sort of where maybe Fireworks AI came from. you spent a lot of time at Meta and in their ML team, which means you were doing things with ML probably way before many of us really had it on our radar. But what was kind of your path? What's been your path through software engineering and especially the Meta phase as well? Yeah, that's definitely... It's been a while. So let me know what the journey was like in the beginning. So in the very beginning, I joined as a software engineer in 2014 on the integrity team, where most of, I would say, the sort of like non-recommendation system experiments started. Early on, it was sort of like decision trees for different fraud behaviors. And then also on the team, we started doing like image classifiers for different rules for advertising. And then come 2016, switched over to the ads infrastructure. So worked on supporting the recommendation system models. And then in 2017, I think the Facebook leadership back then started thinking about having a ASIC in-house. That's more like Google's TPU. So we started collaborating with Intel back then on a ASIC for recommendation systems. So is there anything about like doing something that's not cool before everyone remembers? Definitely doing ASICs in 2017 isn't necessarily cool. But yeah, it was an interesting project for sure. And I always tell people that the ASICs I worked on supporting back then was 17 watts. so if you look at new nvidia gpus that's like a thousand watts some of the peripherals on that chip is more than 17 watts but it was a very small chip it's like how many years ago almost nine years ago nine years ago at this point so it's like time really flies and things really change up quickly uh in bay area i guess but yeah i worked on supporting that chip for about two years and then fast forward nvidia started shipping a100s that's when everyone realizes hey i guess a6 are not gonna be as good as a100s and i worked on supporting the pytorch enablement for all the ads models in 2019 2020 like around the pandemic time but yeah like nvidia also wasn't a huge company back then so i probably should have just loaded up on video stock rather than doing anything else and after that it was sort of like all the nvidia gpus for another two years until we decided that it was time to start something new and then started working on fireworks yeah nice because yeah i was going to sort of ask i guess yeah why leave meta i guess but i mean everyone's always got their own reasons for you know doing their tenure like one of the big big tech and then and then starting their own thing but like was there a sort of defining moment for you? Was it just like the time is now or, you know? Yeah. To be frank, like I probably could have premeditated more. I probably could have think it through more analytically. At the same time, I do think like AI infrastructure will take off. We in fact started before ChatGPD came out. So we couldn't time it better. Like we started maybe like five, six months before ChatGPD was shipped but yeah it's not so much that there's any like particular trigger but definitely being in a meta for about eight years i mean like all good show must end right so yeah absolutely and for me i think i also maybe is a little bit more risk prone than some of my friends i do think taking on more risk is a good thing so i probably didn't think it through as hard as i should have in retrospect yeah but i mean that's where some of the best things come from so yeah and that's probably why we're sitting here today speaking about fireworks ai which i think yeah let's get on to what you did next so let's just sort of start a super high level what is fireworks ai the fireworks is a platform that serves and train open source models and we mostly stuck to that mission since ChatGPT shipped till today. And to be frank, our play hasn't been super sophisticated for our customers. At the same time, the work itself is very complicated and a lot of our customers appreciate the support we're able to give them. Specifically, a lot of our customers are very, they're either AI native or they're very AI leaning enterprises who are looking to either offer products that's based on language models or looking for automating a large part of their organization through large language models. We're here to help them customize their open source models and then scale them up. A lot of startups we work with also start with the frontier models. And as they scale, they needed to make sure that their unit economics are good. And we're here to help customize open source models so they can reduce their total cost of ownership for their customers and start making money. In general, I really appreciate a lot of our customer support for having trust in such a young company versus all the big clouds and other offerings out there. And yeah, we're here to help. Nice. And I think usually we kind of later in episodes go into sort of customer stories, I guess. But I think one that's maybe useful to pull out now to help set the scene for the listeners is cursor so i believe that you guys do work with cursor i'm taking a guess that this is like for example when well i think code completion is like one area that you guys supply to cursor is like could you just talk a little bit about that because i think that's going to help kind of set the scene for the rest of the episode just in terms of what does firewix actually do and for whom and yeah yeah honestly the cursor people are amazing and we're mostly here to help support them I think early on, a lot of work around supporting Cursor was their custom models like tapped and editing models. And we helped design solutions that would work very fast with the model in a cost-effective way. I think one thing that we publicized early on was the fast-apply model we serve for Cursor, which required a lot of special support for speculative decoding. in order to support the model properly with fast decoding. So in those settings, inside the editor, you want to edit a very large file in one go. A lot of times there's a lot of nuances on how to serve those kinds of models. And we work with them to set up a sort of like a dedicated algorithm so that we can serve those models very, very cost effectively. yeah and i think like recently cursor also published a blog around how to do like online learning with their tab models yeah they are like very sophisticated team yeah no i mean i was i've been a fan of cursor for a long time i think since it almost came about unfortunately i don't do as much hands-on coding today as i did even 18 months ago but that's just a function of where i've gone in life but uh but yeah cursor still love cursor as a product still use it if i do code so let's kind of walk through some of i guess like the features if you like of like how firex's inference infra actually operates i mean just to kind of give some scale idea here probably should have mentioned at this at the start but you guys process something like 13 trillion tokens a day Is that right? Yeah, yeah. And I think I'm probably not here to share the latest numbers, but at the same time, I think the 13 trillion number was larger than the Gemini and OpenAI number they shared for the APIs. I think Gemini was like 10, 11 trillion, give or take. And we've been growing quite a bit since we shared a number last time as well. But yeah, open source models are very strong. Definitely, it was a leap of faith from us to focus on open source models. At the same time, I'm very pleasantly surprised on how strong the open source model has been. Getting to a point where it is price competitive against closed source models this year, just in the beginning of this year, is amazing. I do think a lot of open source models last year shows promising benchmark scores, but are not competitive end-to-end. And I think what really helped set the scene this year is when OpenClaw came out, which one, people realized, wow, these models are amazing. Two, wow, these models are expensive. Like I had a friend who like set up OpenClaw and he's like, yeah, this is so good. At the same time, I don't know why one message cost me like a million tokens. Just like send one message on Telegram and then a million token gone. He doesn't even know where it went. Just sees the money going away. And two, yeah, like a lot of the open source models that are offered are very competitive in like a real open core setting where you don't necessarily want the most expensive model just to book you a restaurant. You probably just want something normal, can book you a restaurant, but with one 20th of the price. So, yeah, I think there's a lot of competition heating up. And like Fireworks here is here to support the open source model thing as much as we can. and i mean would it be fair to say that because you came from meta ultimately and they had been working well the things i'm trying to look at like line up the timelines here which is llama and was that really a thing internally when you were still at meta or did that only really come out or like what i'm getting as were you already working on something llama related before you left and that got your kind of brain jogging on these open white models or not really? Maybe the closest answer is kind of. So when I was working at Meta, the LAMA or like the large language model program wasn't really that well funded. So it was different from where OpenAI was doing the YOLO run for like GB4 and with like I forgot like thousands of A100s There was nowhere near as much of a commitment in meta back then as of now But I worked a lot on recommendation system models that had transformers. And there was a clear sign where you just throw more compute at the problem and the model just gets better. The ROI for those increased compute may or may not be worth it. At the same time, there was a clear trend. And so while people were talking about scaling laws for language models, I guess we were sort of like seeing the scaling law for recommendation system models in real time as well. So the migration between the tiny 17-watt A6 into the 200-300-watt A100, there was like a clear payoff for that. And then keep on migrating to H100, there was also a clear payoff for that as well. So yeah, I guess I was working in related fields and seeing the scaling law working, but in a different way. At the same time, I think the commitment or like the belief in open source models right now seems obvious, but I think it was definitely contrarian three years ago. People forget the models we had back then were OPT, Llama 1, and Falcon. And if those models could hold a three-term conversation, that was already amazing. and people also forget three years ago there was no function calling it was like we also worked on open sourcing function calling models which wasn't really a straightforward thing back then so yeah i would say it wasn't straightforward at all to say hey like llama is here and it's here to stay in fact it's not here to stay anymore but sort of like the belief we had because we worked on open source software for so long and sort of like the belief that the models are more like software rather than some kind of hardware kind of setup i think that sort of leap of faith was important for us and i think carried a lot of weight even in today yeah and i realize we're sticking a lot on on models generally but i think this is quite interesting just for a second i mean i think just to be clear a lot of these open well first of all just to clarify you mentioned open source models i mean is there a distinction here open weight like are you using interchangeably with open weight or yeah i am using it interchangeably and then for all the people out there who really understand the distinction i think they will hate me for using these term interchangeably i'm more of an attitude where like i don't really make those distinction when the goodies are still flowing to be honest as long as the final artifact is produced i think it's good But I have huge respect to both like the Omo team and the NemoTron team. The Omo team is like LNAI. They have funding from NFS as well as I think other sources. And they really publish all the results as well as all the intermediate artifacts. I think that's like a really good foundational work for everyone working in the field. And then I think NVIDIA also is really committing to keep pushing on open source models. and they also publish all their training code or their recipes to benefit the community. Yeah, absolutely. TurboPuffer is how companies like Anthropic, Cursor, Notion, Atlassian, and Ramp ship their most ambitious search features. TurboPuffer is a serverless vector and full-text search engine built on object storage. It's up to 95% cheaper than traditional search databases and just as fast. With TurboPuffer, you can index and search 50 million documents at 10-millisecond P90 query latency for less than $100 per month. Head to turbopuffer.com slash S-E-D to get your first month free. In mobile application security, good enough is a risk. GuardSquare uses advanced, multi-layered code hardening techniques and automated runtime application self-protection and mobile application security testing. combined with real-time threat monitoring to deliver the highest level of mobile app security. Discover how GuardSquare brings all these together to provide mobile app security for your Android and iOS apps without compromise at www.guardsquare.com. Today's episode of Software Engineering Daily is brought to you by Unblocked. Your coding agents have access to your code base. Maybe you've even connected other tools via MCPs. But access doesn't mean context. Agents can't reason across MCPs. They don't know your architectural decisions, your team's patterns, or why the API was shaped the way it is. So agents look in the wrong place and deliver bad outputs. Then you spend time correcting, turn after turn. Unblocked is the context layer your agents are missing. It synthesizes your PRs, docs, Slack, and tickets into organizational context that agents actually understand. so they make better plans, write higher quality code, use fewer tokens, and require fewer correction loops. If you are running Claude Code, Cursor, or any agentic workflow, Unblocked is worth a look. Get a free three-week trial at getunblocked.com slash sedaily. So yeah, I guess to keep some of our listener base happy, when we talk about open weight models, which is a lot of, I guess, what Fireworks is working with, if I look at most of those names, or especially on the Fireworks platform site. But I think just known generally, they're mostly coming, would it be fair to say from Chinese companies and Chinese tech? Is that a good way to say? Yeah, yeah. So, I mean, as someone who's based in Asia, I fully appreciate just the amazing talent and innovation that comes out of this part of the world. And I think it is helpful to maybe just talk through, have you hit any bumps with the fact that you are a US company and then pushing models like this from that part of the world? Or I'm just curious, like how that looks, because it's something that obviously gets talked a lot about maybe in the news, even, you know, especially when DeepSeek landed, if you want to call it that. And, and I think all that's kind of settled down to some respect. But what was your thinking around all of that when putting these models into fireworks? Yeah, I think that's a good question. So I would say 90% of our customers don't really care about the origin of the model, especially if they are doing fine tuning, because they will be sort of like running these models in very specific environments. For example, for coding environment, often you don't really care about the political preference for these models because you're just writing React, Rust, you know, like it doesn't really matter as much. And then definitely for certain customers who are more like consumer facing, they are much more aware of the origin of the model and we are more constrained to serve things like Mnemotron, GPUS-S, these kind of American models. I do think Fireworks is here to help support our customers and we don't really judge why are they making these distinctions. We're just here to help support them as much as we can. Mistral people also do a great job and they also publish very strong models and sometimes also serve Mistral models as well. but yeah like i will say like one thing i keep chatting with my colleagues on for like it's like a year on and gpt oss is surprisingly competitive today amongst all the american models so huge respect for open ai as well for open sourcing the model back then but yeah there are new models coming out every day from nvidia from omel those are also getting more competitive and i do think at the end of the day, given the same dataset and the same model size, the results will converge. And then the dataset should converge as people start sharing more and more of those intermediate artifacts and recipes. So I bet the American models will catch up this year very quickly. I honestly feel like there's so little secret out there today. I feel like there's so many conversations I'm in where people think they have alpha and then they sort of exchange notes and be like, oh, I didn't realize some other people are doing the same thing. But yeah, I do think a lot of things will converge and American models will be very competitive this year. Yeah. And I guess FireX has a platform, which we're going to get into in more detail in a second, but I guess part of your offering is to help your customers understand which model would suit their case best. Because I think, at least to me, when I was doing a bit more hands-on, call it 12 months ago, I think trying to pick through like which model, why even the, okay, this is kind of the same model, but one's like, you know, 4B and one's 16B or whatever, you know, like this is, I'm sure where you're actually able to give more direction and input to your customers, given the size of them. And like, this isn't just a, you know, you can't kind of make the quote wrong decision on these things. Yeah, that's a good question. How we talk with our customer on when to use which model depends on the use case quite a bit. I would say for probably a third of the use case, our customer is more sophisticated than us. They have run the evaluations in-house. They explicitly come to us and be like, this is the model you need to serve. Don't argue with me. I will pay you for this. Just serve this model. And then one third is more somewhere in the middle where they kind of know they're going to pick between two or three models. They mostly need some judgment call on the cost of serving, on whether it scales on our platform. So they run the evaluation locally and then they just need to know the cost or the scalability of the setup. And then the last one third is where the customer really looking for us for advice on which model to use and which use case. And in those cases, we share our evaluation results internally and try to paint a full picture on like, hey, this model is much better at coding. This model is much more malleable and a good starting point for reinforcement learning. And then just show them the data point and give them the judgment call. To answer that question more concretely, for example, a lot of choices are more nuanced depending on what you want to do. For example, like Kimi is a very big model. So it's very easy to do reinforcement learning or fine-tune on it. Easy as in like infrastructure, it's difficult, but it's easy to push results out of big model rather than trying to RL like a small model and try to get the small model to be smarter. At the same time, the evaluation initially may be worse. So then it is not going to be obvious to our customers. Hey, like, hey, Kimi is a much better starting point than a lot of other models. But for example, when some of our customers are just looking to serve open source models and not customize, then like, for example, if it's like a coding use case, and glm and minimax both are great and then just depends on the cost effectiveness depending on their use case so we help try to paint a full picture to our customers and still rely on them to decide what's the best for them yeah no it's very interesting i maybe hadn't appreciated yeah what you said so towards the beginning a good chunk of people come and just say this is the model already figured that one out just run it for me so let's get on to the just run it for me bit so you know you have quite advanced infra as it has to be running running all of this i believe you have things called or like a piece of the puzzle called fire attention for example like maybe could you walk us through what is that let's just kind of start there and we can go through some of these interesting bits of the firework stack yeah we set up our own kernel in-house for a few reasons. One, a lot of kernels in the open source world are not numerically pristine in the sense that it will work. It may not work in the way that you expected, or it may not be working because of the reason you think it's working. So there's always new kernels coming out every which way. We try to make sure our customer, when they use Fireworks to serve the model, don't think about all the complexity for like all these weird kernels. They just understand that we provide the best trade-off between quality and speed. We'll push as much as we can on speed, but not compromise on quality. And a lot of our customers appreciate that. That's sort of like the first reason why we set the kernels in-house. The second reason is we have a very, I would say, expensive but important commitment to do multi-hardware. So we work with AMD team quite a bit to make sure we can properly support their hardware. And not everyone is willing to sink the time and effort into supporting different hardware. So yeah we set up those AMD kernels in as well just to make sure that we can serve on different hardwares and try to find the best price for our customers And also maybe the last reason I can go through on these internal fire attention kernels is because we do a lot of reinforcement learning workload, making sure that the training inference consistency is minimized is very, very important. You see a lot of fancy algorithm getting published every day. There's like so many variants of GRPO at this point, I bet if you put two random letters in front of it, there's probably a paper for it. I would not know where it came from, but there's probably a paper on it. But all three letters for certain people. But from our observation, those algorithms are important, but not as important as controlling the numerics and making sure the numerics are aligned across training inference. So you see a lot of headlines on, hey, there's a reinforcement learning stack where we use like trainer from one stack and an inference from another stack. And then like, it should just work for a big model. From our experience is far from it. We really take it seriously and make sure that we align the numerics across training inference for those kernels. That's why we have to have those in-house and that's why we have to spend all the sweat and tears. So then our customer don't have to think twice and be like, Hey, like, did these people do their job? And is this RL run not going up because their numeric sucks? we try to really make sure our customer trust us to make those happen yeah and then something called speculative decoding which again maybe context wise i believe cursor built their fast apply feature on this api could you maybe just like speak to that yeah speculative coding i would say at this point is a pretty well-known concept for many serving stack it's a setup where you have a small model that try to guess which token the big model would like and just ask the big model, do you like these tokens? If yes, let's spit them out all at the same time. There's a lot of interesting research in this area. I think recently there's like using different model architectures to do the speculation. We also spend a lot of time doing those research in-house. Yeah, specular decoding is definitely also workload dependent and making sure that we can train different form of specular decoding models while the data is coming in to make sure that we keep up with the change in distribution in our customers' data. Those are also very important as well. I think one other thing that people often don't appreciate enough is that training an Eagle model is often like training a large language model or like training a small language model in the sense that data quality is important, training infrastructure is important, dev efficiency on the stack is important. There's a lot of nuances. on how we can support our customer with good specular decoding models. And we spend a lot of time and effort on that, making sure that when they bring an open source model that they fine tune, we can quickly train a really good speculator model for them. Yeah, because I was going to sort of just ask this, yeah, the idea of, you know, you got effectively a sort of draft model and a target model. And I guess you're hinting that the target model is often actually the customer's fine-tuned model. and FireX is needing to pair that speculative draft model. Absolutely. A lot of our customers are very, very sophisticated. They may have most of the things like 70, 80% away there. And we are here to, let's say, like deploy the model, scale the model. And often part of scaling the model is to train the speculator for them as well. Yeah, nice. And then you have something called 3D Fire Optimizer. So yeah, I'll just let you take that one. yeah i think it's an interesting name when we came up with the concept sounds like something might find in gaming so yeah that's that's yeah i think conceptually it's very straightforward though it's a database of all the previous performance optimization results we have as well as predicted results of what the customer workload will be because there's so many dimensions to do trade-offs. For example, like workload patterns, hardware types, cash hit rate, many, many different variants to deploy the same workload. And it's important for us to be able to scale the engagements for our customers in an automated way. So we have a database for these performance optimization techniques that our automation can make use of and make sure that we can get back to our customer and answer very quickly yeah oftentimes how to set up the workload correctly is like 50 of the work and 3d optimizer is a in-house stack that helps us get there nice and then i mean you have mentioned hardware i think especially in relation to the fire attention kernel you are a sort of hardware agnostic if you like sharp and biagnostic i mean we're talking basically two providers because that's kind of what it boils down to but yeah i mean what is your thought process apart like i mean let's assume that cost comes into it somewhere but cost plus what else sort of is what you're looking at when you're deciding nvidia and amd basically which i believe you run both and just to be clear both are our investors so i don't want to gotcha We do love NVIDIA and work a lot with NVIDIA, just to be clear. I do think when we're running a business, we need to make sure we maximize our customer value. And in the world where we maximize customer value, it's definitely a little bit weird to be locked into one hardware vendor. And I really think this year is honestly not so much about loyalty to NVIDIA and whatnot, because NVIDIA is already producing chips as fast as it can. And it's just that everyone else is buying them up. So practically speaking as well, for multi-hardware vendor, oftentimes it's all about supply chain reliability, making sure that you can actually buy the cards when you have the money. And at any point in time, there may not be NVIDIA cards available and you have to buy AMD cards. available as in like available at reasonable price because of course like if you are willing to pay like there's always someone willing to sell you it just said the premium will be very very high and honestly like while i was working at meta like because i worked on a6 i was also in conversation with all the like procurement process of course i'm not the one negotiating the contract but because i was on as team i was providing all the inputs on like how many cpus and gpus we need And often in those, like even very early on in those conversations, it was AMD and Intel for CPUs back then. And it was important back then even just to have like a dual supply strategy that works surprisingly well. So in case where people think like two suppliers may or may not be enough, oftentimes it is enough. It will be great when you have three suppliers. That's really when you get to benefit. But even having more than one is very, very helpful. And that sort of is ingrained into how I think about this process. And that's why multi-hardware is so important for us. And I do want to be honest about it, that it is a lot of work. And there's always a trade-off between working on AMD for fire attention versus working on other stuff that we can serve our customer better. At the same time, we really believe that the AMD investment will pay off. Yeah, I think it's really, really helpful, I'm sure, for our listeners to hear just sort of actually what are the things that you have to consider when running something like this. And obviously, investors come into it as well. You know, I think that's what people forget about as well. Most companies in this space have investors and that will often have a bearing on some direction at some stage as well. So I think that's really helpful. Let's maybe sort of move on to evals. I mean, most of our listener base should be, I think, somewhat familiar with the concept of evals and why it's important to anything, especially running your own or wanting to fine tune your own models, but even just picking between models as well. and i think at least as a company you've kind of made the sort of statement which a lot of people agree with i think which is one of the biggest barriers to this whole ai roi problem or mystery to some people for those just roi return on investment so like is the money being put into your ai within a company are you getting something that like effectively is larger than the investment you put in it's very simple and fireworks is saying the biggest barrier to that isn't cost as such but it is like define good like what is it that you think good is that is coming you know so yeah i mean could you just sort of walk us through what does fireworks do in this area like do you help your customers do evals or yeah just walk us through kind of what that looks like yeah fireworks investment on the eval is i would say like 70 on the infrastructure side and 30 on the consulting side. So the 70% on infrastructure, we have a project, open source project called eval protocol, where we help people author evals for reinforcement learning. Oftentimes when I ask customers whether they have evals and then recommend them to write evals and whatnot, my pitch often is even if this engagement falls through, the worst outcome is that you have better evals, which hopefully isn't a bad outcome. Because there's so much innovation coming out of these frontier labs, close source models even. How do you decide which one to use? It's like every day there's something new. Okay, every day is an overstatement, but every week there's something new. And just to say even the most straightforward case, let's say you love Elon and you want to use XAI's model, how do you know when it's appropriate? You can't even pay Elon unless you have the evals. So it's very important, I think, for a lot of customers to realize that these are assets and Fireworks is here to help you build up those assets. And as soon as people have evals, the gap between using that eval to evaluate model to the gap to use those evals to train in reinforcement learning, to train your own model, it's very, very small. So it's very helpful for a customer to first be able to pick different models. And then second, once they are comfortable, start training new models on fireworks with reinforcement learning through eval protocol is something that we've seen repeatedly happen over and over again. At the end of the day, if you know how to evaluate your model, you have all the power. You get to decide which supplier to use in what setting and when. And that power is very, very important to a lot of our customers. It's just that some of them haven't realized how important that power is. The other part is also we work with some of the large customers to help them set up the email as well, because in certain settings that it's not practical to sort of like leave our customer hanging and ask them to author the email themselves. We have a lot of know-hows in-house as well after all these engagements, and we help our customers really use our know-hows and author the evals themselves as well. And you have an eval framework that is actually being open sourced. Is that right? Yeah. It's called eval protocol. It is focused on helping people author evaluations for reinforcement learning settings. So you focus on writing the evaluation itself and then fireworks can help take care of the rest for doing rollouts on inference, passing those rollouts to the trainer and making sure that the reinforcement learning graph goes up and to the right. It also helps with observability for like which traces went wrong and why and what did the model trip on. It is often understated on how much people focus on sort of like the fancy algorithmic side of RL, but in practice, a lot of time, the observability is very, very important. Making sure that you have a open source SDK that help people hook into any part of the infrastructure, however they want. And they need that assurance because unless they can own the code themselves, like they don't want to touch it. Making sure that all the observability stuff can be set up so they can see the environment's behavior at any point in time. Because oftentimes the RL problems are coming from the environment itself, like something broke in the environment and you want to fix the environment to continue on training. So those things are all very, very important. And eval protocol helps you with all of that. Yeah And you know I encourage anyone interested in any of these topics You guys have a really good blog and something was published on there Traces are all you need to brackets to rank LLMs So this is in reference to production trace data and being sort of you know a better signal say than benchmark suites And like there's probably a catch somewhere. So I think just maybe if you could briefly unpack what is that about? Yeah, I think when we talk with our customers, a lot of them feel that it's intimidating to author evaluations. A lot of them didn't start out as machine learning engineers, but that's the beauty of this wave of innovation. A lot of people coming in from product backgrounds who has the ability to clearly articulate what is good and what is bad for their customers. And oftentimes I feel like that's underappreciated in a lot of settings. Because honestly, if you can clearly articulate what is good, what is bad, you are 90% of the way there. The only delta is then now you need to take traces from your production workload and run a language model through those traces with your articulation. And oftentimes, I think in a lot of settings, our customer tend to have more elaborate setup for the evaluations and whatnot. But what we find is, honestly, if you can't articulate it, you're 90% of the way there. There are definitely special cases where, for example, we have an example for SVG agent where you still need to render the output through a Chrome server to be able to get the screenshot to even use your articulation of like what is a good SVG. At the same time, honestly, like Opus 4.6, you say the word, it comes out the other end. I don't even know what happened in between, but like it works. Then the only important bit is can you pull your traces out of your production system? And then can you articulate what is good or bad? Yeah. So we're going to move on probably to the final fairly meaty topic is just on reinforcement fine tuning. Something that I believe, again, Fireworks helps people figure out. So supervised fine tuning, it's been around a while. So what does reinforcement fine tuning like? Why does that change things like fundamentally, which is sort of my understanding of it's a huge shift. It's not just a small incremental change. It's kind of a step change in this area. Absolutely, absolutely. Yeah, maybe I can start with how it changes the industry and then specifically how it changes fireworks. I would say reinforcement learning is a new lever that the industry found while the pre-training, sort of like a free writing slow down. You definitely see a lot of articulation from like Dario on saying like, hey, like Anthropic does not see the improvement slowing down and whatnot. And pre-training is still giving a lot of gains. At the same time, my current understanding is that you need sort of like exponential amount of compute to get the straight line going. And at some point, the money aspect will kick in, like reality will have to kick in. Even if part of reality doesn't kick in, the electricity part of reality will kick in. So there are limits to how far you can push these ideas. And reinforced modeling was a new paradigm that the industry found to keep the scaling going. Because even for relatively small models, if you can push RL, you can get really good results out of the model. And I think the other thing that people don't tend to talk about for reinforcement learning is that it really consolidates your evaluation as an asset. Meaning that the same evaluation on the same environment can be used across different generations of model without significant changes. This is different from SFT datasets where depending on what's needed for this model, you may need more supervised fine-tuning data to push the model to a certain direction. And that is definitely important work that a lot of frontier labs, people are pushing in every day to make sure they curate better and better supervised fine-tuning dataset. At the same time, when the models are really good, you have to throw certain things out just to even not confuse the model. whereas for like asking the model to do an Excel spreadsheet is always going to be valuable no matter where you start because you always want to make sure that it knows how to read a spreadsheet and knows how to manipulate a spreadsheet right so those assets are sort of more enduring for these settings and that's why I think so many frontier labs are investing so much money into these environments. Specifically for fireworks, it is very important because it now finally unlocks the customization loop from a software engineer directly to a tuned model. Previously to do supervised fine tuning, the conversation often goes like, oh, do you have a team of MLEs in-house who knows how to work with data labelers? Because these MLEs need to have some experience managing labelists and making sure that they can sort of clearly communicate with another set of human that they never worked with before on what is good, what is bad, and what I'm looking for. And it also takes a few iterations because oftentimes these data labeling companies will have to assign you a certain set of people repeatedly just so that they don't lose the context. But also you need to have quality control on the other end, making sure that the supervised data set, the supervised fine-tuning data set are consistent and it will continue to be what you're looking for at the thousandth label. That is a very tedious process, a very difficult process for many, many people. And the process also breaks down when it's long context because honestly at certain context length, I really don't think I even understand what's going on. Like it probably takes me like an hour just to read through all the conversations to figure out where way we're wrong for me to like edit the conversation to make it work correctly. Whereas for reinforcement learning, as long as you have a product manager who can articulate what is good or bad, they will be able to author a language model as a judge snippet and then send it to fireworks and be like, hey, teach my model this, right? Everyone else is out of the loop. And then we can sort of bootstrap this very very quickly and as the coding model gets better and better i really think that there is a lot more we can push out of these coding models to get reinforcement learning more automated i think that would be like a interesting topic to dig into more this year yeah and i believe it was for cell used has used rft with you guys and that sort of come out with you know putting numbers on something along the lines of 40x faster code fixing and with better outputs so that's like sort of interesting use case for rft yeah absolutely absolutely yeah and we have a lot of smaller customers in general who benefit from rft as well i would say for example vassal has like a very very strong engineering and product team those teams are exceptionally well fit to do reinforcement learning because we just with like two or three people they can internally align what's good or what's bad and they can just go nice yeah that's a really kind of powerful image there so we are sort of starting to wrap up but something i did want to touch on it's this competitor as a vendor problem that we've seen you know play out over the last i guess 18 months especially so you know some examples on SED News, which is our monthly podcast that myself and Sean Falconer do. We've covered actually quite a few of these. So this is why it's quite interesting. So the fact that Anthropic effectively cut out Windsurf, and then, you know, OpenAI launching Codex, which competes with Cursor. You've then got things like Google search, you know, versus perplexity in some respects. So if you're the vendor, like, how are you thinking about where this risk for you potentially lies or doesn't even so yeah i honestly think we are such a small player in this market and the market is so early still that i don't know if the competitive pressure matters yet yeah i keep telling my colleagues like if we don't do single digit percent of what nvidia is doing i feel like we're doing something really wrong and we're not there yet so what nvidia is I forgot what the exact number is, like 400 billion this year. Something like that, yeah. Yeah. I would say probably before we do a few percent of what NVIDIA is doing, I really don't think that we will run into those hard constraints on competitive vendors and whatnot. But I do think I can be more helpful for this answer in the sense that when our customers are thinking about Fireworks, what they're looking for specifically. And honestly, I think at the end of the day, it's mostly about trust. Trusting that we get the numerics right so they don't have to figure out all the details. Trusting that we got all the serving details right so the function calls can happen properly, the constraint generation happens properly. The trust on that we set up our reinforcement learning rig correctly so they don't have to do the numerical debugging themselves. And I don't think a lot of those are competitor related in the sense that a lot of our customers are making honestly a lot of money and they just want to make sure that we handle those complexities for them. I do think in the fast evolving field, maybe I should think more about the competitors landscape and whatnot, but I really don't think it matters as much yet. And it's more about helping our customers make as much money as they can at this stage. yeah exactly and you know you work with some very big names so it's sort of anyone potentially smaller coming along and that's that trust piece i think that's where you've got some of the bigger names to kind of back you up at this stage in terms of if you're able to work with the cursors and the verselles that surely says something so yeah i mean we are we are coming to time but i think for anyone out there developer or anyone uh maybe slightly i don't say higher level because IC versus business is kind of the same thing, but someone who's not hands-on keyboard. But I mean, if someone wants to get sort of, quote, started with Fireworks, like what's kind of the best path there? Install OpenClaw, hook up Kimmy on Fireworks with OpenClaw. And honestly, Peter is amazing. I watched some of his podcasts, like the author for OpenClaw. Yeah. Who's just joined OpenAI, at least. Yeah. Yeah. A couple of days ago before this recording. Yeah. Like, I don't know how much he's paid, but he definitely deserves it. It is surprising, honestly, how fast everything is moving still. Just as soon as you expect things to slow down, maybe a little bit, like all the models come out in the last two weeks. Right. And then like OpenClaw gets their fame to acquisition in like a few weeks. Things are not slowing down. No. And I do think that for anyone who's listening, any amount of effort in this area, I think will pay off in anything. It's just like if it's not open claw and like, I don't know, Vibecode something. Because I really think a lot of the difference that I'm seeing for some of our customers is that they are two months early. And the fact that they are two months early makes the world of difference. Yeah. Yeah. Amazing. so yeah i mean thank you so much for coming on today benny is there anywhere that you personally are on like that people can follow you are you on i don't know x or anything like that or or not really i guess i'm old i still say twitter yeah nice twitter yeah i would still say twitter i'm just uh the number of times i hear x formerly known as twitter i'm like well we've got to pick one don't we cool so you're on twitter what's your what's your handle on twitter funny chen oh cool okay bunny chen nice yeah awesome well yeah thank you so much for coming on i've learned a lot as well as yeah it's just kind of really awesome to see a company in this space and looking back to the beginning when you were saying like maybe you weren't analytical enough about when you left meta and started this but i mean this sounds amazing and you've obviously done a huge amount of work in this space that someone else hasn't done so i think that's that says it really so yeah thank you for having me greg thank you thanks a lot okay i hope we get to catch up again in the future thank you