The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764

63 min
Mar 26, 202623 days ago
Listen to Episode
Summary

Stefano Ermon, Stanford professor and CEO of Inception, discusses the development of diffusion language models as an alternative to autoregressive LLMs. These models generate text through iterative denoising rather than token-by-token prediction, offering 5-10x faster inference speeds while maintaining comparable quality to speed-optimized models from frontier labs.

Insights
  • Diffusion models for text overcome the discrete token challenge by using masking-based noise processes instead of continuous perturbations
  • The economics of AI inference are shifting focus from training-time scaling to inference-time efficiency, making faster models commercially valuable
  • Diffusion language models enable better controllability and error correction capabilities compared to autoregressive models
  • The serving infrastructure for diffusion models requires complete rebuilding as existing frameworks are optimized for autoregressive generation
  • Academic research in diffusion language models is exploding, but significant engineering and science challenges remain for frontier-scale deployment
Trends
Shift from training-time to inference-time scaling as key competitive advantageGrowing demand for latency-sensitive AI applications in voice and agentic systemsEmergence of alternative architectures challenging autoregressive model dominanceIncreased focus on cost per token and energy efficiency in production AI systemsCross-pollination between image and text generation research methodologiesRising importance of controllable generation capabilities in enterprise applicationsAcademic explosion in diffusion language model research following initial breakthroughsNeed for specialized serving infrastructure for non-autoregressive models
Companies
Inception
Stefano Ermon's startup developing commercial-scale diffusion language models like Mercury 2
Stanford University
Ermon's academic affiliation where foundational diffusion model research was conducted
OpenAI
Referenced for their speed-optimized models (Mini, Flash) that Mercury 2 competes against
Google
Mentioned for Gemini models and their announced but unreleased diffusion language model work
Anthropic
Referenced for their Haiku models as comparison points for speed-optimized LLMs
Nvidia
Collaboration partner on video model research and Cosmos project
Alibaba
Industry partner collaborating with Chinese universities on LADA diffusion models
ByteDance
Has internal diffusion language model research efforts through their Seed division
Cerebras
AI inference chip company mentioned as partner for accelerating traditional models
Groq
AI inference chip company mentioned as partner for accelerating traditional models
People
Stefano Ermon
Main guest discussing his work on diffusion models and commercial applications
Sam Charrington
Podcast host conducting the interview about diffusion language models
Aditya
Ermon's co-founder and former PhD student working on multimodal diffusion models
Quotes
"if you need to scale up these models and they are actually getting into production, the price per token or the watts needed per token becomes the key metric that you care about"
Stefano Ermon
"what we're seeing with diffusion language models is that they scale better than autoregressive models. At inference time, they're cheaper to serve, they're faster"
Stefano Ermon
"When I started back in 2014 or so, we were barely able to model and nist images and it was all very blurry and that was already like a big result"
Stefano Ermon
"Mercury 2 is the first commercial scale diffusion language model with reasoning capabilities"
Stefano Ermon
"It's unlikely that one architecture is going to dominate the other one. There's gotta be some use cases where an alternative architecture is just going to be better"
Stefano Ermon
Full Transcript
3 Speakers
Speaker A

A big thanks to Blitzi for supporting the podcast and sponsoring this episode. Want to accelerate software development velocity by 5x? You need Blitzi, which brings autonomous software development to your enterprise code base. Your engineers declare intent and Blitzi agents map your code base and generate an agent action plan. Once approved, Blitzi gets to work autonomously, generating hundreds of thousands of lines of validated end to end tested code. More than 80% of the work completed in a single run. Blitzi is not just generating code, it's developing software at the speed of compute experience. Blitzi firsthand@blizzi.com TWIML that's B-L-I-T Z-Y.com TWIML

0:01

Speaker B

if you need to scale up these models and they are actually getting into production, the price per token or the watts needed per token becomes the key metric that you care about. And so what we're seeing with diffusion language models is that they scale better than autoregressive models. At inference time, they're cheaper to serve, they're faster. You get more tokens per gpu, which means that the price is actually lower. And so that's why we felt like, yeah, this is the time to do it. And in fact, that's what we're seeing.

0:48

Speaker C

All right, everyone, welcome to another episode of the TWIML AI podcast. I am your host, Sam Charrington. Today I'm joined by Stefano Ehrman. Stefano is associate professor at Stanford University and the CEO of Inception. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Stefano, welcome back to the podcast. It has been a while.

1:31

Speaker B

Yeah, thank you for hosting me again. Yeah, it's been a very long time since we last chatted.

1:54

Speaker C

Yeah, I think about eight years or so. Certainly lots has changed and we'll get into some of that. In particular, what you've been doing with diffusion models. But to get us started, why don't you tell us a little bit about what you've been up to for the

2:00

Speaker B

last eight years, maybe? Yeah, so I've been working still in the same space. So I've been working in generative models, I guess, my whole career, my whole life. Now what has changed is that the field really took off, I guess now it's called generative AI and everybody is paying attention to it and it's become the thing that everybody is looking at and everybody's trying to get into. So, yeah, it's been exciting to see the growth of the field and the capabilities of these Models. When I started back in 2014 or so, we were barely able to model and nist images and it was all very blurry and that was already like a big result. And now, yeah, the bar has shifted a little bit in terms of like what these models can do. So yeah, it's been exciting. And you know, more specifically my lab at Stanford has always been kind of like at the forefront of what these models can do and kind of like always been innovating at the model level, at the architecture level, the MLSYS level. So I did early work on diffusion models back in 2019 when everybody was using generative adversarial networks, if you still remember. Yeah. So we kind of came up with this alternative approach that is what's now called a diffusion model, which is now used in pretty much every generative solution for images, video, music. And yeah, since back then I've been trying to get these models to work on text and code and DNA, like discrete objects and finally been able to get some really, really good results with this approach. And that's what I've been doing at Inception. I'm currently the CEO and one of the founders of startup called Inception where we are developing a new kind of LLM that is based on diffusion. And these new LLMs are way faster, more efficient, higher quality. So we just launched Our newest model Mercury 2 A couple of days ago. And so if you want to play with a different kind of LLM, something that's fundamentally different in the way generate text and code, give it a try. It's really, really fast. It's a great solution, especially if you're thinking about latency. Sensitive applications have very tight latency budgets. These models are really, really quick and they get really high quality answers. So a lot of developers are already building a bunch of real time AI applications on top of them. And so that's what I'm most excited today. That's what I've been spending a lot of my time kind of like figuring out how to get these models to work.

2:15

Speaker C

Even better, take us back to the creation of diffusion models. Like where did the inspiration come from?

4:58

Speaker B

Yeah, so back then the field was dominated by Gans Generative Adversarial networks. And that's that old approach where there is two neural network, there is one that generates images and there is one that is kind of like trying to discriminate and figure out if the images are real or fake and then you train them one against each other and it's a very kind of like unstable and challenging kind of optimization problem. Because they are, you know, there is this game theoretic kind of aspect to it where, you know, they need to out compete each other, these two neural networks. And it was very, very unstable. Very, very difficult to get it to work well. Lots of tricks were needed. And so we were trying in my lab to experiment with alternatives. And one alternative was the usual autoregressive approach where you kind of like generate the image, let's say one pixel at a time. And that's never worked particularly well for images and video, and it still doesn't. It's just very slow and not very accurate. And so we came up with this alternative approach, which is now called the diffusion model, where essentially you generate an image by starting from noise and then iteratively refining it until you get a crisp kind of like nice image that, you know, is consistent with the prompt at the end. And the key benefit is that the training objective is very stable. The neural network is trained to just denoise an image. You take images, you add noise, and you train a neural network to remove noise, which is a fairly standard, relatively easy kind of like optimization problem that you can use to train neural networks over large data sets. And it works reasonably well. And essentially you can then use these neural networks at inference time to generate images, because the networks have been trained to remove noise, to improve the samples, to correct mistakes. And so you can turns out that you can just start with pure noise and then you apply this denoising network a bunch of times and at the end you get a really nice image. And the key benefit, I mean back then people were not thinking about test time, inference and those kind of things. But it has. One of the reasons we really wanted to get this to work was that it had that flavor of a kind of neural network where you have a very deep kind of like inference path because you are chaining together many, many evaluations of this neural network at inference time. So you have a very, very deep kind of like computation graph that can do very, very powerful things. But it's still very scalable during training because you don't have to unroll all of this computation during training. During training, you just train the model to remove noise. So you just basically need a single neural network evaluation during training. So this idea of having something that is cheap to train, yet very powerful at inference time has always been something that was on the back of my mind and trying to think about ways to do this sort of computations efficiently, which is now kind of like showing up in a different form in the context of LLM where people are very excited, chains of thought and being able to kind of adjust the amount of compute at inference time. I feel like it's a similar idea, although implemented on top of a very different kind of generative model.

5:05

Speaker C

Talk a little bit about the path to getting from diffusion models for images to diffusion models for text.

8:14

Speaker B

Yeah, so it took a while. So immediately after getting the good results on image generation, where initially we showed that these models were better than gans, and then very quickly the field switched to diffusion models and stable diffusion came out and mid journey and then quickly basically took over the whole field. And since basically back then I started thinking about, okay, how do we get these kind of ideas to work for text and code? But we want the model to somehow generate discrete objects.

8:22

Speaker C

You've mentioned discrete a couple of times, as opposed to continuous. Can you talk about why that presents a challenge for diffusion models?

8:54

Speaker B

Yeah, of course. So if you think about, you know, like an image or just even a single pixel, you know, it's a. You can think of it as a bunch of colors. And the interesting thing is that if you change the colors a little bit, you know, the meaning doesn't change. So in particular, you can kind of think about two possible colors for a pixel and all the kind of things in between them still make sense and they don't change the meaning of the image in any dramatic way. Right. But if you think about text and you take two words, then it's not clear what's in between the meaning of two different words. Right. And so there is no real geometry through the space of possible tokens or possible words. And so that makes the idea of denoising much more challenging because it's not clear what it means to perturb, add noise to text. It's not clear how you build the like the whole geometry does not exist. And so a lot of the concepts that were defined for, that were invented to get diffusion models to work on images and video, they were kind of like relying very heavily on the fact that there is some kind of continuum of possible images and you can kind of like interpolate between them. And it makes sense to get the model to kind of like smoothly move from one image to another. In the context of text and code, everything is very discrete. And so it's not obvious how you got the mathematics that were developed for continuous spaces do not translate immediately to discrete spaces.

9:03

Speaker C

When you talk about the idea of words between points and, you know, words in a neighborhood, it calls to mind embedding spaces and the like. You know, to what degree I imagine that's been tried or, you know, and. Or maybe part of the ultimate solution to getting it to work for text.

10:41

Speaker B

Yeah, so that can be. There are approaches that essentially try to build diffusion models for language generation, kind of like in the embedding space. So first you embed everything, then you build a diffusion model, and then the problem is that essentially you have two sips, eventually decode back to text. Right. Eventually you cannot give embedding to your users or your customers. And so that's always the problem, that essentially, you know, at the end of the day the diffusion model will make some small mistakes and it might not end up exactly in a point that corresponds to one of the existing words in the dictionary. And so it's actually pretty challenging to get these models to diffusion models to work well in latent spaces. There's been a number of papers, including from, you know, academia, industrial labs, but it's not been very successful. But it is one of the approaches that people have taken.

10:59

Speaker C

So what has been demonstrated to work for text with diffusion.

11:58

Speaker B

So the initial results were still sort of like in the academic setting. It was actually again from my lab where we had a paper a couple of years ago, essentially showing that for the first time it was possible to train a transformer based model. So you basically took a GPT2 size model and then you train it autoregressively, the usual way. You train it to predict the next token the way everybody else is training LLMs. Then you can train the same neural network as a diffusion model. And in that paper we showed that for the first time we were able to match the quality. So in terms of perplexity, in terms of like the quality of the text that these two models are able to generate, it was about the same, but the diffusion model was significantly faster. Like you could generate the same quality of text in about 10x less, so 10 times less, sort of like number of neural network evaluations. So diffusion models are significantly more efficient at the GPT2 scale.

12:05

Speaker C

And so just so I understand the setup there, are you saying you said train it auto regressively and train it via diffusion, Is that two different models that you're comparing or are you sequentially training it autoregressively and then with diffusion,

13:06

Speaker B

so it's really just like almost like an a B test where, you know, it's not very, very much fair comparison in the sense that you take the same exact neural network architecture with the same number of parameters, you train it on the same amount of data. You just train it on the one hand as a typical autoregressive model where you just predict the next token and that's how you use it in inference time. And then on the other hand you can train it as a diffusion model. And so at that point, kind of like the difference in performance is entirely due to the different modeling paradigm, diffusion versus autoregressive model.

13:25

Speaker C

How did you overcome the discrete challenge in training that model?

14:01

Speaker B

Yeah, so that was kind of like the main, kind of like idea in that paper. Like there were some new mathematics, some new methods of basically figuring out what it means to do diffusion in the context of discrete text like objects. And then it was demonstrated to actually work well in practice up to the GPT2 scale. The next step was that we've at inception and I was very excited about those results. And so I started a company called Inception where we've been scaling that up. And so we've trained commercial scale diffusion language models. So much larger models, some more data, and now the results are extremely good. Like the latest model that we announced this week, Mercury 2 is actually matching in quality some of the best speed optimized models from Frontier Labs. So we'll think about the haiku models, the flash models, mini models from OpenAI. So it's at that quality level, but again it's about 5 to 10x faster in terms of like the time it takes you to get an answer using a diffusion model versus an autoregressive model.

14:05

Speaker C

Are you able to give us

15:20

Speaker A

an

15:23

Speaker C

overview or a summary of like some of the mathematics that kind of make

15:24

Speaker B

this work at an intuitive level? It's all somewhat similar in the sense that there is still a neural network that is trained to remove noise. It's just like the noise process is no longer kind of like adding small numbers to the pixel intensities. It's more like there is different kinds of noise processes that you can use. One that works pretty well is basically one where you mask out tokens. So you kind of like hide them, you take a sentence and then you remove some of the tokens, you hide them from the neural network. And then you ask the neural network, can you predict what those tokens were? And so it's similar in some sense to next token prediction, except that things are done out of order and the network needs to be able to use information from. You need to use context to the left and to the right and combine it in some interesting ways to figure out how to predict all these missing tokens from the sentence.

15:28

Speaker C

So in some ways, you're changing the definition of noise to one that makes sense in the context of text.

16:28

Speaker B

Exactly, exactly. And so actually that kind of training objective is very similar to the BERT style models from, again, many years ago. But that was the thing that for a while, you know, was sort of like widely used in natural language processing. People were training these neural networks exactly on the same objective. This idea of, oh, let's train the network to predict some of the missing tokens, you know, if it. In order to do that, it really needs to understand the meaning of the other tokens, and that's a good way to get representations. In that ICML paper that I mentioned, basically we show that, well, once you can do that, you can also generate content from scratch, because essentially you can start with a sentence where everything is masked, and then you can let the neural network figure out how to fill in pieces. But it does so out of order. So instead of generating left to right one token at a time, it does it in any order. And crucially, the network can output more than one token at a time. And that's why these models are so much faster, because in the autoregressive world, if you want to generate a thousand tokens, you need a thousand neural network evaluations. In the context of a diffusion language model, the neural network can output many tokens at every step. And so to the extent that you don't need too many steps, 20 denoising steps, then these models can be much, much more efficient.

16:34

Speaker C

In the image world, I think we're familiar with these progressively enhanced images where you see the image taking shape. Are you able to see the same thing with text? Like, does text start out horrible and get better over time?

18:00

Speaker B

Yeah, there is definitely something like that going on. In fact, if you go on our website, you can see some little animations to kind of give you a sense of what's going on under the hood. It's not as interpretable, I would say, as what you see in the image space where you see really the, the details emerging as you go through the process. I think text has always been a little bit less interpretable to me, but you can definitely, especially in code, you can see the structure kind of like emerging sometimes. At least you are kind of able to see some interesting patterns in terms of how the model is producing the answer. And for sure, this idea of being able to control the quality of the answer as a function of the number of improvement steps, the number of denoising steps is actually very exciting because it gives you another direction to do test time scaling test time, inference. Use applying computer inference time to control the quality of your answers. So if you have an autoregressive model, the only way you can actually kind of control the quality of the answer is by basically producing a longer and longer thinking trace. That's what this, you know, reasoning models are doing. They produce a thinking trace before actually providing the right answer. And the longer you let them think, the better. Usually the quality of the answer is and the more expensive it becomes and the slower, of course, it becomes. A diffusion language model has a different kind of axis to do something similar where you can kind of like control the number of denoising steps, the number of iterations, and the more iterations you do, the more, the higher the quality becomes. But all the edits are essentially happening in place. So you don't necessarily have to make the trace longer. The model is actually able to do error correction. It's able to improve its own answer without having to make it longer and longer, which saves memory and it's significantly more efficient.

18:17

Speaker C

When you talk about reasoning models and thinking in the context of diffusion, you know, even beyond this idea that you can change the number of denoising steps, should we be thinking about thinking in the same way with diffusion models as autoregressive models? Like, do you still have thinking traces or is all thought, if you're, if we're even there yet is all thought in diffusion models, you know, out of bat, out of band?

20:15

Speaker B

Yeah, it's a great question. And in fact, the space of reasoning and diffusion language models, it's pretty new. Mercury is the first. Mercury 2, like the model we released this week, is the first commercial scale diffusion language model with reasoning capabilities. So it's still this capability, this technology, it's all brand new. In our case, we do still have reasoning traces. They're just produced in a different way and the models have been trained to generate them through denoising, through a different kind of like training process. But the idea of a reasoning trace is still, it's still there. And in fact we're able to provide summaries of the reasoning trace to our users if they want to. So it's actually pretty similar. And in Fact all the API, it's all OpenAI compatible and you can still use some parameters to decide and to kind of like control how quickly you want your answer, how much you want to trade off, compute for quality inference time.

20:48

Speaker C

Another thing that I'm thinking about in comparing these two types of models is that with your traditional LLMs, autoregressive LLMs to get more tests. You just continue generating tokens because it's autoregressive, it's next token for diffusion models. You know, what does that mean and what are the implications on things like context windows? Like are you doing rolling windows of generation or how does that all translate?

21:46

Speaker B

Yeah, that's another kind of like, kind of capability that it's not, you know, there's many different ways of handling like outputs of variable length. We have figured out a way to do it at inception. I'm sure there are other. There is also the idea of doing rolling blocks, which also makes sense. It's been published in the literature. So there's different ways of handling variable length. It's possible to do it with a diffusion language model. It does not affect sort of like the scaling with respect to context size. That's more affected by the architecture, which is kind of like completely orthogonal to the training objective. So at inception we're still using transformers as the underlying neural network. And so, you know, we're using attention. And so we have the same kind of like benefits and downsides of attention in terms of like how it scales with respect to the sequence length. But that's an orthogonal kind of like direction. It's possible to train. And in fact we have prototypes of diffusion language models where the backbone, the network is not a transformer. It's maybe state space based or MAMBA based so that you have better scaling with respect to context length, sub quadratic scaling. But for our main models we're still transformer based.

22:20

Speaker C

I guess the question that is coming to mind for me is why now, like, is this model enabled by, you know, particular other things that are happening in the space or is it just the, the, you know, the time that it takes or took you to kind of get to this point?

23:44

Speaker B

Yeah, it's, it's a kind of like a combination of both. One is the, you know, it was just like, you know, it took us a while to figure out how to do it and you know, it was the right timing to sort of like scale things up. Because finally things were working at least at the sort of like on academic benchmarks at academic scale. The other one is that people are starting to realize that it's all about inference scaling, right? So for a while the main axis that people care about and all the interest was around scaling laws in terms of like, okay, how do things scale at training time? A pre training time. Right now everything has shifted to inference time scaling and because of several reasons. One is just like, you know, that's where you're seeing the biggest benefits, like rl, post training, test time inference. But also just like the economics, like if you need to scale up these models and they are actually getting into production, you know, the price per token or the watts needed per token becomes the key metric that you care about. And so what we're seeing with diffusion language models is that they scale better than autoregressive models at inference time. They're cheaper to serve, they're faster, you get more tokens per gpu, which means that the price is actually lower. And so that's why we felt like, yeah, this is the time to do it. Because if we are just able to match the capabilities in terms of intelligence of autoregressive models with a solution that scales better along the axis that actually matters, which is cost and speed, then we would have something that can actually be very, very valuable and that customers would jump to. And in fact that's what we're seeing, that there is really, really a lot of demand for speed, for cost. And we're seeing also other competitors are sort of like trying to get fast versions of their models, partnering with the AI inference chip companies Cerebras, Grox and Bonova to kind of like try to get the fastest models out there. Except our solution is software based, so it's much more scalable. We're still running on GPUs, so you can get as much capacity as you can get GPUs for, which is relatively easier compared to specialized AI inference chips.

24:02

Speaker C

To what degree do all of the techniques that we have learned about and apply regularly now and post training apply also to these diffusion models?

26:23

Speaker B

So some do, some we had to kind of like reinvent from scratch. So if you think about pre training, mid training, sft, a lot of that is actually relatively simple in the sense that you can essentially use the same kind of data sets. Raf, you know, the architectures don't have to change too much and the loss is you just need to change the loss function essentially right from next token prediction to denoising. If you think about reinforcement learning, that's where things become more interesting. Because in the context of reinforcement learning, whether you do it from human preferences or you do it if you have some kind of verifiable or non verifiable reward that you need to optimize for, the sampling process is quite different. And so the way you would propagate that information back into the network is different. And in fact it's actually beneficial to diffusion language models. Because in the context of RL post training, the real bottleneck is interest. Like if you're doing RL post training of an LLM, you're going to spend most of your time doing rollouts. You get the model to provide a bunch of candidate solutions and then you score them using your reward function and then you somehow figure out how to teach the model to do better, to put more probability mass on the rollouts that were good and avoid rollouts that were not good as evaluated by the reward function. And because diffusion language models are so much faster at inference time, then you can kind of do different things in the RLPostraining stage. And that's where we're spending a lot of our time right now, which is to figure out what is the right way to do our outpost training for a diffusion language model.

26:35

Speaker C

In terms of pre training, are these models pre trained from scratch?

28:21

Speaker B

Yeah, we are training our own models, we have our own pipeline. We have not disclosed a lot of detail in terms of how we do it, but we have our own recipe, we have our own stack for training our models.

28:26

Speaker C

And are the recipes substantially different from beyond the loss function or are they. If you squint, you can see the, the echoes of the way we train autoregressive models.

28:40

Speaker B

There are some similarities, but I would say yeah, it's been non trivial to figure out what is the right way to get these models to work. And yeah, in fact there have been attempts over the years to get diffusion language models to work, including from Google, from other places. And it wasn't, they would, you know, they were not successful for a long time. Right. So it, it is non trivial to, to figure out how to, how to train them well, how to, to get them to scale in the, you know, in the best possible way to make the best possible use of the data and the flops that, that you can have access to.

28:55

Speaker C

Is it a foregone conclusion that there's no way to like, you know, does the, you know, does the math, for example, say that there's no way to start from a pre trained autoregressive model and somehow, you know, through some magic transform that into a base that you can then diffusion train? It seems like that would be really interesting giving how, given how much, you know, energy and investment has been placed in training traditional models?

29:32

Speaker B

Yeah, so there's been a number of papers in the, in the academic literature kind of like trying out various recipes for doing exactly that. And so, you know, to some extent embeddings can be can be still reused. And to some extent, you know, networks, you know, the real challenge is that sort of like the attention mask that you use in a traditional autoregressive model is causal. So the model only knows how to use context to the left as it figures out how to what to do next. And in a diffusion language model, you really want to be able to have access to the context to the left and to the right as you decide what to change. It's like one of the key properties that make these models potentially much higher quality than compared to autoregressive models. And so that's sort of like the challenging bit. And people have explored ways of kind of like annealing the attention mask to make it go from causal to something non causal slowly, kind of like making the model drift away from the initial autoregressive tick. There are more mathematically sophisticated ways of kind of like converting the likelihood from an autoregressive model to a score function, which is what you need in the context of a diffusion model. More broadly, I think one thing that always sort of works is that you can get samples from the autoregressive model. That's usually a good way to at the very least generate synthetic data that then you can use to, to train your diffusion language model. That's almost like a black box that you can always kind of like use to combine and try to get knowledge out of an existing model. And as we know, I mean, with the distill gate, I think that has been on the mind of a lot of labs and a lot of researchers and it seems to be something that is really going on on a pretty massive scale in other places, but that's always possible.

30:03

Speaker C

How does the serving setup change for diffusion models?

31:59

Speaker B

Yeah, that's a great question. And it's another pretty challenging kind of aspect. And I think one of the reasons why there are still no other providers that are able to serve diffusion language models in production today, you cannot run a diffusion language model on existing serving engines. So if you think about BLLM, SGLANG, TensorRT, these frameworks that exist, and I know are even open source, and they are really, really good at serving autoregressive LLMs very efficiently. So they will handle things like continuous batching for you, like when there is a stream of requests coming in, how do you batch them together to serve them efficiently? And there is all kinds of optimizations that you need to do once you have access to multiple GPUs and many requests. And you know, there is a lot of existing kind of like frameworks and great work that has been done for autoregressive models. The space for diffusion language models is much, much less developed, so we had to build our own serving engine. Over the last maybe month or two there's been some support for diffusion language models in sglang, for the open source models that have been open source diffusion language models that have been developed by the community. So there is starting to be a little bit of ecosystem, a little bit of tooling, a little bit of community support for diffusion language models in the open source community. But it's still not nearly as developed as for autoregressive models.

32:04

Speaker C

You talked earlier about the ability to change the number of refinement steps and kind of how powerful that is in diffusion zone type of inference, time scaling. Is that something that is currently,

33:38

Speaker B

you

33:55

Speaker C

know, it's, you know, I'm thinking like on the static to dynamic spectrum, is it fully dynamic, is it fully static, is it per request static? Like how do you think about that? The knobs there?

33:55

Speaker B

Yeah. So it's a design choice, I think, you know, like to some extent it's, it's a choice that we, you know, as, as developers we made to sort of like figure out how to expose this kind of functionality to the user. So right now our, our Mercury models allow you to select different kinds of efforts essentially. So we try to basically kept it compatible with the, with the existing autoregressive OpenAI kind of like frameworks so that it's very easy for people to essentially plug in our diffusion language models into their existing apps or IDEs and they can just be used seamlessly. You just need to change the API key, everything works out. So we still basically use the reasoning effort parameters to control how much compute is used under the hood, but potentially you could think about alternative ways of exposing the knob. It's just like, as you know, there is already a very well developed market right now and so we've tried to be, you know, for us it's very important to be backwards compatible so that our customers can very quickly kind of like switch out whatever they were using before to a diffusion language model. It's very easy for people to try our models and see how fast they are. And so that was kind of like a design choice that we made because it makes it easier for us to go to market with diffusion language models.

34:10

Speaker C

Beyond speed and costs, which are these metrics that we've talked about, there's also quality and are you giving the user all three or are they sacrificing where are they sacrificing? How do you know? Or how do they know what the sacrifices are? And what's the strongest evidence you have that the speed gains survive under real production load at an acceptable book quality?

35:38

Speaker B

Yeah, that's a great question. And I think it boils down to, you know, there are three things that matter when you think about LLMs. It's quality, speed and cost. Right. And it's always the trade off between those things. And there is, you know, you can actually plot this, you know, the, the where existing LLMs stand in terms of these three things. And I know that's what you find if you go on artificial analysis or you know, you look at the kind of like, you know, providers that are kind of like benchmarking LLMs in terms of like the capabilities that they have in terms of like cost, price, cost, speed and quality. And so we've benchmarked our models using this existing methodology. And so of course measuring speed is easy. Measuring cost is also easy. Quality is always the hard one. Right? Like what does it mean that a model is better than another one? It's all very tricky to actually measure quality in a good way. Right. And but the way it's usually done is, you know, there is a number of benchmarks that have been established and that people, you know, that kind of like try to measure things that people care about like coding ability, question answering, instruction following, stuff like that, tool use. And so what we do is we basically just compare the quality of our models on these existing benchmarks and we've actually given our models, for example to artificial analysis. Artificial analysis did their independent evaluation. They tried the model on a bunch of benchmarks that they use to come up with their own intelligence score, which is exactly like a quality metric. It's basically trying to see how different models compare in terms of like their capabilities on these benchmarks which reflect kind of like real world use cases. And again, the result is that our latest diffusion language model, MercerRY2 is comparable in quality, is about the same as the speed optimized models from Frontier Labs, so Haikus Mini Flash models, but significantly faster, 510 x faster depending on which one you compare against. So the big limitation is that it's not the highest possible quality. So if you have a workload where you want to have the most intelligent model or the latest OPUS model or the latest Pro model from, from, from Google Gemini Pro or something like that, we are not at that quality level. So we, we have not yet trained a diffusion language model that matches the Quality of the best models from Frontier Labs. That's kind of like the key limitation at the moment. So we've been able to show that we can shift the Pareto frontier of quality versus speed at the level of the speed optimized models from Frontier Labs. But we need to do more work to basically keep increasing the quality of our models. Train bigger diffusion language models, use more data, figure out better training techniques to close a gap. And at that point we would have something really, really, really valuable.

36:03

Speaker C

Do you find when you're comparing these models empirically, do you find any qualitative differences, the types of generations that you see?

39:05

Speaker B

We heard anecdotally from our, from our users and customers that yeah, it does feel different, but it's, you know, it's hard to quantify. Again, I mean, you can use the benchmarks as a good way to measure, you know, how well these models do. There are some that I think are, you know, basically editing like tasks. If you think about autocomplete or edit suggestions, that's the kind of task where intuitively like you can kind of see that you really want to be able to use context to the left and to the right if you're doing autocomplete in an editor. And indeed we're seeing that diffusion models do really, really well. So you know, there is this thing called Copilot arena. It's kind of like the LM arena for code generated models where there is basically an ELO score that they come up with for code generations. So it's literally an ide and developers get to see autocomplete suggestions from two models. They don't know what the models are and then they rank them which one is better. And we are at the top of that ranking in terms of like the quality of the completions that you get from a diffusion language model. It's really, really fast. So we're seeing a lot of usage right now. Our models are already embedded in a number of IDEs, continue z, a bunch of others. Hello code. And yeah, developers are actually loving the experience, the quality of the generations that we get and the speed at which we can provide.

39:18

Speaker C

Are there areas where diffusion struggles relative to traditional models and not, you know, granted at a consistent like, you know, tier. Like if you're comparing the smaller, faster models to Mercury 2, you know, like long horizon coherence or you know, needle in a big haystack or like.

40:55

Speaker B

Yeah, so the context that we, our Mercury 2 model has 128K context. So that could be, you know, that's you know, if you have a task where you maybe need more than that, that's again, probably not the best use case. Again, I don't think it's a fundamental limitation of diffusion language models. It's just like we haven't trained models with longer context. We're not multimodal yet. So that's another limitation, at least right now. So if you have a task where you need vision inputs or you're thinking about audio or outputting images and video or something multimodal, we do not yet support those kind of functionalities. I mean, there's no technical fundamental reason we cannot do. It's just like we didn't have the time to train the multimodal models yet

41:18

Speaker C

in terms of getting to a larger scale. What does that look like for you and what are the key impediments, steps, that kind of thing?

42:13

Speaker B

Yeah, it's a process, it's a new technology. And so a lot of the time we still have to reinvent new things and it doesn't make sense to do all the R and D at the largest possible scale. So we can iterate much more quickly if we try out our ideas, try out our methods at medium scale, small to medium scale, kind of like models. Nice. Just because, yeah, iteration is faster and so there is still a lot of R and D to be done before we can kind of just, okay, let's just scale up. Right. And. But fundamentally, yeah, it's, it's a. There is, there are some science questions that still need to be solved. Then there is engineering. Of course. Every 10x in data parameters comes with a lot of engineering challenges. And it often means that you have to change a lot of the infrastructure because there is a bunch of new problems that didn't show up at that previous scale that now become important at the next scale. And so as we go through this process, we're learning a lot about scaling up to much larger number of GPUs and bigger data sets and kind of like various kinds of engineering problems that infrastructure problems that, you know, there's not a lot of technical risk. It just takes time to figure out how to come up with a solution internally.

42:26

Speaker C

Can you talk about some of the open science questions?

43:54

Speaker B

Yeah, it's still pretty open in terms of like, what is the best way to train one of these models? Right. What is the right noise process? There's many choices there. We have some things that work, but there could be better ones if you think about the inference. The interesting thing about diffusion language model is that training and inference are Decoupled. So in an autoregressive model, you're trained to predict the next token and then at inference time, the only thing you can do is to basically reuse exactly the same process over and over. In a diffusion language model, you're essentially solving a differential equation to generate samples. And at least for image and video generation, there is a, there is a lot of methods that you can use to accelerate sampling. A lot of techniques from the, a lot of numerical methods, techniques like fancy ode, ordinary differential equation solvers or stochastic differential equation solvers. A lot of those techniques have been ported over to machine learning and they've led to really, really fast and high quality sampling algorithms for traditional continuous diffusion models. The space of discrete language, diffusion language models, it's, it's, it's still the wild West. Nobody knows what's the best way to do things, architecture wise. I think there is still a lot that can be, that can be changed. If you think about rl, what is the right way to do RL using a diffusion language model? Even in the context of just traditional image and video models, there is still a lot of research that's still kind of like wide open. What is the best way to incorporate the information through the diffusion process? What's the most efficient way of doing it? I'm still involved through my lab at Stanford and some research projects there, some collaborations with Nvidia where we're training big video models Cosmos. We've been working on trying to figure out what is the right recipe for RL on this more standard diffusion models that have been around for six or seven years. The space for language discrete, it's, it's, it's, it's still much less mature. And so there is a lot of research to be done there as well.

43:57

Speaker C

And presumably because you're ultimately based on transformer models, all of the limitations of traditionally trained LLMs are similar in diffusion models. Is that the case? Hallucination is one that comes to mind, for example.

46:09

Speaker B

Yeah. So hallucinations, yes. I think it's not necessarily an issue with the, or at least the way I think of it, it's not necessarily a problem with the architecture. I think that's just like a fundamental issue. Whenever you fit a statistical model, right, there is data, you're fitting a statistical model. There is a regime where you're going to be interpolating and you know, maybe the answers that you get are going to be reliable, that there's always going to be a regime where you're going to be extrapolating and at that point, you know there's going to be mistakes. And I think that's to some extent unavoidable. No matter whether it's a diffusion autoregressive, no matter what is the architecture, we're learning from limited data. We need this model to generalize. And yeah, generalization is very, very, very, you know, nobody really understands it. Basically how does generalization and deep learning work? I mean, even for classification, like there's been, you know, people in ML theory, very, very smart people have spent a lot of time trying to understand how does generalization work in deep learning. The progress has been very, very limited. Like to this day you cannot, there is no predictive theory that can tell you will this neural network generalize. It's all very empirical, you have to try. And then, you know, did it work or not? But there is no theory that is, or that is at any reasonable scale that people would care about that is predictive. And that will tell you how well a neural network will work in practice. Even for classification, for generation generating models, it's even worse. It's a problem that fundamentally should be impossible to solve. Right? There is a, there is curse of dimensionality like we have. There is some pretty good arguments for why what these models are doing should not be possible, yet they work. So I feel like there is something fundamentally missing there from a scientific point of view in terms of like understanding how these models work, why they work, under what conditions they will work. It's still very, very open.

46:29

Speaker C

And how about things like explainability or the model's ability estimate its uncertainty. Are there any, I guess I'm looking for, are there any fundamental differences either to the benefit of diffusion models or to transformer based models in these kind of core dimensions?

48:34

Speaker B

Yeah, so we've not explored much interpretability and I would not expect particular differences in the sense that it's yet again one of the spaces where if the moment you start using deep networks, I think I'm personally pretty skeptical about the whole interpretability switch direction. And so that's not something that we've invested in at the moment. One interesting direction that I think is actually exciting and it's also practically relevant, is controllability. That's a space where people do care about being able to control the outputs of these models. And usually that's done through a prompt, maybe some guarded rails at the end. Like there is a certain stack and a certain set of things you can and cannot do with an autoregressive model, a diffusion model, at least for images Diffusion models are known to be much more suitable for controllable generation. And the reason is that because the object, let's say the image that you're generating is sort of like, available to the model from the very beginning, it's very easy for the model to check whether or not this object that it's generating is consistent with, say, some constraints or some kind of, some kind of like, control signal that you want to use to make sure that the output is consistent with whatever you want the model to generate. And not only you can check whether it matches your conditions, but you can also steer the generation process in a direction that makes it consistent with these external constraints. And that's only possible because you have the object from the very beginning, the full object, as opposed to generating it token by token, where you can only check whether or not it satisfies the constraint at the end. And so that's why diffusion models have been used a lot as priors for solving inverse problems in medical imaging. Like, there is a lot of applications where this ability of controlling the output through some external signal has been really, really important. So I was on some papers where we're doing medical imaging, and, and the idea is that, you know, when you do the CT scan, you're basically taking some projections of your body cross section. And then, you know, you're trying to reconstruct what your body looks like from some measurements that you get from the machine. And the more measurements you get, the higher the quality of the reconstruction. But it also means more radiations for, for the patient. Right, but if you had a good prior model of what the body looks like, which can be given by a diffusion model, then you can kind of force the model to say, okay, produce something that is likely to be the, you know, the, the to correspond to an actual human body. But it's also consistent with these measurements that I'm getting, you know, for this particular patient. And that can significantly reduce the number of measurements that you need to take for the same quality level, which means less radiations for the patients. And there's a number of problems that kind of like have that flavor where diffusion models have been really, really good. And so now how to do that for text, you know, there's some work again there, but that would be pretty exciting, right? Because people care about being able to stay on brand or be. Of course, safety constraints. Like, there is a bunch of settings where you do want to be able to control the output of the model. And so I think that's an exciting capability that is pretty unique to Diffusion

48:54

Speaker C

language models, looking forward, what's your kind of mental timeline for? Maybe I should even ask this, maybe even more in a more open ended way. Do you ultimately see diffusion challenging autoregressive models at frontier scale?

52:27

Speaker B

Yeah, yeah, I think that's, that's our bet. I think there is no reason it shouldn't, you know, I don't know how long it's going to take us to, to get there. And I think the, I guess the challenge a bit is that the frontier keeps moving. Like if you tell me, you know, this is the frontier, how long do you need to get there? I think I could probably come up with a reasonable estimate for that. The problem is that, you know, it keeps shifting, right. And so the models keep getting better and the speed keeps accelerating. So it's hard to predict how long it's going to take. And again, there is still a lot of R and D unfortunately, which has a lot of risks but also a lot of upside. It's entirely possible that we come up with a new algorithm that is way better than what we have. And so that could accelerate progress by a lot. Especially because the diffusion language space is very, very unexplored. I think there is still a lot, a lot of low hanging fruits, a lot of room for improvement, a lot of room for, you know, wildly better solutions to what we're currently doing. So it's hard to predict how quickly it's going to take. It's hard to predict, you know, how well it's going to work if we were to scale up to those, to those sizes. What's exciting is that, you know, it's unlikely that, you know, one architecture is going to dominate the other one. So maybe the best case scenario is sure, you know, diffusion models are better, everybody will switch. That will become the architecture for LLMs in the future. Even if that doesn't happen, there's gotta be some use cases, latency sensitive on device. Like there's going to be some use cases where an alternative architecture is just going to be better and you know, it's going to be such a big market that even the worst case scenario is actually pretty, pretty good for us. Right. Because there's going to be just so many use cases of these LLMs that as long as we can win on a reasonable subset of them, that's still going to be extremely valuable. Yeah.

52:46

Speaker C

When I think about latency sensitive, the things that come to mind most immediately are things like, you know, voice interactions. But then I think about all the activity around agents and how they're running a loop and anytime you have looping, if you can compress the, you know the time for one run through that loop, then you know that compounds. Are you doing a lot or seeing a lot with regards to these diffusion models and agentic applications? Are they powerful enough to be used in agents now?

54:45

Speaker B

Absolutely, yes. Yes. So we're already seeing a lot of usage. I mean you nailed the two main ones that we're seeing. Voice, a lot of voice, customer support, educational, kind of like agents. People love the speed of the of diffusion language models. They always had this issue that they would want to be able to use a thinking model like a reasoning model, but usually the latency is just not enough. And so maybe they use unless they use specialized AI inference chips, but that's too expensive and they cannot scale to large volumes. So we had a bunch of customers that are building voice agents on top of diffusion language models and agents. That's another one, just general agents. Mercury works actually pretty well in Open Claw for example. So yeah, you can use it during plug it in. It's already yeah, you can use it also for coding client kilo code. So it's all, you know, it can use tools, it can reason and it's really quick. So you know, especially kind of like it's not the best model. If you, if you're thinking of okay, I'm going to let it run for 24 hours, I'm going to come back and see whether it solves my problem, then it's probably not a good use case for that. But if you think about fast agentic interactions and loops where you're actually there and you want to be able to get an answer quickly and there is a human in the loop, then it's a really good model because as you said, it's significantly faster, you can iterate more quickly and so eventually you can get to the final result in less time, which is the thing that actually matters to developers.

55:15

Speaker C

I think last year, I think it was at Google, I o Google announced and kind of previewed their play in the space. I don't know that I've seen much of it since then. Have you tracked what they've been up to and can you give us a summary?

56:48

Speaker B

Yeah, so I mean I don't have any inside information in terms of like what they're doing. But yeah, as you said, they also announced diffusion language model Gemini Diffusion a few months after we announced our first Mercury model. I'd like to think that maybe that had a little bit of an influence and a little bit of impact there, pushing them to actually show they have something too. I think what they published back then, those numbers were very comparable to our initial Mercenary one model. So I don't know whether you've been able to improve. What I know is that yeah, it's not yet in production in the sense that it's not yet available to customers. So I'm guessing they maybe not figure out or they're still working on figuring out how to serve it efficiently and what are the best use cases. And you know, it's. My sense is that, you know, there is a big switching cost. Like they're very, very focused on Gemini and then their main model. And so, you know, it could be a. That's kind of like the issue with these big labs is that, you know, they're all in, in one direction and then it's hard for them to really focus on, on an alternative direction. As a startup, we're much better positioned to do that because we, you know, we're laser focused on one thing and we can really deliver and build everything that's needed to get that technology to succeed. But it's a little bit harder to do these things this place, I think in a big company, big lab, I think they're already kind of like have a direction set and there's a big opportunity cost if you want to switch.

57:06

Speaker C

What other labs or teams, academic or industry, do you kind of keep an eye on for doing interesting things with diffusion?

58:39

Speaker B

Yeah. So there is a lot of good work coming out from China, like the LADA models. These are like several Chinese universities collaborating with Alibaba. So that's the. I think they get all the compute and funding from, from industry and they're doing good work in terms of like thinking about models, architectures, how to train them. There's still a huge gap between these ladder models and what we have internally, but they've been doing good work in terms of like doing research and pushing the field forward. Bytedance has a pretty serious effort internally. They've also at least published a few papers with some internally built diffusion language models by Dance Seed, which is kind of like the fundamental research group within Bidance. A lot of smart people, a lot of good researchers. I think they've been doing good work in the space too. But then generally I think that the whole academic community there is a bunch of interesting papers coming out. I was on neurips in December and yeah, it was crazy to see how many papers are there on diffusion language models. If you were to plot it, you'll see that there's been an explosion since that original paper from my group in 2024. Now everyone is kind of like looking at this new paradigm. Of course it's exciting, right, because it's, there is this approach that works really well for image, video and music and then this designer approach that works well for text and code. And then what's going to be the winning solution? Is there a way to, you know, is there a way to unify everything and have a single kind of generative model that works well across all modalities? Of course everyone is excited about LLMs, but it's surprising how similar all the different models from Frontier Labs are. They're all kind of like clones of each other. There's very, very little differences. And so now there is an alternative approach, an alternative path. So of course that is generating a lot of excitement in the, in the research community because that's an opportunity to do something new, something at the frontier or something to have really impact on the conceptual foundations for this approach.

58:48

Speaker C

Do you see image and text as kind of these two divergent paths or are there techniques with image being further ahead? Are there techniques that are created on the image side that you can pull over or have pulled over to facilitate your work on the tech side?

1:00:57

Speaker B

Yeah, there is a lot of cross pollination, I would say. I mean I myself started out working on images. A lot of the researchers in our team, because there was not really a set of researchers that were working on diffusion for language or not that many. A lot of the people on our team actually started out as just, you know, pure old kind of diffusion4 images or diffusion4 video kind of researchers. And so a lot of the know how did indeed transfer reasonably well. And so yeah, we also do pay attention, close attention to what's happening in that, in that community in terms of like distillation techniques to accelerate the models, as I mentioned before, inference tricks to make diffusion models go even faster. So all those advances have been pretty exciting.

1:01:14

Speaker C

Is there any work happening either, you know, at inception or elsewhere that points to an approach to kind of a credible multimodal approach based on diffusion?

1:02:06

Speaker B

So yeah, at inception we've not been prioritizing multimodal yet, but in the academic community, yeah, there's been a number of papers that have come out over the last year or so, including from one of my co founders, Aditya, who was a former PhD student with his lab. He's done some really, really good work in terms of showing how to build diffusion models that are truly multimodal. And so there's been some really, really good results in the academic space on getting that unifying model based on diffusion that can handle different modalities.

1:02:17

Speaker C

Well, Stefano, it's been great catching up with you and getting the complete download on text diffusion. I feel caught up now. Thanks so much for jumping on and sharing a bit about what you've been working on.

1:02:56

Speaker B

Yeah, thank you so much for hosting me. Yeah, it was a really fun chat. Same. Thank you.

1:03:11