Latent Space: The AI Engineer Podcast

Mistral: Voxtral TTS, Forge, Leanstral, & what's next for Mistral 4 — w/ Pavan Kumar Reddy & Guillaume Lample

49 min
Mar 30, 202619 days ago
Listen to Episode
Summary

Mistral AI announces VoxTral TTS, their first text-to-speech model using novel autoregressive flow matching architecture. The discussion covers their approach to building specialized AI models, open source strategy, and enterprise deployment through their Forge platform.

Insights
  • Flow matching architectures can outperform traditional depth transformers for audio generation by better modeling the entropy and distribution of speech inflections
  • Specialized models for specific tasks (3B for TTS, separate models for transcription) can be more cost-effective than large general-purpose models
  • Enterprise AI deployment requires significant customization and fine-tuning on proprietary data to achieve meaningful competitive advantages
  • Open source AI models accelerate research by enabling academic institutions to develop new techniques like preference optimization
  • Audio AI is still in early stages compared to text/vision, with no dominant architectural paradigm yet established
Trends
Shift from general-purpose AI models to specialized, efficient models for specific enterprise use casesIntegration of multiple AI capabilities (text, audio, vision) into unified transformer architecturesGrowing demand for on-premise AI deployment due to data privacy and security concernsFlow matching and diffusion techniques expanding from image generation to audio applicationsReal-time streaming audio generation becoming critical for voice agent applicationsFormal verification and mathematical reasoning as proxies for long-horizon AI reasoning capabilitiesEnterprise AI requiring extensive customization and domain-specific fine-tuningOpen source AI models driving academic research and technique development
Companies
Mistral AI
Main subject - announcing VoxTral TTS model and discussing their AI development approach
OpenAI
Referenced for comparison with their Omni model approach and ChatGPT impact on industry
Google
Mentioned in context of Google Assistant and Pavan's previous work at Gemini team
Meta
Referenced for Llama model's impact on open source AI research community
People
Pavan Kumar Reddy
Leading audio research at Mistral, previously worked on post-training at Google Gemini
Guillaume Lample
Co-founder and chief scientist discussing Mistral's research strategy and model development
Quotes
"We really don't want to be living in a world where the smartest model, the best models are only behind closed doors, only accessible to a few companies"
Guillaume Lample
"Unlike text, even in vision I think this is true. But in audio it's definitely true. There is no winner model yet. There is no, okay, this is the way you do things"
Pavan Kumar Reddy
"What's very sad is that they are not leveraging these data that they have been collecting for four years or sometimes for decades"
Guillaume Lample
"If it compiles in Lean is functionally correct. It's like a program. If it compiles, hence it's correct. It's very easy"
Guillaume Lample
Full Transcript
4 Speakers
Speaker A

Foreign.

0:00

Speaker B

Welcome to Leinspace. We're here in the studio with trustee co hosts Bibu. Welcome.

0:05

Speaker C

Very excited for this one.

0:11

Speaker B

As well as Guillaume and Pavan from Mistral. Welcome. Excited to be here.

0:12

Speaker A

Thank you for having us.

0:16

Speaker B

Pavan, you are leading audio research at Mistral and Guillaume, your chief scientist. What are we announcing today where we're coordinating this release with you guys?

0:18

Speaker A

Yeah, so we are releasing VoxTral TTS. So it's our first audio model that generates speech. It's not our first audio model. We had a couple of releases before we had one in the summer that was a Boxtral, our first audio model but it was like a transcription model ASR like a few months later we released some updates on top of this supporting more languages. Also a lot of table stack features for our customers. Context biasing, precision, timestamping and the auto transcription. We also had some real time model that can transcribe not just at the end of the. You just don't need to fill your entire audio file but that can also come in real time. And here this is a natural extension in the audio. So basically speech generation. So yeah, so we support nine languages and this is a pretty small model, 3D model. So very fast and also state of the art. Performs at the same level of the best model. But it's much more efficient in terms of cost and also much. In terms of cost. It's also much to go there, only a fraction of the cost apart from the details. And we are also releasing the work that is modeling if it's.

0:26

Speaker B

Yeah.

1:25

Speaker A

Mammal linked. Not this time. Yeah.

1:26

Speaker B

What's the decision factor?

1:28

Speaker A

It's a good question. There will be more. There'll be more.

1:32

Speaker B

Yeah. Pawan, any other sort of research notes to add on? What?

1:36

Speaker D

Maybe we'll dive into it later in the forecast too. But it's a novel architecture that we developed in house. We iterated on several internal architectures and ended up with a autoregressive flow matching architecture. And also have a new in house neural audio codec which converts this audio into ballpoint by hertz lithium tokens, semantic and acoustic tokens. And yeah, that's the new part about this model and we're pretty excited that it came out with such good quality. And Guillaume was mentioning. Yeah, it's a 3B model. It's based off of the ministral model that we actually released just a few months back and install trunk and mainly meant for like the TTS stuff but the innate text capabilities are also there. Yeah.

1:39

Speaker B

So there's a lot to cover I always, I love any anything to do with novel encodings and all those things because I think that's obviously increase a lot of efficiency, but also maybe bugs that sometimes happen. You were previously at Gemini and you worked on post training for language models and maybe a lot of people will have less experience with audio models just in general compared to pure language. What did you find that you have to revisit from scratch as you joined Mistral and started doing this?

2:24

Speaker D

At least when it comes to for I think there are two buckets, I guess the Audio understanding and Audio generation. The Audio Understanding, like the Voxel models that Kim was mentioning that were released earlier. They're Voxel Chat that we released I think July last year and the follow up transcription only models family that we released in January. That would be one bucket and the generation is another bucket. I think you can also treat them as a unified set of models, but currently the approaches are a little different between these two. To your question on how audio is fed to the model in the Understanding model, it's very similar to actually Pixel model that we also released. Yes, that was amazing. That was the first project I worked on after joining Mistral. It was pretty, pretty nice. And Voxel was very similar in spirit, I guess. So we feed audio through an audio encoder similar to images through a Vision encoder. And it produces continuous embeddings which are fed as tokens to the main transformer decoder. Transformer model. Yeah. And the model output is just text. So on the output side there is nothing that needs to be done in these kinds of models. I guess the interesting part about the generation stuff is the output now has to produce audio. And the approach that we have is this neural audio codec which converts audio into these latent tokens. There is a lot of existing literature and a lot of models which are based off of this kind of approach. And we took a slightly different design decisions around this. But at the end of the day, the neural audio product converts audio into a 12.5Hz audio set of latents. And each latent has a semantic token and a set of acoustic tokens. And the idea is that you take these discrete tokens and then feed it on the input side. There's several ways to fuse this at each frame, but we just sum the embeddings. So it's like having K different vocabularies and combine all of them because they all correspond to one audio frame on the input side. The output side is the interesting part on the output side that it's not the. I don't know if it's the most popular, but one popular technique is to have a depth transformer because you have K tokens at each time step. Like with text, you just have one token at each time step. So you just do predict the token from the vocabulary with. Yeah, just you get probability.

2:52

Speaker B

This is a very straightforward text.

5:10

Speaker D

Very straightforward.

5:11

Speaker B

Yeah.

5:11

Speaker D

But if you have K tokens then the main thing would be to predict all of them in parallel. But that doesn't work. At least that doesn't work that well because audio has more entropy and, and one of the techniques people use is this depth transformer where you almost have a small transformer or it can be lskmr as well, but people use transformers and you predict the K tokens in autoregressive fashion in that. So you have two autoregressive things going on. So the thing we did differently is instead of having this autoregressive K step prediction, we have a flow matching model. Instead of modeling this as a discrete token set, we trained the codec to be both discrete and continuous to have this flexibility. So we did try the discrete stuff too and it works well, but the continuous stuff works just better. So yeah, we took this flow matching. So it's a flow matching head which takes the latent from the main transformer and like in diffusion it's denoising but in this flow matching it's a velocity estimate. So you go from this noised all the way to the audio latent which corresponds to the 80 millisecond audio and then. Which is sent through the vocoder to get back the 80 millisecond audio frame. Yeah.

5:12

Speaker B

Is this the first application of flow matching in audio? Because usually I come across this in the image.

6:20

Speaker D

Yeah, actually in some sense there are models flow matching models in audio. But I think this specific combination, I could be wrong. There could be some work. I haven't seen much work in this. So I think it's novel and a lot of it's just a way bigger community. So they, I think they pioneer a lot of these diffusion flow matching work and it's interesting to adopt some of the ideas there into audio and. Yeah, yeah. And personally that's the big part which is trying out one of more meta point is, unlike text, even in vision I think this is true. But in audio it's definitely true. There is no winner model yet. There is no, okay, this is the way you do things. It's still evolving. I think people are still iterating and Figuring out like what's the best overall recipe. I guess the idea. Pretty sure there are models which are also completely end to end like NATO Audio and NATO Audio. But it's still not come to a convergence point where this is the right way to think. That also makes the space pretty exciting to explore.

6:26

Speaker C

What are some of the ways to look at it? There are ways where you can do diffusion for audio generation, but if you want like real time generation, that's a big thing with the approach I'm assuming that you took. And also how do you go about evaluating different axes of what you care about?

7:25

Speaker D

Yeah, good point. I think we. So you can do just flow matching diffusion for the whole audio. We didn't even go down that path because one of the main applications is voice agents and we want real time streaming. And that's the use case. That's not the only use case, but that's one of the primary use cases we want to get to. So we picked the autoregressive approach for that. And within the autoregressive space again you can do chunk by chunk or you can do. So we pick the. Though I think at least personally I prefer the approaches which are the simplest. And so we tried to see can we just add audio as just another head to our regular transformer decoder model because that kind of makes it easier for eventual end to end modeling of audio. Text native modeling. Yeah. And it works pretty well. So I guess we went with that and we tried it a little bit with the flow matching head itself. Like we had a discrete diffusion kind of approach which also works well, but the flow matching worked better.

7:41

Speaker B

I was just curious about how you also think about this overall direction of research. Do you basically when you work with the audio team, do you set some high level parameters and then let them explore whatever or how does it work between you guys?

8:40

Speaker A

I think the way it works is that we are prioritizing together. I think what are the most important features. But there are many things we can do in audio. I think we try to decide like how we should do things. For instance. Ultimately what we want to do is to build this full duplex model. But we are not going to start there directly. I think it's. Some of the project people are doing.

8:53

Speaker B

But just to confirm full duplex means it can speak while I'm speaking or.

9:11

Speaker C

Yeah, yeah.

9:17

Speaker A

So ultimately we're going to get there. But for us it was. We decided to take it like step by step. So we start with whatever is the most important. I think also for Our customers which is the transcription is the most popular use case. Then the speech generation, the real time just a bit before that and then the XP going to be like more try combine everything all together. But yeah, we felt it was also important to separate things and optimize each capability one by one before we merge all of that together.

9:18

Speaker B

Then the super Omni model.

9:43

Speaker A

But very interesting because as Pawan said, when you work on some other domains of this ALM and everything there are many areas where I think it's not as interesting. For instance many places it's essentially just around data or creating new environments on a lot of kind of easy things. But things where I think the research is maybe not as interesting. Where in audio there are so many ways to actually build this model, so many ways to go around it. The sense I think is really interesting. And what we also tried for speech generation is that we tried multiple approaches. What was interesting is that even though they were extremely different, they ended up big at the end of the day like deep particles. But the flow matching turned out to be quite more natural. So we are happy with this.

9:45

Speaker B

Is there an intuition why maybe flow matching is just. Just models speech better in some natural fundamental latent dimension?

10:23

Speaker D

No, I think the main thing is even at a particular time step there is a distribution of things to be predicted. Like the way you inflect so you already know the word that you're speaking. And yeah, the in text space, let's say the word maps to just a single token for simplicity. In most cases it does. So there is not a lot of. So you just pick the word. But within audio even the same word could even with your own voice could be inflected in so many different ways. And I think any approach which like models this distribution and flow matching is one of the. It's not the only one at all but it's one which works pretty reasonably well I think does better. So you have to pick across several different. The intuition I have is there are some. Several different clusters each corresponding to some specific way you would inflect pronouns that thing and you can't predict the mean of it because that corresponds to some blurred out speech or something like that. But you have to pick one and then like sharp conditional inference. Yeah, exactly.

10:32

Speaker B

Is that all covered under disfluencies which is I think the normal term of art. Disfluencies, pauses, intonations. By the way, we have to thank Sophia for setting all this up, including some of these really good notes because I'm less familiar with the audio.

11:32

Speaker D

For me I think dissonances are definitely one such phenomenon.

11:45

Speaker B

Differences is more like which is arms ahs.

11:48

Speaker D

Yeah, arms ahs. And also repeat you like you do this filler words you're thinking so you repeat the word.

11:51

Speaker B

Okay. Whereas intonation is like a. It's up up speak.

11:56

Speaker A

Okay.

12:00

Speaker D

And yeah. So I think there is a lot of like entropy and modeling it as a distribution and any technique which helps with it. And the depth transformer is a conditional way of modeling this and Transformers that's really good at it. Even though that's a mini transformer. So I think that worked pretty well too for us too. It's just that the main consideration is when you have a depth transformer if you have K tokens you need to do K auto regressive steps. So even though it's a small thing, it's K steps which is very latency heavy. With flow matching we were able to cut it down significantly. So we are able to do the inference in 12 steps or 16 steps and it works pretty well. And there are more normal techniques to bring it down even further to like in the extreme case one step like we're not doing it yet. But at least the framework lends itself to more efficient.

12:01

Speaker B

Yes. And the image guys have done yeah incredible work as well.

12:44

Speaker D

Now you just send the prompt and you get an image.

12:49

Speaker B

Yeah. Surprisingly not enough. I think image model labs use those techniques in production. I feel like it's a lot of research demos but nothing I can use on my phone today.

12:51

Speaker A

The thing that would be interesting here is that since indeed there has been so much work that has been done in the vision community compared to audio on this domain. I think there are so many long infos here and there are so many things we can do to actually improve this pad even further. So when it's our first version. But we have so many ways to make this much better and much more efficient, cost efficient. So yeah. So really it's not a new field at all of course. But there are still so many things that can be done.

13:02

Speaker B

I should also mention for those who are newer to flow matching, I think the creator, this guy's name is Alex, he's done I think on Neurips maybe two neuropses ago there's a very good workshop. There's one hour on like this flow matching is. I would recommend people look that up this. That's the other thing. Right. The efficiency wise. I imagine the reason is open weights. The reason you pick 3 6B backbone 3, 4B. You are trying to fit to some Kind of hardware constraints, you kind of fit some kind of latency constraints. What are they?

13:26

Speaker A

Not necessarily. I think something we care about in our model is that they are efficient. So we have a lot of separate models. For instance, so we have this audio model that is very small, very efficient. We also have a small OCR model that is really very good, highly efficient as well. And I think an approach that maybe other companies are going to take is to have a very general model that will do a bit of everything, but that is also going to be expensive. And here what we want to say is if you care about this specific use case, you can actually use this model. It just does that. It's extremely good at it, but also very efficient. That's why we can create models audio but also OCR that are really good at that and that will be much more cost effective than a general model that will contain a lot of capabilities you don't really need at. So we are doing like general model but also like more customized model like this.

13:54

Speaker C

How does it compare to other TTS models? It's we are going full open way. We are just dropping it.

14:37

Speaker A

I think it's pretty good.

14:42

Speaker D

Yeah, I think it's pretty good. Like it? It's definitely one of the best for sure. It's probably, I would say it's the best open source model.

14:43

Speaker B

Quite the siphon themselves.

14:50

Speaker C

Yeah. Why now? How does it fit into broader mistral vision? How do you see voice agents? How do you see voice? I think every year I've heard, okay, you're a voice, your voice. There's a lot of architectural stuff, there's a lot of end to end latency that you're solving. But where do you see voice settings?

14:52

Speaker A

We had so many customers asking for voice. That's also why we wanted to build it. What's interesting in this domain is that in a sense if you take something simple like transcription, it doesn't seem like something that should be very hard to do for a model. Essentially it's pattern recognition, it's classification. These models are very good at classifying. Right. Nonetheless, when you talk to them, it's not there yet. Right. You don't talk to them the same way you talk to a person on something. Maybe people don't realize it. In English it's still much better than in any other language. Even compared to French. For instance, if you talk to this model in French, when you see people talking to this model they will talk very slow, they will articulate as much as they can. So it's not natural, right? We are not yet to this and I think yeah, maybe the next generation will not know this but yeah, I think people that are maybe our age will actually always keep this bias speaking very slowly when they talk to this model. Even if maybe probably in a couple of years, maybe next year it will not be necessary anymore. But yeah, but what's interesting is to see that even for languages like French and Spanish, Germans that are not no resource you have a lot of audio. It is there and still it's not as good. And I think I mean reason for this I suppose there is not as much energy, as much effort that has been put than in some other modalities like for instance fission or like coding. But. But yeah there is still a lot of progress to be done. I think it's just a question of doing the work and it's like a clear path I think to get there.

15:06

Speaker D

It's a little fascinating because I worked on Google Assistant I think while back at this point. But it's. I think it's like when you take a step back it's fascinating. It's not that long ago it was like four years ago or five years ago and it's now it's completely audio in, audio out and the function calling and the whole thing happens completely end to end and in a very natural, natural way and still ways to go. You're most telling even despite all the previous it's not like you're speaking to a person when you talk to any of these agents, bots or voice mode kind of situation. It's still like a gap. I think that's the great part and I feel like with even the existing stack we should be able to get to this very natural speech conversational abilities soon enough I guess and we'll also

16:21

Speaker A

hope to get there and it's kind of the next step. Right. Because when you talk to these agents like usually people are just writing to them and sometimes they have this very clear. For instance you want to write code but you have a very clear idea of how you want the model to implement what you have in mind. But so here you are to spend a lot of time writing. So it's not really efficient on audio is really like a natural interface that is just not there yet. But I think it's just going to be there very soon.

17:07

Speaker C

How's it like building, surveying, inferencing like we see a lot about. It's very Easy to take LLMs off the shelf serve them fine tuning, deploying. I know you guys have a whole. You have forge you have a whole stack of customizing, deploying. Is there a lag in getting that like distribution channel? Are you helping? There is. So like prompting LLMs, you can have them be concise, verbose all that. They're built on LLM backbones, these models. How do you see all that?

17:30

Speaker A

Yeah, I think this is a lot of what we are doing with our own customers. Very often they come to us. So it's for different reasons. I think one reason is sometimes they have this lot of privacy concerns. They have this data that it is very sensitive. They don't want the data to leave the company, they want it to stay inside the company. So we help them deploy model in house. So either on premise or on private cloud. So they are not worried that it's given to a third party and that there is some leakage. Sometimes they have this kind of. Many companies, they have these different sensitive data there but sometimes tier 1, tier 2, tier 3 data, tier 3 can send it to the cloud. Tier 1, it has to stay there. So then it creates some kind of heterogeneous workflows where it's annoying and you cannot send some data to the cloud. This one you can't. So here when we actually deploy the model for them, they don't have this consideration. They are like not worried that this is going to leak. Everything is much easier. So we help them basically do this. So it's one of the value propositions. But the other is very often when customers use this off the shelf closed model. What's very sad is that they are not leveraging these data that they have been collecting for four years or sometimes for decades. So much data, sometimes it's trillions of tokens of data in a very specific domain, their domain, which is data that you will not find on the public Internet. So data on which like the closed model, we actually not have access to one which is going to be really good. So if they're using closed source models they are basically not benefiting from all these insights. All these data they have collected two years, they can always give it into the context that inference but it's never as good as if you actually trained them at analysis. So yes, that's basically what we help them to do. We actually provide them some Mistral projects, basically what we announced at IGTC this week. So we provide them with this. It's basically like a platform with a lot of tools to actually help them process data, train on that. Yeah, it's actually the same thing that we are using in the science team. So it's actually very battle tested infrastructure, like a lot of efficient training code base for continuing pre training, like a fine tuning, even doing sft, irl. So we help them do this using the same tools as what Rosen's team is building is using. So since it's tools that we have been using for two years now, it's really patent tested, it's really sophisticated. So it's the same thing we are giving to them, we are giving the company the same thing that what Arsene is still using internally to actually build their own AI. And it makes a really big difference. I think sometimes customers, many in general don't realize how much better the model becomes when you fine tune it on your own data in cannabis. Your model is here and you start from there. You have a local source model which is sort of here, but if you actually fine tune you can actually really go much further than this and then you have a very big advantage. The model is trained on your entire company knowledge so it knows everything. You don't have to feed like 10k tokens of context at every query so it's much easier. It's a bit, I think using a closed source model it's really sad because it basically puts you are not leveraging all this data and you are going to be using the same model as all your old competitors. When you connect to using everything you have been collected for years, which is really valuable. So yeah, so we help basically customers do this. We have a lot of solution, I mean deployed for engineers that go in the company that basically look at the problem customers are facing, they look at what they're struggling to do, what we should do to solve it. So we help them solve them together. So I think our approach is a bit different here than some other companies and competitors. We don't just release an endpoint and say do some stuff on top of that or we don't just give a checkpoint. We really look very closely with customers, we look at the issues they have, we help them solve them. We really make some tailored solutions for the clients they're facing. Some example are also going to be sometimes with some customers they really wanted to have a really good model, really performant on some Asian rare languages. If you take some of the chef models, they can speak it, they can write in this language, but it's not amazing. This language will be like maybe.01% of the mixture. So it has been included during training but very little. So what we did here is of course we trained a new model for them, but so this language was 50% of the mix. So it's much, much stronger. It knows all the dialects it knows. So it's. Yeah, so it's some example of things we can do and it's really arbitrarily custom. I think some of their customers, for instance, they wanted some, they wanted some 3D model that can do audio with very good at function key. So something you want them to put in the car. In particular, they wanted this to be offline because in a car you necessarily have access to Internet. So yeah. So here we can actually build these solutions. There is no like model out of the box on this. In the Internet you have this very, you have this very general model. A generalist like he's on strong model but for things like this they always want like specific solutions. And some other reasons sometimes they come to us is because like they experiment with some closed source model, they get some prototype, they are happy with what they build, it works well, they're happy with the performance and then they want to go to production and then they realize oh, but it's extremely expensive, we cannot push this. So then they come back to us and they say can you help us build the same thing as this but using something much cheaper on here, on here we can sometimes build something 10x cheaper by just fine tuning a model and it will be better on prem on their own server and also much cheaper as well.

17:56

Speaker B

So yeah, that's the Mistral pitch right there. Take all the money.

22:43

Speaker C

Outside of that you do, you do put open wave models, models so people can do this themselves. I feel like not enough people go out of their way.

22:48

Speaker B

They're not going to, they're going to ask them to do it. Ask the expert.

22:55

Speaker A

Initially we didn't know we were not competition at the beginning of the company because I think our strategy was not exactly the same as what it is today. But what we underestimated initially is the complexity of deploying this model and connecting them to everything to be sure it has access to the company knowledge. And it was yeah, we were seeing customers struggling with this, but it was even. That was two years ago and now things are much more complicated because you don't just have text on sft on simple instruction following, you have reasoning like agents, you have like tools. Then you have multimodal audio. So it's much more complicated than before. And even back then it was hard for customers. So they really need some support. And this is why actually providing always some 4D collision as well as this.

22:58

Speaker B

I'm curious, is there also voice Fine tuning that people do.

23:39

Speaker D

So in this forge we'll also have unified framework and the hope is like the Voxels speech to text that we released earlier this year and even the Voxel chat that we released last year. And I think a big people, I think there's a big rich ecosystem of people fine tuning Whisper. And people want the same thing with Voxel. It's much stronger than Whisper. And yeah, the platform offers that kind of fine tuning which could be any kind of fine tuning. Like for instance, even sometimes people want to support new languages to this which are three languages which we hope to cover ourselves natively. But if there is a language where you have data and you want to fine tune, I think this is a good use case. The other use case is it's the same language, like even English, but it's in a very domain specific way.

23:43

Speaker B

Yes, terminology, jargon, medical stuff.

24:29

Speaker D

Exactly. And also there's specific acoustic conditions, like there's a lot of noise or the. And the model will do decently in most conditions, but you can always make it better. And those are some of the use cases where you can improve it even further. And that's one good use case for this and for text to speech. We're just releasing it so we'll have support for that soon too. I think it's similar use case, it's little different. The kind of things that you want to extend a text to speech model to, which could be like voice personalization, voice adaptation for enterprises and many enterprises need very specific kind of tone, very specific kind of like personality for this kind of voice. And all of those are like good use cases for fine tuning.

24:32

Speaker B

This one I was going to ask you, we never talked about cloning voice clothing here. How important is it, right? Like I can clone a famous person's voice.

25:14

Speaker A

Okay.

25:20

Speaker D

But the main use case would be like for enterprise personalization, like enterprises need like a lot of customization. You don't want the same voice for all the enterprises. Each enterprise want a customized, specialized something which is is representative both their brand and also their I guess, safety considerations. And the use case, I think the kind of thing that you would deploy as a empathetic assistant in the context of a healthcare domain would be very different from the kind of thing that would be in a customer support bot and would be different from like more conversational aspects. I think those are the customizations you would expect from enterprise and that's the main use case, at least from our side.

25:21

Speaker C

My basic example is you don't want to call two customer services and have the same exact voice. It's going to be weird. But also on the technical side of this. So there's like a few things in Voxtra that I thought were pretty interesting.

26:04

Speaker B

He's a big fan of this paper. He said.

26:15

Speaker C

Very good paper.

26:17

Speaker A

I think he said this is the

26:18

Speaker B

best ASR paper he's ever read.

26:18

Speaker C

Yeah, I've hyped up this voice paper enough. We covered it somewhere. But a big thing. So Whisper is known for 32nd generation. 32nd processing. You extended this to 40 minutes. There was a lot of good detail in the paper about how this was done. Even little niches of how the padding is. So it's very much needed. You need to have that padding in there. The synthetic data generation around this. I'm wondering if you can share the same about the new Speech to text. Right. Text to speech. So how do you generate long form coherent. How do you generate. How do you do that? And then any gems? Is there going to be a paper?

26:20

Speaker B

Yeah, yeah.

26:54

Speaker D

There would be a technical report. Yeah. I think it will have a lot of details, but I think the summary of it. Actually some of the considerations in this paper were because we started with the Whisper encoder as the starting point and now we have in house encoders like the real time model for instance, which we released in January. We also released a technical report for that real time model as well, which is this dual stream architecture. It's an interesting architecture. You should check it out. And there we have a causal encoder. And I don't think there's any strong multilingual causal encoder out in the community. So we thought it's a good contribution. So that's one nice encoder. The other people want to adapt. That's a good encoder. And we trained it from scratch. I think our pull stack is now mature enough that we're able to train super strong encoders. And some of these considerations like sparing and stuff is a function of the Whisper encoder. And now that we train encoders in house, the design considerations are different. And for the question on text to speech, I think that also leans onto the original autoregressive decoder backbone. I think it says almost identical considerations. I think the long context in it's not even long context. So the model processes audio at 12.5 Hz. So 1 second maps to like 12.5 tokens. So I think 1 minute is like 7, 20 tokens. You can get like up to 10 minutes in 8k context window and get half an hour in 30k context window. So that's and 32k context is something that we are very comfortable training on. We can extend it even much longer 128k. You can naturally see how it can extend to even hour long generations. Yeah we need the data recipe and the whole algorithm to work coherently enough through such long context. But the techniques are some way very similar to the text long context modeling. And the key difference is it's just doing flow matching order aggressively instead of text open prediction.

26:54

Speaker B

Okay. I think that was most of the sort of voice questions that we had.

28:48

Speaker C

But I have a big question on Mr. Small Mistral Small.

28:52

Speaker A

Let's go.

28:56

Speaker C

So what is small? How do we define small? What is this?

28:57

Speaker D

What is this?

29:01

Speaker C

I remember the days of Mistral 7B on my laptop. It's not fitting on my laptop that could run it on a big laptop.

29:02

Speaker A

But it's just a different question of terminology I care about really Baseball is enough active parameters. But it's true. Maybe let's give it another name. But yeah we could have called it medium but then I think. But yeah it's a model that we really mixture of experts. It's a model that combines different models. Before the way we are doing things is that we had one model general model for industrial doing instruction following. We had like a separate model that was devstral so really good at coding specific to code. We had another model for reasoning Magistral. So these were separate artifacts built by different teams at Mistral. And now what we are doing is basically merging all of this. It was even Pixtral. The first vision model we had was like a separate model. And the way we do things internally is that we are wanting focus on one capability, build one model and then when it's mature enough we decide to merge this into the texture. But here it was the first time we basically merged all of this into one. There are some other things we didn't have time to merge at the time. For instance more capabilities or function coding. I think it's going to be much better in Stral small propos one but our latest model on the we're working on explore the ultra version of this.

29:09

Speaker C

And yeah key things. It's very sparse 6B active. Pretty efficient to serve 256K context.

30:16

Speaker B

Yeah I think what's interesting is just this general theory of developing individual capabilities in different teams and then merging them. Where is this going to end up?

30:22

Speaker C

We've seen the five things put together in this. What are the next five teams.

30:33

Speaker B

I think actually OpenAI has gone away from the original 4.0 vision of the Omni model. That's because of what they were selling. All modalities in, all modalities out. But I feel like you might do it.

30:37

Speaker A

I think there are some modalities where it's not completely obvious. For instance for audio. For audio here if you want to do transcription I think it makes no sense to use a model this large. If you just want to transcribe tech it would be very inefficient. If you want to do audio, you probably just want to do the 1B or 3D model. Performance will be essentially the same. It's going to be incredibly cheaper. So here. That's why we want to have a separate palette that just does this. Yeah. I think the portion is just. Yeah. If you are tempted to your model by speech and you're asking very complex questions and how you do this and your cascade things. Do you want to put a duo in a model that has a one key around that? It's not a competitive discussion. I think unaware if you going into that direction. But that's possible of course. But yeah. But I think for us the next capabilities we want to try to integrate into these models when you are going to be more coding, more reasoning but I think more capabilities that people don't talk too much about. But that's high bottom. I think for our customers on different industries, for instance things are like a little legal, finance, computer aided design. All of these things that this method out of the box are familiar to put at that. Because people really don't prioritize this. There is no like too new benchmark on that. But it's not hard to make this model that couldn't just have to do the work for things on that processing including in 2008 version. So yeah. But we have other things that we merge into this.

30:46

Speaker B

I think for voice. Yeah. The key thing I think over maybe like the last year or so with VEO and Grok Imagine and all these things is joining voice with video. Right. Which people don't understand spatial audio because like most TTs is just oh I'm speaking to a microphone in perfect studio quality. But when you have video like the voice moves around.

32:03

Speaker D

That's true. The consideration was a little different in the sense that there it's like a standalone artifact where you get the whole thing and you consume it. But in the conversational setting it's a. You need the extreme low latency streaming would be one of the primary concentrations

32:24

Speaker B

you can build a giant company just doing that so you don't need to do the voice. But I was just on the theme of merging modalities. That is something I'm like wow, everyone up till let's say mid last year was just doing these pipelines of okay, we'll stitch a TTS model with a voice thing and a lip sync thing and what have you. Nope, just one giant model. Yeah.

32:41

Speaker C

I have a two part question. So one is it's still open. It seems like open source is still very core to what you guys do. And I just have to plug your paper. Jan 2024. This is the mixture of experts, like very fundamental research on how to do good moes. Paper comes out very good paper for anyone. That's just side tangent with no, this

33:05

Speaker B

thing caused 8 by 22 was like the nuclear bomb for open source.

33:25

Speaker A

I think it takes 7B more.

33:31

Speaker B

7B more. Okay.

33:33

Speaker C

Yeah, yeah.

33:35

Speaker A

But this is a bigger possibility.

33:35

Speaker B

Yeah, yeah. I don't remember this.

33:37

Speaker A

I remember.

33:38

Speaker B

I don't think it was January. Right. It was like neurips. It dropped during Europe and everyone in Europe was Europe.

33:39

Speaker A

So It's December of 23. But I think yeah, the model was out. It is wrong in terms of people.

33:44

Speaker C

It's just a little update probably.

33:48

Speaker D

Yeah.

33:49

Speaker B

No, but you have a point to make.

33:50

Speaker C

No, you got to check that. But then I just want to hear more broadly on open source for you guys and when you had asked earlier about what's next, what are the other side teams working on you you put

33:52

Speaker B

out lean straw this honestly, surprise I was like I don't. This doesn't fit my mental model of Mistral.

34:03

Speaker A

Yeah. First for open source in general, I think it's really something which belongs to the DNA of the company. I think we started it around this wesi. We have been open sourcing models since the beginning and even before this. So before this. So we released Llama and I think what was really nice to see that before this for most researchers like universities it was impossible to work on LLMs. There was no LLM outside. And if you look at many of the techniques that were developed after, for instance TAMA was open sourced. All these post training approaches like even dpod, like preference optimization, all of these were done by people that had access to this model and it would have been impossible to do without this model. So it's really making sense move faster. So we really want to contribute to this opentrose ecosystem. I think like the deep sego and also like very lot of impact all these Papers that are I think in the open source community are really helping the science community as a whole to move faster. So we want to contribute to this ecosystem. That's why we are releasing very detailed technical reports. So Magistral and our first reasoning model and a lot of reasons, things that worked, things that did not work as well that I think helpful and yeah, so for the audio model also going to show a lot of details. We shadow them for the real time model and yeah, so we really want to continue this basically belong to this community of people who share science. I think we really don't want to be living in a world where the smartest model, the best models are only behind closed doors, only accessible to a few companies that really have the power to decide who can use them or not. I think it's a scary future we don't want to live in. We really want this model to be accessible to anyone. We want intelligence to be used and accessible by anyone who can use it. So yeah, so that's why we are pushing this mission source model on the. Yep. So not so yeah, we'll start ets so it's open source, not the first model. So not the best. And yeah, lean style I think is also one step into this direction. So it's yeah a bit different than what we are usually releasing but we have a small team internally working on formal proving formal math. So I think a subject we care about in general and we were working on reasoning I think we started too early before alms. Doing reasoning without LMD is very hard, especially when you work with formal systems because the amount of data you have is not negligible. It's a very small community of people writing like formal proofs. But the reason why we like it is because I think there is. If you look at what people are doing with reasoning the problems that you can use are usually going to be problems where you can verify the output. So for instance all this AIME problem where the solution is a number between one and a thousand. So you can verify. Compare this with a reference or it's an expression. You can actually compare the output expression generate your model with a reference. But there are many most of the math problem and most of the reason problems there is no way to easily verify the solution. If the question is show that F is continuous, cannot compare in the reference. Right. If it's proof that this is true or proves this property, there is no way to. You cannot simply verify the correctness of your proof. So it's hard to apply the. There is no verifiable reward here. So what you could provide is of course a judge that will look at your proof. But it's very hard and you could do certain reward hacking happening there. So it's difficult. You could provide like a reference proof, but then there are also many ways to prove the same thing. So if the model says give a negative reward because it's a different proof, but maybe it was still digit proof, just different. So it's not going to work. Well, what's nice with Lean and with format proving is that you don't have to worry about this whatsoever. We just.

34:09

Speaker B

They all function as long as it compiles in Lean is functionally the same.

37:25

Speaker A

Exactly. It's like a program. If it compiles, hence it's correct. It's very easy. And you can apply this on any kind of.

37:28

Speaker B

It's just way too small. So no human will actually go and do it.

37:33

Speaker A

Yeah, that's exactly. Only people can do it. It's like a very small community of people doing a PhD on that. So it's super small on its side because it's actually very useful on not just math, but also in software verification. So for instance, software verification today, so tiny market, very few industries work on this and we need that. It's usually going to be like companies like building airplanes, aerobatics, like things where they absolutely want to be sure life depends on this. But it's very rare that people formally verify the correctness of their software. But I think one reason for this is simply that it's just super hard to do.

37:37

Speaker B

Are you thinking of tla? It's the language that some people do for software verification.

38:09

Speaker A

No, I agree with code that people use in France. But yeah, it's the reason, I think, why people don't use it more and why this industry is not as big as it could be is because it's very hard. But now with coding agents that are there, it's going to be very difficult. We're going to see much more of this. So I think, yes, industry there is going to be much larger in the future that we have this model. So yeah, here also anticipating this a little bit, we wanted to work on that because it's proving like a math theory and proving like a function, essentially the same tools. Yeah, yeah.

38:14

Speaker B

One of my theories is that because the proofs take so long, it's actually just a proxy for long horizon reasoning and coherence and planning. Maybe a lot of people will say, okay, it's for people who like math, it's for Lean okay, it's like a niche math language, who cares. But actually when you use this as part of your data mixture for post training and reasoning, actually it might spike everywhere else and I think that's under explored or no one's really put out a definitive paper on how this generalizes.

38:39

Speaker A

Yeah, absolutely. And I think even that's what we are seeing already. For instance, if you do some reasoning and math, then the Americans should do reasoning and code everywhere. Even Lian should code that way in the stage. So we depolitize some transfer some sort of emergence that happens and I think some. It's also interesting, it's not just I think the topic in general, but there is a lot of connection with this encoding agents because sometimes the model can see like a theorem that it has to prove. It's very complex. But then it can take the initiative to say I'm going to prove this three lemma. I'm going to suggest three lemmas and I'm going to in parallel prove each lemma. So three of them in parallel with subagents. But I'm also going to prove them in theory and assume the three lemmas are tuned so you can do this suburgen. But also pretty interesting you can. Even if you fail to pull one of the lemma, you can actually maybe you succeed to pull the Norwegian lemma too. So you get some reward here. So it's a bit less sparse than if you just get a.01 for the entire thing. So it's pretty interesting. I think we can actually stop.

39:08

Speaker C

Yeah, it's also an interesting case just for specialized models in general.

40:02

Speaker B

Right.

40:06

Speaker C

Like the cost thing you show is pretty interesting. Similar score wise your 30, 70, 150,

40:07

Speaker B

300 bucks we have the parasmon with them. I think cost is a bit unfair.

40:13

Speaker A

Right.

40:17

Speaker B

Because this one is at like inference cost. As long as they're on top with their margins on top of it. But we don't know anything else so got to figure it out.

40:17

Speaker C

I did want to actually push on that more not on cost but you mentioned about. Okay, it's a great way to have verifiable long context reasoning. What are other frontiers that I'm sure you guys are working on internally? There's a lot of push of people pushing back on pre training scaling rl pushing compute towards having more than half of your training budget all on rl. Where are you guys seeing the frontier of research in that?

40:25

Speaker A

You mean in Wizard?

40:49

Speaker C

Just in foundation model training in the next one thing that you guys do actually is you do fundamental research from the ground up. Right. So you probably have a really good look at where you can forecast this out.

40:50

Speaker A

Yeah, I think for us we are still working a lot of the pre training side. I think we are very far from the Joshua the pre training. I think ML4 precleaning will be like big step compared to everything we have done before. So we are pretty excited about this. And I think on the IELTS side, I think now we have more and more to think about this algorithm that will actually support these very long trajectories. I think when it was for instance, grpo for instance doesn't really work with any bit of policy, which was okay initially because you are solving math problem that can be sold in like a few thousand tokens. So the model can actually generate them pretty quickly. So when you do your update, the model is never too far off. It's never too far off. But now when you are moving towards these kind of problems where something takes hours, like six hours to get a reward, then your model is completely placed. So you have to build new algorithms, new infrastructure that supports this, but also new algorithms. So now everything we are doing internally we are trying to build some infra that we actually anticipate is what we have in six months, one year, which is this extremely low scenarios on the Pisces the day game. I think when we started misalignment part of me and maybe also Timothy we wanted to this very nice environment where people are there, they can do research they like with a lot of resources. So it was nice. I think things changed a lot when I think when ChatGPT came out. I think after that I think was a bit of it turn these labs were neither the same thing, but yeah, but it was nice. And I think we also want to part of this future we had before

41:00

Speaker B

coming to the end. Obviously I think you guys are doing incredible work. You've laid out very impressive vision for open source and for Voice. What are you hiring for? What are you looking for that you're trying to join the company?

42:19

Speaker A

Yeah, so we are hiring a lot of people in our sense team. We are hiring in all our offices. So we have our HQ is in France, in Paris. We have a small team in London, like a team in Palo Alto as well. Recently we opened some offices in Warsaw, in Poland, so one in Zurich. We also have some presence in New York as well on Sooner 1 in San Francisco. So we are a bit elware also hiring people remotely. So we are growing the team trying to hire very strong people. I Think we want to stay. So the team is not still a fairly small team and I think we want to keep it that way because we find it quite efficient. Like a small team, very agile.

42:32

Speaker B

Okay, let's focus on science and the forward deployed. We actually are strong believers in science. We started our new science pod that focuses specifically on the air for science. What areas do you think are the most promising?

43:05

Speaker A

What we are pretty excited about right now and something we have already started doing and we'll probably be able to share more about this in a couple of months is that we are exploring for science and there are a lot of areas where we think that you could get some extremely promising results if you are to apply AI in these domains. There are a lot of long inputs. You just have to find these domains where actually AI has not been yet applied. And it's usually hard to do because the people working in those domains don't necessarily know the capability of these models. They don't know how well AI would

43:15

Speaker B

just have to pair them with.

43:42

Speaker A

Yeah, exactly. Researchers matching, which is actually hard to do. But this matching we are doing it naturally with our customers. So we have some companies we work very closely with. So for instance ISML and Treason are one of our partners. So we are doing some research with them on there. There are like tons of extremely interesting problems, problems in physics, in science, material science that they are essentially the only ones to work on because they are doing something no one else is doing. Yeah. So there are many domains where AI can actually revolutionize things. Just you have to think about it and be familiar with what can do well now how to apply it. So yeah, it's something we are more modeling with our partners, with our customers. So AI for sciences, but one big thing. Yeah.

43:43

Speaker B

Okay. And then forward deployed. What makes a good four deployed engineer. What do they need? Where do people fail?

44:19

Speaker A

I think it's usually you need people that are very familiar with the tech, not necessarily with a lot of research expertise but that are actually pretty good at using this model that can actually like that know how to do fine tuning, that know how to like start some RL pipeline and it's, it's not easy. It's something that most majority of companies will not be able to do this on their own. So here I think we need people that are, that like to solve problems that I started about solving some complex, very concrete problem. It's applied science basically. And yeah, so I think it's not too different I think from the skills you did about in your research because essentially you are trying to find solutions to problems that customers have not yet solved. Sometimes it's easy, sometimes you're here to do the work. You have to create synthetic data, find some edge case. So it can be. Yeah it depends on the problem but yeah you have to. I think you need also a bit of patience on be creative I think very similar skill less the diversity of

44:26

Speaker D

the work they do always surprises me. It goes all the way from the kind of stuff they encounter in industries. It's just very interesting.

45:14

Speaker B

I think any fun success anecdotes.

45:22

Speaker A

Yeah it can be like really training this small model on edge that just do one specific thing like training some very large model with some specific languages as well. Making models really good at some to dos like for instance computer ID design these kind of things.

45:24

Speaker B

Is that impairing with vision as well?

45:38

Speaker D

Yeah and defect detection for chips or like in factories identifying things like the diversity could be anything where you can deploy these foundation models. The work to make it work in that specific setting basically whatever it takes to make it like add value and that's by the way workflow.

45:41

Speaker C

Yeah and it goes across the stack. Right. Like even just pulling up the website

45:59

Speaker B

like you're compute is so broad.

46:03

Speaker C

We didn't even touch on Mistral buy we have a coding CLI tool. One thing you guys were actually like I think the first two was Mistral agents. You had a agent builder. You can serve it via API and all that. I'm guessing forward deploy people help build

46:06

Speaker A

that out and stuff. It's also why we are so we are doing many things but I think that's also part of the value proposition that sometimes customers are always very extremely careful about their data and they don't want to. They don't like trusting so many partners. Trusting one partner for code, giving your data to another third party for like audios and another one they don't like this here. What they really like with our approach that we can help them on anything so they don't have to send their data to so many clouds.

46:21

Speaker B

So yeah I think that there can be many orders of magnitude more FDEs than research scientists and they don't need your full experience but they're still super valuable to customers in practice.

46:49

Speaker A

These two teams are still quite intertwined very often. So first of all they are using the same tools, the same data pipeline and everything. And it's very helpful for the sense team to get the feedback and the solution team because they can say look these customers are trying to do this. This is not working. Can we be sure. In the next version.

46:59

Speaker D

Yeah.

47:14

Speaker B

Look, this is basically a real world eval. Yeah.

47:15

Speaker A

It's a rewardible and it's not something. For instance if you are just working in a lab it's just ship's model. But you don't do this work of preparing the model for customers. You have no idea whether your model is good at this edge case. For instance, in a year, even in year before this. Right. There is a few very big gap between the public benchmarks that are very like academic on the real cases are

47:17

Speaker D

just very diverse and in the specific conduct of a customer you can fine tune and make it like first evaluate create a solid eval benchmark and then measure in the context of there the kind of audios like for instance one use case is literally just. There'll be a word for kids and they have to just say it out. It's a very specific thing. You're just saying one word and then you have to you. You'll grade the kid whether they did it right or not. It's like RL for kids. But so there's very diverse use cases and the idea is that the. The applied scientist engineers will go and make it better and then from the learnings we incorporate it into the base model itself. So it's. It's just better out of the box.

47:37

Speaker A

Yeah.

48:15

Speaker C

It's a good full circle system. Like the foundation model evals are all just proxies of what you really need. You're never going to have one that says it doesn't make sense for there to be a one where word transcription like that. It's not something you want to fit on. Perfect.

48:16

Speaker B

Everyone should go check out everything Mistral has to offer and try the TTS model which will link in the show notes. But thank you so much for coming.

48:29

Speaker D

Thanks.

48:36

Speaker B

Such a treasure with us.

48:36