The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

Dynamic Token Merging for Efficient Byte-level Language Models with Julie Kallini - #724

51 min
Mar 24, 2025about 1 year ago
Listen to Episode
Summary

Julie Kallini, a Stanford PhD student, discusses her research on Mr. T5, a dynamic token merging approach for efficient byte-level language models that addresses tokenization issues like unfair pricing for non-English languages. She also covers her Mission Impossible paper exploring whether language models can learn 'impossible languages' that humans cannot.

Trends
Shift toward byte-level modeling to address tokenization inequitiesDynamic compression replacing static tokenizationGrowing focus on multilingual fairness in AI systemsArchitecture innovations to improve efficiency without sacrificing performanceIntegration of linguistic theory with machine learning research
Full Transcript
Probably the more important reason tokenization is kind of flawed is that there are different compression rates for different languages and scripts. So, like, high resource languages like English are totally fine. You know, on average a token is like maybe four or five characters or approximately a word. But for other languages, the same sentence could be tokenized into so many tokens, users who speak those languages, when they interact with language model APIs where users are charged per token, they'll be overcharged, basically. All right, everyone, welcome to another episode of the TWIML AI podcast. I am your host, Sam Charrington. Today I'm joined by Julie Collini. Julie is a PhD student at Stanford University. Before we get started, be sure to hit that subscribe button wherever you're listening to today's show. Julie, welcome to the podcast. Thank you so much for having me, Sam. I'm looking forward to jumping into our conversation. I came across a couple of your papers recently that I'd like to dig into. We'll be focusing primarily on the most recent of the two called Mr. T5. Love that name. Dynamic token merging for efficient byte level language Models. But you've also got a really interesting paper called Mission Impossible Language Models that time permitting will touch on. Before we get going, I'd love to have you share a little bit about your background and how you got started in the field. Yeah, thanks so much for the introduction. So I kind of just starting from where I got into computer science, I really just loved math and science in high school. When it came around like junior year, I think it was my sister that told me it's time to start thinking about what you're gonna major in in college, kiddo. So she suggested I think about computer science just because it's really like applied math, like taking math and applying it to making computers work. I started to learn a little bit of programming on my own and I was like, oh, this actually seems pretty fun. Jumped to college and I, I took my first computer science course, like actual structured computer science course. And you know, it was a little bit daunting because I think like most of my peers were not in the position of taking their first computer science course. But I stuck through it. I loved it. And yeah, it seemed to all work out. And in terms of actually getting into the field of NLP or natural language processing, um, I think throughout college I kind of wasn't sure about what topic within computer science was really my focus. Like, I thought I could have been a, a systems person or a theory person for a long time. Um, but then the thing that Brought me toward machine learning. Was taking my first linguistics class. Um, and I really enjoyed looking at language in a different way that I hadn't before. Like linguistics really approaches studying language as a science as well. And then marrying linguistics and computer science, like natural language processing is the most natural way to do that, or the perfect marriage of the two. So that's what got me interested in nlp. Awesome. Awesome. And what year are you in your program? I'm a second year PhD. What's your research focus? How do you think about the things that you're interested in? I like to say that what I work on is fitting pop culture references into my paper titles. But if you want to, if you want a more serious answer, I think there are two strands right now to my research. Right now I've been really interested in tokenization and byte level models, which is the focus of the Mr. T paper we're going to be talking about. And then another big strand of my research that's the focus of the Mission Impossible paper as well as follow ups we're doing to that paper is kind of looking at how language models could help us understand linguistics or cognitive science. So these are the big strands. It used to be the case where we just want computers to be able to mimic human language in some way, but now that they're so good, the question is, can they help us learn more about language? Well, let's dig into the Mr. T paper and I'd like to maybe start at the top and talk about tokenization. Why is tokenization so important for large language models? Yeah, so tokenization is the pre processing step that's central to basically every language model that you've heard of these days. So it's the pre processing step that breaks up text into, into chunks or units called tokens. So you can think of these as words or parts of words and these are the units that we feed into a language model. You can think of tokenization as a form of compression because it's basically taking a long text sequence and making it into a smaller set of units that if you think about the transformer, which is the component that's really central to all of the large language models that we know today, it can be very expensive. You know, the term you'll hear is like quadratic complexity in terms of the sequence length for the attention mechanism. So long sequences are pretty inefficient and tokenization compresses it into these smaller units. However, there are some big problems with tokenization. For, for one, it can be really sensitive to character Manipulations or character level noise. So a simple spelling error can result in a sequence being represented by a completely different set of tokens. It's also, I think probably the more important reason tokenization is kind of flawed is that there are different compression rates for different languages and scripts. So like high resource languages like English are totally fine. You know, on average a token is like maybe four or five characters or approximately a word. But for other languages, the same sentence could be tokenized basically into, into so many characters, like so many tokens basically operating at a, at a character level. So then all the problems of efficiency come in, and users of those languages, users who speak those languages, when they interact with language model APIs where users are charged per token, they'll be overcharged, basically. So I think the unfairness aspect is really interesting to me. You mentioned that one of the issues with tokenization is that it's sensitive to character errors. And yet I think a lot of our experience with LLMs is that we almost get really sloppy with the way we communicate with them because we know they'll figure it out. How do you reconcile those two experiences? Yeah, that's a great question. So intuitively to me, when I think about tokenization and how it's sensitive to character errors, I wouldn't necessarily want the two words that are spelled slightly. If a word is spelled slightly differently, it can be chunked in a different way, or maybe if a letter is capital, then that word will get a completely different input representation. And then the model basically has to learn that, oh, this word that you meant actually means this other word. It has to kind of learn that implicitly. I think the models are just trained on a ton of data and trained for very long that maybe these differences don't on the surface seem to not matter as much, but I think that it can still result in weird, interesting failure cases in certain points. The second idea that you mentioned is the idea that subword tokenization is less efficient for under resourced languages. Can you give some examples of how that comes to play? Yeah, for sure. So basically I have this example when I give talks about Mr. T, I kind of have this example where I take the GPT4 tokenizer and I have a sentence in English and I have its translation in Arabic. And we know that these sentences mean the same exact thing because it's literally the same sentence just translated into Arabic. But the exact number I have for the GPT4 tokenizer on that language is like the English sentence is tokenized into 10 tokens or so or 9 tokens and then the Arabic sentence is tokenized into 31 tokens, even though the number of characters is. There are actually fewer characters in the Arabic sequence. There are a variety of factors that go into this. So, you know, perhaps it's just that the tokenizer is not trained on enough Arabic, or maybe English is the dominant language, but also from a linguistic perspective, and Arabic is one of those languages. From a linguistic perspective, there are some languages where meaningful units are not necessarily adjacent. So like, these things are called infixes in languages where rather than putting a prefix or a suffix, like we do in English, to add some meaning to a word, or you put, let's say ed to make it mean the past tense, you can shove it in the middle of a word. And now it's no longer like, yeah, now it's no longer concatenating something to the end, it's inserting it in the middle. So, like, the way the morphology works is actually breaking up the root into pieces. So from a linguistic perspective, I can see why an algorithm that is kind of merging frequent adjacent tokens might not be the best for all languages, beyond just having enough data for that language during the tokenizer training. And so the contention in this paper is that a big part of the issue arises from the unit of tokenization. Did I get that right? Yeah, yeah. Basically that, yes, performing tokenization can have these drawbacks for. For certain languages. Yeah. And the units of tokenization can be very different for, for. Or how. How much compression you achieve in different languages is a big factor. And also like how efficient it's going to be in each language. And so one of the distinctions that you call out is between subword tokenization and character level tokenization and byte tokenization. Can you talk about the differences there? So most of the main models use subword tokenization, which is what we've talked about. The alternative would be character level or byte level models. So these characters don't perform a tokenization. Sorry, these language models don't perform a tokenization preprocessing step. They just take in the raw character or byte sequences as the units that the transformer or the language model will operate on. And the benefit here is that some of the issues that we talked about, like, let's say sensitivity to character level manipulations or the model having like, more awareness of what characters comprise its tokens. A lot of these issues are addressed by modeling at the character or byte level. The issue is that you still get, you get these problems of having very long sequence lengths when you're just operating on the raw character or byte streams. Um, so there are different architectures that have kind of addressed this. So I guess in, in the related work section I talk about like charformer or canine is Google's character level counterpart. Counterpart to multilingual Bert. There are different architectures to try and kind of down sample the sequence in a learned way. The focus of our paper is on Bytet 5 and the reason we focused on Bytet 5 is it had very impressive performance compared to its token level counterpart, which was multilingual T5. So on all these benchmarks it matched or even outperformed MT5. But the main problem was it was efficiency. It just, it was exactly the architecture of MT5, but operating on bytes. So you get these very long sequence lengths. And our idea was kind of like how can we take this existing byte level model and make it more efficient, specifically through minimal fine tuning. And that's where the idea kind of started. What you're talking about here is modeling directly at the character or byte level as opposed to some other type of tokenization that's more granular. Yes, tokenization, specifically as pre processing before feeding it in to the transformer. There are methods that some would call soft tokenization, which is like maybe you feed in a character or byte sequence, but the transformer kind of learns how to group tokens implicitly via some learned mechanism. So that would be an example, would be like the charformer architecture. Or another example is hourglass transformer which we also partially replicate in the paper. Someone called those soft tokenization. I'm. I'm pro those methods or I would consider that to be separate from like the typical subword tokenization that we know about in most language models, like your llamas or your GPT. Hope maybe GPT4. Yeah, we know it uses a tokenizer, so yes, GPT4. And so MRT5 isn't an alternative approach to tokenization. It is an alternative model architecture that doesn't require tokenization. And it's based on the BYT5 which preceded it, which had some inefficiencies. Yes, exactly. So actually how this project came about was I was just exploring, doing some interpretability work on character level language models. Like what do models learn about words when they're operating at the character level or the byte level. This was early last year during my still. It was during my second quarter of my first year of grad school and I was in a meeting with my advisors, Chris Potts and Dan Drawski, and I think we just came to a point where we said, why aren't people using these models? Like, if there are these benefits to abandoning subword tokenization, like what's preventing them from taking off? And it seems like the main thing is the efficiency aspect. So I geared toward an architecture project that could address that. And how do we characterize that efficiency aspect in our paper? I think it's always helpful to look specifically at wall clock time usually also flops. Analysis is important. So, like how many floating point operations are done in the model compared to, let's say, its token level counterpart? Yeah, we include both of those analyses in the paper. Yeah, in the end, I think it's always important to include the wall clock time per example. So we made sure to include those in the paper. And so let's talk a little bit about the architecture and the way you approached it. So the idea of Mr. T5 is to take Bytet 5, which. Bytet 5, they kind of put all of the, or most of the parameters in this heavy encoder. So most of the inefficiency comes from processing sequences in this heavy encoder architecture. So just to, just to also take a step back, the T5 architecture is an encoder decoder model. And in BYTET 5, the encoder is the, is the massive part. So the idea that we kind of came up with is, you know, maybe after a couple of contextual layers of the model, so maybe after you've processed this entire byte or character sequence, after you've processed it through a couple of encoder layers, tokens already contain information about other tokens via the attention mechanism. And then the model can decide what tokens it can drop and what tokens can remain to be processed by the rest of the encoder layers. And that's the main idea. And the way we do this is with a gating mechanism that learns, that learns to drop tokens and then learns which ones to keep as well. So this gating mechanism is the one that does the work of deciding which tokens will be removed from the sequence. And the way I like to think of it is during training, MRT5 is doing this dropping as an attention masking process. So to describe what attention masking is, it's. It's this process that, that we use in language models to prevent certain tokens from looking at other tokens during the attention computation. So the typical uses of attention masking are like in a decoder model, which, or an autoregressive model, you don't want preceding tokens to be able to look into the future, because that would defeat the purpose of next token prediction. So you would mask out future tokens. In an encoder model, you might have sequences of different lengths, so when you process them in a batch, you pad it up and you want to mask out the pad tokens because you wouldn't want it to affect the representation. You wouldn't want pad tokens to affect the representations of other tokens. So the way we do this in Mr. T5 is basically, can the model learn which tokens to mask out via a learned attention masking mechanism? And that's what we do during training. Basically, the model implicit, or the model has a learned attention masking sort of function that removes tokens from the sequence during training or doesn't allow the tokens that remain to look at the other tokens for most of the encoder layers. And then during inference is when we actually remove those tokens from the sequence in what we call like a hard deletion mechanism. So those tokens are actually removed from the sequence and the sequence length is resized into a shorter, more compact sequence. When you're talking about tokens here, are these tokens characters or bytes or. Yes, these tokens are characters or bytes. Sorry, the terminology is a little bit confusing, but whenever I'm Talking about byte T5 or Mr. T5, the tokens are the bytes. Yeah, we usually just refer to the units. We also just refer to the units that the transformer is working with. We refer to those as tokens too. Yeah, and so what you are essentially doing with the dynamic token merging or the deleting of these tokens is an alternate kind of learned compression scheme, ultimately. So, you know, unlike pre processing into some reduced amount of tokens here, you're doing it via this dropping scheme. Yes, that's exactly right. And the idea of merging is that we allow the model to kind of, you know, the attention mechanism of early layers kind of already does a sort of merging because every, every unit or every, every token is through attention is a combination of the tokens around it. So information has already been merged to other tokens, and maybe the, the, the, some of them could be dropped now because they've merged their information into other tokens in the early early layers. So is there an analogy of the, the multilingual example you gave where you can kind of look at the effective compression rate for English versus some other language, and have you demonstrated that it's kind of invariant to language or less variant to language or script? Yeah, yeah, that's a great question. So in our main results or in our main experiments, we train MRT5 on multilingual data. So sampling like 15 different languages from multilingual C4 for our continued pre training experiments and we don't give MRT5 a certain prior to compress languages at different compression rates. We have this formulation where we can, using this controller algorithm that's detailed in the paper, we can target a specific compression rate if it's desired. Let's say I want on average Mr. T to compress sequences by 50% and Mr. T will do that. But when I tested on individual languages separately, we actually found that Mr. T learns language specific compression rates. So for example, Chinese, which already has a very information dense script like individual characters could mean entire words. That's just how the orthography of Chinese is. Mr. T had lower compression rates than for a lower compression rate for Chinese than for other like Latin script languages. So it kind of already, it notices that Chinese is already pretty compressed. I'm not going to compress it as much as a language that uses another script that is not as compressed, which I thought was super interesting because we are not injecting that prior, it just learns it implicitly. In terms of the ultimate performance of the model, what benchmarks are you looking at? Yeah, so in the section on downstream fine tuning in the paper we fine tune on some additional tasks. So from the BYTET 5 paper we look at the XNLI and tied IQA tasks. So those are both multilingual benchmarks. XNLI is a classification task. So natural language inference is kind of like you take two sentences and you need to determine whether they entail, contradict or contradict each other or if there's no relationship tied IQA is a question answering task. So given a passage and a question, can, can the model retrieve the answer from it? And these are, these are both multilingual. So it, it tests the multilingual capabilities of Mr. T as well. And we found that MRT5 can, can reduce sequence lengths on those tasks. I think it was up to like 45 or 50% while maintaining close to the same performance as BYT5. I think even for XMLi it outperformed BYT5, which was interesting because you know, you would think that bytet 5, since we're fine tuning on top of bytet 5 it would present like a bound on how good you can get. But yeah, so Mr. T5 can match the performance while significantly speeding up the model. And we also test on some character level manipulations. So we took two tasks From a character level benchmark, a spelling correction task and a word search task. So the spelling correction task is just like given a sentence that contains some spelling error, can you reproduce the sentence but corrected? And the word search task is given like a random sequence of a bunch of character, Like a bunch of characters or numbers. Can you like find the English word that matches some definition? So subword models are really, really bad at these two tasks. So I would. This comes from a benchmark Created by another PhD student in my lab. Her name is Jinghuang and she evaluated like subword T5 and compared it to by T5 at the time. This was before Mr. T. And like the subword models really struggle on these sorts of character level tasks. But again, back to our paper. Mr. T5 is able to significantly reduce the sequence lengths and improve the runtime while coming close to matching ByteTV's performance on the task. So you get the benefits of the compression without an effect on the model's performance. And in terms of the inference efficiency that you were going for, how does it compare to Bytet 5? Yeah, so the task efficiency will vary depending on how big the encoder sequence lengths are relative to the decoder. So if you have very long encoder sequence lengths and very short decoder sequence lengths, you're going to get the most gains. So I think we saw the most gains on the XNLI task because the encoder sequences are very long and then the decoder, it's just a classification task, so you're outputting a number. So with a 45%. Sorry, with a 50% compression rate of the sequence. So I cut the encoder sequence length by half. I also got like around a 45% speed up compared to bytet 5 on that task. So that task was probably the best in terms of improving efficiency at around a 50% compression rate. So this is great. I know the field is very focused on decoder models, but there are plenty of use cases. Like as someone who came from industry, there are plenty of use cases for encoder models that use that do classification. So yeah, this would be great. Imagine anyone who is using bytet5 for some use case and you could have your inference time. That would just be a great gain in my opinion. And now your work primarily focuses on and a set of enhancements to BYTET 5 but like situate the work for us in the broader context, like relative to T5 and other models. Are you giving up a lot for this character level efficiency? Yeah, that's a great Question. So I think a great direction to take this work would be to try to see how we could adapt this method for decoder models. I'm not entirely sure what that would look like, but I think that would be a great next step. And the yeah, this while while this is a particularly an architecture that enhances byte T5, like if we had the resources to train from scratch or train at large scales, maybe the benefits would be even greater or I bet the benefits would be even greater. One of the results that we have in the new version of the paper is training at a larger like 1.2 billion parameter model and seeing how the efficiency gains are even greater at that scale. So this just makes me think about if we could scale these models up as much as we've scaled up like subword level models, then maybe they could, maybe we could get back to the problem of like why haven't these caught on? And maybe it's just we haven't scaled them up enough to that point. I should mention it would be remiss of me, not to mention a new paper from Meta, the byte latent transformer that scaled up they had a particular architecture. They scaled up byte level models to 8 billion parameters and found that it's their. Yeah, training from scratch and it matched their llama model. I think it was llama 2 that they compared to. And yeah, I just think that the field is starting to go. Sounds promising. Sounds very promising. And I would love to be able to scale up Mr. T or other architectures to that size and then see. See if they would scale up just as well. So do you ultimately think that byte level encoding or modeling is going to replace subware token level modeling? I think that it's hard to tell the future, but I think that it's very promising. Not even just the issues that I kind of talked about previously, but maybe there are even sequences that could be compressed more that a token level model doesn't compress. Like there's some very predictable sequences that I can imagine. I wouldn't want a transformer to spend so much time operating on like let's say I give it a sequence that's like to be or not to be. That is the question. Like I know what that, I know what that is and maybe the model knows like that's a very predictable sequence. Maybe spending all the spending like several tokens to process that, you know, whereas maybe a byte level model or another model that uses some sort of down sampling could compress that whole sequence into like one unit there I could See how the efficiency gains might even be better than a subword model. And I think these started to be explored more in like the byte latent transformer paper. But yeah, I'd love to see character models like other architectures scaled up, including Mr. T. The comment about the to be or not to be kind of highlights for me that this work is really about two things. One is kind of the byte level paradigm and the advantages of that relative to subword tokenization. But also maybe even more importantly to that last point is the idea of dynamic compression as opposed to like static, you know, fixed compression. Yes, absolutely. So dynamic compression is the big, the big benefit. There other models that maybe do fixed length down sampling, I don't see as much like maybe every four characters I'm going to chunk into a, into one representation that still has benefits of being more character aware because maybe you're pooling over representations of characters, but still it's not dynamic in a way that you would get those sorts of benefits of actually being able to compress really long sequences into fewer representations. I think we've got a few minutes to touch on the Mission Impossible paper. Why don't we start with an overview of that paper and maybe even the origins of that paper, the setting or the conversation that that paper jumped into. Yeah, absolutely. So Mission Impossible was the first paper of my grad school experience, or I guess it's the first paper I wrote in grad school. Uh, so I remember I was just starting off and I was talking to my advisor, Chris Potts about potential first research topics. Um, and I, we, we had both read the, this New York Times op ed where Noam Chomsky, who's a very like famous and very important linguist, talked about language models and whether they have a bearing on studying linguistics. And his argument in the New York Times article was that language models kind of, they learn too much, like they're too good and to the point where they could learn impossible languages, which are languages that humans wouldn't be able to learn. And yeah, I thought from a research perspective, I thought this would be a really, really cool problem to explore. And just to jump in there, maybe you're about to say this. The idea behind his argument was that if language models are kind of just pattern matchers and aren't kind of learning things in a human like way, that we can't really extrapolate from them to the way humans learn language. Is that the core idea? Yeah, yeah, I think that's the core idea of a lot of his critiques of language. Up to this point. This particular point was. Yeah, the. How even how good language models are is it's detrimental to their use as a models of language or as linguistic tools because they couldn't possibly match certain human behaviors just because of how well they learn. Yeah, I think that was his main point in that article. And he cited a paper by Mitchell and Bowers that explored, I think I would consider it the one of the first papers to explore impossible languages from a computational lens. And that paper was really cool, but we thought we could expand to more languages and also test on more modern architectures. So rather than the recurrent neural networks that they tested on in that paper, we could also test on transformer based language models that are the core of language models today. So, yeah, I think Chris and I were both really excited about the topic. He tells me he remembers thinking it was maybe too ambitious of a project for a first project, but I remember him saying, I remember him being very positive and very supportive of pursuing this direction. Yeah. And so how does one define an impossible language? Oh, that's a, that's a very good question. And I don't think that there is a very clear answer still. Truthfully, there's not a very clear answer because the definition of an impossible language is a language that a human wouldn't be able to learn. And it would be very unethical, I think, to try out different languages on like babies. That's the ideal thing. Right. Like all of the languages that we test on, we would want to be able to give to a baby and see if the baby would be able to learn it, but we can't do that for obvious reasons. So in the paper we kind of take perspectives from linguistic theory as well as perspectives that's more from the ML side of, let's say, languages that have inherently more entropy and would be more difficult for both a human and a machine learning algorithm. And we try to test on a broad range of languages in the paper. So these go from languages that are very intuitively impossible. So there were lots of languages that involved random shuffling of words within sentences. That sounds fun. Yeah, it sounds intuitively impossible. But we even had to hedge there, like we included a footnote. You know, we believe that these languages are impossible given the context where we're scrambling, like English words. But there are some languages in the world that are called scrambling languages or free word order languages where basically words can appear in almost any order, but usually there's some other process in the language that disambiguates the meaning in a Different way. I think I remember reading that about some creole languages, that they have a tendency to support freer word order. Yeah, yeah. And I think lots of polysynthetic languages, so those mean languages that have lots of affixes on words, they will put more meaning into the affixes rather than putting meaning into the actual ordering of the words in the sentence, which is different from English, because English, I think a lot of the meaning comes from the structure, the syntactic structure. And you can't scramble words in that way and have people know what you're talking about. Yeah. So the impossible languages that the paper considers were these collected from the literature. Did you create impossible languages? Where did they come from? Yeah, so the languages that we tested, like most of these were kind of invented by us but inspired by parts of the literature. So like we had a set of languages of reverse languages that were actually replicated from the Mitchell and Bowers paper that I mentioned before. I think the main set of languages in the paper that we kind of focus on are these hop languages. And these are inspired a bit by some artificial language learning experiments, I'd say, that have been done on like humans about how humans kind of disprefer languages that involve like certain counting, like count based rules. So in the languages that we have in the paper, basically we take English. So all, all of our languages involve perturbing English. Are a number of reasons why we go in that direction that I could talk about. But to talk about these hop languages, basically we take verbs and remove any sort of inflection, meaning how you mark tense or number, and we put a marker that signifies tense and number four words later. Which, you know, there's nothing that, that doesn't sound inherently too complex. Right. But it's something that no language really does. Having this verb inflection be marked by a marker that comes four words after a verb. And I remember it sounds like it should be pretty easy for a language model. We have some targeted evaluations showing that, um, the model is actually more surprised by those markers or is worse at predicting those markers than in a control condition where the marker appears right next to the verb, as it would in, in the natural English setting. Um, yeah, so we thought that was, that was pretty interesting. And it is the core idea behind the research to demonstrate one way or the other the relative difficulty that language models have with these impossible languages. Yeah. So the main takeaway there was that, at least for the class of models we tested. So like the GPT2 models we trained from scratch, we had to train? Yeah, just to clarify, we had to train all the models from scratch on each impossible language. So these models are not pre trained, or they're not pre trained. On top of like massive English corpora, they only see their respective impossible languages. There's something about the GPT2 architecture that biases toward the natural language, meaning the control languages. In our experiments. And yeah, that's kind of the main takeaway. And our thought was that it has to do with or our hypothesis is that the GPT2 architecture prefers information locality. And what this basically means is in language, words that are predictive of each other are often close together. So when we mess with the locality of a language, it kind of is what makes it harder for GPT2. And we think it comes from the autoregressive language modeling objective where the language model has to predict the next token given the preceding tokens. And that kind of creates information locality bias in GPT2 as well. In terms of the training data set for these models, did you define the rules for these impossible languages and then translate from English data set to data set in the impossible language or did you use some other kind of synthetic generation? That's a great question. So we started off from an English corpus. This is a Babylm corpus, which is about 100 million words of text. It's supposed to be approximately what a child would hear up to age 12. So I think it's a nice corpus, if you're trying to do these sorts of language learning experiments, also is supposed to mimic what a child would encounter during the first 12 years of life. And what we did was we defined these rules that would transform the English corpus into each impossible language. So yeah, the data is very controlled. When we compare each impossible language, it's the same sentences that have just been transformed by different rules. Yeah, so that's how we went about it. And then we Pre trained the GPT2s on each corpus. I guess it strikes me that, so yeah, the, the, the languages, the language models that we use, like GPT2 were they kind of evolved in the context of English to some degree. And so there's like, I don't know, some kind of selection bias there for English or something. And you know, so therefore, you know, if you were, if your focus was some impossible language, maybe you evolved some other language model architecture that worked better for those languages. Which I guess causes me to reflect on the, you know, the relationship between impossibility and language model architecture from the, you know, from Chomsy's perspective. Like does the Fact that these language models that evolved in this English context, you know, work or don't work for these impossible languages. Like what does that really mean? Oh that, that's a great question. I, I have to say I'm, I'm really happy with some, with like, I guess the reception of the paper there has been like lots of follow up work I think that tries the sim or test the similar question, but maybe starting from corpora that are non English, you know, starting from other, other base corpora. And I think yeah, it's definitely a question that could be explored more. Like if we incorporate the comparison of different real natural languages versus impossible languages that are derived from each of those natural, those natural languages. It's something that just needs to be explored. And then the architecture question, that's what we think would be like a really natural next step. How can we find architectures that bias that are more biased toward the natural languages and less biased toward the impossible languages? Because ultimately like all of the parts of GPT2 are engineering choices and there's no reason we can't just change them in order to make them more cognitively plausible models. So yeah, these are great directions and I'm excited that people are working on them more. And in the follow up we are working more on the architecture question. That's what we're targeting in Mission Impossible too. Continuing the theme of the pop culture reference in the title. Yeah, that means you have like nine left. Oh man. Yeah, nine left. Oh, I, I wanted to do a paper that had a spin on like Fast and Furious, like Too Fast Too Furious or something. But it seems like the Almo team took it. They, they had. Their sequel to Almo was called 2almo too furious. Oh, nice. Well, great. I think we, we covered these papers. Anything else you would like to share about what you're working on or excited about? Yeah, I'm continuing to be very excited in tokenization and very excited in architectures. So I think the, the kind of, the link between the two papers that I talked about is the exploration of kind of what is learnable and what architectures are best for your specific use cases like for Mr. T, it's obviously what architecture is going to be at best for achieving more efficient byte level models for Mission Impossible. I think the clear next step is to explore the architectures that make a learner more or less biased toward natural or impossible language. And yeah, I'm just really excited for, excited to do more work on architecture. I think that these two questions have allowed me to explore that especially in a world where kind of the standard transformer architecture is very dominant. And I've been very happy to kind of break away from that a bit. Awesome. Awesome. Well, thanks so much, Julie, for sure. Sharing a bit about what you've been working on. Thank you so much, Sam. It was really a pleasure to talk to you. Thank you.