Reiner Pope of MatX on accelerating AI with transformer-optimized chips

73 min

•Feb 26, 20263 months ago

Summary

Reiner Pope, CEO of MatX, discusses designing transformer-optimized AI chips to compete with Google's TPUs and NVIDIA's GPUs. MatX raised a $500M Series B to manufacture chips that balance throughput and latency by combining HBM and SRAM memory, targeting inference workloads for frontier AI labs.

Insights

The most critical constraint in AI chip design is not raw compute but memory bandwidth and latency trade-offs; MatX solves this by combining HBM (for throughput) and SRAM (for latency) on the same chip
Frontier labs can afford to hire 50+ engineers to optimize software for each new chip generation, making custom software development economically viable unlike consumer GPU markets where CUDA's ecosystem matters
Supply chain bottlenecks (HBM, logic dies, rack manufacturing, power infrastructure) will pace AI buildout more than chip design innovation over the next 3-5 years
Chip architecture decisions are largely determined during the design phase through mental models and simulation; actual performance is predictable within 30-40% before writing any Verilog
Model architecture should evolve with hardware capabilities; separating pre-fill and decode phases, or training vs. serving models, could unlock efficiency gains currently constrained by unified designs

Trends

AI chip startups can compete with incumbents by targeting specific workload constraints (inference latency+throughput) rather than general-purpose computeSupply chain diversification becoming critical strategic concern; HBM, TSMC capacity, and power infrastructure are emerging bottlenecks, not chip designLow-precision arithmetic (4-bit, mixed-precision) becoming standard; precision trade-offs are now validated through ML research rather than IEEE standardsVertical integration trade-offs shifting; labs increasingly design custom chips despite multi-year delays because model architectures change faster than chip cyclesLatency optimization gaining parity with throughput in AI inference; 10-20ms HBM latency vs 1ms SRAM latency is becoming a key product differentiatorWaterfall methodology persisting in chip design despite software industry shift to agile; tape-out cycles remain discrete annual events with high failure costsAI-assisted chip design emerging; LLMs effective for Verilog/Rust/Python but weak on architecture definition and physical design optimizationSystolic arrays becoming standard for matrix operations; innovation shifting to attention mechanisms and memory hierarchy rather than core compute architectureCustom CPU instruction design for specific workloads (e.g., hash tables) remains unexplored despite high-frequency applications across software stacks

Topics

Transformer-optimized chip architecture HBM vs SRAM memory trade-offs in AI inference Systolic array design and optimization Low-precision arithmetic (4-bit, mixed-precision)AI chip supply chain bottlenecks TSMC manufacturing and tape-out process Latency vs throughput optimization Custom software optimization for frontier labs Model architecture co-design with hardware Attention mechanism hardware mapping Mixture of experts layer optimization Long-context inference and memory bandwidth Chip design verification and testing Verilog and EDA tool workflows Rust for hardware-level programming

Companies

MatX

AI chip startup founded by Reiner Pope and Mike; raised $500M Series B to manufacture inference-optimized chips combi...

Google

Developed TPUs and Transformers; established foundational AI research and chip architecture that influenced entire in...

NVIDIA

Dominant GPU provider for AI; benefits from CUDA ecosystem and gaming market positioning; MatX positioning as alterna...

OpenAI

Frontier lab buying multi-gigawatt compute clusters; starting to design custom chips; represents target customer for ...

Anthropic

Frontier lab purchasing large-scale compute clusters; target customer for MatX inference chips

TSMC

Primary foundry for MatX and most AI chip manufacturers; maintains durability through conservative pricing and techni...

Cerebras

AI chip startup mentioned as competitor; focuses on latency via SRAM-based weights but suffers from poor throughput e...

Graphcore

AI chip startup from 2016-2017 wave inspired by TPU v1 announcement; represents earlier generation of AI chip competi...

SambaNova

AI chip startup from 2016-2017 wave inspired by TPU v1 announcement; represents earlier generation of AI chip competi...

Grok

AI chip startup optimized for latency with SRAM weights; trades throughput for low latency, making it uncompetitive o...

Amazon

Frontier lab with custom chip development; optimizes latency in retail products; buying large compute clusters

Synopsys

EDA tool vendor providing synthesis and design automation software for chip design; critical infrastructure for MatX

Cadence

EDA tool vendor providing design automation for physical design and verification; critical infrastructure for MatX

ASML

Manufactures wafer steppers and lithography equipment; critical bottleneck for mask production and chip manufacturing

Samsung

Secondary foundry option for chip manufacturing; HBM vendor competing with SK Hynix and Micron

SK Hynix

Major HBM memory vendor; supply constraint for AI chip manufacturing

Micron

HBM memory vendor; supply constraint for AI chip manufacturing

Intel

Mentioned as alternative foundry option; not competitive at leading edge nodes for AI chips

Jane Street

Led MatX Series B funding round; technical experts in quantitative finance and optimization

Situational Awareness

Leopold Aschenbrenner's fund; co-led MatX Series B; focused on AGI and AI infrastructure

People

Reiner Pope

Co-founder and CEO of MatX; former TPU architect at Google; Haskell programmer turned chip designer

Mike

Co-founder of MatX; former Google employee; worked on inference stack for LLMs alongside Reiner

Leopold Aschenbrenner

Founder of Situational Awareness fund; co-led MatX Series B; wrote on AGI and AI infrastructure

Jeff Dean

Google researcher; referenced for 'Jeff Dean numbers' mental model for system design estimation

Quotes

"Hardware is massively parallel. You've got tens of billions, hundreds of billions of transistors on your chip, and it takes maybe 100 clock cycles to get from one side of the chip to the other, so you can't do a sequential computation involving transistors on both sides of the chip."

Reiner Pope

"It is actually possible to do both in the same chip. It's kind of an obvious thing. You take the HBM, you take the SRAM, put them together in the same chip. You put the weights in SRAM and you put all of the inference data in HBM."

Reiner Pope

"A startup is more of the right place to make a big bet on a workload. You either fail, it's fine. You just, another startup will succeed. Whereas a company like Google or NVIDIA, the next job has to work for sure."

Reiner Pope

"The economics seems to be most important. Ultimately, the quality of the AI you can train and serve is constrained by, I have only a $10 billion budget and I want to train and serve the best model I can on that budget."

Reiner Pope

"The best iteration is in your head. Can you map a model to hardware in your head? Can you estimate the performance of what it is in your head? You're not going to be 100% perfect, but maybe you can prove some kind of lower bound on performance."

Reiner Pope

Full Transcript

Raynor Pope is the co-founder and CEO of MatX. He's a former math whiz and Haskell programmer who became a TPU architect for Google. And now he's teamed up with Google's former chief chip architect to design a better chip for AI. So a year ago, everyone was saying Google is cancelled. AI is going to eat their search. No one's going to search for things and therefore the business won't do well. Obviously, that sentiment has really shifted. It's in part helped by, you know, Gemini 3 is really good. And then also it's really fast. You know, it's powered by the custom chip hardware Google has. You were inside Google for actually, I think, a lot of the foundational period laying the groundwork for that stuff. What do people not appreciate about what Google did right to lay all the groundwork for their current AI success? They started with the research, right? The Transformers came from there. Pretty much anyone who's maybe, I don't know, over 30 and at a large lab has been at Google Brain at some point. So I think there's just like there was and has been a lot of talent there. TPUs are pretty good. I mean, we think there's better you can do, of course, but they at least had the option, like the opportunity to design the TPUs for neural nets, at least, rather than graphics applications like NVIDIA. And so the overall architecture, starting with single core, doing what was at the time reasonably large systolic arrays by today's standards, nowhere near as much. But I think those were a lot of really good decisions. When did the TPU project start? TPU v1 was announced in 2016, I think. That was what actually kind of led to the creation of all of those 2016, 2017 startups. So Cerebus, Gronk, Graphcore, SambaNova, all of those. TPU v1 actually was, I think, is a really impressive project. It was done on a very short timeline, maybe, I don't know the full details, but maybe about a year or so, maybe a year and a half, with a skeleton team of 20, 30 people. Really, really minimal viable product. More recent TPUs and more recent AI chips in general can't do that because the market has moved and the stakes or the table stakes are much higher. But the first generation product, they just one big systolic array, stick a memory next to it, we're done. And it was really simple, a nice, elegant product. And obviously that TPU v1 predates the transformer. Is that just a coincidence that they happened at very similar times or related in some way? Yeah, I mean, there was a period of maybe about four years of like a lot of ML research or neural net research prior to Transformers. So what was popular? LSTMs and Confidence and ResNet and Inception. The big thinking at the time was to adapt it to be used for LSTMs. It's a reasonable fit there. But yeah, I mean, I think there was just a huge flurry of activity. I think why did it all happen then and not later is probably just because people stopped publishing. In 2022 was about the time when just Google completely stopped publishing its research. Yes, yes. And so all the good papers are from before that as a result. Right, right. But is there some hand-wavy story you can tell about parallelization where both transformers and TPUs are about really internalizing the importance of parallelization? So, I mean, definitely. I put it somewhat on people, actually. So, I mean, it is just true. Hardware is massively parallel. Like, you've got tens of billions, hundreds of billions of transistors on your chip, and it takes, like, maybe 100 clock cycles to get from one side of the chip to the other, and so you can't, like, do a sequential computation involving transistors on both sides of the chip. So the hardware is just fundamentally parallel, and you have to take advantage of that. TPU v1 and all later TPUs naturally took advantage of that. Matrix multiply is really nice because it is so parallel. So I think on the hardware side, that's generally understood. I think most ML researchers, especially of the time, were not sort of super deep in what hardware wants and what is sort of mechanical sympathy is a term that's used for that. So, I mean... So what is the term mechanical? I mean, it kind of makes sense. It speaks for itself. It's like, I mean, think about the poor machine and what does it want? What does it want? I mean, the term actually, I think, originates in maybe high-frequency trading in areas like that, which I haven't worked in. I like reading about the software that people have built from there. And it's like, for them, what does the machine want? It wants a lot of instruction-level parallelism. This is CPUs, not DPUs. It wants a lot of don't branch. So unpredictable branches kill your performance. And so think about the things that CPUs do and how to use them best. Can I get to peak performance on a CPU? It's sort of that idea. I think the whole idea of peak performance on a CPU is kind of crazy. Like no one even says, what is peak performance? What is my percentage of peak on a CPU? Because performance of software running on CPUs is really bad. But running on GPUs or TPUs or AI chips in general, actually that is the main focus. It's like, what is my percentage of peak? Can I get 70% or 80%? Okay, I feel like many people listening to this know that GPUs perform better for AI workloads than CPUs. And it's kind of a funny history when you think about it, where just one day we woke up with all these very mathematically intensive workloads. First, crypto mining, and then AI. And so then NVIDIA is extremely well positioned because they've been making GPUs for gamers that you would plug into. You'd buy your Dell PC back in the day and maybe upgrade the graphics card by plugging in a better NVIDIA graphics card than the one the stock Dell computer came with. And they were incredibly well-positioned to capture that. So I think people know that. What is the intuitive explanation as to why GPUs are better for AI workloads than CPUs? Because, I mean, people say, yeah, they're better for these mathematical computations. but that's kind of a total logical answer, basically. Is there some way you can have a mental model for why that is the case? Because, I mean, software instruction sets also involve doing math. Yeah, so, I mean, intuitions, I'm not sure. Let me try and just go to some of the big differences, which is really wide vector instructions is sort of the hallmark of a GPU, which I think it's maybe, if you want sort of some intuition, it's like how much is spent on controlling the thing? And maybe control means like if I'm driving a truck, how much like is the driver versus the payload? Yes. A truck has a huge payload in it. That's more like the GPU, whereas maybe a motorcycle is more like the CPU where you've got like the instruction, like actually just processing the instructions, reading what do I have to do next? Okay, how do I do that? That is most of the cost on a CPU. Whereas if you just keep the same instructions but make the payload 100 times bigger, then you can shift most of the cost to be in the actual work that you want to do. Okay. Okay, so CPUs have been optimized for very complex instruction sets, whereas GPUs optimized for... Yeah, complex instruction sets and sort of fine-grained changing what you want to do. So, like, I mean, steering, like, in this analogy, like, a CPU can steer on obstacle costs, no problem. Whereas, like, on a GPU, You're just going to go straight line for a really long time. Yes, yes. Okay, so this is getting us into what is MATX? How did you guys start it? And which part of this space are you attacking? Yeah, so MATX is making the best chips physically possible for LLMs. What led us into MATX, so Mike is the other founder. Mike and I were both working at Google. and we, I mean, I was working on the inference stack for running LLMs and I was saying, like, how can we make the best software on TPUs for running LLMs? And then what we really wanted out of hardware was support much, much larger matrices. The matrices have grown from maybe 128 in dimension into the many thousands. And so, like, Brock goes to like many trailers. trailers. So much larger matrices and much lower precision arithmetic. And we tried to move the TPUs in this direction. TPUs have been moving in this direction, but they're kind of constrained by a lot of other workloads. There was a big ads workload at the time. And so back in 22, before ChatGPT was released, there was this idea that LLMs were going to be a big thing, but not conviction and really hard to make a big bet on that. I think a startup is more of the right place to make a big bet on a workload. You either, like, if you fail, it's fine. You just, like, another startup will succeed. Whereas I think a company like Google or NVIDIA, the next job has to work for sure. And so... You can take more technical risks as a startup. Yeah. Well, actually, I would say we're taking sort of product risks rather than technical risks. But is there actually product risk? Because it seems like LLMs are going to work. I think now we understand it. Two years ago or three years ago, I think it was. Fair, okay. And when you say the best chips for LLMs, I mean, I can think of multiple ways to measure best. It could be best performance per watt. It could be lowest latency, capable of handling the largest models. What is best? In general, there are two metrics which LLM workloads care about, which is throughput, which is really just an economics thing. I buy a chip for $30,000 and then can I do 10,000 tokens a second or 100,000 tokens per second of throughput? That determines the dollars per token. So throughput and then latency, how fast does the thing respond? As I see the market, the economics seems to be most important. Ultimately, the quality of the AI you can train and serve is constrained by, I have only a $10 billion budget and I want to train and serve the best model I can on that budget. And so if I can have more tokens per dollar, then I can get a better quality up. So the product we aim to build is far ahead on latency, on throughput. But then actually the sort of surprising thing is we're competitive with the best on latency as well. And so I think that is a unique thing in offering boats in the same place. And is this for, obviously, an AI, there's training the models and then running the models inference is this most interesting for inference or is there any training i mean incidentally is it useful for training or but you're trying to win inference is that how you think about it i think that's a reasonable way to look at it uh i think the best inference chip today will be a train a really good training chip as well um and so uh our product is both training and inference but um i think the first sales will be an inference. That's mostly just a market effect where it's easier to buy. It's not as big of a risk to buy an inference cluster than as a training cluster. I think the product is really compelling for training as well. And so I think it should be the best training product. And you guys just raised a big new round of financing. Yeah, that's right. So we've raised a series B round. It's led by Jane Street and Situational Awareness. Situational awareness, that is Leopold Ashenbrenner's fund. He wrote the definitive book on AGI and where it's going. And then Jane Street, they're real technical experts. They understand all the details really well. So very happy to be having them lead the round. It is a $500 million round. Helps us actually ramp the manufacturing and supply chain for our chip so we can bring our chip to market. That's a lot of money. Yeah, it is. Yeah. No, I mean, I think the, like, roughly I would say it costs all-pack $100 million to produce a chip in small volumes. But then if you want to, like, you see the orders that are going around, like OpenAI, Anthropic, Google are going around buying multi-gigawatt clusters. They cost, like, tens of billions of dollars of chips. And you want to deploy all of that in, like, in a year or so. And so you just need a massive supply chain behind you. And so assuming everything works technically, what rate of production could you start to see? We have some estimates of where we'd like to be on this. Ramping to very large volumes is a huge challenge for anyone. And so obviously for the large place, they've had some practice in it. Getting to a very large volume for a startup is hard. We would like to be at a place where we're shipping multiple gigawatts a year. Multiple gigawatts per year. Speaking of the metrics, you talked about tokens per second. We used to measure chips in flops, and I guess there's some kind of custom flop thing for AI chips. But is everyone just using tokens per second these days? Is the industry aligning that on that as the chip metric? Yeah, so I mean, I guess it's sort of like an application metric versus the chip itself. Flops of the chip is the key chip metric. There's a little bit of, like, if I go and say I've got an exaflop chip to you, then sort of the appropriate suspicion is to say, okay, but can I actually use those flops effectively? I see. And so then you need to map the application to that. Yeah, yeah, yeah. So this is kind of telling you the usable flops, yeah, for your purposes. Okay. As a consumer of AI, we have known for a long time that lower latency products succeed. Google talked about their internal testing where the differences were down to, was it 50 milliseconds? Something like that. Yeah, yeah, in result times where they noticed more Google engagement the faster the results were. And you'd think that 50 milliseconds is imperceptible to a human. And it almost is, but turns out it's not. And I think Amazon has, I mean, certainly they've optimized the latency of the Amazon experience quite a lot. I don't know if they've talked about this stuff publicly, but you know that their internal metrics similarly show that the faster the product page loads, the more people buy it. And yet, in AI, Google has carved out a meaningful advantage via Gemini just being really fast for its level of intelligence. And as far as I can tell, ahead of most of the other labs, at a latency at a fixed high level of intelligence. Why have you guys or Grok or Better Chips not been adopted faster to give this product latency is just that this will happen, and you guys will be powering all the AI products. But I note that Google has an interesting lead there. I think there's ultimately, at least for existing chips in the market, there's a really uncomfortable trade-off between latency and throughput. The chips that are best at throughput have historically been the chips that are based on HBM as the memory. So that is Google, Amazon, NVIDIA. in order to have very large throughput you need a lot of a lot of inferences in flight simultaneously so that needs the large memory but that hasn't been so good at latency and then there's the Grok and Cerebris that are much better at latency because they've got this the SRAM weights are in SRAM very low latency the problem is and the challenge when you go to a Grok or a Cerebris system is that the throughput you get there it just is not very good And so the fundamental dollars per token is just not competitive with Google or NVIDIA or Amazon. It is actually possible to do both in the same chip. It's kind of an obvious thing. You say you take the HPM, you take the SRAM, put them together in the same chip. You put the weights in SRAM and you put all of the inference data in HPM. That is what we are doing, in fact. And I think that actually hits a really nice sweet spot where you can get a load latency and also be very cheap. So I think that's a really attractive point to be. It hasn't happened in the market yet just because of product decisions that have been made by the different chips. Got it. But we should expect all the AIs we're using to get significantly faster over the coming three to five years. Or order it might be faster, I'd say. Yeah. So I mean, generally, HBM-based chips tend to be about 10 milliseconds or 20 milliseconds per... I'm sorry, HBM-based chips are things like TPUs. That's right, that's right. Yeah. There's just some simple math of like, How long does it take you to read through all of HBM? It takes about 20 milliseconds, and so that's the amount of time pro token it runs, whereas the amount of time to read through all of SRAM is much faster, and so you can typically get about one millisecond. So that's in order to manage to be faster. Famously, software used to be, like old-fashioned deterministic software, the kind that's now out of favor, used to be very easy and quick to scale. And you would have social networks that have some Southwest moment, and they can scale through 10, 100, 1,000 orders of magnitude of adding users because it's just a few rows in a database and it's a very underutilized CPU. What's interesting about the AI world is there are very real bottlenecks. You spend lots of time talking about power, but it's not just bringing power online. You mentioned HBM is reminding me of, it seems like there's a view that maybe there's going to be some HBM supply chain crunch. And so where do you see, are we in for just a crunched world where some limiter is pacing the rate of AI build-out over the coming few years where the economics work of the products and everything like that but ultimately we just can bring the components online fast enough because we have to build up the factories and things like that And what are those crunched components Yeah, no, I mean, I think so. And I'll just comment, by the way, this is a great time to be a supplier in this place. Yeah. Or just really... You should have started an HBM company. I know, right? I think it's also just a fun time to be someone who optimizes software. That's always what I like doing. Always the challenge is, Why am I optimizing this if no one cares? But finally, there's a place where actually you can, it's actually very meaningful in a very tangible sense. Like, if I can make this 20% more efficient, then it can save that 20% of the build-out. The supply chain, we're going to have crunches on all of the supply chain, really. So if you look at the sort of the big components of what any company, but like us, for example, build-out, there is dependency on logic dies from typically tsmc maybe samsung or hbm which are the big three hbm vendors hynix samsung and micron um and then and then there's also just the whole rack manufacturing which includes um i mean literally just sheet metal and so on that builds the rack but also cables and connectors because because of all the high-speed interconnect um that's what Racks don't sound hard. Are they sneaky hard? The big challenge is that you want to bring in a huge amount of power, get a huge amount of heat out, and also have phenomenal interconnect, which has very high signal integrity requirements. And so pack a lot of cables in with cables that don't bend too much. They have to have enough copper in them and so on, but you don't lose data rate on the interconnect. Yes. So, yeah, if you push it to a limited time. Okay. Wafers, racks, HBM. What else? Data centers, which I think is power primarily, a little bit of build-out, but primarily power and grid infrastructure there. Okay. How do you then, as a startup that is looking to acquire all these components, elbow your way in amongst the giants of the Googles and the NVIDIAs and all these people who have long-running relationships and have been buying for much longer? Yeah. I mean, ultimately, what all of these suppliers care about, they do somewhat care about a diversity of their own customers. It's not a great position to be in. They don't want monopsony. That's right. Yeah. But then, you know, what is their hesitation or the calculus for one of these large suppliers is, if I reserve some of my capacity for you, a startup, are you going to be around in a year? Is anyone going to even buy your product? Our approach has been to just actually find buyers for the product, and then the buyers answer that question, ultimately. Got it. And so if you show up with a bunch of fairly ironclad contracts to a supplier, then that has happened. That's the nature of it, yeah. I presume also the round you just raised really helps there, where showing that you are incredibly well capitalized and not going anywhere also helps from a supplier point of view, a supplier validation point of view. Yeah, absolutely. Yeah. I mean, it helps just to say that we are around. We, in some cases, are actually, it depends on which part of the supply chain, but some parts of the supply chain, some are fungible. Logic dies are typically pretty fungible. but other parts of manufacturing are you actually need something specifically set up for you and so we're also able to cover the capital costs for that. Yeah, that makes sense. And coming back to the Matex architecture okay, you want to build the best your problems. What is that? Yeah. Sounds great. Yeah, so I mean there's a few aspects to that. I think the first one is just pick your memory system right and so I said like we've seen this HBM family We've got the SRAM family. Put the most together is actually, I mean, most obvious idea, but you can actually do it. There are a lot of details to make that work. Well, we've done that work. One of the things that shows up there is you've spent all of this area on your chip on SRAM. How do you fit in the matrix multipliers, which are the other big thing you really need to do? And so somehow create a much more efficient matrix multiply engine. There is a gold standard for that that is called the systolic array. Make a really large systolic array. You can't beat that in area or power efficiency. Like provably so? Practically. Practically. It is not known a better approach there. The main thing is like, where are the inefficiencies typically? The inefficiencies show up when you leave the systolic array. So if you make a really big, if you make your systolic array really big, then you just don't leave it as often. So that's the idea. So make a really big systolic array. That is sort of the theme of several of the 2023 era startups, including us. but one of the challenges there is now there is this part of the neural network as part of the transformer which is this attention that doesn't map well onto a large distalic array and so that's attention the the the mixture of expert layer maps really well but the attention does not and so what we came up with which is quite different than some of the other startups in this space is say take a really large distalic array but have a way to split it up into pieces without losing efficiency. So sort of that is the core of the design for us. And then there's sort of the third component. So first was HBM and SRAM. Second is the systolic ray. Third component is just an interesting new approach on low-precision arithmetic. Low-precision arithmetic, in general, we've seen number formats get narrower and narrower. They get faster and faster as you make them less precise. Number formats get narrower. What does that mean? Yeah, so float32. was how people used to train neural nets. And that's just too much precision. Too much precision, yeah. It's like saying I've got an image with a billion color bit depth. It's too many colors. You'd rather have more pixels and fewer colors. And so that trend seems to go all the way, almost all the way down to one bit even, where just have very few colors, but a huge number of pixels. and that in net seems to be better, just more efficient way to train models. And so sorry, literally what precision are you dealing with in these? So we have a range. I mean, we actually have an ML team who we hired specifically to research different forms of numerics and how to make them all work together really well. We have a range of precisions. It's not just one precision. We think probably the main thing will be similar to where NVIDIA is at, which is 4-bit precision. But I think a mix of different precisions is useful for just when you look at the research, sometimes you want some layers in higher precision or lower precision and so on. Yeah, yeah. Okay, so 4 bits is 16. Yeah, you get 16 choices. That's it. Yeah, that's it. Yeah, it was pretty imprecise. Yeah, yeah. That's really interesting. I didn't know about that dynamic, but it makes sense. Yeah, and half of them are positive, half of them are negative. So, like, it's even less precise. How do you design a chip? Is that a whiteboard? What software are you working in? I'd just love to know. I understand how you design software and what that process looks like. I've actually no sense for what chip design looks like. So the way that you actually type a chip into a computer is similar to software. So you write Verilog. Verilog is a programming language. It is a very parallel programming language, which makes it different than C or Python or something. but it is a program language. So the mechanics of how you express the design are the same as software and we have continuous integration, Git, all of those things. But like a program executes, like your Verilog program. We don't really run it, right? Yeah, exactly. We synthesize it, yeah. Okay. So Synopsys and Cadence provide EDA tools. So EDA, if you remember, yeah. Electronic design automation. I'm just a humble penis. I don't even know what it means really. I think it's electronic design automation. It takes the Verilog and says, first turns it into a description of what are the logic gates that are involved, ands ors, nots, and then the wires between them. And then it runs for days doing some really difficult algorithms and then eventually produces, I mean, so gates are the first thing and then even below that, it literally just produces polygons. It says like P-type semiconductor here, N-type semiconductor here. and polysilicon. Okay, so you write Verilog and then that compiles down into gates and ultimately the Minecraft 3D, just this is where your elements should go. But then what is the iteration loop? When we write code at Stripe, we build a first version of something and then we try it out and then we refine it and we add more functionality over time. We're going to write some tests at some point. We'll ship that. We'll find product market fit and then we'll refine it in market. Like, do you just sit down and write the completed chip and it works really well? Yeah, every year we tape out a chip and if there's a bug, we just wait till next year. It's not really how we do it. Yeah, well, so what's the iterative? How do we actually do it? Yeah, it's much more waterfall than software is. So, like, waterfall is almost a bad word in software development. But it's just a fact of life in chip design. Yeah, yeah. So the waterfall goes from architects to logic designers who are writing Verilog. And then there's this design verification and then physical design. So there's this really big architecture phase, which happens before even writing any Verilog, which is what do I want the organization of my chip to be? There's in some sense, I mean, what I really like, I came to hardware after doing almost 10 years in software. I really like the blank slate you get in hardware. You've got all of the raw materials. You have a much more varied in what you have available. So what is the organization of your chip? Do I have 100 cores? Do I have one core? Do I have systolic arrays? Do I have vector units? All of those things. And then we spend a long time coming up with that general principle and then saying, okay, now I've got these applications I want to run. I want to run a transformer of a particular shape. I want to map that onto this architecture that I've got in my head. And so we do a lot of iteration. Well, I've got this architecture in my head. I write it down to communicate to other people, but that's just like a markdown file. And then still actually a lot in my head, but maybe with Python simulation and so on, I'll see, do my applications map well to it? And so can I run LLM? This is where I was going to go. Okay, so you have a simulator where you write your chip. You can then simulate its performance and you have some battery of tests that you kind of see how this chip design works. Is it like an industry standard, you know, is it the X-plane of chip testing? Yeah. So, I mean, there's an industry standard thing for the Verilog once you've done the design. They're just Verilog simulators that you can test against. Okay. That is, but you've already invested a huge amount of work by the time you've got to that point. And so, I sure hope you haven't made a big mistake at that point. Yes. So the thing that everyone does prior to that is we'll write our own performance simulator, which, I mean, it is very specific to your particular architecture, and you can write it quite concisely in just like a normal programming language. And so that is where most of the architecture work is done. And then the simulation on Verilog is more, I know what I'm doing, I just want to make sure I didn't have any bugs when I implemented it. Right. But I presume it's a game of inches where different people are trying different things, and then you do simulate it to see if it runs 1% better across the battery of tests, or is that not how it works? In this space, not so much. So, I mean, just to sort of characterize what performance of an AI chip is, it is how many really, like, if you're just, like, first thing you care about is flops. How many flops have I got? That's a product of how many multiplies, like, I've got a grid of a certain size, like 1,000 by 1,000, And so that can do a million multipliers in a clock cycle. And then I have a certain clock frequency, like a gigahertz. And so I multiply them out. That is the speed of it. I don't even need to write that and test it to see how fast it is. It just is. Yeah. So what I plan in advance is it's going to be this fast. What I can then optimize on maybe a little bit is clock speed. There's not a lot I can do there. And then I can optimize a bit on area as well. So there is some room for optimization, but actually a lot of it gets set. Like, actually, just the speed of the chip gets set very much up front. Got it. And then, how many chips do you fab? Is it only the ones going into production, or is it just build a few to throw away, or how does it work? Yeah. So, the ideal, which companies tend to hit about 50% of the time, is that your first tape-out, tape-out costs like $30 million. Tape-out is just production. That's right. It's the actual manual, like the first chip costs $30 million, the second chip costs $1,000. Yes, yes. So tape out is that first chip. Yeah, okay. The ideal is that your first tape out is actually is your production thing. So you do a tape out, you make maybe 1,000 chips and test them, and then you do production volume. In the unlucky 50% of the time, you need to redo some or all of your tape out. So in good cases and in many cases, you can redo just the metal layers, which costs you only like $100,000. As opposed to the... Pay the $30 million again. But in bad cases, if you've made something serious and you can't fix it at the metal layers, you have to do the whole thing again. Why can't that be solved? Is that definitionally an error in simulation where it turns out these two gates were too close together and it just led to some reliability issues? Yeah. So, yeah, like what you're describing is like physical, like the physical implementation of the chip is wrong. That's one class. The other class is that the logical specification of the chip is wrong. But shouldn't that be... Shouldn't you have got that before? Yeah. Before you spend $30 million on it. So, I mean, yeah, we do a lot of testing. We try not to ship these things. I hear software companies also ship bugs to production as well. Fair. And sometimes things miss. There's a very good retort. Shouldn't you not be shipping bugs? But, I mean, there is a real trade-off in you can spend more and more time on design verification. There's always this question of when do you stop? And so you stop when your coverage metrics have hit a certain point, but maybe not 100%. And then if, you know, Apple has to discretize the iPhone release cycle and they've settled on, you know, once per year. And so they'll decide, you know, we've got this better camera, but it's got to wait for the next version. Or, you know, we're going to improve the waterproofing, but, you know, that's got to wait for the iPhone 8 or whatever. And so they have taken a continuous process of like all this coming up with ways to make the iPhone better and discretized it into annual iPhone releases. what will your discrete cadence be? Many chip vendors have this sort of tick-tock model, which is you'll do on one generation, like maybe you're trying to release every year. On even numbered years, you'll do a physical technology upgrade. So new transistor technology, new memory technology, a new interconnect. And then on odd numbered years, you might do a architecture overall. I think that's a pretty good fit because you have different parts of your company that are skilled at different areas. And it allows you to keep sort of both of them occupied without like having instead every two years doing a massive risk release. Yeah, yeah, yeah. Okay, and so you think that's probably likely for you. Yeah, that's right. You mentioned interconnect. So there's an art about there that NVIDIA, a huge part of the defensibility comes not from the chips, which are good, but from the software layer and the ability for engineers to write these really parallel workloads and the fact that they've been refining CUDA for whatever number it was. Yeah, a decade or something. Exactly, yeah, a long time. Just how do you think about parallelization and is that narrative true? Yeah, it's true, for sure. It's true in many areas of the market. I think, and especially where you look at where NVIDIA entered the market, they're doing PC devices, lots of gaming and so on. there are thousands of games, maybe tens of thousands of games released, and they all need to be programmed against CUDA. And so there's such a huge investment in the software that this is really important, the compatibility. There are not thousands of LLMs. There's one LLM per Frontier Lab, and there's maybe five Frontier Labs or something like that. And so just the economics of that is different. The calculation for Frontier Lab roughly goes as I just bought a $10 billion compute cluster. I have hired 50 of the best people who can write optimized GPU or TPU or Tranium software. I pay them less than $10 billion, a lot less. And so let's put them to work optimizing the compute. And so they can, like, good work there can, I mean, depends on what your baseline is, but it can very easily double the performance of the software you write. And so there is a huge amount of custom software written for every generation of chip. When a new chip comes out, the software is, like, substantially rewritten to optimize for that specific chip. And that's just the right trade-off, given the relative costs of these things. What that means for us is that that ecosystem already exists and that way of operating where you say I just going to staff a 50 team to write software for this chip works really well if you trying to sell to Frontier Labs okay so you're saying CUDA is way more important for the games environment where it's just there's a lot of games yeah than this top heavy AI market that we're in yeah where if people say you need to then customize your workload for a Matex chip it's like well fine fine yeah cost of business yeah yeah that makes a lot of sense um where will you fab the chips? TSMC Okay. Yeah. Why is TSMC so durable? Yeah. I mean, it's interesting. They don't charge a lot as well. You'd think that if they're a monopoly provider, they should charge a lot of money. They don't. I think that is a big aspect of why they're so durable. It's like this cyclical conservatism crossed with Taiwanese business conservatism means you're at the most conservative part of the matrix. But, I mean, it does, I mean, like, an American capitalist might say, well, they're just screwing up. They could have extracted more money from the market. But you could also say that there's actually this long-term sustaining advantage because they will just stay ahead for a really long time. They don't encourage the creation of competitors. Yeah, yeah. But isn't the creation of competitors kind of priced in because of geopolitical risk? And so it's not like everyone's fat, dumb, and happy with their TSMC dependence. They're actually thinking a lot about it. Yeah. I mean, so there is real technical advantage there as well. It's not just like the discouragement. But like standing chips seems really hard. Building airplanes seems really hard. There are so many areas where competitive market forces create multiple options. Yeah. And yet that has not occurred here. So, I mean, there are multiple options. You can buy from Intel or Samsung. But at leading edge nodes. Yeah, yeah. So, I mean, what do we even care about in leading edge nodes, I guess? The big advantage is on power. The advantage on area is smaller. The leading edge nodes, the density doesn't go up as much as it used to. So, when you are really, really sensitive to power, it is a good idea to be on leading edge nodes. So that is AI chips and mobile phone chips. But there's a lot of the market where you don't like devices in cars and so on. Sure, sure. Yeah, car chips, yeah, that's fine. But you're kind of saying, like, if you exclude the two most interesting parts of the market. For this super high growth area of the market, it's interesting to me. Like, again, there's a lot of other really complex business problems out there this competition has solved. And chip design is like, why has someone not left TSMC and gone and built a new fab? Yeah, I mean, I don't know. The cost of a lab, of a fab, is extremely expensive. I mean, I recognize that also the cost of a lab is extremely expensive too. I don't really understand the technical details of why it's so hard. I mean, there is some amount of just a $10 billion fab versus a $100 million tape out and chip development. There's a huge difference there. But beyond that, I'm not sure. What's JSMC like to deal with? So they're very big. So as a startup, we tend to work with, not directly with TSMC, but with an ASIC vendor who, I mean, firstly does a huge amount of the actual backend work for us that change affects with them, but then also has existing relationships with them. Got it. TSMC cares a lot about diversity of their customer pool. And so... It gets back to that conservatism. Yeah. So they're great to work with from that perspective. They want to encourage startups. That's right. Yeah. That's pretty cool. Why don't the labs design their own chips? I mean, Google does, but... Google does. OpenAI is starting. It's really a trade-off of how much advantage do you get from vertical integration versus how much advantage do you get by concentration of R&D work. So you take the five labs, and if they all buy from one player, then you can put like five times as much R&D into that chip. And does that beat the advantage you get from saying, I know exactly what my model is? Because of the like several years delay from designing a chip to being in production, you can't actually say I know exactly what my model is because models change much faster than that. So even the labs are forced into this position where they have to make predictions and they have to hedge against what they might do two years from now. The calculus is sort of like, what is the probability distribution of what my model might look like and then sort of design a chip that gets like 90% of that probability distribution or something. Yeah, yeah. Elon is excited about data centers in space the two criticisms I've heard are that cooling is very hard and then just repairing the chips is hard but I know nothing about chips you do the repair I think is really interesting when you look at how NVIDIA deploys their X we do something pretty similar to what NVIDIA does in general you always need to design for the fact that some of your chips are going to be down. Like, mean time between failure of chips is not that large. And so in a cluster of 100,000 chips, there's going to be chips that are down all the time. One way you can do that is you can make a rack where one rack has some spare chips in it. NVIDIA has eight spare chips and a rack of 64. That's pretty good. The common on Torx works really well for you there. that you can actually, because you can pick which ones to avoid, you can with very high probability tolerate a lot of failures. And then the other family of things is to say, my rack has to work, but I have some spare racks as well. So you can math that out with the tax of reliability here is only like 10%, it's pretty good, but That relies on someone coming and servicing the part within a day or something like that. If you say they're going to service it never, then I think you actually can get where you want to be, but maybe with 100% tax on reliability rather than 10%. So, for example, if you think the average lifetime of a chip is in the range of three to five years, so that means if I deploy twice as many chips, then three to five years from now, half of them will still work. Yeah, and also the burn-in is particularly failure-y. How about the cooling? So most of the challenge, I mean, I guess there's actually really a data center design aspect. At the rack level, the challenge of cooling is just getting the heat out as quickly as possible out of the rack into the cooling network. How you get it out of the spaceship, other people would know that better than I do. Okay. Yeah, yeah, yeah. Again, that seems to be the main objection, but I don't know. Yeah, I mean, I think it's sort of like, if you think the cost of repair is that you need to have deployed twice as many chips, then it's a trade-off of the capital of the chips versus the power saving. Exactly. The repair thing, it feels like, can be solved because also I think part of the best, you know, probably one's claim, is that we will just be so power limited that, you know, you have no option but to go to space. And, you know, people can argue about that, but were that to be the case, then yes, it's like, while you can get power in space and you cannot on Earth. And so you might as well go there, whereas the cooling is more fundamental. Does the product actually work at all? Rayner thinks about AI the unglamorous way, compute, systems architecture, and what it takes to run models reliably at scale. And if you're building an AI product, the business model similarly has a ton of unglamorous complexity. You're not just selling AI, you're monetizing consumption across API calls, tokens processed, GPU hours. Stripe Billing is a scalable system for usage-based billing. It lets you launch token-based pricing, subscriptions, credits, hybrid models, whatever you want. So you can create revenue models based on usage without rebuilding your pricing system every six months. If you're building an AI product, Stripe Billing is worth a look. What are your AI predictions for Twin26? I mean, what I'm really excited about is just being able to... I'm still excited about the coding. This is what we do as a company. It's what many others do as a company as well. The one aspect of this is expanding into more domains. So, for example, where we spend our time, as a company, we write Rust, we write Verilog, we write Python. No Haskell. Yeah, no, there's a story there. I used to love Haskell. Rust is my current favorite. Mutation is good. The models are extremely good at Rust and Python. They've done a lot of RL on them. They have not done as much RL on Verilog. They've done almost none on, OK, write me a markdown file that describes a chip architecture. And then how do you even RL on that? You have to say, what is a good chip architecture? I have to somehow say whether that's a good result or not. I think one of the things the labs are doing is trying to broaden what they've done RL on, source it from customers and so on, in order to sort of fill out the knots, make it less spiky, fill out the gaps between the spikes. I presume the labs would love to work with you on improving the models by doing RL on this specific task. However, it's also somewhat... It doesn't make sense for us. Yeah, you're a special sauce. So do you want to come up with some AI approaches but keep them proprietary? Yeah, so we've looked at a few different aspects here. What we're able to do by ourselves, our business is not training models. We do it in order to do the research on numerics, but actual production models we don't do. So the biggest mileage, I think, is on the RL, and it's not something we can really do ourselves. We'd love it if we could have a custom model just for us, but that doesn't seem to be a perfect product. The terms we've been offered by labs so far have not been on those terms. Because you have to share the IP back. The way they prefer to do it is that they put it into their mainstream model because it's good for them. Yeah, yeah, yeah. Which obviously you don't want to do. Yeah. I mean, how do you think, what does you using AI to design a model do you think look like? because this is actually, I think, an interesting sight glass into a weak version of recursive self-improvement where we're using the AIs to develop better AIs. And so I'm curious, yeah, what you think that looks like. Is it your own proprietary recursive models? What else? Is there kind of day-to-day AI usage that's load-bearing? Yeah, I mean, so the stuff that is available today and I think will become even better very quickly is just the stuff that looks most like software. So writing Verilog, running tests, running continuous integration, and so on. And that is a big fraction of the development time in the chip. It's probably 9, 12, 15 months or so. There's some stuff that's downstream of that, which is physical design, which is you take that Verilog and you generate the gates and the polygons. We don't have a clear path for, at least the most obvious thing is not clear for how to compress that. Like the goal, can you tape out a chip in one month? One month would be the goal. In theory, you could compress all of the logic design and design verification down to a short amount of time just by continuing on the same path we're doing now. But if you wanted to take the physical design down, that has to leave code. You're now doing like graphical interfaces and saying, well, I want to replace stuff and so on. actually there has been work on this even prior to llms um uh which is like specific um model trained for that particular problem yeah and and i think the vendors uh which is like synopsis and cadence um probably will should move in that direction um most of the focus has not been do it faster it's been do it with higher quality um uh but but that is a big bottleneck on on like can I have a new chip every month? And then there's just the practical thing of like, a new chip every month doesn't really make sense because then if I'm deploying, like if it takes me a year to populate a data center, that means I'm going to have different chips in different corners of the data center. Yes, yes. Sorry, when you talk about one month to tape out, so you do all this work to ultimately produce a file. Everything TSMC then does, it's not entirely in software. Is there some typesetting that has to happen of moving stuff around? But yeah, what happens when you send your files to TSMC? Then what? So they create a mask. So that is where the ASML tools come in. And a mask is really just a stencil. You shoot the lasers through the mask or the x-rays through the mask, and then that produces the different p-type and n-type semiconductors. So they produce the mask. That is the expensive part. And then they're building up these 15 or so metal layers. So they place it on the silicon, and then there are different layers of metals, which connect all the transistors together. They do that on a wafer. It happens on a stepping basis. So there's sort of a maximum size of chip you can build, which is constrained by this machinery. The wafer stepper is part of the ASML special sauce, right? Yeah, I guess there's probably some important alignment requirement there. Yeah, I think I remember that being quite like the classic manufacturing throughput problem. And I think they've done a lot of work on optimizing that. Yeah, yeah. So they take that. So then you just produce hundreds of copies of your chip. You have to test it because there's defects. You typically, I think the average rates really depends on process and so on. But small single digit number of defects per chip. So you test the chip and see whether it has any defects in it. many chips are designed to be able to tolerate a few defects. And so you need to configure it to tolerate the defects. And now you have a die that by itself works. And then you need to package it. So you put it in a package together with memories. Typically, that's the HPM. And maybe you escape the wires to connect to other chips. How long does it take to make a mask? So, I mean, what we see is time from, like, tape out to chips back. Again, depends on Node, but it's ballpark four or five months. Oh, so tape out is just like sending the file? Yeah, well, I mean, we can send a tape out, send the file, and then there's a whole process of you make the masks for all the layers and then actually just producing the chips. Got it, so producing the masks and producing the chips happens after tape out. That's right. I see, okay. So, like, is the term tape out from, like, you send a magnetic tape with the instructions or something? It could be. I was in software when someone was created. I'm curious what the tape actually means. It feels like, you know, when we think about AI predictions, one thing I'm really struck by is how still in 2026, every time you open a chat window, it's contextless. It's got no memory. And now, to be fair, it's like, guys, it's been four years, not even four years, it's been three and a half years. Just calm down, we'll get there. But I also interpret a lot of the current enthusiasm for OpenClaw and all that stuff, as it's like this super hacky backdoor into state management, where your little claw will write a markdown file of what it's doing, and then look at that markdown file the next time and things like that. But it just feels like state management and memory is going to be a huge deal, and that will really change the character of AI products. Yeah, it's really interesting. Like the, so, I mean, long context is the reason, is one of the biggest bottlenecks on model performance, on speed of the model. Yes. It just, like, every single token you generate, it reads through all of the previous tokens, or maybe it reads through a subset of them, but reads through a lot of the previous tokens you've written. And so memory bandwidth for that is really constraining. You can think of model-level ways to solve that problem, which is to say maybe I can compress it into a few bytes or something like that. But it's interesting that the most effective way to solve it has been, I mean, it's really a combination of everything, but the most effective way to solve it has been once you hit your 300,000 token limit, have the model go back through it and compact. Yes, yes. I mean, it's kind of what OpenClaw is doing. It's like compacting everything you've done. Yeah. But it's funny that it's so manual. Yeah, I mean, I think... Manual is the wrong word. I mean, it's so primitive. It's maybe because it's so controllable. You can, like, if you want to iterate on how you compact, you give a different prompt and you say, compact this way, compact that way. You can iterate that on that in seconds or minutes. Whereas if you're trying to do some iteration on the model level, where you say, now I've got a different model architecture, it's going to take months to try and launch something. Yes, yes. Any other AI predictions? I'm generally just interested in what makes models cheaper and faster. So that's just at the model architecture level. Really tied into this context thing. I think the context size will stay ballpark the same way it is, maybe a few times larger. But the parameter count will go up. Parameter counts will grow much faster than context length, actually, just because of the underlying physics of what's available. So has that been the story Would that be a reacceleration of parameter count Because it feels like we leveled off slightly in the last year or two and instead we been focusing on more and better RL Yeah, okay. Parameter count or thinking tokens, I guess. Those are available, but the context length, I think, is sort of struggling to grow. Yeah, yeah. Okay, but you think we... We say context length is struggling to grow, but you're saying we keep context the same length? Keep context the same length. But we're better at working with large context. Is that what you're saying? Yeah, I mean, just have application-level interventions to manage large context, like compacting. Yeah, because I think everyone's had the experience, you know, currently of, like, the chat conversation and the further down in the chat you get... It just gets looser. Yeah, it's sloppy. It's just, like, really sloppy by the end. It's, like, making mistakes. So you're saying we start to do better with large context. Okay, I buy that. When will I be typing into a chat window and it is a MATAX chip underneath it, powering it. Tape out in under a year. And then that means chips available in sort of... Yeah, it's a ballpark. Okay, that's okay. Okay, so in 2027, I will be seeing very high-performing chats as a result of... In the 1% experiment of the users or something like that. Yeah, exactly. I need to find a way to fine-agle myself into the AV test. MATAX is 100 people? That's right. How have you gone about building the team, the culture? Yeah, so what we have on the team is hardware, mostly hardware, but a big software team and also a big ML team. I think the ML team is quite unusual in what we ask them to do. When you look at a typical ML team in an AI chip company, it will be what I might say ML engineering or ML performance. they're writing kernels that actually will just use your hardware as well on a given model. There's sort of a missed opportunity there. If you're saying all we do is we take other people's models and we write kernels for them, you're optimizing this, but you can optimize this at the same time. And so we want to optimize the whole thing at the same time. So like real co-design. So our ML team is actual real ML research. What they do every day is they train small LLMs from scratch, focusing on numerics and attention. And this has really, really helped us make an interesting product. It's straight up most strong in our numerics. Often what you see when people design numerics is they say, well, back when Float32 was popular, it would be I'm going to follow the IEEE standard. Now it is like follow the open compute standard. And there's lots of little details where you say things like maybe what's the rounding mode. I'm going to use like round to nearest even or something like that, which is like the best known standard for how to round. We want to cut corners anywhere we can. And so like maybe don't do the best rounding. Maybe don't do the like don't get all the corner cases correctly. That's a very scary proposition if you're just making those choices blind. But if you have the benefit of a research team who can sort of back you up as you do that, it's really powerful and it's really interesting that we can make some sloppy choices in these cases. I feel like often technical advances come through better iteration loops. A favorite example of this I found recently was that the Wright brothers actually had a failed season before first flight. So I guess the first flight was 1904. And they were down in Kitty Hawk in 1903 and not making that much progress. And they went back to Ohio and they had a wind tunnel. And they were like testing their design in the wind tunnel. You can imagine not a lot of wind tunnels in 1904. And they did a lot of wind tunnel testing. And their successful flight was after that. Is this something you're focused on where, you know, to get better chips, you allow for a better testing and iteration loop? and what does that look like? Yeah, I think this mostly happens in the architecture and product definition stage. Maybe even more generally, I think AI chips seem to live or die by product definition and architecture. What is the most extreme form of fast iteration? It's doing it in your head. And so can you map a model to a hardware in your head? Can you estimate the performance of what it is in your head? You're not going to be 100% perfect, but maybe you can prove some kind of lower bound on performance. And so, I mean, the simplest possible thing is my model has a trillion parameters. My device can do a billion multipliers per second, so it takes a thousand seconds to run or something like that. Just do that simple division. But then there are much more complicated things. Like we tend to look at resource balances. And so, like, how many memory fetches do I need to do per multiply or something like that? So we do, I mean, at least the way I like to do design and architecture and optimization is to be able to sort of estimate the performance to within about 30%, 40% before even typing anything in at all. And so we've tried to do that a lot. A lot of our architecture comes from there. Then sort of the next stage of iteration is, oh, that's kind of on the performance side. And this also happens on the circuit design side as well. Can you take a circuit and say, what is the gate count on that? So like a 16-bit multiplier has approximately 16 squared many gates, and you can do that for more complicated things by sorting networks and so on. So we already have a pretty good idea of the costs and speeds of things at that point after doing these calculations. Then what we tend to do as sort of the next step of iteration is on the ML side, we run model experiments. You get iteration speed just by having small models, mostly. And then on the hardware side, we use simulators, performance simulators, to do the next level of detail to make sure we're seeing all the things we want to see. Yeah, yeah. This idea that the best iteration is in your head is kind of reminding me of Jeff Dean's numbers. Yeah, yeah. Do you have your equivalent of that, numbers every MATX engineer should know? Yeah, we have go slash gates in our company, which says, what is the cost of an XOR gate, an AND gate, a full adder, SRAM bit cell, and so on. And you want people to be working with that stuff in their head and have an intuitive sense for it because, again, it leads to better iteration. What is the pitch to someone joining MATX? I mean, I think if you are someone who likes optimizing, just optimize something. Yeah, yeah, yeah. Software, hardware, Factorio, whatever. if you're trying to like fit something into the smallest budget possible i think it's a pretty exciting place to be um uh i think hardware place hardware companies in general are really exciting because you have such a broad range of skills of people on the team you have software people you have hardware people you've got physical design you've got people who are just like looking at the insertion force of a rack into of a card into a rack um and so there's like so much discussion and learning you can do. I think Maddox in particular, we really care about this and I think we extend it all the way up into the application and the machine learning as well. And so really, I mean, really, really, really interesting tactical problems and I think just generally, like there's lots of interesting people to talk to. Yes. And presumably in terms of impact, if you can design a meaningfully higher throughput chip, a 20% higher throughput chip means 20% more AI is happening. If the bottleneck is elsewhere, like power or something like that, or cost, you actually just are meaningfully increasing the amount of intelligence in the world, which is presumably exciting to people. Yeah, yeah. I mean, I think this shows that both as just kind of applying more applications as well as just how smart is the model. Yes, yes. Quiet Rust. So a previous project I worked on at Google, we did a lot of Haskell. I did Haskell when I was at school. I loved it, like very principled, very interesting. I like Haskell, but I also like making stuff fast. And then the question is, what is the first thing you want to do? You want to be able to modify your memory. Haskell, you jump through hoops to do that. Maybe I just want a language that is like programming, like functional programming, that lets me modify my memory. So I think Rust has a lot of the nice things which are like type classes or traits and a rich type system. One of the things that we have done, like interesting ways we use it at Maddox, are the range of sort of data types that you express on software. Like what are the integer types? Int32, int64, int8, maybe that's all you care about. But it turns out in hardware, you care about every single bit and so you want to use like 17, 18, 19-bit integers. That is quite natural to express and we build up sort of a whole ecosystem of rich hardware data types in Rust as well. Has Rust beaten Go for the position of sort of performant type programming language with modern features or do they actually address different markets? Yeah, I mean, so there's what the Rust marketing will say, which is safe without a garbage collection, which I think is a real, I mean, is the objective thing that you can say is different, but sort of barriers the lead, which is also just like it's got nice type system features that Go doesn't have. And then, like, why is garbage collection, why does it matter at all? Like, it's not, I mean, people often focus on the time it takes to run a garbage collector, but the other thing is that every time you allocate an object, you've got the object, and then you've got the garbage collector header at the beginning. And so it uses a lot more memory as well. And so if you want to design some, I don't know, data structure that uses the right amount of memory rather than a bit more, then... I'm sorry, I hadn't realized that in Rust you're allocating your memory manually versus in Go you have a garbage collector. Yeah, yeah, that's right. I hadn't realized that. Okay, and you prefer that for what you're doing. I just really like dealing with the details. Like you give me a puzzle and I'll be like, let me solve every single piece of it. Yeah, yeah, yeah. So that tickles that part of my mind with Rust. It seems like you're a fan of optimization generally. Is that a fair characterization? Yeah. Where else have you, so chip optimization is one domain. Where else? Yeah. So, I mean, I started, I mean, one of the really exciting things I found about working at Google is the whole Google code base is available and you can look at how does a memory allocator work? How does a mutex work? How does a hash map work? Any of those things. And you can go and look inside the implementations. And Google has excellent implementations of those, some of the best you could write. So one of the things I did on my nights and weekends when I was at Google was just go find those implementations, write a benchmark. How many nanoseconds does it take to allocate eight bytes of memory? And then can I make that faster? Maybe I inline this function. Maybe I look at the assembly and say, looks like there's a few memory moves here or there's some registers that are being used that I don't need in the fast path, I only need in the slow path. Can I do something there? So, I don't know, that was always my fun and learning activity. Being outside of Google, I feel, I mean, I probably could have done this inside of Google as well, but outside of Google I felt the luxury to be able to talk about these results as well. One of the things I've looked at recently is just hash tables are used so much. One prompt for me was like, what would, if I wanted to design like custom CPU instructions for accelerating hash tables, like hash tables are one of the most common things. I'm looking at them up and writing them all the time. What would the optimal CPU be for that? and so then following down that chain is like what are the what is the best hash table implementation in the first place and so I spent some time looking at different SIMD implementations and there's this really cool technique called cuckoo hashing where you hash into two different locations and then you use the bucket which is less full it's been in the literature for decades and yet the best hash table implementations don't use it because it's somehow not practical. So why is it not practical? Practical hash tables are these days considered to be ones that use SIMD vector instructions to scan like eight buckets at a time. And the way cuckoo hashing is normally described is I look up one bucket here and one bucket there. And so I'm not using the vector instructions. Vector instructions are much faster than scalar instructions, and so there's kind of a missed opportunity. Again, just take the two good ideas and stick them together. Do vector instructions on Goku hashing. You have to be careful to get the details right, but if you get it right, you can actually just win. And Cesare, is your claim that one could design a custom CPU that has way better hash table performance, or even on current chips, you could get way better hash table performance? So both. I mean, I'm interested in what you can do in designing custom hardware, but Maddox doesn't make CPUs. We're not going to make CPUs. You could. New line of business. I mean, we just want to focus on shipping one product well for the time being. Fair. Good answer. So, I mean, I think it's an interesting exercise, but I don't get to feel the endorphins of seeing the number going down. So I first did this on just Intel CPUs. And you can get better performance than some of the best hash table implementations available using cuckoo hashing on Intel CPUs. MARK MANDELAVYSKI- And what are examples of workloads that are really hash table reads intensive? I mean, I know kind of everything. MARK MANDELAVYSKI- I mean, JavaScript, I guess. But yeah, I mean, it's sort of a tricky exercise. Because when you really think about it, you're like, did I really need a hash table there? I probably didn't, but you just reach for it all the time. Okay, but you could go to the Google JavaScript team and probably help them eke out better performance in the Chrome JavaScript engine. Yeah, I mean, potentially. Like, I mean, I'm not going to spend my time on that. Well, if you're listening to this podcast, here's a free idea from Reiner. And then explain the Dragon. Yeah, this is from a book that, when I was working on the JAX team, so the JAX team is one of the ML infrastructure teams at Google. I was there as the most recent team before I left. I'm sorry, what does the JAX team do? Oh, yeah. So the JAX team develops, this is sort of Google's new, more modern version of TensorFlow or competitive PyTorch. It's how you write models in Python to run on TPUs. A big part of the JAX team, however, is to say, okay, we have JAX, the technical artifact. Can we help enable users to actually use it really well and get high performance? And so ultimately that became, well, who are the users? It's people writing LLMs. How do you get good performance on LLMs? And so really, really strong team, the JAX team at Google, although as with a lot of brain people are now elsewhere as well. And so we developed a lot of the different techniques for how to lay out models efficiently on many chips. And so ultimately some people at Google, and I contributed after I left Google, wrote this guide called How to Scale Your Model, How to Run an LLM as Fast as Possible. It is sort of the main reference for how to get high performance on TPUs. there's now also a GPU version of this as well. It's a dragon because it's how to train your dragon. I see. Okay. Last question. People might not have thought that there's room for new chip companies. It might have seemed unusual or very hard. And you guys, it seems like a very good approach with that. Where do you think are other opportunities for companies to be started here in 2026? Where do you think people should be looking for entrepreneurial opportunities or just technical challenges that haven't been properly addressed? More labs, I think, is still interesting. Can we do more on model architecture is always interesting. You think we have not fully explored model architecture space? Yeah, I mean, the Frontier Labs have done a pretty good job of exploring it, but I think, I mean, as the hardware changes, the shape of the model should change for sure. Yeah, okay, and presumably you're not thinking like yet another Frontier Lab pursuing the same architecture. you think there's probably off-the-wall-looking architecture that will actually make a lot of sense? Yeah, I think there's a little bit off the wall. Okay. For sure. Do you have a specific architecture in mind? My mentality is always sticking within the transformer family, but what are the constraints that are currently available, like currently imposed that you could lift? Yeah, yeah. So, for example, one of the things is there's this idea when you're doing transformer inference, you do pre-fill, that is sort of processing what the user said to you, and then there's decode which is generating the response to that and those are totally different in in pretty much every aspect of how um how they actually run um one runs a step at a time the other one runs really in parallel so there is this somewhat artificial constraint today that those are the same model that's doing both um maybe lift that constraint um another example would be there's this idea that the model that you uh i mean this is this is more fundamental constraint that you have to train the same model as you serve. But again, training is very different from serving. At training, it's very compute intensive. At serving, it's more memory bandwidth intensive. And so maybe, is there a way you can make a model that when you use it at inference time, it increases the amount of compute it does to use some of the available resources? Yeah, makes sense. Well, Warno, thank you. Pleasure.