Cavities in the Data: Building FDA-Cleared AI for Dental Imaging with Overjet

59 min

•May 13, 20262 months ago

Summary

Sadeg Salehi from Overjet discusses how to build FDA-cleared AI models for dental imaging across heterogeneous data sources. The episode explores the critical importance of stratified evaluation across data subgroups, the evolution from narrow task-specific models to foundation models, and how predetermined change control plans enable rapid iteration in regulated medical AI.

Insights

Aggregate metrics mask catastrophic failures on specific subgroups—an 80% F1 score is meaningless without understanding performance across tooth types, sensor manufacturers, image modalities, and disease severity distributions
Models exploit shortcuts in training data; if rare sensors only contain healthy images, the model learns the sensor type predicts health rather than learning to detect disease across all sensors
Stratified sampling during training and evaluation is more important than sophisticated loss functions; representative prevalence of subgroups in training data is the foundational requirement
Architectural decisions are regulatory decisions; independent prediction heads per indication allow retraining one head without FDA reclearance, avoiding the need to reclear all downstream models
Production feedback loops using treatment plan codes provide noisy but real labels for monitoring model performance without requiring additional annotation infrastructure

Trends

FDA's predetermined change control plan (PCCP) framework is enabling rapid model iteration in regulated medical AI without full reclearance cyclesFoundation models trained on semi-supervised noisy labels from narrow task-specific models are becoming the standard architecture for multi-indication medical imaging AIUnsupervised image enhancement and normalization across sensor types is critical infrastructure for deploying models across fragmented hardware ecosystems in healthcareStratified evaluation frameworks are becoming table stakes for medical AI; aggregate metrics alone are insufficient for regulatory approval and clinical safetyTreatment plan and procedure codes are emerging as valuable production monitoring signals for medical AI systems without explicit ground truth labelsMulti-head architectures with frozen backbones are being adopted specifically to manage regulatory constraints and reduce reclearance burden in medical AIDental AI is becoming a leading use case for demonstrating how to handle extreme data heterogeneity (32 teeth, 15-20 sensor types, multiple image modalities)Semi-supervised learning using model-assisted labeling is enabling scaling from thousands to millions of training images in medical imaging AI

Topics

Stratified evaluation and subgroup analysis in medical AI Data heterogeneity in dental imaging (sensors, modalities, tooth morphology)Foundation models and multi-task learning architectures FDA predetermined change control plans (PCCP)Model monitoring and production feedback loops Sampling bias and shortcut learning in medical AI Semi-supervised learning with noisy labels Loss functions for imbalanced medical image segmentation Image quality enhancement and normalization Regulatory constraints on medical AI architecture Transition from narrow to foundational models Model-assisted annotation and labeling at scale Dental disease detection (cavities, bone loss, pathologies)Multi-modality medical imaging (X-ray, CBCT, intraoral scans)Clinical decision support and clinician-AI collaboration

Companies

Overjet

Primary subject; builds FDA-cleared AI models for dental disease detection across billions of X-ray images from thous...

Cordon

Produces the Data in Biotech podcast series exploring how companies leverage data science in life sciences

People

Sadeg Salehi

Guest expert discussing FDA-cleared dental AI models, stratified evaluation, and foundation model architecture for me...

Ross Katz

Podcast host conducting the interview and synthesizing key insights about medical AI evaluation and regulation

Quotes

"You build a model, you evaluate it, the numbers look great. high precision, strong recall, everything checks out in your test set. You deploy it. And within a week, someone's asking why it completely falls apart on a subset of data that you never thought to isolate."

Ross Katz•0:00-0:30

"When you provide me only one number, I don't like that because that's not you are you are zooming out so much that you don't see any detail inside of your model and you don't know the performance of this model"

Sadeg Salehi•~22:00

"These models, they are looking for shortcuts. They are not doing something that you want them to. They are not looking at something that you want them to see. So you need to guide them to see what you want them to see and they are really good to find the shortcuts"

Sadeg Salehi•~28:00

"The most important thing in order to sample effectively is to have sort of the most comprehensive data set you possibly can. But you don't have examples of all of the different disease types that you might want to detect. So you just train the models that you can train using the data that you do have"

Ross Katz•~35:00

"We are one of the very first companies that we got the clearance for this PCCP. And that affects our lives very easier now for some of the indication that we have PCCP. we just we just retrain a model we just follow our comprehensive evaluation protocols that we have and then we pass all of those things we can we just we just put in production"

Sadeg Salehi•~55:00

Full Transcript

Welcome to Data & Biotech, a podcast from Cordon where we explore how companies leverage data to drive innovation in life sciences. Every two weeks, we sit down with an expert from the world of biotechnology to understand how they're using data science to solve technical challenges, streamline operations, and further innovation in their business. Here we go. Welcome back to Data & Biotech. I'm Ross Katz. Here's something that happens more often than anyone in AI wants to admit. You build a model, you evaluate it, the numbers look great. high precision, strong recall, everything checks out in your test set. You deploy it. And within a week, someone's asking why it completely falls apart on a subset of data that you never thought to isolate. In most industries, that's an expensive mistake. But in healthcare, it's a different kind of problem entirely. When your model is analyzing medical imagery, making detections that influence whether a patient gets treated or sent home, a blind spot in your evaluation isn't just a performance issue. It's a clinical risk. And the challenge gets harder when you consider what medical imaging data actually looks like in the real world. It arrives in dozens of different sensors and acquisition systems, each producing images with different contrast resolution and noise characteristics. So for example, in dentistry, a bite wing x-ray, which is the close-up shot your dentist takes of your back teeth, looks nothing like a panoramic image that you get of your full jaw. A digital capture from one sensor manufacturer looks nothing like one from another. And yet, the model has to work across all of them, reliably, fairly, and under the watchful eye of the FDA. My guest today lives in the middle of that problem. Sadeg Salehi is the Director of Research and Principal Scientist at Overjet, where he leads the team building FDA-cleared vision models that detect and quantify dental diseases across billions of x-ray images from thousands of practices. His academic work on loss functions for imbalanced medical image segmentation has been cited over 3,000 times. And at Overjet, he's applied that expertise to a data problem with a staggering number of dimensions. So you've got 32 teeth per adult patient, each with different morphology, multiple image types showing different anatomy. And then you've got 15 to 20 sensor manufacturers in the field with 90% of practices using just three of them. And disease severity distributions that range from barely visible early stage decay to obvious pathology that anyone could spot. In this conversation, Sadeg walks us through what it actually takes to evaluate models across all of those dimensions and why aggregate metrics can mask catastrophic failures on specific subgroups of your data. He explains how his team evolved from over 20 narrow task-specific models into a single foundational architecture they internally call Unity. How they use treatment plan procedure codes as a noisy but real feedback signal to monitor model performance and production, and how Overjet became one of the first companies to secure the FDA's predetermined change control plan, a framework that lets them update models without filing a new clearance every time. So whether you work in medical imaging or not, the evaluation traps that Sadeg describes apply anywhere you're building models against heterogeneous high-stakes data. Let's get into it. Sadeg, welcome to Data & Biotech. Can you give us an overview of what Overjet does and what your role looks like day to day as director of research? Yeah, thank you. Thank you, Rod, for having me in this podcast. So Overjet is a healthcare AI company focused on improving oral health by bringing more objectivity and consistency into dentistry. We built a learning-based model that analyzes different dental modalities. It can be x-rays, it can be CBCT, 3D x-ray images, or even voice. It can have different modalities of the intraoral images, intraoral scan images. And health clinicians detect conditions like cavity, bone loss, other pathologies earlier and more accurately. And my role as a director of research here at Overjet is to just make sure that we can train, develop, and evaluate these models rigorously and just make this happen in the real world. Yeah, that's awesome. And just so that I'm understanding, you know, overjet, overjet's model a little bit that, you know, you're, my understanding of the value proposition is that from the clinician perspective, you know, you're able to detect, you know, conditions more, more accurately by using, you know, all of the different modalities of data that you're collecting. and uh uh and then from the patient's perspective my understanding is that you know oftentimes you get to see you get to see sort of like the indication of what your of like what your condition is which gives you more trust in the in like the condition that's being assigned to you and then also from like the payer perspective there's sort of like a closed loop documentation there am i thinking about that right or like how does that exactly exactly so if you think about that if you are you as a dentist you see something in the H-ray image you see something when you are examining the patient and now you want to communicate with the patient about the treatment plan and what are you what procedure you are going to apply on that specific tooth because of these indications that I can see as a dentist as a trained dentist in the H-ray image as a trained eye and it is easier for the dentist and for them to communicate with the patient to show that okay you see that a small red thing that's over the AI today that is decay and if I turn it off now you see that shade of that gray shape that you see on that tooth that is decay and if over the didn't color code that part, it is really hard for the non-trained eye to see that. The dentist saw it, but the patient didn't. And it is now just really hard communication from the sense of the dentist and patient that, okay, that part is, it's very small, sure, but it is going to be easier. So it is going to be a problem for you. So now we need to do the treatment. But having those quantification, having those color codes on top of the image, it just makes trust for the patient to realize that, yeah, okay, that is serious. And I need to do this procedure. Yeah. Awesome. So you've got sort of billions of dental x-ray images and then data coming in in all of these modalities. And what you're doing in the middle is using machine learning to detect these very hard to see, especially for the untrained eye conditions that are inside of people's teeth, people's oral cavities. Can you sort of give us an overview of what the data looks like and the different variety of sensors, imaging modalities and acquisition systems that you're dealing with as you're building those? Oh yeah, sure. So even in one modality, there are so many variations. So if you want to think about the different modalities in the dentistry that we are actively analyzing, we have one main modality that is x-ray, dental x-ray images. Those are the x-ray images from the tooth. And even that dental x-ray image has different variations. where we have bite-wing images and you have your bite-wing images, you have PA images, periapical images. The bite-wing image, when you take the bite-wing image, you don't see the root of the tooth. But when you take the PA image, you only focus on one part of the jaw, not two jaw at the same time, not maxilla and mandio at the same time. or when you do the when you take the panoramic image you are looking at the whole jaw at the same time but now you have a lower resolution of of of each teeth so you have higher resolution for each teeth when you take the x-ray by swing and pa image but you have higher lower resolution but more context when you take panoramic these are all different variation of the same modality just x-rays these are 2d images we have cbct it is a 3d image it's like the you have these are if you think about that the jaw and teeth and in the jaw they are all 3d objects and these images are just a representation of that 3d object in cbct we take 3d like a 3d image so now you have depth in that image. But then you have lower resolution, you have higher exposure. So not all the clinicians take the CBCT. So that's a different modality. Okay, so when you're training a model to detect something like tooth decay across all of these different modalities and distributions, what are some of the hidden subgroups in the data that you think about that might undermine the performance of a model if you're not paying attention to them and accounting for them when you're building the model? So that's a great question, specifically in the dental domain, in the dental x-ray domain. One main difference is when you take the chest x-ray, you have one image and one 2D image that is representing, that's a 2D representation of a 3D object. and that 3D object is the chest. But that's one representation, one image. And you are analyzing, your model analyzing that one image, which is fine, it's good. In dental x-rays, you don't have just one representation. You have, when we take these images, you have at least 32 objects. For the adult behavior, you have 32 tooth. or when you have a mixed dentition, it can go to 40 because now you have the pediatric tooth. Underneath the pediatric tooth, you have the erupting adult tooth. These are very weird looking images. But these are different teeth and these different teeth have different representation in these X-ray images. Think about that. Your molar teeth, they are very thick tooth. which means that they are thick, and when you take the image, when you take your x-ray image, those photons, they are just not coming through the tooth, which means that these are, they look more white, they look more opaque in your image. But you have the anterior teeth, that these are thinner teeth compared to the molar one, which means that you see more shade and what was that shade the definition of that shade that's a decay so most of the times you say that when you see a shade of gray that that that indicates the decay something is wrong let's let's train the model on that but then if you train your model only on the shades of the molar teeth now every single shape on your anterior teeth is detected as decay why because it is thinner thinner teeth so now if you don't introduce this variability between even tools numbers in your evaluation you are you are just evaluating the wrong thing which means that just to give you the exam just to give you digna. Assume that I have my model and I'm training your DK model and I'm getting the F1 of 80% that's a very high number that's a great number but all your training and test data they are coming from molarties. Is that a good model for anteriorities? Probably not and we did we did we did we did these uh experiments just just out of curiosity that okay what would happen if we if we do that and you see that you have a very bad model that's not it has so many false positive on anteriorities because you only train on the different distribution of data so now you have these are different dimensions that you when you want to train your model you want to have you you want to make sure that these subgroup dimensions, you have the representation of them. You have the sensor type, the image type, the acquisition type. These are metadata subgroups. But inside of the image itself, you have different distribution of the two numbers that you want to make sure that you have a representation of those in your training set. And not only that, you want to make sure that the prevalence of your data is covering those distributions, which means that you want to have some anteriorities in your training data that has decayed. So you want to show these distributions to the model when you are training the model. That makes a lot of sense. And it makes me wonder whether, you know, is there a pre-processing step where you're segmenting the different images and modalities into the teeth and assigning them numbers so that you know which tooth you dealing with Or is that kind of abstracted away in the modeling process And it basically like you let the model learn You want to make sure that you know which teeth are in the different images so that you can sample effectively as you're going through training and evaluation. But you let the model sort of learn how to detect tooth decay across all of the teeth in the image without having to handhold it through the segmentation process? So during the segmentation, during the segmentation of the decay, for example, we do not provide the information of the tooth number to the model. We just let the model learn about the different representation of the, for example, decay in different teeth. But during the sampling of the data, for training, we are very well aware of the representation of these different tooth number in our training, in our test. We know that, okay, these images, we know, we are very well aware of the image type distribution. We very well aware of how many of these training, we have millions of images, 20% of them contain these anterior teeth, 20% of them, these. So we need to, you need to have these information because when you do the evaluation afterwards, you want to make sure that it is following the reality. It's following the, there's a real distribution out there. So you make sure that you have, you train in the same distribution in the real world. So that's, on the sampling side, to answer your question. On the sampling side, we are very well aware of these things. But during the training, we do not want to buy us the model about this information. We just let the model learn. Right. And my understanding is that, you know, these images that you're getting are already, it's a supervised learning problem. You already know sort of the clinical diagnosis or lack of clinical diagnosis that's associated with each image. And so, you know, you're able to, without necessarily needing to tell it, this is where tooth number 14 is located in this image, it's able to pick up, it's able to learn from the quantity of images that tooth number 14 has decay happening roughly here. Exactly. I'm interested in how you think about sampling during the training and testing process in order to make sure that the model is getting that representative distribution, but also that you're maximizing the combination of precision and recall that you want to get out of the model. For every single subgroup of your training, when you want to do the training, you have some subgroup of your training data. You have subgroup of your targets. What am I going to detect? Am I going to detect the large decades Or am I going to detect the calculus in this image? These are different labels. So you want to make sure that you have a good and representative prevalence of each of those targets on each of your subgroups. In a way that then you can claim that, yeah, I can, this model can detect decay on this subgroup. and I already tested and evaluated that. Now I can provide you the number. So now I can, with a peace of mind, I can sleep at night and make sure that, okay, my model is doing what I am claiming to. So that's a very, very important part that even the regulatory space, they are asking for us to provide this information. And so that's the, that's the whole point. So for example, just to give you an example on the, let's, let's focus on the some metadata sub sensor, different sensors. So different sensors. So if you think about it, if you go out in the, if you sample 1000 clinics and check which sensor they are using for taking A3. That distribution is a heavy tail distribution. Most of the clinics are using a couple, three, four different sensors. 90% of them are using these sensors. But then you have 10 other 10% that they are using very rare sensors out there, which are visually different. they are perceptive it's a literally different image when you look at that image it's a different image now if you don't sample them if you don't sample them correctly just assume that if you just sample randomly from your clinics then 90% of your data is on on the on the on the well represented those ones that they are very they are very frequently in using the clinic but then think about that I assume that if you are not aware of the prevalence of your rare sensors and all of your rare sensors they are healthy images there is no decay in that image so what does your model learn your mother learned that okay for these sensors there is no way you can see the decay because I have never saw any decay in that representation in that sample for that rare case. So because most of the time, I'm crazy sure any, any, any, any scientist that trained a model knows that these models, they are looking for shortcuts. They are, they are not doing something that you want to, you want them to, they are not looking at something that you want them to see. So you need to guide them to see what you want them to see and they are they are really good to find the shortcuts and now i assume that all of my rare sensors they are healthy images now my model is just there just doesn't look at the image at all just say that okay this is coming from this sensor because it's easier for the model to classify between cells rather than finding that shade in in damage, okay, it's always healthy. And then you put this model into production and Kinect come back, okay, what's wrong with you? It's always, it gives me always, you have so many false negatives. You always think it's healthy, but then you think, oh, but I didn't do anything. It's just, and now you go to your goal and you go come back and go look at your training, that you really, oh, now I see that this subgroup has a prevalence of zero for this abnormality. Now you understand, oh, okay, now I need to make sure that these have the prevalence of correctness. So that's a very handy explanation of why it's very important for you to be aware of the prevalence on your subgroups. Right. So you have to be aware of the prevalence of the subgroups. I think I'm interested also in how that plays into the training and evaluation process. So obviously, when you talk about the subgroups of your input data set and the subgroups within your targets, you're talking about you know, sort of a massive degree of cardinality that is going to be really hard to pick up on. And so I'm imagining that you have to sort of prioritize which are the intersections of subgroups that you care most about and also like think about how you put into the loss function, how you put into like your evaluation, what the, you know, how you teach the model to also care about all of the subgroups that you care about in that stratified manner that you're talking about. That's true. That's true. One thing that, one thing that's really important and on the, specifically on the evaluation side is that you cannot do, You should not do the dimension of your matrix. You should not decrease that dimension to only one number or two numbers. Just decreasing the number. Okay, we are saying, okay, my model is, my precision on these images is 80%. What does that even mean? Okay, sure, you are combining everything, all these different subgroups into one number. Okay, that does, okay, one number, but what does that even mean? You have 80% precision on what? You have 80% precision on aggregated of all the subgroups of the metadata, like different sensors, different things. all different abnormalities and all different teeth. If that's enough for you to make a decision that this is a good model, I don't think so. I have this discussion with my team several times a year on that when you provide me only one number, I don't like that because that's not you are you are you are zooming out so much that you don't see any detail inside of your model and you don't know the performance of this model so you need to give me the information of for each of what what is the precision of this subgroup what is the precision of this subgroup what is the precision of this flavor and not only that when you are doing the training your model, you want to, you want to have that loss function, adapt to look at these subgroups. So sometimes you just say, first, assume that I already decay, I already have a good representative data in my training. That's the most important. To be honest, that's the most important. You need to sample your data in a correct way, in a way that you understand the distribution and the prevalence. But even that, even that during the training, you want to have some sort of adaptive loss function that look at these subgroups performance and adjust that hyperparameter during the training that, yeah, okay, on this subgroup, we are not doing great on this subgroup. Let's push model to look at that subgroup as well and then look at, but in the meantime, you don't want to lose the other subgroup that it has more representative. You don't want to overthink on the smaller one. This adaptive training, when you are doing the training, that's the key to make sure that, okay, you are aware of these different representations and you push the model to make sure that, okay, it is working well on all these different subgroups. Yeah, that makes sense. So, you know, when the model output comes out, you're looking at each subgroup. you're looking at sort of the performance across each subgroup but then inside of the modeling process as well you're encouraging the model to to attend to the to the performance across each of the subgroups and sort of like up you know like updating you know basically given how each subgroup is performing understanding the benefit of like one improvement in precision on one subgroup versus one improvement and recall on another subject. Exactly, exactly. And not only that, sometimes based on the application, based on the application, you want to have a better, for example, better recall than better precision. That's the, okay, if I had a triage application, I want to have a better recall. Yeah. I don't mind if I have some false quality. I don't want to detect, I don't want to call everything as a positive, but I don't mind to have more false quality to not have some false negative because I care about the recall. So even when you look at the application, then you know which metric is more important for you. Yeah, so for example, in the case of Overjet, you have expert clinicians reviewing the predictions that you're making. And so obviously you don't want to have a ton of false positives, but if there's the occasional false positive, that's like you expect the clinician to be able to differentiate between that and what's actually there. They can detect, they will read. We are not, they always review our predictions, but also if we can detect something that they just, they didn't see, it's good that, okay, take a look at, take a look at these two, it might be nothing, but take a look at, it's something, and they they appreciate it. They appreciate these, it's not an incidental finding, but these pointing to other teeth in the tube, in the image, it's good because now they can double check. Yeah, no, that makes sense. And also, you know, from the dental practitioner's perspective, they can go into the patient's mouth and say, these are the places where I need to be looking and I can be reasonably confident that, you know, these are all of the places that are most important for me to be looking while I'm in here. Before we move on, like you mentioned that, you know, the most important thing in order to sample effectively is to have sort of the most comprehensive data set you possibly can And you you know as I understand it working with you know one of the largest data sets of dental imagery that is out there But you know Overjet I don't, I can't imagine started with that data set in place. And so you sort of had to like, come up that learning curve in order to get to a place where you could, you know, solve problems with sampling that you might have used, that you might have had to solve with other methods earlier on. So I'm just interested in understanding that journey. Like, how did you go from working with a smaller data set and developing sort of a clinically applicable model or set of models that's capable of being FDA approved to the place where you are now where you've built up that cache of billions of images and you can start thinking about foundation models? So what happened was that at the very beginning, we had different small models that were very specific for different small subgroups, which means that, okay, we had that model at the very beginning that, okay, this is only working for Pk on bi-twing images for this specific sensor. That's it. It is because we need to be, we could not cover everything. We didn't have data. We didn't have the annotation power. We didn't have any, because when you have, when you have a primary model, then you can do the model assisted labeling. You provide the label now annotator, they are correcting the labels. So now it is very faster annotation, all those things. But at the very beginning, you don't, you don't have these things. So now we need to make sure that, okay, your model is trained for a specific image type, a specific indication and a specific sensor. These are these are very small models. And then we had so many of these models. We have more than 20 something models that are doing a small, these small things. One was expert for a feeling on image type. One was expert for feeding image type panel. One was all the different models. It's, these are tiny models that are trained on not millions of images, but thousands of images. They are doing decent. They are good. They are safe. They are working as expected, but they are not, they are very focused on one specific indication and one specific image. And then when you go forward, you see that now you have bunch of the admin move forward after two years, three years. Now you have millions of images and these millions of images, they are already, you already have the noisy label for them because these are coming, the other, these small models, they are doing the, their annotation already there. You can see that there's an incline in this image because the incline model is there for this image type. Now you have millions of images with noisy labels. What you can do now is that you have now huge amount of data with huge amount of noisy labels that you can semi-supervised train a large backbone that can take a look at every single of these things at the same time. So now we are moving toward these, okay, now there is the foundation model here. What does that foundation model for us? So we moved to that foundation model three years ago, more than three years ago, in our internal discussion, we call it unity, because we literally unite all these models into one foundation model. Now the foundation model, the way that we work on, we train this model, was that okay, we said that okay, all these images, all these X-ray images, panel, byte-tween, PA, these are X-ray images, different sensors, all those different distributions. They are all X-ray images. So if you shoot this image from sensor one and sensor two to the dentist, they can read. You don't need to have two different, the dentist, it's not like, okay, I can read this image, but I cannot read this, this. They can read. So what we were thinking was like, okay, we need to have a foundation model that understands X-ray, we don't want the foundation model to be expert on any indication. We just want to make sure that it understands the image, the X-ray image, and have the best latent space that we can use that latent space for different indications. That was the main goal for the foundation model as a unit. We trained that unity model, it's a huge one. On this thread, there's millions of images with even noisy labels. And we figured out, oh, okay, no, no. This guy is understanding the X-ray. It is understanding the X-ray. And the way that we realized that, okay, it is understanding the label, it was really funny. We said that, okay, let's use the latent variable of this foundation model, and use that latent variable that vector space for us, and use that to retrieve the tooth that had the same issue. For example, let's find the latent variable for the molar tooth with huge decay. And now use that to retrieve the same teeth from other sensors, other patient population, and even other image type. We extracted that sensor, that latent space from the bite wing, and then we run it on the pH. And boom, it is showing you, it is extracting the teeth that has the same abnormality, same representation, by just looking at these vectors. We realized that, okay, this foundation model is doing its job. It is understanding the X-ray. It knows what the hell is going on in the X-ray. And now we have different heads that are expert based on the input for this head is that latent variable. And the output now is a supervised data. Now you don't need so many data because it's, You already, the model already normalized that for you. Now you only need to, but that does, that head needs to be trained in a supervised manner. You need to make sure that you have, it is a perfect layer. So you redo that and now we have, we have the foundation model that is being drawn on every single image that comes to our pipeline. And we have different heads that does the, okay, one head is expert in feeling, One head is expert in crown. One head is, couldn't we do that training together? Couldn't we train all these heads at one? Yes, we can. But why we didn't do that? Because if you make, if you have independent heads for different indication, now you can change one head without affecting the other heads by freezing your body. we had a frozen backbone and if that foundation model is frozen, but these heads, we intentionally make them independent from each other. So we don't need to, if we change something in one head, we don't need to reevaluate all the other heads. Yeah, that makes a lot of sense. And, you know, just hearing you talk about it, that like the journey, the journey is very fascinating. And so I just want to sort of like recount like the insights that I got, that got from it. One of the things that came away from me is when you're in an organization that is smaller and closer to your startup environment, you're operating in a resource-constrained and a data-constrained environment, and you're focused on the 80-20 of the problem. You're trying to solve the problem as well as you can for the dental consortiums that you're actually working with using the devices that they're actually using and you're targeting maybe only the sensor types that are being used by the 90% of the dental industry rather than focusing on sort of the long tail of imagery that's out there. You're also, you know, you don't have examples of all of the different disease types that you might want to detect. So you just, you train the models that you can train using the data that you do have, and you're sort of building out the feature set of Overjet for the dentists that you're serving as you're collecting more data and sort of growing the organization. And then at a certain point, you've got these 20 models and you have a sufficient amount of data. You can start thinking about a foundation model and you're using those 20 models to help self-supervise or semi-supervise the foundation model in order to create that backbone, that latent space. Okay, so you've got this foundation model, it embeds a single image. And so the foundation is learning how to evaluate these images for just the full variety of dental issues that might arise. And then, you know, you want to have that backbone that is an expert at reading these images. and then you decide to have, you know, sort of multiple prediction heads that sit on top of it, that you're training independently off of that backbone, so that your input space basically looks the same for every model, but the way that that model is interpreting it is different for each prediction head. And the reason you do that rather than, you know, having a model that predicts all of them is because there might be trade-offs, I'm imagining, in the quality of the prediction heads. If you tried to train them all at once, they wouldn't all get better. Sometimes some would get better and some would get worse. And what you want is the best prediction head for each one. Am I thinking about that right? Exactly, exactly. And also, we are in the regulated space. When we have the model evaluated and we say that, okay, this is the performance of this model and this is not true, that's it. you need to lock that model. So we have this PCCP that we can talk to later on that. But in the regulated space, in the FDA point of view, if you have trained your model and you get the clearance for this model, you got to keep that model locked. You cannot change even one parameter of that model. Yeah. So, which means that now assume that I have clearance for, I have three clear, three different clearances for three different indications. And now I realize that, okay, one of those indications, I have some bias in one of them, and I want to retrain that one. So if I combine all of them in one hit, when I retrain that, now I need to make sure that, okay, the performance of the other two indications are not effective. And also I need to get another clearance for all three, not just that one that has the bias. So that intentional separation between the indication was the very smart thing that we did. Now that I look at it in the retrospective, that was a very smart thing that we did, that it just didn't tie our hands by combining all of them together. Yeah, no, that's really interesting. And it's also, you know, it indicates to me sort of the sort of non-obvious consequences of the way that the FDA regulates models and how that influences sort of like the architecture that these models take on. I want to get to the predetermined change control plan and sort of how you navigate that a little bit. But before we do, can you talk a little bit about, you know, my understanding is that you're doing some unsupervised image quality enhancement to normalize across these different sensor distributions. Can you explain what that problem looks like and sort of like why you did that and how you approached it? Yeah, that was also a very, very, very interesting one. So what happens in the dental industry is that you have different sensors, and each of these sensors, most of the time, they have different acquisition software. And then these acquisition software, they are not just, they are not retaining, and then you have the visualization for these different software, the PMS one. So the dentist, they look at the image and they diagnose and introduce a treatment. What happened is that each of these acquisition software, they have their own post-processing on top of the raw data to just make the image a bit sharper, less noisy, all those post-processing after the image acquisition. But these are, these post-processing are designed for one sensor, one image acquisition. That's it. These are, these are way, because that's because of that their company, that experience. Now in Overjet, we have, we have access to that raw data, but we don't have access to that post-process data most of the time. we have access to that raw data, but we want to make sure that we have a high quality image on top of that raw data on all of these different sensors and different acquisitions So what we did but the problem here is that you don have access to the good quality image. That's the raw image, that's the raw image, and you cannot even annotate those images. How do you want to annotate the image quality? So someone needs to paint, you cannot annotate that. The way that we trained this unsupervised model, internet has the internal architecture, we haven't released this architecture, we haven't published this one, it's a good one. The way that the idea behind these image enhancement was that we trained the model in the unsupervised manner, we train the model to learn artifacts rather than learn image content so we train the model in a way that it learns what is the artifact what is it what is the distribution of noise what is the distribution of the artifact that you will observe on these different data sets what is the different distribution of the contrast artifact. It learns those artifacts, and it's a residual model. It removes the artifact from the raw image. It is not outputting the processed image. It is outputting the artifact image, and then we subtract the raw image from that artifact. So, you know, Overjet operates under FDA regulation, which means, you know, obviously, as you mentioned earlier, you can't just retrain a model and push it to production. Can you explain the PCCP and how it changes the way that you have to do business in terms of updating AI models? So what happens in regulated space and in the medical imaging domain is that before the multiplayer era, 2018, 2017, those days. People had their traditional image processing algorithms, data, it was not, you do all those different, most of them were not learning based. You do all those things and you get the Kira. Now you are not going to change the whole pipeline in the next 10 years. But after deep learning came, what happened is that people realized that, okay, I get, I get, I train my data, I train my model, I get the Kira, but now I can, I can train a better model in three months. Do I need to go through all these huge evaluation and sending to the regulators, wait for three all their things, it doesn't make sense. And to be fair, I think that FDA also, they are following the tech and they change the rules in a way that it makes sense. So in 2021, if I'm not mistaken, 2020 or 2021, FDA introduced this predetermined change control plan where you add this into your filing and you are saying that okay this is my filing this is my evaluation of this model all these things also I have this predetermined change that I am going to apply, which means that, okay, I am going to apply, I am going to train this model with 10% more data with these things, but I'm not going to change the intended use of this device. I'm not going to change the input of this device, which means that if this device is cleared for by-tring images, I'm not going to change it that is working for the panel. I'm not making it up. But I'm going to retrain this model with this extra data. And I'm going to do this comprehensive evaluation. I am going to make sure that it is passing all these predefined thresholds. I am going to test this on all these different predefined subgroups that all of those things is doing. And if I pass everything, I don't need to send you another clear. And I don't need to file another clear to get to, to put this new date into my, into my product. I have all those. I, I, I put it in the shelf. You can audit me, you can come and ask me what, what are those changes, but I don't need. And then you send it to the FDA during the filing and the FDA say that, yeah, okay, good. I agree with your predetermined track. You can follow this. Now you don't need to for every single change. You don't need to send it again, send this to FDA. So over that, at Overjet, we are one of the very first companies that we got the clearance for this PCCP. And that affects our lives very easier now for some of the indication that we have PCCP. we just we just retrain a model we just follow our comprehensive evaluation protocols that we have and then we pass all of those things we can we just we just put in production we don't need to go through the regulated expected gain combat yeah uh no that's really interesting i mean it makes a lot of sense both from your perspective and from the fda's perspective uh you know since you were one of the first you know my understanding is that you sort of have this stage deployment process where you have where you start with like a shadow model where the model is doing inference but it's not actually like making predictions in production and then you have like a canary where you know it's it's only serving a small percentage of traffic and then there's like an a b testing phase then you're going to like full production is that is that sort of how it works from your from your perspective and you know if so i'm interested in like how you monitor and and like connect to the like to what's happening in production money monitoring is It's really interesting for us because we are very lucky. We are very lucky because we have some sort of label in production. And it's a noisy label, but let me tell you what I mean. What's happened in the dentistry, what happened is that the patient go to the office. First, they take the X-ray image, if there's a procedure or not. They take the X-ray image, then the doctor examines the tooth. And then, they intrude, they provide the treatment plan, they say that, okay, I'm going to put the filling on your tooth number two. the patient accepts that treatment plan and they do the procedure. Now what happens is that treatment plan and that and that treatment plan has different procedure code and we have access to that treatment. That's a very important part of this puzzle. We have access to this treatment plan. So what happens is that this treatment plan is happening after the image acquisition. So you have the image you have the treatment plan what does that mean so if there is a treatment plan with a procedure code of filling on the mesial side of the tooth number two which means that there was something right on the mesial side of the tooth number two and most likely it is cavity because they do the filling for the cavity, which means that if our model detects that cavity, we can say it is a true positive. We detected the doctor applied that treatment. Why I'm saying it is noisy because I assume that there is no treatment plan for this one, but we detect the cavity for the mesial side of the tooth number two, but there is no treatment plan for that tooth. Why there is no treatment plan? Because sometimes there is a cavity but the cavity is too small they don't want to do the filling. They say they say that okay there is a coloride treatment you can do the coloride treatment and you're good. A filler treatment and you're good. or sometimes we don't detect anything and they do the treatment. Why is it false negative? Not necessarily because sometimes that decay is not visible in X-ray. They detect examination and they saw that that decay is on the occlusal side of the tooth and if the decay is on the occlusal side of the tooth, you cannot see it in the X-ray. So I say that these are noisy labels, but you can sample these things so you can get the feeling of if your model is doing good or not, because you have some sort of information afterwards. So that's one of the main things, one of our main monitoring systems that we do. Yeah, no, that's really interesting. And also, if you're in that shadow model or canary model phase, you get both the quote unquote ground truth label from the clinic, but then you also get the predictions of both models. but then also there's the feedback loop between the model that you have in production and the clinic itself because if the model that's in production didn't predict that the tooth decay was there then the clinic might not notice it but maybe your new model is so much better that it's noticing the tooth decay and if that model were in production then there would have been a treatment plan in the first place so certainly it's not as easy as it may seem on its surface but Exactly. But as long as you're being rigorous about your evaluations, you know, and you've had to create that plan up front with the FDA, then you can be reasonably confident that the model that you're rolling out is as good or better than the one that you had in place. Exactly. Well, Sadeg, it's been awesome having you on the podcast. I really appreciate it. You know, and yeah, I look forward to connecting down the line. Yeah. Thank you for having me. It was fun. I really enjoyed it. Sadeg, thank you. That was Sadeg Salehi, Director of Research and Principal Scientist at Overchat. So here's what I'm taking away from this one. The subgroup problem is real and it's sneaky. You get an F1 of 80% and you think you've got a strong model. But Sadeg made the point that 80% is a meaningless number if you don't know what it's 80% across. He walked through an example where they trained a decay model, mostly on molars. These are thick teeth, dense. The decay shows up as a subtle gray shade on the x-ray. And then they ran it on anterior teeth, thinner teeth that naturally show more gray. The model flagged everything, huge false positive rates on a whole category of teeth. And the aggregate metric just, it didn't surface that. It was buried. And that's why Sadeg asks his team to provide not just one number to evaluate models, but a variety of numbers that sort of do that cross-sectional analysis. And then the shortcut thing, which I thought was one of the sharpest moments in the conversation. If all the images from a rare sensor in your training data happen to be healthy, the model doesn't learn how to read images from that sensor. It just learns that the sensor means healthy. It's not even looking at the teeth anymore. It found an easier pattern. And if you're evaluating aggregate metrics, you'll never catch that. Sedegg's point, and I think this is the thesis of the whole episode, is that the same flawed sampling that creates gaps in your training data also creates them in your test data. So if your evaluation isn't catching your blind spots, then it's obviously confirming them. The other thread I keep thinking about is the architecture story. They started with 20-something narrow models, each one trained on thousands of images for one condition, one image type, one sensor. As the data grew to millions, they used those small models to generate noisy labels, and then they used those noisy labels as the training signal for a much larger foundation model that they call Unity. And then they built independent prediction heads on top, one per condition, specifically so that they could retrain one head without having to reclear every other head with the FDA. That's not just a modeling decision, that's a regulatory decision that they made at the architectural level. And I thought that was really smart and interesting. And then there's the monitoring piece. And Overjet has access to treatment plan codes, the CDT codes that he mentioned, that get assigned after a clinician examines a patient and actually performs a procedure. So if a filling goes on a tooth where the model didn't flag decay, that's a signal. It's noisy. You know, maybe the decay was too small to fill. Maybe it was on a surface x-rays can't see. But it's a real feedback loop between what the model predicted and what the dentist actually did. Most teams in production ML don't have the benefit of something like that. Whether you're in dental imaging or building models in a completely different domain, I think the core points hold. If you're not stratifying your evaluation by the dimensions that actually matter when your model hits production, then your numbers are telling you a story that might not be the right one. Set eggs on LinkedIn. The link's in the show notes. If you want to see what Overjet is building, then go to overjet.com. and if you got something out of this, then please share it with someone who's building models against METC data. They'll thank you. I'm Ross Katz. Thanks for listening.