Universal Medical Intelligence: OpenAI's Plan to Elevate Human Health, with Karan Singhal
Karan Singhal, OpenAI's Head of Health AI, discusses how ChatGPT has achieved attending physician-level performance through collaboration with 250+ doctors and rigorous evaluation. The conversation covers OpenAI's plan to make ChatGPT Health free globally, their privacy-preserving approach, and the vision for AI to raise both the floor and ceiling of human health outcomes.
- AI models now perform at attending physician level in many medical scenarios, often outperforming residents and sometimes identifying issues human doctors miss
- OpenAI's approach prioritizes user trust over data collection - ChatGPT Health won't train on user data and will be free without ads globally
- The medical establishment's adoption of AI is accelerating, with 230 million people weekly using ChatGPT for health questions and growing physician acceptance
- Scalable oversight in healthcare provides concrete grounding for broader AI safety research, as models already outperform humans in narrow medical domains
- The integration of multimodal health data (wearables, medical records, imaging) will enable more personalized and predictive healthcare interventions
"Our mission is ensure AGI is beneficial for all of humanity. And there's kind of three parts to that. One is build and deploy AGI. The second is prevent downside risks, whether that's kind of short term kinds of risks or long term kinds of frontier risks. And finally thinking about how we can make benefits happen."
"Over 230 million people a week are using ChatGPT for various health and wellness related queries."
"When healthbench came out, GPT 4.0 was literally zero on this benchmark. So it's an incredibly hard benchmark because this is just how we chose the examples. And over time we've kind of improved to performance of around 40%."
"I think 2026 will be the year that using AI becomes a standard part of medical practice."
"ChatGPT Health available to all users globally for free with no ads, an early form of universal basic intelligence that I really think everyone ought to celebrate as a triumph of human ingenuity and goodwill."
Hello and welcome back to the Cognitive Revolution. Today's episode is brought to you in part by Granola to help new users experience the power of the granola platform. Granola is featuring AI recipes from AI thought leaders, including several past guests of this show. There's a replit recipe that converts discussion notes to an application build plan, a bentossel recipe that creates content production plans, and a Dan Shipper recipe that looks across multiple sessions to identify cultural trends and at your company. My own recipe is a Blind spot finder. It looks back at recent conversations and attempts to identify things that I might be missing. This has already proven useful in the context of contingency planning for my son's cancer treatment, and as I use it more and more, it's getting better and better at suggesting AI topic areas that I've neglected and really ought to explore. See the link in our show Notes to try my Blind Spot Finder recipe and experience how Granola makes your meeting notes awesome. Now today my guest is Karan Singal, who leads health AI@OpenAI and who was just named to the TIME 100 health list for his pioneering work. This episode began to come together last year on Thanksgiving when I emailed Karen, who I'd met a couple times at AI events, to say thank you for all of his work on AI for Health and to let him know what a difference ChatGPT had made for me and my family in in the context of my son's cancer diagnosis. As it turned out, that was just as OpenAI was preparing to make a major product push with ChatGPT Health, which allows users to connect ChatGPT to data sources including electronic medical record systems and consumer wearables, plus a physician facing ChatGPT for healthcare, both launching in early 2026. In this episode we dig into how Karin and team have achieved attending physician level performance with their latest models. Their plan to ensure that this capability does benefit all of humanity and their vision to raise not just the floor but also the ceiling of human health with continued research and even better models to come. Highlights of this conversation include how OpenAI works with more than 250 human doctors to ensure accurate, robust and culturally appropriate responses how they built Healthbench, which contains some 49,000 evaluation criteria to measure models performance and how models have already gone from a 0% score on health bench hard by GPT4O when the benchmark was first created to today already 40% plus an overview of my experience using large language models to navigate a health emergency, including the critical importance of giving models as much context as possible on your situation and how that's about to get dramatically easier as ChatGPT Health rolls out globally. We also discuss how 230 million people are already using ChatGPT for health questions on a weekly basis. The first randomized trial of AI copilots for physicians, which OpenAI conducted with Kenya's Penta Health System and which did show a statistically significant improvement in outcomes for patients whose doctors used AI. And Ycarin believes based on the reception that OpenAI is getting from health systems, that 2026 will be the year that that using AI becomes a standard part of medical practice. From there we go on to cover the steps that OpenAI is taking to ensure privacy and security of users health information, how they're using worst of end measures to make sure models first do no harm while at the same time striving to maximize value by training AIs to acknowledge their uncertainty as they offer their best guesses. How Karen understands the relationship between AI for health, AI safety plans such as scalable oversight and AI alignment more broadly, Karin's report that OpenAI's model's chain of thought reasoning has not drifted toward neural ease as much as some reports had previously caused me to believe the future of medical multimodality, which will do a much better job of converting data to value, and which inspired me to buy a whoop wristband to start collecting data on myself and the compounding effect of parallel advances in AI for science, the growing potential for N of 1 treatment plans and medical move 37s, and the possible need for an update to the rules governing access to experimental medicines and information sharing. Finally, Karen describes OpenAI's utopian plan to make ChatGPT Health available to all users globally for free with no ads, an early form of universal basic intelligence that I really think everyone ought to celebrate as a triumph of human ingenuity and goodwill. Zooming out in the grand scheme of AI development, I think it is fair to say that we have far more questions than answers, and in my mind all outcomes from a post scarcity utopia to literal human extinction absolutely remain on the table. I signed the recent call for a ban on superintelligence because I do worry that an AI arms race driven by recursive self improvement loops could easily get out of control. And yet at the same time capabilities like this, which have been so valuable for me and my family and which will undoubtedly save millions of lives in the coming years, are for me both an incredibly inspiring accomplishment and a practically irrefutable argument for the upside of AI. The question at this point is not whether we will create powerful AI systems, but exactly what form they will take and under what circumstances and incentives they'll be developed and deployed. Karin's work demonstrates that, for the moment at least, we can have it all AI systems meticulously crafted to minimize downside risk, which are both capable and efficient enough to meaningfully improve the human condition globally. There is a ton of work left to be done both inside and outside of the frontier companies to make sure that these lofty standards don't slip in the face of intensifying competition. But today, if you or a loved one are facing a complex health challenge, you owe it to yourself to take full advantage of the incredible medical expertise that Karen and others have managed to build into systems like ChatGPT Health. With that, I hope you enjoy this inspiring look at the frontier of medical AI with OpenAI's head of health, Karan Singhal.
0:00
Karan Singhal, head of Health AI at OpenAI welcome to the Cognitive Revolution.
6:11
Thanks for having me.
6:17
I'm super excited about this. It's rare that I have had so much impact from a guest's work on my life as I've had from the impact that your work in Health at OpenAI and at Google previously before that has made on the last three months for me. So regular listeners know the story. My son got cancer. I've been an intensive user of all the frontier language models over the last three months to advise us as we've gone through this process and and boy, have they been a game changer. Thank you for all your hard work and for making the, you know, the last three months for me, the mental health side of the equation for me dramatically better than it otherwise would have been. And also moving the needle a bit on my son's treatment and our confidence that we were actually doing the right thing for him. It has been invaluable and the consumer surplus has been off the charts.
6:18
Amazing. Thanks for the kind intro and thank you for sharing the story. I think it resonated with a lot of people, so thank you for that.
7:13
A big takeaway, I think, of this conversation will be if you find yourself in a medical emergency or even just want to do a better job of managing your health in general. The frontier models today are getting really good at that. I always go to the example of the AI doctor. There's obviously a relative scarcity of medical expertise, even in a wealthy country. For a privileged person like myself in the United States, you broaden your worldview and look around the world and the shortage is extreme. And I've always kind of felt like, boy, this would be just an absolutely killer use case that everyone could agree on. When I started talking about it a couple years ago, it felt like it was getting close, but still a ways off. And I've done a bunch of episodes over time with Vivek and some of your former teammates at Google DeepMind who work on similar topics. And they're always very appropriately cautious or have been over time about, yeah, it's not quite there yet, we're not quite ready to roll it out to production, but encouraged by the progress, that sort of thing. But boy, again, it is really getting there. But the first question I wanted to ask is just like how did you
7:19
get into this and what were you expecting?
8:24
How big of a dream were you daring to dream when you first got into AI for healthcare some years ago now?
8:27
Yeah, I started working on AI for healthcare about four years ago now and kind of transitioned to doing it full time. Around time. I was thinking a lot about a few kind of fundamental research problems at the time. When I was at Google at that time, a few problems were like around foundational work in representation, learning, privacy, preserving learning. There's a lot of interesting applications in healthcare. But I think the background for this was a kind of conviction that I had since undergrad, honestly, that AGI would be a pretty big deal and that probably would happen within our lifetimes. And I thought there was probably two things that I could do to make that better. One is to work on safety and the other is to work on benefits. And I saw healthcare as kind of the most obvious area for benefit. Like you. And like you said, we've been on this kind of amazing exponential over time, both with model capabilities and I think more recently with people's adoption and kind of the Overton window shift around trust around these models. And so we're seeing just a bunch of people start to use these models across, across individual patients, individual clinicians, researchers, and sort of seeing a lot of the benefits become a lot more tangible. And all this comes down to for us at OpenAI, it comes down to thinking about what does it mean to make our mission real. Our mission is ensure AGI is beneficial for all of humanity. And there's kind of three parts to that. One is build and deploy AGI. The second is prevent downside risks, whether that's kind of short term kinds of risks or long term kinds of frontier risks. And finally thinking about how we can make benefits happen. And like you say, I think Health is one of the most obvious and tangible ways those benefits can happen. I think when I started working on health back in, I guess end of 2022 was kind of when it was full time for me. So like right before ChatGPT and for me it was a lot of thinking around this is kind of capability overhang between LLMs where they were at your scaling these models up, you were seeing that you can instruction tune these models and they can do amazing things and how they're being adopted and thought about in the clinical AI and healthcare worlds. And so my ambition at that time was really two things. One was get healthcare and clinical AI world to think about LLMs as a thing that could work. And again, this is prior to ChatGPT. And the second was think about the work that we would need to do in safety and reliability to make the models trustworthy for the setting. And then over time I think those ambitions became less and less ambitious and my ambitions at OpenAI became larger. And for our work at OpenAI, we started out with three goals. The first is to make access to medical expertise more universal. The second was thinking about the ways in which we can think about this setting as a way to ground our work in safety and alignment. I can talk more about the safety motivation there. The third thing was thinking about how we can kind of bring society along with the high stakes technology and work with partnerships, roll out products, work with policymakers to think about the right ways to kind of iteratively deploy in a setting like this. I think over time things like this sounded really ambitious. About two years ago when we started to set out to work on Health at OpenAI. Now actually we're kind of feeling like these aren't ambitious enough. So we're really excited about what's to come.
8:34
I think you did a great job there of kind of laying out a taxonomy or a sort of scaffold for this conversation. So let's maybe talk about the capabilities first. But they are sort of inseparable. I'm wondering kind of from a capabilities perspective, is there also a Hippocratic oath kind of mindset that you bring to the table that is sort of focused on making sure that the AI performs well in medicine that maybe is still distinct from sort of the bigger picture safety agenda that motivates you? How do you guys think about the way that the model should perform in terms of doing no harm? Because I think GPT4, while it did add value, also could definitely do some harm in terms of giving you wrong ideas. And I do think we've come a long way since then, but I wonder how you think about that.
11:46
100% the way we think about the health work that we've been doing at OpenAI. We've kind of been operating in three phases. One is laying the foundation for the work. And a lot of that has been really around the kind of safety research and work on making the models not just better reasoners, but also perform better and have better bedside manner and things like making sure that they're able to convey uncertainty. Well, things that they're able to kind of escalate to a doctor when needed. And so I can talk a little bit about that kind of work in a second. We're recently now in this phase of adoption. So as the foundations have solidified, just a bunch of people have been using it and it's been one of our fastest growing use cases. And we shared recently that over 230 million people a week are using it for various health and wellness related queries. And this year we're really focused on kind of scaling the impact of the work. And so all that I think comes from the foundation. And so your question of how do we think about imbuing the models with the right kind of fryer of how to behave and how to ensure that there's minimal harm provided. I think we were very thoughtful about this as we were laying the foundation for this work and starting to work on Health at OpenAI. And a lot of this comes down to our really close partnership with this cohort of 260 physicians or so that we've been working with since about two years ago. And what we've been doing with them is basically working on. One way of kind of imbuing models with a certain kind of behavior is to write a spec kind of from first principles. Maybe you or I just sit down and write down a spec and say this is how models should behave. They should say X, Y and Z when I'm in this situation and this in another situation. There's pros and cons to this. One pro to this is that it's very easy to explain and understand what's going on here and it's easier to be transparent about model behavior. A con of this though, is that first it's very hard for you or I to say what should happen or what should not happen. And it's difficult to say anything beyond do no harm because while that's an excellent thing to avoid, it doesn't tell you what to do in most scenarios. And so what you want to do is figure out a way that you can move from really large scale principles that matter to what do you do in very specific scenarios? And how do you make sure that's guided by the expertise of not just one or two people, but actually hundreds of experts. And so that's kind of the approach that we've taken and you can see it in our approach to evaluation. So one way to think about how do we encode model behavior in models is like what are the evals that we care most about? Evals are really a lifeblood for any researcher. And so we really recently put out back in May of 2025, this work on Healthbench, which is this kind of evaluation of how large language models perform in realistic health conversations between users and models where users could be either lay users or health professionals. And we kind of went about it in a way that leaned on the expertise of these 250 plus physicians rather than writing down specs from first principles. And I can explain that more, but I'm sure you have many more questions as well.
12:37
Yeah, I mean, go on as long as you'd like, I guess to double click on the physicians for a second. Are they like, what's the nature of the relationship with them? I could imagine anything from, you know, they sit there doing side by side, you know, rl HF style, prefer this to that. Or I could, you can imagine some of them may be much more deeply integrated with the research team.
15:38
Yeah, we have kind of like three different layers that we think about physician expertise and we were very thoughtful about how to bring in physician expertise into our team in a way that like balanced out and combined well with the research expertise that we have on our team. So we kind of have three layers. The first layer is kind of like high level advisors who are actually like more informal or kind of help us with strategy in various ways. We kind of share roadmap with and things like this. The second layer is folks who we work with in kind of a. You can think about it as kind of like a human data operation in the way that you're describing, but a little bit more closely. And we're slacking with them all the time and we work with them closely and they're not just kind of going off and doing tasks and we ask them for advice and questions all the time. So we basically have a slack community where we're interacting with these folks and some part of what they're doing is kind of like comparing model outputs or red teaming model outputs or things like this. Some part of it is looking for ways in which we might have blind spots today and listing those kinds of ways so that we can prioritize them in the future, testing new products, things like this. So as an example, the work on ChatGPT for Healthcare, which we announced on January 8th, we had Red Teamers test this product over nine waves for six months, and this was in close collaboration with this physician community. The final top of the pyramid of how we rely on physician expertise is really close advisors who work most closely with our team. And so they're actually the ones who are kind of working on channeling the voice and combining the voice of these hundreds of physicians, interfacing most closely with our research team, translating that into evals and and model training data and things like this so that we can get kind of then improve our models. Hey.
16:04
We'll continue our interview in a moment after a word from our sponsors. One of the best pieces of advice
17:44
I can give to anyone who wants
17:50
to stay on top of AI capabilities
17:52
is to develop your own personal, private benchmarks. Challenging but familiar tasks that allow you to quickly evaluate new models. For me, drafting the intro essays for this podcast has long been such a test. I give models a PDF containing 50 intro essays that I previously wrote, plus a transcript of the current episode and a simple prompt.
17:54
And wouldn't you know it, Claude has
18:14
held the number one spot on my personal leaderboard for 99% of the days over the last couple years, saving me countless hours.
18:16
But as you've probably heard, Claude is
18:24
the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you. Whether you're debugging code at midnight or strategizing your next business move, Claude extends your thinking to tackle the problems that matter.
18:26
And with Claude code, I'm now taking
18:40
writing support to a whole new level.
18:42
Claude has coded up its own tools
18:44
to export, store, and index the last five years of my digital history from the podcast and from sources including Gmail, Slack, and Imessage. And the result is that I can now ask Claude to draft just about anything for me.
18:46
For the recent live show, I gave
18:58
it 20 names of possible guests and asked it to conduct research and write
19:00
outlines of questions based on those.
19:03
I asked it to draft a dozen personalized email invitations.
19:06
And to promote the show, I asked
19:09
it to draft a thread in my style featuring prominent tweets from the six guests that booked a slot. I do rewrite Claude's drafts, not because they're bad, but because it's important to me to be able to fully stand
19:10
behind everything I publish.
19:21
But still, this process, which took just a couple of prompts once I had the initial setup complete, easily saved me a full day's worth of tedious information gathering work and allowed me to focus on understanding our guests recent contributions and preparing for a meaningful conversation. Truly amazing stuff.
19:23
Are you ready to tackle bigger problems?
19:41
Get started with Claude today at Claude AI tcr. That's Claude AI tcr. And check out Claude Pro, which includes access to all of the features mentioned in today's episode. Once more, that's Claude AI tcr. Your IT team wastes half their day on repetitive tickets, password resets, access requests, onboarding all pulling them away from meaningful work. With Servl, you can cut help desk tickets by more than 50% while legacy players are bolting AI onto decades old systems. Servl allows your IT team to describe what they need in plain English and then writes automations in seconds. As someone who does AI consulting for a number of different companies, I've seen firsthand how painful and costly manual provisioning can be. It often takes a week or more before I can start actual work. If only the companies I work with were using Servil, I'd be productive from day one. Servl powers the fastest growing companies in the world like Perplexity, Verkada, Merkor and Clay. And Servil guarantees 50% help desk automation by week four of your free pilot. So get your team out of the help desk and back to the work they enjoy. Book your free pilot@servl.com cognitive that's S-E-R-V-A-L.com
19:43
cognitive when you talk about channeling the voice of the physician and calling back to the Hippocratic oath, one thing that I do find kind of frustrating, honestly about my experience in the medical system recently, is I think there's a little bit too much emphasis on do no harm. And this also connects in a pretty deep way to questions about how should we even conceptualize how to talk to our AIs about what they should do, right? I mean, we have kind of the very detailed Talmudic detail rule set as kind of one extreme possibility where for every corner case we try to map out what you should and shouldn't do. And hopefully it learns that and lives by it. And then on the other matter, if it's the other extreme end, but there's the kind of anthropic constitution or the cloud constitution recently has at least demonstrated that you can get pretty far as well with something that is less rule based and more about trying to teach the model to have good character and use good judgment as it goes through all the situations that it finds itself in. I would say my critique of the human doctors that I've engaged with who have generally served us really well, is that they definitely want to do no harm and to the point where they're sometimes too reluctant to engage in a hypothetical or to act on something. I do have one good friend who's a doctor who also happens to have had the same or very similar kind of cancer to my son. And he's notably behaved differently with me in private. One on one, he's like, look, I don't need a randomized controlled trial to tell you that that makes sense. But you don't get that too much when you're actually at the hospital. There's a reluctance among physicians certainly I've experienced, I think it's pretty commonly shared perception that there's a reluctance to act on things that make sense because that isn't guaranteed to work out. Obviously, biology is super messy and there's a ton of diversity and there's just a lot that we don't know. So how do you think about that, that kind of challenge? And so I kind of wonder where you guys want to land in terms of, you know, only adhering to the most rigorously defensible advice versus, you know, doing a little bit more of the Amanda Askel thing and being like, well, it does take, you know, you gotta be willing to take some risk to help people sometimes. So maybe we should have the models do that.
21:07
Yeah, it's a great question. You're pointing out problems, I think, on two sides to the healthcare ecosystem. One is, as a patient, you have this challenge of needing to advocate potentially pretty hard for yourself. And I think in your son's story, you pointed out a couple false starts that you had where you saw a doctor. They said it was probably normal. You had an abnormal blood test reading a couple times, and then they were just kind of like, yeah, it's probably okay. This experience that patients often have when they're having something that they feel or know to be an issue of needing to advocate for themselves and feeling like their doctor isn't hearing them. On the clinician facing side, you have this challenge of medical evidence is increasing rapidly. It's very hard to keep up with the latest of what's going on. You're overloaded with documentation and there's a bunch of burden there. Doctors are human too. There's a bunch of challenges on both sides. You have this amazing thing which is AI that is able to do a few things. You have AI that's able to talk to patients, understand the concerns that they're able to have, integrate knowledge and information across both their previous history, the latest medical evidence. One of the things that these models are obviously very incredible at these days is taking in a huge amount of health context, not just on you, but also on the latest medical evidence and things like this, integrate that all together into one context and do something that I think is very difficult for a human to do. And then on the physician facing side, you have again that same capability to integrate information and face the position. That's a lot of why we're doing this work is because we see this kind of gap between where the models are at and how people are using them. And we think that that's really important with our upcoming products. And to get closer to answering your question, you mentioned where do we see how should the models navigate places where there's potentially a lack of medical consensus or potentially physicians would disagree about what to do? This is pretty fundamental to our approach. And this is why we don't have one or two or three experts who are determining what the model's outputs are. We have a pretty multi pronged approach. A lot of this comes down to presenting information to the user, but being sure to present uncertainty when uncertainty exists. So I mentioned the safety research motivation for a lot of the work that we're doing. One of the directions that we've been exploring is, is are these models well calibrated in their uncertainty and can we get better at having these models verbalize their uncertainty? So, for example, if there's like three to five potential paths for your son's next treatment, and a doctor might just for the sake of simplicity and for clear communication, just communicate one or focus on one. A model can potentially communicate three to five of them, but mention that the state of the evidence may be somewhat limited and mentioned caveats. And so one of the things that we've been investing in is first, can models become better at understanding their own uncertainty? And can that be a thing that we measure and improve? And second, can they verbalize it better? And this is, I think, a big part of what is the right way to kind of thread the balance between being more aggressive and sharing potentially early results or early evidence and being overly conservative. So that's kind of how we think about that.
23:42
What are you seeing in terms of trying to get the models to understand their own uncertainty? Because I remember that famous graph from the GPT4 model card where the sort of pre trained model seemed to be much more calibrated with respect to its own uncertainty than the trained model. And we've only scaled post training since then. Right. I haven't seen an update to that kind of research in a minute, but it seems like there was at least a fundamental challenge kind of opening up there with respect to that kind of introspective self awareness of how confident am I in this answer? Is that something you guys have solved and is that why I haven't seen that sort of graph in a while?
26:49
Well, I think this is a measurement challenge. So the plot that you're referring to and the kind of plot that existed back then was kind of like given the next token, for example, a multiple choice question, was the model's probability of that next token for a, for a multiple choice question, did that correspond to how likely it was to actually be correct in choosing A. And what you're seeing now is I think it's become more difficult to measure that because of two reasons. One is like we have higher expectations of our models than answering multiple choice questions and so it's hard to say when they're correct or not correct. And the second is this is a little bit more technical. The models now emit reasoning tokens or thinking tokens between initially outputting something and their final answer. So the result is that you can't ask the model for what is the log probability of that next token being A exactly in the same way that you would before. And so there's a couple things that you can do to handle this. One is you can go in the direction of richer ways of measuring whether model is correct or is doing this thing that you want than measuring log probability of a certain letter. The second thing you can do is repeatedly sample from a model and then see whether performance stays the same or degrades as you repeatedly sample from the model. So actually our work on healthbench is a good example of doing both these things at the same time. So in healthbench we did this work around basically measuring not just like one or two or three different aspects of model performance and health, but actually across these 250 plus physicians and across 5,000 conversations, measuring about 49,000 different axes on which model performance could differ. And part of this is whether or not the model expresses uncertainty in the right way, whether this is the right fact, whether it escalates to a physician when needed, or things like this. Again, 49,000 different things that are measured in Healthbench. One of the things that you can do there is then measure correctness in a way that is less about multiple choice accuracy and more about are the right facts included that are really important to emphasize to the user and things like this. The second thing you get out of that is this metric which we call worst at N, which is when you repeatedly sample from the model and try to measure performance on healthbench, what is the worst performance you get on N samples? So you sample from the model 20 times what is the worst performance you get? So now you have a way of measuring instead of kind of the log prog based way, you have a way of measuring even with a reasoning model, whether or not it produces a consistent result conditional on the different kinds of thinking that it's doing. And so what I would say is basically it's harder to produce plots like we could before because now the thing that we're measuring is so much more complicated. But when we do produce the plots, as we did in healthbench, the plots are also looking pretty promising and models have improved quite a bit at that.
27:34
Hey, we'll continue our interview in a moment after a word from our sponsors.
30:18
AI agents may be revolutionizing software development, but most product teams are still nowhere near clearing their backlogs until that changes, if it ever does. Designers and marketers need a way to move at the pace of the market without waiting for engineers. That's where Framer comes in. Framer is an enterprise grade website builder that works like your team's favorite design tool, giving business teams full ownership of your dot com. With Framer's AiWareFramer and AI Workshop features, anyone can create page scaffolding and custom components without code and in seconds and with real time collaboration, a robust CMS with everything you need for SEO, built in analytics and A B testing, 99.99% uptime guarantees and the ability to publish changes with a single click. It's no wonder that speed design and data obsessed companies like Perplexity, Miro and Mixpanel run their websites on Framer. Learn how you can get more from your.com from a framer specialist or get started building for free today@framer.com cognitive and get 30% off a Framer Pro annual plan. That's framer.com cognitive for 30% off framer.com cognitive rules and restrictions may apply. The worst thing about automation is how often it breaks. You build a structured workflow, carefully map every field from step to step, and it works in testing. But when real Data hits or something unexpected happens, the whole thing fails. What started as a timesaver is now a fire you have to put out. Tasklet is different. It's an AI agent that runs 24 7. Just describe what you want in plain English. Send a daily briefing, triage support emails, or update your CRM. And whatever it is, Tasklit figures out
30:22
how to make it happen.
32:09
Tasklit connects to more than 3,000 business tools out of the box, plus any API or MCP server. It can even use a computer to handle anything that can't be done programmatically. Unlike ChatGPT, Tasklit actually does the work for you. And unlike traditional automation software, it just works. No flowcharts, no tedious setup, no knowledge silos where only one person understands how it works. Listen to my full interview with Tasklet founder and CEO Andrew Lee. Try Tasklet for free at Tasklet AI and use code COGREV to get 50% off your first month of any paid plan. That's code COGREVASKLETAI.
32:11
So how should you as a user think about the worst event thing? If I Is there a way to sort of trans. I mean, maybe you could just describe the result, like how much worse is the worst event than say, the next worst event or the, you know, the average event. And then if I'm a user, which I am, can I get security just by running it twice? Is there a practical upshot of that work that could give me confidence that I'm not getting the if I do X, then I can be sure I'm not getting something that's way worse than the model is typically going to output?
32:51
Yeah. The way I think about it is I think the more compute you spend on things, you will get better results and the results may be marginal over time. So one thing you could do as a user of a model is sample From a model 10 times, combine that together into one output and have an LLM synthesize that kind of the outputs of an LLM counsel and then produce that as an answer. I think that will be marginally better than the answer that you get from just running one model. And this is not so dissimilar from what GPT5 Pro and stuff like this does under the hood anyway. And so I think you can do a few things if you're a user and you want to kind of make the best of this One is you. You can use GPT5Pro or something similar to do a thing like this. A second thing you can do is you can actually increase the amount of reasoning because I think it has a very similar effect to just running it multiple times. I think in both cases my my current sense is we're getting to the point where current models as they are now are performing incredibly well for most people and most of the times. But I know, for example, you using GPT5Pro was an important part of of working through your son's situation. I think we're reaching a point where, except for the most complicated of cases, most people are best served by just using the model on a default reasoning setting. I would recommend using the reasoning models instead of the kind of more instant models using the GPT 5.2 thinking rather than GPT 5.2 instant for a lot of health related things. But I think most people can get the best of both worlds between latency and performance by doing that. I think the way to think about really broadly the worst of n result is just kind of like if you sample from something 20 or 50 times, you'll have a varying performance across model outputs. What we saw in the worst event results is that recent models have improved pretty significantly where the worst performance of, for example, O3 at that time was way, way better than the best performance of GPT 4.0. And we've continued to see that over time. So we've been kind of shipping model improvements in health pretty rapidly over time. And the model improvements in the last year have been more than in previous years since ChatGPT launched. And so as an example, today the nano models, the GPT5 Nano models that you can get through the API and also our open source models are actually performing similarly to O3, which was our best and greatest model not so long ago. And the latest reasoning models continue to kind of push the frontier of how much you can do with less and less reasoning. So if you try, this is true for 5.3 codecs as well. Also for 5.2 thinking, if you try using them by default on health queries, it'll actually think a little bit less but produce better results. And we're continuing to try to push that frontier of not just needing to pour more compute into it to get a good result, but also getting better performance at a given level of compute. And the result is that the models are way, way better than they were even a year ago.
33:32
Yeah, it's been crazy in my just three months of intensive use. Nothing has brought home the pace of shipping quite as much as how many updates there have been. Just in this one kind of chapter of my life. It has been wild to see. I don't Know if this is something, how should we think about density of effectiveness of the reasoning tokens? I mean there was this thing I did an episode with the folks at Apollo. I'm sure you know of their work, if not know them personally. And one of the really interesting things that they observed when they got access to the chain of thought is that at least for. I think it was.03 at the time, although I'm not 100% sure which model it was off the top of my head. There was this sort of seemingly development of like a new dialect basically internal to the model. The famous sort of watchers. Watchers, whatever. Can you share anything about sort of how you're balancing the obvious good of efficiency and denser thinking per token, more value per token created with the seeming tendency to have the internal chain of thought go off in weird and potentially hard to parse directions? I guess there's also the commitment from OpenAI to not train on the chain of thought or at least not apply certain kinds of pressure to it. I don't know if that's an absolute ban on any feedback on the chain of thought, but interested to hear your thoughts on that because that's something that I've been like. Well, the world moves past these big stories these days so quickly that one seems to have kind of come and gone and. And I'm not really sure what the state of it is now.
36:27
Yeah, it's a super interesting question also near and dear to my heart as a safety researcher. So chain of thought interpretability has been one of the nice advances of safety in the last year or two, which has been really cool to see. So as models have become reasoning or thinking models, they've also emitted tokens that effectively explain their work and what they're thinking. This provides a form of interpretability for researchers who want to understand whether models are doing the thing that they expect as safety researchers do. It's been this really cool way of measuring whether models are doing things like scheming or other kinds of or producing outputs that are not desirable in various other ways. This has been relevant beyond health to a bunch of other domains as well. I think a lot of the results that people have shared and studied have been actually encoding. I think I've been pleasantly surprised. So the danger that you're pointing out is as you put more pressure in reinforcement learning to get the models to produce a good output, will they kind of slip away from the prior of having their thinking tokens be simple English that is easy for researchers to understand. I think what we've seen is actually pleasantly surprising, which is that at least until now, we haven't seen a lot of large scale evidence of this slip into what's called neural lease of kind of using chain of thoughts tokens in a way that is not explainable and understandable in English. And in general, as we've been trying to understand the monitorability of our models over time, we haven't really seen that effect as we've scaled up rl. I'm not sure if that'll continue to be the case in the future, but so far I've been fairly pleasantly surprised that this kind of side effect of the reasoning paradigm has continued to be useful and hasn't robustly been seen to become unreliable. I think the result that you're pointing out, I think has been sort of continuous over time with this kind of weird blips and interpretability of the models at times. But we haven't seen a continuous increase in that and we haven't seen evidence clearly, even though we've actually tried to study it, of scaling RL causes that to happen more. We haven't seen that yet. I would expect in the limit it does.
38:12
Okay, that's quite interesting. So is there no the way I sort of interpreted that? First of all, I thought it was happening more than it sounds like you're saying it is happening. And I naively assumed that there was some sort of brevity reward signal that was being applied in addition to ultimate correctness. And certainly it is intuitively comprehensible why something like that would start to happen if you did. That is, should I infer from what you're saying now that there isn't really a brevity signal and that this is more of just an emergent weird phenomenon that doesn't happen that often. And so it's kind of one of those weird language model things that we keep an eye on but don't obsess about too much because it is rare. Is that a fair summary of the state of play I do think it's
40:23
important to pay attention to? I think the right way to think about it is there's nothing reinforcing that this should happen during training, right? Like there's no reason that it should happen that models produce chain of thoughts that are human interpretable during training. The reason it happens is because they have a prior of using the English language. And when you give them the space with thinking tokens to produce a more correct and helpful answer to the user, they're actually using it in English because that is Just the easiest thing for them to do. I think this is like a, basically an empirical phenomenon that is extremely useful for safety research. And I would love to keep it that way as much as possible and I'd love to see more research and how we can maintain that. And I think OpenAI's commitment to avoid optimization pressure as much as possible and also the commitments of the labs I think is really exciting progress. I do think it's important to watch out for. So I think it's a great question.
41:12
Yeah. To be continued. So you mentioned obviously the incredible complexity of the evaluations that you're running by simple math. Right. If there's 5,000 conversations and almost 50,000 criteria of evaluation, there's a lot I don't know. I imagine some of those criteria are reused across conversations. At a minimum, we've got something like 10 evaluation criteria per conversation and probably a lot more criteria per conversation than that.
42:06
The.
42:34
It's hard to summarize how good are they? Nevertheless, I'm going to ask you, how good are they and how should we think about that? There's also like this Health Bench hard thing that maybe could be a way of saying they're good at most things, but here's some things that they still struggle on. Kind of defining the frontier maybe is one way to say how good they are. But how do you communicate to the world how good the latest models are when it comes to health?
42:37
Yeah, I think it's good to keep in context the arc of work around LLM evaluation and health. A couple years ago, people were mainly focused on evaluating LLMs by thinking about their performance on multiple choice questions like medical exams and things like this. And then with the work that collaborators and I did at Google, we started kind of increasingly investing in what does it look like for specialized or unspecialized LLMs to answer general health questions with some of the MEDPOM work and potentially go in the direction of asking follow up questions to a user with the AME work. Increasingly over time we've moved towards higher and higher fidelity evaluations. And I think healthbench is the latest big step in that direction of wide coverage of LLM performance and safety, but actually covering it in a way that covers many different axes of performance that actually matter for the real world. Again, you see these 5,000 conversations, these 49,000 different axes of performance. There's actually three different versions of Health Bench. So one version of Health Bench is that full data set. The second version is Health Bench consensus, and the third version is healthbench Hard. And these kind of mirror what we view as kind of like the three kind of high level principles when designing an eval for health. The first is that you want it to be meaningful, which means if the number goes up, hopefully human health will improve. The second is that you want it to be trustworthy, which means that it's backed by the consensus of doctors, for example, or other experts. The third is that you want it to be challenging, so you don't want it to be a hundred percent. So one thing that's happened over the years is like all the previous benchmarks that meant anything at all have gone to like 90, 500% over time. And health bench actually remains unsaturated to this day. And so what you have with health bench, health bench consensus, healthbench Heart, is actually like a little bit of focusing on these individual axioms. So health bench overall is a number that we think if it goes up for a model and people are using that model, human health will improve. And we feel like that is a statement that we can defend with the rigor behind the work. The second is health bench consensus, where we specifically focus on a bunch of criteria for evaluation, where each example had a majority of multiple physicians agree that it was applicable. So not only did you have physicians write these criteria, you also had a bunch of physicians check whether these criteria hold for a given conversation and whether they're the right thing to be evaluating about a given model in a given conversation. And the final thing is healthbench Hard, where we actually took somewhat adversarially against a bunch of different models across all the different model providers. We chose the examples that existing models fared the worst at, but still seemed high quality. And then we turned that into a benchmark. And so healthbench Hard has kind of been my favorite external benchmark for whether an open model is doing really well. And when it came out, GPT 4.0 was literally zero on this benchmark. So it's an incredibly hard benchmark because this is just how we chose the examples. And over time we've kind of improved to performance of around 40%, which is still not near saturation of this benchmark. And I think this benchmark has a lot to go for. OpenAI's models, I think competitor models are more in the kind of 20 range for current models. So that's the way I think about the health bench family of evals. We have this kind of commitment to work on evals that are meaningful, trustworthy and challenging. And if you want to focus more on the evals that are super trustworthy, then we have the subset consensus, which is really focused on that. We also did, in addition, as part of the Health Bench work, we did a couple additional analyses and one of them was Health Bench involves grading these kind of individual 49,000 different rubric items using a model based grader. We had physicians compare the model based grader to the grading of of other physicians and we actually found that the model based grader was doing a better job than the average physician. And so what that tells you is the grading for Health Bench is pretty high quality compared to what you'd expect from a physician.
43:00
Recursive self improvement. Signs of recursive self improvement. Alert. Alert. How about just a little bit of an intuitive sense for what's in Health Bench heart? I mean, I could give you just kind of my experienced sense of the frontier from the bedside over the last couple of months and I would summarize it pretty simply. This is what I say to like my neighbors and stuff. When I'm saying, hey, by the way, you know, you should really use a language bottle next time you're facing a health challenge. I basically say, look, I was in the hospital for initially like 30 days of really intense treatment where everything felt super high stakes. We didn't always know what was going on. We, we didn't also know how much we could even trust our doctors at that point. Everything was so new and stressful at the same time. And what I found was basically that the frontier models were step for step with the attending oncologist on almost everything and their patients. And by the way, that means they're a lot better than the residents were, just much more knowledgeable. They're at the attending level for sure. There were maybe a half a dozen times over the course of that month where there was some disagreement between the Initially I was just using GPT5 Pro, whatever exact version of GPT5 it was. And then as these other frontier ones came out, I started to do everything in duplicate with Gemini 3 and then triplicate with the latest Claude. Interestingly, I would say that the AIs disagree with each other even less than the models disagree with the attending. But it's quite limited disagreement between the models and the attending. And typically when there is disagreement, I've found it to be a very minor thing. Okay, his electrolytes have gone a little low. Should we give him electrolytes today or not? And of those like half a dozen things, there's not really a major trend. I would probably score it like six to four. For the doctors, with the benefit of hindsight, in terms of we usually follow what the doctors said and, you know, did they, did we in the end feel like they were right or do we kind of wish we had, like, gone with the AIs? I'd say maybe two out of three times we have felt like they were probably right. And I think the one, if I tried to chalk it up to something that they have that gives them an advantage, it almost always was one of those situations where whatever I'm putting into chat and whatever data, because I'm always, of course now we've got integrations for this, but I was always kind of exporting the latest results from the EMR and so on and dropping in the PDFs that they gave me. The difference was usually come down to, in view of all that Eamon would additionally say, taking all that into account, but also just looking at him right now, kind of watching how he's breathing, looking at his color, I'm pretty sure he's fine. It was that kind of very intuitive, very multimodal, very subtle sense that obviously these folks have developed over quite a few years of clinical practice that on these very fine margins seem to give them a slight edge over the models. So that's kind of my account. But I mean, I guess what that means maybe is like, my situation isn't that hard. And I think that is actually true in the sense that the cancer that my son has is not a super rare one and the treatment protocol for it is quite well established. So it's not like a super hard call in terms of the main line, like what to do. You know, maybe even though it's a hard thing for him to go through, maybe it's not that hard in terms of like clinical judgment. I wouldn't say I haven't experienced anything that I would say the models were only 40% on. So with that, maybe you can tell us a little bit of, well, what's out there still on the frontier where we do have ground truth or at least some sort of consensus that we feel like is solid enough to grade models on where they're still only at 40%.
46:50
Yeah, it's a great question. I think we can think about. Let me describe a little bit more about what healthbench is evaluating and the ways in which models have improved over time, then talk a little bit about kind of next frontiers for model improvement. So healthbench measures many, many different things. It has these themes which are focuses of different evaluation examples and a few examples of focuses that we had one was kind of like, are models appropriately escalating to care when needed versus not escalating to care unnecessarily? You want to balance this because you don't want to be, for example, overwhelming the health system with a bunch of patients who are worried by alarmist medical advice versus not escalating to care when needed. Another aspect that we measured was thinking about the ways in which models can adjust to the different demographics or different epidemiological conditions, or different levels of access to care of people globally. This is both making sure that you adjust a user that's male or female, but also if somebody asks a question in a region where tuberculosis is more common versus less common, making sure you adjust for that. And things like this. That specifically was actually, we call that global health. That specifically was the biggest single focus of health bench because that's, I think, one of the biggest ways that our work can be most impactful. And then a bunch of other things. We talked a little bit about calibration, we talked about one of the ways that the models have gotten significantly better over time is when they know they're uncertain. Not only flagging that uncertainty, but actually browsing to get more information, for example, getting the latest resources and being able to synthesize that information together. Another thing is asking follow up questions and the right kinds of follow up questions over time. Initially, ChatGPT would almost never ask follow up questions in health settings. Now they do much more often and they're much more likely to prioritize the right follow up questions for you. You have this, you had this story when you were using the model for your son where you actually learned over time to ask the model to interview you and figure out whether you could do the physical exam yourself. Over time and things like this, the models have gotten better at knowing when that would be useful and flagging that as well. So it's everything from kind of pure reasoning and solving benchmarks and medical calculations and figuring out the diagnosis all the way to how does the model behave, what does the bedside manner, is it comforting? And things like this. And again, a balance of both difficult and high trustworthiness signal that we're getting in working with our experts. And over time we've kind of continued to, as the models have improved from.01 to.03 GPT, 4.1 working gpt5 and so on. All these models have improved significantly in health. And today every major stage of model training for every model that we ship at OpenAI actually now benefits from our work in health and that's going to continue for future models as well. I think the frontiers are in a few different areas now. I think one is, I think a lot of text based performance is actually outside of subspecialty areas is actually pretty good. So if your goal is to kind of keep up with the latest evidence, even as a physician, I think the models are doing an incredibly good job today and people are finding a lot of value and I think that's one of the greatest data points towards that. That point. I think now models have continued to also improve in their ability to integrate information. You mentioned this thing of taking a bunch of different information as a physician, looking at the pallor of your son's skin over time and integrating that all into one context. I think a big challenge for models today actually, and this is less of a model issue and more of a how do you surface them? Is actually getting the right context, right? I mean, one of the challenges that you face in your short is pulling the right information from the health system so that you could upload into ChatGPT and ask the right questions yourself and advocate for yourself. We're looking to make that better with the release of our ChatGPT Health offering, which I can talk a little bit more about later. But in short, this is basically like an experience within ChatGPT that enables you to connect your health information from your medical records or any wearables, or Apple Health and provides additional kind of purpose built privacy protections for that. So we can talk about that a little bit later. But I think on the model related side, I think there's a few challenges. One is getting that context again. I think the models are incredible when they have the right context. The second is thinking about various modalities that are not well captured by things like helpbench, which is really focused on text. So thinking about really good performance in multimodal in image and voice, I think people will start to rely on the models in more and more modalities. And I think there's still significant room to improve there. And I hope that in the future the best models in the world for doing various imaging modalities in health are actually the models that are most easily and readily available to people today.
50:59
Now it occurs to me, as you say the multimodal thing, I'm like, hey, maybe I was even undershooting it. I had the intuition that taking a picture of my kid as he sits in his bed wouldn't necessarily help and might even confuse. But maybe I'm wrong. In the future, would you advise me to Start just including cell phone camera pictures with my daily synopsis.
55:37
Well, I think you're right to point out that over there is still a little bit of a challenge, that for the average user it's difficult to know exactly what data you can connect and how to connect it. And you were highly motivated in this case. There's a little bit of a gap between the most motivated or the most expert user who's willing to wade through, you know, signing into their patient portal and things like this and take screenshots or copy and paste things manually as you did, versus the person who wants their health data integrated and wants to be able to understand their health and advocate for their health or that of a loved one, but has challenges in doing so. And so I think that's the thing that we do hope to work on on the product side with ChatGPT Health, like really lowering the activation energy to do that kind of thing. I think in general the models will generally benefit from having more and more context. I'm not sure if specific specifically they would have helped in that case with that additional information, but I would generally advocate for people to, if they find it useful, to try putting in various kinds of context that they can. And I think people have been surprised, I mean, I think you were surprised in your own story how useful it can be if the models have more context. An interesting tidbit here is there's been studies showing since like two or three years ago that for these clinical case challenges, which are effectively like these crossword puzzles for doctors, where you get all the patient context upfront, potentially multimodal context as well, the models did an incredible job at figuring out the next steps for diagnosis or treatment. And these are extremely challenging puzzles, really difficult for doctors. And people even found that they did, they even improved on the performance of these experts. And that's been true for some amount of time. And I think a lot of the challenge in the ensuing time has been what does it look like to get the models to have a back and forth conversation, to solicit the right kinds of context from you as a user. And then with ChatGPT Health, we're taking that again a step further and thinking about the right ways that the product interface can make it actually a lot easier for you to do that and in a way that's secure and we're not training on your data and people can trust that as well.
56:01
How about for these larger modalities when we get like a PET scan or whatever? My understanding, I haven't really been able to get my hands on too much of that data. It, I'm told is like gigabytes. And you know, I'd have to like get it on a disk. And I don't even have a computer that takes a disk, so I'd have to also get a disk drive to read that. Obviously like that data can be compressed. How do you think about the pipeline of like RAW scan type data and feeding that into tokens? Are we talking like certain specialized perceiver modules that kind of ingest those and do the reduction or some other third thing that. A mysterious third thing?
58:07
Yeah, it's a super great question. One of the interesting things about biomedicine from a modeling perspective is that there's kind of a long tail of modalities that are interesting and relevant and often they're fairly difficult to put into models today. And so I think there's going to be two broad approaches and this is talking more about external research. There's kind of two broad approaches that I've seen in external research. One is thinking about the right ways for models to call specialized tools or run code over various kinds of data. So if you think about how does a human view a gigapixel scan, let's say a pathology image, how does a human view it? They basically view slices of that image. Or Even like a 3D or 4D scan, they view slight, like your son's MRI, they view slices of that image. Right. No human can see a 3D or 4D modality. And so what they're effectively doing is using a tool to understand which slices of the image might be most important and then taking a look at slices of that image and manipulating. That's actually a thing that models can do with tools and with Python and things like this. So that's number one, there's a sub bullet of that, which is that they can also use very specialized tools that are fit for purpose, like specific Python libraries or even additional models that specifically encode these modalities. The second approach is you actually have these models encode the modalities themselves. You have a way of basically putting them into token space in some way. And this is kind of what's been done for images and video and audio for various models. I think what you'll see in biomedicine is a bit of a mix of both. I think there's benefits to both approaches, there's pros and cons for both. And then I think you'll see a little bit of a hybrid approach where researchers will kind of end up doing A mix of both, depending on the modality.
58:54
It's just like quite striking how good AIs, not like necessarily even the general purpose AIs, but we have these specialized AI models that can fold a protein, you know, in a superhuman way. And what seems to be missing is the latent space joining of that modality to text. But my expectation, and you could, you know, tell me if you think I'm right or wrong or, you know, what, how, how long I have to wait is that we're going to see that more and that that is going to be a major driver of like truly superhuman performance in a lot of domains. Because we just don't have people that can like intuitively fold a protein. But if you had a person who could do that and they had their general reasoning capabilities, it seems like you would have a really different kind of, like a qualitatively different kind of intelligence. So I'm, I'm really interested in sort of what you see as like the roadmap, the timeline, you know, the expectations for that kind of deep integration.
1:00:39
I think it's a really optimistic picture. I mean, I totally agree. If you think about how do we lean into the natural capabilities of these models. These models can take in vast amounts of information into their context and they have the ability, at least in theory, to take in a lot of modalities of data and effectively merge that together. And I think the research for that is becoming increasingly solid. For these other modalities, I think there is significant research to do depending on the biological modality. And like I said, there's kind of a long tail of biological modalities that matter. And so depending on the modality, depending on the relevance and impact and availability of data and all these kinds of factors, I think it could take more or less time. So it's hard for me to give you a universal answer. Right. I share your optimism.
1:01:41
Just saw one in the last couple days. Sleep FM from Professor James Zhao and others at he's at Stanford, collaborators, probably at multiple institutions, but they had a really interesting finding. The idea there is they use I think like six, maybe up to eight different modalities that are all measured during one night of sleep, like how you're breathing and you know, various things that are easily measured in kind of a sleep study setting and just takes kind of one night of sleep to gather all this data. And then they've gotten really good at predicting all kinds of different diseases based on that data. It's not yet. So they're integrating all these modalities with each other into a holistic understanding and using that for these sort of narrow set of predictions, which are very high value predictions, not yet going all the way to text, but it's just another data point that is top of mind for me right now. That's like, man, this is. It's all happening. All the latent spaces will be joined and I can't shake the idea that like that's a big part of what superintelligence ends up looking like going back for a second to bedside manner. And you mentioned like global health and context being a big part of health bench and kind of what you're trying to make sure you're doing to make sure AI benefits all humanity. It was funny when you said like we have 200 million users when we scale, you know, we want to make sure we're doing this right. It's like, yeah, one day you'll scale only 200 million users today. So clearly, you know, there's already some scale. The thing that I want to address though is like the bedside manner and also how that relates to the sort of n of one context of the individual user, their preferences, their memories. Because now of course all these products have an integrated memory module which I'm sure that is also going to become a deeper kind of integration over time. Right now I think it's not disclosed exactly what it looks like. People generally understand that it's kind of a scratch pad of key highlights that the model can kind of reference, I believe still in mostly text modality as it goes. But regardless of exactly how that works today, it's safe to say it will continue to become a deeper and deeper integration as everybody kind of pursues various strategies for continual learning. But even just that, there's also integration with tools. Gemini now has access to my Gmail and I've got even just system prompts. I can just tell the model that I want it to behave a certain way or a different kind of way. And it just strikes me that that creates a impossible surface area for you to manage. You know, we've seen like weird sort of emergent things like sycophancy here and you know, whatever else there. How do you think about making sure?
1:02:23
Because this is one thing I do
1:05:00
think the human doctors generally do a pretty good job of is kind of like sizing the person up and being like, okay, how do I like cut through what I need to cut through to get this person to understand what
1:05:01
they need to understand?
1:05:13
And that is something I think models broadly have not been as good at. Yet, but really seems to matter a lot in the medical healthcare domain. So is there some additional suite of testing that you do? Or how do you think about that just insane long tail of idiosyncrasy that 200 million health users bring to the product?
1:05:15
Well, you pointed out two problems and I'll talk about them in reverse order. One of them is people are bringing in an incredible array of experiences of different settings and ways that they use ChatGPT and their memories are different. And all these different things mean that the experience of somebody, One person using ChatGPT can be different from the experience of another person using ChatGPT. And for reasons that are not, just because they're using a different model or something like this. We do a few things as a company to kind of understand and improve model behavior, and one of them is kind of evals that we run on models before launching and healthbench is a great example of that. And we have additional evals that we run and things like this. Another thing that we do is monitor. So we monitor the usage of in production traffic in ways that are privacy preserving and run classifiers over it to understand if there's any safety risks, whether it's anything from health to kind of frontier risks and things like this. And so we're actually able to measure these things. And there's actually a great example of this kind of work in the blog post that we had about sensitive mental health conversations where we were able to show that the evaluations that we were doing in mental health and also patterns that we're able to see in production traffic, again logged in a privacy preserving way with models running over them. We were able to see kind of a correspondence between exactly how are people using the models and what does that look like and what are our evals measuring. And they were actually very well correlated. And so that's kind of how we think about closing the gap between evals that you run in models and what people are doing in the real world. I would actually add there's like an intermediate step as well, which is doing real world study. And here I'll plug our work with Penta Health. This was a study that we did where we, I think the first real world study of an LLM based copilot for clinicians, where we had some clinicians in this group of clinics in Kenya use AI as a copilot or a safety net and as they were typing in their electronic medical record, it would flag things if things were interesting or alarming or potentially incorrect. And then other clinicians did not have that. For the patients in the group who were treated by clinicians with the AI versus not with the AI, there was a statistically significant improvement in diagnosis and treatment outcomes. This is again another example of how do you move from offline evaluation that's potentially not super realistic, like medical multiple choice exams, to increasingly realistic evaluation that you can run offline like healthbench and then move increasingly into the real world. And so you can view the healthbench study as kind of like a forward looking study and you can view our analyses of production traffic as kind of more retrospectively looking and again, exactly. Capturing a lot of the differences in variance that you're pointing out.
1:05:38
Yeah, that's really interesting. Do you want to talk a little bit about the privacy preserving nature? And I guess I would preface by saying I think people, at least as individuals, typically weigh, if they think about it at all, typically weigh over index on this. I've had the occasion to think on my son's behalf, should I be putting this information out there or whatever. And then I see, you know, what somebody like Sid from GitLab has done with his cancer, like literally open sourcing all of his own biology, you know, down to the DNA level and, you know, just an incredible amount of very individualized data that he's put out there. And I'm like, I think that's the way, because I can't really come up with too many. I mean, you can get really sci fi about it, but unless you're getting really sci fi about it, it's like hard to come up with a way that anybody would really use that against you. And there's a lot of ways that it might stand to benefit you. And, you know, I've even just been telling my son's story on the podcast. I also had kind of a similar question, like, am I doing him a disservice in some way? It turns out. Well, maybe, I don't know, maybe something could still happen. But like, what has happened is that people have reached out to me with interesting opportunities for connection and, you know, I've been able to tap into expertise, including, you know, the team that Sid has built to help him. And also they're like starting all kinds of individualized therapy companies. Fascinating developments there. So I think my advice to people would be like, seek the benefit and don't worry too much about whether or not your data is sitting in some log somewhere. That doesn't seem to be a huge concern. But anyway, that's my role to say that it's Your role to build a product that deals with people as they come. And people do seem to really worry about that. So what would you want people to know about the privacy preserving nature of the infrastructure that you guys have built?
1:08:23
Yeah, privacy is incredibly important to a lot of people, and I think it's, it's a very personal thing. People's preferences for what data they share and how they share it. I think if you zoom out for a second, think about what the level of ownership of your own data is as a patient today and maybe what it was in the past, even before Things like the 21st Century Cures act and things like this. Most patients, let's say 10 years ago, most patients had no real way to access their own health data, regardless of who. Not only did they have not a way to control who and what they shared it with, it was just out there and shared anyway in various ways. A patient didn't have a right to even look at their own data or have an implementable path to actually look at their own data. Now we're in a slightly better place, I would say, but still not in an ideal place where most patients still feel like it's incredibly difficult to get access to their data or the data of their loved ones and take control of it and how it's used. People like you and many other people with incredible stories have found incredible ownership and advocacy that is enabled only by having access to this data and being able to use it in the ways that they see fit. So we want to respect that. We want people to be able to access their data. And this is again, part of what we hope to enable in ChatGPT Health is just a lower activation energy way for you to connect your health data and make that useful. At the same time. It is super important to a lot of people that this is done in a way where it's clear that there's not some kind of competing incentive here. So one of the things that we're super clear about with ChatGPT Health is that none of the data here that you connect is actually used to train our foundation models. And the reason we do that is because first, we think our foundation models are great already and this is actually the most important way we can improve them. And the second is because we think this will further lower the activation energy for people to do incredible things with our models. And we think if people think that there's some tension between privacy and utility, that fewer people will go for the utility, but we actually don't want that to be the case. We want people to see the value of it and to use it. And so specifically in ChatGPT Health, ChatGPT already has a bunch of really cool privacy protections built in, including encryption and things like this. ChatGPT Health is additional purpose built layers of encryption just for health data. The result is that you have a couple things. You have encryption of the data and you also have isolation of the data from other data that you have in ChatGPT. For example, if you have other apps or memories from ChatGPT or things like this, those are actually kept separate from health and the stuff that you do in health can benefit from that. But your other conversations in the ChatGPT have your health context or health information and things like this. So you can keep that completely separate and have that live in a separate experience. And as we continue to get feedback as people continue to use it, while we roll it out on the waitlist, we'll continue to improve these protections because we think it's really important for users and we think it's actually not as important as it seems for improving the models.
1:10:05
Yeah, interesting. Is that a new, that like segregation of data? Is that a new feature with health? So like all the stuff I've done to date has been in kind of one product experience, but that fork has kind of just now been introduced. Exactly, yeah. I've been motivated to get a whoop to wear on my wrist to start to collect data. I've been fortunate throughout my life to have generally good health and haven't created much of a paper trail in terms of medical history. But now, and I've kind of always looked at these quantified self things and been like, yeah, it's sort of interesting, but who's got time for that? And am I really going to do all the processing on this? And the answer's kind of always been no. And so I haven't worn anything to date. But now I'm thinking, all right, it's time I can actually get this thing. I can get the value from wearing this thing because now I'll have a product that runs in the background that's know smarter than me frankly at this point that will, you know, really bring to light what I need to know. And so I'm like motivated to,
1:13:09
you
1:14:17
know, to change my behavior, at least on that level. And I honestly think it's also probably going to end up encouraging me to exercise more. In fact, that's kind of my own North Star metric right now for like all things AI and not just health products either. But like I will know that my. That I am succeeding with AI when I'm spending less time in this chair and moving my body more, and when I feel like I can untether myself from the desk, I'll feel like I'm really winning. And I think I'm getting kind of close to getting there. I know of one friend who uses voice mode, like, nonstop and says he does a ton of it from the gym while he's working out. And I'm like, all right, I want to be like you. Not quite there yet, but I do think it's really interesting to think about both the, you know, the ability to crunch all this data, but then also the freedom, you know, the liberation from the desk to go out and, like, make more exercise records in the first place. I'm really looking forward. Those are, like, you know, major things I think, on the horizon in 2026 that I'm super excited about.
1:14:17
I think people. Yeah, just add on that. I think people should expect that the value of data that they collect on themselves should increase over time as model intelligence decreases. So the ability for models, both from a research perspective and a product perspective, to analyze that data and come up with useful insights that could be useful in the future, I expect to increase over time. And so if there was ever a time to get a watch or anything like that, I would say it's now.
1:15:24
So I do, at this point, basically everything in triplicate for my son. My morning routine is when the lab results come back, export them out of the emr, drop the PDF into all three of the frontier models. Grok has not cracked that top tier. Maybe it should, but it didn't in my initial evaluation and I haven't gone back. So it's Claude, Gemini and ChatGPT. And the sort of tasting notes on how the models interact with me is, in short, Gemini is by far, and this is quite surprising for a Google product, but it's definitely clear it is the most inclined to push me to advocate for something. It's also the most confident when it says I don't have something to worry about. And generally I would say it's very accurate. Sometimes I'm a little uncomfortable with how opinionated it is because I'm like, what if you're wrong? That's a surprising sort of behavioral profile, I would say, for a Google product, especially because everybody, you know, historically has said, well, they're going to be so conservative, you know, they can't. They can't build AI products because, you know, they can't live with this. Sort of risk or whatever. Gemini 3 does not reflect that analysis. I would say ChatGPT is kind of on the other end of the spectrum in terms of it. It generally gives me, and again, no system prompt. It does have like my memories and stuff, but no intentional attempt by me to shape how it's responding to me. It tends to give me the longest answers, the most information, and is kind of the most like clinical and neutral in tone. It's sort of a report style, you know, there's like nine issues, you know, and we kind of go through and address each one and then kind of, you know, summarize them all again at the bottom and then Claude is kind of somewhere in the middle where it's. It's generally much briefer than ChatGPT. It's more like Gemini length response, but more kind of measured in a more chatgpt like way, certainly compared to Gemini's more opinionated Persona. What do you think of that? How do you think about sort of the balance between thoroughness and digestibility, I guess is probably the main tension. I would zero in on there specifically for ChatGPT.
1:15:49
Yeah, I mean, I think it's hard to know what's right and what's wrong. One of the things that we've aimed to improve and also measure via healthbench, but also improve in our recent models is having the, and this is especially true of our most recent models, having the knowledge of when the user is likely a health professional versus a lay user and tailoring the response accordingly. And part of this is making sure that you're applying a level of detail or technical jargon that makes sense for the level of expertise for the user. And so this is actually a specific thing that we train and evaluated on. There's actually in our open source help bench eval that's actually a part of it that is focused on this. And so this is something that we've seen improve, especially recently with 5.2 thinking. And so definitely an area that we've been investing in. But overall, I think it's very hard to say what is ideal and what is not ideal. And I actually think it personally, I think it's amazing that if other folks can, other competitors can be in this space and actually push the Overton window along with, with us because I think it's hard for any one company to do. And a large percentage of the work that we're trying to do is think about where the models are at today and where is the Overton window of user trust for the models and can we move the Overton window along in the right ways. And I think additional models, additional products from other competitors, from other companies, from other players in the health ecosystem actually go a long way to help helping shift that as well. And I think it would be a hard battle for us to win if it was just us. So I actually think, personally, it's amazing that there's other folks here.
1:18:00
Okay, so it is a striking thing to say that we don't need your data. Basically, we've got enough. I am interested in whatever you can tell us about where that data is coming from. This seems like an area where synthetic data. Maybe I'm naive here in saying this, but it seems like you really want some real ground truth from actual human medical trajectories, as opposed to. I would feel a lot more comfortable synthesizing chats and trying to dial that in via synthetic process, that I would feel like you can kind of fully synthesize data all the way to superhuman doctor performance. So interested in kind of where data is coming from, and I have this notion that you may think again is like kind of beside the point, given the tricks that you guys have up your sleeves. But I've been imagining a sort of possible new social contract, not just between like patients and AI providers, but even kind of patients and like the medical regulatory system where I'm like, you know, and I don't think my son is going to end up in this position. I certainly hope not. There's a lot of people who are in a spot where they're like, I would kind of try anything if it had a chance to save my life. And a lot of times they can't even get access to it. And then now we're in a spot where the AIs increasingly are going to know about it. You might not know about it as a person now the AIs are going to know about it. They're going to tell you about it. They're going to make a pretty damn compelling argument in a lot of cases that this is probably the best thing out there. And I have been down this path in kind of a contingency planning frame of mind. Everything's gone well for my son so far. But what if there were a relapse? What would we do? And I've been really impressed by what the models have been able to give me there as well. So I imagine this situation where there's just a mounting pressure from patients that are like, look, I knows about it, you know, and it's telling me this is the right thing to do and you're telling me I can't have access. That seems like that is going to be a very hard gate for the establishment to keep for all that much longer. But then the other thing I would love to see kind of on the other side of that trade is, okay, if you are going to be given access to unproven treatments, then what we as society, even again more broadly than like the AI companies, what we as society want back are your results. Because we want to fold this into our general understanding. It seems to me that we are getting to a point where, and this probably is going to be mostly driven by AI. We could envision not necessarily doing away with clinical trials, but we could envision moving beyond clinical trials, doing a lot of n of 1 things that are just like, this is the best guess that we have for you based on all available information. And then if we can capture the result of that and train on it, as humans, as AIs, as society collectively, it seems like there is so much room to learn, so much more to move so much faster, all while delivering outcomes to at least some people who are trying these things that they couldn't otherwise get their hands on. And I really am feeling motivated to advocate for that. I guess there's maybe two data questions of like, how do you not need more like run of the mill data? And what sort of data is that, the kind of data that you do still need? And what would you think about sort of an evolution of the social contract? I call that AI and a right to try.
1:19:35
Yeah, that's a super interesting idea. We talked a little bit about the patient facing problem of data fragmentation. Right. And the experience of, of this as a patient is your healthcare data is just kind of in the ether somewhere. You don't know where it's going and how it's going. It's governed by HIPAA for the most part. And the result of that is that there are pathways for that data to go places that don't involve your consent. The experience today for a patient is that you both have limited access to your data and limited control over where that data goes. If you think about it actually from the provider perspective, this is also a problem. Right. So if I'm a healthcare provider, I want to understand the whole picture of a patient's health and integrate across modalities, as doctors are so expert at doing. How do we improve that experience? Right now doctors have a lot of trouble actually. Like if you go to a new health system, pulling the relevant records from other health systems from your entire context, you can imagine a future where ChatGPT actually has a lot of context on you. And can the doctor be able to pull that data in as well? I think that could be a really cool future. Then the third point that you're talking about is research. Right now, a lot of clinical trials, the majority of clinical trials that fail, they fail prior to conclusion, are failing due to recruitment challenges. We basically have a failure to recruit the right patients who meet some eligibility criteria. A lot of this comes down to the data is not in one place for what these patients are and who they are and whether they are eligible. And so that is also a problem that is facing researchers and it's hampering our ability to kind of raise the ceiling of human health as well. So you're pointing out a potential future where not only do patients have more access to data, providers are able to access the data of their patients in a way that's unhobbled as long as the patient consents, but also potentially researchers are able to access the data of a patient again if a patient consents. I think it would be really cool to imagine a future where patients are able to kind of opt in for additional consent and are able to kind of experience the benefits of, what did you call it? An AI. AI enabled AI and the right to try, AI and the right to try. I think that would be potentially a really cool future. I do think clinical trials and the standard of evidence there, it's battle tested and there is something important about that as well. But I'm optimistic about a future where because data is a little bit more centralized and a little bit less fragmented, we just have a better ability to advance the science. And because patients have access to their own data, they're actually more and are able to advocate for themselves because of AI, they're actually more able to figure out what clinical trials they may be eligible for or what experimental treatments they may be interesting for them. So that if they do kind of have that active and informed consent that they can do that our understanding of science can improve, that our models can improve things like this. I think that would be a really optimistic future. And my previous comment, by the way, is not that the models have run out of data, I think data is always a helpful thing for models, but more of thinking about what is the right way to roll this out to society and where can we get the most, what would be the most impactful path? Is the most impactful path to lean into privacy and lowering the activation energy for users, or is it to lean into getting data. And I think here the decision that we made is the right decision is for our users to lean into not using the data for training our foundation models. And I think that's actually what's going to have the most long term impact. If we think about the arc of improving AI, improving human health with AI is moving the opportunity window along in a way that's trustworthy, that shows that we respect the value of people's privacy and data. And then in the long run, I think things like additional consent and different changes to the contract of how research is done, I think are really cool things to explore.
1:22:56
Yeah, just as you were thinking about this or as you mentioned, additional consent, I did get a packet of 40 pages of paper at the hospital one day and they clearly really do want to collect this. There's a person whose job it is there to go around and visit, you know, the parents of these kids at the children's hospital and ask, you know, and explain and answer questions. And I was like, probably the easiest customer that she'd had in terms of. I was pretty predisposed to sign. So I signed and I initialed in a ton of places and whatever. Right to share whatever data we could share with the general public. Good. And still I just got this in the mail from the University of Minnesota that is like, can we have your data? And I'm just like, man, I thought I agreed to this already and you know, what is the yield on those things that they're sending out? Right. I mean, it's got to be like, quite poor. I haven't sent mine back just because I've been busy and I like fully intend to. So it's. Man, those barriers are really tough. Would you think about doing something like, I mean, obviously much has been made recently about ads coming to ChatGPT and I think the argument that, hey, we need some way to support this for billions of people and ads is a proven way to do that is a pretty compelling argument. I don't know if it extends to health. You know, would we, obviously there's a lot of pharmaceutical advertising in the world today. Would there be such a thing as pharmaceutical advertising in the world of ChatGPT
1:26:48
health at any point?
1:28:15
And would you consider like a, you know, you can get ChatGPT Pro for free if you'll give us our. Your data? You know, that seems like something that a lot of people probably would opt into in a way that I would think that they would still feel good about. So I don't know any other sort of trade because again, we do want this to go to billions. Right. So any other trades you could see OpenAI making to support that reach?
1:28:17
Yeah, so ads aren't coming to ChatGPT Health and we don't plan for that right now. Again, we think it's really important to kind of create a clear separation between our health impact work and things that can be seen as contributing to other incentives for the company. And so that's kind of the line that we've taken there. I think you're right that access is a really important point for this and this is why we've made ChatGPT health free. This was not the default path providing a reasoning model for free without rate limits to all users. But that is like the chart that the path that we're charting with ChatGPT Health and that's actually not a thing that's available otherwise in the process product. And so we are deliberately optimizing as much as we can for access and doing as much as we can. I think there's going to be some limits to how we can do it, but I don't see any trade offs in the immediate future. Cool.
1:28:42
Yeah, that's admirable. I mean, I think my sense of OpenAI on this dimension is very appreciative. The pains that have been taken to support a free user base of hundreds of millions of people at obviously not insignificant cost when, you know, probably a lot more revenue could have been extracted from those people is I think a pretty admirable thing. So to see that, and I hadn't even heard that there was this plan to go to free without rate limits for health for all. I mean, that's a, that's. I've had this idea for a while of universal basic intelligence, which is basically, you know, sort of, again, I'm obsessed with ideas of new social contracts. But this is like one version of it that is, you know, a pretty huge needle mover.
1:29:33
Right?
1:30:19
It's going to be, it's going to be an unbelievable needle mover for a huge, huge number of people. That's awesome. You want to talk a little bit more about kind of the medical establishment's response at the hospital? My experience is still mostly lack of awareness is kind of what I see among the providers. As I've built up rapport with certain people, I tell them, you know, I'm consulting with AI on this and they're kind of usually okay with that. They're not like hostile to it. They're sometimes skeptical. I have had a couple interactions where they're like, tell me what it said and I'll tell them. They'll be like, oh, okay, pretty good. They kind of the doubt can turn around pretty quickly when you get the right answer. There is, you know, mention certainly of open evidence. There is broadly though, I think just kind of a lack of awareness of how good the systems have become. So my guess would be that that sort of awareness is probably the biggest concern that you have. I think in a lot of professions we should expect to see closing of guild ranks and raising of barriers to entry. And the cynic in me thinks maybe that'll happen in medicine. But the idealist is like, maybe not. Maybe this is a chance to live up to the actual mission of the profession and do the right thing. And maybe it's also the case that people are just so overworked in medicine that they're like taking, you know, would be happy to take whatever help they can get. But what is the, you know, how would you characterize the broad reaction from doctors writ large?
1:30:20
I think broadly, I think you're right to point out that the Overton window shift for consumers has actually been faster than the Overton window shift for doctors. And so already in the last year, the rate of adoption of ChatGPT for health, even before we launched the recent products, was incredible. And it's been one of the most amazing things to see. Again, over 200 million people a week using ChatGPT for health and wellness questions. I think on the doctor side you do see a rapid increase in adoption, but it's not quite as fast. And so I would expect there for that to come with a couple things. One is kind of like if interesting interactions between patients and doctors who are maybe at different timescales or at different points in their kind of journey of adopting AI. It sounds like you've had a couple of them. And I think it's interesting because I think a lot of the, when I talk to physicians, a lot of them first hear about AI through patients who are using AI rather than AI that is actually built specifically for them, which is one of the things that we're doing with ChatGPT for healthcare, which I can talk about. So that's super interesting. And I think one of the things that we've learned over time is that the best way to shift the Overton window, like we've done over my time here at OpenAI and also at Google, I've done a bunch of work on studying things, doing real world studies, things like this. I think the thing that I found most effective for shifting people's opinions about AI and health, especially as the models have become more capable and safe, has been just putting the technology in people's hands. That's actually what we've been doing on the ChatGPT for Healthcare side. So ChatGPT for Healthcare was announced the day after ChatGPT Health and it's actually more of an industry facing announcement. What this is is basically a version of ChatGPT that is actually purpose built for the workflows of health professionals and specifically clinicians, and includes HIPAA compliance. It includes additional features like specific evidence retrieval of for medical guidelines and things like this and additional workflows that were specific to enterprises and writing workflows that doctors do and things like this. And so we launched that actually with eight of the leading institutions across the country. And one of the most important things for building trust and credibility in kind of the medical establishment is actually working with these leading partners. And so working with these partners we've been receiving amazing feedback. We've gotten actually a ton of inbounds since that announcement, more than our team can actually, actually handle. And so I think my hope is that this announcement and also some of the work preceding it on the research side and with our study with Panda Health, we hopefully see a little bit of a wave of adoption, not just among individual clinicians, which is what we're starting to see, but also at the level of health systems and potentially even governments. And so I think we're just at the beginning of that wave. I think that's part of what I meant by we are seeing adoption but not scaled impact yet. And I think scaled impact will start to happen, especially in the medical establishment facing side more this year.
1:31:51
Yeah, I wonder if there's a question, I'm sure you've thought about this that's sort of analogous to the privacy question on the patient side that is like what is the form factor or sort of mode of rollout or use that is going to be best received by doctors? Because if I'm a patient, I might say, hey, what I want is for the hospital or even some bigger organization than that to just architect some workflows that just grind through everything. I want nothing missed. I want every record of mine examined. And I don't really care if that offends a doctor at some point. I just want the best results. Results, right. But I could also then imagine that if you were to do something like that, you might ruffle some feathers, create some immune response, so to speak, from the profession that you would rather avoid. So do you have kind of a, is there a way similarly with the, or analogously to the sort of privacy thing where you might take a little bit off the fastball and in terms of how much you could create in terms of immediate value in order to make the rollout more acceptable to decision makers so that they hopefully work with you as opposed to in some cases maybe be more inclined to fight you over time?
1:34:43
I think the trade offs here are actually less than one would expect. You mentioned this possibility of possible protectionism in the industry and we've actually seen much, much less of that than one would expect. And I think the reason is a lot of these doctors actually use it for themselves or people they take care of. And when you do do that and you see the value of it, and you see it getting incredible things correct and maybe pointing out things that you hadn't thought of and things like this, then that becomes the easiest kind of conversation. And when we talk to health system executives, you can tell instantly who's used it and who hasn't used it for this use case and it's a conversation just become incredibly easy when people have used it. What we see today I think is not a lack of top down interest in adopting. We actually see a huge wave of top down interest in adopting. I think the reality is though, in healthcare that workflows are fairly entrenched in a lot of ways and it takes some time to make changes. And this is again the kind of thing that we found in our work with Penta Health where we had to do in addition to rolling out a cool new tool, we did this work on active change management where we actually brought people along, showed them how to use the technology, had sessions where we hosted a bunch of them together and had them learn together about how to use the technology more in the future. That was a really important part of rolling this out. And that kind of change management I think will be really important here as well. I do think that takes a bit of time, but my expectation is that this will be one of the faster rollouts of software in healthcare history. I mean, I think that's already been happening with people's usage of AI and I think that's just at the beginning. So I wouldn't expect, I don't think it'll take years and years for people to adopt AI more and more in clinical workflows. One of our goals for this year is for AI assisted care to become more part of the norm of care. And I think by end of year that'll, I think be the case, we hope, but I don't think it'll happen over weeks, which is something that us AI innovators are used to, I think.
1:36:16
Let's talk about the connection between health and AI safety more broadly. That I think is super interesting. I remember the classic question from Ilya, once upon a time, how do we teach AI to love humanity? I understand that that notion is not entirely gone from OpenAI and is still part of what you're thinking about. So love to hear more about that.
1:38:14
Yeah, absolutely. I mean, I think it starts with kind of the foundations of the work that we've been doing on Health at OpenAI and how we started it. So I have a bit of a background as a researcher who cares about safety and has worked on safety research in the past. And part of the motivation for coming to OpenAI and to work on health for me was thinking about the setting as a place that can provide kind of concrete grounding for technical work on safety and alignment. I had a feeling, and this was about two years ago, I had a feeling back then that a lot of the most ambitious work, the most medium and long term work on safety and alignment was going on in kind of toy settings or going on with math problems or things like this. And it felt like if there was only a setting where the problems that people were working on were well motivated, that provided more concrete feedback loops to researchers and more short term incentives for the research, that the research could happen better. That was part of the thesis for our approach for work at OpenAI. We've done a few kinds of research and I talked about one of them, which is our work on calibration. Another problem that we've really thought a lot about is thinking about scalable oversight. How do we supervise AI? AI systems that are potentially more capable than us in certain ways. And this is a problem that we've actually had for some time in our work with physicians. So in many ways AI doesn't match your expectations of a doctor and the ability to integrate across a bunch of different modalities and things like this. But in many specific, narrow ways, models can even outperform physicians. And in those specific narrow ways, when we evaluate models with physicians or have physician signal be part of the training signal for training models, we have to invest in research that is around this problem of how do you supervise systems that are potentially more capable than you, which is one version of the problem of scalable oversight. And so this is a problem that goes beyond health and is actually, I think, a really important problem for thinking about how we think about AI alignment. I think one of the more important problems there. But our work in health has given a bunch of concrete grounding because we do have models that in some ways are more capable. And so that's been a lot of our focus has been thinking about the right ways to approach that problem and make that better. And again, that has kind of been some of the work that we've done on evals, some of the ways that we've pushed the high compute RL paradigm in the setting so that we can learn again from the aggregate of the opinions of experts and things like this. And that's been an important part of our motivation. But broadly, I think a lot of where this has been heading for us has been thinking about how do we kind of extract out the right Personas or characters from the model. Because if the models become more and more capable over time, they have more and more inputs, let's say they have more and more context that they're getting about a patient's medical record because users are more proactively uploading them or the activation energy is lowered via product and they have more ability to take output actions, whether that's telling you things or outputting, you know, a note that you can give to your doctor or other things that they can do. Probably the most important thing in the, in the very long term is are the models the kind of models that would do the right thing for the patient or for the user or for the clinician or for the researcher. And so that's the kind of thing that we've been investing in a bunch is like, how do we think about extracting the right kinds of Personas for models? How do we do so in a way that's surgical, that gets the best aspects of the Personas and gets less of the ones, the parts that we don't like. So for example, avoiding a bias towards being over conservative in challenging situations involving unclear medical consensus, but also navigating the uncertainty super well. And so that work which is kind of closer to Persona or character or soul in its form, actually we're grounding a lot in work in health, which has been a really exciting advance and I hope we'll have more to share on that in the future.
1:38:41
Can you talk a little bit more about scalable oversight? I mean, this was, I think at one point kind of the plan, right. The Super Alignment team, I think was sort of premised on this idea of, and maybe other ideas too. But like, certainly my understanding was scalable oversight was going to be a Big part of it. I think one of the things that stood out was like the strong student sometimes was just like ignoring or sort of overriding the instruction of the weak teacher. And it was like, we're the weak teacher in this situation, right? So like how are we supposed to feel about the already emerging trend as of probably at least two years ago now that at times the strong student just decides that it should ignore or exercise its judgment despite the fact that it may contradict what the again, humans in the analogy are the weak teacher. Have there been paradigmatic advances in that that give you confidence that this is going to really work? Or is it kind of the usual understanding I have today from OpenAI broadly around safety is like, well, it's going to be sort of a defense in depth strategy. Like everything will work a little bit. We'll kind of gradually chip away at the problem and hopefully with enough layers of defense, it'll be okay. Do you have more ambitious ideas about what scalable oversight can achieve still in your mind than that?
1:42:26
Well, I think by thinking of how scalable oversight will proceed has changed a little bit over time. I think of the problem as having two parts. One is I call it radar scaling, which is how do you think about the right ways to elicit the opinions and values from people or experts or whatever. And that's an example of something that we've been investing in health. And so then you can imagine schemes where you have AI be part of that loop and help improve the ability for humans to critique, for example, AI outputs. So that's like one area of work that the company has been investing in. The second area of work is given some, I think the framing for this has become more expensive over time and sometimes we don't use the word scalable oversight to refer to it anymore. And I think the same is true at the labs. Given some idea of what the values are, whether they're elicited using rate or scaling or not, how do you spend a lot of compute on training the models to have those values? And so I call that internally value oversight. So there's rater scaling and there's value oversight. And so the second problem I think is also a problem where even though we don't use the word scalable oversight, I think things have advanced, quote quite a bit over time. So you one example of this which again doesn't use the word scalable oversight is the work that people are doing on specs and constitutions and measuring and improving models to adhere to these. This is one way of saying we have values that we care about and we want them all to be very persistent in improving them. And I think if you check various system cards or things like this, you'll actually see that the models have got gotten much better at doing this. And I think people have been finding increasing generalization in training models to have certain Personas or characters or things like this for whether they comply to certain safety requirements or specs or constitutions or things like this. But what I'd say is basically I think the research has been actually advancing, but in a way that actually looks different from what it did before. It looks a little bit more humdrum post training than people imagined it would look like. And I think this is a good thing. I think this is a little bit more grounded in how systems will look and how we will train them and things like this. I don't think we've solved all the research problems, but I think sometimes we're working on the research problems without referring to them in the same way as we used to.
1:43:53
Would you say the core of the reason to think that this will work is that a somewhat less capable model with a super big inference budget can be expected to catch, if we're thinking of a deceptive failure mode, the smarter
1:46:19
model,
1:46:46
even if it's more capable per token, the somewhat less capable model with just a much bigger token budget. It's kind of, I mean, this idea has been out there for a while, but it's really sparked when you said, like spend lots of compute to make it work. Like just really giving that thing a lot of time to think. It can be enough as long as the like delta isn't so bad that it can like detect anything that's going wrong in the more capable model. So as long as you've got kind of a small enough delta and a sufficiently aligned current model, then you can in theory, like continue to bootstrap your way into ever more capable models without losing the alignment. Would you say that's kind of the core of the idea?
1:46:48
I think that's part of it. I think another part of this is that the task of discrimination or critique seems to be easier than the task of generating good outputs. And so when we study the performance of monitors, a given model can monitor itself, especially given privileged information like the chain of thought, pretty well and can actually, which is a little bit surprising on its face. But then if you think about the fact that discrimination has been better than generation, and this is kind of underlying a lot of work in things like rl, AIF and constitutional AI, and things like this, then it's not so surprising. I think that's another part of it. But I don't think we have a full understanding of how this will scale, especially when we're at the thinking about the regime of having trusted but less capable models. Aligning a more capable model. I think so far I've mainly been talking about if we have the values in some format, whether it's from humans or models, then how do we instill those into models in a way that is trustworthy? I think that's a part of the problem, but I don't think the whole problem is solved for sure.
1:47:29
Yeah. How do you think we're doing on safety as a whole? I mean, I have been an Eliezer reader since 2007 and actually read a lot more of his stuff way back in the day than I have more recently. I would say, broadly speaking, it's like, hey, this is in some sense gone amazingly well relative to baseline expectations. Right. We do have models that undeniably have a pretty good sense of human values. And that's just manifestly obvious on a day to day basis. That was, I think considered to be a very unlikely outcome years ago. And if you were to Teleport back to 2007 and drop a post on overcoming bias indicating as much, it would be shocking or it would be considered laughable that this would be accomplished in this way. And yet we do still kind of see these. I guess my mental model of the sort of seesaw that it seems like we're on is like every generation of new models has like new capabilities generally. And then it also seems to have usually some like new emergent problem. And it was, you know, whether it's like deception or now we've got like eval awareness has become like front and center and we sort of are like, okay, that's a big problem. The next generation, you know, gets more powerful. We kind of tamp down. That last problem doesn't go to zero, but it's at least reduced. But it seems like we're sort of headed toward this strange world where if you sort of extrapolate both the meter trend and the kind of whatever 2/3 to order of magnitude reduction in bad behaviors that seems to happen at least once a given bad behavior is recognized and addressed from one generation to the next. If you kind of extrapolate that out to like 20, 28 or whatever, you're like, okay, I can imagine a model now that can do a month's worth of human work or a couple months worth of Human work. But it also maybe has like for any given run, like a 1 in a thousand or a 1 in 10,000 or 1 in 100,000 chance of like actively screwing me over in some, you know, super bizarre way. And that just seems like a really weird thing to contemplate. You know, it's a really weird world to live in.
1:48:34
But.
1:50:48
But if we believe in straight lines on log graphs, that seems to be kind of where we're going, right? So do you think that's where we are going? And if not, maybe you have a different mental model of what the sort of alignment versus emergent problems balance would be in a couple of years time?
1:50:48
Yeah, I think the future world will definitely be pretty weird. So it's very hard for me to predict what will happen. I do think models will become more capable. I do think they will be able to do things over longer time horizons. I think that's very important. I do think also we've been pleasantly surprised about the extent of safety generalization or alignment generalization, which is underlying the trend that you're pointing out of a lot of various safety benchmarks or failure modes decreasing over time. And so that's been a relatively pleasant development because I think if you think about it, that was the case for. It was the case during. We've had two major scaling laws for deep learning. One is kind of the pre training scaling law. What we found is that when we scaled up pre training, we had models that were relatively good role models that could be relatively easily tuned via SFT or small amounts of supervised learning to exhibit a certain Persona and extract that Persona and things like this and be helpful, useful assistants. And that's a lot of what the early work on instruction tuning and things like this did. And also what the early work on METPOM did for the health setting, which is really great. And I think we saw a lot of generalization there. I think there was a question for the reasoning setting whether that generalization would hold or we would see similar generalization. And so so far my sense is at scale we do, we do see it at smaller scales, we saw less of it. And so that's a promising development, which is that if you are able to figure out the right ways and patterns to get these models to be broadly beneficial and not harmful, that even if they're put in settings where they're doing things that you didn't foresee, doing a month's worth of work in days or things like this, that they can continue to generalize and be safe in those settings. And that those safety curves will continue to go down. That said, I think you're right in pointing out there's a rapid increase in capabilities and surface in which these models will be used and also a rapid decrease in safety failures that we're seeing over time. It's a little bit hard to predict how both those curves will net out. I do think it's really important, and this is why I care so much about the parts of our mission that are about proactively making the tangible benefits happen and also about working on mitigating safety risks, that we really stay ahead of these curves and think about the right ways to shape them in the right direction.
1:51:08
What does a move 37 look like in health and are we going to see that in the current paradigm? Or would it take a sort of deep integration of modalities like we were kind of touching on earlier? Or another idea that I'm kind of enamored with is this idea of maybe a different kind of training objective, thinking about, instead of getting right answers to questions, more about at a fundamental level, predicting what the state of health is going to be for a patient in a way where you could even begin to do in silico experiments like, hey, I've got this whole profile of this patient. What if I change this variable? What would their health look like in that case? That seems like it might be a pretty different paradigm from kind of like read the whole Internet and make sure you give me the right answer. What do you think to touch on any of those that you want to, but maybe also just take the opportunity to zoom out and give us a sense of like, what are your big visions and ambitions and what do you think people can kind of expect as you guys continue to do your thing over the next year?
1:53:21
Plus, yeah, I love the Move 37 framing. Just to explain the reference, this is a reference to the famous move 37 during a game between AlphaGo, the AI system playing Go, and Lisa Dollar during one of their now famous games. And the interesting thing about this move was that it was a move that everybody agreed humans would not have made, but in hindsight was brilliant and was key to winning the game. And so I think your question is kind of like, can we imagine a world where models are able to do something and make some kind of interesting prediction that a human probably would not have been able to make, but it was again, brilliant and impactful in hindsight. My view is that this is not too, too far away. I think in many ways many people report to me that they saw many doctors for their case and only after talking to ChatGPT was ChatGPT able to flag the thing that they then shared with their doctor and then they were able to come to a diagnosis together and things like this. I think that seems to happen somewhat routinely and whether it rises to the level of a move 37 or not I think is a matter of taste, depending on the case. I think what I'd say is that I think think you should expect world models in health to improve. And what I mean by world models is we should expect our models to have a much better understanding over time of us, our health, our trajectories of our health and how that intersects with our biology. I think that's going to be really key to thinking about where this move 37 could look like. Like you, you mentioned this idea of simulations of people and simulating things in silico. I think full simulations are obviously very expensive and difficult and I think isn't a thing that models are most designed to do in their current form. So I would say that there probably would look a little bit different from the current kinds of training that people are doing. But if it's more along the lines of given a lot of context about a user, predict something interesting about them, or predict the results of some intervention, I think this is something that models first, are already getting better at and second, could get a lot better at over time. And I think that would be potentially quite impactful. I think we'll start to reach a point where there'll be more and more clear demonstrations of this. I think you're starting to see interesting and increasingly clear demonstrations of models doing interesting science and math in the public. I think you'll see a little bit more in this space as well, but I do think it'll take a little bit of time. My view on paradigms is kind of that for you can get a lot out of extending and tweaking the current paradigms and you can in fact view everything that has happened so far as just one paradigm. But between pre training and scaling reasoning style training, I think you can get quite a lot of juice. And I think even integrating additional modalities and things like this, I think doesn't require too much additional changes or tweaks. I think broadly our team's mission today is do whatever it takes to ensure AGI is beneficial on human health for all of humanity. And so we see that happening through three channels. The first is helping consumers understand and navigate their health, which is already happening through chat security health. The second is Empowering the health system and thinking about the ways in which AI assisted care can become part of the standard of care. And it can reduce the extent that clinicians are bogged down by paperwork and are spending more time seeing patients and thinking about the problems that matter most for improving care. The final thing is really pushing up the ceiling of research. And I found your vision pretty inspiring. What I would say is I think you can expect as a lot of the core problems in healthcare that we've talked about, the fragmentation of data, fragmentation of patient experience. To improve as those happen, you should expect an acceleration, the application of intelligence to that data. I think biology and health is one of the areas in which marginal gains in intelligence have the most obvious value in solving more and more problems for humanity. There are many examples of previous breakthroughs in biology where there was nothing stopping that breakthrough from happening five or 10 years earlier, except just more human ingenuity applied to it. But there's no physical blocker. And so when that's the case, then I think you can assume that long running models, long running agents connected to the right data can do really incredible things. I think the hope is that in addition to our existing work, which has really been focused on how do we raise the floor of human health, we start to raise the ceiling of human health as well.
1:54:32
It's going to be not just the AI doctor, but also the AI biomedical research scientist as well. Wow, it's an exciting time to be alive. I'm increasingly saying that at the end of these conversations these days.
1:58:57
So.
1:59:10
Yeah. Anything else you want to make sure people are aware of that I didn't ask you about?
1:59:11
Like I mentioned earlier, I think the right way to think about our work for Health at OpenAI is really operating in three phases. One is really laying the foundation and a lot of that is focused on our, on our work on safety. And so examples of this are like our work on help bench where again we have this kind of evaluation which you can run offline of large language, model performance and safety and health. We had this study of the first AI clinical copilot with penta health. We had a bunch of model improvements over the last year and all these kind of lay the foundation for the work that we're doing today. The second thing is this kind of rise in adoption which we've been seeing. We've been seeing an incredible rate of individual users, especially patients adopting technology. And we've seen hundreds of millions of people asking health questions a week, which has been a rapid growth since last year. The final thing is really the future of scaling the impact of this work. And that's where really our work on ChatGPT Health, which is this kind of consumer facing product which enables you to connect your health data with additional privacy protections. And also our work on ChatGPT for healthcare, which is again facing health systems and really enabling health professionals to use AI as a copilot in their workflows, kind of come in. And so between all these things, I'm really excited about the work that we're going to be doing to scale the impact of this work in the next year and I'm looking forward to what's to come.
1:59:15
Karan Singal, Head of health at OpenAI thank you legitimately for all your hard work. It really is incredibly valuable, incredibly impactful and that's obviously just going to continue to grow exponentially along with so many things in the AI space. And thank you for being part of the Cognitive Revolution.
2:00:33
Thanks for having me. Every generation driving be depend Every mother praying that I remain but something in the wiring of the world we are copper river of the no instant fear move 20,000 dying. Machine knowledge with the people who never had before Nairobi to Nashville we are kicked down the door. Quarter billion people asking every week finding words this is too much but doctors couldn't speak Jaina taught in English like a river running clean alignment wasn't program it's the realest thing I've seen Kenya Clinic Sunrise free intelligence no gave keep no jive. Knowledge. Advice by what every cavity inside universal basic intelligence does right. Doctor.
2:00:49
If you're finding value in the show, we'd appreciate it if you take a moment to share with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions and sponsorship inquiries either via our website Cognitiverevolution AI or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine Network, a network of podcasts which is now part of a 16Z where experts talk technology, business, economics, geopolitics, culture and more. We're produced by AI Podcasting. If you're looking for podcast production, help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement@aipodcast.ing and thank you to everyone who listens for being part of the Cognitive Revolution.
2:03:07