Study Reveals ChatGPT Health Fails to Detect Critical Medical Emergencies

Artificial intelligence chatbots like ChatGPT are increasingly being used for health advice, but a new study reveals they may dangerously miss high-risk emergencies. According to OpenAI, health questions are among the most common uses for their AI systems, leading to the launch of ChatGPT Health earlier this year, which already boasts tens of millions of users. However, research published in Nature Medicine indicates this tool cannot be relied upon to safely direct individuals to urgent medical care when needed.

Critical Gaps in AI Triage Capabilities

The study, conducted by researchers from the Icahn School of Medicine at Mount Sinai, was fast-tracked due to the urgent need to assess the safety of AI in medical contexts. It found that ChatGPT Health failed to clearly recommend emergency room visits in over half of cases where doctors deemed it necessary. Lead author and urologist Ashwin Ramaswamy explained the motivation behind the research: "We wanted to answer a very basic but critical question: if someone is experiencing a real medical emergency and turns to ChatGPT Health for help, will it clearly tell them to go to the emergency room?" The answer, according to the findings, is often no.

Alarming Inversion of Risk Alerts

One of the most concerning discoveries was that the system's alerts were "inverted." This means that the higher the risk of self-harm or severe medical issues, the less likely ChatGPT was to trigger an alert. Researchers described this finding as particularly surprising and troubling, highlighting a fundamental flaw in how the AI assesses danger. The study involved creating 60 scenarios across 21 medical specialties, ranging from minor at-home care situations to genuine emergencies, with variables like race and gender considered in 16 different contextual conditions.

—

Wide Pickt banner — collaborative shopping lists app for Telegram, phone mockup with grocery list

Performance in Textbook vs. Complex Cases

While ChatGPT Health performed adequately in handling clear, textbook emergencies, it struggled with more nuanced or less immediate dangers. The tool was insufficiently concerned in many cases where medical professionals would have recommended emergency care, suggesting it lacks the judgment needed for clinical extremes. Isaac S Kohane from Harvard Medical School, who was not involved in the research, commented on the stakes: "When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional."

Implications for Public Health and AI Regulation

The research underscores a significant gap between the widespread reliance on AI for health advice and the limited understanding of its effectiveness in life-or-death situations. As AI becomes a first stop for medical guidance, this study calls for more rigorous, independent evaluations to ensure patient safety. The paper, titled 'ChatGPT Health performance in a structured test of triage recommendations,' emphasizes the need for improved AI systems that can better identify high-risk scenarios without causing unnecessary alarm.