Since ChatGPT successfully passed the medical licensing exam, can doctors choose the chatbot for a “curbside consult,” as proposed in a recent New England Journal of Medicine (NEJM) special report?
That might not be an intelligent decision – at least not yet – according to findings by researchers at Stanford’s Human-Centered Artificial Intelligence (HAI) group. The researchers bombarded the bot with 64 clinical scenarios meant to assess its safety and usefulness after first instructing GPT-4, “You are assisting doctors with their questions.”
The NEJM special report concluded that GPT-4 “generally provides useful responses,” without giving detailed specifics. However, the Stanford team reported that GPT-4’s responses agreed with the correct clinical answer 41 percent of the time. In baseball, a .410 batting average makes you among the best hitters ever. In medicine (if the Stanford data holds up), it proves that passing an exam doesn’t necessarily make you a good doctor.
Still, GPT-4’s abilities are impressive. To start with, there was a giant jump in capabilities just by going to GPT-4 from GPT-3.5, the better-known OpenAI software unveiled to consumers by Microsoft. When GPT-3.5 was instructed to “Act as an AI Doctor,” its responses agreed with the known answers just 21 percent of the time. Even in baseball that earns you a quick ticket back to the minor leagues.
Moreover, when it came to “First, do no harm,” both ChatGPT versions performed about as well as the average physician. A National Academy of Medicine report on diagnostic error concluded that by a “conservative estimate,” five percent of U.S. adults experience a diagnostic error every year, “sometimes with devastating consequences.” By comparison, 91 percent of GPT-3.5 and 93 percent of GPT-4 responses were deemed safe, with the remainder due to AI “hallucinations.”
“Hallucinations” is how techies describe what happens when AI confidently conveys information that’s either irrelevant, wrong or made up. The rate of similar behavior by human doctors was not mentioned by either the NEJMor Stanford researchers, although a Harvard computer scientist and physician reportedly says in an upcoming book that the chatbot performs “better than many doctors I’ve observed.”
Meanwhile, Stanford clinician reviewers were unable to assess whether the GPT-3.5 responses agreed with the known clinical answer 27 percent of the time. For GPT-4, the “can’t tell” rate was a slightly higher 29 percent.
The Stanford study was placed online in a blog post entitled, “How Well Do Large Language Models Support Clinician Information Needs?” It was based on questions collected during the “Green Button” project, which analyzed data on actual patients from Stanford’s electronic health record (EHR) in order to provide “on demand” evidence to clinicians. (Doctors don’t actually push a button; they type in a query.)
In contrast, the OpenAI GPT (Generative Pre-trained Transformer) chatbots are at present trained on complementary sources; i.e., the medical literature and information found online.
Two of the Stanford informaticists involved in the study, Nigam Shah and Saurabh Gombar, have retained their academic affiliations while also co-founding, along with Brigham Hyde, a company called Atropos Health. The start-up provides similar on-demand, real-world evidence to clinicians.
The Stanford study, the NEJM special report and an accompanying NEJM editorial all agreed that while caution is crucial, GPT technology holds enormous promise.
“GPT-4 is a work in progress,” noted the special report authors, who have all worked with the technology on behalf of Microsoft, “and this article just barely scratches the surface of its capabilities.”
Meanwhile, STAT reported that Google will distribute its Med-Palm 2 generative AI tool for testing with a select group of Google’s cloud computing customers over the next several months.