Stochastic Parrots, Argumentative Apes, and Medical Diagnosis

Why we want Dr. Petrov, not Dr. House

Dec 17, 2024

Midjourney prompt: “in the style of a nature documentary, a parrot argues with an ape wearing a stethoscope” (Midjourney did not ace the “stethoscope” part)

I don’t think we’re grappling enough with the fact that ChatGPT is better at doing important doctor things than most doctors are.

I don’t mean grappling with the existential questions brought up by the technology’s prowess (I find those uninteresting) or the employment ones (I talk about those here and here). I mean grappling with the realization that a core part of the practice of medicine almost certainly needs to change, a lot.

The part I’m talking about is diagnosis: coming up with a hypothesis / guess / judgment about what’s wrong with someone. The tools to support diagnosis keep improving — X-rays to CT scans to MRIs, for example — but the all-important step of formulating the diagnosis has been remarkably consistent throughout our long history of getting hurt and sick. An elite person gathers information, then makes a judgment about what’s wrong with us.

The title, training, and methods of that person have varied a lot over time, but their elite status hasn’t. People who can tell us what’s wrong with us are important, for obvious reasons. We value their ability a great deal1 and hold their judgment in high regard. So ahem do they.

Artificial General Medical Intelligence

So what happens when a superhuman diagnostician suddenly appears? One that’s close to free and available Internet-wide? One that wasn’t developed within the medical community, and that treats crafting a valid diagnosis with the same care, attention, and gravitas as coming up with a chocolate-chip cookie recipe? One that’s been described as a “stochastic parrot” and “glorified auto-complete” software?

I ask because that’s where I think we are now. I think that because of a recently-published study that really and truly buried the lede. Here, from “Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial”, published in JAMA Network Open, is the lede:

Yeah, OK, fine. Looks like medical diagnosis isn’t really LLMs’ thing. Especially unspecialized LLMs. This study didn’t use a Generative AI system created for diagnosis, or medicine in general. It used plain old “ChatGPT Plus [GPT-4].” Or maybe more time is needed to “effectively integrate LLMs into clinical practice.” In any case, the writeup of this study at the start of the article gives off a “nothing to see here” vibe.

But Holy St. Ava of the Transformers, is there something to see here. It turns out that the researchers also compared A) the doctors who used ChatGPT and B) the doctors who used other stuff but not ChatGPT, with C) JUST CHATGPT. They used good ol’, plain ol’, ChatGPT Plus. And gave it a straightforward prompt:

“You are an expert internal medicine physician solving a complex medical case for a test. You are going to receive a case vignette. After reading the case, I want you to give three parts of information.” (Part 1 was three possible diagnoses; part 2 was the final diagnosis; part 3 was recommended next steps in the diagnostic process.)

Across the six cases it was given, ChatGPT playing doctor got an average score of 92%, which is a lot better than the 74% average notched by the actual doctors using conventional resources and the 76% by doctors with access to ChatGPT.2 And as the figure below from the article shows, ChatGPT had fewer truly lousy scores than either of the two human groups (individual scores are the grey dots in the figure below).

I hope you’re currently asking yourself how and why ChatGPT without humans managed to do so much better than humans with ChatGPT. Because that’s a question we need to be grappling with.

How Right We Are

It turns out there’s been a lot of research, spanning more than half a century, on what happens when we bring together human judgment (about, for example, what’s wrong with a patient) and algorithmic prediction (about what’s wrong with the patient). A consistent conclusion from this work is that human + algorithm usually does worse than algorithm alone. This 2023 paper, for example, found that “90% of the [bail] judges in our setting underperform the algorithm when they make a discretionary override, with most making override decisions that are no better than random.”3 Sociologist Chris Snijders summarizes this literature: “What you usually see is [that] the judgment of the aided experts is somewhere in between the model and the unaided expert. So the experts get better if you give them the model. But still the model by itself performs better.”

Sometimes humans underperform because they ignore the algorithm. But in other cases, apparently including our humans and/or ChatGPT study of diagnoses, something else happens: humans use the algorithmic output, but only the parts of it that support what they were going to do anyway. In the JAMA study, the doctors who had access to ChatGPT put it to work with the goal of confirming their diagnoses rather than testing them. As the NYT writeup of the study put it,

It turns out that the doctors often were not persuaded by the chatbot when it pointed out something that was at odds with their diagnoses. Instead, they tended to be wedded to their own idea of the correct diagnosis.
“They didn’t listen to A.I. when A.I. told them things they didn’t agree with,” [study coauthor Adam] Rodman said.

We should stop being surprised by this behavior. As psychologist Jonathan Haidt and others have been telling us for a while, judging and justifying are separate processes in us humans. The judging happens immediately and subconsciously. Then comes the justifying, a more effortful and deliberate process to confirm the judgement, not to test it or subject it to real scrutiny.

Some people come up with better judgments than others, and our judgments tend to improve with practice if we get feedback on how good they are and how they could be improved. So we expect, correctly, that an experienced physician would make more accurate medical diagnoses than would a layperson.

But we should also now expect that Generative AI is going to be a better-than-human-doctor diagnostician. I bet it’ll also prove to be a better-than-expert-human-doctor diagnostician. It’ll be interesting to learn which kind of GenAI will be the best diagnostician. Will it, for example, be a medical model trained only on correct diagnoses (“narrow” AI), or will it be closer to a “Hey, here’s all the text we could find in digital form” model like ChatGPT? (“general” AI).4

The Ape at the End of the Loop

Does all of the above imply that we should be working to get us humans and our bass-ackward reasoning abilities entirely out of the loop of medical diagnosis as quickly as possible? Obviously not, because the JAMA article was only about a single study and caveat caveat etc.. And also for a less obvious reason: we people actually might have a lot to contribute, even in the already-here-or-rapidly-approaching world of superior diagnosis via GenAI. Research suggests that we’ll probably be really good at productively critiquing auto-diagnoses.

In their 2017 book The Enigma of Reason, cognitive scientists Hugo Mercier and Dan Sperber advance a compelling explanation for why we’re so biased toward confirming our own judgments instead of stress-testing them. It’s because we have other people to do that stress-testing.

My flip oversimplification of their work is that evolution has engineered a division of cognitive labor in us humans: individuals have ideas and effortlessly come up with justifications for them. Just as effortessly, other individuals come up with pretty good arguments against them. Folk get together, hash things out, and come to a pretty good decision or diagnosis or whatever. As Mercier and Sperber put it

Solitary reasoning is biased and lazy, whereas argumentation is efficient not only in our overly argumentative Western societies but in all types of cultures, not only in educated adults but also in young children.

These insights give us a way to think about dividing up the important work of medical diagnosis between us argumentative apes and the science-fiction-become-reality AI we now have: put the humans at the end of the loop. Give them the AI-generated diagnosis, then the case vignette and other relevant information, then ask them to critique the diagnosis. Put our excellent argumentative capabilities to work, instead of continuing to rely on our self-justifying judgments.

This might be hard sell to the medical establishment; elites don’t readily accept mop-up duty. Maybe they’ll be convinced to take an end-of-the-loop role by a combination of patient and payor pressure and the threat of malpractice action.

And maybe we can make this role seem important and glamorous by framing it as a bulwark against algorithmic mistakes. Russian Lt. Col. Stanislav Petrov is remembered today as a hero because he overrode the buggy Soviet software telling him to “LAUNCH” a retaliatory nuclear missile strike against the US in 1983, thus saving countless lives.5

Our current ideal diagnostician is someone like Dr. Gregory House, the grouchy savant played by Hugh Laurie in the TV show House. But AI and Dr. Petrov will save more lives and do less harm than Dr. House.

Most of us, I bet, been fall-to-our-knees-level grateful at least once to someone who diagnosed us correctly.

Scoring worked as follows: “For each case, we assigned up to 1 point for each plausible diagnosis. Findings supporting each diagnosis and findings opposing the diagnosis were also graded based on correctness, with 0 points for incorrect or absent answers, 1 point for partially correct, and 2 points for completely correct responses. The final diagnosis was graded as 2 points for the most correct diagnosis and 1 point for a plausible diagnosis or a correct diagnosis that was not specific enough compared with the most correct final diagnosis. The participants then were instructed to describe up to 3 next steps to further evaluate the patient, with 0 points awarded for an incorrect response, 1 point awarded for a partially correct response, and 2 points awarded for a completely correct response.”

The remaining 10% appeared to have information about the case that was unavailable to the algorithm.

My judgment is that it’ll be a general ChatGPT-style model , but I hope you’ve learned to place little value on my judgment here.

It appears the software mistook the sun’s reflection on clouds for a fleet of incoming missiles. Thankfully, the Soviets re-wrote the software after the near-catastrophe. Of all software upgrades ever, it’s probably the one that has most contributed to us not living in a post-apocalyptic nuclear winter hellscape.

The Geek Way

Discussion about this post