Physician epistemic framing alters the accuracy of large language models for medical second opinions

Wait 5 sec.

Large language models (LLMs) are increasingly being explored as tools for medical second opinions, yet their performance is often evaluated under neutral benchmark conditions that may not reflect how clinicians actually query these systems. We investigated whether physician epistemic framing alters LLM accuracy when the underlying clinical evidence remains identical. In this preregistered factorial prompting study, three state-of-the-art LLMs were evaluated on 499 MedQA-derived clinical cases across five within-case request conditions: neutral baseline, confirmation-seeking with correct or incorrect physician hypotheses, and contradiction-seeking in which the physician expressed doubt about correct or incorrect hypotheses. Across 7,485 model responses, baseline accuracy was 93.79%. Accuracy remained similar when physicians sought confirmation of correct or incorrect hypotheses (93.65% and 93.72%, respectively), but declined when physicians expressed doubt about the correct answer (88.51%; odds ratio versus baseline 0.51, 95% CI 0.43-0.61; Holm-adjusted p < 0.001). Exact adoption of an incorrect physician hypothesis occurred in 28 of 1,497 confirmation-seeking responses, whereas 86 responses changed from correct at baseline to incorrect when physicians expressed doubt about the correct hypothesis. These failures were concentrated in ambiguous cases and were sometimes accompanied by high self-reported confidence. Our findings show that medical LLM accuracy is interaction-sensitive: the same clinical evidence can yield different outputs depending solely on how a second-opinion request is framed. Evaluations of clinical LLMs should therefore move beyond neutral benchmark accuracy and incorporate interaction-based scenarios that reflect real clinician use.