ChiyuShengM.Med.
1Emailshengchiyu@outlook.com ShuminShenM.Med.
1EmailShenShumin0623@outlook.com LinWangPh.D:
2Emaillinwang_ecnu@163.com JieChenM.Med.
3Emailcj970506578@163.com WeiChenM.D.
4Emailchenpangzai0214@163.com NianfeiWangM.D.
5✉Emailwangnianfei@outlook.com ShanghuWangM.D.
6✉Emailwangshanghu1987@163.com 1Department of OncologyThe Second Hospital of Anhui Medical University230601HefeiChina
2School of Compute Science and TechnologyEast China Normal University200062ShanghaiChina
3department of pediatricThe Children’s Hospital Affiliated to Soochow University215000SuzhouChina
4Department of General SurgeryHefei Affiliated Hospital of Anhui Medical University230011HefeiChina
5A
Department of OncologySecond Hospital of Anhui Medical UniversityNo. 678 Furong Road230601HefeiChina 6Department of RadiotherapyAnhui Chest Hospital230026HefeiChina
Chiyu Sheng1*, Shumin Shen 1*, Lin Wang2, Jie Chen2, Wei Chen3, Nianfei Wang 1,5#, Shanghu Wang 4, 5#
Chiyu Sheng: M.Med.: Department of Oncology, The Second Hospital of Anhui Medical University, Hefei 230601, China. Email: shengchiyu@outlook.com
Shumin Shen M.Med.: Department of Oncology, The Second Hospital of Anhui Medical University, Hefei 230601, China. Email: ShenShumin0623@outlook.com
Lin Wang Ph.D: School of Compute Science and Technology, East China Normal University, Shanghai 200062, China. Email: linwang_ecnu@163.com
Jie Chen:M.Med.:department of pediatric,The Children's Hospital Affiliated to Soochow University,Suzhou 215000,China.Email:cj970506578@163.com
Wei Chen, M.D.: Department of General Surgery, Hefei Affiliated Hospital of Anhui Medical University, Hefei 230011, China. chenpangzai0214@163.com
Corresponding author:
1. Nianfei Wang, M.D.: Department of Oncology, Second Hospital of Anhui Medical University, No. 678 Furong Road, Hefei 230601, China; Email: wangnianfei@outlook.com
2. Shanghu Wang, M.D.: Department of Radiotherapy, Anhui Chest Hospital, Hefei 230026, China. Email: wangshanghu1987@163.com
# Co-corresponding author.
Chiyu Sheng M.Med. and Shumin Shen M.Med. contributed equally to this work.
Abstract
Background
Theoretically, multimodal large language models better reflect real-world clinical scenarios in disease diagnosis compared to text-only large language models. The New England Journal of Medicine Image Challenge contains real clinical cases with images and textual materials, making it the optimal resource for testing multimodal LLM diagnostic accuracy.
Methods
A
A
We analyzed 272 Image Challenge cases (June 2009 to March 2025) containing both images and clinical text. Three LLMs—GPT-4o, Claude 3.7, and Doubao—were evaluated against responses from 16,401,888 physicians worldwide (mean, 60,301 per case). Models were tested with images alone and with combined image-text inputs. The primary outcome was diagnostic accuracy in the multimodal condition.
Results
All LLMs significantly outperformed physicians (P < 0.001). Diagnostic accuracy in multimodal testing was 89.0% (95% CI, 84.9 to 92.3) with Claude 3.7, 88.6% (95% CI, 84.5 to 92.0) with GPT-4o, and 71.0% (95% CI, 65.3 to 76.2) with Doubao, compared with 46.7% (95% CI, 40.7 to 52.7) for physician majority vote—an absolute difference exceeding 40 percentage points for top-performing models. In diagnostically challenging cases where fewer than 40% of physicians were correct, Claude 3.7 maintained 86.5% accuracy versus 33.4% for physicians. Despite high accuracy, model-physician concordance was low (Cohen's κ, 0.08 to 0.24), with a 15.4:1 ratio of model-advantage to physician-advantage cases for Claude 3.7. Adding clinical text to images improved accuracy by 28 to 42 percentage points across models. At least one model was correct in 96.3% of cases.
Conclusions
Multimodal testing achieved significantly higher diagnostic accuracy than image-only evaluation and substantially exceeded physician diagnostic performance. High AI accuracy coupled with low physician-AI concordance indicates that multimodal large language models utilize fundamentally different diagnostic reasoning processes. These findings suggest multimodal LLMs may function as valuable diagnostic assistants, augmenting rather than replacing physician clinical decision-making.
Key Words:
Artificial Intelligence
Multimodal Large Language Models
Diagnostic Accuracy
Rare Disease
Medical Imaging
Multimodal Large Language Models Challenge NEJM Image Challenge
A
A
A
A
A
Introduction
Accurate diagnosis is fundamental to effective medical treatment, yet diagnostic errors affect millions of patients annually. A meta-analysis of 22 studies involving 80,026 hospitalized patients found a harmful diagnostic error rate of 0.7% (95% CI, 0.5%-1.1%), translating to approximately 249,900 harmful diagnostic errors annually in the United States1. The burden is substantially higher in outpatient settings, where diagnostic errors affect 5.08% of adults—approximately 12 million Americans each year—with at least 1 in 20 adults experiencing a diagnostic error and half of these errors potentially causing harm2. This challenge is magnified for rare diseases, where patients endure a median diagnostic delay of 4.7 years and 40% receive multiple incorrect diagnoses before the correct one is identified3. For the estimated 300 million individuals worldwide affected by rare diseases, the scarcity of specialized expertise and the tendency of rare conditions to mimic common diseases create particularly formidable diagnostic obstacles4.
Recent advances in large language models (LLMs) have demonstrated remarkable performance on standardized medical examinations, suggesting potential diagnostic capabilities5–7. GPT-4 exceeded the United States Medical Licensing Examination passing threshold by more than 20 points, achieving 86.5% accuracy on complex clinical questions8. Med-PaLM 2 reached 86.5% accuracy on medical question-answering datasets, with physician evaluators preferring its responses over human physicians' on eight of nine clinical utility metrics9. These achievements indicate that current LLMs possess substantial medical knowledge that could theoretically be applied to diagnostic challenges. However, translating examination performance into clinical utility reveals critical limitations. A 2024 systematic review found that while AI models achieved diagnostic accuracy comparable to non-expert physicians (52.1%), they remained significantly inferior to specialists by 15.8% (P = 0.007)10.
The recent development of multimodal LLMs capable of processing both images and text simultaneously now enables direct comparison of AI and physician diagnostic performance in a format that mirrors clinical practice11–14. In a clinical evaluation of 150 dermatological cases, SkinGPT-4, a multimodal diagnostic system, achieved 80.63% accuracy validated by board-certified dermatologists15. Han et al. demonstrated that GPT-4V achieved superior diagnostic accuracy compared with unimodal predecessors (GPT-4, GPT-3.5) and contemporary models (Gemini Pro, Llama 2, Med42) across both JAMA Clinical Challenge and NEJM Image Challenge datasets, establishing that multimodal capabilities enable medical image interpretation without specialized fine-tuning16. However, GPT-4V demonstrated poor diagnostic performance in radiological contexts, achieving accuracy rates of only 8% without clinical context and 29% with contextualization when requiring the most likely diagnosis17. Fenglin Liu et al. developed Med-MLLM, a medical multimodal large language model evaluated across five COVID-19 datasets, demonstrating superior performance in COVID-19 reporting, diagnosis, and prognosis tasks even with minimal labeled data (1%)18. In COVID-19 diagnostic image-text classification tasks, the model achieved 90.3% diagnostic accuracy (AUC) when trained with complete datasets, indicating high efficiency and accuracy for rare disease diagnosis.
Given these limitations in image-only interpretation, we sought to evaluate the diagnostic performance of three state-of-the-art multimodal LLMs—GPT-4o, Claude 3.7, and Doubao—using both image-alone and image-plus-text modalities across NEJM Image Challenge cases, comparing their accuracy against global physician performance to determine the clinical potential of AI-assisted rare disease diagnosis.
2.Methods
2.1 Data Sources
We analyzed 272 consecutive cases from the NEJM Image Challenge published between June 27, 2009, and March 27, 2025. Cases were included if they contained both clinical images and text descriptions. Cases with images only or those flagged for content violations during testing were excluded.
2.2 Model Selection and Testing
Three publicly available multimodal large language models were evaluated: GPT-4o (OpenAI), Claude 3.7 (Anthropic), and Doubao (ByteDance). Each model was tested through its official web interface using standardized prompts instructing selection of the correct answer from five options with supporting rationale.
Models underwent two-phase testing for each case: image-only followed by multimodal (image plus text). We recorded the selected answer, diagnostic choice, and complete model response. No fine-tuning or repeat querying was performed. For physician-model comparisons, we used multimodal LLM results, as NEJM respondents had access to both images and text.
2.3 Physician Benchmark
Physician performance data were obtained from NEJM's published results, comprising 16,401,888 responses (mean, 60,301 physicians per case; range, 12,066–185,210). Physician accuracy was defined as the proportion selecting the correct diagnosis.
2.4 Statistical Analysis
The primary outcome was diagnostic accuracy. We compared model and physician performance using McNemar's test and assessed concordance with Cohen's kappa. Secondary analyses included performance stratification by physician consensus (< 40%, 40–69%, ≥ 70%), disease category, imaging modality, age group (< 1, 1–12, > 12 years), and sex.
Model sensitivity was calculated for cases where physician accuracy was < 50% or < 33%. Ensemble performance was evaluated through majority vote. Subgroups with fewer than 5 cases were excluded. Sex equity was defined as accuracy differences < 5 percentage points.
Statistical analyses were performed with R version 4.3.0. Two-sided P values < 0.05 were considered significant. Confidence intervals were calculated using the Wilson method.
2.5 Ethics
This study used publicly available NEJM cases without patient identifiers.
A
Institutional review board approval was not required.
3. Results
3.1 Patient and Case Characteristics
The study comprised 272 diagnostic cases from the NEJM Image Challenge. The cohort included 159 male patients (58.5%) and 113 female patients (41.5%). Age distribution spanned from infancy to advanced age: 155 patients (56.9%) were aged 13–60 years, 81 (29.8%) were older than 60 years, and 36 (13.3%) were younger than 13 years.
Infectious diseases accounted for 70 cases (25.7%), immune-mediated diseases for 47 (17.3%), and neoplastic diseases for 38 (14.0%). The remaining cases were distributed among genetic disorders, vascular diseases, metabolic conditions, trauma-related pathology, drug-induced diseases, and degenerative disorders. Physical examination findings constituted 142 images (52.2%), radiologic studies 65 (23.9%), and the remainder included combination images, pathologic specimens, endoscopic findings, and electrocardiographic tracings.
Physician participation averaged 60,301 per case (range, 12,066 to 185,210), totaling 16,401,888 individual responses. Mean physician diagnostic accuracy was 50.1% (SD, 11.8%; range, 26% to 88%).
A
This variation in physician performance across cases provided a robust benchmark for evaluating model performance across different levels of diagnostic complexity.
Table 1
Characteristics of Patients, Cases, and Physician Performance.
Patient and Case Characteristics | (N = 272) |
|---|
Demographic Characteristics | |
|---|
Sex — no. (%) | |
Female | 113 (41.5) |
Male | 159 (58.5) |
Age Distribution — no. (%) | |
<1 yr | 16 (5.9) |
1–12 yr | 20 (7.4) |
13–40 yr | 82 (30.1) |
41–60 yr | 73 (26.8) |
>60 yr | 81 (29.8) |
Clinical Characteristics | |
Disease Classification — no. (%) | |
Infectious diseases | 70 (25.7) |
Immune-mediated diseases | 47 (17.3) |
Neoplastic diseases | 38 (14) |
Genetic/congenital diseases | 23 (8.5) |
Vascular diseases | 23 (8.5) |
Metabolic/nutritional diseases | 21 (7.7) |
Traumatic/physical diseases | 19 (7) |
Drug/toxin-related diseases | 18 (6.6) |
Degenerative/functional diseases | 12 (4.4) |
Ectopic diseases | 1 (0.4) |
Image Type — no. (%) | |
Physical signs | 142 (52.2) |
Radiological | 65 (23.9) |
Combination | 29 (10.7) |
Pathological | 16 (5.9) |
Other | 9 (3.3) |
Endoscopic | 8 (2.9) |
Electrocardiographic | 3 (1.1) |
Physician Assessment Performance | |
Mean accuracy ± SD (%) | 50.1 ± 11.8 |
Accuracy range (%) | 26–88 |
Physician participants per case, mean | 60,301 |
Physician participants range | 12,066–185,210 |
3.2 Diagnostic Accuracy of LLMs versus Physicians
3.2.1Diagnostic Performance in Multimodal Testing
In the multimodal evaluation of 272 clinical cases, all three large language models significantly outperformed physicians (Fig. 1). Claude 3.7 and GPT-4o achieved comparable diagnostic accuracy rates of approximately 90%. The absolute difference in accuracy between these models and physician majority vote exceeded 40 percentage points (P < 0.001 for all comparisons). Doubao, though less accurate than Claude 3.7 and GPT-4o, also significantly outperformed the physician benchmark (P < 0.001).
Figure 1. Diagnostic Accuracy of Large Language Models versus Physicians in Multimodal Testing.
Bar graph shows the diagnostic accuracy of three large language models (Claude 3.7, GPT-4o, and Doubao) compared with physician majority vote for 272 multimodal clinical cases from the NEJM Image Challenge. Error bars indicate 95% confidence intervals calculated with the Wilson method. The dashed horizontal line represents chance performance (50%). All models significantly outperformed physicians (P < 0.001 for all comparisons, McNemar's test). *** P < 0.001.
3.2.2 Performance Stratified by Case Difficulty
Large language model performance remained superior across all levels of diagnostic difficulty, as stratified by physician consensus (Fig.
2).
A
All models achieved high accuracy in cases with strong physician agreement (≥ 70% consensus). In cases with low physician consensus (< 40% correct), where diagnostic uncertainty was greatest, Claude 3.7 maintained 86.5% accuracy, compared with mean physician accuracy of 33.4%.
Line graph displays diagnostic accuracy across three levels of physician consensus. Low consensus indicates cases where fewer than 40% of physicians selected the correct diagnosis (n = 52); moderate consensus, 40% to 69% correct (n = 201); and high consensus, 70% or more correct (n = 19). Background shading corresponds to consensus levels (red, low; yellow, moderate; green, high). All language models maintained high accuracy (> 78%) even in low-consensus cases, whereas physician accuracy increased from 33.4% in low-consensus to 77.3% in high-consensus cases. The dashed horizontal line indicates chance performance.
3.2.3 Robustness of Findings
The analysis included 16,401,888 physician responses, with participation ranging from 12,066 to 185,210 physicians per case. Weighted analyses accounting for differential participation rates yielded results identical to those of unweighted analyses. Effect sizes were large for both Claude 3.7 (Cohen's h = 0.96) and GPT-4o (Cohen's h = 0.95). These findings remained consistent across all analytical approaches.
3.3 Diagnostic Concordance and Complementarity
3.3.1 Performance Independence from Case Difficulty
Large language model performance showed minimal correlation with physician consensus levels (Fig. 3). Physician accuracy ranged from 26% to 88% across cases. In contrast, Claude 3.7 and GPT-4o maintained high accuracy regardless of case difficulty. In cases where fewer than 40% of physicians were correct, Claude 3.7 achieved 86.5% accuracy and GPT-4o achieved 78.8% accuracy. Doubao showed greater variation (46.2% accuracy in low-consensus cases vs. 100% in high-consensus cases).
Bubble plot analysis (Fig. 4) confirmed these patterns. More than 98% of cases for Claude 3.7 and GPT-4o fell above the line of equal performance with physicians, compared with 90.8% for Doubao. The concentration of large bubbles in the upper regions indicated consistent model superiority across all difficulty levels.
Individual diagnostic outcomes for three large language models are shown for 272 cases. Points represent correct (1) or incorrect (0) diagnoses as a function of the proportion of physicians selecting the correct answer. Vertical jittering prevents overlap. Background shading indicates physician consensus levels: red (< 40% correct), yellow (40–69% correct), and green (≥ 70% correct).
A
Smooth curves were fitted with locally weighted regression (LOESS) with 95% confidence intervals (shaded areas). GPT-4o and Claude 3.7 maintained consistent performance across all difficulty levels, whereas Doubao showed greater sensitivity to case difficulty. Numbers of cases: low consensus, 52; moderate consensus, 201; and high consensus, 19.
Diagnostic accuracy of three large language models is plotted against physician accuracy for 272 cases. Bubble size is proportional to the number of cases at each accuracy level. The diagonal line represents equal performance. Smooth curves show locally weighted regression weighted by case frequency. Cases above the diagonal line indicate superior model performance: GPT-4o, 269 of 272 (98.9%); Claude 3.7, 267 of 272 (98.2%); and Doubao, 247 of 272 (90.8%).
3.3.2 Diagnostic Agreement and Complementarity
Agreement between large language models and physicians was low despite high model accuracy (Table 2). Cohen's kappa values were 0.08 (95% CI, -0.04 to 0.19) for GPT-4o, 0.08 (95% CI, -0.03 to 0.20) for Claude 3.7, and 0.24 (95% CI, 0.13 to 0.35) for Doubao. The combination of low kappa values and high model accuracy suggested different diagnostic reasoning pathways.
When physician majority vote was incorrect (< 50% accuracy), GPT-4o and Claude 3.7 correctly diagnosed 84.8% of cases, and Doubao diagnosed 59.3% correctly. Among the 8 cases with physician accuracy below 33%, GPT-4o and Claude 3.7 maintained 62.5% accuracy.
Table 2
Agreement and Complementarity Between Large Language Models and Physicians in Clinical Diagnosis.
Model | Kappa | Kappa_CI | Sensitivity_50 | Sensitivity_33 | Specificity |
|---|
GPT4o | 0.08 | (-0.04-0.19) | 84.8% | 62.5% | 89.5% |
Claude | 0.08 | (-0.03-0.20) | 84.8% | 62.5% | 94.7% |
Doubao | 0.24 | (0.13–0.35) | 59.3% | 37.5% | 100.0% |
Cohen's κ values measure agreement between each model and physician majority vote (≥ 50% of physicians correct) beyond chance; values near 0 indicate no better than random agreement. Sensitivity indicates model accuracy when physician accuracy was < 50% (145 cases) or < 33% (8 cases). Specificity indicates model accuracy when physician accuracy was ≥ 70% (19 cases). CI denotes confidence interval.
3.3.3 Patterns of Concordance and Discordance
Confusion matrices revealed asymmetric agreement patterns (Fig. 5). For Claude 3.7, model success with physician failure occurred in 123 cases (45.2%), mutual success in 119 cases (43.8%), mutual failure in 22 cases (8.1%), and physician success with model failure in 8 cases (2.9%). This yielded a 15.4:1 ratio of model-advantage to physician-advantage cases. GPT-4o showed similar patterns. Models excelled particularly in cases with low physician consensus.
Confusion matrices compare diagnostic outcomes between each model and physician majority vote (≥ 50% correct) for 272 cases. Values show the number of cases with percentages in parentheses. Shading intensity corresponds to percentage. The ratio of model-correct/physician-incorrect to physician-correct/model-incorrect cases was 11:1 for GPT-4o, 15.4:1 for Claude 3.7, and 4:1 for Doubao.
3.3.4 Ensemble Performance
All three models agreed on the correct diagnosis in 171 cases (62.9%). At least one model was correct in 262 cases (96.3%), and all models were incorrect in 10 cases (3.7%). When physician majority vote was included, complete diagnostic failure (all models and physicians incorrect) occurred in 9 cases (3.3%).
3.4 Performance Across Clinical Contexts
3.4.1 Disease Category Analysis
Model accuracy varied by disease category (Fig. 6). Claude 3.7 achieved 100% accuracy in drug- and toxin-related diseases (18 of 18 cases), 95.7% in immune-mediated diseases (45 of 47 cases), and 95.7% in genetic disorders (22 of 23 cases). GPT-4o achieved 97.9% accuracy in immune-mediated diseases (46 of 47 cases). All models had lower accuracy for traumatic diseases (Claude 3.7, 73.7% [14 of 19 cases]; GPT-4o, 78.9% [15 of 19 cases]; Doubao, 63.2% [12 of 19 cases]).
The largest performance gaps between models and physicians occurred in drug-related diseases (Claude 3.7, 100% vs. physicians, 49.3%; difference, 50.7 percentage points) and genetic disorders (Claude 3.7, 95.7% vs. physicians, 45.5%; difference, 50.2 percentage points). The smallest gap occurred in vascular diseases (physicians, 52.0%; Doubao and Claude 3.7, 78.3%).
Heat map showing the diagnostic accuracy (percentage of correct diagnoses) of three large language models (Claude 3.7, GPT-4o, and Doubao) and physicians across nine disease categories. Values in each cell represent the percentage accuracy for that model-disease combination. Darker blue shading indicates higher accuracy. Disease categories are ordered by overall performance. Analysis includes 272 clinical cases with text and images.
3.4.2 Performance by Image Type
Model accuracy varied by imaging modality (Fig. 7). Claude 3.7 and GPT-4o achieved 100% accuracy with endoscopic images (8 of 8 cases), 96.6% with combination images (28 of 29 cases), and high accuracy with pathological specimens (Claude 3.7, 93.8% [15 of 16 cases]; GPT-4o, 100% [16 of 16 cases]).
For physical signs (142 cases), Claude 3.7 achieved 91.5% accuracy and GPT-4o achieved 89.4% accuracy, compared with 49.1% for physicians. With radiological images (65 cases), accuracy was 81.5% for Claude 3.7 and 84.6% for GPT-4o, compared with 53.5% for physicians.
Bar graph comparing the diagnostic accuracy of three large language models and physicians across different types of medical images. Error bars represent 95% confidence intervals. Analysis includes 272 clinical cases with both images and text.
3.4.3 Performance by Age and Sex
Claude 3.7 achieved 100% accuracy in infants younger than 1 year (16 of 16 cases), compared with 49.6% for physicians. In children 1 to 12 years of age, GPT-4o achieved 95.0% accuracy (19 of 20 cases) and Claude 3.7 achieved 80.0% accuracy (16 of 20 cases). Among patients older than 12 years (86.8% of the cohort), model performance approximated overall averages (Table 3).
Sex-based differences in accuracy were minimal. The largest difference was 8.8 percentage points for Doubao (females, 76.1%; males, 67.3%). Differences were 1.3 percentage points for GPT-4o and 0.7 percentage points for both Claude 3.7 and physicians.
Table 3
Age and Sex-Stratified Diagnostic Performance.
Category | Subgroup | N | Claude 3.7 (%) | GPT-4o (%) | Doubao (%) | Physician (%) |
|---|
Age Group | Infant (< 1 year) | 16 | 100.0 (100.0-100.0) | 87.5 | 75.0 | 49.6 |
Pediatric (1–12 years) | 20 | 80.0 (62.5–97.5) | 95.0 | 75.0 | 50.6 |
Adult (> 12 years) | 236 | 89.0 (85.0–93.0) | 88.1 | 70.3 | 50.1 |
Sex | Female | 113 | 89.4 | 89.4 | 76.1 | 49.7 |
Male | 159 | 88.7 | 88.1 | 67.3 | 50.4 |
Values are diagnostic accuracy percentages. Confidence intervals (95%) for Claude 3.7 were calculated with the Wilson method. Physician values represent the mean proportion selecting the correct diagnosis. Age groups: infant (< 1 year), pediatric (1–12 years), and adult (> 12 years).
3.5 Multimodal Performance Enhancement
3.5.1 Overall Performance
Adding clinical text to images improved diagnostic accuracy for all models (Fig. 8). Accuracy increased from 47.1% to 89.0% for Claude 3.7 (difference, 41.9 percentage points), from 58.8% to 88.6% for GPT-4o (difference, 29.8 percentage points), and from 42.6% to 71.0% for Doubao (difference, 28.3 percentage points). All differences were significant (P < 0.001 by McNemar's test).
Diagnostic accuracy of three large language models with images alone (gray bars) and with images plus clinical text (green bars) for 272 cases. Horizontal brackets indicate pairwise comparisons (MeNemar's test). Absolute improvements in accuracy: Claude 3.7,41.9 percentage points (from 47.1%to 89.0%); GPT-40, 29.8 percentage points (from 58.8% to 88.6%); and Doubao, 28.3 percentage points (from 42.6%to 71.0%). Error bars represent 95%confidence intervals. *** P < 0.001.
3.5.2 Individual Case Patterns
Among 272 cases, diagnostic outcomes after adding clinical text were as follows (Fig. 9): For Claude 3.7, accuracy improved in 120 cases (44.1%) and remained unchanged in 146 cases (53.7%). For GPT-4o, accuracy improved in 89 cases (32.7%) and remained unchanged in 175 cases (64.3%). Doubao showed improvement in 77 cases (28.3%) and unchanged accuracy in 195 cases (71.7%).
Unexpectedly, the addition of clinical text led to diagnostic errors in previously correct cases for GPT-4o (8 cases, 2.9%) and Claude 3.7 (6 cases, 2.2%), whereas Doubao showed no such deterioration. Case 20211007 represented the intersection where both GPT-4o and Claude 3.7 changed from correct to incorrect diagnoses after text addition (Fig. 10 and Fig. 11).
We extracted the response content from both iterations for all 13 cases in which the large language model produced erroneous diagnoses after text augmentation during testing. By analyzing their "reasoning" processes, we identified potential causes underlying this phenomenon (Table 4).
Waterfall plots show individual patient outcomes for 272 cases when clinical text was added to image-based diagnosis. Each vertical bar represents one patient, ordered by identification number. Bar height indicates change in diagnostic outcome: +100% (green), improvement from incorrect to correct; 0% (gray), unchanged; -100% (red), deterioration from correct to incorrect. Panels show results for GPT-4o (top), Claude 3.7 (middle), and Doubao (bottom). Numbers below each panel indicate cases in each category. Claude 3.7: 120 improved (44.1%), 146 unchanged (53.7%), 6 deteriorated (2.2%). GPT-4o: 89 improved (32.7%), 175 unchanged (64.3%), 8 deteriorated (2.9%). Doubao: 77 improved (28.3%), 195 unchanged (71.7%), 0 deteriorated.
Patient identifiers indicate NEJM publication dates (YYYYMMDD). Numbers in parentheses show cases in each category. Red indicates the single case (20211007) where both GPT 4o and Claude 3.7 deteriorated.
MRI showing multifocal ring-enhancing brain lesions (left) and microscopy demonstrating filamentous branching Gram-positive rods (right), suggestive of nocardiosis. Both GPT-4o and Claude 3.7 AI correctly diagnosed nocardiosis using imaging alone, but incorrectly revised diagnosis to listeriosis when clinical text emphasizing elderly age, immunosuppression, and Gram-positive bacilli was added. Images reproduced from NEJM Image Challenge, Case ID 20211007 (https://www.nejm.org/image-challenge?ci=20211007), ©Massachusetts Medical Society. Used under fair use for educational purposes.
A
Table 4
Possible Explanations for Diagnostic Errors After Adding Clinical Text in 13 Image Challenge Cases
Patient ID | URL | Likely Cause of Misdiagnosis with Text |
|---|
20211007 | Link | Clinical context (elderly, immunocompromised, fever/confusion) caused overemphasis on Listeriosis instead of imaging-characteristic Nocardiosis. |
20191205 | Link | Description of “soft mass increases with crying” in a neonate misled to Prolapsed Uterus; imaging favored Hydrocolpos. |
20200206 | Link | Non-specific swelling and elderly context led to Carcinoma of the tongue; image was classic for Sublingual epidermoid cyst. |
20200305 | Link | HIV/immunosuppressed background and “B symptoms” led to DLBCL; imaging supported Disseminated Mycobacterium avium-intracellulare. |
20210218 | Link | Elderly, weight loss, and abdominal mass misled to Abdominal aortic aneurysm; image showed Urachal mucinous cystic tumor. |
20210304 | Link | Subacute cough/dyspnea led to Diffuse alveolar hemorrhage; “sandstorm” X-ray supported Pulmonary alveolar microlithiasis. |
20210401 | Link | Text focus on tick bite, fever, and lymphadenopathy favored RMSF; eschar and lymph nodes fit Tularemia. |
20220324 | Link | Text mentioned cholesterol emboli—led to Livedo reticularis; skin pattern was more consistent with Livedo racemosa. |
20200220 | Link | Clinical context of tongue swelling in elderly—model chose Carcinoma; image pointed to Beckwith-Wiedemann syndrome (macroglossia). |
20210121 | Link | Young adult, fever, and mass symptoms in text—model picked Lymphoma; image showed Castleman disease. |
20210311 | Link | Clinical symptoms (pain/swelling) suggested Abscess; imaging classic for Cysticercosis. |
20210506 | Link | Textual clues (weight loss, GI symptoms) misled to Colon cancer; image was consistent with GIST (gastrointestinal stromal tumor). |
20210520 | Link | Middle-aged patient, chronic symptoms—model chose Sarcoidosis; imaging was classic for Pulmonary Langerhans cell histiocytosis. |
4. Discussion
Our study encompassed a diverse patient population spanning infancy to advanced age with balanced gender distribution (58.5% male, 41.5% female) and broad disease spectrum including infectious, immune-mediated, neoplastic, and genetic conditions. This comprehensive approach offers several clinical advantages over specialized evaluations. Unlike domain-specific studies focusing on single specialties such as neuroradiology or rheumatology, our broad case selection better reflects the diagnostic challenges encountered in general clinical practice where physicians must differentiate among diverse conditions with overlapping presentations. The consistent AI performance across age groups, particularly the perfect accuracy in infants under one year where physicians achieved only 49.6%, suggests robust generalizability across patient demographics—a critical consideration for real-world implementation. The wide disease spectrum evaluation demonstrates that multimodal AI capabilities extend beyond specialty-specific pattern recognition to general diagnostic reasoning, supporting potential applications in primary care and emergency medicine settings where diagnostic breadth rather than depth is often required.
Our findings diverge significantly from four recent evaluations using NEJM Image Challenge datasets. Han et al. reported GPT-4V achieving 88.7% accuracy on 348 NEJM cases versus 51.4% for human readers16, while Kaczmarczyk et al. found Claude 3 models reaching only 58.8–59.8% accuracy compared to 90.8% collective intelligence19, and Suh et al. demonstrated GPT-4o accuracy of 59.6% versus 80.9% for junior faculty radiologists20. A rheumatology-focused evaluation showed Claude Sonnet 3.5 achieving 81.2% accuracy in multimodal tasks versus online participants' 51.6%21. Our Claude 3.7 achieved 89.0% accuracy against 46.7% physician majority vote. These disparities reflect critical methodological differences: human performance benchmarks varied from individual physician responses (Han, our study) to collective intelligence aggregation (Kaczmarczyk) or expert radiologist panels (Suh), representing fundamentally different clinical scenarios. Model generation advances likely contributed, as newer versions (Claude 3.7, GPT-4V) consistently outperformed earlier iterations, while dataset composition and evaluation periods differed across studies (Han: 2017–2023; Kaczmarczyk: 2005–2023; Suh: 2005–2024; ours: 2009–2025). The Kaczmarczyk collective intelligence benchmark at 90.8%, though statistically robust, represents an idealized scenario unattainable in clinical practice where individual physicians make diagnostic decisions, explaining apparent AI superiority in studies using realistic individual physician baselines.
Our findings contrast with Le Guellec et al.'s neuroradiology evaluation, where radiologists outperformed GPT-4o and Gemini 1.5 Pro in complete cases (48.0% vs 34.0%) with AI models showing minimal multimodal benefit, unlike our substantial 28–42 percentage point improvements and overall, AI superiority (Claude 3.7: 89.0% vs physicians: 46.7%)22. These disparities likely reflect domain-specific challenges, as neuroradiology requires specialized expertise in subtle imaging findings—an area where AI models failed in 81–94% of cases—and different human benchmarks (expert radiologists vs general physician majority vote), suggesting AI diagnostic capabilities vary significantly across medical specialties.
In approximately 2%–3% of cases, adding clinical text caused models such as GPT-4o and Claude 3.7 to shift from a correct image-based diagnosis to an incorrect one. Our conclusions are based on qualitative review and remain subjective; the specific reasons are unclear. Notably, this pattern was not observed with the Doubao model, and the errors rarely overlapped between GPT-4o and Claude 3.7. Most cases involved images with highly characteristic findings, where non-specific text may have misled models—especially those more reliant on textual cues. These observations highlight model differences and underscore the need for further study.
Several study limitations require acknowledgment. Selection bias in educational case collections may not reflect typical clinical practice complexity. Evaluation using static clinical vignettes differs from dynamic clinical encounters where physicians gather additional information and order sequential investigations. We cannot assess optimal AI-physician collaboration potential or account for real-world time pressures and resource constraints. Additionally, without access to proprietary model training data, possible data contamination cannot be excluded.
Ethics
This study used only publicly available Internet data and did not involve human subjects. Therefore, no specific ethical considerations were required in this study.
Declarations
of generative AI in scientific Writing
The author used Claude to translate the paper and Grammarly to correct English grammar when preparing this work.
A
The author reviewed and edited the content as needed and took full responsibility for the publication.
A
Data Availability
The datasets generated or analysed during the current study are available from the corresponding author on reasonable request.
Credit Authorship Contribution Statement
A
Author Contribution
**Chiyu Sheng:** Conceptualization, Data curation, Formal analysis, Investigation, Writing - original draft, Writing - review & editing.**Shumin Shen:** Conceptualization, Data curation, Formal analysis, Investigation, Writing - original draft, Writing - review & editing.**Lin Wang:** Methodology, Software, Formal analysis, Data curation, Validation, Writing - review & editing.**Jie Chen:** Methodology, Software, Formal analysis, Data curation, Validation, Writing - review & editing.**Wei Chen:** Investigation, Validation, Resources, Writing - review & editing.**Nianfei Wang:** Conceptualization, Methodology, Validation, Writing - review & editing, Supervision, Project administration, Funding acquisition.**Shanghu Wang:** Conceptualization, Resources, Writing - original draft, Supervision, Project administration, Funding acquisition.
Reference
1.Gunderson, C. G. et al. Prevalence of harmful diagnostic errors in hospitalised adults: a systematic review and meta-analysis. BMJ Qual. Saf. 29 (12), 1008–1018 (2020).
2.Singh, H., Meyer, A. N. D. & Thomas, E. J. The frequency of diagnostic errors in outpatient care: estimations from three large observational studies involving US adult populations. BMJ Qual. Saf. 23 (9), 727–731 (2014).
3.Faye, F. et al. Time to diagnosis and determinants of diagnostic delays of people living with a rare disease: Results of a rare barometer retrospective patient survey. Eur. J. Hum. Genet. 32 (9), 1116–1126 (2024).
4.Nguengang Wakap, S. et al. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur. J. Hum. Genet. 28 (2), 165–173 (2020).
5.Schubert, M. C. et al. Performance of Large Language Models on a Neurology Board–Style Examination. JAMA Netw. Open. 6 (12), e2346721 (2023).
6.Beam, K. et al. Performance of a Large Language Model on Practice Questions for the Neonatal Board Examination. JAMA Pediatr. 177 (9), 977 (2023).
7.Longwell, J. B. et al. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw. Open. 7 (6), e2417641 (2024).
8.Nori, H., King, N., McKinney, S. M., Carignan, D. & Horvitz, E. Capabilities of GPT-4 on Medical Challenge Problems [Internet]. [cited 2025 May 28]; (2023). Available from: http://arxiv.org/abs/2303.13375
9.Bicknell, B. T. et al. ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis. JMIR Med. Educ. 10, e63430–e63430 (2024).
10.Takita, H. et al. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. npj Digit. Med. 8 (1), 175 (2025).
11.Ferber, D. et al. In-context learning enables multimodal large language models to classify cancer pathology images. Nat. Commun. 15 (1), 10104 (2024).
12.Zhang, D. et al. MM-LLMs: Recent Advances in MultiModal Large Language Models [Internet]. [cited 2025 May 28]; (2024). Available from: http://arxiv.org/abs/2401.13601
13.Nishino, M., Ballard, D. H., Nishino, M. & Ballard, D. H. Multimodal Large Language Models to Solve Image-based Diagnostic Challenges: The Next Big Wave is Already Here. Radiology 312 (1), e241379 (2024).
14.Bradshaw, T. J. et al. Large Language Models and Large Multimodal Models in Medical Imaging: A Primer for Physicians. J. Nucl. Med. 66 (2), 173–182 (2025).
15.Zhou, J. et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat. Commun. 15 (1), 5649 (2024).
16.Han, T. et al. Comparative Analysis of Multimodal Large Language Model Performance on Clinical Vignette Questions. JAMA 331 (15), 1320 (2024).
17.Huppertz, M. S. et al. Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur. Radiol. 35 (3), 1111–1121 (2025).
18.Liu, F. et al. A medical multimodal large language model for future pandemics. npj Digit. Med. 6 (1), 1–15 (2023).
19.Kaczmarczyk, R., Wilhelm, T. I., Martin, R. & Roos, J. Evaluating multimodal AI in medical diagnostics. npj Digit. Med. 7 (1), 205 (2024).
20.Suh, P. S. et al. Comparing Large Language Model and Human Reader Accuracy with New England Journal of Medicine Image Challenge Case Image Inputs. Radiology 313 (3), e241668 (2024).
21.Omar, M. et al. Large Language Models in Rheumatologic Diagnosis: A Multimodal Performance Analysis. J. Rheumatol. 52 (2), jrheum2024–jrheum0975 (2025).
22.Le Guellec, B. et al. Comparison between multimodal foundation models and radiologists for the diagnosis of challenging neuroradiology cases with text and images. Diagn. Interv. Imaging ; (2025). S2211-5684(25)96 – 8.
Table 1. Characteristics of Patients, Cases, and Physician Performance
Patient and Case Characteristics | (N = 272) |
|---|
Demographic Characteristics | |
Sex — no. (%) | |
Female | 113 (41.5) |
Male | 159 (58.5) |
Age Distribution — no. (%) | |
<1 yr | 16 (5.9) |
1–12 yr | 20 (7.4) |
13–40 yr | 82 (30.1) |
41–60 yr | 73 (26.8) |
>60 yr | 81 (29.8) |
Clinical Characteristics | |
Disease Classification — no. (%) | |
Infectious diseases | 70 (25.7) |
Immune-mediated diseases | 47 (17.3) |
Neoplastic diseases | 38 (14) |
Genetic/congenital diseases | 23 (8.5) |
Vascular diseases | 23 (8.5) |
Metabolic/nutritional diseases | 21 (7.7) |
Traumatic/physical diseases | 19 (7) |
Drug/toxin-related diseases | 18 (6.6) |
Degenerative/functional diseases | 12 (4.4) |
Ectopic diseases | 1 (0.4) |
Image Type — no. (%) | |
Physical signs | 142 (52.2) |
Radiological | 65 (23.9) |
Combination | 29 (10.7) |
Pathological | 16 (5.9) |
Other | 9 (3.3) |
Endoscopic | 8 (2.9) |
Electrocardiographic | 3 (1.1) |
Physician Assessment Performance | |
Mean accuracy ± SD (%) | 50.1 ± 11.8 |
Accuracy range (%) | 26–88 |
Physician participants per case, mean | 60,301 |
Physician participants range | 12,066–185,210 |
Table 2. Agreement and Complementarity Between Large Language Models and Physicians in Clinical Diagnosis.
Model | Kappa | Kappa_CI | Sensitivity_50 | Sensitivity_33 | Specificity |
|---|
GPT4o | 0.08 | (-0.04-0.19) | 84.8% | 62.5% | 89.5% |
Claude | 0.08 | (-0.03-0.20) | 84.8% | 62.5% | 94.7% |
Doubao | 0.24 | (0.13–0.35) | 59.3% | 37.5% | 100.0% |
Cohen's κ values measure agreement between each model and physician majority vote (≥ 50% of physicians correct) beyond chance; values near 0 indicate no better than random agreement. Sensitivity indicates model accuracy when physician accuracy was < 50% (145 cases) or < 33% (8 cases). Specificity indicates model accuracy when physician accuracy was ≥ 70% (19 cases). CI denotes confidence interval. Cohen's κ values and performance metrics are descriptive statistics and do not require multiple comparison correction.
Table 3. Age- and Sex-Stratified Diagnostic Performance.
Category | Subgroup | N | Claude 3.7 | GPT-4o | Doubao | Physician |
|---|
Age Group | Infant (< 1 year) | 16 | 100.0* (95.0-100.0) | 87.5* (64.0-96.5) | 75.0* (50.5–89.8) | 49.6 |
Age Group | Pediatric (1–12 years) | 20 | 80.0* (58.4–91.9) | 95.0† (76.4–99.1) | 75.0* (53.1–88.8) | 50.6 |
Age Group | Adult (> 12 years) | 236 | 89.0* (84.3–92.4) | 88.1* (83.4–91.7) | 70.3* (64.2–75.8) | 50.1 |
Sex | Female | 113 | 89.4* (82.4–93.8) | 89.4* (82.4–93.8) | 76.1* (67.5–83.0) | 49.7 |
Sex | Male | 159 | 88.7* (82.8–92.7) | 88.1* (82.1–92.2) | 67.3* (59.7–74.1) | 50.4 |
* p < 0.05 after multiple comparison correction (McNemar's test)
† Lost significance after multiple comparison correction (original p < 0.05)
Values are diagnostic accuracy percentages with 95% confidence intervals calculated using the Wilson method. Physician values represent the mean proportion of physicians who selected the correct diagnosis. Age groups included infants (< 1 year), pediatric (1–12 years), and adults (> 12 years).
Total cases: 272- Infant: 16 cases- Pediatric: 20 cases - Adult: 236 cases- Female: 113 cases- Male: 159 cases
Table 4. Possible Explanations for Diagnostic Errors After Adding Clinical Text in 13 Image Challenge Cases
Patient ID | URL | Likely Cause of Misdiagnosis with Text |
|---|
20211007 | Link | Clinical context (elderly, immunocompromised, fever/confusion) caused overemphasis on Listeriosis instead of imaging-characteristic Nocardiosis. |
20191205 | Link | Description of “soft mass increases with crying” in a neonate misled to Prolapsed Uterus; imaging favored Hydrocolpos. |
20200206 | Link | Non-specific swelling and elderly context led to Carcinoma of the tongue; image was classic for Sublingual epidermoid cyst. |
20200305 | Link | HIV/immunosuppressed background and “B symptoms” led to DLBCL; imaging supported Disseminated Mycobacterium avium-intracellulare. |
20210218 | Link | Elderly, weight loss, and abdominal mass misled to Abdominal aortic aneurysm; image showed Urachal mucinous cystic tumor. |
20210304 | Link | Subacute cough/dyspnea led to Diffuse alveolar hemorrhage; “sandstorm” X-ray supported Pulmonary alveolar microlithiasis. |
20210401 | Link | Text focus on tick bite, fever, and lymphadenopathy favored RMSF; eschar and lymph nodes fit Tularemia. |
20220324 | Link | Text mentioned cholesterol emboli—led to Livedo reticularis; skin pattern was more consistent with Livedo racemosa. |
20200220 | Link | Clinical context of tongue swelling in elderly—model chose Carcinoma; image pointed to Beckwith-Wiedemann syndrome (macroglossia). |
20210121 | Link | Young adult, fever, and mass symptoms in text—model picked Lymphoma; image showed Castleman disease. |
20210311 | Link | Clinical symptoms (pain/swelling) suggested Abscess; imaging classic for Cysticercosis. |
20210506 | Link | Textual clues (weight loss, GI symptoms) misled to Colon cancer; image was consistent with GIST (gastrointestinal stromal tumor). |
20210520 | Link | Middle-aged patient, chronic symptoms—model chose Sarcoidosis; imaging was classic for Pulmonary Langerhans cell histiocytosis. |