Evaluation of ChatGPT's Performance in Residency Training Progress Exams and Competency Exams in Orthopedics and Traumatology
Yaşar
Mahsut
DİNÇEL
1
Emailymd61@gmail.com
Gündüz
Ercan
KUTLUAY
2✉
Emailgunduzercankutluay@gmail.com
Hadi
SASANİ
3
Emailhsasani@nku.edu.tr
Mehmet
Ali
ŞİMŞEK
4
Emailmasimsek@nku.edu.tr
Murat
EREM
5
Emailmuraterem@trakya.edu.tr
1A
Department of Orthopedics and Traumatology, Faculty of Medicine
Tekirdağ Namık Kemal University
0000-0001-6576-1802
Tekirdağ
Türkiye
2A
ÇOSB Kapaklı Devlet Hastahanesi
0000-0002-1077-4945
Tekirdağ
Türkiye
3A
Department of Radiology, Faculty of Medicine
Tekirdağ Namık Kemal University
Tekirdağ
Türkiye
4A
Department of Computer Programming, Vocational School of Technical Sciences
Tekirdağ Namık Kemal University
Tekirdağ
Türkiye
5A
Department of Orthopedics and Traumatology, Faculty of Medicine
Trakya University
Edirne
Türkiye
Yaşar Mahsut DİNÇEL (1), Gündüz Ercan KUTLUAY (2), Hadi SASANİ (3)
Mehmet Ali ŞİMŞEK (4), Murat EREM (5)
(1) Department of Orthopedics and Traumatology, Faculty of Medicine, Tekirdağ Namık Kemal University, Tekirdağ, Türkiye
0000-0001-6576-1802, ymd61@gmail.com
(2) ÇOSB Kapaklı Devlet Hastahanesi, Tekirdağ, Türkiye
0000-0002-1077-4945, gunduzercankutluay@gmail.com
(3) Department of Radiology, Faculty of Medicine, Tekirdağ Namık Kemal University, Tekirdağ, Türkiye
0000-0001-6236-4123, hsasani@nku.edu.tr
(4) Department of Computer Programming, Vocational School of Technical Sciences, Tekirdağ Namık Kemal University, Tekirdağ, Türkiye
0000-0002-6127-2195, masimsek@nku.edu.tr
(5) Department of Orthopedics and Traumatology, Faculty of Medicine, Trakya University, Edirne, Türkiye
0000-0002-9743-5515, muraterem@trakya.edu.tr
Corresponding Author: Gündüz Ercan Kutluay
ABSTRACT
Background
Artificial intelligence (AI) technologies have rapidly expanded into the field of medical education, offering innovative tools for training and assessment.This study aimed to evaluate the performance of the ChatGPT-3.5 language model in the “Residency Training Progress Examination” (UEGS) and the “Competency Examination” administered by the Turkish Society of Orthopedics and Traumatology (TOTBID). The objective was to determine whether ChatGPT performs comparably to orthopedic residents and whether it can achieve a passing score in the Competency Exam.
Methods
A total of 2,000 UEGS and 1,000 Competency Exam questions (2012–2023, excluding 2020) were presented to ChatGPT-3.5 using standardized prompts designed within the Role–Goals–Context (RGC) framework. The model’s responses were statistically compared with those of orthopedic residents and specialists using the Mann–Whitney U and Kruskal–Wallis tests (p < 0.05).
Results
ChatGPT achieved the highest accuracy in the General Orthopedics category (62%) and the lowest in Adult Reconstructive Surgery (40%). It outperformed residents only in the Spine Surgery category (p < 0.05). In the Competency Exams, ChatGPT passed four of ten exams.
Conclusion
ChatGPT-3.5 demonstrated limited reliability and accuracy in orthopedic examinations and should be used cautiously as an educational support tool. Future studies involving newer multimodal versions of large language models may clarify their potential role in medical education and assessment.
Keywords:
ChatGPT
Board Examination
Orthopedics
Traumatology
Artificial Intelligence
A
A
INTRODUCTION
The rapid growth in computing power and the vast amount of data have led to significant advancements in artificial intelligence (AI) in recent years, enabling its integration into various aspects of daily life.
The rapid growth in computing power and data availability has accelerated the development of artificial intelligence (AI), enabling its integration into diverse aspects of life (1). In 1943, McCulloch and Pitts introduced the first artificial neuron model, regarded as the foundation of AI. Later, Alan Turing proposed the Turing Test to determine whether machines could exhibit human-like reasoning. In 1956, John McCarthy coined the term artificial intelligence and developed LISP, the first AI programming language. After a period of limited progress, advances in computing power and algorithmic design reignited interest in artificial intelligence, marking the beginning of the deep learning era that enabled the development of large language models (2).
The Generative Pre-trained Transformer (GPT) model, introduced by OpenAI in 2018, is a deep learning–based system trained on massive text datasets to generate human-like responses. The ChatGPT interface, launched in 2022, made this technology accessible to the general public. The latest versions, such as GPT-4 and GPT-5, demonstrate enhanced accuracy and reasoning ability (3).
In healthcare, AI applications now contribute to diagnosis, image interpretation, and patient management with increased accuracy and reduced workload for clinicians. In education, AI-based learning systems are being tested for their potential to improve comprehension and self-directed learning (4–6).
In this study, questions from the
Residency Training and Progress Examination (UEGS) and the
Competency Examination, organized by the Turkish Society of Orthopedics and Traumatology (TOTBID) across various years, were presented to the ChatGPT-3.5 model.
A
The results were compared with participant outcomes to evaluate whether ChatGPT performs better than resident physicians in the UEGS and whether it can achieve a passing score in the Competency Examination.
The findings aim to clarify whether ChatGPT can serve as a supplementary educational tool in residency training.
MATERIALS AND METHOD
Study design and data sources
This was a retrospective, descriptive, and comparative study that evaluated ChatGPT-3.5’s performance in the Residency Training Progress Examination (UEGS) and the Competency Examination administered by the Turkish Society of Orthopedics and Traumatology (TOTBID). Both exams are organized annually and publicly accessible on the official TOTBID website (7, 8).
The UEGS is a national examination designed to assess the progress of orthopedic and traumatology residents in Türkiye. It has been held annually since 2009 and includes theoretical questions from all subfields of orthopedics. The exam initially contained 100 questions but expanded to 200 questions in 2014. The present study used UEGS exams from 2012 to 2023, excluding 2020, as those questions were unavailable online.
A
The Competency Examination, prepared by the Turkish Orthopedics and Traumatology Education Council (TOTEK), has been conducted since 2003 and consists of two stages. Only the first stage, comprising 100 multiple-choice questions, was analyzed in this study (
9,
10). Participants who answer at least 60 questions correctly are considered successful. Competency Exam questions from 2014 to 2023 were included.
ChatGPT interaction and data collection procedure
All questions from the UEGS and Competency Exams were presented to the ChatGPT-3.5 model (OpenAI, web interface; tested in March 2023). The free version of ChatGPT-3.5 web interface, which represents the most widely used configuration during the study period, was employed. The model was instructed to answer in Turkish to ensure linguistic compatibility with the original exam format.
Example of standardized prompt used for Competency Exam: “You are an orthopedic resident preparing for the national examination. Read the following question carefully and select the most appropriate answer among the options. Then, provide a brief (one-sentence) explanation for your choice. Answer in Turkish.”
The same standardized prompt structure and wording were used consistently for all questions to ensure reproducibility. No feedback or corrections were provided to the model during data entry.
For the UEGS, ChatGPT was prompted to respond to each item as either correct or incorrect with a brief explanation. Each interaction was designed following the Role–Goals–Context (RGC) Framework, which defines the AI’s role (exam participant), objective (select the most accurate answer), and contextual boundaries (question content and available options) [9].
Subcategorization of questions
To allow domain-specific performance analysis, all questions were classified into subcategories representing the major divisions of orthopedics and traumatology.
For the UEGS, nine subcategories were defined: general orthopedics, trauma, pediatric orthopedics, spinal surgery, hand and upper extremity surgery, foot and ankle surgery, sports trauma and knee arthroscopy, orthopedic oncology, and adult reconstructive surgery
For the Competency Exam, ten subcategories were used: basic sciences, pediatric orthopedics, pediatric trauma, adult trauma, upper extremity and hand surgery, lower extremity and foot surgery, arthroscopic and sports surgery, adult reconstructive surgery and arthroplasty, spinal surgery, and infections and tumors (Table 1).
Table 1
Distribution of questions by subcategories in Competency Exam and UEGS.
|
Competency Exam Subcategories
|
Number of Questions (n)
|
UEGS
|
Number of
|
|
Subcategories
|
Questions (n)
|
|
Basic Sciences
|
127
|
General Orthopedics
|
285
|
|
Pediatric Orthopedics
|
151
|
Trauma
|
257
|
|
Pediatric Trauma
|
110
|
Pediatric Orthopedics
|
218
|
|
Adult Trauma
|
130
|
Spinal Surgery
|
212
|
|
Upper Extremity and Hand Surgery
|
88
|
Hand, Wrist, and Upper Extremity Surgery
|
202
|
|
Lower Extremity and Foot Surgery
|
52
|
Foot and Ankle Surgery
|
180
|
|
Arthroscopic Surgery and Sports Traumatology
|
102
|
Sports Trauma, Arthroscopy, and Knee Surgery
|
247
|
|
Adult Reconstructive Surgery and Arthroplasty
|
96
|
Orthopedic Oncology
|
186
|
|
Spinal Surgery
|
80
|
Adult Reconstructive Surgery
|
213
|
|
Infections and Tumors
|
64
|
|
|
|
Total
|
1000
|
Total
|
2000
|
Limitations of AI interaction
Because ChatGPT-3.5 does not support image interpretation, the model was unable to answer questions containing radiographs, clinical photographs, or other visual material. In total, 56 such questions were skipped: 7 (2023), 8 (2022), 17 (2021), 5 (2020), 5 (2019), 4 (2018), 1 (2017), 5 (2016), 3 (2015), and 1 (2014). These were excluded from accuracy calculations.
All remaining UEGS questions (n = 2000) and non-visual Competency Exam questions (n = 944) received valid text-based responses.
Statistical analysis
All statistical analyses were performed using PASW Statistics for Windows, version 18.0 (SPSS Inc., Chicago, IL, USA). Data normality was assessed with the Shapiro–Wilk and Kolmogorov–Smirnov tests. Descriptive statistics were calculated as means ± standard deviation (SD) and frequency distributions.
Since the data were not normally distributed, the Mann–Whitney U test was used to compare ChatGPT and resident groups in two-group analyses, while the Kruskal–Wallis test was employed for comparisons across multiple categories. A p-value < 0.05 was considered statistically significant.
Ethical considerations
No human participants or patient data were involved in this study. All exam data were publicly available on the official TOTBID website and analyzed in aggregate form. The artificial intelligence algorithm we used in our study is an “open access” artificial intelligence platform. Therefore, formal ethics committee approval was not required.
RESULTS
Performance of ChatGPT in the UEGS
In the UEGS examinations conducted between 2012 and 2023 (excluding 2020), the total number of analyzed questions was 2,000, distributed across nine orthopedic subcategories as shown in Table 1. ChatGPT achieved an overall accuracy rate of 52%, providing 1,043 correct and 957 incorrect responses.
The highest accuracy was recorded in the General Orthopedics category (62%), followed by Trauma (57%) and Orthopedic Oncology (57%). The lowest accuracy was found in Adult Reconstructive Surgery (40%) and Foot and Ankle Surgery (45%).
Detailed accuracy rates by subcategory are presented in Table 2.
Table 2
Accuracy of ChatGPT-3.5 by subcategories in the UEGS
|
Subcategory
|
Number of Questions (n)
|
Correct Answers (n)
|
Accuracy (%)
|
|
General Orthopedics
|
285
|
176
|
62
|
|
Trauma
|
257
|
146
|
57
|
|
Pediatric Orthopedics
|
218
|
112
|
51
|
|
Spinal Surgery
|
212
|
117
|
55
|
|
Hand, Wrist, and Upper Extremity Surgery
|
202
|
93
|
46
|
|
Foot and Ankle Surgery
|
180
|
81
|
45
|
|
Sports Trauma, Arthroscopy, and Knee Surgery
|
247
|
127
|
51
|
|
Orthopedic Oncology
|
186
|
106
|
57
|
|
Adult Reconstructive Surgery
|
213
|
85
|
40
|
|
Total
|
2,000
|
1,043
|
52
|
When compared statistically with resident physicians’ scores obtained from the TOTBID database, ChatGPT’s total accuracy did not differ significantly (p = 0.895). However, ChatGPT outperformed residents in the Spinal Surgery category (mean = 10.64 ± 2.54 vs. 7.54 ± 2.72, p = 0.034), while residents performed significantly better in Adult Reconstructive Surgery (mean = 7.27 ± 3.20 vs. 8.71 ± 1.43, p = 0.015).
Although overall accuracy levels were similar, ChatGPT produced more incorrect responses in nearly all categories.
A
A detailed comparison of correct and incorrect responses between ChatGPT and residents is presented in Table 3.
|
Subcategory
|
ChatGPT Mean ± SD (Correct)
|
Residents Mean ± SD (Correct)
|
p-value
|
ChatGPT Mean ± SD (Incorrect)
|
Residents Mean ± SD (Incorrect)
|
p-value
|
|
General Orthopedics
|
16.00 ± 6.48
|
13.17 ± 6.41
|
0.426
|
9.91 ± 6.04
|
5.92 ± 2.82
|
0.070
|
|
Trauma
|
13.27 ± 4.34
|
13.29 ± 1.69
|
0.929
|
10.10 ± 4.48
|
5.77 ± 0.94
|
0.038
|
|
Pediatric Orthopedics
|
10.18 ± 3.63
|
10.29 ± 1.22
|
0.965
|
9.64 ± 3.93
|
5.14 ± 1.47
|
0.007
|
|
Spinal Surgery
|
10.64 ± 2.54
|
7.54 ± 2.72
|
0.034
|
8.64 ± 4.13
|
4.23 ± 1.02
|
0.070
|
|
Hand, Wrist, and Upper Extremity Surgery
|
8.45 ± 3.53
|
10.93 ± 3.47
|
0.070
|
9.91 ± 4.18
|
5.16 ± 1.42
|
0.003
|
|
Foot and Ankle Surgery
|
7.36 ± 4.92
|
10.75 ± 4.95
|
0.085
|
9.00 ± 5.72
|
5.72 ± 2.30
|
0.085
|
|
Sports Trauma, Arthroscopy, and Knee Surgery
|
11.45 ± 4.39
|
11.49 ± 2.47
|
0.757
|
10.91 ± 3.70
|
6.51 ± 1.35
|
0.001
|
|
Orthopedic Oncology
|
9.37 ± 4.86
|
7.90 ± 3.51
|
0.566
|
7.27 ± 3.90
|
3.96 ± 1.85
|
0.057
|
|
Adult Reconstructive Surgery
|
7.27 ± 3.20
|
8.71 ± 1.43
|
0.015
|
11.64 ± 3.44
|
4.31 ± 0.75
|
< 0.001
|
|
Total
|
94.10 ± 24.84
|
94.58 ± 8.33
|
0.895
|
87.73 ± 28.46
|
46.44 ± 5.07
|
0.010
|
*Statistical comparisons were made using the Mann–Whitney U test; bold p-values indicate statistical significance (p < 0.05).
Table 3. Comparison of ChatGPT-3.5 and resident participants in the UEGS
Performance of ChatGPT in the Competency Exams
A total of 1,000 Competency Exam questions from 2014–2023 were analyzed.
After excluding 56 image-based questions that the model could not interpret, 944 questions were evaluated.
A
ChatGPT achieved its highest accuracy in Pediatric Orthopedics (68.2%), followed by Lower Extremity and Foot Surgery (65.4%), and its lowest accuracy in Infections and Tumors (38%). Figure
1 illustrates the distribution of correct response percentages across subcategories.
In the ten annual Competency Exams analyzed, ChatGPT passed four exams (years 2016, 2017, 2019, and 2022) by achieving ≥ 60% correct answers (Table 4).
Its best performance occurred in 2019 (72% accuracy), while the lowest score was in 2023 (44%).
Table 4
ChatGPT-3.5 performance in the Competency Exams (2014–2023)
|
Year
|
Unanswered Questions (n)
|
Correct Answers (n)
|
Incorrect Answers (n)
|
Result
|
|
2023
|
7
|
44
|
49
|
Failed
|
|
2022
|
8
|
61
|
31
|
Passed
|
|
2021
|
17
|
45
|
38
|
Failed
|
|
2020
|
5
|
47
|
48
|
Failed
|
|
2019
|
5
|
72
|
23
|
Passed
|
|
2018
|
4
|
59
|
37
|
Failed
|
|
2017
|
1
|
70
|
29
|
Passed
|
|
2016
|
5
|
60
|
35
|
Passed
|
|
2015
|
3
|
57
|
40
|
Failed
|
|
Total
|
56
|
515
|
330
|
4 / 10 Passed
|
DISCUSSION
This study examined the performance of the ChatGPT-3.5 artificial intelligence model in Türkiye’s Residency Training and Progress Examination (UEGS) and Competency Examination, both organized by the Turkish Society of Orthopedics and Traumatology (TOTBID).
ChatGPT achieved its highest accuracy in the General Orthopedics category (61%) and the lowest in the Adult Reconstructive Surgery category (39%). The model significantly outperformed participants only in the Spine Surgery category. This finding may be explained by the fact that Spine Surgery was also the category in which resident physicians had the lowest accuracy (7.98%). These results may provide insights for those involved in planning and improving orthopedic and traumatology residency training programs, as there is already a core curriculum committee established for Spine Surgery (11).
When comparing the results of ChatGPT and residents in the UEGS, no significant difference was found in the total number of correct answers. However, ChatGPT gave more incorrect responses in every category compared with the participants’ average. Based on these data, ChatGPT-3.5 performed worse than orthopedic and traumatology residents in the UEGS.
In the Competency Examination, ChatGPT passed four out of ten exams by correctly answering at least 60 questions out of 100. In the remaining six exams, the model failed to reach the required passing score. Among the 1,000 questions analyzed, ChatGPT-3.5 was unable to answer 56 image-based questions due to its lack of visual interpretation capability. With newer versions such as ChatGPT-4, which can interpret images, more questions could be answered, potentially resulting in higher success rates. In a comparative study using UEGS questions, ChatGPT-4 demonstrated significantly higher accuracy than ChatGPT-3.5 (12). Likewise, a recent study using questions from the Turkish Competency Exam reported that ChatGPT-4o achieved higher accuracy than human participants across all orthopedic subdomains (13). This indicates that our findings align with the overall trend of improving model performance in newer AI systems rather than contradicting it.
Our findings show that ChatGPT-3.5 performs worse than resident physicians in the UEGS and fails to pass most of the Competency Exams in orthopedics and traumatology. Similar studies have demonstrated that ChatGPT performs well in some exams but poorly in others (14–19). In certain cases, ChatGPT outperformed participants, whereas in others, as in our study, participants achieved better results.
ChatGPT became one of the fastest-growing computer programs in history, reaching 100 million active users within two months of its public release (20). Like other AI models, it draws from a wide range of data sources, including peer-reviewed journal articles, textbooks, and online content (21). As new versions are released, both the technical capabilities of the model and the size of its knowledge base expand. Thus, it is expected that future AI models will continue to improve their ability to evaluate and answer questions.
This study has certain limitations. The results of residents and specialists were evaluated using publicly available data from the TOTBID website, where not all exam years had standardized or complete analyses. Some exams were not conducted during the COVID-19 pandemic, and those years were therefore excluded. These limitations may have affected the comparison process.
Finally, the use of chatbots in medical education is an emerging trend supported by many educators and medical professionals. OpenAI’s ChatGPT offers several potential advantages for both students and teachers (
22,
23).
A
Recent reviews have highlighted that generative AI tools hold promise for orthopedic education and training but also pose challenges related to reliability, ethical use, and integration into curricula. It is important to remember that these systems are still evolving and have not yet reached perfection. There remain significant gaps in both theoretical and practical orthopedic education that AI tools cannot yet fill (
24).
Conclusion
ChatGPT-3.5 showed variable accuracy in orthopedic examinations, performing comparably to residents only in the Spine Surgery category while underperforming in most others. Although ChatGPT offers potential educational benefits, it is not yet a reliable or valid resource for independent use in orthopedics and traumatology education.
As artificial intelligence systems continue to evolve, future versions with multimodal capabilities may become more effective tools in medical learning and assessment.
A
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
A
Author Contribution
YMD and GEK wrote the main manuscript.HS and ME asked questions to ChatGPT.MAŞ performed the statistical work.
HS and ME asked questions to ChatGPT.
MAŞ performed the statistical work.
Acknowledgements
The authors would like to express their sincere gratitude to all doctors who helped impove the orthopedic examination for a better education.
REFERENCES
1.Minh, D., Wang, H. X., Li, Y. F. & Nguyen, T. N. Explainable artificial intelligence: a comprehensive review. Artif. Intell. Rev. 55(5) (1), 3503–3568 (2022 June).
2.Haenlein, M. & Kaplan, A. A Brief History of Artificial Intelligence: On the Past, Present, and Future of Artificial Intelligence. Calif. Manage. Rev. 61 (4), 5–14 (2019).
3.Ollivier, M. et al. A deeper dive into ChatGPT: history, use and future perspectives for orthopaedic research. Knee Surg. Sports Traumatol. Arthrosc. 31 (4), 1190–1192 (2023).
4.Liu, P. R. et al. Application of Artificial Intelligence in Medicine: An Overview. Curr. Med. Sci. 41 (6), 1105–1115 (2021).
5.Wu, D. et al. Artificial intelligence-tutoring problem-based learning in ophthalmology clerkship. Ann. Transl Med. 8 (11), 700–700 (2020 June).
6.Yang, Y. Y. & Shulruf, B. Expert-led and artificial intelligence (AI) system-assisted tutoring course increase confidence of Chinese medical interns on suturing and ligature skills: prospective pilot study. J. Educ. Eval Health Prof. 16, 7 (2019).
7.Gönen, D. E. 2012–2013 TOTBİD-TOTEK UZMANLIK EĞİTİMİ GELİŞİM SINAVI RAPORU (UEGS).
8.Türk Ortopedi ve Travmatoloji Birliği Derneği [Internet]. [cited 2024 Sept 1]. Available from: https://totbid.org.tr/tr/
9.Tabatabaian, M. Prompt Engineering Using ChatGPT: Crafting Effective Interactions and Building GPT Apps 142 (Walter de Gruyter GmbH & Co KG, 2024).
10.Benli, İ. & Acaroğlu, E. Türk Ortopedi ve Travmatoloji Birliği Derneği (TOTBİD) Türk Ortopedi ve Travmatoloji Eğitim Konseyi Yeterlik Sınavları. Acta Orthop Traumatol Turc [Internet]. [cited 2024 Sept 1];45(2). Available from: https://dergipark.org.tr/en/download/article-file/169969
11.Acaroğlu, E. et al. Core curriculum (CC) of spinal surgery: a step forward in defining our profession. Acta Orthop. Traumatol. Turc. 48 (5), 475–478 (2014).
12.Ayik, G. et al. Exploring the role of artificial intelligence in Turkish orthopedic progression exams. Acta Orthop. Traumatol. Turc. 59 (1), 18–26 (2025).
13.Yağar, H., Gümüşoğlu, E. & Mert Asfuroğlu, Z. Assessing the performance of ChatGPT-4o on the Turkish Orthopedics and Traumatology Board Examination. Jt. Dis. Relat. Surg. 36 (2), 304–310 (2025).
14.Ruksakulpiwat, S., Kumar, A. & Ajibade, A. Using ChatGPT in Medical Research: Current Status and Future Directions. J. Multidiscip Healthc. 16, 1513–1520 (2023).
15.Khan, R. A., Jawaid, M., Khan, A. R. & Sajjad, M. ChatGPT - Reshaping medical education and clinical management. Pak J Med Sci [Internet]. Feb 16 [cited 2024 Sept 28];39(2). (2023). Available from: https://pjms.org.pk/index.php/pjms/article/view/7653
16.Oztermeli, A. D. & Oztermeli, A. ChatGPT performance in the medical specialty exam: An observational study. Med. (Baltim). 102 (32), e34673 (2023).
17.Sumbal, A., Sumbal, R. & Amir, A. Can ChatGPT-3.5 Pass a Medical Exam? A Systematic Review of ChatGPT’s Performance in Academic Testing. J. Med. Educ. Curric. Dev. 11, 23821205241238641 (2024).
18.Wang, X. et al. ChatGPT Performs on the Chinese National Medical Licensing Examination. J. Med. Syst. 47 (1), 86 (2023).
19.Aljindan, F. K. et al. ChatGPT Conquers the Saudi Medical Licensing Exam: Exploring the Accuracy of Artificial Intelligence in Medical Knowledge Assessment and Implications for Modern Medical Education. Cureus 15 (9), e45043 (2023 Sept).
20.Alessandri Bonetti, M., Giorgino, R., Gallo Afflitto, G., De Lorenzi, F. & Egro, F. M. How Does ChatGPT Perform on the Italian Residency Admission National Exam Compared to 15,869 Medical Graduates? Ann. Biomed. Eng. 52 (4), 745–749 (2024).
21.Massey, P. A., Montgomery, C. & Zhang, A. S. Comparison of ChatGPT–3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations. JAAOS - J. Am. Acad. Orthop. Surg. 31 (23), 1173 (2023).
22.Huang, Y. et al. Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology. Front. Oncol. 13, 1265024 (2023).
23.Moritz, S., Romeike, B., Stosch, C. & Tolks, D. Generative AI (gAI) in medical education: Chat-GPT and co. GMS J. Med. Educ. 40 (4), Doc54 (2023).
24.Atik, O. Ş. Artificial intelligence: Who must have autonomy the machine or the human? Jt. Dis. Relat. Surg. 35 (1), 1–2 (2024).