Original Paper

Structured Taxonomy and Framework for Developing Medical Benchmark in Large Language Models Derived from Scoping Review

JunbokLee1,2

Associate Professor

JaeyongShin

MD, PhD

3✉,4Phone+82-2-2228-1881/+82-2-393-8133Emaildrshin@yuhs.ac

Professor

BelongCho

MD, PhD

2,5,6✉Phone+82-2-2072-2195/+82-2- 766-3276Emailbelong@snu.ac.kr

1Institute for Innovation in Digital HealthcareYonsei UniversitySeoulRepublic of Korea

Department of Human Systems MedicineSeoul National University College of MedicineSeoulRepublic of Korea

3Department of Preventive Medicine and Public HealthYonsei University College of Medicine50-1, Yonsei-ro, Seodaemun-gu03722SeoulRepublic of Korea

4Institute of Health Services ResearchYonsei University College of MedicineSeoulKorea

5Department of Family MedicineSeoul National University HospitalSeoulRepublic of Korea

6Department of Family MedicineSeoul National University College of Medicine101 Daehak-ro, Jongno- gu03080SeoulRepublic of Korea

Junbok Lee^1,2, Jaeyong Shin^3,4*, Belong Cho^2,5*

¹ Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea

² Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea

³ Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, Seoul, Republic of Korea

⁴ Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Korea

⁵ Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea

Corresponding Author #1

Belong Cho MD, PhD, Professor, Department of Family Medicine, Seoul National University College of Medicine, Address: 101 Daehak-ro, Jongno-gu, Seoul 03080, Republic of Korea, TEL/FAX.: +82-2-2072-2195/ +82-2-766-3276, E-mail: belong@snu.ac.kr

Corresponding Author #2

Jaeyong Shin MD, PhD, Associate Professor, Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, Address: 50 − 1, Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea, TEL/FAX.: +82-2-2228-1881/ +82-2-393-8133, E-mail: drshin@yuhs.ac

Abstract

With the rapid advancement of large language model technology, numerous studies have explored its application in the medical field. Robust evaluation is crucial for ensuring reliability and safety, leading to the development of diverse benchmark datasets. In this study, we propose a structured taxonomy to provide researchers with practical guidance for benchmark selection. Furthermore, we introduce READY, a development framework built on five principles - Reliable, Ethical, Annotated, Diverse, Yield-validated - to support the systematic design of medical benchmarks and strengthen future evaluation practices. To establish the taxonomy and framework, we systematically reviewed benchmark datasets designed for evaluating LLMs in medical context. A comprehensive literature search yielded 55 relevant studies. Each benchmark was analyzed using a structured framework encompassing the dataset construction and evaluation methodology. We anticipate that this research will promote more rigorous and ethical LLM evaluation, paving the way for the safe application of LLMs in clinical settings.

Keywords:

Structured Taxonomy

Development Framework

Benchmark

Large Language Model

Scoping Review

Introduction

The growing interest in large language models (LLMs) has catalyzed extensive research on their applications in the medical field¹. Current studies span a broad range of areas, including medical research and data analytics, patient education, clinical decision support, and the automation of medical records^2–6. As LLM capabilities continue to advance, further integration of medical services is required.

Considering that clinical decisions directly affect patient health and safety, ensuring the reliability and safety of LLM applications in medicine is crucial⁷. Accordingly, LLMs designed for medical use must undergo rigorous evaluation within the domain⁸. Robust evaluation frameworks not only ensure safety, but also allow researchers and developers to identify and address model limitations. In response to these requirements, benchmark datasets, which are standardized tools for assessing LLM performance, have been developed.

Although numerous benchmarks have been developed in the field of medicine, their use remains limited. According to previous studies, the number of benchmark citations was relatively low. Additionally, in previous reports on evaluation methods for medical LLM, few studies have utilized benchmarks. There are various reasons for this, but one possibility is that researchers may not be aware of the benchmark datasets suitable for their research. Benchmark datasets typically include all medical specialties, whereas researchers focusing on specific specialties may only require certain parts of a benchmark dataset.

In this study, we first reviewed benchmark datasets developed specifically to evaluate LLMs in the medical field. We analyzed the characteristics of each benchmark and developed a structured taxonomy to provide insights for researchers to select suitable datasets for their research objectives. In addition, we proposed READY, a medical benchmark development framework that includes five principles (Reliable, Ethical, Annotated, Diverse, Yield-validated).

Results

Overview

The review process consisted of four stages: identification, screening, eligibility assessment, and final inclusion, following the PRISMA-ScR guidelines¹². The initial search yielded 3,697 articles (Fig. 1). Thirteen articles (0.35%) were removed by automatic deduplication using EndNote. Based on title and abstract screening, 3,569 articles (96.54%) were excluded according to the eligibility criteria. A full-text review was performed of the remaining 115 articles (3.11%), of which 78 (2.11%) were excluded because they did not meet the inclusion criteria. Finally, 37 articles (1.00%) were included in the analysis. An additional 18 articles were identified through forward snowballing by reviewing the citations of initially included 37 articles via Google Scholar, resulting in a final total of 55 studies included in this review.

Fig. 1

Flow diagram of the benchmark selection process for the scoping review

Analysis of dataset construction methods

(1) Dataset type

The benchmark datasets were categorized based on their sources: medical licensing examinations (n = 17)^13–29, consumer health questions (CHQ) (n = 17)^{18,21,30–44}, and electronic health records (EHR) (n = 17)^{20,25,45–59}. There were slightly fewer datasets derived from biomedical literature (n = 14)^{18,29,41,48,49,56,60–67} (Table 1). Nine benchmarks utilized multisource data that incorporated two or more types of inputs (Table 1).

Table 1
Data source
Data source		N (%)	Reference
Exam	Physician licensing exams only	10 (50.0)	MedQA, MedMCQA, MLEC-QA, MedBench, MedExQA, MedQA-SWE, CMExam, MultiMedQA, Dr. bench, MedExpQA
	Pharmacist licensing exams only	3 (17.6)	NLPEC, FrenchMCQA, ExplainCPE
	Several licensing exams	4 (23.5)	KorMedMCQA, CMB, HeadQA, M-QALM
Consumer Health Question (CHQ)	National Library of Medicine	6 (41.2)	MedicationQA, MEQSUM, MEDIQA-Answer Summarization, MedQuAD, M-QALM
Consumer Health Question (CHQ)	Platforms	11 (64.8)	MASH-QA, webMedQA, ChiMed, CMID, AraMed, MedRedQA, Chq-summ, K-QA, cMedQA-v2.0, Huatuo-26, MultiMedQA
Electronic Health Records (EHR)	n2c2	2 (11.8)	emrQA, ClinicalKBQA
	MIMIC-III	9 (52.9)	emrKBQA, CLIP, RadQA, DiSCQ, CLIFT, DrugEHRQA, MedEVAL, EHRNoteQA, Dr. bench
	EMR data from hospitals	6 (35.3)	CBLUE, MedBench, RuMedBench, Medalign, RareBench, PromptCBLUE
Literature	Research-related data	6 (46.2)	CliCR, PubMedQA, BioRead, MedREQAL, CBLUE, PromptCBLUE
	Websites or Wikipedia	4 (30.8)	HealthQA, ViMedAQA, Huatuo-26M, FrBMedQA
	Medical textbooks	2 (15.4)	RuMedBench, CMB
	Etc.	1 (7.7)	cpgQA

As shown in Fig. 2, the temporal trends revealed that only four benchmarks were introduced in 2018, increasing to nine in 2019. This number declined to three and five in 2020 and 2021, respectively, before rising to 11 in 2022 and peaking at 15 in 2023. Exam-, EHR-, and literature-based benchmarks showed limited development in earlier years but increased notably after 2022 or 2023. The CHQ-based benchmarks initially rose in 2019, declined thereafter, and resurged in 2022 and 2023. Multisource benchmarks were not introduced until 2021, with continued development in subsequent years.

Fig. 2

Temporal trends in the publication of benchmark datasets

(2) Data source

Exam-based benchmarks primarily utilize questions from national licensure examinations for physicians, pharmacists, and nurses. The initial benchmarks focused on physician-licensing examinations. More recently, benchmarks have been expanded to include pharmacist and nurse licensing content. MedBench incorporates items from Resident Standardization Training and Doctor-in-Charge Qualification exams. The CHQ-based benchmarks were developed using user-generated content from websites where patients posed health-related questions and received physician responses. Data sources included MEDLINE and commercial platforms such as Yahoo Answers. EHR-based benchmarks frequently rely on the publicly available MIMIC-III datasets, largely owing to data privacy constraints. Other studies used de-identified clinical data obtained directly from hospitals. Literature-based benchmarks extract data from biomedical research sources such as PubMed and Cochrane, health information websites, and Wikipedia. Additional sources include medical textbooks.

(3) Construction methods

Most benchmarks were constructed manually (n = 49), followed by modified versions of existing datasets (n = 10) and model-generated benchmarks (n = 6) (Fig. 3). Exam-based benchmarks are classified as human-generated because of the use of pre-existing exam questions. The CHQ- and literature-based benchmarks were also predominantly human-constructed. The EHR-based datasets included both human- and model-generated benchmarks, depending on whether expert annotation or algorithmic labeling was applied. Several benchmarks were constructed by adapting and extending existing datasets such as MedQA and MedMCQA.

Fig. 3

Benchmark dataset analysis

(4) Annotation methods and details

The annotation presence, method, and specificity were assessed across 55 benchmarks (Fig. 3). Of these, 39 (70.9%) were annotated, 16 (29.1%) were not annotated, 25 (64.1%) were annotated manually, 5 (12.8%) were annotated using models, and 9 (23.1%) were annotated using a hybrid approach. GPT-4 was employed for annotation in CMExam and K-QA, whereas MedREQAL utilized GPT-3.5 for health-area classification.

Manual annotation in exam-based datasets sometimes targets only subsets of questions (Table 0). The annotations included the medical specialties, reasoning types, difficulty levels, and question types. In CHQ-based benchmarks, annotations involve generating answers or assessing response reliability. Some benchmarks include question type categorization, and the K-QA employed International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10) – based annotations. Literature-based benchmarks were annotated for answer generation and content classification. The EHR-based benchmarks included prototype clinical queries extracted from electronic medical records (EMRs), which were subsequently annotated manually or automatically.

(5) Languages used in benchmarks

English was the most common language (n = 35), followed by Chinese (n = 14). Other languages included French (n = 3), Spanish (n = 2), and, more recently, single benchmarks in Russian, Swedish, Vietnamese, Arabic, Korean, and Italian, reflecting a growing trend toward multilingual benchmark development.

Analysis of evaluation methods

(1) Evaluation methods

The approaches used to evaluate the models against the benchmarks were subsequently analyzed. Code-based evaluation was the most common method (n = 38), followed by human evaluation (n = 3). Twelve studies employed both code-based and human evaluations. Two benchmarks, ExplainCPE and CMB, were used to assess the outputs generated by GPT-4.

(2) Evaluation models

Prior to 2020, benchmark validation primarily relied on neural network–based models such as convolutional neural networks (CNNs), long short-term memory networks, and bidirectional gated recurrent units, including several variants such as multi-scale CNNs, multilevel composite CNNs, and multi-scale attentive interaction networks. In the early 2020s, numerous BERT-based variants, such as RoBERTa, BioBERT, ALBERT, ClinicalBERT, and PubMedBERT, were widely adopted. Since 2023, GPT-based LLMs, including ChatGPT, GPT-4, and PaLM, have been increasingly employed alongside the growing use of open-source models, such as Llama.

(3) Metrics

The metrics used to evaluate the model performance were also examined. Accuracy (n = 27) and F1 score (n = 22) were the most frequently reported metrics commonly used to assess the overall performance. ROUGE (n = 17) and BLEU (n = 8) were used to evaluate the similarity between generated texts. Additional metrics, including exact match (n = 8), precision (n = 4), and recall (n = 4), were used to evaluate predictive accuracy. Other metrics, such as BERTScore (n = 7) and METEOR (n = 5), were also applied in several studies.

Structured taxonomy of medical benchmark

We developed a structured taxonomy of benchmarks to help researchers select benchmarks (Fig. 4). First, the benchmarks were divided into English and non-English based on language because language is important. Given that there were many Chinese benchmarks, we divided non-English into Chinese and other benchmarks. Next, as the source of the benchmark data is important, we distinguished whether the benchmark was based on multiple sources. If it was not multi-source, we further categorized it into Exam, EHR, CHQ, and Literature. Finally, we distinguished whether the benchmark was classified into detailed items.

Fig. 4

Structured taxonomy of benchmark datasets

Development framework of medical benchmark

The development framework consists of 24 questions based on five key principles: (1) Reliable, (2) Ethical, (3) Annotated, (4) Diverse, and (5) Yield-validated (Table 2). First, five questions were asked to confirm that the benchmark data could be sourced from reliable sources and were of high quality. This included explanations of the data source, composition, preprocessing methods, collection from a reliable institution, data recency, and data volume. Second, regarding ethics, the framework included the removal of patients’ personal information, data bias, compliance with the institutional review board, privacy protection laws, and reusability. Third, the framework checked whether a meaningful and consistent annotation system existed. The evaluation consisted of five questions regarding the clarity and consistency of annotations, annotation workflow, annotation usage model, annotation quality verification, and consistency rate among annotators. Fourth, diversity included whether the classification had been performed across multiple departments, ICD codes, or inference types. Finally, five questions were included regarding the evaluation: whether a benchmark evaluation was performed, the models and metrics used, whether reviews by external experts or clinicians were conducted, and whether the results were publicly available. All 24 questions can help researchers to confirm whether the benchmark is appropriately prepared.

Table 2
Framework of Benchmark Development (READY)
		Yes	No	N/A
R - Reliable (Reliability of Source and Quality Assurance) : Benchmark data must be obtained from clinically and scientifically validated sources.
1	Is the data source clearly identified? (e.g., Exam, CHQ, HER, Literature, etc.)
2	Is the data composition and preprocessing method (sampling criteria, filtering criteria, noise removal, missing value handling, etc.) described in detail?
3	Is the data collected from reliable institutions? (e.g. hospitals, government agencies)
4	Does it include the latest data and reflect the most recent medical knowledge?
5	Is the quantity of data sufficient?
E - Ethical (Compliance with Ethics and Privacy Protection) : Patient’s personal information and sensitive data must be strictly de-identified, and relevant laws and ethical standards must be followed.
6	Has all personally identifiable information been removed?
7	Does the data avoid biased or harmful expressions?
8	Is the Institutional Review Board (IRB) approval status clearly stated?
9	Does it comply with privacy protection laws such as HIPAA, GDPR, etc.?
10	Are the usage conditions and reusability (e.g., license, availability, etc.) clearly presented?
A - Annotated (Meaningful and Consistent Annotation System) : Annotations must carry medical significance and support model evaluation and performance improvement.
11	Are the annotations clear and applied consistently?
12	Is the annotation method specified? (manual, automatic, etc.)
13	If automatic annotation is used, is the model disclosed?
14	Has annotation quality been validated?
15	Is the annotator’s expertise (medical expert, layperson, model, etc.) ensured, and has inter-annotator agreement been validated?
D - Diverse (Ensuring Diversity of Questions and Documents) : Questions and documents should reflect various stakeholders and contexts to ensure benchmark generalizability.
16	Does it include multiple medical specialties and domain knowledge (e.g., ICD codes), with proper classification?
17	Does it include multiple question types (diagnosis, examination, treatment, etc.), with proper classification?
18	Does it include multiple reasoning types (fact-based vs. inference-based), with proper classification?
19	Are there additional categories defined for ensuring diversity?
Y - Yield-validated (Practical Medical Utility-Oriented Design) : Benchmark data must be validated based on applicability in real-world medical practice.
20	Has evaluation of the developed benchmark been conducted?
21	Are the models used for evaluation disclosed?
22	Are the metrics used for quality assessment disclosed?
23	Does it include external expert or clinical reviews/evaluations?
24	Are open-source code or evaluation tools available?

Discussion

In this study, we systematically reviewed benchmark datasets used to evaluate LLMs in the medical field, focusing on their construction and evaluation methodologies. Based on this analysis, we propose a structured taxonomy and development framework for researchers seeking to adopt existing benchmarks or develop new benchmarks tailored for specific applications.

Effective benchmark development requires the incorporation of diverse annotations that enable targeted evaluations without necessitating the use of an entire dataset. Although LLM research has expanded across a range of medical specialties, the benchmarks remain underutilized. Notably, none of the studies identified in our scoping review employed the analyzed benchmarks for performance assessment. In contrast to traditional natural language processing research, in which benchmarks are integral to model evaluation, medical studies often assess the applicability of LLMs in specific domains or diseases. Consequently, full benchmark utilization may not always be necessary. One solution is to incorporate fine-grained annotations, such as disease categories or medical specialties^26,44, to enable selective use. Some benchmarks have already introduced such granularity by annotating ICD-10 codes to delineate relevant clinical areas.

The current findings indicate that several benchmarks include annotations for the question type, difficulty level, and reasoning category⁹. These metadata help to identify the strengths and limitations of individual models and offer strategic guidance for model development and fine-tuning. Benchmarks enriched with such annotations can enhance the scientific utility and impact of LLM research on medicine.

However, benchmarks vary in their strengths and limitations depending on their source data, and these factors should be carefully considered during the research design and benchmark selection (Table 0). Exam-based benchmarks offer objectivity and consistency owing to their standardized scoring systems; however, their reliance on structured formats limits their representation of nuanced clinical reasoning. CHQ-based benchmarks are advantageous for their accessibility and scalability, but may suffer from inconsistencies and noise inherent in publicly sourced online content, necessitating thorough human review. EHR-based benchmarks reflect real-world clinical contexts and offer high relevance, but face challenges related to data access and privacy. Literature-based benchmarks can provide high-quality context-rich information; however, they may lack recency and require expert curation to ensure accuracy.

Therefore, the development of multilingual benchmarks is critical. Recent efforts have produced benchmarks in Korean, Vietnamese, Arabic, and Swedish languages, underscoring the global nature of LLM adoption^23,27,42,67. For instance, the Ministry of Food and Drug Safety of South Korea initiated regulatory frameworks that govern LLM-based applications in medicine. Similar to other nations, language-specific benchmarks are essential for aligning LLM evaluations with regulatory and clinical expectations.

Finally, standardization of methodologies for LLM-assisted benchmark development is urgently required. Recent benchmarks, such as CMExam and K-QA, have employed GPT-4 for annotation^26,44, whereas earlier benchmarks, such as emrQA, rely on the rule-based, template-driven generation of physician queries⁴⁵. Given the high cost and labor associated with the manual curation of large datasets, integrating LLMs into the benchmark construction process is increasingly attractive. Nevertheless, the methodological frameworks for such integration remain underdeveloped. Advancing this area of research is critical to ensure the scalable, reproducible, and scientifically sound development of future medical benchmarks.

Methods

Study design

First, a scoping review was deemed the most appropriate approach to systematically examine benchmark datasets in the medical domain. This methodology facilitates the synthesis of existing knowledge, highlights key concepts and supporting evidence, and identifies research gaps. Instead of merely providing a summary of the scoping review findings, this approach will allow us to proposed various frameworks derived from these findings. For instance, a human evaluation framework for LLMs in the medical field was developed based on the results of a scoping review. We suggest a framework for developing medical benchmarks based on the results of a scoping review, and developed a structured taxonomy for LLM researchers to select suitable medical benchmarks.

Search strategy and study selection

The study was conducted in accordance with the PRISMA-ScR guidelines¹². A previous comprehensive review that examined 444 benchmarks across diverse domains identified seven benchmarks specific to the medical field⁹. Relevant keywords such as “clinical,” “medical,” “healthcare,” “benchmark,” “question answering,” and “QA dataset” were extracted from these studies. Additional insights were drawn from earlier scoping reviews targeting EHR and CHQ datasets^10,11. The literature search was restricted to studies published between January 1, 2017, and July 30, 2024. Searches were conducted in PubMed, the ACM Digital Library, and the ACL Anthology.

Study selection followed a two-step screening process, with inclusion and exclusion criteria derived from prior literature^9–11. In the first screening step, two reviewers independently assessed each record. Discrepancies were resolved by a third reviewer. The second step involved full-text screening to finalize the inclusion of studies.

Data extraction and analysis

The analytical framework was adapted from previous benchmark reviews and studies focusing on dataset development^9–11. It comprised two principal domains: dataset construction and evaluation.

(1) Dataset construction

Dataset construction was assessed based on dataset type, source, construction methods, annotation practices, categorization, and language. The dataset types were classified as follows: (1) medical licensing exams (e.g., United States Medical Licensing Examination, USMLE); (2) CHQs generated through patient–provider online interactions; (3) EHRs from EMRs; and (4) biomedical literature from textbooks, scientific articles, or websites such as Wikipedia. Datasets integrating multiple sources were categorized as multi-source.

Based on prior studies, construction methods were classified into three categories: (1) Human-generated datasets, constructed manually by annotators following predefined criteria, without LLM involvement. (2) Model-generated datasets, produced using language models prompted to generate desired outputs. (3) Collection and improvement of existing datasets, developed by compiling and modifying open-source datasets⁹. Annotation methods were categorized by whether they were manual or automated, and by the nature of the annotated content. Information regarding the number and characteristics of categories was extracted. The quantities of training, test, and validation subsets were recorded.

(2) Evaluation

Evaluation analysis included evaluation methods, models used, metrics, and additional assessments. Consistent with prior literature, evaluation methods were classified into: (1) Code-based evaluation, quantitative assessment using predefined metrics. (2) Human evaluation, qualitative assessment by individuals with relevant expertise. (3) Model-based evaluation, assessment wherein LLMs evaluate outputs⁹. To determine model performance, we recorded the evaluation models used in each study and the metrics applied. Additional evaluations, including implementation-specific details, were also reviewed.

Structured taxonomy and framework of medical benchmark development

Based on the data derived from the scoping review, we created a structured taxonomy to provide criteria for researchers to select benchmark datasets, and a framework to serve as a reference for benchmark development. A structured taxonomy was constructed based on some of the indicators used in the scoping review analysis. The framework for the medical benchmark development was also structured as a checklist based on the metrics used in the scoping review analysis. In addition, we incorporated the key points emphasized in each benchmark study during development. To ensure objectivity, the results were reviewed and refined by five experts who were not involved in the original research.

Data availability

No new data was generated or analyzed in this study. All data supporting the findings of this study are available within the cited articles included in the scoping review.

Funding

This research was supported by the Technology Innovation Program (RS-2024-00432987), funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea).

Acknowledgement

This study was derived in part from the doctoral dissertation of Junbok Lee at Seoul National University.

Author Contribution

JBL conceptualized the study, developed the review protocol, conducted the literature search, and performed data extraction and analysis as part of his doctoral research. JYS verified the extracted data and contributed to the development of the taxonomy and framework. BLC supervised the overall study, provided methodological guidance, and critically reviewed the manuscript for important intellectual content.

Competing interests

The authors declare no competing interests.

References

Omar, M., Nadkarni, G. N., Klang, E. & Glicksberg, B. S. Large language models in medicine: A review of current clinical trials across healthcare applications. PLOS Digital Health 3, e0000662 (2024).

Omiye, J. A. et al. Large language models in medicine: the potentials and pitfalls: a narrative review. Annals of Internal Medicine 177, 210–220 (2024).

Li, J. et al. ChatGPT in healthcare: a taxonomy and systematic review. Computer Methods and Programs in Biomedicine 245, 108013 (2024).

Meng, X. et al. The application of large language models in medicine: A scoping review. Iscience 27 (2024).

Lee, S. H. et al. Smartphone AI vs. medical experts: a comparative study in prehospital STEMI diagnosis. Yonsei Medical Journal 65, 174 (2024).

Lim, C. Y. et al. Need for Transparency and Clinical Interpretability in Hemorrhagic Stroke Artificial Intelligence Research: Promoting Effective Clinical Application. Yonsei Medical Journal 65, 611 (2024).

Niv, Y. & Tal, Y. Development of Patient Safety and Risk Management in Medicine. In Patient Safety and Risk Management in Medicine: From Theory to Practice. 15–26 (Springer, 2024).

Chen, X. et al. Evaluating large language models in medical applications: a survey. arXiv preprint arXiv:2405.07468 (2024). (2024).

Liu, Y. et al. Datasets for large language models: A comprehensive survey. arXiv preprint arXiv:2402.18041 (2024). (2024).

10.

Bardhan, J., Roberts, K. & Wang, D. Z. Question Answering for Electronic Health Records: Scoping Review of Datasets and Models. Journal of Medical Internet Research 26, e53636 (2024).

11.

Welivita, A. & Pu, P. A survey of consumer health question answering systems. Ai Magazine 44, 482–507 (2023).

12.

Tricco, A. C. et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Annals of internal medicine 169, 467–473 (2018).

13.

Vilares, D. & Gómez-Rodríguez, C. HEAD-QA: A healthcare dataset for complex reasoning. arXiv preprint arXiv:1906.04701 (2019). (2019).

14.

Li, D. et al. Towards medical machine reading comprehension with structural knowledge and plain text. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) 1427–1438 (2020).

15.

Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 11, 6421 (2021).

16.

Li, J., Zhong, S. & Chen, K. MLEC-QA: A Chinese multi-choice biomedical question answering dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 8862–8874 (2021).

17.

Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning 248–260 (PMLR, 2022).

18.

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

19.

Gourraud, E. M. & Rouvier, M. FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain.

20.

Gao, Y. et al. Dr. bench: Diagnostic reasoning benchmark for clinical natural language processing. Journal of biomedical informatics 138, 104286 (2023).

21.

Subramanian, A. et al. M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering. arXiv preprint arXiv:2406.03699 (2024). (2024).

22.

Kim, Y., Wu, J., Abdulle, Y. & Wu, H. MedExQA: Medical Question Answering Benchmark with Multiple Explanations. arXiv preprint arXiv:2406.06331 (2024). (2024).

23.

Hertzberg, N. & Lokrantz, A. MedQA-SWE-a Clinical Question & Answer Dataset for Swedish. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) 11178–11186 (2024).

24.

Li, D. et al. Explaincpe: A free-text explanation benchmark of chinese pharmacist examination. arXiv preprint arXiv:2305.12945 (2023). (2023).

25.

Cai, Y. et al. Medbench: A large-scale chinese benchmark for evaluating medical large language models. In Proceedings of the AAAI Conference on Artificial Intelligence 17709–17717 (2024).

26.

Liu, J., Zhou, P. & Hua, Y. Benchmarking Large Language Models on CMExam–A Comprehensive Chinese Medical Exam Dataset. Published online June 8, 2023. Advances in Neural Information Processing Systems 36.

27.

Kweon, S. et al. KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations. arXiv preprint arXiv:2403.01469 (2024). (2024).

28.

Alonso González, I. B., Oronoz Anchordoqui, M. & Agerri Gascón, R. MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering. (2024). (2024).

29.

Wang, X. et al. Cmb: A comprehensive medical benchmark in chinese. arXiv preprint arXiv:2308.08833 (2023). (2023).

30.

Abacha, A. B., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. In TREC 1–12 (2017).

31.

Zhang, S. et al. Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access 6, 74061–74071 (2018).

32.

Abacha, A. B. et al. Bridging the gap between consumers’ medication questions and trusted answers. In MEDINFO 2019: Health and Wellbeing e-Networks for All. 25–29 (IOS Press, 2019).

33.

Ben Abacha, A. & Demner-Fushman, D. A question-entailment approach to question answering. BMC bioinformatics 20, 1–23 (2019).

34.

He, J., Fu, M. & Tu, M. Applying deep matching networks to Chinese medical question answering: a study and a dataset. BMC medical informatics and decision making 19, 91–100 (2019).

35.

Tian, Y., Ma, W., Xia, F. & Song, Y. ChiMed: A Chinese medical corpus for question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task 250–260 (2019).

36.

Abacha, A. B. & Demner-Fushman, D. On the summarization of consumer health questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2228–2234 (2019).

37.

Chen, N. et al. A benchmark dataset and case study for Chinese medical question intent classification. BMC Medical Informatics and Decision Making 20, 1–7 (2020).

38.

Savery, M., Abacha, A. B., Gayen, S. & Demner-Fushman, D. Question-driven summarization of answers to consumer health questions. Scientific Data 7, 322 (2020).

39.

Zhu, M. et al. Question answering with long multiple-span answers. In Findings of the Association for Computational Linguistics: EMNLP 2020 3840–3849 (2020).

40.

Yadav, S., Gupta, D. & Demner-Fushman, D. Chq-summ: A dataset for consumer healthcare question summarization. arXiv preprint arXiv:2206.06581 (2022). (2022).

41.

Li, J. et al. Huatuo-26m, a large-scale chinese medical qa dataset. arXiv preprint arXiv:2305.01526 (2023). (2023).

42.

Alasmari, A., Alhumoud, S. & Alshammari, W. AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation@ LREC-COLING 2024 50–56 (2024).

43.

Nguyen, V., Karimi, S., Rybinski, M. & Xing, Z. MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) 629–648 (2023).

44.

Manes, I. et al. K-qa: A real-world medical q&a benchmark. arXiv preprint arXiv:2401.14493 (2024). (2024).

45.

Pampari, A., Raghavan, P., Liang, J. & Peng, J. emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732 (2018). (2018).

46.

Raghavan, P. et al. emrkbqa: A clinical knowledge-base question answering dataset. (Association for Computational Linguistics, 2021).

47.

Mullenbach, J. et al. CLIP: a dataset for extracting action items for physicians from hospital discharge notes. arXiv preprint arXiv:2106.02524 (2021). (2021).

48.

Blinov, P. et al. Rumedbench: a Russian medical language understanding benchmark. In International Conference on Artificial Intelligence in Medicine 383–392 (Springer, 2022).

49.

Zhang, N. et al. Cblue: A chinese biomedical language understanding evaluation benchmark. arXiv preprint arXiv:2106.08087 (2021). (2021).

50.

Soni, S., Gudala, M., Pajouhi, A. & Roberts, K. Radqa: A question answering dataset to improve comprehension of radiology reports. In Proceedings of the thirteenth language resources and evaluation conference 6250–6259 (2022).

51.

Lehman, E. et al. Learning to ask like a physician. arXiv preprint arXiv:2206.02696 (2022). (2022).

52.

Pal, A. CLIFT: Analysing Natural Distribution Shift on Question Answering Models in Clinical Domain. arXiv preprint arXiv:2310.13146 (2023). (2023).

53.

Bardhan, J., Colas, A., Roberts, K. & Wang, D. Z. Drugehrqa: A question answering dataset on structured and unstructured electronic health records for medicine related queries. arXiv preprint arXiv:2205.01290 (2022). (2022).

54.

Wang, P. et al. Attention-based aspect reasoning for knowledge base question answering on clinical notes. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 1–6 (2022).

55.

He, Z. et al. Medeval: A multi-level, multi-task, and multi-domain medical benchmark for language model evaluation. arXiv preprint arXiv:2310.14088 (2023). (2023).

56.

Zhu, W. et al. PromptCBLUE: a Chinese prompt tuning benchmark for the medical domain. arXiv preprint arXiv:2310.14151 (2023). (2023).

57.

Fleming, S. L. et al. Medalign: A clinician-generated dataset for instruction following with electronic medical records. In Proceedings of the AAAI Conference on Artificial Intelligence 22021–22030 (2024).

58.

Kweon, S. et al. EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings. arXiv preprint arXiv:2402.16040 (2024). (2024).

59.

Chen, X. et al. RareBench: Can LLMs Serve as Rare Diseases Specialists? In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 4850–4861 (2024).

60.

Šuster, S. & Daelemans, W. CliCR: a dataset of clinical case reports for machine reading comprehension. arXiv preprint arXiv:1803.09720 (2018). (2018).

61.

Zhu, M., Ahuja, A., Wei, W. & Reddy, C. K. A hierarchical attention retrieval model for healthcare question answering. In The World Wide Web Conference 2472–2482 (2019).

62.

Jin, Q. et al. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146 (2019). (2019).

63.

Pappas, D., Androutsopoulos, I. & Papageorgiou, H. BioRead: A new dataset for biomedical reading comprehension. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018).

64.

Kaddari, Z. & Bouchentouf, T. FrBMedQA: the first French biomedical question answering dataset. IAES International Journal of Artificial Intelligence 11, 1588 (2022).

65.

Mahbub, M. et al. cpgqa: A benchmark dataset for machine reading comprehension tasks on clinical practice guidelines and a case study using transfer learning. IEEE Access 11, 3691–3705 (2023).

66.

Vladika, J., Schneider, P. & Matthes, F. MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering. arXiv preprint arXiv:2406.05845 (2024). (2024).

67.

Tran, M.-N., Nguyen, P.-V., Nguyen, L. & Dien, D. ViMedAQA: A Vietnamese Medical Abstractive Question-Answering Dataset and Findings of Large Language Model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) 356–364 (2024).

Yes