References
1.Omar, M., Nadkarni, G. N., Klang, E. & Glicksberg, B. S. Large language models in medicine: A review of current clinical trials across healthcare applications. PLOS Digital Health 3, e0000662 (2024).
2.Omiye, J. A. et al. Large language models in medicine: the potentials and pitfalls: a narrative review. Annals of Internal Medicine 177, 210–220 (2024).
3.Li, J. et al. ChatGPT in healthcare: a taxonomy and systematic review. Computer Methods and Programs in Biomedicine 245, 108013 (2024).
4.Meng, X. et al. The application of large language models in medicine: A scoping review. Iscience 27 (2024).
5.Lee, S. H. et al. Smartphone AI vs. medical experts: a comparative study in prehospital STEMI diagnosis. Yonsei Medical Journal 65, 174 (2024).
6.Lim, C. Y. et al. Need for Transparency and Clinical Interpretability in Hemorrhagic Stroke Artificial Intelligence Research: Promoting Effective Clinical Application. Yonsei Medical Journal 65, 611 (2024).
7.Niv, Y. & Tal, Y. Development of Patient Safety and Risk Management in Medicine. In Patient Safety and Risk Management in Medicine: From Theory to Practice. 15–26 (Springer, 2024).
8.Chen, X. et al. Evaluating large language models in medical applications: a survey. arXiv preprint arXiv:2405.07468 (2024). (2024).
9.Liu, Y. et al. Datasets for large language models: A comprehensive survey. arXiv preprint arXiv:2402.18041 (2024). (2024).
10.Bardhan, J., Roberts, K. & Wang, D. Z. Question Answering for Electronic Health Records: Scoping Review of Datasets and Models. Journal of Medical Internet Research 26, e53636 (2024).
11.Welivita, A. & Pu, P. A survey of consumer health question answering systems. Ai Magazine 44, 482–507 (2023).
12.Tricco, A. C. et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Annals of internal medicine 169, 467–473 (2018).
A
13.Vilares, D. & Gómez-Rodríguez, C. HEAD-QA: A healthcare dataset for complex reasoning. arXiv preprint arXiv:1906.04701 (2019). (2019).
A
14.Li, D. et al. Towards medical machine reading comprehension with structural knowledge and plain text. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) 1427–1438 (2020).
15.Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences 11, 6421 (2021).
16.Li, J., Zhong, S. & Chen, K. MLEC-QA: A Chinese multi-choice biomedical question answering dataset. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 8862–8874 (2021).
A
17.Pal, A., Umapathi, L. K. & Sankarasubbu, M. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning 248–260 (PMLR, 2022).
18.Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).
A
19.Gourraud, E. M. & Rouvier, M. FrenchMedMCQA: A French Multiple-Choice Question Answering Dataset for Medical domain.
A
20.Gao, Y. et al. Dr. bench: Diagnostic reasoning benchmark for clinical natural language processing. Journal of biomedical informatics 138, 104286 (2023).
A
21.Subramanian, A. et al. M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering. arXiv preprint arXiv:2406.03699 (2024). (2024).
A
22.Kim, Y., Wu, J., Abdulle, Y. & Wu, H. MedExQA: Medical Question Answering Benchmark with Multiple Explanations. arXiv preprint arXiv:2406.06331 (2024). (2024).
23.Hertzberg, N. & Lokrantz, A. MedQA-SWE-a Clinical Question & Answer Dataset for Swedish. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) 11178–11186 (2024).
24.Li, D. et al. Explaincpe: A free-text explanation benchmark of chinese pharmacist examination. arXiv preprint arXiv:2305.12945 (2023). (2023).
25.Cai, Y. et al. Medbench: A large-scale chinese benchmark for evaluating medical large language models. In Proceedings of the AAAI Conference on Artificial Intelligence 17709–17717 (2024).
26.Liu, J., Zhou, P. & Hua, Y. Benchmarking Large Language Models on CMExam–A Comprehensive Chinese Medical Exam Dataset. Published online June 8, 2023. Advances in Neural Information Processing Systems 36.
27.Kweon, S. et al. KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations. arXiv preprint arXiv:2403.01469 (2024). (2024).
A
28.Alonso González, I. B., Oronoz Anchordoqui, M. & Agerri Gascón, R. MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering. (2024). (2024).
A
29.Wang, X. et al. Cmb: A comprehensive medical benchmark in chinese. arXiv preprint arXiv:2308.08833 (2023). (2023).
30.Abacha, A. B., Agichtein, E., Pinter, Y. & Demner-Fushman, D. Overview of the medical question answering task at TREC 2017 LiveQA. In TREC 1–12 (2017).
A
31.Zhang, S. et al. Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access 6, 74061–74071 (2018).
A
32.Abacha, A. B. et al. Bridging the gap between consumers’ medication questions and trusted answers. In MEDINFO 2019: Health and Wellbeing e-Networks for All. 25–29 (IOS Press, 2019).
A
33.Ben Abacha, A. & Demner-Fushman, D. A question-entailment approach to question answering. BMC bioinformatics 20, 1–23 (2019).
A
34.He, J., Fu, M. & Tu, M. Applying deep matching networks to Chinese medical question answering: a study and a dataset. BMC medical informatics and decision making 19, 91–100 (2019).
A
35.Tian, Y., Ma, W., Xia, F. & Song, Y. ChiMed: A Chinese medical corpus for question answering. In Proceedings of the 18th BioNLP Workshop and Shared Task 250–260 (2019).
A
36.Abacha, A. B. & Demner-Fushman, D. On the summarization of consumer health questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2228–2234 (2019).
37.Chen, N. et al. A benchmark dataset and case study for Chinese medical question intent classification. BMC Medical Informatics and Decision Making 20, 1–7 (2020).
A
38.Savery, M., Abacha, A. B., Gayen, S. & Demner-Fushman, D. Question-driven summarization of answers to consumer health questions. Scientific Data 7, 322 (2020).
39.Zhu, M. et al. Question answering with long multiple-span answers. In Findings of the Association for Computational Linguistics: EMNLP 2020 3840–3849 (2020).
A
40.Yadav, S., Gupta, D. & Demner-Fushman, D. Chq-summ: A dataset for consumer healthcare question summarization. arXiv preprint arXiv:2206.06581 (2022). (2022).
A
41.Li, J. et al. Huatuo-26m, a large-scale chinese medical qa dataset. arXiv preprint arXiv:2305.01526 (2023). (2023).
42.Alasmari, A., Alhumoud, S. & Alshammari, W. AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models. In Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation@ LREC-COLING 2024 50–56 (2024).
A
43.Nguyen, V., Karimi, S., Rybinski, M. & Xing, Z. MedRedQA for Medical Consumer Question Answering: Dataset, Tasks, and Neural Baselines. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) 629–648 (2023).
44.Manes, I. et al. K-qa: A real-world medical q&a benchmark. arXiv preprint arXiv:2401.14493 (2024). (2024).
45.Pampari, A., Raghavan, P., Liang, J. & Peng, J. emrqa: A large corpus for question answering on electronic medical records. arXiv preprint arXiv:1809.00732 (2018). (2018).
A
46.Raghavan, P. et al. emrkbqa: A clinical knowledge-base question answering dataset. (Association for Computational Linguistics, 2021).
A
47.Mullenbach, J. et al. CLIP: a dataset for extracting action items for physicians from hospital discharge notes. arXiv preprint arXiv:2106.02524 (2021). (2021).
A
48.Blinov, P. et al. Rumedbench: a Russian medical language understanding benchmark. In International Conference on Artificial Intelligence in Medicine 383–392 (Springer, 2022).
A
49.Zhang, N. et al. Cblue: A chinese biomedical language understanding evaluation benchmark. arXiv preprint arXiv:2106.08087 (2021). (2021).
A
50.Soni, S., Gudala, M., Pajouhi, A. & Roberts, K. Radqa: A question answering dataset to improve comprehension of radiology reports. In Proceedings of the thirteenth language resources and evaluation conference 6250–6259 (2022).
A
51.Lehman, E. et al. Learning to ask like a physician. arXiv preprint arXiv:2206.02696 (2022). (2022).
A
52.Pal, A. CLIFT: Analysing Natural Distribution Shift on Question Answering Models in Clinical Domain. arXiv preprint arXiv:2310.13146 (2023). (2023).
A
53.Bardhan, J., Colas, A., Roberts, K. & Wang, D. Z. Drugehrqa: A question answering dataset on structured and unstructured electronic health records for medicine related queries. arXiv preprint arXiv:2205.01290 (2022). (2022).
A
54.Wang, P. et al. Attention-based aspect reasoning for knowledge base question answering on clinical notes. In Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 1–6 (2022).
55.He, Z. et al. Medeval: A multi-level, multi-task, and multi-domain medical benchmark for language model evaluation. arXiv preprint arXiv:2310.14088 (2023). (2023).
A
56.Zhu, W. et al. PromptCBLUE: a Chinese prompt tuning benchmark for the medical domain. arXiv preprint arXiv:2310.14151 (2023). (2023).
A
57.Fleming, S. L. et al. Medalign: A clinician-generated dataset for instruction following with electronic medical records. In Proceedings of the AAAI Conference on Artificial Intelligence 22021–22030 (2024).
A
58.Kweon, S. et al. EHRNoteQA: A Patient-Specific Question Answering Benchmark for Evaluating Large Language Models in Clinical Settings. arXiv preprint arXiv:2402.16040 (2024). (2024).
A
59.Chen, X. et al. RareBench: Can LLMs Serve as Rare Diseases Specialists? In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 4850–4861 (2024).
A
60.Šuster, S. & Daelemans, W. CliCR: a dataset of clinical case reports for machine reading comprehension. arXiv preprint arXiv:1803.09720 (2018). (2018).
A
61.Zhu, M., Ahuja, A., Wei, W. & Reddy, C. K. A hierarchical attention retrieval model for healthcare question answering. In The World Wide Web Conference 2472–2482 (2019).
A
62.Jin, Q. et al. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146 (2019). (2019).
A
63.Pappas, D., Androutsopoulos, I. & Papageorgiou, H. BioRead: A new dataset for biomedical reading comprehension. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018).
A
64.Kaddari, Z. & Bouchentouf, T. FrBMedQA: the first French biomedical question answering dataset. IAES International Journal of Artificial Intelligence 11, 1588 (2022).
A
65.Mahbub, M. et al. cpgqa: A benchmark dataset for machine reading comprehension tasks on clinical practice guidelines and a case study using transfer learning. IEEE Access 11, 3691–3705 (2023).
A
66.Vladika, J., Schneider, P. & Matthes, F. MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering. arXiv preprint arXiv:2406.05845 (2024). (2024).
67.Tran, M.-N., Nguyen, P.-V., Nguyen, L. & Dien, D. ViMedAQA: A Vietnamese Medical Abstractive Question-Answering Dataset and Findings of Large Language Model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop) 356–364 (2024).