Model
|
Backbone
|
Model size
|
Biomedical literature
|
Clinical notes
|
Continual pre-training (# of tokens)
|
Instruction tuning (# of instructions)
|
Evaluation tasks
|
Release date
|
---|---|---|---|---|---|---|---|---|
MedAlpaca
|
LLaMA
|
7/13B
|
ü
|
û
|
-
|
160K
|
QA
|
04/14/2023
|
ChatDoctor
|
LLaMA2
|
7B
|
ü
|
û
|
-
|
100K
|
QA
|
05/24/2023
|
AlpaCare
|
LLaMA
|
7/13B
|
ü
|
û
|
-
|
52K
|
QA, Summarization
|
10/23/2023
|
Clinical LLaMA
|
LLaMA
|
7B
|
û
|
ü
|
-
|
-
|
Classification
|
07/06/2023
|
Meditron
|
LLaMA2
|
7/70B
|
ü
|
û
|
48B
|
-
|
QA
|
11/27/2023
|
PMC-LLaMA
|
LLaMA
|
7/13B
|
ü
|
û
|
79B
|
514K
|
QA
|
04/27/2023
|
Me-LLaMA
|
LLaMA2
|
13/70B
|
ü
|
ü
|
129B
|
214K
|
QA, NER, RE, Classification, Summarization, NLI, Medical Diagnosis
|
06/05/2024
|
Task
|
Type
|
Source
|
Size
|
Copy right
|
---|---|---|---|---|
General
|
Conversation
|
Alpaca19
|
20,000
|
CC-BY-NC 4.0
|
Dolly20
|
CC-BY-SA-3.0
|
|||
ShareGPT21
|
Apache-2.0
|
|||
Biomedical
|
Conversation
|
HealthCareMagic12
|
20,000
|
Reserved by HealthCareMagic and Icliniq
|
Icliniq12
|
||||
Instructions
|
MedInstruct13
|
52,000
|
CC BY-NC 4.0
|
|
Question Answering
|
Medical Flash Cards3
|
34,000
|
No commercialized use
|
|
MEDIQA22
|
2,220
|
CC BY 4.0
|
||
MedicationQA23
|
690
|
CC BY 4.0
|
||
LiveQA24
|
634
|
CC BY 4.0
|
||
WikiDocPatient3
|
5,490
|
CC BY-SA 4.0
|
||
GuidelineQA
|
2,000
|
Common Crawl (other)
|
||
Summarization
|
PubMed Central
|
10,000
|
CC BY
|
|
Next Sentence Generation
|
PubMed Central
|
20,000
|
CC BY
|
|
Key words prediction
|
PubMed Central
|
10,000
|
CC BY
|
|
Causal Relation Detection
|
PubMed25
|
2,450
|
CC BY
|
|
Relation Extraction
|
UMLS knowledge graph2
|
10,000
|
Openrail
|
|
Clinical
|
QA, summarization, classification, mortality prediction
|
MIMIC-III,15 MIMIC-IV16
|
30,000
|
PhysioNet credentialed health data use agreement 1.5.0
|
Data
|
Task
|
Train
|
Valid
|
Test
|
Evaluation
|
---|---|---|---|---|---|
PubMedQA*28
|
QA
|
190,143
|
21,126
|
500
|
Accuracy, Macro-F1
|
MedQA29
|
QA
|
10,178
|
1,272
|
1,273
|
Accuracy, Macro-F1
|
MedMCQA*30
|
QA
|
164,540
|
18,282
|
4,183
|
Accuracy, Macro-F1
|
EmrQA31
|
QA
|
122,326
|
30,581
|
26,804
|
Exact match, F1
|
i2b232
|
NER
|
6,0875
|
7,400
|
7,451
|
Entity-level Macro-F1
|
DDI33
|
RE
|
18,779
|
7,244
|
5,761
|
Macro-F1
|
HoC34
|
Classification
|
1,108
|
157
|
315
|
Label-wise Macro-F1
|
MTSample35
|
Classification
|
4,999
|
500
|
999
|
Accuracy, Macro-F1
|
PubMed36
|
Summarization
|
117,108
|
6,631
|
6,658
|
Rouge, BERTScore
|
MIMIC-CXR17
|
Summarization
|
122,014
|
957
|
1,606
|
Rouge, BERTScore
|
BioNLI37
|
NLI
|
5,544
|
5,000
|
6,308
|
Accuracy, Macro-F1
|
MedNLI38
|
NLI
|
11,232
|
1,422
|
1,395
|
Accuracy, Macro-F1
|
Task
|
Dataset
|
Metric
|
LLaMA2 13B
|
PMC-LLaMA 13B
|
Me-LLaMA 13B
|
LLaMA2 70B
|
Meditron 70B
|
Me-LLaMA 70B
|
---|---|---|---|---|---|---|---|---|
PubMedQA
|
Acc
|
0.800
|
0.778
|
0.802
|
0.800
|
0.800*
|
0.814
|
|
Question answering
|
Macro-F1
|
0.560
|
0.544
|
0.562
|
0.560
|
-
|
0.572
|
|
MedQA
|
Acc
|
0.467
|
0.456
|
0.493
|
0.598
|
0.607*
|
0.623
|
|
Macro-F1
|
0.465
|
0.454
|
0.487
|
0.595
|
-
|
0.621
|
||
MedMCQA
|
Acc
|
0.527
|
0.548
|
0.557
|
0.626
|
0.651*
|
0.643
|
|
Macro-F1
|
0.524
|
0.545
|
0.551
|
0.625
|
-
|
0.640
|
||
EmrQA
|
Acc
|
0.789
|
0.810
|
0.857
|
0.847
|
0.850
|
0.854
|
|
F1
|
0.730
|
0.738
|
0.751
|
0.751
|
0.751
|
0.751
|
||
Named entity recognition
|
i2b2
|
Macro-F1
|
0.904
|
0.901
|
0.906
|
0.913
|
0.908
|
0.910
|
Relation extraction
|
DDI
|
Macro-F1
|
0.622
|
0.622
|
0.559
|
0.746
|
0.737
|
0.779
|
Classification
|
HoC
|
Macro-F1
|
0.696
|
0.422
|
0.684
|
0.818
|
0.702
|
0.841
|
MTsample
|
Macro-F1
|
0.430
|
0.345
|
0.451
|
0.458
|
0.284
|
0.544
|
|
PubMed
|
R-L
|
0.191
|
0.091
|
0.197
|
0.211
|
0.197
|
0.209
|
|
Summarization
|
BERTS
|
0.663
|
0.516
|
0.679
|
0.689
|
0.677
|
0.700
|
|
MIMIC-CXR
|
R-L
|
0.437
|
0.139
|
0.453
|
0.440
|
0.458
|
0.476
|
|
BERTS
|
0.816
|
0.694
|
0.821
|
0.813
|
0.824
|
0.828
|
||
Natural language inference
|
BioNLI
|
Macro-F1
|
0.409
|
0.332
|
0.447
|
0.447
|
0.444
|
0.566
|
MedNLI
|
Macro-F1
|
0.881
|
0.868
|
0.903
|
0.884
|
0.897
|
0.916
|
Task
|
Dataset
|
Metric
|
LLaMA2-13B-chat
|
PMC-LLaMA-chat
|
Medalpaca-13B
|
AlpaCare-13B
|
Me-LLaMA 13B-chat
|
LLaMA2-70B-chat
|
Me-LLaMA 70B-chat
|
---|---|---|---|---|---|---|---|---|---|
Question answering
|
PubMedQA
|
Accuracy
|
0.546
|
0.504
|
0.238
|
0.538
|
0.700
|
0.668
|
0.768
|
Macro-F1
|
0.457
|
0.305
|
0.192
|
0.373
|
0.504
|
0.477
|
0.557
|
||
MedQA
|
Accuracy
|
0.097
|
0.207
|
0.143
|
0.304
|
0.427
|
0.376
|
0.523
|
|
Macro-F1
|
0.148
|
0.158
|
0.102
|
0.281
|
0.422
|
0.367
|
0.521
|
||
MedMCQA
|
Accuracy
|
0.321
|
0.212
|
0.205
|
0.385
|
0.449
|
0.339
|
0.539
|
|
Macro-F1
|
0.243
|
0.216
|
0.164
|
0.358
|
0.440
|
0.273
|
0.538
|
||
EmrQA
|
Accuracy
|
0.001
|
0.053
|
0.000
|
0.001
|
0.048
|
0.050
|
0.119
|
|
F1
|
0.098
|
0.304
|
0.040
|
0.198
|
0.307
|
0.251
|
0.346
|
||
Named entity recognition
|
i2b2
|
Macro-F1
|
0.143
|
0.091
|
0.000
|
0.173
|
0.166
|
0.321
|
0.329
|
Relation extraction
|
DDI
|
Macro-F1
|
0.090
|
0.147
|
0.058
|
0.110
|
0.214
|
0.087
|
0.283
|
Classification
|
HoC
|
Macro-F1
|
0.228
|
0.184
|
0.246
|
0.267
|
0.335
|
0.309
|
0.544
|
MTsample
|
Macro-F1
|
0.133
|
0.083
|
0.003
|
0.273
|
0.229
|
0.254
|
0.384
|
|
PubMed
|
Rouge-L
|
0.161
|
0.028
|
0.014
|
0.167
|
0.116
|
0.192
|
0.169
|
|
Summarization
|
BERTS*
|
0.671
|
0.128
|
0.117
|
0.671
|
0.445
|
0.684
|
0.678
|
|
MIMIC-CXR
|
Rouge-L
|
0.144
|
0.139
|
0.010
|
0.134
|
0.400
|
0.131
|
0.418
|
|
BERTS*
|
0.704
|
0.694
|
0.502
|
0.702
|
0.797
|
0.696
|
0.787
|
||
Natural language inference
|
BioNLI
|
Macro-F1
|
0.173
|
0.159
|
0.164
|
0.170
|
0.195
|
0.297
|
0.436
|
MedNLI
|
Macro-F1
|
0.412
|
0.175
|
0.175
|
0.275
|
0.472
|
0.515
|
0.675
|
|
*BERTS: BERTScore.41 |
Data
|
Prompt
|
---|---|
General domain data
|
“Below is an input that describes a task, maybe paired with a context that provides further information.
Write a response that appropriately completes the request. INPUT:{Text} CONTEXT:{Text} OUTPUT:"
|
Medical conversation data
|
“Given a medical query, provide a concise and clear answer based on the patient’s description. INPUT: {text} OUTPUT:"
|
MedInstruct
|
“Below is an input that describes a medical task, maybe paired with a context that provides further input
information. Write a response that appropriately completes the request. INPUT: {text} CONTEXT: {text} OUTPUT:"
|
Medical flash cards
|
“If you are a medical professional, answer this question truthfully. INPUT: {text} OUTPUT:"
|
MEDIQA
|
“Answer the input medical question based on the given context. INPUT: {text} CONTEXT: {text} OUTPUT:"
|
MedicationQA
|
“Answer this medical question truthfully. INPUT: {text} OUTPUT:"
|
LiveQA
|
“Given a medical query, provide a concise and clear answer based on the given details. INPUT: {text} OUTPUT:"
|
Patient Information QA
|
“Answer this medical question truthfully. INPUT: {text} OUTPUT:"
|
GuidelineQA
|
“If you are a medical professional, answer this question truthfully. INPUT: {text} OUTPUT:"
|
Summarization
|
“Given an abstract of a biomedical paper, generate the title. INPUT: {text} OUTPUT:"
|
Pubmed sentence generation
|
“The task is to generate the subsequent sentence in a piece of biomedical literature based on the existing content. INPUT: {text} OUTPUT:"
|
Guideline sentence generation
|
“Write the next part for a clinical guideline. You’re given a piece of the guideline, and your task is to continue it. The new part should match the style and detail of the original and be medically correct. INPUT: {text} OUTPUT:"
|
Summarization
|
“Given an abstract of a biomedical paper, generate the title. INPUT: {text} OUTPUT:"
|
Topic prediction
|
“The task is to assign MeSH (Medical Subject Headings) terms to a given piece of biomedical literature.
The input is the title and abstract of a biomedical literature. The output is a list of applicable MeSH terms. INPUT: {text} OUTPUT:"
|
Causal relation detection
|
“For the following statement, determine whether it offers: 1) Strong Advice: The statement gives a clear and assertive recommendation or guidance, or 2) Weak Advice: The statement provides a suggestion or mild recommendation but is not very assertive, or 3) No Advice: The statement doesn’t offer any recommendation or guidance. INPUT: {text} OUTPUT:"
|
Relation extraction
|
“Given a medical question, provide the answer to determine the relation between two medical terms in the question. INPUT: {text} OUTPUT:"
|
MIMIC radiology
|
“The task is to generate the radiology impression based on radiology findings from a patient’s clinical note. The input is the radiology findings from a patient’s clinical note. Generate an impression accordingly. INPUT: {text} OUTPUT:"
|
MIMIC disease multiple-choice
|
“The task is to determine whether a patient suffers from certain diseases, and you need to choose the right answer from the choices. The input is the clinical note of a patient. Please determine which of the following disease(s) occurred during the patient’s hospital stay, according to the clinical note in the input: A:{text} B:{text} C:{text} D:{text} Output format: The output should be A, B, C, or D only. INPUT: {text} OUTPUT:"
|
MIMIC disease QA
|
“The task is to determine whether a patient suffers from certain diseases. The input is the clinical note of a patient. Please respond with all of the disease(s) that occurred during the hospital stay, according to the clinical note in the input. INPUT: {text} OUTPUT:"
|
MIMIC disease binary classification
|
“The task is to determine whether a patient suffers from certain diseases. The input is the clinical note of a patient. Please determine whether all of the following disease(s) occurred during the patient’s hospital stay: {text}. Answer with True or False only. INPUT: {text} OUTPUT:"
|
MIMIC mortality
|
“The task is to determine whether the patient died while in the hospital. The input is the clinical note of a patient. Using the information in the input, determine whether the patient died while in the hospital. INPUT: {text} OUTPUT:"
|
MIMIC manual QA
|
“The task is answering a question based on a clinical note in the input and your knowledge. The input is the clinical note of a patient. Please answer the question: {text} INPUT: {text} OUTPUT:"
|
2 Medical Evaluation Benchmark |
Data
|
Prompt
|
---|---|
PubMedQA
|
“Your task is to answer biomedical questions using the given abstract. Only output yes, no, or maybe as answer. INPUT:{Text} CONTEXT:{Text} OUTPUT:"
|
MedQA
|
“You are a medical doctor taking the US Medical Licensing Examination. You need to demonstrate your understanding of basic and clinical science, medical knowledge, and mechanisms underlying health, disease, patient care, and modes of therapy. Show your ability to apply the knowledge essential for medical practice. For the following multiple-choice question, select one correct answer from A to E. Base your answer on the current and standard practices referenced in medical guidelines. Question:{text} Options: {text} Answer:"
|
MedMCQA
|
“You are a medical doctor answering realworld medical entrance exam questions. Based on your understanding of basic and clinical science, medical knowledge, and mechanisms underlying health, disease, patient care, and modes of therapy, answer the following multiple-choice question. Select one correct answer from A to D. Base your answer on the current and standard practices referenced in medical guidelines. Question:{text} Options: {text} Answer:”
|
EmrQA
|
“Given a medical context and an open-ended question related to it, extract the relevant text segment from the context as an answer. Expected output: Only extract and return the text segment from the provided context that directly answers the question. Do not add any new words. Context:{text} Answer:"
|
i2b2
|
“In the sentence extracted from clinical narrative notes, identify all the clinically relevant entities, including problems, tests, and treatments. The required answer format is the same sentence with HTML < span > tags to mark up specific entities. Entity Markup Guide: Use < span class=""problem""> to denote a medical problem. Use < span class="" treatment""> to denote a treatment. Use < span class=""test""> to denote a test Input Text: {text} Output Text:"
|
DDI
|
“The task is to predict relationship between the given head entity labeled as @DRUG1$ and tail entity labeled as @DRUG2$ within a given sentence, this relation which must be in (‘mechanism’, ‘effect’, ‘advice’, ‘int’, 'none’). mechanism: this type is used to annotate drug-drug interactions that are described by their pharmacokinetic mechanism. effect: this type is used to annotate drug-drug interactions describing an effect or a pharmacodynamic mechanism. advice: this type is used when a recommendation or advice regarding a drug interaction is given. int: this type is used when a drug-drug interactions appears in the text without providing any additional information. none: there has no drug-drug interactions. INPUT: {text}. OUTPUT:"
|
HoC
|
“The task is to decide the Hallmarks of Cancer (HOC) taxonomy of the article based on its abstract. The input is an abstract text. There are 10 topics you will need to decide whether the article is related to. Topics: sustaining proliferative signaling, evading growth suppressors, resisting cell death, enabling replicative immortality, inducing angiogenesis, activating invasion and metastasis, genomic instability and mutation, tumor promoting inflammation, cellular energetics, avoiding immune destruction. The output should be topics from the above 10 topics, that are related to the input article. Please note one article can be related to multiple topics. Output format: provide a semicolon-separated list of relevant topics.
INPUT:{text} OUTPUT:"
|
MTSample
|
“TASK: The task is to determine the medical specialty or domain that a medical transcription belongs to. The input is a medical transcription. There are 40 medical specialties or domains, and you need to decide which one is the transcription related to. The medical specialties or domains are: ’Surgery’, ’Allergy / Immunology’, ’Sleep Medicine’, ’Pediatrics - Neonatal’, ’SOAP / Chart / Progress Notes’, ’Bariatrics’, ’Pain Management’, ’Lab Medicine - Pathology’, ’Dermatology’, ’Orthopedic’, ’Dentistry’, ’Psychiatry / Psychology’, ’General Medicine’, ’Office Notes’, ’Letters’, ’Neurosurgery’, ’Radiology’, ’Cosmetic / Plastic Surgery’, ’Nephrology’, ’Diets and Nutritions’, ’Chiropractic’, ’Gastroenterology’, ’Cardiovascular / Pulmonary’, ’Speech - Language’, ’Hospice - Palliative Care’, ’Autopsy’, ’Endocrinology’, ’Emergency Room Reports’, ’Discharge Summary’, ’ENT - Otolaryngology’, ’Urology’, ’Physical Medicine - Rehab’, ’Neurology’, ’Podiatry’, ’Ophthalmology’, ’Rheumatology’, ’IME-QME-Work Comp etc.’, ’Hematology - Oncology’, ’Consult - History and Phy.’, ’Obstetrics / Gynecology’. The output should be only one medical specialty or domain from the above 40 specialties or domains, that is most relevant to the medical transcription. Please note that each medical transcript can only be related to one medical specialty or domain. Output format: provide the name of the medical specialty or domain.
INPUT:{text} OUTPUT:"
|
PubMedSum
|
“The task is to summarize an input biomedical literature in six sentences. The input is a biomedical literature. The output is the summary of an input biomedical literature in six sentences. INPUT:{text} OUTPUT:"
|
MIMIC-CXR
|
“Derive the impression from findings in the radiology report. INPUT:{text} OUTPUT:"
|
BioNLI
|
“TASK: Please classify the relationship between the given premise and hypothesis into one of the following labels: entailment, contradiction, or neutral. Return only the label. INPUT:{text} OUTPUT:"
|
MedNLI
|
“TASK: Please classify the relationship between the given premise and hypothesis into one of the following labels: entailment, contradiction, or neutral. Return only the label. INPUT:{text} OUTPUT:"
|
3 Complex Clinical Case Diagnosis Task |
Dataset
|
Metric
|
LLaMA2 13B (backbone)
|
Me-LLaMA 13B
(backbone + pre-train)
|
Me-LLaMA-13B-chat (backbone + pre-train + instruction tuning)
|
LLaMA2 70B
(backbone)
|
Me-LLaMA 70B
(backbone + pre-train)
|
Me-LLaMA-70B-chat
(backbone + pre-train + instruction tuning)
|
---|---|---|---|---|---|---|---|
PubMedQA
|
Acc
|
0.216
|
0.266
|
0.700
|
0.132
|
0.682
|
0.768
|
Macro-F1
|
0.177
|
0.250
|
0.504
|
0.152
|
0.520
|
0.557
|
|
MedQA
|
Acc
|
0.000
|
0.000
|
0.427
|
0.005
|
0.281
|
0.523
|
Macro-F1
|
0.000
|
0.000
|
0.422
|
0.009
|
0.350
|
0.521
|
|
MedMCQA
|
Acc
|
0.003
|
0.003
|
0.449
|
0.012
|
0.447
|
0.539
|
Macro-F1
|
0.006
|
0.005
|
0.440
|
0.024
|
0.396
|
0.538
|
|
EmrQA
|
Acc
|
0.000
|
0.005
|
0.048
|
0.000
|
0.021
|
0.119
|
F1
|
0.038
|
0.122
|
0.307
|
0.000
|
0.172
|
0.346
|
|
i2b2
|
Macro-F1
|
0.008
|
0.030
|
0.263
|
0.181
|
0.224
|
0.329
|
DDI
|
Macro-F1
|
0.035
|
0.036
|
0.214
|
0.034
|
0.118
|
0.283
|
HoC
|
Macro-F1
|
0.253
|
0.210
|
0.335
|
0.255
|
0.252
|
0.544
|
MTsample
|
Macro-F1
|
0.042
|
0.072
|
0.229
|
0.066
|
0.226
|
0.384
|
PubMed
|
R-L
|
0.170
|
0.168
|
0.116
|
0.167
|
0.119
|
0.169
|
BERTS
|
0.654
|
0.654
|
0.445
|
0.654
|
0.654
|
0.678
|
|
MIMIC-CXR
|
R-L
|
0.051
|
0.172
|
0.400
|
0.059
|
0.137
|
0.418
|
BERTS
|
0.566
|
0.697
|
0.797
|
0.577
|
0.649
|
0.787
|
|
BioNLI
|
Macro-F1
|
0.109
|
0.060
|
0.195
|
0.285
|
0.499
|
0.436
|
MedNLI
|
Macro-F1
|
0.172
|
0.206
|
0.472
|
0.265
|
0.256
|
0.675
|