Cloud vs. On-Premise Large Language Models for Urgent Patient-Portal Message Screening: A Comparative Evaluation

ValderyMoura Junior

PhD, MBA

1,2,3,9

SusannaGallani

PhD

LaraBasovic

MajedAlomar

JasonC.You

MD, PhD

LipikaSamal

MD, MPH

ElyseR.Park

PhD, MPH

2,6

LouisaG.Sylvia

PhD

GaurdiaBanister

PhD, RN

PeterHadar

ShawnMurphy

MD, PhD

1,4

LidiaMVRMoura

MD, PhD, MPH

4,5✉Emailvmoura@mgh.harvard.edu

1Center for AI and Biomedical Informatics for the Learning Healthcare System (CAIBILS) at Mass General BrighamSomervilleMAUSA

2Department of Medicine, Massachusetts General HospitalHarvard Medical SchoolBostonMAUSA

3Accounting and Management UnitHarvard Business SchoolBostonMAUSA

4Department of NeurologyMassachusetts General Hospital, Harvard Medical SchoolBostonMAUSA

5Department of EpidemiologyHarvard T.H. Chan School of Public HealthBostonMAUSA

6Department of Psychiatry, Massachusetts General HospitalHarvard Medical SchoolBostonMAUSA

7Department of Nursing, Massachusetts General HospitalHarvard Medical SchoolBostonMAUSA

8Department of Medicine, Brigham and Women’s HospitalHarvard Medical SchoolBostonMAUSA

MBA. Center of AI and Biomedical Informatics for the Learning Healthcare System (CAIBILS)Mass General Brigham399 Revolution Drive, Suite 72502145SomervilleMA

Valdery Moura Junior, PhD, MBA^1,2,3; Susanna Gallani, PhD³; Lara Basovic, MD⁴; Majed Alomar, MD⁴, Jason C. You MD, PhD⁴; Lipika Samal, MD, MPH⁸, Elyse R. Park, PhD, MPH^2,6; Louisa G. Sylvia, PhD⁶; Gaurdia Banister, PhD, RN⁷; Peter Hadar, MD⁴; Shawn Murphy, MD, PhD^1,4; Lidia MVR Moura, MD, PhD, MPH^4,5

Affiliations:

1. Center for AI and Biomedical Informatics for the Learning Healthcare System (CAIBILS) at Mass General Brigham, Somerville, MA, USA

2. Department of Medicine, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

3. Accounting and Management Unit, Harvard Business School, Boston, MA, USA

4. Department of Neurology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

5. Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA

6. Department of Psychiatry, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

7. Department of Nursing, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

8. Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA

Corresponding author:

Valdery Moura Junior, PhD, MBA. Center of AI and Biomedical Informatics for the Learning Healthcare System (CAIBILS), Mass General Brigham, 399 Revolution Drive, Suite 725, Somerville, MA 02145, vmoura@mgh.harvard.edu

Abstract

Importance

: Patient portal messaging has become a core feature of outpatient care, particularly in neurology. In epilepsy care, timely triage of urgent symptoms — such as breakthrough seizures or adverse medication effects — and efficient evaluation of urgency level are critical to patient safety. However, increasing message volume and a nationwide neurologist shortage have intensified clinician burden and delayed response times. Large language models (LLMs) may offer a scalable solution. A key step to achieving this goal is to compare performance across cloud-based and locally deployable models and to estimate the impact of the differences in high-stakes clinical contexts.

Objective

To evaluate the urgency and message-type classification performance of six LLMs - three commercial cloud-hosted (GPT-4o, GPT-5, GPT-5 Mini) and three locally deployable open-weight models (Llama 4 Scout, GPT-OSS 20B, Gemma 3 27B) - against a reference standard in outpatient epilepsy care.

Design, Setting, and Participants:

Retrospective diagnostic accuracy study of 503 de-identified patient portal messages from adult outpatients at a tertiary epilepsy clinic. Five epilepsy fellowship-trained neurologists independently annotated each message using a standard operating procedure (SOP) with high inter-rater reliability (Fleiss’ κ ≥ 0.80). Analyses were stratified by three non–mutually exclusive levels of physician consensus: Unanimous (5/5), Majority (≥ 3/5), and Any MD Match.

Main Outcomes and Measures:

Primary outcomes included sensitivity and negative predictive value (NPV) for urgency classification under Unanimous or Majority reference strata. Secondary outcomes included specificity, positive predictive value (PPV), overall accuracy, and message-type classification accuracy.

Results

Under the Unanimous reference standard, five of six models achieved perfect sensitivity and NPV, indicating safe rule-out performance. Under the Majority consensus, GPT-5 achieved the highest sensitivity (0.98) and NPV (1.00), while GPT-4o and Llama 4 Scout offered balanced performance with strong specificity (0.87–0.88) and NPV (≥ 0.97). GPT-OSS 20B demonstrated high specificity (0.95) but lower sensitivity (0.57), while Gemma 3 27B provided intermediate performance and supports full on-premise deployment. GPT-5 Mini offered a cost-efficient cloud alternative with solid overall performance, though reproducibility was limited by non-configurable decoding.

Conclusions and Relevance:

In high-risk outpatient neurology, both cloud-hosted and locally deployed LLMs demonstrated screening-level performance comparable to epilepsy fellowship-trained neurologists. Performance trade-offs between sensitivity and specificity allow institutions to tailor model selection to operational goals - whether minimizing false negatives, reducing alert burden, or ensuring Protected Health Information (PHI) containment. These results support the safe, scalable, and privacy-preserving deployment of LLM-powered triage systems across digitally burdened clinical neurology settings.

1. Introduction

Patient portal messaging has rapidly evolved into a core channel for patient-clinician communication in outpatient medicine. In neurology, and especially in epilepsy care, this shift has been profound[1]. Epilepsy is one of the most common serious neurological conditions - approximately 1 in 10 people in the U.S. will experience at least one seizure during their lifetime, and nearly 3.4 million people in the U.S. are currently living with epilepsy[2, 3]. Management is long-term, complex, and highly individualized, with frequent medication adjustments, side effect monitoring, and urgent need for communication following breakthrough seizures or changes in neurological status[4–6]. Delays in access to neurologists can exacerbate clinical risks and worsen outcomes[6, 7].

The burden of managing these communications is amplified by a growing shortage of neurologists in the United States, one which is projected to worsen over the next decade[8]. As subspecialty access becomes increasingly constrained, primary care physicians are often drawn into the ongoing management of epilepsy patients, receiving and responding to urgent messages that fall outside their immediate scope when a pathway for rapid escalation to specialist consultation is not readily available[9].

The result is a high and rising volume of complex patient messages flowing into both subspecialty and primary care inboxes, with urgent cases interspersed among routine inquiries. Identifying those urgent cases quickly and reliably could help to prevent delays and adverse outcomes, but reviewing messages quickly is resource-intensive and contributes to clinician burnout[10, 11].

Automated message triage powered by large language models (LLMs) offers a potential solution[12]. Such applications align with emerging frameworks positioning LLMs as foundational tools for neurologic population health, bridging individual patient interactions with system-level safety and efficiency goals[13]. These models can process free-text messages at scale, identify key clinical cues, and classify urgency[14]. For such systems to be viable in high-stakes epilepsy care, they must achieve near-perfect sensitivity and high negative predictive value (NPV) to avoid missed urgent messages, while maintaining specificity to prevent excessive false positives and alert fatigue. This creates incredible stress on front line clinicians in primary care.

Recent advances in Artificial Intelligence (AI)-generated software may help to support these clinicians in primary care. Specifically, the recent release of GPT-5 has intensified attention on cutting-edge cloud-based LLMs as well as accelerated a movement toward locally deployable, open-weight AI-models that can operate entirely within institutional networks. Such models —including Meta’s Llama 4 Scout, OpenAI-aligned GPT-OSS-20B, and Google DeepMind’s Gemma 3 27B Instruct — support open-weight or institutionally hosted deployment[15–17]. This approach enables protected health information (PHI) to remain within institutional firewalls rather than being transmitted to external services, provided appropriate technical and governance controls are in place. These options address key concerns of using AI to support clinical practice, or issues with security, privacy, reproducibility, scalability, and cost control. However, it remains unclear whether their screening performance in subspecialty care matches that of flagship cloud models like GPT-4o or GPT-5.

In this study, we evaluated the performance of six LLM classifiers in classifying message urgency and type for patient portal messages sent to an academic epilepsy neurology clinic.

2. Methods

2.1. Study Design and Objectives

We conducted a retrospective diagnostic accuracy study to evaluate the performance of large LLMs in classifying the urgency and type of outpatient neurology patient messages. The primary objective was to assess the clinical safety and feasibility of using LLMs for automated message triage, emphasizing sensitivity and NPV to minimize missed urgent cases. A secondary operational objective was to compare cloud-based versus locally deployable (on-premises) models, with attention to practical considerations for privacy-preserving deployments in PHI-constrained environments. The study followed Standards for Reporting Diagnostic Accuracy (STARD) reporting guidelines [18, 19]. A full STARD compliance checklist is provided in Supplementary Table 1.

2.2. Governance, Privacy, and Ethical Review

All messages were de-identified prior to modeling and analysis in accordance with HIPAA de-identification standards[20].

Local models operated entirely within institutional boundaries to maintain PHI containment. The project was reviewed by the Mass General Brigham IRB and deemed exempt or non–human subjects research[21].

2.3. Setting, Data Source, and Participants

The dataset consisted of 503 de-identified patient portal messages submitted by adult outpatients (or their caregivers) to the epilepsy subspecialty clinic of an academic medical center via the electronic health record (EHR) secure web-based messaging system between January 28, 2016 and February 2, 2020.

From 93,355 total neurology-related messages during this period, we first isolated patient-sent messages (n = 54,512) and then restricted to the first patient message in each thread (n = 22,247). Among these, messages directed to the epilepsy service (n = 5,827) formed the initial study cohort. From this cohort, 50 messages were randomly selected for rater training and 503 for the final study dataset (Fig. 1).

Fig. 1

STARD Flow Diagram for Message Inclusion and Analysis

All messages were extracted from the EHR’s secure messaging platform and de-identified prior to annotation or model input. Inclusion criteria required (1) an established adult patient or caregiver as sender, and (2) availability of a free-text message body.

2.4.

Reference Standard (Physician Annotations)

To establish a clinically grounded reference standard, five epilepsy fellowship-trained neurologists independently reviewed and annotated each message along two axes: urgency and message type. Urgency was coded as a binary label (Urgent vs Non-Urgent), based solely on medical urgency, excluding administrative or interpersonal considerations. Message type was categorized into four pre-defined groups: (1) Medical Question / Multiple Topics, (2) Prescription Question, (3) Test Results Question, and (4) Referral Request.

All annotations followed a structured standard operating procedure (SOP) designed to ensure consistent interpretation and labeling, aligned with best practices and mirroring methods from prior validation studies[22, 23]. Before full-scale annotation, neurologists [LB, MA, JCY, PH, LMVRM] completed a structured training phase using a 50-message holdout set to calibrate definitions and resolve ambiguities. During this phase, they achieved high inter-rater reliability on both axes, with Fleiss’ κ ≥ 0.80[24]. The detailed rater Standard Operating Procedure (SOP) and training process, along with inter-rater agreement procedures, are described in the Supplementary Methods. Only after reaching this predefined agreement threshold did the full annotation of the 503-message corpus proceed.

In brief, we defined three a priori strata for evaluating model concordance: (a) Unanimous — all five physicians assigned the same label; (b) Majority — the modal label was selected by at least three of five physicians; and (c) Any MD Match — at least one physician selected the label, used to capture upper-bound concordance. These strata were applied independently to both urgency and message-type tasks, resulting in different denominators for each analytic endpoint. (see Supplementary Table 2 for stratum counts).

The SOP and training methods followed established best practices and mirrored approaches used in prior validation studies, recognizing that clinical rating tasks can remain subjective even after structured training and agreement calibration[23, 25, 26].

2.5. Index Tests (LLM Classifiers)

We evaluated the performance of six LLM classifiers. The cloud-based models, accessed via the Azure OpenAI Service, included GPT-4o, GPT-5-chat, and GPT-5-mini, the latter representing a cost-optimized variant with limited configurability. The locally hosted models – each deployed entirely within secure institutional infrastructure using Ollama or vLLM – included Llama 4 Scout (Meta), GPT-OSS-20B (an OpenAI-aligned open-weight transformer), and Gemma 3 27B (Google DeepMind).

To maintain consistency and avoid confounding from task-specific adaptation, all models were evaluated in a zero-shot setting – that is, they received only the task prompt at inference time, with no task-specific fine-tuning, few-shot exemplars, or chain-of-thought scaffolding. Zero-shot evaluation for clinical NLP tasks has been supported in recent peer-reviewed work, including studies showing effective phenotyping without fine-tuning[27].

2.6. Prompt and Output Specification

Each model received the same structured prompt (see Supplementary Methods, Prompt Template) instructing it to classify the message into one urgency level and one message type based on predefined options. The prompt emphasized medical urgency (not administrative or interpersonal tone) and required strict label matching. Models were instructed to return a minimal structured output (e.g., “Urgent; Prescription Question”). Only the text body of the message was included - no metadata or prior conversation history.

2.7. Model Configuration and Reproducibility

To ensure consistency across platforms and reproducibility of outputs, all models capable of deterministic decoding were configured with temperature set to 0, top-p set to 1, and a fixed random seed (12345). These settings enforced greedy decoding and consistent tie-breaking behavior across inference runs, minimizing stochastic variation in model outputs[28].

For local deployments, inference servers (e.g., Ollama, vLLM) were explicitly configured to respect these parameters. In contrast, GPT-5-mini, a cost-optimized cloud model hosted on Azure, did not allow modification of decoding parameters or seed control; as a result, its outputs may have varied slightly between runs. We describe the implications of this limitation for clinical reproducibility in the Discussion. Detailed API configuration and inference parameters for each model are listed in Supplementary Table 4.

2.8. Outcomes

The primary outcomes were focused on urgency classification, specifically evaluating each model’s sensitivity, defined as the proportion of truly urgent messages correctly identified as urgent, and NPV, defined as the probability that a message classified as non-urgent was in fact non-urgent. These metrics were selected due to their clinical relevance in minimizing missed urgent messages and were calculated under both the Unanimous (5/5 physician agreement) and Majority (≥ 3/5 agreement) consensus strata.

Secondary outcomes included specificity, PPV, overall accuracy, and likelihood ratios (LR + and LR−) to further characterize each model’s diagnostic performance. For the classification of message type, we calculated accuracy under all three reference standard strata (Unanimous, Majority, and Any MD match). The Any MD match stratum was analyzed descriptively to provide an upper-bound estimate of potential concordance between model predictions and at least one expert clinician.

2.9. Power Considerations

We prioritized NPV as the primary safety metric in low-prevalence settings and report exact binomial CIs plus observed false-negative counts for sensitivity to convey uncertainty. We therefore targeted ≥ 500 messages to support stable estimation of sensitivity and NPV even at 3–5% urgency. At this prevalence, N ≥ 500 yields at least 15 urgent and 485 non-urgent messages: NPV approximately 0.98 is estimated precisely (95% CI ≈ ± 0.01–0.02) given the large non-urgent denominator. By contrast, sensitivity with about 15 urgent cases is less precise (95% CI ≈ ± 0.10–0.15), and a one-sided α = 0.05 test has approximately 45% power to show sensitivity ≥ 0.95 vs. 0.80. While a larger sample could further narrow sensitivity intervals, we balanced this against the limited bandwidth of our neurologist annotators, prioritizing feasibility without compromising core safety inferences.

2.10. Handling of Missing/Discordant Data

We analyzed all eligible messages in the sampling frame. Denominators varied by reference stratum. We used pairwise exclusion: if a message lacked either a physician label or a model output for a specific endpoint, it was excluded only from that endpoint’s analysis. Concordance strata (Unanimous, Majority, Any MD Match) were computed from available physician labels per SOP; messages without sufficient labels to define a given stratum were excluded from that stratum but retained elsewhere when appropriate.

2.11. Statistical Analysis

To evaluate urgency classification performance, we constructed 2×2 contingency tables for each model and reference standard stratum, treating “Urgent” as the positive class. From these, we calculated standard diagnostic accuracy metrics, including sensitivity, specificity, positive PPV, NPV, LR+, LR−, apparent prevalence, and overall accuracy. Confidence intervals (95% CIs) for all proportions were calculated using the exact Clopper–Pearson method [29]. For the message type classification task, we computed accuracy with 95% CIs under each consensus stratum (Unanimous, Majority, and Any MD match).

Given the benchmarking and descriptive nature of our study, no formal hypothesis testing was performed between models. Instead, we emphasized clinical relevance and operational implications - particularly the trade-offs between safety (sensitivity and NPV) and efficiency (specificity and alert burden).

All statistical analyses were conducted in R (v4.x) using packages such as dplyr, tidyr, purrr, epiR, tibble, epiR, and knitr. Model inference and data ingestion pipelines were implemented in Python, using OpenAI-compatible clients for both cloud-hosted and on-premises models. All reproducibility materials, including the physician labeling SOP, prompt template, and parameter configurations, are detailed in the Supplementary Methods and Supplementary Tables 1–4.

2.12. Governance, Privacy, and Ethical Review

All messages were de-identified prior to modeling and analysis in accordance with HIPAA de-identification standards [20]. Local models operated entirely within institutional boundaries to maintain PHI containment. The project was reviewed by the Mass General Brigham IRB and deemed exempt or non–human subjects research [21].

3. Results

3.1. Dataset and Reference Standard Availability

We analyzed 503 de-identified patient portal messages submitted by adult outpatients to a tertiary epilepsy clinic (Table 1, Fig. 1; see Supplementary Table 3 for the distribution of urgency and message-type labels). For urgency, 384 messages (76.3%) reached unanimous agreement, while all 503 contributed to the Majority and Any MD strata. Detailed counts for each reference stratum are presented in Supplementary Table 2. For message type, agreement was lower: 334 messages (66.4%) reached unanimous consensus, while 500 (99.4%) and 503 (100%) messages qualified for the Majority and Any MD Match strata, respectively. These variations underscore the inherent ambiguity in asynchronous messaging, particularly around topic classification.

Table 1
Sample Characteristics
Characteristic	Total Sample (N = 343)
Age at Message, mean (SD), years	46.74 (19.29)
Age group, n (%)
18–29	90 (26.2%)
30–44	81 (23.6%)
45–59	58 (16.9%)
60–74	85 (24.8%)
≥75	29 (8.5%)
Sex, n (%)
Female	202 (58.9%)
Male	141 (41.1%)
Ethnicity, n (%)
Hispanic or Latino	12 (3.5%)
Not Hispanic	316 (92.1%)
Unavailable	15 (4.4%)
Race, n (%)
White	272 (79.3%)
Black or African American	16 (4.7%)
Asian	37 (10.8%)
Other	17 (5.0%)
Declined	1 (0.3%)

3.2.

Primary Outcomes — Urgency Classification (Screening Task)

Unanimous Reference Standard (n = 384)

Under the most conservative standard of unanimous physician consensus, five of the six LLMs achieved perfect sensitivity (1.00) and negative predictive value (NPV = 1.00), indicating that no urgent message was missed (Table 2). The exception was GPT-OSS 20B, which demonstrated lower sensitivity (0.69 [95% CI: 0.39–0.91]), despite maintaining high specificity (0.98 [0.96–0.99]) and an overall accuracy of 96.9% [94.6–98.4]. All other models balanced high specificity and precision to varying degrees: GPT-5 Mini and Llama 4 Scout each reached specificity of 0.94 and PPV around 0.35, while GPT-4o and Gemma 3 27B offered slightly lower specificity (0.92 and 0.85, respectively).

Table 2
Primary Outcome - Model performance on urgency classification under the Unanimous reference standard (n = 384)
Model	Sensitivity (95% CI)	NPV (95% CI)	Specificity (95% CI)	PPV (95% CI)	Accuracy (95% CI)
GPT-4o	1.00 (0.75–1.00)	1.00 (0.99–1.00)	0.92 (0.88–0.94)	0.30 (0.17–0.45)	0.919 (0.887–0.944)
GPT-5	1.00 (0.75–1.00)	1.00 (0.99–1.00)	0.87 (0.83–0.90)	0.21 (0.12–0.34)	0.875 (0.838–0.906)
GPT-5 Mini	1.00 (0.75–1.00)	1.00 (0.99–1.00)	0.94 (0.91–0.96)	0.37 (0.21–0.55)	0.943 (0.915–0.964)
GPT-OSS 20B	0.69 (0.39–0.91)	0.99 (0.97–1.00)	0.98 (0.96–0.99)	0.53 (0.28–0.77)	0.969 (0.946–0.984)
Gemma 3 27B	1.00 (0.75–1.00)	1.00 (0.99–1.00)	0.85 (0.81–0.89)	0.19 (0.11–0.31)	0.859 (0.821–0.893)
Llama 4 Scout	1.00 (0.75–1.00)	1.00 (0.99–1.00)	0.94 (0.91–0.96)	0.35 (0.20–0.53)	0.938 (0.908–0.960)
Footnotes: (1) NPV: Negative predictive value; PPV: Positive predictive value; CI: Confidence interval. (2) Performance metrics computed using the Unanimous reference standard (i.e., all 5 physicians agreed.

Majority Reference Standard (n = 503)

When applying the Majority standard, which better reflects real-world clinical ambiguity, model performance diverged more noticeably (Table 3). GPT-5 led in safety metrics, with sensitivity of 0.98 [0.91–1.00] and perfect NPV (1.00 [0.98–1.00]), but had lower specificity (0.79), translating to more false positives and higher alert burden. In contrast, GPT-OSS 20B achieved the highest specificity (0.95 [0.92–0.97]) and accuracy (90.3% [87.3–92.7]), but its sensitivity was modest 0.57 [0.43–0.69], suggesting a greater risk of missed urgent messages in clinical settings.

Table 3
Primary Outcome - Model performance on urgency classification under the Majority reference standard (n = 503)
Model	Sensitivity (95% CI)	NPV (95% CI)	Specificity (95% CI)	PPV (95% CI)	Accuracy (95% CI)
GPT-4o	0.88 (0.77–0.95)	0.98 (0.96–0.99)	0.87 (0.83–0.90)	0.47 (0.38–0.57)	0.869 (0.836–0.897)
GPT-5	0.98 (0.91–1.00)	1.00 (0.98–1.00)	0.79 (0.74–0.82)	0.38 (0.31–0.46)	0.809 (0.772–0.843)
GPT-5 Mini	0.78 (0.66–0.88)	0.97 (0.95–0.98)	0.89 (0.85–0.92)	0.48 (0.38–0.59)	0.875 (0.843–0.902)
GPT-OSS 20B	0.57 (0.43–0.69)	0.94 (0.92–0.96)	0.95 (0.92–0.97)	0.60 (0.46–0.72)	0.903 (0.873–0.927)
Gemma 3 27B	0.75 (0.62–0.85)	0.96 (0.94–0.98)	0.81 (0.78–0.85)	0.35 (0.27–0.44)	0.807 (0.770–0.841)
Llama 4 Scout	0.82 (0.70–0.90)	0.97 (0.95–0.99)	0.88 (0.85–0.91)	0.48 (0.38–0.58)	0.873 (0.840–0.901)
Footnotes: (1) NPV: Negative predictive value; PPV: Positive predictive value; CI: Confidence interval. (2) This table uses the Majority reference standard (≥ 3 of 5 physicians agreed on urgency label). (3) Performance metrics computed using the Unanimous reference standard (i.e., all 5 physicians agreed

Two models - GPT-4o and Llama 4 Scout - offered strong balance between safety and operational efficiency. GPT-4o achieved sensitivity of 0.88, specificity of 0.87, and accuracy of 86.9%. Llama 4 Scout posted a similar profile, with sensitivity of 0.82, specificity of 0.88, and accuracy of 87.3%, representing robust performance for both local deployment and clinical screening.

Gemma 3 27B delivered intermediate results, with sensitivity of 0.75 and specificity of 0.81, while GPT-5 Mini - a lower-cost Azure deployment - achieved sensitivity of 0.78, specificity of 0.89, and accuracy of 87.5%, despite being subject to non-deterministic decoding (temperature fixed at 1).

3.3. Secondary Outcomes — Message Type Accuracy

Across all models, classification accuracy for message type was more consistent than for urgency. Under the Unanimous consensus standard (n = 334), five of the six models achieved accuracy above 90%. GPT-4o (93.4%), Gemma 3 (92.8%), and Llama 4 Scout (91.0%) led the group, while GPT-5 Mini (74.9%) and GPT-OSS 20B (79.3%) trailed behind. These patterns persisted under the Majority reference, where GPT-4o, Gemma 3, and Llama 4 Scout again exceeded 84% accuracy, and GPT-5 Mini remained lowest at 68.6% (Table 4).

Table 4
Secondary Outcome - Message type classification accuracy by model under Unanimous (n = 334) and Majority (n = 500) reference standards
Model	Accuracy – Unanimous (95% CI)	Accuracy – Majority (95% CI)
GPT-4o	93.4% (90.2–95.8)	85.2% (81.8–88.2)
GPT-5	91.9% (88.5–94.6)	83.8% (80.3–86.9)
GPT-5 Mini	74.9% (69.8–79.4)	68.6% (64.3–72.6)
GPT-OSS 20B	79.3% (74.6–83.6)	72.8% (68.7–76.7)
Gemma 3 27B	92.8% (89.5–95.3)	86.2% (82.9–89.1)
Llama 4 Scout	91.0% (87.4–93.9)	84.8% (81.3–87.8)
Footnotes: (1) Accuracy calculated as correct message type predictions / total evaluated messages. (2) Unanimous: 5/5 annotators agreed on message type; Majority: ≥3/5 agreed. (3) Performance reflects zero-shot prompting without additional metadata or fine-tuning.

3.4. Additional Performance Metrics and Trade-offs

Given the low prevalence of urgent messages (~ 3%), positive predictive value (PPV) remained modest for all models despite strong specificity. Under the Unanimous standard, GPT-5 Mini achieved the highest PPV (0.37), while GPT-5 trailed at 0.21. Under the Majority standard, PPV improved: GPT-4o reached 0.47, GPT-5 Mini 0.48, and GPT-OSS 20B peaked at 0.60, reflecting its emphasis on specificity.

Likelihood ratios under the Majority reference provided further insight into model behavior. Comprehensive model-level accuracy metrics across all reference standards are provided in Supplementary Table 5. GPT-4o achieved LR + of 6.77, GPT-5 Mini 3.73, and GPT-5 4.08, all indicating moderate to strong rule-in utility. All models maintained LR − values ≤ 0.17, suggesting high effectiveness at ruling out urgent messages when labeled as non-urgent.

4. Discussion

We have addressed a primary safety concern for automated screening tools in clinical care and shown that these systems can safely rule out urgent messages without missing a single true urgent case.

In this multi-model evaluation of LLM-based classifiers for triaging patient portal messages in a high-stakes neurology subspecialty setting, we found that five of six models achieved perfect sensitivity and NPV under the most stringent Unanimous physician consensus standard.

Importantly, real-world implementation rarely enjoys perfect consensus. Under the Majority consensus, differences emerged in how models balanced sensitivity (patient safety) and specificity (alert burden). These differences reveal the operational implications of model selection and underscore that no one-size-fits-all model exists. The choice of model should reflect institutional priorities, whether minimizing missed urgency, reducing false alerts, ensuring local data governance, or controlling costs. Similar to prior work evaluating AI tools for message prioritization in outpatient workflows, these trade-offs must be contextualized to align with clinical priorities and system capacity[30–32].

4.1. Clinical implications for primary care, neurology and beyond

As message volumes grow, clinicians may struggle to respond promptly, and urgent content may be unintentionally routed through wrong channels, creating inefficiencies and increasing risk [30, 33]. A reliable LLM-based screener that can identify urgent messages with high sensitivity - without overwhelming clinicians with false positives - could improve both safety and workflow sustainability. In neurology settings where clinician bandwidth is limited and the stakes are high, such tools may help ensure that critical information is acted on promptly as well as protect against clinician burn out.

Although this framework was developed and tested in a neurology subspecialty setting, it is generalizable to other high-risk outpatient domains, including oncology, cardiology, transplant medicine, psychiatry, and primary care. These specialties also manage large volumes of non-urgent messages alongside occasional time-sensitive concerns, where screening tools that efficiently and reproducibly distinguish between the two can serve as valuable decision-support assets. Future studies should include cross-specialty external validation, prospective workflow evaluations, and assessment of handoffs and closed-loop communication — particularly primary-care-to-neurology channels (e.g., referral questions, medication titration, post-ED follow-up) — to quantify safety, timeliness, and burden reduction.

4.2. Reproducibility and deployment readiness

All models that supported deterministic decoding (temperature = 0, top_p = 1, and seed = 12345) produced consistent and reproducible outputs across repeated runs under controlled conditions. This level of determinism is critical for clinical settings where auditability, traceability, and workflow predictability are required. While true bit-for-bit identical outputs may vary slightly depending on system-level factors (e.g., hardware or tokenizer version), no material output variation was observed for models with deterministic configuration. These principles align with broader calls for rigorous reproducibility practices in AI for health research[28].

GPT-5 Mini was the exception: Azure’s implementation does not allow control of temperature or seed, resulting in small but nonzero run-to-run variability. Although this did not significantly affect average performance in our evaluation, it may complicate deployment in environments that require exact reproducibility for version control or clinical auditing.

Local deployment is increasingly desirable as health systems navigate protected health PHI governance, cybersecurity concerns, and vendor-related constraints. Llama 4 Scout and Gemma 3 27B demonstrated that open-weight models hosted on-premises can achieve clinically acceptable performance, enabling privacy-preserving and sustainable deployment strategies without the need to transmit sensitive data externally. These considerations reflect a growing recognition that AI implementations must address both technical and organizational data governance challenges [34].

4.3. Operational decision-making

Our findings suggest that no single LLM offers a universally optimal solution for clinical message screening. Instead, model selection should be informed by an institution’s operational priorities - particularly the trade-off between patient safety and clinician efficiency. A useful way to conceptualize this is along two key axes (Fig. 2).

Fig. 2

Safety vs. Efficiency Trade-off Across LLMs

On the safety axis, models with the highest sensitivity and negative predictive value (NPV) are preferred, especially in settings where the consequences of missing an urgent message are unacceptable. Among the models tested, GPT-5 consistently achieved near-perfect sensitivity and NPV, making it well-suited for high-risk environments such as epilepsy clinics or oncology practices, where timely intervention is critical.

On the efficiency axis, specificity becomes the dominant consideration. In environments where clinician workload and alert fatigue are pressing concerns, models like GPT-4o and Llama 4 Scout offered strong overall accuracy and reduced false positives, while still maintaining excellent NPV. These models may be better aligned with general outpatient or primary care workflows, where a small risk of not triaging enough may be tolerated to improve usability and reduce cognitive burden.

GPT-OSS 20B, while achieving the highest specificity across models, showed lower sensitivity, making it best suited for contexts where minimizing over-alerting outweighs the need for maximal urgent message capture. Gemma 3 27B provided a moderate balance of sensitivity and specificity, with the added advantage of full on-premise deployment. Meanwhile, GPT-5 Mini emerged as a cost-efficient alternative to GPT-5, offering strong safety metrics in cloud deployments - though its lack of reproducibility may limit its use in regulated workflows requiring auditability[35].

Ultimately, operational decisions should be guided not just by technical performance, but also by clinical context, resource constraints, and deployment infrastructure. This nuanced framework allows health systems to match LLM capabilities with institutional priorities, ensuring that screening tools are both effective and sustainable.

4.4. Limitations

This study has several limitations that warrant consideration. First, the low prevalence of urgent messages constrained the PPV across all models, despite consistently high sensitivity and NPV. This reflects a real-world challenge in asynchronous outpatient messaging, where urgency is rare, but highly consequential. While PPV may appear modest, it should not overshadow the high ruling-out performance that enables safe screening in low-base-rate contexts. Prior work in clinical message classification has similarly found that low prevalence of high-risk messages limits precision even with strong model performance[12].

Second, the reference standard - while built on a rigorous SOP and substantial inter-rater agreement (κ ≥ 0.80) - is not an infallible gold standard. Disagreements among expert annotators in the field of epilepsy is known and reflects the inherent subjectivity of human-based clinical classification [36, 37]. Rather than masking this ambiguity, our study aimed to reflect it, recognizing its relevance for real-world deployment.

Third, reproducibility varied by model. All models supporting deterministic decoding were executed with fixed parameters (temperature = 0, top_p = 1, seed = 12345), producing stable outputs across runs. However, GPT-5 Mini, deployed via Azure AI, does not allow seed or temperature control (fixed at temperature = 1), leading to minor but non-zero variability. This limitation could pose challenges in production settings where traceability and deterministic behavior are essential for clinical auditing or regulatory oversight[28].

Fourth, findings are based on data from a single academic epilepsy clinic. While this setting was deliberately selected as a high-acuity, high-volume outpatient specialty with known urgency variability, generalizability to other specialties or practice environments may be limited. Nevertheless, many subspecialties - such as primary care and oncology - face similar urgency triage needs and may benefit from analogous screening approaches.

Finally, we evaluated all models in a zero-shot prompt configuration, without fine-tuning or domain adaptation. This enhances generalizability and reflects the practical constraints of many health systems, especially those without internal machine learning teams. However, performance gains may still be achievable through techniques such as few-shot prompting, retrieval-augmented generation, or task-specific fine-tuning - particularly in edge cases where clinical language is ambiguous or evolving[38].

5. Conclusions

This diagnostic evaluation of six LLMs - including three commercial cloud-hosted models and three locally deployable open-weight alternatives - demonstrated that LLMs can achieve clinically acceptable performance in classifying the urgency and type of patient portal messages in outpatient neurology. When evaluated against a physician-derived reference standard, most models achieved perfect sensitivity and NPV under unanimous expert agreement, affirming their potential utility as high-safety screening tools in triage workflows.

These findings suggest that LLM-powered triage systems may be safely and flexibly deployed in high-stakes clinical neurology settings, with model selection tailored to institutional priorities - whether focused on maximizing sensitivity, minimizing alert fatigue, or ensuring data containment. As both commercial and open-weight LLMs continue to evolve, such tools may offer scalable, reproducible, and privacy-preserving infrastructure for alleviating clinician workload while maintaining high standards of patient safety.

Data Availability

The datasets generated and analyzed during the current study consist of de-identified patient portal messages from the Mass General Brigham electronic health record system. Due to institutional privacy regulations and the inclusion of protected health information (PHI), these data are not publicly available. De-identified excerpts and the analytic code used for model evaluation are available from the corresponding author on reasonable request and pending institutional approval. Core scripts and configuration files used for the benchmarking analyses are provided in Supplementary Material for reproducibility.

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

References

Ochoa C, Baron-Lee J, Popescu C, et al. Electronic patient portal utilization by neurology patients and association with outcomes. Health Informatics J. 2020;26:2751–61.

Zhang J, Yu Y, Chen Z, et al. Trends and disparities in the prevalence of physical activity among US adults with epilepsy, 2010–2022. Epilepsy Behav. 2024;157:109850.

Kobau R, Luncheon C, Greenlund K. Active epilepsy prevalence among U.S. adults is 1.1% and differs by educational level—National Health Interview Survey, United States, 2021. Epilepsy Behav. 2023;142:109180.

Moura LMVR, Karakis I, Zack MM, et al. Drivers of US health care spending for persons with seizures and/or epilepsies, 2010–2018. Epilepsia. 2022;63:2144–54.

Karakis I, Boualam N, Moura LM, et al. Quality of life and functional limitations in persons with epilepsy. Epilepsy Res. 2023;190:107084.

Moura L, Karakis I, Howard D. Emergency department utilization among adults with epilepsy: A multi-state cross-sectional analysis, 2010–2019. Epilepsy Res. 2024;205:107427.

Lewis AK, Taylor NF, Carney PW, et al. What is the effect of delays in access to specialist epilepsy care on patient outcomes? A systematic review and meta-analysis. Epilepsy Behav. 2021;122:108192.

Racette BA, Holtzman DM, Dall TM, et al. Supply and demand analysis of the current and future US neurology workforce. Neurology. 2014;82:2254–5.

Gallani S, Martin Lee B, Moura LMVR. Achieving epilepsy care for all: Ecosystem-based transformation. Epilepsia. 2025;66:2669–78.

10.

Nath B, Williams B, Jeffery MM, et al. Trends in electronic health record inbox messaging during the COVID-19 pandemic in an ambulatory practice network in New England. JAMA Netw Open. 2021;4:e2131490.

11.

Grouse CK, Esper GJ. The patient portal messaging crisis. JAMA Neurol. Published Online First: 9 December 2024. doi: 10.1001/jamaneurol.2024.4153

12.

Yang J, So J, Zhang H, et al. Development and evaluation of an artificial intelligence-based workflow for the prioritization of patient portal messages. JAMIA Open. 2024;7. doi: 10.1093/jamiaopen/ooae078

13.

Moura Junior V, Kummer BR, Moura LMVR. Population health in neurology and the transformative promise of artificial intelligence and large language models. Semin Neurol. 2025;45:445–56.

14.

Moura L, Jones DT, Sheikh IS, et al. Implications of large language models for quality and efficiency of neurologic care. Neurology. 2024;102. doi: 10.1212/wnl.0000000000209497

15.

Ng MY, Helzer J, Pfeffer MA, et al. Development of secure infrastructure for advancing generative artificial intelligence research in healthcare at an academic medical center. J Am Med Inform Assoc. 2025;32:586–8.

16.

Riedemann L, Labonne M, Gilbert S. The path forward for large language models in medicine is open. NPJ Digit Med. 2024;7:339.

17.

Dennstädt F, Hastings J, Putora PM, et al. Implementing large language models in healthcare while balancing control, collaboration, costs and security. NPJ Digit Med. 2025;8:143.

18.

Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ. 2015;351:h5527.

19.

Cohen JF, Korevaar DA, Altman DG, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open. 2016;6:e012799.

20.

U.S. Department of Health and Human Services. Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. 2020. https://www.hhs.gov/hipaa/for-professionals/privacy/index.html (accessed 21 September 2025)

21.

Menikoff J, Kaneshiro J, Pritchard I. The Common Rule, Updated. N Engl J Med. 2017.

22.

Moura LMVR, Festa N, Price M, et al. Identifying Medicare beneficiaries with dementia. J Am Geriatr Soc. 2021;69:2240–51.

23.

Leng Y, He Y, Amini S, et al. A GPT-4o-powered framework for identifying cognitive impairment stages in electronic health records. NPJ Digit Med. 2025;8:401.

24.

Gwet KL. Handbook of inter-rater reliability: Volume 1: Analysis of categorical ratings. Advanced Analytics 2021.

25.

Li J, Goldenholz DM, Alkofer M, et al. Expert-level detection of epilepsy markers in EEG on short and long timescales. NEJM AI. 2025;2. doi: 10.1056/aioa2401221

26.

Moura LMVR, Zafar S, Benson NM, et al. Identifying Medicare beneficiaries with delirium. Med Care. 2022;60:852–9.

27.

Neves B, Moreira JM, Gonçalves S, et al. Zero-shot learning for clinical phenotyping: Comparing LLMs and rule-based methods. Comput Biol Med. 2025;192:110181.

28.

McDermott MBA, Wang S, Marinsek N, et al. Reproducibility in machine learning for health research: Still a ways to go. Sci Transl Med. 2021;13:eabb1655.

29.

Clopper CJ, Pearson ES. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika. 1934;26:404.

30.

Martinez KA, Schulte R, Rothberg MB, et al. Patient portal message volume and time spent on the EHR: An observational study of primary care clinicians. J Gen Intern Med. 2024;39:566–72.

31.

Ren Y, Wu Y, Fan JW, et al. Automatic uncovering of patient primary concerns in portal messages using a fusion framework of pretrained language models. J Am Med Inform Assoc. 2024;31:1714–24.

32.

Kaur A, Budko A, Liu K, et al. Automating responses to patient portal messages using generative AI. Appl Clin Inform. 2025;16:718–31.

33.

Tai-Seale M, Dillon EC, Yang Y, et al. Physicians’ well-being linked to in-basket messages generated by algorithms in electronic health records. Health Aff (Millwood). 2019;38:1073–8.

34.

Yadav N, Pandey S, Gupta A, et al. Data Privacy in healthcare: In the era of artificial intelligence. Indian Dermatol Online J. 2023;14:788–92.

35.

Wells BJ, Nguyen HM, McWilliams A, et al. A practical framework for appropriate implementation and review of artificial intelligence (FAIR-AI) in healthcare. NPJ Digit Med. 2025;8. doi: 10.1038/s41746-025-01900-y

36.

Conroy M, Powell M, Suelzer E, et al. Electronic medical record–based electronic messaging among patients with breast cancer: A systematic review. Appl Clin Inform. 2023;14:134–43.

37.

Budd J. Burnout related to electronic health record use in primary care. J Prim Care Community Health. 2023;14. doi: 10.1177/21501319231166921

38.

Ge W, Godeiro Coelho LM, Donahue MA, et al. Automated identification of fall-related injuries in unstructured clinical notes. Am J Epidemiol. 2025;194:1097–105.

Author Contribution

VMJ, PH, and LMVRM. conceived, designed, and planned the study. VMJ, LMVRM, PH, LB, JY, and MM collected and acquired the data. VMJ developed the workflow, and LMVRM analyzed the de-identified data. VMJ and LMVRM drafted the manuscript. All authors interpreted the results, critically reviewed the manuscript, and approved the final version. VMJ had full access to all data in the study and takes responsibility for its integrity and the accuracy of the analysis.

Funding

This study was supported by the Baker Family Foundation and the Harvard D^3 Institute Associates Program.

Competing interests

All authors declare no financial or non-financial competing interests. None of the authors have any financial relationships, advisory roles, or institutional benefits related to the large language models, software platforms, or products mentioned in this study.

Ethics approval

The use of patient data in this study was approved by the Mass General Brigham Institutional Review Board (protocol 2025P000101, Human Ethics and Consent to Participate declarations: not applicable).

Clinical trial number

not applicable

Figures and Tables

Table 2 Legend: Screening safety priority (maximizing sensitivity and NPV): GPT-5 stood out with the highest sensitivity (0.98) and perfect NPV (1.00), minimizing the chance of missing an urgent message. In specialties like epilepsy, where missing a breakthrough seizure or adverse medication effect can have severe consequences, such sensitivity is essential. Alert burden efficiency priority (balancing specificity and NPV): GPT-4o and Llama 4 Scout both demonstrated high specificity (0.87 and 0.88, respectively) while maintaining excellent NPV (0.98 and 0.97). These models produced fewer false positive alerts, making them strong candidates for clinics prioritizing workload reduction and minimizing over-alerting. Cost-conscious deployment: GPT-5 Mini offered a favorable trade-off, achieving reasonable sensitivity (0.78) and high specificity (0.89), with NPV of 0.97. Given its lower cloud cost, GPT-5 Mini may be an attractive option for institutions with constrained budgets, though its inability to support deterministic decoding (temperature fixed at 1, no seed control) introduces minor reproducibility limitations. Local, privacy-preserving deployment: GPT-OSS 20B, Llama 4 Scout, and Gemma 3 27B all ran fully on-premise, avoiding the need to transmit PHI externally. Among them, Llama 4 Scout most closely approximated the performance of GPT-4o, making it a leading local option for balancing sensitivity and specificity. GPT-OSS 20B delivered the highest specificity (0.95) and highest PPV (0.60) under Majority consensus, but its lower sensitivity (0.57) may pose a clinical safety risk in some settings. Gemma 3 27B offered intermediate performance and a solid local deployment pathway.

Yes