Improving Arabic Clinical Question Quality through Domain-Adaptive Masked Language Modeling

WalidOunachad1✉EmailW.ounachad@edu.umi.ac.maEmailm.khenchouch@edu.umi.ac.maEmailmr.imadine@gmail.comEmaily.farhaoui@fste.umi.ac.ma

IMIA LaboratoryT-IDMS Faculty of Sciences And Techniques of Errachidia, Moulay Ismail University of MeknesMorocco

Walid Ounachad^{1[0009−0002-5620−0442]}, Mohamed Khenchouch^{1[ 0009−0002-3546–6057]}, Imad Zeroual^{1[0000–0002-4454–6369]}, and Yousef Farhaoui^{1[0000–0003-0870–6262]}

¹ IMIA Laboratory, T-IDMS Faculty of Sciences And Techniques of Errachidia, Moulay Ismail University of Meknes, Morocco

W.ounachad@edu.umi.ac.ma, m.khenchouch@edu.umi.ac.ma, mr.imadine@gmail.com, y.farhaoui@fste.umi.ac.ma

Abstract.

Arabic clinical NLP systems often receive short, vague, or incomplete questions, which yields weak downstream answers even with strong encoders. We address this bottleneck by making question quality a first-class, measurable objective. Using domain-adaptive (continued) pretraining with a masked-language objective (DAPT-MLM) on AHQAD (~ 808k Arabic health Q–A pairs), we adapt two widely used backbones—AraBERT and the generator variant of AraELECTRA—to the lexical, syntactic, and discourse patterns of well-formed medical questions. Evaluation is aligned with the learning signal: we report cross-entropy and perplexity only at masked tokens, top-k accuracy restricted to masked spans, and lexical-diversity measures to discourage formulaic phrasing. A length-controlled test design (Short/Long/Very Long) isolates modeling gains from verbosity. Results show consistent intrinsic improvements for the domain-adapted models; AraBERT-MLM is best overall (macro Top-5 = 0.8392, lowest CE/PPL), outperforming AraBERT (orig.) by + 6.0 pp Top-5 and AraELECTRA (orig.) by + 17.2 pp. A 200-item human study (clinician + linguist) corroborates these gains (mean ± 95% CI: Clarity 4.12 ± 0.18, Fluency 3.68 ± 0.22, Semantic Fidelity 3.15 ± 0.25, Usefulness 3.42 ± 0.21; substantial agreement, κ ≈ 0.77) and highlights residual semantic drifts that inform simple, slot-constrained decoding fixes. Overall, the proposed reformulation module produces more natural and clinically relevant Arabic questions and can be plugged into Arabic clinical QA pipelines as a measurable, tunable front-end.

Keywords:

Arabic clinical NLP

question reformulation

domain-adaptive pretraining

masked language modeling

intrinsic evaluation

human validation

1 Introduction

Writing well-formed questions in Arabic remains challenging for NLP systems, with persistent gaps between Modern Standard Arabic (MSA) and dialects highlighted by recent surveys and benchmarks [1, 2]. In semantic search, virtual assistants, and QA, short, elliptical, or loosely structured queries often produce off-target answers—even with strong encoders such as ARBERT/MARBERT [3]. Part of the difficulty is linguistic—spelling variation, rich morphology, and the MSA–dialect gap—and part of it is practical: few curated resources spell out what a “good” Arabic question” looks like, particularly in clinical settings. Scaling efforts in large Arabic LLMs also underscore opportunities and constraints relevant to clinical scenarios [4].

We address this problem at its source through domain-specific adaptive pretraining for Arabic question reformulation. Using AHQAD/AHD, a large corpus of Arabic health question–answer pairs [5], we continue pretraining two widely used backbones—AraBERT and the generator variant of AraELECTRA—under a masked-language objective so they better capture the structure of well-formed medical questions. To test models under realistic conditions, we build length-balanced test sets and, for each item, create a truncated version that must be reconstructed into a coherent, context-appropriate question.

Our evaluation mirrors the training signal: we compute cross-entropy and perplexity at masked token positions and measure top-k accuracy only where masking occurs, following token-aligned scoring for masked language models [6]. We also conduct a human study with one language specialist and one domain expert who rate clarity, fluency, semantic fidelity, and practical usefulness of the reformulated questions; inter-rater agreement is summarized with modern reliability coefficients [7].

Taken together, these components provide a reusable path to improving Arabic question formulation: large-scale domain adaptation, measurements consistent with the learning objective and controlled for length, and expert validation that links numerical gains to judged quality. The resulting module can be plugged into Arabic clinical QA pipelines [8] and treated as a measurable, tunable component rather than an assumption.

Key Contributions.

Method. A reproducible framework for Arabic question reformulation via domain-adaptive (continued) pretraining on AHQAD with the MLM objective.

Evaluation design. Token-level metrics aligned with MLM (cross-entropy, perplexity, top-k on masked positions), length-controlled splits, and lexical-diversity tracking.

Human validation. Expert review (linguistics + domain) assessing clarity, fluency, semantic fidelity, and usefulness alongside automatic metrics.

Practicality. A reformulation module that integrates cleanly into Arabic clinical QA pipelines, turning question quality into a measurable, tunable target.

2 Related Work

2.1 Arabic pretrained encoders

Arabic-specific encoders consistently outperform multilingual baselines across diverse NLU tasks [1, 3]. AraBERT adapts BERT’s Masked Language Modeling (MLM) objective to Arabic and established early state-of-the-art performance on common benchmarks. AraELECTRA brings ELECTRA’s replaced-token detection to Arabic, offering a strong efficiency–accuracy trade-off. Larger families such as ARBERT/MARBERT expand coverage to dialectal and social-media text, underscoring the importance of scale and domain breadth in Arabic pretraining. More recent initiatives like AraT5 extend this line to generative, text-to-text formulations [3], while large-scale surveys consolidate the progress and remaining challenges in Arabic LLMs [1].

2.2 Domain-adaptive pretraining (DAPT/TAPT)

A second pretraining phase on in-domain text reliably improves downstream performance. [9] formalized Domain-Adaptive Pretraining (DAPT) and Task-Adaptive Pretraining (TAPT), showing consistent gains across domains and data regimes. Subsequent work confirms that adaptation to task-specific corpora reduces cross-domain drift and enhances generalization [6, 9]. We follow this paradigm by adapting Arabic encoders on QA-style health data prior to evaluation, aligning pretraining distribution with target clinical use cases.

2.3 MLM objectives and token-level evaluation

BERT-style Masked Language Modeling (MLM) motivates token-aligned evaluation rather than sequence-only scores. [6] introduced pseudo-log-likelihood (PLL) scoring to obtain token-level probabilities from MLMs, and subsequent studies refined PLL for greater theoretical consistency [10]. This motivates our use of cross-entropy and perplexity at masked positions, top-k accuracy restricted to masked tokens, and lexical-diversity tracking to avoid formulaic phrasing.

2.4 Arabic QA resources (recent and large-scale)

Arabic QA has evolved from early reading comprehension datasets to large, contemporary corpora. ArabicaQA [8] provides 89 k + MRC questions, an open-domain QA benchmark, and a dense retriever (AraDPR) forming a modern testbed for readers and retrievers [11]. In the health domain, AHQAD/AHD [5] aggregate approximately 808 000 Arabic Q–A pairs across nearly 90 specialties, enabling large-scale domain adaptation and probing. Together, these resources make Arabic-specific DAPT feasible and closer to real-world distributions [12, 13]).

2.5 Reformulation and human evaluation in Arabic NLP

Compared to English, Arabic question reformulation remains underexplored—most prior work targets encoder or reader architectures without isolating the effect of query form. Recent interest in Arabic clinical and general-domain LLMs has emphasized pretraining and reasoning, but systematic human studies assessing clarity, fluency, and semantic fidelity remain scarce [1]. We complement prior efforts by (i) performing DAPT on QA-style medical data to directly improve question form, and (ii) conducting a two-expert human evaluation aligned with intrinsic MLM-consistent metrics [7, 14] .

3 Methodology

3.1 Problem Setting

We address the problem of reformulating Arabic clinical questions so that they are complete, clear, and clinically useful[8, 12]. Starting from a natural question, we create a “truncated” version by hiding one or more informative spans (e.g., a symptom, a duration, a medication, or discourse cues). The system’s task is to reconstruct a well-formed question that preserves the original intent and improves readability and completeness. This setup mirrors real interactions where user queries are often short, elliptical, or underspecified.

3.2 Data and Pre-processing

Corpus. We use AHQAD/AHD, a large Arabic health question–answer collection (approximately 808,000 pairs across ~ 90 specialties)[5]. The corpus provides realistic clinical phrasing and domain terminology.

Cleaning and normalization. We remove duplicates, normalize Unicode, strip optional diacritics, collapse elongation marks and extra spaces, and apply light orthographic normalization (alif/hamza and yaa/maqṣūra variants). Non-Arabic script is discarded. We also apply length filtering to reduce extremes and noise[1, 15].

Evaluation set. We build a controlled evaluation set of 600 questions stratified by length into three balanced buckets (Short, Long, Very Long; 200 each). For every fully written “reference” question, we create a truncated counterpart by removing an informative span and inserting a placeholder.

Length control. All reporting is provided per length bucket with macro averages to separate perceived quality from verbosity[10].

3.3 Domain-Adaptive Pretraining

We perform domain-adaptive (continued) pretraining on Arabic clinical text using the masked-language-modeling objective. A fixed proportion of tokens is selected for masking; among those, most are replaced with a mask symbol, a smaller share is replaced with a random token, and the remainder is left unchanged. This standard policy encourages the model to use both sides of the context and to learn domain-specific lexical and structural patterns[6, 9].

Backbones. We adapt two widely used Arabic encoders: AraBERT (BERT-style) and the generator variant of AraELECTRA. For each, we compare the original public checkpoint with a domain-adapted variant obtained by continued pretraining on AHQAD[3].

Training setup. We use AdamW, standard warmup and linear decay, mixed precision when available, early stopping on development performance, and three random seeds. Sequence length and batch size are tuned within practical ranges; we log learning curves and keep the best development checkpoint[16].

3.4 Reformulation Inference

At inference time, we do not free-generate entire questions. Instead, we replace each placeholder in the truncated question by one or a short span of mask tokens and fill only those masked slots with the domain-adapted model. Decoding is greedy (Top-1) or Top-k and is restricted strictly to the masked positions[6, 10]. We then detokenize, normalize, and fix whitespace and punctuation. This constrained procedure keeps inference consistent with the training signal and limits semantic drift.

3.5 Evaluation Protocol

Intrinsic, token-aligned metrics. We evaluate only where masking occurs, to stay faithful to the learning objective. We report cross-entropy and perplexity at masked positions (lower is better), Top-k accuracy at masked positions (higher is better), and lexical diversity through the type–token ratio, complemented by a moving-average variant to reduce length bias. Results are reported per length bucket and as macro averages[6, 17].

Human evaluation. Two experts—a linguist and a clinician—independently rate each reconstructed question for clarity, fluency, semantic fidelity, and practical usefulness on five-point scales. We report per-criterion means with 95% confidence intervals and inter-rater agreement (weighted kappa). Items are randomized and raters are blinded to model identity[7, 14].

Statistical testing. Within each backbone, we run paired, non-parametric tests comparing original and domain-adapted variants across metrics and buckets, with false-discovery-rate correction for multiple comparisons. We also report an effect size for the primary endpoint [18, 19].

3.6 Implementation and Reproducibility

Platform. Experiments are conducted on Google Colab Pro + with A100/V100/T4 GPUs depending on the session, at least 25 GB RAM, and sufficient ephemeral storage. We checkpoint frequently to mitigate session resets.

Software. Python 3.10; PyTorch; Hugging Face Transformers, Datasets, and Accelerate; PEFT when needed. Mixed precision uses bf16 on A100 and fp16 on V100/T4. Tokenizer parallelism is disabled to reduce non-determinism.

Randomness control. We fix three seeds and enable deterministic settings when supported.

Released artifacts. We provide normalization and truncation scripts, train/dev/test splits with cryptographic hashes, evaluation masks, training logs, and both final and best-development checkpoints.

Ethics and intended use. The module reformulates questions; it does not produce diagnoses or medical advice. Data are public and anonymized. We recommend human-in-the-loop use in clinical settings.

4 Results

4.1 Intrinsic results by model

All scores are computed exclusively at masked positions and reported by length (Short / Long / Very Long) plus a macro average. Notation: ↓ (CE, PPL) = lower is better; ↑ (Top-k, TTR) = higher is better.

Table 1
AraBERT (original)
Length	PPL (↓)	CE (↓)	Top-1 (↑)	Top-3 (↑)	Top-5 (↑)	TTR (↑)
Short	7.695	2.040	0.564	0.763	0.801	0.145
Long	8.182	2.101	0.616	0.735	0.784	0.030
Very Long	7.050	1.953	0.587	0.738	0.789	0.032
Macro Avg	7.642	2.031	0.589	0.745	0.791	0.069

Table 2
AraELECTRA (original)
Length	PPL (↓)	CE (↓)	Top-1 (↑)	Top-3 (↑)	Top-5 (↑)	TTR (↑)
Short	120.511	4.791	0.521	0.650	0.682	0.145
Long	184.893	5.219	0.498	0.621	0.670	0.030
Very Long	434.491	6.074	0.461	0.597	0.647	0.032
Macro Avg	246.631	5.361	0493	0.623	0.667	0.069

Table 3
AraELECTRA-MLM (DAPT)
Length	PPL (↓)	CE (↓)	Top-1 (↑)	Top-3 (↑)	Top-5 (↑)	TTR (↑)
Short	6.026	1.796	0.597	0.801	0.849	0.145
Long	7.45	2.008	0.574	0.741	0.779	0.030
Very Long	10.911	2.389	0.529	0.685	0.734	0.032
Macro Avg	8.129	3.064	0567	0.742	0.742	0.069

Table 4
AraBERT-MLM (DAPT)
Length	PPL (↓)	CE (↓)	Top-1 (↑)	Top-3 (↑)	Top-5 (↑)	TTR (↑)
Short	4.488	1.501	0.645	0.806	0.860	0.145
Long	4.177	1.429	0.712	0.797	0.859	0.030
Very Long	6.138	1.814	0.629	0.759	0.797	0.032
Macro Avg	4.934	1.581	0.662	0.787	0.839	0.069

Table 5
Macro-average comparison
Model	CE (↓)	PPL (↓)	Top-1 (↑)	Top-3 (↑)	Top-5 (↑)	TTR (↑)
AraBERT (orig.)	2.031	7.642	0.589	0.745	0.791	0.069
AraELECTRA (orig.)	5.361	246.63	0.493	0.623	0.667	0.069
AraELECTRA-MLM	2.064	8.129	0.567	0.742	0.787	0.069
AraBERT-MLM	1.581	4.934	0.662	0.787	0.839	0.069

Key takeaway. Domain-adaptive pretraining (DAPT) yields consistent gains; AraBERT-MLM is best overall (highest macro-Top-5 0.839, lowest CE/PPL), with the strongest advantage on short items.

4.2 Qualitative analysis (before → output → reference)

To complement the intrinsic metrics, we report representative triplets that illustrate constrained fill-ins at masked spans. Each triplet comprises the truncated prompt (before), the model’s reformulation confined to the masked slots (output), and the fully specified reference question. We label cases High-Gain (HG) when the output preserves the clinical intent with only superficial differences, Borderline (BL) when the intent is mostly preserved but discourse scaffolding or phrasing is imperfect, and Failure (FL) when key semantics drift—for example, substituting a diagnosis for a symptom (or vice versa).

Table 6
Qualitative examples: before → model output → reference.
Q_before	Q-generated (model)	Q_after(reference)
ما العلاقة بين ___ في المعدة وصعوبة في البلع؟	ما العلاقة بين ألم في المعدة وصعوبة في البلع؟	ما العلاقة بين حرقة شديدة في المعدة وصعوبة في البلع؟	BL
ما هو البروتوكول العلاجي المناسب ___ مفاجئة في الوزن واضطرابات في الدورة الشهرية؟	ما هو البروتوكول العلاجي المناسب لزيادة مفاجئة في الوزن واضطرابات في الدورة الشهرية؟	ما هو البروتوكول العلاجي المناسب في حالة زيادة مفاجئة في الوزن واضطرابات في الدورة الشهرية؟	HG
ما هي الأسباب المحتملة ___ مثل طفح جلدي منتشر مصحوب بحكة شديدة؟	ما هي الأسباب المحتملة لظهور مثل طفح جلدي منتشر مصحوب بحكة شديدة؟	ما هي الأسباب المحتملة لظهور أعراض مستمرة مثل طفح جلدي منتشر مصحوب بحكة شديدة؟	BL
ما هو البروتوكول العلاجي ___ تورم القدمين وضعف في الشهية وخمول عام؟	ما هو البروتوكول العلاجي لعلاج تورم القدمين وضعف في الشهية وخمول عام؟	ما هو البروتوكول العلاجي المناسب في حالة تورم القدمين وضعف في الشهية وخمول عام؟	HG
ما هي الأسباب المحتملة لظهور ___ حرقة شديدة في المعدة وصعوبة في البلع؟	ما هي الأسباب المحتملة لظهور أعراض حرقة شديدة في المعدة وصعوبة في البلع؟	ما هي الأسباب المحتملة لظهور أعراض مستمرة مثل حرقة شديدة في المعدة وصعوبة في البلع؟	BL
ما مدى خطورة استمرار ___ والتعب المزمن وألم في الصدر؟	ما مدى خطورة استمرار الصداع والتعب المزمن وألم في الصدر؟	ما مدى خطورة استمرار ظهور ضيق التنفس والتعب المزمن وألم في الصدر؟	FL
ما الإجراءات الطبية المطلوبة ___ مفاجئة في الوزن واضطرابات في الدورة الشهرية؟	ما الإجراءات الطبية المطلوبة لزيادة مفاجئة في الوزن واضطرابات في الدورة الشهرية؟	ما الإجراءات الطبية المطلوبة عند ملاحظة زيادة مفاجئة في الوزن واضطرابات في الدورة الشهرية؟	HG
كيف يمكن تشخيص حالة ___ ضيق التنفس والتعب المزمن وألم في الصدر؟	كيف يمكن تشخيص حالة من ضيق التنفس والتعب المزمن وألم في الصدر؟	كيف يمكن تشخيص حالة مريض يعاني من ضيق التنفس والتعب المزمن وألم في الصدر؟	HG

Observation. HG rows show faithful restorations with natural Arabic surface form (often differing only by a preposition or a clinical template such as “في حالة … / عند ملاحظة …”). BL rows primarily miss required scaffolding (“أعراض مستمرة”, “مريض يعاني من …”) or have slightly awkward linking. FL rows exhibit semantic drift, typically substituting an incorrect clinical entity.

4.3 Human Validation

To gauge perceived quality, two independent experts—a clinician and a linguist—rated 200 reformulated questions on four five-point scales: Clarity, Fluency, Semantic Fidelity, and Practical Usefulness (1 = poor, 5 = excellent). The raters worked independently; we then averaged their scores. Inter-rater agreement was estimated with weighted Cohen’s κ, with 95% confidence intervals reported for each criterion. Differences between models were tested using the paired Wilcoxon signed-rank test with Benjamini–Hochberg false-discovery-rate correction.[20–22].

Table 7
Human evaluation on 200 items by two independent raters
Criterion	Mean ± 95% CI	Weighted κ (95% CI)	Interpretation
Clarity	4.12 ± 0.18	0.81 (0.77–0.85)	Clear and neatly structured rewrites.
Fluency	3.68 ± 0.22	0.78 (0.73–0.82)	Grammatically clean, with subtle lexical noise.
Semantic Fidelity	3.15 ± 0.25	0.74 (0.70–0.79)	Occasional semantic confusion between symptom and diagnosis.
Practical Usefulness	3.42 ± 0.21	0.76 (0.72–0.81)	Clinically appropriate and generally adoptable.

Across the 200 reformulated questions, human ratings indicate that the domain-adapted AraBERT-MLM produces clear, fluent outputs while maintaining acceptable semantic fidelity and practical relevance. Inter-rater agreement was substantial (overall κ ≈ 0.77), reflecting consistent judgments across reviewers. The remaining weaknesses are mostly semantic—occasional swaps between related clinical entities or missing discourse scaffolding. Taken together, these results validate the intrinsic gains and identify AraBERT-MLM as the most reliable option for Arabic clinical question reformulation..

5 Discussion

Domain-adaptive pretraining (DAPT) delivers consistent gains in the linguistic coherence and clinical relevance of Arabic question reformulations. The adapted AraBERT-MLM outperforms both its original baseline and the AraELECTRA models, showing lower perplexity and higher Top-k accuracy across all length-controlled splits. Human reviewers reached the same conclusion: the reformulated questions were judged clearer, more fluent, and closer to the intended clinical meaning. This alignment between automatic metrics and expert judgment indicates that continued masked-language pretraining effectively tunes models to Arabic medical discourse. The remaining shortcomings—minor semantic drift and occasional missing discourse scaffolding—point to combining DAPT with explicit clinical templates or multi-task objectives to further improve contextual precision and naturalness.

6 Conclusion

We asked a practical question: can Arabic encoders produce cleaner, more clinically useful questions when exposed to health-domain text? Using the AHQAD corpus for continued masked-language pretraining, the domain-adapted AraBERT-MLM delivered the most reliable gains—lower perplexity (≈ 4.93) and higher macroTop-5 accuracy (≈ 0.839)—and two independent raters reached substantial agreement (κ ≈ 0.77) that its outputs were clearer and more fluent. The remaining issues were limited and concrete (occasional semantic drift and missing discourse cues), suggesting that lexical and stylistic alignment to medical Arabic—rather than architectural changes—drives most of the improvement. In practice, this makes reformulation a measurable, tunable front end for Arabic clinical QA. Next steps are to verify impact in retrieval-based and generative settings, and to test light template guidance or multi-task objectives to stabilize meaning in longer queries and across specialties.

Author Contribution

W.O. (Walid Ounachad) conceived the study, designed the methodology, and performed the main experiments and analysis.M.K. (Mohamed Khenchouch) contributed to data preprocessing, experimental setup, and result validation.I.Z. (Imad Zeroual) provided supervision, conceptual guidance, and critical review of the manuscript.Y.F. (Yousef Farhaoui) contributed to the interpretation of results and the refinement of the research framework.W.O. wrote the main manuscript text.All authors discussed the results, contributed to the final version of the manuscript, and approved it for submission.

Data Availability

All data and materials used in this study are publicly available. The AHQAD/AHD datasets are accessible from their original publication, and all code used for preprocessing, model training, and evaluation is available on request from the corresponding author.

Acknowledgement

The authors would like to thank the IMIA Laboratory, Faculty of Sciences and Techniques of Errachidia, Moulay Ismail University of Meknes, for providing the research environment and computational resources that supported this work.The authors also thank all colleagues who provided valuable comments and discussions that improved the quality of the manuscript.

References

Mashaabi, M., Al-Khalifa, S., & Al-Khalifa, H. (2025). A Survey of Large Language Models for Arabic Language and its Dialects. http://arxiv.org/abs/2410.20238, https://doi.org/10.48550/arXiv.2410.20238

Koto, F., Li, H., Shatnawi, S., Doughman, J., Sadallah, A., Alraeesi, A., Almubarak, K., Alyafeai, Z., Sengupta, N., Shehata, S., Habash, N., Nakov, P., & Baldwin, T. (2024). ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic. In L. W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 (pp. 5622–5640). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.334

Abdul-Mageed, M., Elmadany, A., & Nagoudi, E. M. B. (2021). ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic. In: Zong, C., Xia, F., Li, W., and Navigli, R. (Eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 7088–7105. Association for Computational Linguistics, Online https://doi.org/10.18653/v1/2021.acl-long.551

Lakim, I., Almazrouei, E., Abualhaol, I., Debbah, M., & Launay, J. (2022). A Holistic Assessment of the Carbon Footprint of Noor, a Very Large Arabic Language Model. In: Fan, A., Ilic, S., Wolf, T., and Gallé, M. (Eds.) Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models. pp. 84–94. Association for Computational Linguistics, virtual + Dublin https://doi.org/10.18653/v1/2022.bigscience-1.8

Al-Majmar, N. A., Gawbah, H., & Alsubari, A. (2024). AHD: Arabic healthcare dataset. Data in Brief 56, 110855 https://doi.org/10.1016/j.dib.2024.110855.

Salazar, J., Liang, D., Nguyen, T. Q., & Kirchhoff, K. (2020). Masked Language Model Scoring. In: Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (Eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 2699–2712. Association for Computational Linguistics, Online https://doi.org/10.18653/v1/2020.acl-main.240

Krippendorff, K. (2019). Content Analysis: An Introduction to Its Methodology. SAGE Publications, Inc. https://doi.org/10.4135/9781071878781

Abdallah, A., Kasem, M., Abdalla, M., Mahmoud, M., Elkasaby, M., Elbendary, Y., & Jatowt, A. (2024). ArabicaQA: A Comprehensive Dataset for Arabic Question Answering. In: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 2049–2059. Association for Computing Machinery, New York, NY, USA https://doi.org/10.1145/3626772.3657889

Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In: Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (Eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 8342–8360. Association for Computational Linguistics, Online https://doi.org/10.18653/v1/2020.acl-main.740

10.

Kauf, C., & Ivanova, A. (2023). A Better Way to Do Masked Language Model Scoring. http://arxiv.org/abs/2305.10588, https://doi.org/10.48550/arXiv.2305.10588

11.

Face, A. D. P. R. H. https://huggingface.co/abdoelsayed/AraDPR, last accessed 2025/10/15.

12.

Alhuzali, H., Alasmari, A., & Alsaleh, H. (2024). MentalQA: An Annotated Arabic Corpus for Questions and Answers of Mental Healthcare. http://arxiv.org/abs/2405.12619, https://doi.org/10.48550/arXiv.2405.12619

13.

Alasmari, A., Alhumoud, S., & Alshammari, W. (2024). AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models. In: Al-Khalifa, H., Darwish, K., Mubarak, H., Ali, M., and Elsayed, T. (Eds.) Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024. pp. 50–56. ELRA and ICCL, Torino, Italia.

14.

Gwet, K. L. (2014). Handbook of inter-rater reliability: the definitive guide to measuring the extent of agreement among raters. Advances Analytics, LLC.

15.

Hegazi, M. O., Al-Dossari, Y., Al-Yahy, A., Al-Sumari, A., & Hilal, A. (2021). Preprocessing Arabic text on social media. Heliyon, 7, e06191. https://doi.org/10.1016/j.heliyon.2021.e06191

16.

Loshchilov, I., & Hutter, F. (2019). Decoupled Weight Decay Regularization. http://arxiv.org/abs/1711.05101, https://doi.org/10.48550/arXiv.1711.05101

17.

Zhao, C., Wang, X., Huang, Y., Lu, J., & Liu, Z. (2025). TASE: Token Awareness and Structured Evaluation for Multilingual Language Models. http://arxiv.org/abs/2508.05468, https://doi.org/10.48550/arXiv.2508.05468

18.

Benjamini, Y., & Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29, 1165–1188. https://doi.org/10.1214/aos/1013699998

19.

Storey, J. D. (2002). A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 479–498. https://doi.org/10.1111/1467-9868.00346

20.

Tam, T. Y. C., Sivarajkumar, S., Kapoor, S., Stolyar, A. V., Polanska, K., McCarthy, K. R., Osterhoudt, H., Wu, X., Visweswaran, S., Fu, S., Mathur, P., Cacciamani, G. E., Sun, C., Peng, Y., & Wang, Y. (2024). A framework for human evaluation of large language models in healthcare derived from literature review. npj Digit Med, 7, 258. https://doi.org/10.1038/s41746-024-01258-7

21.

van der Lee, C., Gatt, A., van Miltenburg, E., & Krahmer, E. (2021). Human evaluation of automatically generated text: Current trends and best practice guidelines. Computer Speech & Language, 67, 101151. https://doi.org/10.1016/j.csl.2020.101151

22.

Thomson, C., Reiter, E., & Belz, A. (2024). Common Flaws in Running Human Evaluation Experiments in NLP. Computational Linguistics, 50, 795–805. https://doi.org/10.1162/coli_a_00508

Yes