A Statistical Framework for Consensus-Based Reliability Assessment in Large Language Model Evaluation Applied to Web Accessibility

B.O.Kuzikov1✉Emailo.shovkoplyas@mss.sumdu.edu.ua

1Sumy State University40007SumyUkraine

B. O. Kuzikov^{[0000–0002−9511−5665]}, O. A. Shovkoplias^{[0000–0002−4596−2524]},

P. O. Tytov^{[0009−0003−6911−5463]}, S. R. Shovkoplias^{[0000–0003−1837−0213],}

^O. V. Shutylieva ^{[0000–0002−7236–8555]}, and O. V. Vlasenko^{[0000–0003−4315–5654]}

Sumy State University, Sumy 40007, Ukraine

o.shovkoplyas@mss.sumdu.edu.ua

Abstract.

Context

. Multi-rater evaluation systems require reliable consensus estimation methods when objective ground truth is unavailable. This challenge is common in domains requiring semantic judgments from multiple, variably reliable evaluators.

Methods

. We present a statistical framework for consensus-based reliability assessment in ensemble evaluation systems. The methodology employs median aggregation for robust consensus estimation and introduces consistency metrics (R², variance, Spearman correlation) to quantify individual rater alignment. We formalize the consensus problem mathematically, develop core set selection algorithms under cost constraints, and validate the approach using the Intraclass Correlation Coefficient (ICC2k). Theoretical properties include robustness guarantees (50% breakdown point) and ICC monotonicity for nested reference sets.

Results

. Applied to semantic similarity assessment using 17 Large Language Models on ~ 14,384 samples, the framework achieves ICC2k = 0.977 with a 9-model core and 0.955 with an optimized 3-model core, demonstrating excellent inter-rater reliability. The 3-model configuration reduces computational requirements by 67% while maintaining near-equivalent reliability (ICC decline of only 2.2%). Strong negative correlation (ρ = -0.83) between rater variance and consensus alignment validates the consistency metrics.

Conclusions

. The framework achieves excellent inter-rater reliability while enabling significant computational cost reduction. Results validate the robustness of median-based consensus estimation and demonstrate the framework's effectiveness for multi-rater evaluation without ground truth. The methodology generalizes to any ordinal-scale consensus problem, providing a statistically validated approach for scalable annotation tasks.

Keywords:

Consensus Estimation

Intraclass Correlation

Inter-Rater Reliability

Multi-Rater Evaluation

LLM Evaluation

Accessibility

WCAG

Introduction

Multi-rater evaluation systems are fundamental to numerous domains where subjective judgments must be aggregated reliably such as medical diagnosis, academic peer review, and crowdsourced data annotation. A central challenge emerges when objective ground truth is unavailable: how to assess individual rater reliability and construct consensus estimates that are statistically robust and computationally efficient?

Recent advances in Large Language Models (LLMs) have introduced a new paradigm—using multiple models as automated raters for semantic evaluation tasks [1, 2]. While the “LLM-as-a-Judge” paradigm offers scalability, it inherits the core challenge of validating consensus and model reliability without ground truth.

This problem extends beyond LLM evaluation to any scenario where multiple evaluators of varying quality must be aggregated. Traditional approaches assume either (1) availability of ground truth for calibration, or (2) equal reliability across all raters. Both assumptions are frequently violated in practice. When ground truth exists, the problem reduces to supervised learning; our work addresses the more challenging scenario where it does not.

Mathematical Problem Formulation. Given M evaluation targets and N raters, we observe an evaluation matrix

$\:\text{X}\:=\:\left[{\text{x}}_{\text{i}\text{j}}\right]\in\:{R}^{M\times\:N}$

, where

$\:{x}_{ij}$

represents rater j's score for target i on a bounded ordinal scale. In the absence of ground truth, we must: (1) estimate a consensus score x̃_i for each target i, (2) quantify each rater's consistency with this consensus, and (3) select an optimal reference set

$\:\stackrel{\sim}{{\upepsilon\:}}\:\subseteq\:\:\{1,\dots\:,\text{N}\}$

that balances reliability and computational cost.

Methodological Contributions. We develop a statistical framework that addresses these challenges through:

1. Robust consensus estimation via median aggregation, which provides resistance to outliers and systematic biases (Proposition 1: 50% breakdown point).

2. Multi-metric consistency assessment using R², variance of deviations, and Spearman correlation to characterize rater alignment with consensus.

3. Optimal core set selection formalized as a constrained optimization problem (Eq. 1) that maximizes aggregate consistency under budget constraints.

4. Statistical validation using the Intraclass Correlation Coefficient (ICC2k) to quantify inter-rater reliability, with theoretical guarantees on monotonicity for nested reference sets (Proposition 2).

Unlike approaches that rely on mean aggregation and treat ICC solely as a terminal summary without budget-aware core selection, this work integrates a robust median-based consensus (50% breakdown point), multi-metric consistency profiling (R², variance of deviations, Spearman rank), and an explicit cost-constrained core optimization (Eq. 1). Beyond empirical validation, we state theoretical properties − robustness of median and ICC monotonicity for nested reference sets − yielding predictable behavior along the ‘3→5→9 models’ trajectory under minimal rank-correlation conditions for added raters. This integration of consistency, cost, and theory enables practical design of small model cores that retain ‘excellent’ ICC while offering a controlled quality–cost trade-off for scalable annotation without ground truth.

Empirical Validation Domain. We validate the framework on semantic similarity for automated accessibility testing, a task where traditional systems fail and human annotation is prohibitively expensive [3–6]. Specifically, we examine WCAG 2.5.3 (Label in Name) compliance, where visible labels must semantically correspond to accessible names − a task that defies simple string matching and can involve cases ranging from benign clarifications to critical contradictions, including potential Accessibility Cloaking Attacks [7]. LLMs have demonstrated promising capabilities in accessibility-related tasks including alternative text generation, content simplification, and interface evaluation [8–13], yet their deployment faces practical obstacles: high computational demands, API costs, latency, and provider dependency [11]. Our framework addresses these challenges by identifying minimal model cores that maintain statistical reliability while reducing computational requirements by up to 67%. Using 17 LLMs as automated raters on ~ 14,384 synthetic English/Ukrainian text pairs, we evaluate semantic correspondence on a [-1,1] scale. While the application domain involves web accessibility, the statistical methodology is domain-agnostic.

Generalizability. The framework generalizes to diverse NLP tasks requiring consensus without ground truth, including sentiment analysis, translation quality assessment [14], and content moderation scoring [15].

The mathematical formulation (Section 1.3) remains invariant across domains, requiring only ordinal outputs and scale boundedness. For categorical outputs, extensions using Fleiss' κ and weighted consensus schemes are straightforward. The remainder of this paper is organized as follows: Section 1 presents the methodology including dataset construction (1.1), evaluation protocol (1.2), theoretical properties (1.3), and consistency metrics (1.4); Section 2 reports empirical results; Section 3 discusses implications and limitations; Section 4 concludes with methodological generalizations.

1 Methodology

1.1 Dataset Creation and Validation

Taxonomy of Semantic Changes. Our framework introduces a structured evaluation process grounded in a detailed taxonomy of semantic inconsistencies, which is fully described in [16]. This taxonomy organizes potential LLM interpretation failures into five high-level classes (e.g., context extension, action object changes, action type changes, negation, and technical modifications) and ten fine-grained subclasses. This structure ensures systematic and comprehensive coverage of diverse evaluation scenarios. Based on this, we generate a synthetic dataset where each item represents a specific type of inconsistency, allowing for targeted reliability assessment. All data, prompts, evaluation code, anonymized aggregated scores, and step-by-step replication instructions are hosted at the dataset repository [17].

This taxonomy enabled the generation of synthetic English and Ukrainian datasets comprising approximately 14,384 samples for evaluating LLM performance against the WCAG 2.5.3 criterion. To ensure sample diversity, data generation employed leading commercial models, including Anthropic Claude, OpenAI ChatGPT, Grok, and Google Gemini. Model selection for evaluation balanced performance, cost, and operational metrics such as latency. Table 1 presents model specifications and the proportion of pairs processed in batch mode, facilitating assessment of task suitability and language-specific effects on quantitative outcomes. Qualitative characteristics are examined in subsequent sections. Analysis of language influence demonstrates consistent performance across multilingual datasets, with comparable success rates for English and Ukrainian test sets. This validates the framework's language independence and its capacity to process linguistically diverse data while maintaining consistency metrics (R² and variance). The methodology's scalability for big data analytics applications enables high-volume multilingual annotation in large-scale NLP pipelines.

Table 1
Model Assessment Capability
Model	Model Size, 1×10⁹	Price, USD³	Success rate (%)
Model	Model Size, 1×10⁹	Price, USD³	EN	UA
Blended price⁴ [0.92$...3.5$]
openai / gpt-4.1	300	2.00/8.00	100.0	100.0
anthropic / claude-3.5-haiku	~ 10–20¹	0.80/4.00	85.1	91.9
deepseek / deepseek-prover-v2	671	0.50/2.18	98.8	100.0
deepseek / deepseek-r1	671/ 37²	0.50/2.18	100.0	100.0
Blended price (0.1$...0.3$)
meta-llama / llama-4-maverick	400/17²	0.17/0.60	100.0	99.3
google / gemini-2.5-flash-preview	80	0.15/0.60	100.0	99.8
openai / gpt-4o-mini	~ 10–50¹	0.15/0.60	98.9	99.3
qwen / qwen3-235b-a22b	235/ 22²	0.15/0.60	99.8	99.3
google / gemini-2.0-flash-001	~ 10–50¹	0.10/0.40	100.0	100.0
openai / gpt-4.1-nano	~<10¹	0.10/0.40	100.0	98.5
Blended price [0.01$...0.1$]
mistral / ministral-8b	8	0.10/0.10	89.2	90.9
amazon / nova-micro-v1	~ <10¹	0.035/0.14	99.4	99.8
meta-llama / llama-3.1-8b-instruct	8	0.02/0.03	90.9	86.9
meta-llama / llama-3.2-3b-instruct	3	0.01/0.02	65.2	39.5
liquid / lfm-3b	3	0.02/0.02	47.2	10.3
qwen / qwen2.5-coder-7b-instruct	7	0.01/0.03	84.6	93.9
liquid / lfm-7b	7	0.01/0.01	52.7	34.1

¹ Estimated quantity, the actual number has not been published.

² Model built with a Mixture of Experts approach. The total number of parameters and the number of active parameters is indicated.

³ Price is indicated via OpenRouter. The actual price may differ depending on the provider and changes dynamically. The values indicated are the price per million input and generated tokens.

⁴ Blended prices defined as a 3:1 ratio of input to output token cost.

1.2 Analysis of Semantic Similarity Using LLMs

To evaluate semantic similarity between “visible text – accessible name” pairs, a specialized query (prompt) for LLMs was developed. Models were tasked with assessing semantic similarity on a scale from − 1.0 (opposite or contradictory meaning) to 1.0 (perfect semantic correspondence).

To improve the accuracy and consistency of responses, the prompt was developed using the few-shot prompting technique. This allowed adapting the models to the specific task and the required [-1.0, 1.0] output scale without additional training.

To reduce processing costs, input rows were grouped into sets of 100 pairs, which significantly affected the quality of results from weaker models. Their performance could have been higher with sequential row-by-row processing, as batch processing increases demands on contextual analysis and model robustness to information overload. During processing, all available assessments for pairs were considered, even when the model returned partial results (for example, additional comments or outputs for only some pairs). The experiment was conducted at a temperature of 0.1 to ensure deterministic responses. Table 1 demonstrates the ability of different models to process this task in batch mode for different test sets (synthetic English (EN), Ukrainian (UA)).

1.3 Theoretical Properties

Proposition 1

(Robustness of Median Consensus). The median-based consensus

$\:{\stackrel{\sim}{x}}_{i}$

is robust to contamination by up to

$\:\lfloor\:\mid\:\stackrel{\sim}{\epsilon\:}\mid\:/2\rfloor\:\:$

arbitrarily corrupted model outputs. This follows from the 50% breakdown point of the sample median.

Proposition 2

(ICC Monotonicity). Under the additional condition that all models in

$\:{\stackrel{\sim}{\epsilon\:}}_{2}\setminus\:{\stackrel{\sim}{\epsilon\:}}_{1}$

have rank correlation at least 𝜌0 > 0 with the

$\:{\stackrel{\sim}{\epsilon\:}}_{1}$

consensus, and absent substantial heteroscedasticity across rater groups, it follows that

$\:\text{I}\text{C}\text{C}\left({\stackrel{\sim}{\epsilon\:}}_{2}\right)\ge\:\text{I}\text{C}\text{C}\left({\stackrel{\sim}{\epsilon\:}}_{1}\right).$

Intuitively, if new raters added to a reference set are already in agreement with the existing consensus (positive rank correlation), they are more likely to reinforce the consensus than to contradict it. This reduces the proportion of unexplained variance relative to the total variance across raters, thereby increasing the ICC.

Core Set Selection. Given cost constraint

$\:\text{C}\:$

and model costs

$\:{\text{c}}_{\text{j}}$

, the optimal reference set

$\:{\stackrel{\sim}{{\upepsilon\:}}}^{\text{*}}$

solves:

$\:max\sum\:_{j\in\:\stackrel{\sim}{\epsilon\:}}{R}_{j}^{2}\text{subject\:to}\sum\:_{j\in\:\stackrel{\sim}{\epsilon\:}}{c}_{j}\le\:C$

Our empirical analysis (Section 2) shows diminishing returns beyond 3–5 models, validating greedy selection by R².

1.4 Methodology for Evaluating the Consistency of Language Models

To ensure a scalable and consistent annotation of large datasets, this study adopts the “LLM-as-a-Judge” paradigm, where LLMs serve as automated evaluators for complex NLP tasks [3, 4]. For each of the M text pairs, we obtained numerical evaluations from N = 17 models on a unified scale of [-1, 1], forming an evaluation matrix

where:

– is the numerical evaluation of text i by model j, M is the number of texts (instances), N is the number of models (LLMs)

Since objective ground truth is unavailable, model reliability is assessed relative to a consensus evaluation derived from a trusted reference set of models

. or text i is defined as the median of the reference models' scores:

. The median is chosen for its robustness to outliers, a common practice when aggregating expert evaluations without ground truth. The deviation of model j evaluation on instance i is defined as

For each model j, the following consistency indicators with consensus are calculated:

Variance of deviations (characterizes the “noisiness” of the model):

, where

Coefficient of determination

, where

. This indicator measures how much better the consensus score explains the model's evaluations compared to the model's own average score. An

indicates perfect explanation, while negative values indicate that the consensus is a worse predictor than the model's average.

Spearman's rank correlation coefficient

. This indicator captures the monotonic correspondence between the ranking of texts by the model's evaluations and consensus and is not sensitive to linear transformations of scale and shift. The median is robust to systematic over/underestimation and distributional asymmetry, unlike the mean. This approach is typically applied when aggregating expert evaluations in the absence of ground truth.

The use of these metrics allows for objective comparison of models with each other, even in the absence of ground truth, focusing on their internal consistency. Special attention should be paid to the selection of the reference set

, as it forms the basis of comparison for the rest of the models.

An assessment of the appropriateness of including or excluding a model from the reference set can be the change in

. It is worth noting that rank consistency metrics like R² can be artificially inflated when consensus is formed from a broad set of models. This is due to an internal correlation effect, where each model's evaluation is compared against an aggregate that includes its own values. Therefore, a smaller, highly correlated reference core is preferable to mitigate this statistical artifact. On the other hand, many models in the reference set require additional resources to calculate values.

To determine the reference set

, a correlation matrix was constructed. The values and their heat map are presented in Fig. 1. This allowed identifying a core of models {1–9} that strongly correlate with each other.

Fig. 1

Heatmap of cross-correlations for LLM evaluation scores

To assess inter-model reliability, the intraclass correlation coefficient (ICC, Average Random Raters, ICC2k) was used. This statistical indicator assesses the consistency of evaluations provided by different models for the same objects. The calculation of ICC requires evaluations from all studied models for each pair of texts, which may lead to the exclusion of some pairs from the analysis.

2 RESULTS

The analysis of LLM consistency in the task of semantic similarity assessment was conducted based on the metrics described in the methodology.

Cross-correlation and Selection of the Reference Core of Models. Figure 1 shows cross-correlations identifying a stable core {1–9}. Notably, intra-developer correlation did not exceed inter-developer correlation, indicating architectural diversity.

Inter-model Reliability (ICC). When analyzing the group of 9 most consistent models, the calculated intraclass correlation coefficient (ICC) using the Average Random Raters scheme (ICC2k) was 0.977 with a 95% confidence interval [0.96, 0.98] (after rounding), using 86.4% of data without missing values. The obtained value indicates a high level of inter-model consistency, which is classified as “excellent” reliability.

Separately, an optimal core (based on the balance of quality indicators and cost) of three models (anthropic/claude-3.5-haiku, deepseek/deepseek-r1, and google/gemini-2.5-flash-preview) was identified, for which the ICC was 0.955 (CI95% [0.95, 0.96]) based on 88.1% of values without missing data. This is also a very high indicator, while such a core requires significantly fewer computational resources. Although the 3-model core's ICC was slightly lower (0.955 vs. 0.977), its reliability remained "excellent," making it a computationally efficient choice for resource-constrained scenarios.

These values exceed the 0.90 threshold for “excellent” reliability and empirically validate Proposition 2: the 9-model core includes the 3-model optimized core (

$\:\stackrel{\sim}{{{\upepsilon\:}}_{1}}\subset\:\stackrel{\sim}{{{\upepsilon\:}}_{2}}$

), with

$\:\text{ICC}\left(\stackrel{\sim}{{{\upepsilon\:}}_{2}}\right)=0.977>\text{ICC}\left(\stackrel{\sim}{{{\upepsilon\:}}_{1}}\right)=0.955$

as predicted

Qualitative Indicators of Models Relative to the Reference Core. Using the defined core of three models to compute the consensus, the relative quality of other models was evaluated. Table 2 demonstrates the indicators of evaluation bias (

), “model noisiness” (σ²_j), and consensus consistency (R²).

Table 2
Qualitative indicators of models relative to the core
		σ²_j	R²
Blended price [0.92$...3.5$]
openai/gpt-4.1	0.04	0.05	0.85
anthropic/claude-3.5-haiku	0.04	0.04	0.90
deepseek/deepseek-prover-v2	0.00	0.10	0.62
deepseek/deepseek-r1	0.00	0.05	0.87
Blended price (0.1$...0.3$)
meta-llama/llama-4-maverick	-0.03	0.11	0.75
google/gemini-2.5-flash-preview	-0.05	0.03	0.91
openai/gpt-4o-mini	-0.06	0.10	0.71
qwen/qwen3-235b-a22b	0.01	0.09	0.78
google/gemini-2.0-flash-001	-0.08	0.11	0.70
openai/gpt-4.1-nano	0.15	0.21	-0.69
Blended price [0.01$...0.1$]
mistral/ministral-8b	0.28	0.27	-0.75
amazon/nova-micro-v1	0.02	0.22	0.37
meta-llama/llama-3.1-8b-instruct	0.05	0.29	0.18
meta-llama/llama-3.2-3b-instruct	-0.07	0.41	0.22
liquid/lfm-3b	0.17	0.35	0.63
qwen/qwen2.5-coder-7b-instruct	0.16	0.29	0.07
liquid/lfm-7b	0.11	0.18	0.76

For this table

with both negative and positive values show systematic deviation from the core value (model bias), σ²ⱼ shows unsystematic deviation ("noise"). Models with low or negative R² represent a "minority report" or a fundamentally different interpretation of the task. While not suitable for the reference core, analyzing their outputs can provide insights into ambiguous cases or alternative semantic perspectives.

The results show that more powerful models tend to demonstrate better consistency with the consensus and less “noisiness”. Negative R² values for models gpt-4.1-nano and ministral-8b indicate that their predictions deviate from consensus more than a naive constant baseline, confirming systematic misalignment. These models should be excluded from the reference set per the optimization criterion in Eq. (1). The strong negative correlation (Pearson ρ = -0.83) between variance σ²ⱼ and R²ⱼ across all models demonstrates that consistency with consensus inversely relates to evaluation noise: lower-variance models achieve higher explanatory power.

3 DISCUSSIONS

Our findings confirm that LLMs can effectively automate semantic similarity evaluation for the WCAG 2.5.3 criterion. The developed taxonomy of semantic changes and the methodology for generating synthetic data using LLMs allowed for the creation of relevant datasets for further analysis. While manual verification of every generated instance is impractical, the framework's reliance on synthetic data is methodologically robust. By ensuring that our dataset covers a comprehensive taxonomy of 5 classes and 10 subclasses of inconsistencies, we systematically account for a wide range of potential failure modes. The adequacy of this approach is validated through the inter-class comparison of mean reliability scores, which allows us to confirm that the framework correctly differentiates between varying types and severities of rater disagreement. This structured, variance-aware evaluation serves as a proxy for correctness in the absence of absolute ground truth.

The use of a semantic similarity scale in the range [-1.0..1.0] proved effective for differentiated assessment: it allows not only to determine the degree of semantic proximity but also identifying cases of opposite meaning. This is critically important for identifying potentially dangerous or disorienting discrepancies and can be considered as an indirect indicator of the presence of Accessibility Cloaking Attacks where content is presented differently to users and assistive technologies.

The LLM-as-a-judge approach and the calculation of the intraclass correlation coefficient (ICC) confirmed a high degree of consistency among leading LLMs in assessing semantic similarity on the proposed scale. This suggests that LLMs can be used as a reliable tool for data annotation in cases where manual labeling is too resource-intensive. The ability of models with a large number of parameters to demonstrate strong cross-correlation may indicate that, despite differences in architecture and training data, they perform natural language processing tasks with similar quality, and their responses tend toward the true value.

An important observation is that a model's ability to generate syntactically correct responses (as shown by the high percentage of successful generations in Table 1 for many models) does not always directly correlate with the semantic quality and consistency of its evaluations. For example, the amazon/nova-micro-v1 model demonstrated high reliability in forming responses; however, its semantic consistency with the reference core was lower compared to some other models with similar or even lower generation success (Table 2). This highlights the need to distinguish a model's formal operability from its semantic analysis quality.

Despite the framework's success, it is important to acknowledge the practical limitations of using LLMs, such as high computational requirements and API-related costs, which remain obstacles for widespread, real-time implementation. While this research establishes a solid baseline for quality assessment, the results also highlight the need to develop more resource-efficient solutions.

Methodological Generalizability. While demonstrated on WCAG 2.5.3 semantic similarity, the proposed framework applies to any ordinal-scale multi-model evaluation task: (1) Sentiment analysis consensus [-1 to 1], (2) Text quality scoring [0 to 1], (3) Translation adequacy rating [1 to 5], (4) Any NLP annotation requiring inter-rater reliability. The mathematical formulation (Eq. 1) remains invariant to domain, requiring only: ordinal model outputs, scale boundedness, reference set selection criteria. Future work includes extending the framework to handle categorical (e.g., via Fleiss' κ) and multimodal (e.g., joint text-image) evaluation tasks.

4 CONCLUSIONS

This study demonstrated that LLMs can reliably evaluate semantic similarity for web accessibility, specifically for the WCAG 2.5.3 criterion. This research yielded the following significant outcomes:

1) A formalized statistical framework for consensus estimation and reliability assessment in multi-model LLM systems, with theoretical guarantees (Propositions 1–2) and optimal core set selection (Eq. 1).

2) The optimized 3-model core reduces the number of required API calls by 67% (from 9 to 3 models) compared to the full core, while the ICC2k decreases by only 2.2% (from 0.977 to 0.955), demonstrating favorable cost-quality trade-off.

3) The methodology generalizes to multi-rater consensus problems in NLP, enabling scalable annotation and model comparison without ground truth.

Despite the demonstrated potential, the use of LLMs is associated with high computational costs and expenses. The obtained results created a foundation for further research aimed at developing more economical and efficient solutions.

Research context. This research was conducted as part of the research project of Sumy State University titled “Information technology models and methods for analysis and synthesis of structural, information and functional models, objects and automated processes” (state registration number 0120U103071).

Author Contributions.

Conceptual framework design, problem formulation, and supervision of research – Borys Kuzikov; Development of statistical methodology, mathematical formalization of consensus estimation and reliability metrics – Oksana Shovkoplias; Formulation of optimization model for core set selection and theoretical analysis of robustness and ICC monotonicity – Serhii Shovkoplias; Implementation of experimental pipeline, dataset synthesis, and computational experiments – Pavlo Tytov; Statistical validation, empirical analysis, and visualization of results – Oleksandr Vlasenko; Literature review, comparative analysis with existing approaches, and preparation of the manuscript – Olha Shutylieva.

All authors contributed to discussions, reviewed the results, and approved the final version of the manuscript.

Conflict of Interest.

The authors declare no conflict of interest regarding this research, whether financial, personal, authorial, or otherwise, that could influence the research and its results presented in this article.

References

Gu J, Jiang X, Shi Z, Tan H, Zhai X, Xu C, Li W, Shen Y, Ma S, Liu H, Wang S, Zhang K, Lin Z, Wang Y, Ni L, Gao W, Guo J (2024) A survey on LLM-as-a-Judge. 1. https://doi.org/10.48550/arXiv.2411.15594

Li H, Dong Q, Chen J, Su H, Zhou Y, Ai Q, Ye Z, Liu Y (2024) LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. 1. https://doi.org/10.48550/arXiv.2412.05579

Automated WCAG Testing is Not Enough for Web accessibility ADA Compliance. https://blog.usablenet.com/automated-wcag-testing-is-not-enough-for-web-accessibility-ada-compliance, last accessed 2025/06/14

The Automated Accessibility Coverage Report https://www.deque.com/automated-accessibility-testing-coverage, last accessed 2025/06/20.

Sane P (2021) A Brief Survey of Current Software Engineering Practices in Continuous Integration and Automated Accessibility Testing. In: 2021 Sixth International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET). IEEE, pp. 130–134. https://doi.org/10.1109/WiSPNET51692.2021.9419464

What’s the difference between manual and automated accessibility testing? https://www.boia.org/blog/whats-the-difference-between-manual-and-automated-accessibility-testing, last accessed 2025/08/15.

Kuzikov B, Tytov P, Shovkoplias O, Lavryk T, Koval V, Kuzikova S (2025) Detection and Prevention of Accessibility Cloaking Attacks. Inf Technol Comput Sci Softw Eng Cyber Secur 124–135. https://doi.org/10.32782/IT/2025-1-17

Improving accessibility through leveraging large language models (LLMs). https://www.deque.com/axe-con/sessions/improving-accessibility-through-leveraging-large-language-models-llms, last accessed 2025/09/18.

Fatiul Huq S, Tafreshipour M, Kalcevich K, Malek S (2025) Automated Generation of Accessibility Test Reports from Recorded User Transcripts. In: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). Ottawa, ON, Canada, pp. 534–546

Improving accessibility through leveraging large language models (LLMs), https://a11ytalks.com/posts/2024-may-16, last accessed 2025/09/18.

Delnevo G, Andruccioli M, Mirri S: On the Interaction with Large Language Models for Web Accessibility: Implications and Challenges. In: Proceedings - IEEE Consumer Communications and Networking, Conference (2024) CCNC. Institute of Electrical and Electronics Engineers Inc., pp 1–6

Aralimatti R, Shakhadri SAG, KR K, Angadi K (2025) Fine-Tuning Small Language Models for Domain-Specific AI: An Edge AI Perspective. https://doi.org/10.20944/PREPRINTS202502.2128.V1

Kuzikov BO, Tytov PO, Shovkoplias OA (2025) Analysis of Web Accessibility of Ukrainian Higher Education Institutions’ Websites. Syst Res Inf Technol 2:139–150. https://doi.org/10.20535/SRIT.2308-8893.2025.2.10

Yang B, He L, Liu K, Yan Z (2024) VIAssist: Adapting Multi-Modal Large Language Models for Users with Visual Impairments. Proc – 2024 IEEE Int Work Found Model Cyber-Physical Syst Internet Things, FMSys 2024. 32–37. https://doi.org/10.1109/FMSys62467.2024.00010

Artificial Intelligence in Digital Accessibility – And How We Already Use It https://eye-able.com/blog/artificial-intelligence-in-digital-accessibility-and-how-we-already-use-it, last accessed 2025/09/25.

Kuzikov BO, Shovkoplias OA, Tytov PO, Shovkoplias SR (2025) Application of small language models for semantic analysis of web interface accessibility. Probl Program 2:77–86. https://doi.org/10.15407/pp2025.02.077

Tytov PO Semantic Language Models for WCAG. https://www.kaggle.com/datasets/tytovpavel/semantic-language-models-for-wcag, last accessed 2025/09/25

Yes