Evaluating the use of Artificial Intelligence (AI) in Systematic Review Abstract Screening: A Comparative Study of AI-aided Tools

Title

Authors

Selin Akaraci 1✉ Emails.akaraci@qub.ac.uk

Sophie Marie Jones 1

Christopher Tate 1

Niamh O’Kane 1

Leandro Garcia 1

Professor

Mike Clarke 1

Professor

Maureen Dobbins 2

Professor

Ruth F. Hunter 1

1 Centre for Public Health Queen’s University Belfast BT12 6BA Belfast UK

2 National Collaborating Centre for Methods and Tools, School of Nursing McMaster University Hamilton ON Canada

Dr Selin Akaraci, Centre for Public Health, Queen’s University Belfast, Belfast BT12 6BA, UK

Dr Sophie Marie Jones, Centre for Public Health, Queen’s University Belfast, Belfast BT12 6BA, UK

Dr Christopher Tate, Centre for Public Health, Queen’s University Belfast, Belfast BT12 6BA, UK

Dr Niamh O’Kane, Centre for Public Health, Queen’s University Belfast, Belfast BT12 6BA, UK

Dr Leandro Garcia, Centre for Public Health, Queen’s University Belfast, Belfast BT12 6BA, UK

Professor Mike Clarke, Centre for Public Health, Queen’s University Belfast, Belfast BT12 6BA, UK

Professor Maureen Dobbins, National Collaborating Centre for Methods and Tools, School of Nursing, McMaster University, Hamilton, ON, Canada,

Professor Ruth F.Hunter, Centre for Public Health, Queen’s University Belfast, Belfast BT12 6BA, UK

Corresponding Author

Dr Selin Akaraci, s.akaraci@qub.ac.uk

Abstract

Background:

Manual abstract screening in systematic reviews is a time-consuming and labour-intensive task. With the rise of artificial intelligence (AI), the number of published articles has grown substantially, adding to the workload of review studies that rely on robust and timely evidence synthesis. At the same time, AI-aided screening tools have been developed to accelerate this process. While previous studies have demonstrated the efficiency of such tools, ongoing technological advances necessitate updated evaluations, particularly for tools that are freely accessible. In review types such as umbrella reviews, where both the topic area and study design are central to eligibility decisions, the performance of AI-aided tools remains underexplored.

Methods:

We conducted a comparative evaluation of six freely available AI-aided abstract screening tools: Rayyan, RobotAnalyst, PICO Portal, Abstrackr, ASReview, and Colandr using a previously completed umbrella review of interdisciplinary urban planning and public health studies. We assessed (1) early recall performance (i.e., identification of included studies within the first 25% of screening), (2) feature availability and depth, and (3) user experience. This Study WIthin a Review (SWAR) was registered in the SWAR repository as SWAR 25.

Results:

All evaluated tools supported the review process by facilitating screening and offering features such as prioritization and keyword highlighting. However, none identified more than 50% of the previously included studies within the first 25% of screening. Feature analysis and user feedback suggested that Rayyan and PICO Portal provided the most useful functionality for our interdisciplinary umbrella review context, although limitations were noted in duplicate removal and in recognising the importance of study design in eligibility decisions.

Conclusions:

Although a growing number of AI-assisted abstract screening tools are publicly and freely available, their accuracy, usability, and adaptability to different review designs remain limited. Enhanced support for duplicate detection and integration of study design considerations could improve their utility in umbrella reviews and other complex evidence syntheses. Continued evaluation and user training may support broader adoption across diverse research contexts.

1. Introduction

Systematic reviews (SRs) are pivotal for informing evidence-based practice, consensus building, and the identification of research gaps. Timely evidence synthesis is essential for supporting decision-making, yet the traditional review process can delay the delivery of actionable insights[1–3]. The rapid growth of scientific literature, partly driven by advances in artificial intelligence (AI) and digital technologies[4, 5] has further increased the volume of published articles, intensifying the workload for systematic reviewers. Therefore, accelerating the screening process without compromising the accuracy of manual review could enhance the efficiency of evidence generation and deliver timely results. To address this, a growing number of artificial intelligence (AI)-aided software tools have been developed to improve the speed and consistency of screening [6–8]. These tools vary in accessibility and functionality, ranging from freely available applications that provide core screening features to subscription-based systems offering more advanced capabilities such as automated duplicate removal, predictive prioritization, and workflow management. Some tools incorporate machine learning algorithms that continuously learn from reviewer decisions to prioritize unscreened records according to their likelihood of inclusion, while others rely primarily on manual review with optional AI support.

Some AI-aided tools offer support across multiple stages of the review process (Rayyan, PICO Portal), while others focus on specific tasks such as abstract screening (Abstrackr), which is among the most well-supported, widely implemented, and technically advanced features [9, 10]. However, the initial adoption of AI-aided tools in SRs was relatively slow, as evidenced in a 2018 study that surveyed systematic reviewers from the field of biomedical sciences[11]. Several factors contributed to this, including licensing costs, steep learning curves, and limited support for some freely available tools. Additionally, a study published in 2018 reported that a lack of awareness contributed to low uptake of AI-aided screening tools [12]. However, a more recent survey (published in 2021) conducted among reviewers involved in evidence synthesis, health technology assessment, and guideline development, who all had conducted interventions, scoping, accuracy, and prognostic reviews, found that 79% reported using automation tools during the screening phase [13].

A growing body of literature, including several reviews, has explored the landscape of AI-aided screening tools, revealing key trends in focus, functionality, and limitations [9, 10, 14–18], with a common focus on feature identification and comparison. Some reviews [11] [9, 15] have systematically catalogued the technical capabilities of existing screening platforms, particularly in relation to dual screening, reference management, and machine learning integration. These studies often relied on structured frameworks (e.g., DESMET) [19]or weighted scoring systems to evaluate tool functionality [14, 18].

Other features of freely available AI-aided screening tools, such as usability and accessibility, have been evaluated, particularly for non-technical users[10, 15]. Marshall et al. (2019)[10] conducted a descriptive study introducing several tools for screening tasks, including Abstrackr[8], EPPI-Reviewer [20], RobotAnalyst [21], SWIFT-Review [22], Colandr [23], and Rayyan [24], with a focus on their applicability for researchers without programming expertise. Harrison et al. (2020) [15] emphasized user friendliness and interface design, identifying Covidence and Rayyan as the most popular tools due to their ease of use. Similarly, two other studies [17, 25] conducted scoping reviews to map the overall landscape of automation in screening. Blaziot’s study included 12 systematic reviews in the health sciences that employed some form of AI method. Among these, eight reviews used AI during the title and abstract screening stages and included evaluations conducted by the authors of each review. The findings revealed time savings and workload reductions associated with the use of AI. The 47 studies included in Khalil et al (2022) scoping review examined tools for the semi-automation of various stages of systematic reviews, encompassing both freely available and commercially licensed software (e.g., DistillerSR [26]); the authors noted that the screening stage has reached a relatively high level of maturity, although many tools remain insufficiently validated, are often domain-specific, and may present usability or accessibility challenges. Other studies have aimed to evaluate performance and real-world applications of AI-aided screening tools, though such work remains relatively limited. Reis et al. (2023), for example, assessed three screening tools – Rayyan, Abstrackr, and Colandr – based on speed, accuracy, and perceived usefulness, incorporating both objective outcomes and user perspectives. Their findings suggested that Rayyan software offered the best user experience and achieved the highest true-positive rate, meaning it correctly identified the largest number of relevant studies[18]. All three tools demonstrated comparable performance in identifying true negatives, as well as in measures of proportion missed, workload reduction, and time savings[18]. Similarly, Kohl et al. (2018)[27] compared several AI-aided tools used at different stages of the review process, including screening, highlighting their respective features and usability. Despite evaluating only a small number of tools, they identified CADIMA[27] as the most favourable tool.. Furthermore, some studies evaluated AI-aided tool performance only within a single domain such as health care [9, 17].

Taken together, these studies demonstrate a growing understanding and application of AI-aided screening tools, particularly in terms of available features and general usability. However, several key gaps remain. First, there is a need for studies that systematically re-evaluate AI-aided screening tools with respect to their updated feature sets and performance improvements. Second, existing evaluations have not assessed these tools in the context of umbrella reviews-studies that synthesize evidence from multiple systematic reviews rather than primary studies-which are becoming increasingly common alongside AI development [28]. Umbrella reviews present additional challenges, as automation tools must not only assess topical relevance and adhere to pre-defined eligibility criteria, but also accurately identify study designs by distinguishing between systematic reviews and primary research. Third, there is a need to evaluate these tools in multidomain or interdisciplinary contexts, as performance may vary when applied to reviews spanning diverse fields. In addition, there is a broader methodological gap regarding the transparent and standardized comparison of available screening tools. This has been underscored in a recent study [16] that highlighted the need for larger evaluation datasets from completed reviews, to enable more robust comparisons of tool performance and generalizability across review types and disciplines.

To address these gaps, we aimed to systematically catalogue existing tools that automate abstract screening and conduct a comparative study of web-based, freely available AI-aided abstract screening tools by evaluating their feature sets, screening performance, and user experience. We focused on tools that are freely accessible or offer at least a trial version to ensure that all evaluated methods could be readily accessed, tested, and reproduced by other researchers without subscription barriers. Using a previously completed umbrella review focusing on the impacts of urban environments on cognitive health as a case study, we performed three complementary analyses in this Study Within a Review (SWAR)[29]. These were (1) a feature analysis based on established frameworks; (2) a performance assessment to assess the ability of tools to retrieve relevant studies within an initial screening subset; and (3) a user survey capturing subjective perceptions of learnability, interface intuitiveness, and proficiency time. By situating this evaluation within an umbrella review context, our study provides novel insights into how current freely available AI-aided tools perform under more demanding methodological requirements, and highlights avenues for future development in AI-aided abstract screening.

2. Methods

This SWAR was registered in the SWAR repository as SWAR 25[30].

2.1 Identification of AI-Aided Abstract Screening Tools

We identified freely available AI-aided abstract screening tools through a comprehensive search strategy utilising multiple sources, including systematic electronic database searches to capture relevant published studies. University library webpages specifically detailing systematic review methodologies were also consulted, as these resources often curate and recommend evidence synthesis tools. The Systematic Review Toolbox [31], an online platform providing a comprehensive repository of software tools designed to support various stages of systematic reviews and evidence synthesis, was temporarily unavailable during our search period due to ongoing maintenance. However, we contacted the Systematic Review Toolbox team directly and received written guidance regarding currently available and emerging tools suitable for abstract screening. Additionally, we conducted extensive web searches and participated in webinars, including those hosted by academic institutions such as King’s College London (e.g., https://libguides.kcl.ac.uk/systematicreview/ai), to understand the latest developments in AI-aided screening tools. Finally, we employed Research Rabbit (https://www.researchrabbit.ai/), an AI-powered research discovery platform, to identify additional relevant literature and tools, and to sense check if any pertinent studies were overlooked in the above process.

To map the evidence on the evaluation of AI-aided abstract screening tools, we constructed a matrix with identified tools listed in the rows and individual studies in the columns (see Table S1). This approach allowed us to visualize the extent and distribution of tool evaluations, identify which tools had been widely studied versus those with limited or no evaluation, and to detect potential redundancies or gaps in the literature.

An evaluation framework was developed to assess the identified AI-aided abstract screening tools (see Table S2) and was composed of three sections: (1) performance metrics, (2) feature analysis (see Table S3), and (3) user survey (see file S1). This framework was based on previous studies that evaluated AI-aided screening tools at various stages of systematic reviews [14, 18]. Four team members (SA, SMJ, CT, NOK) were assigned to test the tools, with each tool evaluated by two randomly selected raters. All raters received training based on our recently completed umbrella review synthesising evidence on the association between urban environment features and cognitive health. Subsequently, each rater independently evaluated the tools.

2.1.1 Database Search

Searches were conducted in Embase, MEDLINE, and the Cochrane Library to identify studies using AI-aided abstract screening tools across various evidence synthesis methodologies. The search covered publications from January 2021 to April 2024. This date range was chosen to build upon the comprehensive systematic review previously conducted by Blaziot et al. [17], which covered AI-aided abstract screening tools up to 2021, allowing our search to focus on capturing recent developments and publications from 2021 onward. The initial search was conducted in November 2024 to ensure inclusion of the most recent studies.

The search strategy was developed collaboratively by two members of the research team (SJ and SA), adapting search terms from Blaziot et al.[17] and Khalil et al.[25]. The strategy was piloted in MEDLINE to ensure optimal balance between comprehensiveness and precision. The finalised search strings were subsequently adapted for the Cochrane Library. The full search strategy is provided in Table S4.

2.1.2 AI-aided Screening Tool Identification

Screening was conducted in two phases using Covidence. Titles and abstracts were independently screened by two team members (SA and SMJ), followed by full-text screening. Any disagreements were resolved through discussion.

We included systematic reviews, rapid reviews, umbrella reviews, evidence gap maps, evidence mapping, scoping reviews, and similar review types that employed AI-aided abstract screening tools with clearly reported AI methods. Methodology papers introducing novel AI models or screening software were also included. Only articles written in English were included. Priority was given to tools designed for interdisciplinary searches, although healthcare-specific tools were included. Review protocols, conference proceedings, and studies lacking full-text availability were excluded to ensure the inclusion of complete and accessible evidence.

Following the identification of eligible studies, additional screening criteria were applied to determine which tools to evaluate in detail. Specifically, the additional criteria included tools that were currently functioning, did not require coding, free to use, and not limited to a specific subject domain.

2.2 Assessment of AI-aided Abstract Screening Tools

All activities related to testing the identified AI-aided tools in this study were guided by a previously conducted umbrella review, with pilot testing of the inclusion/exclusion criteria among the research team included as one of these preparatory tasks. Following the selection of AI-aided tools, we tested their usability and performance using a set of 2,212 abstracts identified in the umbrella review. The review applied the search strategy provided in Table S5 to PubMed, Embase, and PsycINFO, without restrictions on publication date, and examined the effects of urban environments on cognitive health. Before assigning AI tools for evaluation, we assessed the consistency of each reviewer’s screening judgments to confirm that all reviewers could reliably apply the same eligibility criteria. Establishing this baseline consistency ensured that any observed differences in screening outcomes during AI-assisted evaluations could be attributed to tool performance rather than variability in human judgment. To achieve this, we conducted a calibration exercise using a 10% random sample of abstracts from the umbrella review dataset (n = 250), stratified by researchers not involved in the screening to include a representative distribution of previously included and excluded studies. The original screening decisions, used as the reference standard, were made by a different team of experienced systematic reviewers following dual independent screening and resolution of conflicts through consensus.

Each of the three current reviewers (SJ, CT, and NOK) independently screened the calibration sample while blinded to the original decisions. Their inclusion and exclusion judgments were compared to the reference standard by one of the authors of the present study (SA). Conflicts among screening decisions were discussed and resolved through consensus. Raw rates of agreement exceeded 95% across all reviewers, and Cohen’s kappa values ranged from 0.900 to 0.986, indicating a high degree of agreement according to the Landis and Koch interpretation scale [REF]. These results provided strong evidence that current reviewers could consistently and accurately apply the previously defined eligibility criteria. Subsequently, AI-aided tools were allocated to reviewers, with each tool being tested by two reviewers. To mitigate against any potential bias, the distribution of tools was designed to ensure that reviewers had either no prior experience or only minimal experience with the tools.

2.2.1 Feature Analysis

The feature analysis aimed to systematically identify and compare the functionalities embedded within each tested tool, providing potential users with insights to select tools that best align with their specific needs and preferences. The feature set used in this study was adapted from previously published frameworks[14, 18] and was refined through internal discussion among the research team. These included, for example, the ability to import and allocate references, remove duplicates, export screening results, and support distinct title/abstract and full-text screening phases. Additional features such as conflict resolution tools, keyword highlighting, project progress tracking, and the ability to attach comments or PDFs to references were also evaluated. Active user support availability was also considered as a potential facilitator of tool adoption. The final set of features assessed is provided in Table S3.

Two types of features were included in the analysis. Binary features captured whether a given functionality was present or absent, while compound features assessed the degree or quality of support, typically structured around multiple pre-defined levels of implementation. For each tool, it was first determined whether a particular feature was available. If so, the feature's implementation was then compared to conformance criteria as defined in Table S2.

Feature analysis was carried out by three researchers (CT, SJ, and NOK), with each tool independently rated by two of the three researchers as described in Section 2.2. All assessments were subsequently reviewed and verified by a single team member (SA) to ensure consistency. The user survey and performance evaluation were conducted by four researchers (CT, NOK, SA, and SJ), with each tool evaluated by two raters. Where applicable, intra-observer and inter-rater agreement were considered during the evaluation process, following approaches used in previous studies as done in a previously published study [9].

2.2.2 Performance Metric

Our primary performance metric was recall, defined as the proportion of relevant studies identified within a given portion of the screening queue. We selected recall as the key outcome because, in evidence synthesis, the ability to adequately identify relevant studies early in the screening process is essential to reducing workload while maintaining review quality. This was particularly critical in the context of our umbrella review, where missing relevant systematic reviews could compromise the comprehensiveness and validity of the synthesis.

Initially, we adopted an exploratory approach in which screening continued until reviewers judged that the remaining unscreened abstracts were likely to be irrelevant. However, this strategy proved inconsistent across tools. In some cases, particularly with Abstrackr, no clear point of convergence was reached, and screening needed to continue through nearly the full dataset to ensure all relevant studies were captured.

To overcome this variability and allow for standardised comparison, we implemented a fixed evaluation threshold: we screened the first 25% of records prioritised by each tool and calculated the recall achieved within that subset. This approach reflects a commonly used heuristic in the AI-aided screening literature, where early recall performance is used as a proxy for workload reduction potential[32]. While this threshold is not intended as a stopping rule, it provides a pragmatic benchmark to assess whether a tool surfaces a substantial proportion of relevant studies early in the screening process.

2.2.3 User Survey

To evaluate the perceived usability and functionality of the AI-aided abstract screening tools, we conducted a post-use user experience survey. The objective of the survey was to capture subjective impressions of tool usability, ease of learning, and feature adequacy as perceived by the reviewers involved in tool testing.

The survey explored several key dimensions. These included the ease with which reviewers were able to learn how to use each tool, their perceptions of user-friendliness and interface intuitiveness, and the amount of time required to gain proficiency in the tool. In addition, reviewers were asked whether they identified any missing or insufficient features and were invited to suggest potential improvements.

The full list of survey questions is provided in File S1. The instrument was adapted from the previously published framework [18], ensuring methodological rigor and comparability with prior studies of AI-aided evidence synthesis tools. The questionnaire included both closed- and open-ended questions to allow for structured feedback as well as exploratory input.

All reviewers who participated in the tool performance evaluation completed the survey immediately following their use of each assigned tool. Survey responses were analysed descriptively to identify common usability patterns and inform future tool development recommendations.

3. Results

3.1 Identification of AI-aided Abstract Screening Tools

In total, we identified 43 tools (see File S3), eight of which were deemed suitable for analysis: AbstrackR, CADIMA, Colandr, Rayyan, PICO Portal [29], RobotAnalyst, and ASReview. An overview of all identified tools is provided

in Supplementary Table S3. Of these eight tools, six-AbstrackR, ASReview, Colandr, PICO Portal, Rayyan, and RobotAnalyst-were included in the evaluation, as they offered AI-based prioritization and/or active learning functionalities at the time of assessment. CADIMA and LitReview were excluded because they did not support AI-assisted title and abstract screening during the study period.

3.2 Feature Analysis

The feature analysis revealed that all seven AI-aided abstract screening tools included in this study offered core functionalities necessary for basic screening workflows. Results of the feature analysis are shown in Table 1. Common features across tools included reference importing, inclusion and exclusion marking, and the ability to export results - functions that are foundational to any systematic review process. All tools supported formal export in standard formats and allowed some form of structured screening. Some tools also offered visual progress tracking (n = 4) and keyword highlighting (n = 4), aiding usability and navigation during screening.

Despite this shared baseline, tools varied in the level of support for collaboration, reviewer management, and workflow customisation. However, as our evaluation focused on independent, single-user screening, these features were assessed based on their documentation and availability in the interface, rather than hands-on multi-user testing. In tools where such features were present, they typically appeared in pre-defined forms (e.g., fixed roles or basic multi-user settings), but advanced customisation options were not commonly observed. This interpretation should be treated with caution, as we did not assess collaborative functionality through live, team-based screening scenarios.

One notable distinction across tools was in the handling of duplicate references. PICO Portal [33] was the only tool in our evaluation that fully supported automated duplicate removal, offering a streamlined process for detecting and removing duplicates without manual intervention. In contrast, while other tools provided partial support, such as identifying or flagging potential duplicates, they typically did not offer complete or automatic removal and required users to manage duplicates manually. Similarly, project auditing (i.e., tracking reviewer actions and decisions over time) and PRISMA diagram generation were unsupported in all the tools except for PICO Portal, despite their importance for transparency and reporting in systematic reviews.

Support for multilingual interfaces and non-Latin characters appeared limited across most tools based on available documentation and interface settings. While some platforms indicated partial support for other languages or character sets, our evaluation was restricted to English-language articles and interfaces, and we did not directly test multilingual functionality. As such, findings in this area should be interpreted with caution.

Some tools demonstrated unique strengths. For instance, PICO Portal, Colandr and Abstrackr offered highly intuitive, visually user-friendly interfaces that flattened the learning curve for new users. These tools also integrated direct customer support and detailed help documentation, which may facilitate adoption among less experienced reviewers. In contrast, others prioritised streamlined workflows for single reviewers or incorporated customisable tagging systems, which can support internal classification schemes during screening.

Table 1

Feature Analysis
Feature	PICO PORTAL	ROBOTANALYST	ASREVIEW	RAYYAN	COLANDR	ABSTRACKR
Customer Support	3	2	3	2	3	3
Multiple User support	3	1	1	3	2	3
Article tagging	2	3	2	2	3	3
Reference importing	3	1	3	3	3	3
Reference allocation	1	1	1	1	1	1
Removing duplicates	3	2	2	2	1	1
In-/excluding references	3	1	3	3	3	2
Distinct TiAb/full-text phases	2	1	1	2	2	1
Discrepancy resolving	2	1	1	1	2	1
Exporting results	3	3	3	3	3	3
Order of references	3	3	2	3	3	1
Keyword highlighting	3	1	1	3	3	3
Multiple user roles	2	1	1	1	2	1
Project auditing	1	2	2	1	2	2
Show project progress	3	2	2	3	2	2
Attaching comments	2	1	2	2	2	2
Reference labelling	2	2	1	2	2	2
Prisma Diagram/Flow diagram creation	2	1	1	1	1	1

*In the heatmap presented in Table 1, support levels are represented by both numerical scores and cell shading. A score of 3, shown in green, indicates high support, meaning the feature is fully implemented and well-functioning. A score of 2, shaded orange, reflects moderate support, where the feature is partially implemented or has limited functionality. A score of 1, marked in yellow indicates no support.

The total scores of support levels across features for each tool are presented in Fig. 1. According to these results, PICO Portal achieved the highest average score, reflecting consistent support across a wide range of features, including collaboration, usability, and data handling. Colandr, Rayyan and AbstrackR also performed well, particularly in user-focused aspects such as tagging, keyword highlighting, and progress tracking. In contrast, tools like RobotAnalyst and ASReview demonstrated more limited functionality overall, with lower average scores, especially in collaboration and discrepancy resolution. These findings suggest that while most tools support essential screening activities, only a few provide the comprehensive feature sets required for complex, team-based systematic review workflows.

Fig. 1

Feature Matrix of AI-Assisted Screening Tools

3.3 Performance

Our findings indicated that none of the tools retrieved more than 50% of relevant studies within the first 25% of screened records.

3.4 User Survey

Table 2

Reviewer scores for each tool on easiness to learn, user-friendliness, and time to learn the tool.
Survey Questions	Reviewer no.	Easiness to learn	User-friendliness	Time to learn
AbstracKR	R2	5	2.5	5
AbstracKR	R3	7.5	2.5	5
AsReview	R1	7.5	5	7.5
AsReview	R2	7.5	7.5	7.5
Colandr	R1	5	5	5
Colandr	R3	7.5	5	5
PICO Portal	R1	7.5	5	7.5
PICO Portal	R4	7.5	7.5	10
Rayyan	R1	7.5	7.5	10
Rayyan	R4	7.5	7.5	7.5
RobotAnalyst	R3	7.5	7.5	7.5
RobotAnalyst	R4	7.5	7.5	10

Table S1

Abstract Screening Tools Evaluated Across Included Studies
Tools	Blaizot et al (2022)	Cowie et al (2022)	Van der Mierden et al (2019)	Harrison et al (2020)	Kohl et al (2018)	Reis et al (2023)	Khalil et al (2022)	Marshall et al (2019)	Schmidt a et al (2023)
Abstrackr®	+	+		+		+	+	+
CADIMA		+	+		+
Colandr®		+		+	+	+		+
Covidence		+	+		+
DistillerSR®		+	+		+		+		+
EPPI reviewer	+	+	+		+		+		+
EROS					+
Giotto Compliance		+
HAWC					+
JBI SUMARI		+
K-means clustering algorithm	+
LibSVM classifier (RCT tagger)							+
LitStream		+
PARSIFAL					+
PICO Portal		+
Rayyan®	+			+	+	+	+	+
REviewER					+
Revman Web		+	+		+
Robot reviewer	+								+
RobotAnalyst		+					+	+
Semi-automated natural language processing	+
SESRA					+
SLR-Tool					+
SLuRp					+
SR Accelerator		+
SRA Helper					+		+
SRDB.PRO		+			+
SRDR+		+			+
StArt		+			+
SWIFT-Active Screener	+		+				+		+
SWIFT-review	+		+		+			+
SyRF		+	+		+
SysRev		+	+
Wordstat	+
QDA Miner	+
Research Screener
Robot screener
LitSuggest
ExaCT
LiteRev
NetMetaXL
AsReview
Pitts Web application
*“+” indicates that the tool was assessed in the given study

Table S2

Compliance of Identified Abstract Screening Tools with Eligibility Criteria
Feature	Compound feature	Description of levels of conformance
Customer Support	C(3)	1. No support: Help documentation is inadequate (does not help to solve many questions/problems) and the company does not reply in a reasonable amount of time or does not help solve the issue. 2. Documentation only: There is adequate documentation available but the company does not reply in a reasonable amount of time or does not help solve the issue. 3. Direct support: There is adequate documentation available and the company replies in a timely manner and actively supports the customer by answering questions and helping with issues.
Multiple User support	C(3)	1. No multiple user support: It is not possible for multiple users to work at the same time, on the same project, independently from each other, and blinded. 2. Two user support: Two users can work at the same time, on the same project, independently from each other, and blinded. 3. Multiple user support: An unlimited number of users can work at the same time, on the same project, independently from each other, and blinded.
Article tagging	C(3)	1. There is no option to assign tags to the abstracts. 2. There is an option to assign tags to the abstracts; however, there is no option to hide these tags from the other reviewers. 3. There is an option to assign tags to the abstracts, and it is possible to hide these tags from the other reviewers.
Reference importing	C(3)	1. No formal import: The tool does not formally support importing of references; references have to be entered manually (this includes copy-pasting). 2. Limited files supported or difficult process: The tool can only import using a limited number of file extensions (e.g., only CSV) and/or the process is difficult. 3. Fully supported: The tool has an easy process for importing references and supports multiple file extensions.
Reference allocation	C(3)	1. No formal allocation: There is no formal method for allocating references to reviewers. 2. Allocation possible: It is possible to allocate references to reviewers, but the tool does not support randomization of this step. 3. Allocation + re-allocation: The tool is able to re-allocate references to different reviewers (e.g.,s when a reviewer drops out).
Removing duplicates	C(3)	1. No formal duplicate removal 2. Duplicate removal is possible, but the tool does not remove all the duplicates 3. Tool removes all the duplicates
In-/excluding references	C(3)	1. No system for in-/exclusion: The tool has no formal system for in- or excluding references. 2. In-/exclusion only: The tool supports in- and excluding references, but no reason for exclusion can be given. 3. In-/exclusion + reason for exclusion: The tool supports in or excluding of references, and a reason for exclusion can be given.
Distinct TiAb/full-text phases	C(3)	1. No distinct phases: There is no clear distinction between the title/abstract phase and the full-text phase; there is only one phase. 2. TiAb & full-text phase: There is a clear distinction between the title/abstract phase and the full-text phase. 3. User-defined phases: The user can create as many distinct phases as they need.
Discrepancy resolving	Binary	1. No: There is no official process to resolve discrepancies. 2. Yes: Official support for discrepancy resolving.
Exporting results	C(3)	1. No export: No formal export is supported, exporting must be done manually. 2. Limited export: Support for formal export, but only in limited file extensions (e.g., only .txt or .xlsx). 3. Full export: It is possible to export the results in at least the CSV format, or multiple general file extensions are supported.
Order of references	C(3)	1. No: It is not possible to randomize the order of references for the reviewers. 2. Yes: It is possible to randomize the order of references for the reviewers. 3. It is possible to order references based on different criteria e.g., author, date, and relevance
Keyword highlighting	C(3)	1. No highlighting: No keyword highlighting possible or highlighting of only one word is possible. 2. 3rd party only: The tool does not support formal keyword highlighting, but it is possible to use (free) 3rd party software for highlighting (e.g., extensions for Google Chrome, Add-ons for Firefox). 3. Highlighting possible: The tool natively supports the highlighting of more than one word.
Multiple user roles	C(4)	1. No different roles: There are no different roles for different users; everybody has the same role and rights in the project. Reviewer + Manager roles: Two different roles with different rights for reviewers and for manager roles. 2. Any further role: The tool supports both reviewer and manager roles, but also any further roles (e.g., librarian role). 3. User definable roles: The users can determine the number of roles and determine the rights for the roles.
Project auditing	Binary	1. No: The tool does not support auditing the project; a complete overview of all alterations by all users on the project. 2. Yes: The tool supports auditing the project.
Non-latin character support	Binary	1. No: The tool does not support non-Latin characters (e.g., Cyrillic, Greek, Chinese, Arabic, etc.). 2. Yes: The tool supports non-Latin characters.
Show project Progress	C(3)	1. No project progress: There is no way to determine the overall progress of the project (e.g., % completed) 2. Limited progress: The tool only shows rudimentary project progress (e.g., only the total % of references completed/ still to do) 3. Detailed progress: The tool can display detailed progress (e.g., the progress per reviewer)
Attaching comments	Binary	1. No: It is not possible to attach comments to references. 2. Yes: It is possible to attach comments to references.
Reference labelling	Binary	1. No. It is not possible to attach a priori determined labels to references. 2. Yes. It is possible to attach a priori determined labels to however, there is no option to hide these tags from the other reviewers. 3. Yes. There is an option to assign tags to the abstracts, and it is possible to hide these tags from the other reviewers.
Prisma Diagram/Flow diagram creation	Binary	1. No: The tool cannot automatically provide a flow diagram meeting the PRISMA criteria. 2. Yes: The tool can automatically provide a flow diagram meeting the PRISMA criteria
Supporting languages other than English	C(3)	1. No: The tool only support English 2. Limited support: The tool support a few other languages 3. Fully support: The tool support a large number of different languages

Table S3

Feature Analysis Template
Tools	Cost	Accessability/Current functionality	Link	Field/area of research	Additional notes
Abstrackr®	Free	YES	http://abstrackr.cebm.brown.edu/account/login	Multidisciplinary /not domain specific
CADIMA	Free	YES	https://www.cadima.info/	Multidisciplinary /not domain specific
Colandr®	Free	YES	https://www.colandrapp.com/signin	Multidisciplinary /not domain specific
Covidence	Paid subscription	YES
DistillerSR®	Paid subscription	YES
EPPI reviewer	Paid subscription	YES
EROS	No longer in use	YES
Giotto Compliance	No longer in use	YES
HAWC	Free-requires two step account creation	YES	https://www.epa.gov/risk/health-assessment-workspace-collaborative-hawc	medical science/healthcare
JBI SUMARI	Paid subscription	YES		medical science/healthcare and social sciences
K-means clustering algorithm	NA	NA		ML algorithm	Coding required
LibSVM classifier (RCT tagger)	Free	YES		medical science/healthcare	Works with PubMed
LitStream	Paid subsription	YES
PARSIFAL	Free	YES		Software engineering focus
PICO Portal	Paid-Free trial is available for one project	YES	https://picoportal.org/plans-comparison/	Multidisciplinary /not domain specific
Rayyan®	Free for early career researchers for 3 reviews	YES	https://www.rayyan.ai/	Multidisciplinary /not domain specific
REviewER	Not available	No
Revman Web	Requires Cochrane account	YES		Healthcare-designed for Cochrane reviews
Robot reviewer	not available	No- Demonstration website was not available		Medical- RCT studies
RobotAnalyst	Free but it requeries an account on request	YES		Multidisciplinary /not domain specific
Semi-automated natural language processing	NA	NA		ML model requires coding
SESRA	Free	YES		Particularly for software engineering
SLR-Tool	Free	YES		Particularly for software engineering
SLuRp	Free	YES		Software engineering
SR Accelerator	Free	YES		Software engineering
SRA Helper	Free	YES
SRDB.PRO	Free	YES		Healthcare and medical sciences
SRDR+	Free	YES		Healthcare and medical sciences
StArt	Free	YES		Software engineering
SWIFT-Active Screener	Paid subscription	YES
SWIFT-review	Free	YES		Healthcare and medical science	Desktop application
SyRF	Free	YES		Preclinical, experimental studies
SysRev		YES			Text mining Software
Wordstat		NA			Text mining Software
QDA Miner		NA
Research Screener	Paid
Robot screener	Paid
LitSuggest	Free
ExaCT	Some features are free			Healthcare and medical science with clinical focus	Does not offer TI/abstract screening feature
LiteRev			https://literev.unige.ch/accounts/login/
NetMetaXL	Free				Does not offer TI/abstract screening feature
AsReview	Free		https://asreview.nl/	Multidisciplinary/nor domain specific	Desktop application
Pitts Web application	Paid

Table S4

Search string used in Ovid and Cochrane
Database	Terms
Embase and Medline (via Ovid)	Artificial Intelligence.ti,ab.
	AI.ti,ab.
	Machine Learning.ti,ab.
	Deep learning.ti,ab.
	Robotics.ti,ab.
	Neural Networks, Computer.ti,ab.
	Automation.ti,ab.
	Natural language processing.ti,ab.
	Artificial Intelligence.ti,ab. OR AI.ti,ab. OR Machine Learning.ti,ab. OR Deep learning.ti,ab. OR Robotics.ti,ab. OR Neural Networks, Computer.ti,ab. OR Automation.ti,ab. OR Natural language processing.ti,ab.
	Abstract screening.ti,ab.
	(abstract adj2 screening).ti,ab.
	Title screening.ti,ab.
	(title adj2 screening).ti,ab.
	Abstract screening.ti,ab. OR (abstract adj2 screening).ti,ab. OR Title screening.ti,ab. OR (title adj2 screening).ti,ab.
	Artificial Intelligence.ti,ab. OR AI.ti,ab. OR Machine Learning.ti,ab. OR Deep learning.ti,ab. OR Robotics.ti,ab. OR Neural Networks, Computer.ti,ab. OR Automation.ti,ab. OR Natural language processing.ti,ab. AND Abstract screening.ti,ab. OR (abstract adj2 screening).ti,ab. OR Title screening.ti,ab. OR (title adj2 screening).ti,ab.
Cochrane	Artificial Intelligence:ti,ab OR AI:ti,ab OR Machine Learning:ti,ab OR Deep learning:ti,ab OR Robotics:ti,ab OR Neural Networks, Computer:ti,ab OR Automation:ti,ab OR Natural language processing:ti,ab OR Natural language processing:ti,ab AND Abstract screening:ti,ab OR (abstract adj2 screening):ti,ab OR Title screening:ti,ab OR (title adj2 screening):ti,ab (with Cochrane Library publication date from Jan 2021 to Apr 2024, in Cochrane Reviews and Cochrane Protocols)

Table S5

Search Strategy for the Umbrella Review
Broad Term		Search Term	Pubmed (filtered for review, systematic review, meta-analysis)*	PSYCinfo and Embase (AND review OR systematic review OR meta-analysis OR meta-analyses)**
Dementia, MCI, cog health	1	“Mild Cognitive impairment” OR “dementia” OR “Alzheimer*” OR “MCI” OR “cognitive impairment” OR “cognitive health” OR “cognitive decline”	81,314	143,485
Urban design	2	“Green space*” OR “blue space” OR UGBS	429	602
	3	walkability OR cyclability OR “Pedestrian infrastructure” OR “cycling infrastructure”	236	347
	4	“Road density”	5	10
	5	"Street light density" OR "streetlight density" OR street*light	50	12
		1 AND 2	14	28
		1 AND 3	3	7
		1 AND 4	0	0
		1 AND 5	0	0
Social environment	6	“Social capital” OR “social relationship”	624	1,910
	7	Crime	16,861	13,609
	8	“Material deprivation” OR “deprivation” OR “socio economic status” OR “socioeconomic status” OR “Socioeconomic status” OR “Socioeconomically disadvantage*”	16,986	35,506
		1 AND 6	8	25
		1 AND 7	183	129
		1 AND 8	536	1,264
Environmental by-products	9	“Noise pollution” OR “water pollution” OR “air pollution” OR “soil pollution” OR “light pollution” OR pollutant	49,934	32,377
	10	“Heat stress” OR “ambient temperature” OR heatwaves	14,589	17,968
		1 AND 9	502	513
		1 AND 10	57	131
Transport behaviours	11	"Active travel" OR "active transport" OR walk* OR bicycling OR bike OR biking OR "ecological commut" OR "ecological transport" OR non-auto OR non-motori?e* OR "green travel" OR "green transport"	320,017	37,253
	12	car use OR car usage OR car dependency OR car ownership	8,366	89
	13	cycle lane OR bicycle lane OR bike lane OR cycle trail OR bicycle trail OR cycle path OR bicycle path OR bike path OR bikeway OR footpath OR pavement OR sidewalk OR greenway OR walkability OR cyclability OR “Pedestrian infrastructure” OR “cycling infrastructure”	1,333	642
	14	“Public transport” OR “public transit”	172	350
	15	traffic	9226	2,2042
		1 AND 11	715	1,418
		1 AND 12	82	0
		1 AND 13	17	9
		1 AND 14	5	3
		1 AND 15	232	510
Pubmed reviews were gathered by adding the ‘review,’ ‘systematic review,’ and ‘meta-analysis’ filters. *Embase and PSYCinfo reviews were gathered by adding ‘AND (review or systematic review OR meta-analysis OR meta-analyses)’ at the end of every search string. Cog health: cognitive health; MCI: Mild cognitive impairment; UGBS: Urban green and blue space.
File S1: Survey Questions (Adapted from Reis et al (2023))
On a scale of 0 to 10, where 0 is not friendly at all and 10 is very friendly, how would you rate it?
• How would you rate the process of learning how to use the software?
a) Very easy (10 points)
b) Easy (7.5 points)
c) Not easy but also not difficult (5 points)
d) Difficult (2.5 points)
e) Very difficult (0 points)
• In your experience, would you describe the software as user friendly (i.e., is it possible to understand the software instantly and intuitively)?
a) Very Friendly (10 points)
b) Above Average (7.5 points)
c) Average (5 points)
d) Below Average (2.5 points)
e) Not friendly at all (0 points)
• How long did it take you to learn how to use the software in minutes?
a) < 15 min (10 points)
b) 15 to 30 min (7.5 points)
c) 30 to 45 min (5 points)
d) 45 to 60 min (2.5 points)
e) > 60 min (0 points)
• Are there any features you found to be missing or lacking in the software?
• What feature or improvement would you most like to see in the tool?

Fig. 2

Average reviewer scores for survey questions on easiness to learn, user-friendliness, and time to learn the tool.

Table 2 presents the individual reviewer ratings for easiness to learn, user-friendliness, and time to learn the tool across the evaluated tools, while Fig. 2 illustrates the average scores for each question per tool in a stacked bar graph format. Across all tools, higher scores reflected a more favourable user experience. Several tools stood out positively in these domains: Rayyan, RobotAnalyst, and PICO Portal received the highest average scores for user-friendliness and time efficiency, indicating that users found them intuitive and quick to learn. PICO Portal was also noted for its visually accessible interface, which contributed to an efficient onboarding process. ASReview was also rated positively for ease of learning and required minimal time to reach proficiency, suggesting a generally favourable learning experience.

While tools varied in design and user interaction, those with clear visual layouts, guided workflows, and accessible support documentation tended to receive higher usability ratings. These findings underscore the value of intuitive interface design and streamlined access to enhance the user experience during AI-aided abstract screening.

In addition to survey feedback, practical observations during the evaluation process provided further context regarding tool access and setup. RobotAnalyst required a trial account to access the platform. A three-month license was granted to the authors for this evaluation, with no restriction on the number of abstracts uploaded. PICO Portal similarly offered a free trial with full feature access, allowing for uninterrupted use during the review period.

CADIMA was accessed through a time-limited training account, which expired after three months. Unlike other tools (i.e., PICO Portal, Rayyan, AsReview, Colandr and Abstrackr) CADIMA did not support RIS file uploads. Instead, it offered an internal search function, which during testing did not yield usable references. This limited the team’s ability to proceed with screening through the platform.

Colandr presented some access challenges for team members, specifically related to password resets and account recovery. These issues delayed initial use for some reviewers. However, the tool did provide user guides on its website, which helped address questions related to tool functions once access was obtained.

These observations reflect tool configurations and access experiences at the time of evaluation. While functionality and support may change over time, such practical considerations remain important for teams planning to adopt AI-assisted tools in time-sensitive review contexts.

Reviewers were asked to identify any features they found to be missing or insufficient in the tools, as well as to suggest improvements that could enhance the user experience. Several common themes emerged across responses, including requests for clearer feedback on AI behaviour, automated deduplication, improved filtering and tagging functions, and enhanced documentation or user support.

A frequently mentioned concern was the lack of clarity around AI functionality and prioritisation mechanisms. Reviewers using tools with prioritisation features, such as PICO Portal and Rayyan, noted uncertainty regarding how the software reordered records based on relevance during screening. Reviewers suggested that visual indicators such as confidence scores or AI progress feedback could help clarify when it may be reasonable to stop screening. Similarly, in PICO Portal, reviewers expressed a desire to better understand how the "priority screening algorithm" was functioning in real-time.

Another recurring theme from the user survey was the need for improved duplicate handling. Several tools either lacked automated deduplication entirely or offered only limited manual support. Reviewers highlighted that integrating automatic duplicate removal, particularly in large-scale reviews, would substantially reduce screening burden. This was noted specifically in relation to Abstrackr and RobotAnalyst.

Documentation and onboarding support were also identified as areas for improvement. One reviewer mentioned difficulty locating official documentation for RobotAnalyst, which led to a trial-and-error learning approach despite the tool being otherwise well-received. Similarly, Colandr users experienced issues with logging in and noted challenges in using filtering features, including occasional script errors and difficulty locating tags for studies, even after setup.

Reviewers also recommended enhancing filtering, tagging, and study categorisation options, particularly for Colandr and ASReview. While ASReview allowed note attachment, users suggested the addition of highlighting functionality, as well as clarity on whether multi-user collaboration was supported. In some cases, visual or interface limitations (e.g., image rendering errors in Abstrackr) were also noted as affecting usability.

Finally, reviewers emphasised the value of progress tracking features. Tools that lacked a visible indication of screening completion (e.g., percentage completed) were perceived as less transparent. Reviewers recommended that tools include more robust project progress displays to support workflow management and planning.

4. Discussion

4.1 Summary of Findings

Our study evaluated the performance, features, and user experience of six freely available AI-aided abstract screening tools using an umbrella review as a case study. While performance differences between tools were not substantial, none were able to retrieve more than 50% of the included studies after screening 25% of abstracts, indicating limitations in early recall. Nevertheless, most tools adequately supported core screening tasks, and several demonstrated additional strengths in usability, tagging, and exporting functions. A likely reason for this performance could be the added complexity of our case study, and that the tools had to do more than judge topical relevance because they also needed to distinguish systematic reviews from primary research. This two-tiered classification task is inherently more challenging than typical screening, which usually focuses solely on content relevance. Furthermore, except for PICO Portal, none of the freely available tools automatically removed duplicates. This limitation can substantially increase screening burden and reduce efficiency in large-scale reviews.

Despite these challenges, several tools were still able to surface a subset of pertinent studies early in the process, indicating they could help lighten reviewer workload. Although complete manual screening remains necessary in most cases, these platforms can serve as valuable assistants by streamlining the initial screening stages.

PICO Portal consistently emerged as the highest-performing tool across multiple dimensions, offering strong support for collaboration, usability, and progress tracking. Colandr and Rayyan also performed well, especially in user-facing features like article tagging, keyword highlighting, and ease of navigation, though they lacked advanced collaborative functions such as project auditing and user-role customisation.

Our survey results showed generally positive user experiences but also highlighted gaps in functionality and accessibility. Importantly, many free tools restrict essential features such as duplicate removal or allocation to paid plans or require time intensive account setup, limiting their appeal for time-constrained or resource-limited researchers.

4.2 Comparison with Existing Evidence

Previous evaluations of AI-aided screening tools have consistently reported substantial workload reductions, which aligns with our findings [18] [9, 14, 15]. However, to our knowledge, no prior study has evaluated these tools specifically in the context of umbrella reviews or systematically assessed recall rates at early screening stages, making direct comparisons with our results challenging. Similarly, while some studies analysed tools that require paid subscriptions [14], these are not directly comparable to our study as we focused only on tools that are freely available for at least one case study.

Among the available tools, Rayyan has been highlighted in the literature as a favourable option. For example, a recent study [18] reported that Rayyan correctly identified nearly 80% of relevant records, recommending it as the most effective software in their evaluation. In our study, Rayyan demonstrated strengths in user experience and feature availability, but our performance analysis did not reproduce the high recall rates reported previously, underscoring the potential influence of review context and dataset characteristics.

PICO Portal has received limited formal assessment, likely because it was only introduced in 2020. One study [9] included PICO Portal in a broad feature analysis that focused on functionalities across all stages of the systematic review process rather than on screening performance specifically. The study evaluated the approach feature of screening, which referred to the type of AI methodology used to prioritise records during title/abstract screening, but it did not assess other aspects such as usability, efficiency, or recall performance. An earlier evaluation [34] highlighted PICO Portal’s PICO-based keyword highlighting and AI-driven prioritisation as key strengths, while noting a steep learning curve and usability challenges for new users. In contrast, our team found PICO Portal intuitive and user-friendly, suggesting potential improvements in the interface since its initial evaluation.

On the other hand, while Abstrackr has been positively reviewed in prior studies [18, 35] for its screening efficiency, our team comprising researchers with over five years of experience and involvement in multiple reviews experienced difficulties using it effectively for title/abstract screening. Such discrepancies between previous evaluations and our experience are unsurprising given the rapidly evolving nature of AI-aided review tools and differences in study designs, datasets, and reviewer experience.

4.3 Strengths, Limitations and Implications

A key strength of this study is its use of a real-world umbrella review on an interdisciplinary topic that comprised more than 2,000 abstracts. Using this previous study as a case study allowed us to evaluate tool performance across a broad and complex abstract set. Another strength is the multi-dimensional evaluation framework, which included not only performance and feature analysis but also direct user experience through structured survey data.

However, several limitations must be noted. First, our findings are based on a single case study, which limits generalisability. Future research should examine these tools across different domains to determine whether results are consistent. Additionally, while we documented the presence of features such as multi-user support and non-Latin character handling, we did not test their real-world performance and thus cannot comment on their practical utility.

4.4 Implications and Future Directions

Our findings suggest that AI-aided screening tools are promising but remain underused in evidence synthesis. A significant barrier is the lack of awareness, training, and user support, particularly for researchers without computational backgrounds. Institutions can help close this gap by offering regular training workshops and integrating AI tools into research methodology curricula to increase both exposure and user confidence.

Developers also play a critical role in facilitating adoption. Usability and onboarding support should be prioritised to lower the entry barrier for new users and flatten the learning curve. Our experience with the PICO Portal team illustrates the positive impact of proactive engagement: the developers provided onboarding materials, live demonstrations, and responsive assistance, all of which helped address the steep learning curve and can encourage broader use. Such direct interaction between developers and end-users can significantly improve adoption and satisfaction, particularly for researchers unfamiliar with AI technologies.

From a development standpoint, there is clear room for improvement. Key areas include enhancing duplicate detection, implementing reliable study design recognition (e.g., distinguishing reviews from primary research), and offering transparent performance tracking. Additionally, improving multi-language support and non-Latin character recognition would expand the utility of these tools in global and multilingual research contexts.

Authors of research studies themselves can also facilitate automation by clearly reporting study design and methodology in abstracts. This practice would help both human reviewers and AI tools accurately classify studies during the screening phase, which is particularly important in umbrella reviews where distinguishing between primary studies and reviews is essential but often requires full-text access.

Finally, while generative AI and no-code tools offer new opportunities for automation, free platforms still fall short of offering fully comprehensive solutions for complex, multidisciplinary reviews. Tools such as PICO Portal are moving in the right direction by incorporating progress tracking and detailed performance reports, but broader functionality without reliance on paywalls is essential to ensure equity, scalability, and wider adoption of AI-aided screening in systematic review practice.

5. Conclusions

This study evaluated six freely available AI-aided abstract screening tools using an umbrella review as a test case. Our findings suggest that these tools are useful and increasingly accessible for supporting systematic reviews, even for researchers without programming experience. While most tools performed adequately in core screening tasks, limitations remain, particularly in early recall, duplicate removal, study design recognition, and multi-user functionality.

Despite these challenges, AI-aided screening tools hold strong potential to enhance the efficiency and consistency of evidence synthesis. For broader adoption, however, improvements in usability, feature completeness, and institutional training support are needed. As the field evolves, developers, researchers, and institutions all have a role to play in advancing and adapting these tools to better meet the practical demands of diverse review contexts.

Declarations

6.1 Ethics approval and consent to participate

Not applicable

6.2 Competing interest

The authors declare that they have no competing interests

6.3 Consent for Publication

Not applicable

6.4 Funding

This work was supported by the Evidence Synthesis Ireland SWAR Award.

6.5 Author Contributions

SA, RFH, LG, and MC contributed to the conceptualization of the study. SA, SMJ, CT, and NOK contributed to the methodology and to data analysis and interpretation. SA drafted the original manuscript. SA, SMJ, CT, NOK, RFH, LG, MC and MD reviewed and edited the manuscript. All authors read and approved the final version of the manuscript.

6.6 Acknowledgements

We thank Dr Morteza Namvar for his valuable contributions to the discussions that informed the development of this study.

6.7 Availability of data and material

All data generated or analysed during this study are included in this published article [and its supplementary information files]

Additional Files

Additional file 1

References

Beller EM, et al. Are systematic reviews up-to-date at the time of publication? Syst reviews. 2013;2(1):36.

Sampson M, et al. Systematic reviews can be produced and published faster. J Clin Epidemiol. 2008;61(6):531–6.

Allers K, et al. Systematic reviews with published protocols compared to those without: more effort, older search. J Clin Epidemiol. 2018;95:102–10.

Hao Q et al. AI Expands Scientists' Impact but Contracts Science's Focus. arXiv preprint arXiv:2412.07727, 2024.

Porto JR, et al. Quantifying the Scope of Artificial Intelligence–Assisted Writing in Orthopaedic Medical Literature: An Analysis of Prevalence and Validation of AI-Detection Software. JAAOS-Journal of the American Academy of Orthopaedic Surgeons; 2022. p. 105435.

van Dijk SH, et al. Artificial intelligence in systematic reviews: promising when appropriately used. BMJ open. 2023;13(7):e072254.

Fabiano N, et al. How to optimize the systematic review process using AI tools. JCPP Adv. 2024;4(2):e12234.

Wallace BC et al. Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. in Proceedings of the 2nd ACM SIGHIT international health informatics symposium. 2012.

Cowie K, et al. Web-based software tools for systematic literature review in medicine: systematic search and feature analysis. JMIR Med Inf. 2022;10(5):e33219.

10.

Marshall IJ, Wallace BC. Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Syst reviews. 2019;8(1):163.

11.

Van Altena A, Spijker R, Olabarriaga S. Usage of automation tools in systematic reviews. Res synthesis methods. 2019;10(1):72–82.

12.

O’Connor AM, et al. Moving toward the automation of the systematic review process: a summary of discussions at the second meeting of International Collaboration for the Automation of Systematic Reviews (ICASR). Syst reviews. 2018;7(1):3.

13.

Scott AM, et al. Systematic review automation tools improve efficiency but lack of knowledge impedes their adoption: a survey. J Clin Epidemiol. 2021;138:80–94.

14.

Van der Mierden S, et al. Software tools for literature screening in systematic reviews in biomedical research. Altex. 2019;36(3):508–17.

15.

Harrison H, et al. Software tools to support title and abstract screening for systematic reviews in healthcare: an evaluation. BMC Med Res Methodol. 2020;20(1):7.

16.

Schmidt L et al. A narrative review of recent tools and innovations toward automating living systematic reviews and evidence syntheses. Zeitschrift für Evidenz, Fortbildung und Qualität im Gesundheitswesen, 2023. 181: pp. 65–75.

17.

Blaizot A, et al. Using artificial intelligence methods for systematic review in health sciences: A systematic review. Res Synthesis Methods. 2022;13(3):353–62.

18.

Dos Reis AHS, et al. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst reviews. 2023;12(1):68.

19.

Kitchenham B, Linkman S, Law D. DESMET: a methodology for evaluating software engineering methods and tools. Comput Control Eng. 1997;8(3):120–6.

20.

Thomas J, Brunton J, Graziosi S. EPPI-Reviewer 4.0: software for research synthesis. EPPI-Centre Software. London: Social Science Research Unit, Institute of Education; 2010.

21.

Przybyła P, et al. Prioritising references for systematic reviews with RobotAnalyst: a user study. Res synthesis methods. 2018;9(3):470–88.

22.

Howard BE, et al. SWIFT-Review: a text-mining workbench for systematic review. Syst reviews. 2016;5(1):87.

23.

Cheng S, et al. Using machine learning to advance synthesis and use of conservation and environmental evidence. Conserv Biol. 2018;32(4):762–4.

24.

Ouzzani M, et al. Rayyan—a web and mobile app for systematic reviews. Syst reviews. 2016;5(1):210.

25.

Khalil H, Ameen D, Zarnegar A. Tools to support the automation of systematic reviews: a scoping review. J Clin Epidemiol. 2022;144:22–42.

26.

DistillerSR. DistillerSR: Literature Review Software [Internet]. 2025.

27.

Kohl C, et al. Online tools supporting the conduct and reporting of systematic reviews and systematic maps: a case study on CADIMA and review of existing tools. Environ Evid. 2018;7(1):8.

28.

Aromataris E, et al. Summarizing systematic reviews: methodological development, conduct and reporting of an umbrella review approach. JBI Evid Implement. 2015;13(3):132–40.

29.

Devane D, et al. Study within a review (SWAR). J Evidence-Based Med. 2022;15(4):328.

30.

Akaraci S. SWAR 25: A comparison of artificial intelligence (AI) aided and manual reviewing in abstract screening.. 2023; Available from: https://www.qub.ac.uk/sites/TheNorthernIrelandNetworkforTrialsMethodologyResearch/FileStore/SWARFileStore/SWAR25%20Selin%20Akaraci%20(2023%20FEB%2028%202026).pdf

31.

Marshall C, Brereton P. Systematic review toolbox: a catalogue of tools to support systematic reviews. in Proceedings of the 19th international conference on evaluation and assessment in software engineering. 2015.

32.

Van De Schoot R, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell. 2021;3(2):125–33.

33.

PICOPortal2025. PICO Portal. 10.10.2025]; Available from: https://picoportal.net/

34.

Minion JT, et al. Pico portal. J Can Health Libr Association. 2021;42(3):181.

35.

Gates A et al. Decoding semi-automated title-abstract screening: a retrospective exploration of the review, study, and publication characteristics associated with accurate relevance predictions. 2020.

Supplementary Material