1. Introduction
Systematic reviews (SRs) are pivotal for informing evidence-based practice, consensus building, and the identification of research gaps. Timely evidence synthesis is essential for supporting decision-making, yet the traditional review process can delay the delivery of actionable insights[1–3]. The rapid growth of scientific literature, partly driven by advances in artificial intelligence (AI) and digital technologies[4, 5] has further increased the volume of published articles, intensifying the workload for systematic reviewers. Therefore, accelerating the screening process without compromising the accuracy of manual review could enhance the efficiency of evidence generation and deliver timely results. To address this, a growing number of artificial intelligence (AI)-aided software tools have been developed to improve the speed and consistency of screening [6–8]. These tools vary in accessibility and functionality, ranging from freely available applications that provide core screening features to subscription-based systems offering more advanced capabilities such as automated duplicate removal, predictive prioritization, and workflow management. Some tools incorporate machine learning algorithms that continuously learn from reviewer decisions to prioritize unscreened records according to their likelihood of inclusion, while others rely primarily on manual review with optional AI support.
A
Some AI-aided tools offer support across multiple stages of the review process (Rayyan, PICO Portal), while others focus on specific tasks such as abstract screening (Abstrackr), which is among the most well-supported, widely implemented, and technically advanced features [
9,
10]. However, the initial adoption of AI-aided tools in SRs was relatively slow, as evidenced in a 2018 study that surveyed systematic reviewers from the field of biomedical sciences[
11]. Several factors contributed to this, including licensing costs, steep learning curves, and limited support for some freely available tools. Additionally, a study published in 2018 reported that a lack of awareness contributed to low uptake of AI-aided screening tools [
12]. However, a more recent survey (published in 2021) conducted among reviewers involved in evidence synthesis, health technology assessment, and guideline development, who all had conducted interventions, scoping, accuracy, and prognostic reviews, found that 79% reported using automation tools during the screening phase [
13].
A growing body of literature, including several reviews, has explored the landscape of AI-aided screening tools, revealing key trends in focus, functionality, and limitations [9, 10, 14–18], with a common focus on feature identification and comparison. Some reviews [11] [9, 15] have systematically catalogued the technical capabilities of existing screening platforms, particularly in relation to dual screening, reference management, and machine learning integration. These studies often relied on structured frameworks (e.g., DESMET) [19]or weighted scoring systems to evaluate tool functionality [14, 18].
Other features of freely available AI-aided screening tools, such as usability and accessibility, have been evaluated, particularly for non-technical users[10, 15]. Marshall et al. (2019)[10] conducted a descriptive study introducing several tools for screening tasks, including Abstrackr[8], EPPI-Reviewer [20], RobotAnalyst [21], SWIFT-Review [22], Colandr [23], and Rayyan [24], with a focus on their applicability for researchers without programming expertise. Harrison et al. (2020) [15] emphasized user friendliness and interface design, identifying Covidence and Rayyan as the most popular tools due to their ease of use. Similarly, two other studies [17, 25] conducted scoping reviews to map the overall landscape of automation in screening. Blaziot’s study included 12 systematic reviews in the health sciences that employed some form of AI method. Among these, eight reviews used AI during the title and abstract screening stages and included evaluations conducted by the authors of each review. The findings revealed time savings and workload reductions associated with the use of AI. The 47 studies included in Khalil et al (2022) scoping review examined tools for the semi-automation of various stages of systematic reviews, encompassing both freely available and commercially licensed software (e.g., DistillerSR [26]); the authors noted that the screening stage has reached a relatively high level of maturity, although many tools remain insufficiently validated, are often domain-specific, and may present usability or accessibility challenges. Other studies have aimed to evaluate performance and real-world applications of AI-aided screening tools, though such work remains relatively limited. Reis et al. (2023), for example, assessed three screening tools – Rayyan, Abstrackr, and Colandr – based on speed, accuracy, and perceived usefulness, incorporating both objective outcomes and user perspectives. Their findings suggested that Rayyan software offered the best user experience and achieved the highest true-positive rate, meaning it correctly identified the largest number of relevant studies[18]. All three tools demonstrated comparable performance in identifying true negatives, as well as in measures of proportion missed, workload reduction, and time savings[18]. Similarly, Kohl et al. (2018)[27] compared several AI-aided tools used at different stages of the review process, including screening, highlighting their respective features and usability. Despite evaluating only a small number of tools, they identified CADIMA[27] as the most favourable tool.. Furthermore, some studies evaluated AI-aided tool performance only within a single domain such as health care [9, 17].
Taken together, these studies demonstrate a growing understanding and application of AI-aided screening tools, particularly in terms of available features and general usability. However, several key gaps remain. First, there is a need for studies that systematically re-evaluate AI-aided screening tools with respect to their updated feature sets and performance improvements. Second, existing evaluations have not assessed these tools in the context of umbrella reviews-studies that synthesize evidence from multiple systematic reviews rather than primary studies-which are becoming increasingly common alongside AI development [28]. Umbrella reviews present additional challenges, as automation tools must not only assess topical relevance and adhere to pre-defined eligibility criteria, but also accurately identify study designs by distinguishing between systematic reviews and primary research. Third, there is a need to evaluate these tools in multidomain or interdisciplinary contexts, as performance may vary when applied to reviews spanning diverse fields. In addition, there is a broader methodological gap regarding the transparent and standardized comparison of available screening tools. This has been underscored in a recent study [16] that highlighted the need for larger evaluation datasets from completed reviews, to enable more robust comparisons of tool performance and generalizability across review types and disciplines.
To address these gaps, we aimed to systematically catalogue existing tools that automate abstract screening and conduct a comparative study of web-based, freely available AI-aided abstract screening tools by evaluating their feature sets, screening performance, and user experience. We focused on tools that are freely accessible or offer at least a trial version to ensure that all evaluated methods could be readily accessed, tested, and reproduced by other researchers without subscription barriers. Using a previously completed umbrella review focusing on the impacts of urban environments on cognitive health as a case study, we performed three complementary analyses in this Study Within a Review (SWAR)[29]. These were (1) a feature analysis based on established frameworks; (2) a performance assessment to assess the ability of tools to retrieve relevant studies within an initial screening subset; and (3) a user survey capturing subjective perceptions of learnability, interface intuitiveness, and proficiency time. By situating this evaluation within an umbrella review context, our study provides novel insights into how current freely available AI-aided tools perform under more demanding methodological requirements, and highlights avenues for future development in AI-aided abstract screening.
2. Methods
This SWAR was registered in the SWAR repository as SWAR 25[30].
2.1 Identification of AI-Aided Abstract Screening Tools
We identified freely available AI-aided abstract screening tools through a comprehensive search strategy utilising multiple sources, including systematic electronic database searches to capture relevant published studies. University library webpages specifically detailing systematic review methodologies were also consulted, as these resources often curate and recommend evidence synthesis tools. The Systematic Review Toolbox [31], an online platform providing a comprehensive repository of software tools designed to support various stages of systematic reviews and evidence synthesis, was temporarily unavailable during our search period due to ongoing maintenance. However, we contacted the Systematic Review Toolbox team directly and received written guidance regarding currently available and emerging tools suitable for abstract screening. Additionally, we conducted extensive web searches and participated in webinars, including those hosted by academic institutions such as King’s College London (e.g., https://libguides.kcl.ac.uk/systematicreview/ai), to understand the latest developments in AI-aided screening tools. Finally, we employed Research Rabbit (https://www.researchrabbit.ai/), an AI-powered research discovery platform, to identify additional relevant literature and tools, and to sense check if any pertinent studies were overlooked in the above process.
To map the evidence on the evaluation of AI-aided abstract screening tools, we constructed a matrix with identified tools listed in the rows and individual studies in the columns (see Table S1). This approach allowed us to visualize the extent and distribution of tool evaluations, identify which tools had been widely studied versus those with limited or no evaluation, and to detect potential redundancies or gaps in the literature.
An evaluation framework was developed to assess the identified AI-aided abstract screening tools (see Table S2) and was composed of three sections: (1) performance metrics, (2) feature analysis (see Table S3), and (3) user survey (see file S1). This framework was based on previous studies that evaluated AI-aided screening tools at various stages of systematic reviews [14, 18]. Four team members (SA, SMJ, CT, NOK) were assigned to test the tools, with each tool evaluated by two randomly selected raters. All raters received training based on our recently completed umbrella review synthesising evidence on the association between urban environment features and cognitive health. Subsequently, each rater independently evaluated the tools.
2.1.1 Database Search
Searches were conducted in Embase, MEDLINE, and the Cochrane Library to identify studies using AI-aided abstract screening tools across various evidence synthesis methodologies. The search covered publications from January 2021 to April 2024. This date range was chosen to build upon the comprehensive systematic review previously conducted by Blaziot et al. [17], which covered AI-aided abstract screening tools up to 2021, allowing our search to focus on capturing recent developments and publications from 2021 onward. The initial search was conducted in November 2024 to ensure inclusion of the most recent studies.
The search strategy was developed collaboratively by two members of the research team (SJ and SA), adapting search terms from Blaziot et al.[17] and Khalil et al.[25]. The strategy was piloted in MEDLINE to ensure optimal balance between comprehensiveness and precision. The finalised search strings were subsequently adapted for the Cochrane Library. The full search strategy is provided in Table S4.
2.1.2 AI-aided Screening Tool Identification
Screening was conducted in two phases using Covidence. Titles and abstracts were independently screened by two team members (SA and SMJ), followed by full-text screening. Any disagreements were resolved through discussion.
We included systematic reviews, rapid reviews, umbrella reviews, evidence gap maps, evidence mapping, scoping reviews, and similar review types that employed AI-aided abstract screening tools with clearly reported AI methods. Methodology papers introducing novel AI models or screening software were also included. Only articles written in English were included. Priority was given to tools designed for interdisciplinary searches, although healthcare-specific tools were included. Review protocols, conference proceedings, and studies lacking full-text availability were excluded to ensure the inclusion of complete and accessible evidence.
Following the identification of eligible studies, additional screening criteria were applied to determine which tools to evaluate in detail. Specifically, the additional criteria included tools that were currently functioning, did not require coding, free to use, and not limited to a specific subject domain.
2.2 Assessment of AI-aided Abstract Screening Tools
All activities related to testing the identified AI-aided tools in this study were guided by a previously conducted umbrella review, with pilot testing of the inclusion/exclusion criteria among the research team included as one of these preparatory tasks. Following the selection of AI-aided tools, we tested their usability and performance using a set of 2,212 abstracts identified in the umbrella review. The review applied the search strategy provided in Table S5 to PubMed, Embase, and PsycINFO, without restrictions on publication date, and examined the effects of urban environments on cognitive health. Before assigning AI tools for evaluation, we assessed the consistency of each reviewer’s screening judgments to confirm that all reviewers could reliably apply the same eligibility criteria. Establishing this baseline consistency ensured that any observed differences in screening outcomes during AI-assisted evaluations could be attributed to tool performance rather than variability in human judgment. To achieve this, we conducted a calibration exercise using a 10% random sample of abstracts from the umbrella review dataset (n = 250), stratified by researchers not involved in the screening to include a representative distribution of previously included and excluded studies. The original screening decisions, used as the reference standard, were made by a different team of experienced systematic reviewers following dual independent screening and resolution of conflicts through consensus.
Each of the three current reviewers (SJ, CT, and NOK) independently screened the calibration sample while blinded to the original decisions. Their inclusion and exclusion judgments were compared to the reference standard by one of the authors of the present study (SA). Conflicts among screening decisions were discussed and resolved through consensus. Raw rates of agreement exceeded 95% across all reviewers, and Cohen’s kappa values ranged from 0.900 to 0.986, indicating a high degree of agreement according to the Landis and Koch interpretation scale [REF]. These results provided strong evidence that current reviewers could consistently and accurately apply the previously defined eligibility criteria. Subsequently, AI-aided tools were allocated to reviewers, with each tool being tested by two reviewers. To mitigate against any potential bias, the distribution of tools was designed to ensure that reviewers had either no prior experience or only minimal experience with the tools.
2.2.1 Feature Analysis
The feature analysis aimed to systematically identify and compare the functionalities embedded within each tested tool, providing potential users with insights to select tools that best align with their specific needs and preferences. The feature set used in this study was adapted from previously published frameworks[14, 18] and was refined through internal discussion among the research team. These included, for example, the ability to import and allocate references, remove duplicates, export screening results, and support distinct title/abstract and full-text screening phases. Additional features such as conflict resolution tools, keyword highlighting, project progress tracking, and the ability to attach comments or PDFs to references were also evaluated. Active user support availability was also considered as a potential facilitator of tool adoption. The final set of features assessed is provided in Table S3.
Two types of features were included in the analysis. Binary features captured whether a given functionality was present or absent, while compound features assessed the degree or quality of support, typically structured around multiple pre-defined levels of implementation. For each tool, it was first determined whether a particular feature was available. If so, the feature's implementation was then compared to conformance criteria as defined in Table S2.
Feature analysis was carried out by three researchers (CT, SJ, and NOK), with each tool independently rated by two of the three researchers as described in Section 2.2. All assessments were subsequently reviewed and verified by a single team member (SA) to ensure consistency. The user survey and performance evaluation were conducted by four researchers (CT, NOK, SA, and SJ), with each tool evaluated by two raters. Where applicable, intra-observer and inter-rater agreement were considered during the evaluation process, following approaches used in previous studies as done in a previously published study [9].
2.2.2 Performance Metric
Our primary performance metric was recall, defined as the proportion of relevant studies identified within a given portion of the screening queue. We selected recall as the key outcome because, in evidence synthesis, the ability to adequately identify relevant studies early in the screening process is essential to reducing workload while maintaining review quality. This was particularly critical in the context of our umbrella review, where missing relevant systematic reviews could compromise the comprehensiveness and validity of the synthesis.
Initially, we adopted an exploratory approach in which screening continued until reviewers judged that the remaining unscreened abstracts were likely to be irrelevant. However, this strategy proved inconsistent across tools. In some cases, particularly with Abstrackr, no clear point of convergence was reached, and screening needed to continue through nearly the full dataset to ensure all relevant studies were captured.
To overcome this variability and allow for standardised comparison, we implemented a fixed evaluation threshold: we screened the first 25% of records prioritised by each tool and calculated the recall achieved within that subset. This approach reflects a commonly used heuristic in the AI-aided screening literature, where early recall performance is used as a proxy for workload reduction potential[32]. While this threshold is not intended as a stopping rule, it provides a pragmatic benchmark to assess whether a tool surfaces a substantial proportion of relevant studies early in the screening process.
2.2.3 User Survey
To evaluate the perceived usability and functionality of the AI-aided abstract screening tools, we conducted a post-use user experience survey. The objective of the survey was to capture subjective impressions of tool usability, ease of learning, and feature adequacy as perceived by the reviewers involved in tool testing.
The survey explored several key dimensions. These included the ease with which reviewers were able to learn how to use each tool, their perceptions of user-friendliness and interface intuitiveness, and the amount of time required to gain proficiency in the tool. In addition, reviewers were asked whether they identified any missing or insufficient features and were invited to suggest potential improvements.
The full list of survey questions is provided in File S1. The instrument was adapted from the previously published framework [18], ensuring methodological rigor and comparability with prior studies of AI-aided evidence synthesis tools. The questionnaire included both closed- and open-ended questions to allow for structured feedback as well as exploratory input.
All reviewers who participated in the tool performance evaluation completed the survey immediately following their use of each assigned tool. Survey responses were analysed descriptively to identify common usability patterns and inform future tool development recommendations.
3. Results
3.1 Identification of AI-aided Abstract Screening Tools
In total, we identified 43 tools (see File S3), eight of which were deemed suitable for analysis: AbstrackR, CADIMA, Colandr, Rayyan, PICO Portal [
29], RobotAnalyst, and ASReview. An overview of all identified tools is provided
A
in Supplementary Table S3. Of these eight tools, six-AbstrackR, ASReview, Colandr, PICO Portal, Rayyan, and RobotAnalyst-were included in the evaluation, as they offered AI-based prioritization and/or active learning functionalities at the time of assessment. CADIMA and LitReview were excluded because they did not support AI-assisted title and abstract screening during the study period.
3.2 Feature Analysis
The feature analysis revealed that all seven AI-aided abstract screening tools included in this study offered core functionalities necessary for basic screening workflows. Results of the feature analysis are shown in Table 1. Common features across tools included reference importing, inclusion and exclusion marking, and the ability to export results - functions that are foundational to any systematic review process. All tools supported formal export in standard formats and allowed some form of structured screening. Some tools also offered visual progress tracking (n = 4) and keyword highlighting (n = 4), aiding usability and navigation during screening.
Despite this shared baseline, tools varied in the level of support for collaboration, reviewer management, and workflow customisation. However, as our evaluation focused on independent, single-user screening, these features were assessed based on their documentation and availability in the interface, rather than hands-on multi-user testing. In tools where such features were present, they typically appeared in pre-defined forms (e.g., fixed roles or basic multi-user settings), but advanced customisation options were not commonly observed. This interpretation should be treated with caution, as we did not assess collaborative functionality through live, team-based screening scenarios.
One notable distinction across tools was in the handling of duplicate references. PICO Portal [33] was the only tool in our evaluation that fully supported automated duplicate removal, offering a streamlined process for detecting and removing duplicates without manual intervention. In contrast, while other tools provided partial support, such as identifying or flagging potential duplicates, they typically did not offer complete or automatic removal and required users to manage duplicates manually. Similarly, project auditing (i.e., tracking reviewer actions and decisions over time) and PRISMA diagram generation were unsupported in all the tools except for PICO Portal, despite their importance for transparency and reporting in systematic reviews.
Support for multilingual interfaces and non-Latin characters appeared limited across most tools based on available documentation and interface settings. While some platforms indicated partial support for other languages or character sets, our evaluation was restricted to English-language articles and interfaces, and we did not directly test multilingual functionality. As such, findings in this area should be interpreted with caution.
Some tools demonstrated unique strengths. For instance, PICO Portal, Colandr and Abstrackr offered highly intuitive, visually user-friendly interfaces that flattened the learning curve for new users. These tools also integrated direct customer support and detailed help documentation, which may facilitate adoption among less experienced reviewers. In contrast, others prioritised streamlined workflows for single reviewers or incorporated customisable tagging systems, which can support internal classification schemes during screening.
Table 1
|
Feature
|
PICO PORTAL
|
ROBOTANALYST
|
ASREVIEW
|
RAYYAN
|
COLANDR
|
ABSTRACKR
|
|
Customer Support
|
3
|
2
|
3
|
2
|
3
|
3
|
|
Multiple User support
|
3
|
1
|
1
|
3
|
2
|
3
|
|
Article tagging
|
2
|
3
|
2
|
2
|
3
|
3
|
|
Reference importing
|
3
|
1
|
3
|
3
|
3
|
3
|
|
Reference allocation
|
1
|
1
|
1
|
1
|
1
|
1
|
|
Removing duplicates
|
3
|
2
|
2
|
2
|
1
|
1
|
|
In-/excluding references
|
3
|
1
|
3
|
3
|
3
|
2
|
|
Distinct TiAb/full-text phases
|
2
|
1
|
1
|
2
|
2
|
1
|
|
Discrepancy resolving
|
2
|
1
|
1
|
1
|
2
|
1
|
|
Exporting results
|
3
|
3
|
3
|
3
|
3
|
3
|
|
Order of references
|
3
|
3
|
2
|
3
|
3
|
1
|
|
Keyword highlighting
|
3
|
1
|
1
|
3
|
3
|
3
|
|
Multiple user roles
|
2
|
1
|
1
|
1
|
2
|
1
|
|
Project auditing
|
1
|
2
|
2
|
1
|
2
|
2
|
|
Show project progress
|
3
|
2
|
2
|
3
|
2
|
2
|
|
Attaching comments
|
2
|
1
|
2
|
2
|
2
|
2
|
|
Reference labelling
|
2
|
2
|
1
|
2
|
2
|
2
|
|
Prisma Diagram/Flow diagram creation
|
2
|
1
|
1
|
1
|
1
|
1
|
*In the heatmap presented in Table 1, support levels are represented by both numerical scores and cell shading. A score of 3, shown in green, indicates high support, meaning the feature is fully implemented and well-functioning. A score of 2, shaded orange, reflects moderate support, where the feature is partially implemented or has limited functionality. A score of 1, marked in yellow indicates no support.
The total scores of support levels across features for each tool are presented in Fig. 1. According to these results, PICO Portal achieved the highest average score, reflecting consistent support across a wide range of features, including collaboration, usability, and data handling. Colandr, Rayyan and AbstrackR also performed well, particularly in user-focused aspects such as tagging, keyword highlighting, and progress tracking. In contrast, tools like RobotAnalyst and ASReview demonstrated more limited functionality overall, with lower average scores, especially in collaboration and discrepancy resolution. These findings suggest that while most tools support essential screening activities, only a few provide the comprehensive feature sets required for complex, team-based systematic review workflows.
3.3 Performance
Our findings indicated that none of the tools retrieved more than 50% of relevant studies within the first 25% of screened records.
3.4 User Survey
Table 2
Reviewer scores for each tool on easiness to learn, user-friendliness, and time to learn the tool.
|
Survey Questions
|
Reviewer no.
|
Easiness to learn
|
User-friendliness
|
Time to learn
|
|
AbstracKR
|
R2
|
5
|
2.5
|
5
|
|
R3
|
7.5
|
2.5
|
5
|
|
AsReview
|
R1
|
7.5
|
5
|
7.5
|
|
R2
|
7.5
|
7.5
|
7.5
|
|
Colandr
|
R1
|
5
|
5
|
5
|
|
R3
|
7.5
|
5
|
5
|
|
PICO Portal
|
R1
|
7.5
|
5
|
7.5
|
|
R4
|
7.5
|
7.5
|
10
|
|
Rayyan
|
R1
|
7.5
|
7.5
|
10
|
|
R4
|
7.5
|
7.5
|
7.5
|
|
RobotAnalyst
|
R3
|
7.5
|
7.5
|
7.5
|
|
R4
|
7.5
|
7.5
|
10
|
Table S1
Abstract Screening Tools Evaluated Across Included Studies
|
Tools
|
Blaizot et al (2022)
|
Cowie et al (2022)
|
Van der Mierden et al (2019)
|
Harrison et al (2020)
|
Kohl et al (2018)
|
Reis et al (2023)
|
Khalil et al (2022)
|
Marshall et al (2019)
|
Schmidt a et al (2023)
|
|
Abstrackr®
|
+
|
+
|
|
+
|
|
+
|
+
|
+
|
|
|
CADIMA
|
|
+
|
+
|
|
+
|
|
|
|
|
|
Colandr®
|
|
+
|
|
+
|
+
|
+
|
|
+
|
|
|
Covidence
|
|
+
|
+
|
|
+
|
|
|
|
|
|
DistillerSR®
|
|
+
|
+
|
|
+
|
|
+
|
|
+
|
|
EPPI reviewer
|
+ |
+
|
+
|
|
+
|
|
+
|
|
+
|
|
EROS
|
|
|
|
|
+
|
|
|
|
|
|
Giotto Compliance
|
|
+
|
|
|
|
|
|
|
|
|
HAWC
|
|
|
|
|
+
|
|
|
|
|
|
JBI SUMARI
|
|
+
|
|
|
|
|
|
|
|
|
K-means clustering algorithm
|
+
|
|
|
|
|
|
|
|
|
|
LibSVM classifier (RCT tagger)
|
|
|
|
|
|
|
+
|
|
|
|
LitStream
|
|
+
|
|
|
|
|
|
|
|
|
PARSIFAL
|
|
|
|
|
+
|
|
|
|
|
|
PICO Portal
|
|
+
|
|
|
|
|
|
|
|
|
Rayyan®
|
+
|
|
|
+
|
+
|
+
|
+
|
+
|
|
|
REviewER
|
|
|
|
|
+
|
|
|
|
|
|
Revman Web
|
|
+
|
+
|
|
+
|
|
|
|
|
|
Robot reviewer
|
+
|
|
|
|
|
|
|
|
+
|
|
RobotAnalyst
|
|
+
|
|
|
|
|
+
|
+
|
|
|
Semi-automated natural language processing
|
+
|
|
|
|
|
|
|
|
|
|
SESRA
|
|
|
|
|
+
|
|
|
|
|
|
SLR-Tool
|
|
|
|
|
+
|
|
|
|
|
|
SLuRp
|
|
|
|
|
+
|
|
|
|
|
|
SR Accelerator
|
|
+
|
|
|
|
|
|
|
|
|
SRA Helper
|
|
|
|
|
+
|
|
+
|
|
|
|
SRDB.PRO
|
|
+
|
|
|
+
|
|
|
|
|
|
SRDR+
|
|
+
|
|
|
+
|
|
|
|
|
|
StArt
|
|
+
|
|
|
+
|
|
|
|
|
|
SWIFT-Active Screener
|
+
|
|
+
|
|
|
|
+
|
|
+
|
|
SWIFT-review
|
+
|
|
+
|
|
+
|
|
|
+
|
|
|
SyRF
|
|
+
|
+
|
|
+
|
|
|
|
|
|
SysRev
|
|
+
|
+
|
|
|
|
|
|
|
|
Wordstat
|
+
|
|
|
|
|
|
|
|
|
|
QDA Miner
|
+
|
|
|
|
|
|
|
|
|
|
Research Screener
|
|
|
|
|
|
|
|
|
|
|
Robot screener
|
|
|
|
|
|
|
|
|
|
|
LitSuggest
|
|
|
|
|
|
|
|
|
|
|
ExaCT
|
|
|
|
|
|
|
|
|
|
|
LiteRev
|
|
|
|
|
|
|
|
|
|
|
NetMetaXL
|
|
|
|
|
|
|
|
|
|
|
AsReview
|
|
|
|
|
|
|
|
|
|
|
Pitts Web application
|
|
|
|
|
|
|
|
|
|
| *“+” indicates that the tool was assessed in the given study |
A
Table S2
Compliance of Identified Abstract Screening Tools with Eligibility Criteria
|
Feature
|
Compound feature
|
Description of levels of conformance
|
Raters Response
|
|
Customer Support
|
C(3)
|
1. No support: Help documentation is inadequate (does not
help to solve many questions/problems) and the company
does not reply in a reasonable amount of time or does not
help solve the issue.
2. Documentation only: There is adequate documentation
available but the company does not reply in a reasonable
amount of time or does not help solve the issue.
3. Direct support: There is adequate documentation available
and the company replies in a timely manner and actively
supports the customer by answering questions and helping with issues.
|
|
|
Multiple User support
|
C(3)
|
1. No multiple user support: It is not possible for multiple users
to work at the same time, on the same project, independently
from each other, and blinded.
2. Two user support: Two users can work at the same time, on
the same project, independently from each other, and blinded.
3. Multiple user support: An unlimited number of users can
work at the same time, on the same project, independently
from each other, and blinded.
|
|
|
Article tagging
|
C(3)
|
1. There is no option to assign tags to the abstracts.
2. There is an option to assign tags to the abstracts; however, there is no option to hide these tags from the other reviewers.
3. There is an option to assign tags to the abstracts, and it is possible to hide these tags from the other reviewers.
|
|
|
Reference importing
|
C(3)
|
1. No formal import: The tool does not formally support
importing of references; references have to be entered
manually (this includes copy-pasting).
2. Limited files supported or difficult process: The tool can
only import using a limited number of file extensions (e.g., only
CSV) and/or the process is difficult.
3. Fully supported: The tool has an easy process for importing
references and supports multiple file extensions.
|
|
|
Reference
allocation
|
C(3)
|
1. No formal allocation: There is no formal method for
allocating references to reviewers.
2. Allocation possible: It is possible to allocate references to
reviewers, but the tool does not support randomization of
this step.
3. Allocation + re-allocation: The tool is able to re-allocate
references to different reviewers (e.g.,s when a reviewer drops
out).
|
|
|
Removing duplicates
|
C(3)
|
1. No formal duplicate removal
2. Duplicate removal is possible, but the tool does not remove all the duplicates
3. Tool removes all the duplicates
|
|
|
In-/excluding
references
|
C(3)
|
1. No system for in-/exclusion: The tool has no formal system
for in- or excluding references.
2. In-/exclusion only: The tool supports in- and excluding
references, but no reason for exclusion can be given.
3. In-/exclusion + reason for exclusion: The tool supports in or
excluding of references, and a reason for exclusion can be
given.
|
|
|
Distinct TiAb/full-text phases
|
C(3)
|
1. No distinct phases: There is no clear distinction between the
title/abstract phase and the full-text phase; there is only one
phase.
2. TiAb & full-text phase: There is a clear distinction between
the title/abstract phase and the full-text phase.
3. User-defined phases: The user can create as many distinct
phases as they need.
|
|
|
Discrepancy resolving
|
Binary
|
1. No: There is no official process to resolve discrepancies.
2. Yes: Official support for discrepancy resolving.
|
|
|
Exporting results
|
C(3)
|
1. No export: No formal export is supported, exporting must be
done manually.
2. Limited export: Support for formal export, but only in limited
file extensions (e.g., only .txt or .xlsx).
3. Full export: It is possible to export the results in at least the CSV format, or multiple general file extensions are supported.
|
|
|
Order of references
|
C(3)
|
1. No: It is not possible to randomize the order of references
for the reviewers.
2. Yes: It is possible to randomize the order of references for
the reviewers.
3. It is possible to order references based on different criteria e.g., author, date, and relevance
|
|
|
Keyword highlighting
|
C(3)
|
1. No highlighting: No keyword highlighting possible or
highlighting of only one word is possible.
2. 3rd party only: The tool does not support formal keyword
highlighting, but it is possible to use (free) 3rd party software
for highlighting (e.g., extensions for Google Chrome, Add-ons
for Firefox).
3. Highlighting possible: The tool natively supports the
highlighting of more than one word.
|
|
|
Multiple user roles
|
C(4)
|
1. No different roles: There are no different roles for different
users; everybody has the same role and rights in the project.
Reviewer + Manager roles: Two different roles with different
rights for reviewers and for manager roles.
2. Any further role: The tool supports both reviewer and
manager roles, but also any further roles (e.g., librarian role).
3. User definable roles: The users can determine the number of
roles and determine the rights for the roles.
|
|
|
Project auditing
|
Binary
|
1. No: The tool does not support auditing the project; a complete
overview of all alterations by all users on the project.
2. Yes: The tool supports auditing the project.
|
|
|
Non-latin character support
|
Binary
|
1. No: The tool does not support non-Latin characters (e.g.,
Cyrillic, Greek, Chinese, Arabic, etc.).
2. Yes: The tool supports non-Latin characters.
|
|
|
Show project Progress
|
C(3)
|
1. No project progress: There is no way to determine
the overall progress of the project (e.g., % completed)
2. Limited progress: The tool only shows rudimentary project
progress (e.g., only the total % of references completed/
still to do)
3. Detailed progress: The tool can display detailed progress
(e.g., the progress per reviewer)
|
|
|
Attaching
comments
|
Binary
|
1. No: It is not possible to attach comments to references.
2. Yes: It is possible to attach comments to references.
|
|
|
Reference labelling
|
Binary
|
1. No. It is not possible to attach a priori determined labels to
references.
2. Yes. It is possible to attach a priori determined labels to
however, there is no option to hide these tags from the other reviewers.
3. Yes. There is an option to assign tags to the abstracts, and it is possible to hide these tags from the other reviewers.
|
|
|
Prisma Diagram/Flow diagram creation
|
Binary
|
1. No: The tool cannot automatically provide a flow diagram
meeting the PRISMA criteria.
2. Yes: The tool can automatically provide a flow diagram
meeting the PRISMA criteria
|
|
|
Supporting languages other than English
|
C(3)
|
1. No: The tool only support English
2. Limited support: The tool support a few other languages
3. Fully support: The tool support a large number of different languages
|
|
Table S3
Feature Analysis Template
|
Tools
|
Cost
|
Accessability/Current functionality
|
Link
|
Field/area of research
|
Additional notes
|
|
Abstrackr®
|
Free
|
YES
|
http://abstrackr.cebm.brown.edu/account/login
|
Multidisciplinary /not domain specific
|
|
|
CADIMA
|
Free
|
YES
|
https://www.cadima.info/
|
Multidisciplinary /not domain specific
|
|
|
Colandr®
|
Free
|
YES
|
https://www.colandrapp.com/signin
|
Multidisciplinary /not domain specific
|
|
|
Covidence
|
Paid subscription
|
YES
|
|
|
|
|
DistillerSR®
|
Paid subscription
|
YES
|
|
|
|
|
EPPI reviewer
|
Paid subscription
|
YES
|
|
|
|
|
EROS
|
No longer in use
|
YES
|
|
|
|
|
Giotto Compliance
|
No longer in use
|
YES
|
|
|
|
|
HAWC
|
Free-requires two step account creation
|
YES
|
https://www.epa.gov/risk/health-assessment-workspace-collaborative-hawc
|
medical science/healthcare
|
|
|
JBI SUMARI
|
Paid subscription
|
YES
|
|
medical science/healthcare and social sciences
|
|
|
K-means clustering algorithm
|
NA
|
NA
|
|
ML algorithm
|
Coding required
|
|
LibSVM classifier (RCT tagger)
|
Free
|
YES
|
|
medical science/healthcare
|
Works with PubMed
|
|
LitStream
|
Paid subsription
|
YES
|
|
|
|
|
PARSIFAL
|
Free
|
YES
|
|
Software engineering focus
|
|
|
PICO Portal
|
Paid-Free trial is available for one project
|
YES
|
https://picoportal.org/plans-comparison/
|
Multidisciplinary /not domain specific
|
|
|
Rayyan®
|
Free for early career researchers for 3 reviews
|
YES
|
https://www.rayyan.ai/
|
Multidisciplinary /not domain specific
|
|
|
REviewER
|
Not available
|
No
|
|
|
|
|
Revman Web
|
Requires Cochrane account
|
YES
|
|
Healthcare-designed for Cochrane reviews
|
|
|
Robot reviewer
|
not available
|
No- Demonstration website was not available
|
|
Medical- RCT studies
|
|
|
RobotAnalyst
|
Free but it requeries an account on request
|
YES
|
|
Multidisciplinary /not domain specific
|
|
|
Semi-automated natural language processing
|
NA
|
NA
|
|
ML model requires coding
|
|
|
SESRA
|
Free
|
YES
|
|
Particularly for software engineering
|
|
|
SLR-Tool
|
Free
|
YES
|
|
Particularly for software engineering
|
|
|
SLuRp
|
Free
|
YES
|
|
Software engineering
|
|
|
SR Accelerator
|
Free
|
YES
|
|
Software engineering
|
|
|
SRA Helper
|
Free
|
YES
|
|
|
|
|
SRDB.PRO
|
Free
|
YES
|
|
Healthcare and medical sciences
|
|
|
SRDR+
|
Free
|
YES
|
|
Healthcare and medical sciences
|
|
|
StArt
|
Free
|
YES
|
|
Software engineering
|
|
|
SWIFT-Active Screener
|
Paid subscription
|
YES
|
|
|
|
|
SWIFT-review
|
Free
|
YES
|
|
Healthcare and medical science
|
Desktop application
|
|
SyRF
|
Free
|
YES
|
|
Preclinical, experimental studies
|
|
|
SysRev
|
|
YES
|
|
|
Text mining Software
|
|
Wordstat
|
|
NA
|
|
|
Text mining Software
|
|
QDA Miner
|
|
NA
|
|
|
|
|
Research Screener
|
Paid
|
|
|
|
|
|
Robot screener
|
Paid
|
|
|
|
|
|
LitSuggest
|
Free
|
|
|
|
|
|
ExaCT
|
Some features are free
|
|
|
Healthcare and medical science with clinical focus
|
Does not offer TI/abstract screening feature
|
|
LiteRev
|
|
|
https://literev.unige.ch/accounts/login/
|
|
|
|
NetMetaXL
|
Free
|
|
|
|
Does not offer TI/abstract screening feature
|
|
AsReview
|
Free
|
|
https://asreview.nl/
|
Multidisciplinary/nor domain specific
|
Desktop application
|
|
Pitts Web application
|
Paid
|
|
|
|
|
Table S4
Search string used in Ovid and Cochrane
|
Database
|
Terms
|
|
Embase and Medline (via Ovid)
|
Artificial Intelligence.ti,ab.
|
|
AI.ti,ab.
|
|
Machine Learning.ti,ab.
|
|
Deep learning.ti,ab.
|
|
Robotics.ti,ab.
|
|
Neural Networks, Computer.ti,ab.
|
|
Automation.ti,ab.
|
|
Natural language processing.ti,ab.
|
|
Artificial Intelligence.ti,ab. OR AI.ti,ab. OR Machine Learning.ti,ab. OR Deep learning.ti,ab. OR Robotics.ti,ab. OR Neural Networks, Computer.ti,ab. OR Automation.ti,ab. OR Natural language processing.ti,ab.
|
|
Abstract screening.ti,ab.
|
|
(abstract adj2 screening).ti,ab.
|
|
Title screening.ti,ab.
|
|
(title adj2 screening).ti,ab.
|
|
Abstract screening.ti,ab. OR (abstract adj2 screening).ti,ab. OR Title screening.ti,ab. OR (title adj2 screening).ti,ab.
|
|
Artificial Intelligence.ti,ab. OR AI.ti,ab. OR Machine Learning.ti,ab. OR Deep learning.ti,ab. OR Robotics.ti,ab. OR Neural Networks, Computer.ti,ab. OR Automation.ti,ab. OR Natural language processing.ti,ab.
AND Abstract screening.ti,ab. OR (abstract adj2 screening).ti,ab. OR Title screening.ti,ab. OR (title adj2 screening).ti,ab.
|
|
Cochrane
|
Artificial Intelligence:ti,ab OR AI:ti,ab OR Machine Learning:ti,ab OR Deep learning:ti,ab OR Robotics:ti,ab OR Neural Networks, Computer:ti,ab OR Automation:ti,ab OR Natural language processing:ti,ab OR Natural language processing:ti,ab AND Abstract screening:ti,ab OR (abstract adj2 screening):ti,ab OR Title screening:ti,ab OR (title adj2 screening):ti,ab (with Cochrane Library publication date from Jan 2021 to Apr 2024, in Cochrane Reviews and Cochrane Protocols)
|
Table S5
Search Strategy for the Umbrella Review
|
Broad Term
|
Search Term
|
Pubmed (filtered for review, systematic review, meta-analysis)*
|
PSYCinfo and Embase (AND review OR systematic review OR meta-analysis OR meta-analyses)**
|
|
Dementia, MCI, cog health
|
1
|
“Mild Cognitive impairment” OR “dementia” OR “Alzheimer*” OR “MCI” OR “cognitive impairment” OR “cognitive health” OR “cognitive decline”
|
81,314
|
143,485
|
|
Urban design
|
2
|
“Green space*” OR “blue space” OR UGBS
|
429
|
602
|
|
3
|
walkability OR cyclability OR
“Pedestrian infrastructure” OR “cycling infrastructure”
|
236
|
347
|
|
4
|
“Road density”
|
5
|
10
|
|
5
|
"Street light density" OR "streetlight density" OR street*light
|
50
|
12
|
| |
1 AND 2
|
14
|
28
|
|
1 AND 3
|
3
|
7
|
|
1 AND 4
|
0
|
0
|
|
1 AND 5
|
0
|
0
|
|
Social environment
|
6
|
“Social capital” OR “social relationship”
|
624
|
1,910
|
|
7
|
Crime
|
16,861
|
13,609
|
|
8
|
“Material deprivation” OR “deprivation” OR “socio economic status” OR “socioeconomic status” OR “Socio*economic status” OR “Socio*economically disadvantage*”
|
16,986
|
35,506
|
| |
1 AND 6
|
8
|
25
|
|
1 AND 7
|
183
|
129
|
|
1 AND 8
|
536
|
1,264
|
|
Environmental by-products
|
9
|
“Noise pollution” OR “water pollution” OR “air pollution” OR “soil pollution” OR “light pollution” OR pollutant
|
49,934
|
32,377
|
|
10
|
“Heat stress” OR “ambient temperature” OR heatwaves
|
14,589
|
17,968
|
| |
1 AND 9
|
502
|
513
|
|
1 AND 10
|
57
|
131
|
|
Transport behaviours
|
11
|
"Active travel" OR "active transport" OR walk* OR bicycling OR bike OR biking OR "ecological commut*" OR "ecological transport" OR non-auto* OR non-motori?e* OR "green travel" OR "green transport"
|
320,017
|
37,253
|
|
12
|
car use OR car usage OR car dependency OR car ownership
|
8,366
|
89
|
|
13
|
cycle lane OR bicycle lane OR bike lane OR cycle trail OR bicycle trail OR cycle path OR bicycle path OR bike path OR bike*way OR foot*path OR pavement OR sidewalk OR greenway OR walkability OR cyclability OR
“Pedestrian infrastructure” OR “cycling infrastructure”
|
1,333
|
642
|
|
14
|
“Public transport” OR “public transit”
|
172
|
350
|
|
15
|
traffic
|
9226
|
2,2042
|
| |
1 AND 11
|
715
|
1,418
|
|
1 AND 12
|
82
|
0
|
|
1 AND 13
|
17
|
9
|
|
1 AND 14
|
5
|
3
|
|
1 AND 15
|
232
|
510
|
|
*Pubmed reviews were gathered by adding the ‘review,’ ‘systematic review,’ and ‘meta-analysis’ filters.
**Embase and PSYCinfo reviews were gathered by adding ‘AND (review or systematic review OR meta-analysis OR meta-analyses)’ at the end of every search string.
Cog health: cognitive health; MCI: Mild cognitive impairment; UGBS: Urban green and blue space.
|
| File S1: Survey Questions (Adapted from Reis et al (2023)) |
| On a scale of 0 to 10, where 0 is not friendly at all and 10 is very friendly, how would you rate it? |
| • How would you rate the process of learning how to use the software? |
| a) Very easy (10 points) |
| b) Easy (7.5 points) |
| c) Not easy but also not difficult (5 points) |
| d) Difficult (2.5 points) |
| e) Very difficult (0 points) |
| • In your experience, would you describe the software as user friendly (i.e., is it possible to understand the software instantly and intuitively)? |
| a) Very Friendly (10 points) |
| b) Above Average (7.5 points) |
| c) Average (5 points) |
| d) Below Average (2.5 points) |
| e) Not friendly at all (0 points) |
| • How long did it take you to learn how to use the software in minutes? |
| a) < 15 min (10 points) |
| b) 15 to 30 min (7.5 points) |
| c) 30 to 45 min (5 points) |
| d) 45 to 60 min (2.5 points) |
| e) > 60 min (0 points) |
| • Are there any features you found to be missing or lacking in the software? |
| • What feature or improvement would you most like to see in the tool? |
Table 2 presents the individual reviewer ratings for easiness to learn, user-friendliness, and time to learn the tool across the evaluated tools, while Fig. 2 illustrates the average scores for each question per tool in a stacked bar graph format. Across all tools, higher scores reflected a more favourable user experience. Several tools stood out positively in these domains: Rayyan, RobotAnalyst, and PICO Portal received the highest average scores for user-friendliness and time efficiency, indicating that users found them intuitive and quick to learn. PICO Portal was also noted for its visually accessible interface, which contributed to an efficient onboarding process. ASReview was also rated positively for ease of learning and required minimal time to reach proficiency, suggesting a generally favourable learning experience.
While tools varied in design and user interaction, those with clear visual layouts, guided workflows, and accessible support documentation tended to receive higher usability ratings. These findings underscore the value of intuitive interface design and streamlined access to enhance the user experience during AI-aided abstract screening.
In addition to survey feedback, practical observations during the evaluation process provided further context regarding tool access and setup. RobotAnalyst required a trial account to access the platform. A three-month license was granted to the authors for this evaluation, with no restriction on the number of abstracts uploaded. PICO Portal similarly offered a free trial with full feature access, allowing for uninterrupted use during the review period.
CADIMA was accessed through a time-limited training account, which expired after three months. Unlike other tools (i.e., PICO Portal, Rayyan, AsReview, Colandr and Abstrackr) CADIMA did not support RIS file uploads. Instead, it offered an internal search function, which during testing did not yield usable references. This limited the team’s ability to proceed with screening through the platform.
Colandr presented some access challenges for team members, specifically related to password resets and account recovery. These issues delayed initial use for some reviewers. However, the tool did provide user guides on its website, which helped address questions related to tool functions once access was obtained.
These observations reflect tool configurations and access experiences at the time of evaluation. While functionality and support may change over time, such practical considerations remain important for teams planning to adopt AI-assisted tools in time-sensitive review contexts.
Reviewers were asked to identify any features they found to be missing or insufficient in the tools, as well as to suggest improvements that could enhance the user experience. Several common themes emerged across responses, including requests for clearer feedback on AI behaviour, automated deduplication, improved filtering and tagging functions, and enhanced documentation or user support.
A frequently mentioned concern was the lack of clarity around AI functionality and prioritisation mechanisms. Reviewers using tools with prioritisation features, such as PICO Portal and Rayyan, noted uncertainty regarding how the software reordered records based on relevance during screening. Reviewers suggested that visual indicators such as confidence scores or AI progress feedback could help clarify when it may be reasonable to stop screening. Similarly, in PICO Portal, reviewers expressed a desire to better understand how the "priority screening algorithm" was functioning in real-time.
Another recurring theme from the user survey was the need for improved duplicate handling. Several tools either lacked automated deduplication entirely or offered only limited manual support. Reviewers highlighted that integrating automatic duplicate removal, particularly in large-scale reviews, would substantially reduce screening burden. This was noted specifically in relation to Abstrackr and RobotAnalyst.
Documentation and onboarding support were also identified as areas for improvement. One reviewer mentioned difficulty locating official documentation for RobotAnalyst, which led to a trial-and-error learning approach despite the tool being otherwise well-received. Similarly, Colandr users experienced issues with logging in and noted challenges in using filtering features, including occasional script errors and difficulty locating tags for studies, even after setup.
Reviewers also recommended enhancing filtering, tagging, and study categorisation options, particularly for Colandr and ASReview. While ASReview allowed note attachment, users suggested the addition of highlighting functionality, as well as clarity on whether multi-user collaboration was supported. In some cases, visual or interface limitations (e.g., image rendering errors in Abstrackr) were also noted as affecting usability.
Finally, reviewers emphasised the value of progress tracking features. Tools that lacked a visible indication of screening completion (e.g., percentage completed) were perceived as less transparent. Reviewers recommended that tools include more robust project progress displays to support workflow management and planning.
4. Discussion
4.1 Summary of Findings
Our study evaluated the performance, features, and user experience of six freely available AI-aided abstract screening tools using an umbrella review as a case study. While performance differences between tools were not substantial, none were able to retrieve more than 50% of the included studies after screening 25% of abstracts, indicating limitations in early recall. Nevertheless, most tools adequately supported core screening tasks, and several demonstrated additional strengths in usability, tagging, and exporting functions. A likely reason for this performance could be the added complexity of our case study, and that the tools had to do more than judge topical relevance because they also needed to distinguish systematic reviews from primary research. This two-tiered classification task is inherently more challenging than typical screening, which usually focuses solely on content relevance. Furthermore, except for PICO Portal, none of the freely available tools automatically removed duplicates. This limitation can substantially increase screening burden and reduce efficiency in large-scale reviews.
Despite these challenges, several tools were still able to surface a subset of pertinent studies early in the process, indicating they could help lighten reviewer workload. Although complete manual screening remains necessary in most cases, these platforms can serve as valuable assistants by streamlining the initial screening stages.
PICO Portal consistently emerged as the highest-performing tool across multiple dimensions, offering strong support for collaboration, usability, and progress tracking. Colandr and Rayyan also performed well, especially in user-facing features like article tagging, keyword highlighting, and ease of navigation, though they lacked advanced collaborative functions such as project auditing and user-role customisation.
Our survey results showed generally positive user experiences but also highlighted gaps in functionality and accessibility. Importantly, many free tools restrict essential features such as duplicate removal or allocation to paid plans or require time intensive account setup, limiting their appeal for time-constrained or resource-limited researchers.
4.2 Comparison with Existing Evidence
Previous evaluations of AI-aided screening tools have consistently reported substantial workload reductions, which aligns with our findings [18] [9, 14, 15]. However, to our knowledge, no prior study has evaluated these tools specifically in the context of umbrella reviews or systematically assessed recall rates at early screening stages, making direct comparisons with our results challenging. Similarly, while some studies analysed tools that require paid subscriptions [14], these are not directly comparable to our study as we focused only on tools that are freely available for at least one case study.
Among the available tools, Rayyan has been highlighted in the literature as a favourable option. For example, a recent study [18] reported that Rayyan correctly identified nearly 80% of relevant records, recommending it as the most effective software in their evaluation. In our study, Rayyan demonstrated strengths in user experience and feature availability, but our performance analysis did not reproduce the high recall rates reported previously, underscoring the potential influence of review context and dataset characteristics.
PICO Portal has received limited formal assessment, likely because it was only introduced in 2020. One study [9] included PICO Portal in a broad feature analysis that focused on functionalities across all stages of the systematic review process rather than on screening performance specifically. The study evaluated the approach feature of screening, which referred to the type of AI methodology used to prioritise records during title/abstract screening, but it did not assess other aspects such as usability, efficiency, or recall performance. An earlier evaluation [34] highlighted PICO Portal’s PICO-based keyword highlighting and AI-driven prioritisation as key strengths, while noting a steep learning curve and usability challenges for new users. In contrast, our team found PICO Portal intuitive and user-friendly, suggesting potential improvements in the interface since its initial evaluation.
On the other hand, while Abstrackr has been positively reviewed in prior studies [18, 35] for its screening efficiency, our team comprising researchers with over five years of experience and involvement in multiple reviews experienced difficulties using it effectively for title/abstract screening. Such discrepancies between previous evaluations and our experience are unsurprising given the rapidly evolving nature of AI-aided review tools and differences in study designs, datasets, and reviewer experience.
4.3 Strengths, Limitations and Implications
A key strength of this study is its use of a real-world umbrella review on an interdisciplinary topic that comprised more than 2,000 abstracts. Using this previous study as a case study allowed us to evaluate tool performance across a broad and complex abstract set. Another strength is the multi-dimensional evaluation framework, which included not only performance and feature analysis but also direct user experience through structured survey data.
However, several limitations must be noted. First, our findings are based on a single case study, which limits generalisability. Future research should examine these tools across different domains to determine whether results are consistent. Additionally, while we documented the presence of features such as multi-user support and non-Latin character handling, we did not test their real-world performance and thus cannot comment on their practical utility.
4.4 Implications and Future Directions
Our findings suggest that AI-aided screening tools are promising but remain underused in evidence synthesis. A significant barrier is the lack of awareness, training, and user support, particularly for researchers without computational backgrounds. Institutions can help close this gap by offering regular training workshops and integrating AI tools into research methodology curricula to increase both exposure and user confidence.
Developers also play a critical role in facilitating adoption. Usability and onboarding support should be prioritised to lower the entry barrier for new users and flatten the learning curve. Our experience with the PICO Portal team illustrates the positive impact of proactive engagement: the developers provided onboarding materials, live demonstrations, and responsive assistance, all of which helped address the steep learning curve and can encourage broader use. Such direct interaction between developers and end-users can significantly improve adoption and satisfaction, particularly for researchers unfamiliar with AI technologies.
From a development standpoint, there is clear room for improvement. Key areas include enhancing duplicate detection, implementing reliable study design recognition (e.g., distinguishing reviews from primary research), and offering transparent performance tracking. Additionally, improving multi-language support and non-Latin character recognition would expand the utility of these tools in global and multilingual research contexts.
Authors of research studies themselves can also facilitate automation by clearly reporting study design and methodology in abstracts. This practice would help both human reviewers and AI tools accurately classify studies during the screening phase, which is particularly important in umbrella reviews where distinguishing between primary studies and reviews is essential but often requires full-text access.
Finally, while generative AI and no-code tools offer new opportunities for automation, free platforms still fall short of offering fully comprehensive solutions for complex, multidisciplinary reviews. Tools such as PICO Portal are moving in the right direction by incorporating progress tracking and detailed performance reports, but broader functionality without reliance on paywalls is essential to ensure equity, scalability, and wider adoption of AI-aided screening in systematic review practice.
5. Conclusions
This study evaluated six freely available AI-aided abstract screening tools using an umbrella review as a test case. Our findings suggest that these tools are useful and increasingly accessible for supporting systematic reviews, even for researchers without programming experience. While most tools performed adequately in core screening tasks, limitations remain, particularly in early recall, duplicate removal, study design recognition, and multi-user functionality.
Despite these challenges, AI-aided screening tools hold strong potential to enhance the efficiency and consistency of evidence synthesis. For broader adoption, however, improvements in usability, feature completeness, and institutional training support are needed. As the field evolves, developers, researchers, and institutions all have a role to play in advancing and adapting these tools to better meet the practical demands of diverse review contexts.