Transforming Learning or Empty Promise? A Meta-Analysis of Generative AI in Education
XiuxiuTang1,4✉Email
XiyuWang2Email
LiuDong2Email
JingxianCeciliaZhang3Email
1Department of PsychologyUniversity of Notre Dame46556Notre DameINUSA
2College of EducationPurdue University47906West LafayetteINUSA
3Richard J. Bolte, Sr. School of Business, Mount St. Mary’s University21727EmmitsburgMDUSA
4
A
E459 Corbett Hall46556Notre DameIN
Xiuxiu Tang1*, Xiyu Wang2, Liu Dong2, Jingxian Cecilia Zhang3
1 Department of Psychology, University of Notre Dame, Notre Dame, IN, 46556, USA
2 College of Education, Purdue University, West Lafayette, IN, 47906, USA
3 Richard J. Bolte, Sr. School of Business, Mount St. Mary's University, Emmitsburg, MD, 21727, USA
Corresponding author: Xiuxiu Tang
Corresponding author address: E459 Corbett Hall, Notre Dame, IN 46556
Corresponding author email address: xtang8@nd.edu
Other authors email addresses: Xiyu Wang (wang5608@purdue.edu); Liu Dong (dong306@purdue.edu); Jingxian Cecilia Zhang (j.zhang@msmary.edu)
Transforming Learning or Empty Promise? A Meta-Analysis of Generative AI in Education
Abstract
A
This meta-analysis examines the impact of generative artificial intelligence (GenAI) tools, such as ChatGPT, on students’ academic achievement. Drawing on 52 experimental and quasi-experimental studies across educational levels and domains, we synthesized evidence from interventions using GenAI to support learning. Eligible studies reported performance outcomes (e.g., test scores, grades, GPA) and met rigorous inclusion criteria. Overall, GenAI-based instruction showed a positive effect (Hedges' g = 1.193) on academic achievement, with substantial between-study variability indicating that GenAI’s effectiveness depends on contextual and design features. Moderator analyses identified two significant factors, which are instructional role and subject area. GenAI was most effective when used to support formative functions such as assessment, feedback, and tutoring, suggesting that its strength lies in providing adaptive guidance and personalized learning support. Effects also varied across subject areas. Language education showed the strongest and most consistent gains, reflecting a close alignment between GenAI’s natural language capabilities and core instructional practices. In contrast, more modest effects were observed in computer science and art education, where applications tend to be narrower in scope. Other moderators, including educational level, sample size, intervention duration, and learning domain, did not yield statistically significant differences but revealed descriptive patterns that may inform future research and implementation. These findings suggest that GenAI tools hold considerable promise for improving academic performance when thoughtfully integrated into instructional practice. Educators and policymakers should consider both the role GenAI plays and the subject context to ensure its effective use in diverse educational settings.
Keywords:
Generative artificial intelligence
academic achievement
learning outcomes
meta-analysis
AI in education
A
A
1. Introduction
The rapid advent of generative artificial intelligence (GenAI) tools such as ChatGPT, Gemini, and Claude is reshaping educational landscapes by offering new possibilities for supporting and enhancing learning. Unlike earlier forms of artificial intelligence (AI), GenAI technologies generate human-like text, facilitate problem-solving, and provide interactive, context-sensitive assistance (Fui-Hoon Nah et al., 2023). These distinctive capabilities have has attracted widespread attention, but empirical evidence on their educational impact is mixed, highlighting a pressing need for systematic synthesis.
Meta-analyses of AI in education more broadly have consistently found positive effects. For example, Dong et al. (2025) synthesized 29 empirical studies of diverse AI technologies and reported a large overall effect (g = 0.92), with variation across educational levels and subjects. Tlili et al. (2025), analyzing 85 studies, likewise found a large average effect (g = 1.10), especially highlighting the effectiveness of chatbots. While these reviews established AI’s general potential, they combined traditional AI systems with emerging GenAI tools, leaving the unique contribution of GenAI unclear.
More recently, several reviews have examined GenAI specifically (e.g., Liu et al., 2025; Sun & Zhou, 2024; Zhu et al., 2025). Although these studies provide valuable early insights, their scope remains limited. Sun and Zhou (2024), for instance, focused exclusively on college students and synthesized only 28 studies. Zhu et al. (2025) included both K–12 and higher education contexts but drew on just 26 studies and reported modest effects (g = 0.39) across varied outcomes. Liu et al. (2025) analyzed a larger set of 49 studies and found substantial effects on achievement and motivation, but their moderator analyses concentrated on media format and interface features rather than pedagogical functions. These reviews demonstrate the potential of GenAI but leave important questions unanswered regarding its impact on academic achievement across educational levels, learning domains, and instructional designs, including how study design features (e.g., sample size) and the instructional roles assigned to GenAI influence outcomes.
The present meta-analysis addresses these gaps by synthesizing recent peer-reviewed, quasi-experimental and experimental studies of GenAI-based instruction published since late 2022. The evidence base spans all educational levels, from K–12 to higher education. By focusing on achievement outcomes and analyzing a larger and more up-to-date body of studies, this study provides a comprehensive and nuanced account of GenAI’s educational effects. It also incorporates several key moderators, including participant sample size, intervention duration, learning domain, subject area, the instructional role of GenAI, and educational level. In doing so, it extends prior reviews and clarifies the conditions under which GenAI is most effective. This contributes to both theory and practice by informing pedagogical design, institutional decision-making, and policy development.
Research Questions
This meta-analysis synthesizes findings from 52 peer-reviewed empirical studies published between November 2022 and 2025 that met strict eligibility criteria. The study addresses two central questions:
RQ1: What is the overall effect of GenAI-based instructional interventions on academic performance outcomes (e.g., test scores, grades, GPA)?
RQ2: Under what conditions does GenAI most effectively improve student achievement (e.g., across educational levels, learning domains, instructional roles, subject areas, or study designs)?
2. Literature Review
2.1 Learning Domains: STEM vs non-STEM
The pedagogical utility of GenAI is influenced by the academic discipline in which it is applied. Evidence shows that GenAI benefits both STEM and non-STEM fields, but the mechanisms through which it operates differ. In non-STEM disciplines such as language acquisition and the social sciences, GenAI tools have been especially effective for communication, critical inquiry, and affective engagement. For example, in English as a Foreign Language (EFL) instruction, an AI chatbot integrated into “think-pair-share” activities reduced students’ speaking anxiety while increasing enjoyment and oral proficiency (Wu et al., 2025). Likewise, in an undergraduate research methods course, the use of ChatGPT enhanced research competencies and fostered autonomous motivation and self-directed learning (Li et al., 2025).
In contrast, within STEM fields such as computer science and engineering, GenAI's primary value appears to lie in its ability to manage cognitive load and provide scaffolded, personalized learning support. For example, a study conducted in a computer vision course found that an intelligent teaching assistant powered by a large language model not only improved academic outcomes but also helped students develop higher-order cognitive skills like active questioning and summarization (Teng et al., 2025). Furthermore, in a university programming course, an AI-agent-supported collaborative learning framework was shown to enhance learning achievement while simultaneously reducing the cognitive effort required of students and increasing their self-efficacy (Wang et al., 2025). In these settings, the AI agent acts as a dynamic, intelligent scaffold, personalizing the learning pathway and enabling students to tackle complex problems more effectively.
2.2 Educational Level Differences
Beyond the subject matter, the developmental stage of the learner also plays a crucial role in determining the effectiveness and appropriate application of GenAI. The existing literature indicates that while GenAI is a potent tool across various age groups, its impact diverges between higher education and K-12 settings, with pedagogical strategy being a particularly critical variable for younger learners.
Most existing research has been conducted in higher education, where students generally benefit from greater autonomy and digital fluency. In this context, studies have consistently found that learners using GenAI tools outperform their peers in control conditions (Lee et al., 2022; Wang, 2025). For example, in diverse courses spanning public health, engineering, and language learning, university students using GenAI tools have produced higher-quality work, demonstrated stronger comprehension, and reported greater confidence and motivation. Specifically, Lee et al. (2022) found that an AI-based chatbot used for after-class review enhanced academic performance and self-efficacy among college students. Similarly, Wang (2025) reported that undergraduates using ChatGPT-4 for English practice showed significant gains in their communication skills and a high degree of acceptance for the technology. These results suggest that university students are well-positioned to engage productively with GenAI, especially when the technology supports complex tasks like revision, self-regulation, or conceptual reasoning.
In K-12 education, the results are more variable and highly dependent on pedagogical support (Alneyadi & Wardat, 2024; Jeon, 2023; Liu et al., 2024; Sapan & Uzun, 2024). For secondary students, interventions that incorporate strong instructional design tend to yield better outcomes. This is supported by Alneyadi and Wardat (2024), who demonstrated that integrating ChatGPT into a Grade 12 Quantum Theory course significantly enhanced student achievement. However, the importance of this scaffolding is underscored by the contrasting findings of Sapan and Uzun (2024), where traditional instruction proved more effective than a less structured ChatGPT integration for improving high school students' writing skills. Research in primary education, while rarer, echoes this theme. Early evidence suggests that GenAI can support foundational skills only when carefully scaffolded. For example, Jeon (2023) found that a chatbot employing "dynamic assessment" with graduated, interactive assistance was highly effective for elementary school students learning vocabulary, while Liu et al. (2024) showed that a teacher-guided, LLM-supported model significantly improved writing performance. Overall, these findings confirm educational level as a key moderator, with the most consistent gains observed in postsecondary contexts and an increasing need for structured, teacher-supported interventions for younger learners.
2.3 Roles of GenAI in Instructional Contexts
The role that GenAI plays within the instructional design of each study also varies widely. Based on a synthesis of recent literature, GenAI interventions typically fall into at least three major categories, which are critical for understanding their impact. First, in the role of tutoring and scaffolding, many studies used GenAI to act as a conversational tutor or a tool to support student learning. These interventions aimed to replicate or enhance elements of one-on-one instruction. For example, some used GenAI as a conversational partner with human-like avatars to reduce speaking anxiety and build student confidence (Wang et al., 2024), while others used it as a creative scaffolding tool, translating students' textual descriptions of poetry into visual images to deepen their comprehension (Chu et al., 2025).
Second, some studies used GenAI primarily for assessment and feedback, which emphasizes iterative improvement and reflection. In this capacity, GenAI was deployed to automatically evaluate student work and provide formative feedback. For instance, it has been used to provide "dialogic feedback" on complex tasks like computer programming, where the timing and context of the GenAI's interactive suggestions were found to be critical for developing students' critical thinking skills (Gong et al., 2025).
Third, an increasing number of studies have explored GenAI’s capacity for personalized learning support. This role often overlaps with tutoring and feedback, as the most effective tools adapt to student input. For example, the conversational tutors adjust their dialogue, and feedback systems provide suggestions tailored to specific errors. This adaptive capability, which supports self-regulated learning and engagement, is a key feature in studies showing positive outcomes across different educational levels (Jeon, 2023; Lee et al., 2022).
Although these roles often overlap in practice, they reflect distinct pedagogical aims and learner interactions. Understanding how GenAI functions as a tutor, feedback provider, or personalized support tool is therefore essential for interpreting its educational impact. This body of research indicates the need for a systematic synthesis that examines not only the overall effect of GenAI on student achievement but also the conditions under which its impact varies.
3. Methodology
The present meta-analysis examined the influence of GenAI on students’ learning outcomes and investigated how study characteristics such as sample size, educational level, intervention duration, role of GenAI, learning domain, and subject area moderated effect sizes. The study was conducted in accordance with the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines (Moher et al., 2009).
3.1 Eligibility Criteria
To be included in this meta-analysis, studies had to meet the following criteria: (1) be peer-reviewed journal articles published between November 30, 2022 and 2025; (2) be written in English and full-text accessible; (3) be empirical in nature (quantitative or mixed-method); (4) apply GenAI technologies such as ChatGPT; (5) be quasi-experimental or true-experimental studies; (6) include both an experimental group using GenAI to support learning and a control group relying on traditional instruction without GenAI; (7) report academic or learning outcomes for both groups; (8) provide sufficient statistical information to calculate the effect size.
3.2 Search Strategy and Selection Criteria
We conducted a comprehensive search of published studies using a wide range of scientific databases, including Scopus, Web of Science, and EBSCO (including APA PsycINFO, Education Full Text; Education Source, ERIC, Social Sciences Full Text). The key search terms used in the electronic database searches consisted of two categories: GenAI and learning, which were connected by Boolean AND. The GenAI-related keywords included “generative artificial intelligence” or “Gen-AI” or “GenAI” or “GAI” or “ChatGPT” or “GPT” or “agent” or “chatbot” or “large language model” or “LLM” or “OpenAI”. The learning-related keywords included: “Learning” or “education” or “instruction” or “curriculum” or “course” or “learning outcome” or “learning performance” or “learning effect” or “learning achievement” or “academic performance” or “academic achievement”.
3.3 Data Extraction
A PRISMA flow diagram (Fig. 1) was used to illustrate the data extraction process. The search initially identified 7,667 records, which were imported for processing. After removing the duplicates, 5,302 unique studies remained for screening. The review proceeded in three rounds. In the first round, titles and abstracts were screened for relevance, resulting in the exclusion of 5,157 records and leaving 145 full-text articles for eligibility assessment. In the second round, the full texts were reviewed in detail, and 91 studies were excluded. In the third round, statistical information and categorical variables were extracted from the remaining studies for effect size calculation. Ultimately, 54 studies met all criteria and were included in the meta-analysis.
3.4 Data Coding
A
A structured coding form was developed to extract key information from each study, including title, publication year, research design, and statistical data (sample size, means, and standard deviations). Additionally, each study was coded on 6 moderator descriptors: (1) educational level ( higher education, secondary education, primary education), (2) sample size (1–50, 51–100, 101–150 or more than 150), (3) learning domain (STEM or Non-STEM), (4) subject area, (5) intervention duration, and (6) the instructional role assigned to GenAI (personalized recommendation, assessment and evaluation, tutoring, mixed or others). By focusing on these specific variables, this meta-analysis provides insights into which aspects of GenAI interventions are most influential. Four coders were trained and coded the first ten articles together to ensure consistency in coding. Then, at least one coder coded and cross-checked the remaining articles. Any discrepancies were discussed and resolved until full agreement was reached. See Appendix for coding details.
3.5 Data Analysis
To ensure statistical independence, only one effect size was extracted from each of the 54 included studies. To reduce small-sample bias, Hedges’ g was used to estimate effect sizes. A random effects model, justified by significant heterogeneity (Q test, I² statistic), was used to synthesize the overall impact of GenAI on student achievement. A forest plot was generated to display individual effect sizes and corresponding confidence intervals. Moderator analyses were conducted to examine factors influencing the effectiveness of GenAI interventions. All analyses were conducted using SPSS version 29.
Fig. 1
Flow Diagram
Click here to Correct
3.6 Publication bias
To assess the risk of publication bias, a funnel plot was generated to examine the relationship between study effect sizes and their standard errors. In the absence of bias, the funnel plot should display a symmetrical, funnel-shaped distribution, with smaller studies scattered widely at the base and larger studies clustered near the mean effect size at the top. Asymmetry in the plot, by contrast, may suggest that smaller studies with divergent or unfavorable results are underrepresented, indicating possible selective outcome reporting or publication bias. In addition to visual inspection, Egger’s regression test was conducted to statistically evaluate funnel plot asymmetry.
4. Results
4.1 Weighted Average Effect Size and Heterogeneity
Of the 54 eligible studies, two were excluded prior to analysis due to extreme outlier effect sizes and inconsistencies in the original reports, to avoid undue influence on the pooled estimate. Across the 52 independent comparisons included in the analysis, the aggregated impact of GenAI tools on student academic achievement was substantial and statistically significant. The pooled standardized mean difference, expressed as Hedges’ g, was 1.193 (SE = 0.243, 95% CI [0.716, 1.669], p < 0.001), indicating a strong positive effect of GenAI-supported learning relative to conventional instructional methods. The 95% prediction interval (–2.267 to 4.639) further illustrates the extent of contextual variability, suggesting that while the average effect is robust, outcomes may differ across implementation settings and study conditions.
Measures of heterogeneity confirmed considerable between-study variability. The Q statistic was highly significant, Q(51) = 863.51, p < 0.05, and the I² index was 98.0%, indicating that nearly all variance observed among effect sizes reflects systematic differences across studies rather than random error. The estimated between-study variance (τ²) was 2.98. A histogram of the unweighted effect sizes (Fig. 2) shows a moderately skewed distribution, with several studies reporting large positive effects. The forest plot (Fig. 3) provides a visual summary of the individual study estimates and their confidence intervals. These results collectively justify the use of a random-effects model and underscore the value of exploring moderators to account for the observed dispersion.
Fig. 2
Distribution of 52 Unweighted Hedge’s g Effect Sizes
Click here to Correct
Fig. 3
Distribution of the Weighted Effect Sizes and Their CI
Click here to Correct
4.2 Publication Bias
A
As part of the risk-of-bias assessment, a funnel plot of effect sizes against standard errors was generated (Fig. 4). Visual asymmetry raised the possibility of small-study effects or selective reporting. To formally test this, Egger’s regression test for funnel plot asymmetry was conducted. The result was statistically significant, t(49) = 3.156, p < 0.05, providing evidence of potential bias. The extrapolated effect size as standard error approaches zero was − 0.344 (95% CI: − 1.093, 0.406), suggesting that smaller studies may overestimate treatment effects. While the confidence interval of the limit estimate includes zero, the significance of the test highlights the need to interpret the pooled estimate in light of potential publication or reporting bias.
Figure 4
Funnel Plot for Publication Bias Assessment
Click here to download actual image
4.3 Moderator Analyses
To examine sources of heterogeneity, moderator analyses were conducted on study-level characteristics, including educational level, sample size, learning domain, subject area, intervention duration, and the instructional role assigned to GenAI. Summary statistics for all subgroup comparisons are reported in Table 1.
Table 1
Subgroup Analyses of Potential Moderator Variables Using Random Effects
Variable
Category
K
Hedge's g
 
Q statistics
df
p
 
Chi-square of the moderator analysis
Mean
SD
  
Education level
        
χ²(2) = 1.628, p = 0.443
 
Primary education
7
1.007
0.414
 
40.477
6
< 0.05
  
 
Secondary education
11
0.715
0.418
 
246.381
10
< 0.05
  
 
Higher education
34
1.397
0.346
 
518.676
33
< 0.05
  
Sample size
         
χ²(3) = 4.745, p = 0.191
 
1–50
13
1.075
0.276
 
68.526
12
< 0.05
  
 
51–100
29
1.517
0.421
 
584.256
28
< 0.05
  
 
101–150
6
0.349
0.425
 
127.992
5
< 0.05
  
 
More than 150
4
0.671
0.318
 
50.948
3
< 0.05
  
Learning domain
        
χ²(1) = 0.499, p = 0.480
 
STEM
20
1.329
0.365
 
321.970
19
< 0.05
  
 
Non-STEM
32
1.313
0.352
 
541.032
31
< 0.05
  
Subject area
         
χ²(10) = 51.949, p < 0.001
 
General education
4
1.423
0.662
 
61.847
3
< 0.05
  
 
Art education
4
0.284
0.181
 
4.515
3
0.211
  
 
Language education
23
1.629
0.486
 
331.692
22
< 0.05
  
 
Mathematics education
1
0.274
0.324
 
NA
NA
NA
  
 
Physics education
2
1.130
0.128
 
0.000
1
0.997
  
 
Science education
5
0.975
0.890
 
130.125
4
< 0.05
  
 
Medical education
1
2.334
0.402
 
NA
NA
NA
  
 
Nurse education
1
2.084
0.405
 
NA
NA
NA
  
 
Health education
3
0.743
0.167
 
0.362
2
0.834
  
 
Computer science education
7
0.467
0.655
 
190.030
6
< 0.05
  
 
Engineering education
1
1.544
0.207
 
NA
NA
NA
  
Intervention duration
        
χ²(2) = 1.864, p = 0.397
 
1–4 weeks
26
1.159
0.270
 
340.446
25
< 0.05
  
 
5–8 weeks
18
0.787
0.342
 
355.2363
17
< 0.05
  
 
> 8 weeks
8
2.332
1.228
 
130.434
7
< 0.05
  
Instructional role of GenAI
        
χ²(4) = 10.641, p = 0.031
 
Assessment and evaluation
4
2.019
0.586
 
46.288
3
< 0.05
  
 
Personalized recommendation
1
0.490
0.311
 
NA
NA
NA
  
 
Tutoring
17
1.410
0.398
 
208.508
16
< 0.05
  
 
Mixed
29
1.017
0.367
 
532.273
28
< 0.05
  
 
Others
1
0.274
0.324
 
NA
NA
NA
  
Educational Level
Differences across educational levels were not statistically significant (χ²(2) = 1.628, p = .443). Still, a descriptive gradient appeared: studies at the higher-education level showed the largest mean effect (g = 1.397), followed by primary education (g = 1.007), with secondary education showing the smallest (g = 0.715). Although this difference cannot be interpreted as conclusive, it hints that GenAI may yield stronger outcomes among learners in higher education, who can make more autonomous use of the tools. The smaller gains observed in secondary education may reflect differences in curricular structure or support needs. Thus, even without statistical significance, the pattern suggests that developmental stage could influence the effectiveness of GenAI integration.
Sample Size
Sample size also did not significantly moderate effects (χ²(3) = 4.745, p = .191). Nonetheless, a clear descriptive tendency emerged. Studies with 51–100 participants reported the strongest mean effect (g = 1.517). Smaller studies with 1–50 participants showed a positive but lower effect (g = 1.075). Very large studies with over 150 participants reported a more modest effect (g = 0.671). The lowest mean appeared in the 101–150 participant group (g = 0.349).
These findings should be read cautiously, but they suggest that study scale may shape outcomes. Mid-sized studies often occur in manageable settings where GenAI can be fully embedded, while very large studies may include heterogeneous contexts that dilute effects. The unexpectedly low value in the 101–150 group likely reflects study-specific features rather than a general trend. Altogether, while not statistically significant, the pattern hints that study scale could be a practical factor in how GenAI’s impact is realized.
Learning Domain
The difference between STEM and non-STEM domains was not significant (χ²(1) = 0.499, p = .480). The mean effect sizes were nearly identical - STEM at g = 1.329 and non-STEM at g = 1.313. This similarity suggests that GenAI has broad applicability across disciplinary boundaries. Although the contrast was nonsignificant, the fact that both groups showed large positive effects indicates that the benefits of GenAI are not confined to one type of learning domain. This consistency across fields strengthens the case for generalizability.
Subject Area
Subject area significantly moderated effects, χ²(10) = 51.949, p < .001. The strongest average gains were observed in language education (g = 1.629), followed by engineering (g = 1.544) and general education (g = 1.423). Smaller pooled effects were reported in computer science (g = 0.467) and art education (g = 0.284). Very large mean values appeared in medical education (g = 2.3341) and nursing (g = 2.084), though each was based on a single study and thus should be interpreted cautiously.
The relatively modest average in computer science is notable, especially given GenAI's conceptual alignment with this field. One possible reason is that current studies in CS education often focus on narrow skillsets such as syntax correction or code generation, rather than broader applications like project-based learning, peer feedback, or generative support for computational thinking. Moreover, the small number of available studies (K = 7) limits the stability of the pooled estimate, and the diversity in study design further complicates interpretation. As a result, the generalizability of the observed effect size in this area should be regarded as limited. By contrast, the consistently strong effects observed in language education likely reflect a tighter match between GenAI’s natural language capabilities and common instructional tasks in that domain, such as writing assistance, paraphrasing, translation, and iterative feedback. The extremely large means found in medical and nursing education should be treated cautiously due to their basis in single studies, but they suggest potentially promising areas for simulation-based or decision-support applications—pending further replication.
Altogether, the evidence indicates that GenAI’s impact is not distributed evenly across disciplines. It tends to be stronger where the tools' capabilities closely map onto instructional practices and weaker where current use cases remain limited in scope or scale. Computer science education, despite its intuitive connection to AI, may require broader pedagogical integration to fully reflect the field’s potential.
Intervention Duration
Intervention duration did not significantly moderate the effect, χ²(2) = 1.864, p = .397. The average effect size was highest for studies lasting more than eight weeks (g = 2.332), followed by short-term interventions of one to four weeks (g = 1.159). Interventions of medium length, lasting five to eight weeks, showed the lowest average (g = 0.787). The longer interventions showed a strikingly high average effect, though based on a smaller set of studies. Medium-duration studies reported the lowest average, while shorter interventions fell in between. While these results were not significant, the variation raises the possibility that exposure time plays a role in GenAI's impact.
Instructional Role of GenAI
Instructional role of GenAI significantly moderated outcomes, χ²(4) = 10.641, p = .031. The largest effects were found in assessment and evaluation (g = 2.019) and tutoring (g = 1.410). Mixed-use studies averaged g = 1.017, while recommendation-based (g = 0.490) and miscellaneous applications (g = 0.274) were lower. The pattern suggests that GenAI is most effective when integrated directly into instructional feedback and guidance processes. While the test was significant, some categories included very few studies, so interpretations should be tempered accordingly.
5. Discussion
This meta-analysis of 52 experimental and quasi-experimental studies demonstrates that GenAI interventions produce a strong positive effect on student academic achievement (Hedges’ g = 1.193). This estimate exceeds effect sizes reported in earlier reviews of AI or GenAI in education (Dong et al., 2025; X. Liu et al., 2025; Sun & Zhou, 2024; Tlili et al., 2025; Zhu et al., 2025). Several factors likely explain this difference, including the rapid advances in large language models since 2022, which have made GenAI tools more accurate, interactive, and versatile, as well as the broader scope of this review, which encompassed multiple educational levels, learning domains, subject areas and instructional designs.
At the same time, the observed high heterogeneity indicates that effectiveness varies substantially across contexts. Understanding the sources of this variation is therefore critical. The moderator analyses revealed that the instructional role of GenAI and subject area were statistically significant, which indicates conditions under which GenAI is most effective. Other moderators (educational level, intervention duration, sample size, and learning domain) did not reach significance but showed descriptive patterns that can inform future research and implementation.
5.1 Instructional Role as a Core Driver
The instructional role emerged as one of the most important moderators of GenAI effectiveness. Interventions in which GenAI was used for assessment and feedback produced the strongest effects, followed by tutoring applications. By contrast, recommendation-based or loosely defined uses were less effective. These findings align with the extensive literature on formative assessment, which emphasizes that timely, targeted, and individualized feedback is among the most powerful influences on student learning (Black & Wiliam, 2009; Hattie & Timperley, 2007). GenAI tools are well-suited to this function because they can provide immediate, personalized responses at scale, supplementing teacher feedback and allowing for more frequent cycles of practice and revision.
From the perspective of scaffolding theory (Van De Pol et al., 2010; Wood et al., 1976), GenAI can function as a flexible support mechanism that adapts to learners’ needs as they progress through their zone of proximal development (Vygotskij, 1981). In addition, self-regulated learning frameworks (Panadero, 2017; Zimmerman, 2002) suggest that such adaptive feedback promotes metacognitive monitoring, error detection, and iterative improvement—all of which are processes that GenAI-based tutoring or assessment tools can enhance.
The weaker effects of recommendation-based or miscellaneous applications suggest that pedagogical alignment is essential. Simply deploying GenAI as a novelty or generic recommender does not guarantee benefits; in fact, such uses may fail to engage deeper learning processes. Together, these results suggest that GenAI’s strongest contributions occur when its functions are directly tied to formative assessment and tutoring, where feedback and adaptive guidance are critical for learning.
5.2 Subject Area Differences: Affordance-Practice Alignment
The second significant moderator was subject area, with striking variation in effect sizes across disciplines. The largest and most consistent gains were observed in language education, followed by strong results in engineering and general education. By contrast, more modest effects were reported in computer science and art education, while single studies in medical and nursing education reported very large effects. The success of GenAI in language education likely reflects the close match between large language models’ capabilities and the core practices of the domain. Writing, translation, paraphrasing, and revision are all central to language learning, and GenAI can provide immediate, iterative, and tailored support in these areas (Hyland & Hyland, 2006). This natural synergy between affordance and task likely explains why language education consistently produced stronger effects.
The weaker effects in computer science are notable. Despite the field’s conceptual connection to AI, many CS interventions focused on narrow skill sets such as syntax correction or code generation (Yang et al., 2025; Zhao et al., 2025). These uses may ease extraneous cognitive load (Sweller, 1988) but do not necessarily foster deeper conceptual learning or computational thinking. More integrative approaches, such as project-based learning, peer review of code, or generative support for problem decomposition, may be needed to realize GenAI’s potential in this field. Similarly, in art education, current applications may not align well with the creative, iterative, and process-driven nature of artistic practice.
The very large effects reported in medical and nursing education, though based on single studies, point to promising directions. GenAI could be particularly useful in simulation-based learning, clinical decision-making, and diagnostic reasoning, where adaptive feedback and scenario generation are valuable (Chang et al., 2022; Cook et al., 2013). However, further replication is needed before strong conclusions can be drawn.
These results suggest that GenAI’s effectiveness depends on how closely its affordances align with the instructional practices of a domain. Language tasks map naturally onto LLM capabilities, while other fields may require more thoughtful integration to achieve similar benefits.
5.3 Contextual Moderators: Descriptive but Informative Patterns
Although not statistically significant, several other moderators revealed consistent descriptive patterns that provide insight into contextual conditions shaping GenAI’s impact.
Educational level. The gradient observed—higher education showing the strongest effects, followed by primary, then secondary—suggests that learner autonomy and developmental readiness may play important roles. University students may be better equipped to independently integrate GenAI into their work, while primary students may benefit when teachers mediate GenAI use. Secondary students, in contrast, may be at a transitional stage where metacognitive monitoring is still developing (Kuhn, 2000; Pintrich, 2000), and where curricular structures may limit opportunities to use GenAI meaningfully (Mercer & Littleton, 2007). These patterns, while not conclusive, suggest that developmental stage could influence how effectively learners benefit from GenAI.
Intervention duration. Longer interventions (> 8 weeks) showed descriptively stronger effects than shorter ones, while medium-length interventions yielded the lowest averages. This pattern may reflect novelty effects in short-term studies and the stabilizing of benefits in longer-term integrations. Motivation research supports this interpretation: self-determination theory (Ryan & Deci, 2000) emphasizes that sustained engagement depends on autonomy, competence, and relatedness, while work on achievement emotions suggests that novelty-driven excitement tends to fade unless interventions are integrated into stable learning routines (Linnenbrink-Garcia & Pekrun, 2011; Pekrun, 2006; Tsay et al., 2020). These findings point to the importance of designing interventions for sustained rather than short-term use.
Sample size. The strongest effects were observed in mid-sized studies (51–100 participants), while very large studies yielded smaller effects. One explanation is that mid-sized studies often occur in classroom-based settings where implementation fidelity is high, while large studies encompass heterogeneous learners and contexts that dilute effects. At the same time, small-study bias may inflate results in underpowered designs (Ioannidis, 2005; Valentine et al., 2010). This pattern underscores the need for larger, well-controlled trials that balance fidelity with generalizability.
STEM vs. non-STEM domains. Finally, the near-identical effects across STEM and non-STEM fields suggest that GenAI’s benefits are broadly applicable. This finding strengthens the case that GenAI is not restricted to text-rich domains but can enhance learning across diverse disciplines, provided applications are thoughtfully designed.
5.4 Limitations and Future Research
Several limitations warrant caution. First, publication bias suggests that smaller studies may overestimate effects, and the wide prediction interval indicates that not all interventions yield positive outcomes. Second, the evidence base remains dominated by short-term studies in higher education, leaving gaps in K–12 contexts where scaffolding and access issues are more pronounced. Third, few studies examined equity implications, raising concerns that GenAI could exacerbate existing digital divides (Capraro et al., 2024; Luo & Liu, 2025). Finally, the rapid evolution of large language models raises questions about the generalizability of current findings to future iterations.
Future research should address these gaps by conducting preregistered, large-scale trials with strong implementation fidelity; longitudinal studies to examine sustainability over time; and equity-focused analyses to ensure benefits are accessible across diverse learner populations. Design-based research (Brown, 1992; Cobb et al., 2003) will be particularly valuable for refining pedagogical integration, while cross-model comparisons are needed to disentangle technology-specific from pedagogy-specific effects.
5.5 Practical Significance
This meta-analysis shows that GenAI can substantially improve student learning when used thoughtfully, but not all uses are equally effective. The strongest benefits occur when GenAI supports formative roles such as providing feedback on student work or serving as a tutoring assistant. In these cases, students receive more personalized guidance, which helps them practice, revise, and reflect in ways that accelerate learning. Language education stands out as the field with the largest gains, reflecting a strong match between GenAI’s language-generation strengths and writing, translation, and feedback tasks. By contrast, narrower applications in fields like computer science or art show smaller benefits, suggesting that more integrative and creative uses are needed.
For educators and decision-makers, the practical takeaway is that: GenAI should be viewed as a pedagogical partner rather than a novelty. Its value lies not in replacing teachers, but in scaling up feedback, tutoring, and practice opportunities that are otherwise difficult to provide in large or diverse classrooms. At the same time, successful implementation requires attention to context: younger students need structured teacher support, longer-term use appears more beneficial than one-off trials, and equitable access must be ensured to avoid widening learning gaps. With thoughtful integration, GenAI can reduce instructional burden and expand opportunities for student-centered learning across a wide range of subjects.
5.6 Implications
For educators, these findings suggest that GenAI should be deployed strategically in formative roles, such as tutoring and feedback, where its capacity for personalization can enhance learning. In K–12 contexts, teacher mediation and scaffolding are critical to ensure productive engagement with AI tools. For institutions, professional development is essential to prepare teachers not only to operate GenAI tools but to integrate them into pedagogy responsibly. Implementation should be guided by clear ethical standards and curricular alignment (Holmes et al., 2022). For policymakers, equity must be a central concern. Without intentional policies to ensure access, GenAI may widen digital divides (Bentley et al., 2024). Policies that promote responsible adoption, transparency, and research–practice partnerships will be critical to maximizing benefits and minimizing risks. At a system level, GenAI should be understood as a pedagogical partner rather than a replacement for teachers. When embedded thoughtfully, it can reduce instructional burden, strengthen formative assessment, and expand opportunities for student-centered learning.
5.7 Conclusion
This meta-analysis provides a comprehensive synthesis of GenAI’s effects on student achievement. While overall benefits are strong, outcomes depend on how, where, and for whom GenAI is implemented. Instructional role and subject area emerged as the strongest moderators, with particularly large effects for formative assessment, tutoring, and language education. Descriptive patterns further suggest that developmental stage, exposure time, and implementation fidelity condition effectiveness.
Looking forward, GenAI has the potential to transform education, but its promise will only be realized through responsible integration, equity-focused implementation, and sustained empirical evaluation. By aligning GenAI use with pedagogical theory, developmental needs, and institutional support, educators and policymakers can move beyond novelty to unlock its long-term potential for equitable, evidence-based learning.
A
Author Contribution
X.T. contributed to conceptualization, methodology, formal analysis, data curation, software, validation, supervision, and writing (original draft and review & editing). X.W. contributed to methodology, investigation, formal analysis, data curation, software, validation, and writing (review & editing). L.D. contributed to methodology, formal analysis, data curation, software, validation, visualization, and writing (review & editing). J.Z. contributed to methodology, data curation, software, validation, and writing (review & editing). All authors reviewed and approved the final manuscript.
References
*Ait Baha, T., Hajji, E., Es-Saady, M., Y., & Fadili, H. (2024). The impact of educational chatbot on student learning experience. Education and Information Technologies, 29(8), 10153–10176. https://doi.org/10.1007/s10639-023-12166-w
*Alneyadi, S., & Wardat, Y. (2024). Integrating ChatGPT in grade 12 quantum theory education: An exploratory study at Emirate School (UAE). International Journal of Information and Education Technology, 14(3), 398–410. https://doi.org/10.18178/ijiet.2024.14.3.2061
A
*Al Ghaithi, A., & Behforouz, B. (2024). The use of an interactive ChatBot in grammar learning. Journal of Educators Online, 21(4), n4.
*Avello-Martínez, R., Gajderowicz, T., & Gómez-Rodríguez, V. G. (2024). Is ChatGPT helpful for graduate students in acquiring knowledge about digital storytelling and reducing their cognitive load? An experiment. Revista De Educación a Distancia (RED), 24(78). https://doi.org/10.6018/red.604621
A
*Behforouz, B., & Ghaithi, A. A. (2024). Investigating the effect of an interactive educational chatbot on reading comprehension skills. International Journal of Engineering Pedagogy (iJEP), 14(4), 139–154. https://doi.org/10.3991/ijep.v14i4.48461
*Beltozar-Clemente, S., & Díaz-Vega, E. (2024). Physics XP: Integration of ChatGPT and gamification to improve academic performance and motivation in physics 1 course. International Journal of Engineering Pedagogy, 14(6). https://doi.org/10.3991/ijep.v14i6.47127
Bentley, S. V., Naughtin, C. K., McGrath, M. J., Irons, J. L., & Cooper, P. S. (2024). The digital divide in action: How experiences of digital technology shape future relationships with artificial intelligence. AI and Ethics, 4(4), 901–915. https://doi.org/10.1007/s43681-024-00452-3
Black, P., & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21(1), 5–31. https://doi.org/10.1007/s11092-008-9068-5
Brown, A. L. (1992). Design experiments: Theoretical and methodological challenges in creating complex interventions in classroom settings. Journal of the Learning Sciences, 2(2), 141–178. https://doi.org/10.1207/s15327809jls0202_2
Capraro, V., Lentsch, A., Acemoglu, D., Akgun, S., Akhmedova, A., Bilancini, E., Bonnefon, J. F., Brañas-Garza, P., Butera, L., Douglas, K. M., Everett, J. A. C., Gigerenzer, G., Greenhow, C., Hashimoto, D. A., Holt-Lunstad, J., Jetten, J., Johnson, S., Kunz, W. H., Longoni, C., & Viale, R. (2024). The impact of generative artificial intelligence on socioeconomic inequalities and policy making. PNAS Nexus, 3(6), 191. https://doi.org/10.1093/pnasnexus/pgae191
*Chang, C., Hwang, G., & Gau, M. (2022). Promoting students’ learning achievement and self-efficacy: A mobile chatbot approach for nursing training. British Journal of Educational Technology, 53(1), 171–188. https://doi.org/10.1111/bjet.13158
A
*Chen, C., & Chang, C. (2024). Effectiveness of AI-assisted game-based learning on science learning outcomes, intrinsic motivation, cognitive load, and learning behavior. Education and Information Technologies, 29(14), 18621–18642. https://doi.org/10.1007/s10639-024-12553-x
A
*Chen, C., & Gong, Y. (2025). The role of AI-assisted learning in academic writing: A mixed-methods study on Chinese as a second language students. Education Sciences, 15(2), 141. https://doi.org/10.3390/educsci15020141
*Chen, J., Mokmin, N. A. M., Shen, Q., & Su, H. (2025). Leveraging AI in design education: exploring virtual instructors and conversational techniques in flipped classroom models. Education and Information Technologies, 1–21. https://doi.org/10.1007/s10639-025-13458-z
*Chen, M. R. A. (2025). Improving English semantic learning outcomes through AI chatbot-based ARCS approach. Interactive Learning Environments, 1–16. https://doi.org/10.1080/10494820.2025.2454443
A
*Chen, M. R. A. (2024). Metacognitive mastery: Transformative learning in EFL through a generative AI chatbot fueled by metalinguistic guidance. Educational Technology & Society, 27(3), 407–427. https://www.jstor.org/stable/48787038
*Chen, M. R. A. (2024). The AI chatbot interaction for semantic learning: A collaborative note-taking approach with EFL students. Language Learning & Technology, 28(1), 1–25.
*Chen, Y., Zhang, X., & Hu, L. (2024). A progressive prompt-based image-generative AI approach to promoting students’ achievement and perceptions in learning ancient Chinese poetry. Educational Technology & Society, 27(2), 284–305. https://hdl.handle.net/10125/73586
*Chu, H. C., Hsu, C. Y., & Wang, C. C. (2025). Effects of AI-generated drawing on students’ learning achievement and creativity in an ancient poetry course. Educational Technology & Society, 28(2). https://doi.org/10.30191/ETS.202504_28(2).TP03
Cobb, P., Confrey, J., diSessa, A., Lehrer, R., & Schauble, L. (2003). Design experiments in educational research. Educational Researcher, 32(1), 9–13. https://doi.org/10.3102/0013189X032001009
Cook, D. A., Hamstra, S. J., Brydges, R., Zendejas, B., Szostek, J. H., Wang, A. T., Erwin, P. J., & Hatala, R. (2013). Comparative effectiveness of instructional design features in simulation-based education: Systematic review and meta-analysis. Medical Teacher, 35(1), e867–e898. https://doi.org/10.3109/0142159X.2012.714886
Dong, L., Tang, X., & Wang, X. (2025). Examining the effect of artificial intelligence in relation to students’ academic achievement: A meta-analysis. Computers and Education: Artificial Intelligence, 8, 100400. https://doi.org/10.1016/j.caeai.2025.100400
*Edwards, B. I., Olugbade, D., & Ojo, O. A. (2024). Facilitating cognitive load management and improved learning outcomes and attitudes in middle school technology and vocational education through ai chatbot. Journal of Technical Education and Training, 16(3), 114–131. https://penerbit.uthm.edu.my/ojs/index.php/JTET/article/view/19476
Fui-Hoon Nah, F., Zheng, R., Cai, J., Siau, K., & Chen, L. (2023). Generative AI and ChatGPT: Applications, challenges, and AI-human collaboration. Journal of information technology case and application research, 25(3), 277–304. https://doi.org/10.1080/15228053.2023.2233814
*Fathi, T. E., Saad, A., Larhzil, H., Lamri, D., & Ibrahmi, E. M. A. (2025). Integrating generative AI into STEM education: enhancing conceptual understanding, addressing misconceptions, and assessing student acceptance. Disciplinary and Interdisciplinary Science Education Research, 7(1). https://doi.org/10.1186/s43031-025-00125-z
A
*Fidan, M., & Gencel, N. (2022). Supporting the instructional videos with chatbot and peer feedback mechanisms in online learning: The effects on learning performance and intrinsic motivation. Journal of Educational Computing Research, 60(7), 1716–1741. https://doi.org/10.1177/07356331221077901
A
*Gasaymeh, A. M. M., & AlMohtadi, R. M. (2024). The effect of flipped interactive learning (FIL) based on ChatGPT on students’ skills in a large programming class. International Journal of Information and Education Technology, 14(11), 1516–1522. https://doi.org/10.18178/ijiet.2024.14.11.2182
*Gong, X., Li, Z., & Qiao, A. (2025). Impact of generative AI dialogic feedback on different stages of programming problem solving. Education and Information Technologies, 30(7), 9689–9709. https://doi.org/10.1007/s10639-024-13173-1
*Hakim, V. G. A., Paiman, N. A., & Rahman, M. H. S. (2024). Genie-on‐demand: A custom AI chatbot for enhancing learning performance, self‐efficacy, and technology acceptance in occupational health and safety for engineering education. Computer Applications in Engineering Education, 32(6), e22800. https://doi.org/10.1002/cae.22800
Hattie, J., & Timperley, H. (2007). The Power of Feedback. Review of Educational Research, 77(1), 81–112. https://doi.org/10.3102/003465430298487
Holmes, W., Porayska-Pomsta, K., Holstein, K., Sutherland, E., Baker, T., Shum, S. B., Santos, O. C., Rodrigo, M. T., Cukurova, M., Bittencourt, I. I., & Koedinger, K. R. (2022). Ethics of AI in Education: Towards a Community-Wide Framework. International Journal of Artificial Intelligence in Education, 32(3), 504–526. https://doi.org/10.1007/s40593-021-00239-1
A
*Hsu, M. H. (2024). Mastering medical terminology with ChatGPT and Termbot. Health Education Journal, 83(4), 352–358. https://doi.org/10.1177/00178969231197371
*Hui, Z., Zewu, Z., Jiao, H., & Yu, C. (2025). Application of ChatGPT-assisted problem-based learning teaching method in clinical medical education. BMC Medical Education, 25(1), 1–7. https://doi.org/10.1186/s12909-024-06321-1
A
*Hwang, G. J., & Zhang, D. (2024). Effects of an adaptive computer agent-based digital game on EFL students’ English learning outcomes. Educational technology research and development, 72(6), 3271–3294. https://doi.org/10.1007/s11423-024-10396-4
Hyland, K., & Hyland, F. (2006). Feedback on second language students’ writing. Language Teaching, 39(2), 83–101. https://doi.org/10.1017/S0261444806003399
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124
*Jeon, J. (2023). Chatbot-assisted dynamic assessment (CA-DA) for L2 vocabulary learning and diagnosis. Computer Assisted Language Learning, 36(7), 1338–1364. https://doi.org/10.1080/09588221.2021.1987272
*Ji, Y., Zhan, Z., Li, T., Zou, X., & Lyu, S. (2025). Human-machine co-creation: the effects of ChatGPT on students' learning performance, AI awareness, critical thinking, and cognitive load in a STEM course towards entrepreneurship. IEEE Transactions on Learning Technologies. https://doi.org/10.1109/TLT.2025.3554584
A
*Karaman, M. R., & Göksu, İ. (2024). Are lesson plans created by ChatGPT more effective? an experimental study. International Journal of Technology in Education, 7(1), 107–127. https://doi.org/10.46328/ijte.607
Kuhn, D. (2000). Metacognitive development. Current Directions in Psychological Science, 9(5), 178–181. https://doi.org/10.1111/1467-8721.00088
*Lee, Y. F., Hwang, G. J., & Chen, P. Y. (2022). Impacts of an AI-based chabot on college students’ after-class review, academic performance, self-efficacy, learning attitude, and motivation. Educational Technology Research and Development, 70(5), 1843–1865. https://doi.org/10.1007/s11423-022-10142-8
*Lee, Y. F., Hwang, G. J., & Chen, P. Y. (2025). Technology-based interactive guidance to promote learning performance and self-regulation: a chatbot-assisted self-regulated learning approach. Educational Technology Research and Development, 1–26. https://doi.org/10.1007/s11423-025-10478-x
A
*Li, H. (2023). Effects of a ChatGPT-based flipped learning guiding approach on learners’ courseware project performances and perceptions. Australasian Journal of Educational Technology, 39(5), 40–58. https://doi.org/10.14742/ajet.8923
*Li, H., Wang, Y., Luo, S., & Huang, C. (2025). The influence of GenAI on the effectiveness of argumentative writing in higher education: Evidence from a quasi-experimental study in China. Journal of Asian Public Policy, 18(2), 405–430. https://doi.org/10.1080/17516234.2024.2363128
*Li, Y., Sadiq, G., Qambar, G., & Zheng, P. (2025). The impact of students’ use of ChatGPT on their research skills: The mediating effects of autonomous motivation, engagement, and self-directed learning. Education and Information Technologies, 30(4), 4185–4216. https://doi.org/10.1007/s10639-024-12981-9
* Liang, H. Y., Hwang, G. J., Hsu, T. Y., & Yeh, J. Y. (2024). Effect of an AI-based chatbot on students' learning performance in alternate reality game‐based museum learning. British Journal of Educational Technology, 55(5), 2315–2338. https://doi.org/10.1111/bjet.13448
Linnenbrink-Garcia, L., & Pekrun, R. (2011). Students’ emotions and academic engagement: Introduction to the special issue. Contemporary Educational Psychology, 36(1), 1–3. https://doi.org/10.1016/j.cedpsych.2010.11.004
*Liu, C. C., Hwang, G. J., Yu, P., Tu, Y. F., & Wang, Y. (2025). Effects of an automated corrective feedback-based peer assessment approach on students’ learning achievement, motivation, and self-regulated learning conceptions in foreign language pronunciation. Educational Technology Research and Development, 1–22. https://doi.org/10.1007/s11423-025-10484-z
Liu, X., Guo, B., He, W., & Hu, X. (2025). Effects of generative artificial intelligence on k-12 and higher education students’ learning outcomes: A Meta-Analysis. Journal of Educational Computing Research, 63(5), 1249–1291. https://doi.org/10.1177/07356331251329185
*Liu, Z. M., Hwang, G. J., Chen, C. Q., Chen, X. D., & Ye, X. D. (2024). Integrating large language models into EFL writing instruction: Effects on performance, self-regulated learning strategies, and motivation. Computer Assisted Language Learning, 1–25. https://doi.org/10.1080/09588221.2024.2389923
Luo, J. (Jess), & Liu, X. (Caroline) (Eds.). (2025). What do we mean by digital equality in education? Toward five conceptual lenses based on a systematic review. Journal of Research on Technology in Education, 1–21. https://doi.org/10.1080/15391523.2025.2487279
A
*Mahapatra, S. (2024). Impact of ChatGPT on ESL students’ academic writing skills: A mixed methods intervention study. Smart Learning Environments, 11(1). https://doi.org/10.1186/s40561-024-00295-9
Mercer, N., & Littleton, K. (2007). Dialogue and the Development of Children’s Thinking (0 ed.). Routledge. https://doi.org/10.4324/9780203946657
Moher, D., Liberati, A., Tetzlaff, J., Altman, D. G., & for the PRISMA Group. (2009). Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Bmj, 339(jul21 1), b2535–b2535. https://doi.org/10.1136/bmj.b2535
A
*Nusivera, E., & Hikmat, A. (2025). Integration of Chat-GPT usage in language learning model to improve argumentation skills, complex comprehension skills, and critical thinking skills. IJLTER ORG, 24(2), 375–390. https://doi.org/10.26803/ijlter.24.2.19
Panadero, E. (2017). A review of self-regulated learning: Six models and four directions for research. Frontiers in Psychology, 8, 422. https://doi.org/10.3389/fpsyg.2017.00422
Pekrun, R. (2006). The control-value theory of achievement emotions: Assumptions, corollaries, and implications for educational research and practice. Educational Psychology Review, 18(4), 315–341. https://doi.org/10.1007/s10648-006-9029-9
Pintrich, P. R. (2000). The role of goal orientation in self-regulated learning. In Handbook of Self-Regulation (pp. 451–502). Elsevier. https://doi.org/10.1016/B978-012109890-2/50043-3
Ryan, R. M., & Deci, E. L. (2000). Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. American Psychologist, 55(1), 68–78. https://doi.org/10.1037/0003-066X.55.1.68
*Sapan, M., & Uzun, L. (2024). The effect of ChatGPT-integrated English teaching on high school EFL learners’ writing skills and vocabulary development. International Journal of Education in Mathematics Science and Technology, 12(6), 1679–1699. https://doi.org/10.46328/ijemst.4655
*Shahsavar, Z., Kafipour, R., Khojasteh, L., & Pakdel, F. (2024). Is artificial intelligence for everyone? Analyzing the role of ChatGPT as a writing assistant for medical students. Frontiers in Education, 9. https://doi.org/10.3389/feduc.2024.1457744
*Shi, H., Chai, C. S., Zhou, S., & Aubrey, S. (2025). Comparing the effects of ChatGPT and automated writing evaluation on students’ writing and ideal L2 writing self. Computer Assisted Language Learning, 1–28. https://doi.org/10.1080/09588221.2025.2454541
Sun, L., & Zhou, L. (2024). Does generative artificial intelligence improve the academic achievement of college students? A Meta-analysis. Journal of Educational Computing Research, 62(7), 1676–1713. https://doi.org/10.1177/07356331241277937
Sweller, J. (1988). Cognitive load during problem solving: Effects on learning. Cognitive Science, 12(2), 257–285. https://doi.org/10.1207/s15516709cog1202_4
*Teng, D., Wang, X., Xia, Y., Zhang, Y., Tang, L., Chen, Q., Zhang, R., Xie, S., & Yu, W. (2025). Investigating the utilization and impact of large language model-based intelligent teaching assistants in flipped classrooms. Education and Information Technologies, 30(8), 10777–10810. https://doi.org/10.1007/s10639-024-13264-z
Tlili, A., Saqer, K., Salha, S., & Huang, R. (2025). Investigating the effect of artificial intelligence in education (AIEd) on learning achievement: A meta-analysis and research synthesis. Information Development, 41(3), 825–842. https://doi.org/10.1177/02666669241304407
Tsay, C. H., Kofinas, A. K., Trivedi, S. K., & Yang, Y. (2020). Overcoming the novelty effect in online gamified learning systems: An empirical evaluation of student engagement and performance. Journal of Computer Assisted Learning, 36(2), 128–146. https://doi.org/10.1111/jcal.12385
Valentine, J. C., Pigott, T. D., & Rothstein, H. R. (2010). How many studies do you need? A primer on statistical power for meta-analysis. Journal of Educational and Behavioral Statistics, 35(2), 215–247. https://doi.org/10.3102/1076998609346961
Van De Pol, J., Volman, M., & Beishuizen, J. (2010). Scaffolding in teacher–student interaction: A decade of research. Educational Psychology Review, 22(3), 271–296. https://doi.org/10.1007/s10648-010-9127-6
Vygotskij, L. S. (1981). Mind in society: The development of higher psychological processes. Harvard Univ. Press.
*Wang, C., Zou, B., Du, Y., & Wang, Z. (2024). The impact of different conversational generative AI chatbots on EFL learners: An analysis of willingness to communicate, foreign language speaking anxiety, and self-perceived communicative competence. System, 127, 103533. https://doi.org/10.1016/j.system.2024.103533
*Wang, H., Wang, C., Chen, Z., Liu, F., Bao, C., & Xu, X. (2025). Impact of AI-agent-supported collaborative learning on the learning outcomes of University programming courses. Education and Information Technologies. https://doi.org/10.1007/s10639-025-13487-8
*Wang, M., Zhang, D., Zhu, J., & Gu, H. (2025). Effects of incorporating a large language model-based adaptive mechanism into contextual games on students’ academic performance, flow experience, cognitive load and behavioral patterns. Journal of Educational Computing Research, 63(3), 662–694. https://doi.org/10.1177/07356331251321719
*Wang, Y. (2025). A study on the efficacy of ChatGPT-4 in enhancing students’ English communication skills. Sage Open, 15(1), 21582440241310644. https://doi.org/10.1177/21582440241310644
*Wei, X., Wang, L., Lee, L. K., & Liu, R. (2025). Multiple generative AI pedagogical agents in augmented reality environments: A study on implementing the 5E model in science education. Journal of Educational Computing Research, 63(2), 336–371. https://doi.org/10.1177/07356331241305519
Wood, D., Bruner, J. S., & Ross, G. (1976). The role of tutoring in problem solving. Journal of Child Psychology and Psychiatry, 17(2), 89–100. https://doi.org/10.1111/j.1469-7610.1976.tb00381.x
*Wu, T. T., Hapsari, I. P., & Huang, Y. M. (2025). Effects of incorporating AI chatbots into think–pair–share activities on EFL speaking anxiety, language enjoyment, and speaking performance. Computer Assisted Language Learning, 1–39. https://doi.org/10.1080/09588221.2025.2478271
*Yang, T. C., Hsu, Y. C., & Wu, J. Y. (2025). The effectiveness of ChatGPT in assisting high school students in programming learning: Evidence from a quasi-experimental research. Interactive Learning Environments, 1–18. https://doi.org/10.1080/10494820.2025.2450659
*Zahran, F. A. (2025). The Impact of using Poe ChatGPT-based TPACK model on English as a foreign language teachers' performance and their students' vocabulary learning. Higher Learning Research Communications, 15(1), n1.
*Zhao, G., Yang, L., Hu, B., & Wang, J. (2025). A Generative artificial intelligence (AI)-based human-computer collaborative programming learning method to improve computational thinking, learning attitudes, and learning achievement. Journal of Educational Computing Research, 63(5), 1059–1087. https://doi.org/10.1177/07356331251336154
*Zhou, Q., Hashim, H., & Sulaiman, N. A. (2025). Supporting English speaking practice in higher education: the impact of AI chatbot-integrated mobile-assisted blended learning framework. Education and Information Technologies, 1–32.
Zhu, Y., Liu, Q., & Zhao, L. (2025). Exploring the impact of Generative artificial intelligence on students’ learning outcomes: A meta-analysis. Education and Information Technologies, 30(11), 16211–16239. https://doi.org/10.1007/s10639-025-13420-z
Zimmerman, B. J. (2002). Becoming a self-regulated learner: An overview. Theory Into Practice, 41(2), 64–70. https://doi.org/10.1207/s15430421tip4102_2
Appendix
N.O.
StudyID
Education level
Sample size
Learning domain
Subject area
Intervention duration
Instructional role of GenAI
1
Chen (2024)
Higher education
1–50
Non-STEM
Language education
5–8 weeks
Mixed
2
Fathi et al. (2025)
Higher education
101–150
STEM
Engineering education
5–8 weeks
Tutoring
3
Lee et al. (2022)
Higher education
1–50
STEM
Health education
1–4 weeks
Tutoring
4
Chu et al. (2025)
Secondary education
51–100
Non-STEM
Language education
1–4 weeks
Tutoring
5
Li et al. (2023)
Higher education
51–100
Non-STEM
General education
1–4 weeks
Tutoring
6
Wang (2024)
Higher education
51–100
Non-STEM
Language education
> 8 weeks
Mixed
7
Chen et al. (2024)
Primary education
51–100
Non-STEM
Language education
1–4 weeks
Mixed
8
Wu et al.(2025)
Higher education
51–100
Non-STEM
Language education
1–4 weeks
Mixed
9
Chen and Gong (2025b)
Higher education
1–50
Non-STEM
Language education
> 8 weeks
Mixed
10
Chen and Chang (2024)
Secondary education
101–150
STEM
Science education
1–4 weeks
Mixed
11
Gong et al. (2024)
Secondary education
51–100
STEM
Computer science education
5–8 weeks
Assessment and evaluation
12
Shi et al. (2025)
Higher education
51–100
Non-STEM
Language education
1–4 weeks
Assessment and evaluation
13
Avello-Martínez et al. (2024)
Higher education
1–50
Non-STEM
Art education
1–4 weeks
Personalized recommendation
14
Shahsavar et al. (2024)
Higher education
51–100
Non-STEM
Language education
> 8 weeks
Assessment and evaluation
15
Behforouz and Ghaithi (2024)
Higher education
51–100
Non-STEM
Language education
1–4 weeks
Tutoring
16
Chen et al. (2025)
Higher education
101–150
Non-STEM
Art education
1–4 weeks
Tutoring
17
Liu et al. (2025)
Higher education
51–100
Non-STEM
Language education
5–8 weeks
Mixed
18
Mahapatra (2024)
Higher education
51–100
Non-STEM
Language education
1–4 weeks
Assessment and evaluation
19
Teng et al. (2024)
Higher education
51–100
STEM
Computer science education
5–8 weeks
Mixed
20
Sapan and Uzun (2024)
Secondary education
51–100
Non-STEM
Language education
1–4 weeks
Tutoring
21
Karaman and Göksu (2024)
Primary education
1–50
STEM
Mathematics education
5–8 weeks
Other: Lesson planning and instructional design
22
Wang et al. (2024)
Higher education
51–100
Non-STEM
Language education
5–8 weeks
Mixed
23
Huang et al. (2025)
Higher education
51–100
STEM
Computer science education
5–8 weeks
Mixed
24
Hsu (2023)
Higher education
1–50
STEM
Medical education
5–8 weeks
Mixed
25
Chang et al. (2021)
Higher education
1–50
STEM
Nurse education
1–4 weeks
Tutoring
26
Fidan and Gencel (2022)
Higher education
51–100
Non-STEM
General education
1–4 weeks
Mixed
27
Jeon (2023)
Primary education
51–100
Non-STEM
Language education
1–4 weeks
Tutoring
28
Li et al. (2025)
Higher education
51–100
Non-STEM
Language education
1–4 weeks
Tutoring
29
Alneyadi and Wardat (2024)
Secondary education
101–150
STEM
Physics education
Unclear
Tutoring
30
Liu et al. (2025)
Primary education
51–100
Non-STEM
Language education
5–8 weeks
Mixed
31
Li et al. (2025)
Higher education
More than 150
Non-STEM
General education
> 8 weeks
Tutoring
32
Gasaymeh et al. (2024)
Higher education
51–100
Non-STEM
General education
5–8 weeks
Mixed
33
Wei et al. (2024)
Primary education
51–100
STEM
Science education
> 8 weeks
Mixed
34
Chen (2023)
Higher education
51–100
Non-STEM
Language education
1–4 weeks
Mixed
35
Ait Baha et al. (2024)
Secondary education
51–100
STEM
Computer science education
1–4 weeks
Tutoring
36
Ghaithi and Behforouz (2024)
Higher education
51–100
Non-STEM
Language education
1–4 weeks
Tutoring
37
Beltozar-Clemente and Díaz-Vega (2024).
Higher education
More than 150
STEM
Physics education
> 8 weeks
Mixed
38
Liang et al. (2024)
Primary education
1–50
Non-STEM
Art education
1–4 weeks
Tutoring
39
Hakim et al. (2024)
Higher education
101–150
STEM
Health education
1–4 weeks
Mixed
40
Hwang and Zhang (2024).
Secondary education
51–100
Non-STEM
Language education
1–4 weeks
Mixed
41
Wang et al. (2024)
Higher education
1–50
STEM
Computer science education
5–8 weeks
Mixed
42
Edwards et al. (2024)
Secondary education
101–150
Non-STEM
Science education
5–8 weeks
Mixed
43
Ji et al. (2025)
Higher education
1–50
STEM
Science education
5–8 weeks
Tutoring
44
Zhao et al. (2025)
Secondary education
51–100
STEM
Computer science education
5–8 weeks
Mixed
45
Yang et al. (2025)
Secondary education
More than 150
STEM
Computer science education
1–4 weeks
Mixed
46
Chen (2025)
Higher education
1–50
Non-STEM
Language education
1–4 weeks
Tutoring
47
Zhou et al. (2025)
Higher education
51–100
Non-STEM
Language education
5–8 weeks
Mixed
48
Lee et al. (2025)
Higher education
1–50
Non-STEM
Art education
5–8 weeks
Mixed
49
Zahran (2025)
Secondary education
51–100
Non-STEM
Language education
5–8 weeks
Mixed
50
Nusivera and Hikmat (2025).
Higher education
More than 150
Non-STEM
Language education
Unclear
Mixed
51
Wang et al. (2025)
Primary education
51–100
STEM
Science education
1–4 weeks
Mixed
52
Hui et al. (2025)
Higher education
1–50
STEM
Health education
1–4 weeks
Mixed
Total words in MS: 6438
Total words in Title: 12
Total words in Abstract: 243
Total Keyword count: 5
Total Images in MS: 3
Total Tables in MS: 2
Total Reference count: 83