Exploring the Construct Validity of Integrated Speaking Tasks: The Case of a Large-scale High-stakes Computer-based Listening-Speaking Task

Abstract

Integrated speaking tasks have been widely used in many large-scale high-stakes tests. However, little is known about their application among low- or intermediate-level English second language learners, such as in the Guangdong Version of the Computer-based English Listening and Speaking Test of the National Matriculation English Test. This misalignment is particularly problematic given the substantial impact of integrated speaking tasks on teaching, learning, and assessing of English language education in China and internationally. To address this gap, the present study employed a bi-factor Exploratory Structural Equation Modeling (ESEM) and hierarchical multiple regression analyses, with data from a sample of 360 participants in a real test, and probed whether and to what extent the test actually assesses the ability it is supposed to test as manifested in its official test specifications issued by the Education Examinations Authority of Guangdong Province. Findings indicated that the test did measure students’ ability of accomplishing certain tasks in specific contexts by acquiring and applying various knowledge sources (e.g. tasks prompts, encyclopedic knowledge of English and the world, source material, communicative strategies, etc.) in English.

Specifically, parameters from the five domain-specific textual factors and two communicative strategies, extracted from the participants’ oral output, co-worked on the participants’ performance in the test, with varying weights across factors. This variability highlights the comprehensiveness and contextual specificity of the test. These findings could provide empirical evidence supporting the validity of score interpretations and offer important implications for the teaching, learning, and assessment of integrated speaking tasks in senior high schools in China.

Keywords:

integrated speaking tasks

national matriculation test

CELST

construct validity

1. Introduction

It is a challenge to conceptualize, define and assess speaking in a reliable and valid way (de Jong, 2023; Fan and Yan, 2020), particularly in relation to integrated speaking tasks, which are intended to be performed with reliance on given reading and/or listening resources. Integrated speaking tasks are increasingly endorsed by language testing scholars and educators for their authenticity in approximating real-world communicative practices (Barkaoui et al., 2013; Brown et al., 2005), their contribution to the cultivation of academic multiliteracy and their role in aligning English for Academic Purposes instruction with disciplinary content demands or designated test construct (Brown and Ducasse, 2019; Frost et al., 2021; Hirai and Koizumi, 2013). Meanwhile, the advancement of communicative technologies and the reform of assessment formats have led to the emergence of computer-based assessment, particularly computer-based integrated speaking tests, which in turn pose new challenges to its construct validation researches (Iwashita, 2022, p.130–133), such as whether these new test formats faithfully capture its intended constructs and to what extent they do so.

In integrated speaking tasks, test-takers are required to process and transform a cognitively complex stimulus (e.g. a written test or a lecture) and integrate information from this source into the speaking performance (Brown et al., 2005, p.1; Zhang et al., 2022). Most existing studies on integrated speaking tests have focused on college or university-level English language learners in L2 contexts (Brown and Ducasse, 2019; Frost et al., 2021; Kormos et al., 2022; Suzuki and Kormos, 2023). In contrast, relatively few studies have examined intermediate or low-level ESL learners (Tsang and Lee, 2023; Xu et al., 2019, 2020, 2021, 2023a, 2023b, 2024, 2025; Zhou and Zeng, 2016), despite the fact that a substantial proportion of ESL learners fall into this category. Even fewer studies have explored how these learners perform on computer-based integrated speaking tasks. Therefore, continued validation researches are expected by drawing on diverse theoretical perspectives and using a wide range of research methods (Iwashita, 2022, p.139).

To address this research gap, the present study conducted a construct validation study of a large-scale, high-stakes computer-based integrated speaking task, the Guangdong Version of the Computer-based English Listening and Speaking Test (hereafter, CELST) of the National Matriculation English Test. Employing bi-factor exploratory structure equation modeling (hereafter, ESEM) and hierarchical multiple regression analyses, the study aimed to investigate the underlying constructs measured by CELST.

2. Literature review

2.1 Integrative speaking tasks

Currently, integrated speaking tasks are widely used in many large-scale high-stakes tests, such as, the TOEFL iBT speaking tests, the Versant™ Speak Test, the TEM-4 Oral Test, the CET-SET, and the CELST. Various theoretical motivations have driven the emergence of integrated speaking tasks. For instance, communicative language ability should target at the capacity to implement knowledge in communicative language use (Bachman, 1990); integrated tasks entail the amalgamation of multiple L2 skills (Huang and Hung, 2018). Among the various studies conducted on integrated speaking tasks, those on TOEFL iBT integrated speaking tasks accounted for a great portion. They focused mainly on the comparison between different types of TOEFL iBT integrated speaking tasks and independent speaking tasks (Huang et al., 2018; Zhang et al., 2022) or between TOEFL iBT integrated speaking tasks and other academic speaking tasks (Brown and Ducasse, 2019; Farnsworth, 2013); rating issues (Wei and Liosa, 2015); textual properties and strategy uses (Crossley and Kim, 2019; Inoue and Lam, 2021; Zhang et al., 2021); and source material use (Frost et al., 2019; Frost et al., 2021). Apart from the studies conducted on TOEFL iBT integrated speaking tasks, there are also some studies performed on other integrated speaking tasks and their research perspectives include source material use (Kormos et al., 2022; Lin, 2023; Pusey, 2020), rating (Hirai and Koizumi, 2013; Kim, 2015), affective factors (Ishikawa, 2020), strategy use (Rukthong, 2021; Rukthong and Brunfaut, 2020), fluency of oral products (Suzuki and Kormos, 2023), test fairness (Yan et al., 2019), and task acceptance (Zhang and Zhang, 2022), etc.

Compared with the large number of studies on integrated speaking tasks conducted internationally, researches on such tasks in Chinese context remains relatively limited (Jin, 2012; Rui and Ji, 2017; Zeng, 2011), which is not well aligned with its substantial population of ESL learners and the great effort devoted to English teaching and learning. Jin (2012), Zeng (2011), and Zhou (2005) probed into the influences of source material input in integrated speaking tasks on college students’ oral performance and the results confirmed its positive impacts in relieving anxiety and classroom reticence, facilitating autonomous learning, and promoting oral language output. Meanwhile, Zhang and Elder (2009) examined the reliability and validity of CET-SET from multi-perspectives, namely, authenticity, interactiveness, fairness, and washback effects. Recently, Tsang and Lee (2023) proved that foreign language-related emotions (anxiety, boredom, and enjoyment), speaking motivation, and spoken input beyond the classroom connected directly to Year 3 and Year 4 EFL leaners’ speaking proficiency in Hongkong primary schools, with enjoyment and spoken input beyond classroom serving as direct predictive power. Among them, those particularly related with the present study were those on the National Matriculation English Test (Shanghai Version) (hereafter, NMET(SH)) (Hou, 2018; Liu and Chen, 2018; Xu, 2016, 2021; Zhang, 2019), especially when the characteristics of test takers (e.g. overall language ability, number of participants) were taken into consideration. Their studies confirmed that the NMET(SH) assessed students’ ability of applying various sources of information into completing the integrated listening-speaking tasks and exerted positive washback effects.

2.2 CELST

It is early since World War II that political needs have been exerting various kinds of influences on the form and scoring of speaking test (Fulcher, 2003, p.1). China is a case in point. General Senior High School Curriculum Standards (hereafter, GSHSCS) (Ministry of Education of the PRC, 2020) provided a comprehensive guidance on the general teaching goals, content standards, and teaching and assessment approaches for senior high school English courses. According to the GSHSCS (2020), the general aim of senior high school English curriculum is to help students to cultivate and develop students’ subject core competencies which include language abilities, cultural awareness, thinking capacity, and learning ability (Ministry of Education of the PRC, 2020, p.5). Meanwhile, students should be able to acquire English learning resources through multiple channels, choose appropriate strategies and methods, monitor, evaluate, reflect on, and adjust learning content and progress (Ministry of Education of the PRC, 2020, p.6).

CELST, a response of the aim of cultivating senior high school students’ comprehensive English ability, has been implemented since 2011. There are three sub-tasks in CELST, namely, reading aloud, role play, and story retelling. Test design of CELST works for the selective purposes. According to the Test Syllabus and Sample Paper Disk for Computer-based English Listening and Speaking Test (EEA-GD, 2016), CELST aims at assessing students’ ability of accomplishing communicative tasks in specific contexts by using English, acquiring and applying their knowledge of phonology, vocabulary, grammar, etc. to comprehend and express effectively. Over the past decade, the number of candidates taking the CELST has consistently exceeded 630,000 annually. In both 2022 and 2023, this figure reached 710,000, reflecting the sustained demand for English proficiency assessment in Guangdong province’s higher education admissions process. These figures also indicate that a substantive study of CELST is of great significance considering the large candidate population and the competitive nature of the college entrance examination.

Large amount of test takers it owns, few studies were conducted on it. Cheng’s (2011) doctoral dissertation focused solely on the validity issues of the story-retelling task in CELST, leaving the first two sub-tasks unexamined. So do the original researches conducted by Wang et al. (2018) and Xu and his co-workers (2019, 2020, 2021, 2023a, 2023b, 2024, 2025). Zhan and Wan (2016) investigated into Senior III students’ attitudes, test preparation practices and test taking processes when completing CELST. Zhou and Zeng (2016) compared the rating results between human raters and computer automated scoring of CELST by using many-facet RASCH models and found that despite of the differences in rater severity between these two scoring approaches, computer automated scoring was of better inner-consistency due to lower bias rates.

To sum up, the above reviewed validation studies on integrated speaking tasks are helpful in conceptualizing construct and the validation procedure. However, language test taking is a process that is task-bound and context specific (Cohen, 2014). Test response is a function not only of the items, tasks, or stimulus conditions, but also of the participants’ responding and the context of measurement (Messick, 1987).

In this light, different types of integrated speaking tasks would set different requirements on the amount, degree, and form of information in the listening/video clip that can be integrated into the test-takers’ oral response. Besides, though diverse results have been reported for different types of integrated speaking tests, researches on CELST remain scare, especially the first two sub-tasks. What’s more, most existing studies have largely relied on correlational approaches, such as examining the relationship between textual features and test performance or between strategy use and test performance. Few, if any, have probed into its construct validity directly through in-depth analysis of test-takers’ actual performance. On top of that, as a large-scale high-stake test, studies addressing Chinese senior high school EFL learners’ listening-speaking performance in NMET context are in dire need concerning the great influence it exerts on education both nationally and internationally. Therefore, collection of validity evidence of CELST using a more mathematically-grounded validation testing theory and methodology is of great necessity for the sound and comprehensive interpretation of its test scores as well as test constructs.

2.3 Models of integrated speaking tasks and their conceptualizations

Theoretical models serve various functions in test validity development and validation, such as score interpretation, test development, curriculum design, etc. (Luoma, 2004, p.96). The necessity of grounding language test development and application in a theory of language proficiency calls for the incorporation of a theoretical framework that defines what language proficiency is (Bachman, 1990, p.81). Besides, a task-specific theoretical model does not only represent and encompass task construct but also function as the foundation of evaluation and assessment (Luoma, 2004, p.107).

Enlightened by the various frameworks related to listening and/or speaking, it could be concluded that a model of listening-speaking ability should firstly take an integrative, interactive, or communicative stand, for language use both in natural and academic contexts are inherently integrative, interactive, or communicative. Meanwhile, it should also serve as a guideline for the assessment of both language product parameters and speaking process parameters, for the inclusion of the former could guide leaners’ self-learning, teachers’ teaching, and raters’ rating practices, and the encompass of the latter could help examine participants’ cognitive and strategic processes, which would help finding factors influencing task completion (Bachman, 1990). Therefore, an operationalized working model of the construct of CELST will be developed based on the analysis of existing theoretical models, the Test Specifications, and the specific test purposes of CELST (EEA-GD, 2016), which will be presented in Section 2.5.1.

2.4 Theory of validity & validation

Being aware of the shift from multiple types of validity to a unitary understanding and from a focus on prediction to one on explanation, Messick (1987) conceptualized validity as a unitary concept and proposed a comprehensive framework consisting of six interrelated aspects, namely, content, substantive, structural, generalizability, external, and consequential validity. Among them, the external aspect of construct validity, which includes convergent and discriminant evidence obtained through multitrait-multimethod comparisons, as well as evidence of criterion relevance applied utility (Messick, 1987), delineates the extent to which the construct represented in the assessment accounts for the external pattern of correlations.

Besides, Chapelle put that a validity argument should integrate both evidence and rationales to support conclusions about the appropriateness and soundness of score-based inferences and uses of a test (1999, p.263) and held that validation should begin with formulating a hypothesis concerning the appropriateness of testing outcomes (1999, p.259). For the external aspect of validity, Chapelle (1999) stated that data analyzed through a multitrait-multimethod research design could be used to examine the relationships between the target test and other tests or quantifiable performance indicators. Extending back, influenced by psychological structuralism and the conception of explaining performance through the systems and subsystems of underlying process, Embretson (1983) introduced construct modeling to construct validation study. According to construct modeling, the internal structure and substance of a test can be addressed more directly by means of causal modeling of item or task performance. Embretson (1983) suggested that the construct validation research of identifying the theoretical mechanisms underlying task performances (a task decomposition process), that is, construct representation, and detecting the relationships of a test to others (e.g. strength, frequency, and pattern of significant relations with other measures), that is, nomothetic span, should be separated. It could be achieved by mathematical modeling, psychological modeling, or multicomponent latent trait modeling which combines the features of the former two ones (1983, p.181).

Moreover, according to the Standards for Educational and Psychological Testing (hereafter the Standards) (AERA et al., 2014), which provides criteria for the development and evaluation of tests and testing practices and guidelines for assessing the validity of test scores interpretations for their intended uses, evidence based on test content may involve logical or empirical analyses to determine how adequately the test content of a test represents the targeted domain in relation to the proposed interpretations of test scores (AERA et al., 2014, p.14). However, Xu et al. (2024) pointed out that, to date, no public report has examined the content validity of the CELST.

Following the guidance of the Standards, decomposing into coded construct-relevant textual measures and strategic measures, and employing a multicomponent model based on the testing purpose of CELST, the present study would explore to what extent these data-driven measures represent and are related to the specific inferences to be made from test scores.

Fig. 1

Hypothesized Working Model of CELST

(adapted from Chapelle et al., 1997)

2.5 The present study

2.5.1 Working taxonomy of CELST

All the theoretical models of speaking ability assessment are not really appropriate to serve as a theoretical framework for the CELST considering the integrative nature of the task itself and the targeted participants and context of it. Hence, a task-specific theoretical framework of CELST that reflects task construct and guides evaluation and assessment by offering wordings and providing criteria (Luoma, 2004, p.107) is needed.

An operationalized working model of the construct of CELST (Fig. 1) was worked out based on a series of considerations. It includes the analysis of the GSHSCS (Ministry of Education of the PRC, 2020), test aims issued in the Test Syllabus and Sample Paper Disk for Computer-based English Listening and Speaking Test (EEA-GD, 2016), personal communication with the test designer, a review of the theoretical models related to listening and/or speaking, the notion that a combination of the assessment of the product and the process undertaking it (Bygate, 1987) could make construct representation much sounder, and the adoption of the internal operation process from COE model (Chapelle et al., 1997). Thus, it is supposed that the construct model of CELST should entail the following principles and conceptions: 1) context part, the left part of the figure, refers to various task characters, candidates, raters, etc. related factors that affect candidates’ performance; 2) internal operation part, the core of the working model symbolized by the largest block, is the underlying process running through task completion; and 3) performance is the results of the oral production (Gist & Britol, 2020).

2.5.2 SEM & Bi-factor E-SEM

Structural equation modeling (SEM) is a priori hypothesis testing that is based on theoretical knowledge or what previous research has found. It examines the nature of the relationships between the observed variables and the latent variables by using a confirmatory and hypothesis-testing approach (Bollen, 1989; Byrne, 2013) and has been widely used in education and psychology, including language testing field (Dörnyei, 2007; Phakiti, 2008). Powerful as it is, traditional SEM did meet its bottle neck. For example, the fixed zero loadings of the factors led to poor application issues like believability and replicability (Asparouhov and Muthén, 2009).

At this very moment, Exploratory Structural Equation Modeling (ESEM, for short), a relatively novel approach, has recently emerged as a promising alternative to overcome some of the challenges (van Zyl and ten Klooster, 2022). ESEM is an exploratory factor analysis measurement model with rotations used in a structural equation model and functions as a combination of confirmatory factor analysis and exploratory factor analysis (Marsh, et al., 2009, 2010). All the SEM parameters are accessible in ESEM, such as residual correlations, regressions of factors on covariates, regressions among factors, standard errors for all rotated parameters, and overall model fit indices (Asparouhov and Muthén, 2009, p.398–399). Meanwhile, bifactor models test the presence of a global unitary construct underlying the answers to all items (G-factor) and whether this global construct co-exists with meaningful specificities (S-factors) defined by the part of the items not explained by the G-factor (Swami et al., 2023). Since one of the contexts that bi-factor ESEM could be used is no clear a priori structure exists (Swami et al., 2023), the present paper would employ a bi-factor ESEM, a more flexible, data-driven approach, to examine the extent to which the data, as reflected by the test scores and coded values, aligns with the proposed theoretical construct of CELST.

2.5.3 Research questions

As mentioned above, participants’ performances as well their strategic behaviors involved illustrated the construct of integrated speaking tasks. However, these studies did not involve a large number of ESL learners of low to intermediate English language ability, nor test their performance on the computer-based integrated speaking task such as the CELST. To bridge this gap, the present study drew on bi-factor ESEM and hierarchical multiple regression techniques to identify to what extent the CELST can tease out the supposed construct. Specifically, we addressed the following two research questions.

To what extent do participants’ performances activate and reflect the textual measures in the construct of CELST?

To what extent do participants’ performances activate and reflect the communicative strategy use in the construct of CELST?

3. Method

3.1 Participants

3.1.1 Student participants

With the assistance of the Education Examinations Authority of Guangdong Province (EEA-GD), a total of 360 students’ performance data in the CELST were included. To protect participants’ personal privacy, EEA-GD only provided the overall score and detailed subtask scores in an anonymous manner, without including any personal information such as names or student IDs. Influential factors such as their overall oral English proficiency level (Brown et al., 2005; Iwashita et al., 2008; Swain et al., 2009) and geographical regions (Jin and Wu, 2010) were taken into consideration when sampling participants. Firstly, only Test A was sampled despite six paralleled tests were used in that year due to convenience. Secondly, all students were stratified randomly into three groups (high, medium and low) based on their total score reported by EEA-GD. Students whose CELST scores ranked in the top 20 out of 100 were assigned to the high-level group, those in the bottom 50 were assigned to the low-level group, and the remaining students were placed in the middle-level group. Because such stratification criterion corresponds to the enrollment differences among key universities, higher vocational colleges, and general colleges or universities. Participants of each proficiency group were parceled and named as a separate file. Thirdly, all the participants in each proficiency level file would be classified into four geographical region groups, namely north Guangdong, east Guangdong, west Guangdong, and Pearl River Delta. To be specific, 30 participants would be sampled randomly from each geographical region group in each proficiency level. In other words, 360 participants (30*4*3 = 360) were recruited finally. Table 1 is a description of the constituent of the student participants.

3.1.2 Transcribers and coders

Apart from the first author of the present study, another seven M.A.

students majoring in English language testing participated the transcription work. All of them had experiences of rating CELST for at least two years before doing the present transcription work. Training and trial transcription were conducted before formal transcription. The pilot transcription recordings were sampled randomly from students’ performance in CELST conducted one year earlier than the present study and totally ten pieces were selected. Besides, their final transcription results were checked by the second author and another Ph.D. candidate in English language testing randomly and totally forty pieces of their transcriptions were double-checked.

The coding work of the present study was classified into two parts. One was the coding work that needs to be done manually with the help of WinMax to collect data on textual and communicative strategy use measures, and the other part were those could be done automatically under the guidance of WorldSmith Tools, Praat, and the online resources of Lu Xiaofei (http://aihaiyang.com/software/). Apart from the seven M.A. transcribers mentioned above, another two Ph.D. candidates in English language testing, with one in her first-year trip and the other in her second-year trip, participated the first part of the coding work. All these nine coders were trained together on how to do the coding work with the software WinMax. Thirty-six samples were double-coded so as to keep reliability of the coding results following the approach of Brown et al. (2005). The second part of the coding work was done by the first author and then checked by the second author concerning issues like missing data, input accuracy, etc.

Table 1
A Description of the Student Participants
Proficiency Region	high	medium	low	total
north Guangdong	30	30	30	90
east Guangdong	30	30	30	90
west Guangdong	30	30	30	90
Pearl River Delta	30	30	30	90
Total	120	120	120	360

3.2 Instruments and materials

3.2.1 Integrated speaking tasks

The oral data were collected from the three subtasks of Test A of CELST in a certain year. Figure 2 is a presentation of the screen interface of CELST. Appendix A presented the details of Test A of CELST used in the present study.

Fig. 2

A presentation of the interface of CELST

3.2.2 Rating rubrics

Rating rubrics for CELST (Appendix B) were used in the present study not for the purpose of scoring performance as it usually employed, but contributing to the establishment of the working framework of the construct of the CELST and the taxonomy of coding.

3.2.3 Coding scheme

Test takers’ performances were coded from the perspectives of oral product textual measures and communicative strategies used as presented in Fig. 1.

The final framework of test takers’ oral product textual measures comprises 25 coding items under five aspects, namely, accuracy, complexity, coherence, fluency, and phonology, while their strategies used would be assessed from reduction and achievement perspectives. The detailed variables that measure test takers’ oral products and strategies used could be referred in Appendix C.

4. 4. Results and Discussions

4.1 Preliminary analysis

Firstly, the first author of this paper rechecked whether total scores, sub-scores for each subtask, and the corresponding video tape recordings were included and the results were confirmed. Secondly, during the process of data confirming, it was found that the frequency for several coding parameters were zero for all participants. Thus, the parameter number of internals was deleted due to zero occurrence across all three subtasks. Thirdly, observed variable SpM was no longer used due to its high correlation with the observed variable MSpM which has more linguistic implication.

The inter-coder reliability Cranach’s α value for all parameters were over 0.8 (see Appendix D), indicating that the two coders were consistent with each other in the coding process. This further guaranteed the reliability of the coding results in the main analysis.

4.2 The competence model of CELST

Based on the results of previous literature on integrated speaking assessment and second language assessment, a set of indexes would be used in several alternative hypothesized models of CELST to explore into the relationship between the theoretical model proposed for CELST and the working construct (see Fig. 3) from the perspective of language competence parameters, addressing Research Question 1. As stated in Section 3.2.3, five factors, namely, accuracy, complexity, coherence, fluency, and phonology, were employed as latent factors in the competence model of CELST. Initially, the researchers assessed the model fitness index of hypothesized competence Model 1 using traditional second-order SEM. However, the analysis failed to converge due to the exceeding of the maximum number of iterations.

Fig. 3

A hypothesized competence model of CELST (Second-order SEM) (Model 1)

A detailed inspection of the data found that most of the value of CpT is smaller than 2 (CpT appeared less than twice in each participant’s performance), indicating inappropriateness of using CpT as a measure of differentiating senior high school students’ oral proficiency. Iwashita et al. (2008: 45) also raised doubts on using of ratio measures to represent complexity.

Since with participants’ overall oral language proficiency going up, NoT and NoC in their oral production would increase, but not the value of the ratio. Besides, the negative correlation between CpT and the total score (-.018) also violates the common sense, because more CpT usually indicates high overall language proficiency and score. Therefore, the observed variable NoC and NoT are used instead of CpT and MLT to measure language complexity. Besides, variable RMS would be discarded in the new model due to its non-significant relationship with total score (-.063). However, the new model still couldn’t converge. Probing into the observed variables in the new model, those whose correlation with the total score is smaller than 0.3, to be specific, DysM and RRSU, were deleted. Thus, a hypothesized exploratory bi-factor SEM (Model 2) (see Fig. 4), instead of the traditional SEM, was proposed based on the data feedback of model 1.

Results of bi-factor ESEM indicated that the model (Fig. 4: Model 2) fits the data well. The Chi-square in the present model was 116.489, with 85 degrees of freedom. χ²/df was 1.37, indicating a good fit between the present model and the data. The value of RMSEA and SRMR were much smaller than 0.5, also indicating well model fit. Hitherto, all these three absolute model fit indexes demonstrated that Model 2 fits the present data well. CFI and TLI are two indexes of comparative fit. The higher these two values are, the more fit the model would be. They both range from 0 to 1 and a value larger than 0.9 indicates well model fit. These two model fit indexes also indicated that Model 2 fits the present data well. Besides, almost all observed variables loaded on the general factor, listening-speaking ability, and several of them clustered and loaded on one domain specific factor such as accuracy, complexity, coherence, fluency and phonology. Such kind of loading pattern confirmed the use of exploratory bi-factor ESEM analysis of the present data. However, probing into the details, some minor issues still deserve further in-depth speculations. For one thing, RMP (ratio of mispronounced phonemes) loaded on overall listening-speaking ability with a positive value, which is against the theory. As is known to all, the more mistakes one makes in terms of phonemes mispronounced, the poorer one’s oral English would be. For another thing, RUiF (0.674) did work effectively on its a domain specific factor, phonology, but not on the overall general listening-speaking ability. Finally, the factor loading of PRO on its domain specific factor was relatively lower (-0.204) than the accepted value (0.3) symbolizing meaningful causal relationship between observed variable(s) and latent variable, while its contribution to the general listening-speaking ability factor was rather well (-0.747). Therefore, modifications were to be made. Firstly, the index RMP was discarded. Even though the factor loading values in the new modified model (Model 2.1) were somewhat better than those in Model 2, RUiF still could not load on general listening-speaking ability, which is against the hypothesis of bi-factor ESEM analysis. Besides, PRO still loaded poorly on phonology (-0.207), though significant. However, several modification indexes did not imply any modification on PRO. Thus, RUiF was deleted in the modified Model 2.2 to see whether the deletion of such an observed variable would result in better model fit and interpretations.

As could be seen in Fig. 5 (Model 2.2), all observed indicators in the new model loaded on the overall listening-speaking ability and their own domain specific latent factors. As a whole, the finalized model presented a well level of model-to-data fit (χ² = 79.97, df = 60; χ²/df = 1.33; RMSEA = 0.04; SRMR = 0.02; CFI = 1.00; TLI = 0.976). All factor loadings of the observed variables on their corresponding latent variables were significant at the 5% level.

Fig. 4

A hypothesized bi-factor ESEM model of CELST (Model 2)

Fig. 5

A hypothesized bi-factor ESEM model of CELST (Model 2.2)

4.3 Strategy use in CELST

As illustrated in Section 2.4.1 and Section 3.2.3, the communicative strategies adopted in the present study are consisted of two macro-parameters, to be specific, reductive strategies and achievement strategies, representing two different model approaches. Hierarchical multiple regression is often used when the independent variables are entered cumulatively according to a specified hierarchy by the researchers based on theoretical grounds (Gaciu, 2021; Pallant, 2020). Therefore, hierarchical multiple regression was employed to examine whether and to what extent the two different macro-parameters of communicative strategies used by participants influenced and predicted their overall performance, as well as their performance on individual aspects separately. Assumptions of conducting a multiple linear regression analysis like the examination of sample size, multicollinearity and singularity, outliers, normality, linearity, etc. were all confirmed. Firstly, a total of 360 participants met the sample size requirement on conducting multiple linear regression analysis (N ≥ 50 + m8; m refers to the number of independent variables) (Tabachnick & Fidell, 2013). Secondly, with all Pearson correlation values allocated between − 0.2 and 0.3, Tolerance values of more than 0.10, and VIF values well below the cut-off of 10, non-collinearity could be achieved. Thirdly, visual inspection of normal probability plot of the regression standardized residual confirmed the normality of the data. (see Appendix E)

As shown in Table 2, R square of the reductive model is 0.065. The value was upgraded to 0.168 after the four achievement strategy parameters entered. To be specific, the value of R square changed was 0.103. That is to say, the present communicative strategy model only explained 16.8 per cent of the variance in participants’ score. Results from ANOVA indicated that this model did achieve statistical significance (F = 33.35, Sig.=.00). Detailed inspection of the coefficients analysis demonstrated that apart from App, all the other five parameters made significant unique contributions to the prediction of participants’ overall CELST score (Sig.<.05).

Table 2
multiple regression analysis on communicative strategy variables and score
model summary^c				ANOVA
R	R Square	Adjusted R Square	R square change	F	Sig		Standardizedβ	t	Sig.
.25^a	.07	.06	.07	37.27	.00	MA	− .17	-5.89	.00
.25^a	.07	.06	.07	37.27	.00	SR	− .18	-5.98	.00
.41^b	.17	.16	.10	33.35	.00	MA	− .21	-7.16	.00
						SR	− .22	-7.87	.00
						App	.00	.16	.87
						GUS	.07	2.45	.02
						PAR	.09	3.17	.00
						RES	.27	9.05	.00
a. Predictors: (Constant), MA, SR
b. Predictors: (Constant), MA, SR, App, GUS, PAR RES
c. Dependent Variable: score

5. 5. Discussions

5.1 Textual measures perspective

Such a loading pattern (see Fig. 5: Model 2.2) confirms that CELST does examine participants’ ability of fulfilling three different types of listening-speaking tasks by using and applying their knowledge of accuracy, complexity, coherence, fluency, and phonology, which is in accordance with its test purposes, testing students’ ability of accomplishing tasks in specific contexts by acquiring and applying their knowledge of English phonetics, vocabulary, grammar, etc. to comprehend and express in particular (EEA-GD, 2016). The relatively high loadings on general listening-speaking ability also reflect that CELST implements the requirements of the GSHCSC that senior high school English courses should stress on cultivating students’ ability of acquiring and processing information by means of English, analyzing and solving problems in English, especially the ability of thinking and expressing in English; thus, developing students’ overall English language using ability.

Besides, the different numbers of observed indicators of these five domain specific factors and their different weights not only manifest the comprehensiveness of CELST but also indicate its emphasis on the dimensions of language ability might diverse in different sub-tasks, that is, contextual specificity of its three subtasks. As manifested by Model 2.2, accuracy is loaded by REFC, RCVF, and RRSC, complexity is loaded by TTR, NoC, and NoT, fluency is loaded by MSpM, AR, MLR, and DysM, phonology is loaded by PRA, PRO, and PR, and coherence is loaded by COM. Each of the loadings of accuracy and fluency has a value of over 0.4, with some larger than 0.7, indicating strong predictive power of these textual measures on accuracy and fluency. However, when it comes to complexity and coherence, the situation turns out to be not so sound, for the loading value of TTR and COM on their correspondent domain specific factor is just over 0.3, a border value. What’s worse, the factor loading of RPO on phonology is only − 0.208. These different loading values reflect, to some extent, CELST gives different weights to these five dimensions due to aiming at testing participants’ ability of obtaining and applying the information from diverse source materials and their previous background knowledge to accomplish given tasks. This might also be attributed to the integrative nature of listening-speaking task (Brown et al., 2005; Farnsworth, 2013).

Meanwhile, the three sub-tasks (reading aloud, role play, and story retelling) and their corresponding rating criteria might also play a role. The different requirements for senior high school students’ language ability development in terms of grammar, vocabulary, pronunciation, etc. of listening ability, speaking ability and affective attitude (see Section 2.2) again is one of the factors that could address the different weights. What needs to be noticed is that, phoneme omitting or swallowing, particularly the omission of the medial or final consonant in words, is a common phenomenon among students from Guangdong when speaking English (Xu et al., 2025). Such kind of consonant simplification could be possibly related with their dialect, Cantonese, which does not have complex consonant clusters at the end of syllables and tends to rely on tone change with pitch but not consonants and their endings.

Therefore, to some sense, the negative significant impact of RPO on phonology and listening-speaking ability indicated that raters can recognize and evaluate participants’ phoneme omitting effectively, which is in consistency with the findings of Xu et al. (2025).

Moreover, the unidimensional factor loading of RRSU, ADD, and TEM on the general listening-speaking ability and the components of the domain specific complexity factor deserve exploration. Whether RRSU should be used as a parameter of accuracy or a factor representing content remains a question (Zhou, 2005; Brown et al., 2005). The high and unidimensional factor loading of RRSU (0.965) on general listening-speaking ability manifests that RRSU is an indispensable as well as decisive variable accounting for the overall listening-speaking proficiency; however, it is a variable that should not be regarded as a part of accuracy. Moving to coding parameters of coherence, such as, ADD and TEM, it could be due to the fact that reading aloud task only requires test takers to read after the video clip with appropriate pronunciation and intonation and no cohesive devices are needed in this task. As a result, when CELST is taken into consideration as a whole, only COM variable clustered to the domain specific coherence factor. Results of Model 2.2 also prove that it is inappropriate to use ratio measures, like CpT, to assess L2 language learners’ oral production, which gives empirical support to the doubts proposed by Iwashita et al. (2008, p.45). At the meantime, the high factor loadings of NoC and NoT on both the general listening-speaking ability factor (NoC: 0.694; NoT: 0.729) and the domain specific complexity factor (NoC: 0.692; NoT: 0.447) indicates that NoC and NoT could be used as measures representing textual complexity.

5.2 Communicative strategy use perspective

Results of hierarchical multiple regression analysis demonstrated that the combination of all the six indices explained 16.4% of the score variances. Though not very large in terms of mathematic value, such a value is considerable since language competence part was not considered in the regression model. On the one hand, the four predictors of achievement strategies contributed 10.3% out of a total of 16.4%, indicating that achievement strategies like guessing, paraphrasing, restructuring, etc. are preferred by senior high school participants compared with those of reductive strategies when they have difficulties in expressing. This is quite different from Yang’s (2009) finding that their participants tended to rely more on strategies like verbatim source use which was considered to be unfavorable for their language development. Meanwhile, it might also be due to the fact that participants in the present study were well trained before they took part in such a large-scale high-stake test so that they were quite well-familiar with how their performances will be assessed. Hence, they would not use reductive strategies, message abandon and semantic reduction, to avoid being assessed poorly. Detailed inspection of the coding data proved that no participant used message abandon in the first sub-task (reading aloud), which requires the test takers to repeat immediately what the computer just broadcasted, with the text of the content presented on the computer screen.

On the other hand, approximation was the only variable that did not make sense to score variance, compared to the other five communicative strategy variables (see Table 2, t = .16, Sig. >.05). Swain et al. (2009, p.23) also reported that approximation was the type of communicative strategy that was used least by participants. According to Fulcher (2003) and Swain et al. (2009), approximation refers to the behavior of using strategies like lexical substitution, over-generalization, and exemplification to replace an unknown word or a word that is out of memory’s reach. Probably, the vocabulary and expressions given in the source material in these three subtasks in CELST facilitated participants’ oral production in acquiring words and expressions (Brown et al., 2005; Frost et al., 2021).

6. Conclusions

Employing a convergent mixed-methods and under the guidance of Messick’s unitary concept of construct validity (Messick, 1987), the present research explored into whether and to what extent CELST test the language competences it declares to. The results indicated that CELST does test students’ ability of accomplishing certain tasks in specific contexts by acquiring and applying various sources of knowledge, such as their encyclopedic knowledge of English (accuracy, complexity, coherence, fluency, and phonology) and the world, source material content, communicative strategies (message abandon, semantic reduction, approximation, guessing, paraphrasing, restructuring), etc. in English. Such test construct is similar to that of the integrated speaking tasks in TOEFL as manifested in Brown et al. (2005).

That is to say, integrated speaking tasks, or CELST in the present study, examine participants’ ability of acquiring information from various source materials and their encyclopedic background knowledge and then use them to fulfill oral output tasks (reading aloud, role play, story retelling) under the guidance of task prompt and with the help of their communicative strategic knowledge (see Fig. 1), which is in consistency with the teaching goal guidance of the GSHCSC (Ministry of Education of the PRC, 2020).

However, this study has several limitations that should be carefully considered when interpreting or generalizing the findings. Firstly, only participants’ performances were involved in the present study, but not the factors of the task and the candidate, that is, the left part of the theoretical model of CELST (see Fig. 1), which limits the generalizability of the research results. Secondly, other approaches of data collection, such as interviews or questionnaires concerning participants’ self-report of task completion process or senior high school teachers’ feedback, worth exploration, despite of the irreplaceability of authentic NMET data in the present study.

Keeping these limitations in mind, the researchers interpret the findings and propose implications for L2 theory and practices. Theoretically, such context-bound task-specific model adapted from Chapelle et al. (1997) brings the integrative view of listening-speaking tasks into CELST. Task characteristics and candidates’ individual differences together play a role in their internal operation of their language competence, communicative strategy, and world knowledge in the performances they produce. Data of the influences of task characteristics and other factors of participants themselves, like geographical regions, gender differences, motivations (Tsang and Lee, 2023), etc., could be collected to gather evidences based on relations to other variables to examine the generalizability of CELST (AERA et al., 2014). Methodologically, the bottom-up research paradigm, a discourse-based approach, together with bi-factor ESEM will function as a complementation in the research methodology of construct validation in applied linguistics. Future comparative studies across the three subtasks of CELST concerning their task-specific constructs as well as the psychological processes during task fulfillment would contribute to the enrichment of the researches of CELST based on internal structure and response process (AERA et al., 2014). Pedagogically, the contributive coding parameters of specific factors and their loading values differed dramatically across each other, indicating that different types of integrated speaking should be adopted for different educative purposes. For example, reading aloud task and role play task might not be suitable to test students’ ability of achieving textual coherence, while it would be wiser to assess senior high school students’ oral language proficiency on how many clauses or T-unit they have made but not how many clauses or T-unit per sentence.

Declarations

Funding

Declaration

The authors declare that no financial support was received for the conduct of this study, the preparation of this manuscript, or its publication. This research was carried out independently without any external funding from public, commercial, or non-profit organizations.

Ethics Approval

The study was reviewed and given approval by the Ethics Review Committee of School of Foreign Languages, Southern Medical University (No. 20230910) on September 10, 2023.

This study was conducted in accordance with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed consent

The study “Exploring the Construct Validity of Integrated Speaking Tasks: The Case of a Large-scale High-stakes Computer-based Listening-Speaking Task” collected 360 senior high school students’ performance data in the Test A of the Guangdong Version of the Computer-based English Listening and Speaking Test of the National Matriculation English Test (hereafter, CELST). All the data were provided by the Education Examination Authority of Guangdong Province (hereafter, EEA-GD) since the first author participated in a Guangdong provincial educational and scientific research program (TJW2013001) which aimed at validating the reliability and validity of the automatic scoring of CELST. In line with provincial regulations and institutional policies, written (signed) consent was not required. EEA-GD only provided the overall score and detailed subtask scores in an anonymous manner, without including any personal information such as names or student IDs. Anonymity was rigorously guaranteed to ensure that all collected data would be used solely for academic research purposes. All participants of this provincial program, except for scientific research purposes, should keep the obtained data confidential.

Additionally, we also applied for informed exemption consent and was approved by the Ethics Committee at the School of Foreign Languages, Southern Medical University.

Data Availability

Due to an agreement with the Education Examination Authority of Guangdong Province, the data used and/or analyzed during the current study are restricted to research purposes only and cannot be made publicly available.

Competing interest

The authors declare that they have no competing interests.

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

Author Contribution

CRediT author statement Zhou: conceptualization; data collection; formal analysis; writing-original draft; revising;Bin: formal analysis; revising;Zhang: supervision, review, revising & editing.

References

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (2014) Standards for Educational and Psychological Testing. American Educational Research Association, Washington, DC

Asparouhov T, Muthén B (2009) Exploratory structural equation modeling. Struct Equation Modeling: Multidisciplinary J 16(3):397–438. https://doi.org/10.1080/10705510903008204

Bachman LF (1990) Fundamental considerations in language testing. Cambridge University Press, Cambridge, UK

Barkaoui K, Brooks L, Swain M, Lapkin S (2013) Test-takers’ strategic behaviors in independent and integrated speaking tasks. Appl Linguist 34(3):304–324. https://doi.org/10.1093/applin/ams046

Bollen KA (1989) Structural equations with latent variables. Wiley, New York, NY

Brown A, Ducasse AM (2019) An equal challenge? Comparing TOEFL iBT™ Speaking Tasks with Academic Speaking Tasks. Lang Assess Q 16(2):253–270. https://doi.org/10.1080/15434303.2019.1628240

Brown A, Iwashita N, McNamara T (2005) An examination of rater orientations and test-taker performance on English-for-academic purposes speaking tasks. TOEFL Monograph Series. Educational Testing Service, Princeton, pp MS–29

Bygate M (1987) Speaking. Oxford University Press, Oxford, UK

Byrne BM (2013) Structural equation modeling with Mplus: Basic concepts, applications, and programming. Routledge. https://doi.org/10.4324/9780203807644

Chapelle CA (1999) Validity in language assessment. Annu Rev Appl Linguist 19:254–272. https://doi.org/10.1017/S0267190599190135

Chapelle CA, Grabe W, Berns M (1997) Communicative language proficiency: Definition and implications for TOEFL 2000. TOEFL Monograph Series. MS-10. Educational Testing Service

Cheng FX (2011) Justifying the interpretations about a Listening-to-retell task in CELST in NMET(GD). Guangdong University of Foreign Studies, Guangzhou, China

Cohen AD (2014) Strategies in learning and using a second language. Longman

Crossley SA, Kim YJ (2019) Text integration and speaking proficiency: Linguistic, individual differences, and strategy use considerations. Lang Assess Q 16(2):217–235. https://doi.org/10.1080/15434303.2019.1628239

de Jong NH (2023) Assessing second language speaking proficiency. Annual Reviews Linguistics 9:541–560. https://doi.org/10.1146/annurev-linguistics-030521052114

Dörnyei A (2007) Research methods in applied linguistics. Oxford University Press, Oxford, UK

Education Examinations Authority of Guangdong Province (2016) Test syllabus and sample paper disk for Computer-based English Listening and Speaking Test (CELST) of National Matriculation English Test (Guangdong Version). Guangzhou. Guangdong Pacific Electronic, China

Embretson S (1983) Construct validity: Construct representation versus nomothetic span. Psychol Bull 93(1):179–197. https://www.researchgate.net/publication/289963742

Fan J, Yan X (2020) Assessing speaking proficiency: A narrative review of speaking assessment research within the argument-based validation framework. Front Psychol 11:330. https://doi.org/10.3389/fpsyg.2020.00330

Farnsworth TL (2013) An investigation into the validity of the TOEFL iBT speaking test for international teaching assistant certification. Lang Assess Q 10(3):274–291. https://doi.org/10.1080/15434303.2013.769548

Frost K, Clothier J, Huisman A, Wigglesworth G (2019) Responding to a TOEFL iBT integrated speaking task: Mapping task demands and test takers’ use of stimulus content. Lang Test 37(1):133–155. https://doi.org/10.1177/0265532219860750

Frost K, Wigglesworth G, Clothier J (2021) Relationships between comprehension, strategic behaviours and content-related aspects of test performances in integrated speaking tasks. Lang Assess Q 18(2):133–153. https://doi.org/10.1080/15434303.2020.1835918

Fulcher G (2003) Testing second language speaking. Routledge

Gaciu N (2021) Understanding quantitative data in educational research. SAGE

Gist CD, Bristol TJ (eds) (2020) Fairness in Educational and Psychological Testing. American Educational Research Association, Washington DC

Hirai A, Koizumi R (2013) Validation of empirically derived rating scales for a story retelling speaking test. Lang Assess Q 10(4):398–422. https://doi.org/10.1080/15434303.2013.824973

Huang HD, Hung SA (2018) Investigating the strategic behaviors in integrated speaking assessment. System 78(1):201–212. https://doi.org/10.1016/j.system.2018.09.007

Huang HD, Hung SA, Plakans L (2018) Topical knowledge in L2 speaking assessment: Comparing independent and integrated speaking test tasks. Lang Test 35(1):27–49. https://doi.org/10.1177/0265532216677106

Hou YP (2018) A study on the washback effect of the reform of SHNMET listening and speaking test. TEFLE 183(05):25–31

Inoue C, Lam DMK (2021) The effects of extended planning time on candidates’ performance, processes, and strategy use in the lecture listening-into-speaking tasks of the TOEFL iBT® test (TOEFL Research Report No. RR-93). Princeton, NJ: Educational Testing Service. https://doi.org/10.1002/ets2.12322

Ishikawa S (2020) Influence of learner attributes on complexity, accuracy, and fluency in English oral outputs of Japanese learners. In: Mentz O, Papaja K (eds) Focus on language: Challenging language learning and language teaching in peace and global education. LIT, pp 43–68

Iwashita N (2022) Speaking assessment. In: Derwing TM, Munro MJ, Thomson RI (eds) The Routledge handbook of second language acquisition and speaking. Routledge, New York, NY, pp 130–140

Iwashita N, Brown A, McNamara T, O’Hagan S (2008) Assessed levels of second language speaking proficiency: How distinct? Appl Linguist 29(1):24–49. https://doi.org/10.1093/applin/amm017

Jin X (2012) Working memory constraints on L2 learners’ speech production. Foreign Lang Teach Res 44(4):523–535

Jin Y, Wu J (2010) A preliminary study of the validity of the Internet-Based CET-4 —— Factors Affecting Test-takers’ Perception of the Performance on the Test. Technol Enhanced Foreign Lang Educ 132(2):3–10

Kim HJ (2015) A qualitative analysis of rater behavior on an L2 speaking assessment. Lang Assess Q 12(3):239–261. https://doi.org/10.1080/15434303.2015.1049353

Lin R (2023) Examining the scoring of content integration in a listening-speaking test: A G-theory analysis. Lang Assess Q 20(3):319–338. https://doi.org/10.1080/15434303.2023.2242334

Liu S, Chen YJ (2018) A practical exploration on NMET (Shanghai)-based English listening and speaking teaching. TEFLE 183(05):32–36

Luoma S (2004) Assessing speaking. Cambridge University Press, Cambridge, UK

Kormos J, Suzuki S, Eguchi M (2022) The role of input modality and vocabulary knowledge in alignment in reading-to-speaking tasks. System 108:102854. https://doi.org/10.1016/j.system.2022.102854

Marsh HW, Muthén B, Asparouhov T, Lüdtke O, Robitzsch A, Morin AJS, Trautwein U (2009) Exploratory structural equation modeling, integrating CFA and EFA: Application to students’ evaluations of university teaching. Struct Equation Modeling: Multidisciplinary J 16(3):439–476. https://doi.org/10.1080/10705510903008220

Marsh HW, Lüdtke O, Bengt M, Asparouhov T, Morin AJS, Trautwein U, Nagengast B (2010) A new look at the big five factor structure through exploratory structural equation modeling. Psychol Assess 22(3):471–491. https://doi.org/10.1037/a0019227

Messick S (1987) Validity (TOEFL Report). Educational Testing Service, Princeton, NJ

Ministry of Education of the People’s Republic of China (2020) General senior high school curriculum standards. People’s Education

Pallant J (2020) SPSS survival manual: A step by step guide to data analysis using IBM SPSS, 7th edn. Routledge

Phakiti A (2008) Construct validation of Bachman and Palmer’s (1996) strategic competence model over time in EFL reading tests. Lang Test 25(2):237–272. https://doi.org/10.1177/0265532207086783

Pusey K (2020) Assessing L2 listening at a Japanese university: Effects of input type and response format. Lang Educ Assess 3(1):13–35. https://doi.org/10.29140/lea.v3n1.193

Rui YP, Ji HJ (2017) The impact of multimodal listening & speaking teaching on English speaking anxiety and classroom reticence. TEFLE,178(6): 50–55

Rukthong A (2021) MC listening questions vs. integrated listening-to-summarize tasks: What listening abilities do they assess? System 97(1):102439. https://doi.org/10.1016/j.system.2020.102439

Rukthong A, Brunfaut T (2020) Is anybody listening? The nature of second language listening in integrated listening-to-summarize tasks. Lang Test 37(1):31–53. https://doi.org/10.1177/0265532219871470

Swain M, Huang L, Barkaoui K, Brooks L, Lapkin S (2009) The speaking section of the TOEFL iBT™ (SSTiBT): Test-takers’ reported strategic behaviors. TOEFL iBT-10. Educational Testing Service

Swami V, Maïano C, Morin AJS (2023) A guide to exploratory structural equation modeling (ESEM) and bifactor-ESEM in body image research. Body Image 47:101641. https://doi.org/10.1016/j.bodyim.2023.101641

Suzuki S, Kormos J (2023) The multidimensionality of second language oral fluency: Interfacing cognitive fluency and utterance fluency. Stud Second Lang Acquisition 45(1):38–64. https://doi.org/10.1017/S0272263121000899

Tabachnick BG, Fidell LS (2013) Using Multivariate Statistics (6th^ed). Pearson Education

Tsang A, Lee JS (2023) The making of proficient young FL speakers: The role of emotions, speaking motivation, and spoken input beyond the classroom. System 115:103047. https://doi.org/10.1016/j.system.2023.103047

Van Zyl LE, ten Klooster PM (2022) Exploratory structural equation modeling: Practical guidelines and tutorial with a convenient online tool for Mplus. Front Psychiatry 12(1):1–28. https://doi.org/10.3389/fpsyt.2021.795672

Wang H, Fan TT, Zeng YQ (2018) Investigating the construct of speaking proficiency under the listening-to-speak integrated task. Mod Foreign Lang 41(3):413–424

Wei J, Liosa L (2015) Investigating differences between American and Indian raters in assessing TOEFL iBT speaking tasks. Lang Assess Q 12(3):283–304. https://doi.org/10.1080/15434303.2015.1037446

Xu W (2016) Analysis of National Matriculation English Test (Shanghai) under the new reform of examination and enrollment system: Innovation, elucidation and prospection. Foreign Lang Test Teach 4:24–31

Xu W (2021) Practice of a speaking assessment task in a high-stake test: Taking NMET(Shanghai) as an example. Foreign Lang Test Teach, (1): 21–27

Xu Y, Huang M, Chen J, Zhang Y (2023a) Investigating a shared-dialect effect between raters and candidates in English speaking tests. Front Psychol 14:1143031. https://doi.org/10.3389/fpsyg.2023.1143031

Xu Y, Li XD, Chen J (2024) The review: Computer-based English Listening and Speaking Test (CELST) of National Matriculation English Test (NMET) Guangdong version in China. Lang Test 42(2):238–249. https://doi.org/10.1177/02655322241255712

Xu Y, Li XD, Wang PC (2023b) Validating an empirically developed rating scale of story retelling task. J PLA Univ Foreign Lang 46(5):11–19

Xu Y, Liao TH, Han S, Wang YQ (2019) Development and validation of the content rubric of a story retelling task. Foreign Lang Test Teach 4:21–30

Xu Y, Liao TH, Han S, Wang YQ (2020) Investigating language features for the listening-to-speak integrated task: A corpus-based approach. Foreign Lang Res 1:56–63

Xu Y, Yang MN, Li XD (2025) Investigating the relationships between listening strategies and speaking performance in integrated listening-to-speak tasks. System 129:103586. https://doi.org/10.1016/j.system.2024.103586

Xu Y, Zhang YQ (2021) Investigating pronunciation features of the integrated listening-to-speak task construct. Foreign Lang Test Teach 3:39–48

Yan X, Cheng LX, Ginther A (2019) Factor analysis for fairness: Examining the impact of task type and examinee L1 background on scores of an ITA speaking test. Lang Test 36(2):207–234. https://doi.org/10.1177/0265532218775764

Yang HC (2009) Exploring the complexity of second language writers’ strategy use and performance on an integrated writing test through structural equation modeling and qualitative approaches. Unpublished doctoral dissertation. The University of Texas

Zhan Y, Wan ZH (2016) Test takers’ beliefs and experiences of a high-stakes Computer-based English Listening and Speaking Test. RELC J 47(3):363–376. https://doi.org/10.1177/0033688216631174

Zeng QM (2011) The efficacy of multi-modal teaching on the development of L2 listening and speaking abilities. J PLA Univ Foreign Lang 6:72–76

Zhang R (2019) Washback effect analysis of NMET(Shanghai) listening and speaking test: Taking J school as an example. Foreign Lang Test Teach 4:47–53

Zhou WJ (2005) Effects of input modes on oral English production. J PLA Univ Foreign Lang 28(6):53–58

Zhang Y, Elder C (2009) Measuring the speaking proficiency of advanced EFL learners in China: The CET-SET solution. Lang Assess Q 6(4):298–314. https://doi.org/10.1080/15434300902990967

Zhou Y, Zeng YQ (2016) Many-facet Rasch model analysis on computer automatic scoring of a computer-based English listening-speaking test. Foreign Lang Test Teach 1:22–31

Zhang WW, Zhang LJ (2022) Understanding assessment tasks: Learners’ and teachers’ perceptions of cognitive load of integrated speaking tasks for TBLT implementation. System 111:102951. https://doi.org/10.1016/j.system.2022.102951

Zhang WW, Zhang DL, Zhang LJ (2021) Metacognitive instruction for sustainable learning: Learners’ perceptions of task difficulty and use of metacognitive strategies in completing integrated speaking tasks. Sustainability 13:6275. https://doi.org/10.3390/su13116275

Zhang WW, Zhao MJ, Zhu Y (2022) Understanding individual differences in metacognitive strategy use, task demand, and performance in integrated L2 speaking assessment tasks. Front Psychol 13:876208. https://doi.org/10.3389/fpsyg.2022.876208

Yes

For the detailed description and scoring criteria of the three subtasks of CELST please refer to the “General description” section in Xu et al. (2024).

These measures include: the number of error-free clauses, the number of verb forms, the number of correct-verb forms, the number of reported semantic units, the number of additive conjunctions, the number of comparative conjunctions, the number of temporal conjunctions, the number of consequential conjunctions, the number of internal conjunctions, the number of unmeaningful syllables (including repetition, reformulation, and replacement), the number of filled pauses, the number of unfilled pauses, the number of self-corrections, the number of phoneme additions, the number of phoneme omitting, the number of mispronounced phonemes, the number of unintelligible fragments, the number of misstressed phonemes, the number of message abandoned, the number of semantic units reduced, the number of guessing, the number of approximation, the number of approximation, the number of paraphrasing, the number of restructuring, the number of words coinaged.

To be specific, they are all the sub-measures of coherence, number of internals, all the sub-measures of achievement strategy use, number of message abandon in reading aloud task; all the sub-measures of achievement strategy use and number of comparatives and temporals in role play task; and number of internal conjunctions and number of coinage across all the three sub-tasks in CELST.

⁴

Generally speaking, χ²/df value of smaller than 2 indicates well fit, with a value of 2 < χ²/df < 5 indicating acceptable fitness (Hou, Wen, & Cheng, 2003, p.177–179). Another two commonly used absolute indexes of model fit are RMSEA and SRMR, representing Root Mean Square Error of Approximation and Standardized Root Mean Square Residual separately. It is usually reckoned that a RMSEA value of smaller than 0.8 indicates acceptable model fit, and a RMSEA value of smaller than 0.5 indicates well model fit. For SRMR, a value smaller than 0.5 indicates well model fit.