Analysis of Spatial and Metacognitive Performance in First-Year Health Sciences Students

LauraM.Fernández-Méndez1✉Emaillm.fernandez@psi.uned.es

RobertBauer2

AntonioRodán3

GemaDíaz-Gil4,5

StellaMarisGómez-Sánchez4,5

FranciscoGómez-Esquer4,5

MaríaJoséContreras1

Antonio1

StellaMarisGómez1

1Facultad de PsicologíaUniversidad Nacional de Educación a Distancia, UNEDMadridEspaña

2Centro de Investigación en Deporte y Ejercicio (CIDE)Universidad Rey Juan CarlosMadridSpain

Facultad de FarmaciaUniversidad San Pablo-CEU, CEU UniversitiesBoadilla del MonteEspaña

4Área de Anatomía y Embriología Humana, Departamento de Ciencias Básicas de la Salud, Facultad de Ciencias de la SaludUniversidad Rey Juan CarlosMadridEspaña

5Grupo de investigación consolidado de Bases Anatómicas, Facultad de Ciencias de la SaludMoleculares y del Desarrollo Humana (GAMDES), Universidad Rey Juan CarlosMadridEspaña

Laura M. Fernández-Méndez^1*, Robert Bauer², Antonio Rodán³, Gema Díaz-Gil^4,5, Stella Maris Gómez-Sánchez ^4,5, Francisco Gómez-Esquer ^4,5, María José Contreras¹

¹Facultad de Psicología, Universidad Nacional de Educación a Distancia, UNED, Madrid, España

²Centro de Investigación en Deporte y Ejercicio (CIDE), Universidad Rey Juan Carlos, Madrid, Spain

³Facultad de Farmacia, Universidad San Pablo-CEU, CEU Universities, Boadilla del Monte, España

⁴Área de Anatomía y Embriología Humana. Departamento de Ciencias Básicas de la Salud. Facultad de Ciencias de la Salud, Universidad Rey Juan Carlos, Madrid, España

⁵Grupo de investigación consolidado de Bases Anatómicas, Moleculares y del Desarrollo Humana (GAMDES). Facultad de Ciencias de la Salud, Universidad Rey Juan Carlos, Madrid, España

*Corresponding author: lm.fernandez@psi.uned.es

ORCID Laura M. Fernández-Méndez: https://orcid.org/0000-0002-4455-9928

ORCID Robert Bauer: https://orcid.org/0000-0003-1034-2007

ORCID Antonio Rodán: https://orcid.org/0000-0001-7674-3335

ORCID Gema Díaz-Gil: https://orcid.org/0000-0003-3480-1490

ORCID Stella Maris Gómez-Sánchez: https://orcid.org/0000-0002-7140-4286

ORCID Francisco Gómez-Esquer: https://orcid.org/0000-0002-5426-4349

ORCID María José Contreras: https://orcid.org/0000-0001-6302-5238

Abstract

This study examines spatial metacognitive monitoring in health sciences students. This process is essential in many diagnostic judgments, as it involves interpreting images together with the level of confidence in such interpretations. We analyzed how confidence, item difficulty, spatial ability, and gender influence calibration (the accuracy of confidence judgments) and bias (the directional deviation of confidence judgment) in a task that assesses the ability to visualize cross-sections. In addition, we examined performance across all visuospatial tasks administered. A total of 322 students completed three tests (cross-sections, visualization, and mental rotation), providing confidence judgments after each item. Results showed that greater spatial ability predicted better calibration and lower overconfidence, whereas higher item difficulty reduced calibration accuracy; this effect was moderated by spatial ability. Significant gender differences emerged: women showed lower calibration error, although men were more likely to achieve perfect calibration. In the performance models, both confidence and visualization ability strongly predicted accuracy. Interactions indicated that spatial ability buffered the negative effects of difficulty and that confidence predicted performance mainly for easier items. Spatial ability is highlighted as a key moderator of metacognitive accuracy, and its training may enhance diagnostic reasoning, thereby contributing to improved patient safety in future professional practice.

Keywords:

metacognition

spatial ability

gender differences

health sciences students

spatial reasoning

Statements and Declarations

Funding

This research was partially supported by the 1) IMPULSO project funded by the Universidad Rey Juan Carlos (URJC) under grant reference 2024/SOLCON-135283 and 2) Ministry of Science and Innovation under grant reference PID2021- 125677OB-I00.

Competing Interests

The authors declare that they have no competing interests.

Ethics Approval

The study was approved by the Ethics Committee of Universidad Rey Juan Carlos (protocol No. 140520243142024).

All participants provided informed consent prior to participation, in accordance with the Declaration of Helsinki.

Data Availability

The dataset and the R code of the statistical analyses presented in this study can be found in the online repository Zenodo under the following accession number: https://doi.org/10.5281/zenodo.17629979.

Author Contribution

Conceptualization: LM. F-M, A. R., M. J. C.; Data collection and organization data: L.M. F-M., A. R., M. J. C., G. D-G., S. MG-S., F. G-E.; Methodology: L. M. F-M., A. R., M. J. C.; Formal analysis and investigation: L. M. F-M., R. B.; Writing - original draft preparation: L. M. F-M., R. B., A. R., M. J. C.; Writing - review and editing: L. M. F-M., R. B., A. R., M. J. C., G. D-G., S. MG-S., F. G-E.; Funding acquisition: L. M. F-M., M. J. C; Resources: L. M. F-M; Supervision: L .M. F-M.

Introduction

In the field of patient safety, the World Health Organization (WHO) has identified misdiagnosis as one of the primary contributors to unsafe healthcare delivery (World Health Organization & World Alliance for Patient Safety Research Priority Setting Working Group, 2008). To make a diagnostic judgment, health professionals must rely on clinical evidence and, in addition, possess an adequate level of confidence in that judgment. The discrepancy between actual diagnostic accuracy and the perceived confidence in that accuracy represents a critical threat to patient safety (Meyer et al., 2013). When a professional feels unsure about a clinical diagnosis, they tend to seek additional information to confirm or rule out their hypothesis, thereby helping minimize diagnostic error. In this regard, the self-assessment that professionals make about their decisions largely determines the course of action they adopt in clinical practice, modulated by their level of confidence.

Therefore, the quality of care is closely related to healthcare professionals’ self-assessment capacity and to self-regulated learning (Murdoch-Eaton & Whittle, 2012; Sandars & Cleary, 2011). However, several studies have shown a weak correlation between self-assessments and standardized external assessments among health professionals (Davis et al., 2006; Eva et al., 2004), with coefficients ranging from 0.02 to 0.65 in the case of students (Gordon, 1991). This gap highlights the need to strengthen metacognition, understood as the ability to monitor, regulate, and evaluate one's own performance, in the training of future healthcare professionals (Honeycutt et al., 2021; Naug et al., 2011; Prokop, 2019; Tan et al., 2010).

One of the key components of clinical practice is the interpretation of medical images, which constitutes a fundamental basis for accurate diagnostic judgments. This skill requires rigorous and early training (Weimer et al., 2024), since the interpretation of these images entails reasoning with complex spatial information. In this context, spatial ability plays a central role, as these tasks rely on spatial mental representations and on processes such as mental rotation and visualization (Hegarty et al., 2007). Spatial ability is defined as the capacity to generate, retain, retrieve, and transform well-structured visual images (Lohman, 1996). Health science students are frequently required to reason about spatial concepts such as shape, relative position, and the connections between anatomical structures (Hegarty et al., 2007), which involves learning to mentally represent internal body structures that are not directly observable.

Consequently, given the relevance of spatial ability in the interpretation of medical images and in clinical decision-making, it is essential to consider it a central component of metacognition in health education. Analyzing how students judge their own spatial performance allows researchers to understand and optimize these processes from the early stages of university studies. Although numerous studies analyze confidence in responses, confidence does not equate to competence, and further research is needed to examine the degree of discrepancy between confidence judgments and actual performance in health sciences students.

Metacognition in education

Metacognition is defined as reflection on one's own thinking. In other words, it is the ability to plan, monitor, and evaluate learning and performance by regulating one's own cognitive processes (Flavell, 1979; Lai, 2011). In the clinical context, metacognition has been associated with greater patient safety, enhanced decision-making, and a lower incidence of diagnostic errors in clinical professionals (Kuiper & Pesut, 2004; Royce et al., 2019; Siqueira et al., 2020). Conversely, insufficient metacognitive skills can lead to overestimation of abilities or reduced self-monitoring and, consequently, to medical errors (Medina et al., 2017; Royce et al., 2019). For example, Pusic et al. (2015) reported that medical students rated their diagnostic judgments on an X-ray image as "definitely correct," although this was accurate in only 69% of cases.

Metacognitive knowledge and metacognitive regulation are the two components of metacognition, the latter involving monitoring and control processes (Nelson & Narens, 1994; Schraw & Dennison, 1994). Metacognitive monitoring refers to activities that involve reviewing and evaluating the quality of cognition, whereas metacognitive control refers to decisions made based on information from monitoring operations (for a review, see Fiedler et al., 2019; Nelson & Narens, 1990). Control processes determine the confidence threshold at which an action is initiated. For example, in the clinical context, a physician may proceed with treatment only when they have more than 90% confidence in a diagnosis (e.g., Djulbegovic et al., 2014; Pauker & Kassirer, 1980). Inaccurate monitoring deteriorates decision quality; thus, the effectiveness of control depends directly on monitoring accuracy (Ackerman & Thompson, 2017). In this context, confidence judgments (hereinafter, CJ) are among the most widely used measures of metacognitive monitoring (Fleming & Lau, 2014). CJs reflect an individual's belief in the accuracy of their decisions following a cognitive task. According to Nelson and Narens (1990), “a system that monitors itself (even if imperfectly) can use its own introspections as information to alter the behavior of the system” (p. 128). Thus, confidence levels help determine whether individuals feel sufficiently competent in a particular domain or whether further learning is required (Dautriche et al., 2021), guiding control decisions such as seeking help or checking for errors (Dunlosky & Hertzog, 1998; Dunlosky & Metcalfe, 2009; Efklides, 2011; Metcalfe, 2002, 2009; Nelson & Narens, 1990).

Empirically, monitoring accuracy is assessed by comparing subjective performance evaluations (e.g., CJs) with objective task performance, measuring the correspondence between the two. A recent meta-analysis (Jin et al., 2022) across 16 countries showed significant correlations between performance and mean confidence, suggesting that individuals who feel more confident tend to perform better. Judgment accuracy, defined as the relationship between objective performance and CJs, is considered a central indicator of monitoring quality (Kleitman & Moscrop, 2010). Monitoring accuracy can be assessed through absolute accuracy, relative accuracy, or the bias index (Burson et al., 2006; Nelson, 1996). Absolute accuracy is calculated as the difference between estimated performance scores (confidence judgments) and objective performance; values closer to zero indicate better calibration. The bias index is another form of absolute accuracy that measures the degree to which an individual has an excess or a deficiency of confidence when making a prediction. The bias index reflects the direction and degree of discrepancy between the judgment and performance (Schraw, 2009), with positive values indicating overconfidence and negative values indicating underconfidence. Thus, overconfident students show higher confidence judgments, while underconfident students show lower confidence judgments relative to objective performance on the task.

In educational contexts, students who consistently underestimate their performance (with low confidence) may lose motivation to learn, whereas those who overestimate their performance may be at a disadvantage in the long term, as this can hinder their motivation to learn new techniques (since they feel confident that they know everything). Numerous studies have shown that students tend to be inaccurate in judging their performance (Fitzsimmons & Thompson, 2024), tending to overestimate, that is, they assess their performance as higher than their actual performance on tests (Händel & Dresel, 2018; Kruger & Dunning, 1999). This phenomenon, known as overconfidence bias, varies depending on contextual and individual factors. Due to the nature of test administration, confidence ratings are sensitive to item difficulty. Thus, difficulty directly impacts the CJs assigned, so that with difficult items, participants tend to show overconfidence, and with easy items, a lack of confidence. This effect is called the easy-difficult effect (Juslin et al., 2000; Lichtenstein & Fischhoff, 1977). It is worth noting that item difficulty is closely related to participants' ability: lower-ability participants tend to be overconfident with moderately difficult items, whereas higher-ability participants tend to be well calibrated or even slightly underconfident (Morphew, 2021; Stankov et al., 2012). Other studies, however, indicate that low-performing participants were more accurate, overestimating less than those considered high-performing in a test on pedagogical knowledge (Florín & Grecu, 2019). Despite these discrepancies, the literature converges in pointing out that lower-performing students generally tend to overestimate their performance to a greater extent than those of average or high performance, a pattern also observed in disciplines related to health sciences and allied sciences (Bunce et al., 2023; Cale, 2023; Morphew, 2021).

Although these individual differences exist, confidence tends to manifest as a relatively stable characteristic across ability levels; thus, overconfidence is not limited to low performers. Stankov and Lee (2014) noted that this trend persists even among high-ability groups, suggesting that overconfidence is a widespread phenomenon. In the healthcare field, Cleary et al. (2019) found that medical students—in a sample of 157 participants—overestimated their performance on two tasks (medical history and physical examination) in 98% and 95% of cases, respectively, and their judgment accuracy remained moderately stable across tasks. Similar findings have been reported in other studies that show poor calibration in medicine (Benjamin et al., 2022; Berner & Graber, 2008; Meyer et al., 2013) as well as in education (Callender et al., 2016; Foster et al., 2017; Huff & Nietfeld, 2009), where a low correspondence between confidence and performance can lead to systematic diagnostic errors (Berner & Graber, 2008).

In addition to ability and difficulty, gender emerges as a relevant factor in metacognitive calibration. Several studies have shown that, in general knowledge tasks or university contexts, women tend to show lower confidence and less overestimation than men (de Bruin et al., 2017; Buratti et al., 2013), resulting in better calibration, reflected in lower bias scores (Pallier et al., 2002). In a numerical series task with a large sample of 6,544 participants (58% women), men showed higher confidence, with no differences in performance, and exhibited greater overconfidence biases, although with a small effect size. The meta-analysis by Rivers et al. (2020) supported these findings partially by analyzing six studies with 758 participants, where men were more confident and, in some cases, also more accurate. Even when controlling for performance, women reported approximately 7% less confidence than men. However, gender differences appear to be task-dependent: in mathematical problems, women have shown superior calibration (Bench et al., 2015). McMurran (2020) found that men were more confident but less calibrated, particularly on low-difficulty problems; as task complexity increased, women’s calibration improved and gender differences in confidence decreased. Overall, although men tend to appear more confident (Lundeberg & Mohan, 2009; Stankov & Lee, 2008), the evidence indicates that overconfidence decreases in both genders with practice and time (Hadwin & Webster, 2013).

Spatial reasoning and metacognition in health sciences

Spatial reasoning has been less explored from a metacognitive perspective, despite its central role in multiple cognitive and professional activities. Spatial reasoning can be considered an umbrella term encompassing a broad set of spatial abilities, including spatial visualization, mental rotation, and spatial orientation (Ramful et al., 2017). In their meta-analysis, Voyer et al. (1995) defined mental rotation (hereafter, MR) as the ability to mentally manipulate two- and three-dimensional objects, whereas spatial visualization refers to the ability to mentally transform or modify the spatial properties of an object. Subsequent studies have confirmed that these are clearly differentiated abilities (Hegarty & Waller, 2005; Mix & Cheng, 2012). Given that these processes involve monitoring, controlling, and evaluating one's own performance, their analysis from a metacognitive perspective is particularly relevant.

In the clinical field, spatial reasoning acquires particular importance, as many medical and surgical tasks depend on the accurate interpretation of spatial information. For example, understanding X-rays, CT scans, or ultrasounds requires integrating complex two- and three-dimensional representations with prior anatomical knowledge. Numerous studies have demonstrated the relationship between spatial ability and the learning of specific skills in laparoscopic surgery, colonoscopy, and other techniques related to dentistry (Hassan et al., 2007; Hedman et al., 2006; Hegarty et al., 2008; Keehner et al., 2006; Luursema et al., 2008; Risucci, 2002; Evans et al., 2001; Wanzel et al., 2002, 2003). In this regard, clinical teaching seeks to develop competencies in the interpretation of radiological images through strategies that enhance anatomical understanding and visuospatial ability (de Barros et al., 2001; Ertl-Wagner et al., 2016; Khalil et al., 2005; Sendra Portero et al., 2023). Thus, the ability to perceive anatomical relationships and mentally manipulate internal structures is considered a fundamental skill for healthcare practice.

Empirical evidence has shown that targeted practice and specific training can significantly improve spatial reasoning and, consequently, clinical performance. Weimer et al. (2024) found that after an ultrasound course, students improved both their visuospatial ability and their understanding of radiological sections and anatomical relationships in the abdomen. Similarly, Garg et al. (2001) demonstrated that students with higher MR ability and greater exposure to multi-view three-dimensional models achieved superior spatial learning in anatomy. These findings underscore initial spatial ability as a predictor of anatomical learning and success in clinical tasks (Garg et al., 2001). Similarly, Clem et al. (2013) reported that 21% of students’ performance in ultrasound interpretation could be explained by their initial spatial ability. Complementarily, Hoyek et al. (2009) observed that MR training improved performance on an anatomy test with items that specifically required mental rotations, but not on those focused on factual knowledge. Both men and women improved after the training, although the former maintained a higher absolute performance. These results support the hypothesis of transfer from mental rotation tasks to complex anatomical tasks (Guillot et al., 2007; Hoyek et al., 2009), suggesting that strengthening spatial skills contributes directly to anatomical learning and retention.

Individual differences in spatial ability emerge early in development (Levine et al., 1999) and persist into adulthood (Hegarty & Waller, 2005). Regarding gender, men consistently outperform women on mental rotation tasks (Fernández-Méndez et al., 2024; Lippa et al., 2010; Maeda & Yoon, 2013; Voyer et al., 1995). This mental rotation advantage is considered one of the most robust gender differences in cognitive psychology (Halpern, 2013; Voyer et al., 1995; Zell et al., 2015). In contrast, sex/gender differences in chronometric mental rotation tasks are less consistent, often small or non-significant, or appear only in particular subtests or with specific types of stimuli (Bauer et al., 2021; Bauer & Jansen, 2024; Jansen-Osmann & Heil, 2007; Peters & Battista, 2008).

Regarding the psychometric Mental Rotation Test (Vandenberg & Kuse, 1978; Peters et al., 1995), it is well established that the speed at which spatial problems are solved differs between men and women, and that this may contribute to observed gender differences in mental rotation performance when time limits are imposed (Peters, 2005; Voyer & Saunders, 2004). As such, examining the effects of varying time constraints, Peters (2005) proposed that men would outperform women under all timing conditions, and that their superiority would be more significant as the time available per item decreases.

In the context of these psychometric spatial tasks, analyses of metacognitive variables suggest that confidence in spatial task responses may explain part of the individual differences observed in such tasks (Arrighi & Hausmann, 2022; Cooke-Simpson & Voyer, 2007; Desme et al., 2024; Estes & Felker, 2012). Specifically, Cooke-Simpson and Voyer (2007) examined the role of confidence in MR task performance.

They hypothesized that participants who responded randomly were more likely to have less confidence in their answers, so they looked at the role of confidence as an explanation for overall performance. Using a sample of university students, they found a high correlation (r = 0.685) between MR (measured through the Mental Rotation Test; MRT) and the average confidence rating. Similarly, Estes and Felker (2012) showed that men not only had more confidence than women in their own MR abilities, but that individuals who rated themselves as more confident in the accuracy of their answers were, in fact, more accurate on a classic MR task. Other studies have corroborated that those who obtain the best results also have greater confidence in their responses in the MRT task (Desme et al., 2024). However, greater self-confidence is not always related to better performance on spatial tasks. The study by Ariel et al. (2018) showed that men had higher confidence levels even when no gender differences were observed in visuospatial performance. Despite differing results among studies, the finding of greater confidence in men's spatial responses compared to women’s seems consistent. Furthermore, the positive relationship between self-confidence and MR performance is stronger in men than in women (Estes & Felker, 2012). Thus, confidence differs between men and women in MR tasks (Lemieux et al., 2019; Rahe & Jansen, 2022), even when performance is controlled. Moreover, among the different metacognitive variables, confidence seems to be the one that best predicts performance in spatial tasks, surpassing both perceived difficulty and the effort invested (Ackerman et al., 2024).

Among health science students, these differences are also reproduced in different educational contexts. Male advantages have been documented in MR among medical students (Langlois et al., 2013) and in anatomy (Koh et al., 2023). Since clinical tasks constantly require spatial reasoning, understanding how individual differences in spatial ability and metacognitive confidence interact is essential for optimizing the teaching of diagnostic skills that rely on visuospatial processing.

Present research

This study had two primary objectives. The first was to analyze calibration and bias in a sectioning task, considering gender, spatial ability, and item difficulty. This task was selected because it requires spatial reasoning about object sections, an essential component of anatomical learning. Spatial ability was estimated using MR and visualization tests, which provide complementary measures of individuals’ capacity to mentally manipulate spatial information. We hypothesized that students with greater spatial ability would exhibit better calibration (scores closer to zero) and less overconfidence bias. Furthermore, calibration was expected to decrease and overconfidence to increase as item difficulty rose. Men were also expected to exhibit poorer calibration than women, with a greater overconfidence bias.

The second objective was to analyze performance across three spatial tests (sections, visualization, and mental rotation) as a function of confidence, task difficulty, and the proportion of items attempted. We expected performance to be higher among students with greater spatial ability and higher confidence, whereas increased difficulty would reduce performance. In line with previous research, men were expected to outperform women on the MR task. Additionally, we hypothesized that students who answered a greater number of items under time pressure would perform worse, reflecting a speed-accuracy trade-off. However, this negative effect would be attenuated in participants with greater spatial ability, compensating for the impact of responding quickly on performance.

Methods

Participants

A total of 322 participants from Rey Juan Carlos University (URJC) took part in the study. Eleven participants were excluded for providing insufficient responses to most of the questionnaires and/or for not reporting a gender identity. The final sample consisted of 65 men (M_age = 19.43; SD = 4.46) and 246 women (M_age = 19.14; SD = 4.64), all of whom were first-year undergraduate students at the Faculty of Health Sciences.

All participants were informed about the study, and ethical approval was obtained from the research ethics committee of Rey Juan Carlos University (Approval No. 140520243142024).

Materials and procedure

Data was collected in a single, collective session at the beginning of the first academic year in a classroom specifically prepared for this purpose. The task order was counterbalanced across academic degrees, and the session lasted approximately two hours.

Section task (SEC)

This test was administered in a revised version that included 15 items from the Santa Barbara Solid Test (SBST; Cohen & Hegarty, 2012) and 5 items from the Planes of Reference Test (PRT; Titus & Horsmann, 2009). The selection of items was based on Munns et al. (2023), who analyzed all items from both tests to develop an efficient measure of the ability to visualize the cross-sections of three-dimensional objects. Item difficulty followed the indices reported by Munns et al. (2023) and can be consulted in the supplementary materials. In each trial, a three-dimensional figure with a cutting plane and four (in the SBST items) or five (in the PRT items) response options was presented, only one of which was correct. Participants were instructed to select the option corresponding to the flat cross-section of the object, that is, the correct 2D cross-section derived from cutting the 3D object. There was no time limit. One point was awarded for each correct answer, with a maximum score of 20. Internal consistency was Cronbach's ⍺ = 0.822.

After each response, participants provided a confidence judgment (SEC conf) on a scale from 0 to 100.

For the section task, two metacognitive indices were computed for each participant to assess the accuracy and directional deviation of judgments: calibration and bias. For each task item, the calibration index (CAL) was calculated as the squared difference between accuracy and confidence and reflects the precision of metacognitive judgments irrespective of direction. Calibration of confidence judgments was computed using the calibration component of Murphy’s (1973) decomposition of the Brier score for each participant. A calibration score of zero reflects perfect accuracy between confidence judgment and actual performance, with increasing values indicating deviations from perfect accuracy of judgment. Lower calibration values indicate more accurate alignment between subjective confidence and actual performance.

The bias index (BIAS) was defined as the difference between the objective accuracy for each item (scored as 1 = correct, 0 = incorrect) and the participant’s normalized confidence rating (scaled from 0 to 1). Negative values indicate overconfidence, whereas positive values indicate underconfidence. A bias value close to zero reflects well-calibrated confidence judgments.

Mental Rotation Task (MRT, Peters et al., 1995)

The task measures participants’ ability to mentally rotate representations of three-dimensional block figures. In each trial, a comparison figure was presented along with four response alternatives. Of these four figures, only two were rotated versions of the comparison figure, while the other two figures were distractors that do not match the comparison figure. Thus, each trial has two correct options and two incorrect options. Participants were instructed to select the two correct rotated versions. In total, the test consists of two parts of 12 items each and a completion time of 3 minutes for each part, with a 4-minute break between both parts. Three practice trials are included prior to the experimental trials. A trial was scored as correct when both correct options were selected. The maximum score was 24. Internal consistency was Cronbach's ⍺ = 0.833.

After each trial, participants were instructed to provide a confidence judgment (MR conf), rating their confidence in the accuracy of their responses on a scale from 0 (I am not at all sure) to 100 (I am very sure).

Visualization Task (VZ, Prieto & Velasco, 2002a, 2002b; 2004; Prieto et al., 2007)

Each item presented a cube with faces labeled by letters and a flat net of the cube showing one labeled face and another marked with a question mark. Participants were required to choose from the 9 options the one that represents the face marked with a question mark. Item difficulty depended on the number of rotations needed to reach the solution, so that for the simplest difficulty, two turns had to be applied and for the most complex, five turns. Each difficulty consisted of 5 items. The test included 2 practice trials and 20 experimental trials with 9 answer choices, only one of which was correct. The time limit was 30 minutes. One point was awarded per correct answer, for a maximum score of 20. Internal consistency was Cronbach's ⍺ = 0.865.

As with the MRT, participants assigned their confidence level (VZ conf) after selecting each trial on a scale of 0 to 100.

Statistical Analysis

In this study, statistical analysis was conducted using the lme4 (Bates, Maechler, et al., 2015) and glmmTMB package (Brooks et al., 2017; McGillycuddy et al., 2025) in R (R Core Team, 2025). The outcome variables SEC, MR, and VZ were examined through generalized linear mixed-effects models with a binomial distribution.

The variable Calibration (CAL) was analyzed using a zero-inflated generalized linear mixed-effects model including a beta-regression model with a logit link function and a dispersion model (Brooks et al., 2017).

This model was selected because calibration scores were bound between 0 and 1 and showed an excess of structural zeros, reflecting participants who were perfectly calibrated. The zero-inflated approach allowed us to separate the probability of belonging to the “perfect calibration” group from the degree of miscalibration in the remaining participants, while a dispersion component captured variability in the precision of responses.

The variable BIAS was analyzed via a hurdle model using a generalized linear mixed-effects model with a binomial distribution to analyze true zeros, and a linear mixed-effects model for the continuous part (Mabire-Yon, 2025). This modeling strategy was chosen because bias values were bound around zero and displayed a clear excess of exact zeros, reflecting participants who showed no directional bias at all. The hurdle model allowed us to disentangle two complementary processes: (1) the probability of showing no bias (zero component), and (2) the magnitude and direction of bias when it was present (continuous component).

Model parameters were estimated via maximum likelihood estimation. To assess the contribution of the fixed effects of interest, likelihood ratio tests were applied, and p-values were compared against a significance threshold of .05.

Fixed effects were gender, confidence (conf; for each task respectively), percentage of completed items (ans; for MR and VZ), item difficulty (diffic; for SEC and VZ), and task performance (total test score per participant in percent; for each task respectively). The latter variable was created as a predictor of overall competence in the specific tasks. For instance, the total score in the visualization task served as an indicator for the spatial ability level of each participant. Spatial visualization is a measure typically used to estimate visuospatial reasoning (Carroll, 1993; Fernández-Méndez et al., 2020) and is therefore included in the present study as a criterion measure of spatial capacity.

At present, there is no established consensus regarding the calculation of standardized effect sizes in linear mixed-effects models as they are typically reported in the context of power analyses or meta-analyses (Bauer et al., 2022; Feingold, 2009; Hedges, 2007; Rights & Sterba, 2019). For significant effects and primary main effects, we report the unstandardized effect sizes along with confidence intervals, calculated through parametric bootstrapping with 1000 simulations, following the guidelines of Baguley (2009) and Pek and Flora (2018). Visual examination of residuals did not indicate any major violations of the assumptions of homoscedasticity or normality across models. Similarly, the variance inflation factors in all models showed no collinearity problems.

Model specification followed a hypothesis-driven procedure (Barr et al. 2013; Bates et al., 2015). Initially, models included random intercepts and slopes for all relevant fixed effects and were progressively simplified by removing non-significant variance components. Fixed effects were retained only when they improved model fit, while non-significant predictors were excluded. For significant interactions, the main effects were examined separately by decomposing the interaction. A detailed description of the final models for each dependent variable is provided in the Results section. Figures and graphical representations of the data were created using ggplot2 (Wickham, 2016) and flexplot (Fife, 2022) in R (R Core Team, 2025).

Results

Effects of metacognitive performance in Anatomy

Calibration (CAL)

Model construction for the outcome variable calibration (CAL) resulted in a zero-inflated generalized linear mixed model with a beta-regression in the conditional model and a dispersion model, with random intercepts and slopes for diffic by participant. Regarding the variables VZ score, SEC score, MR score, diffic, and gender, all respective interactions and main effects were analyzed as fixed effects. For the continuous model, significant effects were found for MR score*diffic, SEC score, and gender. For the zero-inflation model, significant effects were found for VZ score, diffic, SEC score, and gender. For the dispersion model, significant effects were found for MR score, diffic, SEC score, and gender (see Table 1).

Regarding the main effects, the conditional model revealed that SEC performance strongly predicted calibration (OR = 0.16, p < .001; large effect; Chen et al., 2020), indicating that students who performed better in the task showed substantially more accurate calibration. Task difficulty also influenced calibration, with higher difficulty being associated with poorer calibration (OR = 1.16, p < .001; small-to-moderate effect). Gender differences emerged as well, with women being better calibrated than men (OR = 0.74, p = .010; moderate effect).

Importantly, the conditional model also revealed a significant interaction between task difficulty and mental rotation performance (OR = 0.81, p = .043). This represents a moderate effect, indicating that the negative impact of increasing difficulty on calibration is attenuated in students with higher mental rotation ability. This relationship is illustrated in Fig. 1. Visually, students with lower MR scores show a sharp decline in calibration as items became more difficult, whereas those with high MR ability maintained more stable calibration levels across difficulty conditions. In other words, there is less calibration loss as difficulty increases in participants who show higher performance on the MR task.

The zero-inflation component further showed that participants with higher SEC performance were much more likely to be classified as perfectly calibrated (OR = 134.29, p < .001), a very large effect. VZ score also contributed positively (OR = 1.60, p = .001), while higher difficulty (OR = 0.70, p < .001) and being a woman (OR = 0.64, p < .001) reduced the likelihood of perfect calibration, all corresponding to moderate effects.

The dispersion submodel revealed that variability in calibration differed across predictors: MR performance and task difficulty were linked to lower precision (greater heterogeneity), whereas women and participants with higher SEC performance showed higher precision. These effects were controlled to ensure robust estimation of the zero-inflation and mean processes.

Table 1
Statistical Analysis of Calibration (CAL; ranging from 0 to 1)
Variable	Estimate	SE	OR / exp (β)	Test Statistic (z value)	p value	95% CI OR / exp (β)
Conditional (mean; logit link)			OR			OR
Intercept	0.87	0.21	2.38			1.53, 3.55
MR score*diffic	-0.22	0.11	0.81	-2.03	.043	0.66, 0.99
MR score(0)*diffic	0.23	0.04	1.26			1.15, 1.35
MR score*diffic(0)	0.65	0.31	1.91			1.06, 3.59
MR score	0.52	0.30	1.68	1.71	.087	0.93, 2.86
diffic	0.15	0.02	1.16	8.27	< .001	1.12, 1.20
gender(w)	-0.30	0.12	0.74	-2.59	.010	0.59, 0.93
SEC score	-1.84	0.25	0.16	-7.49	< .001	0.10, 0.25
Zero-inflation (structural zeros; Logit link)			OR			OR
Intercept	-3.44	0.17	0.03			0.02, 0.05
VZ score	0.47	0.14	1.60	3.28	.001	1.20, 2.14
diffic	-0.35	0.02	0.70	-14.23	< .001	0.67, 0.74
gender(w)	-0.44	0.07	0.64	-6.13	< .001	0.55, 0.74
SEC score	3.45	0.22	31.6	16.01	< .001	20.8, 49.6
Dispersion (precision; log link)			exp(β)			exp(β)
Intercept	-0.12	0.09	0.89			0.78, 1.04
MR score	-0.48	0.12	0.62	-3.89	< .001	0.49, 0.79
diffic	-0.08	0.02	0.92	-4.70	< .001	0.90, 0.95
gender(w)	0.27	0.05	1,31	5.34	< .001	1.18, 1.41
SEC score	0.28	0.11	1.32	2.52	.012	1.11, 1.58
Note. The intercept in the conditional model represents the estimated logarithmic odds of CAL (calibration) being 1, when all numeric predictors (SEC score = SEC Task Performance (scaled 0 to 1); VZ score = Visualization Task Performance (scaled 0 to 1); MR score = Mental Rotation Task Performance (scaled 0 to 1); and diffic = task item difficulty) are at zero, and gender = men. The intercept in the zero-inflation model represents the estimated logarithmic odds of CAL (calibration) being a structural zero, when all numeric predictors are at zero, and gender = men. Higher estimates reflect an increase in odds to pertain to the “perfectly calibrated” group (CAL = 0), lower estimates reflect lower odds to pertain to this group, that is, their calibration scores are better explained by the “continuous” (miscalibration) process (conditional model). The intercept in the dispersion model represents the baseline log-precision (φ) of CAL when all numeric predictors are set to zero and gender = men. Larger values of exp(β) (= multiplicative change in precision) indicate higher precision, meaning that responses are more tightly clustered around the predicted mean (lower variability). Smaller values of exp(β) indicate lower precision, meaning that responses are more dispersed. Thus, predictors in the dispersion model capture factors associated with heterogeneity in calibration scores beyond differences in the mean. The coefficients for gender reflect the estimates for women. Coefficients for numeric variables reflect changes in the log-odds associated with a one-unit increase in each predictor, respectively. The Odds-Ratios reflect the relative change in percentage of CAL with each 1-unit increase of the variables. All numeric predictors are treated as continuous variables. Estimates, ORs, CIs, and test statistics are derived from a zero-inflated generalized linear mixed model with a beta-regression in the conditional model and a dispersion model, with random intercepts and slopes for diffic by participant.

Fig. 1

Interaction Effect of Task Difficulty (diffic SEC) and Mental Rotation Ability (MR score) on Calibration (CAL). For visualization purposes, the data was binned into three levels of MR score. The lines represent predicted values for low (0–0.33), medium (0.33–0.66), and high (0.66–1) levels of mental rotation performance. Shaded areas indicate the standard errors of the fitted lines. For better visualization, the raw data is jittered.

Directional Judgment of Confidence (BIAS)

Model construction for the outcome variable BIAS (ranging from − 1 to 1) resulted in a hurdle model, combining a generalized linear mixed model with a binomial distribution to analyze true zeros and a linear mixed model for the continuous part, both with random intercepts by participant. Regarding the variables VZ score, SEC score, MR score, diffic SEC, and gender, all respective interactions and main effects were analyzed as fixed effects. For the probability-of-zero model, significant effects were found for diffic SEC, SEC score, and gender. For the continuous model, significant effects were found for SEC score*diffic SEC, and gender. Additionally, significant main effects were diffic SEC and SEC score (see Table 2).

The probability-of-zero model indicates that higher SEC performance greatly increased the probability of showing no bias (i.e., making perfectly calibrated judgments); OR = 134.29, p < .001; very large effect), whereas greater task difficulty (OR = 0.61, p < .001; moderate effect) and being a woman (OR = 0.51, p = .012; moderate effect) reduced the likelihood of unbiased responses.

The continuous non-zero model revealed that, among participants who did show bias, gender differences were also significant: women tended to show lower overconfidence than men when bias was present (β = 0.09, p = .002). Furthermore, the interaction between SEC performance and task difficulty was significant (β = 0.19, p < .001). As illustrated in Fig. 2, good performers in the SEC task showed relatively similar BIAS for all levels of difficulty, resulting in a relatively flat slope. In contrast, lower performing participants have negative slopes, showing increasing levels of being overconfident with increasing levels of task difficulty.

Table 2
Statistical Analysis of Directional Judgment of Confidence (BIAS; ranging from − 1 to 1)
Variable	Estimate	SE	OR	Test Statistic	p value	95% CI OR / Estimate
Probability-of-zero model (logit link)			OR			OR
Intercept	-4.71	0.51	0.01			0.00, 0.02
SEC score	4.90	0.59	134.29	χ2(1) = 67.19	< .001	45.78, 468.76
Diffic SEC	-0.49	0.03	0.61	χ2(1) = 288.78	< .001	0.57, 0.65
gender(w)	-0.68	0.59	0.51	χ2(1) = 6.26	.012	0.30, 0.87
Continuous non-zero model						Estimate
Intercept	-0.56	0.05				-0.65, -0.47
SEC score *diffic SEC	0,19	0.03		χ2(1) = 51.87	< .001	0.14, 0.24
SEC score(0)*diffic SEC	-0.19	0.02				-0.22, -0.15
SEC score*diffic SEC(0)	0.56	0.05				0.45, 0.66
SEC score	0.64	0.05		χ2(1) = 109.32	< .001	0.53, 0.74
Diffic SEC	-0.06	0.01		χ2(1) = 107.68	< .001	-0.07, -0.05
gender(w)	0.09	0.03		χ2(1) = 9.55	.002	0.03, 0.14
Note. The intercept for the probability-of-zero model represents the estimated logarithmic odds of BIAS (= Directional Judgment of Confidence) being zero (i.e., perfectly calibrated judgments) when all numeric predictors (SEC score = SEC Task Performance (scaled 0 to 1), and diffic SEC = task item difficulty) are at zero, and gender = men. Positive estimates reflect an increase in the odds of having zero BIAS, and negative estimates reflect lower odds. The coefficients for gender reflect the estimates for women. Coefficients for numeric variables represent changes in the log-odds associated with a one-unit increase in each predictor. All numeric predictors are treated as continuous variables. The intercept for the continuous non-zero model represents the estimated BIAS (excluding zeros) when all numeric predictors are at zero, and gender = men. Estimates, ORs, CIs, and test statistics are derived from a hurdle model combining a generalized linear mixed model with a binomial distribution and a linear mixed model for the continuous component, both with random intercepts by participant.

Fig. 2

Interaction effect of SEC performance (SEC score) and item difficulty (diffic SEC) on Directional Judgment of Confidence (BIAS). For visualization purposes, the data was binned into three levels of SEC score. The lines represent predicted values for low (0–0.33), medium (0.33–0.66), and high (0.66–1) levels of SEC performance. Shaded areas indicate the standard errors of the fitted lines. For better visualization, the raw data is jittered.

Effects on spatial cognitive performance

Section task performance (SEC)

As for the performance in the section task (SEC), model construction resulted in a model with random intercepts by participant. Regarding the variables SEC conf, diffic SEC, VZ score, and gender, all respective interactions and main effects were analyzed as fixed effects. Significant differences were found for SEC conf*diffic SEC, VZ score*diffic SEC, and SEC conf*VZ score (see Table 3). SEC increased significantly by SEC conf and VZ score, and decreased significantly by diffic SEC (main effects). Thus, the model establishes that students who reported higher confidence (OR ≈ 6.11, p < .001; large effect) and higher VZ performance (OR ≈ 10.91, p < .001; large effect) had a much greater likelihood to answer correctly. As expected, performance decreased as items became more difficult (OR ≈ 0.67, p < .001; moderate effect).

The final model consisted of several significant interactions. The positive association between confidence and accuracy decreased as difficulty increased (OR ≈ 0.70, p = .002), that is, the confidence was less predictive of task performance as item difficulty increased. VZ performance reduced the negative effect of difficulty, with higher performers maintaining better accuracy on challenging items (OR ≈ 1.57, p < .001; moderate effect; see Fig. 3). Lastly, a significant interaction emerged between confidence and VZ performance (OR ≈ 13.09, p = .001; very large effect), suggesting that the likelihood of answering correctly in the SEC task increased with higher VZ performance, particularly among participants who also reported greater confidence.

Table 3
Statistical Analysis of (Logarithmic Odds of) Section Task Performance (SEC)
Variable	Estimate	SE	OR	Test Statistic	p value	95% CI OR
Intercept	-0.65	0.20	0.52			0.35, 0.81
Interactions
SEC conf*diffic	-0.36	0.11	0.70	χ2(1) = 9.70	.002	0.55, 0.88
SEC conf*diffic(0)	1.38	0.25	3.98			2.41, 6.35
SEC conf(0)*diffic	-0.25	0.09	0.78			0.65, 0.93
VZ score*diffic	0.45	0.12	1.57	χ2(1) = 12.98	< .001	1.21, 2.03
VZ score*diffic(0)	0,06	0.67	1.06			0.26, 3.86
VZ score(0)*diffic	-0.25	0.09	0.78			0.65, 0.93
SEC conf*VZ score	2.57	0.78	13.09	χ2(1) = 10.96	.001	3.16, 70.24
SEC conf*VZ score(0)	1.38	0.25	3.98			2.41, 6.35
SEC conf(0)*VZ score	0,06	0.67	1.06			0.26, 3.86
Main Effects
SEC conf	1,81	0.16	6.11	χ2(1) = 129.67	< .001	4.43, 8.41
Diffic	-0.40	0.03	0.67	χ2(1) = 258.15	< .001	0.64, 0.70
VZ score	2.39	0.23	10.91	χ2(1) = 101.77	< .001	7.17, 17.29
Note. The intercept in this model represents the estimated logarithmic odds of success on the SEC task when all predictors (SEC conf = confidence, VZ score = spatial ability level (scaled 0 to 1), and diffic = task item difficulty) are at zero. Coefficients for SEC conf, VZ score, and diffic reflect changes in the log-odds associated with a one-unit increase in each predictor, respectively. Their test statistics and p-values represent pairwise comparisons. All predictors are treated as continuous variables. Estimates are derived from a generalized linear mixed model with a binomial distribution and random intercepts by participant.

Fig. 3

Interaction effect of item difficulty and visualization performance (VZ score) on accuracy in the cross-sectional task (SEC). For visualization purposes, the data was binned into three levels of VZ score. The lines represent predicted accuracy values for low (0–0.33), medium (0.33–0.66), and high (0.66–1) levels of visualization performance. Shaded areas indicate the standard errors of the fitted lines. For better visualization, the raw data is jittered.

Mental rotation task performance (MR)

Regarding the performance in the mental rotation task (MR), model construction resulted in a model with random intercepts by participant. Regarding the variables MR conf, VZ score, MR ans, and gender, all respective interactions and main effects were analyzed as fixed effects. Significant differences were found for MR conf*MR ans and VZ score*MR ans (see Table 4). MR accuracy increased significantly by MR conf and VZ score, and decreased significantly by MR ans (main effects). The main effects showed that students who reported higher confidence (OR ≈ 13.33, p < .001; large effect) and higher VZ performance (OR ≈ 4.67, p < .001; large effect) had a significantly greater likelihood of answering correctly. Performance decreased as the percentage of answered items increased (OR ≈ 0.33, p < .001; moderate negative effect), indicating that a higher response rate was associated with lower accuracy.

However, these effects of the variables also showed a dependency on each other. When participants responded to only a few items, confidence strongly predicted accuracy—those who felt more confident were indeed more likely to respond correctly. However, as the number of answered items increased, confidence became less predictive of accuracy. Visually, it can be observed in Fig. 4 (panel a); when grouping participants by confidence level (0–0.7; 0.7–0.9; 0.9–1), accuracy decreased across all confidence ranges, but the decline was steeper among highly confident participants. So, the slope was less pronounced for participants with lower confidence, resulting in smaller differences between confidence levels as more items were answered.

In contrast, VZ performance interacted positively with the proportion of answered items (OR ≈ 34.85, p = .001; very large effect), showing that students with high visualization ability maintained greater accuracy even when responding to a larger number of trials, suggesting that high visualization ability supports performance even under greater task demands (see Fig. 4, panel b).

Table 4
Statistical Analysis of (Logarithmic Odds of) Mental Rotation Task Performance (MR)
Variable	Estimate	SE	OR	Test Statistic	p value	95% CI OR
Intercept	-1.41	0,49	0.24			0.09, 0.64
Interactions
MR conf*MR ans	-2.98	1.05	0.05	χ2(1) = 8.03	.005	0.01, 0.44
MR conf*MR ans(0)	4.29	0.84	72.87			20.33, 254.51
MR conf(0)*MR ans	0.02	0.84	1.02			0.19, 5.82
VZ score*MR ans	3.55	1.10	34.85	χ2(1) = 10.24	.001	4.10, 307.05
VZ score*MR ans(0)	-0.65	0.72	0.52			0.13, 2.16
VZ score(0)*MR ans	0.02	0.84	1.02			0.19, 5.82
Main Effects
MR conf	2.59	0.21	13.33	χ2(1) = 170.24	< .001	9.19, 20.30
MR ans	-1.10	0.29	0.33	χ2(1) = 13.92	< .001	0.19, 0.60
VZ score	1.54	0.24	4.67	χ2(1) = 39.59	< .001	2.88, 7.77
Note. The intercept in this model represents the estimated logarithmic odds of success on the MR task (i.e., accuracy) when all predictors (MR conf = confidence, VZ score = spatial ability level (scaled 0 to 1), and MR ans = percentage of answered task items, (scaled 0 to 1)) are at zero. Coefficients for MR conf, VZ score, and MR ans reflect changes in the log-odds associated with a one-unit increase in each predictor, respectively. Their test statistics and p-values represent pairwise comparisons. Estimates are derived from a generalized linear mixed model with a binomial distribution and random intercepts by participant.
A Figure 4

Interaction effects of (a) Confidence (MR conf) and Proportion of Answered Items (MR ans), and (b) Visualization Performance (VZ score) and Proportion of Answered Items on Accuracy in the Mental Rotation Task. For visualization purposes, the data was binned into three levels of one predictor, respectively. In Panel (a), the lines represent predicted accuracy values for three confidence levels: low (0–0.7), medium (0.7–0.9), and high (0.9–1). In Panel (b), the lines represent predicted accuracy for three levels of visualization performance: low (0–0.33), medium (0.33–0.66), and high (0.66–1). Shaded areas indicate the standard errors of the fitted lines. For better visualization, the raw data is jittered.

Visualization task performance (VZ)

Model construction for the outcome variable visualization task performance (VZ) resulted in a model with random intercepts by participant. Regarding the variables VZ conf, VZ ans, diffic, and gender, all respective interactions and main effects were analyzed as fixed effects. Significant differences were found for gender*diffic, and VZ conf (see Table 5).

The main effect of VZ conf indicates that students who reported higher confidence were more likely to respond correctly (OR ≈ 11.47, p < .001; large effect). The interaction between gender and difficulty (OR ≈ 1.25, p = .003; moderate effect) indicated that women were more likely to respond correctly on easier items, which was reduced as item difficulty increased (see Fig. 5). In other words, women may perform better on easier items, while men show a more stable pattern across difficulty levels.

Table 5
Statistical Analysis of (Logarithmic Odds of) Visualization Task Performance (VZ)
Variable	Estimate	SE	OR	Test Statistic	p value	95% CI OR
Intercept	-1.10	0.25	0.33			0.21, 0.55
Interactions
gender*diffic	0.22	0.07	1.25	χ2(1) = 8.63	.003	1.07, 1.45
gender(m)*diffic	-0.20	0.07	0.82			0.72, 0.93
gender(w-m)*diffic(0)	-0.47	0.26	0,63			0.37, 1.03
Main Effects
gender(w)	0,08	0.18	1.08	χ2(1) = 0.19	.664	0.77, 1.54
diffic	-0.02	0.03	0.98	χ2(1) = 0.66	.417	0.92, 1.03
VZ conf	2.44	0.16	11.47	χ2(1) = 242.52	< .001	8.66, 16.09
Note. The intercept in this model represents the estimated logarithmic odds of success on the VZ task when all numeric predictors (VZ conf = confidence; and diffic = task item difficulty) are at zero, and gender = men. The coefficients for gender reflect the estimates for women. Coefficients for VZ conf and diffic reflect changes in the log-odds associated with a one-unit increase in each predictor, respectively. Estimates are derived from a generalized linear mixed model with a binomial distribution and random intercepts by participant.

Fig. 5

Interaction effect of Gender and Item Difficulty on Accuracy in the Visualization Task (VZ). The lines represent predicted accuracy values for men and women across increasing levels of item difficulty. Shaded areas indicate the standard errors of the fitted lines. For better visualization, the raw data is jittered.

Discussion

The first objective of this study was to analyze monitoring measures in a spatial task that assesses the ability to infer two-dimensional sections of three-dimensional objects, a competency required for success in anatomical sciences in first-year health sciences students. While this type of metamemory task is not included in classic factor analyses of spatial ability (Carroll, 1993; Lohman, 1988), it is considered a key component of effective spatial thinking in STEM disciplines. Cross-sections in health sciences often involve visualizing the internal structure of three-dimensional objects, such as anatomical structures of the human body, which highlights the relevance and novelty of the present study for educational practice and for improving diagnostic skills in students in training. Given the relevance of spatial abilities in interpreting medical images and their role in clinical judgment, the second objective was to examine performance across different spatial tests considering individual and performance-related factors.

Regarding the first objective, the models provided several notable findings. First, and consistent with the hypotheses, students with greater spatial ability (as measured by the VZ test) were more likely to show perfect calibration (a calibration index of 0). However, when considering the full calibration continuum in the conditional model (scores from 0 to 1), calibration loss increased for students with lower MR scores as item difficulty increased. In other words, the loss of calibration associated with the increase in difficulty is modulated by MR ability. This finding suggests that participants with stronger MR ability exhibit less calibration decline as the task becomes more demanding, whereas those with lower MR ability show more substantial calibration loss under high difficulty. These results align with prior work showing a positive association between spatial ability and calibration (Clem et al., 2013; Garg et al., 2001), indicating that greater spatial performance supports not only higher objective performance, but also more accurate self-assessment of one's performance. More broadly, the findings agree with studies showing that students with greater ability demonstrate better calibration (Morphew, 2021; Stankov et al., 2012). To our knowledge, this study provides the first evidence of a differentiated contribution of spatial skills depending on the type of calibration required: the VZ emerged as a predictor of perfect calibration, whereas MR ability, in interaction with difficulty, plays a key role in miscalibration. Other studies have reported similar findings where different spatial skills contribute differently depending on task demands (Fernández-Méndez et al., 2020; Fernández-Méndez et al., 2024b; Frick, 2019). Regarding gender differences, the zero-inflation model showed that women were less likely than men to be perfectly calibrated, indicating that men more frequently fell into the “exact calibration” category. However, within the continuous component (among participants who were not perfectly calibrated), women showed lower miscalibration scores, meaning that their deviations from perfect calibration were smaller. Thus, men were more often perfectly calibrated, but when miscalibrated, women tended to be closer to accuracy. These findings partially overlap with studies reporting better calibration among women in mathematical tasks (Bench et al., 2015; McMurran, 2020). However, the relationship between gender and the presence of perfect calibration or the variation in miscalibration reported in this study constitutes a novel contribution. The results suggest that, while women tend to make calibration errors, these are less pronounced than those observed in men, while the total absence of error is more frequently associated with the male gender. Interestingly, dispersion effects suggested that women and participants with higher SEC performance were more consistent in their calibration, pointing to potentially greater stability in metacognitive monitoring—an avenue worth exploring in future work.

Turning to the bias index, the hurdle model’s zero component indicated a higher probability of zero bias among men, as well as among participants with high scores on the SEC task itself and on items of lower difficulty. In other words, women were about 50% less likely than men to display zero bias. Item difficulty exerted a strong influence, with each unit increase in difficulty reducing the likelihood of zero bias by 39%. Better SEC performance increased the odds, with each 10% performance increase raising the odds of having zero bias by 63%.

The continuous model, which considers trend scores from − 1 (absolute overconfidence) to + 1 (absolute underconfidence), excluding values of 0, confirmed a general pattern of overconfidence, as indicated by the negative intercept. This aligns with evidence showing that overconfidence is a widespread phenomenon across cognitive and judgment tasks (Stankov et al., 2014).

Regarding the observed effects, the interaction between the score on the SEC task and the difficulty of the item stands out. Performance on the task modulates the impact of difficulty on bias: higher performance leads to a lower tendency towards overconfidence as the difficulty of the item increases. Thus, although high-performing students showed overall milder overconfidence bias, this bias still increased as items became more challenging, but to a lesser extent than in low-performing students, suggesting a better ability to adjust their confidence judgments as task demands increase. These results confirm the hypotheses put forward and align with the so-called easy-hard effect (Juslin et al., 2000; Lichtenstein & Fischhoff, 1977), in addition to showing less bias in high-achieving students (Bunce et al., 2023; Cale, 2023; Morphew, 2021).

Regarding gender, the pattern resembled that observed for calibration. In the zero-bias component, being male was associated with a higher probability of showing no bias, while in the continuous model, being female was related to a lower degree of overconfidence. This indicates that, when bias occurs, women tend to exhibit it to a lesser degree than men. Overall, these findings confirm the initial hypothesis and align with previous literature documenting greater male overconfidence in spatial tasks (Ariel et al., 2018; Pallier et al., 2002) as well as lower confidence levels among women in university contexts (de Bruin et al., 2017; Buratti et al., 2013).

Regarding the second objective of the study, the results show that confidence is a predictor of performance in the three tests (MR, VZ, and SEC). Thus, students who reported higher confidence were also those who achieved the best results, confirming the initial hypothesis and providing evidence of the importance of this monitoring measure as a predictor of performance (Stankov, 2013). Furthermore, spatial ability, measured through the VZ test, emerged as a predictor of expected performance in both MR and SEC. However, its relevance was especially pronounced in tasks involving the visualization of sections and slices in different planes of the figures in the SEC task. This suggests that visualization ability may support students, more than the other abilities evaluated here when dealing with the spatial demands of anatomical tasks. This is consistent with the results of the study by Koh et al. (2023), highlighting that students with high spatial ability performed better in anatomy tests that required memorization and drawing of an anatomical structure of the heart. Since spatial skills are known to be trainable (Uttal et al., 2013), the present results support the idea that training prior to coursework may improve students’ ability to interpret anatomical images, thereby enhancing future clinical diagnostic performance. Conversely, students with low spatial abilities may struggle to infer and reason about two-dimensional planes, negatively impacting course performance and slowing the acquisition of skills necessary for ¡interpreting spatial images, as previous research on the importance of spatial ability for promoting learning in clinical tasks has already indicated (Cleam et al., 2013, Garg et al., 2001; Hoyek et al., 2009).

The analysis of difficulty in the tasks where it was operationalized (section task and visualization task) shows the expected results, where higher difficulty was associated with a lower probability of answering correctly. However, the interactions found indicate that difficulty interacts with both confidence and spatial ability to predict success in the SEC task. The first interaction showed that students with higher confidence experienced a sharper decline in accuracy as task difficulty increased. That is, confidence predicts accuracy on easier items, but becomes less informative as tasks become more demanding. The second relevant interaction revealed that high spatial visualization ability attenuates the negative effect of difficulty, where the probability of success decreases to a greater extent when the items become more complicated as spatial ability decreases. Thus, spatial ability appears to be a stronger predictor of success on more challenging items. These findings suggest that increasing confidence alone does not guarantee better results, particularly in complex and more challenging contexts. Educational interventions should therefore aim not only to enhance confidence, but to ensure that confidence aligns with actual performance, that is, to improve students’ calibration.

Regarding the percentage of unanswered items, the results indicate that a higher percentage of answered items predicts a lower accuracy in the MR task. The effect is modulated by confidence and the level of spatial ability: confidence becomes less predictive as the number of answered items increases. In other words, answering more items affects the outcome differently depending on the level of confidence. Students with high confidence experience a larger drop in accuracy when answering more items, whereas low-confidence students show a more modest decline, as their performance was already low even with few items answered. Thus, confidence is a strong predictor overall, although the feeling of confidence decreases and ceases to reflect actual accuracy as the number of items answered increases. These results support the idea that the time spent per item is a critical factor in time-constrained tasks, such as the MRT. In this type of test, a faster response time usually involves a trade-off between speed and accuracy, particularly in people with lower visualization ability. For instance, Liesefeld et al. (2015) showed that participants who modulated their speed based on item complexity committed fewer errors, while those who maintained high speed on complex items showed a substantial increase in errors in an MR task. Thus, slowing down and taking more time to respond to complex items or objects appears to be a more effective strategy for improving accuracy. The strategy of going faster or slower is also modulated by contextual factors, so that, with an appropriate incentive, speed can be reduced and performance improved, while, with an increase in time pressure, execution speed can be increased, compromising performance (Liesefeld et al., 2015). Likewise, strategies for solving MR items are not fixed; they can be modified through instructions (Fernández-Méndez et al., 2024b) or through manipulation of time pressure. Voyer and Saunders (2004) identified two response styles in MR tasks and found that responding to a greater number of items (risky style) was associated with better overall performance, while a tendency to not take risks (cautious style), with a greater number of unanswered items, was linked to lower scores. However, crucial distinction is that Voyer and Saunders analyzed total test performance, whereas the present study analyzes the probability of answering a given item correctly. These two perspectives are not mutually exclusive, but complementary, since the probability of answering more items will be associated with a greater probability of having more correct answers, even if the probability of getting each item right is reduced by spending less time solving it. Our results support the idea that speed in answering, with a greater number of items answered, entails a loss of precision per item, considering that the magnitude of this cost depends on individual spatial ability. This has important implications for healthcare training, where the main objective would be to increase accuracy in solving a particular diagnostic problem, regardless of the time cost.

Regarding gender, the tested models showed no differences in MR or SEC performance, suggesting that when other variables are held constant, gender differences only emerge in the visualization task in interaction with difficulty. Specifically, men were more likely to answer correctly on easy items, but the gender difference diminished as item difficulty increased. The non-replicability of the gender effect favoring men in the MR task is highlighted (Fernández-Méndez et al., 2024b; Zell et al., 2015). This finding may be due to the higher proportion of women in health science careers. It is possible that gender differences depend on domain-specific experience and may be influenced by self-efficacy and self-concept, which act as internal informational cues (Jansen et al., 2014; Pallier, 2003; Retelsdorf et al., 2015). Notably, the meta-analysis by Xue et al. (2024) found no gender differences in confidence, sensitivity, or metacognitive efficiency in perceptual decision-making tasks.

In conclusion, research on these factors is crucial because reliable monitoring is essential for making appropriate control decisions (Ackerman & Thompson, 2017), such as seeking help or verifying errors (Efklides, 2011; Metcalfe, 2002, 2009). Individuals with lower confidence, a high threshold that determines when to stop seeking further evidence, and strong cognitive abilities should achieve better results (Katyal & Fleming, 2024). This is particularly important in diagnostic judgments, where patient safety may be compromised. The present findings highlight the importance of considering both spatial ability and task difficulty when examining monitoring and accuracy in spatial tasks. For example, in visual diagnostic tasks, such as the interpretation of radiographs, monitoring is especially difficult, since there are few external cues to support self-monitoring, people have difficulty recalling where they looked (Kok et al., 2017; Võ et al., 2016), and radiologists often report using visualization strategies inconsistent with their actual behavior (Aizenman et al., 2017). This study also clarifies gender differences in metacognitive tasks and does not support the idea of greater accuracy among men in spatial reasoning tasks, finding greater accuracy in men in visualization, modulated by task difficulty among health science students.

Further research should focus not only on examining monitoring in different tasks linked to clinical practice, but also on analyzing how actions are initiated within error detection frameworks (Jackson et al., 2016). Monitoring accuracy is crucial, but the translation of awareness into corrective action is equally important. It is therefore essential to evaluate in detail the metacognitive processes of professionals and to analyze possible changes in these processes during a particular clinical activity (Cleary et al., 2015; Eva & Regehr, 2011; Pieschl, 2009; Sargeant et al., 2010). For example, Garbayo et al. (2023) developed a self-assessment metacognitive calibration instrument to help students learn to maintain a deliberate awareness of their reasoning process in simulated clinical environments. Students with high accuracy benefited most from unrestricted control over their learning process, while students with low accuracy benefited more from restricted control, which protected them from poor self-regulated decisions. Taking into account different levels of task calibration may allow educators to design interventions that improve monitoring while also acknowledging that students do not benefit equally from self-regulated learning strategies. Furthermore, it will allow researchers to evaluate how these metacognitive judgments impact control processes, such as study time or effort invested (Mihalca et al., 2017). The disparity in self-control behaviors among students (Thiede et al., 2010) highlights the need to strengthen monitoring accuracy early in health sciences training (de Bruin et al., 2017).

It is important to note that this study was conducted in the first weeks of the first year of different health science degrees, when students had no prior exposure to spatially intensive coursework or clinical practice. This ensured that spatial or metacognitive performance was not biased by previous training. It is likely that the students' monitoring capacity will vary as experience is acquired, and, although confidence measures may provide information about metacognitive processing (Kleitman & Stankov, 2007; Stankov et al., 2012), it is necessary to evaluate how accurate this measure is with respect to actual performance and how it evolves from the early years of training to professional clinical practice, since there is no evidence of when students improve their metacognition during professional health training (Siqueira et al., 2020). Further studies are needed to analyze the progression of monitoring longitudinally and in other real clinical scenarios where spatial ability may be relevant for image interpretation and clinical diagnosis.

The study presented has certain limitations that should be mentioned for its proper interpretation. The unequal proportion of women in health science degrees has resulted in a sample with a much larger number in the women's group than in the men's group. In addition, it would have been desirable to have a difficulty index associated with the MRT, to analyze this variable in a similar way to the SEC and VZ tests. Moreover, the MRT lacked item-specific difficulty indices, making it impossible to analyze difficulty in the same way as in SEC and VZ. Although difficulty in MR tasks is typically operationalized via angular disparity, the MRT contains four response options with different angular disparities, complicating the computation of item-level difficulty. Finally, it should be noted that there are other monitoring measures that have not been studied and that would add value to the present results, such as indices related to relative accuracy.

References:

Ackerman, R., & Thompson, V. A. (2017). Meta-Reasoning: Monitoring and control of thinking and reasoning. Trends in Cognitive Sciences, 21(8), 607–617. https://doi.org/10.1016/j.tics.2017.05.004

Aizenman, A., Drew, T., Ehinger, K. A., Georgian-Smith, D., & Wolfe, J. M. (2017). Comparing search patterns in digital breast tomosynthesis and full-field digital mammography: An eye tracking study. Journal of Medical Imaging, 4(4), 045501. https://doi.org/10.1117/1.JMI.4.4.045501

Ariel, R., Lembeck, N. A., Moffat, S. D., & Hertzog, C. (2018). Are there sex differences in confidence and metacognitive monitoring accuracy for everyday, academic, and psychometrically measured spatial ability? Intelligence, 70, 42–51. https://doi.org/10.1016/j.intell.2018.08.001

Ariel, R., & Moffat, S. (2018). Age-related similarities and differences in monitoring spatial cognition. Aging Neuropsychology and Cognition, 25(3), 351–377. https://doi.org/doi:10.1080/13825585.2017.1305086

Arrighi, L., & Hausmann, M. (2022). Spatial anxiety and self-confidence mediate sex/gender differences in mental rotation. Learning & Memory, 29, 312–320. https://doi.org/10.1101/lm.053596.122

Baguley, T. (2009). Standardized or simple effect size: What should be reported? British Journal of Psychology, 100(3), 603–617. https://doi.org/10.1348/000712608X377117

Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. https://doi.org/10.1016/j.jml.2012.11.001

Bates, D., Kliegl, R., Vasishth, S., & Baayen, H. (2015). Parsimonious Mixed Models. ArXiv:1506.04967. http://arxiv.org/abs/1506.04967

Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting Linear Mixed-Effects Models Using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Bauer, R., & Jansen, P. (2024). A short mindfulness induction might increase women’s mental rotation performance. Consciousness and Cognition, 123, 103721. https://doi.org/10.1016/j.concog.2024.103721

Bauer, R., Jost, L., Günther, B., & Jansen, P. (2022). Pupillometry as a measure of cognitive load in mental rotation tasks with abstract and embodied figures. Psychological research, 86(5), 1382–1396. https://doi.org/10.1007/s00426-021-01568-5

Bauer, R., Jost, L., & Jansen, P. (2021). The effect of mindfulness and stereotype threat in mental rotation: a pupillometry study. Journal of Cognitive Psychology, 33(8), 861–876. https://doi.org/10.1080/20445911.2021.1967366

de Barros, N., Rodrigues, C. J., Rodrigues, A. J. Jr, de Negri Germano, M. A., & Cerri, G. G. (2001). The value of teaching sectional anatomy to improve CT scan interpretation. Clinical anatomy (New York, N.Y.), 14(1), 36–41. https://doi.org/10.1002/1098-2353(200101)14:1%3C36::AID-CA1006%3E3.0.CO;2-G

Bench, S. W., Lench, H. C., Liew, J., Miner, K., & Flores, S. A. (2015). Gender gaps in overestimation of math performance. Sex Roles: A Journal of Research, 72(11–12), 536–546. https://doi.org/10.1007/s11199-015-0486-9

Berner, E. S., & Graber, M. L. (2008). Overconfidence as a cause of diagnostic error in medicine. The American Journal of Medicine, 121(5 Suppl), 2–S23. https://doi.org/10.1016/j.amjmed.2008.01.001

Brooks, M. E., Kristensen, K., van Benthem, K. J., Magnusson, A., Berg, C. W., Nielsen, A., Skaug, H. J., Maechler, M., & Bolker, B. M. (2017). GlmmTMB Balances Speed and Flexibility Among Packages for Zero-inflated Generalized Linear Mixed Modeling. The R Journal, 9(2), 378–400. https://doi.org/doi:10.32614/RJ-2017-066

de Bruin, A. B. H., Kok, E. M., Lobbestael, J., & de Grip, A. (2017). The impact of an online tool for monitoring and regulating learning at university: Overconfidence, learning strategy, and personality. Metacognition and Learning, 12, 21–43. https://doi.org/10.1007/s11409-016-9159-5

Bunce, D. M., Schroeder, M. J., Luning Prak, D. J., Teichert, M. A., Dillner, D. K., McDonnell, L. R., Midgette, D. P., & Komperda, R. (2023). Impact of clicker and confidence questions on the metacognition and performance of students of different achievement groups in general chemistry. Journal of Chemical Education, 100(5), 1751–1762. https://doi.org/10.1021/acs.jchemed.2c00928

Buratti, S., Allwood, C. M., & Kleitman, S. (2013). First- and second-order metacognitive judgments of semantic memory reports: The influence of personality traits and cognitive styles. Metacognition and Learning, 8, 79–102. https://doi.org/10.1007/s11409-013-9096-5

Burson, K. A., Larrick, R. P., & Klayman, J. (2006). Skilled or unskilled, but still unaware of it: How perceptions of difficulty drive miscalibration in relative comparisons. Journal of Personality and Social Psychology, 90(1), 60–77. https://doi.org/10.1037/0022-3514.90.1.60

Cale, A. S., Hoffman, L. A., & McNulty, M. A. (2023). Promoting metacognition in an allied health anatomy course. Anatomical Sciences Educations, 16(3), 473–485. https://doi.org/10.1002/ase.2218

Callender, A. A., Franco-Watkins, A. M., & Roberts, A. S. (2015). Improving metacognition in the classroom through instruction, training, and feedback. Metacognition and Learning, 11, 215–235. https://doi.org/10.1007/s11409-015-9142-6

Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge University Press. https://doi.org/10.1017/CBO9780511571312

Chen, H., Cohen, P., & Chen, S. (2010). How Big is a Big Odds Ratio? Interpreting the Magnitudes of Odds Ratios in Epidemiological Studies. Communications in Statistics - Simulation and Computation, 39(4), 860–864. https://doi.org/10.1080/03610911003650383

Cohen, C. A., & Hegarty, M. (2012). Inferring cross sections of 3D objects: A new spatial thinking test. Learning and Individual Differences, 22(6), 868–874. https://doi.org/10.1016/j.lindif.2012.05.007

Cleary, T. J., Durning, S., Gruppen, L., Hemmer, P., & Artino, A. R. (2013). Self-regulated learning in medical education. In K. Walsh (Ed.), Oxford textbook of medical education (pp. 465–477). Oxford University Press. https://doi.org/10.1093/med/9780199652679.003.0040

Cleary, T. J., Konopasky, A., La Rochelle, J. S., Neubauer, B. E., Durning, S. J., & Artino, A. R. Jr. (2019). First-year medical students’ calibration bias and accuracy across clinical reasoning activities. Advances in Health Sciences Education, 24(4), 767–781. https://doi.org/10.1007/s10459-019-09897-2

Clem, D. W., Donaldson, J., Curs, B., Anderson, S., & Hdeib, M. (2013). Role of spatial ability as a probable ability determinant in skill acquisition for sonographic scanning. Journal of ultrasound in medicine: official journal of the American Institute of Ultrasound in Medicine, 32(3), 519–528. https://doi.org/10.7863/jum.2013.32.3.519

Cooke-Simpson, A., & Voyer, D. (2007). Confidence and gender differences on the Mental Rotations Test. Learning and Individual Differences, 17, 181–186. https://doi.org/10. 1016/j. lindif. 2007. 03. 009.

Dautriche, I., Rabagliati, H., & Smith, K. (2021). Subjective confidence influences word learning in a cross-situational statistical learning task. Journal of Memory and Language, 121, 104277. https://doi.org/10.1016/j.jml.2021.104277

Davis, D. A., Mazmanian, P. E., Fordis, M., Van Harrison, R., Thorpe, K. E., & Perrier, L. (2006). Accuracy of physician self-assessment compared with observed measures of competence: a systematic review. Journal Of The American Medical Association, 296(9), 1094–1102. https://doi.org/10.1001/jama.296.9.1094

Desme, C. J., Dick, A. S., Hayes, T. B., & Pruden, S. M. (2024). Individual differences in emerging adults' spatial abilities: What role do affective factors play? Cognitive research: principles and implications, 9(1), 13. https://doi.org/10.1186/s41235-024-00538-w

Djulbegovic, B., van den Ende, J., Hamm, R. M., et al. (2014). How do physicians decide to treat: An empirical evaluation of the threshold model. BMC Medical Informatics and Decision Making, 14(1), 47. https://doi.org/10.1186/1472-6947-14-47

Dunlosky, J., & Hertzog, C. (1998). Training programs to improve learning in later adulthood: Helping older adults educate themselves. In D. J. Hacker, J. Dunlosky, & A. C. Graesser (Eds.), Metacognition in educational theory and practice (pp. 249–275). Erlbaum.

Dunlosky, J., & Metcalfe, J. (2009). Metacognition. Sage.

Efklides, A. (2011). Interactions of metacognition with motivation and affect in self-regulated learning: The MASRL model. Educational Psychologist, 46(1), 6–25. https://doi.org/10.1080/00461520.2011.538645

Ertl-Wagner, B., Barkhausen, J., Mahnken, A. H., Mentzel, H. J., Uder, M., & Weidemann, J. … White Paper: Radiological Curriculum for Undergraduate Medical Education in Germany. RöFo – Fortschritte auf dem Gebiet der Röntgenstrahlen und der bildgebenden Verfahren, 188(11), 1017–1023. https://doi.org/10.1055/s-0042-116026

Estes, Z., & Felker, S. (2012). Confidence mediates the sex Difference in mental rotation performance. Archives of Sexual Behavior, 41, 557–570. https://doi.org/10.1007/s10508-011-9875-5

Eva, K. W., Cunnington, J. P., Reiter, H. I., Keane, D. R., & Norman, G. R. (2004). How can I know what I don't know? Poor self assessment in a well-defined domain. Advances in health sciences education: theory and practice, 9(3), 211–224. https://doi.org/10.1023/B:AHSE.0000038209.65714.d4

Eva, K. W., & Regehr, G. (2011). Exploring the divergence between self-assessment and self-monitoring. Advances in health sciences education: theory and practice, 16(3), 311–329. https://doi.org/10.1007/s10459-010-9263-2

Evans, J. G., & Dirks, S. J. (2001). Relationships of admissions data and measurements of psychological constructs with psychomotor performance of dental technology students. Journal of dental education, 65(9), 874–882.

Feingold, A. (2009). Effect sizes for growth-modeling analysis for controlled clinical trials in the same metric as for classical analysis. Psychological Methods, 14(1), 43–53. https://doi.org/10.1037/a0014699

Fernández-Méndez, L. M., Amores, C., Orenes, L., Prieto, I., Rodán, A., Montoro, A., Mayas, P. R., Cabestrero, J., R., & Contreras, M. J. (2024a). Inducing strategies to solve a mental rotation task is possible: evidence from a sex-related eye-tracking analysis. The Journal of General Psychology, 1–26. https://doi.org/10.1080/00221309.2024.2433287

Fernández-Méndez, L. M., Contreras, M. J., Mammarella, I. C., Feraco, T., & Meneghetti, C. (2020). Mathematical achievement: The role of spatial and motor skills in 6–8 year-old children. PeerJ, 8, e10095. https://doi.org/10.7717/peerj.10095

Fernández-Méndez, L., Meneghetti, C., Martínez-Molina, A., Mammarella, I., & Contreras, M. J. (2024b). Visuospatial and Motor Ability Contributions in Primary School Geometry. Psicologica, 45(1), e16046. https://doi.org/10.20350/digitalCSIC/16046

Fiedler, K., Ackerman, R., & Scarampi, C. (2019). Metacognition: Monitoring and controlling one’s own knowledge, reasoning and decisions. In R. J. Sternberg & J. Funke (Eds.), The psychology of human thought: An introduction (pp. 89–111). Heidelberg University Publishing. https://doi.org/10.17885/heiup.470

Fife, D. (2022). Flexplot: Graphically-based data analysis. Psychological Methods, 27(4), 477–496. https://doi.org/10.1037/met0000424

Fitzsimmons, C. J., & Thompson, C. A. (2024). Why is monitoring accuracy so poor in number line estimation? The importance of valid cues and systematic variability for U.S. college students. Metacognition Learning, 19, 21–52 (2024). https://doi.org/10.1007/s11409-023-09345-y

Flavell, J. H. (1979). Metacognition and cognitive monitoring: A new area of cognitive–developmental inquiry. American Psychologist, 34(10), 906–911. https://doi.org/10.1037/0003-066X.34.10.906

Fleming, S. M., & Lau, H. C. (2014). How to measure metacognition. Frontiers in Human Neuroscience, 8, 443. https://doi.org/10.3389/fnhum.2014.00443

Foster, N. L., Was, C. A., Dunlosky, J., & Isaacson, R. M. (2017). Even after thirteen class exams, students are still overconfident: The role of memory for past exam performance in student predictions. Metacognition & Learning, 12, 1–19. https://doi.org/10.1007/s11409-016-9158-6

Frick, A. (2019). Spatial transformation abilities and their relation to later mathematics performance. Psychological research, 83(7), 1465–1484. https://doi.org/10.1007/s00426-018-1008-5

Frumos, F. V., & Grecu, S. P. (2019). Inaccuracy and overconfidence in metacognitive monitoring of university students. Revista de Cercetare şi Intervenţie Socială, 66, 298–314. https://doi.org/10.33788/RCIS.66.17

Garbayo, L. S., Harris, D. M., Fiore, S. M., Robinson, M., & Kibble, J. D. (2023). A metacognitive confidence calibration (MCC) tool to help medical students scaffold diagnostic reasoning in decision-making during high-fidelity patient simulations. Advances in physiology education, 47(1), 71–81. https://doi.org/10.1152/advan.00156.2021

Garg, A. X., Norman, G., & Sperotable, L. (2001). How medical students learn spatial anatomy. The Lancet, 357(9253), 363–364. https://doi.org/10.1016/S0140-6736(00)03649-7

Gordon, M. J. (1991). A review of the validity and accuracy of self-assessments in health professions training. Academic Medicine, 66(12), 762–769. https://doi.org/10.1097/00001888-199112000-00012

Guillot, A., Champely, S., Batier, C., Thiriet, P., & Collet, C. (2007). Relationship between spatial abilities, mental rotation and functional anatomy learning. Advances in health sciences education: theory and practice, 12(4), 491–507. https://doi.org/10.1007/s10459-006-9021-7

Hadwin, A. F., & Webster, E. A. (2013). Calibration in goal setting: Examining the nature of judgments of confidence. Learning and Instruction, 24, 37–47. https://doi.org/10.1016/j.learninstruc.2012.10.001

Halpern, D. F. (2013). Sex differences in cognitive abilities (4th ed.). Psychology.

Händel, M., & Dresel, M. (2018). Confidence in performance judgment accuracy: The unskilled and unaware effect revisited. Metacognition and Learning, 13, 265–285. https://doi.org/10.1007/s11409-018-9185-6

Hassan, I., Gerdes, B., Koller, M., Dick, B., Hellwig, D., Rothmund, M., & Zielke, A. (2007). Spatial perception predicts laparoscopic skills on virtual reality laparoscopy simulator. Child’s Nervous System, 23(6), 685–689. https://doi.org/10.1007/s00381-007-0330-9

Hedges, L. V. (2007). Effect sizes in cluster-randomized designs. Journal of Educational and Behavioral Statistics, 32(4), 341–370. https://doi.org/10.3102/1076998606298043

Hedman, L., Ström, P., Andersson, P., Kjellin, A., Wredmark, T., & Felländer-Tsai, L. (2006). High-level visual-spatial ability for novices correlates with performance in a visual-spatial complex surgical simulator task. Surgical endoscopy, 20(8), 1275–1280. https://doi.org/10.1007/s00464-005-0036-6

Hegarty, M., Keehner, M., Cohen, C., Montello, D. R., & Lippa, Y. (2007). The Role of Spatial Cognition in Medicine: Applications for Selecting and Training Professionals. In G. L. Allen (Ed.), Applied spatial cognition: From research to cognitive technology (pp. 285–315). Lawrence Erlbaum Associates.

Hegarty, M., Keehner, M., Khooshabeh, P., & Montello, D. R. (2009). How spatial abilities enhance, and are enhanced by, dental education. Learning and Individual Differences, 19(1), 61–70. https://doi.org/10.1016/j.lindif.2008.04.006

Hegarty, M., & Waller, D. A. (2005). Individual Differences in Spatial Abilities. In P. Shah, & A. Miyake (Eds.), The Cambridge Handbook of Visuospatial Thinking (pp. 121–169). Cambridge University Press.

Hoch, E., Sidi, Y., Ackerman, R., Hoogerheide, V., & Scheiter, K. (2023). Comparing mental effort, difficulty, and confidence appraisals in problem-solving: A metacognitive perspective. Educational Psychology Review, 35, 61. https://doi.org/10.1007/s10648-023-09779-5

Honeycutt, K. J., & Miller, D. R. (2021). Learner Metacognitive Insights from Writing Professional Clinical Practicum Reflections: An Instrumental Case Study of a University-Based MLS Program. Journal of allied health, 50(1), 67–72.

Hoyek, N., Collet, C., Rastello, O., Fargier, P., Thiriet, P., & Guillot, A. (2009). Enhancement of mental rotation abilities and its effect on anatomy learning. Teaching and learning in medicine, 21(3), 201–206. https://doi.org/10.1080/10401330903014178

Huff, J. D., & Nietfeld, J. L. (2009). Using strategy instruction and confidence judgments to improve metacognitive monitoring. Metacognition and Learning, 4(2), 161–188. https://doi.org/10.1007/s11409-009-9042-8

Jackson, S. A., Kleitman, S., Stankov, L., & Howie, P. (2016). Cognitive abilities, monitoring confidence, and control thresholds explain individual differences in heuristics-and-biases tasks. Frontiers in Psychology, 7, 1559. https://doi.org/10.3389/fpsyg.2016.01559

Jansen, M., Schroeders, U., & Schiefele, U. (2014). Academic self-concept in science: Multidimensionality, relations to achievement measures, and gender differences. Learning and Instruction, 31, 65–83. https://doi.org/10.1016/j.lindif.2013.12.003

Jansen-Osmann, P., & Heil, M. (2007). Suitable stimuli to obtain (no) gender differences in the speed of cognitive processes involved in mental rotation. Brain and Cognition, 64(3), 217–227. https://doi.org/10.1016/j.bandc.2007.03.002

Jin, S., Verhaeghen, P., & Rahnev, D. (2022). Across-subject correlation between confidence and accuracy: A meta-analysis of the Confidence Database. Psychonomic Bulletin & Review, 29(4), 1405–1413. https://doi.org/10.3758/s13423-022-02063-7

Juslin, P., Winman, A., & Olsson, H. (2000). Naive empiricism and dogmatism in confidence research: A critical examination of the hard–easy effect. Psychological Review, 107(2), 384–396. https://doi.org/10.1037/0033-295X.107.2.384

Katyal, S., & Fleming, S. M. (2024). The future of metacognition research: Balancing construct breadth with measurement rigor. Cortex; A Journal Devoted To The Study Of The Nervous System And Behavior, 171, 223–234. https://doi.org/10.1016/j.cortex.2023.11.002

Keehner, M., Lippa, Y., Montello, D. R., Tendick, F., & Hegarty, M. (2006). Learning a spatial skill for surgery: How the contributions of abilities change with practice. Applied Cognitive Psychology, 20(4), 487–503. https://doi.org/10.1002/acp.1198

Khalil, M. K., Payer, A. F., & Johnson, T. E. (2005). Effectiveness of using cross-sections in the recognition of anatomical structures in radiological images. Anatomical record Part B New anatomist, 283(1), 9–13. https://doi.org/10.1002/ar.b.20053

Kok, E. M., Niehorster, D. C., van der Gijp, A., Rutgers, D. R., Auffermann, W. F., van der Schaaf, M., Kester, L., & van Gog, T. (2024). The effects of gaze-display feedback on medical students' self-monitoring and learning in radiology. Advances in health sciences education: theory and practice, 29(5), 1689–1710. https://doi.org/10.1007/s10459-024-10322-6

Koh, M. Y., Tan, G. J. S., & Mogali, S. R. (2023). Spatial ability and 3D model colour-coding affect anatomy performance: a cross-sectional and randomized trial. Scientific reports, 13(1), 7879. https://doi.org/10.1038/s41598-023-35046-2

Kleitman, S., & Moscrop, T. (2010). Self-confidence and academic achievements in primary-school children: Their relationships and links to parental bonds, intelligence, age, and gender. In A. Efklides, & P. Misailidi (Eds.), Trends and prospects in metacognition research (pp. 293–326). Springer. http://dx.doi.org/10.1007/978-1-4419-6546-2_14

Kleitman, S., & Stankov, L. (2007). Self-confidence and metacognitive processes. Learning and Individual Differences, 17(2), 161–173. https://doi.org/10.1016/j.lindif.2007.03.004

Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’sown incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77, 1121–1134. https://doi.org/10.1037/0022-3514.77.6.1121

Kuiper, R. A., & Pesut, D. J. (2004). Promoting cognitive and metacognitive reflective reasoning skills in nursing practice: Self-regulated learning theory. Journal of Advanced Nursing, 45(4), 381–391. https://doi.org/10.1046/j.1365-2648.2003.02921.x

Lai, E. R. (2011, April). Metacognition: A literature review [Research report]. Pearson.

Langlois, J., Wells, G. A., Lecourtois, M., Bergeron, G., Yetisir, E., & Martin, M. (2013). Sex differences in spatial abilities of medical graduates entering residency programs. Anatomical sciences education, 6(6), 368–375. https://doi.org/10.1002/ase.1360

Lemieux, C. L., Collin, C. A., & Watier, N. N. (2019). Gender differences in metacognitive judgments and performance on a goal-directed wayfinding task. Journal of Cognitive Psychology, 31(4), 453–466. https://doi.org/10.1080/20445911.2019.1625905

Levine, S. C., Huttenlocher, J., Taylor, A., & Langrock, A. (1999). Early sex differences in spatial skill. Developmental psychology, 35(4), 940–949. https://doi.org/10.1037//0012-1649.35.4.940

Lichtenstein, S., & Fischhoff, B. (1977). Do those who know more also know more about how much they know? Organizational Behavior and Human Performance, 20(2), 159–183. https://doi.org/10.1016/0030-5073(77)90001-0

Liesefeld, H. R., Fu, X., & Zimmer, H. D. (2015). Fast and careless or careful and slow? Apparent holistic processing in mental rotation is explained by speed-accuracy trade-offs. Journal of experimental psychology Learning memory and cognition, 41(4), 1140–1151. https://doi.org/10.1037/xlm0000081

Lippa, R. A., Collaer, M. L., & Peters, M. (2010). Sex differences in mental rotation and line angle judgments are positively associated with gender equality and economic development across 53 nations. Archives of Sexual Behavior, 39(4), 990–997. https://doi.org/10.1007/s10508-008-9460-8

Lohman, D. F. (1988). Spatial abilities as traits, processes, and knowledge. In R. J. Sternberg (Ed.), Advances in the psychology of human intelligence (Vol. 4, pp. 181–248). Lawrence Erlbaum Associates.

Lundeberg, M., & Mohan, L. (2009). Context matters: Gender and cross-cultural differences in confidence. In D. J. Hacker, J. Dunlosky, & A. C. Graesser (Eds.), Handbook of metacognition in education (pp. 221–239). Routledge/Taylor & Francis Group.

Luursema, J. M., Buzink, S. N., Verwey, W. B., & Jakimowicz, J. J. (2010). Visuo-spatial ability in colonoscopy simulator training. Advances in Health Sciences Education, 15(4), 685–694. https://doi.org/10.1007/s10459-010-9230-y

Mabire-Yon, R. (2025). Hurdle Models in Psychology—A Practical Guide for Inflated Data. International Journal of Psychology, 60(3), e70042. https://doi.org/10.1002/ijop.70042

Maeda, Y., & Yoon, Y. S. (2013). A Meta-Analysis on Gender Differences in Mental Rotation Ability Measured by the Purdue Spatial Visualization Tests: Visualization of Rotations (PSVT:R). Educational Psychology Review, 25, 69–94. http://dx.doi.org/10.1007/s10648-012-9215-x

McGillycuddy, M., Warton, D. I., Popovic, G., & Bolker, B. M. (2025). Parsimoniously Fitting Large Multivariate Random Effects in glmmTMB. Journal of Statistical Software, 112(1), 1–19. https://doi.org/doi:10.18637/jss.v112.i01

McMurran, M. B. (2020). Gender differences in confidence, calibration, and willingness to share problem solutions in math (Doctoral dissertation, University of California, Riverside). ProQuest Dissertations Publishing. https://eric.ed.gov/?id=ED652658

Medina, M. S., Castleberry, A. N., & Persky, A. M. (2017). Strategies for Improving Learner Metacognition in Health Professional Education. American journal of pharmaceutical education, 81(4), 78. https://doi.org/10.5688/ajpe81478

Metcalfe, J. (2002). Is study time allocated selectively to a region of proximal learning? Journal of Experimental Psychology: General, 131, 349–363. https://doi.org/10.1037/0096-3445.131.3.349

Metcalfe, J. (2009). Metacognitive Judgments and Control of Study. Current Directions in Psychological Science, 18(3), 159–163. https://doi.org/10.1111/j.1467-8721.2009.01628.x

Meyer, A. N., Payne, V. L., Meeks, D. W., Rao, R., & Singh, H. (2013). Physicians' diagnostic accuracy, confidence, and resource requests: a vignette study. JAMA internal medicine, 173(21), 1952–1958. https://doi.org/10.1001/jamainternmed.2013.10081

Mihalca, L., Mengelkamp, C., & Schnotz, W. (2017). Accuracy of metacognitive judgments as a moderator of learner-control effectiveness in problem-solving tasks. Metacognition and Learning, 12(3), 357–379. https://doi.org/10.1007/s11409-017-9173-2

Mix, K. S., & Cheng, Y. L. (2012). The relation between space and math: Developmental and educational implications. Advances in Child Development and Behavior, 42, 197–243. https://doi.org/10.1016/B978-0-12-394388-0.00006-X

Morphew, J. W. (2021). Changes in metacognitive monitoring accuracy in an introductory physics course. Metacognition and Learning, 16(1), 89–111. https://doi.org/10.1007/s11409-020-09239-3

Murdoch-Eaton, D., & Whittle, S. (2012). Generic skills in medical education: developing the tools for successful lifelong learning. Medical education, 46(1), 120–128. https://doi.org/10.1111/j.1365-2923.2011.04065.x

Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied Meteorology, 12, 595–600. https://doi.org/10.1175/1520-0450(1973)012%3C0595:ANVPOT%3E2.0.CO;2

Munns, M. E., He, C., Topete, A., & Hegarty, M. (2023). Visualizing Cross-Sections of 3D Objects: Developing Efficient Measures Using Item Response Theory. Journal of Intelligence, 11(11), 205. https://doi.org/10.3390/jintelligence11110205

Naug, H. L., Colson, N. J., & Donner, D. G. (2011). Promoting metacognition in first year anatomy laboratories using plasticine modeling and drawing activities: a pilot study of the blank page technique. Anatomical sciences education, 4(4), 231–234. https://doi.org/10.1002/ase.228

Nelson, T. O. (1996). Consciousness and metacognition. American Psychologist, 51(2), 102–116. https://doi.org/10.1037/0003-066X.51.2.102

Nelson, T. O., & Narens, L. (1990). Metamemory: A theoretical framework and new findings. In G. H. En, & Bower (Eds.), The Psychology of Learning and Motivation: Advances in Research and Theory (Vol. 26, pp. 125–169). Academic.

Nelson, T. O., & Narens, L. (1994). Why investigate metacognition? In J. Metcalfe, & A. P. Shimamura (Eds.), Metacognition: Knowing about knowing (pp. 1–25). The MIT Press.

Pallier, G. (2003). Gender differences in the self-assessment of accuracy on cognitive tasks. Sex Roles, 48, 265–276. https://doi.org/10.1023/A:1022877405718

Pallier, G., Wilkinson, R., Danthiir, V., Kleitman, S., Knezevic, G., Stankov, L., & Roberts, R. D. (2002). The role of individual differences in the accuracy of confidence judgments. The Journal of General Psychology, 129, 257–299. https://doi.org/10.1080/00221300209602099

Pauker, S. G., & Kassirer, J. P. (1980). The threshold approach to clinical decision making. The New England Journal of Medicine, 302(20), 1109–1117. https://doi.org/10.1056/NEJM198005153022003

Pek, J., & Flora, D. B. (2018). Reporting effect sizes in original psychological research: A discussion and tutorial. Psychological Methods, 23(2), 208–225. https://doi.org/10.1037/met0000126

Peters, M. (2005). Sex differences and the factor of time in solving Vandenberg and Kuse mental rotation problems. Brain and Cognition, 57, 176–184. https://doi.org/10.1016/j.bandc.2004.08.052

Peters, M., & Battista, C. (2008). Applications of mental rotation figures of the Shepard and Metzler type and description of a mental rotation stimulus library. Brain and Cognition, 66(3), 260–264. https://doi.org/10.1016/j.bandc.2007.09.003

Peters, M., Laeng, B., Latham, K., Jackson, M., Zaiyouna, R., & Richardson, C. (1995). A redrawn Vandenberg and Kuse mental rotations test: different versions and factors that affect performance. Brain and cognition, 28(1), 39–58. https://doi.org/10.1006/brcg.1995.1032

Pieschl, S. (2009). Metacognitive calibration: An extended conceptualization and potential applications. Metacognition and Learning, 4(1), 3–31. https://doi.org/10.1007/s11409-008-9030-4

Prieto, G., & Velasco, A. D. (2002a). Construção de um teste de visualização a partir da psicologia cognitiva. Avaliação Psicológica: Interamerican Journal of Psychological Assessment, 1(1), 39–47.

Prieto, G., & Velasco, A. D. (2002b). Predicting academic success of engineering students in technical drawing from visualization test scores. Journal for Geometry and Graphics, 6, 99–109.

Prieto, G., & Velasco, A. D. (2004). Training visualization ability by technical drawing. Journal for Geometry and Graphics, 8, 107–115.

Prieto, G., Velasco, A. D., Arias-Barahona, R., Anido, M., Núñez, A. M., & Có, P. (2007). Análisis de la dificultad de un banco de ítems de visualización espacial [Difficulty analysis in a spatial visualization item bank]. Ciencias Psicológicas, 1(1), 71–79. https://doi.org/10.22235/cp.v0i1.573

Prokop, T. R. (2019). The relevance of metacognition for clinical reasoning. In G. M. Musolino, & G. Jensen (Eds.), Clinical reasoning and decision making in physical therapy (1st ed., pp. 169–176). Routledge.

Pusic, M. V., Hall, E., Billings, H., Branzetti, J., Hopson, L. R., Regan, L., Gisondi, M. A., & Cutrer, W. B. (2022). Educating for adaptive expertise: case examples along the medical education continuum. Advances in health sciences education: theory and practice, 27(5), 1383–1400. https://doi.org/10.1007/s10459-022-10165-z

R Core Team. (2025). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Rahe, M., & Jansen, P. (2022). Sex differences in mental rotation: The role of stereotyped material, perceived performance and extrinsic spatial ability. Journal of Cognitive Psychology, 34(3), 400–409. https://doi.org/10.1080/20445911.2021.2011896

Ramful, A., Lowrie, T., & Logan, T. (2016). Measurement of Spatial Ability: Construction and Validation of the Spatial Reasoning Instrument for Middle School Students. Journal of Psychoeducational Assessment, 35(7), 709–727. https://doi.org/10.1177/0734282916659207

Retelsdorf, J., Schwartz, K., & Asbrock, F. (2015). Michael can’tread! teachers’ gender stereotypes and boys’ reading self-concept. Journal of Educational Psychology, 107, 186–194. https://doi.org/10.1037/a0037107

Rights, J. D., & Sterba, S. K. (2019). Quantifying explained variance in multilevel models: An integrative framework for defining R-squared measures. Psychological Methods, 24(3), 309–338. https://doi.org/10.1037/met0000184

Risucci, D., Geiss, A., Gellman, L., Pinard, B., & Rosser, J. (2001). Surgeon-specific factors in the acquisition of laparoscopic surgical skills. American journal of surgery, 181(4), 289–293. https://doi.org/10.1016/s0002-9610(01)00574-8

Rivers, M. L., Fitzsimmons, C. J., Fisk, S. R., Dunlosky, J., & Thompson, C. A. (2021). Gender differences in confidence during number-line estimation. Metacognition and Learning, 16(1), 157–178. https://doi.org/10.1007/s11409-020-09243-7

Royce, C. S., Hayes, M. M., & Schwartzstein, R. M. (2019). Teaching critical thinking: A case for instruction in cognitive biases to reduce diagnostic errors and improve patient safety. Academic Medicine, 94(2), 187–194. https://doi.org/10.1097/ACM.0000000000002518

Sandars, J., & Cleary, T. J. (2011). Self-regulation theory: applications to medical education: AMEE Guide 58. Medical Teacher, 33(11), 875–886. https://doi.org/10.3109/0142159X.2011.595434

Sargeant, J., Armson, H., Chesluk, B., Dornan, T., Eva, K., Holmboe, E., Lockyer, J., Loney, E., Mann, K., & van der Vleuten, C. (2010). The processes and dimensions of informed self-assessment: a conceptual model. Academic medicine: journal of the Association of American Medical Colleges, 85(7), 1212–1220. https://doi.org/10.1097/ACM.0b013e3181d85a4e

Schraw, G. (2009). A conceptual analysis of five measures of metacognitive monitoring. Metacognition and Learning, 4(1), 33–45. https://doi.org/10.1007/s11409-008-9031-3

Schraw, G., & Dennison, R. S. (1994). Assessing metacognitive awareness. Contemporary Educational Psychology, 19(4), 460–475. https://doi.org/10.1006/ceps.1994.1033

Sendra Portero, F., Domínguez Pinos, D., & Souto Bayarri, M. (2023). The current situation of Radiology training in medical studies in Spain. Radiología, 65, 580–592. https://doi.org/10.1016/j.rx.2023.07.002

Siqueira, M. A. M., Gonçalves, J. P., Mendonça, V. S., et al. (2020). Relationship between metacognitive awareness and motivation to learn in medical students. BMC Medical Education, 20, 393. https://doi.org/10.1186/s12909-020-02318-8

Stankov, L. (2013). Noncognitive predictors of intelligence and academic achievement: An important role of confidence. Personality and Individual Differences, 55, 727–732. https://doi.org/10.1016/j.paid.2013.07.006

Stankov, L., & Lee, J. (2014). Overconfidence Across World Regions. Journal of Cross-Cultural Psychology, 45(5), 821–837. https://doi.org/10.1177/0022022114527345

Stankov, L., Lee, J., Luo, W., & Hogan, D. J. (2012). Confidence: A better predictor of academic achievement than self-efficacy, self-concept and anxiety? Learning and Individual Differences, 22, 747–758. https://doi.org/10.1016/j.lindif.2012.05.013

Tan, S. M., Ladyshewsky, R. K., & Gardner, P. (2010). Using blogging to promote clinical reasoning and metacognition in undergraduate physiotherapy field-work programmes. Australasian Journal of Educational Technology, 26(3), 355–368. https://doi.org/10.14742/ajet.1080

Thiede, K. W., Griffin, T. D., Wiley, J., & Anderson, M. C. M. (2010). Poor metacomprehension accuracy as a result of inappropriate cue use. Discourse Processes, 47, 331–362. https://doi.org/10.1080/01638530902959927

Titus, S. J., & Horsman, E. (2009). Characterizing and improving spatial visualization skills. Journal of Geoscience Education, 57(4), 242–254. https://doi.org/10.5408/1.3559671

Uttal, D. H., Meadow, N. G., Tipton, E., Hand, L. L., Alden, A. R., Warren, C., & Newcombe, N. S. (2013). The malleability of spatial skills: A meta-analysis of training studies. Psychological Bulletin, 139(2), 352–402. https://doi.org/10.1037/a0028446

Vandenberg, S. G., & Kuse, A. R. (1978). Mental rotations: a group test of three-dimensional spatial visualization. Perceptual and Motor Skills, 47, 599–604. https://doi.org/10.2466/PMS.47.6.599-604

Võ, M. L. H., Aizenman, A. M., & Wolfe, J. M. (2016). You think you know where you looked? You better look again. Journal of Experimental Psychology: Human Perception and Performance, 42(10), 1477–1481. https://doi.org/10.1037/xhp0000264

Voyer, D., & Saunders, K. A. (2004). Gender differences on the mental rotations test: a factor analysis. Acta Psychologica, 117, 74–94. https://doi.org/10.1016/j.actpsy.2004.05.003

Voyer, D., Voyer, S., & Bryden, M. P. (1995). Magnitude of sex differences in spatial abilities: a meta-analysis and consideration of critical variables. Psychological bulletin, 117(2), 250–270. https://doi.org/10.1037/0033-2909.117.2.250

Wanzel, K. R., Hamstra, S. J., Anastakis, D. J., Matsumoto, E. D., & Cusimano, M. D. (2002). Effect of visual-spatial ability on learning of spatially-complex surgical skills. The Lancet, 359(9302), 230–231. https://doi.org/10.1016/S0140-6736(02)07441-X

Wanzel, K. R., Hamstra, S. J., Caminiti, M. F., Anastakis, D. J., Grober, E. D., & Reznick, R. K. (2003). Visual-spatial ability correlates with efficiency of hand motion and successful surgical performance. Surgery, 134(5), 750–757. https://doi.org/10.1016/S0039-6060(03)00248-4

Weimer, A., Recker, F., Vieth, T., Buggenhagen, H., Schamberger, C., Berthold, R., Berthold, S., Stein, S., Schmidmaier, G., Kloeckner, R., Neubauer, R., Müller, L., Weinmann-Menke, J., & Weimer, J. (2024). Undergraduate musculoskeletal ultrasound training based on current national guidelines-a prospective controlled study on transferability. BMC medical education, 24(1), 1193. https://doi.org/10.1186/s12909-024-06203-6

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4. https://ggplot2.tidyverse.org

World Health Organization & World Alliance for Patient Safety Research Priority Setting Working Group. (2008). Summary of the evidence on patient safety: Implications for research. World Health Organization. https://apps.who.int/iris/handle/10665/43874

Xue, K., Zheng, Y., Papalexandrou, C., Hoogervorst, K., Allen, M., & Rahnev, D. (2024). No gender difference in confidence or metacognitive ability in perceptual decision making. iScience, 27(12), 111375. https://doi.org/10.1016/j.isci.2024.111375

Zell, E., Krizan, Z., & Teeter, S. R. (2015). Evaluating gender similarities and differences using metasynthesis. The American psychologist, 70(1), 10–20. https://doi.org/10.1037/a0038208

Yes