Title: A Review of the Reliability and Validity of Objective Structured Clinical Examinations (OSCE) in Assessing Clinical Competency in Anaesthesia: A Systematic Review.
SohelM.G.Ahmed1,2✉Email
JibrilBello3
1Hamad Medical CorporationPO Box 11340DohaQatar
2College of MedicineUniversity of QatarDohaQatar
3University of South WalesPontypriddSouth WalesUnited Kingdom
Sohel M. G. Ahmed 1,2 (ORCID ID 0000-0002-4992-471X)
Jibril Bello 3
1. Hamad Medical Corporation, PO Box 11340, Doha, Qatar
2. College of Medicine, University of Qatar, Doha, Qatar
3. University of South Wales, Pontypridd, South Wales, United Kingdom.
Corresponding author: sohelm@yahoo.com
Abstract
The Objective Structured Clinical Examination (OSCE) has become a cornerstone in assessing clinical competence in medical education, particularly in anaesthesia training and certification exams. This systematic review evaluates the reliability and validity of OSCEs in anaesthesia assessments, addressing the critical question of whether they effectively measure clinical competency for consequential decisions such as licensure and certification.
A comprehensive literature search was conducted using databases including PubMed, Google Scholar, Embase, ERIC, and Scopus, adhering to PRISMA guidelines. Studies published between 2000 and 2024 were included, focusing on anaesthesia trainees or practitioners undergoing assessments. Key outcomes examined were reliability metrics (internal consistency, inter-rater reliability) and validity evidence (content, construct, criterion, and consequential validity).
Findings indicate that OSCEs demonstrate strong reliability when properly designed, with Cronbach’s alpha values often exceeding 0.7 and inter-rater reliability ranging from 0.7 to 0.9. Validity evidence supports OSCEs as robust tools for assessing clinical skills, particularly when scenarios align with real-world clinical tasks and scoring rubrics are standardised. However, challenges such as examiner bias, station heterogeneity, and resource intensity were identified as limitations. Innovations like virtual OSCEs (VOSCEs) and simulation-based assessments show promise but require further validation.
The review concludes that OSCEs, when implemented with rigorous examiner training, standardized scenarios, and iterative quality improvements, offer a reliable and valid method for evaluating clinical competence in anaesthesia. Recommendations include enhancing examiner calibration, integrating workplace-based assessments, and exploring technological advancements to address current limitations. Future research should focus on multi-institutional validation and longitudinal studies to strengthen the generalizability of OSCE outcomes.
Keywords:
OSCE
reliability
validity
anaesthesia
1. Introduction
1.1 Assessments in Medical Education
The assessment of medical graduates' clinical competence has always been a debatable issue. Several methods are in use, of which the conventional method of oral and practical examination is common. With an increasing number of medical graduates and, more importantly, the establishment of new medical colleges and universities with medical facilities, this type of evaluation is inadequate, time-consuming, and prone to errors (Jindal & Khurana, 2016). To overcome this, examiners have moved to a paper-and-pencil test. However, to evaluate the psychomotor and affective domains of medical graduates, the Objective Structured Clinical Examination (OSCE) was introduced at the University of Ontario in Ottawa. Currently, the range of assessment methods in use includes Multiple Choice Questions (MCQS), Modified Essay Questions (MEQS), OSCE, and various forms of structured clinical exams. Among these, the OSCE has gained increasing importance as an assessment tool in professional exams, particularly in clinical skills. In this thesis, we aim to answer the question of whether OSCE exams used to assess knowledge and skills in anaesthesia are reliable and valid, and to what extent they might be used as a measure of clinical competency.
1.2 Objective Structured Clinical Examination (OSCE): An Overview
OSCE has been defined as "a set of sequential skill stations, each of which concentrates on a particular skill and is rated in a standardised way through the use of carefully structured checklists or global rating scales" (Khan, 2013).
OSCE may be used to assess a wide variety of clinical skills. These may be as simple as the skill of giving information to an appropriate adult, child, relative, or patient, to more complex skills such as history taking, physical examination and interpreting investigations. The OSCE can also be used at the postgraduate level to assess other competencies that might be difficult to evaluate using more traditional methods. Competencies in prescribing practices, therapeutics, teaching and learning and research abilities are examples of this form of assessment.
A typical OSCE station has a card with a heading indicating the station of the examination, with a clearly stated set of instructions to the candidate. A time period is allowed to read the card prior to entering the station. Examination stations are standard for all students and are all soundproof booths in order to standardise the listening conditions. Two examiners assess each student; one asks the questions, and both mark the student's answers. The examiners should be experienced and clinically active physicians (Jindal & Khurana, 2016).
OSCEs were developed as a result of the need for an examination that was more methodical and objective in clinical evaluation. This format was specifically designed to assess the clinical competence of candidates in high-stakes licensing examinations. OSCE has its own merits and demerits as an examination format. Despite its extensive use in recent years, there is still a scarcity of large-scale data on the reliability and validity of OSCE in assessing clinical competence in anaesthesia. This information may be the pivotal factor in the decision on the use of OSCE in anaesthesia training. It is worth assessing how a shift to an environment that relies on focus stations and uses lay examiners who are sometimes trainees could affect reliability and validity (Fisseha & Desalegn, 2021).
1.3 Clinical Competency Assessments in Anaesthesia
A
Anaesthesia is a branch of medical science that deals with the ability to block the pain sensitivity of patients during and for a short period of time after surgery or potentially painful interventions, e.g. diagnostic tests. It requires an understanding of the pharmacological, physiological, and physical basis for drug-induced loss of consciousness, the management of patients during and after major surgery, and control of pain postoperatively. Safety is paramount, and in many countries, anaesthetists are involved in the delivery of intensive care medicine. Improving the standards of anaesthesia delivery will reduce global surgical and anaesthetic complications, improve patient outcomes, and reduce the marked inequality in surgery which currently exists. If, like any medical specialisation, the anaesthesia speciality is to flourish, anaesthesiologists-in-training must be of the highest standards in terms of knowledge base and clinical performance. Significant resources are used to ensure that those in their training programs reach such a level. Certification Examinations (e.g., M.B., DNB, FRCA), and other formative assessments, designed to have an appropriate psycho-metric profile, are currently used to determine if a candidate has achieved the required standard.
An issue with written examinations is that they have limited validity in relation to the prediction of clinical competence (Jindal & Khurana, 2016). A second and equally important point is that, given the tradition of strong clinical training, there is no clear evidence to show that high performance in such written tests is indicative of safe, competent clinical practice. By definition, anaesthesia involves 'doing', and an anaesthetist must command and then apply professional skills under pressure in real-time in order to mitigate risk. With this in mind, some countries have developed the OSCE as a primary component in the examination process. This explicitly tests the clinical performance of a candidate in an examined setting, as opposed to their ability to pass a written test. In the past, the incidence of pass/fail assessments led to candidates securing results. In an effort to eliminate subjectivity in the assessment of clinical competencies, OSCE is now included in many high-stakes examinations and formative assessments to ensure a consistent and objective examination process. During these examinations, anaesthesia residents perform procedures such as intubation, spinal anaesthesia, and managing complications with graded patient actors. These clinical scenarios need to be contemporary, realistic and challenging. The results, or the outputs, from OSCE stations are now widely used in summative adversarial examinations, and in order to ensure the reliability and validity of such data, it is important to understand any inherent errors in the measurement process. The multifaceted nature of the OSCE process can be ensured by using a number of key techniques in the development and running of the OSCE stations. With particular regard to anaesthesia OSCE and the high-stakes nature of such assessments, the role of measurement bias in OSCE outputs also warrants discussion.
2. Research Methodology
2.1 Ethical Approval
The research was deemed as low risk and, as such, was reviewed by the Low-Risk Ethical Committee at the Faculty of Life Science and Education, University of South Wales and granted approval.
A
2.2
Research Protocol
For this systematic review, electronic databases were searched for relevant articles. These included PubMed, Google Scholar, Embase and Educational Resources Information Centre (ERIC) and Scopus. A manual search through relevant reference articles was also done to broaden the literature search. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Guidelines have been followed to improve the rigour of the evaluation (Liberati, 2009). The research question was directed towards assessing the reliability and validity of OSCE used to evaluate clinical competence in anaesthesia evaluations. Keywords used in this search were anaesthesia OR anesthesia, objective structured clinical examination, OR OSCE, AND reliability, AND validity, AND clinical competence assessment, AND evaluations.
2.3 Inclusion Criteria
These are developed in accordance with the PICOS framework.
1. Population
Studies that included anaesthesia trainees (technicians, nurses, residents, fellows) or practitioners undergoing formative or summative evaluations (e.g., targeted skill training, board certification, licensure, or progression in training).
2. Intervention
Studies that included the use of OSCEs to evaluate clinical competency in anaesthesia.
3. Comparison
Studies that compared OSCE to other assessment methods (e.g., written/oral exams, traditional clinical evaluations) or different OSCE formats (e.g., virtual vs. in-person). However, studies that solely evaluate OSCE's psychometric properties are also included.
4. Outcome
Reliability Outcomes: Measures such as internal consistency (Cronbach's α), inter-rater reliability, and test-retest reliability.
Validity Outcomes: Content validity (expert consensus), construct validity (factor analysis), criterion validity (concurrent/predictive), or face validity.
5. Study Design
Primary observational studies, validation studies, cross-sectional analyses, cohort studies, case reports or methodological studies reporting psychometric data are included in the designs.
Other inclusion criteria included
Context: OSCE must be part of a formative/summative assessment.
Language: Peer-reviewed articles in English language.
Timeframe: studies conducted between 2000 and 2024 to ensure relevance.
Geographic Scope: No restrictions.
2.4 Exclusion Criteria
Studies outside anaesthesia (e.g., surgery).
OSCEs used for non-clinical skills (e.g. communication alone).
Studies lacking explicit psychometric data, such as reviews, opinion pieces, or studies without quantitative reliability/validity metrics.
2.5 Selection Process
A
Zotero was used to store the results of literature searches. The same program was used to remove duplicate articles from the list. All titles and abstracts were checked manually to select the articles that were to be included in the review. The full text of selected articles was downloaded for detailed scrutiny. References to these articles were then manually searched for more relevant studies by the authors separately. The final selected list was formulated in reference to the research questions. Risk Of Bias In Non-randomized Studies - of Interventions (ROBINS-I) tool was used to assess the risk of bias in the estimated effect of an intervention.
2.6 Data Synthesis
The studies in the final selection were scrutinised in detail in relevance to the research questions. A meta-analysis was not attempted due to the heterogeneity of the studies. Hence, a narrative synthesis of the selected articles was done to form results and conclusions (Campbell 2018).
1.
3. Reliability of OSCE in Anaesthesia
2.
3.1. Defining Reliability
Reliability in Objective Structured Clinical Examinations (OSCEs) refers to the consistency, reproducibility, and dependability of assessment scores in measuring a candidate's clinical competence (Downing SM, 2004). A reliable OSCE ensures that the same candidate would receive a similar score if reassessed under comparable conditions, different examiners would assign consistent scores for the same performance and that different stations or test forms measure the intended skills without undue variability. The key role of a reliable OSCE is to reduce inconsistencies caused by examiner bias, station variability, or candidate nervousness. This ensures fairness by accurately reflecting the candidates' abilities rather than external factors (e.g., rater leniency, station difficulty). Additionally, reliability is a prerequisite for validity; an inconsistent test cannot be considered valid. It is paramount that high-stakes decisions (e.g., certification) are made with confidence that scores are not distorted by measurement error.
3.2. Types of Reliability
Several types of reliability are relevant to OSCEs, each addressing different sources of variability:
3.2.1. Inter-Rater Reliability (Inter-RR)
IRR Measures the agreement between different examiners scoring the same candidate's performance. This is important because inconsistent ratings can undermine the fairness of the process. Assessment is carried out by using:
- Cohen's kappa (for categorical data) :
Cohen's kappa (κ) is a statistical measure used to evaluate inter-rater reliability for categorical data (e.g., pass/fail, checklist items marked as "done/not done"). It quantifies the level of agreement between two or more raters beyond what would be expected by chance alone. A kappa value of 1 indicates perfect agreement between the examiners.
- Intraclass Correlation Coefficient (ICC) (for continuous scores) :
The ICC is a statistical measure used to assess inter-rater reliability or test-retest reliability for continuous data, such as OSCE station scores out of 20 points or global rating scales. It quantifies the degree to which different raters or repeated measurements agree on numerical scores.
If two raters score an anaesthesia OSCE station differently, low Inter-RR suggests rater bias or unclear marking criteria (Boulet et al., 2003).
3.2.2. Intra-Rater Reliability (Intra-RR)
Intra-RR assesses whether the same examiner scores a performance consistently over time. This is particularly relevant in longitudinal OSCE assessments. Intra-RR is commonly measured via test-retest or ICC.
3.2.3. Internal Consistency (IC)
IC evaluates whether different OSCE stations measure the same underlying construct (e.g., clinical competence). Cronbach's alpha (α) is commonly used as a matrix. An α value of ≥ 0.7 indicates acceptable reliability, while a high alpha (> 0.9) may suggest redundancy. Low alpha could mean stations assess different skills (Streiner, 2003).
3.2.4. Test-Retest Reliability (TTR)
TTR examines the stability of scores when the same candidates retake a similar OSCE. TTRs are rarely used in OSCEs due to learning effects, but they are important in research settings.
3.2.5. Parallel-Forms Reliability (PFR)
PFR assesses consistency between different but equivalent OSCE versions (e.g., different station sets). PFR assessments are important for large-scale exams to ensure fairness across cohorts.
3.2.6. Generalizability (G-Theory)
The G-Theory extends classical reliability by analysing multiple sources of error (stations, raters, occasions). It provides a dependability coefficient (Φ) indicating how generalisable scores are across different OSCE conditions (Lawson, 2006). G-studies in anaesthesia OSCEs suggest that adding more stations makes the results more generalisable than adding more raters (Lawson, 2006).
4.
Validity of OSCEs in Anaesthesia
4.1. Defining Validity
Validity in OSCEs refers to the extent to which the assessment measures what it intends to measure (here it’s clinical competence) and supports the interpretations and uses of its scores.
4.2. Types of Validity
Two key frameworks for understanding validity in OSCEs are those proposed by Messick and Kane.
4.2.1. Messick’s Unified Validity Framework
Samuel Messick’s (1995) model views validity as a unified concept, where different types of evidence contribute to a comprehensive argument for validity. For OSCEs, this includes:
a. Content Validity
Does the OSCE cover relevant clinical domains (history-taking, physical examination, communication, etc.)? Are stations representative of real clinical tasks? Do expert review of cases and blueprint alignment with learning objectives.
b. Substantive Validity (applied to Cognitive & Skill Processes)
Do candidates engage in appropriate clinical reasoning and skills during tasks? Are the tasks structured to elicit the intended competencies?
c. Structural Validity
Does the scoring rubric accurately reflect the construct (e.g., global ratings vs. checklist scores)? Are station scores internally consistent?
d. Generalizability
Can performance in the OSCE predict real-world clinical ability? Does performance remain stable across different stations, examiners, and occasions?
e. External Validity
Correlation with other assessments (e.g., workplace-based assessments, written exams).
a.
f. Discriminant validity (e.g., differentiating between novice and advanced learners).
b.
g. Consequential Validity
Are OSCE results used fairly for decisions (pass/fail, licensure)? Does the OSCE have positive educational impact (e.g., guiding learning)?
2. Kane’s Argument-Based Validity Framework
Michael Kane’s (1992, 2013) approach focuses on constructing a validity argument through claims and evidence. For OSCEs, this involves:
a. Scoring Validity (Interpretation of Observations)
Are scores based on reliable and relevant criteria (e.g., checklists, global ratings)? Are examiners trained to minimize bias?
b. Generalization Validity (Extending Scores to Broader Domains)
Does performance in sampled stations reflect broader clinical competence? Are enough stations used to ensure reliability?
c. Extrapolation Validity (Linking Scores to Real-World Practice)
Does OSCE performance correlate with actual clinical performance? Are simulated patients and scenarios realistic enough?
d. Decision Validity (Appropriate Use of Results)
Are cut-scores (pass/fail standards) justified? Are consequences of passing/failing fair and defensible?
A
While Messick provides a holistic view, emphasizing multiple sources of evidence, Kane offers a structured argument-based approach, focusing on inferences from scores to decisions. Both stress that validity is not just about the test itself but how scores are interpreted and used.
5.
Results
Literature search of PubMed, Google Scholar, Embase and Educational Resources Information Centre (ERIC) and Scopus revealed a total of 64 articles. After removal of the duplicates, 21 articles remained. The titles and abstracts of these articles were then scrutinised. 5 articles were excluded as being irrelevant (Fig. 1). After full text retrieval 16 articles (Appendix 1) were carefully analysed for inclusion in the review
A
Fig. 1
Articles selection process undertaken by the author
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
Click here to Correct
The included studies (n = 16) investigated the reliability and validity of Objective Structured Clinical Examinations (OSCEs) in assessing clinical competence in anaesthesia. Reliability measures, such as Cronbach’s alpha, inter-rater reliability, and generalizability coefficients, were frequently reported. For instance, Afra et al. (2025) demonstrated high internal consistency (Pearson’s r = 0.861, p < 0.001), while Ruan et al. (2025) reported excellent Cronbach’s alpha values ranging from 0.916 to 0.978. Inter-rater reliability was also strong in several studies, with Greenwood and Ledvina (2023) reporting an inter-rater reliability of 0.9092, and Khalafi et al. (2024) noting an intraclass correlation coefficient (ICC) of 0.824. However, variability was observed, as Mitchell et al. (2018) reported lower reliability (KR-20 > 0.5), and Morgan et al. (2004) found Cronbach’s alpha values as low as 0.16 for certain scenarios.
Validity evidence was diverse, with many studies supporting construct validity. For example, Tanaka et al. (2020) used G-theory to demonstrate significant variance attributable to individual and program levels (G-coefficient = 0.56), while Everett et al. (2019) showed discriminant validity with significant score differences between junior and senior trainees (p = 0.04 to < 0.001). Criterion-related validity was also evident, such as in Dubois et al. (2021), where OSCE scores correlated with other assessment tools (r = 0.45–0.52). Content validity was highlighted in Weller et al. (2017), where realism and alignment with crisis management frameworks were confirmed. However, some studies, such as Rebel et al. (2019), lacked detailed validity metrics, underscoring the need for standardized reporting.
Overall, the findings suggest that OSCEs in anaesthesia assessments generally exhibit strong reliability and validity, though variability across studies indicates context-dependent performance
6.
Studies on Reliability and Validity in OSCEs
The statistical measures employed in OSCEs of anaesthesia evaluations and certification can tell us much about the internal consistency and, hence, the reliability of these evaluations. Internal consistency is indicated by Cronbach’s alpha and shows how consistently different stations measure the same construct. Articles included in this review suggests the value of this indicator can be varied and may be quite low (Baig LA, Beran TN, Vallevand A, et al, 2014). Karam et al. show the Cronbach’s alpha to be 0.43 for anaesthesia OSCE stations (Karam et al., 2018). This brings into question the homogeneity and reliability of the examination, as well as the procedural deficiencies that might exist within the process. This may suggest standardization efforts are insufficient and more careful station redevelopment and scoring refinement are required to improve reliability
With the Cronbach’s alpha reported by Karam et al. being moderately low, examiners may vary in their rating styles, stations might not be balanced well, and candidate performance may have significant variety between the different stations (Karam et al., 2018). This calls for in-house quality assurance through regular station content evaluation. A broader analysis of station elements contributing to reliability would be useful in order to enhance this form of examination (Karam et al., 2018).
Heterogeneity of station design also appears to have an effect on the internal consistency. For instance, if a single OSCE is testing many different competencies, it would have a decreased reliability indicated by Cronbach’s alpha (Karam et al., 2018). In addition to this observation by Karam et al. on station heterogeneity, it may also be useful to conduct item analysis to improve psychometric properties (Karam et al., 2018).
In order to achieve acceptable reliability, there have been a few different scoring models implemented. Abed and Elaraby show that a scoring system which uses the borderline students' characteristics model improves reliability from 0.57 to 0.85 between parallel circuits of OSCE stations in anaesthesia (Abed & Elaraby, 2022). Therefore, standardized methods of standard setting are important for improving reliability. This higher level of reliability also improves the likelihood that the cut scores in examinations will be fair and defendable. A procedural change like this may be warranted to enhance these psychometric measures; however, the practicality and resources of a model like this must be considered (Abed & Elaraby, 2022).
The implementation of the borderline students' characteristics model also changed the cut score, resulting in higher progression rates (Abed & Elaraby, 2022). These results also point to the importance of calibrating examiners. By calibrating examiners, they will score and pass candidates in a standardized and evidence-based manner.
Abed and Elaraby show that deleting some items such as item three and item ten would improve the reliability scores (Abed & Elaraby, 2022). This illustrates the importance of regular assessment redevelopment and item refinement in order to improve the validity and reliability of a particular OSCE (Abed & Elaraby, 2022).
Another piece of work looked at various scoring systems and their correlation. Checklists versus domain-based assessment methods were found to have a correlation coefficient of 0.858 between the two. However, depending on the group and the years, Cronbach’s alpha can range from 0.2 to 0.57, which suggests this examination is variable and potentially unreliable for high-stakes anaesthesia certification (Mahmoud, 2023). It is useful to note that with checklist methods, a technical competence is tested, but domain-based scoring provides an examination of clinical reasoning (Mahmoud, 2023). Therefore, perhaps neither scoring method is ideal, and examiners may be able to assess technical competency via checklist and non-technical elements through domain-based scoring.
For instance, in comparing the two different groups, there was a considerable difference in Cronbach’s alpha (Mahmoud, 2023). In cohort one, they were assessed with the same 15-station circuit over three years, and in cohort two, examiners and station layout were frequently changed. The decline in Cronbach’s alpha for the group with more variation illustrates the importance of not only the OSCE but how it is conducted (Mahmoud, 2023). Perhaps annual audits are warranted for examiners and station changes in order to prevent inconsistencies within the process (Mahmoud, 2023). The other measure calculated in this study was skewness. Both checklists and domain-based measures displayed similar trends. Most stations resulted in a skewed distribution of scores and demonstrated either a ceiling effect or floor effect, depending on station difficulty. Some stations had very limited score ranges and poor variance between high and low performers. Improving the performance variability on station items would serve to reduce score bias and enhance the discriminating validity of the examination, particularly during high-stakes anaesthesia certification (Mahmoud, 2023).
Another measure that improves the reliability of an examination is the addition of another examiner. Wright implemented an intervention in which he had two raters score one participant (Wright, 2019). He evaluated both the checklist scores and global scores to determine how effective having two examiners would be on these types of evaluations. Interestingly, by adding a second examiner to the examination, both checklist and global scores were more reliable than with only one evaluator (Wright, 2019). Even though this shows it is more reliable, there are several limitations. Unfortunately, some scenario recordings did not record properly, which resulted in uninterpretable data for some scenarios (Wright, 2019). Also, this was a smaller sample size in some scenarios compared to others, so the reliability scores will change from scenario to scenario based on this data (Wright, 2019). Another interesting observation from this work is that trainees consistently score significantly lower on both checklist scores and global scores when compared to their senior colleagues (Wright, 2019). This confirms the construct validity of this type of exam.
Even though there are many good points with having two examiners, there are limitations. A lot of time and resources are required to ensure proper administration of this form of assessment. Therefore, some considerations should be given if this intervention is attempted for reliability purposes (Wright, 2019).
Therefore, while some authors may find checklist scores more reliable and beneficial, Wright highlights that they have no added value compared to global scores in terms of inter-rater reliability (Wright, 2019). Regardless of these findings, in all assessment scenarios there were two raters, which reduces bias and variability, thereby improving the reliability. Wright shows that an appropriate intervention can be implemented for a specific issue related to reliability scores and validity scores of the OSCE and provides strategies for implementing a reasonable procedural improvement (Wright, 2019).
Comparative studies comparing the reliability of OSCE with other methods have illustrated some differences in anaesthesiology (O’Shaughnessy & Joyce, 2015). OSCEs were assessed in the context of internal medical education, anaesthesia education, and surgical education. It was noted that written tests, such as multiple-choice or single-answer, lend themselves to high reliability given the inherent ability to have objective marking done by computer (O’Shaughnessy & Joyce, 2015). O’Shaughnessy and Joyce reported an observation made by a reviewer, who felt that the differences noted were due to the inherently higher degree of standardization in these multiple-choice or single-answer examinations than in OSCE-style tests (O’Shaughnessy & Joyce, 2015). The reviewer felt the variability across candidates taking the same multiple-choice test may be far less than the variability of candidates doing an OSCE, because it includes factors such as examiner bias, candidate performance, and standardization (O’Shaughnessy & Joyce, 2015).
Validity evidence forms the bedrock of the effectiveness of OSCEs, especially in high-stakes anaesthesia examinations. This section looks critically at the available research evidence related to content, construct, consequential, criterion and other forms of validity as it relates to the use of OSCEs for measuring competency in anaesthesia practice.
The development of OSCE tools in anaesthesia can provide good content validity because these assessments have generally broad content coverage. For instance, scenarios that may address periprocedural and communication skills in anaesthesia, such as taking consent, selecting appropriate equipment or managing a complication, can align very closely with clinical activities (Rojas et al., 2025). This implies that OSCEs can be an effective tool in accurately assessing knowledge, skills and behaviors required for safe anaesthesia practice. However, the challenge with content validity is to ensure adequate representation across different settings. Although curriculum-based scenario design enhances the content validity of anaesthesia OSCEs, further research is required to ensure appropriate scenario development that can cater to the varying prevalence of certain clinical tasks across settings.
Checklist items and global rating scales support construct validity. High interrater reliability of OSCE scoring provides evidence that OSCEs can provide consistent measures of clinical competency and differentiate levels of trainee experience (Rojas et al., 2025). On the contrary, global ratings can be subject to more interrater variability than checklist items because the implementation of the rating scales may not be performed consistently. Nevertheless, utilizing both checklist items and global ratings in assessing competencies offers good construct validity to OSCEs in high-stakes examinations, especially for measuring the ability to perform in a particular clinical domain in anaesthesia practice.
Careful consideration of curriculum needs assessments and tailoring OSCE scenarios appropriately to those needs can provide good relevance to practice (Rojas et al., 2025). This ensures that the assessed competencies reflect real-world requirements, especially in domains such as regional anaesthesia. However, the risk of over-specialization is associated with this type of validation that may lead to challenges in the generalizability of findings or competencies assessed. Periodic curriculum reviews and stakeholder input can mitigate this limitation as clinical relevance can be best ensured when OSCE scenario design caters to a broad coverage of required competencies.
The reproducibility and feasibility of OSCEs provide good consequential validity (Rojas et al., 2025), suggesting that it can easily be integrated into regular assessment schemes for training and certification in anaesthesia practice. Also, the ability to set benchmarks or minimal accepted clinical standards for a particular cohort of candidates supports the fairness and consistency of OSCEs as a high-stakes examination. Cost-effectiveness also indicates the consequential validity of OSCEs because it is economically reasonable and sustainable.
Conversely, the ability of OSCEs to retain relevance over time must be addressed in all settings to ensure that changing clinical practices, evolving curricula and advancements in medical education simulation are also being captured, thereby providing poor consequential validity to anaesthesia OSCEs in the face of constant innovation in the health professions.
Standard-setting by raising the cut-score thresholds impacts candidate performance. In the setting of high-stakes anaesthesia examinations, it was found that raising cut scores to better mirror real-world performance and increase external validity of an anaesthesia OSCE resulted in poorer average grades and higher cut-off values. This may exclude potentially qualified candidates who may otherwise pass lower cut-off tests, thus negatively impacting on external validity (Karam et al., 2018). An audit may be required to assess for disproportionate failures among specific student groups from different settings to ensure fairness and minimize any unfair bias or disadvantage to specific groups during testing.
Setting cut-off thresholds can also change the likelihood and proportion of pass rates; therefore, the consequences of standard-setting practices may not align with clinical expectations (Karam et al., 2018). An audit should be conducted to ascertain whether the newly raised standards have adverse or disproportionate impacts on the quality and diversity of potential anaesthesia physicians or lead to increased inequities, and to mitigate any unexpected negative consequences of newly set OSCE performance standards.
Correlation between checklist scores and domain ratings suggests high criterion validity of the OSCE (Mahmoud, 2023), where the correlation coefficient r = 0.858. As such, both scores are likely to reflect similar information about the level of clinical competence of the OSCE examinees. This implies that either type of scoring system for OSCEs in anaesthesia is reliable enough to give good external validity evidence. However, different centers may have a predilection for different scoring schemes that can limit consistency across OSCE implementations in anaesthesia.
Application of the borderline students may provide a good measure of improving reliability and validity and optimizing cut thresholds (Abed & Elaraby, 2022). Use of the borderline student model increased the reliability coefficient from 0.57 to 0.85 and yielded a cut-threshold (pass mark) of 54.79%. This also resulted in 65% of trainees achieving a grade in line with the estimated “true” grade. Borderline students can reduce the degree of ambiguity of test score results and further enhance the validity of high-stakes examinations in anaesthesia to better measure and ensure competence. However, the feasibility and scalability of applying borderline cases may be a limiting factor across centers in order to perform high-stakes examinations in anaesthesia effectively.
The impact on patient safety is significant because setting cut-off scores to better match the standards of clinical performance can allow only qualified candidates to complete anaesthesia training (Abed & Elaraby, 2022). However, more frequent validations should be employed, such as using follow-up surveys in settings that can track the graduates after completing certification in anaesthesia, to see if the cut-score remains adequate over time. This validation will likely need to be repeated over time to verify validity evidence is reliable for new cut-thresholds.
Weaker in-training evaluations appear predictive of later failures in high-stakes certification examinations. Therefore, the ability of in-training evaluations to predict performance on OSCEs supports predictive validity. In the presence of stronger overall in-training evaluations, most candidates went on to successfully complete their training requirements and obtained certification (Bhanji et al., 2024). OSCE is not predictive enough and it could be a superfluous assessment of clinical competencies during training in this case. However, as a part of a suite of assessment tools, OSCE can serve to bridge this gap.
Adding more components, such as OSCEs, to longitudinal in-training evaluations and other forms of formal assessment may seem superfluous. If individual assessment tools are assessing the same construct (clinical competence), and each instrument is able to provide adequate data on the competence of candidates to guide training, additional assessment may simply burden the evaluation process with redundant information. But if each assessment tool, including longitudinal evaluations, is designed to assess unique elements of competence, adding them may provide valuable information about a broader range of candidates (Bhanji et al., 2024). The best overall evaluation will likely incorporate both longitudinal in-training data and intermittent formal assessment using techniques, like OSCEs, to capture a wider array of skills and competencies. This is not to suggest that combining assessments cannot lead to redundancies. All assessments should be reviewed periodically to ensure that they continue to provide adequate data on aspects of competency without excessively burdening the overall evaluation process.
In-training performance predicts OSCE performance at the end of training (Bhanji et al., 2024). The predictive utility of longitudinal in-training performance data supports the incorporation of OSCE in the certification process. This enhances the validity evidence of OSCEs and supports their integration within a comprehensive evaluation strategy. It is important to recognize, however, that there are circumstances in which simply providing this broad overview of in-training performance may be inadequate. For example, simply categorizing anaesthesia trainees at various points in training as “meeting” performance expectations fails to provide specific feedback regarding areas in which that trainee is exhibiting deficiency, and areas in which they are thriving. Furthermore, performance data alone will not necessarily offer useful guidance with respect to the best way to assist a struggling candidate. Although some candidates may benefit from enhanced educational support, such as supplementary practice, mentoring or enhanced feedback, others might have challenges, such as physical or mental health problems, that would best be addressed through other supportive modalities.
The external and convergent validity of OSCEs is further enhanced by the fact that they can be applied in similar contexts to other health professional trainees from allied disciplines. Other healthcare fields such as medicine, pharmacy and nursing also utilize OSCEs in their licensing and certification processes to assess candidates during high-stakes examinations. Evidence can be drawn from these allied disciplines to enhance and support OSCE implementation in anaesthesia and provide further validity evidence for OSCEs.
For example, in physical therapy candidates for certification, similar levels of inter-examiner reliability and correlation between stations were revealed that enhance the external and convergent validity (Pashmdarfard et al., 2022).
Finally, additional support of face validity and reliability evidence has been shown from feedback from stakeholders. An audit on students’ perspectives on fairness in the implementation of OSCEs provided important validity and reliability evidence for this type of assessment. Students believed that fairness in OSCEs meant that candidates must experience standardized conditions (Pashmdarfard et al., 2022). When students felt they were experiencing standardized conditions during examinations, then the content of each station was considered appropriate for the level of their education. While this suggests a degree of student acceptance and perception of fairness of standardized examinations and conditions, challenges regarding adequate OSCE time to complete cases remain. Other limitations suggested by medical students include student stress and timing to accurately assess clinical competence. Such challenges were considered by the students as barriers to providing fairness of OSCEs. These areas require special consideration in ensuring quality and reliability evidence in performing anaesthesia OSCEs.
3.
7. Strengths and Limitations of OSCEs
4.
7.1 Strengths
OSCEs reduce subjectivity through structured checklists and SPs, ensuring fairness. It is a comprehensive assessment that evaluate multiple competencies (e.g., communication, procedural skills) in a controlled environment (Hodges, B. 2003). OSCEs could be adaptable; virtual OSCEs (teleOSCEs) emerged during COVID-19, maintaining assessment continuity despite pandemic constraints .
7.2 Limitations
However, OSCEs come with their own limitations. They are resource intensive requiring significant time, funding, and personnel, limiting scalability. Also simulated scenarios may lack the complexity of real patient interactions, risking "performance vs. competence" gaps. High-stakes pressure may skew performance (Brand, H.S. and Schoonheim-Klein, M., 2009), while rater bias (e.g., halo effects) persists despite training .
8.
Comparative Analysis with Other Assessment Methods
Objective Structured Clinical Examinations (OSCEs) are widely seen as a strong way to test clinical skills because they have clear benefits over traditional written and practical tests. OSCEs are different from regular tests because they test practical skills, communication, and clinical reasoning in a controlled, standardised setting (Harden and Gleeson, 1979). Regular tests mostly test theoretical knowledge through multiple-choice or essay questions. This format makes things less subjective and makes sure that all candidates are judged on the same criteria, which makes the process more fair and reliable. Simulations, on the other hand, are useful for training but don't always have the same level of formal assessment as OSCEs because they might not always have standardised scoring or feedback from the examiner (Khan et al., 2013). However, OSCEs take a lot of time, people, and planning, which may make them less practical than traditional tests or low-fidelity simulations (Turner and Dankoski, 2008).
OSCEs are better than real-life clinical assessments because they create a controlled environment where patient encounters are always the same, making evaluations more reliable (Newble, 2004). Practical tests are real, but they can be biassed because different patients are more or less complicated and the examiner's opinion is subjective (Norcini et al., 2011). OSCEs fill this gap by combining realism with standardisation, which makes them very useful for both formative and summative assessments in medical education. Critics, on the other hand, say that OSCEs may not fully capture the unpredictability of real-world practice and may put too much emphasis on structured scenarios instead of flexible clinical decision-making (Epstein, 2007). Even with these problems, OSCEs are still the best way to test someone's skills because they are more reliable and useful than other methods that are based on real life or only simulations.
9.
Recommendations for Improving OSCEs
From what is mentioned above, it is clear that OSCE remains as an important reliable and valid tool in medical assessents in general and anaesthesia high stake exams in particular. To address limitations associated with OSCE, multiple innovations have been proposed. One of those is the integration of OSCEs with Work Place Based Assessments (WPBAs) such as mini-CEX, DOPS) to balance standardization with real-world context. In addition the use of teleOSCEs for communication skills and in-person stations for technical proficiencies. Providing formative OSCEs throughout training might lead to reduced anxiety and familiarity gaps.
Implementing frame-of-reference training and incorporating AI-driven analytics to standardize checklist scoring will help to improve interrater reliability and reduce bias.
10. Future Research Directions
Developments, such as the Virtual Objective Structured Clinical Examination (VOSCE), in high-stakes assessment within anaesthesia are making it possible to remotely assess clinical competencies using electronic platforms (Afra et al., 2023). This new assessment has been tested to maintain high psychometric properties and also address logistical issues such as distance and availability for face-to-face assessments. This new method of assessing students also was deemed very effective during the height of the COVID-19 pandemic as a way to maintain educational assessment continuity without any face-to-face interaction. Although this has been established to have strong psychometric values, the literature suggests it should still be explored in its ability to be compared to in-person assessments.
The VOSCE has demonstrated similar psychometric properties, such as reliability and validity, to traditional OSCEs and can be a highly effective way to remotely assess students (Afra et al., 2023). However, it is important for assessment designs to be well thought out as assessments that use computer software and internet connections have the potential for software and hardware malfunctions that may interfere with test-taker performance. Other variables that must be accounted for with new types of high-stakes testing are the need to modify test station content, such as case content, rating forms, and grading methods. While VOSCEs have been effective in maintaining psychometric integrity, the ability to replicate an actual in-person encounter must be taken into account when transitioning to a remote assessment of complex high-stakes exams.
From the test takers’ perspective, participants reported accessibility for them as a positive feature as they could attend their assessment from anywhere, thus minimizing travel time and costs (Afra et al., 2023). These points are significant because accessibility plays an important role in allowing individuals the opportunity to obtain high-stakes certification and testing, which can otherwise limit those that are unable to attend assessments due to distance from major testing centers or lack of financial support for such travel costs. These are often students that come from lower economic status. Ensuring access to high-stakes testing and certification is crucial and requires equity for all participants. Students must be afforded the ability to attend their high-stakes assessments and have the means and resources for successful completion regardless of their financial, ethnic, or geographic standing.
On the other hand, a drawback to the VOSCE implementation is that there are technological resources needed, such as a reliable and adequate internet connection, compatible computer equipment, and skills and abilities to operate an online assessment, to ensure equal access for all participants (Afra et al., 2023). These are all variables that candidates from lower socioeconomic status may not have the ability to access, which is a major gap to be identified for fair and equitable implementation of a remote high-stakes assessment. Additional research should be aimed at addressing how to bridge the gaps of varying socioeconomic and access to internet connections and reliable computer technology so as to equitably utilize VOSCEs for fair high-stakes testing opportunities for all candidate populations.
Recording of candidate performance and examiner performance monitoring (i.e., video footage of assessments) and examiner training are also valuable as they allow for review and quality improvement to assist with the maintenance of psychometric properties and standardization in scoring for high-stakes assessments (Afra et al., 2023). On the other hand, concerns such as data privacy and storage, as well as data security with any potential technological failures of recording procedures or equipment, have to be taken into account to comply with standards from local and state regulations as well as the protection of candidates and their data. Stress and anxiety can also be of great significance in the assessment performance as the recording process can be a significant stressor for some candidates. While these may seem to be minimal deterrents, it is very important to factor these in as test-takers tend to become very sensitive when participating in assessments that weigh high on their future careers, education, and certification.
While the VOSCE is a promising endeavor, technology infrastructures need to be reviewed and tested by the institutions in which these assessments are implemented (Afra et al., 2023). The initial cost for developing and implementing these types of remote assessments has to be feasible enough so that the costs can be sustained to support ongoing assessments. In addition, the gaps that are between technological resources available across institutions and educational programs need to be examined. In cases where institutions, schools, and program systems may have financial restraints, which affects technology implementations of remote assessment, it would prove helpful to conduct research regarding cost-effective ways to transition high-stakes assessment to virtual options without sacrificing the psychometric standards needed for effective high-stakes implementation.
High-fidelity simulations are beginning to become increasingly useful in high-stakes testing as they provide more validity of the assessment while also improving patient safety by replicating rare yet critical patient emergencies that may not be seen in real-patient assessment, such as acute resuscitation scenarios (Dupre & Naik, 2021). Simulation-based assessment delivers consistency in providing the same clinical encounter to each candidate by minimizing variability in patient scenarios and cases, which improves the reliability of assessment performance (Dupre & Naik, 2021). Despite these advantages, a drawback of high-fidelity simulation assessment is that it is costly and requires multiple resources such as technological equipment (high-fidelity simulations), physical space (simulation lab), training staff for implementation, standardization for scoring, and maintenance, to make this transition for high-stakes assessment. Therefore, it is important to find the right balance between high-fidelity versus low-fidelity, weighing the pros and cons for which assessments these will prove to be the most useful and for the improvement of patient safety outcomes without sacrificing candidate performance based on equipment use, scenario, or patient variability.
Simulation-based assessment also has the ability to test non-technical skill sets, such as crisis management, communication skills, teamwork, and leadership, by creating scenarios for real-life and emergency patient scenarios (Dupre & Naik, 2021). It is imperative to remember, when incorporating the ability to assess non-technical skills, that the grading or rating of such skills may be somewhat subjective depending on the examiner’s expertise, thus requiring that a focus be kept on consistent training of examiners. Continuous quality assurance is required, along with active standardization of examiner calibration, to reduce bias and subjectivity, as well as to maintain consistent and accurate scoring, or the simulation-based assessment runs the risk of decreasing reliability. This leads to a potential decrease in candidate perception of the assessment's appropriateness, accuracy, and ultimately validity.
Scalability is very important to take into consideration when choosing to utilize the use of simulations within high-stakes assessment processes (Dupre & Naik, 2021). One limitation can be the price that is attached to higher-fidelity simulations, and for programs that have a minimal budget, this can be difficult to expand across multiple assessments, to expand to all learning programs within the hospital, or even to have consistent examiner training in simulation implementation as funds or budgets are not always sufficient to cover this in every hospital or program (Dupre & Naik, 2021). It is therefore important to be mindful that with the use of higher-fidelity simulations there must be a balance of what can be provided within these implementations and what the return may be for improving both performance validity in high-stakes assessment and ultimately in reducing patient safety gaps to provide equitable quality care for the patient population (Dupre & Naik, 2021). Research can be implemented to determine if low-fidelity simulations could serve the same purpose in assessment but may not have the variability, costs, and resources that are tied to higher-fidelity simulations. Collaboration amongst multiple simulation centers can also be explored, so that smaller institutions can benefit from those that have higher simulation availability (Dupre & Naik, 2021). Other opportunities include the incorporation of modular simulations for ease of setup, maintenance, standardization for multiple examiners, and ability to implement multiple stations/assessments simultaneously to decrease waiting times and candidate and station variability (Dupre & Naik, 2021). This could alleviate many of the concerns for high-stakes assessments in the field of anaesthesiology as OSCE implementation will be of higher reliability, validity, and acceptability.
In 2003, the American Board of Anesthesiology (ABA) incorporated nine domains into their blueprint of technical and non-technical assessment skills within anaesthesiology. These nine domains cover technical skills, professionalism skills, communication skills, etc. (Warner et al., 2020). Despite all areas tested at each assessment station being tested simultaneously, each domain is independently evaluated, which allows for a better comprehensive perspective of individual skill-level mastery as well as provides better quality of feedback for those that are not successful in particular domains of practice (Warner et al., 2020). Regardless of all domains and/or stations being required for successful completion to obtain full certification, these domains are all rated and can assist with future improvement and maintenance to prevent future performance gaps (Warner et al., 2020). However, it must be taken into consideration that this only stands true for consistent and well-calibrated evaluators, as the more that examiners are in calibration with one another, the stronger the reliability of grading for this style of OSCE becomes (Warner et al., 2020). Constant reminders or standardization exercises, as well as regular feedback, may be required to keep all evaluators on task with standardization of calibration within OSCE assessment to decrease all forms of variability of evaluator inconsistencies in scoring (Warner et al., 2020).
In 2003, the ABA also expanded the examiner pool in order to test variability of graders with the intentions of maintaining higher standards for assessment and in providing feedback for all candidate performances (Warner et al., 2020). All 51 ABA board-certified and active diplomates participated, which required the ABA to take on very large logistic and administrative burdens for scheduling and coordinating the event that had multiple candidates and many different stations. The ABA incorporated quality assurance for the standardization of grading by examiner pairs for each of the stations that all candidates were required to test (Warner et al., 2020). Though the use of examiner pairs allows for more objective grading in comparison to solo examiners, and the higher numbers of examiners will, in turn, decrease variability across many OSCEs, it becomes crucial to determine the number of examiners needed without becoming a logistic nightmare (Warner et al., 2020). Also, with the recruitment of that many examiners, there becomes a wide variety of experience, specialties, and even personalities, making it extremely important that all examiners undergo the necessary training to ensure accuracy of all assessment stations (Warner et al., 2020). Continuous auditing and feedback or recalibration exercises or activities may prove essential in standardizing the evaluators’ performance. Any underperformers may require closer evaluation of their assessment styles to ensure correct implementations and to provide more effective education as required, thus requiring more ongoing supervision (Warner et al., 2020).
Station and scenario design in relation to OSCEs in high-stakes testing requires more quality assurance and review so that there are no unintended variables that skew candidate responses (Warner et al., 2020). Feedback can be obtained on station designs and how effective they are by soliciting candidate feedback. It is also critical to ensure all stations remain equally tested in levels of difficulty or complexity. In any given station, there could be an unintended weakness within a candidate’s responses; this could be related to station wording or wording complexity, scenario design that may lack clarity, a change of evaluators, test anxiety, all or most of which may create a gap within station effectiveness (Warner et al., 2020). To improve standardization of station designs, scenario content and wording may be crucial to review yearly or when OSCE implementations are conducted (Warner et al., 2020).
As for nurse anaesthesiology OSCEs, this type of performance assessment can take the place of regular written examinations of competency with more effectiveness as it assesses skills in addition to content mastery in scenario-based performance (Sammons et al., 2024). It addresses both “Knows” and “Knows How” elements of Miller’s Pyramid. OSCEs also focus more on the ability to appropriately demonstrate a task and the professional behavior while conducting the task (Sammons et al., 2024). Since it has been standard within nurse anaesthesiology education to focus more on knowledge assessment with multiple-choice/true/false examination, this transition is an area of gap for many educators (Sammons et al., 2024). Implementation with OSCEs requires time and training of examiners as well as the standardization of performance ratings (Sammons et al., 2024). When these two steps have not been given appropriate attention and focus, there can be some lack in the implementation of psychometric integrity. Therefore, it is recommended for program directors to invest in educational interventions for appropriate OSCE implementation to ensure effective psychometric integrity (Sammons et al., 2024).
In addition, another gap that is associated with implementation in nurse anaesthesiology is the initial costs of setting up a simulation room and the materials or technologies necessary for assessment stations (Sammons et al., 2024). Though some schools have limited to no barriers regarding resources, others are financially constrained and/or must utilize rooms, classrooms, or simulation laboratories that also must be used to support other educational training (Sammons et al., 2024). Other programs do not have a space at all and struggle finding one in which to do simulations or examinations as they cannot allocate funds or afford the rent for a separate building for them (Sammons et al., 2024). This places a huge inequity with OSCE implementation and may need further consideration so that all nurse anaesthesiology programs in the U.S. are equitable in their training and implementation standards for high-stakes assessments such as those utilized for National Boards. There is no published research available regarding these gaps and potential solutions to such gaps, but the use of shared-simulation labs among universities and programs, as well as the potential for implementation of mobile simulation labs within the hospital settings, are avenues for exploration. Implementation of OSCEs is of importance and offers more valuable assessment to the programs, for the nurse anaesthetist candidate, and for all the patients that
11. Conclusion
This thesis systematically reviewed the reliability and validity of OSCEs in high-stakes anaesthesia certification. The research aimed to determine whether OSCEs are reliable and valid enough for consequential exams and to assess their impact on professional preparedness. The findings suggest that OSCEs can provide a robust evaluation of anaesthetists' technical and non-technical skills, enabling the assessment of clinical competency required for high-stakes certification.
The study found that OSCEs have excellent statistical reliability when assessing at a global score or station-by-station level. However, poor station construction and examiner calibration can impact performance, underscoring the need for further investigation. The study also found that virtual and simulation-based OSCEs have allowed for increased access and decreased expenses, but have several technological and resource limitations.
The findings suggest that OSCEs are the gold standard assessment of competence in anaesthesia due to their station-by-station and scenario-based construction. However, debates surrounding standard-setting cut-off marks, examiner bias, resource costs, and time requirements persist. Future research should focus on multi-institutional, cross-national, and multi-regional validation to determine their generalizability and to explore virtual and simulation-based OSCE formats.
List of abbreviations
DNB
Diplomate of National Board
ERIC
Embase and Educational Resources Information Centre
FRCA
Fellowship of the Royal College of Anaesthetists
M.B.
Bachelor of Medicine
MCQs
Multiple Choice Questions
MEQs
Modified Essay Questions
OSCE
Objective Structured Clinical Examination
PRISMA
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
VOSCEs
Virtual Objective Structured Clinical Examinations
Declarations
Ethics approval and consent to participate:
• reviewed by Ethical Committee at the Faculty of Life Science and Education, University of South Wales and granted approval
Consent for publication:
• Not applicable
A
Data Availability
All data generated or analysed during this study are included in this published article [and its supplementary information files].
A
Funding:
No funding was received for this review.
A
Author Contribution
SA prepared the manuscript, collected and analysed the studies. JB supervised and reviewed the manuscript and provided resources.
Acknowledgements:
• None
• Clinical trial number: not applicable.
• Authors' information: SA is a consultant anaesthetist and intensivist, he is a program director for the anaesthesia residency program and an examiner with multiple national and international boards. JB is a urologist and an educator, and a lecturer in University of South Wales.
Electronic Supplementary Material
Below is the link to the electronic supplementary material
References
Abed RA, Elaraby SE. Measuring the effect of using a borderline students' characteristics model on reliability of objective structured clinical examination. Cureus. 2022;14(5):1–7.
Afra A, Seneysel Bachari S, Ban M, Design. Implementation, and Evaluation of Anesthesia Students' Clinical Competency Based on the Virtual Objective Structured Clinical Examination. Anesth Pain Med. 2025;15(1):e155251.
A
Baig LA, Beran TN, Vallevand A, et al. Accuracy of portrayal by standardized patients: Results from four OSCE stations conducted for high stakes examinations. BMC Med Educ. 2014;14:97.
Boulet JR, De Champlain AF, McKinley DW. Setting defensible performance standards on OSCEs and standardized patient examinations. Med Teach. 2003;25(3):245–9.
A
Brand HS, Schoonheim-Klein M. Is the OSCE more stressful? Examination anxiety and its consequences in different assessment methods in dental education. Eur J Dent Educ. 2009;13(3):147–53.
A
Campbell M, Katikireddi SV, Sowden A, Thomson H, Stranges S. Lack of transparency in reporting narrative synthesis of quantitative data: a methodological assessment of systematic reviews. J Clin Epidemiol. 2018;105:1–9.
Downing SM. Reliability: on the reproducibility of assessment data. Med Educaion. 2004;38(9):1006–12.
A
Dubois DG, et al. Validity of entrustment scales within anesthesiology residency training. Can J Anesth. 2020;68:53–63.
Epstein RM. Assessment in medical education. N Engl J Med. 2007;356(4):387–96.
Everett TC, McKinnon RJ, Ng E, Kulkarni P, Borges BCR, Letal M, Fleming M, Bould MD, MEPA Collaborators. Simulation-based assessment in anesthesia: an international multicentre validation study. Can J Anaesth. 2019;66(12):1440–9.
Fisseha H, Desalegn H. Perception of students and examiners about objective structured clinical examination in a teaching hospital in Ethiopia. Adv Med Educ Pract. 2021;12:1439–48.
Greenwood JE, Ledvina M. Development and Validation of a Quantitative Grading Rubric for High-Fidelity Simulation Assessment. AANA J. 2023;91(3):197–205.
Harden RM, Gleeson FA. Assessment of clinical competence using an objective structured clinical examination (OSCE). Med Educ. 1979;13(1):41–54.
Hodges B. Validity and the OSCE. Med Teach. 2003;25(3):250–4.
Jindal P, Khurana G. The opinion of postgraduate students on objective structured clinical examination in anaesthesiology: A preliminary report. Indian J Anaesth. 2016;60(3):168–73.
Kane MT. An argument-based approach to validity. Psychol Bull. 1992;112(3):527–35.
Kane MT. Validating the interpretations and uses of test scores. J Educ Meas. 2013;50(1):1–73.
Karam VY, Park YS, Tekian A, Youssef N. (2018). Evaluating the validity evidence of an OSCE: Results from a new medical school. BMC Med Educ, 18(313).
Khalafi A, Abbasi A, Sarvi Sarmeydani N, Albooghobeish M. Investigating the Effectiveness of Formative OSCE Combined with Visual Feedback in Improving Clinical Competence among Iranian Nurse Anesthesia Students: A Quasi-experimental study. J Adv Med Educucation Professionalism. 2024;12(4):251–60.
Khan KZ, Ramachandran S, Gaunt K, Pushkar P. The Objective Structured Clinical Examination (OSCE): AMEE Guide 81. Part I: An historical and theoretical perspective. Med Teach. 2013;35(9):e1437–46.
A
Lawson DM. Applying generalizability theory to high-stakes objective structured clinical examinations in a naturalistic environment. J Manipulative Physiological Ther. 2006 Jul-Aug;29(6):463–7.
A
Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gøtzsche PC, Ioannidis JP, Clarke M, Devereaux PJ, Kleijnen J, Moher D. The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration.
Mahmoud A. (2023). A comparison of checklist and domain-based ratings in the assessment of objective structured clinical examination (OSCE) performance. Cureus, 15(6), e40220.
Messick S. Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. Am Psychol. 1995;50(9):741–9.
A
Mitchell JDMD, Amir R, Montealegre-Gallegos MD, Mahmood MMD, Feroze MD*; Shnider, Marc MD*; Mashari, Azad MD*,†; Yeh, Lu MD*,‡; Bose, Ruma MD*;, Wong VBS, Hess. Philip MD*; Amador, Yannis MD§; Jeganathan, Jelliffe MBBS*; Jones, Stephanie B. MD*; Matyal, Robina MD*. Summative Objective Structured Clinical Examination Assessment at the End of Anesthesia Residency for Perioperative Ultrasound. Anesthesia & Analgesia 126(6):p 2065–2068, June 2018.
Morgan P, Cleave-Hogg D, DeSousa S, Tarshis J. High-fidelity patient simulation: Validation of performance checklists. Br J Anaesth. 2004;92(3):388–92.
Newble D. Techniques for measuring clinical competence: Objective structured clinical examinations. Med Educ. 2004;38(2):199–203.
Norcini JJ, McKinley DW, Boulet JR. A methodological review of the assessment of clinical competence: Where do we go from here? Med Educ. 2011;45(6):636–45.
A
O'Shaughnessy SM, Joyce P. Summative and formative assessment in medicine: The experience of an anaesthesia trainee. Int J High Educ. 2015;4(2):198–206.
Pashmdarfard M, Hassani Mehraban A, Shafaroodi N, Arabshahi S, Parvizy K, Azad S, A., Esmaeili K, S. Validation of Objective Structured Clinical Examination (OSCE) based on the Occupational Therapy Practice Framework (OTPF): A pilot study. J Occup Therapy Educ. 2022;6(2):1–17.
Rebel A, DiLorenzo A, Nguyen D, Horvath I, McEvoy MD, Fragneto RY, Dority JS, Rose GL, Schell RM. Should Objective Structured Clinical Examinations Assist the Clinical Competency Committee in Assigning Anesthesiology Milestones Competency? Anesth Analg. 2019;129(1):226–34.
Rojas AF, Chen F, McMillan D, An X, Isaak R, Jolly M, Allan J, Coombs R, Nanda M, Grant SA. Beyond the Block: Development of an Assessment Tool to Evaluate Periprocedural and Communication Skills in Regional Anesthesia. J Educ Perioper Med. 2025;27(1):E743.
Ruan X, Xu X, Pei L, Yi J, Yu C, Yu X, Zhu B, Quan X, Li X, Jv H, Zhang Y, Huang Y. Chinese Anesthesiology Milestones in Resident Evaluation: Reliability, Validity, and Correlation with Objective Examination Scores: A Cross-sectional Study. Anesth Analg. 2025;141(1):190–8.
Streiner DL. Starting at the beginning: an introduction to coefficient alpha and internal consistency. J Pers Assess. 2003;80(1):99–103.
Tanaka P, Park YS, Liu L, Varner C, Kumar AH, Sandhu C, Yumul R, McCartney KT, Spilka J, Macario A. Assessment Scores of a Mock Objective Structured Clinical Examination Administered to 99 Anesthesiology Residents at 8 Institutions. Anesth Analg. 2020;131(2):613–21.
Turner JL, Dankoski ME. Objective structured clinical exams: A critical review. Fam Med. 2008;40(8):574–8.
A
Weller J, Bloch M, Young S, Maze M, Oyesola S, Wyner J, Dob D, Haire K, Durbridge J, Walker T, Newble D. Evaluation of high fidelity patient simulator in assessment of performance of anaesthetists. Br J Anaesth. 2002;90(1):43–7.
Wright MC. High-stakes assessment in anesthesia via simulation: Are we there yet? Can J Anesth/J Can Anesth. 2019;66:1431–6.
A
Appendix. (1): Summary of articles included for assessing the Reliability and Validity of Objective Structured Clinical Examinations (OSCE) in Assessing Clinical Competence in Anaesthesia.
Total words in MS: 8928
Total words in Title: 23
Total words in Abstract: 251
Total Keyword count: 4
Total Images in MS: 13
Total Tables in MS: 0
Total Reference count: 39