Introduction
Clinical trials have become increasingly complex due to the development of sophisticated protocol designs, the introduction of new technologies, and the growing demands for expanded data. Recent evidence shows that trial sites view protocol complexity as a greater challenge than staff turnover or limited resources (1, 2). The Tufts CSDD likewise reports that oncology trials—marked by global scope, molecularly driven designs, and restrictive eligibility criteria—are especially burdensome, with more endpoints, amendments, and intensive procedures (3–5). Broader analyses confirm a decade-long increase in procedural volume and complexity, resulting in longer development timelines and operational inefficiencies (6, 7).
The operational impact of this trend is substantial. Complex designs require strong infrastructure, multidisciplinary expertise, and institutional support to avoid overwhelming staff (8). Insufficient resources and poor workload management disproportionately affect frontline professionals, including Study Nurses (SNs), Study Coordinators (SCs), and Data Managers (DMs)(9). These roles cover patient recruitment, follow-up, data quality, safety reporting, protocol compliance, and regulatory readiness—key determinants of trial feasibility (10, 11). Evidence shows that SCs act as vital links between protocols and patient care (12), while SNs ensure ethical and safe procedures, and DMs safeguard data integrity and traceability (13).
Healthcare research consistently demonstrates that increased nursing workloads are associated with burnout, reduced performance, and adverse patient outcomes (14, 15). Comparable mechanisms, such as cognitive overload, prioritization trade-offs, and reduced vigilance, are likely to operate within research settings, threatening both data integrity and protocol compliance (16, 17). Surveys of clinical research staff confirm that unbalanced workload is a significant determinant of job dissatisfaction, turnover intention, and declining site performance (10, 12, 18, 19).
To address these challenges, several workload assessment approaches have been developed to align protocol complexity with staff capacity. However, their integration with validated site performance indicators remains limited, leaving a critical gap in optimizing workforce allocation and sustaining research quality. Despite these developments, the integration of validated site performance indicators with workload assessment remains limited (9, 20). The Clinical Trial Site Performance Measure (CT-SPM) (21) offers a unique opportunity to bridge this gap by systematically linking staffing demands with measurable trial outcomes. Such integration would not only optimize workforce allocation but also ensure the sustainability and quality of clinical research delivery.
Results
Most of the trials (n = 362) included were conducted in Oncology (67.68%) with < 50 subjects enrolled (96.41%). Trial characteristics are reported in Table 1.
Table 1
Characteristics of included trials
|
Phase
|
|
Phase I
|
49 (13.5%)
|
|
Phase II
|
99 (27.3%)
|
|
Phase III
|
178 (49.2%)
|
|
Phase IV
|
12 (3.3%)
|
|
Non Phase
|
22 (6.1%)
|
|
Study
|
|
Other
|
86 (23.8%)
|
|
Observational
|
161 (44.5%)
|
|
RCT
|
115 (31.8%)
|
|
CRO Involvement
|
|
No
|
13 (3.59%)
|
|
Yes
|
349 (96.41%)
|
|
Sponsor
|
|
No
|
15 (4.14%)
|
|
Yes
|
347 (95.86%)
|
|
Translational
|
|
No
|
130 (35.91%)
|
|
Yes
|
231 (63.81%)
|
Insert Table 1 here
Performance Indicators
Participant Retention (F1) was M = 3.10 (SD = 0.87), Data Quality (F2) was M = 3.11 (SD = 0.36), and Adverse Events (F3) was M = 2.97 (SD = 0.40). Protocol Compliance (F4) followed a similar pattern, with a mean of 3.05 (SD = 0.35). Finally, the overall performance showed comparatively higher values, with a mean of 3.40 (SD = 0.56).
Differences in Performance
The Kruskal–Wallis test revealed significant differences between phases for Participant Retention (F1) (χ²(5) = 50.000, p < 0.001, η²[H] = 0.126, moderate), Adverse Events (F3) (χ²(5) = 16.900, p = 0.005, η²[H] = 0.033, small), and Overall (χ²(5) = 25.400, p < 0.001, η²[H] = 0.057, small) (Fig. 1). For study design, large effect sizes were found for Participant Retention (F1) (η²[H] = 0.689) and Data Quality (F2) (η²[H] = 0.254), with moderate effects for Protocol Compliance (F4) (η²[H] = 0.095) and Overall (η²[H] = 0.100). Post-hoc analyses with Holm adjustment confirmed that, within trial phase, Participant Retention (F1) scores in Phase I were significantly higher than in Phases II, III, IV, and non-Phase (all p < 0.006). Observational studies scored considerably higher than other study types in Participant Retention (F1) and Data Quality (F2). Significant magnitude effects were observed in key contrasts, such as Phase I vs. Not applicable for Participant Retention (F1) (δ = 0.838, 95% CI [0.603, 0.939]) and RCT vs. Observational (δ = -0.951, 95% CI [-0.981, -0.876]).
Differences by study design were more pronounced. Observational studies achieved the highest Retention and Data Quality, clearly outperforming RCTs. In Retention, RCTs scored significantly lower than Observational studies, with a significant effect (δ = − 0.867, 95% CI [–0.918, − 0.788], p < 0.001), and lower than other studies as well (δ = − 0.558, 95% CI [–0.680, − 0.406], p < 0.001). Data Quality was also higher in Observational studies than in RCTs, with a medium effect (δ = − 0.400, 95% CI [–0.515, − 0.272], p < .001). Protocol Compliance favored Observational studies over RCTs with a small effect (δ = − 0.243, 95% CI [–0.362, − 0.117], p < .001). The overall performance index was also higher in Observational Studies compared to RCTs, with a medium effect (δ = − 0.382, 95% CI [–0.498, − 0.252], p < 0.001). For AEs, no significant difference was found between RCTs and Observational studies (δ = 0.000, 95% CI [–0.137, 0.137], p = 0.999), while RCTs reported slightly more events than other studies, with a small effect (δ = 0.201, 95% CI [0.046, 0.347], p = 0.037).
Canonical Discriminant Analysis
The canonical discriminant analysis yielded two significant dimensions. The first canonical variate (Can1), which explains 65.9% of the variance, was strongly defined by participant retention (standardized coefficient = 1.17) and, to a lesser extent, adverse events (0.44). This function clearly distinguished Phase I trials (M = 1.09) from all later phases, which showed substantially lower centroids. The second canonical variate (Can2), accounting for an additional 30.2% of the variance, was characterized by positive loadings on data quality (0.51) and negative loadings on adverse events (–0.68). This function separated Phase IV (M = − 1.11) and non-classifiable trials (M = − 0.74) from Phases II and III, which clustered near the positive end of the axis. Together, the two canonical functions captured over 96% of discriminative variance, underscoring the robustness of phase-related differences across performance outcomes.
Staffing impact on outcomes
The effect size estimates indicated that SNs exerted a very large influence on trial performance, particularly in participant retention (η²p = 0.86, large), data quality (η²p = 0.23, large), and protocol compliance (η²p = 0.07, medium) as shown in Table 2. SCs also showed meaningful contributions, with moderate-to-large effects across multiple outcomes, including data quality (η²p = 0.08) and adverse event reporting (η²p = 0.03). In contrast, DMs had only minor effects, limited mainly to adverse event outcomes (η²p = 0.02). Trial complexity did not explain meaningful variance in any domain, with consistently negligible effect sizes (all η²p < 0.01).
Table 2
Staffing predictors of performance outcomes in clinical trials
|
Predictor
|
F1 – Participant Retention F(p) [η²p]
|
F2 – Data Quality F(p) [η²p]
|
F3 – Adverse Events F(p) [η²p]
|
F4 – Protocol Compliance F(p) [η²p]
|
Mokken Short F(p) [η²p]
|
|
Study Nurse
|
2216.39 (< 0.001) ***
[η²p = 0.86]
|
106.67 (< 0.001) ***
[η²p = 0.23]
|
2.44 (0.119)
[η²p = 0.01]
|
28.78 (< 0.001) ***
[η²p = 0.07]
|
60.74 (< .001) ***
[η²p = .15]
|
|
Study Coordinator
|
79.99 (< 0.001) ***
[η²p = 0.18]
|
30.49 (< 0.001) ***
[η²p = 0.08]
|
9.42 (0.002) **
[η²p = 0.03]
|
3.62 (0.058)
[η²p = 0.01]
|
19.98 (< .001) ***
[η²p = .05]
|
|
Data Manager
|
0.74 (0.389)
[η²p < 0.01]
|
3.21 (0.074)
[η²p = 0.01]
|
5.69 (0.018) *
[η²p = 0.02]
|
0.08 (0.777)
[η²p < 0.01]
|
0.49 (.484)
[η²p < .01]
|
|
Trial Complexity
|
1.55 (0.214)
[η²p < 0.01]
|
1.68 (0.196)
[η²p < 0.01]
|
0.01 (0.914)
0.02 [η²p < 0.01]
|
0.46 (0.499)
[η²p < 0.01]
|
0.17 (.678)
[η²p < .01]
|
| Notes: Bold values are statistically significant; Sign Codes: *** p < 0.001; ** p < 0.01; * p < 0.05 |
Insert Table 2 here
Discussion
This study examined how clinical trial staff perceive the impact of team composition and study design on site-level operational performance. Using a validated, behaviorally anchored instrument and a structured complexity classifier, we triangulated staffing patterns with domain-specific performance outcomes across a heterogeneous set of protocols. Taken together, the findings argue for a dual, complementary approach to trial operations: plan capacity proportionate to protocol demands (via OPAL) and verify performance through recurrent, domain-specific signals (via CT-SPM).
Across CT-SPM domains, mean values generally indicated optimal performance, with overall scores exceeding those of individual domains—consistent with teams synthesizing multiple practices into a positive overall appraisal. However, between-group contrasts were informative. Differences by phase showed that early-phase studies clustered with stronger participant-facing performance (Retention), while design contrasts were more pronounced: observational studies outperformed RCTs on Retention and Data Quality, with medium-to-large effects, and showed modest advantages on Protocol Compliance; AE reporting differences were minimal between observational studies and RCTs. This pattern aligns with operational realities: RCTs concentrate risk at the intersection of protocol intensity, endpoint burden, and multi-site coordination, which can compromise timeliness and increase queries, even when scientific rigor is higher. Observational designs, in contrast, involve fewer invasive procedures and narrower safety constraints, allowing teams to preserve data completeness and timeliness, as well as patient workflow continuity.
The canonical discriminant analysis adds a multivariate perspective: a first function dominated by Retention (with some AE contribution) separates early-phase studies from later phases, while a second function contrasting Data Quality (positive) against AEs (negative) discriminates Phase IV and “Non-Phase” from Phases II–III. This suggests that perceived performance coalesces around two recognizable planes—participant-facing and data-facing—which is precisely the hierarchical structure supported by CT-SPM psychometrics (F1–F4 nested within higher-order dimensions). From a Quality-by-Design perspective, these planes are actionable: if a protocol’s features are expected to stress participant interfaces (e.g., frequent safety assessments), staffing and processes should prioritize protecting Retention and AEs handling; if the design elevates endpoint density and documentation, attention should shift to data-facing workflows.
The staffing analysis points to a clear, modifiable driver: SNs were strongly associated with improvements in Retention, Data Quality, and Protocol Compliance (very large to medium effects), with SCs contributing strongly to benefits across outcomes, including AEs; DMs showed more minor, domain-specific effects. Two interpretations are plausible and not mutually exclusive.
First, SNs occupy the clinical–operational bridge, where many preventable defects originate, such as late or missed visits, consent and re-consent issues, pre-visit preparation, bedside clarification of procedures, and immediate reconciliation of safety information. By stabilizing the patient path and closing loops upstream, SNs reduce downstream burdens (queries, missingness, deviations). SCs, in turn, orchestrate logistics, calendars, and stakeholder interfaces—work that carries significant leverage on timeliness and compliance. DMs safeguard integrity and traceability, but much of the variance captured by CT-SPM domains is determined before data reach DM workflows; this may attenuate their apparent effect in perception-based models. Second, counts (rather than FTE-normalized effort or seniority) may underestimate DM contributions where one experienced DM covers multiple protocols efficiently; conversely, SN and SC effects may scale more linearly with headcount at the protocol level. Regardless, the signal is operationally valuable: the mix matters at least as much as the absolute number of staff.
Notably, OPAL-defined complexity did not explain meaningful variance in perceived performance once staffing was considered. This is not contradictory; it reflects construct divergence. OPAL encodes the expected workload/complexity from protocol features, while CT-SPM captures the realized practices as perceived by teams. A high-maturity unit with adequate SN/SC coverage and tight workflows can absorb complexity without perceiving a decline in day-to-day performance. Range restriction may also play a role: the sample was heavily oncology-weighted, with near-universal CRO involvement, which compresses variability at the higher end of complexity (18). In short, OPAL still does what it is meant to do—quantify demand—but perceived performance depends on whether capacity and process are tuned to that demand. This reinforces a plan-and-prove model: use OPAL upstream to argue for resources; use CT-SPM downstream to verify that practice patterns remain robust under the realized load.
CT-SPM is intentionally perception-based, completed jointly by the entire team to minimize single-rater bias and capture shared operational experience. This approach harnesses tacit knowledge that rarely appears in hard KPIs but predicts where defects tend to accumulate. Its validated structure—four domains nested within two higher-order dimensions, and the presence of a scalable short form enables both depth (domain diagnostics) and breadth (lightweight screening) in routine oversight. Still, perception brings limits: shared biases, optimism/pessimism, and context effects may influence ratings; and some domains (e.g., Protocol Compliance) showed moderate discriminative accuracy, suggesting benefits from triangulation with objective, automated KPIs (e.g., CRF timeliness from EDC, query burden per subject-visit, adjudicated deviation counts).
Staff to the risk profile, not just the headcount. Where Retention lags or AE handling is fragile, grow SN capacity and standardize bedside workflows (visit scripts, re-consent triggers, pre-visit checklists). Where timeliness/missingness drive risk, strengthen SC-led logistics (calendar control, source prep, pre-query huddles) and pre-source checks before CRF entry; DMs should focus on traceability and reconciliation protocols for SAE data and primary outcomes. Use CT-SPM subscales to target the lever with the highest marginal return.
Use OPAL at feasibility, CT-SPM in conduct. Integrate OPAL into start-up checklists to make resource requests explicit and defensible; then embed CT-SPM (full or short form) in periodic reviews to confirm that practices occur with the required frequency. This sequencing operationalizes QbD and RBM—specifying risks ex-ante and monitoring the right behaviors in-process.
Prefer domain-aware dashboards. Visualize Retention, Data Quality, AE Reporting, and Protocol Compliance separately. Mixing them into a single index obscures the trade-offs inherent in different designs: RCTs will often carry data-facing stress; observational studies may excel in this regard but require vigilance on consent continuity when recruitment is diffuse. The bifactor architecture of CT-SPM supports exactly this domain-aware view.
Make the short form your “tripwire.” The four-item Mokken scale (queries, SAE accuracy, outcome-data queries, protocol violations) is well suited as a low-burden trigger in central monitoring to flag sites for focused review, especially between visits.
4.1 Strengths and limitations
A key strength is the integration of a validated perception instrument with a logically coherent complexity classifier, enabling separation of demand from performance. The team-consensus completion further mitigates idiosyncratic bias and aligns with how site operations are delivered—collectively, not by isolated roles. Finally, the domain-specific analysis respects real operational trade-offs across trial genres.
Limitations merit caution. First, the cross-sectional design prevents causal inference: better staffing may drive better performance, but high-performing units could also attract resources. Second, staffing was captured as counts, not FTE-normalized effort or seniority; future work should account for role expertise and turnover. Third, the sample’s oncology predominance and high CRO involvement may compress OPAL variability and limit generalizability beyond similar environments. Fourth, because CT-SPM is perception-based, common-method variance is possible; triangulation with objective KPIs is needed to solidify construct validity and to calibrate context-sensitive thresholds (e.g., via control charts over time).
Future directions
Future research should expand beyond cross-sectional perceptions to establish causal and longitudinal evidence linking staffing configurations, workload dynamics, and site performance. Multi-center, mixed-methods studies integrating objective metrics (e.g., CRF timeliness, deviation rates, AE reconciliation time) with perception-based measures such as the CT-SPM would enhance construct validity and help delineate the pathways through which team composition affects data quality and patient-centered outcomes. Experimental or quasi-experimental designs could test targeted staffing interventions—for instance, increasing SN or SC coverage, while monitoring their downstream effects on recruitment, retention, and regulatory compliance (24, 25).
Additionally, psychometric refinement of the CT-SPM should continue, including cross-cultural validation and the establishment of performance benchmarks stratified by therapeutic area and trial phase. Integrating the CT-SPM within centralized monitoring systems or electronic dashboards would operationalize real-time oversight, enabling early detection of risk signals and data drift. Future iterations of the OPAL–CT-SPM integration could also incorporate machine learning models to predict staffing needs or performance decline based on protocol complexity and operational history.
Finally, workforce sustainability should become a core endpoint. Understanding how workload equity, training, and professional recognition influence retention and well-being among SN, SC, and DM will be critical to maintaining quality and continuity in an increasingly complex research environment.