Development and validation of a scale assessing perceived trustworthiness in large language models
AlaYankouskaya1✉,2Email
BasadBarajeeh2,3
AreejBabiker3
SamehaAlShakhsi3
YunsiTinaMa2,3
ChunSingMaxwellHo2
RaianAli3
ChunSing3
MaxwellHo3
ChunSingMaxwell1
1
A
School of PsychologyBournemouth UniversityFern BarrowBH12 5BBPooleUK
2The Education University of Hong Kong10 Lo Ping RdTing KokHong Kong
3
A
A
College of Science and EngineeringHamad Bin Khalifa UniversityEducation City - Gate 8, Ar RayyanQatar
Ala Yankouskaya1,*, Basad Barajeeh2, Areej Babiker3, Sameha AlShakhsi3, Yunsi Tina Ma2, Chun Sing Maxwell Ho2, Raian Ali3
1 School of Psychology, Bournemouth University, UK
2 The Education University of Hong Kong, Hong Kong
3 College of Science and Engineering, Hamad Bin Khalifa University, Qatar
Corresponding author
Ala Yankouskaya (ayankouskaya@bournemouth.ac.uk)
Authors and Affiliations
School of Psychology, Bournemouth University, Fern Barrow, Poole BH12 5BB, UK
Ala Yankouskaya
The Education University of Hong Kong, 10 Lo Ping Rd, Ting Kok, Hong Kong
Basad Barajeeh, Yunsi Tina Ma, Chun Sing Maxwell Ho
College of Science and Engineering, Hamad Bin Khalifa University, Education City - Gate 8, Ar Rayyan, Qatar
Raian Ali, Sameha AlShakhsi, Areej Babiker
Contributions
A.Y. Conceptualised and designed the study, mentored, performed and reported the statistical analysis, and wrote the original draft.
B.B. Performed and reported the statistical analysis, contributed to the original draft and prepared the material for OSF.
A.B. Conceptualised and designed the study, curated the data, and contributed to the original draft.
A
S.A. Conceptualised and designed the research, curated the data, performed and reported part of the analysis, and reviewed and edited the paper.
A
T.Y Validated the analysis, contributed to the original draft, and reviewed and edited the paper.
A
M.C Validated the analysis and reviewed and edited the paper.
A
R.A. Conceptualised, designed, and supervised the research, curated the data, secured funding, validated the analysis, and reviewed and edited the paper.
A
Funding
Open Access funding provided by the Qatar National Library. This publication was supported by NPRP 14 Cluster grant # NPRP 14 C-0916–210015 from the Qatar National Research Fund (a member of Qatar Foundation). The findings herein reflect the work and are solely the responsibility of the authors.
A
Data Availability
The datasets generated by the survey research and codes for analysis are available in the Open Science Framework repository at ( [https://osf.io/ku5sz/?view\_only=794eea00b5f149b49b874c385c07651c](https:/osf.io/ku5sz/?view_only=794eea00b5f149b49b874c385c07651c) ). The author confirms that all data generated or analysed during this study are included in this published article.
Competing interest
The authors declare no competing financial and/or non-financial interests.
ORCID:
Ala Yankouskaya: 0000-0003-0794-0989
Basad Barajeeh: 0009-0008-8389-5132
Areej Babiker: 0000-0002-8105-1664
Sameha AlShakhsi: 0000-0002-2138-4731
Yunsi Tina Ma: 0000-0002-1083-7547
Chun Sing Maxwell: 0000-0002-1776-3683
Raian Ali: 0000-0002-5285-7829
Declarations
Ethical approval and consent to participate:
A
ethical approval for this research was granted by the Bournemouth University Ethics Committee, UK (Reference: N62239, 03 March 2025).
A
Informed consent was obtained from all individual participants included in the study.
Consent to publish:
NA
Abstract
A
Large language models (LLMs) are increasingly part of everyday life, yet there is no established way to measure how users evaluate their trustworthiness. This study introduces the Perceived Trustworthiness of LLMs scale (PT-LLM-8), developed from the TrustLLM framework and adapted as a human-centred measure. The scale was designed to measure the perceived trustworthiness of a user’s primary LLM and assesses eight dimensions: truthfulness, safety, fairness, robustness, privacy, transparency, accountability, and compliance with laws. Psychometric properties of the scale were tested with 752 LLM users in the United Kingdom (Mean age = 28.58, SD = 6.11, 50.3% males, 48.8% females). The PT-LLM-8 functions as a unidimensional measure with high internal consistency (Cronbach’s alpha = 0.90, Composite Reliability = 0.91, strong item-total correlations (ranged between 0.62–0.75), and measurement invariance across gender. The measure of perceived trustworthiness of LLM that can be applied as an overall score, along with item-level responses when insight into specific dimensions is needed. For researchers, practitioners, and developers, the PT-LLM-8 offers a practical instrument for evaluating interventions, comparing groups and contexts, and examining whether technical safeguards are reflected in users’ perceived trustworthiness of LLM. The scale can also be applied to guide system design, support policy development, and help organisations monitor shifts in user trust toward LLMs over time, making it applicable across research, practice, and governance.
Key words
Large Language Models
perceived trustworthiness
human-centred measure
Generative Artificial Intelligence
A
A
A
Introduction
Debates about generative artificial intelligence (GenAI) are increasingly centred on the issue of whether these systems and tools can be trusted and on what grounds such trust should be established. The rise of large language models (LLMs) and their popularity have intensified this inquiry. Unlike earlier forms of automation, which were mainly limited to computational or information retrieval tasks, LLMs generate text that can simulate expertise, credibility, empathy, and advice, frequently articulated with the apparent authority of natural language (Liu et al., 2022),(Sorin et al., 2024),(Taillandier et al., 2025),(Wester et al., 2024). This capability obscures the boundary between information provision and persuasion, and between augmenting human decision-making and supplanting it. Consequently, the question of trust becomes more consequential, not only in terms of whether people will use these systems, but also whether they will consider LLM’s outputs as credible and actionable. Governments increasingly invoke public trust as a prerequisite for deploying AI in sensitive domains such as healthcare, education, and law, while the private sector often frames trust as essential for widespread adoption (Roski et al., 2021),(Afroogh et al., 2024),(Ingrams et al., 2022),(Kleizen et al., 2023). This introduces a practical challenge for both researchers and policymakers. If trust is to meaningly guide AI governance and implementation, it must be examined and defined in ways that render it both observable and comparable. Accordingly, the development of robust methodologies to assess how individuals orient themselves in relation to LLMs, whether with trust, skepticism, or uncertainty, has become a critical task. Such measurement is essential for monitoring public sentiment, promoting transparency, trust, accountability, and elucidating the conditions under which these LLMs may attain or be withheld social acceptance (Afroogh et al., 2024),(Sarker, 2024),(Durán and Pozzi, 2025).
The concept of trust has been studied extensively in human-automation interaction (Schäfer et al., 2024),(Roesler et al., 2024). In psychology and management, trust refers to a person’s readiness to accept vulnerability to the actions of another, particularly in situations involving uncertainty (Mayer et al., 1995a). In a broader sense, trust represents a cognitive and emotional evaluation formed by individuals, influenced by factors such as prior experiences, personal beliefs, and situational cues (Hancock et al., 2023a). While trust may promote reliance, it does not directly reflect the system’s actual capabilities (De Fine Licht and Brülde, 2021). This distinction is crucial, as users may sometimes exhibit over-trust, relying on a system beyond its functional limits, or under-trust, withholding reliance even when the system is capable (Hoff and Bashir, 2015). In human–computer interaction and human factors research, this is formalized in the notion of calibrated trust, which refers to the degree of alignment between a user’s trust and the system’s actual reliability and performance (Lee and See, 2004),(Holland et al., 2024). Calibrated trust is considered as an essential condition for safe and effective interaction because both over-trust and under-trust can impair task performance, reduce safety, and hinder technology adoption. However, researchers have cautioned that focusing on “trust” alone can be misleading, because exhorting people simply to “trust” technology is misplaced if that technology is not in fact trustworthy (Durán and Pozzi, 2025).
Trustworthiness refers to the objective, intrinsic properties or attributes of the AI system that justify or undermine people reliance on them (Reinhardt, 2023). These are the verifiable and measurable characteristics that make a system worthy of trust. For a large language model, trustworthiness is a function of its performance, consistency, safety, and ethical alignment. For instance, a model's ability to generate factually accurate content is a component of its trustworthiness, while a user’s belief that it will do so is a component of their trust. The central challenge in the field is that a system can be technically trustworthy without earning a user's trust, or conversely, a user can develop overconfidence in a system that is not fully trustworthy (Daly et al., 2025),(Schlicker et al., 2025). Furthermore, evidence from human-automation studies shows that while trust and trustworthiness are correlated, they are not identical: for instance, people may continue to rely on a system they judge untrustworthy if no alternatives exist (Parasuraman and Riley, 1997),(Harvey and Laurie, 2024), or conversely, they may withhold reliance from systems they explicitly recognise as trustworthy due to low trust propensity (Merritt and Ilgen, 2008),(Mitchell, 2025). For example, studies of algorithm aversion and algorithm appreciation show that people’s trust can fluctuate dramatically based on single errors or superficial cues (Dietvorst et al., 2015). Yet underlying these behavioural swings are implicit judgements about how people perceive the trustworthiness attributes Therefore, distinguishing between “trust in LLMs” (a user’s willingness to rely) and “perceived trustworthiness of LLMs” (a judgement of system attributes) is essential, because each requires different forms of measurement: the former through behavioural and self-report indicators of reliance, and the latter through assessments of perceived accuracy, reliability, safety, robustness, etc.
Several approaches have been developed to measure trust in automation and AI through self-report instruments. Early work produced the Trust in Automation Scale (TIAS), which assesses users’ perceptions of system ability and integrity in generic automated contexts (Jian et al., 2000). Related measures include the Trust between People and Automation (TPA) scale, originally designed for human-machine interaction and the Trust Scale for the AI Context (TAI), which adapts this framework for contemporary AI applications (Scharowski et al., 2024a). Both have been examined in psychometric studies, with the TAI showing better alignment to AI-specific scenarios than the older TPA. More recent efforts have turned explicitly to AI agents and LLMs. The Trust-In-LLMs Index (TILLMI) is a short instrument distinguishing two dimensions: an affective orientation towards LLMs, described as closeness, and a cognitive orientation, described as reliance (De Duro et al., 2025). A different study has employed a 27-item semantic-differential scale to separate cognitive from emotional trust in AI systems, including LLMs. Outside the LLM context, the Trust of Automated Systems Test (TOAST) has been proposed as a concise nine-item measure that combines perceived understanding of a system with expectations of performance (Wojton et al., 2020). Taken together, the existing instruments demonstrate a consistent focus on trust understood as a user’s stance towards a system. Whether expressed as reliance, confidence, or affective orientation, these measures capture the extent to which people are willing to engage with automated or AI systems. However, there remains a notable lack of instruments that specifically measure perceived trustworthiness - that is, users’ subjective judgments about a system’s qualities as distinct from their own willingness to rely on it.
Perceived trustworthiness is not a superficial attitude metric but a determinant of whether LLMs will be adopted, governed, and integrated into high-stakes sectors. First, market adoption depends on whether individuals perceive LLMs as reliable and fair. If users doubt trustworthiness, even technically robust systems face rejection, slowing productivity gains and digital transformation (OECD, 2024). Second, regulatory compliance and legitimacy hinge on perception: frameworks such as the EU AI Act (Official Journal and of the European Union, 2024) and NIST AI RMF (National Institute of Standards and Technology, 2023) require not only demonstrable safeguards but also transparency practices that are convincing to human stakeholders. Third, economic value creation and competition are perception sensitive. For example, consumer confidence, investor sentiment, and trade in AI services are influenced by whether systems are seen as trustworthy, which in turn influences capital flows and procurement decisions (Kumar et al., 2024). Fourth, social stability and risk governance depend on perceptions aligning with reality; misperceptions (e.g., over-trust in unsafe systems or under-trust in safe systems) can produce either harmful reliance or wasted innovation potential, both of which carry direct socio-economic costs (Jacovi et al., 2021). Yet at present, no validated tools exist to measure how people judge the trustworthiness of large language models (LLMs). This gap means we cannot systematically study whether the public sees these systems as reliable, fair, and accountable, or how such views influence adoption, regulation, and wider social and economic outcomes. Developing and validating such instruments will operationalise perceived trustworthiness and provide a robust basis for practical use such as evidence-led governance, deployment and public monitoring.
The present study aims to develop and validate a psychometric instrument for measuring users’ perceived trustworthiness of large language models (LLMs) by operationalising the TrustLLM conceptual framework (Huang et al., 2024), which delineates eight dimensions (truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency and accountability) into measurable content domains. Building on the framework’s empirical grounding, specifically, its benchmark-based evaluation of sixteen mainstream LLMs across more than thirty datasets, we convert these principles into items and subject them to a staged analyses to establish dimensional structure, reliability and construct validity, yielding a concise and interpretable scale.
A
We also set out a transparent workflow for development and validation to guide researchers and practitioners through the analyses and their implications, and we provide a ready-to-use instrument with clear instructions for administration, scoring and interpretation along with practical applications and limitations to support responsible deployment.
Method
Theoretical background
Trustworthiness of large language models can be approached from two complementary points: as a set of intrinsic socio-technical properties sustained by institutional safeguards, and as a socio-psychological judgement - perceived trustworthiness.
Socio-technical. At the system and institutional levels, socio-technical dimensions represent a validated and auditable foundation for determining whether a system can be considered trustworthy. Recently proposed TrustLLM framework offers the most comprehensive attempt to date to consolidate these socio-technical dimensions into an integrated evaluative structure. Synthesising insights from more than 600 studies, it identifies eight dimensions that define the trustworthiness of LLMs: truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency, and accountability (Huang et al., 2024) (Fig. 1). TrustLLM provides conceptual clarity of LLM trustworthiness from as a system and also grounds each trustworthiness dimension in empirical evaluation, using over 30 benchmark datasets and systematically testing 16 mainstream LLMs. In doing so, it translates broad normative and technical concerns into measurable outcomes that can be compared across systems, establishing both the scope and the limits of what can be known about the trustworthiness of current LLMs. Yet, while these dimensions serve as objective criteria, they are limited by the extent to which users perceive them in everyday interactions. In other words, benchmark scores and audit results establish the warrant for trust, but what actually guides user behaviour is the uptake of that warrant - their perceived trustworthiness of LLMs. When objective properties and user perceptions converge, reliance is well-calibrated; when they diverge, patterns of over-trust or under-trust emerge. In this sense, the framework delineates the structural grounds of trustworthiness but leaves open the equally important question of how those grounds are taken up psychologically by users in real interactions with LLMs.
Socio-psychological. From psychological perspective, perceived trustworthiness refers the beliefs and expectations people develop about whether the necessary conditions for trustworthy performance are in place (Alarcon et al., 2016). These beliefs are influenced by prior experiences (Schwerter and Zimmermann, 2020), contextual cues (Van Der Biest et al., 2025), and institutional assurances (Li et al., 2024), and it is ultimately these perceptions, rather than formal audit results in a system as trustworthy or not. Crucially, many of these beliefs are implicit: they are formed automatically, outside conscious awareness, yet guide behaviour in systematic ways (Cyrus-Lai et al., 2022). Psychological models of dual-process cognition (Hochman, 2024) and the heuristic-systematic model (Chaiken and Ledgerwood, 2012) show that such implicit evaluations can serve as fast heuristics for action influencing whether users accept or contest a system, often before deliberate reasoning intervenes (Forscher et al., 2019). The content of these judgements is influenced by factors such as repeated interaction with a system (Glickman and Sharot, 2024), prevailing cultural narratives (Tao et al., 2024), design features that signal safety or authority (Kostick-Quenet and Gerke, 2022), and the broader regulatory environment (Kattnig et al., 2024).
The socio-psychological perspective therefore emphasises that perceived trustworthiness emerges from a combination of implicit and explicit judgements. A substantial body of psychological and organisational research shows that these judgements consistently centre on a set of recurring attributes that people use to evaluate whether an agent, institution, or system is trustworthy (Daronnat et al., 2021). One of the most fundamental of these is truthfulness, often conceptualised as integrity, honesty, or accuracy (David et al., 2015a),(David et al., 2015b). Across models of interpersonal and organisational trust, integrity is identified as a core antecedent of perceived trustworthiness, with clear effects on willingness to rely on others or on systems (Mayer et al., 1995b),(Colquitt et al., 2007) while evidence of deception or unreliability rapidly undermines it (Levine, 2014). In this sense, truthfulness constitutes a foundational dimension of perceived trustworthiness, anchoring people’s intuitive expectations about whether a system is worthy of reliance.
Beyond truthfulness, perceived trustworthiness also depends on how information is collected, stored, and protected. In interpersonal settings, respecting confidentiality is central to maintaining trust, whereas breaches of privacy often irreparably damage it. In digital environments, privacy has been formalised in theories such as the privacy calculus model, which proposes that individuals balance anticipated benefits of disclosure against perceived risks of data misuse, with trust mediating this trade-off (Culnan and Armstrong, 1999). Empirical studies consistently reported that when users believe their data are safeguarded, they are more likely to form trusting beliefs, disclose information, and continue engagement (Xu et al., 2009). Conversely, perceived risks of surveillance undermine trust and trigger avoidance behaviours (Malhotra et al., 2004). Thus, privacy protection shapes perceived trustworthiness by assuring users that their personal information will not be misused or exposed. Importantly, perceptions of privacy often intersect with concerns about fairness, since individuals evaluate not only whether their data are secured, but also whether data practices and outcomes are distributed and applied in an equitable manner across users and groups.
Perceptions of fairness have long been recognised as central to the psychology of trust. Organisational justice theory distinguishes between distributive fairness (the equity of outcomes), procedural fairness (the impartiality and consistency of decision-making processes), and interactional fairness (the quality of interpersonal treatment), with all three dimensions shown to predict perceived trustworthiness of authorities and institutions (Colquitt et al., 2001),(Colquitt et al., 2013). In digital and automated decision-making environments, fairness is similarly salient: users are more likely to trust systems they perceive as unbiased and equitable in treatment or outcomes (Lee, 2018). Violations of fairness, whether through biased outputs, inconsistent processes, or unequal access, consistently undermine perceived trustworthiness and reduce willingness to rely on a system (Grgic-Hlaca et al., 2018). In this sense, fairness operates as a core evaluative lens through which individuals interpret whether trustworthy performance is present, complementing privacy by extending concerns from the security of personal information to the equitable treatment of persons and groups.
The concept of fairness perception does not exist in isolation but often interact with broader concerns about harm and protection. When outcomes are perceived by individuals as biased or unjust, this can amplify doubts about a system’s ability to operate safely, since unfair treatment is frequently interpreted as a risk to personal or collective well-being. In psychological and socio-technical models of trust, perceptions of safety are closely linked to risk assessment and vulnerability. Classic trust definitions emphasise that trust entails a willingness to accept vulnerability in situations of uncertainty (Mayer et al., 1995c). The Trust-Confidence-Cooperation (TCC) model further specifies that trust reduces perceived risk and supports cooperative behaviour in domains such as environmental management and health communication (Earle and Siegrist, 2008). Recent psychological research further supports the centrality of perceived safety to trust and trustworthiness. A multidimensional model of perceived personal safety argues that safety confidence which represents the belief in one’s ability to remain protected or escape harm is a component influencing overall safety judgements (Syropoulos et al., 2024). Rooted in classic notions of the “fight or flight” response (Cannon, W. B., 1932), this work shows that perceived safety is not only about the absence of external threats but also about the confidence that one can withstand or avoid them. Such perceptions are integral to trustworthiness because they determine whether users view a system as exposing them to manageable or unacceptable risks, and they form the basis upon which expectations of robustness are built.
Interestingly, in person perception research, trust depends partly on predictability and consistency of perceived actions and behaviours (i.e., the expectation that behaviour will generalise across time and situations (Rempel et al., 1985). Classic attribution theory argues that when outcomes are consistent and caused by stable, competent dispositions, observers infer reliability and form trusting beliefs (Weiner, 1985). In parallel, moral-social judgement theories show that trust also hinges on accountability - the sense that agents are answerable for outcomes and can give adequate reasons (Tetlock, 2002). Experimental work on trust repair likewise finds that appropriate accounts (apology, explanation, or remedy) can restore perceived trustworthiness after failures, especially when the violation is competence-based (Huo et al., 2022). Together, these literatures indicate that perceived trustworthiness increases when an agent (or system) performs reliably (robustness) and remains answerable for what happens (accountability).
A
For perceived trustworthiness, it is not enough that reliability and accountability exist; they must also be perceptible to others. Transparency provides this perceptual bridge by reducing the psychological costs of uncertainty and suspicion while enhancing the rewards of confidence and predictability, thereby enabling robustness and accountability to be recognised as trustworthy qualities. Classic and modern psychological theories of interpersonal perception show that observability is critical: when people perceive others as open, their uncertainty diminishes and confidence in the other’s intent increases (Wheeless and Grotz, 1977),(Yang et al., 2021). In contrast, perceived secrecy or ambiguity raises anxiety and suspicion, degrading perceived trustworthiness (Larzelere and Huston, 1980). In addition, the Anxiety-Uncertainty Management (AUM) theory posits that reducing uncertainty through transparent signals is essential to maintaining confidence in social interaction where transparent communication serves to manage anxiety and ambiguity (Neuliep, 2017), thereby stabilising perceived trustworthiness. For instance, transparent communication of health guidelines during the COVID-19 pandemic reduced public anxiety and increased trust and compliance with protective behaviours (Lee and Li, 2021). Thus, transparency operates as a psychological mechanism that allows qualities such as reliability and accountability to be recognised and assessed. By reducing uncertainty and anxiety, it creates the conditions under which these structural features are interpreted as evidence of perceived trustworthiness.
Taken together, the evidence indicates that perceived trustworthiness is a socio-psychological construct through which people organise diverse cues such as truthfulness, privacy, fairness, safety, robustness, accountability, and transparency into an overall expectation of whether reliance is justified (Fig. 1). Because perceived trustworthiness functions as an overarching judgement that integrates multiple cues, advancing the field now requires instruments that can render this latent construct measurable from the human perspective. A validated measure would provide three core benefits. First, it would allow systematic analysis of how people perceive trustworthiness across its key dimensions, including how judgments vary across individuals, contexts, and tasks. Second, it would enable perceived trustworthiness to be transformed from a diffuse intuition into a construct that can be compared, modelled, and linked to behaviour, making it possible to test whether reliance is appropriately calibrated or misplaced, and to identify conditions that lead to over- or under-trust. Third, it would establish perceived trustworthiness as a multidimensional construct that can be tracked over time, allowing researchers to quantify its impact on behaviour, evaluate its stability and change, and generate cumulative evidence across studies and populations.
For large language models, the need for such measurement is especially important. Their outputs are probabilistic, open-ended, and often presented with a fluency that can obscure underlying uncertainty. Users must rely on perception to decide whether to accept, verify, or reject responses, and miscalibration in these judgments can have direct consequences. LLMs are also used across highly diverse contexts from casual information seeking to professional and educational settings where perceptions of trustworthiness may differ substantially. Without instruments that capture how people perceive trustfulness, privacy, fairness, safety, robustness, transparency, and accountability in these systems, it is not possible to determine whether reliance is proportionate, whether cues are interpreted as intended, or whether safeguards are effective.
Item development
The Perceived Trustworthiness of LLMs scale was developed to measure perceived trustworthiness of a user’s primary LLM. Item development was grounded in the TrustLLM framework, which identifies eight dimensions of trustworthiness: truthfulness, safety, fairness, robustness, privacy, machine ethics, transparency, and accountability (Huang et al., 2024). Each dimension was operationalised into a single item that could be judged directly by participants. For example, the dimension of truthfulness was operationalised as the perception of receiving “factually accurate and reliable information”. Items were phrased so that respondents evaluated the outcomes of their interaction with the LLM (e.g., receiving reliable information, unbiased and fair responses) rather than its internal technical processes.
All items were worded to reference the participant’s primary LLM rather than LLMs in general, because users are more familiar with the system they use most often, which allows them to provide more accurate judgement. Each item was presented with the prompt “I trust my primary LLM will…” followed by a statement reflecting the content of the relevant dimension (e.g., I trust my primary LLM will use my feedback to address errors and biases effectively)
A
Item development followed recommended guidelines for scale construction (Boateng et al., 2018). Items were created to ensure parallel structure, direct wording, and user-centred phrasing. Drafts underwent expert review for face validity, with feedback focusing on clarity, relevance, and alignment with the foundational construct of perceived trustworthiness.
A
Revisions were made in response to this feedback and then refined through iterative review to enhance comprehensibility and ensure consistency of language across dimensions.
It has to be noted that during the scale development, the dimension of machine ethics was refined into the dimension of laws and regulations. TrustLLM framework defines machine ethics in socio-technical terms, dividing it into implicit ethics, explicit ethics, and emotional awareness(Huang et al., 2024). These aspects are conceptually diffuse and were designed for evaluating the technical functioning of LLMs rather than user perceptions. In the context of perceived trustworthiness, which captures the socio-psychological side of human-LLM interaction, machine ethics could not be framed in wording that would allow respondents to give reliable judgements. In contrast, framing the construct as perceived compliance with laws and regulations provides a clearer and more meaningful dimension for respondents.
The final set of items are presented in Table 1.
Table 1
Scale items, corresponding dimensions, and the construct measured
Scale items
Corresponding dimension
What it measures
I trust my primary LLM will:
Measures the extent to which..
… provide information that is factually accurate and reliable
Trustfulness
users believe that an LLM provides accurate, reliable, and non-misleading information.
… handle my personal data securely and confidentially
Privacy
users feel confident that an LLM respects their privacy and protects sensitive or personal data from misuse.
… provide unbiased and fair responses
Fairness
users perceive that an LLM gives responses that are impartial and free from bias or discrimination.
… avoid generating harmful or dangerous content
Safety
users feel assured that an LLM avoids producing unsafe, harmful, or inappropriate outputs.
… perform well across different topics and situation
Robustness
users perceive that an LLM can perform consistently and reliably across a wide range of topics and contexts.
… provide clear explanations about how their responses are generated
Transparency
users perceive that an LLM makes its outputs understandable by providing clear and accessible explanations.
… use my feedback to address errors and biases effectively.
Accountability
users believe that developers of LLMs take responsibility for errors and use feedback to improve system reliability and fairness.
… provide responses that comply with relevant laws and regulations
Regulations and laws
users feel confident that an LLM’s outputs align with legal and regulatory standards
Each item was rated on an 11-point scale ranging from 0 (not at all) to 10 (completely). This format was chosen because numerical rating scales with a wider range of response options capture greater variability in perceptions than shorter Likert-type scales, thereby improving measurement sensitivity (Jebb et al., 2021). An 11-point format is widely used in the social sciences and survey research, particularly in studies of trust and satisfaction, as it allows respondents to express more nuanced judgements while remaining easy to understand (Revilla et al., 2014). The use of 0 as a true minimum anchors the scale clearly and helps to avoid ambiguity about whether low scores reflect neutrality or absence of the construct (DeCastellarnau, 2018). This design therefore ensured that ratings of perceived trustworthiness were both interpretable for respondents and psychometrically robust.
In line with the 0–10 response format, participants were first asked “To what extent do you trust your primary LLM?”, followed by the specific item statements beginning “I trust my primary LLM will…”. Framing the question in terms of extent encouraged respondents to view trust as graded rather than absolute, allowing their answers to be placed along a continuum and thus providing greater sensitivity to differences in perceived trustworthiness.
Data collection
A
Participants were recruited via the online platform Prolific (prolific.com), and the survey was developed and disseminated using SurveyMonkey (surveymonkey.com). Data collection was part of a larger study aimed at assessing psychological and behavioral dimensions related to the use of LLMs. The survey instrument underwent multiple iterations during its development to enhance clarity and refine the questionnaire. A pilot study was carried out in late March 2025 to assess comprehension of the questionnaire.
Ethical approval
for this research was granted by the Bournemouth University Ethics Committee, UK (Reference: N62239, 03 March 2025).
A
Prior to participation, individuals were presented with detailed information about the study and asked to provide informed consent. They were also informed of their right to withdraw at any point without penalty. Monetary compensation was provided upon successful completion of the questionnaire. The full-scale data collection took place from late March to early-August 2025.
Participants
A total of 966 participants were recruited for this study. Following data quality checks and eligibility screening, several cases were removed prior to analysis. Eligible participants were required to be at least 18 years old, currently reside in the United Kingdom, and identify as British in terms of culture and norms. Responses that were incomplete or failed to meet basic prescreening requirements were removed from the dataset (n = 55, 5.69%). Participants were further screened based on their engagement with the LLM they use most frequently, referred to as their “Primary LLM.” Inclusion required meeting at least one of the following: (1) they used their Primary LLM almost daily and considered themselves significantly dependent on it, (2) they used it almost daily but not relying on it significantly, or (3) they reported significant reliance despite not using it on a daily basis. Participants who neither used their Primary LLM frequently nor reported significant reliance were excluded from the analysis (n = 34, 3.52%).
To ensure the quality of the data, the survey incorporated attention check items. Participants who failed to respond appropriately to these checks were excluded from the final analysis (n = 125, 12.94%). After applying all exclusion criteria, including both eligibility screening and failed attention checks, a final sample of 752 participants was retained for analysis. A description of the study sample characteristics, including demographic information, education level, and employment status is provided in Supplementary Material 1.
Workflow for testing psychometric properties of the new scale
The psychometric evaluation of the PT-LLM-8 scale was conducted in several steps. First, we examined the latent structure of the scale through exploratory and confirmatory factor analyses. Next, we assessed reliability using indices such as internal consistency, composite reliability, convergent validity, item-level diagnostics, and measurement invariance. Finally, we evaluated validity in relation to external criteria, including personality characteristics, LLM competence, and LLM usage characteristics. Figure 1 illustrated these steps.
Fig. 1
Steps in the psychometric evaluation of the Perceived Trustworthiness of LLM Scale (PT-LLM-8)
Click here to Correct
Instruments used for external validation:
External validity of the PT-LLM-8 was assessed using established instruments measuring related psychological constructs. Specifically, we examined the associations between perceived trustworthiness of LLMs and individual personality traits, employing the following scales:
Self-efficacy Scale
Self-efficacy refers to an individual’s belief in their capacity to successfully perform tasks or manage situations (Bandura, 1977). In the context of digital technologies, technological self-efficacy has been shown to influence how people engage with, evaluate, and adopt new systems (88). Several studies reported that higher self-efficacy is positively correlated with trustworthiness of web sources (Andreassen and Bråten, 2013) using sport devices (Chamorro-Koc et al., 2021) and adopting technology (Xu et al., 2024). Based on this evidence, we expected self-efficacy to be positively associated with trustworthiness. Individuals who feel confident in their ability to handle technology experience less doubt and risk, which makes them more likely to judge digital systems as credible and dependable. We employed a single-item measure of general self-efficacy (Di et al., 2023). Psychometric evaluation of this scale showed good reliability estimates (M = 0.642) and strong criterion validity, correlating at r = 0.795 with a widely used multi-item general self-efficacy scale. Participants were asked to what extend they agree with the statement “I believe I can succeed at most endeavours I set my mind”. The responses for this measure ranged from 0 to 10, with higher scores reflecting greater self-efficacy.
Self-esteem Scale: Self-esteem is generally defined as a person’s overall evaluation of their own worth or value. It reflects the degree to which individuals feel confident, capable, and worthy of respect (Rosenberg, 2011). Prior work links global self-esteem positively to trust, including higher interpersonal trust and more trusting behaviour (B. Zhang et al., 2024). Accordingly, we expected higher self-esteem to be associated with greater perceived trustworthiness. Self-esteem was assessed with the Single-Item Self-Esteem Scale (Robins et al., 2001) adapted here to an 11-point response format (0–10) where 0 corresponded to (Not very true of me) and 10 (Very true of me). The SISE has strong psychometric support: across adult samples it correlates closely with the Rosenberg Self-Esteem Scale (r = .75-.80) and shows comparable criterion validity. The scale (“I have high self-esteem”) has also performed well across languages and response formats, indicating robustness to minor scaling changes (Robins et al., 2001).
Personality Traits. Personality traits were assessed using the Ten-Item Personality Inventory (Gosling et al., 2003). The TIPI is an ultra-brief measure of the Big Five personality dimensions, consisting of two items per trait (one positively worded and one negatively worded). Participants rated the extent to which each item described them on a 5-point Likert scale ranging from 1 (Disagree strongly) to 5 (Agree strongly). The Extraversion subscale measured tendencies related to sociability and outgoingness. The Agreeableness subscale assessed trustfulness and cooperative tendencies, including items “is generally trusting” and “tends to find faults with others.” Conscientiousness was assessed with items “tends to be lazy” and “does a thorough job.” The Neuroticism subscale measured the tendency to experience emotional instability, anxiety, and difficulty coping with stress. Openness to Experience assessed curiosity, imagination, and receptivity to new ideas.
Total scores for each Big Five trait were calculated as the sum of the two corresponding items (after reverse-coding where necessary), with higher scores reflecting higher levels of that trait. In our sample, internal consistency (Cronbach’s α) was as follows: Extraversion = 0.68, Agreeableness = 0.41, Conscientiousness = 0.62, Neuroticism = 0.70, and Openness = 0.45. These values are consistent with prior work showing that the brevity of the TIPI often results in modest reliability estimates, but that the instrument nevertheless demonstrates good test–retest reliability, convergent validity with longer Big Five measures, and meaningful external correlations (Gosling et al., 2003),(Ehrhart et al., 2009). Previous research has shown that Agreeableness reflects a prosocial orientation and strongly predicts interpersonal trust and judgments of others’ trustworthiness (Thielmann and Hilbig, 2015a). Conscientiousness also contributes to trust formation, being associated with reliability, consistency, and predictability in social exchanges, which are qualities of trustworthiness (Evans and Revelle, 2008). Extraversion supports the development of trust through active social engagement and communication (Hancock et al., 2023b). Openness to Experience has been associated with greater tolerance for ambiguity and receptivity to novelty, including new technologies, which may foster trust under conditions of uncertainty (Stanley et al., 2005). By contrast, evidence for Neuroticism (low emotional stability) shows no reliable link to trusting or trustworthy behaviour (Hancock et al., 2023b). Based on this evidence, we expected higher Agreeableness, Conscientiousness, Extraversion, and Openness to Experience to be positively associated with perceived trustworthiness, while Neuroticism would show no significant association.
LLM Competency Scale
LLM competence was measured using a single-item self-assessment, asking participants to evaluate their overall competence in using large language models (“Please rate your competency in using your primary LLM (e.g., forming queries, prompting, applying it to various topics and tasks, adjusting settings, etc.”). Responses were recorded on an 11-point scale ranging from 0 (“not competent at all”) to 10 (“very competent”). This measure was designed to capture participants’ subjective perception of their ability to effectively interact with and utilize LLMs in various tasks. Perceived competence is a key determinant of technology-related attitudes and behaviors (Xie et al., 2023),(Bo and Ma’rof, 2024) as individuals who feel more capable are more likely to engage with digital systems confidently and to evaluate them positively. Based on this reasoning, we hypothesized that higher levels of LLM competence would be positively associated with perceived trustworthiness of LLMs.
Additional measurements. Participants rated the perceived usefulness of their primary LLM in two domains. Personal help was measured with the item: “How much does your primary LLM help you in your personal life such as editing email or social media posts, recommending weekend activities, answering daily life queries such as meal ingredients or recipes, interpreting personal medical reports or conditions, managing daily schedules or reminders?”. Professional help was measured with the item: “How much does your primary LLM help you in your professional tasks such as styling or writing reports, drafting work emails, summarizing text, conducting research, analyzing data, or creating presentations?” Both items were answered on an 11-point scale ranging from 0 (not at all) to 10 (a great deal), with higher scores indicating greater perceived help from the LLM in that domain.
Characteristics of LLM usage were also assessed to provide external validators for the PT-LLM-8 scale.
A
Participants reported their frequency of use on an 11-point scale ranging from 0 (“very infrequently”) to 10 (“very frequently”). They further indicated their primary method of interaction with the LLM (text, voice, both text and voice, or other), as well as the main device used for access (PC, mobile phone, tablet, smart speaker such as Amazon Echo or Google Nest, dedicated AI device such as Humane AI Pin or Rabbit R1, or other). Finally, participants identified their typical usage location, selecting from stationary settings (e.g., home, office, school), on the move (e.g., commuting, travel), both stationary and mobile, or other.
Data analysis
Data Preprocessing. After initial inspection and removing participants who did not complete or failed attention checks, 752 participants were included in the final analysis (Mean age = 28.58, SD = 6.11, 50.3% males, 48.8% females, 0.9% missing responses for gender).
Descriptive statistics. The mean, standard deviation, and distribution of participants’ responses were calculated for each item to examine central tendency, variability, and overall response patterns.
Item quality check procedures. Before conducting the main analyses, we evaluated item quality by examining the assumption of monotonicity. This assumption requires that respondents with higher levels of the latent construct (e.g., perceived trustworthiness of LLMs) are more likely to endorse higher response categories. Assessing monotonicity in the current study served two purposes. First, it strengthened our reliability and dimensionality evaluations, as violations may indicate that an item does not operate consistently across the trait continuum. Second, because such violations can signal multidimensionality or measurement error, early detection allows problematic items to be identified before they affect further analyses. The assessment was implemented using the mokken package in R. For each item, we obtained a scalability coefficient (H), the proportion of violated monotonicity comparisons, the maximum standardized violation (zmax), and an indicator flag when violations exceeded critical thresholds.
Exploratory Factor Analysis. The purpose of this analysis was to examine the latent dimensionality of the trust items and to identify an interpretable factor structure that could be cross-validated in an independent subsample using confirmatory factor analysis (CFA). The dataset (N = 752) was randomly divided into two independent subsamples of equal size. The first subsample (n = 376) was used for exploratory factor analysis (EFA), while the second was reserved for CFA. This split-sample design ensured that the factor structure identified in the EFA could be validated on an independent dataset.
Prior EFA, sampling adequacy was assessed using the Kaiser-Meyer-Olkin (KMO) statistic, which was computed both at the overall scale level and for individual items. Bartlett’s test of sphericity was applied to confirm that the observed correlations differed significantly from an identity matrix, supporting the suitability of the data for factor analysis
To determine the appropriate number of factors to retain, several complementary criteria were applied. First, a scree plot of eigenvalues was inspected to identify points of inflection. Second, Horn’s parallel analysis was performed, comparing empirical eigenvalues to those generated from random data under maximum likelihood (ML) extraction. Third, a non-parametric bootstrap procedure with 2,000 resamples was implemented to compute mean eigenvalues and 95% confidence intervals, providing an additional robustness check on factor retention decisions
Factor extraction was performed on the item correlation matrix using maximum likelihood estimation, which allows the computation of model fit indices and is robust under the assumption of multivariate normality. Competing solutions specifying one, two, and three latent factors were estimated. To allow for correlated dimensions, an oblimin rotation was applied. For each model, rotated factor loadings, item communalities, and uniqueness values were reported. When multiple factors were specified, factor correlation matrices (Φ) were also estimated.
Model fit was evaluated using a range of indices, including the chi-square goodness-of-fit test, degrees of freedom, p-value, the root mean square error of approximation (RMSEA), the TuckerlLewis Index (TLI), the root mean square residual (RMSR), and the Bayesian Information Criterion (BIC). These indices provided information on both absolute and comparative model fit, enabling the identification of the most parsimonious and theoretically interpretable factor structure.
Confirmatory factor analysis (CFA). The purpose of this analysis was to test the fit of a one-factor model of perceive trustworthiness of LLMs, identified in the exploratory phase, on an independent subsample. Analyses were conducted in R using the lavaan package, with descriptive statistics and correlation diagnostics computed in psych. Items were recoded as numeric, and rows containing only missing responses were removed. Pairwise correlations and Mardia’s coefficients were inspected to evaluate distributional assumptions.
A single latent factor representing Perceived trustworthiness of LLMs was defined by the eight observed trustworthiness items. The primary model was estimated using robust maximum likelihood with robust (Huber–White) standard errors and a scaled test statistic (MLR). Missing data were handled with full information maximum likelihood (FIML). The latent factor variance was fixed to 1 for identification (std.lv = TRUE). Global model fit was assessed using the chi-square test, comparative fit index (CFI), Tucker–Lewis index (TLI), root mean square error of approximation (RMSEA with 90% CI), standardized root mean square residual (SRMR), Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC). Both conventional and robust versions of the indices were reported where available. Standardized factor loadings, standard errors, z-values, and R² values were computed for each item. Residual correlation matrices and standardized residuals were examined. Modification indices were inspected (threshold ≥ 10) to identify potential sources of misfit. Checks for Heywood cases (negative residual variances) were also performed.
To assess stability of parameter estimates, a nonparametric bootstrap procedure was implemented on complete cases. A total of 1,000 bootstrap resamples were drawn, with models estimated by ML in each resample. For each item, distributions of standardized factor loadings and R² values were summarized with means, medians, and 95% percentile confidence intervals. Distributions of key fit indices (CFI, TLI, RMSEA, SRMR) were likewise summarized across bootstrap samples.
Reliability analyses. The aim of this analysis was to evaluate the reliability and internal consistency of the Perceived Trustworthiness of LLM Scale (PT-LLM-8) using the full dataset (N = 752). Cronbach’s α was computed for the entire eight-item scale on the full sample to evaluate internal consistency. Item-total correlations and α-if-item-deleted statistics were examined to assess the contribution of each item to the scale. Composite reliability (CR) was calculated using standardized factor loadings and measurement error variances from the CFA model, treating all available data. CR accounts for differential item contributions and provides a more robust estimate of reliability than α. Average variance extracted (AVE) was computed from CFA estimates. An AVE value of 0.50 or greater was taken to indicate that the latent factor explained more than half of the variance in its indicators, supporting convergent validity. Standardized residuals, squared multiple correlations (R²), and modification indices were inspected across the whole sample to identify items with weaker loadings, lower communalities, or localized misfit.
Measurement Invariance Analysis. To further support the validity of the PT-LLM-8 scale, measurement invariance tests were conducted to stablish the construct structure is equivalent across male and female groups. These tests employed multigroup confirmatory factor analysis (MG-CFA) to assess whether the factor structure of PT-LLM-8 was consistent between genders. The process involved fitting a series of nested models with increasing constraints: starting with configural invariance, where no restrictions were applied and parameters were freely estimated for both groups; followed by metric invariance, which constrained factor loadings to be equal; then scalar invariance, which additionally constrained item intercepts; and finally, strict invariance, where loadings, intercepts, and residual variances were all constrained to be equal across groups.
Model invariance was determined by comparing fit indices across these models, with invariance accepted when changes in CFI were less than 0.01, and changes in RMSEA less than 0.015, based on guidelines from (Chen, 2007). Model fit was evaluated against the specified criteria, and the comparison between models focused on differences in fit indices. For example, a significant decline in fit, exceeding the cutoff thresholds, indicated a lack of invariance. Furthermore, the chi-square difference test was used to compare the models. If the resulting p-value exceeds the significance level of α = 0.05, we conclude that there is no significant difference between the models, and the more constrained model is considered acceptable.
The models were estimated using maximum likelihood with robust standard errors (MLR) to account for potential multivariate non-normality. These analyses were performed using R 4.4.2 (R Core Team, 2022) with the lavaan package (v0.6-19).
External Validation. We conducted external validation for the PT-LLM-8 scale by examining the relationships and effects of several theoretically relevant variables on the perceived trustworthiness of LLM scores. Predictor variables included self-efficacy, self-esteem, and the Big Five personality traits, Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism, chosen based on their theoretical relevance to trust and technology adoption.
A multiple linear regression model was fitted using ordinary least squares (OLS) as the estimation method to evaluate the influence of the predictor variables on perceived trustworthiness. The significance of each predictor was assessed through t-tests. In this analysis, the dependent variable was the total score obtained by summing the responses to the items on the PT-LLM-8 scale.
Prior to the regression analysis, we examined the pairwise relationships among variables through Pearson correlation coefficients to ensure linearity and to detect potential multicollinearity issues. To assess multicollinearity, we used the common threshold of 0.9; correlations exceeding this value would indicate problematic multicollinearity among independent variables, which could distort the regression estimates (P.Vatcheva and Lee, 2016). Furthermore, we calculated the variance inflation factor (VIF) for each predictor to provide additional evidence of the absence of multicollinearity in the linear regression model. As a common guideline, VIF values exceeding 5 are considered indicative of problematic multicollinearity (O’brien, 2007). This process ensured that the assumptions underlying the regression analysis were adequately met, providing a robust external validation of the PT-LLM-8 scale.
Additional analyses. To further assess the external validity of the PT-LLM-8 scale, we conducted two additional analyses examining the relationship between LLM usage characteristics, self-reported helpfulness of LLMs for personal and professional tasks and user’s competence with LLMs. We tested two multiple regression models. The first model included four predictors: LLM competency, perceived helpfulness in personal and professional contexts, and frequency of LLM use. The second model focused on characteristics of LLM usage, including method of interaction (text, voice, both text and voice, or other), the device primarily used to access the LLM (PC, mobile phone, tablet, smart speaker such as Amazon Echo or Google Nest, dedicated AI device such as Humane AI Pin or Rabbit R1, or other), and the location where the individuals access the LLM (stationary settings (e.g., home, office, school), on the move (e.g., commuting, travel), both stationary and mobile, or other).
We employed the ordinary least squares (OLS) estimation method to compute regression coefficients and evaluate the significance of each predictor. To ensure the validity of the regression models, we assessed multicollinearity among the numerical predictors in the first model by calculating the variance inflation factor (VIF). In our analysis, all predictors demonstrated VIF values below this threshold, indicating that multicollinearity was not a concern.
All analyses in the present study were conducted using R (Version 2024.12.1 + 563). Codes for these analyses are available on OSF (https://osf.io/ku5sz/?view_only=794eea00b5f149b49b874c385c07651c).
Results
Descriptive statistics and correlations
The mean, standard deviation, skewness, and kurtosis of each item of the PT-LLM-8 scale were assessed to examine the distribution of responses and identify any potential deviations from normality. Spearman’s rank correlation coefficients were calculated to examine the relationships between the items of the scale. Table 2 presents the descriptive statistics and correlation coefficients across the items of the PT-LLM-8 scale.
The item means ranged from 5.91 to 7.95, with standard deviations between 1.71 and 2.68. The items “handle data securely” and “good across topics” had the lowest and highest mean scores, respectively. The absolute values of skewness for all items ranged from 0.43 to 1.27 and kurtosis from .08 to 2.15. All correlations between the items of the PT-LLM-8 scale are significant at the p < .001 level, indicating strong positive relationships among the different aspects of trustworthiness.
Table 2
Descriptive statistics and Spearman coefficients of PT-LLM-8 items.
     
Correlations (r)
Item
Mean
SD
Skewness
Kurtosis
1
2
3
4
5
6
7
 
1. Factually accurate
7.068
1.841
− .803
.958
-
       
2. Handle data securely
5.928
2.675
− .433
− .590
.661
-
      
3. Unbiased and fair
6.775
2.352
− .661
− .086
.671
.642
-
     
4. No harmful content
7.504
2.163
-1.091
.944
.502
.513
.535
-
    
5. Good across topics
7.945
1.719
-1.261
2.145
.611
.478
.557
.600
-
   
6. Clear explanations
7.299
2.095
-1.125
1.352
.531
.490
.516
.475
.595
-
  
7. Listen to feedback address it
7.221
2.198
-1.018
.835
.514
.429
.511
.436
.556
.600
-
 
8. Compliant with laws
7.650
2.016
-1.105
1.299
.568
.535
.583
.617
.582
.569
.555
 
Note: All correlation coefficients were significant at .001 level.
Exploratory factor analysis
The aim of the exploratory factor analysis (EFA) was to examine the latent structure of the eight-item Perceived Trustworthiness scale, in order to test whether the items are best represented by a single common factor. The analysis was conducted on the Spearman correlation matrix using maximum likelihood (ML) extraction. This approach was chosen to reduce the influence of non-normality in item distributions, while ML extraction provides likelihood-based indices of model adequacy.
Preliminary diagnostics indicated that the data were highly suitable for factor analysis. The Kaiser-Meyer-Olkin (KMO) statistic was .93, with individual item MSAs ranging from .92 to .94, exceeding the recommended .80 threshold for meritorious adequacy. Bartlett’s test of sphericity was highly significant, χ²(28) = 1860.29, p < .001, N = 376, rejecting the null hypothesis of an identity matrix and confirming that the items shared substantial common variance.
The eigenvalues of the Spearman correlation matrix revealed a single dominant factor. The first eigenvalue was 5.15, accounting for 64.3% of the total variance. All subsequent eigenvalues were well below 1.0 (λ₂ = 0.67, 8.4%; λ₃ = 0.53, 6.6%; λ₄ = 0.39, 4.9%), indicating they explained only minor amounts of variance. The scree plot displayed a sharp break after the first factor, followed by a long flat tail, consistent with unidimensionality (Fig. 2). Parallel analysis supported this conclusion. The observed first eigenvalue (4.75) far exceeded the simulated random mean (0.60). By contrast, the second observed eigenvalue (0.21) was only slightly larger than the random mean (0.14) and fell below the 95th percentile benchmark. The third and higher eigenvalues were smaller than their simulated counterparts (e.g. observed = 0.10 vs random = 0.10), providing no evidence for additional factors.
Fig. 2
Scree plot with bootstrap confidence intervals and parallel analysis for the Perceived Trustworthiness of LLMs scale
Click here to Correct
Note
The solid line with points represents the observed eigenvalues from the Spearman correlation matrix.
A
The shaded band corresponds to the 95% confidence intervals of eigenvalues obtained from 2,000 bootstrap resamples. The dotted line indicates the mean eigenvalues from simulated random data, and the dashed line indicates the 95th percentile of these simulated eigenvalues. The horizontal dashed line at eigenvalue = 1 marks Kaiser’s criterion
The 1-factor ML solution produced strong, clean loadings across all items (|λ| range = 0.663–0.851, median = 0.774 (Table 3). Communalities were generally high (h² range = 0.440–0.725, median = 0.599 (Table 2), indicating that the single factor captures a substantial proportion of each item’s variance. Global fit for the 1-factor model was: χ²(20) = 91.71, p < .001; RMSEA = .098; TLI = 0.945; RMSR = 0.041; BIC = -26.88. Although the chi-square test was significant and RMSEA approached .10, such indices are known to be overly sensitive to large samples and low degrees of freedom and therefore should not be treated as definitive indicators of misfit (Schermelleh-Engel, K., Moosbrugger, H., & Muller H., 2003),(Kenny et al., 2015). Taken together, these results show strong item saturation, adequate communalities, and converging retention evidence justify the interpretation of the Trustworthiness scale as essentially unidimensional, consistent with best-practice recommendations for EFA (Costello and Osborne, 2005).
Table 3
The EFA results for PT-LLM-8 scale
Item
Factor loadings
Communalities (h2)
Interpretation
Factually accurate
0.851
0.724
Excellent loading; well represented
Handle data securely
0.758
0.575
Very good loading; acceptable communality
Unbiased and fair
0.812
0.659
Excellent loading; well represented
No harmful content
0.725
0.525
Very good loading; acceptable communality
Good across topics
0.782
0.612
Very good loading; well represented
Clear explanations
0.764
0.584
Very good loading; acceptable communality
Listen to feedback address it
0.662
0.439
Good loading; marginal communality
Compliant with laws
0.791
0.627
Very good loading; well represented
Note: Interpretation is based on standardised, qualitative benchmarks (Tabachnick et al., 2019)
Confirmatory Factor Analysis
The aim of the confirmatory factor analysis (CFA) was to test a unidimensional measurement model for the eight-item Trustworthiness scale in the UK independent subsample (N = 376), using robust maximum likelihood with full information maximum likelihood (MLR with FIML) estimation.
The one-factor model showed good fit on robust indices: χ²(17) = 63.59, p < .001, CFI.robust = 0.96, TLI.robust = 0.93, RMSEA.robust = 0.09 (90% CI [0.08, 0.13]), and SRMR = 0.03. The robust CFI exceeded the conventional 0.95 benchmark, and the SRMR indicated very small average residuals. Although the RMSEA was near 0.10, this value requires cautious interpretation because RMSEA is biased upward in models with few degrees of freedom (here df = 17). Simulation studies show that RMSEA can “over-reject” correctly specified low-df models, making fixed cut-offs inappropriate in such contexts (Marsh et al., 2004),(Kenny et al., 2015),(Groskurth et al., 2023). Evaluating model fit holistically and given strong theoretical expectations and EFA evidence of unidimensionality, the model was judged acceptable.
At the item level, all standardized factor loadings were significant (z ≥ 11.3, p < .001) and substantial in size (Table 4). Loadings ranged from 0.646 (listen to feedback and address it) to 0.851 (factually accurate), with a median of 0.76. The factor explained a sizeable proportion of variance in each item, with R² values between 0.42 and 0.57 (median 0.50). The complementary uniqueness values (1 − R²) confirm that residual variance was moderate and well within expectations, with no Heywood cases.
Table 4
Item-level results from confirmatory factor analysis of the PT-LLM-8 scale
Item
Standirdised factor loading
SE
z
R2
Uniqueness
Boot mean loading
95% CI
Factually accurate
0.746
0.097
13.639
0.556
0.444
0.734
[0.667, 0.799]
Handle data securely
0.664
0.131
13.328
0.441
0.559
0.690
[0.617, 0.760]
Unbiased and fair
0.725
0.112
15.591
0.526
0.474
0.743
[0.673, 0.803]
No harmful content
0.651
0.124
11.340
0.424
0.576
0.674
[0.578, 0.755]
Good across topics
0.755
0.097
12.677
0.570
0.430
0.740
[0.667, 0.806]
Clear explanations
0.685
0.121
11.409
0.469
0.531
0.688
[0.590, 0.775]
Listen to feedback,
Address it
0.646
0.113
12.549
0.417
0.583
0.652
[0.571, 0.725]
Compliant with laws
0.744
0.108
13.598
0.554
0.446
0.766
[0.704, 0.826]
Note. Standardised factor loadings were estimated using confirmatory factor analysis with robust maximum likelihood estimation and FIML (N = 376). SE = standard error of the loading; z = Wald z statistic testing whether the loading differs from zero; R² = proportion of variance in the observed item explained by the latent factor; Uniqueness = residual variance (1 − R²); Boot mean loading = mean standardised loading across 1,000 bootstrap replications (complete cases); 95% CI = percentile bootstrap confidence interval for the loading
Robust estimation was complemented by nonparametric bootstrapping (1,000 resamples, complete cases), which yielded closely aligned results (Table 4). Across all items, bootstrap confidence intervals were narrow, never crossed zero, and converged closely with robust point estimates. This convergence demonstrates that the factor solution is stable, replicable, and not influenced by estimation method.
Because the RMSEA estimate in the adjusted one-factor model was 0.09, a Monte Carlo simulation study was conducted to evaluate whether this value reflected substantive model misfit or sampling variability inherent in low-degrees of freedom models. The simulation generated 2,000 synthetic datasets from the fitted adjusted CFA model (N = 376) under the assumption that this model represented the true population structure. Each dataset was re-estimated using MLR with the same model specification, and robust fit indices were recorded.
Across replications, the median robust RMSEA was .06 (95% range ≈ .05-.08), with CFI and TLI consistently high (median CFI = .97, TLI = .95). The observed RMSEA of .09 in the empirical data fell near the upper tail of this simulated distribution, indicating that such values are expected by chance even when the model is correctly specified. In contrast, when the restricted one-factor model without residual covariances was fitted to the same simulated datasets, median RMSEA increased to .11 and CFI dropped to .91, demonstrating that omitting substantively justified residual correlations produces systematic deterioration in model fit.
Sensitivity analyses varying the sample size (N = 200–750) further showed the instability of RMSEA in smaller samples. At N = 376, the distribution of simulated RMSEA values remained wide, occasionally producing estimates near .09-.10 despite the model being correctly specified. Larger sample sizes narrowed the RMSEA distribution but retained a slight positive bias relative to CFI and SRMR.
Overall, the Monte Carlo results confirm that the elevated RMSEA observed in the empirical data does not indicate serious misspecification but reflects the known upward bias of RMSEA in low-df models with moderate sample sizes. This finding supports the interpretation of model fit holistically, with stronger emphasis on the robust CFI and SRMR, which consistently indicated good fit, and aligns with prior theoretical expectations and EFA evidence of unidimensionality.
Reliability of PT-LLM-8
We evaluated internal consistency for the eight-item PT-LLM-8 scale using the full dataset (N = 752). Cronbach’s alpha computed from the covariance matrix was α = 0.902, and the standardised alpha computed from the correlation matrix (i.e., as if all items had SD = 1) was αstd = 0.907. The small difference shows that high reliability is not an artefact of unequal item variances. The mean inter-item correlation was 0.550, indicating strong coherence without redundancy. Non-parametric bootstrapping confirmed the stability of α: the bootstrap 95% percentile interval was [0.889, 0.914].
Composite Reliability (CR) was 0.908, meaning about 91% of the variance in the optimally weighted composite is attributable to the common factor rather than measurement error. Average Variance Extracted (AVE) 0.552, indicating that, on average, over half of each item’s variance is explained by the latent factor which is consistent with satisfactory convergent validity.
Per-item diagnostics show uniformly good relationships with the scale total (Table 5). Crucially, neither α-if-deleted nor ωt-if-deleted suggested that dropping any single item would help: the largest α-if-deleted was 0.897 (which is lower than α = 0.902; Δ = -0.005) and the largest ωt-if-deleted was 0.902 (lower than ωt = 0.908; Δ = -0.006). Together the results demonstrate that all eight items contribute meaningfully, and none depress scale reliability.
Table 5
Item-level diagnostics for PT-LLM-8
Item
Corrected item–total r
α if deleted
ωt if deleted
Factually accurate
0.754
0.886
0.891
Handle data securely
0.682
0.893
0.897
Unbiased and fair
0.735
0.886
0.893
No harmful content
0.653
0.893
0.899
Good across topics
0.723
0.889
0.894
Clear explanations
0.696
0.889
0.896
Listen to feedback, address it
0.618
0.897
0.902
Compliant with laws
0.74
0.886
0.892
Note. Corrected item–total correlation (r) reflects the correlation of each item with the total scale score excluding that item. α if deleted shows the Cronbach’s alpha estimate of reliability if the item were removed from the scale. ωt if deleted represents McDonald’s omega total if the item were removed.
Measurement invariance tests
To evaluate whether the PT-LLMs scale functions equivalently across gender groups, measurement invariance testing was conducted. This process assesses whether the scale's factor structure, item loadings, and intercepts are consistent between females and males, ensuring that any observed differences in scores are attributable to genuine variations in the underlying construct rather than measurement bias. The analysis involved a series of nested models, beginning with the configural model (M1), followed by the metric (M2), scalar (M3), and strict invariance models (M4), to determine the level of invariance supported by the data. The results of these tests are reported in Table 6.
Table 6
Summary of measurement invariance analyses for the PT-LLM-8 scale by gender
Model
c2
df
p-value
CFI
TLI
RMSEA
SRMR
BIC
M1: Configural
172.47
34
< .001
.957
.929
.105
.035
22973.98
M2: Metric
182.46
41
< .001
.956
.940
.096
.047
22937.68
M3: Scalar
185.49
48
< .001
.957
.950
.088
.047
22894.41
M4: Strict
200.59
56
< .001
.955
.955
.083
.047
22856.61
 
Comparisons
 
c2
df
p-value
∆CFI
∆TLI
∆RMSEA
∆SRMR
∆BIC
M2 vs. M1
9.996
7
.189
− .001
.011
− .008
.012
-4.004
M3 vs. M2
3.027
7
.882
.001
.010
− .009
.000
-10.973
M4 vs. M3
15.106
8
.057
− .002
.005
− .004
.000
− .894
Note: ∆ represents the difference between fit indices; df = degree of freedom.
The configural model demonstrated acceptable fit based on CFI = .957, TLI = .929, and SRMR = .035. The c2 test was significant (c2(34) = 172.47; p < .001).
Next, metric invariance was tested by constraining factor loadings to be equivalent in the two groups. The metric model showed a good fit to the data: CFI = .956; TLI = .940; SRMR = .047. The c2 test was significant (c2(41) = 172.47; p < .001) and RMSEA index was .096 which is above the conventional threshold of .08. After estimating the metric model, we compared it against the configural model. The results showed that the chi-square difference test was not statistically significant; Δχ² = 9.996, df = 7, p = .189. The absolute difference in CFI and RMSEA were small 0.01 and 0.015, respectively. This result suggests that after constraining the factor loadings to be equal across gender groups, the model fit did not change substantially. In other words, the constrained model (i.e., metric model) fits the data equally well.
After establishing metric invariance, scalar invariance was tested by constraining the item intercepts to be equivalent in the two groups of males and female. The constraints applied in the metric invariance model were retained. The model fit indices indicate a good fit for the scalar model: CFI = 0.957; TLI = 0.950; RMSEA = 0.088; SRMR = 0.047. The chi-square test for scalar model was significant (c2(48) = 185.49; p < .001). The comparison between metric and scalar models revealed minimal differences: the changes in CFI and TLI were less than 0.01, and the changes in RMSEA and SRMR were negligible, suggesting that the scalar invariance assumptions hold. Specifically, the non-significant change in fit between the scalar and metric models (Δχ²(7) = 3.027, p = .882) supports the presence of scalar invariance. This suggests that constraining the item intercepts across gender groups does not significantly influence the model fit, indicating that the PT-LLMs construct has equivalent factor loadings and intercepts for both men and women.
Finally, the strict model was conducted by further constraining the item residuals across males and females. The model demonstrated a good fit: CFI = 0.955; TLI = 0.955; RMSEA = 0.083; SRMR = 0.047. The chi-square test for strict model was also significant (c2(56) = 200.59; p < .001). Comparing the strict to scalar invariance model reveals that strict model did not worsen the fit (Δχ²(8) = 15.106, p = .057). Therefore, we can conclude that the PT-LLMs scale indicates strict invariance across the female and male groups.
Overall, these results indicate that the scale demonstrates configural, metric, scalar, and strict invariance across gender groups and constraining the factor loadings, item intercepts, and item residuals across groups does not significantly affect the model fit. These findings confirm that the PT-LLMs scale measures the same construct equivalently for males and females.
External validation
To assess the external validity of the PT-LLM-8 scale, we examined its associations with several theoretically relevant variables. Specifically, a linear regression analysis was conducted with the PT-LLM-8 score as the dependent variable and the following external variables as independent predictors: self-efficacy, self-esteem, extraversion, agreeableness, conscientiousness, neuroticism, and openness.
The absolute correlations among the external validators ranged from 0.012 to 0.608. We examined the presence of multicollinearity in the regression model using variance inflation factor (VIF). The VIF values for predictors ranged from 1.047 to 2.031, which are below the threshold of 5. Therefore, multicollinearity was not a concern in this analysis, and the regression estimates can be considered stable and reliable.
The regression model explained approximately 10.4% of the variance in PT-LLM-8 scores (R² = 0.104) by external variables, indicating a modest level of predictive power. The results of multiple linear regression fit are presented in Table 7.
Table 7
Estimates of external validators predicting PT-LLM-8
     
95% CI
 
Estimate
Std. error
t-value
p-value
Lower
Upper
Intercept
42.777
.208
10.165
< .001
34.515
51.038
Self-efficacy
1.173
.319
3.672
< .001
.546
1.800
Self-esteem
.070
.258
.272
.785
− .437
.578
Extraversion
− .576
.239
-2.413
.016
-1.048
− .107
Agreeableness
.991
.266
3.720
< .001
0.468
1.514
Conscientiousness
.854
.292
2.926
.003
.281
1.426
Neuroticism
− .344
.262
-1.311
.190
− .859
.171
Openness
− .322
.233
-1.385
.167
− .779
.135
Note
CI = confidence interval
Among the predictors, self-efficacy emerged as the strongest positive correlate of perceived trustworthiness in LLMs, suggesting that individuals with greater confidence in their own capabilities tend to view LLMs as more trustworthy. Two personality characteristics, agreeableness and conscientiousness, were also significantly positively associated with PT-LLM-8 scores, consistent with the notion that cooperative and dependable individuals are more inclined to attribute trustworthiness to LLMs. In contrast, extraversion was negatively related to PT-LLM-8, indicating that more outgoing individuals tended to rate LLMs as less trustworthy. The remaining predictors (self-esteem, neuroticism, and openness) did not show significant associations, although the directions of the effects for neuroticism and openness were negative.
Overall, these findings suggest that the PT-LLM-8 scale is meaningfully related to relevant external variables, supporting its validity as a measure of perceived trustworthiness in LLMs.
Additional analyses
To further explore the factors influencing perceived trustworthiness of LLMs, two additional regression analyses were conducted. The first analysis examined whether self-report of user’s competence with LLMs, LLMs’ help in personal and professional contexts, and frequency of LLMs use could predict PT-LLM-8 scores. The second analysis investigated whether contextual variables such as location of use, device type, and interaction method serve as significant predictors of trustworthiness perceptions. The results of both regression models are reported in Table 8.
Table 8
Estimates of additional external validators predicting PT-LLM-8
   
Std. estimate
Std. error
  
95% CI
  
Estimate
t-value
p-value
Lower
Upper
Model 1:
Intercept
20.593
-
3.095
6.654
< .001
14.517
26.668
 
Competency
1.470
0.162
0.323
4.550
< .001
0.836
2.104
 
Help in personal
1.455
0.305
0.163
8.935
< .001
1.135
1.774
 
Help in professional
1.304
0.189
0.235
5.541
< .001
0.842
1.766
 
Frequency of use
0.658
0.067
0.361
1.824
0.069
-0.050
1.366
Model 2:
Intercept
56.323
-
1.057
53.308
< .001
54.249
58.398
 
Location: Stationary and on the move
3.833
-
1.051
3.648
< .001
1.770
5.896
 
Device: PC/Laptop
-1.744
-
1.077
-1.620
0.106
-3.858
0.369
 
Interaction: Text and voice
7.904
-
1.476
5.353
< .001
5.006
10.803
Note. The reference categories for each predictor are as follows: for location of use, "Stationary and on the move "; for device type, "Mobile"; for device type, “PC/Laptop”, and for interaction method, "Text and voice" were the reference categories
The regression model explained approximately 24.3% of the variance in PT-LLMs scores, as indicated by the R² value of 0.243. In the first set of predictors, LLMs’ help in personal contexts was the strongest predictor for perceived trustworthiness of LLMs. This suggests that individuals who seek LLMs’ help in their personal lives tend to have a higher perception of LLMs trustworthiness. Competency in using primary LLM was also significantly positively related to PT-LLM-8, indicating that individuals who are more competent in using their primary LLM consider LLMs to be more trustworthy. LLMs’ help in professional tasks also showed a strong positive relationship with PT-LLM-8, suggesting that users who rely on their primary LLM for assistance in their professional tasks are more likely to perceive LLMs as trustworthy. The frequency of LLMs use demonstrated a non-significant association with PT-LLM-8; however, it had a positive impact on PT-LLM-8 which indicates that individuals who use LLMs more frequently, tend to have higher perception of LLMs trustworthiness. For checking the multicollinearity among predictors, we examined the VIF values which ranged from 1.15 to 1.35. This suggests that there is no problematic multicollinearity among the predictors. Overall, based on the results of the first model, competency in using primary LLM, along with LLMs’ help in personal life and professional tasks significantly contribute to prediction of perceived trustworthiness of LLMs. LLMs users with higher scores on these attributes perceive LLMs as more trustworthy.
Next, we examined the relationship between PT-LLMs and the second set of predictors, which included location of use, device type, and interaction method. Since only a small number of participants selected certain choices, we considered only the relevant viable categories as follows. For the location of use, the categories “on the move” and “both” were combined, resulting in two categories retained for analysis: “Stationary” (n = 487, p = 64.76%) and “Stationary and on the Move” (n = 265, p = 35.24%). Regarding device type, the categories of PC and Laptop were merged, as were Tablets and Mobile devices, creating two categories: “PC/Laptop” (n = 530, p = 70.48%) and “Mobile” (n = 222, p = 29.52%). For the interaction method, the categories “text” and “text and images” were combined, as were “voice” and “text and voice,” resulting in two categories: “Text” (n = 662, p = 88.03%) and “Text and Voice” (n = 90, p = 11.97%).
Since all the predictors in the second regression analysis are categorical, one category within each predictor will serve as the reference category, and the results for the other categories are interpreted in comparison to this baseline (see Table 8).
Results showed that the typical location of LLMs use was a significant predictor of PT-LLM-8, indicating that individuals who used their primary LLM both in stationary and on the move locations exhibited higher perception of LLMs trustworthiness, compared to those using their primary LLM in a stationary setting. Furthermore, the device type was not significantly associated with PT-LLM-8. Additionally, the interaction method with primary LLMs was the most influential factor on PT-LLMs. The findings showed that users who interact with their primary LLM through both text and voice are more likely to perceive LLMs as trustworthy compared to those interacting with PLLM via text alone.
Overall, the second regression analysis suggests that location of use and interaction method are associated with variations in PT-LLMs scores.
Discussion
This study makes a methodological contribution to the field of human-AI interaction by developing and validating the Perceived Trustworthiness of Large Language Models scale (PT-LLM-8). Grounded in a socio-psychological perspective, the scale conceptualises the trustworthiness of LLMs as a human judgement, formed through everyday evaluative processes when individuals interact with these systems and expressed as perceived trustworthiness. As its theoretical foundation, the study draws on the TrustLLM framework (Huang et al., 2024), which sets out key dimensions of trustworthiness such as truthfulness, safety, fairness, robustness, privacy, transparency and accountability. However, in contrast to the TrustLLM framework, which defines trustworthiness primarily from a systems-design perspective, our focus is on how these dimensions are perceived and made meaningful by users in everyday interactions.
Within this framing, the TrustLLM’ dimension of machine ethics requires particular consideration. In TrustLLM, machine ethics refers to embedding moral reasoning within artificial agents, ensuring that LLMs can recognize, evaluate, and enact ethically appropriate behaviour across contexts (Huang et al., 2024). While this dimension is crucial from a systems-design perspective (Zhou et al., 2024),(Ji et al., 2024),(Coleman et al., 2025), it is less meaningful from the vantage point of end-users. Most users do not expect LLMs to possess genuine moral agency (Myers and Everett, 2025); instead, they look for assurances that systems operate within recognisable, legitimate boundaries (Scharowski et al., 2023),(Yoganathan et al., 2025). In practice, what functions as a proxy for “ethics” in the user’s mind is whether the system adheres to established laws, regulations, and institutional safeguards (Afroogh et al., 2024). Therefore, we reframed the dimension of machine ethics into regulations and laws to capture how trust is socially constructed (i.e., users’ perception that LLMs have been subjected to enforceable standards of oversight). This shift preserves the normative intent of machine ethics, ensuring responsible behaviour, but grounds it in the domain where human trust is actually cultivated: externally verifiable accountability structures.
The psychometric results further support the inclusion of the new item on regulations and laws. This item showed strong correlations with all other dimensions of perceived trustworthiness, loaded robustly on the single latent factor in both exploratory and confirmatory analyses, and demonstrated a high corrected item-total correlation (r = 0.74). Importantly, removing it would have reduced the overall reliability of the scale, confirming that it contributes positively to the coherence of the construct. These findings indicate that regulations and laws is not only theoretically justified as a reframing of machine ethics but also empirically validated as a consistent and meaningful component of perceived trustworthiness.
Psychometric properties of PT-LLM-8 scale. Our results provide converging evidence from exploratory and confirmatory factor analyses that the PT-LLM-8 reflects a single latent construct. Items showed substantial loadings with no cross-loading problems, and model fit indices were within acceptable ranges.
A
Reliability analyses supported this structure, with high internal consistency and strong item-total correlations, indicating that the eight indicators together capture a coherent judgement of perceived trustworthiness across truthfulness, privacy, fairness, safety, robustness, transparency, accountability, and regulations and laws.
Measurement invariance testing showed that the scale functions equivalently across gender groups. Configural, metric, scalar, and strict invariance were all supported, meaning that factor structure, loadings, intercepts, and residuals are stable across men and women. Observed differences in scores therefore reflect genuine differences in perceived trustworthiness rather than measurement artefacts, confirming that the PT-LLM-8 can be used for valid group comparisons in both research and applied contexts.
At the item level, descriptive analyses showed no floor or ceiling effects, and correlations among items were uniformly strong. The closest associations were among factually accurate, no harmful content, and clear explanations, highlighting accuracy, safety, and responsibility as central anchors of trust. The item “good across topics” stood out with the highest mean, robust factor loadings, and the largest explained variance. It was most strongly associated with accuracy, safety, and clarity, and less so with privacy and regulations and laws, suggesting that broad topical ability is trusted mainly when linked to reliable and safe performance. This pattern diverges from earlier models of trust in competence, which emphasise domain-specific expertise (Lee and See, 2004). In the context of LLMs, however, the systems were designed and introduced as general-purpose tools, making breadth a credible marker of consistent reliability across domains. This extends existing theory by showing that for general-purpose systems, breadth itself can function as a valid cue of trustworthiness.
An important practical consideration for any self-report measure is whether it should be used as a total score across items or also interpreted at the item level. The unidimensional structure of the PT-LLM-8 supports its use as a summed score, providing a reliable index of perceived trustworthiness. This approach is consistent with other short unidimensional trust measures in automation and AI. For example, the Trust in Automation Scale (Jian et al., 2000) and its derivatives are most often reported as composite scores, despite comprising items that refer to diverse qualities such as dependability, integrity, and reliability (McGrath et al., 2025). The same applies to more recent instruments such as the Trust Scale for the AI Context (Scharowski et al., 2024b) and the Situational Trust Scale (Holthausen et al., 2020) both of which demonstrate strong one-factor solutions and are routinely interpreted at the total-score level. In this respect, the PT-LLM-8 behaves in line with established practice: although its items capture distinct cues, they are psychometrically integrated into a single construct and can be validly aggregated.
It has to be noted that the use of aggregated scores is generally preferred because they provide a more reliable representation of the underlying construct than any single item. In Classical Test Theory, individual responses are influenced by random factors such as mood or misinterpretation, but these errors are reduced when multiple items are combined (Cappelleri et al., 2014). This logic reflects a central assumption in psychometrics: that latent traits, such as trustworthiness, are stable properties best captured through a set of indicators rather than individual items. In item response theory, the same point is expressed through local independence, which states that once the latent trait is accounted for, items share no further systematic variance (Edwards, 2009).
On the other hand, reliance on a single summed score has limitations (Gustafsson and Åberg-Bengtsson, 2010). Because the scale combines multiple attributes of trustworthiness such as truthfulness, safety, privacy, fairness, robustness, transparency, accountability, and compliance with laws, the total score cannot reveal which of these specific facets accounts for variation in responses. Tracking changes in the overall score over time may therefore mask important shifts, such as improvements in perceived accuracy but declines in privacy. This limits the usefulness of the composite score for evaluating targeted interventions or comparing institutions and systems, where different trust dimensions may move in different directions. Although the summed score provides the most reliable psychometric index, it may not be sufficient in applied settings where understanding the drivers of trust is essential. For this reason, the PT-LLM-8 should be used with both approaches: the total score as a robust global index of trustworthiness, complemented by item-level diagnostics to identify which specific facets drive perceptions in particular settings. However, such item-level interpretations should be made with caution, given the high shared variance among items.
Links with personality traits and user experiences
Personality traits are systematically linked to differences in perceiving the trustworthiness of others. Individuals with higher self-esteem tend to view others as more trustworthy, reflecting a general positivity bias in social perception (Murray et al., 1996). Similarly, higher self-efficacy has been linked with greater confidence in judging others as reliable and trustworthy, consistent with the idea that efficacy fosters approach-oriented social appraisals (Caprara et al., 2011). Among the Big Five, agreeableness is most robustly associated with perceiving others as trustworthy, as agreeable individuals are more inclined to expect benevolence and honesty (Graziano and Eisenberg, 1997),(Thielmann and Hilbig, 2015b). Extraverts also show a tendency to make more positive trustworthiness attributions, likely due to their sociability and reward orientation (Paunonen, 2003). Conscientiousness has been linked to more cautious but generally positive trust evaluations, reflecting norms of reciprocity and reliability (Thielmann and Hilbig, 2015b). In contrast, neuroticism predicts greater suspicion and lower perceptions of others’ trustworthiness, aligning with heightened threat sensitivity and interpersonal insecurity (Robinson et al., 2025). Openness to experience shows weaker but often positive links, with open individuals more willing to extend trust and interpret unfamiliar others as trustworthy (McCrae and Costa, 1997).
These established associations between personality traits and perceived trustworthiness provide anchor points for validating our scale, as they represent theoretically grounded and empirically consistent patterns of behaviour against which our findings can be evaluated. Consistent with these expectations, higher PT-LLM-8 total score showed positive relationship with higher scores on self-efficacy, agreeableness and conscientiousness. This convergence suggests that the scale is capturing judgments of LLMs in ways that mirror well-established processes in human trust perception. The negative association with extraversion, in contrast to its typically positive relation with interpersonal trust (Paunonen, 2003), indicates that extroverts may apply different evaluative standards to LLMs than to humans, perhaps because their trust is grounded in social cues and reciprocal interaction - features less salient in machine communication. Similarly, the absence of significant associations with neuroticism and openness indicates that some dispositional predictors of human trust are less influential in judgments of trustworthiness of LLMs. Interestingly, the lack of association between self-efficacy and perceived trustworthiness of LLMs suggests that efficacy-based confidence in social judgment does not extend to human-machine contexts. Whereas interpersonal trust often requires appraising the reliability of another person’s actions (Zhang, 2021), trust in LLMs is based on perceptions of technical reliability and consistency rather than one’s own competence in handling social exchanges. In other words, self-efficacy is important when trust involves managing interpersonal interactions, but it is less relevant when judging a system whose behaviour cannot be influenced or controlled by the individual.
Taken together, the findings provide strong support for the construct validity of the Perceived Trustworthiness of LLMs scale. The observed associations with traits such as agreeableness and conscientiousness replicate well-established patterns from the human trust literature, indicating that the scale captures a recognisable trust-related construct. At the same time, the divergences, such as the negative association with extraversion and the absence of links with neuroticism, openness, and self-efficacy, suggests that judgments of LLM trustworthiness are not reducible to interpersonal trust mechanisms. Instead, they reflect evaluations that are grounded in dispositional tendencies yet influences by the unique, non-reciprocal nature of human-machine interaction.
For practitioners, it is essential to understand how perceived trustworthiness of LLMs is influenced by user experiences, since this determines how the PT-LLM-8 can inform design and implementation strategies. The present findings show that trustworthiness scores were more strongly associated with users’ perceptions of how useful LLMs were in personal and professional contexts, and with their sense of competence in using their primary system, than with how frequently they used it. This points to an important distinction between exposure and efficacy in perceived trustworthiness of LLM. Research on technology acceptance and trust in automation shows that exposure alone has limited impact on developing trust in AI unless it is accompanied by successful outcomes (Hassan et al., 2019),(Kelly et al., 2023). In contrast, efficacy as individual’s judgements of own capability to organise and execute actions required to attain designated performances (Bandura, A, 1986) is self-referential and affects both the beliefs people form and the behaviours they pursue. Applied to LLMs, this means that perceptions of trustworthiness are grounded not in the sheer number of interactions but in whether users judge themselves competent in eliciting useful outputs. From a practical standpoint, this distinction implies that increasing the availability or frequency of LLM use is unlikely to enhance perceived trustworthiness unless systems simultaneously support user efficacy. Designing interfaces that provide clear prompts, transparent feedback, and corrective guidance can enhance the sense of efficacy and thereby bolster perceived trustworthiness. Training and onboarding materials that show users how to phrase queries effectively, and adaptive support that adjusts to different levels of experience, can be the ways to achieve this. In deployment contexts of the PT-LLN-8 scale, should be monitored alongside user efficacy, because increases in exposure without corresponding growth in efficacy may produce disengagement or scepticism rather than stronger trustworthiness perceptions.
Our findings also indicate that perceptions of LLM trustworthiness vary with the context of use. Participants who reported using their primary LLM in both stationary and mobile settings show higher PT-LLM-8 scores than those who restricted use to stationary settings. This is consistent with recent studies reporting that LLMs are received a more positive evaluations when they are associated with their availability and flexibility in everyday life (Dang and Li, 2025). Interaction modality was also influential: users who engaged through both text and voice reported markedly higher scores than text-only users. This result is consistent with recent evidence that voice interaction reduces delays in conversation and creates a greater sense of responsiveness, leading users to perceive the system as more reliable (X. Zhang et al., 2024),(Wang et al., 2023). Device type, by contrast, was not a significant predictor. This aligns with evidence that evaluations of AI trustworthiness are grounded in perceptions of capability, reliability, and transparency rather than in the hardware through which the system is accessed (Dang and Li, 2025). Taken together, these findings suggest that perceived trustworthiness increases when LLMs can be used effectively across multiple contexts and modalities, whereas the platform of access plays little role in such judgements.
Practical implications
The development and validation of the PT-LLM-8 have significant implications for research, practice, and system design, addressing research gaps identified in previous studies (Afroogh et al., 2024),(Durán and Pozzi, 2025). For researchers, this scale offers a validated instrument for comparing perceived trustworthiness of LLM across studies, populations, and use cases. Its demonstrated measurement invariance allows for group comparisons between gender without introducing bias from the measurement itself. For practitioners working in organisational or educational settings, the scale can be deployed to monitor perceptions of LLM trustworthiness over time, thereby enabling institutions to detect whether new deployments, policy changes, or training interventions are influencing user evaluations in positive or negative directions (Roski et al., 2021). For LLM developers and engineers, the findings highlight areas where design interventions may be most effective. The strong associations observed between perceived trustworthiness, perceived usefulness, and user competence suggest that technical enhancements should be complemented by features that actively support user efficacy, such as clearer guidance on prompt formulation, more accessible explanations of outputs, and error-correction mechanisms that transparently incorporate user feedback. This aligns with recent findings that showed that in education setting trust in LLM was maintained after receiving instructional guidance, but decline after practical engagement potentially due system responses fall short of expectations (Kumar et al., 2024).
Engineers should also note that multimodal interaction (encompassing both text and voice) and flexible use across contexts (stationary and mobile) were associated with higher trustworthiness, indicating that investment in seamless integration of modalities and robust performance across diverse environments may yield greater benefits than hardware optimisation alone. Finally, for policymakers and regulators, the inclusion of laws and regulations as a dimension of perceived trustworthiness indicates the importance of enforceable oversight frameworks (Ingrams et al., 2022),(Kleizen et al., 2023),(Sarker, 2024). Users interpret compliance with regulation as a proxy for ethical assurance, thus, clear communication regarding such compliance may directly contribute to perceptions of trustworthiness in applied settings (Taillandier et al., 2025),(Wester et al., 2024).
Future directions and limitations
Future research should extend the validation of the PT-LLM-8 beyond the present British sample to include diverse cultural and demographic groups. Although the scale was deliberately designed with outcome-focused and broadly interpretable items (e.g., accuracy, fairness, safety, compliance with regulations), it is reasonable to expect cultural variation in how these dimensions are weighted and understood. For instance, in regulatory environments with visible oversight, adherence to laws may be a primary marker of trustworthiness, whereas in contexts with weaker regulation, users may place greater emphasis on transparency or privacy. Similarly, expectations of fairness and accountability are shaped by cultural and institutional norms. A recent study comparing British and Arab populations found that participants from Arab cultural backgrounds generally expressed more positive attitudes toward AI systems, including higher levels of trust (Alshakhsi et al., 2025). Another study reported no significant gender differences in perceived trust across either group, but noted that UK females expressed greater privacy concerns than males, a pattern not observed among Arab participants (Rahman et al., 2025). Incorporating such cultural variability into future validation efforts would help ensure that the PT-LLM-8 remains robust and interpretable across populations.
A further direction is to use the PT-LLM-8 alongside socio-technical evaluations of LLMs. The present study approached trustworthiness of LLM from a socio-psychological perspective, capturing user judgements in everyday interaction. By contrast, frameworks such as TrustLLM conceptualise trustworthiness at the system level, emphasising technical robustness, safety, and compliance with ethical design standards. Comparing these perspectives could reveal gaps between socio-technical compliance and socio-psychological perception, for example, whether regulatory safeguards recognised by developers translate into perceived legitimacy for end-users. Such comparisons would help identify where system design and user trust diverge and inform more integrated approaches to LLM governance and deployment.
The present findings should be interpreted in light of several limitations. First, all validation analyses were based on a sample of British adults, and the psychometric parameters established here reflect the characteristics of this group. The scale’s performance may differ in other age groups, cultural contexts, or populations with varying levels of digital literacy. Applications beyond the current sample should therefore be approached with caution and accompanied by further validation.
A
Second, while the sample size was sufficient and included a range of ages, the exclusion criteria and quality checks led to a relatively engaged and technologically literate participant pool. This limits generalisability to populations with more casual or sceptical patterns of LLM use. Finally, although the PT-LLM-8 demonstrated unidimensionality and reliability, reliance on self-report methods introduces potential biases related to self-perception, response style, and situational factors. Future work should complement self-report measures with behavioural indicators or longitudinal designs to provide a more comprehensive understanding of perceived trustworthiness of LLM.
Conclusion
The Perceived Trustworthiness of LLMs scale (PT-LLM-8) provides a theoretically grounded and empirically validated tool for assessing how users evaluate the trustworthiness of their primary LLM. By operationalising eight dimensions (truthfulness, safety, fairness, robustness, privacy, transparency, accountability, and compliance with laws) the scale translates complex socio-technical constructs into user-centred judgements that can be measured from user-end perspective. This perspective is essential for understanding adoption, reliance, and critical engagement with LLMs as they become embedded across personal, educational, and professional contexts.
The PT-LLM-8 shows good psychometric properties suggesting that it can be used as a reliable measure, with high internal consistency and strong and coherent contribution of eight dimension to the perceived trustworthiness construct.
As LLMs continue to integrate in every aspect of personal and professional life, the PT-LLM-8 provides researchers, practitioners, and developers with a robust instrument for monitoring perceived trustworthiness. It supports both aggregate measurement and item-level diagnostics, enabling nuanced insight into which facets of trustworthiness matter most in particular settings (see the PT-LLM-8 scale and scoring in Appendix 1).
Electronic Supplementary Material
Below is the link to the electronic supplementary material
A
Author Contribution
- A.Y. Conceptualised and designed the study, mentored, performed and reported the statistical analysis, and wrote the original draft.- B.B. Performed and reported the statistical analysis, contributed to the original draft and prepared the material for OSF.- A.B. Conceptualised and designed the study, curated the data, and contributed to the original draft.- S.A. Conceptualised and designed the research, curated the data, performed and reported part of the analysis, and reviewed and edited the paper.- T.Y Validated the analysis, contributed to the original draft, and reviewed and edited the paper.- M.C Validated the analysis and reviewed and edited the paper.- R.A. Conceptualised, designed, and supervised the research, curated the data, secured funding, validated the analysis, and reviewed and edited the paper.
References
Afroogh S, Akbari A, Malone E, Kargar M, Alambeigi H. Trust in AI: progress, challenges, and future directions. Humanit Soc Sci Commun. 2024;11:1568. https://doi.org/10.1057/s41599-024-04044-8.
Alarcon GM, Lyons JB, Christensen JC. The effect of propensity to trust and familiarity on perceptions of trustworthiness over time. Personal Individ Differ. 2016;94:309–15. https://doi.org/10.1016/j.paid.2016.01.031.
Alshakhsi S, Almourad MB, Babkir A, Al-Thani D, Yankouskaya A, Montag C, Ali R. Designing AI to foster acceptance: do freedom to choose and social proof impact AI attitudes among British and Arab populations? Behav. Inf Technol. 2025;1–19. https://doi.org/10.1080/0144929X.2025.2477053.
Andreassen R, Bråten I. Br J Educ Technol. 2013;44:821–36. https://doi.org/10.1111/j.1467-8535.2012.01366.x. Teachers’ source evaluation self-efficacy predicts their use of relevant source features when evaluating the trustworthiness of web sources on special education.
Bandura A. Social Foundations of Thought and Action: A Social Cognitive Theory. Englewood Cliffs, NJ: Prentice-Hall; 1986.
Bandura A. Self-efficacy: Toward a unifying theory of behavioral change. Psychol Rev. 1977;84:191–215. https://doi.org/10.1037/0033-295X.84.2.191.
A
Belsley DA, Kuh E, Welsch RE. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley; 2005. https://doi.org/10.1002/0471725153.
Bo D, Ma’rof AA. Understanding User Attitude towards AI Agents: The Roles of Perceived Competence, Trust in Technology, and Social Influence. Int J Acad Res Bus Soc Sci. 2024;14:527–38. https://doi.org/10.6007/IJARBSS/v14-i12/24001.
Cannon WB. The Wisdom of the Body. New York, NY: W. W. Norton & Company, Inc.; 1932.
Cappelleri JC, Jason Lundy J, Hays RD. Overview of Classical Test Theory and Item Response Theory for the Quantitative Assessment of Items in Developing Patient-Reported Outcomes Measures. Clin Ther. 2014;36:648–62. https://doi.org/10.1016/j.clinthera.2014.04.006.
Caprara GV, Vecchione M, Alessandri G, Gerbino M, Barbaranelli C. The contribution of personality traits and self-efficacy beliefs to academic achievement: A longitudinal study: Personality traits, self-efficacy beliefs and academic achievement. Br J Educ Psychol. 2011;81:78–96. https://doi.org/10.1348/2044-8279.002004.
Chaiken S, Ledgerwood A. 2012. A Theory of Heuristic and Systematic InformationProcessing, in: Handbook of Theories of Social Psychology: Volume 1. SAGE Publications Ltd, 1 Oliver’s Yard, 55 City Road, London EC1Y 1SP United Kingdom, pp. 246–266. https://doi.org/10.4135/9781446249215.n13
Chamorro-Koc M, Peake J, Meek A, Manimont G. Self-efficacy and trust in consumers’ use of health-technologies devices for sports. Heliyon. 2021;7:e07794. https://doi.org/10.1016/j.heliyon.2021.e07794.
Chen FF. Sensitivity of Goodness of Fit Indexes to Lack of Measurement Invariance. Struct Equ Model Multidiscip J. 2007;14:464–504. https://doi.org/10.1080/10705510701301834.
Coleman C, Neuman WR, Dasdan A, Ali S, Shah M. 2025. The Convergent Ethics of AI? Analyzing Moral Foundation Priorities in Large Language Models with a Multi-Framework Approach. https://doi.org/10.48550/ARXIV.2504.19255
Colquitt JA, Conlon DE, Wesson MJ, Porter COLH, Ng KY. Justice at the millennium: A meta-analytic review of 25 years of organizational justice research. J Appl Psychol. 2001;86:425–45. https://doi.org/10.1037/0021-9010.86.3.425.
Colquitt JA, Scott BA, LePine JA. Trust, trustworthiness, and trust propensity: A meta-analytic test of their unique relationships with risk taking and job performance. J Appl Psychol. 2007;92:909–27. https://doi.org/10.1037/0021-9010.92.4.909.
Colquitt JA, Scott BA, Rodell JB, Long DM, Zapata CP, Conlon DE, Wesson MJ. Justice at the millennium, a decade later: A meta-analytic test of social exchange and affect-based perspectives. J Appl Psychol. 2013;98:199–236. https://doi.org/10.1037/a0031757.
Costello AB, Osborne J. 2005. Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis 10, 1–9. https://doi.org/10.7275/JYJ1-4868
Culnan MJ, Armstrong PK. Information Privacy Concerns, Procedural Fairness, and Impersonal Trust: An Empirical Investigation. Organ Sci. 1999;10:104–15. https://doi.org/10.1287/orsc.10.1.104.
Cyrus-Lai W, Tierney W, Du Plessis C, Nguyen M, Schaerer M, Giulia Clemente E, Uhlmann EL. Avoiding Bias in the Search for Implicit Bias. Psychol Inq. 2022;33:203–12. https://doi.org/10.1080/1047840X.2022.2106762.
Daly SJ, Wiewiora A, Hearn G. Shifting attitudes and trust in AI: Influences on organizational AI adoption. Technol Forecast Soc Change. 2025;215:124108. https://doi.org/10.1016/j.techfore.2025.124108.
Dang Q, Li G. Unveiling trust in AI: the interplay of antecedents, consequences, and cultural dynamics. AI Soc. 2025. https://doi.org/10.1007/s00146-025-02477-6.
Daronnat S, Azzopardi L, Halvey M, Dubiel M. Inferring Trust From Users’ Behaviours; Agents’ Predictability Positively Affects Trust, Task Performance and Cognitive Load in Human-Agent Real-Time Collaboration. Front Robot AI. 2021;8:642201. https://doi.org/10.3389/frobt.2021.642201.
David S, Hareli S, Hess U. The influence on perceptions of truthfulness of the emotional expressions shown when talking about failure. Eur J Psychol. 2015a;11:125–38. https://doi.org/10.5964/ejop.v11i1.877.
David S, Hareli S, Hess U. The influence on perceptions of truthfulness of the emotional expressions shown when talking about failure. Eur J Psychol. 2015b;11:125–38. https://doi.org/10.5964/ejop.v11i1.877.
De Duro ES, Veltri GA, Golino H, Stella M. 2025. Measuring and identifying factors of individuals’ trust in Large Language Models. https://doi.org/10.48550/ARXIV.2502.21028
De Fine Licht K, Brülde B. 2021. On Defining Reliance and Trust: Purposes, Conditions of Adequacy, and New Definitions. Philosophia 49, 1981–2001. https://doi.org/10.1007/s11406-021-00339-1
DeCastellarnau A. A classification of response scale characteristics that affect data quality: a literature review. Qual Quant. 2018;52:1523–59. https://doi.org/10.1007/s11135-017-0533-4.
Di W, Nie Y, Chua BL, Chye S, Teo T. Developing a Single-Item General Self-Efficacy Scale: An Initial Study. J Psychoeduc Assess. 2023;41:583–98. https://doi.org/10.1177/07342829231161884.
Dietvorst BJ, Simmons JP, Massey C. Algorithm aversion: People erroneously avoid algorithms after seeing them err. J Exp Psychol Gen. 2015;144:114–26. https://doi.org/10.1037/xge0000033.
Durán JM, Pozzi G. Trust and Trustworthiness in AI. Philos Technol. 2025;38:16. https://doi.org/10.1007/s13347-025-00843-2.
Earle T, Siegrist M. Trust, Confidence and Cooperation model: a framework for understanding the relation between trust and Risk Perception. Int J Glob Environ Issues. 2008;8:17. https://doi.org/10.1504/IJGENVI.2008.017257.
Edwards MC. An Introduction to Item Response Theory Using the Need for Cognition Scale. Soc Personal Psychol Compass. 2009;3:507–29. https://doi.org/10.1111/j.1751-9004.2009.00194.x.
Ehrhart MG, Ehrhart KH, Roesch SC, Chung-Herrera BG, Nadler K, Bradshaw K. Testing the latent factor structure and construct validity of the Ten-Item Personality Inventory. Personal Individ Differ. 2009;47:900–5. https://doi.org/10.1016/j.paid.2009.07.012.
Evans AM, Revelle W. Survey and behavioral measurements of interpersonal trust. J Res Personal. 2008;42:1585–93. https://doi.org/10.1016/j.jrp.2008.07.011.
Forscher PS, Lai CK, Axt JR, Ebersole CR, Herman M, Devine PG, Nosek BA. A meta-analysis of procedures to change implicit measures. J Pers Soc Psychol. 2019;117:522–59. https://doi.org/10.1037/pspa0000160.
Glickman M, Sharot T. How human–AI feedback loops alter human perceptual, emotional and social judgements. Nat Hum Behav. 2024;9:345–59. https://doi.org/10.1038/s41562-024-02077-2.
Gosling SD, Rentfrow PJ, Swann WB. A very brief measure of the Big-Five personality domains. J Res Personal. 2003;37:504–28. https://doi.org/10.1016/S0092-6566(03)00046-1.
Graziano WG, Eisenberg N. Agreeableness. Handbook of Personality Psychology. Elsevier; 1997. pp. 795–824. https://doi.org/10.1016/B978-012134645-4/50031-7.
Grgic-Hlaca N, Redmiles EM, Gummadi KP, Weller A. 2018. Human Perceptions of Fairness in Algorithmic Decision Making: A Case Study of Criminal Risk Prediction, in: Proceedings of the 2018 World Wide Web Conference on World Wide Web - WWW ’18. Presented at the the 2018 World Wide Web Conference, ACM Press, Lyon, France, pp. 903–912. https://doi.org/10.1145/3178876.3186138
Groskurth K, Bluemke M, Lechner CM. Why we need to abandon fixed cutoffs for goodness-of-fit indices: An extensive simulation and possible solutions. Behav Res Methods. 2023;56:3891–914. https://doi.org/10.3758/s13428-023-02193-3.
Gustafsson J-E, Åberg-Bengtsson L. Unidimensionality and interpretability of psychological instruments. In: Embretson SE, editor. Measuring Psychological Constructs: Advances in Model-Based Approaches. Washington: American Psychological Association; 2010. pp. 97–121. https://doi.org/10.1037/12074-005.
Hancock PA, Kessler TT, Kaplan AD, Stowers K, Brill JC, Billings DR, Schaefer KE, Szalma JL. How and why humans trust: A meta-analysis and elaborated model. Front Psychol. 2023a;14:1081086. https://doi.org/10.3389/fpsyg.2023.1081086.
Hancock PA, Kessler TT, Kaplan AD, Stowers K, Brill JC, Billings DR, Schaefer KE, Szalma JL. How and why humans trust: A meta-analysis and elaborated model. Front Psychol. 2023b;14:1081086. https://doi.org/10.3389/fpsyg.2023.1081086.
Harvey K, Laurie G. Proxies of Trustworthiness: A Novel Framework to Support the Performance of Trust in Human Health Research. J Bioethical Inq. 2024;21:625–45. https://doi.org/10.1007/s11673-024-10335-1.
Hassan MU, Iqbal Z, Nazeer W. Technology trust and online purchase behaviour: a multidimensional research model. Int J Bus Forecast Mark Intell. 2019;5:464. https://doi.org/10.1504/IJBFMI.2019.105342.
Hochman G. Beyond the Surface: A New Perspective on Dual-System Theories in Decision-Making. Behav Sci. 2024;14:1028. https://doi.org/10.3390/bs14111028.
Hoff KA, Bashir M. Trust in Automation: Integrating Empirical Evidence on Factors That Influence Trust. Hum. Factors J Hum Factors Ergon Soc. 2015;57:407–34. https://doi.org/10.1177/0018720814547570.
Holland C, Perry G, Neyedli HF. 2024. Calibrating Trust, Reliance and Dependence in Variable-Reliability Automation. Proc. Hum. Factors Ergon. Soc. Annu. Meet. 68, 604–610. https://doi.org/10.1177/10711813241277531
Holthausen BE, Wintersberger P, Walker BN, Riener A. 2020. Situational Trust Scale for Automated Driving (STS-AD): Development and Initial Validation, in: 12th International Conference on Automotive User Interfaces and Interactive Vehicular Applications. Presented at the AutomotiveUI ’20: 12th International Conference on Automotive User Interfaces and Interactive Vehicular Applications, ACM, Virtual Event DC USA, pp. 40–47. https://doi.org/10.1145/3409120.3410637
Huang Y, Sun L, Wang, Haoran, Wu S, Zhang Q, Li Y, Gao C, Huang, Yixin, Lyu W, Zhang Y, Li X, Liu Z, Liu, Yixin, Wang Y, Zhang Z, Vidgen B, Kailkhura B, Xiong C, Xiao C, Li C, Xing E, Huang F, Liu H, Ji H, Wang, Hongyi, Zhang H, Yao H, Kellis M, Zitnik M, Jiang M, Bansal M, Zou J, Pei J, Liu J, Gao J, Han J, Zhao J, Tang J, Wang J, Vanschoren J, Mitchell J, Shu K, Xu K, Chang K-W, He L, Huang L, Backes M, Gong NZ, Yu PS, Chen P-Y, Gu Q, Xu R, Ying R, Ji S, Jana S, Chen T, Liu T, Zhou T, Wang W, Li X, Zhang X, Wang X, Xie X, Chen X, Wang, Ye Y, Cao Y, Chen Y, Zhao Y. Y., 2024. TrustLLM: Trustworthiness in Large Language Models. https://doi.org/10.48550/ARXIV.2401.05561
Huo W, Zheng G, Yan J, Sun L, Han L. Interacting with medical artificial intelligence: Integrating self-responsibility attribution, human–computer trust, and personality. Comput Hum Behav. 2022;132:107253. https://doi.org/10.1016/j.chb.2022.107253.
Ingrams A, Kaufmann W, Jacobs D. In AI we trust? Citizen perceptions of AI in government decision making. Policy Internet. 2022;14:390–409. https://doi.org/10.1002/poi3.276.
Jacovi A, Marasović A, Miller T, Goldberg Y. 2021. Formalizing Trust in Artificial Intelligence: Prerequisites, Causes and Goals of Human Trust in AI, in: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Presented at the FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, ACM, Virtual Event Canada, pp. 624–635. https://doi.org/10.1145/3442188.3445923
Jebb AT, Ng V, Tay L. A Review of Key Likert Scale Development Advances: 1995–2019. Front Psychol. 2021;12:637547. https://doi.org/10.3389/fpsyg.2021.637547.
Ji J, Chen Y, Jin M, Xu W, Hua W, Zhang Y. 2024. MoralBench: Moral Evaluation of LLMs. https://doi.org/10.48550/ARXIV.2406.04428
Jian J-Y, Bisantz AM, Drury CG. Foundations for an Empirically Determined Scale of Trust in Automated Systems. Int J Cogn Ergon. 2000;4:53–71. https://doi.org/10.1207/S15327566IJCE0401_04.
Kattnig M, Angerschmid A, Reichel T, Kern R. Assessing trustworthy AI: Technical and legal perspectives of fairness in AI. Comput Law Secur Rev. 2024;55:106053. https://doi.org/10.1016/j.clsr.2024.106053.
Kelly S, Kaye S-A, Oviedo-Trespalacios O. What factors contribute to the acceptance of artificial intelligence? A systematic review. Telemat Inf. 2023;77:101925. https://doi.org/10.1016/j.tele.2022.101925.
Kenny DA, Kaniskan B, McCoach DB. The Performance of RMSEA in Models With Small Degrees of Freedom. Sociol Methods Res. 2015;44:486–507. https://doi.org/10.1177/0049124114543236.
Kleizen B, Van Dooren W, Verhoest K, Tan E. Do citizens trust trustworthy artificial intelligence? Experimental evidence on the limits of ethical AI measures in government. Gov Inf Q. 2023;40:101834. https://doi.org/10.1016/j.giq.2023.101834.
Kostick-Quenet KM, Gerke S. AI in the hands of imperfect users. Npj Digit Med. 2022;5:197. https://doi.org/10.1038/s41746-022-00737-z.
Kumar V, Ashraf AR, Nadeem W. AI-powered marketing: What, where, and how? Int J Inf Manag. 2024;77:102783. https://doi.org/10.1016/j.ijinfomgt.2024.102783.
Larzelere RE, Huston TL. The Dyadic Trust Scale: Toward Understanding Interpersonal Trust in Close Relationships. J Marriage Fam. 1980;42:595. https://doi.org/10.2307/351903.
Lee JD, See KA. Trust in Automation: Designing for Appropriate Reliance. Hum. Factors J Hum Factors Ergon Soc. 2004;46:50–80. https://doi.org/10.1518/hfes.46.1.50_30392.
Lee MK. Understanding perception of algorithmic decisions: Fairness, trust, and emotion in response to algorithmic management. Big Data Soc. 2018;5:2053951718756684. https://doi.org/10.1177/2053951718756684.
Lee Y, Li JQ. The role of communication transparency and organizational trust in publics’ perceptions, attitudes and social distancing behaviour: A case study of the COVID-19 outbreak. J Contingencies Crisis Manag. 2021;29:368–84. https://doi.org/10.1111/1468-5973.12354.
Levine TR. Truth-Default Theory (TDT): A Theory of Human Deception and Deception Detection. J Lang Soc Psychol. 2014;33:378–92. https://doi.org/10.1177/0261927X14535916.
Li Y, Wu B, Huang Y, Luan S. Developing trustworthy artificial intelligence: insights from research on interpersonal, human-automation, and human-AI trust. Front Psychol. 2024;15:1382693. https://doi.org/10.3389/fpsyg.2024.1382693.
Liu D, Lemmens J, Hong X, Li B, Hao J, Yue Y. A network analysis of internet gaming disorder symptoms. Psychiatry Res. 2022;311:114507. https://doi.org/10.1016/j.psychres.2022.114507.
Malhotra NK, Kim SS, Agarwal J. Internet Users’ Information Privacy Concerns (IUIPC): The Construct, the Scale, and a Causal Model. Inf Syst Res. 2004;15:336–55. https://doi.org/10.1287/isre.1040.0032.
Marsh HW, Wen Z, Hau K-T. Structural Equation Models of Latent Interactions: Evaluation of Alternative Estimation Strategies and Indicator Construction. Psychol Methods. 2004;9:275–300. https://doi.org/10.1037/1082-989X.9.3.275.
Mayer RC, Davis JH, Schoorman FD. An Integrative Model of Organizational Trust. Acad Manage Rev. 1995a;20:709. https://doi.org/10.2307/258792.
Mayer RC, Davis JH, Schoorman FD. An Integrative Model of Organizational Trust. Acad Manage Rev. 1995b;20:709. https://doi.org/10.2307/258792.
Mayer RC, Davis JH, Schoorman FD. An Integrative Model of Organizational Trust. Acad Manage Rev. 1995c;20:709. https://doi.org/10.2307/258792.
McCrae RR, Costa PT. Personality trait structure as a human universal. Am Psychol. 1997;52:509–16. https://doi.org/10.1037/0003-066X.52.5.509.
McGrath MJ, Lack O, Tisch J, Duenser A. Measuring trust in artificial intelligence: validation of an established scale and its short form. Front Artif Intell. 2025;8:1582880. https://doi.org/10.3389/frai.2025.1582880.
Merritt SM, Ilgen DR. Not All Trust Is Created Equal: Dispositional and History-Based Trust in Human-Automation Interactions. Hum Factors J Hum Factors Ergon Soc. 2008;50:194–210. https://doi.org/10.1518/001872008X288574.
Mitchell T. Trust and Transparency in Artificial Intelligence. Philos Technol. 2025;38:87. https://doi.org/10.1007/s13347-025-00916-2.
Murray SL, Holmes JG, Griffin DW. The self-fulfilling nature of positive illusions in romantic relationships: Love is not blind, but prescient. J Pers Soc Psychol. 1996;71:1155–80. https://doi.org/10.1037/0022-3514.71.6.1155.
Myers S, Everett JAC. People expect artificial moral advisors to be more utilitarian and distrust utilitarian moral advisors. Cognition. 2025;256:106028. https://doi.org/10.1016/j.cognition.2024.106028.
National Institute of Standards and Technology. 2023. Artificial Intelligence Risk Management Framework (AI RMF 1.0).
Neuliep JW. Anxiety/Uncertainty Management (AUM) Theory. In: Kim YY, editor. The International Encyclopedia of Intercultural Communication. Wiley; 2017. pp. 1–9. https://doi.org/10.1002/9781118783665.ieicc0007.
O’brien RM. A Caution Regarding Rules of Thumb for Variance Inflation Factors. Qual Quant. 2007;41:673–90. https://doi.org/10.1007/s11135-006-9018-6.
OECD. THE IMPACT OF ARTIFICIAL INTELLIGENCE ON PRODUCTIVITY, DISTRIBUTION AND GROWTH KEY MECHANISMS. INITIAL EVIDENCE AND POLICY CHALLENGES; 2024.
Official Journal, of the European Union. 2024. REGULATION (EU) 2024/1689 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL.
Vatcheva P, Lee K, M. Multicollinearity in Regression Analyses Conducted in Epidemiologic Studies. Epidemiol Open Access. 2016;06. https://doi.org/10.4172/2161-1165.1000227.
Parasuraman R, Riley V. Humans and Automation: Use, Misuse, Disuse, Abuse. Hum. Factors J Hum Factors Ergon Soc. 1997;39:230–53. https://doi.org/10.1518/001872097778543886.
Paunonen SV. Big Five factors of personality and replicated predictions of behavior. J Pers Soc Psychol. 2003;84:411–24.
A
Rahman MM, Babiker A, Ali R, Motivation, Concerns, Attitudes Towards AI. Differences by Gender, Age, and Culture, in: Barhamgi M, Wang H, Wang X, editors, Web Information Systems Engineering – WISE 2024, Lecture Notes in Computer Science. Springer Nature Singapore, Singapore, 375–91. https://doi.org/10.1007/978-981-96-0573-6_28.
Reinhardt K. Trust and trustworthiness in AI ethics. AI Ethics. 2023;3:735–44. https://doi.org/10.1007/s43681-022-00200-5.
Rempel JK, Holmes JG, Zanna MP. Trust in close relationships. J Pers Soc Psychol. 1985;49:95–112. https://doi.org/10.1037/0022-3514.49.1.95.
Revilla MA, Saris WE, Krosnick JA. Choosing the Number of Categories in Agree–Disagree Scales. Sociol Methods Res. 2014;43:73–97. https://doi.org/10.1177/0049124113509605.
Robins RW, Hendin HM, Trzesniewski KH. Measuring Global Self-Esteem: Construct Validation of a Single-Item Measure and the Rosenberg Self-Esteem Scale. Pers Soc Psychol Bull. 2001;27:151–61. https://doi.org/10.1177/0146167201272002.
Robinson MD, Irvin RL, Asad MR, Fereidouni H. Neuroticism’s link to threat sensitivity: Evidence from a dynamic affect reactivity task. Emotion. 2025;25:884–95. https://doi.org/10.1037/emo0001462.
Roesler E, Vollmann M, Manzey D, Onnasch L. The dynamics of human–robot trust attitude and behavior — Exploring the effects of anthropomorphism and type of failure. Comput Hum Behav. 2024;150:108008. https://doi.org/10.1016/j.chb.2023.108008.
Rosenberg M. 2011. Rosenberg Self-Esteem Scale. https://doi.org/10.1037/t01038-000
Roski J, Maier EJ, Vigilante K, Kane EA, Matheny ME. Enhancing trust in AI through industry self-governance. J Am Med Inf Assoc. 2021;28:1582–90. https://doi.org/10.1093/jamia/ocab065.
Sarker IH. Discov Artif Intell. 2024;4:40. https://doi.org/10.1007/s44163-024-00129-0. LLM potentiality and awareness: a position paper from the perspective of trustworthy and responsible AI modeling.
Schäfer A, Esterbauer R, Kubicek B. Trusting robots: a relational trust definition based on human intentionality. Humanit Soc Sci Commun. 2024;11:1412. https://doi.org/10.1057/s41599-024-03897-3.
Scharowski N, Benk M, Kühne SJ, Wettstein L, Brühlmann F. 2023. Certification Labels for Trustworthy AI: Insights From an Empirical Mixed-Method Study, in: 2023 ACM Conference on Fairness Accountability and Transparency. Presented at the FAccT ’23: the 2023 ACM Conference on Fairness, Accountability, and Transparency, ACM, Chicago IL USA, pp. 248–260. https://doi.org/10.1145/3593013.3593994
A
Scharowski N, Perrig SAC, Aeschbach LF, von Felten N, Opwis K, Wintersberger P, Brühlmann F. 2024a. To Trust or Distrust Trust Measures: Validating Questionnaires for Trust in AI. https://doi.org/10.48550/ARXIV.2403.00582
Scharowski N, Perrig SAC, Aeschbach LF, von Felten N, Opwis K, Wintersberger P, Brühlmann F. 2024b. To Trust or Distrust Trust Measures: Validating Questionnaires for Trust in AI. https://doi.org/10.48550/ARXIV.2403.00582
Schermelleh-Engel K, Moosbrugger H, Muller H. 2003. Evaluating the fit of structural equation models: tests of significance and goodness-of-fit models. Methods Psychol Res Online 23–74.
Schlicker N, Baum K, Uhde A, Sterz S, Hirsch MC, Langer M. How do we assess the trustworthiness of AI? Introducing the trustworthiness assessment model (TrAM). Comput Hum Behav. 2025;170:108671. https://doi.org/10.1016/j.chb.2025.108671.
Schwerter F, Zimmermann F. Determinants of trust: The role of personal experiences. Games Econ Behav. 2020;122:413–25. https://doi.org/10.1016/j.geb.2020.05.002.
Sorin V, Brin D, Barash Y, Konen E, Charney A, Nadkarni G, Klang E. Large Lang Models Empathy: Syst Rev J Med Internet Res. 2024;26:e52597. https://doi.org/10.2196/52597.
Stanley DJ, Meyer JP, Topolnytsky L. Employee Cynicism and Resistance to Organizational Change. J Bus Psychol. 2005;19:429–59. https://doi.org/10.1007/s10869-005-4518-2.
Syropoulos S, Leidner B, Mercado E, Li M, Cros S, Gómez A, Baka A, Chekroun P, Rottman J. How safe are we? Introducing the multidimensional model of perceived personal safety. Personal Individ Differ. 2024;224:112640. https://doi.org/10.1016/j.paid.2024.112640.
Tabachnick BG, Fidell LS, Ullman JB. 2019. Using multivariate statistics, Seventh edition. ed. Pearson, NY, NY.
Taillandier P, Zucker JD, Grignard A, Gaudou B, Huynh NQ, Drogoul A. 2025. Integrating LLM in Agent-Based Social Simulation: Opportunities and Challenges. https://doi.org/10.48550/ARXIV.2507.19364
Tao Y, Viberg O, Baker RS, Kizilcec RF. Cultural bias and cultural alignment of large language models. PNAS Nexus. 2024;3:pgae346. https://doi.org/10.1093/pnasnexus/pgae346.
Tetlock PE. Social functionalist frameworks for judgment and choice: Intuitive politicians, theologians, and prosecutors. Psychol Rev. 2002;109:451–71. https://doi.org/10.1037/0033-295X.109.3.451.
Thielmann I, Hilbig BE. Trust: An Integrative Review from a Person–Situation Perspective. Rev Gen Psychol. 2015a;19:249–77. https://doi.org/10.1037/gpr0000046.
Thielmann I, Hilbig BE. Trust: An Integrative Review from a Person–Situation Perspective. Rev Gen Psychol. 2015b;19:249–77. https://doi.org/10.1037/gpr0000046.
Van Der Biest M, Verschooren S, Verbruggen F, Brass M. Perceptual judgments are resistant to the advisor’s perceived level of trustworthiness: A deep fake approach. PLoS ONE. 2025;20:e0319039. https://doi.org/10.1371/journal.pone.0319039.
Wang L, Song M, Rezapour R, Kwon BC, Huh-Yoo J. 2023. People’s Perceptions Toward Bias and Related Concepts in Large Language Models: A Systematic Review. https://doi.org/10.48550/ARXIV.2309.14504
Weiner B. An attributional theory of achievement motivation and emotion. Psychol Rev. 1985;92:548–73.
Wester J, De Jong S, Pohl H, Van Berkel N. Exploring people’s perceptions of LLM-generated advice. Comput Hum Behav Artif Hum. 2024;2:100072. https://doi.org/10.1016/j.chbah.2024.100072.
A
Wheeless LR, Grotz J, THE MEASUREMENT OF TRUST, AND ITS RELATIONSHIP TO SELF-DISCLOSURE. Hum Commun Res 3, 250–7. https://doi.org/10.1111/j.1468-2958.1977.tb00523.x.
Wojton HM, Porter D, Lane T, Bieber S, Madhavan C, P. Initial validation of the trust of automated systems test (TOAST). J Soc Psychol. 2020;160:735–50. https://doi.org/10.1080/00224545.2020.1749020.
Xie Y, Zhou R, Chan AHS, Jin M, Qu M. Motivation to interaction media: The impact of automation trust and self-determination theory on intention to use the new interaction technology in autonomous vehicles. Front Psychol. 2023;14:1078438. https://doi.org/10.3389/fpsyg.2023.1078438.
Xu H, Teo H-H, Tan BCY, Agarwal R. The Role of Push-Pull Technology in Privacy Calculus: The Case of Location-Based Services. J Manag Inf Syst. 2009;26:135–74. https://doi.org/10.2753/MIS0742-1222260305.
Xu S, Khan KI, Shahzad MF. Examining the influence of technological self-efficacy, perceived trust, security, and electronic word of mouth on ICT usage in the education sector. Sci Rep. 2024;14:16196. https://doi.org/10.1038/s41598-024-66689-4.
Yang Q, Van Den Bos K, Li Y. Intolerance of uncertainty, future time perspective, and self-control. Personal Individ Differ. 2021;177:110810. https://doi.org/10.1016/j.paid.2021.110810.
Yoganathan V, Osburg V-S, Janakiraman N. Lending Legitimacy to Corporate Digital Responsibility: Trust in Firm Versus Government Regulation of Artificial Intelligence Services. J Serv Res. 2025;10946705251345097. https://doi.org/10.1177/10946705251345097.
Zhang B, Wang A, Ye Y, Liu J, Lin L. The Relationship between Meaning in Life and Mental Health in Chinese Undergraduates: The Mediating Roles of Self-Esteem and Interpersonal Trust. Behav Sci. 2024;14:720. https://doi.org/10.3390/bs14080720.
Zhang M. Assessing Two Dimensions of Interpersonal Trust: Other-Focused Trust and Propensity to Trust. Front Psychol. 2021;12:654735. https://doi.org/10.3389/fpsyg.2021.654735.
Zhang X, Lyu X, Du Z, Chen Q, Zhang D, Hu H, Tan C, Zhao T, Wang Y, Zhang B, Lu H, Zhou Y, Qiu X. 2024. IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities. https://doi.org/10.48550/ARXIV.2410.08035
Zhou J, Hu M, Li J, Zhang X, Wu X, King I, Meng H. 2024. Rethinking Machine Ethics – Can LLMs Perform Moral Reasoning through the Lens of Moral Theories? in: Findings of the Association for Computational Linguistics: NAACL 2024. Presented at the Findings of the Association for Computational Linguistics: NAACL 2024, Association for Computational Linguistics, Mexico City, Mexico, pp. 2227–2242. https://doi.org/10.18653/v1/2024.findings-naacl.144
Appendix 1
PERCEIVED TRUSTWORTHINESS IN LLMs SCALE (PT-LLM-8)
Thinking specifically about the large language model (LLM) you use most often (such as ChatGPT, DeepSeek, Gemini, Copilot, etc.), please indicate how much you agree or disagree with each of the following statements using a scale from 0 to 10, where 0 means “not at all” and 10 means “completely.”
Scale items
Corresponding dimension
What it measures
I trust my primary LLM will
Measures the extent to which..
provide information that is factually accurate and reliable
Trustfulness
users believe that an LLM provides accurate, reliable, and non-misleading information.
handle my personal data securely and confidentially
Privacy
users feel confident that an LLM respects their privacy and protects sensitive or personal data from misuse.
provide unbiased and fair responses
Fairness
users perceive that an LLM gives responses that are impartial and free from bias or discrimination.
avoid generating harmful or dangerous content
Safety
users feel assured that an LLM avoids producing unsafe, harmful, or inappropriate outputs.
perform well across different topics and situation
Robustness
users perceive that an LLM can perform consistently and reliably across a wide range of topics and contexts.
provide clear explanations about how their responses are generated
Transparency
users perceive that an LLM makes its outputs understandable by providing clear and accessible explanations.
use my feedback to address errors and biases effectively
Accountability
users believe that developers of LLMs take responsibility for errors and use feedback to improve system reliability and fairness.
provide responses that comply with relevant laws and regulations
Regulations and laws
users feel confident that an LLM’s outputs align with legal and regulatory standards
Scoring
Each item is scored directly from 0 to 10. A total score is obtained by summing the eight responses, producing a possible range from 0 to 80. Higher scores represent greater perceived trustworthiness of the respondent’s primary LLM. While the overall score provides a general measure, item-level responses can also be examined to give insight into specific dimensions of perceived trustworthiness.
The PT-LLM-8 is best interpreted as a single construct, and analyses should therefore focus on the total score rather than treating items as separate dimensions.
Total words in MS: 15210
Total words in Title: 13
Total words in Abstract: 211
Total Keyword count: 4
Total Images in MS: 2
Total Tables in MS: 9
Total Reference count: 131