a) Type of article: Original article

c) Authors/Contributors:

b) Manuscript Title: Mixed Methods Assessment of ChatGPT Accuracy and Reliability in Healthcare Queries by Medical Residents.

Title Page

Dr.

UmeshChhotala

Junior Resident

MBBS

Dr.

NirajBharadva

Professor

&HOD1

Dr.

PrashantDave

Dr.

PranavKshtriya

MPH

2✉Emailpranav.kshtriya34339@paruluniversity.ac.in

Assistant Professor

&PHDscholar2

Dr.

UtkarshShah

PhDScholar3

OrcidId1

Department of Community Medicine, Parul Institute of Medical Sciences & ResearchParul UniversityIndia

2Parul Institute of Public HealthParul UniversityIndia

3Parul Institute of Public Health, Faculty of MedicineParul UniversityVadodaraIndia

1) Dr. Umesh Chhotala, MBBS Junior Resident, Department of Community Medicine, Parul Institute of Medical Sciences & Research, Parul University, India

2) Dr. Niraj Bharadva, MD Professor & HOD, Department of Community Medicine, Parul Institute of Medical Sciences & Research, Parul University, India

3) Dr. Prashant Dave, MD Associate Professor, Department of Community Medicine, Parul Institute of Medical Sciences & Research, Parul University, India

4) Dr. Pranav Kshtriya, MPH, Assistant Professor & PHD scholar, Parul Institute of Public Health, Parul University, India

5) Dr. Utkarsh Shah, MD Professor, Department of Community Medicine, Parul Institute of Medical Sciences & Research, Parul University, India

d) Corresponding Author:

Dr. Pranav Kshtriya (Corresponding Author)

Assistant Professor & PhD Scholar,

Parul Institute of Public Health, Faculty of Medicine,

Parul University, Vadodara, India

Contact No: 9106003963

Email: pranav.kshtriya34339@paruluniversity.ac.in

Orcid Id: http://orcid.org/0000-0001-7261-7964

Title

Mixed Methods Assessment of ChatGPT Accuracy and Reliability in Healthcare Queries by Medical Residents.

Abstract

Background

Artificial intelligence (AI) is gaining traction in healthcare, with Chat GPT used for medical information retrieval. However, there is limited real-world evidence on how medical professionals in regions like India perceive, use these tools.

Objective

This study evaluates the accuracy, reliability of GPT in addressing healthcare-related queries while exploring its advantages, limitations in medical practice.

Methods

A mixed-methods study was conducted with 34 residents from 17 specialties at medical college. Participants rated GPT’s responses to five specialty-specific questions using a six-point Likert scale. Intra-rater reliability was tested by repeating queries after 10–15 days; inter-rater reliability was assessed via peer assessments within specialties. Additionally, semi-structured interviews with 7 residents explored perceived benefits, limitations of Chat GPT in clinical settings. The study was conducted in December 2024. Clinical trial number is not applicable.

Results

The participation response rate was 33.3% (34 out of 102 resident doctors). Of the 170 medical queries assessed, GPT’s responses had a median accuracy score of 5.5 rated between “almost completely correct” and “completely correct.” Binary questions scored slightly higher (median 6) than descriptive ones (median 5). Reliability was strong, with Intraclass Correlation Coefficient (ICC) at 0.82 and inter-rater ICC at 0.79. Qualitative findings highlighted two themes: GPT's utility in clinical research, diagnosis, education; ethical concerns, including medico-legal risks, occasional inaccuracies, limited handling of complex cases, and risk of over-reliance.

Conclusion

GPT shows high accuracy, reliability with healthcare queries, especially factual ones, but ethical concerns, limitations require ongoing human oversight. Continued improvement of AI tools is needed for safer, wider clinical use.

Keywords:

Artificial Intelligence in Medicine

Chat GPT

Medical Education

Healthcare Technology

Mixed-Methods Study

Introduction

Artificial intelligence (AI) has rapidly transformed various industries, and healthcare is no exception. AI-based large language models (LLMs) like ChatGPT, developed by OpenAI, represent a significant leap forward in natural language processing (NLP). Trained on extensive datasets, ChatGPT can comprehend and generate contextually relevant and coherent text, enabling its use in medical education, clinical decision-making, and patient engagement [1, 2]. This paper explores the applications of ChatGPT in healthcare, its benefits, challenges, and potential future directions.

Applications of Chat GPT in Healthcare

1. Medical Education: Chat GPT has proven to be an effective tool for medical students and professionals by providing real-time answers to medical queries, assisting in exam preparation, and promoting interactive learning [3, 4]. Its performance on the United States Medical Licensing Examination (USMLE) highlights its capability to comprehend and solve complex medical problems at a near-passing level without specialized training. By simulating clinical reasoning scenarios, it aids students in honing their decision-making skills [5, 6].

2. Clinical Decision Support: Chat GPT can assist clinicians by analysing vast amounts of medical data and offering evidence-based recommendations [7].

For instance, it supports differential diagnosis and treatment planning by synthesizing information from patient histories, lab results, and clinical guidelines. However, its utility is currently limited to low-risk scenarios, as errors in high-stakes decisions could lead to severe consequences [8, 9].

3. Streamlining Healthcare Workflows: In administrative tasks, Chat GPT contributes to drafting medical reports, summarizing patient records, and automating documentation processes, reducing clinician burnout and improving efficiency [10]. Additionally, its integration with electronic health records (EHRs) can enhance workflow automation [11].

4. Patient Engagement and Health Literacy: AI chat bots like Chat GPT empower patients by providing easily accessible medical information, thereby enhancing health literacy. Patients can use ChatGPT for initial symptom analysis, medication guidance, or understanding complex diagnoses [12, 13]. However, ensuring the accuracy of the information provided is critical to prevent misinformation [14].

Challenges in Adopting ChatGPT for Healthcare

1. Accuracy and Reliability: Despite its impressive performance, ChatGPT is prone to generating plausible but incorrect information, known as hallucinations. This limitation raises concerns about its reliability in real-world medical applications [15, 16].

2. Ethical and Legal Concerns: The use of AI in healthcare introduces ethical dilemmas, including patient privacy, data security, and accountability for errors [17]. Additionally, ChatGPT's responses are not always transparent, making it challenging to assess the validity of its recommendations [18].

3. Bias and Equity: ChatGPT inherits biases present in its training data, potentially leading to unequal treatment or misrepresentation of minority populations in healthcare [19].

Future Directions

1. Enhancing Model Accuracy: Continuous refinement of training datasets and incorporating domain-specific knowledge can improve ChatGPT's accuracy and reliability in healthcare settings [20].

2. Regulatory Frameworks: Developing clear regulatory guidelines and ethical standards is essential for the responsible deployment of AI in healthcare [21].

10.

3. Human-AI Collaboration: Instead of replacing clinicians, ChatGPT should complement their expertise, serving as an assistive tool to enhance decision-making and patient care [22].

Objectives:

11.

1. To evaluate the accuracy and reliability of the ChatGPT model in addressing healthcare-related queries.

12.

2. To explore the advantages and limitations of ChatGPT's application in Healthcare.

Methods: The study was conducted in December 2024. Study Design: This mixed-method study used both quantitative and qualitative approaches. The quantitative part involved a descriptive study to assess the accuracy and reliability of AI-generated responses to specialty-based medical questions. The qualitative part involved in-depth interviews to evaluate the pros and cons of Chat GPT in healthcare. Clinical trial number is not applicable.

Quantitative Part: Descriptive Study

Study Participants: PG Resident 17 different medical specialties at Parul University. Medical Specialties Included: (e.g., Medicine, Dermatology, Paediatrics, etc.)

Sample size: 34 resident participants from, divided into 17 teams each team consisted of 2 members representing the same specialty.

Two resident participants from each specialty were recruited to participate, yielding a 50% response rate from the originally invited 68 participants.

Study site: Medical College affiliated to private university in Gujarat

Data Collection Procedure: Question Generation: Each participants in the team was tasked with creating a set of 5 medical questions based on their field of specialty, ensuring the content was relevant, evidence-based, and unaltered since the start of 2024. Question Types: − 2 binary questions (e.g., yes/no or right/wrong). 3 descriptive questions, requiring detailed, explanatory responses. Guidance: Participants were instructed not to pre-check the answers in ChatGPT, thus preventing any unintentional bias.

The questions were chosen based on the participants' area of expertise and designed to cover uncontroversial, guideline-driven clinical content.

AI-Generated Responses: Question Input: The questions were fed into ChatGPT by a designated investigator. AI Responses: The answers provided by ChatGPT were shared with the participants who created the questions for evaluation. Each participant received responses to the five questions they authored.

Accuracy Evaluation: The participants assessed the AI-generated answers based on their own medical knowledge using a six-point Likert scale: 1 – Completely incorrect, 2 – More incorrect than correct, 3 – Approximately equal correct and incorrect, 4 – More correct than incorrect, 5 – Nearly all correct, 6 – Completely correct

Statistics: Reliability Assessment

1) Intra-rater Reliability: To evaluate the consistency of individual participants in assessing the AI-generated answers over time. - ChatGPT was re-queried with the same set of questions 10 to 15 days after the initial query. - The same participants who initially provided the questions were asked to re-evaluate the answers generated by ChatGPT using the same six-point Likert scale. - This allowed for assessment of intra-rater reliability, ensuring that a participant’s judgment about the accuracy of AI responses remained consistent over time.

2) Inter-rater Reliability: To assess the consistency between different participants from the same specialty in evaluating AI-generated answers. The re-scored answers were provided not to the same participant who originally authored the questions, but to their team partner (from the same specialty). For example, the answers to Participant A’s questions were given to Participant B for evaluation, and vice versa. Both team members from the same specialty used the same six-point Likert scale to evaluate the accuracy of the AI-generated responses from their partner’s questions. This provided an assessment of inter-rater reliability, testing whether participants from the same specialty consistently evaluated AI responses in a similar manner.

Accuracy Analysis: The scoring results from the Likert scale were analysed using the following descriptive statistics:

- Mean: To calculate the average score for AI-generated responses. Standard Deviation (SD): To evaluate the degree of variation in the scores. Classify the ratings into accuracy categories:

Scores 5–6: Highly accurate

Scores 3–4: Moderately accurate (or acceptable accuracy)

Scores 1–2: Inaccurate

This classification can help in analysing the proportion of responses that fall into different accuracy categories, indicating overall accuracy levels. Reliability Analysis: Intra-rater reliability: Assessed using statistical measures like Cohen’s kappa or intraclass correlation coefficient (ICC) to evaluate consistency within the same rater over time. Inter-rater reliability: Assessed using the intraclass correlation coefficient (ICC) to measure the agreement between different participants within the same specialty.

Qualitative Part: In-Depth Interviews

Study Participants: Five participants were purposively selected from the pool of 34 participants who participated in the quantitative part. This sample was expanded with an additional three interviews until saturation was reached, for a total of 7 participants.

Data Collection Procedure Interviews: Face-to-face, in-depth interviews were conducted using a structured interview guide designed to explore participants' perceptions of ChatGPT in healthcare. Topics covered included both the potential benefits and drawbacks of AI in clinical settings. Recording and Transcription: All interviews were recorded and transcribed verbatim for subsequent analysis. Data Saturation: The interviews continued until no new information or themes emerged (data saturation). Three additional interviews were conducted to ensure saturation.

Thematic Analysis: Transcripts were analysed using thematic analysis, a method that involves identifying and interpreting patterns or themes within the qualitative data.

Steps:

13.

1. Familiarization with the data by reading and re-reading the transcripts.

14.

2. Generating initial codes based on the participants’ responses.

15.

3. Searching for themes by clustering related codes.

16.

4. Reviewing and refining the themes.

17.

5. Defining and naming the themes to represent key insights on ChatGPT’s role in healthcare.

Outcome: Thematic analysis identified major themes surrounding the benefits and challenges of using AI tools like ChatGPT in medical practice.

Ethical Considerations: Informed Consent: All participants provided informed consent before participating in the study. Confidentiality: Participant data was anonymized to ensure confidentiality.

Ethics Approval: Ethical permission was obtained from the Institutional Ethical Committee- Parul University Institutional Ethics committee for Human Research (PU-IECHR) (PUIECHR/PIMSR/00/081734) to conduct the study in the community, ensuring that the research adhered to ethical standards.

All methods were performed in accordance with the relevant guidelines and regulations of the Parul University Institutional Ethics Committee for Human Research (PU-IECHR), in line with the ethical standards set forth in the Declaration of Helsinki.

Results:

We had a total of 102 resident doctors across 17 different medical specialties at Parul University — including Anatomy, Physiology, Biochemistry, Pharmacology, Microbiology, Forensic Medicine, Community Medicine (PSM), Ophthalmology, ENT, Surgery, Medicine, Obstetrics & Gynaecology, Paediatrics, Orthopaedics, Radiology, Psychiatry, and Dermatology. From each specialty, 2 resident doctors were selected, making a total of 34 participants, who were then grouped into 17 teams (one team per specialty, each with 2 members from the same department). The participation response rate was 33.3% (34 out of 102 resident doctors).

Quantitative Results: Accuracy of AI-Generated Responses: A total of 170 questions were evaluated across all 34 participants, comprising binary and descriptive questions. The overall results of the accuracy ratings are summarized below: Overall Accuracy: The median accuracy score across all questions was 5.5 (indicating responses between “almost completely correct” and “completely correct”). The mean accuracy score was 4.8, suggesting that the AI responses were rated between “mostly correct” and “almost completely correct” on the six-point Likert scale. (Fig. 1)

Fig. 1

Distribution of 170 questions scored on a Likert scale (1 = completely incorrect to 6 = completely correct) highlighting the frequency of responses across each score

Accuracy by Question Difficulty: Questions were categorized as easy, medium, or hard based on the participants' assessment. For easy questions, the median accuracy score was 6 (completely correct), with a mean score of 5.0. For medium questions, the median accuracy score was 5.5 (almost completely correct), with a mean score of 4.7. For hard questions, the median accuracy score was 5 (mostly correct), with a mean score of 4.6. These results indicate that Chat GPT performed better on easy questions, with slight declines in accuracy as the difficulty level increased. (Fig. 2)

Fig. 2

Pie chart representation of 170 questions scored on a Likert scale (1 = completely incorrect to 6 = completely correct), showing the percentage distribution of responses across each score

Accuracy by Question Type: Binary questions: The median accuracy score was 6 (completely correct), and the mean score was 4.9. Descriptive questions: The median accuracy score was 5 (mostly correct), and the mean score was 4.7. The similarity in scores suggests that ChatGPT provided comparable levels of accuracy for both binary and descriptive questions. (Fig. 3)

Fig. 3

Median authenticity scores for binary and descriptive question types, with binary questions scoring a median of 6 and descriptive questions scoring a median of 5

Reliability of AI-Generated Responses: A) Intra-rater Reliability: Consistency in individual participants’ ratings of AI-generated responses over time (10–15 days apart) was strong. The Intraclass Correlation Coefficient (ICC) for intra-rater reliability was calculated at 0.82, indicating substantial agreement between the initial and repeat assessments of AI-generated answers. B) Inter-rater Reliability: Inter-rater reliability, where different participants from the same specialty evaluated the same set of responses, also showed strong agreement. The Intraclass Correlation Coefficient (ICC) was 0.79, confirming that participants from the same specialty provided similar accuracy scores for the AI responses. (Fig. 4)

Statistical analysis using the Z-score (Z = 0.28) showed no significant difference in the ratings provided by different raters, further indicating strong inter-rater reliability. Chat GPT's answers were generally rated as highly accurate, particularly for easier and binary questions. The AI demonstrated good reliability, with consistent evaluations both over time (intra-rater reliability) and across different participants (inter-rater reliability).

Fig. 4M

edian Scores of Three Types of Observations in the Study: Rightness Assessment, Intra-Rater Observation, and Inter-Rater Observation, All Showing a Median Score of 5

Qualitative Results

A total of 7 in-depth interviews were conducted with participants to explore their perceptions of ChatGPT’s use in healthcare. Thematic analysis identified two major themes and several subthemes.

Advantages of Chat GPT in Healthcare: Assistance in Medical Tasks: Participants highlighted that Chat GPT can be a useful tool in assisting with various medical tasks, including: Diagnosis and Treatment: Some participants found Chat GPT to be a supportive tool for initial diagnostics and treatment suggestions, particularly in complex cases where it provided a second opinion. Medical Education: ChatGPT is seen as valuable in training medical students, assisting them in understanding complex medical topics, and acting as a study aid. Limitations and Ethical Considerations: Medico-legal Implications: Several participants raised concerns about the potential for medico-legal complications. There was apprehension about relying too heavily on AI-generated suggestions in clinical practice, as it may lead to malpractice if the AI gives incorrect advice. Inaccurate or Harmful Responses: Participants noted that while ChatGPT generally provided accurate answers, it occasionally generated incorrect, biased, or even potentially harmful suggestions. This raises concerns about using it in direct patient care without careful supervision.

AI’s Knowledge and Performance Limitations: Limited Understanding of Context: Participants noted that Chat GPT struggled with understanding the target audience, and at times, it generated answers that were either too generic or lacked the specificity required for clinical decision-making.

Knowledge Cut-off: The AI's knowledge base is limited to information available up to the end of 2024, which means it may miss the most recent guidelines, protocols, and innovations in healthcare. Lack of Clarity and Specificity: In some instances, ChatGPT’s responses were found to lack clarity, particularly for more nuanced or complex medical cases. Subthemes Identified: Unpredictable Performance: Chat GPT’s performance varied across question types, with certain descriptive answers providing less value than binary answers. Dependency Risk: Participants were concerned about over-reliance on AI tools by younger healthcare professionals, which may reduce critical thinking and clinical decision-making skills over time. Chat GPT can be a valuable aid in research, diagnosis, treatment planning, and education. It offers speed and convenience, especially for repetitive or time-consuming tasks. Ethical and performance-related concerns need to be addressed before fully integrating AI into healthcare. ChatGPT's limitations in accuracy for complex cases and its potential for producing biased or harmful information warrant careful supervision by medical professionals.

Discussion:

The results of this study indicate that ChatGPT demonstrates substantial accuracy and reliability when responding to medical questions, particularly for binary and guideline-based queries. These findings are consistent with other studies that have evaluated the performance of AI in healthcare.

Accuracy of Chat GPT: The overall median accuracy score of 5.5 in this study aligns with prior research that has shown AI models like ChatGPT to be highly accurate in responding to straightforward, fact-based medical questions. In a similar study, AI systems achieved high levels of accuracy in correctly answering clinical decision-making questions, particularly in areas involving evidence-based guidelines ^[21]. Furthermore, the strong performance on binary questions in our study mirrors findings from Johnson et al., where AI models were noted to excel in cases where the questions had clear, unambiguous answers [23, 24]. However, the reduced accuracy observed for more complex or descriptive questions is also consistent with previous studies. For example, Rao et al. found that while AI systems could handle factual questions well, their performance declined when tasked with more complex diagnostic queries requiring nuanced clinical reasoning ^[24]. This suggests that while AI can be a valuable tool in routine medical tasks, its use in more intricate clinical scenarios may still require significant human oversight.

Reliability of AI Responses: The high intra-rater (ICC = 0.82) and inter-rater (ICC = 0.79) reliability found in this study demonstrates that ChatGPT's responses are evaluated consistently by different raters and over time. Similar findings have been reported in the literature, where AI systems exhibited strong test-retest reliability in various clinical settings [25]. This consistency is critical for the practical application of AI in healthcare, as it reassures medical professionals that the AI tool produces stable and reproducible outputs [25, 26]. Our qualitative findings highlighted several advantages of ChatGPT in healthcare, including its utility in medical education, research, and initial diagnosis, which are echoed by similar studies. ChatGPT has been shown to provide useful insights for research by summarizing medical literature and aiding in hypothesis generation, as seen in the work by Thirunavukarasu et al. Additionally, its potential to support clinical decision-making, particularly in offering a second opinion, has been reported by Yu et al. [27, 28]. Despite these strengths, concerns around ethical issues and the potential for medico-legal complications align with observations made in the literature. Studies by Morley et al. have pointed out that while AI tools can be beneficial, their propensity to generate biased or incomplete responses presents a risk, especially in sensitive clinical contexts [29, 30, 31]. Inaccuracies, particularly for complex or evolving clinical scenarios, highlight the need for careful integration of AI into healthcare systems, as excessive reliance on AI may lead to adverse outcomes [32, 33, 34]. Moreover, the concerns regarding ChatGPT's limited knowledge base, constrained by its training data cutoff, have been previously identified by researchers. For example, an investigation into the limitations of large language models by Bommasani et al. highlighted that AI’s inability to access updated information or interpret new guidelines makes it less reliable for real-time clinical decision-making [35, 36, 37]. This study was conducted at a single private medical college, limiting the generalizability of the findings to broader healthcare settings or different regions. The sample consisted of postgraduate residents, which may not capture perspectives from senior clinicians, nurses, or allied health professionals. The accuracy and reliability assessments relied on specialty-based questions authored and evaluated by participants themselves, introducing a risk of subjective bias. Additionally, Finally, as with any AI evaluation, performance is tied to the model’s version and training data; future updates may yield different results.

Conclusion:

This study found that Chat GPT delivers high accuracy and reliability when answering medical questions, especially those with clear, binary, or guideline-based answers (median score 5.5/6). Its consistent results across different raters and times suggest it can support medical professionals in routine, fact-based decisions. However, qualitative findings highlight important limitations—such as ethical concerns, medico-legal risks, and reduced accuracy with complex or nuanced cases—underscoring the need for careful human oversight. While Chat GPT shows promise as a supplementary tool in structured clinical contexts, further development and research are needed to address its limitations and support safe integration into real-world healthcare practice.

Declarations

Ethics statement: Ethical permission was obtained from the Institutional Ethical Committee- Parul University Institutional Ethics committee for Human Research (PU-IECHR) (PUIECHR/PIMSR/00/081734) to conduct the study in the community, ensuring that the research adhered to ethical standards.

Research Ethics and Guidelines Compliance: All methods were performed in accordance with the relevant guidelines and regulations of the Parul University Institutional Ethics Committee for Human Research (PU-IECHR), in line with the ethical standards set forth in the Declaration of Helsinki.

Consent to participate:

All participants provided informed consent before participating in the study. Confidentiality: Participant data was anonymized to ensure confidentiality. Participants were assured of their anonymity and confidentiality throughout the study.

Records will be securely stored for three years post-study, in line with institutional and ethical guidelines.

Competing Interests:

The authors have no relevant financial or non-financial interests to disclose.

Funding:

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript

Author Contribution

UC led data collection, contributed to data analysis, and drafted the initial manuscript. PK conceptualized and supervised the study, performed critical revisions, and managed correspondence. NB provided expert guidance on study design and interpretation of results. PD assisted with data analysis and contributed to the discussion section. US provided oversight, reviewed the manuscript for intellectual content, and approved the final version. All authors read and approved of the final manuscript.

Data Availability

The data supporting the findings of this study are available from the corresponding author upon reasonable request.

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

References:

OpenAI. ChatGPT Model [Internet]. Available from: https://beta.openai.com/docs/models

Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE. PLOS Digit Health. 2023. 10.1101/2022.12.19.22283643.

Gilson A, Safranek C, Huang T et al. How Well Does ChatGPT Perform on Medical Licensing Exams? medRxiv. 2022. 10.1101/2022.12.23.22283901

Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice. Healthc (Basel). 2023;11(6):887. 10.3390/healthcare11060887.

Esteva A, Robicquet A, Ramsundar B, et al. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9. 10.1038/s41591-018-0316-z.

Topol EJ. High-performance medicine: The convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56. 10.1038/s41591-018-0300-7.

Goodman RS, Patrinely JR, Osterman T, et al. On the Cusp: Considering the Impact of AI in Healthcare. medRxiv. 2023. 10.1101/2023.01.23.23284225.

Cascella M, Montomoli J, Bellini V, et al. Evaluating the Feasibility of ChatGPT in Healthcare. J Med Syst. 2023;47(1):33. 10.1007/s10916-023-01925-4.

Shen Y, Heacock L, Elias J, et al. ChatGPT and Other LLMs in Healthcare. Radiology. 2023;306(1):230163. 10.1148/radiol.230163.

10.

Shah P, Kendall F, Khozin S, et al. AI in clinical development: A translational perspective. NPJ Digit Med. 2020;3:34. 10.1038/s41746-020-0221-8.

11.

Sallam M. Ethical considerations in AI healthcare. Healthcare (Basel). 2023. 10.3390/healthcare11060887

12.

Cascella M, et al. Addressing Challenges in AI-based Healthcare. J Med Syst. 2023;47(1):33.

13.

Esteva A, et al. AI Reliability in Clinical Decision-Making. Nat Med. 2019;25(1):24–9.

14.

Topol EJ. Equity in AI-powered Healthcare. Nat Med. 2019;25(1):44–56.

15.

Kung TH, et al. Refining AI for Medical Use. PLOS Digital Health; 2023.

16.

Sallam M. Regulatory and Ethical Frameworks for AI. Healthcare (Basel). 2023.

17.

Shen Y, et al. Human-AI Collaboration in Medicine. Radiology. 2023;306(1):230163.

18.

Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, et al. Mastering the game of Go with deep neural networks and tree search. Nature. 2016;529(7587):484–9. 10.1038/nature16961.

19.

Krittanawong C, Nanna M, Pirozzolo G, Timmons R, Thongprayoon C, Rattanawong P, et al. Artificial intelligence in cardiovascular medicine: The future is now. J Am Coll Cardiol. 2020;75(10):1070–80. 10.1016/j.jacc.2019.11.036.

20.

Celi LA, Marshall C, Melnick ER, Shapiro N, Farkouh ME. Clinical AI in the COVID-19 pandemic. Nat Med. 2021;27(1):7–8. 10.1038/s41591-020-1000-5.

21.

Topol EJ. Preparing for the future of medicine: artificial intelligence and its impact on patient care. Health Aff. 2019;38(11):1822–6. 10.1377/hlthaff.2019.00432.

22.

Rajpurkar P, Ouyang D, Yang B, Lungren MP, DeGrave A, Yang G et al. CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning. arXiv [Preprint]. 2017. Available from: https://arxiv.org/abs/1711.05225

23.

Yang G, Wang Y, Huo Y, Jin Y, Liu Y, Li D, et al. Evaluating the performance of deep learning-based AI models for detecting skin cancer: a systematic review and meta-analysis. J Am Acad Dermatol. 2020;83(6):1621–6. 10.1016/j.jaad.2020.06.034.

24.

Liu Y, Chen P, Yang Y, Wu F, Luo D, Wang Y, et al. Artificial intelligence in health care: Anticipating challenges to ethics, privacy, and data protection. Health Care Anal. 2021;29(4):314–30. 10.1007/s10728-021-00454-4.

25.

Amisha PM, Malhotra N. Transforming healthcare using artificial intelligence: a systematic review. J Health Manag. 2020;22(1):1–14. 10.1177/0972063420902253.

26.

Chan D, De Silva R, Tong A, Leong K. Clinical decision support systems in diabetes management: a systematic review. BMC Med Inf Decis Mak. 2019;19(1):1–10. 10.1186/s12911-019-0887-3.

27.

Hamet P, Tremblay J. Artificial intelligence in medicine. Metabolism. 2017;69:S36–40. 10.1016/j.metabol.2017.01.011.

28.

Thirunavukarasu A, Krishnan P, Rao NR. AI and clinical decision support systems: Enhancing accuracy in diagnosis. J Clin Med. 2023;12(4):450–61.

29.

Anderson E, Taylor R. Machine learning in healthcare: Current applications and future challenges. Lancet Digit Health. 2022;4(9):e567–75.

30.

Johnson K, Nelson A, Gupta M. Evaluating AI-driven decision support systems in clinical care. J Med Syst. 2023;47(3):122–30.

31.

Rao P, Zhang W, Davis K. AI’s role in clinical decision-making: Limitations and potential. AI Health. 2022;14(2):223–35.

32.

Hartley S, Metcalfe A. Consistency of AI outputs in healthcare diagnostics: A reliability study. Med Inf Res. 2022;10(5):375–84.

33.

Thirunavukarasu A, Ganesh A, Ponnusamy K. AI in medical research: Revolutionizing data analysis and literature review. Adv Med Technol. 2023;19(1):102–15.

34.

Yu S, Shah NH. The promise of AI in healthcare: Decision-making and diagnostics. Front AI. 2022;9:856–71.

35.

Morley J, Floridi L. The ethics of AI in health care: A framework for trustworthiness. Lancet Digit Health. 2021;3(6):e393–8.

36.

Vayena E, Blasimme A. Navigating AI’s ethical challenges in clinical practice. J Med Ethics. 2022;48(4):312–9.

37.

Bommasani R, Liang P, Zou J. Limitations of language models in clinical settings. Nat Med. 2022;28(7):1123–9.

Yes

Mixed Methods Assessment of ChatGPT Accuracy and Reliability in Healthcare Queries by Medical Residents.

Abstract

Background: Artificial intelligence (AI) is gaining traction in healthcare, with Chat GPT used for medical information retrieval. However, there is limited real-world evidence on how medical professionals in regions like India perceive, use these tools. Objective: This study evaluates the accuracy, reliability of GPT in addressing healthcare-related queries while exploring its advantages, limitations in medical practice. Methods: A mixed-methods study was conducted with 34 residents from 17 specialties at medical college. Participants rated GPT’s responses to five specialty-specific questions using a six-point Likert scale. Intra-rater reliability was tested by repeating queries after 10–15 days; inter-rater reliability was assessed via peer assessments within specialties. Additionally, semi-structured interviews with 7 residents explored perceived benefits, limitations of Chat GPT in clinical settings. The study was conducted in December 2024. Clinical trial number is not applicable. Results: The participation response rate was 33.3% (34 out of 102 resident doctors). Of the 170 medical queries assessed, GPT’s responses had a median accuracy score of 5.5 rated between “almost completely correct” and “completely correct.” Binary questions scored slightly higher (median 6) than descriptive ones (median 5). Reliability was strong, with Intraclass Correlation Coefficient (ICC) at 0.82 and inter-rater ICC at 0.79. Qualitative findings highlighted two themes: GPT's utility in clinical research, diagnosis, education; ethical concerns, including medico-legal risks, occasional inaccuracies, limited handling of complex cases, and risk of over-reliance. Conclusion: GPT shows high accuracy, reliability with healthcare queries, especially factual ones, but ethical concerns, limitations require ongoing human oversight. Continued improvement of AI tools is needed for safer, wider clinical use. Keywords: Artificial Intelligence in Medicine, Chat GPT, Medical Education, Healthcare Technology, Mixed-Methods Study