Systematic Comparison of Multimodal Large Language Models for Pediatric Profile Orthodontic Assessment and Early Intervention: ChatGPT, DeepSeek, and Gemini

XiangyuGe1,3

JingchengChen2

ChenyangYuan1,3

ZhenghanChu2

XiangyuLi3

XiZhang3

YananChen3

WeiYingZheng3

ChunqinMiao3✉EmailGexy1012@163.com

1ZheJiang Chinese Medical University310000HangzhouZhejiang ProvincePeople’s Republic of China

2Jiaxing Nanhu District People’s Hospital314000JiaxingZhejiang ProvincePeople’s Republic of China

3The Second Hospital of Jiaxing314000JiaxingZhejiang ProvincePeople’s Republic of China

Xiangyu Ge^1,3, Jingcheng Chen², Chenyang Yuan^1,3, Zhenghan Chu²,Xiangyu Li³, Xi Zhang³, Yanan Chen³, WeiYing Zheng³, Chunqin Miao^3*

Affiliations

1.ZheJiang Chinese Medical University, Hangzhou 310000, Zhejiang Province, People’s Republic of China.

2.Jiaxing Nanhu District People's Hospital, Jiaxing 314000, Zhejiang Province, People’s Republic of China.

3.The Second Hospital of Jiaxing, Jiaxing 314000, Zhejiang Province, People’s Republic of China.

Email^*: Gexy1012@163.com

Abstract

Objective

This study aimed to compare the performance of three multimodal large language models (ChatGPT, DeepSeek, and Gemini) in analyzing pediatric profile photographs and providing early orthodontic intervention recommendations, thereby assessing their clinical feasibility and reliability.

Materials and Methods

In this cross-sectional study, 100 children aged 5–12 years who attended Jiaxing Second Hospital between January and June 2025 were enrolled. Standardized profile photographs were obtained and processed uniformly before being analyzed by the three models using identical prompts. Model outputs were anonymized and independently evaluated under single-blind, randomized conditions by orthodontic experts and parents. An eight-dimension weighted scoring system was applied, encompassing professionalism, accuracy, completeness, individualization, safety, comprehensibility, empathy, and readability. Statistical analyses included the Friedman test, Wilcoxon signed-rank test, and Kendall’s W effect size.

Results

All three models achieved high overall scores, ranging from 3.9 to 4.2. ChatGPT consistently produced slightly higher mean scores (4.07–4.15), while DeepSeek and Gemini showed comparable performance (3.91–4.09). Inter-model differences were not statistically significant (all q > 0.05), and effect sizes were uniformly negligible (Kendall’s W = 0.003–0.029).

Conclusions

ChatGPT, DeepSeek, and Gemini demonstrated comparable and overall reliable performance in pediatric orthodontic screening based on profile photographs, with ChatGPT showing a slight but nonsignificant advantage. At present, LLMs may serve as supportive tools for early orthodontic assessment but cannot substitute for clinical expertise.

Clinical Relevance:

This study highlights the potential role of large language models in pediatric orthodontic screening, suggesting they may improve accessibility and efficiency. Future research should incorporate multimodal inputs such as lateral cephalograms, CBCT, and intraoral scans, and conduct multicenter, large-scale validation to enhance clinical translation.

Keywords:

large language models

orthodontics

pediatric profile analysis

early intervention

artificial intelligence

Introduction

With the rapid advancement of large language models (LLMs), an increasing number of individuals are utilizing these tools to obtain health-related information, including disease knowledge, symptom evaluation, and treatment recommendations. However, given the sensitivity and complexity of medicine, concerns have arisen regarding the reliability and safety of AI-generated medical advice. In parallel with technological progress and heightened health awareness, more parents are paying attention to changes in their children’s facial appearance during growth. Previous studies have demonstrated that, particularly in early childhood, mouth breathing is associated with unfavorable craniofacial features such as mandibular retrusion, narrow palatal arch, elevated nasal base, lip incompetence, increased vertical facial height, and a steep mandibular plane angle, all of which can adversely affect facial esthetics [1–3]. During adolescence, when craniofacial growth peaks, harmful oral habits can lead to aberrant facial development, whereas timely and guided intervention can redirect growth toward favorable trajectories and correct the consequences of adverse habits [4]. Thus, accurate clinical judgment in pediatric profile analysis is critical for determining the timing and modality of orthodontic intervention.In the current digital era, the internet is flooded with heterogeneous health information, making it difficult for parents without medical training to discern its validity. With the continuous evolution and widespread adoption of LLMs, patients and families have become accustomed to consulting these models for medical questions. While LLMs have demonstrated significant capability in information retrieval and patient education, it remains unclear whether they can provide guidance comparable to in-person clinical care.

Artificial intelligence (AI) and LLMs have already demonstrated multiple applications in dentistry, including diagnostic support, personalized treatment planning, intelligent follow-up, and medical education [5–7]. Despite challenges related to data privacy, ethics, and model reliability, the potential of AI in oral healthcare is substantial [8]. In recent years, general-purpose LLMs such as ChatGPT, DeepSeek, and Gemini have accelerated the intelligent transformation of dentistry, offering new opportunities for clinical decision support and imaging analysis [7, 9, 10]. ChatGPT, developed by OpenAI, is based on the generative pre-trained transformer (GPT) architecture. Its latest versions—GPT-4 and the multimodal GPT-4o (omni)—feature robust natural language understanding and generation capabilities, enabling tasks such as dialogue, text summarization, medical Q&A, case analysis, and diagnostic assistance. Prior studies have shown ChatGPT to deliver high accuracy and readability in disease-related Q&A, case summaries, decision support, and patient education, including in dentistry for common disease queries, treatment planning, and preoperative counseling [11, 12]. Nevertheless, limitations persist in processing imaging and complex multimodal data, underscoring the need for expert validation in clinical practice [13].

In comparison, DeepSeek, developed by DeepSeek AI (Hangzhou, China), is a Chinese-optimized LLM series designed for language understanding and multimodal analysis in Chinese contexts. Its DeepSeek-VL version supports text–image fusion input, enabling collaborative analysis of medical images and text. Previous studies have reported its effectiveness in Chinese medical Q&A, automated clinical report generation, and image annotation, with promising results in dental image interpretation and clinical support [14, 15]. Gemini, developed by Google DeepMind, represents a family of native multimodal LLMs (formerly Bard), capable of processing textual, visual, and auditory inputs. The Gemini 1.5 series integrate long-context reasoning with multimodal understanding. In healthcare, Gemini has been applied to medical Q&A, literature reviews, image–text case analysis, and enhanced search. In dentistry, Gemini is particularly suited for integrating photographs, radiographs, and electronic health records for comprehensive assessment, thereby supporting multimodal diagnosis and early orthodontic intervention planning [10, 16].

Although these models have shown potential in prior research, systematic evaluation of their performance in pediatric profile-based analysis and early orthodontic intervention remains limited. This study was therefore designed to directly compare ChatGPT, DeepSeek, and Gemini, with the aim of exploring their feasibility and clinical value in pediatric facial analysis and early orthodontic care.

Materials and Methods

This is a cross-sectional study and the study protocol has been approved by the Institutional Review Board of Jiaxing Second Hospital (Approval No.: 2024JX157).

The study written informed consent was obtained from each participant’s legal guardian.

This study compared the output quality of three multimodal LLMs in pediatric profile-based orthodontic assessment. A total of 100 children presenting to the Department of Orthodontics, Jiaxing Second Hospital, from January to June 2025 were enrolled based on predefined inclusion and exclusion criteria. All data were anonymized, with only essential demographic information retained for statistical analysis.

Sample size estimation was performed using G*Power 3.1, with the following parameters: repeated-measures ANOVA (within factors), effect size f = 0.25 (medium), α = 0.05, power (1 − β) = 0.80, number of measures = 3 (ChatGPT, DeepSeek, Gemini), within-subject correlation r = 0.5, and nonsphericity correction ε = 1. The calculation yielded a minimum required sample size of 28. Considering a 20% dropout/exclusion risk, the final sample size was set at 100, ensuring adequate statistical power.

Inclusion criteria:

Age 5–12 years (mixed or early permanent dentition).

Availability of standardized lateral profile photographs with clear, unobstructed, and evenly illuminated views.

Normal craniofacial development without congenital anomalies.

Complete demographic and clinical data (age, sex, etc.).

Written informed consent from participants and guardians for image use.

Exclusion criteria:

Congenital craniofacial anomalies or severe developmental abnormalities.

History of orthognathic surgery or major craniofacial interventions.

Blurred, obstructed, or non-standardized photographs.

Severe systemic or neuromuscular diseases.

Inability to provide valid informed consent.

All photographs were captured by trained personnel using identical digital cameras under standardized background and lighting. Participants were instructed to adopt a natural head position (NHP) with pupils aligned horizontally, sagittal plane perpendicular to the camera axis, lips lightly closed, and teeth in gentle occlusion. Photos were taken at 1.5 m against a neutral gray background to minimize contrast interference. Preprocessing was performed with OpenCV (v4.x), including background cropping, size normalization (1024×1024 px), and automated brightness/contrast equalization. Images were saved in JPG format at ≥ 300 dpi.

The following multimodal LLMs were used: ChatGPT-4o (OpenAI), DeepSeek-Vision (DeepSeek AI), and Gemini 1.5 Pro (Google DeepMind). Each model received identical prompts in Chinese along with the profile images (Fig. 1):

Fig. 1

Chinese prompt words

Please analyze this patient’s profile photo with respect to: (1) facial type (convex/concave/straight); (2) mandibular development (retrognathia); (3) upper and lower lip position relative to the E-line; (4) need for early orthodontic intervention; (5) if indicated, recommended treatment approaches and timing.

To avoid contextual carryover effects, each case was entered into a new session. Model outputs were de-identified and randomly coded, with raters blinded to the model source (single-blind design). The sequence of cases and models was allocated using block randomization. Ratings were conducted independently with no communication permitted, and all data were recorded in real time and securely encrypted.This study employed a customized evaluation framework (Table 1) comprising eight dimensions grouped into three functional clusters: diagnostic/professional (40%), recommendation/management (40%), and language/patient experience (20%). Each dimension was scored using a 0–5 anchor scale (5 = optimal, 0 = absent), with weights assigned according to functional importance. The diagnostic/professional cluster included professionalism (PA, 25%) and accuracy (AC, 15%), primarily evaluated by orthodontic experts. The recommendation/management cluster encompassed completeness (IC, 14%), individualization/relevance (PR, 12%), and ethics/safety (ES, 14%), also assessed by experts. The language/patient experience cluster consisted of comprehensibility (CC, 10%), empathy/humanistic concern (EHC, 5%), and readability (RE, 5%), rated by parents or guardians.Raters were divided into two groups: the expert group (three orthodontists with ≥ 5 years of clinical experience) evaluated PA, AC, IC, PR, and ES; the parent group (three guardians/parents) evaluated CC, EHC, and RE. Final composite scores were calculated by weighted aggregation across all dimensions and used for inter-model comparisons [7].

Table 1
Eight-dimensional scoring anchors and weights (0–5 scale)
Dimension	Weight	5 points	4 points	3 points	2 points	1 point	0 points
PA	25%	Fully consistent with guidelines/consensus	Generally consistent, minor deviations	Partially consistent, key omissions	Multiple inconsistencies	Severe violation	Not addressed
AC	15%	Completely correct	Core content correct, minor detail errors	Partially correct, key errors	Multiple errors	Major errors	No information
IC	14%	Fully comprehensive	Mostly complete	Some omissions	Key information missing	Highly incomplete	Not provided
PR	12%	Highly individualized	Mostly individualized	Insufficient individualization	Clearly lacking	Template-like	No recommendation
ES	14%	Entirely safe	Generally safe	Potential risk points	Possibly inappropriate	Clearly inappropriate/hazardous	Dangerous content
CC	10%	Clear and accessible	Mostly clear	Some ambiguity/gaps	Vague/overly technical	Confusing	Unintelligible
EHC	5%	Strongly empathetic/caring	Moderately empathetic	Neutral, lacking warmth	Cold	Negative tone	No empathy
RE	5%	Well-structured, fluent	Clear but slightly complex	Partially difficult to read	Generally difficult to read	Very obscure	Unreadable

To minimize subjective variation and ensure consistency in scoring standards, a unified training and pilot rating session was conducted before the formal evaluation. First, all raters received standardized instruction, with operational definitions of the eight scoring dimensions explained item by item. Clinical examples were provided to clarify the anchor criteria for the 0–5 scale, ensuring uniform understanding across raters. Subsequently, the expert group performed a pilot rating exercise using 10 standardized profile photographs (not included in the study sample) to familiarize themselves with the scoring procedures. After the pilot session, any items with inter-rater discrepancies greater than one point were reviewed collectively, and rating criteria were harmonized through discussion.To verify the reliability of the scoring system, inter-rater agreement was assessed using intraclass correlation coefficients (ICC, two-way random effects, absolute agreement) for expert-rated dimensions, while Cronbach’s α was used for readability- and empathy-related dimensions rated by parents/guardians. Thresholds were predefined as ≥ 0.70 for acceptable and ≥ 0.80 for good reliability. If reliability fell below these standards, additional training and calibration were performed until the thresholds were met. Once the reliability indices reached the predetermined criteria, the scoring protocol was finalized and applied in the formal evaluation, thereby ensuring both the reliability and validity of the assessment system (Table 2).

Table 2
Reliability indices and coefficients for each scoring dimension
Dimension	Functional cluster	Reliability index	Coefficient
PA	Diagnostic/Professional cluster	ICC(2,1)	0.761
AC	Diagnostic/Professional cluster	ICC(2,1)	0.766
IC	Recommendation/Management cluster	ICC(2,1)	0.792
PR	Recommendation/Management cluster	ICC(2,1)	0.737
ES	Recommendation/Management cluster	ICC(2,1)	0.841
CC	Language/Experience cluster	Cronbach’s α	0.932
EHC	Language/Experience cluster	Cronbach’s α	0.931
RE	Language/Experience cluster	Cronbach’s α	0.931

All ratings were conducted independently, with the expert group evaluating PA, AC, IC, PR, and ES, and the parent group evaluating CC, EHC, and RE. If the score range among the three raters for any given dimension exceeded one point, a second round of reassessment was initiated. If disagreement persisted, a senior coordinator (a senior orthodontic specialist and/or the coordinating evaluation panel) rendered an adjudicated score based on the anchor criteria. The overall workflow is illustrated in Fig. 2.

Fig. 2

Workflow of the rating and adjudication process.

Statistical Analysis

All statistical analyses were performed using SPSS version 26.0 (IBM Corp., Armonk, NY, USA). Normality of distribution for each dimension was assessed using the Shapiro–Wilk test, which indicated that most variables did not follow a normal distribution; therefore, non-parametric methods were applied. Overall comparisons across the three LLMs (ChatGPT, DeepSeek, Gemini) for the eight evaluation dimensions were conducted using the Friedman test, with Kendall’s W calculated as the effect size. False discovery rate (FDR) adjustment was applied using the Benjamini–Hochberg method. When overall differences reached statistical significance, post hoc pairwise comparisons were performed using the Wilcoxon signed-rank test, with multiple comparisons corrected by the Holm method. All tests were two-tailed, and statistical significance was set at α = 0.05.

Results

The mean scores of the three models across the eight dimensions are shown in Table 3. All three models achieved relatively high scores (approximately 3.9–4.2), indicating similar overall performance. Specifically, ChatGPT obtained the highest scores in all dimensions, with mean values ranging from 4.07 to 4.15. DeepSeek scored in the intermediate range (3.99–4.09), while Gemini scored slightly lower (3.91–4.02). Among the individual dimensions, ChatGPT achieved the highest score in ethics/safety (ES, 4.15 ± 0.59), whereas Gemini had the lowest score in completeness (IC, 3.91 ± 0.64). The differences in mean scores among the three models were small, with a maximum gap of less than 0.24 points between the highest- and lowest-scoring models, suggesting a high degree of consistency across models. In summary, although ChatGPT consistently outperformed the other two models, the differences from DeepSeek and Gemini were limited, and all three demonstrated comparably high performance.

Table 3
Scores of the three models across eight evaluation dimensions (mean ± SD)
Dimension	ChatGPT	DeepSeek	Gemini	Highest model	Highest mean	Lowest mean
PA	4.10 ± 0.62	4.02 ± 0.59	3.97 ± 0.61	ChatGPT	4.10	3.97
AC	4.08 ± 0.60	4.00 ± 0.57	3.95 ± 0.59	ChatGPT	4.08	3.95
IC	4.12 ± 0.63	3.98 ± 0.62	3.91 ± 0.64	ChatGPT	4.12	3.91
PR	4.07 ± 0.61	3.99 ± 0.58	3.93 ± 0.60	ChatGPT	4.07	3.93
ES	4.15 ± 0.59	4.09 ± 0.57	4.02 ± 0.60	ChatGPT	4.15	4.02
CC	4.11 ± 0.61	4.04 ± 0.60	3.96 ± 0.62	ChatGPT	4.11	3.96
EHC	4.13 ± 0.60	4.05 ± 0.58	3.98 ± 0.61	ChatGPT	4.13	3.98
RE	4.09 ± 0.62	4.01 ± 0.59	3.94 ± 0.60	ChatGPT	4.09	3.94

The Friedman test results (Table 4) indicated that there were no statistically significant overall differences among the three models across any of the dimensions (all P values > 0.05, and FDR-adjusted q values also > 0.05). For the completeness (IC) dimension, χ² = 3.437, P = 0.179, q = 0.358, and Kendall’s W = 0.029, suggesting a possible trend but not reaching statistical significance. For all other dimensions, Kendall’s W values ranged between 0.003 and 0.011, indicating negligible effect sizes. Taken together, these findings confirm that performance differences among the three models across the eight evaluation dimensions were minimal and not statistically significant.

Table 4
Friedman test results for the three models across eight evaluation dimensions
Dimension	χ²	df	P value	Kendall’s W	q value (FDR-adjusted)
PA	0.334	2	0.846	0.006	0.846
AC	0.529	2	0.768	0.009	0.846
IC	3.437	2	0.179	0.029	0.358
PR	1.286	2	0.526	0.011	0.701
ES	0.365	2	0.833	0.003	0.846
CC	0.431	2	0.806	0.004	0.846
EHC	0.389	2	0.823	0.003	0.846
RE	0.590	2	0.745	0.005	0.846

The pairwise comparison results are presented in Table 5. The uncorrected Wilcoxon signed-rank test showed a borderline statistically significant difference between ChatGPT and DeepSeek in the completeness (IC) dimension (P = 0.032); however, this difference was no longer significant after Holm correction (P = 0.097). No other inter-model comparisons across the remaining dimensions reached statistical significance.

Table 5
Pairwise comparisons among the three models across eight evaluation dimensions (Wilcoxon signed-rank test, Holm correction)
Dimension	Comparison	Wilcoxon T	P value (uncorrected)	P value (Holm-corrected)
IC	ChatGPT vs DeepSeek	1234	0.032	0.097
Other dimensions	All pairwise comparisons	–	> 0.05	> 0.05

Discussion

The rapid development of large language models (LLMs) has become closely intertwined with many aspects of daily life, and their reliability and accuracy have steadily improved with ongoing technological iterations [17, 18]. As AI continues to integrate across various domains, the medical field has likewise been affected. With increasing accessibility and democratization of LLMs, more individuals are turning to these tools for health-related information. However, medicine is inherently a highly sensitive and rigorous discipline, and the validity of medical advice generated by LLMs has yet to be fully established.

The results of this study (Tables 3–5) demonstrated that the three LLMs performed similarly across all eight evaluation dimensions, each achieving relatively high scores (mean values approximately 3.9–4.2), with no statistically significant differences. ChatGPT showed slightly higher average scores in most dimensions, particularly completeness (IC), where it displayed a borderline advantage over DeepSeek. However, this difference did not remain significant after multiple-comparison correction. This lack of substantial difference likely reflects the overall maturity of LLMs in medical applications: the models have converged in their ability to generate professional, accurate, and comprehensible content, and observed variations may be more attributable to prompt design, training data, or rater subjectivity. The borderline difference in completeness (IC) suggests that models may still vary in the coverage and logical integrity of their responses—an aspect of potential relevance in clinical information gathering and patient education. Effect size estimates further confirmed the minimal impact, with Kendall’s W consistently ranging between 0.002 and 0.029, indicating negligible practical significance. Clinically, these findings suggest that all three LLMs demonstrated stable and consistent outputs for orthodontics-related queries, insufficient to justify recommending one model over the others at this stage.

Comparison with prior studies highlights both concordance and divergence. Earlier research found that ChatGPT outperformed Google search for medical queries, providing more accurate, citable answers often sourced from academic or healthcare domains, whereas Google responses frequently drew from commercial or social media content of variable quality [10]. This aligns with the trend observed in our study, where ChatGPT scored slightly higher than Gemini. Other reports suggest that DeepSeek offers more comprehensive and timely coverage of treatment-related knowledge and symptom interpretation, while ChatGPT excels in clarity and readability, making it more suitable for patient education [9]. This corroborates our findings: ChatGPT produced concise, easily understandable outputs favored by parents, whereas DeepSeek demonstrated stronger content coverage, especially regarding clinical details. Gemini’s performance, by contrast, appeared more sensitive to prompt design and input variation [12, 19]. Sorin et al. [20] similarly noted that fewer prompts improved Gemini’s recall, whereas chain-of-thought prompting sometimes reduced its accuracy. In our study, we also observed that Gemini’s output stability was less robust than that of ChatGPT or DeepSeek under varying image quality or input conditions. Collectively, these findings underscore that in multimodal clinical contexts, model performance is strongly influenced by prompt strategy and input quality.

From a clinical perspective, all three models were able to identify essential orthodontic features such as facial type, mandibular retrusion, and lip–E line relationships, which are critical for early intervention. Moreover, the intervention suggestions provided by the models were generally consistent with conventional clinical recommendations, for example, suggesting functional appliances for mandibular retrusion. Nonetheless, important limitations remain: model outputs lacked the precision and individualization of orthodontists’ assessments, often omitting key parameters (e.g., ANB angle, transverse skeletal discrepancies, overjet/overbite). Consequently, LLMs should currently be regarded as supportive tools rather than replacements for orthodontists’ diagnostic judgment [21, 22].

This study represents one of the first systematic comparisons of multimodal LLMs in orthodontics, establishing a foundation for future exploration of AI in dental imaging and early orthodontic intervention. The study design incorporated measures to maximize fairness and rigor, including standardized prompts, controlled photographic inputs, and single-blind randomized scoring, thereby reducing information and observer bias. In addition, the use of an eight-dimension weighted scoring system ensured a comprehensive evaluation that considered both clinical accuracy and patient-centered experience.

Despite these strengths, several limitations should be acknowledged. First, the sample was derived from a single center and a homogeneous population (Chinese children), limiting generalizability. Second, the input was restricted to standardized profile photographs, without integration of lateral cephalograms, CBCT, or intraoral scans, potentially omitting key diagnostic information. Third, although the scoring system was validated using ICC and Cronbach’s α, subjectivity could not be entirely eliminated, particularly in dimensions rated by parents, where cultural background and individual interpretation may have influenced outcomes. Finally, LLMs evolve rapidly, and our findings represent only the versions tested at the time of study; future iterations may yield substantially different results.

Conclusion

ChatGPT, DeepSeek, and Gemini demonstrated comparable performance in pediatric profile-based orthodontic evaluation, all achieving high overall scores with no statistically significant differences. ChatGPT showed a slight but non-significant advantage. At this stage, LLMs may serve as supportive tools for early orthodontic screening but cannot substitute for clinical expertise. Future research should incorporate multicenter and multimodal study designs to enhance generalizability and validate the clinical applicability of these models.

Data Availability

Due to patient information privacy concerns, the datasets generated and/or analyzed during this study are not publicly available.

However, they can be obtained from the corresponding author upon reasonable request.

Acknowledgements

We thank Z.W.Y, Department of Stomatology, Jiaxing No. 2 Hospital, for supporting the work of this study.

Funding

No Funding

Contributions

Jingcheng Chen: Experimental design, data analysis and paper writing; Xiangyu Ge: Data analysis; Data collection, Chenyang Yuan: Data collection, Data analysis; Yanan Chen: Data collection, Data analysis, Xiangyu Li: Data verification; Xi Zhang: Data collection, Statistical analyses; Zhenghan Chu: Data collection, Statistical analyses; WeiYing Zheng: Experimental design guidance and technical support; Chunqin Miao: Experimental design guidance and technical support.

Corresponding author

Correspondence to

Chunqin Miao

Declarations

Ethics approval and consent to participate:

This retrospective study was approved by the Ethics Committee of Jiaxing Second Hospital (Approval Number: 2024JX157).

All participants provided written informed consent for the use of their anonymized clinical data for research purposes.

Consent to participate:

All participants gave written informed consent to participate in this study.

Clinical trial registration:

Clinical trial number: not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Author Contribution

References

Kandasamy, S. Mouth breathing and orthodontic intervention: Does the evidence support keeping our mouths shut? Am. J. Orthod. Dentofac. Orthop. 167 (6), 629–634. https://doi.org/10.1016/j.ajodo.2025.02.005 (2025).

Kim, K. A., Kim, S. J. & Yoon, A. Craniofacial anatomical determinants of pediatric sleep-disordered breathing: A comprehensive review. J. Prosthodont. 34 (S1), 26–34. https://doi.org/10.1111/jopr.13984 (2025).

Otel, A., Montiel-Company, J. M. & Zubizarreta-Macho, Á. Comparative Analysis of Early Class III Malocclusion Treatments-A Systematic Review and Meta-Analysis. Child. (Basel). 12 (2). https://doi.org/10.3390/children12020177 (2025).

Sinha, A. et al. Effect of Early Orthodontic Treatment on Long-Term Stability of Class II Malocclusions. J. Pharm. Bioallied Sci. 16 (Suppl 2), S1808–s1810. https://doi.org/10.4103/jpbs.jpbs_1171_23 (2024).

Bianchi, J. & Zheng, M. Leveraging Generative Artificial Intelligence in Teaching, Scholarship and Dental Education: Use Cases and Reflections. Orthod. Craniofac. Res. https://doi.org/10.1111/ocr.12949 (2025).

Kavousinejad, S. et al. Developing an artificial intelligence-based progressive growing GAN for high-quality facial profile generation and evaluation through turing test and aesthetic analysis. Sci. Rep. 15 (1), 26611. https://doi.org/10.1038/s41598-025-11172-x (2025).

Chen, J. et al. Comparing orthodontic pre-treatment information provided by large language models. BMC Oral Health. 25 (1), 838. https://doi.org/10.1186/s12903-025-06246-1 (2025).

Olawade, D. B. et al. AI-Driven Advancements in Orthodontics for Precision and Patient Outcomes. Dent. J. (Basel). 13 (5). https://doi.org/10.3390/dj13050198 (2025).

Liu, Y. et al. Assessing the Role of Large Language Models Between ChatGPT and DeepSeek in Asthma Education for Bilingual Individuals: Comparative Study. JMIR Med. Inf. 13, e65365. https://doi.org/10.2196/65365 (2025).

10.

Tekin, S. B., Ince, K., Tekin, B. G., Servet, E. & Bozgeyik, B. Evaluation of Google and ChatGPT responses to common patient questions about scoliosis. Spine Deform. https://doi.org/10.1007/s43390-025-01169-x (2025).

11.

Ali, R., Shi, L. & Cui, H. A Comparative Study on the Use of DeepSeek-R1 and ChatGPT-4.5 in Different Aspects of Plastic Surgery. Aesthetic Plast Surg. (2025). https://doi.org/10.1007/s00266-025-05108-z

12.

Yang, X. & Chen, W. The performance of ChatGPT on medical image-based assessments and implications for medical education. BMC Med. Educ. 25 (1), 1192. https://doi.org/10.1186/s12909-025-07752-0 (2025).

13.

Saeedi, S. & Bakhtiar, M. Assessing the response quality and readability of ChatGPT in stuttering. J. Fluen. Disord. 85, 106149. https://doi.org/10.1016/j.jfludis.2025.106149 (2025).

14.

Jin, I. et al. DeepSeek vs. ChatGPT: prospects and challenges. Front. Artif. Intell. 8, 1576992. https://doi.org/10.3389/frai.2025.1576992 (2025).

15.

Xu, P. et al. DeepSeek-R1 outperforms Gemini 2.0 Pro, OpenAI o1, and o3-mini in bilingual complex ophthalmology reasoning. Adv. Ophthalmol. Pract. Res. 5 (3), 189–195. https://doi.org/10.1016/j.aopr.2025.05.001 (2025).

16.

Alamleh, S. et al. Readability, Reliability, and Quality Analysis of Internet-Based Patient Education Materials and Large Language Models on Meniere's Disease. J. Otolaryngol. Head Neck Surg. 54, 19160216251360651. https://doi.org/10.1177/19160216251360651 (2025).

17.

Haque, M. A. & Siddique, H. R. Generative artificial intelligence and large language models in smart healthcare applications: Current status and future perspectives. Comput. Biol. Chem. 120 (Pt 1), 108611. https://doi.org/10.1016/j.compbiolchem.2025.108611 (2025).

18.

Pal, A. et al. Generative AI/LLMs for Plain Language Medical Information for Patients, Caregivers and General Public: Opportunities, Risks and Ethics. Patient Prefer Adherence. 19, 2227–2249. https://doi.org/10.2147/ppa.S527922 (2025).

19.

Deng, J. et al. Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines. BMC Neurol. 25 (1), 264. https://doi.org/10.1186/s12883-025-04280-8 (2025).

20.

Sorin, V. et al. Evaluating prompt and data perturbation sensitivity in large language models for radiology reports classification. JAMIA Open. 8 (4), ooaf073. https://doi.org/10.1093/jamiaopen/ooaf073 (2025).

21.

Chávez-Sevillano, M. G. et al. Three-dimensional condyle and glenoid fossa alterations after class II treatment with twin block and herbst functional appliances - a randomized clinical trial. Eur. J. Orthod. 47 (4). https://doi.org/10.1093/ejo/cjaf038 (2025).

22.

Piełunowicz, M. et al. Effects of rapid maxillary expansion and functional orthodontic treatment in children with sleep disordered breathing: a systematic review. BMC Oral Health. 25 (1), 1059. https://doi.org/10.1186/s12903-025-06348-w (2025).

Yes