Artificial Intelligence-Powered Spatial Analysis for Interpretable Simulation of Clinical Reasoning in Surgical Management of Impacted Mandibular Third Molars.

Title

Authors

EduardoLuizDelamare

DDS, MSc

1✉Emaileduardo.delamare@sydney.edu.au

XunLi

PhD

2Emailxun.li@data61.csiro.au

JuanDavidOsorio1Emailjuan.osorio@eyesofai.com

BSCIT-EyesofAI1

MichaelHornby3Emailmichael.hornby@sydney.edu.au

KatharinaAlvesRabelo

BDS, MSc

1Emailkatharina.alvesrabelo@sydney.edu.au

SenLe1Emailsen.le@eyesofai.com

BDent-EyesofAI1

KhoaLe1Emailkhoa.le@eyesofai.com

BCom1

MFin-EyesofAI1

SamuelKhela4Emailsamuel.khela@dent.asu.edu.eg

ShenghongLi

PhD

2Emailshenghong.li@data61.csiro.auEmailchangming.sun@data61.csiro.au

DadongWang

PhD

2Emaildadong.wang@data61.csiro.au

HeikoSpallek

DMD, MSBA

1Emailheiko.spallek@sydney.edu.au

Faculty of Medicine and HealthThe University of Sydney School of Dentistry

2Imaging and Computer Vision (ICV) Research GroupCommonwealth Scientific and Industrial Research Organisation (CSIRO)

3BDS - Faculty of Medicine and HealthThe University of Sydney School of Dentistry

4BSc - Dental Biomaterials Department, Faculty of DentistryAin Shams University

*Eduardo Luiz Delamare, DDS, MSc (DMFR) - Faculty of Medicine and Health, The University of Sydney School of Dentistry

eduardo.delamare@sydney.edu.au

Xun Li, PhD - Imaging and Computer Vision (ICV) Research Group, Commonwealth Scientific and Industrial Research Organisation (CSIRO)

xun.li@data61.csiro.au

Juan David Osorio, BSCIT - Eyes of AI

juan.osorio@eyesofai.com

Michael Hornby, BDS - Faculty of Medicine and Health, The University of Sydney School of Dentistry

michael.hornby@sydney.edu.au

Katharina Alves Rabelo, BDS, MSc (DMFR) - Faculty of Medicine and Health, The University of Sydney School of Dentistry

katharina.alvesrabelo@sydney.edu.au

Sen Le, BDent - Eyes of AI

sen.le@eyesofai.com

Khoa Le, BCom, MFin - Eyes of AI

khoa.le@eyesofai.com

Samuel Khela, BSc - Dental Biomaterials Department, Faculty of Dentistry, Ain Shams University

samuel.khela@dent.asu.edu.eg

Shenghong Li, PhD - Imaging and Computer Vision (ICV) Research Group, Commonwealth Scientific and Industrial Research Organisation (CSIRO)

shenghong.li@data61.csiro.au

Changming Sun, PhD - Imaging and Computer Vision (ICV) Research Group, Commonwealth Scientific and Industrial Research Organisation (CSIRO)

changming.sun@data61.csiro.au

Dadong Wang, PhD - Imaging and Computer Vision (ICV) Research Group, Commonwealth Scientific and Industrial Research Organisation (CSIRO)

dadong.wang@data61.csiro.au

Heiko Spallek, DMD, MSBA, Dr.med.dent. - Faculty of Medicine and Health, The University of Sydney School of Dentistry

heiko.spallek@sydney.edu.au

Abstract

Surgical management of impacted mandibular third molars (M3Ms) depend on spatial relationships among the tooth, alveolar bone, and inferior alveolar nerve (IAN). Deep learning (DL) models reliably segment these structures in cone beam computed tomography (CBCT) scans; however, their “black box” nature limits clinical trust, especially in complex decision-making scenarios. We propose a hybrid DL and rule-based (HDLRB) pipeline that pairs nnU-Net and 3D U-Net segmentations with interpretable spatial analytics mirroring clinical reasoning. Three measurements drive recommendations: impaction angle; bony impaction status; and IAN proximity (minimum Euclidean M3M to mandibular canal distance with a prespecified contact threshold). A deterministic flow yields coronectomy, complete removal, or periodic radiographic monitoring. Retrospective validation: three clinicians independently assessed 49 M3Ms; disagreements were resolved by consensus. Expert agreement: Fleiss’ κ = 0.904 (95% CI 0.797–0.977). Against consensus, HDLRB correctly classified 46/49 cases (93.9% accuracy, 95% CI 83.5–98.0), with Cohen’s κ = 0.892. Class-wise recall: 100% for coronectomy and monitoring, 85.7% for removal; precision: 88.9% (coronectomy) and 100% (removal, monitoring). All errors were conservative - removal cases suggested as coronectomy. This interpretable hybrid approach simulates surgical reasoning, achieves high expert agreement, and provides case-level justifications, addressing adoption barriers and aiding decision support.

Keywords:

Artificial intelligence

Mandibular Third Molar

Digital Health

Oral Surgery

Radiology

Cone Beam Computed Tomography

Main Text

1. Introduction

The management of an impacted mandibular third molar (M3M) is a commonly encountered yet intricate challenge in oral and maxillofacial surgery. Population studies across different countries suggest that most young adults must decide whether to have surgery to remove third molars or retain and monitor until problems arise¹. Decisions regarding surgical management depend on a careful assessment of anatomical structures, such as the mandibular canal (MC), adjacent molars, and alveolar bone, as well as their spatial interrelationships^1,2. Accurate evaluation is critical: misjudging the degree of root proximity to the inferior alveolar nerve (IAN) or overlooking subtle anatomical variations may increase the risk of nerve injury and postoperative complications^3,4,5. In this context, reliable and explainable diagnostic tools can help dentists and specialists weigh multiple factors and tailor their approach to each patient’s unique anatomy, ultimately improving treatment outcomes and patient satisfaction.

Artificial intelligence (AI), particularly deep learning (DL), has shown substantial promise in medical imaging tasks, including the segmentation of complex structures in cone-beam computed tomography (CBCT) scans^6,7. In oral and maxillofacial imaging, DL-based methods have demonstrated remarkable accuracy in delineating teeth and the MC^8,9, potentially aiding surgical planning. However, these purely data-driven models often function as “black boxes,” offering limited transparency regarding the reasoning behind their outputs when clinical decisions are needed in high-risk situations^10,11. This lack of interpretability is detrimental to clinical adoption, as surgeons must understand the rationale behind a given recommendation to confidently integrate AI insights into their decision-making processes.

While existing explainable AI (XAI) techniques — such as saliency maps, feature attribution, and post-hoc interpretability frameworks — have improved transparency to some extent^12,13, they may not fully reflect the logical, stepwise processes clinicians use to evaluate surgical risks and plan interventions. Notably, oral and maxillofacial surgeons often employ sequential reasoning, first interpreting anatomical features, then assessing spatial relationships, and finally applying established clinical criteria to inform treatment decisions^1,2,5. Recognising this, recent advances in AI propose reasoning modelling strategies, in which the decision-making pathway is segmented into modular steps, each informed by the preceding output^14,15. Such approaches parallel the cognitive processes of experienced clinicians and may better bridge the gap between model predictions and surgical reasoning.

Within this paradigm, hybrid models — DL-driven image analysis integrated with rule-based decision criteria in a logic layer — offer a particularly appealing solution¹⁶.

By encoding clinically validated guidelines and risk assessment protocols into the logic layers of an analytic pipeline, these systems can provide explanations that align with the factors surgeons consider. For instance, DL models can detect and segment relevant anatomy, but they often fall short of translating these findings into direct, actionable guidance for clinical decisions. A logic layer, encoded with explicit rules, can quantify the minimum allowable distance between the MC and the tooth root for safe extraction or identify specific angular relationships associated with increased procedural complexity. This can directly benefit resident and early-career surgeons who incur higher rates of postoperative complications and nerve injuries in M3M surgery¹⁷. This structured incorporation of domain knowledge can foster value to AI-generated detections, ensuring that the machine-generated outputs not only improve diagnostic accuracy but also translate into auditable guidance on safer decisions potentially leading to improved patient outcomes. Indeed, emerging evidence in other areas of oral and maxillofacial practice suggests that blending machine intelligence with human-like reasoning can reduce iatrogenic complications and streamline clinical workflow quality^18,19.

Therefore, this study aims to investigate if a fully explainable automated system can be developed that simulates the decision-making process for the surgical management of M3Ms, as observed on CBCTs. While our study utilises labels for the automatic segmentation of key anatomical structures, we aim to generate clinical decisions that do not rely on expert-labelled data. We hypothesise that a hybrid diagnostic pipeline, which couples a robust DL-based segmentation with an innovatively designed 3D spatial analytics module, namely logic layer, can generate reliable high-risk decision recommendations. This module leverages advanced 3D data processing techniques, combined with a robust suite of spatial distance-based rules rigorously aligned with established surgical criteria, to deliver enhanced diagnostic precision and clinical applicability. By channelling the predictive power of AI through a decision sequence, the system generates interpretable recommendations for managing M3Ms. We validate its performance against expert assessments of the same CBCT scans, thereby evaluating its diagnostic reliability and the clarity and relevance of its reasoning.

Methods

2. Methods

The study was approved by the Ethical Committee of the Commonwealth Scientific and Industrial Research Organisation (CSIRO) under the approval number 2022_040_LR.

It was performed in accordance with the Declaration of Helsinki.

Informed consent was obtained from all participants.

2.1 Datasets

This research involved two datasets of deidentified CBCT scans from a diverse cohort of dentate patients aged 17 and older, sourced from the CSIRO and Eyes of AI Imaging archive, a comprehensive database that aggregates imaging data from 3 private dental clinics across New South Wales (NSW), Australia. The devices used to acquire the scans were ProMax 3D (Planmeca, Helsinki, Finland), Green X 21 (Vatech, Hwaseong-si, South Korea), Veraview X800 (J. Morita Corp., Saitama, Japan). CBCT volumes were exported in digital imaging and communications in medicine (DICOM) file format, with varying fields of view (FoV) (from 8 × 8 cm to 23 × 26 cm) and varying voxel sizes.

2.1.1 Dataset 1 - AI-based segmentations

For the development of the segmentation models, a total of 716 scans were randomly selected. Following a thorough quality assessment, none of the scans were excluded due to significant imaging artefacts, such as movement, aliasing pattern, or beam hardening. Therefore, all 716 scans of full or partial hemi-mandible sections were included for the development of the segmentation models, including training, testing, and validation sets. The DICOM files were imported into open-source medical imaging analysis software (3D Slicer) for the segmentation of the following anatomical structures:

Crowns.

Roots.

Mandibular bone.

Mandibular canal (MC).

A team of 5 experienced dentists, each with over 5 years of experience analysing CBCT imaging, identified tooth numbers and manually segmented these structures. A board-certified Radiologist with over 10 years of expertise reviewed all manual segmentations before AI model development and adjusted occasional cases of over- and under-segmentation.

2.1.2 Dataset 2 - Clinical Evaluation

A separate control evaluation set of 85 deidentified CBCT scans, within the same range of acquisition parameters and manufacturers, was selected for this stage. The inclusion criteria for the control evaluation set comprised patients referred for CBCT assessment following initial M3M screening with conventional radiographic methods, which indicated a complex and specific clinical question²⁰.

Three experienced clinicians participated in this stage of the study: a lecturer and specialist in Dentomaxillofacial Radiology (with over 15 years of experience), a lecturer and specialist in Dentomaxillofacial Radiology (with over 10 years of experience), and a lecturer in Oral Surgery (with over 5 years of experience). The exclusion criteria were as follows:

(a) M3Ms showing immature development - wide open apices.

(b) M3Ms showing evidence of caries lesions in dentine.

Following the exclusion criteria, 51 CBCT volumes were used for calibration and analysis.

2.2 Development of the Hybrid DL and Rule-Based (HDLRB) Tool

2.2.1 DL-based Segmentation of Relevant Anatomical Structures

A retrospective development and validation design was used to build AI–based segmentation models for relevant anatomical structures on CBCT data. The models were created using two different approaches for teeth and mandibular segmentations.

A total of 587 scans were used in training and validation of tooth segmentations, and 37 scans were used for testing. This process has employed a cascade model consisting of two steps:

1) Heatmap regression for coarse tooth localisation and identification.

2) Per-tooth structure segmentation (crown and root) using a customised model from the nnU-Net framework²¹, a U-Net-based implementation tailored for medical imaging with automatic architecture optimisation²².

A total of 84 scans were utilised for training and validation of Mandible and MC segmentations, and 8 scans were used for testing. An optimised nnU-Net architecture was configured to target specific functional groups relative to the Mandibular Bone and MC. Specialised preprocessing, including image standardisation and rescaling, was applied to the input image, followed by a 3D U-Net to detect the trabecular bone as a reference landmark. Based on the position of this reference landmark within the input image, a distinct crop was generated for each landmark group and subsequently segmented with a dedicated 3D U-Net. The purpose of this approach is to reduce the input size to the 3D U-Net for enhanced efficiency. Following the segmentation, the results underwent post-processing, including artifact removal and smoothing. Figure 1 illustrates the workflow of the CBCT segmentation process.

Fig. 1

Workflow of the CBCT segmentation models.

Data augmentation strategies were employed during training to expand the dataset and enhance its diversity, thereby improving the model’s generalizability and robustness.

The output of the system includes voxel masks representing the segmentation results, providing a visual representation of the identified structures within the CBCT images. Furthermore, the system generates STL mesh files, offering a three-dimensional representation of the segmented dental anatomy. Figure 2 illustrates the segmentation output from a representative CBCT scan.

Fig. 2

Output of the segmentation of a CBCT scan. Some structures were made translucent to highlight the relevant landmarks.

2.2.3 Spatial Analysis Techniques

Following the guidelines from relevant studies in the field^{20,23,24,25,26}, the spatial analytic module mainly consists of three components: angles of impaction, bony impaction, and MC contact. It also required a pre-processing step to transform meshes into point clouds and calculate the main axis/orientation of each tooth. The orientation is determined using principal component analysis (PCA) by centering the points, computing the covariance matrix, and extracting the principal axis as the eigenvector with the largest eigenvalue. This axis is then symmetrically extended from the geometric centre of the point cloud. Such an axis is highly valuable for follow-up processing steps.

Determining Angle of Impaction

The angles of impaction, such as mesioangular, distoangular, vertical, and horizontal, are determined based on the angular discrepancies between the primary axes of the third molar relative to the second molar. The adoption of precise angular measurements offers a method that is considerably more accurate than human visual estimation, as shown in Fig. 3, adapted from Yalmaz et al.²⁷.

Fig. 3

A. An illustration of Winter’s classification of M3M angles of impaction, adapted from Yalmaz et al.²⁷. Vertical impaction: the long axis of the M3M is parallel to the long axis of the second molar (from 10 to − 10°); mesioangular impaction: the impacted tooth is tilted toward the second molar in a mesial direction (from 11 to 79°); horizontal impaction: the long axis of the M3M is horizontal (from 80 to 100°); distoangular impaction: the long axis of the M3M is angled distally/posteriorly away from the second molar (from − 11 to − 79°); others (from 101 to − 80°).B. Sample angle of impaction: 95.52 degrees - horizontal impaction.

Determining Type of Impaction

Bony impaction of M3Ms is determined when the tooth completes its development while failing to reach functional occlusal height, either partially or fully embedded in the alveolar bone. Accurate assessment of bony impaction requires both spatial and positional analyses relative to the surrounding bone structure. Two critical criteria are considered: the overlapping of the bone and molar, and the relative distance between two structures. The quantitative assessment of impaction involves two main evaluations:

(b.1) Intersection and Overlap

The first criterion is to check if the M3M is inside or overlaps the mandibular bone, as opposed to the alveolar bone mesh. The results are quantified as percentages of points inside and overlapping the mandible. Such a percentage value is used as a reference, not a hard condition. To optimise the algorithm speed in this criterion, the outermost points from the tooth’s point cloud are extracted by slicing along the gravity axis to obtain the farthest points.

(b.2) Vertical Distance Calculation

Clinicians often rely on the vertical distance between the highest point of the M3M crown and the alveolar bone margin to assess the position of the tooth relative to the mandible. This metric provides a clear indication of whether the tooth is fully or partially impacted.

A threshold of ≥ 1 mm above the alveolar bone margin was established by consensus among the expert clinical researchers participating in the study to classify the tooth as either partially or fully erupted. Less than 1mm is classified as full bony impaction. We follow the same rule while considering additional rare cases. To differentiate between partial and full eruption, the percentage of crown at the mandible intersection is used, with an empirical value of 10% employed.

To automate this evaluation, the vertical distance is calculated using projection analysis. The methodology involves:

- Defining the central axis of the tooth serves as the reference for projections and spatial alignment.

- Extracting relevant points: the crown points and the mandible points within the tooth's bounding box are identified for analysis. This ensures that the calculation considers only points spatially close to the tooth, thereby minimising computational effort.

- Projection onto the gravity axis: both crown points and mandible points are projected onto the gravity axis for spatial alignment.

- Calculating the vertical distance: the vertical distance is computed as the difference between the highest projection value of the crown (representing the highest point of the crown along the gravity axis) and the highest projection value of the mandible (representing the alveolar bone margin).

- When the growth direction of M3M is abnormal, we need to consider the intersection between the tooth and the corresponding mandible bone instead of the distance between projections.

The criteria for the classification of bony impaction are shown in Table 1, and an example of a fully erupted case is shown in Fig. 4.

Rule ID	Classification	Condition
1	Full Bony Impaction	IF $\:D<1\:mm$ OR $\:P\:\ge\:\:97\%$
2	Fully Erupted	IF $\:P<97\%$ AND $\:D\:\ge\:\:1\:mm$ AND $\:E\:<\:10\%$
3	Partially Erupted	IF $\:P<97\%$ AND $\:D\:\ge\:\:1\:mm$ AND $\:E\:\:\:\ge\:\:\:10\%$

Table 1
Bony impaction classification criteria.
$\:D$
represents the vertical distance between the highest point of the M3M crown and the alveolar bone margin
$\:\left(mm\right)$
,
$\:P$
represents the percentage overlap/intersection of the M3M with the mandibular bone (
$\:\%$
), and
$\:E$
represents the percentage of enamel enclosed in the mandibular bone (
$\:\%$
).
Rule ID	Classification	Condition
1	Full Bony Impaction	IF $\:D<1\:mm$ OR $\:P\:\ge\:\:97\%$
2	Fully Erupted	IF $\:P<97\%$ AND $\:D\:\ge\:\:1\:mm$ AND $\:E\:<\:10\%$
3	Partially Erupted	IF $\:P<97\%$ AND $\:D\:\ge\:\:1\:mm$ AND $\:E\:\:\:\ge\:\:\:10\%$

Fig. 4

A. Point cloud visualisation of a sample M3M. The spatial analytical results: P = 60.87%, D = 5.4 mm, and E = 5.20%. It is classified as “fully erupted”. B. Visualisation of the vertical distance represented by the highest projection line of the M3M crown (red) as a yellow line and the alveolar bone margin (purple line) at 5.4 mm. C. Point cloud visualisation of a sample M3M. The spatial analytic result: P = 79.64%, D = 0.75 mm, and E = 59.92%. It is classified as “full bony impaction”. D. Visualisation of the vertical distance represented by the highest projection point of the M3M crown (yellow line) and the alveolar bone margin (purple line).

An abnormal case is seen in Fig. 5. As observed, the position of the M3M is inverted; therefore, the distance measurement that relies on higher projection points of crown and mandible will fail. Consequently, we use the percentage overlap/intersection of the M3M with the mandible as the primary criterion.

Fig. 5

A. Point cloud visualisation of an abnormal M3M. The spatial analytic result: P = 98.89%, D = 10.11 mm. It is classified as “full bony impaction”. B. Visualisation of the vertical distance represented by the highest projection line of the M3M crown (red) as a yellow line and the alveolar bone margin (purple line). In this case, the highest points are opposite to the adjacent tooth; therefore, such a distance measurement is not valid.

The relationship between the M3M and the IAN is a critical factor in surgical planning. According to a comprehensive 3-year audit on treatment decisions of M3Ms, it was concluded that no bony separation between the M3M and the MC observed in CBCT was the only sign with a highly significant impact on the decision to perform a coronectomy instead of full removal of the tooth (odds ratio 56.8, p < 0.001). If no bony separation between the M3M and the MC was observed in CBCT, the likelihood of the surgeon opting for a coronectomy increased by a factor of 57²⁰.

To evaluate potential contact, Euclidean distance maps were generated to compute the minimum distance between the tooth root and the IAN. The methodology involved the following steps: the shortest Euclidean distance between the M3M and the MC was computed using a k-d tree nearest neighbour search, which efficiently finds the closest MC points for each M3M point. The equation used to compute the minimum distance is:

$\:{d}_{min}\:=\:\underset{i,j}{min}\:\parallel\:{\mathbf{T}}_{i}\:-\:{\mathbf{C}}_{j}\parallel\:$

where

d_min is the minimum distance between the M3M (T) and the MC (C).

T_i represents a point on the root.

C_j represents a point on the MC.

∥⋅∥ denotes the Euclidean norm.

A threshold was set to determine direct contact between the M3M and the MC. If the minimum distance was below the predefined threshold, the structures were considered to be in contact. The threshold is set as an empirical value of 0.2 mm. Figure 6 illustrates cases of different relationships between M3M and the MC.

Fig. 6

A. The minimum distance between the MC and the M3M is 0.01 mm; therefore, the MC and M3M are in contact. B. The minimum distance between the MC and the M3M is 0.11 mm; therefore, the MC and M3M are in contact. C. The minimum distance between the MC and the M3M is 3.24 mm; therefore, the MC and M3M are not in contact. D. The minimum distance between the MC and the M3M is 1.38 mm; therefore, the MC and M3M are not in contact.

2.2.4. Output of the decision

The results from the logic layer of the spatial analysis module, specifically for angle/type of impaction and contact with the MC, relevant to each CBCT and M3M, were scripted and exported in .csv format. Based on the analysis of relevant studies in the field^4,5,20,23,24, three treatment decisions were suggested:

1a. Consider coronectomy due to the complex relationship between the M3M and IAN.

1b. Consider coronectomy due to the complex relationship between the M3M and the IAN. Horizontal impaction increases surgical complexity and the risk of nerve injury during crown sectioning. Sectioning the coronal tissue in multiple planes is advised²⁴.

2. Consider complete surgical removal as the M3M does not directly contact the IAN.

3. If no symptoms or risk to the adjacent molar, consider periodic radiographic monitoring.

The final output of the decision was calculated by a conditional function following the decision criteria, illustrated by Fig. 7 below.

Fig. 7

Decision flowchart illustrating the HDLRB system reasoning structure.

2.3 Clinical Validation of the Tool

A comparative analysis in a retrospective observational design was used in the clinical validation of the HDLRB tool, verifying differences in decisions regarding the surgical management of M3Ms referred for CBCT assessment between clinicians and the proposed AI-based system.

A total of 51 deidentified CBCT scans were included in the control evaluation set. The examiners underwent calibration using 10 CBCT scans (15 molars) to verify interobserver agreement. Using 3D Slicer, each examiner independently evaluated the scans and provided recommendations most likely to reflect the surgical management, as defined below.

Treatment Recommendations:

1. Consider coronectomy due to the complex relationship between the M3M and IAN.

2. Consider complete surgical removal as the M3M does not directly contact the IAN.

3. Consider periodic radiographic monitoring.

Fleiss' κ was used for overall agreement among multiple raters. Acceptable reliability was defined as a value of 0.75 or greater. Intraexaminer scores reached an average κ of 0.93 (0.91–0.95) during calibration.

The remaining 41 CBCT volumes (corresponding to 49 M3Ms) were analysed in the clinical evaluation stage, following the same criteria and using the same software employed in the calibration stage. Disagreements in cases were resolved by consensus in order to produce a gold standard. The gold standard was compared to the outputs of the HDLRB tool.

2.4 Statistical Analysis

Statistical analyses were performed using SPSS version 24.0 (IBM Corp., Armonk, NY, USA).

The segmentation performance of the deep learning models for each anatomical structure was assessed by comparing the predictions with the manually delineated ground truth and was evaluated using Dice Similarity Coefficient (DSC). Accuracy, Recall, Precision, Intersection over Union (IoU) and F1 scores were utilised for the evaluation of the tooth identification task.

Fleiss' κ was used to assess the overall agreement among multiple raters evaluating the control set (95% bootstrap confidence interval). Disagreements in cases were resolved by consensus in order to produce a gold standard.

The performance metrics relative to the decisions of the HDLRB tool compared to the gold standard were calculated using True Positive (TP), False Positive (FP), False Negative (FN), True Negative (TN), Sensitivity, Specificity, Precision, Negative Predictive Value, and F1 scores. All scores used a 95% Confidence Interval (CI) for each individual class of decision (coronectomy, surgical removal, or monitoring). Macro averages of all decisions combined, as well as an overall accuracy, are also provided.

3. Results

3.1 Segmentation and Tooth Identification Performance

Table 2
presents the results for segmentation and tooth identification performance. Mandibular bone and MC DSC values were close to or above 0.9 for most of the cases, averaging 0.96 for mandibular bone and 0.87 for MC. Tooth structures reached an average of 0.93 for root and 0.90 for crown. The performance of the model’s teeth detection evaluated on 37 CBCT images, comprised a total of 921 teeth. Overall, the model detected 923 teeth, of which 920 are labelled with correct tooth numbers - Accuracy of 0.96. The corresponding precision, recall and F1 scores are 0.99 and 0.99, respectively, demonstrating the model’s capability to accurately detect teeth in CBCT scans.
Task	Structure	Metric	Score
Identification (n = 923)	Teeth	Accuracy	0.96
		Precision	0.99
		Recall	0.99
		F1	0.99
		IoU	0.99
Segmentation	Root	DSC	0.93
	Crown	DSC	0.90
	Mandible	DSC	0.96
	MC	DSC	0.87
	Macro-average	DSC	0.92

3.2 Expert Consensus and Inter-Rater Reliability

Forty-one CBCT scans (totalling 49 M3Ms) were evaluated by three expert clinicians. Each case was initially categorised as requiring either coronectomy, complete removal, or monitoring based on the experts’ independent analyses. The examiners’ decisions were highly consistent and are shown on Table 3. Examiners A and B each recommended coronectomy in 26 cases and removal in 19, while Examiner C recommended coronectomy in 23 and removal in 22 (all three agreed on 4 cases for monitoring). The overall multi-rater agreement was almost perfect, with a Fleiss’ κ of 0.904 (95% CI: 0.797–0.977). In 94% of cases (46/49), all three examiners concurred on the management decision. Discrepancies in cases were resolved through consensus discussion to establish the reference standard (ground truth) for model evaluation. The final consensus classified 24 teeth (49%) as requiring coronectomy, 21 (43%) as complete removal, and 4 (8%) as monitoring.

Decision	Examiner A	Examiner B	Examiner C
Coronectomy	26	26	23
Monitor	4	4	4
Removal	19	19	22
Fleiss’ κ	0.904
95% CI (bootstrap):	0.797–0.977

Table 3
Management decisions of expert examiners and inter-rater agreement. Per-examiner counts for coronectomy, removal, and monitoring across 49 M3Ms; multi-rater reliability summarised by Fleiss’ κ (bootstrap 95% CI).
Decision	Examiner A	Examiner B	Examiner C
Coronectomy	26	26	23
Monitor	4	4	4
Removal	19	19	22
Fleiss’ κ	0.904
95% CI (bootstrap):	0.797–0.977

3.3 HDLRB Performance

The HDLRB tool demonstrated high overall accuracy against the expert consensus standard. It correctly predicted 46 out of 49 cases, yielding an overall accuracy of 93.9% (95% CI ~ 83.5–98.0%). Performance was robust across all three decision classes (Table 3). The HDLRB achieved 100% sensitivity (recall) for identifying cases that required coronectomy or monitoring, meaning it missed no cases that truly needed these interventions. Sensitivity for the removal class was slightly lower at 85.7%, reflecting that the model misclassified 3 of the 21 true removal cases (these were instead predicted as coronectomy). Specificity was high for all classes, ranging from 88.0% for coronectomy to 100% for removal and monitoring. The HDLRB’s precision (positive predictive value) was 88.9% for the coronectomy decision and a perfect 100% for removal and monitoring (no false positives occurred for the latter two). Accordingly, the negative predictive value was 100% for coronectomy and monitoring classes, as the HDLRB had no false negatives in those categories. In practical terms, when the HDLRB recommended a removal or monitoring, it was always correct, and when it ruled out coronectomy or monitoring, it never missed a needed intervention. All performance metrics are summarised in Table 4, with corresponding 95% CIs reflecting the uncertainty due to the modest sample size. Notably, the sensitivity of 85.7% for the removal class corresponds to a 95% CI of approximately 65%–95%, and the nominal 100% sensitivity for the rare “monitoring” class has a wide lower bound (approaching ~ 40% for the 95% CI) due to the low number of such cases (n = 4). Despite these wider intervals, the model’s point estimates for each class fell within the experts’ performance range, and no significant performance drop was observed for any category. The overall macro-averaged performance was excellent, with a macro-average sensitivity of 95%, specificity of 96%, and precision of 96%. The F1-scores (harmonic mean of precision and recall) were similarly high for all classes – approximately 94% for coronectomy, 92% for removal, and 100% for monitoring – yielding a macro-average F1 of about 95%.

Table 4
Performance of the HDLRB system versus consensus reference (n = 49). Per-class counts (TP, FP, FN, TN) and derived metrics (sensitivity, specificity, precision, NPV, F1) with Wilson 95% confidence intervals; macro-averaged metrics; overall accuracy with 94% CI.
Class	TP	FP	FN	TN	Sensitivity (95% CI)	Specificity (95% CI)	Precision (95% CI)	NPV (95% CI)	F1
Coronectomy	24	3	0	22	1.00 (0.86–1.00)	0.88 (0.70–0.96)	0.89 (0.72–0.96)	1.00 (0.85–1.00)	0.94
Removal	18	0	3	28	0.86 (0.65–0.95)	1.00 (0.88–1.00)	1.00 (0.82–1.00)	0.90 (0.75–0.97)	0.92
Monitoring	4	0	0	45	1.00 (0.51–1.00)	1.00 (0.92–1.00)	1.00 (0.51–1.00)	1.00 (0.92–1.00)	1
Macro-average					0.95	0.96	0.96	0.97	0.95
Overall Accuracy 0.94 (0.83–0.98)

Figure 8 illustrates the confusion matrix of the model’s predictions versus the consensus ground truth. Out of 49 cases, 46 were correctly classified (true positives along the diagonal). The only misclassifications were the 3 consensus “removal” cases that the model predicted as “coronectomy.” This is visually evident in Fig. 1 by the off-diagonal count of 3 (actual Removal cases in the row being predicted as Coronectomy in the column). There were no misclassifications in the other directions (no cases requiring coronectomy or monitoring were wrongly predicted as something else). This pattern underscores the model’s tendency to err on the side of the more conservative treatment (coronectomy) in a few borderline removal cases. The consequences of these errors are reflected in the metrics: for the Removal class, the model’s false negatives (3 cases) lowered sensitivity to 85.7%, while for the Coronectomy class, those same errors appear as false positives (yielding 88.9% precision).

Class	TP	FP	FN	TN	Sensitivity (95% CI)	Specificity (95% CI)	Precision (95% CI)	NPV (95% CI)	F1
Coronectomy	24	3	0	22	1.00 (0.86–1.00)	0.88 (0.70–0.96)	0.89 (0.72–0.96)	1.00 (0.85–1.00)	0.94
Removal	18	0	3	28	0.86 (0.65–0.95)	1.00 (0.88–1.00)	1.00 (0.82–1.00)	0.90 (0.75–0.97)	0.92
Monitoring	4	0	0	45	1.00 (0.51–1.00)	1.00 (0.92–1.00)	1.00 (0.51–1.00)	1.00 (0.92–1.00)	1
Macro-average					0.95	0.96	0.96	0.97	0.95
Overall Accuracy 0.94 (0.83–0.98)

Fig. 8

Confusion matrix of the HDLRB pipeline versus the gold standard.

4. Discussion

Our results demonstrate that the association of DL segmentations with rule-based spatial analysis can generate reliable recommendations, offering a robust and interpretable alternative for clinical decision-making in M3M management. Although expert-labelled data were used for the development of segmentation models, to the best of our knowledge, this is the first study to generate AI-based simulations of clinical decisions in M3M surgical assessment without using expert annotations.

Recent literature has shown several applications of hybrid rule-based DL tools in healthcare¹⁶. Specifically, spatial analysis coupled with DL segmentations can increase interpretability without sacrificing accuracy^28,29. MacCormick et al.²⁸ produced a 15° cup-to-disc rim profile for glaucoma, achieving high discrimination using a dataset 100 times smaller than typically required for DL methods. In urology, a 3D ultrasound system automatically segmented anterior urethral strictures; its measurements of stricture length and fibrosis matched those from intraoperative and manual assessments, and the images were readily interpretable to surgeons²⁸. As demonstrated by the results of our study, these examples illustrate how thoughtfully designed AI can augment clinical insight by presenting outputs in human-interpretable formats. Such technology enables decision support that clinicians can validate and act upon with confidence.

In the context of M3M surgery, our spatial analysis strategy aims to leverage robust and promptly assessable DL-based segmentation models of maxillofacial structures^6,30, while addressing interpretability limitations noted in prior studies. Our segmentation results have met, or surpassed performance of multiple other studies as demonstrated by recent literature reviews and experimental designs analyzing the same structures^6,8,9. Regarding treatment decisions, recent studies using DL models have achieved impressive accuracy in tasks such as automatically classifying impaction type and assessing extraction difficulty^31,32. For example, Balel et al. integrated the classic Pell & Gregory, Winter, and Pederson schemes into a single YOLOv11-based model, enabling automated classification of impaction position and difficulty with over 95% accuracy, comparable to expert judgment³¹. Similarly, CNN approaches have been used to predict M3M extraction difficulty scores (e.g., Pederson index) from panoramic radiographs, showing that AI can mirror expert assessments to a high degree³². However, these end-to-end models function primarily as black boxes – they output a classification or score without explaining the anatomic basis for the prediction. Consequently, even when models achieve impressive performance metrics, such as 95% accuracy, clinicians cannot verify how the AI arrived at a conclusion, whether it is correct or not. Such limitation undermines trust and hinders clinical adoption. In contrast, our strategy focuses on using DL in a time-consuming and lower-complexity task, namely, the segmentation of key anatomical structures, while delegating the higher-complexity task of clinical decision-making to spatial analysis. In line with our methodology, Carvalho et al. used a pipeline of DL models to detect the MC and classify the degree of third-molar root overlap in panoramic imaging¹⁸. Their proposed system achieved high accuracy and produced a diagnostic table to recommend when CBCT should be obtained, based on combinations of nerve overlap and root development status. We expand on this idea by exploring findings in cases where prescription of CBCT was necessary, specifically, after initial screening with conventional radiographic methods indicated a complex and specific clinical question²³. By leveraging spatial analysis, our approach addresses three critical limitations of studies in the field: it offers interpretable outputs in textual and visual forms; it minimizes the requirement for extensive datasets comprising high-quality expert-annotated data by encoding rule-based logic; and it addresses the unavoidable class imbalance caused by limited samples in specific clinical conditions (such as the prescription of CBCT in M3M assessment). Our design strategy not only benefits M3M assessment but also has the potential to support decision-making in other oral surgery applications that heavily depend on 3D and 2D imaging, such as orthognathic surgery¹⁹, as well as areas of clinical dentistry that also depend on diagnoses and treatment planning based on dimensional relationships of anatomical structures of known morphological variability. These include applications in orthodontics, prosthodontics, and periodontics, which could use similar spatial analysis rules organised in different configurations.

Previous studies have highlighted imaging features, such as narrowing of the canal lumen and the canal positioned in a bending or groove within the root complex²⁵, as significant aspects of M3M management. Such features were observed in a few cases from our clinical evaluation sample; however, this has not influenced treatment planning decisions in our study, as all such cases presented in conjunction with verification of direct contact between the M3M and the MC. While we understand that surgical management of M3Ms needs to take into consideration analysis of components beyond simple detection of diagnostic imaging features, such as patient age, complex medical history, and patient preference²⁶, our study has focused on the key aspects driving clinical decision, as supported by recent literature in the field of M3M management based on CBCT imaging ^20,23. Nonetheless, the development of a comprehensive AI-based system for M3M management, taking into account medical history and patients' preferences, seems achievable as of the time of manuscript submission. For that, further research may explore the potential of an HDLRB tool integrated with agentic hierarchical systems^33,34, as well as perspective studies targeted at clinical outcomes, particularly in cohorts of junior clinicians or early career specialists who can potentially translate the assistance provided by comprehensive AI-based systems into safer clinical practices.

There are several limitations to our study. Cases with M3Ms presenting with caries lesions within dentine, evidence of immature root development, or evidence of apical inflammatory lesions had to be excluded from our validation sample. Future systems may integrate AI systems that provide reliable detection in such cases⁷, as these features determine different surgical treatment planning strategies. Our clinical validation sample size is relatively small, particularly in cases where M3Ms were fully impacted (n = 4). Obtaining larger sample sizes is challenging, as the indication for CBCT should be reserved only for cases where initial screening with conventional radiographic methods indicated a complex and specific clinical question²³, which has been demonstrated in a study where surgical treatment plans only changed by CBCT in 12% of the time²⁵. Larger studies are needed to verify the reliability of the system's outputs. Nevertheless, the crucial element of our design is its ability to enable clinicians to critically analyse the output decision in an easily interpretable format, thereby offering reassurance when the system's recommendations differ from those of dentists and specialists.

Further investigation into the ethical ramifications of AI tools, such as the HDLRB tool, is warranted. Although a recent checklist for AI in dentistry addresses principles such as transparency, diversity, privacy, equity, solidarity, and governance, additional research is necessary concerning the pillars pertinent to clinical applications³⁵. These encompass wellness, respect for autonomous decision-making, accountability, responsibility, prudence, and sustainable development^35,36. As suggested by Balel et al., regionally limited datasets may affect the performance of AI models in assessing third molars in underrepresented populations³¹. Although all subjects investigated in our study come from a single country, the ability of our system to maintain robustness and generalizability depends solely on the segmentation model's capacity to identify anatomy, a prospectively favourable scenario. Therefore, by providing clinicians easily explained auditable outputs, this study offers an alternative to increase trustworthiness, usability, and transparency in AI applications in healthcare, as also highlighted by relevant working groups³⁷.

In conclusion, our study demonstrates that an HDLRB tool can effectively simulate the clinical decision-making process for the surgical management of M3Ms as observed on CBCTs without the need for expert-labelled data. This alternative offers a fully interpretable automated solution that addresses clinicians' concerns with ‘black box’ predictions in high-risk situations, therefore facilitating clinical adoption.

Author Contributions Statement

E.L.D., X.L., K.L., and S.L. conceptualised the study. E.L.D., M.H., X.L., J.D.O., Sh.L., S.K. and C.S. designed the study. E.L.D, M.H. and K.R. performed the clinical evaluation. E.L.D. performed the statistical analysis. E.L.D., X.L., and J.D.O. analysed the data. E.L.D. and X.L. participated in writing the original draft of the manuscript. All authors revised and approved the final version of the manuscript to be published. D.W. and H.S. provided resources and supervised the study.

Acknowledgement

E.L.D. discloses support for the research of this work from The University of Sydney’s Postgraduate Research Support Scheme (PRSS) for resources and consumables used in this work.

Additional Information

Competing Interests

H.S and E.L.D have received compensation as members of the Clinical and Scientific Advisory Board of Eyes of AI for projects outside the scope of this publication.

Legends

Table 1. Bony impaction classification criteria.

$\:D$

represents the vertical distance between the highest point of the M3M crown and the alveolar bone margin

$\:\left(mm\right)$

$\:P$

represents the percentage overlap/intersection of the M3M with the mandibular bone (

$\:\%$

), and

$\:E$

represents the percentage of the crown enclosed in the mandibular bone (

$\:\%$

Table 2. Summary of identification and segmentation performance. Values are means, rounded to two decimals.

Table 3. Management decisions of expert examiners and inter-rater agreement. Per-examiner counts for coronectomy, removal, and monitoring across 49 M3Ms; multi-rater reliability summarised by Fleiss’ κ (bootstrap 95% CI).

Table 4. Performance of the HDLRB system versus consensus reference (n = 49). Per-class counts (TP, FP, FN, TN) and derived metrics (sensitivity, specificity, precision, NPV, F1) with Wilson 95% confidence intervals; macro-averaged metrics; overall accuracy with 94% CI.

Data Availability

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Author Contribution

References

Marciani, R. D. Third molar removal: An overview of indications, imaging, evaluation, and assessment of risk. Oral Maxillofac. Surg. Clin. N. Am. 19, 1–13 (2007).

Renton, T., Smeeton, N. & McGurk, M. Factors predictive of difficulty of mandibular third molar surgery. Br. Dent. J. 190, 607–610 (2001).

Korkmaz, Y. T., Kayıpmaz, S., Senel, F. C., Atasoy, K. T. & Gumrukcu, Z. Does additional cone beam computed tomography decrease the risk of inferior alveolar nerve injury in high-risk cases undergoing third molar surgery?Does CBCT decrease the risk of IAN injury? Int. J. Oral Maxillofac. Surg. 46, 628–635 (2017).

Cilasun, U., Yildirim, T., Guzeldemir, E. & Pektas, Z. O. Coronectomy in Patients With High Risk of Inferior Alveolar Nerve Injury Diagnosed by Computed Tomography. J. Oral Maxillofac. Surg. 69, 1557–1561 (2011).

Hatano, Y., Kurita, K., Kuroiwa, Y., Yuasa, H. & Ariji, E. Clinical Evaluations of Coronectomy (Intentional Partial Odontectomy) for Mandibular Third Molars Using Dental Computed Tomography: A Case-Control Study. J. Oral Maxillofac. Surg. 67, 1806–1814 (2009).

Xiang, B., Lu, J. & Yu, J. Evaluating tooth segmentation accuracy and time efficiency in CBCT images using artificial intelligence: A systematic review and Meta-analysis. J. Dent. 146, 105064 (2024).

Orhan, K., Bayrakdar, I. S., Ezhov, M., Kravtsov, A. & Özyürek, T. Evaluation of artificial intelligence for detecting periapical pathosis on cone-beam computed tomography scans. Int. Endod. J. 53, 680–689 (2020).

Oliveira-Santos, N. et al. Automated segmentation of the mandibular canal and its anterior loop by deep learning. Scientific Reports 13, (2023).

Farid Naufal, M., Fatichah, C. & Renwi Astuti, E. Hardani Putra, R. Deep Learning for Mandibular Canal Segmentation in Digital Dental Radiographs: A Systematic Literature Review. IEEE Access. 12, 76794–76815 (2024).

10.

Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. & King, D. Key challenges for delivering clinical impact with artificial intelligence. BMC Medicine 17, (2019).

11.

Maier-Hein, L. et al. Why rankings of biomedical image analysis competitions should be interpreted with care. Nature Communications 9, (2018).

12.

Unified Deep Learning Model for Multitask Reaction Predictions with Explanation. 10.1021/acs.jcim.1c01467.s001

13.

Gilpin, L. H. et al. Explaining Explanations: An Overview of Interpretability of Machine Learning. in. IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA) 80–89 (IEEE, 2018). (2018).

14.

Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv.org (2022). https://arxiv.org/abs/2201.11903

15.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large Language Models are Zero-Shot Reasoners. arXiv.org (2022). https://arxiv.org/abs/2205.11916

16.

Kierner, S., Kucharski, J. & Kierner, Z. Taxonomy of hybrid architectures involving rule-based reasoning and machine learning in clinical decision systems: A scoping review. J. Biomed. Inform. 144, 104428 (2023).

17.

Jerjes, W. et al. Experience versus complication rate in third molar surgery. Head & Face Medicine 2, (2006).

18.

Carvalho, J. S. et al. Preinterventional Third-Molar Assessment Using Robust Machine Learning. J. Dent. Res. 102, 1452–1459 (2023).

19.

Dot, G., Schouman, T., Dubois, G., Rouch, P. & Gajny, L. Fully automatic segmentation of craniomaxillofacial CT scans for computer-assisted orthognathic surgery planning using the nnU-Net framework. Eur. Radiol. 32, 3639–3648 (2022).

20.

Matzen, L. H., Villefrance, J. S., Nørholt, S. E., Bak, J. & Wenzel, A. Cone beam CT and treatment decision of mandibular third molars: removal vs. coronectomy—a 3-year audit. Dentomaxillofacial Radiol. 49, 20190250 (2020).

21.

Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv.org (2015). https://arxiv.org/abs/1505.04597

22.

Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods. 18, 203–211 (2020).

23.

Matzen, L. H. & Berkhout, E. Cone beam CT imaging of the mandibular third molar: a position paper prepared by the European Academy of DentoMaxilloFacial Radiology (EADMFR). Dentomaxillofacial Radiol. 48, 20190039 (2019).

24.

Gleeson, C. F., Patel, V., Kwok, J. & Sproat, C. Coronectomy practice. Paper 1. Technique and trouble-shooting. Br. J. Oral Maxillofac. Surg. 50, 739–744 (2012).

25.

Matzen, L. H., Christensen, J., Hintze, H., Schou, S. & Wenzel, A. Influence of cone beam CT on treatment plan before surgical intervention of mandibular third molars and impact of radiographic factors on deciding on coronectomyvssurgical removal. Dentomaxillofacial Radiol. 42, 98870341–98870341 (2013).

26.

Gady, J., Fletcher, M. C. & Coronectomy Atlas Oral Maxillofacial Surg. Clin. 21, 221–226 (2013).

27.

Yilmaz, S., Adisen, M. Z., Misirlioglu, M. & Yorubulut, S. Assessment of Third Molar Impaction Pattern and Associated Clinical Symptoms in a Central Anatolian Turkish Population. Med. Principles Pract. 25, 169–175 (2015).

28.

MacCormick, I. J. C. et al. Accurate, fast, data efficient and interpretable glaucoma diagnosis with automated spatial analysis of the whole cup to disc profile. PLOS ONE. 14, e0209409 (2019).

29.

Feng, C. et al. Optimizing anterior urethral stricture assessment: leveraging AI-assisted three-dimensional sonourethrography in clinical practice. Int. Urol. Nephrol. 56, 3783–3790 (2024).

30.

Alahmari, M. et al. Accuracy of artificial intelligence-based segmentation in maxillofacial structures: a systematic review. BMC Oral Health 25, (2025).

31.

Balel, Y. & Sağtaş, K. Deep learning-based approach to third molar impaction analysis with clinical classifications. Scientific Reports 15, (2025).

32.

Yoo, J. H. et al. Deep learning based prediction of extraction difficulty for mandibular third molars. Scientific Reports 11, (2021).

33.

Nori, H. et al. Sequential Diagnosis with Language Models. arXiv.org (2025). https://arxiv.org/abs/2506.22405

34.

Chen, W. et al. RadFabric: Agentic AI System with Reasoning Capability for Radiology. arXiv.org (2025). https://arxiv.org/abs/2506.14142

35.

Rokhshad, R. et al. Ethical considerations on artificial intelligence in dentistry: A framework and checklist. J. Dent. 135, 104593 (2023).

36.

An Artificial Intelligence Code of Conduct for Health and Medicine. (2025). 10.17226/29087

37.

Cutillo, C. M. et al. Machine intelligence in healthcare—perspectives on trustworthiness, explainability, usability, and transparency. npj Digit. Med. 3, 1–5 (2020).

Table 2
Summary of identification and segmentation performance. Values are means, rounded to two decimals.
Task	Structure	Metric	Score
Identification (n = 923)	Teeth	Accuracy	0.99
		Precision	0.99
		Recall	0.99
		F1	0.99
		IoU	0.99
Segmentation	Root	DSC	0.93
	Crown	DSC	0.90
	Mandible	DSC	0.96
	MC	DSC	0.87
	Macro-average	DSC	0.92

Yes