A Deep Learning-Based Ni Classification System for Laryngeal NBI Images: A Multicenter Diagnostic Study

Running title: DL for Ni Classification in Larynx

Jie-Lin Huang

Li-Juan Li

Ji-Qing Zhu

Li-Zhou Dou

Xue Zhang

Yu-Meng Liu

Yan Ke

Yu-Da Zhao

Mei-Ling Wang

Jian-Hui Wang

Quan-Mao Zhang

Xiao-Guang Ni

MD, PhD

1✉ Phone+86 010-87787606 Emailnixiaoguang@126.com

Department of Endoscopy, National Clinical Research Center for Cancer/Cancer Hospital National Cancer Center, Chinese Academy of Medical Sciences and Peking Union Medical College No.17 Panjiayuan South Lane, Chaoyang District 100020 Beijing China

2 Department of Otorhinolaryngology The People’s Hospital of Wenshan Prefecture Yunnan China

3 Department of Endoscopy National Cancer Center, National Clinical Research Center for Cancer/Cancer Hospital & Shenzhen Hospital, Chinese Academy of Medical Sciences, Peking Union Medical College 518116 Shenzhen China

4 Department of Endoscopy, Shanxi Hospital Affiliated to Cancer Hospital Shanxi Province Cancer Hospital, Chinese Academy of Medical Sciences, Cancer Hospital Affiliated to Shanxi Medical University 030001 Taiyuan China

Jie-Lin Huang, MD^1*, Li-Juan Li, MD ^2*, Ji-Qing Zhu, MD¹, Li-Zhou Dou, MD¹, Xue Zhang, MD¹, Yu-Meng Liu, MD¹, Yan Ke, MD¹, Yu-Da Zhao, MD¹, Mei-Ling Wang, MD³, Jian-Hui Wang, MD⁴, Quan-Mao Zhang, MD⁴, Xiao-Guang Ni, MD, PhD¹

Affiliations:

¹Department of Endoscopy, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100020, China;

² Department of Otorhinolaryngology, The People's Hospital of Wenshan Prefecture, Yunnan, China;

³Department of Endoscopy, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital & Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shenzhen, 518116, China;

⁴Department of Endoscopy, Shanxi Province Cancer Hospital/Shanxi Hospital Affiliated to Cancer Hospital, Chinese Academy of Medical Sciences/Cancer Hospital Affiliated to Shanxi Medical University, Taiyuan, 030001, China;

Send correspondence to Xiao-Guang Ni.

Xiao-Guang Ni, MD, PhD, E-mail: nixiaoguang@126.com, Tel: +86 010-87787606, Postal codes: 100021, Address: Department of Endoscopy, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, No.17 Panjiayuan South Lane, Chaoyang District, Beijing, China.

Jie-Lin Huang MD and Li-Juan Li MD contributed equally to this work.

Abstract

Background

Laryngeal squamous cell carcinoma accounts for more than 95% of laryngeal tumors. Early diagnosis is crucial for function preservation and prognosis improvement. Narrow Band Imaging (NBI) technology and Ni classification system provide an important basis for early diagnosis; however, poor interobserver agreement limits its standardized clinical application. To develop and validate the first deep learning system (DL-Ni) for Ni classification of laryngeal NBI images, and to evaluate its effectiveness in improving diagnostic agreement among physicians with varying levels of experience.

Methods

This multicenter diagnostic study retrospectively collected 3,023 high-quality laryngeal NBI images to construct the dataset. A dual-branch collaborative learning architecture was developed, comprising a UNet + + semantic segmentation branch and an improved ResNet classification branch.

A randomized controlled crossover experiment was conducted to evaluate the improvement in diagnostic agreement among 12 physicians with different levels of experience under AI assistance.

Results

The developed DL-Ni system demonstrated robust performance in both internal and external validations, with accuracy of 0.858 (95% CI: 0.821–0.895) and 0.827 (95% CI: 0.813–0.841), respectively. AI assistance significantly improved interobserver diagnostic agreement: the Fleiss’ κ value increased from 0.488 to 0.685 (P < 0.05) in the junior physician group, and from 0.621 to 0.791 (P < 0.05) in the expert group.

Conclusion

This study is the first to develop and validate an automated deep learning system for Ni classification of laryngeal NBI images. The system significantly improved interobserver diagnostic consistency, offering an effective tool and solution for the standardized clinical application of NBI technology.

Keywords:

Narrow Band Imaging

Ni classification

deep learning

laryngeal cancer

interobserver agreement

Clinical trial number

not applicable.

Background

Laryngeal squamous cell carcinoma (LSCC) accounts for over 95% of malignant laryngeal tumors, with 5-year survival rates ranging from 80–90% for early-stage disease (T1-T2) to approximately 60% for advanced stage(1, 2). However, superficial lesions often present with non-specific changes under conventional white light imaging (WLI), such as mucosal roughness or erythema, which frequently lead to missed diagnoses.

Narrow Band Imaging (NBI) enhances visualization of mucosal microvascular patterns using specific wavelengths (415nm and 540nm), demonstrating superior sensitivity and specificity in detecting early laryngeal malignancies(3–7). The Ni classification system, established in 2011, standardizes NBI interpretation by categorizing intrapapillary capillary loop (IPCL) morphology into types I through V, providing objective criteria for diagnosing malignant and precancerous laryngeal lesions(8–11). Despite clear morphological criteria, clinical adoption faces a critical challenge: significant inter-observer variability. Meta-analyses report NBI sensitivity ranging from 81% to 94% and specificity between 85% and 96%, with substantial heterogeneity across studies(12). Multiple independent studies demonstrate modest interobserver agreement (κ values 0.40–0.58), far below the clinical application threshold (κ > 0.8)(13–15). This inconsistency compromises diagnostic accuracy, potentially causing missed early malignancies or overtreatment of benign lesions.

Improving NBI classification consistency has traditionally relied on standardized training and experience accumulation. However, even after systematic training courses, interobserver κ value improvements rarely exceed 0.1–0.15(16). Moreover, high-quality educational resources remain concentrated in tertiary hospitals, limiting widespread NBI adoption in primary care settings(17).

Artificial intelligence (AI), particularly deep learning (DL), offers transformative potential in medical image analysis by identifying complex patterns and providing objective, reproducible interpretations. This technology presents a promising solution to inter-observer variability: DL models trained on expert-annotated NBI images can learn characteristic IPCL features, enabling standardized, automated classification. While existing studies demonstrate feasibility of DL in laryngeal image analysis, no research has systematically validated AI efficacy in resolving NBI interpretative discrepancies among physicians(18–20).

This study aims to develop a deep learning system for automatic Ni classification of laryngeal NBI images (DL-Ni system) and systematically evaluate its effectiveness as an assistive tool through randomized controlled human-AI collaborative experiments. By assessing whether AI assistance improves diagnostic accuracy and consistency across physicians with varying experience levels, we seek to standardize NBI interpretation, reduce diagnostic variability, and optimize early precision management of laryngeal cancer and precancerous lesions.

Method

This study was approved by the Institutional Review Board of the National Cancer Center/Cancer Hospital, Chinese Academy of Medical Sciences (Approval No.

22/454–3656), and written informed consent was obtained from all participants. The study comprised two components: a retrospective phase for model development and internal validation, and a prospective phase for human-AI collaborative evaluation.

Data Collection and Study Population

Laryngeal NBI images were retrospectively collected from the Cancer Hospital and Institute of the Chinese Academy of Medical Sciences (CHCAMS) between January 2008 and March 2022 for model development and internal validation.

External validation data were obtained from the Shenzhen Hospital of the Chinese Academy of Medical Sciences (CHSZCAMS) and the Shanxi Provincial Cancer Hospital (SXPCH). All images and videos were captured using Olympus Medical Systems devices (CV-170/ENF-VT2, CV-290/BF-H290, and CV-260/BF-260).

The inclusion criteria for images were as follows: 1) clear visualization of IPCL structures with good vascular contrast, and absence of motion blur or equipment artifacts; 2) uniform and moderate illumination without overexposure or underexposure affecting the observation of vascular patterns; 3) a lesion area occupying ≥ 30% of the image to ensure sufficient diagnostic information; and 4) availability of a histopathological gold standard for diagnostic confirmation. Exclusion criteria included: 1) technical issues such as blurring, inaccurate focus, or artifacts caused by device malfunction; 2) biological interference including hemorrhage covering ≥ 30% of the field of view, or secretions and necrotic tissue covering ≥ 20% of the field; 3) imaging parameter issues such as excessive or insufficient illumination leading to inadequate IPCL contrast; and 4) lack of a definitive pathological diagnosis or unclear pathological findings.

Following stringent quality control screening, 3,308 high-quality NBI images from 1,421 patients were included after exclusion of 1,805 images (37.4%) from an initial pool of 4,828 candidate images. The primary reasons for exclusion were image artifacts (n = 699, 14.5%), biological interference (n = 808, 16.7%), and insufficient IPCL contrast (n = 298, 6.2%).

Image annotation was performed independently by three senior NBI specialists, each with over 10 years of NBI experience and an annual examination volume exceeding 1,500 cases. Using the Ni classification criteria, the experts precisely delineated lesion regions and assigned classifications via the LabelMe software. The initial interobserver agreement among the experts yielded a Fleiss' κ of 0.780 (95% CI: 0.756–0.804). Discrepant cases (n = 156, 5.1%) were resolved through panel discussion and re-evaluation to reach a final consensus. The histopathological diagnosis served as the gold standard.

Dataset Partition

The internal dataset (sourced from CHCAMS) consisted of 3,023 images, while the external validation set (from CHSZCAMS and SXPCH) contained 285 images. The internal dataset was randomly partitioned at the patient level into training (2,144 images), validation (578 images), and internal testing (301 images) subsets in a ratio of 7:2:1. Stratified sampling was employed to ensure a balanced distribution of Ni classification types across all subsets. The distribution of the dataset is shown in Fig. 1.

Fig. 1

Flowchart illustrating the development and validation process of the DL-Ni system.

Deep Learning Model Development

We proposed a dual-branch architecture that jointly performs semantic segmentation and image classification, designed with two dedicated sub-networks: one for pixel-wise lesion area segmentation and the other for image-level lesion type classification. Each network was optimized to enhance feature representation capability for its respective task.

For the semantic segmentation branch, we constructed a segmentation network based on the UNet + + framework, which utilizes nested skip connections to effectively integrate features across different semantic levels. To improve the model’s focus on critical regions, we incorporated the Convolutional Block Attention Module (CBAM), which adaptively emphasizes lesion-relevant features through channel and spatial attention while suppressing background interference. Deep supervision was introduced to enhance the discriminative power of intermediate features and facilitate gradient propagation. The loss function combines weighted Binary Cross-Entropy (BCE), Dice loss, and Focal loss, jointly optimizing pixel accuracy, region overlap, and hard example learning.

For the image classification task, we designed a dual-path network comprising a main branch and a lightweight branch. The main branch employs a 17-layer ResNet to extract high-level semantic and fine-grained structural features. The lightweight branch consists of a 19-layer convolutional network integrated with a CBAM attention module, guiding the model to focus on globally salient regions while significantly reducing parameter count and improving inference efficiency. Features from both branches are independently extracted and subsequently integrated within a feature fusion module to produce the final classification output. The classification network receives input as cropped RGB images of size 224×224. During training, data augmentation techniques including normalization, rotation, and color jittering were applied. To address class imbalance, a class-weighted cross-entropy loss was adopted. A phased training strategy was employed: the first 65 epochs froze the main branch and updated only the lightweight branch and fusion module, while the subsequent 65 epochs unfroze the entire network for end-to-end fine-tuning, totaling 130 epochs. The AdamW optimizer was used with an initial learning rate of 1×10⁻³, coupled with a dynamic decay scheduling strategy to enhance convergence in later stages. The architectural framework of the model is depicted in Fig. 2.

Fig. 2

Architecture of the proposed DL-Ni system for Ni classification. (A) Overall network architecture. (B) The Convolutional Block Attention Module (CBAM) is integrated after each convolutional layer within the residual blocks. It consists of two sub-modules: the Channel Attention Module (CAM) and the Spatial Attention Module (SAM). The CAM utilizes both global average-pooled and max-pooled features, computes channel weights through a shared multi-layer perceptron (MLP), and thereby emphasizes informative channels. Meanwhile, the SAM aggregates channel information via average and max pooling operations to generate a spatial attention map, which is then processed by a convolutional layer to produce spatial weights, enabling the network to focus on critical regions.

Guman-AI Collaboration Experiment

A prospective, fully crossed, multi-reader study was conducted to quantify the improvement in diagnostic consistency among physicians with the assistance of the AI system. A total of 12 physicians from three centers were recruited in a stratified manner: 1) the expert group (n = 6), comprising physicians with over 10 years of NBI experience and an annual operation volume exceeding 500 cases; and 2) the junior group (n = 6), consisting of physicians with 1–3 years of NBI experience. Each physician evaluated 200 representative images randomly selected from the external validation set, covering all Ni classification types. The evaluation was performed in two independent sessions: during the baseline phase, only the NBI images and basic clinical information were provided; in the AI-assisted phase, real-time outputs from the DL-Ni system—including probability distribution of classification, confidence score, and Grad-CAM-generated heatmaps highlighting IPCL key regions—were added. A washout period of at least 4 weeks was implemented between the two sessions to mitigate memory effects. All assessments were conducted under standardized lighting conditions using calibrated medical displays. The order of image presentation was fully randomized across sessions to eliminate sequence effects. The primary endpoint was the inter-observer agreement (Fleiss’ κ) in Ni classification with and without AI assistance. Secondary endpoints included diagnostic accuracy, sensitivity, and specificity within each physician group.

Statistical Analysis

Interobserver agreement was quantified using Fleiss’ κ with 95% confidence intervals. Model performance was evaluated in terms of sensitivity, specificity, accuracy, and receiver operating characteristic (ROC) curves, all reported with 95% CIs. The improvement in performance metrics before and after AI assistance was compared using paired t-tests or Wilcoxon signed-rank tests, as appropriate. All statistical analyses were performed using R version 4.3.3 and Python 3.9. A two-sided p-value < 0.05 was considered statistically significant.

Results

Data Distribution and Baseline Characteristics

Following rigorous quality control screening, a total of 3,308 high-quality NBI images were included in this study. The distribution across Ni classification types was as follows: Type I, 333 images (10.1%); Type II, 231 images (7.0%); Type III, 1,158 images (35.0%); Type IV, 135 images (4.1%); Type Va, 745 images (22.5%); Type Vb, 338 images (10.2%); and Type Vc, 368 images (11.1%). From CHCAMS, 1,270 patients were enrolled, constituting an internal dataset of 3,023 images. An external validation set of 285 images was formed from 151 patients recruited from CHSZCAMS and SXPCH. Detailed clinical characteristics of each dataset are summarized in Table 1.

Table 1

Summary of Demographic and Clinical Characteristics Across Study Cohorts
	CHCAMS Cohorts (Internal)			External Cohorts
Characteristic	Train (n = 873)	Validation (n = 269)	Test (n = 128)	SXPCH (n = 77)	CHSZCAMS (n = 74)
Age, Mean (± SD), years	46.6 ± 12.3	46.1 ± 12.4	47.1 ± 13.2	47.4 ± 13.9	47.8 ± 12.9
Female, n (%)	501 (41.7%)	165 (40.2%)	70 (42.3%)	51 (44.1%)	44 (45.3%)
Male, n (%)	372 (53.2%)	104 (55.5%)	49 (54.8%)	26 (53.9%)	30 (51.5%)
NBI category, n (%)
All images	2144	578	301	139	146
I	211 (9.84%)	57 (9.86%)	30 (9.97%)	12 (8.63%)	23 (15.75%)
II	154 (7.18%)	41 (7.09%)	18 (5.98%)	10 (7.19%)	8 (5.48%)
III	753 (35.12%)	205 (35.47%)	102 (33.89%)	47 (33.81%)	51 (34.93%)
IV	99 (4.62%)	22 (3.81%)	10 (3.32%)	3 (2.16%)	1 (0.68%)
Va	464 (21.64%)	133 (23.01%)	80 (26.58%)	30 (21.58%)	38 (26.03%)
Vb	222 (10.35%)	61 (10.55%)	26 (8.64%)	17 (12.23%)	12 (8.22%)
Vc	241 (11.24%)	59 (10.21%)	35 (11.63%)	20 (14.39%)	13 (8.90%)

Model Performance Evaluation

On the internal test dataset, the model demonstrated the following performance: overall accuracy 0.858 (95% CI: 0.821–0.895), sensitivity 0.809 (0.799–0.819), specificity 0.974 (0.965–0.983), AUC 0.967 (0.960–0.974) (Fig. 3), and mIoU 0.731 (0.711–0.751). For Type Va lesions, the accuracy was 0.940 (0.929–0.951) and sensitivity 0.873 (0.866–0.890). The model exhibited strong generalization capability on the external validation set, achieving an overall accuracy of 0.827 (0.813–0.841), specificity of 0.770 (0.765–0.775), sensitivity of 0.969 (0.958–0.980), AUC of 0.961 (0.955–0.967), and mIoU of 0.719 (0.704–0.734). For Type Va lesions in the external set, accuracy reached 0.918 (0.911–0.925) with a sensitivity of 0.836 (0.830–0.842). The classification performance across different center-specific datasets is detailed in Table 2. Figure 4 depicts the segmentation and prediction results of the DL-Ni system, along with corresponding heatmap visualizations.

Fig. 3

Receiver operating characteristic (ROC) curves demonstrating the diagnostic performance of the DL-Ni model across different datasets. (A) ROC curve for the internal test set from CHCAMS. (B) ROC curve for the external validation cohort from CHSZCAMS and SXPCH. (C) ROC curve for the video validation cohort, reflecting the model's performance in dynamic real-time scenarios.

Table 2

Evaluation of DL-Ni's Diagnostic and Segmentation Performance in Independent Cohorts
Validation Set	Category	Accuracy (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	AUC (95% CI)	IoU (95% CI)
Internal validation	All category	0.858 (0.821–0.895)	0.809 (0.799–0.819)	0.974 (0.965–0.983)	0.967 (0.960–0.974)	0.731 (0.711–0.751)
Internal validation	Va	0.940 (0.929–0.951)	0.873 (0.866–0.890)	0.959 (0.955–0.964)	0.957 (0.951–0.963)	0.726 (0.708–0.742)
External validation	All category	0.827 (0.813–0.841)	0.770 (0.765–0.775)	0.969 (0.958–0.980)	0.961 (0.955–0.967)	0.719 (0.704–0.734)
External validation	Va	0.918 (0.911–0.925)	0.836 (0.830–0.842)	0.943 (0.937–0.949)	0.963 (0.959–0.967)	0.721 (0.710–0.732)
Real-time validation	All category	0.827 (0.811–0.843)	0.751 (0.742–0.760)	0.967 (0.953–0.981)	0.971 (0.961–0.981)	0.713 (0.698–0.728)
Real-time validation	Va	0.915 (0.899–0.931)	0.830 (0.822–0.838)	0.947 (0.933–0.961)	0.968 (0.955–0.981)	0.714 (0.695–0.733)

Fig. 4

Visualization of DL-Ni predictions, segmentation results, and heatmaps. This figure illustrates the model's ability to identify Ni classification types across various NBI lesion images, along with corresponding heatmaps indicating lesion probability. (A) Original NBI image. (B) Lesion boundaries annotated by laryngoscopy experts. (C) Lesion boundaries predicted by DL-Ni segmentation. (D) Heatmap prediction of the lesion region generated by DL-Ni. (E) Overlay visualization combining the predicted lesion boundaries and heatmap from DL-Ni.

To assess the model’s potential for real-time application, performance was evaluated on a test set containing 200 video clips. The model achieved an overall accuracy of 0.827 (0.811–0.843), sensitivity of 0.751 (0.742–0.760), specificity of 0.967 (0.953–0.981), AUC of 0.971 (0.961–0.981), and mIoU of 0.713 (0.698–0.728). For Type Va lesions, accuracy was 0.915 (0.899–0.931) and sensitivity 0.830 (0.822–0.838), confirming the model’s diagnostic capability in dynamic imaging scenarios. Demonstrations of real-time detection and Ni-type probability prediction during laryngoscopy are provided in Supplementary Videos S1 and S2.

Impact of AI Assistance on Physician Diagnostic Agreement

Results from the human-AI collaboration experiment demonstrated that AI assistance significantly improved the diagnostic performance of both physician groups. In the junior group, the diagnostic accuracy increased from 0.661 to 0.801 with AI support, and interobserver agreement (κ value) improved from 0.488 (moderate agreement) to 0.685 (substantial agreement) (P < 0.05). Among experts, diagnostic accuracy rose from 0.787 to 0.885, while the κ value increased from 0.621 (substantial agreement) to 0.791 (substantial agreement) (P < 0.05). For the clinically critical Type Va lesions, AI assistance also led to a marked improvement in recognition accuracy: the junior group improved from 0.652 to 0.834 (P < 0.05), and the expert group increased from 0.759 to 0.891 (P < 0.05). Detailed comparative data are presented in Table 3.

Table 3

Comparison of Sensitivity, Specificity, Accuracy, and Kappa for Junior and Senior Endoscopists With Versus Without AI Assistance
Group	Condition	Accuracy (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	Kappa	P value (vs Without AI)
Junior	Without AI	0.661 (0.630–0.692)	0.658 (0.625–0.691)	0.943 (0.939–0.949)	0.488	< 0.05
Junior	With AI	0.801 (0.765–0.837)	0.801 (0.762–0.840)	0.967 (0.961–0.972)	0.685	< 0.05
Senior	Without AI	0.787 (0.767–0.806)	0.784 (0.758–0.810)	0.964 (0.961–0.968)	0.621	< 0.05
Senior	With AI	0.885 (0.850–0.921)	0.882 (0.842–0.922)	0.981 (0.975–0.987)	0.791	< 0.05

Discussion

This study successfully developed and validated a deep learning system (DL-Ni) for the automated Ni classification of laryngeal NBI images. The system demonstrated excellent diagnostic performance in both internal and external validation (accuracy: 0.858 and 0.827; AUC: 0.967 and 0.961), with particularly high sensitivity in identifying Type Va lesions (0.873 and 0.836). Through a rigorously designed human-AI collaboration experiment, we confirmed that the DL-Ni system significantly improves interobserver agreement in Ni classification among practitioners with varying experience levels, with κ values increasing by 0.197 and 0.170 in the junior and expert groups, respectively, reaching moderate to substantial agreement. This effectively addresses a core bottleneck hindering the standardized clinical application of NBI technology.

The high subjectivity in NBI interpretation—specifically, the variability among physicians in judging the morphological features of intrapapillary capillary loops (IPCLs)—has remained a major challenge limiting its widespread clinical adoption(21–24). This interobserver variability directly leads to inconsistent diagnostic results, ultimately affecting the accuracy of clinical decision-making.

The DL-Ni system developed in this study represents a fundamental shift from the traditional subjective interpretation model toward a standardized and objective analytical paradigm. Through the automatic extraction and quantitative analysis of IPCL morphological features by deep convolutional networks, it transforms traditional qualitative descriptions into quantitative parameters. The system's outputs—including probability distribution of classification, confidence scores, and Grad-CAM-generated heatmaps—provide physicians with a transparent navigation tool for diagnostic decision-making, converting a process traditionally dependent on personal experience into one based on objective data and visual evidence(25). Compared to previous AI studies on laryngeal NBI, this work offers multiple innovations: it is the first to systematically develop and validate a deep learning system for automatic Ni classification of laryngeal NBI images; through a rigorous human-AI collaboration experiment, we quantitatively evaluated for the first time the improvement in inter-physician diagnostic consistency with AI assistance, demonstrating that its effect significantly surpasses traditional training methods; and the visual explanations and quantitative confidence levels enhance the transparency and interpretability of the diagnostic process, which is crucial for clinical acceptance and adoption of AI-assisted diagnosis(26–28).

This study has several limitations that should be addressed in future work. First, although multi-center external validation was performed, the training data were primarily acquired from Olympus endoscopy systems. Differences in spectral characteristics and image quality among devices from different manufacturers may exist; this limitation could be mitigated through cross-device generalizability studies utilizing techniques such as domain adaptation to improve model performance across devices with varying spectral and image characteristics. Second, although model performance was tested on a video dataset, its ability to capture subtle mucosal movements in real-time during dynamic imaging remains to be enhanced; this could be addressed through optimization for real-time video stream analysis, developing algorithms that can capture IPCL morphological changes and hemodynamic features in real-time under conditions of subtle mucosal movement. Third, the histopathological gold standard used in this study may be affected by biopsy sampling errors, especially for heterogeneously distributed lesions; this challenge could be addressed through multimodal integrated diagnosis, exploring comprehensive models that combine NBI with other imaging modalities to compensate for the limitations of NBI in detecting deep tissue infiltration. Additionally, long-term, multi-center prospective cohort studies are needed to systematically evaluate the impact of AI-assisted NBI classification on clinical treatment decisions and patient outcomes.

In conclusion, this study demonstrates that the DL-Ni system not only achieves high diagnostic accuracy in Ni classification of laryngeal NBI images but, more importantly, significantly improves interobserver agreement among practitioners with varying experience levels. This breakthrough provides a novel technological pathway for the early and precise diagnosis of laryngeal cancer and precancerous lesions and is expected to promote the standardized use of NBI in clinical practice, ultimately contributing to organ-preserving treatment and improving long-term patient outcomes.

Conclusion

This study successfully developed and validated the first deep learning-based automated classification system (DL-Ni) for laryngeal NBI images and demonstrated its significant effectiveness in improving diagnostic agreement among physicians with varying levels of experience. The system achieved high-precision automatic classification of NBI images (accuracy exceeding 85%), and more importantly, markedly enhanced inter-observer diagnostic agreement, with κ values increasing by 0.17–0.20. These results provide an effective technical solution to overcome the key bottleneck in the standardized clinical application of NBI technology.

Our work offers a new technological pathway for the early and precise diagnosis of laryngeal cancer and precancerous lesions, with strong potential to promote the standardized adoption of NBI in clinical practice. Ultimately, this approach is expected to improve patient outcomes and healthcare quality. With continued technical optimization and further clinical validation, AI-assisted Ni classification could become a standardized tool in the diagnosis of laryngeal lesions.

Declarations

Ethics approval and consent to participate

This study was approved by the Institutional Review Board of the National Cancer Center/Cancer Hospital, Chinese Academy of Medical Sciences (Approval No. 22/454–3656), and written informed consent was obtained from all participants.

Consent for publication

Not applicable.

Data Availability

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request. The source code of the deep learning model is not publicly available at this time to protect intellectual property prior to patent filing but may be made available under a material transfer agreement.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by National High Level Hospital Clinical Research Funding (grant number LC2024A04) and CAMS Innovation Fund for Medical Sciences (CIFMS) (grant number 2022-I2M-C&T-B-059).

Author Contribution

Jie-Lin Huang and Li-Juan Li contributed equally to this work. Conceptualization: Xiao-Guang Ni, Jian-Hui Wang. Methodology: Jie-Lin Huang, Li-Juan Li, Ji-Qing Zhu. Software: Jie-Lin Huang, Li-Zhou Dou. Validation: Xue Zhang, Yu-Meng Liu, Yan Ke. Formal analysis: Yu-Da Zhao, Mei-Ling Wang. Investigation: All authors. Resources: Xiao-Guang Ni, Quan-Mao Zhang, Jian-Hui Wang. Data Curation: Li-Juan Li, Ji-Qing Zhu. Writing – Original Draft: Jie-Lin Huang, Li-Juan Li. Writing – Review & Editing: All authors. Visualization: Li-Zhou Dou, Xue Zhang. Supervision: Xiao-Guang Ni. Project administration: Xiao-Guang Ni. Funding acquisition: Xiao-Guang Ni. All authors read and approved the final manuscript.

Acknowledgements

Not applicable.

Electronic Supplementary Material

Below is the link to the electronic supplementary material

Supplementary Material 1

Supplementary Material 2

References

Chu EA, Kim YJ. Laryngeal cancer: diagnosis and preoperative work-up. Otolaryngol Clin North Am. 2008;41(4):673–95, v.

Marioni G, Marchese-Ragona R, Cartei G, Marchese F, Staffieri A. Current opinion in diagnosis and treatment of laryngeal carcinoma. Cancer Treat Rev. 2006;32(7):504–15.

Bertino G, Cacciola S, Fernandes WB, Jr., Fernandes CM, Occhini A, Tinelli C, et al. Effectiveness of narrow band imaging in the detection of premalignant and malignant lesions of the larynx: validation of a new endoscopic clinical classification. Head Neck. 2015;37(2):215–22.

Staníková L, Šatanková J, Kučová H, Walderová R, Zeleník K, Komínek P. The role of narrow-band imaging (NBI) endoscopy in optical biopsy of vocal cord leukoplakia. Eur Arch Otorhinolaryngol. 2017;274(1):355–9.

Westra JM, Zwakenberg MA, Halmos GB, van der Laan B, van der Vegt B, Plaat BEC. Narrow band imaging reveals field cancerisation undetected by conventional white light: Optical diagnosis versus histopathology. Clin Otolaryngol. 2024;49(4):429–35.

Galli J, Settimi S, Mele DA, Salvati A, Schiavi E, Parrilla C, et al. Role of Narrow Band Imaging Technology in the Diagnosis and Follow up of Laryngeal Lesions: Assessment of Diagnostic Accuracy and Reliability in a Large Patient Cohort. J Clin Med. 2021;10(6).

Ni XG, Wang GQ. The Role of Narrow Band Imaging in Head and Neck Cancers. Curr Oncol Rep. 2016;18(2):10.

Ni XG, He S, Xu ZG, Gao L, Lu N, Yuan Z, et al. Endoscopic diagnosis of laryngeal cancer and precancerous lesions by narrow band imaging. J Laryngol Otol. 2011;125(3):288–96.

Staníková L, Kántor P, Fedorová K, Zeleník K, Komínek P. Clinical significance of type IV vascularization of laryngeal lesions according to the Ni classification. Front Oncol. 2024;14:1222827.

10.

Nerurkar NK, Sarkar A. Correlation of narrow-band imaging findings using the Ni and European Laryngeal Society classification systems during in-office flexible laryngoscopy with histopathology. J Laryngol Otol. 2024;138(2):203–7.

11.

Fang Y, Yang Y, Chen M, Chen J, He P, Cheng L, et al. Correlating intraepithelial papillary capillary loops of vocal cord leukoplakia with histopathology. Acta Otolaryngol. 2022;142(1):106–11.

12.

Sanda IA, Hainarosie R, Ionita IG, Voiosu C, Ristea MR, Zamfir Chiru Anton A. A Systematic Review Evaluating the Diagnostic Efficacy of Narrow-Band Imaging for Laryngeal Cancer Detection. Medicina (Kaunas). 2024;60(8).

13.

Scholman C, Zwakenberg MA, Wedman J, Wachters JE, Halmos GB, van der Laan B, et al. The influence of clinical experience on reliable evaluation of pharyngeal and laryngeal lesions: comparison of high-definition laryngoscopy using narrow band imaging with fibre-optic laryngoscopy. J Laryngol Otol. 2024;138(4):425–30.

14.

Kraus F, Gehrke S, Ehrmann-Müller D, Hofer F, Shehata-Dieler W, Hagen R, et al. Comparison of three different image enhancement systems for detection of laryngeal lesions. J Laryngol Otol. 2024;138(1):105–11.

15.

Pietruszewska W, Morawska J, Rosiak O, Leduchowska A, Klimza H, Wierzbicka M. Vocal Fold Leukoplakia: Which of the Classifications of White Light and Narrow Band Imaging Most Accurately Predicts Laryngeal Cancer Transformation? Proposition for a Diagnostic Algorithm. Cancers (Basel). 2021;13(13).

16.

Ni XG, Wang GQ, Hu FY, Xu XM, Xu L, Liu XQ, et al. Clinical utility and effectiveness of a training programme in the application of a new classification of narrow-band imaging for vocal cord leukoplakia: A multicentre study. Clin Otolaryngol. 2019;44(5):729–35.

17.

Żurek M, Rzepakowska A, Osuch-Wójcikiewicz E, Niemczyk K. Learning curve for endoscopic evaluation of vocal folds lesions with narrow band imaging. Braz J Otorhinolaryngol. 2019;85(6):753–9.

18.

Esmaeili N, Sharaf E, Gomes Ataide EJ, Illanes A, Boese A, Davaris N, et al. Deep Convolution Neural Network for Laryngeal Cancer Classification on Contact Endoscopy-Narrow Band Imaging. Sensors (Basel). 2021;21(23).

19.

Xiong H, Lin P, Yu JG, Ye J, Xiao L, Tao Y, et al. Computer-aided diagnosis of laryngeal cancer via deep learning based on laryngoscopic images. EBioMedicine. 2019;48:92–9.

20.

Azam MA, Sampieri C, Ioppi A, Africano S, Vallin A, Mocellin D, et al. Deep Learning Applied to White Light and Narrow Band Imaging Videolaryngoscopy: Toward Real-Time Laryngeal Cancer Detection. Laryngoscope. 2022;132(9):1798–806.

21.

Arthur C, Huangfu H, Li M, Dong Z, Asamoah E, Shaibu Z, et al. The Effectiveness of White Light Endoscopy Combined With Narrow Band Imaging Technique Using Ni Classification in Detecting Early Laryngeal Carcinoma in 114 Patients: Our Clinical Experience. J Voice. 2023.

22.

Chabrillac E, Dupret-Bories A, Vairel B, Woisard V, De Bonnecaze G, Vergez S. Narrow-Band Imaging in oncologic otorhinolaryngology: State of the art. Eur Ann Otorhinolaryngol Head Neck Dis. 2021;138(6):451–8.

23.

Kumazawa Y, Ikenoyama Y, Takamatsu M, Kido K, Namikawa K, Tokai Y, et al. Differences in Clinical Characteristics Between Missed and Detected Laryngopharyngeal Cancers. J Gastroenterol Hepatol. 2025;40(8):2037–45.

24.

Nogués-Sabaté A, Aviles-Jurado FX, Ruiz-Sevilla L, Lehrer E, Santamaría-Gadea A, Valls-Mateus M, et al. Intra and interobserver agreement of narrow band imaging for the detection of head and neck tumors. Eur Arch Otorhinolaryngol. 2018;275(9):2349–54.

25.

Jahmunah V, Ng EYK, Tan RS, Oh SL, Acharya UR. Explainable detection of myocardial infarction using deep learning models with Grad-CAM technique on ECG signals. Comput Biol Med. 2022;146:105550.

26.

Inaba A, Hori K, Yoda Y, Ikematsu H, Takano H, Matsuzaki H, et al. Artificial intelligence system for detecting superficial laryngopharyngeal cancer with high efficiency of deep learning. Head Neck. 2020;42(9):2581–92.

27.

Li Y, Gu W, Yue H, Lei G, Guo W, Wen Y, et al. Real-time detection of laryngopharyngeal cancer using an artificial intelligence-assisted system with multimodal data. J Transl Med. 2023;21(1):698.

28.

Yumii K, Ueda T, Kawahara D, Chikuie N, Taruya T, Hamamoto T, et al. Artificial intelligence-based diagnosis of the depth of laryngopharyngeal cancer. Auris Nasus Larynx. 2024;51(2):417–24.

Yes

A Deep Learning-Based Ni Classification System for Laryngeal NBI Images: A Multicenter Diagnostic Study Running title: DL for Ni Classification in Larynx