Deep Learning Detection of Retinitis Pigmentosa Inheritance Forms through Synthetic Data Expansion of a Rare Disease Dataset

ElizabethE.Hwang1

MaxL.Rivera1

LinJia2

ManTingLin1

KrishNachnani1

OliviaYuan1

PulkitMadaan1

YingHan1

JacqueL.Duncan1

JingShan

MD-PHD

1,3✉Emailjing.shan@ucsf.edu

Department of OphthalmologyUniversity of California, San FranciscoSan FranciscoCaliforniaUnited States

2Digillect LLCSan FranciscoCaliforniaUnited States

3Department of OphthalmologyUniversity of CaliforniaSan FranciscoCAUSA

Elizabeth E. Hwang^1*, Max L. Rivera^1*, Lin Jia^2*, Man Ting Lin¹, Krish Nachnani¹, Olivia Yuan¹, Pulkit Madaan¹, Ying Han¹, Jacque L. Duncan¹, Jing Shan¹

1. Department of Ophthalmology, University of California, San Francisco, San Francisco, California, United States

2. Digillect LLC, San Francisco, California, United States

Corresponding author:

Jing Shan, MD-PHD

Department of Ophthalmology

University of California, San Francisco

CA, USA

Email: jing.shan@ucsf.edu

Elizabeth E. Hwang, Max L. Rivera and Lin Jia contributed equally to this work.

Funding

Declarations: Tianqiao and Chrissy Chen Institute Scholar (JS), All May See and Think Forward Foundation (JS), UCSF Perstein Award (JS). This research was supported, in part, by the UCSF Vision Core shared resource of the NIH/NEI P30 EY002162, an unrestricted grant from Research to Prevent Blindness, and the Foundation Fighting Blindness (JLD).

Clinical Trial Number: Not applicable

No conflicting relationship exists for any author.

Abstract

Accurate classification of inheritance patterns is an integral part of diagnosis and genetic counseling for inherited retinal diseases (IRDs). Traditionally reliant on pedigree analysis, clinical phenotyping, and genetic testing, this process is often constrained by incomplete family history, ambiguous presentations, limited access to genetic testing, and inconclusive genetic test results. Deep learning (DL) applied to fundus imaging presents a promising approach for automated inference of inheritance modes; however, development has been hindered by the low prevalence of IRDs and the scarcity of annotated datasets. In this study, we focus on retinitis pigmentosa (RP), a highly heterogeneous disorder in both clinical presentation and genetic etiology. We present a first-in-class deep learning approach that leverages Vision Transformer (ViT) models to distinguish autosomal from X-linked RP using color fundus photography. To overcome challenges posed by limited data, we introduce an innovative variational autoencoder–based data expansion strategy, which improves inheritance pattern classification based on color fundus photos from 0.67 AUC to 0.79 AUC. Our findings demonstrate the potential of deep learning to uncover subtle phenotypic differences linked to genetic inheritance and introduce a novel training data augmentation method to render deep learning accessible to rare diseases.

Introduction

For rare inherited retinal diseases (IRDs), determining the mode of inheritance (i.e. autosomal versus X-linked inheritance) is crucial for providing accurate genetic counseling, guiding family planning, and predicting disease progression. While genetic testing is now routinely performed for many IRD patients, determining inheritance patterns through mutational analysis continues to present significant challenges (Britten-Jones et al., 2024; Xu et al., 2014). The most accessible and cost-effective option, whole-exome sequencing (WES), focuses on the protein-coding region, but coverage is limited and may miss disease-causing variants in non-coding regions (Burdick et al., 2020; Ross et al., 2020). Though more comprehensive, whole-genome sequencing (WGS) has other limitations, such as difficulty in detecting certain types of variants and higher costs (Marian, 2012). Furthermore, the diagnostic yield of next-generation sequencing for patients with RP ranges from 50–75% (Lynn et al., 2024; Consugar et al., 2015), and approximately half of non-syndromic RP patients are cases with no family history of disease, which complicates the determination of inheritance patterns (Jin et al., 2008). Although determining inheritance patterns through sequencing and family history remains difficult, research indicates that distinct genetic inheritance patterns may result in subtle morphological features that are unfortunately challenging to detect by the human eye alone (Currant et al., 2021; Ortin Vela et al., 2024).

Artificial intelligence (AI) enhanced imaging tools may offer a direct, non-invasive alternative. In clinical medicine, contemporary deep learning (DL) models have achieved disease diagnostic accuracies comparable to those of experienced physicians (Zhou et al., 2023; Men et al., 2023; Hwang et al., 2023). More remarkable, some DL models can further discern sub-visual features—such as inferring biological sex or age from fundus photographs—that elude human observers (Berk et al., 2023; Korot et al., 2021; Nusinovici et al., 2022). This success, however, has come at the cost of prodigious data demands. In response, the field is converging on foundation-model strategies—both general and task-specific—that couple resource efficiency with strong cross-task generalization (Gani et al., 2020). General-purpose vision encoders typically require 142–300 million heterogeneous images to attain competitive performance (Oquab et al., 2023; Dosovitskiy et al., 2020), whereas task-tailored variants achieve state-of-the-art (SOTA) accuracy with 1.6–3.4 million curated ophthalmic images (Zhou et al., 2023; Shi et al., 2024; Qiu et al., 2024). Transfer learning techniques can further shrink data requirements by orders of magnitude to 70,000 to 100,000 images (Yang et al., 2024; Cohen et al., 2025). While these AI techniques have empowered detection and grading of common retinal conditions (Silva-Rodriguez et al., 2024; Sevgi et al., 2024), such as diabetic retinopathy (DR) (Men et al., 2023) and age-related macular degeneration (AMD) (Du et al., 2024), reduced data thresholds still pose a prohibitive barrier for fields like IRDs, where annotated datasets seldom exceed a few hundred cases owing to low prevalence and fragmented data stewardship (Decherchi, 2021).

One promising strategy for curbing data demands is to leverage generative AI, which can augment existing datasets with high-fidelity synthetic images. (Chaurasia et al., 2024; Chen et al., 2024; Kumar et al., 2022). Synthetic data has been generated and utilized in multiple settings that are challenged by scarce datasets, including the development of multiracial facial recognition models and the curation of customized organ models for surgical simulation and training (Park et al., 2024; Kimura et al., 2025). While showing great potential, many generative AI methods, particularly diffusion-based models, are prone to hallucination, where the produced outputs are too far detached from reality, creating synthetic images that are plausible but nonsensical (Rajpurkar et al., 2018). To address this, we explored the use of variational autoencoder (VAE). VAE is a generative model that learns to encode input data into a latent space defined by a probability distribution, typically a multi-variate Gaussian. During training, the model optimizes a loss function that balances reconstruction accuracy with regularization, ensuring the latent space conforms to a known prior distribution. To generate synthetic data, new samples are drawn from this prior distribution and passed through the decoder network to produce novel but statistically consistent outputs (Wei & Mahmood, 2021). By generating outputs that adhere to specific input data, VAE is a more controllable generative model than Diffusion, mitigating the problem of hallucination. Previously, we demonstrated that VAE-enhanced synthetic datasets significantly improved glaucoma detection by ViT (Chen et al., 2024). Here we report the development of a second-generation VAE-based data enhancement workflow to deliver dataset diversity beyond previous methods and explore how this new functionality can enable access of DL methods to rare diseases.

Methods

Patient Cohort

Fundus photographs were acquired from patients seen at University of California San Francisco between October 2018 and August 2024 with a familial and/or sequencing-confirmed diagnosis of non-syndromic retinitis pigmentosa, including autosomal dominant (AD), autosomal recessive (AR), X-linked recessive (XR), or patients with sequencing-confirmed X-linked carrier (XLC) status. A total of 132 color fundus photographs were included in the full dataset. Symptom duration was determined from chart review by a retinal specialist with IRD expertise (J.L.D.), and was defined as the length of time between patient-reported onset of visual symptoms and imaging date, rounded up to the nearest year. Asymptomatic patients were assigned a symptom duration of 0 years. Statistical analysis was performed with GraphPad Prism software (10.0.3).

The study adhered to the tenets of the Declaration of Helsinki and was approved by the UCSF Institutional Review Board, which determined that this retrospective study qualified for a waiver of informed consent.

Data Preprocessing

For the purpose of confirming laterality, only fundus photos with clearly visible macula and optic nerve head (ONH) structures were included. If the patient had multiple imaging dates, only the most recent date was included for review. Fundus photos from the most recent date were manually assessed for image quality (excessive blur, artifact, sufficient field of view). In addition, all right eye (OD) images were horizontally flipped to match the image orientation of the left eye (OS) prior to model training. Eyes without images meeting these quality criteria were excluded.

Autoencoder

To address the problem of overfitting caused by data scarcity, we employed a variational autoencoder framework to generate synthetic data. Two variations of autoencoder expansion were investigated: random noise expansion (Gen 1) and pair-wise combinatorial expansion (Gen 2).

I. Random Noise Expansion (Gen 1) VAE

Details of this first-generation expansion and training procedure were previously published (Chen et al., 2024). Briefly, synthetic images were generated by introducing noise into the latent space using four noise distributions: constant, Gaussian, uniform, and sinusoidal. For each image, one strategy was randomly selected and applied with a randomly sampled strength parameter from .05 to 1 to the embedding, leading to additional variations in the output images. Each image along with their random noise expansion was added to the training set resulting in a two-fold expansion of the training data. Our VAE utilized a dual-level ResNet-based encoder-decoder structure trained on ImageNet with pixel-wise reconstruction loss.

II. Pairwise Combinatorial Expansion (Gen 2) VAE

To expand data enhancement capabilities in terms of both quantity and diversity, we developed a 2nd generation framework to perform pairwise combinatorial expansions using VAE. Here, synthetic images were generated by combining the latent representations of every possible pair of training images that share the same genotype label (Fig. 1). This pairwise structure ensures that every unique two-image combination within a genotype class contributes to the synthetic dataset. For each image pair, we encoded both images using a pretrained variational autoencoder (AutoencoderKL) and linearly combined their latent vectors using a predefined set of mixing ratios of 0.1, 0.3, 0.5, 0.7, and 0.9, as illustrated in Fig. 2. Each mixing ratio determines the relative weight between the original image and its paired image in the latent space. This process is repeated for n x n combinations for each label, where n is the number of images with the same label, augmenting each inheritance mode by C (n, 2) images. The resulting composite latent vector is then decoded to produce a synthetic image.

Fig. 1

Two-way combinatorial synthetic image generation using variational autoencoder.

Fig. 2

Examples of Synthetic Images based on Autosomal Dominant (AD) RP images. Left to right A/B ratios: 0.1, 0.3, 0.5, 0.7, 0.9.

Vision Transformer Training and Evaluation

The pretrained foundation model used in this study (Google’s vit-base-patch16-224-in21k) was initialized with pretrained weights from ImageNet-21k and modified for binary classification. We applied an 80/20 train-test data split, ensuring that the fundus images of both eyes from the same patient are incorporated into the same split, and trained two ViT models: one using only the original fundus images and the other incorporating both original and synthetic images (Fig. 3). To prepare the input images for training and evaluation, all images were resized to 224×224 pixels, matching the input size expected by the ViT model. Resizing images to a resolution of 224x224 pixels is a standardized preprocessing step in Vision Transformer (ViT) model, as demonstrated in multiple ViT-based retinal imaging studies (Wang et al., 2024; Powroznik et al., 2025). Training was performed for 30 epochs, using the AdamW optimizer with a learning rate of 5e-05.

During training, we applied data augmentations using a randomized resizing crop followed by a randomized horizontal flip, introducing variability and reducing overfitting. Cross validation was performed by re-sampling to generate representative train/test splits. Synthetic images were generated from and added to only the training sets. To calculate mean accuracy, recall, and specificity, model performances were averaged over five-fold cross-validation. Pooled AUCs were calculated by aggregating labels and predictions generated across all validation sets.

Fig. 3

Workflow for Data Expansion and Vision Transformer Model Training. VAE = Variational Autoencoder, ViT = Vision Transformer.

Results

Retinitis pigmentosa (RP) patient demographics

Our final cohort included 105 eyes from 53 retinitis pigmentosa (RP) patients, for a total of 132 wide-field color fundus photos (Table 1). X-linked recessive (XR) and X-linked carrier (XLC) patient characteristics were reported as a single category to protect confidentiality. Patients' mean age at time of imaging was 53 years (± 16 years) for autosomal dominant (AD), 49 years (± 17 years) for autosomal recessive (AR), and 26 years (± 19 years) for XR or XLC (Table 1, Fig. 4). Mean patient ages were significantly different (ordinary one-way ANOVA with Tukey’s multiple comparisons test, p-value < 0.01) between the autosomal (AD, AR) and X-linked (XL) groups, in line with known earlier onset of symptoms in X-linked RP (De Silva et al., 2021). Patient median symptom duration at time of imaging was 27 years (± 18 years) for AD, 22 years (± 21 years) for AR and 26 years (± 19 years) for XL, with no significant inter-group differences by one-way ANOVA (Table 1, Fig. 4).

Table 1. Study cohort. AD = Autosomal Dominant, AR = Autosomal Recessive, XL = X-linked Recessive and X-linked Carrier (combined for anonymization purposes).

Fig. 4

Patients mean age (panel A) and median symptom duration at time of imaging (panel B). AD = Autosomal Dominant, AR = Autosomal Recessive, XL = X-linked Recessive and X-linked Carrier (combined for anonymization purposes). ** denotes p-value < 0.005 and *** denotes p-value < 0.0001

Vision Transformer classification of disease inheritance mode in RP

We first evaluated our retinitis pigmentosa inheritance Vision Transformer (RP-ViT) base model trained on color fundus photo dataset without synthetic training data enhancements. For the purpose of binary model classification, we combined the four inheritance modes into two classes, autosomal (AR and AD) and X-linked (XR and XLC). For the RP-ViT base model, pooled AUC was 0.67, mean accuracy was 0.62 ± 0.05, and mean specificity was 0.55 ± 0.07 (Fig. 5, Table 2).

Fig. 5

A. Receiver Operating Characteristic (ROC) curve with pooled AUC reported for RP ViT base (unexpanded) model. B. Confusion matrix for the base model.

Autoencoder Enhancement of ViT-based classification of RP inheritance mode

I. Random Noise Expansion (Gen 1)

We trained and evaluated RP-ViT using a two-fold augmented dataset with synthetic images produced by the first-generation VAE method (Chen et al., 2024). This improved pooled AUC to 0.75, mean accuracy to 0.69 ± 0.05, and mean specificity to 0.64 ± 0.07 (Fig. 6, Table 2).

Fig. 6

A. Receiver Operating Characteristic (ROC) curves for RP-ViT base and Gen 1 (random noise) expanded models. B. Confusion matrix for Gen 1 (random noise)-expanded models.

II. Pair-wise Combinatorial Expansion (Gen 2)

Finally, we evaluated RP-ViT trained on a combinatorial pair-wise expanded dataset with synthetic images produced by our second-generation VAE method. This final model outperformed both the base model and the random noise-expansion RP-ViT models on all measured metrics with a pooled AUC of 0.79, mean accuracy of 0.71 ± 0.10, and mean specificity of 0.68 ± 0.10 (Fig. 7, Table 2).

Fig. 7

A. Receiver Operating Characteristic (ROC) curves for RP-ViT base, Gen 1 random noise-expanded, and Gen 2 pair-wise combinatorial expanded models. B. Confusion matrix for Gen 2 RP-ViT model.

¹Pooled AUC refers to AUC calculated from combined classifications across all folds.

Discussion

Deep-learning systems have achieved near-expert performance in detecting prevalent retinal disorders—most notably diabetic retinopathy and age-related macular degeneration—powered by the vast imaging datasets generated through population-wide screening initiatives (Aggarwal et al., 2021; Cen et al., 2021; Shoaib et al., 2025). Yet their translation into subspecialty clinics has been limited, with few algorithms undergoing rigorous, real-world validation in expert settings where diagnostic subtleties matter most (Abràmoff et al., 2018; Li et al., 2023). The gap widens further for rare inherited disorders such as retinitis pigmentosa, where patient numbers are small and existing datasets fall several orders of magnitude below typical deep-learning requirements. It thus remains an open question for the field whether AI models trained on such limited cohorts can deliver actionable insights.

In this study, we report on the application of generative AI and deep learning to classify genetic inheritance patterns in rare retinal diseases. Specifically, we demonstrate the feasibility of using a state-of-the-art (SOTA) vision-based deep learning model (ViT) to identify modes of inheritance from retinitis pigmentosa (RP) fundus images. To enable application of ViT on an inherently scarce dataset that is orders of magnitude smaller than previously reported deep learning usage cases, we developed novel methods of training data enhancement. Using a variational autoencoder, we introduced both random-noise (Gen 1) and pair-wise combinatorial (Gen 2) expansions of training data. Results show that while both methods can improve ViT performance, Gen 2 (which fuses pairs of labeled images to simulate biologically plausible variants) offers greater range of quantity and diversity in its synthetic data, leading to improvements beyond Gen 1 methods. By generating synthetic data via the fusion of real patient images rather than using diffusion-based methods, we minimize hallucination and maximize clinical fidelity and relevance of the resultant synthetic training data.

Our findings introduce two important concepts in clinical AI methodology: first, the ability to identify hidden patterns that correlate with inheritance mode and are invisible to human detection, and second, an encoder-decoder generative AI approach to alleviate scarce-data limitations of rare diseases. To our knowledge, this is the first report to use deep learning to classify images by mode of inheritance in retinitis pigmentosa (RP). We chose a binary classification approach based on both clinical and model design considerations. Classification of inheritance patterns is an important first step for diagnosis and for selecting appropriate genetic testing approaches and interpretation, which may inform the patient about relevant clinical trials or treatment options (Ullah et al., 2025; Lam et al., 2021; Gocuk et al., 2023). Distinguishing between X-linked vs. autosomal inheritance for rare IRDs can improve genetic counselling and family planning, and also enable more targeted diagnostic testing approaches, potentially improving the yield of diagnostic testing. In addition to its clinical relevance, our binary classifier model demonstrated better accuracy compared to a multi-class classifier. We suspect this difference may be due to the inherent limitations of classifiers in handling multiclass problems, as well as the small size of rare disease datasets (Allwein et al., 2000). Multiclass classifiers are intrinsically more data-hungry, a limitation that becomes pronounced in rare-disease settings where sample counts are low. While our data augmentation pipelines aim to address this data scarcity problem, their benefits, particularly that of Gen 2, scale with the size of the native dataset because each new image arises from two distinct originals. Consequently, the smallest cohorts (e.g. <20 images) still pose a challenge. Ongoing efforts therefore aim to further advance data expansion techniques to enable AI tools that perform robustly even with very small datasets, thereby improving classification accuracy in multi-class problems.

Beyond building a more robust data-enhancement architecture, it is our hope that a discovery-focused approach to AI tool development may uncover previously unrecognized deep biological signatures. Pathology unfolds across many measurement axes (e.g. structural, functional, and molecular), each contributing a slice of a high-dimensional landscape. Integrating complementary modalities, such as optical-coherence tomography, fundus imaging, and electrophysiology, provides a more complete representation of inherited retinal disorders and, accordingly, may improve DL classification performance (Hwang et al., 2025).

We acknowledge several limitations to our approach. First, we relied on a limited dataset from a single institution. While UCSF is a tertiary referral center and attracts patients worldwide, our cohort for this initial report is nevertheless limited. Our current work serves as a proof-of-concept, laying groundwork for promoting the development of future multi-center studies to validate the reported framework across more diverse populations.

Second, the autosomal and X-linked groups in our cohort differed significantly in mean age—an expected reflection of the earlier onset typical of X-linked IRDs. Although this age disparity may serve as a confounding factor, it also represents a genuine clinical distinction in onset and progression between inheritance patterns; the differences we observed likely reflect differences in disease severity that correlate with inheritance pattern. Third, similar to other deep learning models, our method involves a level of complexity that can obscure the interpretability of its decision-making process —making it difficult to determine which features were most relevant to the model predictions. While established interpretability tools exist for earlier architectures such as CNNs, the self-attention mechanisms used in Vision Transformers (ViTs) make it more challenging to generate intuitive or clinically recognizable explanations (Doncevic & Herrmann, 2023). To address this limitation, there is ongoing work to develop a multimodal large language–vision model to articulate its visual reasoning for medical image analyses.

In conclusion, this study highlights the potential of deep learning to contribute to the diagnostic workflow for inherited retinal diseases. By integrating VAE-powered data enhancement with ViT-driven binary classification, we report a novel and effective framework for mitigating the limitations imposed on AI tool development by inherently scarce data and may provide information about disease severity in resource-limited settings where genetic testing is not available or results are indeterminate. The reported framework enabled classification of RP inheritance mode using a small dataset, fulfilling an important clinical objective and underscoring the potential of AI to enhance healthcare for all patient populations, irrespective of disease prevalence.

Ethics Statement:

The study adhered to the tenets of the Declaration of Helsinki and was approved by the UCSF Institutional Review Board (IRB # 21-35673), which determined that this retrospective study qualified for a waiver of informed consent.

Funding Declarations:

Tianqiao and Chrissy Chen Institute Scholar (JS), All May See and Think Forward Foundation (JS), UCSF Perstein Award (JS). This research was supported, in part, by the UCSF Vision Core shared resource of the NIH/NEI P30 EY002162, an unrestricted grant from Research to Prevent Blindness, and the Foundation Fighting Blindness (JLD).

Author Contribution

E.E.H., J.J., J.D., and J.S. contributed to the conception and design; E.E.H., J.D., and J.S. collected the data; E.E.H., M.L.R., L.J., M.T.L., K.N., O.Y., J.D., and J.S. prepared all the figures; E.E.H., M.L.R., L.J., M.T.L., K.N., P.M., Y.H., J.D., and J.S. wrote the main manuscript text.

Competing Interests:

All authors declare no financial or non-financial competing interests.

Data Availability

The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.

References

Abramoff, M. D., Lavin, P. T., Birch, M., Shah, N. & Folk, J. C. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. NPJ Digit. Med. 1, 39. https://doi.org/https://doi.org/10.1038/s41746-018-0040-6 (2018).

Aggarwal, R. et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit. Med. 4 (1), 65. https://doi.org/10.1038/s41746-021-00438-z (2021).

Allwein, E. L., Schapire, R. E. & Singer, Y. Reducing Multiclass to Binary: A Unifying Approach for Margin Classifier. J. Mach. Learn. Res. 1, 113–141 (2000). https://www.jmlr.org/papers/v1/allwein00a.html

Berk, A. et al. Learning from small data: Classifying sex from retinal images via deep learning. PLoS One. 18 (8), e0289211. https://doi.org/10.1371/journal.pone.0289211 (2023).

Britten-Jones, A. C. et al. Patient experiences and perceived value of genetic testing in inherited retinal diseases: a cross-sectional survey. Sci. Rep. 14 (1), 5403. https://doi.org/10.1038/s41598-024-56121-2 (2024).

Burdick, K. J. et al. rd, & Undiagnosed Diseases, N. Limitations of exome sequencing in detecting rare and undiagnosed diseases. Am J Med Genet A, 182(6), 1400–1406. (2020). https://doi.org/10.1002/ajmg.a.61558

Cen, L. P. et al. Automatic detection of 39 fundus diseases and conditions in retinal photographs using deep neural networks. Nat. Commun. 12 (1), 4828. https://doi.org/10.1038/s41467-021-25138-w (2021).

Chaurasia, A. K., MacGregor, S., Craig, J. E., Mackey, D. A. & Hewitt, A. W. Assessing the efficacy of synthetic optic disc images for detecting glaucomatous optic neuropathy using Deep Learning. Translational Vis. Sci. Technol. 13 (6), 1. https://doi.org/10.1167/tvst.13.6.1 (2024).

Chen, D., Han, Y., Duncan, J., Jia, L. & Shan, J. Generative Artificial Intelligence Enhancements for Reducing Image-based Training Data Requirements. Ophthalmol. Sci. 4 (5), 100531. https://doi.org/10.1016/j.xops.2024.100531 (2024).

Cohen, B. A. et al. Benchmarking ophthalmology foundation models for clinically significant age macular degeneration detection. arXiv:2505.05291. (2025)., May 22 https://doi.org/10.48550/arXiv.2505.05291

Consugar, M. B. et al. Panel-based genetic diagnostic testing for inherited eye diseases is highly accurate and reproducible, and more sensitive for variant detection, than exome sequencing. Genet. Sci. 17 (4), 253–261. https://doi.org/10.1038/gim.2014.172 (2015).

Currant, H. et al. Genetic variation affects morphological retinal phenotypes extracted from UK Biobank optical coherence tomography images. PLoS Genet. 17 (5), e1009497. https://doi.org/10.1371/journal.pgen.1009497 (2021).

De Silva, S. R. et al. The X-linked retinopathies: Physiological insights, pathogenic mechanisms, phenotypic features and novel therapies. Prog Retin Eye Res. 82, 100898. https://doi.org/10.1016/j.preteyeres.2020.100898 (2021).

Decherchi, S. P., Mordenti, E., Cavalli, M. & Sangiorgi, A. L. Opportunities and challenges for machine learning in rare diseases. Frontiers in Medicine, 8. (2021). https://doi.org/https://doi.org/10.3389/fmed.2021.747612

Doncevic, D. & Herrmann, C. Biologically informed variational autoencoders allow predictive modeling of genetic and drug-induced perturbations. Bioinformatics 39 (6). https://doi.org/10.1093/bioinformatics/btad387 (2023).

Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929. (2020).

Du, K. et al. Detection of disease features on retinal OCT scans using retfound. Bioengineering 11 (12), 1186. https://doi.org/10.3390/bioengineering11121186 (2024).

Gani, H., Naseer, M. & Yaqub, M. How to Train Vision Tranformer on Small-scale Datasets? arXiv: 2210.07240. (2020). https://doi.org/http://doi.org/10.48550/arXiv.2210.07240

Gocuk, S. A., Edwards, T. L., Jolly, J. K. & Ayton, L. N. Perspectives of carriers of x-linked retinal diseases on genetic testing and gene therapy: A global survey. Clin. Genet. 105 (2), 150–158. https://doi.org/10.1111/cge.14442 (2023).

Hwang, E. E., Chen, D., Han, Y., Jia, L. & Shan, J. Multi-Dataset Comparison of Vision Transformers and Convolutional Neural Networks for Detecting Glaucomatous Optic Neuropathy from Fundus Photographs. Bioeng. (Basel). 10 (11). https://doi.org/10.3390/bioengineering10111266 (2023).

Hwang, E. E., Chen, D., Han, Y., Jia, L. & Shan, J. Utilization of image-based deep learning in multimodal glaucoma detection neural network from a primary patient cohort. Ophthalmol. Sci. 5 (3), 100703. https://doi.org/10.1016/j.xops.2025.100703 (2025).

Jin, Z. B. et al. Identifying pathogenic genetic background of simplex or multiplex retinitis pigmentosa patients: a large scale mutation screening study. J. Med. Genet. 45 (7), 465–472. https://doi.org/10.1136/jmg.2007.056416 (2008).

Kimura, T. et al. Development of anatomically accurate digital organ models for surgical simulation and training. PLOS ONE. 20 (4). https://doi.org/10.1371/journal.pone.0320816 (2025).

Korot, E. et al. Predicting sex from retinal fundus photographs using automated deep learning. Sci. Rep. 11 (1), 10286. https://doi.org/10.1038/s41598-021-89743-x (2021).

Lam, B. L. et al. Genetic testing and diagnosis of inherited retinal diseases. Orphanet J. Rare Dis. 16 (1). https://doi.org/10.1186/s13023-021-02145-0 (2021).

Lee, J. et al. Deep learning for rare disease: A scoping review. J. Biomed. Inf. 135, 104227. https://doi.org/10.1016/j.jbi.2022.104227 (2022).

Li, Z. et al. Artificial intelligence in ophthalmology: The path to the real-world clinic. Cell. Rep. Med. 4 (7), 101095. https://doi.org/10.1016/j.xcrm.2023.101095 (2023).

Lynn, J. et al. Expanding the mutation spectrum for inherited retinal diseases. Genes 16 (1), 32. https://doi.org/10.3390/genes16010032 (2024).

Marian, A. J. Challenges in medical applications of whole exome/genome sequencing discoveries. Trends Cardiovasc. Med. 22 (8), 219–223. https://doi.org/10.1016/j.tcm.2012.08.001 (2012).

Men, Y. et al. DRStageNet: Deep learning for diabetic retinopathy staging from fundus images. arXiv:2312.14891. (2023)., December 22 https://doi.org/10.48550/arXiv.2312.14891

Nusinovici, S. et al. Retinal photograph-based deep learning predicts biological age, and stratifies morbidity and mortality risk. Age Ageing. 51 (4). https://doi.org/10.1093/ageing/afac065 (2022).

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez,P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes,R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., … Bojanowski, P. (2024,February 2). DINOv2: Learning robust visual features without supervision. arXiv:2304.07193. https://doi.org/10.48550/arXiv.2304.07193.

Ortin Vela, S. et al. Phenotypic and genetic characteristics of retinal vascular parameters and their association with diseases. Nat. Commun. 15 (1), 9593. https://doi.org/10.1038/s41467-024-52334-1 (2024).

Park, Y. et al. Study on the generation and comparative analysis of ethnically diverse faces for developing a multiracial face recognition model. MDPI, 13(18), 3627. (2024)., September 12 https://doi.org/10.3390/electronics13183627

Powroznik, P. et al. Residual self-attention vision transformer for detecting acquired vitelliform lesions and age-related macular Drusen. Sci. Rep. 15 (1). https://doi.org/10.1038/s41598-025-02299-y (2025).

Qiu, J., Wu, J., Wei, H., Shi, P., Zhang, M., Sun, Y., Li, L., Liu, H., Liu, H., Hou,S., Zhao, Y., Shi, X., Xian, J., Qu, X., Zhu, S., Pan, L., Chen, X., Zhang, X., Jiang,S., … Yuan, W. (2023, October 8). VisionFM: A multi-modal multi-task Vision Foundation model for Generalist Ophthalmic Artificial Intelligence. arXiv: 2310.04992. https://doi.org/10.48550/arXiv.2310.04992.

Rajpurkar, P., Jia, R. & Liang, P. Know what you don’t know: unanswerable questions for SQuAD. https://doi.org (2018). /https://doi.org/10.48550/arXiv.1806.03822

Rombach, R. B., Lorenz, A., Esser, D. & Ommer, P. B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752 (2022). https://doi.org/https://doi.org/10.48550/arXiv.2112.10752

Ross, J. P., Dion, P. A. & Rouleau, G. A. Exome sequencing in genetic disease: recent advances and considerations. F1000Res, 9. (2020). https://doi.org/10.12688/f1000research.19444.1

Sevgi, M., Ruffell, E., Antaki, F., Chia, M. A. & Keane, P. A. Foundation models in ophthalmology: Opportunities and challenges. Curr. Opin. Ophthalmol. 36 (1), 90–98. https://doi.org/10.1097/icu.0000000000001091 (2024).

Shi, D. et al. EyeFound: A Multimodal Generalist Foundation model for ophthalmic imaging. arXiv:2405.11338. (2024). https://doi.org/10.48550/arXiv.2405.11338

Shoaib, M. R. et al. Revolutionizing diabetic retinopathy diagnosis through advanced deep learning techniques: Harnessing the power of GAN model with transfer learning and the DiaGAN-CNN model. Biomed. Signal Process. Control. 99. https://doi.org/10.1016/j.bspc.2024.106790 (2025).

Silva-Rodriguez, J. et al. Exploring the Transferability of a Foundation Model for Fundus Images: Application to Hypertensive Retinopathy. Lect. Notes Comput. Sci. 14497. https://doi.org/10.1007/978-3-031-50075-6_33 (2024).

Sreejith Kumar, A. J. et al. Evaluation of generative adversarial networks for high-resolution synthetic image generation of circumpapillary optical coherence tomography images for glaucoma. JAMA Ophthalmol. 140 (10), 974. https://doi.org/10.1001/jamaophthalmol.2022.3375 (2022).

Ullah, M. et al. A comprehensive genetic landscape of inherited retinal diseases in a large Pakistani cohort. NPJ Genomic Med. 10 (1). https://doi.org/10.1038/s41525-025-00488-2 (2025).

Wang, D., Lian, J. & Jiao, W. Multi-label classification of retinal disease via a novel Vision Transformer model. Frontiers in Neuroscience, 17. (2024). https://doi.org/10.3389/fnins.2023.1290803

Wei, R. & Mahmood, A. Recent Advances in Variational Autoencoders With Representation Learning for Biomedical Informatics: A Survey. IEEE Access. 9, 4939–4956. https://doi.org/10.1109/access.2020.3048309 (2021).

Xu, Y. et al. Mutations of 60 known causative genes in 157 families with retinitis pigmentosa based on exome sequencing. Hum. Genet. 133 (10), 1255–1271. https://doi.org/10.1007/s00439-014-1460-2 (2014).

Yang, Y., Cai, Z., Qiu, S. & Xu, P. Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image. PLOS ONE. 19 (3). https://doi.org/10.1371/journal.pone.0299265 (2024).

Zhou, Y., Chia, M. A., Wagner, S. K., Ayhan, M. S., Williamson, D. J., Struyven, R.R., Liu, T., Xu, M., Lozano, M. G., Woodward-Court, P., Kihara, Y., Allen, N., Gallacher,J. E., Littlejohns, T., Aslam, T., Bishop, P., Black, G., Sergouniotis, P., Atan,D., … Keane, P. A. (2023). A foundation model for generalizable disease detection from retinal images. Nature, 622(7981), 156–163.https://doi.org/10.1038/s41586-023-06555-x.

Yes