A robust deep learning classifier for screening multiple retinal diseases on optical coherence tomography

Wait 5 sec.

IntroductionAccording to 2023 data from the World Health Organization (WHO, https://www.who.int/), 2.2 billion people worldwide are affected by vision problems. Early and accurate diagnosis could help prevent irreversible vision loss for nearly one billion of them1. However, in many regions worldwide, a shortage of medical resources delays the diagnosis and treatment of these diseases2. Artificial intelligence has emerged as a promising, scalable, and real-time solution for the screening of retinal diseases, even in areas with limited medical resources3.AI solutions based on Deep Learning (DL), have been extensively utilized to analyze various retinal imaging modalities4,5, such as color fundus photography6,7,8,9, ultra-widefield fundus imaging10,11, and optical coherence tomography (OCT)12,13,14, for detecting conditions including diabetic retinopathy (DR), age-related macular degeneration (AMD), diabetic macular edema (DME), glaucoma, myopia, and other retinal pathologies. These systems have demonstrated remarkable potential, automating disease detection with accuracy comparable to that of expert ophthalmologists15,16.Among the many retinal imaging techniques, OCT stands out for its ability to generate 3D reconstructions of the retina, providing a more comprehensive view for disease screening17. Additionally, OCT delivers high-resolution, layer-specific information on retinal thickness, enabling more precise and detailed analysis. This makes it a highly valuable tool for integrating DL algorithms into clinical practice for early and accurate disease detection.Several strategies have been developed for classifying volumetric OCT data. One widely used approach is Multiple Instance Learning (MIL)18,19,20, which involves classifying individual 2D slices and then aggregating slice-level predictions to form a single volume-level output, typically using mean or max pooling. MIL has been employed in disease detection, such as for AMD and DME in OCT volumes21,22. Another common approach is direct 3D volume classification12,23, which takes the entire volume as input to make a prediction. By processing the full volumetric data at once, the model effectively captures spatial relationships within the volume. This approach has been successfully applied to glaucoma detection using OCT24,25, with 3D ResNet architectures, showing its strength in leveraging spatial information.More recently, the Variable Length Volume Feature Aggregation Transformer (VLFAT)26 has been introduced. This method aggregates the latent representations of each slice using a slice feature extractor and forms a comprehensive feature map for the entire volume using attention mechanisms27. This technique offers a balance between 2D slice-level and 3D volume-level approaches by processing slices independently while preserving inter-slice dependencies. VLFAT has shown robust performance in multi-class OCT classification tasks involving AMD, DME, and geographic atrophy (GA), which are primarily manifestations of AMD.In addition, the foundation model RETFound28, based on the well-known Vision Transformer (ViT) architecture, represents a significant advancement in OCT analysis. Pretrained on large-scale OCT datasets, RETFound has demonstrated strong potential for generalization and robustness across diverse downstream tasks, including disease detection.To address the clinical need for early detection of retinal diseases, we aim to develop a DL-based classifier capable of screening a wide range of pathologies, specifically focusing on AMD, DME, vitreomacular interface disease (VID), and a final category encompassing all other pathologies using OCT imaging. For AI to be effectively integrated into clinical practice, it must demonstrate both generalizability and robustness to adapt to real-world settings5,6. However, existing models often rely heavily on the datasets used during development and show significant performance drops when tested on external datasets. This challenge arises from variations in patient characteristics, imaging devices, and acquisition parameters, such as image resolution and slice count.Furthermore, most current works focus on single-pathology detection25,29,30. This limits their ability to handle unseen diseases or multiple coexisting pathologies, which are frequent in clinical practice. A multi-disease framework is thus essential to improve diagnostic coverage and deliver reliable performance across diverse patient profiles.In this study, we used three multi-pathology datasets—one private and two public—spanning different imaging devices and diverse demographic populations to ensure a comprehensive evaluation. Our model was trained on one dataset and tested on two others with distinct characteristics, enabling a robust assessment of its generalization capabilities. Our ultimate goal is to create a population- and device-independent tool that can be deployed across different countries and clinical environments, facilitating its widespread adoption in real-world practice.MethodsWe utilized a total of four datasets in this study, comprising both private and publicly available datasets, with their detailed characteristics provided in Table 1. The private dataset (OCTBrest) focused on multi-disease classification, while the publicly available datasets (OCTDL, NEH, and Kermany) included both multi-disease and multi-lesion data.Table 1 Characteristics of OCT datasets used in this study.Full size tableThe private OCTBrest dataset was collected and analyzed in accordance with the MR-004 reference methodology, established by the French CNIL (National Information Science and Liberties Commission). All experimental protocols were approved by the CNIL under this framework, which governs non-interventional research involving health data of public interest. For the three public datasets used in this study (OCTDL, NEH, and Kermany), all data were fully anonymized and complied with the ethical standards set forth by their original data providers. Informed consent was obtained from all subjects above 18 and from a parent or legal guardian for subjects under 18, in accordance with the ethical principles outlined in the Declaration of Helsinki.DatasetsThe private OCTBrest dataset includes 663 OCT volumes from 251 patients (122 men and 129 women), aged 11 to 94 years, with an average age of 64 years (±16). These patients were seeking vision-related consultation at the Brest University Hospital in France between 2017 and 2018, using the Heidelberg Spectralis device and centered on the macula or the fovea. Each volume was annotated by one ophthalmologist expert in the retina (T.T., with 5 years of experience in retinal image analysis), identifying the presence of 20 different lesions or diseases (see Table 2). Normal eyes were classified based on the absence of any retinal lesions or diseases. To avoid memory bias, all right-eye scans were analyzed first, followed by the left-eye scans. The clinician had no access to the patients’ medical records during the annotation process. The images have scan dimensions ranging from 4.2 × 1.4 to 9.2 × 7.7 mm2 (average 6.0 × 4.8 mm2) with an axial resolution of 3.9 μm. The dataset also features azimuthal and lateral resolutions between 4.9 and 13 μm, with slice resolutions of 496 pixels in Height and varying widths of 512, 768, 1024, and 1536 pixels, ensuring high-quality retinal imaging.Table 2 OCTBrest dataset - lesions and pathologies distribution.Full size tableThe public OCTDL (Optical Coherence Tomography Dataset for Image-Based Deep Learning Methods)31 dataset, obtained using the Optovue Avanti device and centered on the fovea, includes images from a Russian population. The dataset includes OCT scans from patients diagnosed with AMD, DME, Epiretinal Membrane (ERM), Retinal Artery Occlusion (RAO), Retinal Vein Occlusion (RVO), and VID, as well as normal cases. Specifically, there are 1231 scans for AMD from 421 patients, 147 scans for DME from 107 patients, 155 scans for ERM from 71 patients, 332 normal scans from 110 patients, 22 scans for RAO from 11 patients, 101 scans for RVO from 50 patients, and 76 scans for VID from 51 patients. In total, the dataset includes 2064 scans from 821 patients. The dataset labeling involved initial labeling by 7 trained medical students with consensus discussions, followed by review and consensus labeling by two clinical specialists, and final diagnosis confirmation by the head clinic expert.The public NEH (Noor Eye Hospital)32 dataset contains 148 OCT volumes (comprising 48 AMD, 50 DME, and 50 normal subjects) acquired using the Heidelberg Spectralis device. The images have the same scan dimensions as OCTBrest (8.9 × 7.4 mm2, with an axial resolution of 3.5 μm). This dataset represents an Iranian population, offering additional geographic and demographic context. The images have slice resolutions of 496 pixels or 512 pixels in width and Height. A total of 4327 slices are available in the dataset, with each volume containing between 19 and 61 slices (an average of 29 slices per volume).The public Kermany14 dataset, consisting of 109,312 OCT B-scan images, was developed to classify retinal conditions such as choroidal neovascularization (CNV), DME, drusen, and normal cases. Although it represents the largest dataset in terms of image volume, its primary focus is on multi-lesion classification rather than true multi-disease detection.Label standardizationThe OCTBrest dataset consists of 20 classes, including 13 lesion types and 7 pathologies. The OCTDL dataset contains 8 distinct classes, while the NEH dataset comprises 3 classes. A senior ophthalmologist standardized the labels for the OCTBrest and OCTDL datasets into 5 main classes: Normal, AMD, DME, VID, and OTHER. In contrast, the original classes were retained for the NEH dataset. The class distributions (illustrated in Figure 1a) are as follows: the OCTBrest dataset contains 205 Normal cases, 258 cases of AMD, 35 cases of DME, 182 cases of VID, and 135 cases in the Other category. The OCTDL dataset includes 332 Normal cases, 1231 cases of AMD, 147 cases of DME, 231 cases of VID, and 123 cases classified as Other. Lastly, the NEH dataset comprises 50 Normal cases, 48 cases of AMD, and 50 cases of DME.Fig 1Complete study overview. A OCT Multi-Diseases Dataset. Among the three datasets, only OCTBrest is a multi-disease OCT dataset that includes patients with multiple co-occurring pathological signs. Among the 458 pathological OCT volumes, 122 presented with two pathologies, 12 with three, and 2 with all four. BData Preprocessing using OCTIP on OCTBrest c FlexiVarViT Architecture d Model Development and Evaluation Pipeline e Our Final Classifier1.The patch embedding weights are dynamically resized to create variable-sized patches, enabling flexible handling of different image resolutions while preserving critical image details2. The positional encoding (PE) introduces spatial information, capturing the relative positions of slices within the volume, ensuring a coherent spatial representation throughout processing3. The Transformer Blocks processes the sequence of embedded patches by leveraging both spatial (through position embeddings) and contextual relationships via self-attention mechanisms, thus enhancing feature extraction and representation learning4. A classification token (cls_token) serves as a summary of the entire sequence and is passed through the MLP Head, which performs the final classification, predicting pathology categories based on the learned features.Full size imageData preprocessingWe used the optical coherence tomography image preprocessing (OCTIP; https://github.com/leto-atreides-2/octip), developed by our laboratory LaTIM, which segments retinal layers and extracts only the essential parts of the image. OCT slices often contain significant noise and irrelevant information; this method effectively isolates the retinal layers, then flattens and aligns the images in depth, helping to standardize variations caused by different acquisition angles(see Figure 1b). The window Height after processing is set at 200 pixels33, a value recently validated on the large public dataset MARIO34. This size is sufficient to fully include all retinal layers, even in the most challenging cases such as severe edema or epiretinal membranes.Deep learning approachesThree distinct approaches for classifying volumetric data were employed in this study: MIL, direct 3D volume classification, and the VLFAT. For the MIL approach, we employed RETFound28, a foundation model, pretrained on 736,000 OCT images with masked autoencoding. We applied both mean (RETFound-MIL-AVG) and maximum (RETFound-MIL-MAX) prediction pooling. For direct 3D volume classification, we employed MedNet3D35, a 3D foundation model based on ResNet36 architectures; among these, we utilized ResNet-18 (MedNet3D-R18) and ResNet-50 (MedNet3D-R50). MedNet3D models were pretrained on a wide range of datasets derived from various medical challenges, covering diverse imaging modalities, anatomical targets, and pathological conditions. Finally, for the VLFAT26 approach, we tested RETFound (RETFound+VLFAT) and a basic ViT27 pretrained on ImageNet (VLFAT) as a feature extractor. All our tested approaches support full-volume processing without subsampling the number of slices, ensuring the preservation of all available information. However, these methods require resizing OCT slices to a fixed size of 224x224 pixels due to architectural constraints, such as input size limitations during pretraining. This resizing process limits the ability to fully utilize the variable and high-resolution details inherent in OCT imaging, reduces image quality, and may negatively impact classification performance. To overcome these limitations, we proposed a novel deep learning architecture, FlexiVarViT, specifically designed to process variable-resolution data effectively while preserving critical image details. The proposed architecture is illustrated in Figure 1c.FlexiVarViT is based on the FlexiViT37 architecture and is specifically designed to handle variable-high-resolution data without resizing, preserving the quality of high resolution images such as OCT images. Like FlexiViT, FlexiVarViT utilizes variable-sized patches; however, our model allows for handling data with variable resolutions by dynamically adjusting the patch sizes according to the image dimensions. In FlexiVarViT, we dynamically adjust the patch sizes to maintain a fixed number of patches per slice. This strategy ensures the preservation of a trainable position encoding without requiring additional modifications27. By keeping a constant number of patches, the position encoding can be applied consistently and continuously across all slices, ensuring that positional information is preserved and effectively utilized by the model. With FlexiVarViT, all images from the different datasets were processed at their original resolution following our preprocessing.Model development and evaluation across different parametersWe developed and evaluated our model through a systematic process aimed at ensuring both robustness and generalizability. Figure 1d illustrates the overall development and evaluation pipeline. The OCTBrest dataset was split into five folds. The distributions of the classes per fold are reported in Supplementary Table S1. A 4-fold cross-validation was performed on folds 1 to 4, producing four models. Model ensembling, utilizing averaged predictions, was employed to achieve robust performance. Final evaluation was conducted on fold 5, held as an independent test set. The OCTDL and NEH datasets were used solely for testing, allowing us to assess the model’s generalization to unseen data, including variations in imaging resolution and slice count.Pretraining strategiesWe explored and compared two OCT-specific pretraining strategies: RETFound, based on masked autoencoding over 736,000 OCT B-scans, and supervised training on the Kermany dataset, the largest publicly available OCT slices dataset. First, we integrated RETFound pretrained weights into our FlexiVarViT architecture and combined them with the VLFAT framework (FlexiVarViT(retfound-p)+VLFAT), allowing our model to leverage large-scale OCT-derived features and potentially improve robustness and generalization.However, both RETFound and ImageNet-based models are trained on downsampled images (224×224), which may limit their effectiveness in architectures designed to handle native-resolution OCT data.To address this limitation, we leveraged the high-resolution capabilities of FlexiVarViT and applied supervised pretraining on the Kermany dataset (FlexiVarViT(kermany-p)). This dataset comprises 108,312 high-resolution slices from 4,686 patients, split into 80% for training and 20% for validation, with an additional 250 samples per class (from 663 patients) Held out for testing. FlexiVarViT achieved a mean one-vs-all AUC of 0.99 across the four classes on the test set, demonstrating excellent performance and potential for automated OCT diagnostics.Implementation detailsAll models were coded in PyTorch(v.2.4.0) and Python(v.3.11.9), and trained using the AdamW optimizer with a cosine scheduler with warmup and a learning rate of 6e-6 over 100 epochs on 4x NVIDIA A6000 GPU (48GB). All models were trained with the same seed for consistency. During training, we applied random slice selection following a Gaussian distribution26, with a random number of slices N chosen from the set {5, 10, 15, 20, 25}. For 3D-CNN models and FlexiVarViT, gradient backpropagation was performed over 8 iterations (batch size of 8) to handle variable-length data efficiently. The best model was selected based on the highest mean one-vs-all AUC (Area Under the receiver operating characteristic Curve) obtained on the validation set.Statistical analysisThe performance of each algorithm was assessed by computing one-vs-all AUC values for each class in each dataset, using the ‘sklearn’ package (v.1.5.1).Statistical analyses were conducted in R (v4.4.0). We applied the Wilcoxon signed-rank test to compare two different algorithms, i.e. to compare two 13-tuples of AUC values.ResultsThe classification performance of the different approaches is summarized in Table 3 and Table 4. Table 3 presents global metrics including mean one-vs-all AUC, F1-score, sensitivity, and specificity. Table 4 details class-wise AUCs for each pathology across all datasets.Table 3 Overall classification performance across the three multi-disease datasets.Full size tableTable 4 Class-wise classification performance (one-vs-all AUC) across the three multi-disease datasets.Full size tableAmong all methods, FlexiVarViT pretrained on Kermany and combined with VLFAT achieved the best results across both in-domain and out-of-domain datasets, with mean AUCs of 0.963 on OCTBrest, 0.916 on OCTDL, and 0.996 on NEH. At the class level, it obtained the highest AUC in 11 out of 13 categories, confirming its effectiveness for multi-disease detection and its strong generalization capability.VLFAT-based models consistently outperformed both MIL and 3D-CNN approaches. RETFound-MIL (using AVG and MAX pooling) achieved stable AUCs across datasets, ranging from 0.840 to 0.953. However, their F1-scores and sensitivities were lower, particularly on OCTDL, where the F1-score dropped to around 0.491 and sensitivity remained below 0.5. On OCTBrest and NEH, their F1-scores were more stable, around 0.659 to 0.703. No notable difference in performance was observed between the AVG and MAX groups.3D-CNN models (MedNet3D-R18 and R50) showed the weakest performance overall. On OCTDL, both models reported low AUCs (approximately 0.560 to 0.570) and F1-scores (approximately 0.214 to 0.223). While they performed well on the VID class in OCTBrest (AUC = 0.954 to 0.961), performance declined sharply on OCTDL (AUC = 0.481 to 0.603), indicating poor robustness to domain shifts. MedNet3D-R50 slightly outperformed R18, but the gains were limited.At the architectural level, performance differences between RETFound+VLFAT, ViT+VLFAT, and FlexiVarViT+VLFAT were minimal on OCTBrest. However, in out-of-domain evaluations with OCTDL and NEH, FlexiVarViT+VLFAT outperformed other methods in both AUC and F1-score.Integrating RETFound weights into FlexiVarViT led to a slight performance drop across most metrics and classes. In contrast, supervised pretraining on the Kermany dataset consistently improved performance, with AUC gains in 12 out of 13 classes and a 10% increase in F1-score on OCTDL. On NEH, a small decrease in F1-score (from 0.912 to 0.890) and sensitivity (from 0.8850 to 0.8494).Interestingly, for the OTHER category in OCTDL, the RETFound-based model achieved the top AUC score of 0.880, outperforming all other models for this specific class.Figure 2 presents the ROC curves illustrating the best performances for each class and dataset. The computational costs of our algorithms on the test sets of the three datasets are reported in Table 5. In addition, the detailed confusion matrices for each method and dataset are provided in the supplementary Figure S1.Fig 2Best ROC Curves per dataset.Full size imageTable 5 Number of parameters and inference time for OCT classification models.Full size tableStatistical analysesFigure 3 shows boxplots of the 13 AUC scores obtained across the three datasets for each model. Due to the similar performance observed between RETFound-MIL-AVG and RETFound-MIL-MAX, and between MedNet3D-R18 and MedNet3D-R50, only the best-performing model (based on the average of the 13 AUC) was retained for statistical comparison.Fig 3Wilcoxon signed-rank test – AUC. Note : Due to the minimal differences between RETFound-MIL-AVG and RETFound-MIL-MAX, as well as MedNet3D-R18 and MedNet3D-R50, we selected the best-performing model of each pair based on the average AUC across the 13 classes.Full size imageEach of these models was compared to our final model : FlexiVarViT(kermany-p)+VLFAT. This model achieved the highest median AUC and exhibited a narrow interquartile range, reflecting both accuracy and stability across datasets.Bonferroni correction was applied to set a conservative significance threshold of 0.008 (0.05/6). All pairwise comparisons resulted in p-values below this threshold, confirming that FlexiVarViT (kermany-p) + VLFAT is statistically significantly superior to all other tested methods.DiscussionThis study evaluated the performance of various deep learning strategies for multi-disease classification using OCT images. To ensure robustness and generalizability across diverse patient populations and imaging devices, key factors for clinical integration, models were tested on three distinct multi-disease datasets: OCTBrest, OCTDL, and NEH. We assessed three strategies including direct 3D classification(3D-CNN); Multiple Instance Learning (MIL), and the Volume-Level Feature Aggregation Transformer (VLFAT). We compared state-of-the-art backbones such as RETFound, MedNet3D, and Vision Transformers (ViT), alongside our proposed FlexiVarViT architecture. We also examined and compared the benefit of OCT-specific pretraining by integrating RETFound weights into our architecture and by performing supervised pretraining on the Kermany dataset.VLFAT consistently demonstrated superior performance over both MIL and 3D-CNN approaches across all datasets, demonstrating greater robustness and generalization capacity for multi-disease OCT classification, as well as more effective handling of variable-length volumes (i.e., varying numbers of B-scans). In contrast, 3D-CNN models (MedNet3D-R18 and R50) appeared less suitable for this complex task. Their relatively low number of parameters limits their capacity to capture intricate features, and their architecture is poorly adapted to handle variable-length volumes or multi-label classification problems. Additionally, their performance dropped sharply on monoslice datasets like OCTDL, indicating a high sensitivity to the shift from full-volume to single-slice inputs.MIL-based approaches, although more effective than 3D-CNNs, remained inferior to VLFAT in terms of performance. This difference mainly results from their aggregation and labeling strategies. MIL aggregates predictions at the output level and applies strong labels to each individual slice, assuming that all slices in a volume reflect the same pathology. This can introduce inconsistencies, particularly when pathological signs are localized and not present in every slice. In contrast, VLFAT aggregates features across slices and incorporates spatial position encoding through an attention mechanism, which allows the model to consider the relative position of each slice within the volume. By assigning a single label at the volume level and learning to weight slice-level features accordingly, VLFAT achieves a more coherent and context-aware representation of the full OCT volume, resulting in improved classification accuracy.All methods experienced a significant drop in performance on the OCTDL dataset. This can be attributed to a domain shift, including differences in acquisition protocols, image resolution, and class distributions between the training set (OCTBrest) and the test set (OCTDL). In addition, OCTDL contains only a single central slice per volume, which limits the structural context available for accurate diagnosis. Another important factor is the labeling strategy: OCTDL uses a multi-class labeling scheme, where each slice is assigned only one pathology, in contrast to the multi-label setup used during training, where multiple co-occurring conditions could be present in a single volume. This inconsistency in labeling paradigms can lead to an underestimation of model performance, especially in cases where multiple pathologies may be present but only a single condition is annotated.RETFound consistently delivered strong performance, particularly on the“OTHER”class, Likely due to its exposure to a wide range of pathologies during pretraining on 736,000 OCT B-scans. However, incorporating RETFound weights into FlexiVarViT did not lead to notable performance improvements. This suggests that RETFound already provides robust OCT feature representations, limiting the added value of architectural enhancements such as high-resolution processing.FlexiVarViT outperformed the standard ViT in all settings, especially in out-of-domain scenarios, thanks to its ability to process native-resolution and variable-sized OCT data. Standard models such as ViT require resizing to 224×224, leading to information loss. In contrast, FlexiVarViT dynamically adjusts patch sizes to preserve fine anatomical structures.FlexiVarViT pretrained on the Kermany dataset significantly outperformed all other models, including those using ImageNet or RETFound weights (p < 0.008). This confirms the value of architecture-specific, domain-specific pretraining for OCT classification. Supervised multi-class pretraining on Kermany likely helped the model to learn pathology-specific features more effectively than masked autoencoding (MAE). On NEH, our pretraining improved overall performance, but led to a slight drop in F1-score and sensitivity, potentially due to overfitting caused by class overlap between Kermany and NEH. Expanding the pretraining dataset to include more rare or underrepresented conditions may help overcome this limitation and further improve performance and generalization.Finally, in terms of computational cost, FlexiVarViT+VLFAT provided the best trade-off between efficiency and performance. Our current pipeline relies on an ensemble of four models to improve robustness to class imbalance, increasing inference time and memory usage by a factor of four. Future development will focus on optimizing a single high-performing model to reduce resource usage and support real-time clinical deployment.A complete overview of our final classification framework is illustrated in Figure 1e.LimitationsDespite the promising results, this study has several limitations. First, rare pathologies were underrepresented and grouped into a single heterogeneous“OTHER”category, which includes a wide range of infrequent and diverse conditions. This grouping limits the ability to accurately evaluate the model’s performance on specific rare diseases and may mask weaknesses related to underrepresented classes. Second, co-occurring pathologies were not assessed in the external test sets. External validation using OCT datasets with multi-pathology annotations is therefore necessary to better evaluate the model’s performance under realistic clinical conditions. Finally, interpretability was not addressed in this study, despite being essential for clinical adoption and integration into diagnostic workflows.Future work will aim to overcome these limitations by expanding the dataset to include a broader spectrum of rare and co-occurring pathologies, and by developing interpretability tools. We propose an interpretability strategy that identifies the most relevant slice within each OCT volume and highlights the lesion regions within that slice. This would improve transparency, facilitate clinical understanding of model decisions, and strengthen trust in automated diagnostic systems.ConclusionIn this study, we introduced and evaluated a high-resolution transformer-based pipeline for multi-disease OCT classification. Through extensive comparison of deep learning architectures and pretraining strategies, we showed that FlexiVarViT+VLFAT pretrained on Kermany dataset using supervised learning consistently achieves the highest performance across datasets.Our results confirm the importance of native-resolution processing and volume-level attention-based aggregation for accurate diagnosis. Furthermore, supervised pretraining on high-resolution OCT data offers significant gains over generic large-scale pretraining such as RETFound.While designed for OCT, our framework is adaptable to other 3D medical imaging modalities such as brain MRI or chest CT, where both resolution and slice count vary. Its flexibility and high-resolution processing make it a strong candidate for generalization across clinical imaging domains.In summary, this work highlights the importance of matching model design and pretraining to medical imaging characteristics and provides a promising foundation for developing scalable, interpretable, and clinically applicable AI tools.Data availabilityThe OCTBrest dataset is not publicly available due to project privacy but can be obtained from the corresponding author upon reasonable request. The OCTDL dataset is publicly available at https://data.mendeley.com/datasets/sncdhf53xc/4. The NEH dataset is publicly available at https://hrabbani.site123.me/available-datasets/dataset-for-oct-classification-50-normal-48-amd-50-dme. The Kermany dataset is publicly available at https://data.mendeley.com/datasets/rscbjbr9sj/3.AbbreviationsAI:Artificial intelligenceAMD:Age-related macular degenerationAUC:Area under the ROC curveDME:Diabetic macular edemaDL:Deep learningMIL:Multi-Instance learningNEH:Noor eye hospital datasetOCT:Optical coherence tomographyOCTDL:Optical coherence tomography dataset for image-based deep learning methods datasetOCTIP:Optical coherence tomography image preprocessingROC:Receiver operating characteristicVID:Vitreomacular interface diseaseViT:Vision transformerVLFAT:Variable length volume feature aggregationReferencesGBD 2019 Blindness and Vision Impairment Collaborators & Vision Loss Expert Group of the Global Burden of Disease Study. Causes of blindness and vision impairment in 2020 and trends over 30 years, and prevalence of avoidable blindness in relation to VISION 2020: the Right to Sight: an analysis for the Global Burden of Disease Study. Lancet Glob. Health 9, e144–e160 (2021).Haakenstad, A. et al. Measuring the availability of human resources for health and its relationship to universal health coverage for 204 countries and territories from 1990 to 2019: A systematic analysis for the global burden of disease study 2019. The Lancet 399, 2129–2154 (2022).Article  Google Scholar Daich Varela, M. et al. Artificial intelligence in retinal disease: clinical application, challenges, and future directions. Graefes Arch. Clin. Exp. Ophthalmol. Albrecht Von Graefes Arch. Klin. Exp. Ophthalmol. 261, 3283–3297 (2023).Article  Google Scholar Muchuchuti, S. & Viriri, S. Retinal disease detection using deep learning techniques: a comprehensive review. J. Imag. 9, 84 (2023).Article  Google Scholar Sükei, E. et al. Multi-modal representation learning in retinal imaging using self-supervised learning for enhanced clinical predictions. Sci. Rep. 14, 26802 (2024).Article  PubMed  PubMed Central  Google Scholar Matta, S. et al. Towards population-independent, multi-disease detection in fundus photographs. Sci. Rep. 13, 11493 (2023).Article  CAS  PubMed  PubMed Central  Google Scholar Li, J. et al. Automated detection of myopic maculopathy from color fundus photographs using deep convolutional neural networks. Eye Vis. 9, 13 (2022).Article  Google Scholar Sahlsten, J. et al. Deep learning fundus image analysis for diabetic retinopathy and macular edema grading. Sci. Rep. 9, 10750 (2019).Article  PubMed  PubMed Central  Google Scholar Rauf, N., Gilani, S. O. & Waris, A. Automatic detection of pathological myopia using machine learning. Sci. Rep. 11, 16570 (2021).Article  CAS  PubMed  PubMed Central  Google Scholar Zhang, P., Conze, P.-H., Lamard, M., Quellec, G. & Daho, M. E. H. Deep learning-based detection of referable diabetic retinopathy and macular edema using ultra-widefield fundus imaging. preprint at https://doi.org/10.48550/arXiv.2409.12854 (2024).Silva, P. S. et al. Automated machine learning for predicting diabetic retinopathy progression from ultra-widefield retinal images. JAMA Ophthalmol. 142, 171 (2024).Article  PubMed  PubMed Central  Google Scholar Park, S.-J., Ko, T., Park, C.-K., Kim, Y.-C. & Choi, I.-Y. Deep learning model based on 3d optical coherence tomography images for the automated detection of pathologic myopia. Diagnostics 12, 742 (2022).Article  PubMed  PubMed Central  Google Scholar Dow, E. R. et al. A deep-learning algorithm to predict short-term progression to geographic atrophy on spectral-domain optical coherence tomography. JAMA Ophthalmol. 141, 1052 (2023).Article  PubMed  PubMed Central  Google Scholar Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172, 1122-1131.e9 (2018).Article  CAS  PubMed  Google Scholar De Fauw, J. et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat. Med. 24, 1342–1350 (2018).Article  PubMed  Google Scholar Wang, Y. et al. Automated early detection of acute retinal necrosis from ultra-widefield color fundus photography using deep learning. Eye Vis. 11, 27 (2024).Article  Google Scholar Podoleanu, A. G. Optical coherence tomography. J. Microsc. 247, 209–219 (2012).Article  PubMed  PubMed Central  Google Scholar Qiu, J. & Sun, Y. Self-supervised iterative refinement learning for macular OCT volumetric data classification. Comput. Biol. Med. 111, 103327 (2019).Article  PubMed  Google Scholar Quellec, G., Cazuguel, G., Cochener, B. & Lamard, M. Multiple-instance learning for medical image and video analysis. IEEE Rev. Biomed. Eng. 10, 213–234 (2017).Article  PubMed  Google Scholar Matten, P. et al. Multiple instance learning based classification of diabetic retinopathy in weakly-labeled widefield OCTA en face images. Sci. Rep. 13, 8713 (2023).Article  CAS  PubMed  PubMed Central  Google Scholar de Vente, C., van Ginneken, B., Hoyng, C. B., Klaver, C. C. W. & Sánchez, C. I. Uncertainty-aware multiple-instance learning for reliable classification: Application to optical coherence tomography. Med. Image Anal. 97, 103259 (2024).Article  PubMed  Google Scholar Rong, Y. et al. Surrogate-assisted retinal OCT Image classification based on convolutional neural networks. IEEE J. Biomed. Health Inform. 23, 253–263 (2019).Article  PubMed  Google Scholar Maetschke, S. et al. A feature agnostic approach for glaucoma detection in OCT volumes. PLoS ONE 14, e0219126 (2019).Article  CAS  PubMed  PubMed Central  Google Scholar Noury, E. et al. Deep learning for glaucoma detection and identification of novel diagnostic areas in diverse real-world datasets. Transl. Vis. Sci. Technol. 11, 11 (2022).Article  PubMed  PubMed Central  Google Scholar Rasel, R. K. et al. Assessing the efficacy of 2D and 3D CNN algorithms in OCT-based glaucoma detection. Sci. Rep. 14, 11758 (2024).Article  CAS  PubMed  PubMed Central  Google Scholar Oghbaie, M., Araujo, T., Emre, T., Schmidt-Erfurth, U. & Bogunovic, H. Transformer-based end-to-end classification of variable-length volumetric data. preprint at http://arxiv.org/abs/2307.06666 (2023).Dosovitskiy, A. et al. An Image is Worth 16x16 Words: Transformers for Image recognition at scale. Preprint at http://arxiv.org/abs/2010.11929 (2021).Zhou, Y. et al. A foundation model for generalizable disease detection from retinal images. Nature 622, 156–163 (2023).Article  CAS  PubMed  PubMed Central  Google Scholar Moradi, M., Chen, Y., Du, X. & Seddon, J. M. Deep ensemble learning for automated non-advanced AMD classification using optimized retinal layer segmentation and SD-OCT scans. Comput. Biol. Med. 154, 106512 (2023).Article  PubMed  PubMed Central  Google Scholar Wang, X. et al. UD-MIL: Uncertainty-driven deep multiple instance learning for OCT Image classification 1 (IEEE J. Biomed, 2020).Google Scholar Kulyabin, M. et al. OCTDL: Optical coherence tomography dataset for image-based deep learning methods. Sci. Data 11, 365 (2024).Article  PubMed  PubMed Central  Google Scholar Rasti, R., Rabbani, H., Mehridehnavi, A. & Hajizadeh, F. Macular OCT classification using a Multi-scale convolutional neural network ensemble. IEEE Trans. Med. Imag. 37, 1024–1034 (2018).Article  Google Scholar Zhang, P. et al. Patch progression masked autoencoder with fusion CNN network for classifying evolution between two pairs of 2D OCT slices. Preprint at https://doi.org/10.48550/arXiv.2508.20064 (2025).Quellec, G. & Zeghlache, R. MARIO: Monitoring age-related macular degeneration progression in optical coherence tomography. Zenodo https://doi.org/10.5281/zenodo.15270469 (2025).Chen, S., Ma, K. & Zheng, Y. Med3D: Transfer learning for 3D medical image analysis. Preprint at https://doi.org/10.48550/arXiv.1904.00625 (2019).He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In proc. IEEE conference on computer vision and pattern recognition 770–778 (2016).Beyer, L. et al. FlexiViT: One Model for all patch sizes. Preprint at https://doi.org/10.48550/arXiv.2212.08013 (2023).Download referencesAcknowledgementThis work was supported by the French National Research Agency under the LabCom program (ANR-19-LCV2-0005 - ADMIRE project).FundingThis work received state aid managed by the National Research Agency under the LabCom program (ANR-19-LCV2-0005 - ADMIRE project).Author informationAuthors and AffiliationsLaTIM UMR 1101, Inserm, Brest, FrancePhilippe Zhang, Gwenole Quellec, Sarah Matta, Béatrice Cochener & Mathieu LamardUniversity of Western Brittany, Brest, FrancePhilippe Zhang, Sarah Matta, Béatrice Cochener & Mathieu LamardEvolucare Technologies, Villers-Bretonneux, FrancePhilippe Zhang, Laurent Borderie & Alexandre Le GuilcherOphthalmology Department, CHRU Brest, Brest, FranceTanguy Thiery & Béatrice CochenerAuthorsPhilippe ZhangView author publicationsSearch author on:PubMed Google ScholarGwenole QuellecView author publicationsSearch author on:PubMed Google ScholarSarah MattaView author publicationsSearch author on:PubMed Google ScholarLaurent BorderieView author publicationsSearch author on:PubMed Google ScholarAlexandre Le GuilcherView author publicationsSearch author on:PubMed Google ScholarTanguy ThieryView author publicationsSearch author on:PubMed Google ScholarBéatrice CochenerView author publicationsSearch author on:PubMed Google ScholarMathieu LamardView author publicationsSearch author on:PubMed Google ScholarContributionsG.Q., M.L. and A.L.G. designed the research. P.Z., G.Q., S.M., T.T. and M.L. contributed in data acquisition and/or research execution. P.Z., G.Q., S.M., M.L., L.B., T.T and B.C contributed in data analysis and or interpretation. P.Z., G.Q., S.M. and M.L. prepared the manuscript. P.Z. prepared Figures 1,2,3, Tables 1,2,3,4,5, Supplementary Figure S1 and Supplementary Table S1. All authors reviewed the manuscript.Corresponding authorCorrespondence to Philippe Zhang.Ethics declarationsCompeting interestsThe authors declare no competing interests.Ethics approvalThe author Philippe Zhang declares no Competing Non-Financial Interests but the following Competing Financial Interests: Employee – Evolucare Technologies. The authors Sarah Matta, Tanguy Thiery and Mathieu Lamard declare no Competing Financial or Non-Financial Interests. The author Laurent Borderie declares no Competing Non-Financial Interests but the following Competing Financial Interests: Employee – Evolucare Technologies. The author Alexandre Le Guilcher declares no Competing Non-Financial Interests but the following Competing Financial Interests: Research & Innovation director – Evolucare Technologies; CEO – OphtAI. The author Gwenolé Quellec declares no Competing Non-Financial Interests but the following Competing Financial Interests: Consultant – Evolucare Technologies, Adcis. The author Béatrice Cochener declares no Competing Non-Financial Interests but the following Competing Financial Interests: Consultant and clinical investigator – Thea, Alcon, Zeiss, B&L, Hoya, Horus, Santen, SIFI, Cutting Edge, J&J. All the remaining authors declare no conflict of interest. All the remaining authors declare no conflict of interest.Additional informationPublisher’s noteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary InformationSupplementary Information 1.Supplementary Information 2.Rights and permissionsOpen Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.Reprints and permissionsAbout this article