An end-to-end multifunctional AI platform for intraoperative diagnosis

Wait 5 sec.

IntroductionIntraoperative frozen diagnosis is a critical technique in surgical pathology1, allowing rapid tissue sample analysis to guide surgical decisions in real-time2. However, the accuracy of this diagnosis depends significantly on the quality of the frozen tissue sections, which is often compromised by the time-sensitive nature of surgery3. Tissue degeneration, contraction during freezing, and suboptimal staining frequently lead to artifacts or unclear nuclear and cytoplasmic details, all of which can obscure diagnostic features4,5,. These quality issues make frozen section interpretation challenging for pathologists, affecting diagnostic reliability and confidence6. Therefore, improving the frozen section quality is critical to advancing intraoperative diagnostic accuracy.Deep learning has emerged as a powerful tool in digital pathology for various tasks, including image quality enhancement7,8. Generative adversarial networks (GANs) have shown particular promise, as they can learn from data distributions and generate synthetic data that closely mirrors real samples9,10. Previous studies have shown that GANs can be effectively applied to image enhancement, producing high-resolution synthetic images that can even deceive experienced pathologists11,12. The ability to generate high-quality synthetic images makes GANs a promising candidate for improving the quality of frozen histologic images. GANs have even been demonstrated to enhance cryosectioned image quality by transforming them to resemble formalin-fixed, paraffin-embedded (FFPE) tissue images, a standard in histopathology13,14. Despite these promising advances, there remains a lack of comprehensive evaluation on the clinical impact, reliability, and utility of these methods within the context of an end-to-end intraoperative diagnostic workflow.To address these challenges, we introduce GAS, a multifunctional, end-to-end framework tailored for intraoperative frozen diagnosis, which includes three integrated modules: the Generative, Assessment and Support modules. The Generation module, trained using a GAN-based multimodal network, effectively converts frozen section images into high-quality virtual FFPE-style images, guided by descriptive text annotations15. The Assessment module utilizes four quality control models, fine-tuned from foundational pathology models, to objectively assess the quality of these generated images. Substantial improvements in microstructural details are demonstrated over the original frozen sections. Finally, the Support module integrates GAS into routine clinical practice through a human-artificial intelligence (AI) collaboration platform, improving pathologists’ diagnostic confidence within prospective cohorts. Our study highlights the clinical utility of the GAS platform in intraoperative diagnostics, establishing a model for integrating end-to-end AI solutions into standard clinical workflow.ResultsStudy participantsIn this study, we introduced the GAS platform to provide a comprehensive diagnostic process that optimizes frozen section images, evaluates quality, and supports clinical decision-making through multiple integrated modules. To achieve this, we utilized six cohorts comprising over 6700 whole-slide images (WSIs) for model training and validation, as shown in Fig. 1. The Internal GZCC cohort served as the training and internal validation dataset, consisting of 553 frozen and 844 FFPE slides from 325 patients. External validation included four retrospective cohorts and one prospective cohort. Specifically, the External JMCH cohort comprised 228 frozen slides and 310 FFPE slides from 191 patients, the External HZCH cohort contained 225 frozen and 225 FFPE slides from 225 patients, the External DGPH cohort contained 114 frozen and 115 FFPE slides from 115 patients, the External JMPH cohort had 103 frozen and 86 FFPE slides each from 77 patients, the External ZSZL cohort had 1500 frozen slides from 674 patients, the TCGA cohort contained 1180 frozen and 857 FFPE slides from 815 patients. The prospective cohort, Pro-External GZCC, included 189 frozen and 179 FFPE slides from 188 patients. All slides were initially scanned at ×40 magnification and subsequently downsampled to ×20 for model training, considering both clinical relevance and computational efficiency. Details of participant characteristics are presented in Table 1.Fig. 1: The overview of the workflow.The datasets were derived from eight different cohorts, with the Internal GZCC data used for model training and data from the other cohorts utilized to externally test the model’s generalization performance. The number of included slides from different organs was illustrated in the column chart. The GAS platform comprises three modules: the Generation module, the Assessment module, and the Support module. The Generation module was trained using a GAN-based multimodal network guided by text descriptions of the FFPE style. The Assessment module included four microstructural quality control models, which were developed by fine-tuning a pathological foundation model using only a small number of patches. The Support module was conducted on a human-artificial intelligence collaboration software to aid pathologists to make diagnoses. FFPE formalin-fixed, paraffin-embedded, WSI whole-slide image, GAN generative adversarial network.Full size imageTable 1 Details of study participantsFull size tableDevelopment of a generation module for optimizing frozen section imagesTo generate virtual FFPE-like images from frozen sections, we developed the Generation module of the GAS platform. This module employed a GAN-based multimodal unpaired transfer network, with both image- and patch-level supervision, making it adaptable to various tumor types. The transformation process was further guided and compared using text descriptions representing various histological styles. Among these, the style described as ‘formalin-fixed paraffin-embedded tissues’ consistently yielded the most favorable results within the Internal GZCC cohort (Supplementary Table 1). Consequently, this FFPE style was adopted to guide the transformation, enhancing the visual fidelity of the synthetic images to real FFPE samples (Fig. 2a). Notable quality improvements were observed across multiple organ types (Fig. 2b–d, Supplementary Fig. 1 and Supplementary Data no. 1).Fig. 2: Development of the generation module.a The architecture of the generative model consisted of three key components: an encoder, a style neck, and a decoder. The style neck was responsible for transferring the style from frozen to FFPE patches. b–d Examples showed that the GAS-generated images improved image quality across various organs. e–g Confusion matrices presented the classification results (generated or real FFPE patches) of three pathologists. FFPE formalin-fixed, paraffin-embedded, GAN generative adversarial network.Full size imageTo assess the fidelity of the generated images compared to real FFPE images, a reader study was conducted. Three pathologists reviewed 200 image patches and classified them as either generated or real. The pathologists classified 50–72% of synthetic images as real FFPE and 26–47% of real FFPE images as generated (Fig. 2e–g). Low inter-observer agreement indicated that pathologists could not reliably differentiate real FFPE from synthetic images. An additional reader study was conducted to identify whether pathologists can distinguish frozen images from the generated images. Results showed a high classification accuracy ranging from 89.5 to 93.5% (Supplementary Fig. 2), indicating that the generated images were visually distinct from frozen images.Evaluation of the generation module using Fréchet inception distance (FID)The Generation module of the GAS platform was assessed for its performance in producing FFPE-like images using FID, a widely accepted metric for assessing image similarity and quality13. The FID values, which quantify the similarity of generated images to real FFPE images—where lower values indicate greater similarity - were evaluated against state-of-the-art generative models across multiple datasets (Fig. 3a). In the independent test cohort of Internal GZCC, GAS achieved a superior FID of 23.021, outperforming other generative models with FID values ranging from 23.717 to 64.977. Similar advantages were observed in external validation cohorts (Fig. 3b–g and Supplementary Table 2). These results collectively demonstrated that GAS consistently produces high-quality FFPE-like images that closely resembled real FFPE images across diverse datasets.Fig. 3: FID evaluation of the generation module.a The workflow for calculating the FID. b–g The FID of different generative model across different cohorts: Internal GZCC, External JMCH, External HZCC, External DGPH, External JMPH, and Pro-External GZCC. Internal GZCC, the test cohort of the retrospective internal cohort; External JMCH, External HZCH, External DGPH, and External JMPH, four retrospective external cohorts; Pro-External GZCC, the prospective study. FFPE formalin-fixed, paraffin-embedded, FID Frechet Inception Distance.Full size imageEstablishment of an assessment module for quality controlTo complement the evaluation of similarity between generated and FFPE images, we established the second component of the GAS platform: the Assessment module. This module comprised four quality control models designed to automatically assess critical pathological features: nuclear detail, cytoplasmic detail, extracellular fibrosis, and overall stain quality (Fig. 4a). A total of 1020 image patches from frozen, generated, and FFPE sections across multiple organs were used for model training and validation (Supplementary Table 3), with 816 patches fine-tuned on a pathological foundation model to train four binary classifiers. High- and low-quality labels were established by consensus between two expert pathologists, following criteria detailed in Supplementary Table 4. Compared to a 4-tier scoring system (excellent to very poor), the binary labeling showed higher inter-rater consistency (Supplementary Figs. 3 and 4). In independent testing, all four binary labeling models achieved strong performance: nuclear detail [Area under the curve (AUC 0.973, accuracy 0.922], cytoplasmic detail (AUC 0.975, accuracy 0.941), extracellular fibrosis (AUC 0.871, accuracy 0.877), and overall stain quality (AUC 0.965, accuracy 0.941) (Fig. 4b, c).Fig. 4: Development of the assessment module.a The workflow of developing quality control models in the Assessment module. The adapter architecture was applied to fine-tune the foundation model to adapt to the quality assessment task. b The area under the curve (AUC) of different items (nuclear detail, cytoplasmic detail, extracellular fibrosis, and overall stain quality) in the validation cohort of quality control models. c Confusion matrices presented the classification results (low or high quality) of the quality control models predicting different items in the validation cohort of quality control models. d The column chart displayed the percentage of high-quality and low-quality images for GAS-generated (G) and frozen (F) images in the test cohort of the Internal GZCC dataset. e Gradient-weighted Class Activation Mapping (Grad-CAM) highlighted the areas of focus for the quality control models.Full size imageTo verify that the models effectively captured the intended features, Gradient-weighted Class Activation Mapping (Grad-CAM) was applied. Overlaying Grad-CAM heatmaps onto H&E patches, revealed that the nuclear detail, cytoplasmic detail, and extracellular fibrosis models accurately focused on their respective features: nuclei, cytoplasm, and extracellular fibrosis. These visualizations confirmed that the Assessment module successfully evaluated the quality of the specified pathological features (Fig. 4d and Supplementary Fig. 5).Image quality evaluation by the assessment moduleThe Assessment module was utilized to assess and compare the quality of frozen images and GAS-generated images. Substantial improvements in the quality of generated patches were observed across all evaluated features in the Internal GZCC cohort. For nuclear detail, the proportion of high-quality patches increased from 38.26% in frozen patches to 61.19% in generated patches, marking a 22.93% improvement. In the cytoplasmic detail item, high-quality patches rosed from 31.19% in frozen patches to 64.78% in generated patches, an improvement of 33.59%. For extracellular fibrosis, the percentage of high-quality patches increased from 87.88% in frozen patches to 97.91% in generated patches, reflecting a 10.03% improvement. Regarding overall stain quality, high-quality patches increased from 70.93% in frozen patches to 92.23% in generated patches, demonstrating a 21.30% improvement (Fig. 4e).The module was further applied across external validation cohorts, revealing consistent enhancements in image quality across diverse datasets. In the External JMCH cohort, generated images exhibited an increase in high-quality patches, with nuclear detail improving from 77.12 to 78.86%, cytoplasmic detail from 79.93 to 87.18%, extracellular fibrosis from 97.24 to 99.00%, and overall stain from 91.73 to 94.55%. In the External HZCH cohort, improvements were observed in nuclear detail (from 63.98 to 65.22%), cytoplasmic detail (from 69.14 to 78.75%), and overall stain quality (from 90.36 to 91.70%). In the External DGPH cohort, significant increases were seen in nuclear detail (from 56.64 to 69.38%), cytoplasmic detail (from 61.07 to 75.72%), extracellular fibrosis (from 95.74 to 98.31%), and overall stain quality (from 91.20 to 94.82%). The External JPCH cohort exhibited enhancements in nuclear detail (from 84.27 to 85.96%), cytoplasmic detail (from 87.29 to 92.36%), extracellular fibrosis (from 98.76 to 99.44%), and overall stain (from 96.97 to 97.89%). In the prospective Pro-External GZCC cohort, significant quality improvements were also observed in GAS-generated images compared to frozen images: nuclear detail increases from 8.74 to 25.45%, cytoplasmic detail improved from 20.45 to 40.45%, extracellular fibrosis rose from 80.32 to 87.50%, and overall stain improved from 41.57 to 69.27% (Supplementary Fig. 6). To assess its robustness, we applied our model to images with pronounced folds and poor quality. The generated images still demonstrated an increase in high-quality patches (Supplementary Table 5). These results underscore the effectiveness of the GAS platform in enhancing the quality of pathological microstructure, facilitating improved diagnostic evaluation.Support module for clinical applicationTo assess the impact of GAS-generated images on diagnostic performance, we conducted several deep-learning classification tasks, comparing the results derived from GAS-generated images with those using original frozen images as inputs. For assessing margin positivity in breast cancer, GAS-generated images achieved a higher AUC of 0.874, compared to 0.862 for frozen images (P = 0.788, t test). Similarly, in distinguishing between benign and malignant breast lesions, the GAS-generated images outperformed frozen images, achieving an AUC of 0.784 compared to 0.765 (P = 0.254, t test). In addition, for differentiating between breast carcinoma in situ and invasive breast cancer, GAS-generated images again demonstrated superior performance, with an AUC of 0.722, compared to 0.696 for frozen images (P = 0.057, t test) (Fig. 5a–c). We also tested three additional tasks, including predicting sentinel lymph node metastasis in breast cancer, classifying lung cancers into adenocarcinoma or squamous cell carcinoma, and benign and malignant thyroid lesions. In these tasks, the generated images demonstrated higher diagnostic performance compared to frozen images (Supplementary Table 6). These consistently higher AUCs for GAS-generated images underscore the platform’s ability to transform frozen images into high-quality virtual FFPE-like images, significantly enhancing diagnostic accuracy in clinical applications.Fig. 5: Clinical application of the support module.a–c The violin plots illustrated the diagnostic performance of deep-learning models in margin assessment, distinguishing between benign and malignant breast lesions, and predicting breast carcinoma in situ and invasive breast cancer, respectively, using either GAS-generated images or frozen images as inputs. Statistical significance was assessed using the t test. d Screenshot of human–AI collaboration software. e The violin plots illustrated the initial diagnostic confidence of the three pathologists as well as their confidence after utilizing GAS assistance. Statistical significance was evaluated using the Wilcoxon signed-rank test. f–h Confusion matrices presented three pathologists’ initial diagnostic confidence and GAS-assisted confidence.Full size imageTo further explore the clinical utility of GAS in intraoperative diagnostic workflows, we developed a human–AI collaboration software. The software allowed pathologists to review frozen WSIs and identified regions of low-quality requiring closer evaluation. These regions could then be selectively converted into virtual FFPE-like images for enhanced clarity (Supplementary Movie 1). We conducted a prospective assessment using a sub-cohort from the Pro-External GZCC cohort, which included 45 breast lesion samples (34 benign and 11 malignant). Three pathologists initially provided diagnoses and assigned diagnostic confidence levels (on a scale of 1–3, where 1 represents low confidence and 3 high confidence) based on frozen WSIs. After employing the GAS platform to convert unclear regions into virtual FFPE-like regions, the pathologists finalized their diagnoses and confidence levels. The time required for the GAS platform was demonstrated in Supplementary Table 7. While their diagnostic outcomes for distinguishing benign from malignant breast lesions remained unchanged with GAS assistance, their confidence levels significantly improved (all P