Comparative study of five-year cervical cancer cause-specific survival prediction models based on SEER data

Wait 5 sec.

IntroductionAccording to the 2024 American Cancer Statistics, cervical cancer (CC) has become the third leading cause of cancer death among young women since 20191. CC is predominant in 25 countries and the leading cause of cancer mortality in 37 countries, and the survival trends for CC have plateaued1,2. Cause-specific survival (CSS) estimates cancer-specific mortality by excluding deaths from other causes, making it particularly suitable for cancers like CC, where cause of death is primarily attributed to the disease, especially in HPV-related cases. Forjaz de Lacerda et al. (2019)3 demonstrated that CSS provides reliable survival estimates in such cases, supporting its use as the primary outcome measure in this study.Despite the potential of machine learning (ML) in survival analysis4, several challenges remain, particularly in CC. These include handling censored survival times, managing data imbalance, and addressing complex, non-linear relationships between demographics, tumor stage, and other treatment variables5. These issues complicate the development of accurate and generalized models. To address these issues, methods such as the Survival Tree (ST) are designed to maximize survival differences among patient groups in a binary tree structure6. The Random Survival Forest (RSF) extends this approach by constructing multiple survival trees and leveraging their average for improved prediction while mitigating overfitting7. Furthermore, Gradient Boosting Survival Analysis (GBSA) uses regression trees to incrementally optimize the negative gradient of the loss function, thereby enhancing model generalizability and reducing overfitting8. According to Milad Rahimi et al.'s research, ML techniques have demonstrated superior accuracy in predicting survival outcomes for CC patients compared with traditional statistical methods9, underscoring the potential of ML for improving prognostic assessments in this context. In addition to ML models, traditional statistical approaches such as Cox Proportional Hazards (CoxPH) and Cox Time-Varying (CoxTV) models were also included for comparison to provide a broader evaluation framework10.CC poses a significant health challenge, and predicting patient survival outcomes is essential for refining treatment strategies and enhancing patient care. Despite their potential, the application of ML algorithms to predict survival in CC patients remains scarce9. While prior research underscores the importance of demographic, tumor, and treatment factors in survival prediction11,12,13, a comparative analysis of predictive models, especially their ability to interact complexly with these factors, is lacking. To address this, we utilized the (Patient, Intervention, Comparison, Outcome, Study Design) PICOS framework to systematically identify relevant studies and refine our research question14. In our study’s initial phase, we used the PICOS on PubMed and yielded no results, revealing a notable gap in comparative studies employing ML models to predict CSS in CC patients. This gap provides both a challenge and an opportunity for our study to contribute valuable insights to the field.Our study aims to compare and analyze five-year CSS prediction models for CC using the SEER database15. We address data imbalance via SMOTE and refine feature selection via stepwise forward selection and feature importance methods. Following performance evaluation via fivefold cross-validation, we identified the most effective model and feature subset. Rationality analysis and SHAP value interpretation were conducted to elucidate the model’s predictions16. Our work is significant because of its potential to enhance the understanding of CC survival factors, identify optimal predictive models, and contribute to clinical decision-making, thereby optimizing treatment strategies and improving patient outcomes and quality of life.ResultsPatient characteristicsUsing the Kaplan–Meier method, we established a survival model for CC patients who met our criteria. In this analysis, we did not stratify the data, meaning that all patients were considered a unified group to assess the overall survival rate of patients with CC. The Kaplan–Meier survival curve depicted in Fig. 1, based on truncated data, shows a smooth trend in the CSS rate over 60 months, starting at 100% and declining to approximately 75%, indicating that three-quarters of the patients survived through the study’s observation period. A risk table parallel to the curve details the number of patients at each follow-up milestone, with a gradual reduction reflecting the expected sample size decrease for various reasons, including death and loss to follow-up. We also compared Kaplan–Meier survival curves for non-truncated data to evaluate the impact of survival time discretization. The non-truncated data showed a sharp decline at 60 months due to sparse long-term data (Supplementary Fig S1). Although truncated data may overrepresent survival probabilities at this point, comparing Kaplan–Meier curves revealed no discrepancies in survival trends up to 60 months. Truncating survival times to 60 months ensures consistency in analyzing five-year CSS; therefore, the truncated dataset was used for model analysis to ensure consistency and fairness across comparisons, aligning with the study’s objectives.Fig. 160-Month CSS for Cervical Cancer.Full size imageWe employed a fivefold cross-validation approach to evaluate our model’s predictive performance. (Supplementary Table S1) presents the baseline characteristics of patients across the training and testing sets for each fold, ensuring a balanced validation process. Despite minor statistically significant differences in some folds (P