Explainable machine learning for predicting ICU mortality in myocardial infarction patients using pseudo-dynamic data

Wait 5 sec.

IntroductionAcute myocardial infarction (AMI) encompasses a spectrum of clinical presentations including ST-segment elevation myocardial infarction (STEMI), non-ST-segment elevation myocardial infarction (NSTEMI), and acute coronary syndrome with confirmed myocardial damage. myocardial infarction remains one of the greatest contributors to cardiovascular deaths in the world whose incidence remains critically high with approximately every 40 seconds someone in the United States suffering an episode1. Cardiovascular diseases (CVDs) also represent a major cost burden globally with MI in the ICU being one of the most common CVD-related conditions in the critical care system2. In 2015, there were more than 18 million CVD-related deaths with MI accounting for over 15% of overall mortality and research showing that healthcare costs skyrocket with longer and more inefficient treatment in the ICU3,4,5. Patients who exhibit MI are usually referred to the ICU, however, they are 10% more likely to suffer another episode in the days following and are at higher risk of death, especially the elderly6. For STEMI, NSTEMI, and general AMI patients admitted to the ICU, studies have found mortality rates as high as 17.6% but usually closer to 11% within the ICU in multicentre ICU studies7,8,9,10. Mortality prediction models can help with prioritising patients with myocardial infarction and supplant existing mortality prediction tools like the APACHE system deployed in US critical care centres, which has been criticised as too general and inaccurate for specific populations and diseases11,12. Others like the Framingham Risk and the GRACE score are simple linear risk calculators for general mortality for heart disease patients and do not support individual prognosis from longitudinal patient data13,14. Machine learning would allow us to provide individual prognosis while learning from complex longitudinal patient dynamics15. When machine learning is combined with interpretability methods, it can be a useful tool for clinical guidance and decision-making.Recent machine learning approaches for MI mortality prediction have shown promising but mixed results. Lin et al. demonstrated that XGBoost and random decision forest algorithms achieved superior performance (AUROC 0.835) compared to traditional scoring systems like SOFA using MIMIC-IV data16, while nomogram-based studies report AUROCs of 0.846-0.88517,18. However, the aforementioned methods are fundamentally static and cannot capture dynamic physiological changes during ICU stay. For time-series analysis, deep learning approaches including LSTM networks and GRUs have been explored for ICU mortality prediction19,20, but face challenges including computational complexity, data requirements, overfitting with sparse measurements, and lack of interpretability essential for clinical decision-making. Recent tabular deep learning methods like TabNet and NODE attempt to bridge performance-interpretability gaps21,22, but their superiority over traditional methods on healthcare tabular data remains contested23. Tree-based models continue to outperform deep learning on tabular data with characteristics common in healthcare: moderate dataset sizes, mixed data types, and uninformative features, due to their robustness to noise and built-in feature selection mechanisms.Due to the recent nature of proposed tabular deep learning models, research applying them to healthcare challenges has been limited, with only one recent paper examining TabNet for ICU mortality prediction24. Prior deep learning work has largely focused on general ICU populations rather than MI-specific patients25,26. Existing risk algorithms for MI patients like TIMI and GRACE are simple linear models that ignore longitudinal physiological trajectories and provide no dynamic risk estimates during ICU stay27,28.To address these limitations, we propose XMI-ICU (XGBoost for Myocardial Infarction in the ICU), a novel pseudo-dynamic machine learning framework that transforms time-series ICU prediction into connected static prediction problems. Our key innovation lies in the sliding time window approach: instead of using complex sequential models like LSTMs, we extract summary statistics (mean and standard deviation) from physiological measurements within defined prediction horizons (e.g., all measurements from ICU admission until 6 h before the event of interest). Each physiological variable (such as systolic blood pressure, lactate, or heart rate) becomes two features per time window, its mean value and variability measure, creating a comprehensive but interpretable feature set. This approach enables gradient-boosted models like XGBoost to capture some temporal dynamics through these engineered features while maintaining the interpretability and robustness that tree-based models provide. The framework automatically adapts to different prediction horizons (6, 12, 18, or 24 h in advance), providing clinicians with flexible, dynamic risk assessment throughout the ICU stay while avoiding the computational complexity and ”black box” nature of deep learning alternatives.Our contributions are as follows:1.We propose XMI-ICU, a novel pseudo-dynamic framework that transforms time-series ICU prediction into connected static problems by extracting temporal summary statistics from physiological measurements within sliding time windows. This approach enables interpretable gradient-boosted models to capture temporal dynamics while achieving superior performance compared to both traditional risk scores and recent deep learning approaches.2.We demonstrate robust external validation across two major ICU databases (eICU and MIMIC-IV) with consistent performance across multiple prediction horizons (6, 12, 18, and 24 h in advance), proving the framework’s generalisability across different healthcare systems and temporal scenarios.3.We introduce time-resolved Shapley value analysis that tracks how feature importance evolves across different prediction horizons, revealing dynamic clinical insights about which physiological measurements warrant attention at different stages of critical illness. We verify XMI-ICU’s clinical significance through decision curve analysis and comparison with APACHE-IV, demonstrating practical benefits for ICU decision-making beyond traditional performance metrics.ResultseICUWe propose a novel integrated class-sensitive and explainable framework for ICU risk management as detailed in the “Methods” section. We compare our proposed XMI-ICU gradient-boosted model to standard supervised learning methods. For some of the methods like support vector machines (SVMs), we have standardised features. All features listed in the supplementary section on data processing were used for the eICU test results while only the most important (top 8) features identified by Shapley values analysis were used for external validation on MIMIC-IV. APACHE IV score was not used as a feature in these models, albeit experiments doing so are included in the Supplementary Materials for those curious. The first set of results in Table 1 concerns the prediction of mortality in MI patients with several hours prior to the event. It is clear that XMI-ICU maintains superior performance across all metrics for a priori prediction beating state-of-the-art tabular deep learning models. For AUROC and average precision, we evaluated the model at the default risk threshold in the results presented in the tables. The results can also be seen in Fig. 1 which highlights the superior performance of XMI-ICU compared to alternative supervised learning models as measured by both AUROC and average precision.Table 1 eICU test prediction results with 95 % confidence intervals for mortality prediction 6 h in advance.Full size tableFig. 1Evaluation performance of XMI-ICU to predict mortality 6 h in advance compared to other models for different metrics on eICU held-out test. AUROC and average precision results are a result of bootstrapping. The left figure represents the AUROC and the right the average precision results.Full size imageAfter the XMI-ICU model was evaluated at 6-hour prediction prior to death, we extended to a more dynamic prediction evaluation by adapting the framework to arbitrarily predict the events of death at any time prior, and the framework will automatically extract, preprocess, standardise existing measurements, optimise respective hyperparameters, and deploy the XGBoost model for test prediction. The results for XMI-ICU evaluated at 6, 12, 18, and 24-h prediction for mortality in the held-out test set of eICU can be seen in Table 2 and they continue to show reliable predictive performance across the different time windows. The table also includes the results for APACHE-IV as a matter of comparison for 24 hour prediction of mortality.Table 2 XMI-ICU test prediction results with 95% confidence intervals for mortality prediction across different time horizons.Full size tableA plot showing the stability of predictive performance across different metrics for XMI-ICU as a function of time in the ICU prior to death can be seen in Fig. 2aindicates stable performance for mortality prediction and its superiority compared to APACHE-IV (included for times beyond just 24 h). The right figure shows the generalisation ability of the model to perform on a completely external test set, MIMIC-IV, using only the top 8 features from eICU for training. Evaluating a time-prediction model like XMI-ICU also requires showing coherent prediction across time and not just consistency of prediction accuracy and robustness. In the next set of results, we show XMI-ICU with low misclassification error across time for the same patient sample. A patient is deemed misclassified if they are predicted incorrectly at time x in advance when they have been previously predicted correctly at times >x. For example, a patient might be predicted to die at the 24 and 18-hour prediction windows correctly, but at 12 h in advance, they are predicted (incorrectly) to survive. These instabilities in prediction across time need to be measured if the model is to sustain reliable performance throughout the ICU stay. We define three patient subcohorts as illustrated at the top of Table 3, where each indicates the group of patients correctly predicted at all previous time windows except one. The bottom of Table 3 presents these results for both death and heart attack prediction, indicating the low levels of misclassification most likely indicate sensitivity to noise rather than predictive weakness.Fig. 2Robustness and reliability of XMI-ICU prediction performance over time in the ICU for mortality prediction (left) using all features available in eICU and as measured by a variety of metrics. The right figure contains results from the eICU held-out test set and MIMIC-IV external cohort with only the top 8 features identified by Shapley value analysis.Full size imageFig. 3Ranking of most important features as identified by their relative SHAP values for XMI-ICU prediction of mortality varied across time during ICU stay prior to the event. The (+) and (−) signs next to each feature indicate the direction in which the feature value affects the increase in risk of death. For example, for glucose, the higher the value of glucose, the higher the predicted risk of death. Furthermore, glucose increases in relative importance for prediction the closer the patient gets to the event. For the time windows in the 6, 12, 18, 24 h intervals, the top 13 features in each of the windows are presented as extracted from eICU, thereby showcasing how the most important features for correct prediction of mortality changes through time or closer to the prediction event.Full size imageTable 3 TOP: Defined patient cohorts for evaluating XMI-ICU predictive robustness across time windows.Full size tableTo interpret XMI-ICU’s mortality predictions, we applied Shapley value analysis to the held-out test set and examined relative feature importance rankings. This interpretability analysis was conducted across all prediction horizons (6, 12, 18, and 24 h), with detailed results for the 6-hour prediction model presented in the Supplementary Materials (Tables S5 and S6). To validate the robustness of our interpretability findings, we conducted random perturbation tests by adding Gaussian-distributed noise features to the original feature set and evaluating whether the top-ranked variables remained stable. The introduction of noise variables produced no significant changes in the most important features identified by Shapley analysis, confirming the reliability of our interpretability findings. Complete perturbation test results are provided in the Supplementary Materials Fig. S5.We further stratify Shapley values as a function of time in the ICU for mortality prediction. A feature ranking at each time-point corresponds to the relative importance of that feature at that point in time in the ICU stay prior to the event in question. The time graphs can be seen in Fig. 3. These values were extracted for each of the time windows, in effect converting a static interpretability method to a dynamic explainability framework that shows how at different times closer to the event (death or heart attack) different values of features and their importance change and how that is used by the model to learn underlying patterns for disease outcome prediction.External validation: MIMIC-IVFor external validation, we limited our analysis to the top 8 most important features identified through Shapley value analysis on the eICU dataset. First, it provides a more rigorous test of generalisability. If a model can maintain strong performance using only its most critical features on a completely independent dataset, this suggests robust identification of truly predictive clinical patterns rather than dataset-specific artifacts. Second, it addresses the clinical reality that simpler models with fewer features are more likely to be adopted in practice, as they require fewer data inputs and are easier to implement across different hospital systems with varying data collection capabilities. Third, it serves as a form of implicit regularisation, reducing the risk that our model’s performance depends on dataset-specific noise or idiosyncrasies that might not translate to other healthcare settings. We evaluated XMI-ICU on the separate and independent MIMIC-IV dataset for mortality prediction in MI patients. The features identified as most important by Shapley values analysis were used to create a new training set of the entirety of eICU and test on the entirety of MIMIC-IV cohorts only using the top 8 features whose statistical distributions for the different sets are included in Table 4. The distributions of the respective features can be seen in Supplementary Materials Figures S1, S2, and S3. XMI-ICU maintains high predictive performance across metrics when tested on this external dataset as can be seen in Table 2 without any training or tuning on it using only the top 8 features identified by Shapley value analysis from the eICU test set. The results immediately above correspond to held-out test set performance for eICU using those same 8 features.A plot showing predictive performance across different metrics for XMI-ICU evaluated on the MIMIC-IV cohort can be seen in the bottom Fig. 2bClinical risk benefit analysisTo communicate the clinical significance of the XMI-ICU model results to clinicians, we evaluated our model with clinical impact curves (Fig. 4a) and decision curve estimates (Fig. 4b) for robust risk evaluation. A 90 percent confidence interval was derived with 50 bootstrap iterations on the test set. As the clinical impact curves for mortality show, XMI-ICU consistently identifies patients at risk across different risk thresholds showing robustness to false negatives. For those at highest risk (>75%), XMI-ICU has very low tendencies for false positives or ”over-risking” in its predictions, learning to focus on those most at risk with higher specificity and sensitivity. The decision curves indicate XMI-ICU’s approximated net benefit outperforming logistic regression (underlying model used in APACHE) using only top features identified from Shapley values analysis.Fig. 4Clinical decision-making evaluation performance of XMI-ICU for mortality prediction using only the top 8 features on the entire eICU test set.Full size imageTo assist clinicians more readily in their decision-making process, the top features of our XMI-ICU model were used to construct a nomogram, which is included in the Supplementary Materials Figure S7 and shows a simple representation of what could be an automatic calculator for 24-h risk calculation in the ICU.DiscussionOur proposed XMI-ICU framework is the first comprehensive machine learning approach to mortality prediction for heart attack patients. For a condition with a significantly higher risk of death, and worse outcomes in the ICU, this presents an important question to address. XMI-ICU does not just use demographic data but also longitudinal time-series data in a pseudo-dynamic manner, coupled with a data pre-processing and Bayesian optimisation pipeline. It is robustly tested in two large heterogeneous ICU datasets, including a multi-centre ICU data source with external validation. XMI-ICU shows superior predictive performance for mortality prediction across different metrics. While our prediction time for mortality is at least 6 and up to 24 hours before death, the framework can be applied to any arbitrary time, which leaves clinicians with flexible extra time to prioritise high-risk patients and administer preventative measures. XMI-ICU presents a paradigm shift from prior approaches to mortality prediction algorithms for heart attack patients, like the GRACE, Framingham, and TIMI scores, which are simple linear risk calculators that only project long-term mortality at a fixed time point. We have also benchmarked against state-of-the-art tabular deep learning models like TabNet and NODE and have achieved superior performance with XMI-ICU. XMI-ICU provides the risk of death dynamically in the ICU for heightened responsiveness and learns relevant underlying physiological dynamics of the patient state with interpretability.The superior performance of XMI-ICU over tabular deep learning methods can be attributed to, first, EHR data exhibits characteristics that favour tree-based models: moderate sample sizes, mixed data types combining categorical and continuous variables, and the presence of many potentially uninformative features due to the comprehensive nature of ICU monitoring. Second, the temporal aggregation strategy employed in our framework (using mean and standard deviation summaries) creates a feature space that aligns well with tree-based learning. The decision tree structure directly mirrors clinical decision-making processes, where clinicians use threshold-based reasoning (e.g., ”if lactate > 2.5 and albumin < 3.0, then higher risk”). Finally, the class imbalance in our dataset presents challenges for deep learning models that require careful architecture design and training procedures. Tree-based models naturally handle imbalanced datasets through their splitting criteria and can be easily adjusted using class weights, as demonstrated in our approach. The combination of these factors, moderate dataset size, mixed data types, noise resilience, and class imbalance handling, explains why our tree-based approach outperformed more complex deep learning alternatives, consistent with recent findings that tree-based models remain superior for many tabular prediction tasks23.XMI-ICU also beats the existing prediction tool in use across ICUs in the United States, APACHE IV, by 18.3% in test AUROC and 11.1% in test accuracy at 24-hour prediction. While the APACHE IV score is calculated at admission, and thus is not the optimal comparison algorithm, it does still provide 24-hour mortality prediction, which is the time window we compared our results from. XMI-ICU requires milliseconds to be deployed once trained, which also only takes a couple of seconds, allowing for rapid response times in the ICU. Additionally, as Fig. 2ashows, XMI-ICU maintains stable performance across all metrics during the 24 h of ICU stay prior to death for MI patients. The model also successfully performs mortality prediction across different prediction time-windows in an external patient cohort obtained from MIMIC-IV without any training on the dataset using only the 8 most important features identified by Shapley values analysis on eICU as seen in Fig. 2b. The maintained performance across both eICU (using top 8 features) and MIMIC-IV external validation suggests that these features capture fundamental physiological signals associated with mortality risk in MI patients that transcend specific hospital systems and data collection protocols. This finding has important implications for clinical translation, as it suggests that effective mortality prediction can be achieved with a focused set of readily available clinical measurements, potentially facilitating broader adoption across diverse healthcare settings. The drop in predictive performance with MIMIC-IV is expected as we now use only 8 features without any training on the MIMIC-IV dataset itself. Table III also demonstrates this for the same 8 features in eICU where a decrease in performance is observed because only the few most predictive features are used. Despite this challenge, XMI-ICU maintains relatively high external predictive performance.The added interpretability provides clinical risk factor importance, which can aid physicians in both relying on the model and investigating what aspects of physiological measurements are more informative at different times during the ICU stay. Our time-resolved Shapley value analysis represents an extension beyond traditional model interpretability, tracking feature importance across multiple prediction horizons to provide clinicians with temporal insights into risk factor evolution during ICU stay. This approach reveals that optimal clinical monitoring strategies should adapt based on prediction timeline, focusing on baseline characteristics and comorbidities for longer-term risk assessment while emphasising acute physiological measurements for short-term mortality prediction. For mortality prediction, we observe that as patients approach the highest risk of death, mechanical ventilation status drops in importance compared to blood measurements like higher lactate, lower albumin, and systolic blood pressure. Hyperlactatemia is highly associated with in-hospital mortality in relatively small and isolated heterogeneous ICU populations29, and our findings on a much larger multi-centre patient cohort provide predictive evidence supporting this association. Previous research has established that lower serum albumin levels are good predictors of higher risk of death in ICU patients with sepsis and COVID-19, while our work suggests a similar predictive pattern for myocardial infarction patients30,31. While this remains a matter of ongoing debate in medical sciences, lower albumin levels may serve as a marker of persistent arterial injury and progression of atherosclerosis and thrombosis, with prolonged low albumin levels indicating higher risk of further acute myocardial injury, making it useful for tracking MI risk as our results suggest32,33. Such dynamic interpretability could inform adaptive monitoring protocols and help prioritise interventions based on temporal risk patterns, providing actionable clinical insights beyond static feature importance rankings.Prior work has shed light on the hypothesis that hypotension, as measured by the lowering of systolic blood pressure, can be an indicator of higher risks of death in ICU patients, specifically those with acute kidney injury34. Some have suggested that myocardial injury is also more likely in cases of lower SBP values, but here we provide an early indication of the high prediction value of lower SBP levels for heart attack in the ICU35. The sudden rise in high glucose levels and variability (as captured by our standard deviation measure for blood glucose) being strong predictors of mortality have been confirmed with several retrospective cohort studies in the ICU36,37. Work on prioritising those patients with such measurements and controlling for blood glucose and albumin can more easily be extended to preventive care for MI patients as well.Potential limitations of the study are relatively small patient cohorts extracted from large general ICU populations. A targeted dataset consisting of only those patients with a confirmed primary diagnosis of myocardial infarction in their records can allow a deeper analysis of this population. Further work can develop larger or more complex models that can exploit the longitudinal, less sparse information if a more comprehensive data source with heart attack patient records becomes available. Comparing the framework to existing deep learning time-series models that tend to be costly and complex, our XMI-ICU framework, with its simple embedded gradient-boosted model robust against class imbalance and with dynamic feature extraction maintains prediction fidelity at varying time points while being faster, more interpretable, and less environmentally and financially costly to train and deploy. The XMI-ICU dynamic framework also offers an alternative to the rush in clinical machine learning in applying costly and less interpretable deep learning models to these types of problems, while still providing a dynamic prediction framework. Another limitation of our current approach is the relatively simple temporal feature extraction strategy. While mean and standard deviation provide robust and interpretable summaries of physiological measurements within time windows, they may not capture all relevant temporal patterns that could improve predictive performance. More sophisticated temporal features could include the min-max range, the temporal slope to identify trends in physiological parameters (improving, deteriorating, or stable), and measures of temporal autocorrelation to assess the persistence of physiological states. These additional features could potentially capture subtle temporal dynamics that our current approach might miss, particularly for patients with complex physiological trajectories. Future work could integrate transformer-based architectures, as recent studies demonstrate that transformers predict patient outcomes efficiently through training-time parallelism while their attention weights yield clinically interpretable insights into the learned dynamics38,39,40. We therefore view transformer models as a natural extension of our pseudo-dynamic framework.In conclusion, we developed a highly predictive machine learning framework that trains on time-series ICU ward data without requiring complex deep learning models. Instead, it relies on dynamic feature extraction and takes advantage of the predictive power of static models like XGBoost, which outperformed other models, including state-of-the-art tabular deep learning. The framework offers time-resolved interpretability that allows tracking changes in vital signs and blood measurement importance across the ICU stay for heart attack patients, whose conclusions seek to provide medical insight. The framework could be integrated into ICU systems to predict negative outcomes in heart attack patients with real-time patient measurements.MethodsStudy design and populationThe cohort data used in this study is the eICU Collaborative Research Database and the Medical Information Mart for Intensive Care (MIMIC-IV v. 2.0, July 2022), public ICU databases available upon request and fulfilment of ethical training41,42. The eICU database was processed using postgreSQL and the pandas package. eICU is a multi-centre ICU database with over 200,859 patient unit encounters for 139,367 unique patients admitted between 2014 and 2015 to one of 335 ICUs at 208 hospitals located throughout the United States41. The database is de-identified and includes vital sign measurements, demographic data, and diagnosis information. For a full list of features used in our study, please consult the relevant tables in the Supplementary Materials (Tables S1, S2, S3, S4).We define our myocardial infarction cohort as patients with confirmed MI diagnosis, including both STEMI and NSTEMI presentations, as well as acute coronary syndrome patients with confirmed myocardial damage. Our inclusion criteria specifically encompass the diagnosis strings and respective ICD-10 codes described in the Supplementary Material (Table S1), which include: cardiovascular chest pain with acute coronary syndrome, acute myocardial infarction with and without ST elevation, post-PTCA myocardial infarction, and various anatomical MI locations (inferior, non-Q wave). We excluded patients with unstable angina without confirmed myocardial damage, ensuring our cohort represents true myocardial infarction cases rather than broader acute coronary syndrome presentations without tissue necrosis.We based this study on the data preprocessing workflow used in43, but adapted it to our problem accordingly. Our inclusion criteria were patients of age>18 and