Large language models forecast patient health trajectories enabling digital twins

Wait 5 sec.

IntroductionClinical forecasting involves predicting patient-specific health outcomes and clinical events over time, which is essential for patient monitoring, treatment selection, and drug development1. An emerging approach to support such forecasting is the use of digital twins2,3. These are virtual representations of patients that generate detailed, multivariable predictions of future health states by leveraging longitudinal medical history3,4. When initialized with individual patient characteristics, digital twins can simulate real-time personalized responses to medical interventions or treatments2,4,5.Digital twins offer a comprehensive framework for patient modeling by integrating diverse data streams, which can include history of medical examinations, diagnoses and treatments, deep molecular profiling, lifestyle and environmental factors, as well as general biomedical knowledge6,7,8. They provide a holistic reflection of an individual’s status within the broader context of the patient population, accounting for the interplay of disease dynamics and medical interventions4. By bridging the gap between population-level evidence and individual-level insights, the application of digital twins is poised to revolutionize healthcare in areas such as precision and personalized medicine, predictive analytics, virtual testing, continuous monitoring, and enhanced decision support3,4.Generative artificial intelligence (AI) holds promise for creating digital twins due to its potential to produce synthetic yet realistic data, but this area of application is still in its infancy4. Generative AI methods for predicting patient trajectories include recurrent neural networks, transformers and stable diffusion9,10,11,12,13. These often fall short in terms of handling missing data, interpretability and performance. These challenges can be partially addressed by causal machine learning14, but these algorithms face limitations related to small datasets or being confined to simulations15.Recent breakthroughs in generative AI have been achieved with foundation models, which are pre-trained AI models adaptable to various specific tasks involving different types of data. Most foundation models for patient forecasting focus on single-point predictions rather than comprehensive longitudinal patient trajectories, which are needed for clinical decision-making16. Recently, clinically focused, LLM-inspired methods have been proposed17, however, with their evaluation focus still being on single-point predictions rather than longitudinal trajectories, and without using the knowledge of pretrained LLMs. Less explored for this purpose remain text-focused Large Language Models (LLMs), which have demonstrated forecasting capabilities18,19, including some approaches showing the ability of zero-shot forecasting, i.e., forecasting without any prior specific training in the task, thus highlighting their remarkable generalizability20,21,22.LLM-based forecasting has made great progress in general forecasting. However, some common methods, such as LSTPrompt20, LLMTime21, Time-LLM22, and GPT4TS23, make assumptions which may not necessarily hold in clinical trajectory forecasting. One example is channel independence, whereby, for multivariate time series, channel-independent models process each time series separately, without modeling interactions and inter-time series dependencies. This approach may not be optimal in the clinical setting, in which we often observe correlated time series, putatively driven by causal biological links, highlighting the need to process all aspects of a patient simultaneously.We propose the creation of digital twins based on LLMs that leverage data from electronic health records (EHRs) from real world data (RWD) and observational studies. EHRs are a key source of training data for machine learning models in healthcare, as they record patient characteristics such as demographics, diagnoses, and lab results over time24. However, they pose specific challenges such as data heterogeneity, rare events, sparsity, and quality issues16. There have been developments in machine learning to overcome these challenges, especially for data sparsity, usually by adapting the model’s architecture, resulting in increased model complexity and the introduction of further assumptions on the data10,13.We hypothesize that LLMs will empower the next generation of digital twins in healthcare. Here, we introduce the Digital Twin - Generative Pretrained Transformer (DT-GPT) model (Fig. 1), whichFig. 1: The LLM-based DT-GPT framework enables forecasting patient trajectories, identifying key variables, and zero-shot predictions.Here exemplified, a sparse patient timeline, which b DT-GPT utilizes for generating longitudinal clinical variable forecasts, e.g., c neutrophil and d hemoglobin blood levels. DT-GPT can e chat and respond to inquiries about important variables, as well as (f) perform zero-shot forecasting on clinical variables previously not used during training.Full size imageenables: (i) forecasting of clinical variable trajectories, (ii) zero-shot predictions of clinical variables not previously trained on, and (iii) preliminary interpretability utilizing chatbot functionalities. DT-GPT is an extension of previous LLM-based forecasting solutions, based on fine-tuning LLMs on clinical data using a straightforward data encoding scheme. The method is designed to solve clinically specific issues, be model-agnostic and to be applied to any text-focused LLM without any further architectural changes.ResultsWe analyzed the performance of DT-GPT by forecasting various clinical values on diverse datasets, including on a short-term scale (next 24 h) for Intensive Care Unit (ICU) patients, a medium-term scale (up to 13 weeks) for non-small cell lung cancer (NSCLC) patients, as well as a long-term Alzheimer’s Disease dataset (next 24 months). The ICU dataset is based on Medical Information Mart for Intensive Care IV (MIMIC-IV)25 with 35,131 patients, whilst the NSCLC dataset is based on the the nationwide Flatiron Health EHR-derived de-identified database, containing 16,496 NSCLC patients (“Methods”; Supplementary Tables 1–4; Supplementary Note 1). The Alzheimer’s disease dataset is derived from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset, containing 1,140 patients (Supplementary Tables 1 and 5; Supplementary Note 1). The datasets complement the analysis to understand how the model works on short-, medium- and long-term scales, as well as on different amounts of patients available for training. All details on task setup, data preprocessing, model training, and evaluation are provided in the “Methods” section.DT-GPT achieved state-of-the-art forecasting performanceDT-GPT achieved the lowest overall scaled mean absolute error (MAE) across benchmark tasks in comparison with state-of-the-art models (Table 1), with the z-score scaling allowing comparison and aggregation across variables (“Methods”). In the NSCLC dataset, we predicted six laboratory values weekly for up to 13 weeks post-therapy initiation, leveraging all pre-treatment data to model patient trajectories under treatment. For the ICU task, we forecasted the next 24 h by predicting respiratory rate, magnesium and oxygen saturation based on the previous 24 h history, enabling real-time monitoring and timely intervention. In the Alzheimer’s dataset, we forecasted Mini Mental State Examination (MMSE)26, Clinical Dementia Rating sum of boxes (CDR-SB)27 and Alzheimer’s Disease Assessment Scale (ADAS11)28 cognitive scores, over the next 24 months at 6 month intervals using baselines measurements. All comparisons were performed on unseen patients.Table 1 Benchmark of clinical variable forecasting across three datasetsFull size tableWe compared DT-GPT to 14 multi-step, multivariate baselines, ranging from a naïve model that copies over the last observed value to state-of-the-art forecasting models. These included linear regression model, time series LightGBM model, Temporal Fusion Transformer (TFT), Temporal Convolutional Network (TCN), Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), Transformer, and Time-series Dense Encoder (TiDE) model12,29,30. The naïve model ensured that models with better performance capture nonstationary time series, whilst advanced models were chosen for their ability to handle future variables and achieving state-of-the-art performance in both medical and standard time series forecasting31,32. To understand the contribution of fine-tuning, we also run the general, state-of-the-art LLM Qwen3-32B and the biomedical LLM BioMistral-7B33,34. Note that DT-GPT is a fine-tuned 7-billion-parameter model based on BioMistral, whilst Qwen3 is a significantly larger model at 32 billion parameters. Additionally, we benchmarked advanced time-series LLM-based methods, i.e. Time-LLM and LLMTime21,22, as well as a patch based model PatchTST35, all of which are channel-independent models, which process each input time series separately.On the NSCLC dataset, DT-GPT achieved an average scaled MAE of 0.55 ± 0.04, whilst LightGBM, the second best model, achieved an average scaled MAE of 0.57 ± 0.05, showing a relative improvement of 3.4% (Table 1), On the ICU dataset, DT-GPT achieved an average scaled MAE of 0.59 ± 0.03, whilst the second best model, LightGBM, performed at 0.60 ± 0.03, equivalent to a 1.3% improvement (Table 1). On the Alzheimer’s disease dataset, DT-GPT achieved an average scaled MAE of 0.47 ± 0.03, with Temporal Fusion Transformer being the second best model with 0.48 ± 0.02, representing a relative improvement of 1.8%. We note that the scaled MAE is normalized by standard deviation, with DT-GPT consistently achieving absolute MAE (Supplementary Tables 6–17) that is lower than the standard deviation, indicating that forecasting errors are smaller than the natural variability present in the data. DT-GPT is shown to be the best performer out of 14 models across all datasets, and achieving statistical significance over the second-best performing model on the NSCLC (p value 0.7) with at least one fine-tuned target variable (Supplementary Fig. 10). For the remaining well-performing zero-shot targets without strong correlations, feature importance analysis and relevant literature suggest that DT-GPT may capture clinically meaningful relationships, such as the ferritin-to-hemoglobin ratio and components of the Albumin-Bilirubin (ALBI) score in NSCLC patients (Supplementary Fig. 11; Supplementary Table 23)44,45,46.DiscussionOur main finding is that a simple yet effective method allows training LLMs on EHRs and study data to generate detailed patient trajectories that preserve inter-variable correlations. This method achieves state-of-the-art performance in clinical forecasting, while closely reproducing the distribution of original data and outperforming baselines in predicting clinically meaningful events in the trajectory. This highlights the potential of using LLMs as a digital twin platform that can mimic individual patients, with applications such as treatment selection and clinical trial support.Building on past LLM research in general forecasting, DT-GPT outperforms existing baselines20,21 in NSCLC, ICU and Alzheimer’ s disease datasets. These findings align with recent LLM forecasting developments, demonstrating that clinically-specific adjustments enable accurate predictions18,19. Further analysis of several existing LLM forecasting approaches reveals that channel dependent modeling is a crucial aspect for patient trajectories, with DT-GPT showing that even a simple approach here can be highly effective. Notably, fine-tuning remains necessary for optimal performance, as demonstrated by the lower accuracy of non-fine-tuned LLMs, even when benchmarked against significantly larger models. Additionally, DT-GPT’s generative nature allows for multiple trajectory simulations per patient, offering insights into possible patient scenarios, cohort simulations, and uncertainty estimates. Finally, while all models were optimized for the forecasting task only, DT-GPT consistently outperformed baselines in classification tasks in detecting clinically relevant events by achieving best or second-best performance.The positive performance of LLMs for patient forecasting may stem from parallels between natural language and biomedical data, such as non-random missingness. For example, a doctor might skip measuring blood pressure if a patient appears healthy, indicating information by omission. Natural language implicitly handles such ambiguity; unspoken words can still convey meaning or none at all. Recent advancements suggest that LLMs can capture these complex relationships47.DT-GPT addresses EHR challenges including noise, sparsity, and lack of data normalization16. Unlike most established machine learning models that require data normalization and imputation, DT-GPT operates without these requirements. Here, we demonstrated its robustness to sparsity, misspellings, and noisy medical data often encountered in real-world datasets. Moreover, EHR data often contain mixed data encodings; for instance, drug information may vary in encoding, such as the dosage used or noted only as “administered”, both of which DT-GPT handles without additional preprocessing. Overall, DT-GPT simplifies and streamlines data preparation, thus enabling faster deployment across diverse datasets.DT-GPT can be inquired about the rationale of predictions, which increases the interpretability of the model. This capability helps bridge the gap between medical expert and model, enabling the exploration of prediction rationales and alternative patient scenarios efficiently. We believe that this advancement could enhance human-computer interaction with AI predictions and may positively affect clinical practices in the near future.DT-GPT enables zero-shot predictions, demonstrating its ability to forecast variables not explicitly included in its fine-tuning phase by learning their dynamics and adapting to novel tasks. Remarkably, zero-shot DT-GPT outperforms a supervised, fully-trained machine learning model on a subset of clinical variables, highlighting the pioneering potential of LLM-based approaches in RWD forecasting.Applying the preliminary interpretability approach also on the zero shot variables, we hypothesize that the model is potentially able to capture latent clinical knowledge, such as the importance of the ferritin-to-hemoglobin ratio and parts of the Albumin-Bilirubin (ALBI) score, both which are emerging prognostic biomarkers in NSCLC45,46. It is important to note that the underlying BioMistral 7B model was trained on a vast amount of biomedical databases and publications. Therefore, these are preliminary hypotheses that require extensive investigation and validation from clinical experts.DT-GPT shows promise for clinical trajectory forecasting, with strong performance on standard metrics (e.g., MAE) and robust modeling of temporal dependencies. It effectively detects moderate abnormalities such as anemia, tracks inflammation-related trends, and predicts progression markers such as elevated LDH. However, performance declines for specific acute events—e.g., severe hemoglobin drops or high leukocyte counts—highlighting the challenge of forecasting low-prevalence, high-variance outcomes. Future improvements will require methods that enhance sensitivity to high-risk events, such as tailored loss functions, anomaly detection, and integration of unstructured clinical data.A challenge of LLM-based models is the restricted number of simultaneously forecasted variables. The current constraint on the number of forecasted variables is due to the limited sequence length of both input and output of the LLMs used in fine-tuning. Advances in extending the context length will enable modeling of additional patient variables, such as by using larger, more advanced models such as Qwen3-32B as the base model. Furthermore, we anticipate that transitioning from zero-shot to few-shot learning, where the model receives further training on a small subset of data, would enable a wider span of forecasted variables and extend DT-GPT’s applicability to broader clinical challenges.Future work can also take inspiration from developments in LLM-based forecasting. Specifically, ideas such as patching and prompt-as-a-prefix from Time-LLM22, as well as normalization and generation of continuous likelihoods from LLMTime21, can be adapted for clinical use, further improving forecasting performance. Additionally, even though DT-GPT was able to capture the clinically relevant events better than other models, performance can still be improved to increase clinical relevance, therefore we consider the optimization of the classification performance to be an important direction of future work. Related to this, future research should also focus on developing disease-specific forecasting metrics that correlate well with clinical utility.Another established shortcoming of LLM-based models is their tendency to hallucinate, as well as recreating the biases from the underlying data. In our case, the hallucination could be reflected in explainability results not necessarily providing true answers. This is a critical aspect for the medical domain, and we believe that a human-in-the-loop setup will be required, together with advanced training of clinicians on the use of LLM outputs. Regarding model biases, it is well established that models recreate the biases from the underlying data, which is especially pronounced in minority populations48. To overcome the bias issues, methodological work, training of users, as well as the gathering of large scale, diverse clinical datasets, is needed.Finally, we observe that high error predictions often occur due to the high variance between the multiple generated trajectories of each patient sample, with the mean aggregation into the final prediction not capturing key dynamics. It is thus an open challenge to develop improved aggregation methods, for example by using a second LLM as an arbiter or by having a human expert select the most realistic trajectory.In conclusion, DT-GPT highlights the utility of using LLMs as a digital twin forecasting platform, enabling state-of-the-art and stable predictions, exploratory interpretability via a natural-language interface, and forecasting of patient variables not used in fine-tuning. Whilst further advancements are needed for wide-scale deployment, DT-GPT exhibits digital twin behaviors, potentially reproducing many aspects of the patients it represents, and surpassing traditional AI methods optimized for individual variables. We believe that through further method development and extensive validation, patient-level digital twins will impact clinical trials by supporting biomarker exploration, trial design, and interim analysis. Additionally, future digital twins will assist doctors in treatment selection and patient monitoring. Overall, we envision LLM-powered digital twins becoming integral to healthcare systems.MethodsDT-GPT is a method that employs pre-trained LLMs fine-tuned on clinical data (Fig. 6a). Notably, this method is agnostic regarding the underlying LLM and can be applied without architectural changes to any general-purpose or specialized text-focused LLM. We trained and evaluated DT-GPT for forecasting patients’ laboratory values across three independent datasets, i.e., non-small cell lung cancer (NSCLC), intensive care unit (ICU), and Alzheimer’s disease patients.Fig. 6: The DT-GPT framework transforms EHRs into text and subsequently fine-tunes an LLM on this data.a Overview of the pipeline: datasets are split and encoded into input/output text based on landmark timepoints, then used to fine-tune an LLM, here BioMistral. The model output is evaluated for trajectory forecasting whilst zero-shot predictions and variable importances are explored via a chat interface. b Sample size, visit frequency, and sparsity of the Alzheimer’s disease (AD), non-small cell lung cancer (NSCLC), intensive care unit (ICU) datasets. c Input and d output encoded examples, emphasizing the chronological encoding of observations.Full size imageNSCLC datasetFor the US-based NSCLC dataset, we used the nationwide Flatiron Health EHR-derived de-identified database. The data are de-identified and subject to obligations to prevent re-identification and protect patient confidentiality. The Flatiron Health database is a longitudinal database, comprising de-identified patient-level structured and unstructured data, curated via technology-enabled abstraction49,50. During the study period, the de-identified data originated from approximately 280 cancer clinics ( ~ 800 sites of care).The study included 16,496 patients diagnosed with NSCLC from 01 January 1991 to 06 July 2023. The majority of patients in the database originate from community oncology settings; relative community/academic proportions may vary depending on the study cohort. Patients with a birth year of 1938 or earlier may have an adjusted birth year in Flatiron Health datasets due to patient de-identification requirements. To harmonize the data, we aggregated all values in a week based on the last observed value.We focused on the 50 most common diagnoses and 80 most common laboratory measurements, complemented by the Eastern Cooperative Oncology Group (ECOG) score, metastases, vitals, drug administrations, response, and mortality variables totaling 773,607 patient-days across 320 variables.For every NSCLC patient, we divided their trajectory into input and output segments based on the start date of each line of therapy to create each patient sample. All variables up to the start date were considered input data. The objective was to predict the weekly values up to 13 weeks after the start date of the following variables and their respective LOINC codes: hemoglobin (718-7), leukocytes (26464-8), lymphocytes/leukocytes (26478-8), lymphocytes (26474-7), neutrophils (26499-4) and lactate dehydrogenase (2532-0). These variables were selected due to their frequent measurement and relevance in reflecting key characteristics of NSCLC treatment response (Supplementary Tables 1, 2).ICU datasetTo demonstrate the generalizability of DT-GPT, we analyzed ICU trajectories from the publicly-accessible Medical Information Mart for Intensive Care IV (MIMIC-IV) dataset25. We employed an established processing pipeline, resulting in 300 input variables across 1,686,288 time points from 35,131 patients51.Here, the objective was to predict a patient’s future hourly lab variables given their first 24 h in the ICU. Specifically, the patient history was considered as the first 24 h for all variables, and the task was to forecast the future 24 hourly values for the following variables: O2 saturation pulse oximetry, respiratory rate and magnesium. These variables were selected due to having the highest temporal variability, thus making the forecasting task more challenging, and the fact that at least 50% of patients had at least one measurement for each, highlighting their widespread clinical usage (Supplementary Tables 1, 3, 4). These criteria not only increased the forecasting challenge, but also ensured wide representation across the patient population.Alzheimer’s disease datasetTo further demonstrate the generalizability of DT-GPT, we ran DT-GPT and the baseline models on the Alzheimer’s disease dataset, based on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD).We preprocessed the dataset, including 1140 patients. The task was to predict the 24 month trajectory of three cognitive variables, given the baseline measurements of the patients. Specifically, the variables were Mini Mental State Examination (MMSE), Clinical Dementia Rating sum of boxes (CDR-SB) and Alzheimer’s Disease Assessment Scale (ADAS11), which are key indicators of cognitive decline commonly measured in Alzheimer’s disease patients (Supplementary Tables 1, 5).Data splitting and filteringThe NSCLC and ICU datasets were split at the patient level into 80% training, 10% validation, and 10% test set. The splitting was performed randomly for the ICU dataset, whilst stratified by group stage, smoking status, number of observations per visit and number of visits with drug administrations to ensure a balanced evaluation. The Alzheimer’s disease dataset was randomly split into 80% training and 20% test, selected due to the small sample size, with all hyperparameters determined via a further splitting on the training set. Thus, each set comprised disjoint sets of patients to avoid data leakage. The test sets were solely used for final evaluation and to assess the model’s generalizability (Fig. 6b).We applied a two-step outlier filtering procedure on all datasets: all target values below or above three standard deviations were filtered out first, then we calculated new standard deviation values on the filtered dataset and clipped target values below and above those values. This approach ensured that the noise present in the data was removed, while some of the outliers were replaced with reasonable low or high values to maintain the biological signal. The data for all of the baselines excluding DT-GPT were then also standardized using z-scores.EncodingWe encoded patient trajectories by using templates that converted medical histories based on EHRs into a text format compatible with LLMs, as proposed by Xue et al.19 and Liu et al.19,20 (Fig. 6c, d; Supplementary Note 4). The input template is structured into four components: (1) patient history, (2) demographic data, (3) forecast dates and (4) prompt. The patient history contains a chronological description of patient visits, requiring no data imputation for missing variables. The output trajectories were also encoded using templates, containing only the relevant output variables for the forecasted time points. We utilized a manually developed template for input encoding and JSON-format encoding for the output (Supplementary Fig. 12).LLMs and fine-tuningWe utilized the biomedical LLM BioMistral 7B DARE, since it is provided with an open source license and based on a recognized LLM33. Furthermore, BioMistral is instruction tuned and through its biomedical specialization incorporates compressed representations of vast amounts of biomedical knowledge. We further fine tuned this LLM using the standard cross entropy loss, masked so that the gradient was only computed on the output text. We performed 30 predictions for each patient sample during evaluation, then took the mean for each time point as the final prediction21,52. All hyperparameters of DT-GPT used fine-tuning (Supplementary Note 5) and are compared to baseline models (Supplementary Note 6).Handling of missing and noise dataWe investigated the ability of DT-GPT as a LLM-based model to handle missing data and misspelling in the input prompts. For the missing data study, we randomly masked between 0 and 80% of data, in addition to the already missing data in a dataset. Evaluation of the effect of missingness was performed on a randomly sampled 200 patients from the test set, which can potentially lead to higher variance in the results, but allowed for a more extensive exploration.For the noise study, we introduce a misspelling algorithm. This algorithm randomly performs either perturbation, insertion, deletion, or replacement, using all ASCII letters & digits, applied to the entire input text. This includes dates, variable names, values, baseline information, and prompts. One operation is considered one misspelling.For the evaluation of the effects of RWD missingness and noise we randomly sampled 200 patients of the test set, which can potentially lead to higher variance in the results, but allowed for a more extensive exploration.Chatbot and zero-shot learningWe employed the DT-GPT model to run a chatbot based on patient histories for prediction explanation and zero-shot forecasting. For this, first we used DT-GPT to generate forecasting results from patient history and, consecutively, added a task-specific prompt surrounded by the respective instruction-indication tokens to the DT-GPT chat history for receiving a response. For prediction explanation, the prompt asked for the most important variables influencing the predicted trajectory. For zero-shot forecasting, the prompt specified the output format and days to predict new clinical variables that were not subject to optimization during training. Example prompts and chatbot interactions for both tasks are provided in Supplementary Note 7 and Fig. 5a, e.Forecasting evaluationForecasting metrics, i.e. Eqs. (1)–(5), are designed to quantify the disparity between predicted and observed numeric values, providing an objective measure of the model’s predictive accuracy (Supplementary Note 8). Let ${{v}_{t}}^{(i)}$ be an observed (non-missing) value of clinical variable $v$ for a subject $i$, $i=1,\cdots ,n$, where $n$ is the total number of subjects, and time step $t$, $t=1,\cdots ,{T}_{i}$, where ${T}_{i}$ is the total number of time steps for the subject $i$. Let baseline value ${v}_{0}^{(i)}$ be the baseline value at time step ${t}_{0}$, $t=0$. We denote predicted values as ${\hat{v}}_{t}^{(i)}$. The forecasting metrics used are mean absolute error (MAE), scaled MAE, mean absolute scaled error (MASE), symmetric mean absolute percentage error (SMAPE) and Spearman correlation coefficient defined as follows:$${MAE}=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}\frac{1}{{T}_{i}}\mathop{\sum }\limits_{t=1}^{{T}_{i}}|{v}_{t}^{(i)}-{\hat{v}}_{t}^{(i)}|$$(1)$${scaled\; MAE}=\frac{{MAE}}{\sigma }$$(2)where $\sigma$ is the standard deviation of the clinical variable after outlier filtering;$${MASE}=\frac{{MAE}}{\,\frac{1}{n}{\sum }_{i=1}^{n}\frac{1}{{T}_{i}}{\sum }_{t=1}^{{T}_{i}}|{v}_{t}^{(i)}-{v}_{0}^{(i)}|}$$(3)$${SMAPE}=\frac{200}{n}\mathop{\sum }\limits_{i=1}^{n}\frac{1}{{T}_{i}}\mathop{\sum }\limits_{t=1}^{{T}_{i}}\frac{{|{{v}_{t}}^{(i)}\,-{{\,\hat{v}}_{t}}^{(i)}|}}{|{v}_{t}^{(i)}|+|{\hat{v}}_{t}^{(i)}|}{\mathbb{1}_{\{|v_t^{(i)}| + |\hat{v}_t^{(i)}| \neq 0\}}}$$(4)where ${\mathbb{1}}$ is the indicator function to avoid division by 0;$$Spearman\,\rho =\frac{{\sum }_{i=1}^{n}\,\frac{1}{{T}_{i}}{\sum }_{t=1}^{{T}_{i}}\,(R[{v}_{t}^{(i)}]-\underline{R[v]})(R[{\widehat{v}}_{t}^{(i)}]-\underline{R[\widehat{v}]})}{\sqrt{{\sum }_{i=1}^{n}\,\frac{1}{{T}_{i}}{\sum }_{t=1}^{{T}_{i}}{(R[{v}_{t}^{(i)}]-\underline{R[v]})^{2}}\,{\sum }_{i=1}^{n}\frac{1}{{T}_{i}}\,{\sum }_{t=1}^{{T}_{i}}\,(R[{\widehat{v}}_{t}^{(i)}]-\underline{R[\widehat{v}]})^{2}}}$$(5)where $R[.]$ is a rank function, ordering values from lowest to the highest, whereby, for the data points with the same value, their average rank is assigned, and $\underline{R[v]}=\frac{1}{n}{\sum }_{i=1}^{n}\,\frac{1}{{T}_{i}}{\sum }_{t=1}^{{T}_{i}}\,R[{v}_{t}^{(i)}]$ and $\underline{R[\widehat{v}]}=\frac{1}{n}\,{\sum }_{i=1}^{n}\,\frac{1}{{T}_{i}}\,{\sum }_{t=1}^{{T}_{i}}\,R[{\widehat{v}}_{t}^{(i)}]$ are the mean ranks of actual and predicted values, respectively.We chose scaled MAE, i.e., Eq. (2), as our primary metric as it allows comparison across all variables, and hence can be used to benchmark different models on all datasets.Classification evaluationClassification metrics assess the model’s clinical utility to capture events, such as abrupt changes in clinical variables indicative of acute conditions (e.g., sudden drops or increases) or prolonged trends in variable changes that are characteristic of a chronic condition (e.g., gradual increases or decreases over extended periods). Below, we provide detailed definitions of the metrics employed in our evaluation. An interpretation of introduced metrics is provided in Supplementary Note 8.First, we assess the model’s ability to detect values outside the normal range of clinical variables. Let $[{v}_{\min \,},\,{v}_{\max }]$ be the reference interval for the clinical variable $v$. We label the observed variable value ${v}_{t}^{(i)}$ as “low” if ${v}_{t}^{(i)} < {v}_{\min }$, as “high” if ${v}_{t}^{(i)} > {v}_{\max }$ and as “normal” if ${v}_{\min } < {v}_{t}^{(i)} < {v}_{\max }$. We define ${v}_{t}^{(i)}$ as “not low” if it is “normal” or “high”, as “not high” if it is “normal” or “low”, and as “not normal” if it is “low” or “high”. Analogously, we label each predicted variable value ${\hat{v}}_{t}^{(i)}$. With this, we are in the classification task settings.For the binary classification tasks “low” versus “not low”, “high” versus “not high”, and “normal” versus “not normal”, we calculate area under the receiver operating characteristic curve (AUC ROC) and denote it as ${AU}{C}_{{low}}$, ${AU}{C}_{{high}}$ and ${AU}{C}_{{normal}}$, respectively. For the multiclass classification task “low” versus “normal” versus “high”, we calculate weighted AUC ROC, denoted by AUC weighted (Eq. (6)), that is given by$$\text{AUC}_{\text{weighted}} = \frac{(\text{AUC}_{\text{low}} \times \#\text{low}) + (\text{AUC}_{\text{normal}} \times \#\text{normal}) + (\text{AUC}_{\text{high}} \times \#\text{high})}{\#\text{low} + \#\text{normal} + \#\text{high}}$$(6)where ${\#low}$, ${\#normal}$ and ${\#high}$, correspond to the number of observed variables values ${{v}_{t}}^{(i)}$ labeled as “low”, “normal” and “high” respectively. Weighted aggregation accounts for the class imbalance, whereby most of the variable values fall within the reference range and are labeled as “normal”.We evaluated the model’s trend forecasting performance by analyzing its predicted value trajectories over a specified time interval $s$. Within these forecasts, a predicted value ${v}_{t}^{(i)}$ was classified as ‘decreasing trend’ if ${v}_{t+1}^{(i)} < {v}_{t}^{(i)}$ or as an ‘increasing trend’ if ${v}_{t+1}^{(i)} > {v}_{t}^{(i)}$. For a trend to be classified at time $t$, the direction of change between consecutive predicted values had to be consistent throughout the entire preceding lookback window. Specifically, ${v}_{t}^{(i)}$ was classified as ‘decreasing trend’ only if ${v}_{k+1}^{(i)} < {v}_{k}^{(i)}$ for all time steps $k$ within the interval $[{time}(t)-s,{time}(t)]$, and ‘increasing trend’ only if ${v}_{k+1}^{(i)} > {v}_{k}^{(i)}$ for all $k$ in that same interval. Here, ${time}(t)$ represents the time since the last input measurement. Ground truth trends were derived similarly from observed data. We then assessed the model’s classification of these trends in its forecasts using two binary classification tasks: ‘decreasing’ versus ‘not decreasing’, and ‘increasing’ versus ‘not increasing’. Performance was quantified by calculating the area under the receiver operating characteristic curve (AUC) based on the forecasted values, yielding ${AU}{C}_{{trend}\downarrow }$ and ${AU}{C}_{{trend}\uparrow }$. Forecasted values were excluded from this analysis if ${time}(t) < s$ to ensure a complete lookback window was available. We provide an example and illustration in Supplementary Fig. 13.We performed the classification evaluation only on the NSCLC data. For this, we used parameters for the reference ranges $[{v}_{\min },\,{v}_{\max }]$ as found in the literature. For hemoglobin [g/dL], we set [14, 18] and [12, 16] for male and female patients53, respectively. We set [4.5, 11.0] for leukocytes [109/L]54, [20, 40] for leukocytes/lymphocytes [%]54, [1.0, 4.0] for lymphocytes [109/L]55, [1.8, 7.5] for neutrophils [109/L]55 and [122, 222] for lactate dehydrogenase [U/L]36.We further address the model ability to detect a significant drop in hemoglobin associated with a bleeding by calculating ${AU}{C}_{{low}}$ with ${v}_{\min }$ = 7.5. As for the trend detection, we consider time intervals of 3 weeks and set $s=21$ days for all NSCLC variables. This time period is clinically relevant to capture the increasing or decreasing dynamics of a clinical variable.Data availabilityThe Flatiron Health data that support the findings of this study were originated by and are the property of Flatiron Health, Inc., which has restrictions prohibiting the authors from making the data set publicly available. Requests for data sharing by license or by permission for the specific purpose of replicating results in this manuscript can be submitted to PublicationsDataAccess@flatiron.com. The Medical Information Mart for Intensive Care IV (MIMIC-IV) is available online upon request under https://physionet.org/content/mimiciv. The Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset is available online upon request under https://adni.loni.usc.edu/data-samples/adni-data/.Code availabilityThe code is available at https://github.com/MendenLab/DT-GPT, including all package and Python versions, as well as license information. Specific parameters used to generate and analyze the datasets presented in this manuscript are detailed in the repository’s README file and relevant configuration files.ReferencesSchachter, A. D. & Ramoni, M. F. Clinical forecasting in drug development. Nat. Rev. Drug Discov. 6, 107–108 (2007.Article CAS PubMed Google Scholar Allen, A. et al. A digital twins machine learning model for forecasting disease progression in stroke patients. Appl. Sci. 11, 5576 (2021).Article CAS Google Scholar Boulos, M. N. K. & Zhang, P. Digital twins: From personalised medicine to precision public health. J. Pers. Med. 11, 745 (2021).Article Google Scholar Bordukova, M., Makarov, N., Rodriguez-Esteban, R., Schmich, F. & Menden, M. P. Generative artificial intelligence empowers digital twins in drug discovery and clinical trials. Expert Opin. Drug Discov. 19, 33–42 (2024).Article CAS PubMed Google Scholar Coorey, G. et al. The health digital twin to tackle cardiovascular disease—a review of an emerging interdisciplinary field. npj Digit. Med. 5, 126 (2022).Article PubMed PubMed Central Google Scholar Venkatesh, K. P., Raza, M. M. & Kvedar, J. C. Health digital twins as tools for precision medicine: Considerations for computation, implementation, and regulation. npj Digit. Med 5, 150 (2022).Google Scholar Bordukova, M. et al. Generative AI and digital twins: shaping a paradigm shift from precision to truly personalized medicine. Expert Opin. Drug Discov. 20, 821–826 (2025).Article CAS PubMed Google Scholar Moingeon, P., Chenel, M., Rousseau, C., Voisin, E. & Guedj, M. Virtual patients, digital twins and causal disease models: Paving the ground for in silico clinical trials. Drug Discov. Today 28, 103605 (2023).Article CAS PubMed Google Scholar Nguyen, M. et al. Predicting Alzheimer’s disease progression using deep recurrent neural networks. NeuroImage 222, 117203 (2020).Article PubMed Google Scholar Jung, W., Mulyadi, A. W. & Suk, H. I. Unified Modeling of Imputation, Forecasting, and Prediction for AD Progression. in Lecture Notes in Computer Science 168–176 (2019).Wu, F. et al. Forecasting Treatment Outcomes Over Time Using Alternating Deep Sequential Models. IEEE Transactions on Biomedical Engineering PP, 1–10 (2023).Phetrittikun, R. et al. Temporal Fusion Transformer for forecasting vital sign trajectories in intensive care patients. in 2021 13th Biomed Eng Int Conf (BMEiCON) 1–5 (2021).Chang, P. et al. A transformer-based diffusion probabilistic model for heart rate and blood pressure forecasting in Intensive Care Unit. Comput. Methods Prog. Biomed. 246, 108060 (2024).Article Google Scholar Melnychuk, V., Frauen, D. & Feuerriegel, S. Causal Transformer for Estimating Counterfactual Outcomes. in International Conference on Machine Learning 15293–15293 (2022).Kaddour, J., Lynch, A., Liu, Q., Kusner, M. J. & Silva, R. Causal machine learning: A survey and open problems. Foundations and Trendsr in Optimization 9, 1–247 (2025).Article Google Scholar Wornow, M. et al. The shaky foundations of large language models and foundation models for electronic health records. npj Digital Med. 6, 135 (2023).Article Google Scholar Renc, P. et al. Zero shot health trajectory prediction using transformer. npj Digit. Med 7, 256 (2024).Google Scholar Liang, Y. et al. Foundation Models for Time Series Analysis: A Tutorial and Survey. in Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining (2024).Xue, H. & Salim, F. D. PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting. IEEE Transactions on Knowledge and Data Engineering (2023).Liu, H., Zhao, Z., Wang, J., Kamarthi, H. & Prakash, B. B. LSTPrompt: Large Language Models as Zero-Shot Time Series Forecasters by Long-Short-Term Prompting. in Association for Computational Linguistics Findings 2024 (2024).Gruver, N., Finzi, M., Qiu, S. & Wilson, A. G. Large Language Models Are Zero-Shot Time Series Forecasters. in Advances in Neural Information Processing Systems (2023).Jin, M. et al. Time-LLM: Time Series Forecasting by Reprogramming Large Language Models. in International Conference on Learning Representations (2024). https://doi.org/10.48550/arxiv.2310.01728.Zhou, T., Niu, P., Wang, X., Sun, L. & Jin, R. One Fits All:Power General Time Series Analysis by Pretrained LM. arXiv (2023) https://doi.org/10.48550/arxiv.2302.11939.Loureiro, H. et al. Correlation between early trends of a prognostic biomarker and overall survival in non–small-cell lung cancer clinical trials. JCO Clin. Cancer Inform. 7, e2300062 (2023).Article PubMed PubMed Central Google Scholar Johnson, A. E. et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10, 1 (2023).Article CAS PubMed PubMed Central Google Scholar Tombaugh, T. N. & McIntyre, N. J. The mini-mental state examination: A comprehensive review. J. Am. Geriatr. Soc. 40, 922–935 (1992).Article CAS PubMed Google Scholar O’Bryant, S. E. et al. Validation of the new interpretive guidelines for the clinical dementia rating scale sum of boxes score in the national Alzheimer’s coordinating center database. Arch. Neurol. 67, 746–749 (2010).Article PubMed PubMed Central Google Scholar Kueper, J. K., Speechley, M. & Montero-Odasso, M. The Alzheimeras disease assessment scale-cognitive subscale (ADAS-Cog): modifications and responsiveness in pre-dementia populations. a narrative review. Journal of Alzheimeras Disease 63, 423–444 (2018).Article Google Scholar Lim, B., Arık, S., Loeff, N. & Pfister, T. Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 37, 1748–1764 (2021).Article Google Scholar Das, A. et al. Long-term Forecasting with TiDE: Time-series Dense Encoder. Transactions on Machine Learning Research (2023).Nespoli, L. & Medici, V. Multivariate Boosted Trees and Applications to Forecasting and Control. J. Mach. Learn. Res. 23, 1–47 (2022).Google Scholar Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q. & Liu, T. Y. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems. 30 (2017).Labrak, Y. et al. BioMistral: A collection of open-source pretrained large language models for medical domains. arXiv (2024).Yang, A. et al. Qwen3 technical report. arXiv https://doi.org/10.48550/arxiv.2505.09388 (2025).Article PubMed PubMed Central Google Scholar Nie, Y., Nguyen, N. H., Sinthong, P. & Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. in International Conference on Learning Representations (2023).Farhana, A. & Lappin, S. L. Biochemistry, Lactate Dehydrogenase. (2023).Margraf, A., Lowell, C. A. & Zarbock, A. Neutrophils in acute inflammation: current concepts and translational implications. Blood 139, 2130–2144 (2022).Article CAS PubMed Google Scholar Groopman, J. E. & Itri, L. M. Chemotherapy-induced anemia in adults: incidence and treatment. J. Natl. Cancer Inst. 91, 1616–1634 (1999).Article CAS PubMed Google Scholar Abdel-Razeq, H. & Hashem, H. Recent update in the pathogenesis and treatment of chemotherapy and cancer induced anemia. Crit. Rev. Oncol. Hematol. 145, 102837 (2020).Article PubMed Google Scholar Wang, Y., Probin, V. & Zhou, D. Cancer therapy-induced residual bone marrow injury: Mechanisms of induction and implication for therapy. Curr. Cancer Ther. Rev. 2, 271–279 (2006).Article CAS PubMed PubMed Central Google Scholar Cella, D. The functional assessment of cancer therapy-anemia (FACT-An) Scale: A new tool for the assessment of outcomes in cancer anemia and fatigue. Semin. Hematol. 34, 13–19 (1997).CAS PubMed Google Scholar Pathak, N. et al. Improving the performance status in advanced non-small cell lung cancer patients with chemotherapy (ImPACt trial): A phase 2 study. J. Cancer Res. Clin. Oncol. 149, 6399–6409 (2023).Article CAS PubMed PubMed Central Google Scholar Tas, F., Ciftci, R., Kilic, L. & Karabulut, S. Age is a prognostic factor affecting survival in lung cancer patients. Oncol. Lett. 6, 1507–1513 (2013).Article CAS PubMed PubMed Central Google Scholar Lee, S., Jeon, H. & Shim, B. Prognostic value of ferritin-to-hemoglobin ratio in patients with advanced non-small-cell lung cancer. J. Cancer 10, 1717–1725 (2019).Article CAS PubMed PubMed Central Google Scholar Matsukane, R. et al. Prognostic significance of pre-treatment ALBI grade in advanced non-small cell lung cancer receiving immune checkpoint therapy. Sci. Rep. 11, 15057 (2021).Article CAS PubMed PubMed Central Google Scholar Tomita, M., Shimizu, T., Hara, M., Ayabe, T. & Onitsuka, T. Impact of preoperative hemoglobin level on survival of non-small cell lung cancer patients. Anticancer Res 28, 1947–1950 (2008).CAS PubMed Google Scholar Sravanthi, S. L. et al. PUB: A Pragmatics Understanding Benchmark for Assessing LLMs’ Pragmatics Capabilities. in Findings of the Association for Computational Linguistics: ACL 2024 (2024).Cross, J. L., Choma, M. A. & Onofrey, J. A. Bias in medical AI: Implications for clinical decision-making. PLOS Digit. Heal. 3, e0000651 (2024).Article Google Scholar Ma, X., Long, L., Moon, S., Adamson, B. & Baxi, S. Comparison of Population Characteristics in Real-World Clinical Oncology Databases in the US: Flatiron Health, SEER, and NPCR. medRxiv 2020, (2023).Birnbaum, B. et al. Model-assisted cohort selection with bias analysis for generating large-scale cohorts from the EHR for oncology research. arXiv preprint arXiv:2007.XXXX (2020).Gupta, M. et al. An Extensive Data Processing Pipeline for MIMIC-IV. in Proceedings of Machine Learning Research 311–325 (2022).Wang, X. et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. in The Eleventh International Conference on Learning Representations (2022).Billett, H. H., Walker, H. K., 1, W. D. H. & Hurst, J. W. Hemoglobin and Hematocrit. in Clinical Methods: The History, Physical, and Laboratory Examinations. 3rd Edition (1990).Riley, L. K. & Rupert, J. Evaluation of patients with leukocytosis. Am. Fam. physician 92, 1004–1011 (2015).PubMed Google Scholar Haematology reference ranges. https://www.gloshospitals.nhs.uk/our-services/services-we-offer/pathology/haematology/haematology-reference-ranges/ (2024).Download referencesAcknowledgementsWe would like to thank Anton Kraxner for providing crucial insights into NSCLC, as well as Ginte Kutkaite, Hugo Loureiro, Franziska Braun, Rudolf Kinder, Venus So, Guy Amster and Will Shapiro for their valuable input and discussions. This study was funded by F. Hoffmann-La Roche and the European Union's Horizon 2020 Research and Innovation Programme (Grant agreement No. 950293–COMBAT-RES). The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript. Data collection and sharing for the Alzheimer’s Disease Neuroimaging Initiative (ADNI) is funded by the National Institute on Aging (National Institutes of Health Grant U19AG024904). The grantee organization is the Northern California Institute for Research and Education. In the past, ADNI has also received funding from the National Institute of Biomedical Imaging and Bioengineering, the Canadian Institutes of Health Research, and private sector contributions through the Foundation for the National In-Institutes of Health (FNIH) including generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; BristolMyers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics.FundingOpen Access funding enabled and organized by Projekt DEAL.Author informationAuthor notesThese authors contributed equally: Nikita Makarov, Maria Bordukova.Authors and AffiliationsRoche Innovation Center Munich (RICM), Penzberg, GermanyNikita Makarov, Maria Bordukova & Fabian SchmichComputational Health Center, Helmholtz Munich, Munich, GermanyNikita Makarov, Maria Bordukova, Papichaya Quengdaeng, Daniel Garger & Michael P. MendenDepartment of Biology, Ludwig Maximilian University of Munich, Munich, GermanyNikita Makarov, Maria Bordukova & Daniel GargerTUM School of Computation, Information and Technology, Technical University of Munich, Munich, GermanyPapichaya QuengdaengRoche Innovation Center Basel (RICB), Basel, SwitzerlandRaul Rodriguez-EstebanDepartment of Biochemistry and Pharmacology, Bio21 Molecular Science and Biotechnology Institute, The University of Melbourne, Melbourne, VIC, AustraliaMichael P. MendenAuthorsNikita MakarovView author publicationsSearch author on:PubMed Google ScholarMaria BordukovaView author publicationsSearch author on:PubMed Google ScholarPapichaya QuengdaengView author publicationsSearch author on:PubMed Google ScholarDaniel GargerView author publicationsSearch author on:PubMed Google ScholarRaul Rodriguez-EstebanView author publicationsSearch author on:PubMed Google ScholarFabian SchmichView author publicationsSearch author on:PubMed Google ScholarMichael P. MendenView author publicationsSearch author on:PubMed Google ScholarContributionsN.M., M.B. and P.Q. performed data processing. N.M. and M.B. performed model implementation. N.M., M.B. and D.G. performed model evaluation. R.R.-E., F.S. and M.P.M. supervised, designed and directed the project. N.M and M.B. drafted the manuscript. N.M., M.B., D.G., R.R.-E., F.S. and M.P.M. substantially revised this manuscript. All authors have read and approved the manuscript.Corresponding authorsCorrespondence to Fabian Schmich or Michael P. Menden.Ethics declarationsCompeting interestsN.M., M.B., R.R.E. and F.S. are all employees of F. Hoffmann-La Roche. M.P.M. collaborates and is financially supported by GSK, F. Hoffmann-La Roche, and AstraZeneca. M.P.M. is supported by the European Union’s Horizon 2020 Research and Innovation Programme (Grant agreement No. 950293—COMBAT-RES). N.M., M.B., R.R.E., F.S. and M.P.M. are authors of an in-force patent entitled “Forecasting of subject-related attributes using generative machine-learning model” (patent publication number 2025/021719, patent application number EP2024070632) owned by F. Hoffmann-La Roche and Helmholtz Zentrum Munich. The patent covers application of large language models such as DT-GPT for forecasting of clinical trajectories of patients during a clinical trial. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary informationSupplementary InformationRights and permissionsOpen Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.Reprints and permissionsAbout this article