A systematic exploration of digital biomarkers for the detection of depressive episodes in bipolar disorder

Wait 5 sec.

IntroductionDigital biomarkers have been introduced into several fields of medicine to enhance diagnosis1,2,3,4,5,6,7,8,9, monitor disease progression2,3,6,7,10,11,12,13,14,15, and personalize treatment plans16,17,18, ultimately improving patient outcomes. In psychiatry, wearables and smart phones are increasingly used to collect data (both passively and actively) outside traditional clinical settings19,20,21,22, a process referred to as “digital phenotyping”23. However, the promise of this approach to transform psychiatry has been impeded by the lack of attention to crucial initial stages required for the identification and validation of digital biomarkers. The successful identification and validation of other types of biomarkers have required a stepwise process. For instance, the selection of reliable pharmacogenetic biomarkers in psychiatry has taken more than two decades, starting with the feasibility of characterizing specific genes and alleles associated with drug metabolism24. The next step was to assess whether these pharmacogenetic biomarkers were associated with clinical outcomes in large datasets25. Then, the potential clinical impact of using these biomarkers was assessed in small prospective studies26. Finally, their clinical efficacy in improving treatment outcomes was established in large randomized clinical trials (RCT) comparing outcomes with and without the use of pharmacogenetic biomarkers27. We propose that a similar approach is needed to establish the usefulness of digital biomarkers. The feasibility of collecting multimodal densely sampled data (“digital phenotyping”) in patients with psychiatric disorders has now been established28,29,30,31,32,33,34,35. Before we can assess whether the potentially unlimited number of measures of objective behaviors and subjective mental states (“digital biomarkers”) can be used to improve care and outcomes, we need to identify a small number of them that are accurate indicators of relevant mental states.Explainable machine learning (ML) models offer a plausible approach for the identification of digital biomarkers that would accurately indicate the presence of a specific mental state36. For instance, in a 30-day longitudinal study involving 54 participants, with self-reported depression severity ranging from none to severe, explainable ML identified the best descriptors of depressive states as lower distance traveled, lower location entropy, fewer location visits, more total sleep time, and less physical activity37. In a longer 1-year study using continuous wearable data and monthly self-reported self-rating Patient Health Questionnaire (PHQ-9) scales in 10,036 participants38, an explainable ML model identified sleep changes and lower recent step count as the most important digital biomarkers in the model’s classification of depressive episodes.Despite the effort in developing explainable ML models for mental disorders in general, and for mood disorders in particular, building a single classification model-that-fits-all remains a challenge39. Different mental states that exist in mood disorders add to the complexity of developing such models. For example, a marker that is associated with bipolar disorder when the patient is in a depressive episode may not be associated with the same patient when they are euthymic. This is partly because prediction is the last step of a three-step process: association, description (or detection), and prediction (or forecasting) of episodes of illness40,41. In this regard, association involves determining the relationship between one or more variables and a mental state or a severity indicator of that mental state42,43. Description takes this one step further and identifies digital measures of objective behaviours and subjective mental states (“digital biomarkers”) that can be integrated with sociodemographic data and clinical characteristics to detect a mental state44,45,46. Prediction aims to anticipate the future occurrence of a mental state before its symptoms unfold based on the identification of early warning signals in historical objective and subjective data47,48,49.In this context, we undertook a longitudinal study to systematically assess which digital biomarkers would be the most accurate descriptors of depressive episodes in BD across different time scales, using longitudinal, densely sampled multivariate objective and subjective data. Unlike many prior studies that focus primarily on mood episode classification with limited methodological innovation, our work presents a robust systematic approach to identify a set of reliable digital descriptors of depressive episodes in BD. By integrating rigorous validation techniques, including bootstrapped feature importance and permutation testing of feature importance, selection frequency, and rank stability across multimodal longitudinal data, we move beyond classification accuracy to unveil reproducible biomarkers with strong potential for clinical actionability. Based on our previous work50, we hypothesised that the day-to-day variability in mood, activity, and sleep would be the most accurate descriptors.MethodsParticipants and data collectionThe overall methods of this ongoing longitudinal study have been described previously51. In brief, between December 12, 2020, and January 15, 2024, we recruited and followed 164 adults diagnosed with BD type I or II, in the outpatient mood clinics of two academic psychiatric hospitals in Canada: The Center for Addiction and Mental Health (CAMH), Toronto, Ontario; and the Queen Elizabeth II Health Sciences Center, Halifax, Nova Scotia. This analysis focused on identifying specific descriptors that could be used to detect depressive episodes. Therefore, we excluded 31 participants who entered the study in a manic, hypomanic, or mixed episode.The study was approved by the Research Ethics Board (REB) at the Centre for Addiction and Mental Health (CAMH), Toronto, Ontario, Canada, and at the QEII Health Care Centre, Halifax, Nova Scotia, Canada, in accordance with the Declaration of Helsinki. REB #: 059-2019. All participants signed an informed consent form approved by the local research ethics board before any research procedure was initiated. Primary diagnoses of BD I or II were established with the Structured Clinical Interview for the Diagnostic and Statistical Manual of Mental Disorders 5 (DSM-552; SCID-553) based on the diagnostic criteria of the DSM-5. Participants completed a baseline assessment, including the gathering of their sociodemographic characteristics, clinical history, cardiovascular health, chronotype, and medication regimen. Polarity (euthymia, depression, or (hypo)mania) at baseline was defined based on the Young Mania Rating Scale (YMRS)54 and the Montgomery-Asberg Depression Rating Scale (MADRS)55: euthymia was defined as both YMRS and MADRS scores < 10; depression as MADRS scores ≥ 10; and (hypo)mania as YMRS scores ≥ 10. All participants were treated by a psychiatrist affiliated with the study or a community-based psychiatrist.Participants were instructed to wear an Oura ring (Oura Health Oy, Generation 2, Oulu, Finland) continuously throughout the study duration, including during sleep, and to charge the device during periods of inactivity as needed. No fixed charging schedule or specific charging intervals were mandated, allowing participants to manage device battery life flexibly. The ring produces a variety of data related to activity and sleep56 (Fig. 1). As an incentive, participants received $180 upon completing their first year. For the second year, they could choose between an additional $180 payment or ownership of the Oura ring. Additionally, participants received a daily email with a link to an electronic visual analog scale (e-VAS) to self-rate their mood, energy, and anxiety levels over the previous 24 h. Each e-VAS component was scored in one-point increment ranging from 10 to 90, with participants instructed to consider “50” as their baseline (e.g., their “usual” mood, energy, or anxiety): ratings higher than 50 indicating levels “higher than their usual”, and ratings lower than 50, indicating levels “lower than their usual.” Reminders were sent to participants when e-VAS were not completed for three consecutive days. Finally, participants also received a weekly email with a link to an electronic version of the 9-item Patient Health Questionnaire (PHQ-9)57 and the Altman Self-Rating Mania scale (ASRM)58 to rate their depressive or manic symptoms, respectively, during the past week. Daily e-VAS ratings and weekly symptom ratings were stored on a REDCap (Research Electronic Data Capture)59 secure database.Fig. 1: Overview of the clinical course and time scales for data generation.a Clinical course of two exemplary participants: one who experienced the onset of a major depressive episode and one who remained in euthymia. This panel illustrates the longitudinal variability in clinical polarity and highlights the differences between participants with and without the onset of a depressive episode. b Subjective and objective variables used to generate features: A-priori selection of three out of three variables provided by participants, 1 out of 15 variables related to activity by wearable, and 11 out of 23 variables related to sleep by wearable. c Collection and aggregation of data over various time scales to create variables: subjective data provided by participants consist of three self-reported measures; objective data provided by wearables consist of densely sampled data on activity and sleep.Full size imageTo assess participant adherence to device-wear protocols, we computed non-wear time and missing-data ratios across modalities. Non-wear was directly provided in the Oura ring activity data dictionary, and we computed the missing data ratio (MDR) by modality (i.e., activity, sleep, scales) as the ratio of missing data observations by the total number of expected data observations in a participant’s study course. We performed another sensitivity analysis comparing the extracted digital biomarkers across two subsamples created by dividing the full sample at the median participation length.Data generation: creation of variables (Fig. 1)Figure 1 presents an overview of the clinical course and time scales for data generation. Based on the weekly PHQ-9, we used a 2-week rolling window to identify depressive episodes and the time of their onset: the onset of a depressive episode was defined as the start of at least two consecutive weeks with a PHQ-9 score ≥ 10. Previous studies have shown this approach to be both highly sensitive and specific60,61 to define major depressive episodes50 (Fig. 1a).Among the 39 data generated by the Oura Ring, we selected a smaller set of 12 objective data produced by the ring (1 out of 16 related to activity and 11 out of 23 related to sleep) (see Fig. 1b). For activity, all variables included in the analysis were derived from the 5-minute sampled Metabolic Equivalent of a Task (MET)62 produced by the ring and aggregated over one-hour periods, day times, evening times, night times, one-day periods, one-week periods, and one-month periods. For sleep, all variables included in the analysis were derived from: (i) the 5-minute hypnograms56 produced by the ring and aggregated over one-hour periods; and (ii) a set 10 of daily data (awake duration, sleep-onset latency, restless sleep duration, light sleep duration, deep sleep duration, REM duration, total sleep duration, waking-up count, getting-up count, and sleep efficiency) aggregated weeks and months. Similarly, daily e-VAS data were aggregated over weeks and months (Fig. 1c).Overall, considering the objective data produced by the ring, the subjective (e-VAS) data provided by the participants, and the various time periods, we created a set of 49 variables (see Supplementary Table 1).Data processing: imputation of missing data and calculation of features (Fig. 2a, b)Given our previous work showing that data in this sample were missing not at random (MNAR)63, we generated ten imputed datasets using the K-Nearest Neighbors (KNN) method64. KNN imputation identifies the k-nearest neighbours to each missing data point based on a distance metric and imputes the missing value using the median of these neighbours. This method uses information from similar observations, making it suitable for handling data MNAR. Each of these ten imputed datasets was generated based on different yet plausible imputed values (Fig. 2a). We conducted the subsequent analyses on each of these 10 imputed datasets and mean-aggregated the results to yield parameter estimates and standard errors. The following analyses were performed within each imputed dataset.Fig. 2: Schematic representation of the data pipeline from data collection to feature ranking.a Variable preprocessing, missing data imputation, and feature extraction: This consists of the use of multiple imputations to generate 10 imputed datasets, followed by the extraction of the main statistical characteristics (i.e., features) from each variable, spanning time scales from five minutes to one month. b Variables across various time scales used to generate matrices of features: each variable with a temporal resolution from five minutes to a month is used to generate seven features. All the features of all participants are used to create two matrices, one corresponding to euthymic states and one corresponding to depressive states. These matrices serve as input for the binary classifier. c Classification and generation of performance metrics: Input features and confounding factors used in the ensemble-based binary classifier for distinguishing clinical polarity (euthymia vs. depression). This panel also illustrates the classifier’s performance metrics, including Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves for both the Decision Tree model and the XGBoost model. Performance is quantified by the Area Under the ROC Curve (AU-ROC) and the Area Under the Precision-Recall Curve (AU-PRC). d Feature ranking: SHAPLEY plots are generated to show each feature’s average impact on the model’s predictions and class-specific impacts for the features extracted during euthymia and depression. The plots provide insights into the relative importance of each feature in driving the classifier’s outputs, enhancing model interpretability.Full size imageFor each of the 133 participants, for each of the 49 variables, in each of ten imputed datasets, we calculated seven features characterizing variability over the entire duration of participation while euthymic (or depressed if applicable): the mean coefficient of variation (CV), kurtosis, skewness, mean absolute differences (MAD), absolute sum of consecutive changes (ASCC), and autocorrelation (AC) (Fig. 2b; Supplementary Table 2). We selected these features to assess the statistical distribution, variability, and self-similarity of the variables. We computed the CV to capture relative variability (i.e., normalising variance by the mean), rather than absolute variability (i.e., changes in the measure). We explicitly distinguished between absolute and relative variability to avoid misinterpretation in cases where mean levels differ substantially across mood states. The calculations were performed across all available time periods (from 5 min to 1 month) to analyze both short- and long-term fluctuations. To assess the impact of various temporal resolutions, we extracted within-day activity features, within-night sleep features, and weekly and monthly features for activity, sleep, and all e-VAS domains using identical preprocessing steps, standardized variability descriptors, and the same evaluation pipeline. For example, for the daily e-VAS ratings of a participant who was euthymic for five months (153 days), became depressed for 2 months (61 days), and became euthymic again for six months (182 days), we would calculate the seven features listed above over the 335 days of euthymia (i.e., concatenating the two periods of euthymia) and the 61 days of depression. For the same participants and their variable corresponding to monthly mood ratings, we would first aggregate daily e-VAS ratings over 13 months of participation and calculate the seven features listed above over the 11 non-overlapping 30-day periods of euthymia (i.e., concatenating the two periods of euthymia) and the two non-overlapping 30-day periods of depression. In total, 343 features were calculated as described below for each individual participant. See Supplementary Table 1 for the full list of data, variables, and features, and Supplementary Table 2 for the mathematical description and interpretation of the features.Classification model development and evaluation of model performanceThe present analyses were designed to evaluate the association between specific features and concurrently assessed clinical polarities: the ML models were used to distinguish between depressive and euthymic data rather than to forecast future episodes. Accordingly, all reported classification performance metrics (e.g., AU-ROC, AU-PRC) reflect the ability of the features to discriminate between clinical polarities in the same data segment, consistent with an associative rather than temporal predictive framework. We created two matrices based on the features of all participants: the first one corresponding to the euthymic states and the second one to the depressed states that were the input for ensemble-based supervised ML classifiers using either Random Forest65, XGBoost66 (Extreme Gradient Boosting), Logistic Regression, and Support Vector Machines (SVM). Input data were split into training and test sets via a random shuffle 80/20 split stratified by binary state (euthymic = 0, depressed = 1). We evaluated model performance using a nested cross-validation (CV) framework to ensure unbiased estimation and robust feature selection. The outer CV consisted of 5 stratified folds, where in each fold, the data was split into training and held-out test sets. Within each outer fold, model optimization and feature selection were performed using an inner CV with 5 folds and randomized hyperparameter search (50 iterations). The models considered included Random Forest, Gradient Boosting, Logistic Regression, and Support Vector Machines, each combined with one of three feature selection methods: SelectKBest, Recursive Feature Elimination (RFE), and SelectFromModel, and feature selection parameters (e.g., number of features to select) were tuned within the inner CV.Preprocessing included robust scaling and KNN imputation for missing values. We then handled feature outliers by applying interquartile range-based clipping, and the best model and feature subset from the inner CV were retrained on the entire outer training fold and evaluated on the held-out test fold. We assessed model performance by computing the area under the receiver operating characteristic curve (AUROC).To evaluate the stability and significance of our classification models, we performed label permutation testing. This involved randomly shuffling class labels and recalculating model performance (AU-ROC) across 5 cross-validation folds to generate a null distribution. The true-label model performance was then compared against this null distribution to assess whether observed results exceeded chance levels.Model explainability: Permutation feature importance and robustnessTo interpret the contribution of specific features to the classification model performance, we computed the SHapley Additive exPlanations (SHAP)67 values on the held-out test sets to provide model-agnostic, local explanation of feature contributions. SHAP values were aggregated across all outer folds to derive an overall feature importance ranking. This method provided insights into how each feature influenced the model’s classification performance. It helped identify clinically actionable digital biomarkers and uncover potential underlying mechanisms driving the model’s performance. We determined the significance of individual features by calculating their impact score (IS), derived as the average of the absolute SHAP values; the IS is a positive metric that quantifies the overall predictive power of each feature on the classification model’s output. Ideally, a highly impactful feature would exhibit a homogeneous impact distribution, with lower or higher values tilting towards the positive class (i.e., depressive episodes).Moreover, to assess feature robustness and importance stability, we performed permutation importance analysis on the held-out test sets of each outer fold. For each fold, the best model was refit on 100 bootstrapped samples of the training data. For each bootstrap, permutation importance was computed on the held-out test set using 100 repeats to ensure stability of importance estimates. The stability metrics we calculated per feature included (i) feature permutation importance (ranked within the fixed feature scope selected by the best model), (ii) feature selection frequency (i.e., the proportion of bootstraps in which a feature ranked within the top k features; 5 for activity data and 10 for sleep and e-VAS data), and (iii) feature rank distribution. We aggregated feature ranks and importance statistics across bootstraps and outer folds to quantify feature robustness. This approach allowed us to identify features consistently important across resamples, enhancing confidence in their relevance.Statistical analysisWe implemented permutation testing (N = 100) to generate a null distribution of differences between euthymia and depression groups and bootstrapping (N = 1000) to estimate variability in test statistics, addressing the relatively small sample size. This resulted in a total of 100 × 1000 resampled datasets derived from the original imputed dataset. The Mann-Whitney U test was applied to evaluate differences between groups across resampled distributions, providing a robust assessment of statistical significance. To ensure the reliability of our statistical analyses, we required each clinical state to have a minimum of five data points. This threshold was chosen to improve the power for each bootstrap run of the Mann-Whitney U-test; we corrected for multiple testing of all 7 (Table 2) and top 10 (Tables 3 and 4) feature comparisons with a False Discovery Rate (FDR) correction using the Benjamini–Hochberg method68; statistical significance was assumed for FDR-corrected p-value