Estimating motor symptom presence and severity in Parkinson’s disease from wrist accelerometer time series using ROCKET and InceptionTime

Wait 5 sec.

IntroductionParkinson’s disease (PD) is a neurodegenerative condition that severely diminishes patient quality of life. The predominant physical manifestations of PD include tremor, bradykinesia, and dyskinesia, among others1. Tremor is described as a trembling of the affected limb that can occur at rest and during activity. Bradykinesia entails slowed movement. Dyskinesia is an involuntary twitching and writhing movement. Because the prevalence of PD increases with age, impacting up to 1 % of individuals over 602, the disease poses a growing economic and social problem3. Although levodopa (L-DOPA) and similar dopaminergic medications can alleviate PD symptoms, they may also cause the side effect of dyskinesia1. Currently, PD symptoms are only monitored every three to 12 months4 during physician visits. However, symptoms may change throughout the day, e.g., as the medication wears off or its uptake is altered by diet and disease progression. This infrequent monitoring of rapidly changing symptom severity complicates the selection of interventions that balance symptom relief against dyskinesia. Thus, methods for continuously and automatically monitoring PD would improve intervention and patient quality of life.A frequently studied method for continuous PD monitoring involves detecting symptom severity using unobtrusive, low-cost wearable sensors, such as smartwatch accelerometers5. These inertial sensors provide acceleration components as a multivariate time series, sometimes including additional gyroscope and magnetometer measurements. Among the various studies that developed models to estimate PD movement symptom severity using wearables, one research strand has focused on explicitly modeling (i.e., handcrafting) a fixed set of features from the inertial sensor data prior to conducting data-driven analysis. For example, Gaussian processes successfully estimated dyskinesia and bradykinesia using wavelet-based features6. Key findings are that modeling PD as a dynamical system, with current symptoms depending on past symptoms, is advantageous and that maximum spectral power is useful for the detection of PD. Other studies extracted features such as acceleration variance and channel-delay correlation matrix eigenvalues, applying a Gaussian mixture model (GMM) classifier7; this supports the effectiveness of combining temporal and spectral features. Conversely, simpler statistical features, such as root-mean-square, were successful with a support-vector machine (SVM) for bradykinesia detection8. Additional proposals include signal processing methods, using dominant pole frequency and amplitude for tremor detection or low-pass filters with explicitly modeled features for bradykinesia9. Such features have also been used in conjunction with dynamic neural networks, SVMs, and hidden Markov models to measure tremor and dyskinesia10. Consequently, explicit features tend to use frequency analysis, but additional modeling of temporal dependencies seems beneficial.Another research strand employed deep learning for motor symptom severity estimation, leveraging implicit11 feature extraction and eliminating the need for fixed features. Convolutional neural networks (CNNs) have been shown to detect bradykinesia with greater accuracy than fully connected neural networks, SVMs, a rule-based classifier (i.e., PART), and AdaBoost in a study of 10 patients; this method operated directly on the raw data and added noise as data augmentation12. Deep neural network architectures designed to better model temporal correlations, such as long short-term memory networks (LSTMs), have been used for PD detection from speech signals13 and gait data14. The latter hypothesize and show that LSTM can learn long-term temporal correlations directly from raw data. CNN, while typically used for image processing, has also been shown to outperform15 LSTMs or perform comparably16 in human activity recognition from wearable sensors; this is pertinent since activity recognition is akin to PD detection. These empirical results showing that CNN are good at PD motion detection seem plausible given the close mathematical relationship between successful explicit features (wavelet, Fourier transform17, channel-delay correlation) and convolutions.Nevertheless, all of these approaches suffer from the problems inherent to PD detection from inertial sensor data. The movement patterns characterizing PD are complex because the accelerations caused by the symptoms are superimposed on accelerations from activities of daily living (ADL). These non-symptomatic accelerations vary widely between patients, activities, and the exact sensor positioning. Furthermore, it is challenging to collect clinical data at scale, and datasets are rather small. Thus, continuous symptom monitoring necessitates an approach capable of detecting complex patterns in noisy signals with limited training data. This approach can be reframed as a time series classification task. Two recently proposed methods for time series classification offer new possibilities and address the problem of complex patterns with limited training data from different perspectives: InceptionTime18 and RandOm Convolutional KErnel Transform (ROCKET)19 have been demonstrated to outperform previous state-of-the-art methods, including CNNs and CNN-LSTM combinations (e.g., TapNet20) on a benchmark consisting of 26 multivariate time series from various domains21.InceptionTime and ROCKET have been shown to work for PD estimation in detecting overall PD presence from eye tracking position and velocity22, but this differs from our problem of detecting multiple PD motor symptom severities from acceleration data. More closely related work detected freezing of gait (FoG) from full-body accelerometer measurements with InceptionTime and the ROCKET-derived MiniROCKET23,24. InceptionTime has also been used for detecting PD from gait through a force-sensitive insole25, but other approaches, such as ResNet, performed better. Another interesting related work used InceptionTime for detecting PD from voice waveforms with a novel generative-adversarial technique and found that reducing InceptionTime’s model complexity and number of parameters led to higher performance, with InceptionTime ranking second-best behind the ResNet CNN variant26. However, existing work using InceptionTime and ROCKET do not provide multi-level symptom severity classification22,23 and restrict the prediction to the presence or absence of PD. Furthermore, they often either use research-grade22,25,26 or full-body23 sensors.InceptionTime is an ensemble of five Inception networks with different random initializations. Each Inception network comprises at least one Inception module, followed by global average pooling, and then fully connected layers to generate class predictions, as shown in Fig. 118. Inception modules transform multivariate time series using convolution filters, maximum pooling, and concatenation. The convolutional filters have varying lengths, allowing the Inception modules to learn long-duration features from time series. By stacking Inception modules, which may include long filters, InceptionTime can have a very large receptive field. Ensembling multiple Inception networks reduces variability for more consistent performance. We hypothesize that this expansive receptive field will enable InceptionTime to effectively learn the complex patterns present in PD time series data; potentially, it may even compete with LSTM on learning long-term features. The parameter sharing inherent to convolutions will be well-suited to scenarios with limited data.Fig. 1(a) Simplified depiction (inspired by the seminal work18) of an Inception network consisting of Inception modules, average pooling, and a fully connected neural network to generate class prediction from an input multivariate times series (MTS). The network’s depth is six, equal to the number of Inception modules. (b) The Inception module’s bottleneck first reduces the input MTS to a univariate time series, and then convolutional filters are applied along the time axis. Additionally, the result of maximum pooling and a bottleneck is concatenated directly to the output. In this example, the Inception module takes a three-dimensional MTS as input and outputs a four-dimensional MTS. The module has three filters and a filter length of l.Full size imageROCKET uses convolutional kernels similar to those in CNNs19. Instead of learning the kernel parameters from the training data, ROCKET generates thousands of random kernels to create features. It then learns only a small set of linear weights from the training data using ridge regression, as shown in Fig. 2. Using random rather than trained kernels makes ROCKET well-suited for PD symptom severity estimation with limited training data. It is well known that as variance decreases with more model parameters, bias increases; random kernels reduce the number of parameters drastically and may result in a more favorable ratio of parameters to data points. As PD affects frequency-related features, we expect that the convolutions intrinsic to both methods will facilitate PD severity estimation.Fig. 2Simplified depiction of ROCKET19. Random convolutional kernels are applied to every time series of the input MTS (shown for the first time series only), yielding an intermediate time series. Proportion of positive values (PPV) and maximum pooling extract two scalar features per intermediate time series, which are inputs to a ridge classifier. The depicted example has n random kernels and produces 2n random scalar features per input dimension.Full size imageDespite InceptionTime’s capacity to capture complex patterns and ROCKET’s effectiveness with limited data, to the best of our knowledge, neither approach has been studied for detecting tremor, bradykinesia, and dyskinesia from wearable wrist accelerometers. The present work addresses this gap by systematically evaluating the performance, calibration, and hyperparameters of InceptionTime and ROCKET for PD motor symptom severity estimation. Our rigorous analysis highlights the advantages and disadvantages depending on the symptom of interest, desired robustness, and target outcomes. We demonstrate that both InceptionTime and ROCKET predict symptoms from accelerometer time series during ADL more effectively than explicitly modeled wavelet-derived features. Although these methods do not reach clinically relevant performance, this study reveals the strengths and limitations of two recently introduced time series classification approaches, ROCKET and InceptionTime, for PD symptom severity estimation. Our results show that the presented methods, with the default hyperparameters, can be used as a baseline for future research investigating PD symptom severity estimation.MethodsWe compare the InceptionTime ensemble and ROCKET against a baseline feature-based classifier for estimating the severity or presence/absence of tremors, bradykinesia, and dyskinesia from the patient’s smartwatch.DataThe dataset is a subset of the publicly available 2021 Michael J. Fox Foundation (MJFF) Levodopa Response Study27,28 repository.Participants and demographicsThe MJFF’s inclusion criteria were community-dwelling, PD diagnosis, 30 to 80 years of age, taking L-DOPA at the time of data collection, self-reported motor fluctuations, self-reported dyskinesia (at least mild), and ability to operate a smartphone27. Patients with other severe neurological issues or deep brain stimulation were excluded27. The selected subset includes data from 27 patients. Patients were 46 to 80 years of age (mean 62.9 yr, S.D. 8.63 yr) with Hohn and Yahr scores of II (21 patients), III (3 patients), IV (2 patients), and unknown (1 patient). 70.4 % were male and 29.6 % were female gender. PD motor symptoms affected 7.41 % equally on both sides of the body, 66.7 % more on the right, and the rest on the left. 96.3 % were right-handed and 3.70 % were left-handed.Collected measurementsPatients were monitored in a clinic on days one and four of the study. On day one, patients arrived in an on-medication state, having taken L-DOPA on their regular schedule, but on day four, patients arrived at least 12 hours after the last L-DOPA dose. The 27 patients wore a GENEActiv smartwatch on the most affected limb as they performed a battery of pre-defined motor tasks (such as standing, walking, and typing) in a laboratory setting27. The test battery was repeated six to eight times. On day one, the median time between the last L-DOPA dose and the motor tasks was 210 min (IQR 32.5 min–257 min). The GENEActiv measures acceleration in three dimensions at 50.0 Hz, producing multiple multivariate time series per patient and task. These time series vary in duration with the time taken for a task and have a mean duration of 29.2 s (S.D. 11.6 s).AnnotationThe Levodopa Response Study used several Movement Disorders Society Unified Parkinson’s Disease Rating Scale (MDS-UPDRS) certified clinical researchers during the in-clinic monitoring. A given task for a given patient was rated according to MDS-UPDRS by a single specialist27. The dataset thus has symptom severity annotations for each of the acceleration time series. Presence/absence of bradykinesia and dyskinesia are boolean, and tremor is on an ordinal scale from zero (no symptoms) to four (severe symptoms)27. We remove data points with missing or implausible annotations from the dataset. In total, the dataset contains 162.0 h annotated GENEActiv acceleration measurements.Evaluation designComparing the machine learning approaches fairly requires appropriate data handling and selection of suitable evaluation metrics.Data splitWe split the pre-processed data into disjoint training, test, and validation sets, ensuring that no patient appears in multiple sets. Each split resembles the class distribution of the overall dataset through data stratification. Table 1 shows the split between training, validation, and test data. The total duration refers to the sum of the durations of all tasks with a given symptom annotation. There is a substantial imbalance in the class labels, with more instances of no (or mild) symptoms than strong symptoms.Table 1 Split between training, validation, and test data. Total duration is the sum of the duration of all time series for a given symptom severity. The number of patients refers to the count of patients with at least one task rated at the given symptom severity.Full size tableFor hyperparameter tuning, we hold out the test set for final evaluation and apply grouped stratified cross-validation using the remaining data29,30. Specifically, we create five folds such that class proportions remain consistent across folds, and no patient appears in multiple folds. Five cross-validation folds lead to approximately 80 % training data and 20 % validation data for each model. Scores are averaged across these folds.Statistical analysisThroughout the present work, we use the metrics of balanced accuracy (BA) and average precision (AP). Balanced accuracy is the mean of the recall of each class31. AP (Eq. 1) is the mean of the precision P at each threshold i, weighted by the change in recall R32. Averaging the one-vs.-rest per-class AP over all classes yields the mean average precision (mAP).$$\begin{aligned} \text {AP} = \sum _{i} (R_i - R_{i-1}) P_i \end{aligned}$$(1)In addition, we use accuracy, area under receiver operating curve (AUROC), and the F1-score (Eq. 2) for comparison with related studies.$$\begin{aligned} \text {F1} = 2 \cdot \frac{P \cdot R}{P + R} \end{aligned}$$(2)For the multiclass task of tremor prediction, F1 is macro-averaged across all classes in a one-vs.-rest fashion. Multiclass AUROC is calculated by averaging the scores of all pairwise class combinations. This one-vs.-one AUROC is less sensitive to class imbalance than a one-vs.-rest calculation33. ROCKET provides class likelihoods rather than the probabilities required for AUROC calculation and calibration curve analysis. Softmax (Eq. 3) constrains the ROCKET class likelihood vector $\vec {x}=\{x_1, x_2, x_3, \ldots , x_n\}$ to the interval [0; 1], thereby approximating probabilities.$$\begin{aligned} \text {softmax}(x_i)=\frac{e^{x_i}}{\sum _{k=1}^n e^{x_k}} \end{aligned}$$(3)Tremors are assessed by clinicians on an ordinal scale, and strongly misclassifying tremor severity is worse than a prediction that is off by one. To account for class imbalance and the degree of misclassification, we use the macro-averaged mean absolute error (MAMAE), which is the macro-average of the per-class mean absolute error34. MAMAE is calculated according to Eq. 4, where N is the number of test samples, C is the set of groundtruth classes, $y_i$ is the groundtruth class of the i-th test sample, and $\hat{y_i}$ is the predicted class.$$\begin{aligned} \text {MAMAE}&= \frac{1}{N} \sum _{c \in C} \frac{\sum _{i=1}^N \delta _{c,y_i} \left| \hat{y_i} - y_i \right| }{\sum _{i=1}^N\delta _{c,y_i}} \\ \delta _{i, j}&= {\left\{ \begin{array}{ll}0 & \text{ if } i \ne j \\ 1 & \text{ if } i=j\end{array}\right. } \nonumber \end{aligned}$$(4)To judge how well the model’s predicted probability matches the actual occurrence probability, reliability plots (also known as calibration curves) can be used. To compare the classifiers’ predicted probability to the actual probability, the latter value must be determined; this is non-trivial because the presence/absence of a symptom is binary. Thus, patients with similar predicted symptom probabilities must be grouped together. We use the SmoothECE approach, which avoids the pitfalls of arbitrarily binning by predicted probabilities or manually choosing smoothing parameters35. SmoothECE regresses the expected value of the conditional relative occurrence frequency E[y|f] against the predicted probability f via Gaussian kernel smoothing35. The SmoothECE metric reflects the deviation from an ideal calibration curve, with lower values indicating better calibration. Crucially, the kernel bandwidth is determined automatically, eliminating all parameter tuning and enhancing comparability with future studies.Comparing deep learning models with statistical rigor is challenging because conventional statistical tests often assume normally distributed results, which may not hold36. Instead of a conventional test, we apply almost stochastic order (ASO)36,37 to the scores of multiple training runs for each model and run 1000 bootstrap iterations. ASO extends the concept of stochastic dominance, whereby an algorithm A is stochastically dominant over algorithm B if and only if the empirical cumulative distribution function (CDF) of A’s scores is always greater than the CDF of B’s scores36. ASO allows stochastic dominance — which is too restrictive for practical purposes — to be violated to a degree of $\epsilon _\text {min}$, estimating the upper bound of $\epsilon _\text {min}$ via bootstrapping37. We train each model ten times and set significance level $\alpha = 0.05$, suggesting that ten samples are sufficient for statistical comparison in nearly all cases. The present work henceforth considers A stochastically dominant over B for $\epsilon _\text {min} < 0.2$ with 1000 bootstrap iterations, as suggested by37. Bootstrap power analysis of model scores38 with 5000 iterations ensures a power of at least 0.837. We perform Bonferroni correction for a total of 31 comparisons (based on four models, three symptoms, and two metrics; considering symmetry and insufficient power).To explain misclassifications, we examine the distribution of performed motor tasks among false positives and false negatives. The expected count $E_{t,{\hat{y}}, y}$ for a motor task $t \in T$, predicted class ${\hat{y}}$, and actual symptom severity class y is computed from the observed count O according to Eq. 5, where C denotes the set of severity levels for a given symptom:$$\begin{aligned} E_{t,{\hat{y}},y} = \frac{\sum _{i \in C} O_{t,i,y}}{\sum _{k \in T} \sum _{i \in C} O_{k,i,y}} \sum _k O_{k,{\hat{y}},y} \end{aligned}$$(5)If the misclassifications by the model were uniformly distributed across motor tasks, the observed and expected counts would closely match. Deviations from this distribution indicate task-specific error patterns. Expected and observed frequencies are calculated across all ten training repetitions. To address class imbalance arising from the limited number of strong tremor instances, tremor severity was binarized into present (severity $\ge 1$) and absent (severity $=0$). To enhance statistical robustness, differences between observed and expected counts for a given ground-truth severity and motor task are reported only if the difference is at least 40 %, with both observed and expected counts consisting of at least ten instances. Welch’s method39 is employed for power spectral density (PSD) estimation, enabling the identification of dominant motion frequencies present within the acceleration time series for each motor task. The complexity of each time series is quantified using sample entropy, which is particularly suited to short and noisy physiological data40. Higher sample entropy values indicate greater complexity or randomness, whereas lower values suggest simpler, more regular patterns. Finally, the energy of a time series is directly proportional to the square of its amplitudes, and its power is directly proportional to its standard deviation. PSD, standard deviation, and sample entropy are calculated from the magnitude of the acceleration time series.Model developmentWe aim to optimize InceptionTime and ROCKET PD prediction through hyperparameter optimization. Two InceptionTime models are developed: one with the default hyperparameters and one with extensive tuning. To evaluate the hypothesis that they ROCKET and InceptionTime will implicitly learn good features, we also develop a baseline classifier using explicit features. Finally, the developed models are trained on the combined training and validation dataset. The selected time series classification approaches require equal-length time series, but raw GENEActiv measurements are of unequal length. Thus, we normalized the data length using a forward-sliding window approach with two hyperparameters: window length and overlap proportion between windows. We use windows of 30 s length with 50 % overlap unless stated otherwise. This fixed window length and overlap mean that more training samples will be generated from longer series. Preliminary tests indicated improved performance with longer windows and greater overlap until performance plateaus at 50 % overlap. Current research lacks consensus on the optimal window length8,9,10,12,41. Since all examples of tremor severity four in the training data have durations of less than 30 s, they are not used for training any of the models. Consequently, the learned models cannot predict tremor severity four, and we therefore combine severity classes three and four into a single class, referred to as “3–4”.InceptionTime hyperparameter tuningInceptionTime has several hyperparameters18: Filter size is the length of the longest one-dimensional convolution filter in the Inception modules (e.g., filter size l in Fig. 1). Number of filters refers to the number of filters each Inception module contains (e.g., three filters in Fig. 1). For example, a filter size of 64 and four filters will result in filters of length 64, 32, 16, and 8 for each Inception module. Depth is the number of stacked Inception modules (e.g., depth 6 in Fig. 1). In addition to the filter size, the number of filters, and the depth, we include the window length in the hyperparameter search. We activate residual connections because they improve accuracy and retain the default batch size of 64 as batch size does not affect accuracy18.Because InceptionTime was originally developed using the University of California, Riverside (UCR) archive18, which contains many different time series (including human activity recognition42), the default architecture might already be transferable to use cases, such as PD symptom severity estimation. We first train InceptionTime with the default hyperparameters18. Next, selected hyperparameters must be optimized to maximize scores based on at least one metric. We employ random search due to its efficiency compared to grid search, particularly when certain hyperparameters are more important than others43. Window length is a real number from 3 s to 30 s with fixed 50 % overlap. Filter length is an integer from 8 to 255. Depth is an integer from 1 to 11. The number of filters is a power of two from $2^1=2$ to $2^6=64$. The window length is sampled uniformly with replacement; all other hyperparameters are sampled uniformly without replacement. We perform 60 random search trials using cross-validation as described above to split the data into training and validation data. For tremor, dyskinesia, and bradykinesia, we train 300 models each, resulting in as many as 900 models being trained. Training is terminated after 600 epochs, based on preliminary tests indicating performance plateaus or declines after 600 epochs, and we hope that this incentivizes architectures that are invariant to epoch count.ROCKET hyperparameter selectionROCKET hyperparameters include the number of convolution kernels, classifier choice (ridge or logistic regression), and window length. This work retains the default random kernel parameters (10 000 kernels, see Fig. 2), which “form an intrinsic part of ROCKET” and “do not need to be ‘tuned’ for new datasets” because they are optimized for a variety of datasets19. We opt for the ridge classifier and tune its regularization strength via cross-validation.Wavelet-based feature engineering and classifier developmentThe baseline classifier is a multi-layer perceptron (MLP) applied to 70 wavelet-based features. Our baseline deliberately uses a simple, generic classifier combined with wavelet features that capture domain-specific knowledge as a counterpoint to the other black-box methods with learned features. As InceptionTime uses an MLP and ROCKET uses a ridge classifier in the final stage, using wavelets with an MLP allows us to compare the efficacy of wavelets-derived features, identified as state-of-the-art for PD classification in prior studies6,44,45,46, to InceptionTime and ROCKET’s implicit feature extraction. More concretely, wavelet features encode the domain knowledge that different symptom severities will affect the wrist motion frequency and have been used in prior studies together with Gaussian processes and SVM. We derive these 70 features by determining the root-mean-square, standard deviation, maximum, kurtosis, skew, power spectral distribution maximum, and power spectral distribution minimum for nine levels of wavelet decomposition and the original signal6. In contrast to the convolutions used by the first stages InceptionTime and ROCKET, an MLP is not affected by the order of features. The MLP is realized with two hidden layers of 128 sigmoid neurons each because a grid search on PD data revealed that this represents the ideal topology. We minimize categorical cross-entropy loss with the parameter optimizer Adam47. The categorical cross-entropy loss for an N-class prediction problem is calculated in Eq. 6 from the probability prediction ${\hat{y}}$ of the classifier and the one-hot encoded labels y.$$\begin{aligned} L = - \sum _{i=0}^N y_i \cdot \log {\hat{y}}_i \end{aligned}$$(6)ResultsThis section first describes the results of hyperparameter tuning and models that are selected for the final evaluation. We then report the results of the final models on the test dataset.Selected models and hyperparametersDuring the hyperparameter tuning process, the test set is not used.InceptionTimeAs mAP and AP are positively correlated with balanced accuracy, AP is used for model selection henceforth. For tremor, the mAP has a moderate positive correlation with the window length, as shown in Fig. 3. The bradykinesia and dyskinesia AP have a weak positive correlation with the window length. All other hyperparameters have negligible impact on the mAP or AP, except for very slightly decreasing mAP with increasing filter length for tremor.Fig. 3Average precision (AP) and mean AP (mAP) in relation to InceptionTime hyperparameters. Network depth, filter length, and the number of filters do not affect the AP. The mAP and AP increase with increasing window length. Spearman’s Rank Correlation Coefficient is denoted by $r_s$. The arrow denotes the best hyperparameters; the scores and standard deviation are in the rightmost column.Full size imageNote that the AP scores for dyskinesia are low, and the balanced accuracy is often close to the balanced accuracy of 0.5 expected from random classifiers. The standard deviations of mAP or AP and balanced accuracy are considerable, and the interval of ± S.D. around the scores could encompass many of the architectures with lower mean scores.ROCKETLonger windows should lead to higher performance because ROCKET excels with little training data19 while benefiting from the larger vector during inference, as the following experiments confirm: For tremor, we find a mAP of 0.404 with 5 s windows, 0.457 with 15 s, and 0.565 with 30 s. A similar pattern is observed for bradykinesia (AP of 0.681 with 5 s, 0.712 with 15 s, and 0.727 with 30 s) and for dyskinesia (AP of 0.119 with 5 s, 0.100 with 15 s, and 0.140 with 30 s). Thus, we use 30 s windows.Final model performanceInceptionTime and ROCKET are always stochastically dominant over the wavelet MLP. In none of the studied cases does hyperparameter tuning yield superior performance over the default InceptionTime hyperparameters. The performance of InceptionTime compared to ROCKET varies by symptom and metric. Statistical power exceeds 0.8 unless stated otherwise.TremorAll classifiers perform better than random classifiers, as shown in Fig. 4. Hyperparameter tuning does not improve InceptionTime scores. Notably, variability between training runs is very high for InceptionTime, especially after hyperparameter tuning, but low for ROCKET.Fig. 4Comparison of all classifiers for smartwatch wrist sensor data. Each classifier is trained and evaluated ten times. The dashed line represents the scores expected from random classifiers. The whiskers extend to the furthest data point to a maximum of 1.5 interquartile ranges past the first and third quartile. Diamonds represent outliers not within the whiskers.Full size imageROCKET has higher scores than default InceptionTime ($\epsilon _\text {min,mAP}=0.0067$, $\epsilon _\text {min,BA}=0.0255$). ROCKET scores higher than the hyperparameter-tuned InceptionTime ($\epsilon _\text {min,mAP}=0.0615$, $\epsilon _\text {min,BA}=0.000104$) and the wavelet-feature MLP ($\epsilon _\text {min,mAP}=0.0$, $\epsilon _\text {min,BA}=0.0$). InceptionTime with default hyperparameters is stochastically dominant over tuned InceptionTime ($\epsilon _\text {min,mAP} = 0.984$, $\epsilon _\text {min,BA} = 0.0$) and the wavelet-based feature MLP ($\epsilon _\text {min,mAP} = 0.0$, $\epsilon _\text {min,BA} = 0.0$). InceptionTime with optimized hyperparameters is stochastically dominant over the wavelet MLP ($\epsilon _\text {min,mAP} = 0.0$, $\epsilon _\text {min,BA} = 0.0$). The confusion matrix in Fig. 5 highlights the class imbalance which results in all models predicting too low tremor severity. ROCKET tends to underestimate tremor more than InceptionTime but is also more accurate for zero or weak tremor. ROCKET displays the lowest MAMAE (mean 0.605, S.D. 0.0598), followed by default InceptionTime (mean 0.770, S.D. 0.0758), tuned InceptionTime (mean 0.800, S.D. 0.112), and the wavelet MLP (mean 0.947, S.D. 0.0478).Fig. 5Confusion matrix for the predictions of all classifiers based on GENEActiv smartwatch data. Out of the ten trained and evaluated classifiers, the fourth-best one according to mAP is selected, approximating the median performance.Full size imageFig. 6Reliability diagrams for all binary classifiers as a smoothed scatterplot of the groundtruth label $E\left[ y|f\right]$ vs. the predicted probability f. The red line represents the smoothed probabilities with thicker red lines representing a higher sample density. The shaded area represents the 95 % confidence interval with darker shades representing a higher sample density. An upwards/downwards tick on the x-axis signifies positive/negative groundtruth labels for a predicted f.Full size imageBradykinesiaAll classifiers substantially outperform random classifiers. Default InceptionTime produces the best AP scores for bradykinesia prediction (Fig. 4). However, the variability of InceptionTime scores (tuned and default) is much larger than the variability of scores produced by the other models. InceptionTime with default hyperparameters is stochastically dominant over ROCKET ($\epsilon _\text {min,mAP} = 0.0,$, $\epsilon _\text {min,BA}=0.0,$) and the wavelet MLP ($\epsilon _\text {min,mAP} = 0.0,$, $\epsilon _\text {min,BA}=0.0,$). InceptionTime with tuned hyperparameters scores higher than ROCKET ($\epsilon _\text {min,AP} = 0.0,$, $\epsilon _\text {min,BA}=0.0887$) and the wavelet MLP ($\epsilon _\text {min,AP} = 0.0$, $\epsilon _\text {min,BA}=0.0$). ASO provides no evidence for significant stochastic dominance of default InceptionTime over tuned InceptionTime ($\epsilon _\text {min,AP} = 0.279$, $\epsilon _\text {min,BA} = 0.770$). A higher predicted probability tends to imply a higher chance of actual bradykinesia, as the calibration plots in Fig. 6 show. ROCKET with softmax has an even distribution of scores while the other classifiers tend to predict either very high or very low scores (note the line thinning around $f=0$).False positives are overrepresented during drawing for InceptionTime (458/168.7) and ROCKET (550/150.9). InceptionTime also has a higher-than-expected frequency of walking down a passage (279/178.8) and walking straight (131/93.2) among false positives.DyskinesiaThe three AP scores barely exceed those expected from random classifiers (Fig. 4). ROCKET has a substantially higher AP than the other classifiers. The InceptionTime classifiers achieve higher AP than the wavelet MLP. The InceptionTime models have the highest balanced accuracy, followed by ROCKET and then the wavelet MLP. The default InceptionTime architecture shows high variability in terms of both AP and balanced accuracy. However, the ten samples yield a power of only 0.294 for InceptionTime dyskinesia predictions and 0.409 for the tuned InceptionTime dyskinesia predictions, which is too low to enable comments on significance. ROCKET is stochastically dominant over the wavelet MLP ($\epsilon _\text {min,mAP} = 0.0$, $\epsilon _\text {min,BA}=0.000914$). Fig. 6 shows that all classifiers predict very low dyskinesia probabilities, although ROCKET again has the most homogenous prediction probability distribution. Due to the low accuracy in dyskinesia detection, no further statements can be made about the quality of the calibration.Misclassification analysisMisclassifications are analyzed for ROCKET and InceptionTime with the default hyperparameters because InceptionTime with tuned and default hyperparameters show similar misclassification patterns. The test dataset time series magnitude has a mean sample entropy of 1.07 (S.D. 0.507, range 0.0187–2.25) and a mean standard deviation of $1.22 \hbox{ ms}^{-2}$ (S.D. 0.925 ms-2, range 0.0605–4.29 ms-2).InceptionTime tremor false positives are strongly overrepresented (observed samples/expected samples) during nuts-and-bolts assembly (216/100) and drawing (200/135). ROCKET false positives are strongly overrepresented during drinking (191/73.3), organizing papers (165/80.6), drawing (143/80.6), and nuts-and-bolts assembly (99/59.9). Notably, for these tasks, the PSD in the 4 Hz to 6 Hz range is unaffected by the presence or absence of tremors. In contrast, for the remaining tasks, patients with tremor show higher PSD in the 4 Hz to 6 Hz band than patients without tremor. False-negative tremor estimation of InceptionTime occurs more frequently than expected when walking a narrow passage (135/70.04) and drawing (188/70.4). False-negative tremor estimation of ROCKET is more frequent than expected when walking a narrow passage (148/79.5) and walking straight (154/90.3). Drawing has the highest complexity of all the tasks (mean sample entropy 1.86), while walking straight and down a passage has higher-than-average amounts of motion (mean standard deviations $2.37\hbox { ms}^{-2}$ and $1.71\hbox { ms}^{-2}$).When a patient has bradykinesia according to the groundtruth, the standard deviation of the motion time series magnitude is higher on average, indicating a greater degree of motion. We find that true negative examples of bradykinesia time series have a higher average standard deviation (ROCKET $1.43\hbox { ms}^{-2}$, InceptionTime $1.40\hbox{ }\hbox {ms}^{-2}$, tuned InceptionTime $1.24\hbox { ms}^{-2}$) than false positives (ROCKET $0.809\hbox { ms}^{-2}$, InceptionTime $0.962\hbox { ms}^{-2}$, tuned InceptionTime $1.13\hbox { ms}^{-2}$). Similarly, true positives have a higher average standard deviation (ROCKET $2.01\hbox { ms}^{-2}$, InceptionTime $ 1.96\hbox { ms}^{-2}$) than false negatives (ROCKET $1.52\hbox { ms}^{-2}$, InceptionTime $1.58\hbox { ms}^{-2}$).As the performance of dyskinesia classifiers is only slightly better than random, dyskinesia misclassifications are expected to also be random, and no meaningful misclassification patterns can be identified.DiscussionThe results show that the time series classification approaches of InceptionTime and ROCKET can learn to estimate PD symptom severity from wrist accelerometer data during ADL with performance exceeding that of an MLP with explicitly modeled wavelet features.Dyskinesia is the most difficult PD symptom to detect in ADL using wearable accelerometers and our studied approaches. Our high accuracy is misleading due to class imbalance, and we only slightly outperform random classifiers. In contrast, all approaches substantially outperform random classifiers for tremors and bradykinesia. ROCKET substantially outperforms the other classifiers in terms of AP for dyskinesia estimation and has an acceptable margin over the random classifier baseline. ROCKET’s higher performance could be attributed to better performance of random kernels on smaller datasets, in comparison to learned kernels19. Furthermore, dyskinesia has a movement frequency slower than tremor and faster than bradykinesia, causing a higher overlap between dyskinesia and ADL in the frequency domain. This characteristic may have contributed to the lower performance observed as well. In comparison to the other symptoms, dyskinesia is highly dependent on context because it describes the patient moving when he/she does not want to move. Our dataset combines various ADL, which include high movement (typing) and low movement (sitting); accurate dyskinesia presence estimation would require “detecting” the current activity and assessing movement intensity relative to that activity. Existing research has demonstrated better dyskinesia estimation, especially when adding gyroscopes48 and multiple sensor locations49, potentially even electromyography10. Using multiple sensors can aid in distinguishing between symptomatic movements and ADL. The sensors in the present work can only measure translational acceleration, but the twisting movements of dyskinesia are primarily rotational instead of translational — a rapidly rotating but relatively stationary wrist can occur in dyskinesia and will only lead to a small signal in the acceleration time series. Thus, gyroscopes may be beneficial when the goal is to improve clinical performance. Assuming that dyskinesia estimation is particularly challenging, deep learning models might require more training data to achieve reasonable performance, explaining the performance of InceptionTime lagging behind ROCKET.ROCKET and InceptionTime have similar performance for tremor estimation, and estimate tremor severity with an average error of less than one severity class (see MAMAE results). In contrast, the InceptionTime classifiers are substantially better at bradykinesia estimation. InceptionTime tends to demonstrate the highest balanced accuracy, suggesting that InceptionTime might be slightly superior to ROCKET when absolute predictions are more important than good calibration. ROCKET and InceptionTime perform similarly on a variety of time series classification benchmarks19,21, which fits the similar results for PD motor symptom estimation. InceptionTime demonstrates greater score variability across training runs than the other approaches, independent of whether automated hyperparameter tuning has been performed (see quantiles in Fig. 4). This variability aligns with the finding of the InceptionTime authors, who made it an ensemble of Inception networks due to the high variability of a single Inception network18. However, we show that for our problem, even ensembling cannot fully mitigate the variability of Inception networks for time series classification. InceptionTime’s sensitivity to random initialization may also contribute to the futility of automated hyperparameter optimization. If InceptionTime reacts to several thousand trainable parameters that cannot be fully optimized by training, the impact of tuning additional hyperparameters may be minimal. This is evidenced by the large variation of the model scores from the cross-validation. ROCKET’s scores are remarkably stable, especially when compared with InceptionTime.The misclassification patterns of InceptionTime and ROCKET are similar. Notably, false-positive tremor classifications occur more frequently than expected during tasks requiring fine motor coordination (drawing/writing, assembling nuts-and-bolts, organizing papers, and drinking). Generally, tremor is characterized by motions between 4.00 Hz and 6.00 Hz50. During fine-motor tasks, patients without tremor display more motion in this frequency band compared to those with tremor. This suggests that tremors may hinder patients from performing the requested task, reducing the overall motion during task execution, and subsequently increasing the risk of false-positive tremor detection during fine motor tasks. For bradykinesia, we find that misclassifications are associated with lower movement intensity. Given that bradykinesia is the unintended slowness of movement, our classifiers may fail to detect bradykinesia (false negative) when the bradykinetic patient is moving less overall, because it may seem like intentionally slow movement. Conversely, the models may misclassify intentionally slow movement as bradykinesia (false positive).Models for all symptom severities demonstrate improved performance with longer window lengths up to 30 s (see Fig. 3). In our case, longer time series may give the classifiers a better chance of separating the superimposed PD symptoms from the ADL. Using random slices from a time series with a window length of 90 % of the original time-series length has been shown to improve deep learning time series classification51, but we found no research comparing this to windows shorter than 90 %. In the context of PD deep learning, some researchers have used 30 s windows for tremor and dyskinesia10, and others have used 5 s windows for bradykinesia12. None of those studies provide results for experiments on window length.Extensive hyperparameter tuning via random search does not improve performance compared to default InceptionTime hyperparameters on a held-out test dataset. Vastly different architectures perform similarly. For instance, the tuned InceptionTime tremor classifier has 7500 trainable parameters, while the default four-class InceptionTime has 491 000 trainable parameters, yet both achieve similar scores. This similarity in test performance pre- and post-tuning may suggest overfitting of hyperparameters to the cross-validation sets. Alternatively, hyperparameters — with the exception of window length — may not substantially impact the InceptionTime performance.When comparing with the established literature, InceptionTime, ROCKET, and the wavelet MLP score lower as per Table 2, but there are very few works with five classes and ADL, limiting the possibility of comparison. Using engineered features and a random forest52 yields higher AUROC than our approaches and wavelets with Gaussian processes44 are more accurate. Other works either had patients at rest45 or performing activities that were optimized for tremor detection53,54,55 instead of the ADL in our research. Further work used multiple or different sensor locations56,57 and/or had different numbers of classes58,59,60.Table 2 Comparison of our mean results* with values from the literature that use similar motions and at least the same number of classes.Full size tableWith regards to bradykinesia detection, two out of seven related studies report better accuracy, one out of three reports better balanced accuracy, none report better AUROC, and all report better F1. Our approaches achieve higher BA and AUROC but lower accuracy and F1 than a study that also looked at many ADLs on a dataset very similar to the MJFF dataset45. We also report higher accuracy than using explicitly modeled features when patients are free to move around6,44,52–but these studies also have more classes. An interesting approach of converting acceleration time series into images for classification by a vision CNN demonstrated slightly better scores48. However, they had one classifier that categorized motions into bradykinetic, healthy, or dyskinetic, in contrast to our separate binary classifiers for bradykinesia and dyskinesia. Many other related works for bradykinesia detection cannot be compared directly because they have restricted patients motions12,63 or used other sensors8,64,65. In accordance with our low performance and poor calibration for dyskinesia detection, all related works (see Table 2) report higher balanced accuracy and F1. Our high accuracy comes from the tendency to predict no dyskinesia and the skewed class distribution (see Fig. 5). However, in absolute terms, all studies except one48 report low accuracies (under 60 %)6,44. Several other related works are excluded from the numeric comparison due to differing sensors49,65 and activities45. We note that comparisons with related work on detecting PD from wearable accelerometers are challenging because not all studies report all relevant metrics. Furthermore, datasets and sensors differ, and the classification task may be framed differently. Finally, many scores reported in existing literature lack an uncertainty estimate, and there are subtleties in how multiclass metrics such as AUROC are calculated.The MJFF dataset gives a single symptom severity label even if symptoms fluctuate during the task. Some works address this problem of having whole time series labeled with a single symptom despite symptom fluctuations by using so-called multiple-instance learning56,57 and/or manually examining the time series signals to modify the class labels57. Generally, machine learning applied to manually generated features is able to outperform InceptionTime and ROCKET, especially when the patient motions are constrained. All high-performing related work with raw data either used semi-supervised learning57 or data augmentation with domain-specific knowledge48,66. Specially designed time series ordinal regression approaches outperform nominal classifiers67 and may be worth investigating to predict MDS-UPDRS scores. Treating symptom severity prediction as a non-linear regression problem models the clinical reality most closely, however.We avoid selection bias in the classifier scoring by evaluating a held-out test dataset while developing models on a validation dataset with cross-validation. Repeating the training of each final model ten times ensures the reliability of the results and provides insights into the score distribution. The significance tests in our work provide statistically rigorous analysis and reduce bias, albeit with some caveats. Significance testing with ASO works better when more different sources of variation are considered37. Although we vary the random initialization, we do not re-shuffle the training data, sub-sample it, or modify the train-validate-test split. In other words, when considering the data split as nested k-fold grouped stratified cross-validation, the outer loop has a count of only one. Hence, some findings reported as significant may not generalize to applications of similar pipelines68. Our results hinge on the Levodopa Response Study dataset, which encompasses 28 patients, but strong tremors are exceedingly rare. Accordingly, when grouping by patients, it is nearly impossible to create training, validation, and test datasets containing tremor severity four while retaining representative distributions for other symptoms and severities across the datasets (see Table 1). Hence, classifiers have very few examples from which to learn about strong tremors. Furthermore, the cross-validation-based automated hyperparameter search cannot consider scores for the strongest tremors as only one patient across the training and validation data has these strong tremors. Our use of data stratification minimizes the impact of the class imbalance on the reported results. The limited dataset size also means that the evaluation of performance cannot be separated by task while preserving statistical validity. Furthermore, the Levodopa Response Study’s use of different raters for different tasks and patients may cause systematic errors or bias in the labels. However, incorrect labels would affect all classifiers, and our conclusions regarding substantial relative performance differences remain valid. Finally, bias might arise in future real-world applications because wearing the smartwatch on the predominantly affected right hand may conflict with cultural and individual preferences for wearing watches on the left (or non-dominant) hand.Our hypothesis was that the state-of-the-art end-to-end time-series classification approaches would also perform well for estimating PD symptom severity estimation during ADL. This hypothesis is supported by comparisons with our wavelet-based baseline, which demonstrate the ability of both InceptionTime and ROCKET to implicitly learn discriminative features from raw tri-axial wrist accelerometry.ConclusionWe compared InceptionTime, ROCKET, and an MLP operating on wavelet-based features for predicting PD symptom severity from wrist-worn accelerometer data. InceptionTime is comparatively better suited to predicting tremor and bradykinesia, significantly outperforming the simpler wavelet-based MLP approach and slightly surpassing ROCKET overall. However, the performance of InceptionTime varies greatly depending on the random initialization of weights. ROCKET represents the most effective estimator for dyskinesia and is the only model that substantially outperforms random classifiers, although dyskinesia remains the most challenging symptom to detect. Our extensive hyperparameter tuning involved the evaluation of 900 model configurations, and the results indicate that InceptionTime is unlikely to benefit substantially from architectural modifications. Despite InceptionTime and ROCKET representing the state of the art in time series classification with automated feature extraction, certain customized machine learning approaches, especially those employing manual feature engineering, still achieve superior performance for PD symptom severity prediction. To the best of our knowledge, this study constitutes the first application of ROCKET and InceptionTime to the prediction of PD motor symptoms from wrist accelerometry data.Data availabilityThe data that support the findings of this study are available from the MJFF under https://doi.org/10.7303/syn20681023.Code availabilityOur source code is available from https://github.com/cedricdonie/tsc-for-wrist-motion-pd-detection.ReferencesKouli, A. et al. Parkinson’s Disease (Codon Publications, 2018).de Lau, L. M. L. & Breteler, M. M. B. Epidemiology of Parkinson’s disease. Lancet Neurology 5, 525–535. https://doi.org/10.1016/S1474-4422(06)70471-9 (2006).Article PubMed Google Scholar Chaudhuri, K. R. et al. Economic burden of parkinson’s disease: A multinational, real-world, cost-of-illness study. Drugs - Real World Outcomes 11, 1–11. https://doi.org/10.1007/s40801-023-00410-1 (2024).Article PubMed PubMed Central Google Scholar National Institute for Health and Care Excellence—NICE. Parkinson’s Disease in Adults. NICE Guideline NG71 (2017).Sigcha, L. et al. Deep learning and wearable sensors for the diagnosis and monitoring of Parkinson’s disease: A systematic review. Expert Systems with Applications 229, 120541. https://doi.org/10.1016/j.eswa.2023.120541 (2023).Article Google Scholar Endo, S. et al. Dynamics-based estimation of Parkinson’s disease severity using Gaussian processes. In 2nd IFAC Conf. Cyber-Physical & Hum. Syst. (IFAC, 2018).Williamson, J. R., Telfer, B., Mullany, R. & Friedl, K. E. Detecting Parkinson’s disease from wrist-worn accelerometry in the U.K. Biobank. Sensors 21, 2047. https://doi.org/10.3390/s21062047 (2021).Article ADS PubMed PubMed Central Google Scholar Pastorino, M. et al. Assessment of bradykinesia in Parkinson’s disease patients through a multi-parametric system. In Proc. 33rd Annu. Int. Conf. IEEE Eng. Medicine Biol. Soc., https://doi.org/10.1109/iembs.2011.6090516 (IEEE, 2011).Salarian, A. et al. Quantification of tremor and bradykinesia in Parkinson’s disease using a novel ambulatory monitoring system. IEEE Trans. Biomed. Eng. 54, 313–322. https://doi.org/10.1109/tbme.2006.886670 (2007).Article PubMed Google Scholar Cole, B. T., Roy, S. H., De Luca, C. J. & Nawab, S. H. Dynamical learning and tracking of tremor and dyskinesia from wearable sensors. IEEE Trans. Neural Sys. Rehabilitation Eng. 22, 982–991. https://doi.org/10.1109/TNSRE.2014.2310904 (2014).Article Google Scholar Bengio, Y. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2, 5–9. https://doi.org/10.1561/2200000006 (2009).Article Google Scholar Eskofier, B. M. et al. Recent machine learning advancements in sensor-based mobility analysis: Deep learning for Parkinson’s disease assessment. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 655–658, https://doi.org/10.1109/EMBC.2016.7590787 (2016).Rizvi, D. R., Nissar, I., Masood, S., Ahmed, M. & Ahmad, F. An LSTM based deep learning model for voice-based detection of Parkinson’s disease. Int. J. Adv. Sci. Technol. 29 (2020).Balaji, E., Brindha, D., Elumalai, V. K. & Vikrama, R. Automatic and non-invasive Parkinson’s disease diagnosis and severity rating using LSTM network. Appl. Soft Comput. 108, 107463. https://doi.org/10.1016/j.asoc.2021.107463 (2021).Article Google Scholar Shiranthika, C. et al. Human Activity Recognition Using CNN & LSTM. In 2020 5th International Conference on Information Technology Research (ICITR), 1–6, https://doi.org/10.1109/ICITR51448.2020.9310792 (IEEE, Moratuwa, Sri Lanka, 2020).Khatun, M. A. et al. Deep CNN-LSTM with self-attention model for human activity recognition using wearable sensor. IEEE J. Translational Eng. Health Medicine 10, 1–16. https://doi.org/10.1109/JTEHM.2022.3177710 (2022).Article Google Scholar Ronald Bracewell. The Fourier Transform And Its Applications (McGraw-Hill, 2000), 3 edn.Fawaz, H. I. et al. InceptionTime: Finding AlexNet for time series classification. Data Mining Knowl. Discovery 34, 1936–1962. https://doi.org/10.1007/s10618-020-00710-y (2020).Article MathSciNet Google Scholar Dempster, A., Petitjean, F. & Webb, G. I. ROCKET: exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining Knowl. Discovery 34, 1454–1495. https://doi.org/10.1007/s10618-020-00701-z (2020).Article MathSciNet Google Scholar Zhang, X., Gao, Y., Lin, J. & Lu, C.-T. TapNet: Multivariate time series classification with attentional prototypical network. Proc. AAAI Conf. on Artif. Intell 34(4), 6845–6852. https://doi.org/10.1609/aaai.v34i04.6165 (2020).Article Google Scholar Ruiz, A. P., Flynn, M., Large, J., Middlehurst, M. & Bagnall, A. The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining Knowl. Discovery 35, 401–449. https://doi.org/10.1007/s10618-020-00727-3 (2020).Article MathSciNet Google Scholar Uribarri, G., von Huth, S. E., Waldthaler, J., Svenningsson, P. & Fransén, E. Deep Learning for Time Series Classification of Parkinson’s Disease Eye Tracking Data, https://doi.org/10.48550/arXiv.2311.16381 (2023). arXiv:2311.16381.Klaver, E. C. et al. Comparison of state-of-the-art deep learning architectures for detection of freezing of gait in Parkinson’s disease. Frontiers in Neurology 14, https://doi.org/10.3389/fneur.2023.1306129 (2023).Dempster, A., Schmidt, D. F. & Webb, G. I. MiniRocket: A very fast (almost) deterministic transform for time series classification. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, 248–257, https://doi.org/10.1145/3447548.3467231 (Association for Computing Machinery, 2021).Zhou, Z. et al. Deep Learning-Based Classification of Neurodegenerative Diseases Using Gait Dataset: A Comparative Study. In Proceedings of the 2023 International Conference on Robotics, Control and Vision Engineering, RCVE ’23, 59–64, https://doi.org/10.1145/3608143.3608154 (Association for Computing Machinery, New York, NY, USA, 2023).Rey-Paredes, M., Pérez, C. J. & Mateos-Caballero, A. Time Series Classification of Raw Voice Waveforms for Parkinson’s Disease Detection Using Generative Adversarial Network-Driven Data Augmentation. IEEE Open Journal of the Computer Society 6, 72–84. https://doi.org/10.1109/OJCS.2024.3504864 (2025).Article Google Scholar Daneault, J.-F. et al. Accelerometer data collected with a minimum set of wearable sensors from subjects with Parkinson’s disease. Scientific Data 8, https://doi.org/10.1038/s41597-021-00830-0 (2021).Vergara-Diaz, G. et al. Limb and trunk accelerometer data collected with wearable sensors from subjects with Parkinson’s disease. Scientific Data 8, https://doi.org/10.1038/s41597-021-00831-z (2021).Vabalas, A., Gowen, E., Poliakoff, E. & Casson, A. J. Machine learning algorithm validation with a limited sample size. PLOS ONE 14, e0224365. https://doi.org/10.1371/journal.pone.0224365 (2019).Article CAS PubMed PubMed Central Google Scholar Cawley, G. C. & Talbot, N. L. C. On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010).MathSciNet Google Scholar Mosley, L. A balanced approach to the multi-class imbalance problem. Ph.D. thesis, Iowa State University (2013). https://doi.org/10.31274/etd-180810-3375.Su, W., Yuan, Y. & Zhu, M. A relationship between the average precision and the area under the roc curve. In Proc. 2015 Int. Conf. Theory Inf. Retrieval, ICTIR ’15, 349–352, https://doi.org/10.1145/2808194.2809481 (Association for Computing Machinery, 2015).Hand, D. J. & Till, R. J. A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning 45, 171–186. https://doi.org/10.1023/A:1010920819831 (2001).Article Google Scholar Baccianella, S., Esuli, A. & Sebastiani, F. Evaluation Measures for Ordinal Regression. In 2009 Ninth International Conference on Intelligent Systems Design and Applications, 283–287, https://doi.org/10.1109/ISDA.2009.230 (2009).Blasiok, J. & Nakkiran, P. Smooth ECE: Principled Reliability Diagrams via Kernel Smoothing. In The Twelfth International Conference on Learning Representations (2023).Dror, R., Shlomov, S. & Reichart, R. Deep dominance – how to properly compare deep neural models. In Proc. 57th Annu. Meeting Assoc. Computat. Linguistics, https://doi.org/10.18653/v1/p19-1266 (Association for Computational Linguistics, 2019).Ulmer, D., Hardmeier, C. & Frellsen, J. deep-significance - easy and meaningful statistical significance testing in the age of neural networks. CoRR (2022). arXiv:2204.06815.Yuan, K.-H. & Hayashi, K. Bootstrap approach to inference and power analysis based on three test statistics for covariance structure models. Brit. J. Math. Statistical Psychology 56, 93–110. https://doi.org/10.1348/000711003321645368 (2003).Article MathSciNet Google Scholar Welch, P. The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms. IEEE Transactions on Audio and Electroacoustics 15, 70–73. https://doi.org/10.1109/TAU.1967.1161901 (1967).Article Google Scholar Richman, J. S. & Moorman, J. R. Physiological time-series analysis using approximate entropy and sample entropy. American Journal of Physiology-Heart and Circulatory Physiology 278, H2039–H2049. https://doi.org/10.1152/ajpheart.2000.278.6.H2039 (2000).Article CAS PubMed Google Scholar Bikias, T., Iakovakis, D., Hadjidimitriou, S., Charisis, V. & Hadjileontiadis, L. J. DeepFoG: An IMU-based detection of freezing of gait episodes in Parkinson’s disease patients via deep learning. Frontiers Robot. AI 8, https://doi.org/10.3389/frobt.2021.537384 (2021).Dau, H. A. et al. The UCR time series classification archive (2018).Bergstra, J. & Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012).MathSciNet Google Scholar Lang, M. et al. A Multi-Layer Gaussian Process for Motor Symptom Estimation in People With Parkinson’s Disease. IEEE Transactions on Biomedical Engineering 66, 3038–3049. https://doi.org/10.1109/TBME.2019.2900002 (2019).Article PubMed Google Scholar Wagner, A., Fixler, N. & Resheff, Y. S. A wavelet-based approach to monitoring Parkinson’s disease symptoms. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5980–5984, https://doi.org/10.1109/ICASSP.2017.7953304 (2017).Vimalajeewa, D., McDonald, E., Tung, M. & Vidakovic, B. Parkinson’s disease diagnosis with gait characteristics extracted using wavelet transforms. IEEE Journal of Translational Engineering in Health and Medicine 11, 271–281. https://doi.org/10.1109/JTEHM.2023.3272796 (2023).Article Google Scholar Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. Poster at the 3rd Int. Conf. Learn. Representations (2015).Pfister, F. M. J. et al. High-Resolution Motor State Detection in Parkinson’s Disease Using Convolutional Neural Networks. Scientific Reports 10, 5860. https://doi.org/10.1038/s41598-020-61789-3 (2020).Article ADS CAS PubMed PubMed Central Google Scholar Hssayeni, M. D., Jimenez-Shahed, J., Burack, M. A. & Ghoraani, B. Dyskinesia estimation during activities of daily living using wearable motion sensors and deep recurrent networks. Scientific Reports 11, 7865. https://doi.org/10.1038/s41598-021-86705-1 (2021).Article ADS CAS PubMed PubMed Central Google Scholar Colcher, A. & Simuni, T. Clinical Manifestations of Parkinson’s Disease. Medical Clinics of North America 83, 327–347. https://doi.org/10.1016/S0025-7125(05)70107-3 (1999).Article CAS PubMed Google Scholar Le Guennec, A., Malinowski, S. & Tavenard, R. Data augmentation for time series classification using convolutional neural networks. In Proc. 2nd ECML/PKDD Workshop Adv. Analytics Learn. Temporal Data (2016).Lonini, L. et al. Wearable sensors for Parkinson’s disease: Which data are worth collecting for training symptom detection models. npj Digital Medicine 1, 1–8. https://doi.org/10.1038/s41746-018-0071-z (2018).Article Google Scholar Polvorinos-Fernández, C. et al. Evaluation of the Performance of Wearables’ Inertial Sensors for the Diagnosis of Resting Tremor in Parkinson’s Disease. In Proceedings of the 17th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2024), vol. 2, 820–827, https://doi.org/10.5220/0012571600003657 (SciTePress, 2024).Sigcha, L. et al. Automatic Resting Tremor Assessment in Parkinson’s Disease Using Smartwatches and Multitask Convolutional Neural Networks. Sensors 21, 291. https://doi.org/10.3390/s21010291 (2021).Article ADS PubMed PubMed Central Google Scholar Legaria-Santiago, V. K., Sánchez-Fernández, L. P., Sánchez-Pérez, L. A. & Garza-Rodríguez, A. Computer models evaluating hand tremors in Parkinson’s disease patients. Computers in Biology and Medicine 140, 105059. https://doi.org/10.1016/j.compbiomed.2021.105059 (2022).Article PubMed Google Scholar Papadopoulos, A. et al. Detecting Parkinsonian Tremor From IMU Data Collected in-the-Wild Using Deep Multiple-Instance Learning. IEEE Journal of Biomedical and Health Informatics 24, 2559–2569, https://doi.org/10.1109/JBHI.2019.2961748 (2019-12-2024).Papadopoulos, A. & Delopoulos, A. Leveraging Unlabelled Data in Multiple-Instance Learning Problems for Improved Detection of Parkinsonian Tremor in Free-Living Conditions. IEEE Journal of Biomedical and Health Informatics 27, 3569–3578. https://doi.org/10.1109/JBHI.2023.3267095 (2023).Article PubMed Google Scholar Reardon, S., Shuqair, M., Jimenez-Shahed, J. & Ghoraani, B. Wearable Sensor Configurations for Effective Tremor Assessment in Parkinson’s Disease. In 2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 1–4, https://doi.org/10.1109/EMBC53108.2024.10782867 (2024).Evers, L. J. W. et al. Passive Monitoring of Parkinson Tremor in Daily Life: A Prototypical Network Approach. Sensors 25, 366. https://doi.org/10.3390/s25020366 (2025).Article PubMed PubMed Central Google Scholar Wu, H. et al. Multi-Instance Learning for Parkinson’s Tremor Level Detection with Learnable Discriminative Pool. In 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 6008–6015, https://doi.org/10.1109/BIBM62325.2024.10822542 (2025).Habets, J. G. V. et al. Rapid Dynamic Naturalistic Monitoring of Bradykinesia in Parkinson’s Disease Using a Wrist-Worn Accelerometer. Sensors 21, 7876. https://doi.org/10.3390/s21237876 (2021).Article ADS PubMed PubMed Central Google Scholar Evers, L. J. et al. Real-Life Gait Performance as a Digital Biomarker for Motor Fluctuations: The Parkinson@Home Validation Study. Journal of Medical Internet Research 22, e19068. https://doi.org/10.2196/19068 (2020).Article PubMed PubMed Central Google Scholar Sigcha, L. et al. Bradykinesia Detection in Parkinson’s Disease Using Smartwatches’ Inertial Sensors and Deep Learning Methods. Electronics 11, 3879. https://doi.org/10.3390/electronics11233879 (2022).Article Google Scholar Park, D. J. et al. Evaluation for Parkinsonian Bradykinesia by deep learning modeling of kinematic parameters. Journal of Neural Transmission 128, 181–189. https://doi.org/10.1007/s00702-021-02301-7 (2021).Article PubMed Google Scholar Pulliam, C. L. et al. Continuous Assessment of Levodopa Response in Parkinson’s Disease Using Wearable Motion Sensors. IEEE Transactions on Biomedical Engineering 65, 159–164. https://doi.org/10.1109/TBME.2017.2697764 (2018).Article PubMed Google Scholar Um, T. T. et al. Data augmentation of wearable sensor data for Parkinson’s disease monitoring using convolutional neural networks. In Proc. 19th ACM Int. Conf. Multimodal Interaction, https://doi.org/10.1145/3136755.3136817 (Association for Computing Machinery, 2017).Guijo-Rubio, D., Gutiérrez, P. A., Bagnall, A. spsampsps Hervás-Martinez, C. Ordinal versus nominal time series classification. In Lemaire, V. et al. (eds.) Proc. 5th ECML/PKDD Workshop Adv. Analytics Learn. Temporal Data, 19–29, https://doi.org/10.1007/978-3-030-65742-0_2 (Springer, 2020).Bouthillier, X. et al. Accounting for variance in machine learning benchmarks. In Smola, A., Dimakis, A. & Stoica, I. (eds.) Proc. Mach. Learn. Syst. 3, vol. 3, 747–769 (2021).Download referencesAcknowledgementsWe thank the Michael J. Fox Foundation for funding the MJFF Levodopa Response Study and providing the dataset used for this paper. This work was partially funded by the European Research Council (ERC) Consolidator Grant “Safe data-driven control for human-centric systems (CO-MAN)” under grant agreement number 864686, by the Federal Ministry of Education and Research of Germany (BMBF) in the program of “Souverän. Digital. Vernetzt.” (Joint project 6G-life, project identification number: 16 KISK002), and by the Federal Ministry of Education and Research of Germany and the Free State of Bavaria under the Excellence Strategy of the Federal Government and the States (TUM Innovation Network eXprt). Open Access funding enabled and organized by Projekt DEAL.FundingOpen Access funding enabled and organized by Projekt DEAL.Author informationAuthors and AffiliationsTUM School of Computation, Information and Technology, Department of Computer Engineering, Chair of Information-oriented Control, Technical University of Munich, Munich, GermanyCedric Donié, Neha Das, Satoshi Endo & Sandra HircheMunich Data Science Institute (MDSI), Munich, GermanySandra HircheMunich Institute of Robotics and Machine Intelligence (MIRMI), Munich, GermanySatoshi Endo & Sandra HircheAuthorsCedric DoniéView author publicationsYou can also search for this author inPubMed Google ScholarNeha DasView author publicationsYou can also search for this author inPubMed Google ScholarSatoshi EndoView author publicationsYou can also search for this author inPubMed Google ScholarSandra HircheView author publicationsYou can also search for this author inPubMed Google ScholarContributionsC.D. and N.D. conceived and conducted the experiments. S.E., N.D., and S.H. developed the research idea. C.D. drafted the manuscript and drew the figures. All authors contributed to and reviewed the manuscript.Corresponding authorCorrespondence to Cedric Donié.Ethics declarationsCompeting interestsThe authors declare no competing interests.Additional informationPublisher’s noteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Rights and permissionsOpen Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.Reprints and permissionsAbout this article