Clinical validation of a deep learning tool for characterizing spinopelvic mobility in total hip arthroplasty

Wait 5 sec.

IntroductionPatients with unfavorable spinopelvic mobility or significant sagittal spinal deformities in total hip arthroplasty (THA) present an elevated risk of instability1,2,3dislocation1,2,3,4and revision surgery1,2. The use of radiographic assessments is routine care for identifying at-risk patients early in the treatment process5. Sagittal radiographs in standing, seated and contra-lateral step-up functional positions allow analysis of spinopelvic sagittal mobility. Typical analysis involves landmarking key anatomic features of the lumbar spine and pelvis to calculate the sacral slope (SS), pelvic tilt (PT), and lumbar lordotic angle (LLA)4,6,7,8,9,10. These parameters allow evaluation of dislocation risk as well as impingement and prosthetic joint mechanics1,11,12,13. Utilizing these measurements surgeons can optimize prosthetic component orientation and selection, enhancing joint functionality and reducing the likelihood of postoperative complications1,12,13.Measurement accuracy, however, is impacted by observer expertise, imaging quality and patient-specific anatomical factors14,15,16,17. Furthermore, the lead time required to process radiographs by vendors using expert engineers poses a risk to routine analysis. The development of automated systems using deep learning (DL) offers a promising solution to enhance both accuracy and efficiency. DL algorithms have demonstrated remarkable success in image recognition tasks using medical imaging18,19,20 and when trained on sufficiently large datasets report a high degree of precision21,22,23,24,25,26,27 potentially surpassing the accuracy of manual measurements28. Previous works have used convolutional neural networks (CNNs) to auto-landmark spine and pelvis radiographs22,23,24,29with some models applying successive networks to refine performance26,27. However, existing models often lack a dataset that originates from diverse imaging centers and machines, and spans various geographical locations, limiting their applicability24,25,26. Moreover, these models frequently do not analyze multiple functional positions and lack rigorous clinical validation22,23,24,26. A DL pipeline that classifies patient functional position and measures spinopelvic mobility in lateral X-rays, using a large-scale multicenter international medical imaging database, has not been previously developed.The purpose of this study was to investigate a method to rapidly generate spinopelvic measurements in hip arthroplasty. Specifically, we aim to: (1) Train a DL model that will classify input sagittal functional pelvic radiographs based on functional position; (2) Train a DL model to identify lumbar and pelvic landmarks allowing calculation of SS, PT and LLA; (3) Develop a DL pipeline that provides end-to-end spinopelvic mobility measurements indistinguishable from expert engineers, with clinical validation by an orthopaedic surgeon and two senior engineers.Materials and methodsImage dataset and landmarksPreoperative and postoperative lateral functional imaging was retrospectively extracted from an international joint replacement registry (CorinRegistry, Corin, UK). Ethics approval was obtained prior to the start of this study (AU ethics: Bellberry: 2020-08-764-A-2, USA IRB: WCGIRB: 120190312). All study methods were carried out in accordance with relevant guidelines and regulations, and a waiver of informed consent was approved by the ethics committee for this research. The registry is a cloud-based big-data ecosystem that passively de-identifies imaging and other data from preoperative planning and postoperative analysis processes for research purposes. It encompasses data from over 38,000 total hip arthroplasties performed between January 2017 and September 2023, collected from 391 imaging centers across 11 countries (Australia, United States, United Kingdom, France, Hungary, New Zealand, Austria, Japan, Italy, Germany and Belgium, Table S1). The lateral functional imaging consisted of standing, flexed seated and contra-lateral step-up radiographs captured either for preoperative THA planning30 or post-operative implant placement analysis. All three lateral functional X-rays were taken with the referred side positioned closest to the imaging detector and furthest from the X-ray source (Figure S1). In the flexed seated X-ray, the femur was positioned to be horizontal and parallel to the floor, as was the contralateral femur in the step-up X-ray. During the step-up, the contralateral leg was raised while the patient bore weight on the affected leg. For the seated position, the patient flexed forward as much as possible. Inclusion criteria were: undergoing or having undergone THA with lateral radiographs previously landmarked for spinopelvic analysis, all landmarks visible on image. All imaging in the registry has previously been evaluated by expert engineers and quality checked by senior engineers to ensure suitable image quality, and the correct functional positions. The maximum available imaging set per patient includes standing, flexed-seated, and contralateral step-up lateral x-rays. Cases with one or more missing images were included in this analysis if the images were acceptable for preoperative planning or postoperative analysis. There are no exclusion criteria for this analysis.The number of images used to train, validate and test each model, the proportion of preoperative versus postoperative imaging, and the number of imaging centers contributing data is shown in Table 1. The differences in total dataset sizes used to train each model are due to labelled data availability constraints. Given the variability in the execution of imaging protocols across 391 centers worldwide, we first trained a deep learning classification model to categorize lateral functional radiographs into one of three functional positions. While imaging centers are provided with standardized protocols and instructed to include the appropriate series description in the DICOM metadata (0008,103e), adherence to these naming conventions is inconsistent. As a result, relying solely on DICOM tags is not feasible, necessitating an automated classification approach. The DICOM images were preprocessed by converting to JPG format and rescaling pixel intensities between 0 and 255. This conversion facilitates standardized input handling in the model training, validation, and testing stages. We developed an X-ray processing pipeline consisting of several DL models to identify PT, SS, and LLA from lateral functional imaging, Fig. 1. PT is defined as the angle formed by a line drawn from the midpoint of the two anterior superior iliac spine (ASIS) landmarks to the center of the pubic symphysis, and a vertical reference line. SS is defined as the angle between the superior surface of the S1 sacral endplate and a horizontal line. The LLA is determined by the angle between the superior S1 and L1 vertebra endplate, which quantifies the degree of lordosis of the lower spine31.Table 1 Overview of image processing pipeline, model specifications, training details, and demographic analysis. Chi-squared tests were used to assess differences in sex distribution across datasets, and one-way ANOVA was used to evaluate age differences.Full size tableFig. 1Example lateral radiographs in stand, contra-lateral step-up and flex seated positions showing spinopelvic parameters of interest: pelvic tilt (PT), sacral slope (SS) and lumbar lordotic angle (LLA).Full size imageLandmarks defined by expert engineers through a manual x-ray annotation process across the three functional positions (standing, contralateral step-up, and flexed seated) were used as ground truth for training the landmark detection model. These ground truth landmarks were identified using RadiAnt DICOM Viewer software (Medixant, Poznan, Poland) and independently verified by senior engineers accredited for quality control with a minimum of two years experience in annotation and having undergone a practical assessment to demonstrate competency. To ensure accuracy during 2D landmarking, corresponding 3D landmarks captured on patient computed tomography (CT) scans were utilized. Both expert and senior engineers visually inspected the annotated landmarks, comparing their placement against 3D-rendered CT scans with corresponding 3D landmark points, which were obtained using Simpleware ScanIP (Synposys, Sunnyvale, CA). To assist with identifying ground truth landmarks, expert engineers manually manipulated the 3D volume rendering to align with the X-ray view, ensuring accurate spatial correspondence. Additionally, they applied adjustable filters to enhance visualization, allowing for the creation of a realistic X-ray replica that closely matched the radiographic appearance. The landmark prediction algorithm outputs confidence maps, with pixel coordinates of the highest confidence values selected as the predicted landmarks. These predicted landmarks were then used to calculate spinopelvic mobility parameters, with angular values compared against the ground truth annotations to assess accuracy.Model pipelineAn image processing pipeline was developed utilizing deep learning techniques to streamline the classification of functional radiographs (Vision Transformer (ViT)32, detect vertebrae (Object detection (YOLOv8)33 and automate anatomical landmark detection (convolutional neural network (CNN)34 and derivation of spinopelvic measurements, Fig. 2.Fig. 2Flow diagram illustrating the deep learning pipeline for analyzing functional lateral X-ray images. X-rays are processed using a vision transformer (ViT) model for image classification. For pelvis landmark detection, a convolutional neural network (CNN) is applied directly to identify key landmarks in isolation. For lumbar landmark detection, the YOLOv8 (You Only Look Once, version 8) model is first used to isolate individual vertebrae as tiles. The L5 tile is processed by the CNN to accurately detect lumbar landmarks. Finally, the landmarks are used to calculate PT, SS and LLA.Full size imageThe development of these models was structured with a division of the imaging into training, validation, and testing groups, following a 70:20:10 split. Data augmentation was applied online during training using PyTorch and the Albumentations library to enhance model generalization. A combination of intensity, geometric, and noise-based transformations was probabilistically applied. Brightness and contrast adjustments were applied randomly, with brightness limited to ± 40% (30% probability) and contrast limited to ± 70% (40% probability). Geometric transformations included independent random shifting (range: ±10%, probability: 10%), scaling (range: ±20%, probability: 10%), and rotation (range: ±45°, probability: 80%). Horizontal flipping was applied with a 50% probability. Noise augmentation included Gaussian noise (variance range: 50–500, probability: 85%) and blurring with a kernel size of 14–20 pixels (15% probability). Each augmentation was applied stochastically to simulate real-world imaging variations and improve model robustness.Models were trained using Python version 3.10.10. The vision transformer was developed using Microsoft Azure’s AutoML platform. For object detection the YOLOv8 model used the Ultralytics 8.0.227 package33,35 and for landmark detection, we employed the PyTorch package version 2.0.036; Data augmentations were performed using the Albumentations package version 1.3.1 (https://github.com/albumentations-team/albumentations)37. Model parameters were initialised with randomization Table 1.Model testing and statistical analysisReceiver Operator Characteristic Area Under the Curve (ROC-AUC) analysis was performed on the image classification algorithm. The micro and macro averaged AUC, sensitivity, specificity, and F1 scores (harmonic mean of precision and recall) were calculated. Precision-Recall AUC (PR-AUC) analysis was performed on the vertebra detection algorithm and the AUC calculated.The mean absolute error (MAE) between the predicted spinopelvic measurements and the ground truth measurements (PT, SS and LLA) captured by the expert engineers during preoperative planning, was calculated. MAE values for spinopelvic measurements were grouped by functional position in the x-ray and compared using Wilcox rank-sum tests. Robustness of the landmarking models were assessed by comparing the PT, SS and LLA MAE based on number of images obtained from each imaging center. Imaging centers were subdivided into 4 groups (1–10, 11–50, 51–100 and > 100 images received). To compare differences between imaging center volume for each functional position a Kruskal-Wallis test was performed and post-hoc Dunn’s tests if the Kruskal-Wallis test returned a significant difference.To prospectively assess the accuracy of the DL pipeline, first a power analysis was performed: using a 2 sample 2-sided means power analysis we calculated the number of cases needed with equal groups, alpha = 0.05, power = 0.8, a standard deviation of 2.5° and a mean difference of 2°. Groups of n = 26 is required. To account for any case failures, we extended this n value to 30. A prospective set of 30 cases was chosen from a pool of images not used during training, validation, or testing. These 30 cases were randomly selected non-consecutive routine preoperative THA cases. The cases were selected to represent a wide range of image quality and difficulty in landmark identification. Each case contained three functional positions (stand, step-up, flex seated) that were annotated by three expert engineers twice, in two rounds spaced two weeks apart, similar to Wakelin et al.38 and Raynauld et al.39. To provide an unbiased comparison between the DL algorithm and the expert engineers, the engineers did not have access to the 3D CT model to assist in landmarking the radiographs for these 30 cases. The interobserver agreement for the expert engineers was evaluated using the interclass correlation coefficient (ICC). Wilcox rank-sum tests compared expert engineers between each other and the MAE between the pooled expert engineers and DL pipeline results.To clinically validate the DL pipeline, two senior engineers and a fellowship trained hip arthroplasty surgeon with over 10 years clinical experience conducted a blinded quality control (QC) check on the prospective set of 90 images (30 cases) landmarked by three expert engineers and the DL pipeline (270 images total), Fig. 3. A 2 sample, proportion power analysis was performed with α = 0.05, 1 - β = 0.8, a sampling ratio of 1:1 and an estimated rejection rate of 5% in the expert engineer landmarked cohort compared to 15% in the comparative cohort: A minimum of 140 images analysed per group are required. The two senior engineers performed a QC check on (1) all DL pipeline landmarked images (90 each = 180 total), and (2) all manually landmarked images by the three expert engineers (90 × 3 = 270, 135 each). The surgeon QC checked all DL pipeline landmarked images (90 total), paired with one expert engineer landmarked image selected randomly from the 3 expert engineers (90 total). The senior engineers and surgeon classified images as correctly or incorrectly landmarked, and if incorrectly landmarked, indicated which landmarks required changing. The pass-fail rates for images annotated by expert engineers versus those by the DL pipeline were compared using Chi-square tests. Fisher’s exact test was used to compare differences in landmark rejection between senior engineers and surgeons. A critical p value of 0.05 was used in all cases. All statistical testing was performed in R4.1.2 (R Project, Vienna, Austria).Fig. 3Image flow diagram for clinical validation.Full size imageResultsModel training and convergenceAll models successfully completed training and were able to process all test set images. Total processing time for sequential classification, object detection, and landmark detection was 1.96 ± 0.04 s per image using an Azure virtual machine (Standard_NC4as_T4_v3) with an Nvidia Tesla T4 GPU.The convergence of estimated landmarks from the sequential 6-stage convolutional model from epoch 0–50 is shown as a heat map in Fig. 4. At epoch 0 the heatmap is broad and undifferentiated. As the epochs advance, the heatmap focus narrows with an increase in intensity over the true landmark regions. By epoch 50, the model has converged.Fig. 4Landmark detection model heatmap output converging from epoch 0 to 50, for the pelvis landmarks.Full size imageImage classification and object detectionHigh ROC-AUC and PR-AUC values were achieved for both the image classification and vertebrae detection models. The classification model demonstrated excellent performance (Figure S2), achieving micro and macro-average AUCs, F1 scores, sensitivity, and specificity all ≥ 0.998 (Table 2). For vertebrae detection, the Precision-Recall (PR) curve (Figure S2) highlighted a similarly strong performance, with an AUC-PR of 0.998, precision of 0.994, and recall of 0.994 (Table 2). Misclassifications between standing and step-up positions were rare (0.23%) and typically occurred when the vertical femur in step-up images was obscured by a limited field of view, creating the appearance of a standing posture (Figure S3). Similarly, misclassifications between flexed seated and step-up positions (0.06%) arose due to overlapping femurs in seated poses and the absence of forward flexion, which closely resembled a stepping action (Figure S3).Table 2 Performance metrics for image classification and vertebrae detection, including AUC, F1 scores, sensitivity, specificity, and precision-recall values achieved by the deep learning pipeline for spinopelvic mobility analysis.Full size tableClinical measurements: DL pipeline versus ground truthThe overall error and error broken down by functional position for the predicted PT, SS and LLA are shown in Fig. 5 (see also Figure S4 for boxplots with outliers and Table S2 for summary statistic values). PT is most accurately predicted with the lowest MAE (MAE ± SD: 1.6°±2.1°), however PT error is also dependent on functional position in which Stand reports the lowest error (1.2°±1.3°) and Seated the highest (2.3°±3.2°), p