Predicting lncRNA and disease associations with graph autoencoder and noise robust gradient boosting

Wait 5 sec.

IntroductionLong non-coding RNAs (lncRNAs) are nucleotide sequences with length greater than 2001,2. lncRNAs are involved in many key physiological processes, for example, tissue development, tumorigenesis, and immune regulation. Furthermore, various human diseases have close associations with the dysregulation and mutation of lncRNAs3.Particularly, lncRNAs demonstrate differential roles in the progress and development of cancers4. lncRNAs have been taken as potential therapeutic molecular targets and offered new opportunities for cancer targeted therapy5. Many lncRNAs have been validated to be able to modulate chemotherapy resistance in cancers6,7. Consequently, the discovery of potential relationships between diseases and lncRNAs facilitates understanding molecular mechanisms of human complex diseases from the aspect of lncRNAs, detecting disease biomarkers, assisting in their diagnosis and treatment, and further promoting the development of personalized medicine8.However, potential lncRNA-disease association (LDA) identification is a huge challenge for biologists due to high cost and labor, and low success rate of in vivo experiments although biological experiments have yielded some LDAs9. Thus, computational techniques have been increasingly applied in association prediction tasks including LDA prediction3,10,11. Chen et al.12 constructed an LDA database and provided experimentally validated LDAs for 166 diseases. Depending on the database, Chen et al.13 inferred potential LDAs by combining lncRNA expressions. Following this work, many computational tools, mainly including network-based methods and machine learning-based methods, have been devised to uncover new LDAs14.Network-based methods first compute lncRNA similarity and disease similarity matrices according to their biological information, and then evaluate association score between each lncRNA-disease pair through network algorithms. These methods include Laplacian regularized least squares13, KATZ measure15, heterogeneous network model16, network consistency projection17, local random walk18, Laplacian normalized random walk with restart19 and bidirectional linear neighborhood label propagation20.Machine learning has been broadly utilized in various linkage prediction fields including LDA identification21,22,23. They first learn features of lncRNAs and diseases and then classify unknown lncRNA-disease pairs. Traditional machine learning-based LDA prediction methods include rotation forest24, random forest25, multi-label learning26, matrix factorization27, inductive matrix completion28, weighted matrix factorization29, matrix decomposition30, collaborative filtering31, bipartite local model32, and heterogeneous Newton boosting machine2. Recently, deep learning algorithms have been gradually adopted to discover new LDAs due to the powerful representation learning ability, for example, collaborative deep learning33, deep belief network34, generative adversarial network35, graph contrastive learning36, deep neural network37,38,39, heterogeneous graph learning40, capsule network41, dual-net neural network42, graph convolutional autoencoder43, graph attention network44, and residual graph convolutional network with attention mechanism8.Machine learning has promoted LDA prediction. However, LDA datasets are imbalanced and contain noises. Machine learning-based LDA prediction algorithms, especially traditional boosting models, remain limitations in label noise, imbalanced datasets, and LDA feature extraction. To address the above problems, in this manuscript, we develop a computational model called LDA-GARB to interpret potential LDAs by combining Nonnegative Matrix Factorization (NMF), Graph Autoencoder (GAE), and noise-Robust gradient Boosting. This work mainly has three contributions:To solve the limitations of label noise and data imbalance in LDA classification task, we present a noise-robust gradient boosting model to classify unobserved lncRNA-disease pairs by integrating Gradient Boosting Decision Trees (GBDT) and robust loss.To obtain abundant LDA features, we leverage NMF for extracting linear features and GAE for extracting nonlinear features.We predict that CCDC26 and HAR1A could have an association with colorectal cancer (CRC) and breast cancer, respectively.ResultsIn this manuscript, as shown in Fig. 1, we proposed an LDA prediction method, LDA-GARB, by incorporating LDA feature extraction through NMF and GAE45 and LDA classification via the noise-Robust gradient Boosting model. Finally, we predicted associated lncRNAs for CRC and breast cancer through LDA-GARB.Fig. 1The pipeline for LDA prediction with LDA-GARB. (i) Feature extraction. Linear and nonlinear features of lncRNAs and diseases are extracted by NMF and GAE. And each LDA is depicted as a vector through concatenating the learned linear and nonlinear features. (ii) LDA classification. The noise-robust gradient boosting model is designed to classify unobserved LDAs based on the extracted LDA features..Full size imageData preparationTwo human LDA datasets, Dataset 1 and Dataset 22,42, were used to evaluate the model and achieve predictions. lncRNAs, diseases, and experimentally confirmed LDAs in the two datasets were obtained from the lncRNADisease v2.046 and MNDR v2.047 databases, respectively. We removed diseases which have no MESH information or regular name, and lncRNAs which are lack of sequence data. After preprocessing, Dataset 1 includes 92 lncRNAs, 157 diseases, and 605 LDAs. Dataset 2 includes 89 lncRNAs, 190 diseases, and 1,529 LDAs. The preprocessed datasets are illustrated in Table 1. The association network was represented as a matrix $\varvec{Y} \in {\Re ^{n \times m}}$ where $y_{i j}$ is 1 when there is an association between the i-th lncRNA and the j-th disease, $y_{i j}$ is 0 otherwise.Table 1 Introduction of two LDA datasets.Full size tableExperimental settingsPeng et al.2 designed multiple 5-fold Cross Validation (CV) ways and provided insights into evaluating the performance of linkage prediction models. Inspired by CVs proposed by Peng et al.2, we used three distinct 5-fold CVs to test the model performance. The three CVs include 5-fold CV on lncRNAs ($CV_1$), 5-fold CV on diseases ($CV_2$), and 5-fold CV on lncRNA-disease pairs ($CV_{3}$). In each round, they run experiments as follows:$CV_1$: 20% of lncRNAs in $\varvec{Y}$ were randomly hidden for test and the rest for training.$CV_2$: 20% of diseases in $\varvec{Y}$ were randomly hidden for test and the rest for training.$CV_{3}$: 20% of lncRNA-disease pairs in $\varvec{Y}$ were randomly hidden for test and the rest for training.To assess the prediction accuracy of LDA-GARB, we used six machine learning indictors. The six evaluation metrics were precision, recall, accuracy, F1-score, Area Under receiver operating characteristic (ROC) Curve (AUC), and Area Under Precision-Recall (PR) curve (AUPR), which were provided by Refs.2,48, respectively. All experiments were run on the Ubuntu system with 12th Gen Intel(R) Core (TM) i7-12650H, NVIDIA GeForce RTX 4060 Laptop GPU, and RAM of 32.0 GB. In addition, the version numbers of related softwares were Python version 3.8, Numpy version 1.23.2, Pandas version 2.1.4, scikit-learn version 1.3.0, and XGBoost version 2.0.0, respectively. The parameters of LDA-GARB and four competing LDA prediction methods (i.e., SDLDA, LDNFSGB, LDAenDL, and LDA-VGHB) were shown in Table 2.Table 2 Parameter settings in SDLDA, LDNFSGB, LDAenDL, LDA-VGHB, and LDA-GARB.Full size tableBaselinesWe compared the performance of LDA-GARB and four state-of-the-art LDA prediction models on two datasets. The four comparison methods were SDLDA49, LDNFSGB50, LDAenDL37, and LDA-VGHB2, respectively. For one lncRNA-disease pair, SDLDA49 first learned its features by integrating singular value decomposition (SVD) and deep learning and then determined that it was associated or not through a full connection layer. LDNFSGB50 first reduced its feature dimension via an autoencoder and classified it through a gradient boosting algorithm. LDAenDL37 extracted its biological features by integrating graph convolutional network, convolutional neural network, and graph attention network, and then inferred its association through deep neural network and LightGBM. LDA-VGHB2 incorporated SVD and variational graph autoencoder for learning its features and heterogeneous Newton boosting machine for obtaining its class.Additionally, similar to work2, LDA-GARB was compared with four representative boosting models under $CV_1$, $CV_2$, and $CV_3$. These four boosting models include XGBoost51, AdaBoost52, CatBoost53, and LightGBM54. They correspond to Extreme Gradient Boosting, weak learning, categorical boosting algorithm, and boosting with one-side sampling along with exclusive feature bundling. The parameters in the four boosting algorithms were set to defaults.To evaluate the ability of LDA-SCGB on imbalanced data, we compared it with five LDA prediction models, LDA-LNSUBRW55, GAMCLDA56, LDA-VGHB2, LDAGM43, and GANLDA44, where their parameters were set to defaults. LDA-LNSUBRW55 used an unbalanced bi-random walk for negative LDA selection. GAMCLDA56 employed a cost-sensitive neural network to handle the problem of imbalance issue between positive LDAs and negative LDAs. LDA-VGHB is recent superior LDA identification model.Performance comparison under $CV_1$To evaluate the LDA-GARB performance when inferring diseases related to a target lncRNA under $CV_1$, we randomly selected 80% of lncRNAs for training and the rest for test. Figure 2A and D demonstrate the classification accuracy of LDA-GARB, SDLDA, LDNFSGB, LDAenDL, and LDA-VGHB under $CV_1$. Figure 3A and B delineated their ROC and PR curves on Dataset 1. Figure 3G and H delineated the ROC and PR curves on Dataset 2.Fig. 2Performance of LDA-GARB and other four methods. (A–C) on Dataset 1 under $CV_1$, $CV_2$, and $CV_3$, respectively. (D–F) on Dataset 2 under $CV_1$, $CV_2$, and $CV_3$, respectively.Full size imageFig. 3The ROC and PR curves of LDA-GARB and other four methods. A-B, C-D, and E-F denote their ROC and PR curves under $CV_1$, $CV_2$, and $CV_3$ on Dataset 1, respectively. G-H, I-J and K-L denote their ROC and PR curves under CV under $CV_1$, $CV_2$, and $CV_3$ on Dataset 2, respectively.Full size imageTable 3 shows their precision, recall, accuracy, F1-score, AUC, and AUPR under $CV_1$. From the results in Table 3, we found that LDA-GARB outperformed other four methods. It calculated the highest recall, accuracy, F1-score, AUC, and AUPR, with AUCs (0.9180 and 0.9716) better 3.99% and 1.80% than LDA-VGHB on Datasets 1 and 2, respectively, and AUPRs (0.9160 and 0.9723) better 2.30% and 1.09% than LDA-VGHB, respectively. In summary, it accurately inferred potential diseases related to a target lncRNA.Table 3 Performance comparison under $CV_1$.Full size tablePerformance comparison under $CV_2$To assess the LDA-GARB performance when inferring lncRNAs related to a target disease under $CV_2$, we randomly selected 80% of diseases for training and the rest for test. Figure 2B and E demonstrate the classification accuracy of LDA-GARB, SDLDA, LDNFSGB, LDAenDL, and LDA-VGHB on two datasets under $CV_2$, respectively. Figure 3C and D delineated their ROC and PR curves on Dataset 1 under $CV_2$. Figure 3I and J delineated the ROC and PR curves on Dataset 2 under $CV_2.$Table 4 shows the performance of LDA-GARB and the above four baselines under $CV_2$. From the results, we observed that LDA-GARB exceeded the four baselines. It calculated the highest recall, accuracy, F1-score, and AUC, with AUCs (0.9493 and 0.9817) better 3.99% and 1.80% than LDA-VGHB on Datasets 1 and 2, respectively, and AUPR with 0.9757 better 0.30 % than LDA-VGHB on Dataset 2. Although AUPR computed by LDA-GARB was slightly smaller than LDA-VGHB (0.9415 vs. 0.9429) on Dataset 1, their difference was very small. In summary, it relatively accurately predicted lncRNAs that may associate with a disease without known lncRNA data.Table 4 Performance comparison under $CV_2.$Full size tablePerformance comparison under $CV_3$To measure the performance of LDA-GARB when inferring potential LDAs under $CV_3$, we randomly selected 80% of lncRNA-disease pairs for training and the rest for test. Figure 2C and F demonstrate the classification accuracy of LDA-GARB, SDLDA, LDNFSGB, LDAenDL, and LDA-VGHB on two LDA datasets under $CV_3$, respectively. Figure 3E and F delineated their ROC and PR curves on Dataset 1 under $CV_3$. Figure 3K and L delineated their ROC and PR curves on Dataset 2 under $CV_2$.Table 5 shows the values corresponding to the six indictors under $CV_3$. The results demonstrated that LDA-GARB surpassed other four methods. It calculated the highest recall, accuracy, F1-score, and AUC, with AUCs (0.9459 and 0.9790) better 1.99% and 0.69% than LDA-VGHB on Datasets 1 and 2, respectively, and AUPR with 0.9418 better 0.57% than LDA-VGHB on Dataset 1. Similar to $CV_2$, although LDA-GARB computed slightly lower AUPR than LDA-GARB on Dataset 2, the difference was tiny. Thus, LDA-GARB could effectively capture new associations from unknown lncRNA-disease pairs.Table 5 Performance comparison under $CV_3.$Full size tablePerformance under different boosting algorithmsLDA-GARB used a noise-robust gradient boosting model for classifying unobserved lncRNA-disease pairs. To validate the LDA classification ability of the robust boosting model, we compared LDA-GARB with XGBoost, AdaBoost, CatBoost, and LightGBM under three different CVs. The results are shown in Tables 6, 7, 8 and Fig. 4. On two datasets, LDA-GARB computed the highest performance in most cases under all three CVs, greatly outperforming other four boosting models. As a result, the noise-robust gradient boosting model can perform effective predictions.Table 6 Performance of different boosting algorithms under $CV_1.$Full size tableTable 7 Performance of different boosting algorithms under $CV_2$.Full size tableTable 8 Performance of different boosting algorithms under $CV_3.$Full size tableFig. 4Performance comparison of other four boosting algorithms under $CV_1$, $CV_2$, and $CV_3$. (A-C) Dataset 1. (D-F) Dataset 2.Full size imagePerformance on imbalanced dataIn the LDA matrix $\varvec{Y}$, known associations are very few and negative associations are very difficult to obtain. Consequently, most computational LDA prediction models randomly selected negative associations from unobserved lncRNA-disease pairs. However, these unlabeled pairs may contain positive samples, thereby severely affecting the model predictions. Thus, researchers have explored machine learning algorithms such as positive-unlabeled learning to select reliable negative associations, or devised a model with more robustness to solve the data imbalance issue. In this section, we adopted a noise-robust gradient boosting model to run predictions on imbalanced LDA datasets.To investigate the LDA classification accuracy of LDA-SCGB on imbalanced dataset, we compared it with five representative association prediction methods, i.e., LDA-LNSUBRW55, GAMCLDA56, LDA-VGHB2, LDAGM43, and GANLDA44. LDA-LNSUBRW55 adopted unbalanced bi-random walk for potential LDA inference. GAMCLDA56 employed graph autoencoder matrix completion to identify new LDAs. LDA-VGHB2 classified unknown lncRNA-disease pairs through heterogeneous Newton boosting machine. LDAGM43 learned deep topological features bases on linkages between lncRNA, diseases, and miRNA, and then devised a multi-view heterogeneous network to infer LDAs by combining graph convolutional autoencoder. GANLDA44 presented a graph attention network to compute LDA score matrix. The five methods are representative LDA prediction models. Table 9 shows their AUCs and AUPRs on two LDA datasets under $CV_3$. Figure 5 depicts the corresponding ROC and PR curves. The results elucidated that LDA-GARB obviously surpassed LDA-LNSUBRW, GAMCLDA, LDA-VGHB, LDAGM, and GANLDA, demonstrating its powerful ability to solve with imbalanced datasets.Table 9 Performance comparison of different methods on imbalanced datasets under $CV_3$.Full size tableFig. 5Performance comparison of different methods on imbalanced datasets under $CV_3$. (A, B) Dataset 1, (C, D) Dataset 2..Full size imageSensitivity of parametersLDA-GARB used GAE to extract nonlinear features of lncRNAs and diseases. Consequently, the embeddings of lncRNAs and diseases are particularly important to LDA prediction performance. Thus, we analyzed the impact of different embedding dimensions k of lncRNAs and diseases and the number of different encoder layers N on the model performance.Tables 10, 11, 12, 13, 14, 15, 16, 17 and 18 show the performance of LDA-GARB when k was set to 64, 128, and 256 and N was set to 1, 2, 3, 4, and 5 under $CV_1$, $CV_2$, and $CV_3$, respectively. We comprehensively considered the performance of LDA-GARB under different embedding dimensions and different encoder layer number, and found that LDA-GARB obtained relatively good performance when $k=64$ and $N=1$. Therefore, we set $k=64$ and $N=1$.In addition, the noise-robust gradient boosting model fully utilized robust focal loss and thus had high robustness to effectively address the issues of label noise and data imbalance. Moreover, during training, the model automatically optimized parameters based on the proportion of noises and used the optimized parameters for testing. Therefore, we didn’t additionally analyze the impact of the model parameters on the model performance. The related parameters were shown in Table 2.Table 10 Performance of LDA-GARB with $k=64$ and different N on two datasets under $CV_1$.Full size tableTable 11 Performance of LDA-GARB with $k=64$ and different N on two datasets under $CV_2$.Full size tableTable 12 Performance of LDA-GARB with $k=64$ and different N on two datasets under $CV_3$.Full size tableTable 13 Performance of LDA-GARB with $k=128$ and different N on two datasets under $CV_1$.Full size tableTable 14 Performance of LDA-GARB with $k=128$ and different N on two datasets under $CV_2$.Full size tableTable 15 Performance of LDA-GARB with $k=128$ and different N on two datasets under $CV_3$.Full size tableTable 16 Performance of LDA-GARB with $k=256$ and different N on two datasets under $CV_1$.Full size tableTable 17 Performance of LDA-GARB with $k=256$ and different N on two datasets under $CV_2$.Full size tableTable 18 Performance of LDA-GARB with $k=256$ and different N on two datasets under $CV_3$.Full size tableAblation studyThe proposed LDA-GARB method extracted LDA linear features through NMF and nonlinear features through GAE. To measure the effect of different feature selection methods on the LDA prediction performance, we conducted ablation experiments. Tables 19, 20, 21 and Fig. 6 give the performance of LDA-GARB with linear features, nonlinear features, their combination under $CV_1$, $CV_2$, and $CV_3$, respectively. As shown in Tables 19, 20, 21 and Fig. 6, under most conditions, LDA-GARB with the two types of features outperformed LDA-GARB only with linear features and LDA-GARB only with nonlinear features. Thus, the combination of the two types of features assists in improving LDA prediction.Table 19 Performance when using different feature selection methods under $CV_1$.Full size tableTable 20 Performance when using different feature selection methods under $CV_2$.Full size tableTable 21 Performance when using different feature selection methods under $CV_3$.Full size tableFig. 6Performance when using different feature selection methods under $CV_1$, $CV_2$, and $CV_3$. (A–C) on Dataset 1. (D–F) on Dataset 2.Full size imageCase studyCRC and breast cancer are two cancers which severely affect human health. Identifying potential lncRNAs for them helps their diagnosis and therapy. We have validated the LDA-GARB performance after multiple experiments. Subsequently, we adopted LDA-GARB to infer associated lncRNAs for CRC and breast cancer.Predicting associated lncRNAs for CRCCRC is one of the most frequent cancers worldwide. Recently, CRC incidences have increased rapidly in patients with age less than 50 years57. Thus, we want to predict potential lncRNAs for CRC. As shown in Table 22 and Fig. 7a, we predicted the top 20 lncRNAs that could associate with CRC on Dataset 1. Among the top 20 lncRNAs, 13 lncRNAs have been verified to associate with CRC. Particularly, we predicted that lncRNA CCDC26 could have an association with CRC. CCDC26 is a novel biomarker58 and can inhibit myeloid leukemia cell59. Its silencing suppresses the growth and migration of glioma cells60. Its downregulation helps imatinib resistance in gastrointestinal stromal tumors61.Table 22 The predicted top 20 lncRNAs associated with CRC on Dataset 1.Full size tableIdentifying new lncRNAs for breast cancerBreast cancer62 is the most frequent women cancer. It has been estimated to be 2.3 million new cases and more than 666,000 deaths in 202263. During the past two decades, survival rates of breast cancer have been markedly improved, but its incidence have still risen. Thus, its effective therapy is making an essential problem.As shown in Table 23 and Fig. 7b, we inferred the top 20 lncRNAs which could associate with breast cancer on Dataset 2. Among the 20 lncRNAs, 12 lncRNAs have been verified to have association with breast cancer. Based on the rankings in Table 23 and Fig. 7, we predicted that HAR1A may be associated with breast cancer. HAR1A can inhibit non-small cell lung cancer progression64, regulate oral cancer development65, and affect brain development66. The association between breast cancer and HAR1A needs further validation.Table 23 The predicted top 20 lncRNAs associated with breast cancer in the dataset 2.Full size tableFig. 7(a) The top potential 20 lncRNAs for CRC on Dataset 1. (b) The top 20 potential lncRNAs for breast cancer on Dataset 2. Solid line denotes an LDA that can be predicted and validated. Dashed line denote an LDA that can be predicted but not validated, respectively..Full size imageDiscussionlncRNAs are closely associated with many important physiological processes and have been regarded as potential biomarkers of cancers. Identifying potential LDAs promotes us to better understand complex molecular mechanisms of human diseases, find new biomarkers, and further facilitate disease diagnosis and therapy.In this manuscript, we proposed a computational framework called LDA-GARB for LDA prediction. LDA-GARB first calculated disease similarity based on their semantic features and GAPK, and lncRNA similarity based on their functional information and GAPK. Subsequently, LDA-GARB extracted linear features through NMF and their nonlinear features via similarity matrices and GAE for lncRNAs and diseases. Finally, LDA-GARB took the extracted features as inputs and designed a noise-robust gradient boosting model to decipher potential associations from unknown lncRNA-disease pairs.To ascertain the LDA-GARB performance, we conducted multiple comparison experiments. First, LDA-GARB was compared with four representative LDA prediction methods under three distinct CVs. These four methods include SDLDA, LDNFSGB, LDAenDL, and LDA-VGHB. LDNFSGB and LDA-VGHB are two boosting-based LDA classification models. SDLDA and LDAenDL are two deep learning-based LDA inference algorithms. LDA-GARB obviously outperformed the two boosting-based models and the two deep learning-based methods, demonstrating its better LDA inference accuracy and feature learning ability.Next, LDA-GARB was compared with four boosting models under the three CVs, i.e., XGBoost, AdaBoost, CatBoost, and LightGBM. XGBoost is a scalable end-to-end extreme gradient boosting system. AdaBoost performs highly accurate prediction by integrating multiple weak rules. CatBoost is a unbiased model for categorical feature learning. LightGBM is a highly efficient GBDT. The four methods are classical and wide-used boosting algorithms. LDA-GARB surpassed the four models, elucidating the powerful LDA classification ability of the noise-robust gradient boosting model.After that, to analyze the performance of LDA-GARB on imbalanced data, it was compared with LDA-LNSUBRW, GAMCLDA, LDA-VGHB, LDAGM, and GANLDA. The five methods utilized unbalanced bi-random walk, graph autoencoder matrix completion, heterogeneous Newton boosting machine, multi-view heterogeneous network, and graph attention network. LDA-GARB outperformed the five model and exhibited better classification ability on imbalanced datasets. Finally, to discern the effect of the proposed feature extraction techniques on predictions, LDA-GARB also conducted ablation experiments. The outcomes indicated that the combination of NMF-based linear feature extraction and GAE-based nonlinear feature extraction improved LDA prediction.CRC and breast cancer are two of the most frequent cancers worldwide. After determining the performance, LDA-GARB was applied to predict possible lncRNAs for CRC and breast cancer. The results showed that CRC could have an association with lncRNA CCDC26 and breast cancer may be associated with HAR1A. The above results provided new potential biomarker for CRC and breast cancer.LDA-GARB demonstrated two main advantages when deciphering possible LDAs. (i) It could effectively reduce the effect of label noises on predictions. Machine learning requires negative samples when predicting LDAs. However, current data resources do not provide negative samples due to experiment limitations. Thus, most machine learning-based LDA prediction models had to obtain negative associations from unlabeled lncRNA-disease pairs through random selection. However, these unlabeled lncRNA-disease pairs may contain a handle of positive LDAs, which causes label noises and severely affects the model performance. To solve the issue, LDA-GARB adopted a noise-robust gradient boosting algorithm to alleviate the effect of label noises on LDA prediction. (ii) It was more appropriate to solve imbalanced LDA datasets. Current LDA datasets are imbalanced while existing boosting models are limited to imbalanced datasets. To address this issue, LDA-GARB used non-convex loss function and exhibited the powerful adaptability on imbalanced LDA datasets.Although LDA-GARB calculated better predictions, it remains limitations. During LDA prediction, we need learn feature vectors of lncRNAs and diseases from their biological information. However, several diseases have no directed acyclic graphs, resulting in that we can’t compute their biological similarity based on their MESH descriptors. So, we had to use GAPK to measure disease similarity and extract their feature vectors from GAPK similarity matrix, which have been widely applied to various disease-related association prediction. But GAPK similarity was computed based on association information, which may cause data leakage during test. The data leakage is a common issue in disease-related association tasks and is urgent to solve. Text mining techniques can effectively capture information hidden in unstructured text data. As a result, in the future, we will design a text mining algorithm to obtain semantic features for all diseases especially diseases without directed acyclic graph from diverse health and medical literatures. By doing so, we can effectively avoid data leakage during LDA prediction and boost the performance of various association prediction models.ConclusionIn this manuscript, we presented a computational model called LDA-GARB for identifying potential LDAs by integrating NMF, GAE, and the noise-robust gradient boosting model. Compared to four state-of-the-art LDA identification methods, four classical boosting models, and five imbalanced data solution algorithms, LDA-GARB computed better predictions under three distinct CVs (i.e., $CV_1$, $CV_2$, and $CV_3$). Moreover, LDA-GARB inferred that lncRNAs CCDC 26 and HAR1A could separately associate with CRC and breast cancer and may be their biomarkers, which provided clues of treatment for the two cancers. As a useful computational tool for identifying potential lncRNAs for human diseases, we anticipate that LDA-GARB can help to find new biomarkers for various complex diseases and further promote their diagnosis and therapy.Materials and methodsLDA-GARB mainly contain two procedures: (i) LDA feature extraction. First, LDA-GARB employs NMF and LDA information to extract linear features of each lncRNA and disease. Next, LDA-GARB computes disease similarity based on semantic features and Gaussian association profile kernel (GAPK) similarity, and lncRNA similarity based on functional information and GAPK similarity. By combining disease similarity and lncRNA similarity, LDA-GARB proposes a GAE model to extract nonlinear features of lncRNAs and diseases. And the extracted linear and nonlinear features of lncRNA are concatenated as a vector to depict the lncRNA. Similarly, the learned linear and nonlinear features of one disease are concatenated as a vector to depict the disease. And the concatenation of lncRNA features and disease features is used to character each lncRNA-disease pair. (ii) LDA classification. LDA-GARB takes the obtained feature vector as input and devises a noise-robust gradient boosting to perform predictions.Linear feature extraction based on NMFNMF can effectively reduce feature dimensionality by combining the non-negativity constraint42,67. Here, we adopt NMF to learn linear representations of lncRNAs and diseases. First, we decompose an LDA matrix $\varvec{Y}$ into two low-rank matrices $\varvec{U}=R^{n\times s}$ and $\varvec{V}=R^{s\times m}$. Next, to make $\varvec{U}$ and $\varvec{V}$ more smooth, we add a weighted matrix $\varvec{W}\in R^{n\times m}$ and perform $L_2$ regularization. Thus, we build an objective function with regularization parameters $\lambda _{1}$ and $\lambda _{2}$ to learn lncRNA linear features $\varvec{U}$ and disease linear features $\varvec{V}$ by Eq. (1):$$\begin{aligned} \min _{\varvec{U} \ge 0, \varvec{V} \ge 0}\Vert \varvec{W} \odot (\varvec{Y}-\varvec{U} V)\Vert _{F}^{2}+\lambda _{1}\Vert \varvec{U}\Vert _{F}^{2}+\lambda _{2}\Vert \varvec{V}\Vert _{F}^{2} \end{aligned}$$(1)where $\odot$ is the Hadamard product, $\varvec{U}\ge 0$, and $\varvec{V} \ge 0$.Nonlinear feature extraction based on GAE and similarity computationSimilarity computationTo extract nonlinear features of lncRNAs and diseases based on GAE, we need to compute disease similarity and lncRNA similarity. First, we compute disease semantic similarity $\varvec{S}_d^{sem}$ using the IDSSIM method68 based on MeSH descriptors. Since the MeSH database does not provide directed acyclic graph for some diseases, we can’t measure their similarity according to their directed acyclic graphs. Thus, we adopt GAPK to compute their similarity. Specially, for diseases $d_i$ and $d_j$, their Gaussian association profiles are represented as ${\varvec{Y}}_{.i}$ (the i-th column of $\varvec{Y}$) and ${\varvec{Y}}_{.j}$ (the j-th column of $\varvec{Y}$), respectively. And their GAPK similarity is defined by Eq. (2):$$\begin{aligned} \begin{aligned} \varvec{G}_d(i, j)&=\exp \left( -\theta _{d}\left\| {\varvec{Y}}_{.i}-{\varvec{Y}}_{.j}\right) \Vert ^{2}\right) \\ \theta _{d}&=\frac{1}{m} \sum _{i=1}^{m}\Vert {\varvec{Y}}_{.i}\Vert ^{2} \end{aligned} \end{aligned}$$(2)Consequently, disease similarity matrix $\varvec{D}$ is built by integrating their semantic similarity and GAPK similarity by Eq. (3):$$\begin{aligned} \textbf{D}\left( i, j\right) =\left\{ \begin{array}{cl} \frac{\textbf{S}_d^{sem}\left( i, j\right) + \textbf{G}_d\left( i, j\right) }{2} & \text{ if } \textbf{S}_d^{sem} \left( i, j\right) \ne 0 \\ \\ \textbf{G}_d\left( i, j\right) & \text{ otherwise } \end{array}\right. \end{aligned}$$(3)lncRNA similarity is computed based on their functional similarity and GAPK similarity2,42. lncRNA functional similarity $\varvec{S}_l^{fun}$ is measured through the IDSSIM method68 based on $\varvec{S}_d^{sem}$. lncRNA GAPK similarity matrix $\varvec{G}_l$ is calculated by Eq. (4):$$\begin{aligned} \begin{aligned} \varvec{G}_l(i, j)&=\exp \left( -\theta _{l}\left\| {\varvec{Y}}_{i.}-{\varvec{Y}}_{j.}\right) \Vert ^{2}\right) \\ \theta _{l}&=\frac{1}{n} \sum _{i=1}^{n}\Vert {\varvec{Y}}_{i.}\Vert ^{2} \end{aligned} \end{aligned}$$(4)where ${\varvec{Y}}_{i.}$ and ${\varvec{Y}}_{j.}$ denote the i-th and j-th rows of $\varvec{Y}$, respectively.Consequently, lncRNA similarity matrix matrix $\varvec{L}$ is built by incorporating their functional similarity and GAPK similarity by Eq. (5):$$\begin{aligned} \varvec{L}\left( i, j\right) =\left\{ \begin{array}{cl} \frac{\textbf{S}_l^{fun}\left( i, j\right) + \textbf{G}_l\left( i, j\right) }{2} & \text{ if } \textbf{S}_l^{fun} \left( i, j\right) \ne 0 \\ \\ \mathbf {G_l}\left( i, j\right) & \text{ otherwise. } \end{array}\right. \end{aligned}$$(5)Nonlinear feature extractionGAE is a novel graph neural network model and can effectively learn graph embedding features69. Here, we extract nonlinear features of lncRNAs and diseases based on the following five steps:Step 1 Bipartite graph constructionFirst, a bipartite graph is constructed based on known LDA matrix $\varvec{Y}$. The constructed bipartite graph contains two types of nodes, i.e., lncRNAs and diseases. Features of nodes can be represented based on their similarity matrix. And edge represents relationship between each lncRNA and disease.Step 2 Feature projectionBoth lncRNAs and diseases are projected to the space of vector with the same dimension based on a linear transformation matrix $\textrm{Q}_{\varnothing }$. Taken lncRNA nodes as an example, lncRNAs are projected to a k-dimensional vector space by Eq. (6):$$\begin{aligned} \textrm{Q}_{l}=\textrm{Q}_{\varnothing _{l}} \cdot \textrm{T}_{l} \end{aligned}$$(6)where $\textrm{Q}_{l}$, $\textrm{T}_{l}$, and $\textrm{Q}_{\varnothing _{l}}$ denote lncRNA projected features, lncRNA similarity matrix, and linear transformation matrix which can be solved by minimizing the loss function, respectively.Similarly, diseases are still projected into a k-dimensional vector space.Step 3 Feature aggregationAn encoder is employed to yield the embeddings of lncRNAs and diseases by combining neighborhood node information. Given an lncRNA $l_{i}$, the aggregation ${Q_{l_{i}}^c}$ of features related to its direct neighbors $\left\{ d_{1}, d_{2},... \right\}$ is depicted by an aggregate function $f(\cdot )$ defined by Eq. (7):$$\begin{aligned} {Q_{l_{i}}^{c}}=\frac{1}{D_{l_{i}}}f\left( Q_{d_{1}},Q_{d_{2}},\dots \right) \end{aligned}$$(7)where we usually use $sum(\cdot )$ as aggregator. $D_{l_{i}}$ indicates the degree of $l_{i}$.Step 4 Feature concatenationThe aggregated features $Q_{l_{i}}^{c}$ in Eq. (7) and the projected features $Q_{l_{i}}$ in Eq. (6) are concatenated to update features $Q_{l_{\textrm{i}}}^{\prime }$ of $l_{i}$ by a multi-layer perceptron by Eq. (8):$$\begin{aligned} Q_{l_{\textrm{i}}}^{\prime }=\textrm{LeakyReLU}\left( g\left( Q_{l_{\textrm{i}}}\bigoplus Q_{l_{\textrm{i}}}^{c}\right) \right) \end{aligned}$$(8)where $\bigoplus$ denotes the concatenation operation, $g(\cdot )$ is a multi-layer perceptron layer with LeakyReLU$(\cdot )$ and k outputs.Similarly, the aggregated features $Q_{d_{\textrm{i}}}^{\prime }$ of diseases are updated.Step 5 lncRNA and disease embedding learningTo incorporate abundant neighbor features and boost the model classification ability, we use an encoder based on stacking graph neural network with N layers to achieve the final embeddings ($Q_{l}^{N}$ and $Q_{d}^{N}$) of lncRNAs and diseases.Subsequently, a bilinear decoder is used to decode the input graph based on the association score $\hat{y}_{{ij}}$ between $l_{i}$ and $d_{j}$ by Eq. (9):$$\begin{aligned} \hat{y}_{i j}=\operatorname {sigmoid}\left( Q_{d_{j}}^{N} H\left( Q_{l_{i}}^{N}\right) ^{T}\right) . \end{aligned}$$(9)where H is a $k\times k$ parameter matrix. Consequently, we obtain nonlinear features $Q_{l}^{N}$ of lncRNAs and $Q_{d}^{N}$ of diseases.During nonlinear feature learning, a cross-entropy loss Loss is used to evaluate whether the model effectively encodes LDA features and accurately reconstructs the input graph by Eq. (10):$$\begin{aligned} Loss=-\sum _{i, j \in \mathscr {Y}^{+} \cup \mathscr {Y}^{-}}\left( y_{i j} \log \hat{y}_{i j}+\left( 1-y_{i j}\right) \log \left( 1-\hat{y}_{i j}\right) \right) . \end{aligned}$$(10)where ${y}_{i j}$ denotes the known relationship between $l_{i}$ and $d_{j}$ in a dataset, $\mathscr {Y}^{+}$ and $\mathscr {Y}^{-}$ indicate positive LDAs and negative LDAs, respectively. By minimizing the loss function defined by Eq. (10), we can solve the linear transformation matrix in Eq. (6).LDA predictionThe gradient boosting decision tree models, such as XGBoost and LightGBM, effectively combine powerful learners and optimization methods, and thus improve the model classification accuracy, accelerate the model training, and enhance the model ability to handle intricate datasets. These models overcome the computational efficiency limitations that inhibit current boosting models. Thus, they have been taken as the most efficient classification tools and the most preferred choices to solve practical problems. However, when using the cross entropy loss, these models have nonsymmetric and unbounded features. As a result, they are sensitive to label noises, making the effect of noise be amplified70. To address the above problems, we devise a noise-robust gradient boosting model for LDA classification.For an LDA dataset $D = \left\{ (\varvec{x}_i,\varvec{y}'_i) \right\} _{i=1}^{N}$, suppose that $\varvec{x}_i$ and $y'_i$ denote the i-th training sample (i.e., lncRNA-disease pair) and its label. $\varvec{x}_i$ is represented a feature vector by concatenating linear and nonlinear features of lncRNA and linear and nonlinear features of disease. $\varvec{y}'_i=1$ when the i-th pair has a link, otherwise $\varvec{y}'_i=0$. As shown in Algorithm 1, we perform predictions through the following five steps:Algorithm 1 LDA classification algorithm.Step 1 Model initializationLet $f_{t+1}$ denotes a new decision tree, $z_{i}^{t}=z_{i}^{0}+\alpha {\textstyle \sum _{j=1}^{t}} f_{j}(x_{i})$ denotes the model’s raw prediction with the initial prediction $z_{i}^{0}$, and $p_{i}^{t+1}=S(z_{i}^{t+1})=1/(1+e^{-z_{i}^{t+1}})$ is computed based on the Sigmoid function. Particularly, $z_{i}^{0}$ is set to zero (for all $i=1,2,\cdots , n$) to reduce impact of the model on the final prediction.Step 2 Residual calculationLet $l(y'_i, p)$ indicate the loss function for LDA classification, where p is the probability that the i-th lncRNA-disease pair is labeled as positive class. Consequently, $\lim _{p \rightarrow 0}l(0, p)=0$ and $\lim _{p \rightarrow 1}l(1, p)=1$. At the ${(t+1)}$-th iteration ($t \ge 0$), the objective function with the learning rate $\alpha$ is defined as Eq. (11):$$\begin{aligned} \begin{aligned} \mathscr {L}^{t+1}&=\sum _{i=1}^{n}l(y_{i},p_{i}^{t+1})=\sum _{i=1}^nl(y'_i,S(z_i^{t+1}))=\sum _{i=1}^{n}l(y_{i},S(z_{i}^{t}+\alpha f_{t+1}(\textbf{x}_{i}))) \end{aligned} \end{aligned}$$(11)For convenience, we define the probability $\hat{p}$ for the ground-truth class as$$\begin{aligned} \hat{p}={\left\{ \begin{array}{ll}p,& y=1\\ 1-p,& y=0\end{array}\right. } \end{aligned}$$(12)Consequently, l(y, p) is written as $l(1,\hat{p})$ ($l(\hat{p})$ for simplicity) and $\lim _{p \rightarrow 1}l(\hat{p})=0,\forall y\in \left\{ 0,1\right\}$.Step 3 Calculating gradient and HessianThe Newton’s method is adopted to optimize the regularized objective (13) with regularization parameter $\Omega (f_{t+1} )$ for the boosting model:$$\begin{aligned} \widetilde{\mathscr {L}}^{t+1}=\sum _{i=1}^n[g_i^tf_{t+1}(\textbf{x}_i)+\frac{1}{2}h_i^tf_{t+1}(\textbf{x}_i)^2]+\Omega (f_{t+1}) \end{aligned}$$(13)where $g_{i}^{t}$ and $h_{i}^{t}$ denote the gradient and the Hessian, respectively. They are defined by Eq. (14):$$\begin{aligned} \begin{aligned}&g_i^t=\frac{\partial l}{\partial z_i^t}=\frac{\partial l}{\partial \hat{p}_i^t}\frac{\partial \hat{p}_i^t}{\partial p_i^t}\frac{\partial p_i^t}{\partial z_i^t}=\frac{\partial l}{\partial \hat{p}_i^t}(2y^{'}_i-1)(\hat{p}_i^t(1-\hat{p}_i^t))\\&h_i^t=\frac{\partial ^2l}{\partial (z_i^t)^2}=\frac{\partial ^2l}{\partial (\hat{p}_i^t)^2}(\hat{p}_i^t(1-\hat{p}_i^t))^2+\frac{\partial l}{\partial \hat{p}_i^t}(\hat{p}_i^t(1-\hat{p}_i^t)(1-2\hat{p}_i^t)) \end{aligned} \end{aligned}$$(14)and$$\begin{aligned} \begin{aligned}\frac{\partial p_i^t}{\partial z_i^t}=p_i^t(1-p_i^t)=\hat{p}_i^t(1-\hat{p}_i^t)\end{aligned} \end{aligned}$$(15)Step 4 Calculating the optimal weight of the leavesFor a decision tree with fixed structure, $f_{t+1}(\textbf{x})$ is written as $f_{t+1}(\textbf{x})=\sum _{j=1}^Jw_jI_j$, where $I_j$ denotes an instance set involved to leaf j. Suppose that $\Omega (f_{t+1})=\frac{1}{2}\lambda \sum _{j=1}^{J}w_{j}^{2}$, $\lambda \ge 0$, the optimal objective is rewritten as Eq. (16):$$\begin{aligned} \begin{gathered} \widetilde{\mathscr {L}}^{t+1} =\sum _{i=1}^n[g_i^tf_{t+1}(\textbf{x}_i)+\frac{1}{2}h_i^tf_{t+1}(\textbf{x}_i)^2]+\Omega (f_{t+1})=\sum _{j=1}^J[(\sum _{i\in I_j}g_i^t)w_j+\frac{1}{2}(\sum _{i\in I_j}h_i^t+\lambda )w_j^2] \end{gathered} \end{aligned}$$(16)f(x) is separable for each leaf, thus, the optimal weight $w_{j}^{*}$ of leaf j can be computed by Eq. (17):$$\begin{aligned} w_j^*=\frac{-\sum _{i\in I_j}g_i^t}{\sum _{i\in I_j}h_i^t+\lambda } \end{aligned}$$(17)and the corresponding optimal objective is defined by Eq. (18):$$\begin{aligned} \widetilde{\mathscr {L}}_j^*=-\frac{1}{2}\frac{(\sum _{i\in I_j}g_i^t)^2}{\sum _{i\in I_j}h_i^t+\lambda } \end{aligned}$$(18)Step 5 Finding the best tree structureWe use the information gain to assess whether the tree will be grown and to identify the split feature as well as split value by Eq. (19):$$\begin{aligned} gain=\dfrac{1}{2}\Bigg [\dfrac{(\sum _{i\in I_L}g_i)^2}{\sum _{i\in I_L}h_i+\lambda }+\dfrac{(\sum _{i\in I_R}g_i)^2}{\sum _{i\in I_R}h_i+\lambda }-\dfrac{(\sum _{i\in I}g_i)^2}{\sum _{i\in I}h_i+\lambda }\Bigg ] \end{aligned}$$(19)where $I_L$ and $I_R$ denote the left node instance set and the right node instance set after the split, respectively. $I = I_L \cup I_R$ denotes the sample set related to their father nodes. $I_L$, $I_R$ and I are used to limit the leaf split and alleviate the overfitting risk from the following three situations:If the number of samples for a leaf is not enough, the leaf split will be stopped.If the sum of hessian $\sum _{i\in I_{j}}h_{i}^{t}$ within a leaf is less than a small threshold $\epsilon$, i.e., \(\sum _{i\in I_{j}}h_{i}^{t}