Machine learning-driven multi-targeted drug discovery in colon cancer using biomarker signatures

Wait 5 sec.

IntroductionOne of the most often diagnosed diseases and the second leading cause of cancer-related deaths globally is colon cancer1. An estimated 1.93 million new CC patients were discovered in 2020, making up 10% of all cancer cases globally2. The increasing number of CC reports worldwide is linked to efficient screening and monitoring programs that are widely and quickly implemented3. However, the rate of CC morbidity remained high, with 0.94 million deaths from the disease recorded in 2020—or accounting for 9.4% of all cancer-related fatalities worldwide; although effective screenings and CC reduction initiatives have led to a general increase in detection, a rise in CC prevalence diagnosis has been observed in developing or emerging nations, as well as a younger population (below 50 years old) starting CC in industrialized nations4. The Guaiac FOBT and intestinal inflammatory examination are combined in the Fecal Occult Blood Test (FOBT), a crucial CC screening tool5. Colonoscopy is regarded as the pattern procedure for CC selection, with benefits such as high specificity, sensitivity, and absolute accessibility, and it gives an essential function in Tumor and Malignant injuries6. This comprises endorectal ultrasonography, Computed Tomography (CT), abdominal ultrasonography, and Nuclear Magnetic Resonance (NMR). Nevertheless, these approaches are only successful for severe localized lesions7. Cancer indicators have become increasingly utilized in the analysis and cure of cancer. The marker must have high accuracy for tumor selection, analysis, efficiency and prediction evaluation, reappearance recognition, and other applications, as well as the ability to perceive tiny lesions and quantitatively reproduce them8. Cancer performance denotes the grade of cancer that has gone across the body. It aids in determining the severity of cancer and the finest treatment options, and doctors also utilize it in survival statistics9. Cancer is characterized into five grades: 0, I, II, III, and IV. The cancer grade determines the position and dimension of cancer, how much it has developed in neighboring matter, and if it has reached nearby humor nodules of the body, as well as the occurrence of cancer-reaching indicators10. Among individuals with the lowest status of CC, if discovered at phase I, the 5-year endurance value for individuals aged 18 to 65 is 91% and is feasible with adequate therapy. Integrating biomarkers next to imaging modalities considerably improves the precision of identifying CC liver metastasis. Despite improvements in prognostic assessment like Carcinoembryonic Antigen (CEA) and Cancer Antigen 125 (CA125), and screening techniques like colonoscopy, tumor heterogeneity, and low biomarker sensitivity still hinder early identification and prognosis11. While serum indicators like CEA are useful in diagnosing CC, their restrictions reduce their effectiveness in detecting hepatic metastases. Thus, it is critical to investigate new biomarkers to develop analytical exactness and medical effects in patients. Cancer screening employs a variety of biomarkers like Deoxyribo Nucleic Acid (DNA), protein, and Ribo Nucleic Acid (RNA) biomarkers. Transcriptional biomarkers are a hopeful kind of biomarker that determines changes in the quantities of RNA particles created from DNA in groups, such as mRNAs, micro RNAs, extensive hypervariable RNAs, and globular RNAs. They are non-intrusive and extremely receptive, creating an excellent apparatus for untimely identification and observing numerous malignancies. Cancer beginning, expansion, and metastasis are all impacted by intricate transcriptome changes. The beginning and development of CC are significantly influenced by transcriptomic and epigenomic changes, and large-scale molecular profiling has been made possible by high-performing equipment like microarrays and Next-Generation Sequencing (NGS). These datasets, which are kept in databases like TCGA and GEO, make it easier to find biomarkers and do computer modeling. Transcriptome data processing is multifaceted and prolonged, necessitating bioinformatics and statistical knowledge. Conventional approaches for interpreting transcriptome information rely on physical search as well as understanding and are costly and unsuitable for dealing with the massive quantities of information produced by current sequencing equipment. The research’s goal is to generate a computational oncology framework that integrates high-dimensional molecular data to identify multi-targeted therapeutic strategies for CC. Scalable methods for obtaining predictive characteristics from high-dimensional data are provided by recent advancements in bioinformatics and Machine Learning (ML). Due to noise and data imbalance, problems with feature selection, parameter tweaking, and precise classification still exist. The research aims to enhance biomarker discovery, improve drug response prediction, and personalize treatment plans. The proposed method can deliver quick and precise evaluations that would enable doctors to rank patients by of importance, expedite triage procedures, and potentially cut down on diagnostic errors and delays. It is also a useful addition to concurrent hospital operations because the method design facilitates easy integration and requires little training for clinical staff.Research on biomarkers and predictive models is being conducted to improve diagnosis and treatment results for CC, a critical health challenge. Recent research has used a variety of ML and bioinformatics tools to uncover possible biomarkers and create predictive models for CRC.Liñares-Blanco et al.12 confirmed a meta-signature using ML methods and molecular docking to corroborate the relations of FABP6 with abemaciclib. However, their validation was restricted to in-silico approaches. Similarly, Kong et al.13 used 3D organoid data to establish biomarkers that accurately predicted treatment responses in colorectal and bladder tumors, but their findings were limited to preclinical validation. Shuwen et al.14 used CatBoost to obtain 99% accuracy and an Area Under Curve (AUC) of 1.0 on GSE131418 for diagnosing Colon Adenocarcinoma (COAD) liver metastases, despite limited external validation. Liu et al.15 found five transcription factors and validated their model across four GEO datasets, achieving good survival prediction accuracy; however, further clinical validation was required. Jin et al.16 used SVM-RFE and Cox regression to identify six lncRNAs for predicting COAD recurrence; still, their model was constrained by dataset reliance. Wang et al.17 used bioinformatics to identify eleven hub genes for colorectal cancer, which need to be validated experimentally. Sun et al.18 used Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Protein-Protein Interaction (PPI) analysis to identify four influential genes and two medicines, but did not conduct any experimental testing. Fang et al.19 identified twelve prognostic genes with a model AUC greater than 0.8 and created a nomogram, albeit only using retrospective data. Zhang et al.20 identified eighteen Differentially Expressed Genes (DEGs), six hub genes, and two predictive indicators for colorectal liver metastases, relying on public datasets. Pan et al.21 found IRF4 and TNFRSF17 and validated their model using GSE1433, but external clinical validation remained lacking.Ma et al.22 created a thirteen-gene immune-related gene classifier with AUC ethics varying from 0.68 to 0.74, presenting modest accuracy. Liu et al.23 used ESTIMATE to identify six predictive DEGs associated with immunological and stromal scores; however, their findings were based on retrospective data. Wang et al.24 identified fifteen important genes, including IL1RN and PRRX1, which are immune-relevant; however, the clinical relationship is poor. Su et al.25 found that their random forest model outperformed Least Absolute Shrinkage and Selection Operator (LASSO) and Weighted Gene Co-expression Network Analysis (WGCNA), but accuracy decreased with each validation stage. Koppad et al.26 discovered that random forest outperformed five other ML classifiers, finding 34 genes, albeit dataset heterogeneity was a concern. Li et al.27 used SVM and LASSO to identify eleven diagnostic genes with substantial AUCs; however, their performance was only moderate. Johnson et al.28 created a seven-gene model for metastatic colorectal cancer that outperformed current techniques; nevertheless, clinical testing was required. Ye et al.29 developed a fifteen-gene profile that predicts survival risk, although their retroactive analysis is limited. Wang et al.30 created a four-gene model that was verified using SVM, quantitative Polymerase Chain Reaction (PCR), and ImmunoHistoChemistry (IHC), with little external testing. Sharma et al.31 discovered genes associated with immunological and cell cycle pathways that lacked in vivo validation. Leng et al.32 discovered 249 medicines and identified TIMP1 as a critical prognostic gene, still without experimental evidence. Liu et al.33 developed a seven-gene signature for recognizing Hepcidin Antimicrobial Peptide (HAMP); furthermore, dataset dependency was a drawback. Kang et al.34 created an immune-related model linked to survival, which requires prospective validation. Salimy et al.35 found that HPOAE outperformed other multi-omics models but lacked clinical testing. Maurya et al.36 obtained 100% accuracy with a random forest model based on Boruta feature selection; however, data imbalance was a concern. Li et al.37 employed WGCNA to identify thirteen gene modules, with the brown and blue modules being the most important, however the research lacked validation through clinical trials. Mortezapour et al.38 discovered seven increased genes, including miR-940, but did not ensure drug testing. Xiao et al.39 discovered three biomarkers that link Crohn’s disease and COAD, albeit using a tiny dataset. Wang et al.40 developed a TLS-based survival prediction model using seven genes that required clinical confirmation.Xu et al.41 created a six-URG signature that demonstrated good immunological and prognostic classification, albeit with dataset dependencies. Xu et al.42 used semi-supervised machine learning on 933 data to generate eighteen prognostic and ten predictive signatures for a 5-FU response; still, the validation was insufficient. Building on previous research, Radhakrishnan et al.43 used ML algorithms and PPI examination to uncover proteomic biomarkers for CRC. It uses classifiers such as LASSO, XGBoost, and LightGBM on proteome reports from both healthy persons and CRC patients. LASSO has the highest AUC, at 75%. Trefoil Factor 3 (TFF3), Lipocalin 2 (LCN2), and Carcinoembryonic Antigen-Related Cell Adhesion Molecule 5 (CEACAM5) were recognized as crucial markers for group devotion and irritation. Similarly, Zhang et al.44 created a system that uses matched CRC tumor and organoid gene appearance information to enhance chemotherapy reaction forecast. Using consensus WGCNA, researchers have identified important gene modules and proposed biomarkers or important genes for patients with CC. However, these were often based on PPI networks or expression analysis. Identifying essential genes and understanding their molecular pathways in the growth and development of CC in patients was a hard task. Additional research was necessary in this area45,46.The current research is significant in the field of computational oncology as it addresses the challenges of developing effective multi-targeted therapies for CC. The key contributions of this research are listed below.The intention is to generate a computational technique that targets multiple signaling pathways in CC, overcoming the limitations of traditional single-target approaches and improving treatment precision. The data is gathered from high-dimensional genomic, Transcriptomics, and proteomic data, enabling robust biomarker discovery and deeper insights into drug resistance mechanisms.The research introduces the Adaptive Bacterial Foraging - CatBooost algorithm (ABF-CatBoost) to enhance parameter tuning and predictive accuracy, enabling effective classification of patients and identification of key biomarkers and drug targets. The model personalizes therapy by predicting drug efficacy, toxicity risks, and metabolic pathways based on patient-specific molecular profiles, contributing to safer and more effective treatment strategies. The proposed framework is adaptable for other cancers by modifying biomarker selection and pathway analysis modules, thus expanding its impact in precision oncology and personalized medicine.ResultsThe research employs Python 3.10 to assess the efficacy of the ABF-CatBoost technique for assessing molecular data and predicting therapy results in CC. The discovered hub genes, TP53, KRAS, and CCNA2, were further investigated for their therapeutic potential in CC. The system’s effectiveness is estimated using performance evaluation metrics, Receiver Operating Characteristic (ROC) for toxicity risks, PPIs for metabolism pathways, and overall survival rates for pharmacological efficacy. The suggested method is compared to established classification systems like RF [47] and SVM [47] for biomarker discovery, drug response prediction, and toxicity risk assessment.Overall survival (OS)The OS rate is utilized to calculate drug efficiency profiles, which measure the percentage of individuals who are indeed active after a particular time of treatment. A higher OS rate shows that the medicine is more successful at prolonging the patient’s life. This statistic gives a direct and clinically relevant assessment of the drug’s therapeutic effect. Figure 1 shows the overall survival of a) KRAS, b) CCNA2 and C) TP53.Fig. 1: Prognostic significance of hub genes in colorectal cancer.Kaplan–Meier survival curves illustrating overall survival associated with (a) KRAS, (b) CCNA2, and (c) TP53 expression.Full size imageThe survival curves for the hub genes TP53, KRAS, and CCNA2 show distinct prognostic implications. Among these, TP53 has the best survival rate, indicating a better long-term patient prognosis. In contrast, KRAS shows a steeper fall in survival rate, indicating a worse prognosis, while CCNA2 indicates an intermediate trend. Thus, TP53 emerges as the most reliable biomarker for predicting improved overall survival in colorectal cancer patients, highlighting its importance as a treatment target.A comparison of KRAS, CCNA2, and TP53 expression to find important differences across sample groups is given in Fig. 2. This is used to highlight statistically significant genes with differential expression to evaluate their possible contribution to disease processes. The use of red and black color coding highlights important distinctions across circumstances.Fig. 2Survival verification of the genes.Full size imageWhen comparing tumor models to regular tissues, the expression analysis of KRAS, CCNA2, and TP53 shows noticeably higher levels, with p-values showing substantial statistical significance (p = 0.000). These variations imply that these genes are essential for the development of tumors and the progression of cancer. Their possible signs for early cancer detection and therapeutic targets are highlighted by the over-expression shown in tumor tissues.Metabolism pathwayThe metabolism pathway is examined using PPI networks, which examine how drug-targeted proteins interact with other metabolic proteins. PPI networks, built with tools such as STRING or Cytoscape, aid in the identification of important hub proteins involved in enzymatic processes, signaling, and metabolic regulation. This method identifies drug metabolism pathways, such as ADME processes, and provides insights into the drug’s biological processing, allowing for better prediction of metabolic behavior and potential interactions. Figures 3–5 represent the PPI for hub proteins such as KRAS, CCNA2, and TP53, respectively.Fig. 3Network analysis of KRAS.Full size imageFig. 4Network analysis of CCNA2.Full size imageFig. 5TP53 network analysis.Full size imageThe RAS/MAPK and P13K.AKT pathways, which control cell growth, survival, and differentiation, are mostly regulated by the main oncogenic and signaling proteins PIK3CA, RAFI, BRAF, SOSI, and RALGDS, which are strongly associated with the KRAS interaction network. Furthermore, KRAS has a role in calcium-mediated signaling by forming connections with downstream effectors such as PHKA1, PHKA2, and CAMK1 as well as calmodulin proteins (CALM1, CALM2, and CALM3). These links highlight KRAS’s crucial role in carcinogenesis and support its designation as a major therapeutic target in the development and management of CC by indicating that it contributes to both calcium-regulated metabolic processes and traditional mitogenic signaling.A crucial part of a complicated network that regulates the cell cycle, CCNA2 interacts with proteins like CDC6, E2F1, SKP2, TFDP1, CDK1, and CDK2. The coordination of the G1/S and G2/M phase changes, which promotes regular cellular division development, depends on these connections. Its connections with CDKN1B and the regulatory proteins CKS1B and CKS2 demonstrate that it is involved in the ubiquitin-proteasome system, which regulates checkpoints and cyclin degradation. One of the promises of cancer is unregulated cellular proliferation, which is caused by the dysregulation of this carefully controlled network. Therefore, CCNA2 is a potentially effective therapeutic target for cell cycle arrest-inducing therapies.The TP53 network demonstrates connections with important DNA damage response components and tumor suppressors, such as ATM, EP300, CREBBP, MDM2, and DAXX. Important physiological functions like apoptosis, DNA repair, cell cycle arrest, and senescence are regulated by TP53. Co-regulators that stabilize or repress p53’s transcriptional activity include MDM4, TP73, and FOXO4. Its connection to proteins like SIRT1 and HSP90AA1 draws awareness to post-translational changes that alter TP53 function. TP53 is an essential cancer silencer and master regulator in cancer biology. As such, it is a crucial target for therapeutic approaches, particularly those that seek to restore p53 activity in malignancies with mutations.The PPI network analysis gives useful information on the metabolism pathways associated with key biomarkers in CC. Drug-targeted proteins have strong interactions with metabolic regulators, indicating their participation in enzymatic activities, signal transmission, and cellular homeostasis. KRAS, CCNA2, and TP53, in particular, have a high level of connection with metabolic control proteins, suggesting these genes play crucial roles in drug metabolism and processing. These discoveries assist in clarifying the underlying biological mechanisms, improve knowledge of Absorption, Distribution, Metabolism, and Excretion (ADME) behavior, and support the creation of more effective and safer therapeutic options by precisely targeting metabolic pathways.Network analysisResearch created an interaction network between center genes and TFs to determine which TFs were the most important transcriptional regulators of hub genes. As the essential transcript controller of center genes with degree ≥4, the top 10 key TFs were selected; KGs are indicated by a red and black color ellipse, while top degree key TFs are indicated by a green rectangle. Figure 6 represents the gene regulatory network.Fig. 6Expanded gene regulatory network.Full size imageThe network emphasizes their crucial roles in apoptosis, cell cycle regulation, and oncogenic signaling; TP53 affects tumor suppressors, KRAS modifies the MAPK and PI3K pathways, and CCNA2 controls the course of the cell cycle. The existence of important transcription factors like MYC and E2F1 highlights their function in the enlargement and spread of tumors. A sophisticated regulatory system governing cancer-related pathways is suggested by the network topology, which exhibits significant linkages. These discoveries support the development of precision medicine’s possible therapeutic targets and enhance oncology treatment plans.Toxicity risksThe research measured the IC50 values of 13 medicines for CC to evaluate the effectiveness of TP53, KRAS, and CCNA2 in predicting drug sensitivity to find the candidate drugs. Figure 7 shows the association between drug sensitivity of hub genes like TP53, KRAS, and CCNA2.Fig. 7: Correlation between drug sensitivity and hub gene expression.Bubble plot illustrating the association of KRAS, CCNA2, and TP53 expression levels with various anticancer drugs. Bubble color represents correlation direction (blue = negative, red = positive), and bubble size indicates the magnitude of the correlation coefficient.Full size imageThe bubble plot shows significant relationships between GE and drug response, with CCNA2 having the strongest positive and negative associations across many medications. TP53 exhibits considerable interactions, particularly with Capecitabine, but KRAS exhibits weak connections. Larger bubbles denote greater statistical significance, with red indicating increasing drug sensitivity and blue indicating possible resistance. These data imply that CCNA2 serves as a biomarker for drug sensitivity, but TP53’s reaction to Capecitabine signals a potentially powerful therapeutic effect. The findings show diverse gene-drug interactions, emphasizing the varied effects of various medications on specific genetic targets.Treatment plans are influenced by the risk score, which aids in predicting chemotherapy sensitivity. While low-risk patients react well to normal regimens, high-risk patients frequently exhibit resistance as a result of genetic differences. By customizing chemotherapy according to patient risk profiles and tumor features, precision medicine is improved by an understanding of this link, which optimizes medication selection, minimizes toxicity, and improves survival results. Figure 8 denotes the association between the risk score and the drug sensitivity.Fig. 8GE levels between risk groups.Full size imageThe results show that there are notable variations in TP53, KRAS, and CCNA2 expression between the high-risk (HRisk) and low-risk (LRisk) groups. The observed differences imply that these genes are essential for differentiating between risk categories. These results demonstrate their potential as biomarkers for determining a patient’s sensitivity to chemotherapy and directing individualized treatment plans.Risk scores are influenced by demographic factors such as age and gender, which have an impact on disease susceptibility and gene expression. While gender differences (male vs. female) emphasize biological responses, age-based analysis (≤65 vs. >65) indicates physiological changes. Dot distribution boxplots show these differences, and statistical significance is indicated by p-values. Comprehending these impacts facilitates precision medicine by permitting risk assessment and individualized treatment plans according to demographic characteristics. Figure 9 shows the demographic characteristics and risk score.Fig. 9: Comparison of risk scores across gender and age groups.a Box plot showing distribution of risk scores between male and female patients. b Box plot showing distribution of risk scores between patients aged ≤65 and >65 years.Full size imageP-values (0.919 for age, 0.443 for gender) show no statistically significant difference in risk ratings between age groups (≤65 vs. >65) and gender (male vs. female). This finding supports the idea that neither gender nor age significantly influences the observed risk score distribution by indicating that GE levels are constant across these demographic groups.Toxicity risks are considered utilizing the ROC curve, which computes the performance of a toxicity forecast method. It evaluates the False Positive (FP) and True Positive (TP) rates at a variety of stages. The AUC determines categorization exactness, with a larger AUC indicating better prediction ability. This approach effectively differentiates toxic and non-toxic chemicals, allowing for further dependable estimation of safety hazards during medication development. Figure 10 denotes the toxicity risk.Fig. 10Toxicity risks.Full size imageThe toxicity risks are assessed using the ROC curve, which demonstrates the method’s prognostic dependability and discrimination capacity. The AUC established a high level of accuracy in discriminating toxic and non-toxic reactions, supporting the model’s efficacy in assessing possible medication safety hazards.Performance evaluationThe method performance is assessed through different metrics, including accuracy, specificity, F1-score, and sensitivity. The method is compared with traditional methods like RF [47], SVM [47], K-Nearest Neighbor Algorithm (KNN) [48], and Artificial Neural Network (ANN) [48] for biomarker identification and precise drug response.Accuracy: Accuracy is a statistical indicator used to assess a model’s performance. It displays the proportion of correctly anticipated cases out of all the data occurrences. It displays how frequently precise model predictions occur. Table 1 and Fig. 11 show the comparison of classification accuracy (%) of the proposed ABF-CatBoost method with existing ML methods. It is evaluated utilizing Eq. (1). Where FN represents the False Negative and TN denotes the True Negative.$${\boldsymbol{accuracy}}=\frac{{\boldsymbol{TN}}+{\boldsymbol{TP}}}{{\boldsymbol{TN}}+{\boldsymbol{FP}}+{\boldsymbol{TP}}+{\boldsymbol{TN}}}$$(1)The findings show that the suggested ABF-CatBoost approach has the highest accuracy (98.6%), outperforming RF (95.8%), KNN (91.11%), ANN (86.71%), and SVM (93.2%). This demonstrates the suggested model’s higher prediction capabilities, emphasizing its efficacy and resilience when compared to existing classification methods for the task. Table 2 and Fig. 12 represent the comparative analysis of the different methods.Fig. 11Evaluation of accuracy.Full size imageFig. 12Estimation of performance evaluation.Full size imageTable 1 Comparing the ABF-CatBoost method’s classification accuracy to that of existing ML techniquesFull size tableTable 2 Determination of presentation assessment of the proposed and existing methodFull size tableThe suggested ABF-CatBoost model performed better than typical ML approaches, with specificity (0.984), sensitivity (0.979), and F1-score (0.978). RF had a specificity of 0.965, a sensitivity of 0.948, and an F1-score of 0.948, whereas SVM had a specificity of 0.982, sensitivity of 0.857, and F1-score of 0.910. In contrast, ABF-CatBoost accomplished a well-balanced trade-off by maximizing predicted accuracy. These findings support its efficacy in improving multi-targeted medicines for precision oncology in CC treatment. Table 3 represents the performance comparison of ABF-CatBoost method on Lung cancer and proposed Gane expression datasets.Table 3 Performance comparison of ABF-CatBoost method on Lung cancer and proposed Gane expression datasetsFull size tableImpact of feature integration on classification accuracyThis demonstrates the progressive impact of integrating various components on classification accuracy. Initially, using GE alone yields lower performance, while adding DEG, KEGG, and CatBoost sequentially improves the accuracy. The final combination of all components, including ABF, achieves the highest accuracy of 98.6%. This highlights the cumulative value of combining multiple features and methods. Table 4 displays the incremental impact of feature integration and method components on classification accuracy.Table 4 Incremental impact of feature integration and method components on classification accuracyFull size tableThe ablation results showed a gradual improvement in classification accuracy when different data attributes and method elements are gradually included. The performance of the method is enhanced with the addition of DEG, KEGG pathway information, the CatBoost classifier, and lastly, the ABF, starting with the GE data. The highest accuracy of 98.6% is attained by the final configuration, which combines all the components (GE + DEG + KEGG + CatBoost + ABF), proving the value of combining integrative-omics data with innovative ML algorithms for the best classification.DiscussionResearch aimed to provide an adequate analytical system for multi-targeted treatments in CC by merging biomarker profiles from multiple omics data. Several investigations12,20,32 used in silico and computational analyses without experimental validation, requiring laboratory-based confirmation of predictive models and biomarkers. The use of public datasets such as TCGA and GEO adds dataset heterogeneity, which limits generalizability30,33,39. Furthermore, the absence of clinical validation in transcriptomic-based models emphasizes the need for prospective trials to determine real-world applicability13,19,28. Preclinical models, particularly 3D organoid cultures, do not fully reproduce tumor complexity, requiring additional in vivo research14,18. Dataset imbalances and small sample numbers in some research raise concerns regarding inflated predicting accuracy, highlighting the need for more diversified datasets36. Translating prognostic markers into clinical practice remains difficult, requiring standardization and regulatory approval40,41. Furthermore, the limited external validation of findings in various research emphasized the need for cross-cohort replication21,38. Future research should focus on experimental validation, clinical trials, and broader dataset validation to enhance the therapeutic function of these findings.To address these constraints, the research developed an acceptable analytical approach for multi-targeted therapy in CC by combining biomarker profiles from multiple omics datasets. The proposed method was compared to current techniques such as RF47and SVM47, which usually have drawbacks such as overfitting, insufficient feature utilization, low flexibility, and susceptibility to kernel parameters. RF47, while able to handle high-dimensional information, limits efficacy in feature significance modification and produces prejudiced forecasts, while SVM47 struggles with large datasets and requires extensive kernel preparation techniques. The ABF-CatBoost system overcomes these constraints by flexibly selecting essential biomarkers and increasing the reliability of predictions. KNN48 has trouble with high-dimensional or unbalanced data distributions, is slow with big datasets, and is sensitive to unimportant features. ANNs48 are prone to overfitting, require a lot of data, have a significant processing overhead, and frequently have difficult-to-understand decision-making processes. To address this, an ABF-CatBoost strategy was proposed, which integrated ABF’s optimization capabilities for hyper-parameter modifying and element refinement with CatBoost’s reliable categorization capability. Additionally, it enhanced categorical parameter administration and reduced system bias, enabling greater generalizability and operational reliability across outside validation data. This combined technique increased the reliability of medical results, allowing for personalized CC treatments and broadening the reach of personalized oncology. Among the genes examined, TP53, KRAS, and CCNA2 were identified as major hub genes, indicating their importance in tumor progression and possible treatment biomarkers. These findings help to further precision medicine approaches and increase the possibilities for focused therapy strategies in CC management. Drug development utilizing biomarker profiles enhanced beneficial accuracy, recognized significant molecular objects, and facilitated personalized treatment. It eliminates drug resistance, improves early recognition, and speeds up the development of multi-targeted drugs. These techniques significantly enhanced survival rates and treatment efficiency, promoting accuracy in CC therapy. To predict drug responses and potential toxicity risks, the ABF-CatBoost model can be effectively integrated into clinical workflows by analyzing patient-specific molecular signatures derived from blood or tissue samples. This model enables personalized medicine by identifying biomarkers associated with drug sensitivity and resistance. In the context of colon cancer management, it facilitates real-time monitoring of adaptive resistance mechanisms, allowing dynamic adjustment of therapeutic strategies. Clinicians can use this information to select the most effective drugs and tailor dosages to minimize adverse effects. The approach supports the design of multi-targeted therapies that are specifically aligned with a patient’s molecular profile. This enhances treatment precision, improves clinical outcomes, and reduces unnecessary toxicity.The proposed approach is applied as a decision-support tool in clinical processes by integrating radiological systems. The technique has the potential to automatically evaluate patient data during routine screens or diagnostic imaging procedures and flag high-risk situations for a prompt professional assessment. Its capacity to deliver quick and precise evaluations would enable doctors to rank patients by importance, expedite triage procedures, and potentially cut down on diagnostic errors and delays. It is also a useful addition to concurrent hospital operations because the method design facilitates easy integration and requires little training for clinical staff. Pre-cleaning transcriptome data and applying optimized ML techniques enabled prediction of CC in this research, improving medication discovery and individualized treatment results. The goal of this research was to generate a multi-targeted treatment system for utilizing an incorporated ABF-CatBoost system that utilized molecular data to recognize important markers and forecast drug reactions. The outcomes established that the suggested system outperformed traditional systems in terms of accuracy (0.986), F1-score (0.978), sensitivity (0.979), and specificity (0.984), effectively identifying therapeutic targets. The research identifies TP53, KRAS, and CCNA2 as important hub genes with major implications for targeted therapy and survival prediction in CC. However, the research’s limitations incorporated small sample sizes and the complexity of integrating information. Future research was determined to raise the dataset size, including concurrent scientific data, and apply this computational framework to diverse cancer kinds to enhance its generalizability and impact in precision oncology. In future work, the real-time dataset is incorporated into clinical settings to validate the robustness and adaptability of the ABF-CatBoost framework. The proposed approach is applied as a decision-support tool in clinical processes by integrating radiological systems. The technique has the potential to automatically evaluate patient data during routine screens or diagnostic imaging procedures and flag high-risk situations for prompt professional assessment in the future.MethodsResearch intends to improve drug response prediction, enhance biomarker identification, and personalize treatment regimens. The Gene Expression (GE) data is gathered for the multi-targeted drug discovery in CC research. Data pre-processing procedures, such as Robust Multi-array Average (RMA) and Microarray Suite (MAS) approaches, are performed. The ClusterProfiler instrument is used for the functional enrichment analysis performance. The PPI system is employed to provide more information on the functional connections of DEG. Validation and survival analysis for the hub gene are determined. The performance of the ABF-CatBoost method is explained more comprehensively. Figure 13 shows the methodological framework.Fig. 13Methodological Framework.Full size imageData collectionThe GE data is gathered from the open-source Kaggle [46]. This dataset, acquired via the ColoCare Project, uses Illumina Human HT12v4 gene chips to include the GE profiles of 117 mucosa tissues and 77 tumor tissues. It also includes 107 tumor and 108 mucosa samples with matching DNA methylation profiles, combining data from GSE1017764. Using these combined datasets, the intricate relationship between DNA methylation and GE was examined.Data preprocessingThe GE dataset was pre-processed to ensure accurate DEG analysis is carried out on GE information to discover genes that are significantly related to CC. The analysis is carried out in an R environment utilizing Bio conductor tools. These tools make it easier to pre-process, adjust for background, and statistically evaluate GE planes in tumor and typical tissue methods.To verify the consistency and stability of expression values, data is normalized using the RMA and MAS approaches. Among them, MAS-normalized data are chosen for downstream analysis because their values are closer to the median distribution. These criteria provided only genes with significant expression variations and physiologically relevant fold changes between malignant and normal tissues that were selected. The identified DEGs provide a platform for future functional enrichment analysis and biomarker development in CC.Functional enrichment analysisThe ClusterProfiler tool in R is used to perform functional enrichment analysis on the marker genes to investigate how these genes influence the development of CC18. (Version 3.12.0). The purpose of the R package Cluster Profiler is to compare organic themes amid gene bunches, including those found in DO, KEGG, and GO.It is utilized to investigate the genetic significance of the discovered DEGs and their role in CC pathogenesis. The KEGG database is used to uncover overrepresented biological pathways and functional categories. KEGG pathway analysis sheds light on the molecular interaction and reaction networks in which DEGs play essential roles. A route is judged significantly enriched with a p-value