A novel anoikis related gene prognostic model for colorectal cancer based on single cell sequencing and bulk transcriptome analyses

Wait 5 sec.

IntroductionColorectal cancer (CRC) is ranked as the second most deadly cancer worldwide1 and the incidence is increasing each year2. Advances in the systematic treatment strategies of CRC in recent years have improved the survival rate to a certain extent, but the overall prognosis remains poor owing to lack of efficient early prognostic biomarkers which can better guide risk stratification and treatment decision3,4,5. Therefore, there is an urgent need to explore potential prognostic factors for clinical management of CRC.Anoikis is a form of cell death caused by loss of or inappropriate cell adhesion6. Anoikis is crucial for maintaining tissue homeostasis because it prevents detached epithelial cells from colonizing elsewhere. Cancer cells develop multiple ways of avoiding anoikis, allowing them to spread across tissues and organs. In CRC, anoikis resistance is a critical mechanism of tumor metastasis and progression, and the specific strategies cancer cells employed to acquire the resistance include altering the expression of adhesion molecules, cell surface receptors, key metabolic enzymes, cytokines and transcription factors7.Single-cell RNA sequencing (scRNA-seq) examines the global gene expression profiles at the level of single-cell resolution, enabling the detection of previously hidden heterogeneity8 and allowing for the assessment of specific genes and pathways in distinct cell types. Several prior studies have proposed anoikis-related prognostic signatures in CRC; however, those studies were all based on data from bulk RNA sequencing9,10,11. In this study, we aimed to develop a prognostic model for CRC based on anoikis-related genes using integrated scRNA-seq and bulk transcriptome data. First, the anoikis-related genes (ANRGs) and their cell-type-specific expression patterns were identified via scRNA-seq, focusing on the epithelial compartment. Second, bulk RNA-seq data and scRNA-seq data were integrated to screen ANRGs with prognostic relevance through weighted gene co-expression network analysis (WGCNA). Finally, a prognostic model was established and validated, followed by mechanistic exploration of key genes in tumor progression (Supplementary Fig. 1).MethodsData acquisitionANRGs were obtained through systematic search and processing of data downloaded from the Human Gene Database (GeneCards, https://www.genecards.org/) and the Harmonizome platform (https://maayanlab.cloud/Harmonizome/). ScRNA-seq data of GSE132465 12 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132465) was downloaded from the Gene Expression Omnibus (GEO) database (http://www.ncbi.nlm.nih.gov/geo/), including 23 samples of primary colorectal cancer and 10 matched normal mucosa samples. Three bulk transcriptomic datasets were downloaded: GSE41258 13 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41258) was obtained from the GEO database, and 186 samples of primary colon adenocarcinomas were extracted and analyzed in this study; GSE17536 14 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE17536) was also downloaded from GEO database, and the samples of 177 colorectal cancer patients from the Moffitt Cancer Center were extracted and analyzed in this study; The Cancer Genome Atlas (TCGA) dataset was downloaded from UCSC Xena (https://xena.ucsc.edu/) and it contained 512 samples of colon cancer. The sequencing platform of GSE132465, GSE41258, and GSE17536 were GPL20301Illumina HiSeq 4000 (Homo sapiens), GPL570[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array, and GPL570[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array, respectively. TCGA dataset was the training set of the prognostic model, and the GSE17536 dataset was the validation set for the model. TCGA and the GSE41258 dataset was employed to evaluate the association between the expression of the identified key prognostic gene and tumor stage.Single-cell transcriptome analysisThe data were analyzed following standard pipeline of the package Seurat15 (v.4.4.0, https://satijalab.org/seurat/). Gene expression was normalized by the LogNormalize method (scale factor = 10,000). The R package harmony16 (https://github.com/immunogenomics/harmony) was used to eliminate the batch effects. The top 1500 highly variable genes were recognized by the function of FindVariableFeatures. The most significant principal component scores used for subsequent cell clustering were defined by the function of RunPCA. Distinct groups of cells identified through the principal component analysis were projected onto t-SNE (t-Distributed Stochastic Neighbor Embedding) analysis by the function of RunTSNE.Subgrouping of single-cell samples and identification of ANRGs between subgroupsTo link ANRG expression to functional differences in critical cell types, we subdivided cells into ANRG-high/low groups and identified differentially expressed genes (DEGs) to explore underlying pathways. The threshold of one standard deviation above the average expression level of the 640 ANRGs was applied to divide the single-cell samples into “ANRG-high” and “ANRG-low” groups. The DEGs between the two groups were identified by the function of FindMarkers, and genes with p 0.25 were considered significant DEGs. Further Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses17,18,19 were performed by the R package clusterprofiler20 (v.3.18, https://bioconductor.org/packages/release/bioc/html/clusterProfiler.html). Gene Ontology analyses comprised of three modules: molecular function (MF), biological process (BP) and cell component (CC).WGCNA and module-trait relationship analysisTo prioritize ANRGs with clinical relevance, we used WGCNA on bulk data to identify co-expression modules associated with anoikis, then intersected these with single-cell-derived DEGs to focus on genes with both cell-type specificity and transcriptome-wide co-expression patterns. The bulk RNA-seq data from TCGA colorectal cancer samples (512 cases) were used and the gene expression values were normalized to eliminate technical biases. The R package WGCNA (v.1.72-1, https://cran.r-project.org/web/packages/WGCNA/index.html)21 was utilized. WGCNA constructs a gene co-expression network based on the similarity of gene expression patterns. To ensure the network follows a “scale-free” property (a common feature of biological networks, where most genes have few connections, and a few have many), we tested soft threshold powers from 1 to 20. The “soft threshold” determines how strongly the correlation between genes is weighted (higher values strengthen the distinction between strong and weak correlations). The adjacency between each pair of genes was first calculated based on their expression correlation, and this correlation was weighted using the optimal soft threshold power to enhance the distinction between strong and weak correlations. To group genes into co-expression modules, we calculated the ‘dissimilarity’ between genes. Genes with lower dissimilarity (that is, higher co-expression similarity) were clustered together via linkage hierarchical clustering, where genes are progressively merged into clusters based on their dissimilarity values. This clustering process generated a dendrogram, and distinct branches in the dendrogram corresponded to co-expression modules—each module consisting of genes with highly coordinated expression patterns. Subsequently, the association between WGCNA modules and anoikis and between cytotoxic T lymphocytes (CTL) and modules was further analyzed. The “eigengene” of anoikis (a representative gene expression profile) was calculated by the function of AddModuleScore in the Seurat package, and the eigengene of CTL was calculated based on the expression levels of CD8A, CD8B, GZMA, GZMB, and PRF1 22. Finally, genes in the WGCNA module most significantly associated with anoikis were intersected with the previously identified DEGs in single-cell analyses, and genes present in both sets were used in subsequent analyses.Establishing and validating a prognostic modelUnivariate cox regression analysis was performed in TCGA dataset to select genes significantly associated with the prognosis of the disease through the R package survival. Then, a multivariate cox regression model was performed with the expression levels of genes identified by univariate cox analysis as input variables, while controlling for other potential confounding variables, such as age, gender, and clinical stage. After the model was fitted, the hazard ratio (HR) and its 95% confidence interval (CI) for each gene were extracted. Then, the forestplot package in R (https://github.com/LSYS/forestplot) was used to draw a forest plot to visually display the HR and CI of each gene, helping to identify the genes significantly associated with survival time. The expression levels of the identified prognostic genes in tumor samples were compared with that in the normal samples, and the association between gene expression levels and prognosis was evaluated by survival analyses using the R package survminer23 (v.0.4.9, https://cran.r-project.org/web/packages/survminer/index.html). To minimize the risk of overfitting, genes identified in the univariate cox regression analysis were further analyzed by LASSO-Cox regression through the cv.glmnet function of the R package glmnet24 (v.4.1-8, https://cran.r-project.org/web/packages/glmnet/index.html) to identify the independent prognostic genes for the predictive model, and p