IntroductionCancer remains an escalating global health challenge, with both incidence and mortality rates steadily rising. In 2022, the United States alone reported more than 1.9 million new cancer cases and 609,360 cancer-related deaths, underscoring the pressing need for more effective interventions1. Among the many cancer types, glioma, a particularly aggressive brain tumor, is distinguished by its infiltrative growth pattern, which closely intertwines with normal brain tissue, obscuring tumor boundaries2. This makes surgical resection difficult and often incomplete, leading to residual tumor cells and recurrence. Moreover, gliomas exhibit a regenerative capacity that increases malignancy over time, further complicating treatment and worsening the prognosis3.An additional challenge in glioma management arises from its intertumoral heterogeneity4. This molecular complexity means that each tumor can present unique features, leading to variable responses to standard therapies5. Given the limited understanding of the molecular mechanisms driving glioma progression, alongside the limitations of current treatments, there is an urgent need for in-depth molecular analyses to advance precision medicine and support the development of more personalized therapeutic strategies.Transcription factors (TFs) are critical regulators of gene expression and play key roles in cancer biology. In eukaryotic cells, transcription initiation requires the coordinated action of RNA polymerase II and numerous TFs that assemble at the promoter region of a gene to form the transcription initiation complex6. TFs are generally classified as either general TFs, which help form the transcription initiation complex, or specific TFs, which regulate genes in response to stimuli such as hormones or growth factors.In cancer, TFs play a central role in various critical pathological processes, including tumorigenesis7,8,9, metabolic reprogramming10,11, and the self-renewal and differentiation of cancer stem cells (CSCs)12,13, by regulating gene expression networks. During tumorigenesis, TFs can function as oncogenes or tumor suppressors, depending on the specific cancer type10,14,15,16,17. For example, the TF Yin Yang 1 (YY1) promotes tumor cell proliferation and metastasis in most cancers18 but exhibits tumor-suppressive activity in specific cancers, such as pancreatic and esophageal tumors. In metabolic reprogramming, TFs regulate critical genes involved in metabolic pathways, such as glycolysis and fatty acid metabolism, enabling tumor cells to meet the energy demands of rapid growth19,20. TFs also play crucial roles in modulating the properties of CSCs, a population of tumor cells responsible for recurrence and metastasis. In breast cancer, forkhead box O3 (FOXO3a) negatively regulates forkhead box M1 (FOXM1)21,22,23,24, reducing CSC stemness and tumorigenicity. Given these diverse roles, TFs serve as important diagnostic and prognostic biomarkers in various cancers25,26. Together, these findings highlight the multifaceted functions of TFs in cancer biology and their potential as therapeutic targets.In this study, we focused on the role of TFs in glioma subtyping. Using nonnegative matrix factorization (NMF)27, we analyzed 795 TFs and identified two molecular subtypes of gliomas with distinct survival rates and clinical characteristics. Further analysis revealed that certain TF-related gene sets were differentially expressed across World Health Organization (WHO) grade II, III, and IV gliomas. To explore protein-level interactions among these genes, we performed a protein‒protein interaction (PPI) network analysis.To identify core genes associated with the subtyping of gliomas and the classification of gliomas according to the WHO standards, ten machine learning techniques, including least absolute shrinkage and selection operator (Lasso), Ridge, Elastic Net (Enet), StepCox, survivalSVM, CoxBoost, SuperPC, plsRcox, random survival forests (RSF), and gradient boosting machine (GBM), were combined into 101 distinct strategies. Subsequently, the support vector machine (SVM) algorithm was used to further refine the results, identifying genes that are representative of different glioma subtypes. This thorough approach led to the identification of several key genes linked to glioma subtypes. Univariate analysis was conducted to determine the main factors affecting glioma risk, and a thorough PPI network analysis of these risk factors was performed. Ultimately, we identified five key genes, EZH2, TWIST1, EGR1, FOSL2, and TCF3, which are important for understanding glioma subtypes and predicting patient outcomes.Research on transcription factors in glioma and glioblastoma has confirmed their unique regulatory roles in these types of brain tumors. Despite its recognized role in other tumors, TCF3 has been minimally studied in gliomas28,29,30, The regulatory role of TCF3 in controlling cell proliferation and migration in glioma cell lines has been reported in previous studies31, Our multi-omics and multi-dataset analyses revealed the abnormal expression of TCF3 in glioma and other cancers, a finding validated at both the cellular and tissue levels. Furthermore, we explored the relationship between TCF3 and glioma prognosis, revealing new insights into its potential as a therapeutic target.Overall, our study introduced a novel molecular classification system for gliomas and identified TCF3 as a key prognostic marker and potential therapeutic target. These findings lay the groundwork for advancing precision medicine in glioma treatment and may facilitate the development of novel therapeutic strategies.Materials and methodsData collection and processingThe structured workflow diagram of this study is illustrated in Fig. 1. Transcription factors are sourced from the TRRUST database (https://www.grnpedia.org/trrust/) , a manually curated repository of transcriptional regulatory networks in humans and mice (Table S1). These data originate from 11,237 PubMed-indexed publications documenting small-scale experimental investigations into transcriptional regulation, encompassing three categories of transcription factors: direct regulators, indirect mediators, and condition-specific modulators. Transcriptome and clinical data were obtained from two databases: the Chinese Glioma Genome Atlas (CGGA) database (http://www.cgga.org.cn) and The Cancer Genome Atlas (TCGA) database (https://portal.gdc.cancer.gov/)32,33,34,35,36,37. Among these, TCGA included 702 glioma samples, while the CGGA database aggregated three datasets (CGGA325, CGGA693, and CGGA301) totaling 1,351 samples, all confirmed as gliomas. For prognostic analysis and molecular subtyping, samples with invalid survival data (survival time ≤ 0 days or missing survival status) were excluded from the cohort. Genomic mutation data for gliomas were acquired from the Genomic Data Commons (GDC) Data Portal (https://portal.gdc.cancer.gov/). The glioma cell lines (U251, LN229, U87, A172, and U118) and normal human astrocytes (HA) were obtained from the Chinese Academy of Science. Tumor tissue specimens, and paired adjacent normal brain tissue samples were obtained from the Tissue Repository of Lanzhou University Second Hospital. All patients provided written informed consent, and the study protocol received full ethical approval from the Clinical Ethics Committee of Lanzhou University Second Hospital.Fig. 1Structured workflow diagram of this study.Full size imageIdentification and classification of glioma subtypes using TFsA set of 795 TF-related genes (Table S1) was identified for analysis. To classify glioma subtypes on the basis of TF expression profiles, we applied NMF with the following parameters: rank = 2:10, method = “brunet”, and nrun = 10. After evaluating the clustering performance, we determined that two subtypes (clusterNum = 2) provided the most robust classification. We then assessed the relationships between these two subtypes and their clinical characteristics. Differential gene expression analyses between two glioma subtypes and across different WHO grades of glioma were performed via the limma R package to identify differentially expressed genes (DEGs). Gene Ontology (GO) analysis38 was carried out to evaluate DEGs across dimensions, including cellular component (CC), molecular function (MF), and biological process (BP). We also performed pathway enrichment analysis via the Kyoto Encyclopedia of Genes and Genomes (KEGG) 39,40,41and gene set enrichment analysis (GSEA) to explore specific functional pathways.Weighted gene co-expression network analysis (WGCNA)WGCNA42 was used to identify gene modules co-expressed in relation to the two glioma subtypes. To construct a scale-free network, we calculate the soft-thresholding power on the basis of the scale-free topology criterion and select the optimal value accordingly. Modules were defined with a minimum size of 50 genes. The dynamic tree cut method was used to detect distinct gene modules, and similar modules were merged via a module eigenvalue dissimilarity threshold (MEDissThres) of 0.25.Tumor microenvironment (TME), immune, and functional scoringThe R package ESTIMATE was used to predict immune, stromal, and total ESTIMATE scores for individual tumor samples43. To investigate immune cell interactions within the TME, data from the web portal TISIDB were used to quantify the relative abundance of various immune cell types via single-sample gene set enrichment analysis (ssGSEA). Immune-related features were further assessed via KEGG_C2 pathway scores. The R package IOBR44,45 provided additional insights into immune cell infiltration and interactions.PPI analysis and machine learning for prognostic gene identificationPPI was conducted to identify core protein-related genes that were differentially expressed across glioma subtypes. Leveraging the CGGA301 dataset as a test set, we applied ten machine learning algorithms, including Lasso, Ridge, Enet, StepCox, survivalSVM, CoxBoost, SuperPC, plsRcox, RSF, and GBM. Under the framework of cross-validation, one algorithm was used for variable selection while another was employed to construct the prognostic model. The concordance index (C-index) was calculated for each model combination (totaling 101 combinations, Table S2) on external datasets (or including the training set). For the CoxBoost model, we first determined the optimal penalty term (shrinkage parameter) by invoking the “optimCoxBoostPenalty” function. Subsequently, tenfold cross-validation was performed to identify the optimal boosting steps for the CoxBoost model, and final model fitting was accomplished using the “CoxBoost” function. In terms of stepwise Cox analysis, we utilized the survival package and evaluated model complexity based on the Akaike Information Criterion (AIC). All possible combinations of direction parameters were considered, including “both” (bidirectional), "backward," and “forward” elimination approaches. The Lasso, Ridge, and Enet models were constructed using the "cv.glmnet" function from the glmnet package. A tenfold cross-validation approach was adopted to determine the regularization parameter lambda, with the compromise parameter alpha ranging between 0 and 1 at 0.1 intervals. Specifically, when alpha = 1, the Lasso model was implemented; when alpha = 0, the Ridge model was used; and for other alpha values, the Enet (Elastic Net) model was applied. For the survival support vector machine model, we employed the “survivalsvm” function from the survivalsvm package, which is specialized for survival outcome analysis. The GBM model was fitted using the “gbm” function from the gbm package combined with tenfold cross-validation. The SuperPC model, an extension of principal component analysis (PCA), was implemented via the superpc package. During model construction, tenfold cross-validation was performed using the "superpc.cv" function. The plsRcox model was directly established using the "cv.plsRcox" function from the plsRcox package. Finally, for the RSF (Random Survival Forest) model, we utilized the “rfsrc” function from the randomForestSRC package. In parameter settings, “ntree” represents the number of trees in the random forest, and “nodesize” denotes the minimum size of terminal nodes. In this study, “ntree” was set to 1000, and the minimum variable count for screening was configured as 5. A total of 101 unique algorithmic combinations were used to identify prognostic genes. We further refined our gene selection through univariate analysis and SVM analysis, focusing on TFs differentially expressed between glioma subtypes. Survival analysis was subsequently performed via the R packages survival and survminer, applying a significance threshold of p