Brief CommunicationPublished: 27 June 2025Yang Zhou1,2,Qiongyu Sheng1,2,Guohua Wang3,Li Xu ORCID: orcid.org/0000-0003-4950-07894 &…Shuilin Jin ORCID: orcid.org/0000-0002-2318-432X1,2 Nature Computational Science (2025)Cite this articleSubjectsComputational modelsData integrationStatistical methodsAbstractBatch effects substantially impede the comparison of multiple single-cell experiment batches. Existing methods for batch effect removal and quantification primarily emphasize cell alignment across batches, often overlooking gene-level batch effects. Here we introduce group technical effects (GTE)—a quantitative metric to assess batch effects on individual genes. Using GTE, we show that batch effects unevenly impact genes within the dataset. A portion of highly batch-sensitive genes (HBGs) differ between datasets and dominate the batch effects, whereas non-HBGs exhibit low batch effects. We demonstrate that as few as three HBGs are sufficient to introduce substantial batch effects. Our method also enables the assessment of cell-level batch effects, outperforming existing batch effect quantification methods. We also observe that biologically similar cell types undergo similar batch effects, informing the development of data integration strategies. The GTE method is versatile and applicable to various single-cell omics data types.This is a preview of subscription content, access via your institutionAccess optionsAccess Nature and 54 other Nature Portfolio journalsGet Nature+, our best-value online-access subscription27,99 € / 30 dayscancel any timeLearn moreSubscribe to this journalReceive 12 digital issues and online access to articles99,00 € per yearonly 8,25 € per issueLearn moreBuy this articlePurchase on SpringerLinkInstant access to full article PDFBuy nowPrices may be subject to local taxes which are calculated during checkoutFig. 1: Quantification of batch effects for individual genes.Fig. 2: Batch effect removal guided by GTE.Data availabilityAll datasets used in this paper are publicly available. Specifically, the mouse MOp dataset can be accessed via the CELLxGENE portal at https://cellxgene.cziscience.com/collections/ae1420fe-6630-46ed-8b3d-cc6056a66467. The mouse retina and human cell line datasets are available at https://github.com/JinmiaoChenLab/Batch-effect-removal-benchmarking/tree/master/Data. The human cortical development and human PBMCs (CITE-seq) datasets can be accessed from the Gene Expression Omnibus (GEO) under accession codes GSE168408 and GSE156473, respectively. The human cortex dataset is available from https://cellxgene.cziscience.com/collections/35928d1c-36fc-4f93-9a8d-0b921ab41745. The human MTG dataset can be accessed via the Allen Brain Map portal at https://portal.brain-map.org/atlases-and-data/rnaseq. The human heart dataset is available from the Heart Cell Atlas project at https://www.heartcellatlas.org. The human PBMCs (scRNA-seq) dataset is accessible at https://github.com/satijalab/seurat-data (ref. 33). The TCGA READ bulk RNA-seq dataset can be accessed through Zenodo at https://doi.org/10.5281/zenodo.6392171 (ref. 34). The mouse brain scATAC-seq datasets (peak and gene activity versions) are available at https://doi.org/10.6084/m9.figshare.12420968.v8 (ref. 35). The mouse cell line proteomics dataset can be accessed via https://scproteomicsdb.com. Refer to Supplementary Table 2 for further details of the datasets. Processed datasets used for the analyses have been deposited to Zenodo at https://doi.org/10.5281/zenodo.13358933 (ref. 36). Source data are provided with this paper.Code availabilityThe codes and Source data used to generate the results in this paper are available at GitHub (https://github.com/yzhou1999/GTEs; ref. 37) and at Zenodo (https://doi.org/10.5281/zenodo.15412860; ref. 38).ReferencesYouden, W. J. Enduring values. Technometrics 14, 1–11 (1972).Article Google Scholar Lander, E. S. Array of hope. Nat. Genet. 21, 3–4 (1999).Article Google Scholar Akey, J. M. et al. On the design and analysis of gene expression studies in human populations. Nat. Genet. 39, 807–808 (2007).Article Google Scholar Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11, 733–739 (2010).Article Google Scholar Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–127 (2007).Article MATH Google Scholar Smyth, G. K. & Speed, T. Normalization of cDNA microarray data. Methods 31, 265–273 (2003).Article Google Scholar Haghverdi, L. et al. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).Article Google Scholar Butler, A. et al. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).Article Google Scholar Lopez, R. et al. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).Article Google Scholar Xu, C. et al. Probabilistic harmonization and annotation of single‐cell transcriptomics data with deep generative models. Mol. Syst. Biol. 17, e9620 (2021).Article Google Scholar De Donno, C. et al. Population-level integration of single-cell datasets enables multi-scale analysis across samples. Nat. Methods 20, 1683–1692 (2023).Article Google Scholar Yao, Z. et al. A transcriptomic and epigenomic cell atlas of the mouse primary motor cortex. Nature 598, 103–110 (2021).Article Google Scholar Büttner, M. et al. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).Article Google Scholar Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).Article Google Scholar Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat. Methods 19, 41–50 (2022).Article Google Scholar Luecken, M. D. & Theis, F. J. Current best practices in single‐cell RNA‐seq analysis: a tutorial. Mol. Syst. Biol. 15, e8746 (2019).Article Google Scholar Subramanian, A. et al. Biology-inspired data-driven quality control for scientific discovery in single-cell transcriptomics. Genome Biol. 23, 267 (2022).Article Google Scholar Ilicic, T. et al. Classification of low quality cells from single-cell RNA-seq data. Genome Biol. 17, 29 (2016).Article Google Scholar CZI Cell Science Program et al. CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data. Nucleic Acids Res. 53, D886–D900 (2024).Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).Article Google Scholar Chazarra-Gil, R. et al. Flexible comparison of batch correction methods for single-cell RNA-seq using BatchBench. Nucleic Acids Res. 49, e42 (2021).Article Google Scholar Lütge, A. et al. CellMixS: quantifying and visualizing batch effects in single-cell RNA-seq data. Life Sci. Alliance 4, e202001004 (2021).Article Google Scholar Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).Article Google Scholar Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).Article Google Scholar Molania, R. et al. Removing unwanted variation from large-scale RNA sequencing data with PRPS. Nat. Biotechnol. 41, 82–95 (2023).Article Google Scholar Leduc, A. et al. Exploring functional protein covariation across single cells using nPOP. Genome Biol. 23, 261 (2022).Article Google Scholar Derks, J. et al. Increasing the throughput of sensitive proteomics by plexDIA. Nat. Biotechnol. 41, 50–59 (2023).Article Google Scholar Mimitou, E. P. et al. Scalable, multimodal profiling of chromatin accessibility, gene expression and protein levels in single cells. Nat. Biotechnol. 39, 1246–1258 (2021).Article Google Scholar McCarthy, D. J. et al. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).Article Google Scholar Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29 (2021).Article Google Scholar Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2019).Article Google Scholar Yu, G. et al. ClusterProfiler: an R package for comparing biological themes among gene clusters. Omics 16, 284–287 (2012).Article Google Scholar Satija, R. et al. seurat-data. GitHub https://github.com/satijalab/seurat-data (2025).Molania, R. Vignettes: removing unwanted variation from TCGA RNA-seq data. Zenodo https://doi.org/10.5281/zenodo.6392171 (2025).Luecken, M. et al. Benchmarking atlas-level data integration in single-cell genomics—integration task datasets. figshare https://doi.org/10.6084/m9.figshare.12420968.v8 (2022).Zhou, Y., Sheng, Q., Wang, G., Xu, L. & Jin, S. Quantifying batch effects for individual genes in single-cell data. Zenodo https://doi.org/10.5281/zenodo.13358933 (2024).Zhou, Y. GTEs. GitHub https://github.com/yzhou1999/GTEs (2025).Zhou, Y. GTEs R package. Zenodo https://doi.org/10.5281/zenodo.15412860 (2025).Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun. 12, 1337 (2021).Article Google Scholar Download referencesAcknowledgementsThis work was supported by the National Natural Science Foundation of China (grant no. 62271173 to S.J., no. 62172122 to L.X. and no. 124B2027 to Y.Z.), the Key Research and Development Program of Heilongjiang (grant no. 2022ZX01A19 to S.J.), the Natural Science Foundation of Heilongjiang Province, China (grant no. JQ2023A003 to S.J.), and the Fundamental Research Funds for the Central Universities (grant no. HIT.DZJJ.2023133 to Q.S. and no. HIT.DZJJ.2024043 to Y.Z.).Author informationAuthors and AffiliationsSchool of Mathematics, Harbin Institute of Technology, Harbin, ChinaYang Zhou, Qiongyu Sheng & Shuilin JinZhengzhou Research Institute, Harbin Institute of Technology, Zhengzhou, ChinaYang Zhou, Qiongyu Sheng & Shuilin JinCollege of Computer and Control Engineering, Northeast Forestry University, Harbin, ChinaGuohua WangCollege of Computer Science and Technology, Harbin Engineering University, Harbin, ChinaLi XuAuthorsYang ZhouView author publicationsSearch author on:PubMed Google ScholarQiongyu ShengView author publicationsSearch author on:PubMed Google ScholarGuohua WangView author publicationsSearch author on:PubMed Google ScholarLi XuView author publicationsSearch author on:PubMed Google ScholarShuilin JinView author publicationsSearch author on:PubMed Google ScholarContributionsS.J. supervised the study. Y.Z. conceived and developed the method, and designed the analysis. Y.Z. and Q.S. performed the analysis. G.W. and L.X. checked the analysis results. Y.Z., Q.S., G.W., L.X. and S.J. wrote the paper. All authors read and approved the final paper.Corresponding authorsCorrespondence to Li Xu or Shuilin Jin.Ethics declarationsCompeting interestsThe authors declare no competing interests.Peer reviewPeer review informationNature Computational Science thanks Lachlan Coin, Debashis Ghosh and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary informationSupplementary InformationSupplementary Notes 1–3, Algorithm 1 and Figs. 1–22.Reporting SummaryPeer Review FileSupplementary Table 1Identified common HBGs and non-HBGs.Supplementary Table 2Details of datasets used in the manuscript.Source dataSource Data Fig. 1The numerical Source data for Fig. 1.Source Data Fig. 2The numerical Source data for Fig. 2.Rights and permissionsSpringer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.Reprints and permissionsAbout this article