Enabling scalable single-cell transcriptomic analysis through distributed computing with Apache spark

Wait 5 sec.

IntroductionSingle-cell RNA sequencing (scRNA-seq) has transformed our understanding of cellular biology by enabling the study of gene expression at the resolution of individual cells. Unlike bulk RNA-seq, which assumes cellular homogeneity within a tissue, scRNA-seq has revealed that cells within the same tissue can exhibit substantial heterogeneity. This cellular diversity plays a crucial role in driving phenotypic variation across tissues and biological systems1. An advantage of scRNA-seq over traditional bulk sequencing, is the identification of rare cells in the same tissues based on various cell features which were unknown earlier2. It also permits a more in-depth examination of processes such as development and differentiation of cells3. This enhancement in resolution, however, comes at the expense of increased technical noise and biases4. This implies that pre-processing, quality control, and normalization are essential for a thorough analysis of scRNA-seq data5.Additionally, the advancement in cell profiling techniques and rapid cost drops, has led to the sequencing millions of cells parallelly. Modern scRNA-seq technologies, such as those from 10x Genomics, routinely generate datasets containing hundreds of thousands of cells, each with expression profiles spanning over 20,000 genes6. The Illumina HiSeq, another next-generation sequencing platform, can alone generate 100 GBs of raw transcriptomics data7. This results in extremely sparse, high-dimensional matrices that demand considerable computational power for downstream analysis8,9. Preprocessing steps such as quality control, normalization, dimensionality reduction, and clustering are computationally intensive, and traditional tools often struggle to handle such datasets on commodity hardware. Furthermore, as the field of genomics advances, the atlas projects such as Human Cell Atlas10 are expected to generate vast amounts of data, often measured in terabytes. To address this, a growing interest has emerged in the field of “big single-cell data science,” which aims to adapt distributed computing tools to biological data7. The development of robust computational methods and big data tools to overcome these hurdles is essential for realizing the full potential of scRNA-seq in various biomedical applications, including cancer research, developmental biology, and personalized medicine11. Frameworks such as Apache Spark and Hadoop offer a robust foundation for scalable and fault-tolerant data processing, yet their integration into bioinformatics pipelines remains limited12.Despite the availability of powerful computational tools, there remains a lack of Spark-native frameworks specifically tailored to large-scale scRNA-seq analysis. Most existing pipelines, including popular tools such as Seurat13 and Scanpy14, rely on in-memory data structures and are constrained by available RAM, limiting their ability to scale to datasets in the range of hundreds of thousands to millions of cells.In this work, we present scSPARKL, a modular, Spark-based analytical framework designed for scalable, parallel, and memory-efficient analysis of scRNA-seq data. The pipeline implements parallel routines for data reshaping, quality control, filtering, and normalization, while seamlessly integrating common downstream tasks such as dimensionality reduction and clustering. We demonstrate its effectiveness through the reanalysis of real-world datasets and benchmarking on varying dataset sizes to evaluate performance gains under different Spark configurations. To the best of our knowledge, this is the first attempt to leverage the computational power of Apache Spark and the efficient data storage capabilities of Apache Parquet to develop a framework specifically tailored for downstream analysis of large-scale scRNA-seq data, enabling seamless execution even on commodity hardware.Related workThe rise of single-cell RNA sequencing (scRNA-seq) has sparked a boom in generating genomic data, bringing forth some complex computational challenges. Researchers have been exploring a variety of techniques and frameworks to deal with the vast scale and complexity of scRNA-seq datasets effectively. This breakthrough technology in genomics, single-cell RNA sequencing (scRNA-seq), has truly advanced up the field by allowing us to peek into gene expression at the individual cell level, triggering a wave of genomic data like never before15. With this surge in data volume comes a set of tough computational hurdles, thanks to the intricate nature of scRNA-seq datasets, packed with high dimensionality, sparsity, and noise16. To tackle these challenges head-on, researchers have proposed a number of techniques and frameworks, all aimed at efficiently handling the massive scale and complexity that comes with analyzing scRNA-seq data17. These strategies cover a wide range of computational methods, from slicing down dimensions with techniques like principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE)18, to clustering algorithms like k-means and hierarchical clustering and even diving into the realm of machine learning with approaches like deep learning for identifying cell types and inferring trajectories19. These methods are the key players in unraveling the intricate biological dances captured in scRNA-seq data, paving the way for a deeper dive into the world of cellular diversity and dynamics within the intricate web of biological systems.The adoption of big data analytics tools like Apache Spark has significantly enhanced the processing and analysis of large-scale genomic datasets. Apache Spark’s distributed computing capabilities have been instrumental in handling the vast amounts of genomic data efficiently20. Studies have showcased Spark’s scalability, fault tolerance, and memory management advantages in genomic data analysis, making it a preferred choice for researchers21. In the realm of single-cell RNA sequencing (scRNA-seq) data analysis, Spark has been leveraged to create scalable pipelines and tools tailored to the unique challenges of analyzing individual cell transcriptomes.Glow, a brainchild of the Regeneron Genetics Center and Databricks, shines as an open-source toolkit built on top of Apache Spark. It’s a nifty tool designed to bring together genomic and phenotypic data using turbocharged algorithms for prepping genomic data, crunching numbers for statistical analysis, and diving into the world of machine learning on a grand scale akin to a biobank. This toolkit steps up to the plate, tackling the hurdles of handling hefty genomic datasets by offering a top-notch framework that speeds up the data engineering process for storing and analyzing genomic data on a grand scale. Glow acts as a bridge, connecting the dots between traditional bioinformatic toolkits and the modern data analytics landscape, paving the way for integrating machine learning into vast health datasets that encompass genomic data and a myriad of phenotypes22.While these existing frameworks and pipelines have been making great strides in tackling the computational hurdles of genomic data analysis, the rapid growth and complexity of datasets keep pushing the boundaries of scalability and performance. Moreover, the diverse range of experimental designs and analysis needs calls for the creation of adaptable and personalized solutions that cater to specific research objectives. Our proposed scSPARKL pipeline, aiming to introduce a fresh, scalable, and highly parallel computational framework for handling large-scale scRNA-seq data. By making use of the power of Apache Spark and bringing in innovative techniques for parallelization and optimization, scSPARKL aims to break through the constraints of current tools and enable the efficient analysis of massive scRNA-seq datasets using everyday hardware setups. This innovative approach underscores the ongoing commitment to crafting state-of-the-art solutions that can adeptly handle the intricacies of modern genomic data analysis, ultimately leading to deeper insights into cellular diversity and biological processes.MethodsDataset descriptionTo evaluate the usability and versatility of the scSPARKL pipeline, we reanalyzed three publicly available single-cell RNA-seq datasets, each representing a distinct biological context and data structure:Mouse brain non-myeloid cellsThis dataset is part of the Tabula Muris Senis compendium, which profiles single-cell transcriptomes across 20 tissues from Mus musculus, comprising approximately 100,000 cells in total23. For our analysis, we extracted the subset of non-myeloid brain cells, which includes 3,401 cells with gene expression measurements for 23,433 genes. The accompanying metadata provides detailed annotations such as sub-tissue origin (cortex, hippocampus, cerebellum, and striatum), cell ontology classifications, mouse ID, and sex.Jurkat-293T cell mixtureThis dataset, originally published by24, contains 3,388 cells derived from a 1:1 in vitro mixture of human 293 T and mouse Jurkat cell lines. The dataset was designed to distinguish between species-specific expression profiles. Cells expressing CD3D were classified as Jurkat, while those expressing XIST were identified as 293T. This dataset serves as a clear benchmark for evaluating interspecies separation and clustering accuracy.68 K PBMCTo assess the pipeline’s performance on larger datasets and evaluate runtime scalability, we used the 68 K Peripheral Blood Mononuclear Cell (PBMC) dataset, also from24 The dataset consists of single-cell transcriptomic profiles collected from a healthy donor, encompassing a heterogeneous population of immune cells. This dataset was used to generate random subsamples of increasing sizes (2 K to 10 K cells) for benchmarking temporal performance.Pipeline developmentFor development, we used Apache’s Spark version 3.1.2, Python 3.9 and JDK version 8.0. Apache Spark is a distributed analytical engine made for handling big data. It provides an essential parallel processing platform for large datasets25. Spark at its core uses an abstraction called Resilient Distributed Datasets (RDD), a read-only collection of objects distributed across the cluster of machines, which can be recreated if one of the machines is lost—fault tolerance12. All the jobs are launched in parallel through a program called “driver program” which also maintains the high-level flow control of the applications and jobs submitted to the executors (Fig. 1). Spark uses in-memory concept i.e., it leverages Random Access Memory (RAM) of a machine and brings the data to be processed directly to the RAM thereby increasing the response time by removing the time taken to fetch data from a disc drive25.Fig. 1Apache Spark workflow. The Spark driver program works as a master and as an entry point for all the Spark jobs. The master submits jobs to the worker nodes. The cluster manager keeps the track of the nodes, and the jobs distributed to them, several cluster managers are Yet Another Resource Negotiator (YARN), Kubernettes, mesos and standalone (in our case). The worker/slave nodes are the actual machines where the tasks are executed, and they report back to the cluster manager.Full size imageThe scSPARKL pipeline consists of 3 stages, algorithms for all the three stages (Pseudocode 1, 2, and 3) were designed in such way that a normal commodity hardware will support the analysis, and a possible optimization of spark is achieved. The stages are modular in nature supporting the variety of analysis. For a thorough understanding of how the stages work internally we have designed a Data Flow Diagram (DFD) for each of the stage (Supplementary Sect. 1).Stage-1 preprocessingTwo exclusive Spark modules were developed for loading data and data filtering. The data_load.py package file loads the data as a Spark dataframe and performs the initial preprocessing like removal of special characters such as periods (.), commas (,), blank spaces etc., from the columns and replaces them with the underscore (‘_’). The scRNA-seq data is usually in wide format and Spark has the ability to work better with tall/long format data; hence for optimization, we reshape the data to the tall format using melt_Spark() function in the pipeline, thereby reducing the columnar load and increasing efficiency. This step is essential for the users with low memory machines (i.e., as lows as 4GB). The dataframe is then written as a Spark Parquet file. Apache Parquet format is an open-source columnar storage file format which is based on “shredding and assembly algorithm”. It is designed to handle large and complex datasets for faster aggregation functions. The parquet formatted file is compressed, which speeds up the further analytical procedures on the file. We mainly use Apache Sparks aggregate functions, known for their quick retrieval and less computationally exhaustive/expensive. Pseudocode-1 provides the detailed outline of the data cleaning and reshaping.Stage-2 quality controlFor the coherent and accurate downstream analysis of scRNA-seq data, it is necessary to filter out low-quality cells and genes. We followed “scater” package9, SCANPY14 and Seurat13 for generating variety of quality control matrices. Quality matrix is calculated on number of parameters like total genes expressed per cell, percentage of external RNA controls consortium (ERCC), number cells expressing genes greater than zero, mitochondrial percentage detected in each cell etc. We also calculate the log measurements and mean measurements of numerous parameters for the easier interpretation such that a good parameter threshold for cell and gene filtering is defined.If we have a cell/gene matrix Aij,Where.i $\:\in\:$ {cell1, cell2, cell3, …, celln} &.j $\:\in\:$ {gene1, gene2, gene3, …, genem} then,then Θ1 and Θ2 denote cell quality and gene quality summaries, respectively, can be given as following in the Tables 1 and 2.Table 1 Θ1 summaries and formulas for cell quality.Full size tableTable 2 Θ2 summaries and formulas for gene quality.Full size tableIn addition to the normal filtering procedures, we have implemented a parallelly executable Median Absolute Deviation (MAD) filtering methodology for filtering out the genes with a normalized count value greater than a decided threshold (usually > 3). Pseudocode-2 outlines the overall procedure for computing cell and gene quality measurements, while pseudocode-3 outlines the filtering process based on these quality summaries along with subsequent steps of normalization, dimensionality reduction and clustering, which are described in detail in the following sections.Stage 3 normalization and dimensionality reductionDue to the inherent sparsity and variability of single-cell RNA-seq (scRNA-seq) data, normalization is a critical preprocessing step to ensure that downstream analyses reflect true biological signals rather than technical noise. Proper normalization improves the comparability of gene expression across cells and enhances the accuracy of clustering, dimensionality reduction, and differential expression analyses. The data_normalize.py module in the scSPARKL framework currently supports two normalization methods: Quantile Normalization (QN) and Global Normalization (GN).In QN, the gene expression values across each cell is first ranked from lowest to highest, The data is arranged according to the ranks of the rows. Across all cells, values at the same rank are averaged to form a new reference distribution. These average values are then mapped back to each cell according to the original rank order. This method helps standardize expression distributions across cells while preserving relative expression relationships within each cell. It is particularly useful in making datasets with different expression scales more comparable.Global normalization, on the other, adjusts each cell’s total expression to a fixed scale (e.g., counts per 10,000), followed by a log-transformation. This approach corrects for differences in sequencing depth and is commonly used in standard scRNA-seq workflows. In this, we divide each expression value of a cell by the total number of counts of a gene. The resultant value is then multiplied by 10,000 and log transformed with the addition of a pseudo-count 1. Mathematically, if $\:{X}_{n*m}$ is an unnormalized filtered scRNA-seq matrix, then the Normalized matrix $\:{X}_{N}$ can be given as:$$\:\begin{array}{c}{X}_{N}=\left\{\frac{{x}_{i,j}}{\sum\:_{i,j=0}^{i\le\:n,j\le\:m}\left({x}_{i,j}\right)}*\text{10,000}\right\}\:\end{array}$$(1)$$\:where\:x\:\in\:{X}_{m*n}\:and\:\text{10,000}\:is\:a\:scaling\:factor$$and this scaled normalized matrix is then log transformed as:$$\:\begin{array}{c}{X}_{SN}=\:\left(loglog\:\left({X}_{N}\right)+1\:\right)\:\end{array}$$(2)$$\:where\:1\:is\:a\:pseudocount.\:$$Both normalization strategies are integrated into the Spark-based pipeline to support parallel execution and efficient processing of large-scale datasets.Dimensionality reductionDimensionality reduction is a key step in scRNA-seq analysis, aimed at projecting high-dimensional gene expression data into a lower-dimensional space while preserving the underlying structure and relationships among cells26. Due to the high dimensionality and sparsity of scRNA-seq data, conventional analysis methods can be impacted by the “curse of dimensionality”—a phenomenon where the distance between data points becomes less meaningful in high-dimensional spaces27.In this study, we employed three widely used dimensionality reduction techniques: Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP). Among these, PCA is implemented using Spark’s MLlib library, allowing distributed and parallel computation across nodes. This integration aligns with the design goals of scSPARKL to support scalable, high-throughput analysis.By contrast, UMAP and t-SNE are currently implemented using standard Python libraries and operate outside the Spark execution engine. Nevertheless, these methods read input directly from intermediate Apache Parquet files, benefiting from efficient I/O and compatibility with Spark-generated outputs. While these steps are not fully parallelized within Spark, their integration into the pipeline remains seamless and supports downstream visualization and clustering workflows.Clustering and further downstream analysisA major advantage of single-cell RNA sequencing (scRNA-seq) lies in its ability to identify both known and novel cell populations through unsupervised clustering28. As single-cell profiling technologies continue to evolve, improvements in capture efficiency have enabled the sequencing of millions of individual cells, giving rise to increasingly large and complex datasets. This scale imposes substantial computational demands on clustering algorithms used for cell-type discovery and classification29.To address this, scSPARKL integrates the K-means clustering algorithm, which partitions a dataset of n observations into k clusters (k ≤ n) by minimizing the within-cluster variance. The algorithm operates iteratively, first by randomly selecting k centroids and then by assigning each data point to the nearest centroid. This process continues until convergence, resulting in compact and well-separated clusters.To support informed selection of k, the pipeline includes both the elbow plot method and silhouette score analysis, allowing users to evaluate clustering stability and separation. Once clusters are defined, differential gene expression (DGE) analysis is performed to identify genes that are most distinctively expressed between groups. For this, we apply a two-sample t-test across clusters and rank the top 10 significant genes based on their p-values, providing an interpretable set of marker genes for downstream biological validation.ResultsThe development and the execution of the pipeline was done on a Windows 11 OS, with 8GB RAM, intel i7 3.8Ghz processor with 8 cores. Table 3 provides the details of the hardware used in the experimentation. In Single-node Spark, one machine was used as a worker and the other one as a master machine. In the 2-node Spark environment, two machines were used as workers and one machine acted as a Master node. The memory allocation was done statically where each executor was provided with 2 GBs of memory. The rest of the memory is reserved for OS and Hadoop daemons.Table 3 Details of the hardware used in the experiments.Full size tableThe scSPARKL pipeline (Fig. 2) provides a promising approach for single-cell RNA sequencing (scRNA-seq) data analysis. To our knowledge it is the first Spark-based computational and analytical framework for scRNA-seq data, and leverages Spark’s core engine to enable parallel resource utilization, dramatically reducing processing time. The pipeline efficiently performs a comprehensive suite of analyses, including data quality control, filtering (Fig. 3a, b), normalization, dimension reduction, clustering, and visualization, in less than 10 min. Our reanalysis results closely align with those presented by the original authors23 (Brain Non-Myeloid, n = 3401) and24 (PBMC, n = 10000), with total computation time as low as ~ 8 min, demonstrating both the efficiency and reliability of our approach. Importantly, the scSPARKL framework is fully open-source, providing the scientific community with a powerful tool that can be freely accessed, modified, and extended to meet evolving research needs. This open approach not only facilitates reproducibility but also encourages collaborative improvement and adaptation of the framework for diverse scRNA-seq applications.Fig. 2General workflow of the scSPARKL pipeline. A complete, step-by-step description of each module is provided in the Materials and Methods section.Full size imageFig. 3Histogram of Gene and cell qualities. (a) Mouse Brain cells and (b) Jurkat T932 cells. These are used for deciding the filtering thresholds, depending upon the nature of data as well as the biological question in hand.Full size imageScSPARKL shows consistency with prior analysesTo validate our framework, we performed reanalysis of Brain Non-myeloid cells and Jurkat-293T Cell data using scSPARKL. The outcomes of our analysis closely aligned with those of the original studies, demonstrating comparable biological accuracy. The framework accurately differentiates between cell types, yielding results consistent with those reported in the original studies. In the analysis of mouse brain non-myeloid cells, our pipeline identified a predominant presence of oligodendrocytes and endothelial cells, mirroring findings from prior work(Fig. 4a, b). Similarly, in the Jurkat–293T cell mixture, scSPARKL clearly distinguished between mouse and human cells, effectively separating the two species in t-SNE space (Fig. 4c, d)., further validating the biological accuracy of the pipeline.Fig. 4Comparison of scSPARKL with original studies. (a) tSNE projection for Brain Non-Myeloid data processed with SCANPY. (b) Corresponding tSNE visualization of Brain Non-Myeloid cells generated with scSPARKL. (c) Reference tSNE visualization for Jurkat Cell data from original author. (d) tSNE visualization of Jurkat-239T Cells using scSPARKL. Legends are shared for panels (a) and (b), and separately for panels (c) and (d).Full size imageSome observed outliers in the t-SNE plots (Fig. 4d) likely reflect differences in cell and gene filtering thresholds as well as t-SNE parameter settings, which were kept at default for consistency across datasets. The default filter parameters used during analysis included: percent_ercc > 10%, percent_mito > 5%, cells expressing