Quantification of transcript isoforms at the single-cell level using SCALPEL

Wait 5 sec.

IntroductionAlternative polyadenylation (APA) is a general mechanism of post-transcriptional regulation that significantly contributes to the diversification of gene expression patterns under diverse physiological and pathological conditions1. APA defines the end of transcripts by selecting one of the available polyA sites (PAS) at the 3’ end of genes, resulting in the generation of multiple mature RNA isoforms from the same pre-mRNA2. These isoforms may have different coding regions or contain distinct 3’ untranslated regions (3’ UTRs), which contain regulatory elements influencing mRNA stability, localization, and translational efficiency3,4,5. Transcriptomic studies have demonstrated that APA is highly regulated in a tissue specific manner6 and plays a crucial role in various biological processes, including cellular differentiation7, development8,9,10, and response to environmental cues11. Alterations in APA patterns have been linked to various diseases, where they can lead to aberrant gene expression and even cancer12,13.The development of high-throughput single-cell transcriptomics technologies (scRNA-seq) has led to the emergence of computational methods to characterize the transcriptomic profile of thousands of individual cells in a single experiment14. While these methods are mainly used to quantify gene expression, 3’ tag-based scRNA-seq protocols such as Drop-seq15 or 10x Genomics provide opportunities to study 3’ end isoform diversity. Currently, only a few computational tools allow to study isoform diversity generated by APA in scRNA-seq data and most of them face significant drawbacks. They often fail to detect polyadenylation sites (PAS) with low read coverage due to the sparse nature of single-cell data and they lack the precision needed to accurately pinpoint the exact PAS locations, leading to potential misidentification and incomplete characterization of isoform diversity16,17,18,19. Alternative methods based on isoform quantification such as scUTRquant20 have been shown to be more powerful in quantifying transcript diversity from scRNA-seq data. Yet, the main power of this method relies on an improved curated 3’ end annotation that is not available for most species.Here, we present SCALPEL, a Nextflow workflow21 to quantify isoform expression using commonly used 3’ tag-based scRNA-seq data. Comparison of SCALPEL to other existing tools using synthetic data shows that SCALPEL predictions have a higher sensitivity than other tools while maintaining a high specificity. On real data, SCALPEL predictions show a strong agreement with other tools and can be validated experimentally. Our analysis revealed that isoform-based analysis of single-cell data recapitulates known biological processes such as 3’ UTR lengthening during mouse spermatogenesis, identify novel cell populations, and reflect post-transcriptional regulatory processes such as microRNA function. Furthermore, we also demonstrate how SCALPEL can be used to improve isoform quantification at the single-cell level using paired long and short scRNA-seq data. Together, our work highlights the versatility of SCALPEL across datasets and highlights the power of isoform-based analysis in single-cell studies.ResultsSCALPEL, a new computational tool for isoform quantification using scRNA-seq dataSCALPEL is a new computational Nextflow workflow21 to quantify transcript isoforms from single-cell data. It takes as input the digital gene expression matrix (DGE) generated by a scRNA-seq processing pipeline such as CellRanger or Drop-seq tools and the mapped reads in BAM format and uses them to decompose gene expression into isoform expression data (Fig. 1a). SCALPEL workflow is divided into three main modules (Fig. 1b). In the first module, raw sequencing data and annotation files are processed to perform bulk quantification of the annotated isoforms. These isoforms are then truncated and collapsed, giving rise to a set of distinct isoforms with different 3’ ends for quantification at single-cell resolution. In the second module, scRNA-seq reads are mapped on the set of selected isoforms and reads coming from pre-mRNAs or resulting from internal priming (IP) events are discarded. In the last module, isoforms are quantified in individual cells and an isoform digital gene expression matrix (iDGE) is generated (Fig. S1). The iDGE can be processed to perform downstream single-cell level analyses such as dimensionality reduction, clustering, marker discovery and trajectory inference. Furthermore, it can also be used to study differential isoform usage (DIU) and visualize isoform coverage using additional functions included in SCALPEL repository. The main novelty that SCALPEL brings is the pseudo-assembly of reads with the same cell barcode (CB) and unique molecular identifier (UMI). This approach helps in the assignment of UMIs to individual isoforms by considering the global transcript structure and jointly modeling the distance of the reads with the same UMI to the 3’ end of the transcripts (Fig. S1).Fig. 1: SCALPEL pipeline quantifies transcript isoforms at the single-cell level.a Schematic representation of SCALPEL function. SCALPEL decomposes conventional scRNA-seq data mapped to a gene to the different isoforms that are expressed. b SCALPEL Nextflow pipeline diagram. SCALPEL is composed of four workflows performing (1) annotation preprocessing (black line); (2) read preprocessing to discard artifacts and reads derived from pre-mRNAs (blue line); (3) quantification of isoforms in individual cells (green line); and (4) characterization of differential isoform usage (orange line).Full size imageSCALPEL shows accurate quantification of isoforms at single-cell resolutionTo assess the performance of SCALPEL to quantify isoforms in scRNA-seq data, we generated synthetic single-cell isoform expression datasets. First, we simulated single-cell gene expression values for 6000 cells belonging to two different cell populations using Splatter22 v1.28.0. (Fig. 2a). These cells expressed a total of 6560 genes and 12,320 isoforms including genes with changes in expression and/or isoform usage across cell populations (Fig. 2b and Supplementary Data 1). Using this strategy, we generated three datasets with different drop-out rates: one similar to a real 10x dataset used as reference23 and two other datasets with lower coverage (Figs. 2c and S2a). Given that SCALPEL uses as input mapped reads, we developed a new method, scr4eam (https://github.com/plasslab/scr4eam), to generate isoform-aware realistic scRNA-seq reads for the synthetic iDGEs (Fig. S2b, c). We used these three datasets to compare SCALPEL isoform quantifications with simulated isoform quantification. In all datasets, we find a high correlation between simulated isoform abundances and the isoform quantification provided by SCALPEL (Pearson correlation coefficient r ≥ 0.8, Fig. 2d–f).Fig. 2: Benchmark of SCALPEL using synthetic data.a UMAP plots depicting the two cell populations, population A (orange) and population B (green), simulated with Splatter. b Graphical representation of the types of genes included in the simulated data. c Violin plots showing the relative number of UMIs per isoform in a reference 10x dataset (gray) and those in the simulated datasets with different dropout rates (high, mid and low depicted in dark, medium and light purple). The real dataset contains information for 40,750 isoforms across 2042 cells. 12,320 isoforms in 6000 cells were simulated for the high, mid and low datasets. Differences in the distribution of UMIs per isoform were tested using Mann–Whitney–Wilcoxon two-sided test. The center of the box plot is denoted by the median, a horizontal line dividing the box into two equal halves. The bounds of the box are defined by the lower quartile (25th percentile) and the upper quartile (75th percentile). The whiskers extend from the box and represent the data points that fall within 1.5 times the interquartile range (IQR) from the lower and upper quartiles. Any data point outside this range is considered an outlier and plotted individually. p-Values for pairwise comparisons (Bonferroni-adjusted) are: High vs. real p-value = 0.86; high vs. mid p-value = 4.40e−259; high vs. low p-value = 2.22e−308; mid vs. low p-value = 2.6e−174. Correlation between simulated isoform abundances (y-axis) and predicted isoform abundances (x-axis) for the high (d), medium (e) and low (f) sequencing depth simulated datasets. g–i Number of correctly identified genes (g), isoforms (h), and DIU genes (i) by each of the sequencing tools in the high (dark purple), medium (medium purple) and low (light purple) simulated datasets. As reference, we provide the number of simulated genes, isoforms and DIU genes simulated (gray bar). j Summary of the performance of the different tools benchmarked on the three synthetic datasets. Source data are provided as a Source Data file.Full size imageBenchmark of SCALPEL using synthetic data shows higher sensitivity and specificity than other toolsWe used the generated synthetic datasets to benchmark the performance of SCALPEL against existing tools developed to quantify APA in scRNA-seq data16,17,18,20,24,25. Considering the underlying quantification strategy, these methods can be divided into peak-calling based tools (Sierra, scAPA, scAPAtrap, SCAPTURE, and scDaPars), and isoform quantification tools (scUTRquant) (Supplementary Data 2). Following the quantification of peaks or isoforms according to the default parameters of each tool, we performed DIU analysis. In the case of scUTRquant20, which uses an extended curated 3’ UTR annotation (3’ UTRome), we performed the benchmarking using both the 3’ UTRome (scUTRquant) and the same annotation as the other tools (scUTRquant*).Overall, our analyses show a clear difference in sensitivity between peak- and isoform-based methods. Peak-based methods quantified fewer genes and isoforms than isoform-based methods, which showed similar sensitivity (Fig. 2g, h). Some of the sensitivity differences between peak and isoform tools can be explained by the constraints of the prediction methods. In particular, the low number of isoforms quantified by scDaPars24 could be explained because this tool only predicts PAS in annotated 3’ UTRs. As expected, sequencing depth impacts the quantification of genes and isoforms. Most methods detect fewer genes and isoforms in the datasets with medium and low UMI per isoform (Fig. 2g, h).We next identified DIU genes across the two cell populations simulated. In all three simulated datasets, SCALPEL recovered the highest number of DIU genes closely followed by scUTRquant* and scUTRquant (Fig. 2i, Supplementary Data 3). Both methods have a superior performance than peak-based tools although SCALPEL has higher sensitivity than scUTRquant (Figs. 2j and S3a–c). The higher sensitivity of SCALPEL is partly explained by a robust performance across expression ranges while all other tools misidentify lowly expressed (bottom 50% expression) DIU genes (Fig. S3d–f). In the low expression dataset, SCALPEL correctly identifies 57% of DIU genes among Q1 genes, while scUTRquant and scUTRquant* identified 19% and 22%, respectively (Supplementary Data 3). We also noticed that true DIU genes predicted by SCALPEL have a high degree of agreement and are co-detected by at least one of the other benchmarked tools (high: 95%, mid: 93%, low: 91%) (Fig. S3g–i), indicating that tool agreement can be used as a metric to assess its performances on real datasets.Finally, we decided to compare the performance of SCALPEL in terms of execution requirements and runtime. SCALPEL runtime and memory usage are comparable or better than most of the tools analyzed, and only scUTRquant is faster and more memory efficient than SCALPEL. Yet, this can be directly linked to the presence of an already processed annotation as both memory and execution time increase when no 3’ UTRome is provided to scUTRquant (scUTRquant*, Fig. S3j–l, Supplementary Data 4).SCALPEL predictions on real data are more sensitive and have a high degree of agreement with other toolsWe additionally benchmarked SCALPEL performance against the other tools using a publicly available single-cell dataset of mouse sperm cell differentiation generated using 10x Genomics platform23 (Fig. S4a). In this dataset, SCALPEL identified 51,767 isoforms in 17,525 genes that were used for downstream analyses such as dimensionality reduction and clustering. We found three main cell populations using markers from a previous study26: spermatocytes (SC), round spermatids (RS) and elongated spermatids (ES) (Fig. S4b). Pairwise comparison between these populations shows again that the number of predicted DIU genes is generally higher for isoform-based approaches than for peak based methods (Fig. S4c–e, Supplementary Data 5). The higher number of identified DIU genes by SCALPEL is partly explained by a higher prediction among less expressed genes (bottom 50% expression). Here, it is important to note that the number of DIU genes detected by scUTRquant using the standard gene annotation (scUTRquant*) is clearly reduced, indicating that the higher sensitivity of scUTRquant can be directly attributed to the use of an extended annotation (3’ UTRome) and not to the algorithm per se.Finally, we investigated the agreement in the prediction of genes with differential peak or isoform usage across tools. We observed a substantial overlap in the predictions of SCALPEL with other tools, with more than 70% of SCALPEL predictions supported by one or more tools (Fig. S4f–h). SCALPEL and scUTRquant showed the highest agreement on the identified DIU genes across the cell types (Fig. S4i–n).We additionally benchmarked the performance of SCALPEL on a shallower single-cell dataset generated in-house with a Drop-seq platform containing human induced pluripotent stem cells (iPSCs) and neural progenitor cells (NPCs) (Fig. 3a, b). In this dataset, SCALPEL quantified 68,813 isoforms in 16,544 genes and more than 10,000 genes with two or more isoforms (Supplementary Data 6–8). When performing the benchmark on the neuronal dataset, SCALPEL and scAPAtrap showed higher sensitivity in the identification of DIU genes while keeping a high degree of agreement with other tools (Fig. S5). In this case, although scUTRquant quantified a high number of isoforms (40,002) and genes with multiple isoforms (10,481), only 246 DIU genes were identified between NPCs and iPSCs (Fig. S5d, e). This drop in the number of detected DIU genes is likely arising from stringent default parameters which discard genes expressed in a few cells (minCellsPerGene = 50).Fig. 3: SCALPEL predictions on iPSC and NPC data can be experimentally validated and reflect miRNA function.a Clustering analysis identifies two main populations corresponding to iPSCs (salmon) and NPCs (cyan). b Distribution of the number of UMIs, Genes and % of mitochondrial UMIs (MT) of the two samples analyzed. iPSC violin plots show measurements for 1233 cells and the NPC violin plots show measurements for 1302 cells. Overlaid boxplots indicate the median (center line), the 25th and 75th percentiles (box limits), and the most extreme values within 1.5 times the IQR (whiskers). iPSCs: median = 7047, box = [4166, 11,495], whiskers = [597, 19,973]; NPCs: median = 2952, box = [1513, 5771], whiskers = [414, 19,975]. c SCALPEL identifies a significant change in isoform usage in EIF1 gene between iPSCs and NPCs. Coverage plots show the distribution of filtered reads along isoforms in both conditions. Relative expression of isoform abundance by SCALPEL is provided on the custom tracks. The location of the oligo dT primer (gray) and the gene-specific primer (purple) used for experimental validations are shown on top of the isoform diagrams. d Experimental validation in bulk of the isoform changes predicted by SCALPEL using 3’ RACE. EIF1-204 isoform is more highly expressed in iPSCs than in NPCs while the expression of the EIF1-201 isoform is similar in both conditions. e Quantification of JPT1 isoforms by SCALPEL. f Relative abundance of JPT1−203 in NPCs is higher than in iPSCs while JPT1-207 is more similar across conditions. g Cumulative distribution plot of log2FCs showing the difference in the expression of isoforms from DIU genes where one of them contains miR-128-5p target sites (pink) and the other does not (gray), indicating that changes in isoform usage between iPSCs and NPCs can be attributed to miRNAs known to be implicated in neurogenesis (two-tailed Kolmogorov–Smirnov test, FDR = 0.0000115). Source data are provided as a Source Data file.Full size imageOverall, the results from the benchmark show that SCALPEL has higher sensitivity than all other tools while keeping good precision and performance in real datasets generated with 10x and Drop-seq technologies.SCALPEL predictions can be experimentally validatedConsidering that iPSCs and NPCs are easily distinguished experimentally, we decided to use this dataset to experimentally validate SCALPEL predictions. To this end, we selected five genes predicted to have changes in isoform usage by SCALPEL using Chi-square statistical test (false discovery rate, FDR, adjusted p-value