Improving gene isoform quantification with miniQuant

Wait 5 sec.

MainGene isoforms, arising mainly from the alternative splicing of many genes (for example, in human1), enhance the diversity of RNAs, proteins and phenotypic traits2. Numerous studies of alternative splicing have shown critical roles in disease mechanisms3 and targeted therapies4. However, gene isoforms have not been studied as extensively as alternative splicing mechanisms because critical unresolved questions persist in gene isoform profiling.Unresolved questions pertain primarily to the read alignment uncertainty to gene isoform of origin. Since isoforms of a gene share exonic sequences, many short reads cannot be unambiguously assigned to their gene isoforms of origin5,6 (Fig. 1a). Although long-read RNA sequencing (RNA-seq) (for example, by the techniques of Pacific Biosciences (PacBio)7 and Oxford Nanopore Technologies (ONT)8) has been demonstrated to be qualitatively superior for identifying gene isoforms in many biomedical contexts9, there still exists a critical proportion of fragmental long reads that are probably assigned ambiguously to the gene isoform of origin, even for short isoforms10,11 (Supplementary Fig. 1). Most methods for estimating gene isoform abundance establish hierarchical models, where the gene isoform abundances are the unknown parameters controlling the emission probabilities of the observed reads. Gene isoform abundances are then estimated by maximizing the likelihood (Fig. 1b). Thus, gene isoform quantification can be viewed as a data deconvolution process incorporating information from both transcriptome (for example, exon–isoform structure) and data (for example, read length and alignment positions)12. Since these features of genes and data determine the difficulty of data deconvolution, the quantification errors can vary substantially across data and genes with different exon–isoform structures, that is, some genes can be quantified accurately at the isoform level, while others cannot. Therefore, we build a framework to determine whether a gene isoform can be quantified accurately and, thus, can be studied reliably.Fig. 1: K-value indexes the uncertainty in gene isoform quantification.a, Schematic illustration of gene isoform structure and corresponding short and long reads assignment. Shared exonic sequences among isoforms lead to uncertainty in assigning reads to gene isoforms of origin, especially for short reads due to the limited read lengths. b, Overview of the statistical models used in gene isoform abundance estimation. Methods have been developed to estimate gene isoform abundance using region-based or read-/fragment-based models. Both models can be unified into a GLM framework b = Aϕ. c, Top, design of a new statistic, K-value, to index the gene isoform quantification error in RNA-seq data. Middle, for full-column-rank matrix A, when \({\Vert \delta {{ \textbf{A}}}\Vert }_{2} \ll {\Vert {{\textbf{A}}}\Vert }_{2}\), K(A) is low, and the error b − Aϕ is small, the relative quantification error \(\frac{{\Vert \delta {\mathbf{\upphi}}\Vert}_{2}}{{\Vert{\mathbf{\upphi}}\Vert}_{2}}\) is bounded approximately by \(K({\textbf{A}}) \times(\frac{{\Vert \delta {\textbf{b}}\Vert }_{2}}{{\Vert {\textbf{b}}\Vert }_{2}}+\frac{{\Vert \delta {{\textbf{A}}}\Vert }_{2}}{{\Vert {{\textbf{A}}}\Vert }_{2}})\) (Supplementary Note 3). Bottom, higher K-value indicates the model is more sensitive to perturbations, thus the quantification results are less reliable. Fix the value of ϕ; given same level of perturbations on b, the linear model b = Aϕ tends to have larger estimation errors δϕ when the K-value of A is larger.Full size imageAlthough longer read length of long-read RNA-seq data has been qualitatively shown to improve gene isoform quantification13,14, this has not been assessed rigorously. Since the coverage of current long-read sequencing data is still much lower than that of short-read sequencing data, it remains unclear to what extent, and for which genes, the benefits of longer read length offset the cost of reduced library size. It is not known how long reads and short reads should be integrated to harness the strengths of both technologies, or what the optimal design of hybrid sequencing experiments using both long reads and short reads should be.Here, we propose the generalized condition number (referred to as ‘K-value’) as a gene- and data-specific proxy for gene isoform quantification error caused by data deconvolution. Some previous attempts of evaluating gene isoform quantification uncertainty simply adopted the number of isoforms to define gene isoform complexity15,16,17, which lacks a rigorous foundation in data science. Some other studies calculated the postquantification metrics, such as bootstrap-based computation of confidence interval18,19, information matrix20,21 and overdispersion22,23. Compared with the previous studies, K-value is a rigorous measurement of gene isoform complexity regarding quantification difficulty given read length, so it can be calculated before data collection and analysis and thus is useful to improve the design of data collection, such as the selection of short-read versus long-read sequencing. Using the K-value, we can sort and filter genes with regard to the reliability of their gene isoform quantifications (Fig. 1c). We show that, across different paradigms of RNA-seq quantification, K-value is a general and reliable proxy for quantification error. This is shown through mathematical derivation, simulation data, experimental validation and applications to vast amount of consortium-scale data (>17,000 datasets from GTEx24,25, TCGA25,26 and ENCODE27).Next, we present the software miniQuant28 (https://github.com/Augroup/miniQuant) that harnesses long reads to improve gene isoform quantification especially for complex exon–isoform structures with high K-values. We also dissect the weakness of long-read-based quantification on the lowly-expressed gene isoforms because of the low coverage of relatively low-throughput long-read RNA-seq. To achieve optimal integration of long reads and short reads, miniQuant adopts a machine learning approach to adaptively find the gene- and data-specific weighting scheme between two data types. In benchmarks, miniQuant improves upon naive integration of long reads and short reads with uniform weight, or existing methods that rely on short reads alone or long reads alone.Benefiting from the accurate quantification by miniQuant, we further characterize the switching of gene isoforms during the differentiation of human embryonic stem cells (ESC) to pharyngeal endoderm (PE) and primordial germ cell-like cells (PGC).ResultsIndex gene isoform quantification error by K-valueThe deconvolution of ambiguous read–isoform alignment for gene isoform quantification can be represented in a form of a generalized linear model (GLM) \({\textbf{b}}={\textbf{A}}{\mathbf{\upphi}},\) where \({\mathbf{\upphi} }={\left({\phi }_{1},\cdots ,{\phi }_{T}\right)}^{{\prime} }\) and \({\mathbf{\uptheta}} ={\left(\right.\!{\theta }_{1},{\theta }_{2},\cdots ,}\)\({\left.{\theta }_{T}\right)}^{{\prime}}\) denote the nucleotide fractions29 and the relative abundance of T gene isoforms, respectively, with \({\phi }_{t}=\frac{{\theta }_{t}{\widetilde{l}}_{t}}{\mathop{\sum }\nolimits_{r=1}^{T}{\theta }_{r}{\widetilde{l}}_{r}}\) and \({\widetilde{l}}_{t}\) being the effective length of gene isoform t; \({\textbf{b}}={\left({b}_{1},{b}_{2},\cdots ,{b}_{S}\right)}^{{\prime} }\) denotes the read proportions for S exonic regions; \({\textbf{A}}=\left({A}_{{{st}}}\right)\in {{\mathbb{R}}}^{S\times T}\) denotes the read-isoform alignments as a conditional probability matrix that incorporates exon–isoform structure and sequencing information (Fig. 1b and Supplementary Notes 1 and 2; Methods ‘K-value (generalized condition number)’). Thus, the estimation accuracy of ϕ and θ depends on A. For example, with short reads, because there are very few unique read-isoform alignments, the isoform quantification of gene FAM219A deviates by a large amount from the true values; in contrast, the estimates and true values are close for the SPINDOC isoforms since all have unique read alignments (Fig. 2a). As A represents the read–isoform alignments, we establish the statistic \(K\left({\textbf{A}}\right)=\frac{{\sigma }_{\max }}{{\sigma }_{r}}\) (referred to as ‘K-value’) to measure the influence of ambiguous read alignment in gene isoform quantification, where σmax and σr are the largest and smallest positive singular values of A (Fig. 1c; Methods ‘K-value (generalized condition number)’).Fig. 2: K-value is an indicator of gene isoform quantification error.a, Schematic illustration of the gene isoform structures and their corresponding aligned short reads of FAM219A (left) and SPINDOC (right). The two-dimension density plot (n = 200 simulations) represents the correlations between true and estimated abundance. b, Barplot representing the number of genes and corresponding isoforms within each K-value group. The first K-value group includes genes with K-value = 1. All subsequent K-value groups are genes with K-values between the two numbers above and below the line. The number of genes and isoforms are labeled on the left and right next to the bars, respectively. c, Boxplot showing the median MARD of genes within each K-value group across sequencing depths (n = 9) quantified using five different tools. Only genes with expression levels TPM > 1 are retained for visualization. In boxplots, the hinges represent the first and third quartiles, the center line represents the median, and the whiskers extend to the smallest and largest datapoints within 1.5 interquartile from the hinges. All boxplots in the subsequence analysis have the same definition unless specified. d, Barplot representing the MARD of ten indicated genes with low (red) and high (blue) K-values across sequencing depths (n = 9) quantified by kallisto. The overall MARD (gray) are calculated based on the median MARD of all genes with expression level TPM > 1. K-values are labeled in brackets along with the gene symbol. In barplots, data are presented as mean values ± s.e. All barplots in the subsequence analysis have the same definition unless specified. e, Comparison of the performance (precision, recall, accuracy and F1 score) between differentially expressed genes isoforms with low (red) and high (blue) K-values. f, Violin plot represents the performance (precision, recall, accuracy and F1 score) of identifying differentially expressed gene isoforms between ESC and DE across sequencing depths (n = 9) within each K-value group quantified using five different tools. g, Boxplot representing the median MARD (left) and irreproducibility (right) of genes within each K-value group per sample from GTEx (n = 54 human tissues), TCGA (n = 32 cancer types) and ENCODE (n = 46 cell lines and human tissues). Only genes with expression levels TPM > 1 are retained for visualization.Full size imageIn mathematics, we demonstrate that the relative quantification error \(\frac{{\|\delta {\mathbf{\upphi}} \|}_{2}}{{\|{\mathbf{\upphi} }\|}_{2}}\) is bounded by \(K({\textbf{A}})\frac{{\|\delta {\textbf{b}}\|}_{2}}{{\|{\textbf{b}}\|}_{2}}\) and approximately bounded by \(K({\textbf{A}})\frac{{\|\delta {\textbf{A}}\|}_{2}}{{\|{\textbf{A}}\|}_{2}}\) (Fig. 1c and Supplementary Note 3), where \(\delta {\mathbf{\upphi}}\) denotes the absolute quantification error, \(\delta {\textbf{b}}\) denotes the variation of sequencing coverage and \(\delta {\textbf{A}}\) denotes perturbation of read-isoform alignments that arises primarily from false gene isoform identification. That is, K-value is an amplification factor of the input data variance to quantification error, and therefore genes with high K-values are prone to be quantified erroneously.This association between K-value and quantification error can be validated and illustrated by the simulation data at different sequencing depths across five of the most commonly used gene isoform quantification tools18,19,29,30,31 (Methods ‘Existing methods,’ ‘Simulation data’ and ‘Performance evaluation’). To focus on the quantification error derived by data deconvolution (hereafter referred to as ‘deconvolution error’), we consider the genes with transcripts per million (TPM) > 1 in this section to avoid the dominating sampling error caused by the extremely small number of reads generated from low-expressed gene isoforms. We adopt the evaluation metric mean absolute relative difference19 (MARD) due to its superior robustness over relative error—with a range from 0 to 1, MARD effectively mitigates the impact of outliers while it preserves similar results to relative error (Supplementary Fig. 2; Methods ‘Simulation data’ and ‘Performance evaluation’).With 40 million short-read pairs, the median MARD of kallisto18 across genes is 0.0778 for the 9,538 GENCODE-annotated isoforms of 3,931 genes with 1 25, and remains at a high level (0.1016) even when the number of short-read pairs increases to 160 million. In particular, these problematic genes include many crucial regulators associated with human ESC pluripotency or differentiation, such as STAT3 (ref. 32), FOXP1 (ref. 33), DNM1, DDR1 and ZMYND8 (refs. 34,35,36,37) (K(A) ≥ 90, average MARD ≥ 0.24 across nine sequencing depths from 10 million to 160 million short-read pairs). Quantification errors of these regulators are much higher than both the transcriptome-wide level (average median MARD 0.1259; Methods ‘Performance evaluation’) and the other regulators with low K-values, such as FOXH1 (ref. 38), MED10, LEFTY1, ID1 and ZIC3 (refs. 34,35,36,37) (K(A) 25, the consortium-wise median values of the transcriptome-wide median MARD increase by 0.1830, 0.1559 and 0.1721 in GTEx, TCGA and ENCODE data, separately; the corresponding overall median irreproducibility increases from 1.03 to 2.12, 1.27 to 2.30 and 0.67 to 1.03, separately.Of note, the extent of these trends varies across samples (for example, cancer types, tissues and cell lines) and consortia. This variation is due to the critical differences in biological contexts (that is, different gene expressions), biological samples and resources (for example, various acquisition sources of cell lines), sequencing platforms, data quality, experimental sites and bioinformatics software used in these consortia, while K-value focuses on the influence of gene isoform complexity in quantification. Regardless of these differences in biological and technological settings, the positive associations of MARD and irreproducibility with respect to K-value hold in general. Therefore, K-value is a very robust index of the intrinsic quantification error in the inherent deconvolution procedure of RNA-seq data. Given gene isoform annotation and input data, miniQuant outputs K-value of each gene for users to interpret the reliability of the quantitative analysis at the gene isoform level.Long reads reduce deconvolution uncertaintyCompared with short reads, long reads improve gene isoform quantification accuracy in two aspects: (1) less ambiguity of long-read alignment can reduce the uncertainty of data deconvolution. For example, given the GENCODE-annotated gene structure and 2 × 150 bp short-read pairs, the MARD of the gene FAM219A with extremely high K-value (156.08) is as large as 0.7094 by kallisto, while long reads reduce MARD to 0.5858 by the long-read-alone mode of miniQuant (referred to as ‘miniQuant-L’) (Fig. 3a; Methods ‘Community-wise gene isoform abundance estimation model’ and ‘Simulation data’). In parallel, (2) long reads can generate a more accurate sample-specific annotation. To avoid the bias caused by using a specific gene isoform identification tool, we mimic the sample-specific annotation by choosing the gene isoforms with TPM ≥ 1 from a human transcriptome profile obtained from the GENCODE annotation and a real dataset from H1-hESC cell line (Methods ‘Reference genome and gene annotations’). The MARD of FAM219A is as low as 0.1696 when long-read coverage and long-read-based sample-specific annotation are used. Extending the evaluation to the entire transcriptome, the reduction of the median MARD by miniQuant-L with the long-read-based sample-specific annotation over five short-read-based tools with the GENCODE annotation is 0.0138–0.2257, even when long reads are far less than short-read pairs (5 million versus 40 million) (Fig. 3b). Moreover, the reduction of the median MARD by miniQuant-L is associated with the K-value, that is, long reads make more remarkable quantification improvement (that is, ΔMARD) when the targets are of very high K-values and thus challenging in the short-read-based quantification (Fig. 3b). For example, the quantification improvement by miniQuant-L over StringTie is 0.3117 and 0.1720 for the genes with the K-values > 25 and from 1 to 2, respectively.Fig. 3: Long reads improve gene isoform quantification by reducing data deconvolution uncertainty.a, Left, schematic illustration of the gene isoform structure and corresponding aligned short reads of FAM219A. Right, raincloud plots represent the median MARD of FAM219A quantified by miniQuant-L and kallisto under GENCODE (top) and sample-specific (bottom) annotation (n = 200 simulations). b, Barcharts represent the ΔMARD between five short-read-based tools and miniQuant-L within eight different K-value groups under GENCODE and sample-specific annotation, respectively. c, Comparison of median MARD between seven long-read-based tools and miniQuant-L across nine different sequencing depths on cDNA-ONT data under GENCODE annotation. d, Comparison of median MARD among cDNA-PacBio, cDNA-ONT and dRNA-ONT protocols across sequencing depths (n = 9) under GENCODE and sample-specific annotations. e, Two-dimension scatterplot shows the ΔMARD between kallisto and miniQuant-L against gene expression levels on different short-read and long-read sequencing depth combinations. SR, short-read pairs. LR, long reads. f, Coverage tracks and gene isoform structures of three indicated genes, OR1I1, CPLANE2 and GCLC, from Set 1, Set 2 and Set 3 in e (left), respectively. The K-value and TPM are labeled in the bracket under the gene symbols. CPLANE2 is described in Supplementary Note 5. In GCLC, low-expressed (TPM 25) (n = 3,339 genes) by miniQuant-L and kallisto across 11 short-read and long-read sequencing depths. In box plots, the hinges represent the first and third quartiles, the center line represents the median, and the whiskers extend to the smallest and largest datapoints within 1.5 interquartile from the hinges.Full size imageMiniQuant-L uses the joint likelihood function that includes comprehensive data features so that the unique information of long reads can be fully taken into account (Supplementary Note 4). MiniQuant-L outperforms seven existing long-read-based gene isoform quantification tools42,43,44,45,46,47,48 (Fig. 3c). Especially, the average ΔMARD across nine sequencing depths (0.5–30 million) of miniQuant-L over TALON47, StringTie2 (ref. 48), FLAMES44 and LIQA46 is as high as 0.2130–0.5084, and the ΔMARD becomes more striking when >1 million long reads are used. Compared with Bambu43, FLAIR45 and IsoQuant42, miniQuant-L also performs better with the average ΔMARD of 0.0140, 0.0433 and 0.0776, respectively, and the difference of their performance varies mildly across nine sequencing depths (Fig. 3c). Evaluation based on the other metrics (that is, relative error and absolute relative difference (ARD)) delivers similar results (Extended Data Fig. 6).We benchmark the performance of miniQuant-L among three different sequencing protocols, including PacBio cDNA sequencing (cDNA-PacBio), ONT cDNA sequencing (cDNA-ONT) and ONT direct RNA-seq (dRNA-ONT) (Methods ‘Simulation data’). With GENCODE annotation, the overall performance of miniQuant-L using cDNA-PacBio data outperforms cDNA-ONT and dRNA-ONT across nine sequencing depths (0.5–30 million), with an average median MARD of 0.4054 versus 0.4480 and 0.4402, respectively (Fig. 3d). This gap is attributed to the lower alignment errors and ambiguities of cDNA-PacBio reads, because of the lower sequencing error rate (1.12% versus 5.73% and 9.94% for cDNA-ONT and dRNA-ONT, respectively) and longer mean read length (2,732 bp versus 885 bp and 1,198 bp for cDNA-ONT and dRNA-ONT, respectively) (Methods ‘Simulation data’). When the sample-specific annotation is used, the errors are all reduced substantially (average median MARD 0.1411, 0.1710 and 0.1731 for cDNA-PacBio, cDNA-ONT and dRNA-ONT, separately), and the variance across nine sequencing depths becomes far smaller for all three protocols, underscoring the larger contribution of sample-specific annotation for gene isoform quantification over the choice of long-read sequencing protocols.Sampling error is critical in long-read-based quantificationAlthough long reads reduce the quantification error derived by data deconvolution (that is, deconvolution error), long-read-based quantification is limited by the substantial sampling error that results from the relatively low throughput given the current cost and yield.For example, kallisto, with 10 million short-read pairs, reports smaller MARD than miniQuant-L, with 1 million long reads on 75.80% (24,420 of 32,215) genes, many (48.31%, 11,797 of 24,420) of which can be grouped into three sets (Fig. 3e and Supplementary Note 5): Set 1 (4,596 genes) with low abundance (0