Interpretable single-cell factor decomposition using sciRED

Wait 5 sec.

sciRED factor discovery frameworkWe model the read counts for n cells and g genes (i.e., \({Y}_{n\times g}\)) as a combination of annotated and unannotated factors as follows:$${Y}_{n\times g}=\,{C}_{n\times p}{\beta }_{p\times g}+\,{F}_{n\times f}{A}_{f\times g}+\,{U}_{n\times u}{H}_{u\times g}+{\varepsilon }_{n\times g}$$Where:\({Y}_{n\times {g}}\) represents the observed data matrix, typically with dimensions \(n\times g\) (where \(n\) is the number of samples or cells and \(g\) is the number of variables or genes)\({C}_{n\times {p}}\) is the matrix of technical covariates with dimensions \(n\times p\) (where \(p\) is the number of covariates), such as library size, batch ID, cell cycle stage\({\beta }_{p\times g}\) represents the coefficient matrix for technical covariates with dimensions \(p\times g\)\({F}_{n\times f}\) is the matrix of annotated factors with dimensions \(n\times f\) (where \(f\) is the number of factors)\({A}_{f\times g}\) is the matrix of loadings for annotated factors with dimensions \(f\times g\)\({U}_{n\times u}\) is the matrix of unannotated factors with dimensions \(n\times u\) (where \(u\) is the number of unannotated factors)\({H}_{u\times g}\) represents the matrix of loadings for unannotated factors with dimensions \(u\times g\)\({\varepsilon }_{n\times g}\) is the error term, with dimensions \(n\times g\)Annotated factors are those matched with known (given) covariates as indicated in the factor-covariate association (FCA) table. The solution to the above equation is reached through a two-step process.1. Derivation of residualsThe first step aims to remove the effect of technical covariates from the data.Input counts, represented by the matrix \({Y}_{n\times {g}}\) are modeled as a Poisson generalized linear model (GLM) to account for the matrix’s distributional properties11,55,57. Technical covariates \({C}_{n\times {p}}\) are incorporated into the model to capture their effects on the count data. The statsmodels package58 (version 0.11.0) is used for GLM implementation.Pearson residuals from this model are computed as the difference between observed and predicted counts divided by the square root of the predicted counts:$${r}_{i}=\,\frac{{y}_{i}-\,\hat{{y}_{i}}}{\sqrt{{VAR}\left(\hat{{y}_{i}}\right)}}$$where \({VAR}\) is the variance. For Poisson GLM, \({VAR}\left(\hat{{y}_{i}}\right)=\,\hat{{y}_{i}}\).The Pearson residuals are used for subsequent matrix decomposition.Two additional residual types were evaluated, but not used in sciRED. Response residuals represent the difference between the observed count and the predicted mean count for each observation.Response Residuals: \({{resid}}_{i}{={y}}_{i}-\,\hat{{y}_{i}}\)Deviance residuals represent the individual contributions of samples to the deviance D, calculated as the signed square roots of the unit deviances. Deviance Residuals:$${d}_{i}={sign}\left({y}_{i}-\,\hat{{y}_{i}}\right)\times \sqrt{2\times ({y}_{i}\log \left({y}_{i}/\hat{{y}_{i}}\right)-\,\left({y}_{i}-\,\hat{{y}_{i}}\right))}$$Pearson residuals are used for sciRED because they are established in other single cell analysis methods55,56. However, sciRED results are robust to the choice of residual (among Pearson, response, deviance) (See Methods section “Classifier ensemble”).2. Residual decompositionAfter obtaining Pearson residuals from the count data, a matrix factorization technique is employed to uncover underlying patterns, including the annotated and unannotated factors. PCA factors are calculated using Singular Value Decomposition (SVD)59 as implemented in scikit-learn60 (version 0.22.1). Following factor decomposition, we apply varimax rotation to enhance the interpretability of the principal axes in the feature space. To achieve this, we reimplemented the optimized estimation procedure as described below.Rotation typesFactor analysis typically comprises two sequential stages. Initially, loadings are computed to best approximate the observed variances within the data. However, these initial loadings may lack interpretability; thus, we apply rotation to generate a revised set that is more easily interpretable. There are two primary rotation types in factor analysis: orthogonal and oblique rotations. Orthogonal rotation, such as varimax, seeks to produce orthogonal factors with zero correlations. Intuitively, varimax rotation aims to identify factors associated with a limited number of variables, thereby it promotes the detection of distinct factors rather than those affecting all variables evenly. Mathematically, interpretability is achieved by maximizing the variance of the squared loadings along each principal component axis. Varimax rotation seeks to maximize the Kaiser criterion:$$v=\,{\sum}_{j}{v}_{j}=\,{\sum}_{j}\,\left\{\,\left[\,g{\sum}_{i}{\left({l}_{{ij}}^{2}/{h}_{i}^{2}\right)}^{2}-\,{\left({\sum }_{i}{l}_{{ij}}^{2}/{h}_{i}^{2}\right)}^{2}\right]\,{/g}^{2}\right\}$$Where \(v\) is the Kaiser criterion and \({l}_{{ij}}\) is the loading value of ith gene and jth factor, and \(g\) is the total number of genes. The communality \({h}_{i}\) is calculated as the squared sum of the loading values for each gene.$${h}_{i}^{2}=\,{\sum }_{j}{l}_{{ij}}^{2}$$To explicitly specify the rotation matrix, we can reformulate the Kaiser criterion using the following notation. Let \(L\) be a \(g\times {k}\) loading matrix (eigenvectors), and \(R\) denote a rotation matrix such that \({R}^{T}R{={I}}_{k}\), where \({I}_{k}\) is the \(k\times k\) identity matrix. Additionally, let \({R}_{{ij}}\) represent the scalar element in the \(i\) th row and \(j\) th column of matrix \(R\). Varimax rotations can now be described as follows:$${R}_{{varimax}}={{argmax}}_{R}\,\left({\sum }_{j=1}^{k}{\sum }_{i=1}^{g}{\left({LR}\right)}_{{ij}}^{4}-\,\frac{1}{g}\,{\sum }_{j=1}^{k}{\left({\sum }_{i=1}^{g}{({LR})}_{{ij}}^{2}\right)}^{2}\right)$$where \({R}_{{varimax}}\) denotes the resulting rotation matrix.The rotation matrix is computed using an iterative method relying on SVD to achieve sparsity in the loadings. Subsequently, the rotated loadings and score matrices are derived by multiplying the original loadings and score matrices with the rotation matrix, respectively. The optimization algorithm is elaborated on in detail in Appendix A of Stegmann et al.61. We re-implemented the base R varimax rotation function in Python.Oblique rotations, such as Promax62, allow factors to be correlated, thereby relaxing the orthogonality assumption63. This flexibility can be beneficial when factors are expected to be correlated within the underlying structure of the data. Promax initially applies the varimax method to generate a set of orthogonal results. Subsequently, it constructs an ideal oblique solution to exaggerate the orthogonal rotated gene-loading matrix. Finally, it rotates the orthogonal results to achieve a least squares fit with this ideal solution.We define a pattern matrix \(P=({p}_{{ij}})\), as the following \((k \, > \, 1)\):$${p}_{{ij}}=\,\frac{|{l}_{{ij}}^{k+1}|}{{l}_{{ij}}}$$Each element of \(P\) matrix is the \({k}^{{th}}\) power (typically 3rd power) of the corresponding element in the row-column normalized varimax loading matrix. Next, the least squares fit of the orthogonal matrix of factor loadings to the pattern matrix is computed.$${R}_{{Promax}}=\,{{argmin}}_{R}{{{||}}P-{L}_{{rot}}R{{||}}}_{2}$$$${R}_{{Promax}}=\,{({L}_{{rot}}^{{\prime} }{L}_{{rot}})}^{-1}{L}_{{rot}}^{{\prime} }P$$where \({R}_{{Promax}}\) is the unnormalized rotation matrix, \({L}_{{rot}}\) is the varimax rotated loadings, and \(P\) is the pattern matrix defined above. The columns of \({R}_{{Promax}}\) are normalized such that their sums of squares are equal to unity. We reimplemented the base R promax function in Python.To evaluate the impact of factor rotations, we applied PCA, sciRED (varimax-based19), and promax-rotated62 PCA to the Pearson residual of the scMixology dataset after regressing out protocol and library size. The factor-covariate association scores reveal a high correlation between sciRED and Promax. Both methods enhance specificity and achieve one-to-one association between factors and cell line covariates, outperforming unrotated PCA (Figure S13). We used varimax as the default rotation method for sciRED, as it has been established to perform well for interpretable factor analysis64,65.Factor identification benchmarkingFor both scMixology and PBMC datasets, PCA and ICA were applied to normalized (library-regressed) data, while NMF, scVI, Zinbwave, cNMF, and scCoGAPS were run on filtered raw counts. All methods used the top 2000 highly variable genes, following sciRED’s standard pipeline. The number of factors (K) was set to 30 across all methods; for additional comparison, K was also set to 10 for the scMixology dataset.Spectra was applied exclusively to the PBMC dataset, as it relies on the Cytopus knowledge base, which contains gene sets tailored for standard scRNA-seq data but lacks coverage for cancer cell line-specific gene sets. For PBMC, cell type labels were standardized to CD4-T, mono, cDC, NK, CD8-T, and B, with an “all-cells” category used for cell types without specific annotations. The number of factors per cell type was set to five to approximate the total number of factors used in the other methods.All methods were run in single-threaded mode with default parameters on a workstation with Intel 3.0 GHz Xeon E5-2687W chip and 64 GB RAM.Data pre-processing and handling of batch effects with sciREDsciRED takes raw count (not batch corrected) data as input, removing cells and genes with zero total read count and retaining only the top highly variable genes, typically the top 2000 as in standard scRNA-seq workflows. While using all genes is an option, the first step—fitting the Poisson GLM—becomes more time-consuming with larger gene sets; thus, a smaller set of highly variable genes is used. The minimum required covariate is library size (total read count per cell), which serves as an equivalent normalization step.Additional covariates can be incorporated into the GLM model based on specific analysis goals; however, technical covariates should be removed with caution to prevent the unintentional exclusion of biologically relevant signals. For instance, in analyses focused on sex-specific variations, adjusting for sample IDs as covariates can inadvertently eliminate sex-associated signals if sample and sex effects are confounded.In multi-sample analyses, we recommend merging all samples without batch correction. Batch correction is generally unnecessary with sciRED, as our analyses demonstrate robust batch effect handling without directly correcting the data matrix or latent embedding. To evaluate batch effect handling, sciRED was tested on two PBMC datasets profiled using 10x Genomics single-cell 3’ and 5’ gene expression libraries with strong batch effects (Fig. S14). sciRED was applied to the combined count matrices, with library and sample IDs regressed as covariates in the Poisson GLM step. The UMAP projection (Figures S14AB) reveals batch-related clustering without correction; however, sciRED analysis (Fig. S14C, D) shows well-integrated cell type identities, even across assay types. For instance, Factors F1 and F4, which identify CD14+ monocytes and B cells, respectively, are well integrated across both assays. The distributions of cells over F1 and F4, colored by cell type and assay, demonstrate effective batch integration, and the corresponding box plots and UMAP enrichment patterns further confirm these results (Fig. S14E–J).Impact of data sparsity on factor decomposition performanceTo assess the impact of data sparsity on sciRED’s decomposition, we systematically varied the sparsity level (0.01, 0.3, 0.5, 0.7, 0.9, 0.95, 0.99) in the human liver atlas dataset by incrementally replacing randomly selected gene expression values with zeros (Fig. S15). For each level, we evaluated the factors based on average and maximum variance explained, the percentage of matched factors, and the percentage of matched covariates. Our results indicate that the average variance explained by all factors remains largely robust across different sparsity levels, likely due to sciRED’s rotation step, as this trend is not observed for PCA factors. The maximum variance explained by factors is also more stable with sciRED’s rotated factors compared to PCA as sparsity increases. As expected, the percentage of matched factors decreases with rising sparsity, with a marked drop around a sparsity threshold of ~70%. At lower sparsity levels (0.01 and 0.5), the FCA heatmaps display a balanced distribution of matched covariates per factor, while at higher sparsity (0.95 and 0.99), most covariates align with only a few factors, reducing interpretability. However, the percentage of matched covariates remains stable, as most covariates continue to align with at least one factor.Impact of factor number (K) on decomposition resultsTo evaluate the impact of factor count (K) on sciRED’s decomposition results, we tested various factor numbers—5, 10, 20, 30, and 50—on the human liver map (Fig. S16). At lower factor counts, such as K = 5, each factor showed higher covariate associations (e.g., Factor 1 was matched to nine covariates). As K increased, each factor aligned with fewer covariates, and the percentage of matched factors decreased. At K = 50, 54% of factors had no covariate matches. Correlating factors between the K = 10 and K = 30 decompositions revealed that individual factors in K = 10 (e.g., F1, F3, and F7) mapped to multiple factors in K = 30, suggesting that low factor limits may lead to each factor capturing multiple gene expression programs, complicating interpretation. Repeating this analysis on the PBMC dataset yielded similar results: as the number of factors increased, each factor aligned with fewer covariates, and a higher proportion of factors remained unmatched at higher K values (Fig. S17).The ideal factor count varies by dataset and is affected by biological and technical signal diversity. While a Scree plot is commonly used in PCA to determine factor counts, sciRED’s rotation step changes the ordering by variance explained, making this approach less effective. Empirically, a moderately high K (e.g., 30) has yielded interpretable results across datasets. We recommend starting with a higher K value and excluding factors with no covariate matches or low factor importance scores (FIS).Feature importance calculationWe evaluated a range of classifiers for inclusion in the sciRED ensemble classifier. Each classifier is trained on individual levels of each covariate separately (e.g., “female” and “male” for a “biological sex” covariate). Each classifier uses a different approach to estimate feature importance:1.Logistic regression: feature importance is the magnitude of the coefficient for each factor, which represents the change in the log-odds of belonging to a covariate level per unit of factor weight.2.Decision trees, random forest66 and extreme gradient boosting: feature importance scores represent the decrease in covariate mixing (e.g. Gini impurity or entropy) when the feature is used within a tree averaged across all trees.3.K-nearest neighbor67 (KNN): feature importance is estimated as the decrease in predictive accuracy when the values for that feature are randomly permuted. This value is calculated as the average across five permutations for each factor based on the default scikit-learn package implementation.4.Linear classifier (AUC): feature importance for a linear classifier (i.e. fixed threshold) is calculated as the area under (AUC) the receiver-operating characteristic (ROC) curve. The AUC for one-dimensional data is equivalent to the Wilcoxon or Mann-Whitney U test statistic with the relation:$${AUC}=\, U/({n}_{0}\times {n}_{1})$$Where U is the Mann-Whitney U statistic, and \({n}_{0}\) and \({n}_{1}\) are the sample sizes of the two groups being compared. The Mann-Whitney U test is a non-parametric test used to assess whether two independent samples are selected from populations having the same mean rank. Here, samples are defined as factor scores for the target group (cells labeled with the covariate level of interest) and the non-target group. In the context of feature importance, a higher AUC value indicates that the factor is better at separating the classes, while a lower AUC value suggests less discriminatory power. The scikit-learn package is used to implement decision tree, random forest, logistic regression and KNN classifiers with default parameters. XGBoost (version 1.5.0) and SciPy68 (version 1.4.1) packages are used for XGB and the Mann-Whitney U test, respectively.Classifier ensembleWe optimized the sciRED classifier ensemble by evaluating different classifier combinations on four independent datasets: a healthy human kidney map, a healthy human liver map, a PBMC atlas, and the scMixology benchmark dataset (Fig. S18). For each experiment, we randomly shuffled the covariate labels to generate a null distribution of classifier association scores and calculated the average number of significant associated factors (p