Reproducible single-cell annotation of programs underlying T cell subsets, activation states and functions

Wait 5 sec.

MainT cells play critical roles in cancer, infection and autoimmune disease, driving widespread interest in characterizing their states using single-cell RNA sequencing (scRNA-seq)1,2,3. Clustering, the predominant analysis approach, has key limitations for interpreting T cell profiles. Transcriptomes reflect expression of multiple GEPs—co-regulated gene modules reflecting distinct biologic functions such as defining cell type, activation states, life cycle processes or external stimuli responses4. T cell GEPs vary continuously5, combine additively within individual cells6 and exhibit stimulus-dependent plasticity7. However, clustering discretizes cells, obscuring the coexpressed GEPs. For instance, proliferating T cells from multiple subsets may cluster together, masking their subsets. This may explain why clustering typically fails to delineate many canonical T cell subsets1,8, even with integration of surface protein markers via CITE-seq9,10.Component-based models like nonnegative matrix factorization (NMF), hierarchical Poisson factorization and SPECTRA overcome some limitations of clustering8,11,12,13,14. These methods model GEPs as gene expression vectors and transcriptomes as weighted mixtures of GEPs. Unlike principal component analysis (PCA), NMF components correspond to biologically interpretable GEPs reflecting cell types and functional states that additively contribute to a transcriptome12. Component-based approaches yield GEP vectors that serve as a fixed coordinate system for comparing GEP activities across datasets. This is similar to scoring gene-set activities15 but with variable gene weights and simultaneous modeling of multiple GEPs. This prevents confounding of related signals and enables comparison of relative GEP activities. Previous analyses of T cells using component-based models have already recognized GEPs associated with T cell activation8 and exhaustion13 but were limited in dataset size and only addressed a small number of biological contexts. Furthermore, it is not well established how well such GEPs generalize across datasets.Here, we present star-CellAnnoTator (starCAT), a framework to score cells based on a fixed, multi-dataset catalog of GEPs. ‘star’ is a wildcard placeholder based on the asterisk (*) used in programming, indicating applicability across tissues and cell types. Our specific instantiation for T cells is thus written T-CellAnnoTator (TCAT). We derive a comprehensive T cell GEP catalog by applying consensus nonnegative matrix factorization (cNMF)12 to seven scRNA-seq datasets comprising 1.7 million T cells from 38 human tissues1,2,10,16,17,18,19. Combining GEPs across datasets yields 46 consensus gene expression programs (cGEPs) capturing T cell subsets, activation states and functions (Fig. 1a). We demonstrate TCAT’s utility for inferring subset and antigen-specific activation (ASA) states and identifying cGEPs predictive of immunotherapy response across multiple tumor types.Fig. 1: Overview of starCAT.a, starCAT first identifies GEPs in multiple datasets and aggregates them into cGEPs. It then uses the cGEPs to annotate new query datasets and compute additional scores and classifiers. b, Pairwise correlations of GEPs discovered across reference datasets with insets for cGEPs derived from all seven references. Inset row and column orders are the same for all cGEPs. c, Heat map of cGEPs (rows) and which datasets the comprising GEPs were found in (columns). Green boxes indicate a GEP was found in a dataset. Colored bar indicates the cGEP’s assigned class. cGEPs corresponding to non-T cell lineages are excluded. d, Marker genes for selected example cGEPs in z-score units with the minimum value fixed at 0. The AB_ prefix indicates a surface protein.Full size imageResultsAnnotating cells with predefined GEPsWe first augmented the published cNMF algorithm to enhance GEP discovery (Fig. 1a). cNMF mitigates randomness in NMF by repeating NMF and combining outputs into robust estimates, generating GEP spectra (gene weights) and per-cell activities (‘usages’) reflecting the relative contributions of GEPs to each cell. To improve cross-dataset GEP reproducibility, we corrected batch effects which can cause cNMF to learn redundant dataset-specific GEPs. Standard batch-correction methods are incompatible with cNMF as they introduce negative values or modify low-dimensional embeddings rather than gene-level data. Therefore, we adapted Harmony20 to provide batch-corrected nonnegative gene-level data. Additionally, we modified cNMF to incorporate surface protein measurements in GEP spectra for CITE-seq datasets to enhance GEP interpretability (Methods).Next, we developed starCAT to infer the usages of GEPs learned in a reference dataset in new ‘query’ datasets. Unlike cNMF, which learns GEP spectra and usages simultaneously, starCAT quantifies the activity of predefined GEPs within each cell, using nonnegative least squares, similarly to NMFproject11. starCAT then leverages the GEP usages to predict additional cell features, including lineage, T cell antigen receptor (TCR) activation and cell cycle phase (Fig. 1a). This can provide several advantages over running cNMF or similar approaches de novo: it ensures a consistent cell state representation for comparison across datasets, can quantify rarely used GEPs that may be hard to identify de novo in small query datasets and markedly reduces run time.We benchmarked starCAT’s performance through simulations where the reference and query datasets contained only partially overlapping GEPs (Methods). Simulations included two 100,000-cell references and a 20,000-cell query, where each cell expressed one subset-defining and one or more non-subset GEPs. Cells in the reference datasets included additional GEPs or lacked certain GEPs relative to the query datasets and only shared 90% of genes in common (Extended Data Fig. 1a). We then learned GEPs from each reference with cNMF and predicted their usage in the query using starCAT.starCAT accurately inferred the usage of GEPs overlapping between the reference and query (Pearson R > 0.7) and predicted low usage of extra GEPs in the reference that were not in the query (Extended Data Fig. 1b–d). We observed similar prediction accuracies when predicting a simulated query dataset with half or fewer overlapping GEPs between the reference and query (Supplementary Fig. 1). starCAT outperformed direct application of cNMF to the query for overlapping GEPs, despite having extra or missing GEPs in the references. We suspected this was due to the larger size of the references and confirmed that starCAT maintained its performance across smaller query datasets while cNMF’s performance declined (Extended Data Fig. 1e).cGEPs for T cell annotationWe next developed a catalog of T cell GEPs to use for TCAT. We analyzed T cells from seven datasets spanning blood and tissues from healthy individuals and those with coronavirus disease 2019 (COVID-19), cancer, rheumatoid arthritis or osteoarthritis (Supplementary Table 1 and Extended Data Fig. 1f). We chose datasets to reflect phenotypic breadth, large sample sizes (>70,000 T cells) and, where possible, inclusion of CITE-seq data to aid GEP interpretation. We included two COVID-19 peripheral blood mononuclear cell (PBMC) datasets, two healthy PBMC datasets and two tissue datasets to assess cross-dataset GEP reproducibility. After quality control, 1.7 million cells remained from 905 samples from 695 individuals. We applied cNMF to each batch-corrected dataset independently (Supplementary Fig. 2 and Methods).GEPs were reproducible across datasets. To quantify this, we clustered GEPs found in different datasets and defined a cGEP as the average of each cluster (Methods). Nine cGEPs were supported by all seven datasets (mean Pearson R = 0.81, P 0.1) and cGEP-low (usage 0.75; Extended Data Fig. 3d and Methods).Next, we validated prediction of T cell polarization against canonical marker expression. We discretized cells based on the TH1-like, TH2-resting and TH17-resting cGEPs and computed per-sample pseudobulk profiles of high (usage > 0.1) and low (usage