Partially shared multi-modal embedding learns holistic representation of cell state

Wait 5 sec.

MainRecent advances in experimental techniques have enabled the simultaneous measurement of multiple modalities, including multiplexed imaging and different sequencing modalities at the single-cell level1,2,3,4,5,6, and various spatial transcriptomic methods in the tissue context7,8,9. While each data modality offers a different view, the true underlying cell state remains not directly observable. Given that the physical, transcriptional and functional state of a cell are correlated, we expect some information to be shared among the different modalities, while some aspects of cell state are only captured in one of the modalities (Fig. 1a).Fig. 1: Overview of APOLLO for partially shared multi-modal embeddings and cross-modality prediction.a, As different experimental technologies capture different aspects of the cell state, we expect some cell-state information to be shared among different modalities and some information to be modality specific. b, APOLLO is an autoencoder-based approach that learns three latent spaces to disentangle information captured by each modality. APOLLO uses a two-step training procedure. In step 1, the decoders of each modality are trained so that the decoders can reconstruct the input data from the latent spaces. In step 2, modality-specific encoders are trained to enable inferring the latent space embedding for cells not used in training the model. c, Our model enables the prediction of missing modalities. The explicit learning of partial information sharing allows APOLLO to perform accurate cross-modality prediction. The example shows the predicted protein (CD16, CD3, CD4, CD8, lamin) fluorescence images of a cell in a held-out patient, given the cell’s chromatin and γH2AX protein stain. d, The information captured by the shared and modality-specific latent spaces learned by APOLLO is interpretable. The partially shared latent space can be interpreted by morphological features or genetic pathways.Full size imageCurrent computational methods for analyzing multi-modal data fall into two main categories: those that process each modality separately and compare results post hoc, or those that learn a combined, integrated representation. The first approach is inefficient and often fails to fully exploit the information in the multi-modal dataset, requiring manual interpretation that typically restricts analysis to a few cell clusters or features5,10. For example, in the context of paired single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) and single-cell-RNA sequencing (scRNA-seq) data, chromatin accessibility is often summarized at the gene level to allow direct comparison with gene expression5. Such coarse-graining of the data into a shared set of features may lose important information in each modality and can only be done when the modalities are similar. To overcome this limitation, various integration methods have been developed. Linear transformations and factor analysis11,12,13 are effective for sequencing-based modalities with common features (such as genes) but cannot easily be extended to data such as imaging, and do not explicitly identify shared information. Nonlinear methods such as optimal transport and generative adversarial networks14,15,16, and neighborhood-based approaches17,18, also present limitations as they either do not learn an integrated latent space that account for all modalities, or are restricted to data with natural distance metrics, such as sequencing-based data. More recently, various representation learning methods19, including autoencoders, have been introduced to single-cell biology20,21,22 for the automatic integration of multi-modal data for joint downstream analysis including clustering and biomarker identification23,24,25,26,27,28,29,30. In previous work, we developed autoencoder-based approaches to integrate diverse modalities, such as chromatin imaging, RNA-seq and ATAC-seq in the single-cell domain, as well as spatial transcriptomics23,24. These frameworks extracted features from each modality and embedded the features of all input modalities into a shared latent space for downstream analysis.Existing multi-modal integration methods learn the union of information from the input modalities, incorporating both the shared and modality-specific information, without disentanglement17,24,27. As a result, it is challenging to infer which modality contributes to certain features or phenotypes. For example, while chromatin staining is standardly used as a reference in large multiplexed single-cell-imaging datasets, as it has been shown to contain rich cell-state information and is predictive of protein subcellular localization31,32, little is known about the information shared between chromatin organization and protein localization. Given the explosion of large-scale paired measurements in the sequencing and imaging domain at the single-cell and tissue levels and also in the medical domain in large-scale biobanks that collect paired information on many individuals33,34, computational methods are needed that automatically and comprehensively learn the shared and modality-specific information to fully exploit multi-modal measurements.We present a method that automatically learns partial information sharing between multiple modalities by using an Autoencoder with a Partially Overlapping Latent space learned through Latent Optimization (APOLLO). APOLLO uses one autoencoder per each modality and a two-step training procedure: in step 1, the decoders of each modality are trained while the latent spaces are updated simultaneously, so that the decoders can reconstruct the input data from the latent spaces; in step 2, modality-specific encoders are trained to enable inferring the latent space embedding for cells not used in training the model (Fig. 1b). We test APOLLO’s disentanglement performance on five simulated datasets with known ground-truth latent structures. We then demonstrate using a paired scRNA-seq and scATAC-seq dataset (SHARE-seq5) that APOLLO can comprehensively and automatically identify and distinguish between gene activity captured by both transcriptomics and chromatin accessibility, as well as by only one of the modalities. We further apply APOLLO to a paired scRNA-seq and surface protein dataset (CITE-seq25) to demonstrate that APOLLO can correctly disentangle known shared and modality-specific information. Beyond sequencing modalities, using multiplexed single-cell imaging datasets, we show that APOLLO identifies features of chromatin organization and protein localization that correspond to cell-state information shared between the two modalities, as well as morphological features that are captured by only one modality. The explicit learning of partial information sharing by APOLLO also enables accurate cross-modality predictions, such as predicting unmeasured proteins from chromatin imaging (Fig. 1c). Incorporating additional cellular stains, including microtubule and endoplasmic reticulum (ER), we demonstrate on multiplexed imaging data from the Human Protein Atlas (HPA)35 that APOLLO can be used to uncover associations between variations in protein subcellular localization and the morphology of different cellular compartments. Overall, APOLLO provides a general framework that enables disentangling and interpreting the shared and modality-specific information contained in multi-modal datasets to obtain biological insights by learning partially shared latent spaces (Fig. 1d).ResultsAPOLLO enables learning of partial information sharing across data modalitiesWhile a regular autoencoder learns a representation of a single modality, our previous works have demonstrated that a joint representation of multiple modalities can be obtained by using one autoencoder per modality and aligning their latent spaces23,24. However, such alignment of the latent spaces results in a single latent space that entangles shared and modality-specific information. To overcome this limitation, APOLLO only enforces alignment between a subset of latent dimensions and allows the remaining latent dimensions to account for domain-specific information. More specifically, let $\textbf{Z}={({\textbf{Z}}_{i})}_{i\in H}$ be a latent random vector with distribution PZ representing the true underlying state of a cell. Each single-cell technology may only measure some aspect of cell state. Let Hj ⊆ H represent the subset of latent variables that can be accessed by technology j and let Dj be a modality-specific map, which takes in the latent vector ${\textbf{Z}}_{H_{j}}:={({\textbf{Z}}_{i})}_{i\in H_{j}}$ and outputs a data sample X(j) from modality j, meaning ${\textbf{X}}^{(j)}=D_{j}({\textbf{Z}}_{H_{j}})$. Note that X(j) may be high-dimensional (for example, 20,000-dimensional for scRNA-seq technologies or consisting of the number of pixels in a chromatin image). Assuming that we have access to 1 ≤ j ≤ J paired modalities that can be measured on the same cell, then each cell gives rise to samples (X(1), …, X(J)) from a joint distribution P(X(1), …, X(J)). Given such samples, the problem then becomes to identify the latent features Z and in particular understand which latent features are shared across modalities and which are modality specific; in the case of J = 2, this leads to the task of identifying the shared features ${\textbf{Z}}_{S}={\textbf{Z}}_{{H}_{1}\cap {H}_{2}}$ as well as the modality-specific features ${\textbf{Z}}_{{S}_{1}}={\textbf{Z}}_{{H}_{1}\backslash {H}_{2}}$ and ${\textbf{Z}}_{{S}_{2}}={\textbf{Z}}_{{H}_{2}\backslash {H}_{1}}$. To generalize to J > 2, ZS could represent the shared information across all modalities, meaning ${\textbf{Z}}_{S}={\textbf{Z}}_{{\cap }_{i=1}^{n}{H}_{i}}$, and ${\textbf{Z}}_{S_{j}}=$${\textbf{Z}}_{H_{j}\backslash ({\cap }_{i=1}^{n}{H}_{i},i\ne j)}$.Without additional assumptions, the shared and modality-specific latent features may not be identifiable. For linear maps Dj, a recent theoretical study proved identifiability under additional assumptions, such as non-Gaussianity of each marginal distribution36. Furthermore, when parameterizing the maps Dj with neural networks and using a latent optimization strategy for training, a recent study demonstrated empirically in the context of computer vision that simple image aspects could be disentangled and represented by different latent features37. APOLLO builds on these works to automatically learn partial information sharing between multiple modalities by using an autoencoder framework with a partially overlapping latent space learned through latent optimization. While our framework is general, for simplicity we describe it in the bi-modal setting where J = 2. In this context, D1 and D2 are decoders that map from shared and modality specific latent features to each data modality. Given data (x(1), x(2)) from P(X(1), X(2)), the corresponding shared latent features as well as the modality-specific features are learned simultaneously with the decoders D1 and D2 by minimizing the reconstruction error $L({\textbf{x}}^{(1)},{D}_{1}({\textbf{z}}_{{H}_{1}}))+L({\textbf{x}}^{(2)},{D}_{2}({\textbf{z}}_{{H}_{2}}))$, where L(⋅) is the mean-squared error (MSE) or another appropriate loss function. In practice, the dimension of the shared latent space (S ≔ H1 ∩ H2) is chosen to be much larger than the dimension of the modality-specific latent spaces (S1 ≔ H1\H2 and S2 ≔ H2\H1), to enforce that information shared by the two modalities is represented in the shared latent space. To encourage this further as well as to enable cross-modality prediction (described below), we introduce two additional decoders ${D}_{1}^{{\prime} }$ and ${D}_{2}^{{\prime} }$ that map from the shared latent space to each of the modalities, trained by minimizing the reconstruction error $L({\textbf{x}}^{(1)},{D}_{1}^{{\prime} }({\textbf{z}}_{S}))+L({\textbf{x}}^{(2)},{D}_{2}^{{\prime} }({\textbf{z}}_{S}))$. In the applications of APOLLO to the SHARE-seq and multiplexed imaging datasets described in the following sections, we demonstrate that APOLLO is robust to the choice of latent space dimensions as well as to whether the two additional decoders ${D}_{1}^{{\prime} }$ and ${D}_{2}^{{\prime} }$ are included. The full set-up and objective function can be found in Methods.To generalize to unseen samples and to enable cross-modality prediction, we use a second training step to obtain encoders Ej that map from each data modality to their respective latent spaces. More precisely, given a sample (x(1), …, x(J)) and the corresponding features z obtained from the first training step, the encoders Ej, 1 ≤ j ≤ J, are obtained by minimizing the MSE $L({\textbf{z}}_{{H}_{j}},{E}_{j}({\textbf{x}}^{(j)}))$. The encoders and the previously described decoders can be any neural network architecture that is suitable for the particular data modality, such as convolutional networks for images and fully connected networks for gene expression. For cross-modality prediction, a sample x(j) from input modality j is first encoded to the latent space to obtain ${\textbf{z}}_{H_{j}}=E_{j}({\textbf{x}}^{(j)})$. Then the decoder ${D}_{k}^{{\prime} }$ of the target modality k is applied to the dimensions corresponding to the shared latent space to obtain ${\textbf{x}}_{k}={D}_{k}^{{\prime} }({\textbf{z}}_{H_{j}\cap H_{k}})$. Additional details of model set-up and training are provided in Methods.Given the limited understanding of information sharing between multiple modalities, to systematically evaluate disentanglement performance, we design simulated datasets with known ground-truth latent structure with increasing complexity and relevance to real biological data (Extended Data Figs. 1–4 and Methods). We assess APOLLO’s performance across five simulation settings that differ in two aspects: (1) whether some modality-specific latent features are children of the shared latent features; and (2) whether the observed features depend on more than one latent feature. It is worth noting that depending on the structure of the latent causal graph, the shared and modality-specific features may not necessarily be independent (Extended Data Figs. 3a and 4a). In all five simulation settings, APOLLO correctly identifies the associations between latent features and the three latent spaces as measured by the separation of known latent features in the corresponding latent spaces (Methods). Importantly, APOLLO has comparable performance when there is dependence between shared and modality-specific latent features (Extended Data Figs. 3 and 4) as well as when the observed features depend on more than one latent feature (Extended Data Figs. 2 and 4), which is to be expected in biological data.In the following, we demonstrate that APOLLO is generally applicable to any multi-modal data by discussing four different multi-modal settings, including paired-sequencing-based modalities as well as multiplexed imaging.APOLLO provides a general framework for the integration of paired-sequencing-based measurementsWe demonstrate that APOLLO learns meaningful shared and modality-specific latent spaces and is generally applicable to paired-sequencing-based measurements using two representative datasets: paired scRNA-seq and scATAC-seq measured using SHARE-seq5 (Extended Data Fig. 5 and Supplementary Fig. 1) and paired scRNA-seq and cellular surface protein abundance measured using CITE-seq25 (Extended Data Fig. 6a). Training APOLLO using the SHARE-seq dataset on a 24 GB graphics processing unit (GPU) (A5000) took 10.8 seconds per epoch for step 1 and 7.5 seconds per epoch for step 2.Using the SHARE-seq data, we first assess whether the modality-specific latent spaces capture additional information compared with the shared latent space using a cell-type classification task (see Methods for details). As a baseline, we used the cell types defined by Ma et al.5 to train a classifier that predicts the cell types using the shared latent space. In addition, we trained two separate classifiers that use the ATAC-seq latent space and RNA-seq latent space respectively, in addition to the shared latent space. To ensure that our model is robust to the choice of latent space dimensions, we tested different dimensions for the shared and modality-specific latent spaces as well as different ratios between the latent space dimensions (Fig. 2a). The incorporation of the RNA-specific latent space improves the cell-type classification accuracy compared with using the shared latent space alone for all latent space dimensions. This shows that the modality-specific latent spaces can capture biologically meaningful information that is not represented in the shared latent space, and that our model is robust to the choice of latent space dimensions. Uniform Manifold Approximation and Projection (UMAP) visualizations of the encoded shared and full latent spaces further support the separation of cell types captured by APOLLO (Supplementary Fig. 1d).Fig. 2: APOLLO for identifying shared and modality-specific information in paired scRNA-seq and scATAC-seq data.a, Comparison of four separate models with different latent space dimensions. The dimensions are listed as ‘shared latent space dimension’-‘modality-specific dimension’; for example ‘50-20’ indicates that the shared latent space has 50 dimensions and the modality-specific latent spaces have 20 dimensions each. For each model, we train four separate classifiers, with three random train–test splits, that predict cell type based on the following inputs respectively: (1) the shared latent space encoded using scRNA-seq (RNA shared) or scATAC-seq (ATAC shared), and (2) the shared latent space concatenated with the RNA-specific latent space (RNA full) or the ATAC-specific latent space (ATAC full) encoded using scRNA-seq or scATAC-seq. The plot shows the mean and standard deviation of each model. The baseline accuracy computed by randomly permuting cell-type assignments is 0.108 with a standard deviation of 0.03 across 5 random permutations. b,c, The genes or peaks explained by each latent space can be identified by performing differential expression using the cells at the two ends and at the center of each PC (b). The curves in c show the number of times among 108 random train–test splits that a gene passes a particular fold-change threshold, while the P-value cut-off is set to 0.05 after multiple testing correction (see Methods for details). The top genes with the highest area under the curve are colored. d, GO enrichment analysis using the differentially expressed genes of the top PCs identified as described in c, for the shared and ATAC-specific latent spaces. The heatmap shows the proportion of times a GO term is observed to be enriched in one of the latent spaces. e, Summary of genes captured in each of the three latent spaces. f, Application of APOLLO to the CITE-seq data from ref. 25. Left: UMAPs of the shared and the two modality-specific latent spaces colored by cell types (top row) or experimental batches (bottom row). Middle: UMAPs obtained from Seurat WNN analysis. Right: UMAPs obtained from the shared latent space of standard multi-modal autoencoder with an encoder and a decoder for each input modality (see Methods for details).Source dataFull size imageWe next demonstrate how principal component (PC) analysis can be used to interpret the information contained in the shared and modality-specific latent spaces. Along each PC, we bin the cells based on their positions along the PC and then compare the gene expression or peak counts of the cells at the two ends of the PC and at the center of the PC (Fig. 2b and Methods). We identify cell cycle-related genes among the top genes explained by the first PC of the RNA-specific latent space, such as Ticrr and Dsn138,39 (Fig. 2c), while the top PCs of the ATAC-specific latent space capture the activity of promoters of transcriptional regulators, for example, the promoter of the transcription factor Heyl (Fig. 2c and Supplementary Fig. 2b). Furthermore, the top genes explained by the first three PCs of the RNA and ATAC shared latent space contain known transcriptional regulators, some of which are known transcriptional activators or repressors (for example Zeb1) based on the correlation between transcription factor activity and transcription factor RNA expression levels5 (Fig. 2c and Supplementary Fig. 3a). Interestingly, the top genes of the ATAC-specific or the shared latent space are generally not related to the cell cycle and the top genes of the RNA-specific latent space are not related to transcriptional regulation; exceptions are the gene Nkrf captured in the shared latent space and the gene CBx2 captured in the RNA-specific latent space, which are known to regulate the cell cycle40,41 (Fig. 2c and Supplementary Fig. 3b).Gene ontology (GO) enrichment analysis is performed on the genes explained by each latent space, identifying GO terms related to transcriptional regulation in the shared latent space and the ATAC-specific latent space (Fig. 2d and Supplementary Fig. 3c). This could indicate a time lag between the change in chromatin accessibility and the resulting change in gene expression. Interestingly, GO terms related to post-transcriptional regulation are found in the shared latent space. These analyses demonstrate that the shared and modality-specific latent spaces identified by APOLLO are biologically meaningful and can be efficiently analyzed to interpret the shared and modality-specific information (Fig. 2e).To further demonstrate that APOLLO can disentangle shared and modality-specific information for different kinds of sequencing-based measurements, we applied APOLLO to a CITE-seq dataset of mouse spleen and lymph nodes with paired scRNA-seq and surface protein measurements obtained from two wild-type mice processed in two separate experiments25. We used the same training strategy and model architecture as in the application to the SHARE-seq data. As cross-modality prediction is not the goal in this application, we removed the two decoders ${D}_{1}^{{\prime} }$ and ${D}_{2}^{{\prime} }$ that map from the shared latent space to each of the modalities (Extended Data Fig. 6a). Using a standard preprocessing and visualization procedure implemented in Scanpy42, the resulting UMAP plots show that the major cell types can be separated by either scRNA-seq or protein abundance and that the scRNA-seq data show further separation of the cells by mouse, indicative of experimental batch effects (Extended Data Fig. 6b,c). The latent spaces of our APOLLO model sufficiently disentangle the multi-modal information for downstream applications: the shared latent space between scRNA-seq and protein abundance shows separation by cell types but not by experimental batches, while batch separation is captured by the modality-specific latent space of scRNA-seq (Fig. 2f and Extended Data Fig. 7a–h). In contrast, applying existing multi-modal integration methods without additional batch correction, such as the popular weighted-nearest neighbor (WNN) method in Seurat17 or a standard multi-modal autoencoder24, results in a latent space that learns the union of information from both modalities, in which cells are separated by both cell types and batches without disentanglement (Fig. 2f, Extended Data Fig. 6d and Methods). Existing integration methods as well as using APOLLO’s full scRNA-seq latent space both result in worse performance in cell-type clustering (Extended Data Fig. 7i–l). This analysis further demonstrates that APOLLO correctly disentangles the shared and modality-specific information while integrating different kinds of paired-sequencing modalities, which could not be achieved by existing methods that perform only integration.APOLLO learns partially shared latent spaces of chromatin and different proteins through conditioningAPOLLO is a general framework that can be applied to any paired multi-modal data by choosing the appropriate modality-specific encoders and decoders. Furthermore, APOLLO can directly be applied to integrate more than two modalities. We introduce a version of APOLLO that incorporates conditioning in the latent space to allow for the integration of chromatin images and images of multiple protein types.In the following, we apply APOLLO to a multiplexed imaging dataset of human peripheral blood mononuclear cells (PBMCs) from ref. 43. The dataset consists of a total of 32,345 PBMCs from 40 patients with 1 of 4 diagnoses: healthy, meningioma, glioma, head and neck tumor (Supplementary Fig. 4). For each patient, two different multiplexed imaging datasets were obtained. One subset of cells was stained with 4′,6-diamidino-2-phenylindole (DAPI) to label the chromatin, along with antibodies for CD4, CD8 and CD16. The other subset of cells was stained with DAPI and antibodies for lamin, CD3 and γH2AX. To incorporate multiple proteins, a trainable vector of protein ID, shared across all images of the same protein, is concatenated to the inputs to the decoders as well as to the last hidden layer of the encoders (Extended Data Fig. 8, Supplementary Fig. 5 and Methods). This conditional model accurately reconstructs images of cells in held-out individuals (Fig. 3a and Supplementary Fig. 6), which ensures that APOLLO can comprehensively capture biologically relevant information from the input modalities. Training APOLLO on this dataset on a 24 GB GPU (A5000) took 40 seconds per epoch for step 1 and 20 seconds per epoch for step 2.Fig. 3: APOLLO for predicting protein localization from chromatin imaging.a, Examples of reconstructed images of cells from patients held-out from the PBMCs imaging dataset in ref. 43, compared with ground-truth images. b, Protein predictions from chromatin images for a cell where the ground-truth protein images are available for the three proteins (γH2AX, lamin and CD3). The average prediction loss for each protein is quantified comparing the APOLLO model with different latent space dimensions to two alternative models, namely, our model without the decoders that map from the shared latent space to the output (without shared decoders) and our model without separating the shared and modality-specific latent spaces (without modality specific). The different latent space dimensions tested are listed as ‘shared latent space size’-‘modality-specific latent space size’, for example, 1,024-200. The error bars show the standard deviations across 5 random initializations for the model with sizes 1,024-200 and 4 random initializations for all other models. c, Cross-modality predictions obtained using our full model compared with those obtained using a standard multi-modal autoencoder training procedure as well as a prior image inpainting model45 (see Methods for details). Data are presented as mean ± s.d. of 5 random initializations for our model and 4 random initializations for the other models. Protein predictions from chromatin images are shown for the same cell used in a. d, Comparison of the model predictive performance using different input types: real protein/chromatin images (real protein), the reconstructed images from the full latent space (full protein latent), the reconstructed images from only the shared latent space (shared protein latent), and the protein images predicted from chromatin images (predicted from chromatin). Data are presented as mean ± s.d. of 36 different train–test splits. The asterisks indicate that the prediction performance of the protein images reconstructed from the full latent space is significantly better than the protein images reconstructed from only the shared latent space (P = 0.0011 for γH2AX and P = 0.098 for lamin using one-sided two-sample t-test. P = 0.00088 for γH2AX and P = 0.047 for lamin using Wilcoxon signed-rank test.)Source dataFull size imageAPOLLO enables accurate cross-modality prediction of protein localization from chromatin imagingThe number of proteins that can be simultaneously imaged in a cell is limited, ranging from a single protein using endogenous tagging44 to around 30 proteins in fixed samples1. In the following, we show that the shared latent space learned by APOLLO enables cross-modality predictions and can be applied to predict the unmeasured proteins in a cell from the chromatin image of that cell (Extended Data Fig. 8a and Methods).The task of predicting images of unmeasured protein stains has been considered using supervised neural network models45,46. In particular, image inpainting methods have been applied to predict the protein image of a target cell from its chromatin and microtubule stains given all three images of other cells45. While such inpainting methods can capture the average distribution, they do not produce realistic single-cell images (Fig. 3b,c, Supplementary Fig. 7e and Extended Data Fig. 9). In contrast, APOLLO can accurately predict unmeasured proteins at the single-cell level in held-out individuals based on chromatin images (Extended Data Fig. 9), and the performance of APOLLO is robust to the choice of latent space dimensions (Fig. 3b,c). To interrogate the necessity of each component of APOLLO, we perform an ablation study for the cross-modality prediction task. We first compare our two-step latent optimization training to the standard autoencoder training procedure, where encoders and decoders have the same architecture as our default APOLLO model but are trained simultaneously without directly updating the parameters of the latent spaces (Supplementary Fig. 7a and Extended Data Fig. 10a). We find that standard one-step training is not able to generate realistic protein images in held-out individuals (Fig. 3b,c and Extended Data Fig. 9). Furthermore, we tested the effect of removing the separation between the shared and modality-specific latent spaces while keeping the same two-step training procedure (Supplementary Figs. 7b–d and Extended Data Fig. 10b). This model without disentanglement of modality-specific information shows comparable performance to our APOLLO model that uses only the (smaller-dimensional) shared latent space for prediction (Fig. 3b), which indicates that the shared latent space has captured all shared information. These results suggest that the improvement in prediction performance by APOLLO is mainly a result of the two-step training procedure and that the latent space disentanglement allows for the use of a smaller latent space dimension (two-thirds of the full latent space) for cross-modality prediction at test time. Similar to the applications to paired-sequencing modalities, it is also possible to remove the decoders that map from the shared latent space to each of the modalities while maintaining comparable performance in cross-modality prediction (Fig. 3b).Next, we demonstrate that the protein images predicted from chromatin images have similar performance on a downstream classification task as the real protein images. For each protein, we train four separate convolutional neural network classifiers with the same architecture to predict the phenotype of the individual from whom the input cell image was obtained from: healthy, meningioma, glioma, or head and neck tumor. For each protein, the four classifiers use the following inputs respectively: (1) real protein images; (2) reconstructed protein images using the full protein latent space (the shared latent and protein-specific latent spaces); (3) reconstructed protein images using only the shared latent space; and (4) protein images predicted from chromatin images through the shared latent space. While the reconstructed protein images have similar phenotype classification accuracy across different proteins as the real protein images (Fig. 3d), reconstruction from the full latent space has slightly better classification accuracy than reconstruction from the shared latent space for γH2AX and lamin (Fig. 3d), thereby indicating that the protein-specific latent spaces are able to learn disease-relevant information that is not shared by chromatin. Moreover, Fig. 3d shows that the predicted protein images and the real protein images have comparable phenotype classification accuracy, indicating that CD3 is a better predictor of phenotype than CD16, CD4 or CD8, which suggests that the protein images predicted from chromatin capture similar disease-relevant information as the real protein images. These results demonstrate that APOLLO is capable of accurately predicting protein localization from chromatin organization and the predicted protein images can be used for downstream tasks with performance mimicking that of real protein images.APOLLO identifies interpretable morphological features between chromatin organization and protein localizationIn contrast to paired-sequencing measurements, where the different modalities can be compared and interpreted at the level of genes5, shared features are not directly available for paired image modalities. APOLLO provides a systematic framework that can be applied also in these settings to interrogate the information that is shared between modalities and specific to each modality. To identify the morphological features captured by the shared and modality-specific latent spaces, we perform PC analysis and bin cells along each PC (see Methods for details). Cells along a particular PC can then be sampled and visualized to interrogate the information captured in the shared or modality-specific latent spaces (Fig. 4a). We use the set of chromatin and protein features defined in ref. 43 to compare the cells at the two ends of each PC and at the center of the PC for each latent space, thereby allowing us to identify the main features captured by each latent space (Fig. 4a, Supplementary Figs. 8–14 and Methods). For example, we find that the first PC of the shared latent space mainly captures bounding box area and homogeneity of chromatin (Fig. 4a).Fig. 4: Interpretable morphological features in the shared and modality-specific APOLLO latent spaces of chromatin organization and protein localization.a, Interpretation of the latent space learned from the PBMCs multiplexed imaging dataset43. Left: cells separated along the first PC of the chromatin shared latent space into 11 bins of equal percentiles; randomly selected cells in each bin are shown. Right: 234 predefined handcrafted morphological features are computed for each cell and the values of these features are computed for the shared and modality-specific latent spaces at the two ends and at the center of each PC. The changes of representative morphological features along PC1 and PC17 of the chromatin shared latent space are plotted. FC, fold change. b, Heatmaps showing the proportion of times each representative morphological feature is explained by a top PC in the shared or the modality-specific latent space. c, Heatmap showing the morphological feature values in each cell, where each row represents a cell. The bar plot shows the fold change of prediction accuracy per each feature ablation compared with the prediction accuracy using all features. Error bars indicate standard deviations across 36 different random initializations. P values of two-sided t-tests between the prediction accuracy of each feature ablation and the accuracy using all features are 0.46 (median_gh2ax_3d_int), 0.16 (skewness_gh2ax_3d_int), 0.11 (kurtosis_gh2ax_2d_int), 0.14 (skewness_gh2ax_2d_int) and 0.00022 (gh2ax_foci_count).Full size imageTo more comprehensively characterize the information captured by each latent space, we compare the image features contained in the top PCs that explain 70% of total variance in each latent space (Fig. 4b and Supplementary Figs. 9c, 10c, 11c and 14c). Interestingly, such analysis shows that heterochromatin volume, a feature associated with aging47 and Alzheimer’s disease23, is exclusively captured by the chromatin and protein shared latent space (namely, in PC17; Fig. 4a). Whereas homogeneity, mean intensity and aspect ratio of chromatin are captured by both the shared and the chromatin-specific latent spaces. A similar analysis applied to each protein shows that the foci count of γH2AX is a protein-specific feature that is not captured by the chromatin and protein shared latent space (Fig. 4b). When training separate regression models to predict the three γH2AX features shown in Fig. 4b based on chromatin images (see Methods for details), the regression outputs show the lowest correlation with the ground-truth feature values for the foci count of γH2AX compared with the other two features that are mainly captured by the shared latent space (Supplementary Fig. 13c). This indicates that APOLLO correctly identified the γH2AX foci count as a protein-specific feature while the other two features are also captured by chromatin imaging. A feature ablation test indicates that removing γH2AX foci count results in the largest reduction of phenotype classification accuracy (Fig. 4c) and is consistent with the significant reduction of phenotype classification accuracy observed when the protein-specific latent space of γH2AX is not used in reconstructing the protein images (Fig. 3d). This confirms that the modality-specific latent space can capture important disease-relevant features and demonstrates how APOLLO can be used to interrogate the shared and modality-specific information between imaging modalities, where shared features are not directly available.APOLLO identifies associations between protein subcellular localization and cell morphologyFunction and activity of a protein is known to be tightly coupled to its subcellular localization, which can vary across single cells even within the same cell line35,44,48,49,50. Computational tools have been developed to analyze single-cell variability in protein localization, showing that cellular and nuclear morphology can be used to predict protein subcellular localization in single cells32,45,51. However, these models take as input images of multiple cellular components and little is known about the association of each cellular component with the change in protein localization across single cells. We apply APOLLO to images of U2OS cells (the most abundant cell line) in the HPA35 to learn shared and modality-specific information between the morphology of nucleus, microtubule and ER with respect to variations in protein subcellular localization (see Methods for details). In HPA, all cells are stained for nucleus, microtubule and ER (the 3 reference stains), together with 1 additional target protein (total of 11,657 proteins across all U2OS cells; see Methods for details). Concentrating on the 25 proteins with the most variable subcellular localization (see Methods for details), we disentangle the information in the three cellular components to analyze their association with the variability in protein subcellular localization.Toward this, we separately train three APOLLO models to learn the shared and modality-specific information between each pair of the three reference stains (see Methods for details). We then cluster the cells in each latent space into 2 clusters and test for each of the 25 proteins whether the 2 clusters capture differences in their subcellular localization as measured by the proportion of protein localized within the cell nucleus (Fig. 5a and Methods). Notably, the three models consistently indicate that variation in intranuclear localization of many proteins is differentially captured by the three different compartments. For example, morphological features of both ER and microtubule, but not the nucleus, are associated with the variability in intranuclear localization of DDB1 (Fig. 5a-c), which is known to change localization in response to ultraviolet exposure52. Both the microtubule and ER modality-specific latent spaces indicate that cells with smaller cytoplasmic volume and lower intensity of microtubule and ER are associated with high intranuclear localization of DDB1 (Fig. 5b,c). In contrast, the intranuclear localization of CLNS1A and C8orf59 is only associated with morphological features of the nucleus (Fig. 5a,d,e). Interestingly, for CLNS1A, a protein involved in small nuclear ribonucleoprotein biogenesis and control of cell volume53,54,55, our model suggests that higher intranuclear localization is found in cells with higher heterochromatin content (Fig. 5d); for C8orf59, a protein involved in ribosome biogenesis56, our model suggests that increased intranuclear localization is associated with more circular nuclei (Fig. 5e).Fig. 5: APOLLO for modeling different cellular components with respect to protein subcellular localization.a, Comparison of three APOLLO models trained using nucleus and microtubule stains (left), nucleus and ER stains (middle), and microtubule and ER stains (right) from the HPA dataset35. For each model, each of the shared and the two modality-specific latent spaces is clustered into two clusters. The heatmap shows the absolute value of the difference in the mean intranuclear proportion in each cluster, averaged across the five random initializations (see Methods for details). *P 0.1 in all 5 random initializations (see Methods for details). The arrows indicate proteins with example images shown below and in Supplementary Fig. 15. b–e, Intranuclear proportions are computed in a total of 37,674 single-cell images, with at least 150 single-cell images per protein. Left: Box plot of the intranuclear proportions of DDB1 (b,c), CLNS1A (d) and C8orf59 (e) in the two clusters of each latent space, obtained using the models trained with either the nucleus and microtubule images (b,d,e) or the nucleus and ER images (c). Right: examples of cells in each cluster of the microtubule-specific latent space (b), the ER-specific latent space (c) or the nucleus-specific latent space (d,e) stained for the particular proteins. f, Box plot of the intranuclear proportions of ATE1 in the two clusters of each latent space, obtained using the model trained with nucleus and ER images. g, Examples of cells stained for ATE1 in each cluster of the ER-specific latent space. h, Examples of cells stained for ATE1 in each cluster of the nucleus-specific latent space. Boxplots: center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range.Full size imageFor some proteins, the difference in intranuclear localization is captured by more than one modality-specific latent space but not the shared space in the same model; for example, both the nucleus-specific and ER-specific latent spaces but not the shared space capture differences in intranuclear localization of ATE1 across cells (Fig. 5a). This indicates that multiple aspects of nuclear and cellular morphology, captured by the different latent spaces, could simultaneously contribute to the localization of a particular protein in different ways. Indeed, analyzing the clusters in the different modality-specific latent spaces shows that while they are associated with differences in the intranuclear localization of a protein, the cell clusters are different (Supplementary Fig. 15a–c and Methods). For example, while both the ER-specific and nucleus-specific latent spaces show significant separation of cells by their intranuclear localization of ATE1 (Fig. 5a,f), visualizing cells in the clusters of the different latent spaces suggests that the ER-specific clusters capture differences in ER intensity and cytoplasmic volume (Fig. 5g), while the nucleus-specific clusters capture differences in nuclear sizes (Fig. 5h). A similar analysis for CIZ1 of the clusters in the microtubule-specific latent space suggests that low intranuclear localization is associated with larger ratio of cytoplasmic-to-nuclear volume (Supplementary Fig. 15e), which is not a distinctive feature between the clusters of the nucleus-specific latent space (Supplementary Fig. 15d). This analysis demonstrates that APOLLO can generalize to different multiplexed imaging experiments and provide insights on relations between protein subcellular localization and the morphology of different cellular components.DiscussionWe introduced APOLLO, a general computational framework that uses autoencoders with partially overlapping latent spaces to explicitly model the shared and modality-specific information across diverse multi-modal single-cell datasets. This design enables accurate cross-modality prediction and provides a systematic way to link interpretable features to cell state. Although we have demonstrated that APOLLO performs well on a variety of simulated and real data and APOLLO is based on theoretical causal foundations, the latent optimization implementation does not have known theoretical guarantees and thus the exact conditions under which APOLLO correctly disentangles multi-modal information are unknown. Our work motivates several future research directions. First, although we demonstrated in the SHARE-seq and multiplexed imaging applications that APOLLO is robust to the choice of latent space sizes, accurate estimation of the intrinsic dimensionality of the shared and modality-specific latent spaces could provide additional guidance for parameter selection and insight into underlying biological mechanisms. While recent theoretical work shows that these dimensions are identifiable under suitable conditions36, further methodological development is needed to devise practical algorithms for estimating intrinsic dimensionality across different modalities. Second, the Gaussian noise added to APOLLO latent spaces could be replaced with learnable noise parameters to enable uncertainty quantification. Third, as multi-modal measurements become increasingly common, extending APOLLO to unpaired measurements could provide important insights when paired assays are not feasible—for example, when obtaining spatial transcriptomics and multiplexed protein stains on adjacent tissue slices. One potential approach is to model unpaired data using two shared latent spaces learned during step 1 training, with a distribution-matching loss, such as maximum mean discrepancy, that encourages the shared spaces to represent the same distribution of cells across modalities.Collectively, APOLLO enables explicit learning of shared and modality-specific information, leading to a more holistic understanding of cell states and their underlying regulatory mechanisms. While we demonstrated APOLLO on SHARE-seq, CITE-seq and multiplexed protein staining, the method is broadly applicable to other multi-modal single-cell measurements by adapting the encoder and decoder architectures. More broadly still, APOLLO could be applied beyond single-cell settings to individual-level multi-modal data, leveraging diverse medical modalities in large biobanks33,34. APOLLO thus offers a framework that goes beyond multi-modal integration57 by enabling mechanistic interpretation of the learned latent space through information disentanglement.MethodsLatent optimizationIn practice, similar to variational sampling in autoencoders, to improve generalization to unseen samples, we add Gaussian noise to each component of the latent space and add a regularization term (ℓ2-norm of the latent features) in addition to the reconstruction losses. Thus, the full objective function is:$$\begin{array}{l}{{\bf{E}}}_{({x}^{(1)},{x}^{(2)}) \sim P({X}^{(1)},{X}^{(2)})}[L({x}^{(1)},{D}_{1}({z}_{S}+{\epsilon }_{S},{z}_{{S}_{1}}+{\epsilon }_{{S}_{1}}))\\ +L({x}^{(2)},{D}_{2}({z}_{S}+{\epsilon }_{S},{z}_{{S}_{2}}+{\epsilon }_{{S}_{2}}))+L({x}^{(1)},{D}_{1}^{{\prime} }({z}_{S}+{\epsilon }_{S}))\\ +L({x}^{(2)},{D}_{2}^{{\prime} }({z}_{S}+{\epsilon }_{S}))]+\lambda (\parallel {z}_{S}{\parallel }_{2}+\parallel {z}_{{S}_{1}}{\parallel }_{2}+\parallel {z}_{{S}_{2}}{\parallel }_{2}),\end{array}$$(1)where ϵS, ${\epsilon }_{{S}_{1}}$ and ${\epsilon }_{{S}_{2}}$ denote Gaussian noise in each component of the latent space and λ is a hyperparameter. We minimize equation (1) to obtain the decoders D1, D2, ${D}_{1}^{{\prime} }$ and ${D}_{2}^{{\prime} }$ as well as the shared latent features zS and the modality-specific latent features ${z}_{{S}_{1}}$ and ${z}_{{S}_{2}}$.Model architecture and training of paired scRNA-seq and scATAC-seqFor both training steps, we split the data into training and validation sets, as is standard in neural network training. We use the same randomly selected 85% of cells from the SHARE-seq dataset5 to train our model and the remaining 15% to validate and test the generalization performance of our model.Step 1 latent optimizationIn this step, we train the latent spaces and the corresponding decoders to reconstruct the scRNA-seq and scATAC-seq data (Extended Data Fig. 5b and Supplementary Fig. 1a). The training was performed on one 24 GB GPU for 35 hours with 10.8 seconds per epoch.Latent spaces of paired scRNA-seq and scATAC-seqThe shared latent space has 50 dimensions and each of the two modality-specific latent spaces have 20 dimensions. The shared latent space is chosen to be much larger than the modality-specific latent spaces to ensure that the shared space has enough capacity to contain all shared information. Latent spaces are initialized using ‘torch.nn.Embedding’ with the default parameters. Independent Gaussian noise with zero mean and unit variance is added to the latent spaces at each training epoch. The hyperparameter λ in equation (1) for the ℓ2 regularization of the latent spaces is set to 0.001 and the learning rate of the ADAM optimizer is set to 0.001, which are the same hyperparameter values as in ref. 37.Decoders of paired-sequencing-based modalitiesFour decoders are trained in step 1: (1) a decoder that reconstructs scRNA-seq from the shared latent space; (2) a decoder that reconstructs scRNA-seq from the full RNA-seq latent space (the shared latent space embedding concatenated with the RNA-specific latent space embedding); (3) a decoder that reconstructs scATAC-seq from the shared latent space; and (4) a decoder that reconstructs scATAC-seq from the full ATAC-seq latent space (the shared latent space embedding concatenated with the ATAC-specific latent space embedding). The architectures of the decoders are shown in Extended Data Fig. 5b. The input feature dimension is 70 for the full latent space decoder and 50 for the shared latent space decoder except when different latent space sizes are tested (Fig. 2a). Each decoder has 3 hidden layers, each of which has dimension 1,024. All hidden layers are linear layers with a dropout rate of 0.01 and are followed by LeakyReLU activation and a batch normalization layer. The output layer is a linear layer with a dropout rate of 0.01 and sigmoid activation. This decoder architecture follows standard autoencoder set-ups23,58. For both modalities, we use binary cross-entropy loss to minimize the objective function defined in equation (1), which is calculated using ‘torch.nn.BCEWithLogitsLoss’ based on the output before the sigmoid activation with pos_weight set inversely proportional to the total counts of genes or peaks. The learning rate of the decoders is 0.0001 and ADAM is used for optimization, which is the same as in ref. 37.Step 2 inferenceIn this step, we train the encoders to infer the learned latent spaces from the scRNA-seq and scATAC-seq data without updating the latent spaces or the decoders (Extended Data Fig. 5c and Supplementary Fig. 1b). The training was performed on one 24 GB GPU for 9 hours with 7.5 seconds per epoch to train the model past convergence.Encoders of paired scRNA-seq and scATAC-seqWe train two separate encoders for scRNA-seq and scATAC-seq with identical structures (Extended Data Fig. 5c). Each encoder starts with two linear layers with a dropout rate of 0.01, each of which has 1,024 dimensions and is followed by LeakyReLU activation and batch normalization, which is a standard set-up for autoencoders23,58. After the first two hidden layers, two separate linear layers with LeakyReLU activation are used to obtain separate hidden layers for the shared and modality-specific latent spaces. A linear layer is applied to each hidden layer to obtain the shared or modality-specific latent space. The inferred latent spaces from the autoencoder are compared with the latent spaces learned in step 1 through MSE loss. The MSE loss is minimized using the ADAM optimizer with a learning rate of 0.0001, the same learning rate as the decoder.Application to paired scRNA-seq and protein abundanceThe same set-up is applied to the paired scRNA-seq and protein abundance data measured by CITE-seq25, with dimension 50 for the shared latent space and dimension 30 for the modality-specific latent space. Given that cross-modality prediction is not the goal in this application and to demonstrate robustness of our approach, the model is trained without the two decoders ${D}_{1}^{{\prime} }$ and ${D}_{2}^{{\prime} }$ that map from the shared latent space to each of the modalities (Extended Data Fig. 6a).Pre-processing of scRNA-seq and scATAC-seq dataThe paired scRNA-seq and scATAC-seq datasets from ref. 5 are separately filtered for genes or peaks that are non-zero in at least 300 cells and then filtered for cells with at least 300 non-zero genes or peaks. The common cells between the two modalities are then selected for another round of filtering with the same criteria. These two rounds of filtering result in 28,098 cells with 9,153 genes in the scRNA-seq data, and 58,170 peaks in the scATAC-seq data. The data are then log-transformed and min–max scaled, such that the minimum count in each cell is 0 and the maximum count in each cell is 1. Our filtering and normalization steps follow standard procedures of preprocessing scRNA-seq and scATAC-seq data (see, for example, refs. 5,22,23,42).Preprocessing of scRNA-seq and cellular surface protein abundance dataFollowing the approach taken by a previous study that analyzes this dataset25, we use the top 4,005 highly variable genes in the scRNA-seq data and all 110 proteins. Same as in the preprocessing of scRNA-seq and scATAC-seq described above, the data are log-transformed and min–max scaled, such that the minimum count in each cell is 0 and the maximum count in each cell is 1.Alternative models for scRNA-seq and cellular surface protein abundance dataWe use the default parameter setting for the Seurat WNN method17. For testing the standard multi-modal autoencoder method, we use the same encoder and decoder architectures as the full APOLLO model with 80 latent dimensions, which is equal to the full latent space dimension of the APOLLO model. Instead of performing a two-step training, this model trains the encoder and decoder together in a single step, without directly updating the latent spaces as parameters for optimization. MSE loss is added between the latent spaces obtained from the two encoders of the two input modalities. Similar to the APOLLO model that adds Gaussian noise in the latent space, this model performs a variational sampling step as in a standard variational autoencoder and the resulting latent spaces are passed to the decoder for reconstruction. In each epoch, we alternate between training the scRNA-seq autoencoder and training the protein autoencoder.Cell-type prediction based on the inferred latent spaces of scRNA-seq and scATAC-seqWe train four separate neural network classifiers with the same architecture to predict cell types based on one of the following inputs: (1) shared latent space inferred from scRNA-seq data using the RNA encoder; (2) shared latent space inferred from scATAC-seq using the ATAC encoder; (3) both shared and modality-specific latent spaces inferred from scRNA-seq data using the RNA encoder; and (4) both shared and modality-specific latent spaces inferred from scATAC-seq data using the ATAC encoder. We use a standard feedforward neural network architecture in our classifiers; see, for example, ref. 19. Each classifier consists of 4 layers and outputs the probability of a given cell being assigned to each of the 23 cell types. The first three layers are followed by LeakyReLU activation and batch normalization. All 4 layers have a dropout rate of 0.1 and a hidden dimension of 128. We use the ADAM optimizer with a learning rate of 0.00001 to minimize the cross-entropy loss between our prediction and the cell-type labels assigned by ref. 5. We use the same train–validation split to train the classifiers as in the training of the APOLLO model, which means the same 85% of cells are used to train the APOLLO model and the cell-type classifiers.Model architecture and training of paired chromatin and protein imagingFor both training steps, we hold-out all images from one patient in each of the four phenotypes for testing the model (Supplementary Fig. 4).Step 1 latent optimizationIn this step, we train the latent spaces, protein IDs, and the corresponding decoders to reconstruct the chromatin and protein images (Extended Data Fig. 8a and Supplementary Fig. 5a). The training was performed on one 24 GB GPU with 40 seconds per epoch.Latent spaces and protein IDs of paired chromatin and protein imagingFor each protein, we randomly initialize a trainable 64-dimensional vector of protein ID that is shared across all images of that protein by using ‘torch.nn.Embedding’. The protein IDs are used in both the encoding and decoding steps by concatenating them to the latent space embeddings, similar to a conditional autoencoder model (Extended Data Fig. 8). The shared latent space has 1,024 dimensions and each of the 2modality-specific latent spaces has 200 dimensions except when different latent space sizes are tested (Fig. 3b). Similar to the application to paired scRNA-seq and scATAC-seq, the shared latent space is chosen to be much larger than the modality-specific latent spaces to ensure that the shared space has enough capacity to contain all the shared information. Latent spaces are initialized using ‘torch.nn.Embedding’ with the default parameters. Independent Gaussian noise with zero mean and unit variance is added to the latent spaces at each training epoch. The hyperparameter λ in equation (1) for the ℓ2 regularization of the latent spaces is set to 0.001 and the learning rate of the ADAM optimizer is set to 0.001, which are the same hyperparameter values as in ref. 37.Decoders of paired chromatin and protein imagingFour decoders are trained in step 1: (1) a decoder that reconstructs chromatin images from the shared latent space; (2) a decoder that reconstructs chromatin images from the full chromatin latent space (the shared latent space embedding concatenated with the chromatin-specific latent space embedding); (3) a decoder that reconstructs protein images from the shared latent space; and (4) a decoder that reconstructs protein images from the full protein latent space (the shared latent space embedding concatenated with the protein-specific latent space embedding). The architectures of the decoders are shown in Extended Data Fig. 8a. The latent space embedding, after adding noise and being concatenated with the protein ID, is passed through a linear layer with ReLU activation and reshaped to 4 × 4 × 96 dimensions for subsequent convolutions. This is followed by 5 convolutional layers with a kernel size of 4 and stride of 2. The number of channels in each hidden layer is listed in Extended Data Fig. 8a. The first four convolutional layers are followed by batch normalization and LeakyReLU activation. The last convolutional layer is followed by sigmoid activation to scale the output image from 0 to 1. For both imaging modalities, we use binary cross-entropy loss to minimize the objective function defined in equation (1), which is calculated using ‘torch.nn.BCEWithLogitsLoss’ based on the output before the sigmoid activation. The learning rate of the decoders is 0.0001 and ADAM is used for optimization. The decoder architectures and training procedure follow standard set-ups for autoencoders (see, for example, refs. 23,37,45).Step 2 inferenceIn this step, we train the encoders to infer the learned latent spaces from the chromatin and protein images without updating the latent spaces, protein IDs or the decoders (Extended Data Fig. 8b and Supplementary Fig. 5b). The training was performed on one 24 GB GPU with 20 seconds per epoch.Encoders of paired chromatin and protein imagingWe train two separate encoders for chromatin and protein images with identical structures (Extended Data Fig. 8b). Each encoder starts with five convolutional layers with LeakyReLU activation, a standard set-up for autoencoders23,37,45. The dimensions of the hidden layers are listed in Extended Data Fig. 8b. The output of the last convolutional layer is divided into two sets of channels that are used to derive the shared and modality-specific latent spaces respectively. Eighty out of the 96 channels of the last hidden layer are flattened, concatenated with protein ID, and passed through a linear layer to obtain the shared latent space. Similarly, the modality-specific latent space is obtained from the remaining 16 channels. The inferred latent spaces from the autoencoder are compared with the latent spaces learned in step 1 through MSE loss. The MSE loss is minimized using the ADAM optimizer with a learning rate of 0.001.Cross-modality predictionsTo predict unmeasured proteins, the shared latent space is first inferred from the chromatin image of a cell using the chromatin encoder. Then each protein image can be predicted by decoding the inferred shared latent space through the protein decoder using the protein ID of the target protein (Extended Data Fig. 8a).Application to HPA dataThe same set-up is applied to the HPA data35, with a shared latent space dimension of 1,024 and a modality-specific latent space dimension of 200. The two decoders ${D}_{1}^{{\prime} }$ and ${D}_{2}^{{\prime} }$ that map from the shared latent space to each of the modalities are not used, as the omission does not impact cross-modality-prediction performance (Fig. 3b) and results in correct disentanglement of the shared and modality-specific information (Fig. 2f).Alternative modelsThe models used for benchmarking our full APOLLO model are described below. In addition to using binary cross entropy (BCE) loss for the decoder outputs, we also tested the use of an alternative loss function, namely, the MSE loss. All model training and testing use the same train–test split. For benchmarking cross-modality prediction, the prediction results are first thresholded using mode intensity and then compared using ℓ1 loss, which is not used for training any of the models.Standard autoencoder training using a single stepThis model has the same encoder and decoder architectures as the full APOLLO model, and the same dimensions of the shared, chromatin-specific and protein-specific latent spaces are used. Protein IDs are concatenated to the hidden layers during encoding and decoding as described in the full model. However, instead of performing a two-step training, this model trains the encoder and decoder together in a single step, without directly updating the latent spaces as parameters for optimization (Extended Data Fig. 10a). MSE loss is added between the shared latent spaces obtained from the two encoders of the two input modalities. Similar to the APOLLO model that adds Gaussian noise in the latent space, this model performs a variational sampling step as in a standard variational autoencoder and the resulting latent spaces are passed to the decoder for reconstruction. In each epoch, we alternate between training the chromatin autoencoder and training the protein autoencoder.Our model without modality-specific latent spaceThis model has the same two-step training procedure as the full APOLLO model, but does not separate the shared and the modality-specific latent spaces (Extended Data Fig. 10b). There are only two decoders that decode the two modalities from the shared latent space and they are trained in step 1. The encoders only output the shared latent space, which has the same dimension as the combined dimension of the shared and the modality-specific latent spaces in the full APOLLO model.Inpainting modelFor the inpainting model developed in ref. 45, we adapted the original code from TensorFlow to PyTorch. We used the same hyperparameter values as in the original publication, including the use of MSE loss, except the slight modification to adapt to the single-channel input image without a microtubule channel.Data simulation and assessment of APOLLO’s disentanglement performanceTo systematically evaluate APOLLO’s ability to disentangle shared and modality-specific representations, we generate simulated multi-modal datasets with ground-truth latent structure. Given the recent theoretical results on identifiability of the shared and modality-specific variables36, we only allowed directed edges from the shared latent variables to the modality-specific latent variables but not vice versa. For all simulations, we assign the latent variables Z1 and Z2 to the shared latent space, Z3 and Z4 to the modality 1 specific latent space, and Z5 to the modality 2 specific latent space. The latent causal graph and the corresponding probability distribution are specified in panel a of Extended Data Figs. 1–4, from which 2,000 samples of latent features are independently drawn for each simulation and visualized in panel b of Extended Data Figs. 1–4. Given the causal graphs, we expect an accurate disentanglement to identify that the modality 1 specific latent space also captures Z2 in simulations 3–5 but not in simulations 1 and 2. Each observed feature is a linear combination of one or more latent features with the coefficients independently drawn from a standard normal distribution. In simulations 1 and 3, where each observed feature depends on just 1 latent feature, 10 observed features per modality are generated from each latent feature, resulting in a feature dimension of 40 in modality 1 and 30 in modality 2. In simulations 2, 4 and 5, where we include observed features with multiple parents, 10 observed features in modality 1 are generated from: (1) each one of Z2, Z3 and Z4 (30 observed features total); (2) each pair of Z2, Z3, and Z4 (30 observed features total); (3) Z2, Z3, and Z4; and (4) all four modality 1 latent features. Similarly, a total of 40 observed features in modality 2 are generated from: (1) each one of Z2 and Z5 (20 observed features total); (2) Z2 and Z5 (10 observed features); and (3) all three modality 2 latent features. This results in a feature dimension of 40 in modality 1 and 30 in modality 2. In simulation 5, we tested the effects of higher weights for the observed features that depend only on one latent feature by using the same set-up as simulation 4 while drawing coefficients of the observed features from ${\mathcal{N}}(0,16)$. For APOLLO training, we use fully connected decoders with a similar architecture as in the paired-sequencing applications and an MSE loss. Each of the shared and modality-specific latent spaces has two dimensions. Accuracy in disentanglement is assessed by the separation of the binary variables Z1 and Z3 in the three latent spaces, which we quantify using silhouette scores59 and UMAP visualizations (Extended Data Figs. 1–4).Pre-processing of paired chromatin and protein imagingThe imaging data are obtained from ref. 43, where the patients are annotated with one of the following four phenotypic classes: healthy, meningioma, glioma, or head and neck tumor (Supplementary Fig. 4). Each single-cell image of each protein stain and chromatin stain is min–max scaled such that the minimum pixel value is 0 and the maximum is 1, a standard image normalization step for neural networks43,45. For each cell nucleus, an image patch centered at the centroid of the nucleus of dimension 128 × 128 pixels is cropped from the whole image. A total of 29,174 cell images were used in training.Pre-processing of the HPA imagesThe same procedure as in a previous study32 is used to pre-process the images. We use all images of U2OS cells (the cell line with the most data in the HPA) that are also used in ref. 32 for studying protein localization variabiltiy (2,973 proteins used) and exclude proteins that are stained in less than 150 cells, resulting in a total of 141 proteins used for training the model. The subset of proteins has been shown in ref. 32 to have diverse subcellular localizations. Nuclear segmentation obtained using StarDist60 is used for computing the intranuclear proportion of the protein in each image. The standard deviation of the intranuclear proportion of each protein is computed across all images stained for the protein. The 25 proteins with the largest standard deviations are used for interpreting the disentanglement results (Fig. 5).Phenotype classification using real or reconstructed protein and chromatin imagesWe use the ResNet-18 model61, a popular neural network model for image classification, in PyTorch for the phenotype classifiers to predict phenotypes from real protein/chromatin images, reconstructed protein/chromatin images, or protein images predicted from chromatin (Fig. 3d). The first layer of the ResNet-18 model is adjusted to take single-channel input images. We use the cross-entropy loss and set the weight of each phenotype class to be inversely proportional to the fraction of cells in that particular phenotype class. The classifiers are trained using the ADAM optimizer with a learning rate of 0.001. The training of each classifier is repeated 36 times with 6 different groups of held-out patients, where each held-out group contains one patient from each phenotype class. One-sided two-sample t-test and Wilcoxon signed-rank test are used for the null hypothesis that the phenotype prediction accuracy using the reconstruction from the full latent space is smaller than or equal to the accuracy using the reconstruction from the shared latent space.Interpretation of the partially shared latent spacesIn the following, we describe a procedure to analyze the shared latent space as well as the modality-specific latent space, which can be applied to any data modality. We compute the PCs of the latent spaces and group the cells based on their positions along the PCs. For both of our applications, we group cells into 11 bins with equal percentile ranges along each PC. In the following description of our approach, we denote the percentile bin of PCi that is centered around 0 as ${\mathrm{PC}}_{i}^{0}$ and the bins ordered from negative to positive PC values are denoted as ${\mathrm{PC}}_{i}^{-5},\cdots {\mathrm{PC}}_{i}^{-1},{\mathrm{PC}}_{i}^{0},{\mathrm{PC}}_{i}^{1},\cdots {\mathrm{PC}}_{i}^{5}$. When comparing cells along PCi we only use the cells at the center of all other PCs, meaning the 15% of cells with the smallest Euclidean distance to the PC origin calculated using all PCj for j ≠ i and j < 10. The number of PCs considered in the analysis can be adjusted given the variance explained by each PC in a particular dataset. Using cells at the center of all other PCs, except the PC being interpreted, ensures that we are only considering the variation of cells along one PC to tease apart the variations explained by the different PCs. In the following, we explain this procedure in more detail in the context of the paired-sequencing and paired-imaging datasets, where we also perform subsampling for a more robust identification of the features explained by each PC.Interpretation of the latent spaces of paired scRNA-seq and scATAC-seqWe test for significant gene expression and peak count changes along each PC. The cells are divided into 36 random train–test splits with 20% of cells used for testing in each split, to validate the significant genes and peaks identified along each PC. We consider three groups of cells for each PCi that are all at the center of other PCs: (1) cells in ${\mathrm{PC}}_{i}^{-5}$ and ${\mathrm{PC}}_{i}^{-4}$; (2) cells in ${\mathrm{PC}}_{i}^{0}$; and (3) cells in ${\mathrm{PC}}_{i}^{4}$ and ${\mathrm{PC}}_{i}^{5}$. Cells in each group are compared with all cells in the other two groups through t-tests using Scanpy’s ‘rank_genes_groups’ function42. The threshold of P values after multiple testing correction is set to 0.05 for both modalities. The number of times in all train–test splits a gene or peak is tested to be significant in both the training and testing cells with the same direction of fold change are plotted for different thresholds of log fold change magnitude (Fig. 2c and Supplementary Figs. 2 and 3). The ATAC-seq peaks in the shared and modality-specific latent spaces are summarized in Fig. 2d. For each ATAC-seq peak, we counted the number of PCs in the shared and modality-specific latent spaces respectively, in which the peak is found to be significant. The genes are then grouped by their GO annotations and the total counts of PCs in the two latent spaces are normalized to 1.Interpretation of the latent spaces of paired chromatin and protein imagingImaging data can be visually inspected by sampling cells at each percentile bin along a particular PC (Fig. 4a). In addition, for a more comprehensive assessment, predefined handcrafted image features43,62 can be used to test for significant feature changes along each PC. For the following analysis, we use a total of 234 chromatin image features and around 30 image features per protein obtained from43. The patients are divided into 12 different train–test splits, which are summarized in Supplementary Fig. 4, to validate the significant features identified along each PC. As in the application to paired scRNA-seq and scATAC-seq, we test for significant feature changes among three groups of cells for each PCi that are all at the center of other PCs: (1) cells in ${\mathrm{PC}}_{i}^{-5}$ and ${\mathrm{PC}}_{i}^{-4}$; (2) cells in ${\mathrm{PC}}_{i}^{0}$; and (3) cells in ${\mathrm{PC}}_{i}^{4}$ and ${\mathrm{PC}}_{i}^{5}$. A feature is considered significant if the P value after multiple testing correction is less than 0.05 and the absolute value of log fold change is greater than log2(FCt) in both the training and testing patients, where FCt is set to be 1.2 for chromatin features and 1.4 for protein features. The fold-change thresholds are chosen so that most representative morphological features are represented by at least one PC. In addition, the direction of fold change is required to be the same in the training and testing patients. Figure 4a and Supplementary 19 show the chromatin features that are significant in at least 8 out of the 12 train–test splits (Supplementary Fig. 4). We summarize the morphological features in all top PCs in the shared and modality-specific latent spaces respectively in Fig. 4b, and Supplementary Figs. 9c, 10c, 11c and 14c. For this, we counted the number of PCs for which each feature is found to be significant in the shared and modality-specific latent spaces and normalized the total counts in the two latent spaces to 1.To obtain a more concise representation, we group the morphological features of chromatin and each protein by Pearson correlation across all cells and plot one representative feature per group in Fig. 4a,b, and Supplementary Figs. 8–14. The feature groupings for each stain are obtained as follows: we build a network with the features as nodes and an edge is added between each pair of features whose Pearson correlation is greater than 0.7. Features within the same connected component of the graph are considered in the same group. The representative feature of a group is the node with the highest degree. This results in 18 feature groups for chromatin, 5 groups for γH2AX, 5 groups for lamin, 6 groups for CD8, 4 groups for CD4, 5 groups for CD3 and 4 groups for CD16.Interpretation of the latent spaces of the HPA imagesFor each subset of U2OS cells stained with the same protein, we separately perform k-means (k = 2) clustering in the shared and the two modality-specific latent spaces of each model. The clustering in each latent space is repeated five times with different random seeds. For each random seed, a t-test is performed to compare the proportion of intranuclear protein localization in the two clusters, and the absolute value of the difference between the cluster means is computed. A clustering shows significant difference of intranuclear proportions if P 0.01 in all 5 random initializations. The P-value threshold is chosen by applying a Bonferroni correction to the overall probability of type 1 error at 0.05 for a total of 225 tests performed for 25 proteins in each of the three latent spaces of the three separate models.γH2AX feature prediction using chromatin imagesWe use the ResNet-18 model61 in PyTorch to train separate regression models that predict the predefined γH2AX features from chromatin images. Similar to the previous section, for each of the 3 γH2AX features (Fig. 4b), a regression model is trained 12 times with different train–test splits of the patients (Supplementary Fig. 4). The outputs of each regression model on the held-out patients are then compared with the ground truth using Pearson’s correlation (Supplementary Fig. 13c).Statistics and reproducibilityPublicly available datasets were used in this study. Standard filtering steps were performed, and otherwise, no data were excluded from our analyses. Randomization and blinding were not applicable in this study.Reporting summaryFurther information on research design is available in the Nature Portfolio Reporting Summary linked to this article.Data availabilityAll datasets used in this study are publicly available. The SHARE-seq data from ref. 5 are available under accession number GEO GSE140203. The CITE-seq data from ref. 25 are available under accession number GEO GSE150599. The PBMCs multiplex imaging dataset from ref. 43 is available from the PSI Public Data Repository at https://doi.org/10.16907/b039dc4e-9366-413c-8f34-92ce9110cc14. The Human Protein Atlas images from ref. 35 are available at https://www.proteinatlas.org. Source data for Figs. 2 and 3b,c is available with this paper. Data in Figs. 3d, 4 and 5 can be reproduced using our deposited code and publicly available data.Code availabilityThe code is available via Zenodo at https://doi.org/10.5281/zenodo.17841315 (ref. 63) and in via GitHub at https://github.com/uhlerlab/APOLLO/. For ease of use, a modularized version of our method, for each application, is available on GitHub with clear step-by-step instructions regarding data input and use of the method.ReferencesTan, W. C. C. et al. Overview of multiplex immunohistochemistry/immunofluorescence techniques in the era of cancer immunotherapy. Cancer Commun. 40, 135–153 (2020).Article Google Scholar Lin, J.-R., Fallahi-Sichani, M., Chen, J.-Y. & Sorger, P. K. Cyclic immunofluorescence (CycIF), a highly multiplexed method for single-cell imaging. Curr. Prot. Chem. Biol. 8, 251–264 (2016).Article Google Scholar Kuett, L. et al. Three-dimensional imaging mass cytometry for highly multiplexed molecular and cellular mapping of tissues and the tumor microenvironment. Nat. Cancer 3, 122–133 (2022).Article Google Scholar Altemose, N. et al. EQ DamID: a microfluidic approach for joint imaging and sequencing of protein–DNA interactions in single cells. Cell Syst. 11, 354–3669 (2020).Article Google Scholar Ma, S. et al. Chromatin potential identified by shared single-cell profiling of RNA and chromatin. Cell 183, 1103–111620 (2020).Article Google Scholar Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).Article Google Scholar Zeng, H. et al. Integrative in situ mapping of single-cell transcriptional states and tissue histopathology in a mouse model of Alzheimer’s disease. Nat. Neurosci. 26, 430–446 (2023).Article Google Scholar Xia, C., Fan, J., Emanuel, G., Hao, J. & Zhuang, X. Spatial transcriptome profiling by MERFISH reveals subcellular RNA compartmentalization and cell cycle-dependent gene expression. Proc. Natl Acad. Sci. USA 116, 19490–19499 (2019).Article Google Scholar Eng, C.-H. L. et al. Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH+. Nature 568, 235–239 (2019).Article Google Scholar Laber, S. et al. Discovering cellular programs of intrinsic and extrinsic drivers of metabolic traits using LipocyteProfiler. Cell Genom. 3, 100346 (2023).Article Google Scholar Gower, J. C. Generalized procrustes analysis. Psychometrika 40, 33–51 (1975).Article MathSciNet Google Scholar Stuart, T. et al. Comprehensive Integration of single-cell data. Cell 177, 1888–190221 (2019).Article Google Scholar Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biology 21, 111 (2020).Article Google Scholar Amodio, M. & Krishnaswamy, S. MAGAN: aligning biological manifolds. Proc. 35th International Conference on Machine Learning, Vol. 80 (eds Dy, J. & Krause, A.) 215–223 (PMLR, 2018).Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks. IEEE International Conference on Computer Vision (ICCV) 2242–2251 (2017).Klein, D. et al. Mapping cells through time and space with moscot. Nature 638, 1065–1075 (2025).Article Google Scholar Hao, Y. et al. Integrated analysis of multimodal single-cell data. Cell 184, 3573–358729 (2021).Article Google Scholar Jain, M. S. et al. MultiMAP: dimensionality reduction and integration of multimodal data. Genome Biol. 22, 346 (2021).Article Google Scholar Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016); https://www.deeplearningbook.org/Svensson, V., Gayoso, A., Yosef, N. & Pachter, L. Interpretable factor models of single-cell RNA-seq via variational autoencoders. Bioinformatics 36, 3418–3421 (2020).Article Google Scholar Brbifá, M. et al. Annotation of spatially resolved single-cell data with STELLAR. Nat. Methods 19, 1411–1418 (2022).Article Google Scholar Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).Article Google Scholar Zhang, X., Wang, X., Shivashankar, G. V. & Uhler, C. Graph-based autoencoder integrates spatial transcriptomics with chromatin images and identifies joint biomarkers for Alzheimer’s disease. Nat. Commun. 13, 7480 (2022).Article Google Scholar Yang, K. D. et al. Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nat. Commun. 12, 31 (2021).Article Google Scholar Gayoso, A. et al. Joint probabilistic modeling of single-cell multi-omic data with totalVI. Nat. Methods 18, 272–282 (2021).Article Google Scholar Ashuach, T. et al. MultiVI: deep generative model for the integration of multimodal data. Nat. Methods 20, 1222–1231 (2023).Article Google Scholar Gong, B., Zhou, Y. & Purdom, E. Cobolt: integrative analysis of multimodal single-cell sequencing data. Genome Biol. 22, 351 (2021).Article Google Scholar Zhang, Z., Yang, C. & Zhang, X. scDART: integrating unmatched scRNA-seq and scATAC-seq data and learning cross-modality relationship simultaneously. Genome Biol. 23, 139 (2022).Article Google Scholar Boehm, K. M., Khosravi, P., Vanguri, R., Gao, J. & Shah, S. P. Harnessing multimodal data integration to advance precision oncology. Nat. Rev. Cancer 22, 114–126 (2022).Article Google Scholar Cao, Z.-J. & Gao, G. Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat. Biotechnol. 40, 1458–1466 (2022).Article Google Scholar Zhang, X. et al. Unsupervised representation learning of chromatin images identifies changes in cell state and tissue organization in DCIS. Nat. Commun. 15, 6112 (2024).Article Google Scholar Zhang, X., Tseo, Y., Bai, Y., Chen, F. & Uhler, C. Prediction of protein subcellular localization in single cells. Nat. Methods 22, 1265–1275 (2025).Article Google Scholar Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, 1001779 (2015).Article Google Scholar All of Us Research Program Investigators. The ‘All of Us’ Research Program. N. Engl. J. Med. 381, 668–676 (2019).Thul, P. J. et al. A subcellular map of the human proteome. Science 356, eaal3321 (2017).Sturma, N., Squires, C., Drton, M. & Uhler, C. Unpaired multi-domain causal representation learning. 37th Conference on Neural Information Processing Systems (NeurIPS) (eds Oh, A. et al.) 34465–34492 (Curran Associates, 2023).Gabbay, A. & Hoshen, Y. Demystifying inter-class disentanglement. International Conference on Learning Representations (ICLR) (2020).Giotti, B. et al. Assembly of a parts list of the human mitotic cell cycle machinery. J. Mol. Cell Biol. 11, 703–718 (2019).Article Google Scholar Sansam, C. L. et al. A vertebrate gene, ticrr, is an essential checkpoint and replication regulator. Genes Dev. 24, 183–194 (2010).Article Google Scholar Chen, F., Castranova, V. & Shi, X. New insights into the role of nuclear factor-kappaB in cell growth regulation. Am. J. Pathol. 159, 387–397 (2001).Article Google Scholar Clermont, P.-L. et al. Identification of the epigenetic reader CBX2 as a potential drug target in advanced prostate cancer. Clin. Epigenetics 8, 16 (2016).Article Google Scholar Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biology 19, 15 (2018).Article Google Scholar Challa, K. et al. Imaging and AI based chromatin biomarkers for diagnosis and therapy evaluation from liquid biopsies. npj Precis. Oncol. 7, 135 (2023).Article Google Scholar Cho, N. H. et al. OpenCell: endogenous tagging for the cartography of human cellular organization. Science 375, 6983 (2022).Article Google Scholar Lu, A. X., Kraus, O. Z., Cooper, S. & Moses, A. M. Learning unsupervised feature representations for single cell microscopy images with paired cell inpainting. PLoS Comput. Biol. 15, 1007348 (2019).Article Google Scholar Ounkomol, C., Seshamani, S., Maleckar, M. M., Collman, F. & Johnson, G. R. Label-free prediction of three-dimensional fluorescence images from transmitted-light microscopy. Nat. Methods 15, 917–920 (2018).Article Google Scholar Lee, J.-H., Kim, E. W., Croteau, D. L. & Bohr, V. A. Heterochromatin: an epigenetic point of view in aging. Exp. Mol. Med. 52, 1466–1474 (2020).Article Google Scholar Viana, M. P. et al. Integrated intracellular organization and its variations in human iPS cells. Nature 613, 345–354 (2023).Article Google Scholar Handfield, L.-F., Chong, Y. T., Simmons, J., Andrews, B. J. & Moses, A. M. Unsupervised clustering of subcellular protein expression patterns in high-throughput microscopy images reveals protein complexes and functional relationships between proteins. PLoS Comput. Biol. 9, 1003085 (2013).Article Google Scholar Kobayashi, H., Cheveralls, K. C., Leonetti, M. D. & Royer, L. A. Self-supervised deep learning encodes high-resolution features of protein subcellular localization. Nat. Methods 19, 995–1003 (2022).Article Google Scholar Donovan-Maiye, R. M. et al. A deep generative model of 3D single-cell organization. PLoS Comput. Biol. 18, 1009155 (2022).Article Google Scholar Li, J. et al. DNA damage binding protein component ddb1 participates in nucleotide excision repair through DDB2 DNA-binding and cullin 4A ubiquitin ligase activity. Cancer Res. 66, 8590–8597 (2006).Article Google Scholar Guderian, G. et al. RioK1, a new interactor of protein arginine methyltransferase 5 (PRMT5), competes with pICln for binding and modulates PRMT5 complex composition and substrate specificity. J. Biol. Chem. 286, 1976–1986 (2011).Article Google Scholar Chari, A. et al. An assembly chaperone collaborates with the SMN complex to generate spliceosomal SnRNPs. Cell 135, 497–509 (2008).Article Google Scholar Krapivinsky, G., Pu, W., Wickman, K., Krapivinsky, L. & Clapham, D. E. pICln binds to a mammalian homolog of a yeast protein involved in regulation of cell morphology. J. Biol. Chem. 273, 10811–10814 (1998).Article Google Scholar Badertscher, L. et al. Genome-wide RNAi screening identifies protein modules required for 40S subunit synthesis in human cells. Cell Rep. 13, 2879–2891 (2015).Article Google Scholar Radhakrishnan, A. et al. Cross-modal autoencoder framework learns holistic representations of cardiovascular state. Nat. Commun. 14, 2436 (2023).Article Google Scholar Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).Article Google Scholar Rousseeuw, P. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).Article Google Scholar Weigert, M., Schmidt, U., Haase, R., Sugawara, K. & Myers, G. Star-convex polyhedra for 3D object detection and segmentation in microscopy. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) 3655–3662 (IEEE, 2020); https://doi.org/10.1109/WACV45572.2020.9093435He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 770–778 (2016).Stirling, D. R. et al. CellProfiler 4: improvements in speed, utility and usability. BMC Bioinformatics 22, 433 (2021).Article Google Scholar Zhang, X. uhlerlab/APOLLO: manuscript partially shared multi-modal embedding learns holistic representation of cell state. Zenodo https://doi.org/10.5281/zenodo.17841315 (2025).Download referencesAcknowledgementsWe thank all members of the Uhler and Shivashankar labs for discussions and E. Forte for editing the manuscript. X.Z. was supported by the Eric and Wendy Schmidt Center at the Broad Institute. G.V.S. was partially supported by the Swiss National Foundation (310030 208046). C.U. was partially supported by NCCIH/NIH (1DP2AT012345), ONR (N00014-22-1-2116), AstraZeneca, the MIT-IBM Watson AI Lab, MIT J-Clinic for Machine Learning and Health, and a Simons Investigator Award.Author informationAuthors and AffiliationsLaboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA, USAXinyi Zhang & Caroline UhlerEric and Wendy Schmidt Center, Broad Institute of MIT and Harvard, Cambridge, MA, USAXinyi Zhang & Caroline UhlerDepartment of Health Sciences and Technology, ETH Zurich, Zurich, SwitzerlandG. V. ShivashankarPaul Scherrer Institute, Villigen, SwitzerlandG. V. ShivashankarAuthorsXinyi ZhangView author publicationsSearch author on:PubMed Google ScholarG. V. ShivashankarView author publicationsSearch author on:PubMed Google ScholarCaroline UhlerView author publicationsSearch author on:PubMed Google ScholarContributionsX.Z. designed the research, developed and implemented the algorithms, performed model and data analysis, and wrote the paper. G.V.S. and C.U. designed and supervised the research, and wrote the paper.Corresponding authorsCorrespondence to G. V. Shivashankar or Caroline Uhler.Ethics declarationsCompeting interestsThe authors declare no competing interests.Peer reviewPeer review informationNature Computational Science thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Michelle Badri, in collaboration with the Nature Computational Science team. Peer reviewer reports are available.Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Extended dataExtended Data Fig. 1 Simulation 1 - All shared latent features are independent of all modality-specific latent features; each observed feature depends on a single latent feature.(a) Left: Latent causal graph of the two modalities. Right: Probability distributions from which the latent features are sampled. (b) Histograms of the 2000 samples of each latent feature (Z1, Z2, Z3, Z4, Z5). (c) Training curves of the APOLLO model. Left: Reconstruction losses of the two modalities measured by mean squared error (MSE). Right: ℓ2 norms of the three latent spaces. (d) UMAPs of the shared latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the shared latent features and Z1 or Z3 as labels. (e) UMAPs of the Modality 1 specific latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the Modality 1 specific latent features and Z1 or Z3 as labels. (f) UMAPs of the Modality 2 specific latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the Modality 2 specific latent features and Z1 or Z3 as labels.Extended Data Fig. 2 Simulation 2 - All shared latent features are independent of all modality-specific latent features; each observed feature depends on multiple latent features.(a) Left: Latent causal graph of the two modalities. Right: Probability distributions from which the latent features are sampled. (b) Histograms of the 2000 samples of each latent feature (Z1, Z2, Z3, Z4, Z5). (c) Training curves of the APOLLO model. Left: Reconstruction losses of the two modalities measured by mean squared error (MSE). Right: ℓ2 norms of the three latent spaces. (d) UMAPs of the shared latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the shared latent features and Z1 or Z3 as labels. (e) UMAPs of the Modality 1 specific latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the Modality 1 specific latent features and Z1 or Z3 as labels. (f) UMAPs of the Modality 2 specific latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the Modality 2 specific latent features and Z1 or Z3 as labels.Extended Data Fig. 3 Simulation 3 - Some modality-specific latent features are children of shared latent features; each observed feature depends on a single latent feature.(a) Left: Latent causal graph of the two modalities. Right: Probability distributions from which the latent features are sampled. (b) Histograms of the 2000 samples of each latent feature (Z1, Z2, Z3, Z4, Z5). (c) Training curves of the APOLLO model. Left: Reconstruction losses of the two modalities measured by mean squared error (MSE). Right: ℓ2 norms of the three latent spaces. (d) UMAPs of the shared latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the shared latent features and Z1 or Z3 as labels. (e) UMAPs of the Modality 1 specific latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the Modality 1 specific latent features and Z1 or Z3 as labels. (f) UMAPs of the Modality 2 specific latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the Modality 2 specific latent features and Z1 or Z3 as labels.Extended Data Fig. 4 Simulations 4 and 5 - Some modality-specific latent features are children of shared latent features; each observed feature depends on multiple latent features.(a) Left: Latent causal graph of the two modalities. Right: Probability distributions from which the latent features are sampled. (b) Histograms of the 2000 samples of each latent feature (Z1, Z2, Z3, Z4, Z5). (c) Training curves of the APOLLO model trained on simulation 4 data. Left: Reconstruction losses of the two modalities measured by mean squared error (MSE). Right: ℓ2 norms of the three latent spaces. (d) UMAPs of the shared latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the shared latent features and Z1 or Z3 as labels. (e) UMAPs of the Modality 1 specific latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the Modality 1 specific latent features and Z1 or Z3 as labels. (f) UMAPs of the Modality 2 specific latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the Modality 2 specific latent features and Z1 or Z3 as labels. (g) Training curves of the APOLLO model trained on simulation 5 data, which has higher weights for the observed features that depend on a single latent feature by using the same setup as in simulation 4 (see panel a) while drawing coefficients of the observed features from N (0, 16). Left: Reconstruction losses of the two modalities measured by mean squared error (MSE). Right: ℓ2 norms of the three latent spaces. (h) UMAPs of the shared latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the shared latent features and Z1 or Z3 as labels. (i) UMAPs of the Modality 1 specific latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the Modality 1 specific latent features and Z1 or Z3 as labels. (j) UMAPs of the Modality 2 specific latent space colored by values of Z1 (left) or Z3 (right). Silhouette scores are computed using the Modality 2 specific latent features and Z1 or Z3 as labels.Extended Data Fig. 5 APOLLO architecture for paired scATAC-seq and scRNA-seq data.(a) APOLLO uses a two-step training procedure. In step 1, the shared and modality-specific latent spaces as well as the modality-specific decoders are trained so that the decoders can reconstruct the scATAC-seq and scRNA-seq data from the latent spaces. For each modality, there is one decoder that reconstructs the modality using only the shared latent space and another decoder that reconstructs the modality using the full latent space, that is the shared latent space and the modality-specific latent space. In step 2, modality-specific encoders are trained to enable inferring the latent space embedding for ells not used in training the model and thus enable cross modality prediction. The encoders are trained to map the input features, for example scRNA-seq and scATAC-seq, to the latent space embeddings obtained in step 1. (b) The decoders of scRNA-seq and scATAC-seq have the same architectures with four fully connected layers, except that the output layer decodes to different feature dimensions for the two modalities. The input feature dimension is 70 for the full latent space decoder and 50 for the shared latent space decoder. Each decoder has 3 hidden layers, each of which has dimension 1024. All hidden layers are linear layers with a dropout rate of 0.01 and are followed by LeakyReLU activation and a batch normalization layer. The output layer is a linear layer with a dropout rate of 0.01 and sigmoid activation. (c) The encoders of scRNA-seq and scATAC-seq have the same architectures, except the difference in the input feature dimensions. Each encoder starts with two linear layers with a dropout rate of 0.01, each of which has 1024 dimensions and is followed by LeakyReLU activation and batch normalization, which is a standard setup for autoencoders 23,58. After the first two hidden layers, two separate linear layers with LeakyReLU activation are used to obtain separate hidden layers for the shared and modality-specific latent spaces. A linear layer is applied to each hidden layer to obtain the shared or modality-specific latent space.Extended Data Fig. 6 Application of APOLLO to paired scRNA-seq and protein abundance data.(a) We applied the same encoder and decoder architecture as well as the default two-step training procedure to the paired scRNA-seq and protein abundance data (see Methods for details). (b) UMAP visualization of scRNA-seq data using the Scanpy package 42. (c) UMAP visualization of the protein abundance data using the Scanpy package 42. (d) UMAPs obtained from the shared latent space of standard autoencoders, encoded using scRNA-seq data, separately colored by cell types and experimental batches.Extended Data Fig. 7 Comparison of clustering results using APOLLO’s latent spaces, the protein abundance data, and a standard variational autoen-coder.(a) Leiden clustering of APOLLO’s shared latent space. Resolution of leiden clustering was adjusted to obtain the minimum number of clusters such that the major cell types are separated into different clusters. (b) Leiden clustering of cellular surface protein abundance data. Resolution of leiden clustering was adjusted to obtain the same number of clusters as (a). (c) Heatmap showing the number of cells in each of the two batches, for each cluster of the shared latent space. (d) Heatmap showing the number of cells in each of the two batches, for each cluster of the cellular protein abundance data. (e) Heatmap showing the number of cells in each cell type, for each cluster of the shared latent space. (f) Heatmap showing the number of cells in each cell type, for each cluster of the cellular protein abundance data. (g) Leiden clustering of APOLLO’s RNA-specific latent space. (h) Heatmap showing the number of cells in each of the two batches, for each cluster of the RNA-specific latent space. (i) Heatmap showing the number of cells in each of the two batches, for each cluster of the full latent space of scRNA-seq learned using APOLLO. (j) Heatmap showing the number of cells in each of the two batches, for each cluster of the latent space of scRNA-seq learned using a standard variational autoencoder (Methods). (k) Heatmap showing the number of cells in each cell type, for each cluster of the full latent space of scRNA-seq learned using APOLLO. (l) Heatmap showing the number of cells in each cell type, for each cluster of the full latent space of scRNA-seq learned using a standard variational autoencoder (Methods).Extended Data Fig. 8 APOLLO architecture for paired chromatin and protein images.(a) APOLLO can be applied to multiplexed imaging data using conditional decoders and encoders. In step 1, the shared and modality-specific latent spaces as well as the modality-specific decoders are trained to minimize reconstruction error of chromatin and protein images from the latent spaces. In addition to the latent space embeddings, the decoders also take in a trainable vector representing each protein ID. In step 2, modality-specific encoders are trained to enable inferring the latent space embedding for cells not used in training the model and thus enable cross modality prediction. The encoders are trained to map protein and chromatin images together with the trainable protein IDs to the latent space embeddings obtained in step 1. (b) The decoders of chromatin and protein images have the same architectures with five convolutional layers following a fully connected layer. The input to the decoder is the full or shared latent space added with noise and concatenated with learnable protein ID. The latent space embedding, after adding noise and being concatenated with the protein ID, is passed through a linear layer with ReLU activation and reshaped to 4 × 4 × 96 dimensions for subsequent convolutions. This is followed by 5 convolutional layers with a kernel size of 4 and stride of 2. The first four convolutional layers are followed by batch normalization and LeakyReLU activation. The last convolutional layer is followed by sigmoid activation to scale the output image from 0 to 1. (c) The encoder of each modality starts with 5 convolutional layers with LeakyReLU activation, a standard setup for autoencoders 23,37. The output of the last convolutional layer is divided into two sets of channels that are used to derive the shared and modality-specific latent spaces respectively. 80 out of the 96 channels of the last hidden layer are flattened, concatenated with protein ID, and passed through a linear layer to obtain the shared latent space. Similarly, the modality-specific latent space is obtained from the remaining 16 channels.Extended Data Fig. 9 Examples of protein images predicted from chromatin images of held-out patients obtained from our full APOLLO model are com-pared to the outputs of different alternative models.‘Autoencoder’ has the same encoder and decoder structures as our full APOLLO model but is trained in a single step without directly optimizing the latent spaces (see Methods for details). ‘Without modality-specific latent’ represents our APOLLO model with only the shared latent space. ‘Inpainting’ corresponds to the model developed by 45.Extended Data Fig. 10 Architecture details of model ablation tests and comparison of reconstruction loss.(a) The schematic shows the architecture of our model with partially overlapping latent space trained as a standard autoencoder in a single step. The encoders and decoders have the same architecture as our APOLLO model, but the encoder and decoder are trained together in a single step without directly optimizing the latent spaces. Mean-squared error loss is applied to the shared latent space encoded from each modality. (b) The schematic shows the architecture of our model without the modality-specific latent spaces. The model is trained with the two-step training procedure as used for the full APOLLO model. There are only two decoders that decode the two modalities from the shared latent space and they are trained in step 1. The encoders only output the shared latent space, which has the same dimension as the combined dimension of the shared and the modality-specific latent spaces in the full APOLLO model.Supplementary informationSupplementary InformationSupplementary Figs. 1–15 and References.Reporting summaryPeer Review FileSource dataSource Data Fig. 2Statistical source data.Source Data Fig. 3Statistical source data.Rights and permissionsOpen Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.Reprints and permissionsAbout this article