Unifying fragmented perspectives with additive deep learning for high-dimensional models from partial faceted datasets

Wait 5 sec.

IntroductionAs told through centuries, the “Blind Men and the Elephant" is a fable of blind individuals attempting to comprehend the appearance and nature of an elephant by independent exploration (Fig. 1a). Each individual has limited information and understanding, acquired through independent experience. However, by sharing, comparing, and synthesizing their experiences, the group can gain a more comprehensive understanding of the elephant as a whole. Similarly, biological systems are complex networks with thousands of interacting molecular components1,2,3. Biological function and dysfunction are often emergent properties of these complex networks. It can be challenging to quantify the contributions of all variables to the biological function simultaneously, making it difficult to obtain a full understanding of the system. More often, a subset of variables is measured and quantified, obtaining a projection (or facet) of the relationship between the biological output and the underlying variable. Therefore, just as in the “Blind Men and the Elephant" example, it is desirable to reconstruct the full relationship between the biological output and all the underlying variables from many sets of faceted data.Fig. 1: Overview of the problem and proposed machine learning framework for biological network reconstruction.Schematic illustration of the problem and the proposed machine learning framework. a Blind men and the elephant problem. Each observer measures a facet of the problem, and therefore receives a biased view. Combining data from all observers will generate a full model. b A biological function is a mapping from cell components to an observable, or output. c Biological network model reconstruction from mapping of data distribution functions. The original data is the joint probability distributions of partial input and output. We dissect the joint distributions into several consecutive conditional distributions and directly fit the conditional distribution to obtain model parameters. d Data structure in the faceted learning procedure. l faceted data sets are collected, each containing only partial dimensions of the input x and output y. Each data set contains M data points, with N = M × l total data points.Full size imageWith advancements in machine learning (ML) and artificial intelligence (AI), there are now many methods that can predict outcomes from complex high dimensional data4,5,6,7. However, in a typical biological experiment, the full space of underlying variables is rarely measured. Here we present a machine learning-based method to reconstruct the complete biological network from faceted data sets. The method allows for incremental improvement of the learned network and is a systematic method of obtaining the global predictive model from multiple independent measurements and observations. When new hidden variables are discovered, new measurements can be added to the existing model to improve the model and predictions.The basic biological unit is a single cell. Each cell is characterized by its proteome, genetic material, and other components such as lipids, small molecules, ions, and so on. Therefore, the underlying variable that describes the single cell, x = (x1, x2, x3, … ), is a high dimensional vector, where xi is the quantity of the i-th component. The minimal number variables that define x is the proteome composition, or the number of expressed proteins in the cell, since given the same genetic sequence, the proteome composition should determine the number of small molecules, lipid, ionic contents of the cell, as well as post-translationally modified forms of proteins. However, proteome composition itself probably does not fully specify biological function, since environmental chemical8,9, mechanical10,11, and electrical variables12 also contribute. Therefore, x minimally will contain the expression levels of all genes and environmental variables.If x is defined as the expression levels of genes, then the distribution of x, ρ(x), is often referred to a ‘gene network’13,14. In the context of gene regulatory networks, the discussions in our paper also apply (See Example 2: P53 network).At the simplest level, a particular biological function/observable, F, is a function of the underlying variable: F(x). For example, F could be the cell size, the cell cycle length, the growth rate, or the cell migration speed, which should be measured at the single-cell level. This is because much of recent work has demonstrated that there is additional complexity and phenotypic variation, even for isogenic cells15,16. The reasons for this are complex and could encompass epigenetic mechanisms and cellular memory17,18. Therefore, F(x) is a complex mapping from biological variables to biological function. It should be noted that recent advancements in AI and machine learning in fact has solved the high dimensional regression problem. If the data for F(x) is available, then AI can now use neural networks or other types of methods that maps biological variables to biological function. The problem, therefore, is not the lack of methods to find F(x). The problem is the lack of multi-dimensional methods that obtain data for all relevant x, and measure F simultaneously at the single-cell level.Thus, the function F(x) is difficult to learn in an unbiased way, and there are no systematic efforts to map F for major biological problems of interest. In most experiments, such as flow cytometry or Western blot experiments, only a few of the xi out of thousands are quantified in a meaningful way. Moreover, it is typical that each researcher measures a different subset of xi’s, and therefore studies a particular ‘facet’ of the problem, precisely the problem identified in the “blind men" story. The global picture is generally missing. There have been extensive studies in the ML field on system reconstruction from partial data sets based on eigenvectors of the system19,20. However, it is desirable to have a method that can combine data from all individual facets, and progressively arrive at a global picture.There are now an increasing number of experimental methods to quantify cell components (e.g., RNAseq21,22, protein secretome23 and morphological data24,25) at the single-cell level. For example, single-cell RNAseq quantifies RNA at the genome-wide level. However, mRNA levels do not easily translate to proteomic composition26,27,28, and no biological observable, F, is typically measured at the single cell level during sequencing. On the other hand, methods such as flow cytometry, Western blots, and immunohistochemistry allow one to examine a handful of proteins at a quantitative level, but it is generally difficult to examine biological function or observables at the single-cell level. There are now highly accurate methods to measure cell size, cell contractility, and cell cycle at the single-cell level. It remains to be seen if single-cell methods can be combined with single-cell measurements to produce truly predictive models of biological function.In this paper, we first describe the general idea of faceted learning based on multiple data subsets of the same problem. We then illustrate the method using machine learning models based on polynomial regression and neural networks, respectively. Two concrete examples are discussed: A mechanical spring network system and a small biological network including the cellular senescence marker P53. The full system is successfully reconstructed from faceted data for both problems. Interestingly, we find that the mechanism regulating P53 level is the same for cells in different growth conditions. The only difference is the underlying proteome distribution of network components. Our method separates the regulatory network that governs p53 level and the intrinsic distribution of the input variables. The polynomial regression model also allows us to explore mechanistic aspects of the network, whether components of the network act synergistically or antagonistically. We also discuss the additive property of the faceted approach, where the model accuracy increases with an increasing number of simultaneously measured variables (dimension of subsets). Our approach provides a novel method utilizing conditional distribution to integrate different pieces of information to reconstruct complex high-dimensional biological systems.Reconstructing the systems model from facets of probability distributions: statement of the problemWe consider a system described by the function y = F(x; θ), where θ is a set of model parameters. For simplicity, we assume that y is a one-dimensional output and x is a d-dimensional input vector (e.g., for the system of a cell, cell volume is a function of protein content and kinase activity) (Fig. 1b). In experiments, we assume only p(p 1, which provides information about the correlation among different input variables (x). It is also possible to perform multiple measurements to obtain different subsets of variables (x, y). Note that data-driven methods of manifold learning using principal component analysis (PCA) for learning models of (x, y) have been investigated extensively29,30. Here we take these available methods as given.Experimental measurements will generate probability distributions of (x, y). In the biological context, each instance of (x, y) arises from a single cell, and many cells are typically measured in a single experiment. Therefore, the mean biological output is$$\langle F\rangle =\int\,d{\boldsymbol{x}}F({\boldsymbol{x}})\rho ({\boldsymbol{x}})$$(1)We assume that it is possible to eventually measure the d × d covariance matrix of x and the mean value of the input variable x, denoted by Σ and μ, respectively. We denote all the d input variables as a universal set U = {x1, x2, . . . , xd}. Assume that each measurement includes p input variables, and we denote the simultaneously measured variables as Si, which is a subset of U. There are in total $\scriptstyle{n}_{s}=\left(\begin{array}{c}d\\ p\end{array}\right)$ different subsets (i