Large-scale causal discovery using interventional data sheds light on gene network structure in k562 cells

Wait 5 sec.

IntroductionRecent developments in the understanding of complex-trait genetics have led to a call for increased study of directed biological networks because they are crucial for dissecting the regulatory architecture of complex traits and finding pathways that can be targeted for treatment1,2,3. However, interrogating causal structure (i.e., causal discovery) is notoriously difficult owing to factors such as unmeasured confounding, reverse causation, and the presence of cycles4. Even assuming that all relevant variables are measured and that the underlying graph is acyclic, the exact network is still not identifiable using observational data alone5, and identifying the set of observationally equivalent graphs is computationally intractable6.Interventional data improve the identifiability of causal models7 and can eliminate biases due to unobserved confounding8. Therefore, advances in the scope and scale of gene-targeting CRISPR experiments9 have created an ideal setting for the development of methods for large-scale causal graph inference. Recent advances include dotears10, which extends the convex optimization-based approach notears11 to interventional data, and igsp12, which learns an equivalence class of graphs using a permutation-based approach. However, existing methods can assume a strong intervention model10, return an unweighted graph12 or be intractable for large graphs10,11. They also generally assume that the graph is acyclic and unconfounded10,11,12.Here, we introduce an approach to causal discovery from interventional data using a two-stage procedure (Fig. 1a). We treat guide RNA as instrumental variables and thus leverage standard procedures for estimating the marginal average causal effect (ACE) of every feature on every other, the matrix \(\hat{R}\). We show that given this, the causal graph G can be obtained via the relationship G = I − R−1D[1/R−1], where/indicates element-wise division, and the operator D[A] sets off-diagonal entries of the matrix to 013. Importantly, while G can be determined exactly given the true matrix R, we only have access to a noisy estimate \(\hat{R}\), which may not be well-conditioned or even invertible. Our primary contribution is a procedure for estimating a sparse approximate inverse of the ACE matrix via solving the constrained optimization problem$${\min }_{\{U,V:VU=I\}}\frac{1}{2}| | W\circ (\hat{R}-U)| {| }_{F}^{2}+\lambda {\sum}_{i\ne j}| {V}_{ij}| .$$(1)This approximate inverse is then used to estimate G via \(\hat{G}=I-VD[1/V]\). We call this approach inverse sparse regression (inspre). Here, U approximates \(\hat{R}\) while its left inverse V has sparsity controlled via the L1 optimization parameter λ. The weight matrix W allows us to place less emphasis on entries of \(\hat{R}\) with high standard error. For complete details, see “Methods”.Fig. 1: Algorithm and simulation results.a Interventions (green squares) are used to estimate marginal average causal effects between all pairs of features (orange rectangles and blue diamonds). These pairwise estimates are used to infer a sparse network on the features. b Structural hamming distance (SHD) versus time for inspre compared to commonly-used methods in the setting of high density graphs with confounding, cycles, and large edge weights (lower is better). inspre_DAG represents inspre with an approximate acyclicity constriant, see “Methods” for details. c Averaged over density, intervention strength, graph type, edge effect size inspre is the most performant approach by several metrics, even in acyclic graphs without confounding. MAE: Mean absolute error. S: Seconds. Source data are provided in Supplementary Data 1.Full size imageWorking with the bi-directional ACE matrix R rather than the full data matrix provides several advantages. First, interventional data can be used to estimate effects that are robust to unobserved confounding. Second, leveraging bi-directed ACE estimates that include both the effect of feature i on j and j on i allows us to accommodate graphs with cycles. Finally, the feature-by-feature ACE matrix is typically much smaller than the original samples by features data matrix, providing a dramatic speedup that enables inference in settings with hundreds or even thousands of features.ResultsSimulation studiesWe evaluated the performance of inspre under a host of different simulation settings while comparing against other commonly-used methods for causal discovery from observational (LinGAM14, notears11, golem15) and interventional (GIES16, igsp12, dotears10) data. We tested 50-node cyclic and acyclic graphs with 100 interventional samples per node and 5000 total control samples. We simulated graphs with and without confounding while varying the graph type (Erdős-Réyni random17 vs scale-free18), density (high vs low), edge weights (large vs small), and intervention strength (strong vs weak). We conducted each of these 64 experiments 10 times and compared methods by structural Hamming distance (SHD), precision, recall, F1-score, mean absolute error and runtime (Fig. 1b, c and Supplementary Data 1).inspre out-performs other tested methods in cyclic graphs with confounding by a large margin, even when interventions are weak (Fig. 1b). Perhaps more interestingly, inspre still obtains the highest precision, lowest SHD, and lowest MAE in acyclic graphs without confounding when averaged over graph type, density, edge weight and intervention strength (Fig. 1c). On top of this, inspre takes just seconds to run, while comparable optimization-based approaches can take up to 10 h. Of course, the performance of inspre is dependent on edge weight and intervention strength—when network effects are small and interventions are weak, inspre performs comparatively poorly. However, even in this setting, the weighting scheme biases our approach to high precision, low recall, and the resulting SHD is still comparable to other methods even in the absence of confounding (Supplementary Data 1).K562 Perturb-seq analysisWe applied inspre to the K562 genome-wide Perturb-seq9 experiment targeting essential genes, choosing the essential-scale screen over the genome-wide screen for our initial analysis due to the larger average number of guides per gene and resulting lower noise floor. We selected 788 genes on the basis of guide effectiveness and number of cells receiving a guide targeting that gene (Supplementary Data 2). Specifically, we included only genes whose targeting guide RNA reduced the expression of it’s target by at least 0.75 standard deviations of the untargeted expression levels, and where there were at least 50 cells receiving that gene-targeting guide. After estimating the ACE of each gene on every other, we found 131,943 significant effects at FDR 5%. We then used inspre to construct a graph on these 788 nodes, which contained 10,423 edges (1.68% non-zero, Fig. 2a–c). We calculated the eigencentrality19 (Fig. 2d), in-degree (Fig. 2e), and out-degree (Fig. 2f) of every node in the network. We observed an exponential decay in both in-degree and out-degree distributions, indicating the grpah has scale-free-like properties. However, there is an interesting asymmetry in the degree distributions, with the out-degree distribution showing a strong mode at 0 but a much longer tail. That is, in our graph estimate most genes do not regulate other genes, but those that do often regulate many. These genes include DYNLL1 (dynein light chain 1, out-deg 422), HSPA9 (heat shock 70 kDa protein 9, out-deg 374), PHB (prohibitin, out-deg 355), MED10 (mediator complex subunit 10, out-deg 306), and NACA (Nascent-polypeptide-associated complex alpha polypeptide, out-deg 284). These are highly conserved genes that play important roles in key cellular processes, particularly transcriptional regulation20,21,22,23. These genes also have high eigencentrality in our graph, but the most central genes also include several ribosomal proteins: RPS3, RPS11, and RPS16. We note that while there are only a few genes with very high eigencentrality, there are many other genes with non-trivial centrality. Specifically, there are 125 genes with eigencentrality above 0.2 (Supplemetary Data 2).Fig. 2: A causal network on 788 genes from the genome-wide perturb-seq experiment.a For each gene pair, we calculate pairwise marginal average causal effects. b Using inspre, we infer a shrunk-approximation to this matrix that reduces noise, and c is supported by an estimate of the underlying causal graph. We plot the distribution of eigencentrality (d), in-degree (e), and out-degree (f), revealing scale-free and small-world properties. Source data for (a–c) have been deposited to Zenodo, see Data Availability. Source data for (d–f) are available in Supplementary Data 2.Full size imageNext, we used our graph estimate to calculate shortest paths and the total effect induced by this path for all pairs of vertices. 47.5% of gene pairs are connected by at least one path, with a median path length of 2.67 (standard deviation sd = 0.78) for all pairs and 2.46 (sd = 0.77) for FDR significant pairs (Fig. 3a). Next, we calculated the percentage of the total effect explained by the shortest path for all FDR 5% significant gene pairs. If this number is low, there are many paths between a given pair of nodes with the shortest making up only a fraction of the total effect; if this number is close to 1, the shortest path explains most of the total effect; if this number is above 1, other paths in the network work to cancel out some of the effect of the shortest path. We find that the average effect explained by the shortest path is low (median = 11.14%), and that there are many pairs where the effect explained exceeds 100% (5448 pairs, Fig. 3b). This indicates there tends to be many important network paths when considering the effect of one gene on another.Fig. 3: Properties of our gene network estimate.a The shortest path between pairs of nodes that have an FDR 5% significant total effect. b The percentage of the total effect explained by the shortest path indicates that many paths are responsible for the total effect of one gene on another. c Eigencentrality is correlated with multiple measures of gene essentiality, including pLI in gnomAD. d For genes with high genome-wide heritability in blood, there is a correlation between heritability and variance explained in that gene in our K562 gene network. Lines represent 25%, 50%, and 75% quantiles. Source data for (a, b) have been deposited to Zenodo, see Data Availability. Source data for (c) are available in Supplementary Data 2. Source data for (d) are provided as a Source data file.Full size imageWe also asked whether centrality was related to other measures of gene importance from sources such as gnomAD24, ExAC25. We fit a beta regression model of eigencentrality on 16 such annotations (Supplementary Data 3) while controlling the family-wise error rate using Holm’s method26. We found associations between eigencentrality and numerous measures of loss-of-function intolerance (Supplementary Data 4), indicating that loss-of-function intolerant genes tend to be more central. Specifically eigencentrality was associated with loss of function intolerance (gnomad_pLI, padj = 2.9 × 10−8, Fig. 3c and Supplementary Fig. 1b), selection coefficient on heterozygous loss-of-function mutations (sHet, padj = 4.9 × 10−8, Supplementary Fig. 1c), happloinsufficiency score (HI_index, padj = 4.1 × 10−7, Supplementary Fig. 1d), probability of being haploinsufficient (pHaplo, padj = 5.2 × 10−6, Supplementary Fig. 1e), missense constraint (gnomad_MisOEUF, padj = 4.5 × 10−4, Supplementary Fig. 1f), and loss of function constraint (gnomad_LOUEF, padj = 4.3 × 10−3, Supplementary Fig. 1g). Eigencentrality was also strongly associated with number of protein-protein interactions (n_ppis, padj = 1.3 × 10−12, Supplementary Fig. 1a).Finally, we wondered whether there might be a relationship between our graph estimates and human gene expression data from whole blood. Specifically, we asked whether genes with high genome-wide (cis plus trans) expression heritability would have more of their variance explained by the network (see Methods). Using data from the INTERVAL study27 we calculated total heritability (h2) for all genes using GCTA28. We then calculated the correlation of the variance explained by the network with the genome-wide heritability. While the overall correlation is low, restricting our analysis to genes with an estimated h2 above a certain threshold revealed an increasing pattern of correlation as the threshold was raised (Fig. 3d). Specifically, we observed a weak but statistically significant correlation when included genes estimated heritability was at least 0.2 (ρ = 0.36, bootstrap p = 0.04), and a stronger correlation when and at least 0.25 (ρ = 0.54, bootstrap p = 0.002). Beyond this, the small number of genes resulted in bootstrap estimates with large variation.Explained variance analysis and validation using the genome-wide screenWe sought to empirically analyze the goodness of fit of our estimated graph. We considered two approaches: randomizing the incoming edges of each node while controlling for in-degree, and randomizing the entire graph (see Methods). We used randomized incoming edges to compare the variance in gene expression explained by our inferred edges compared to the same number of randomly selected edges and calculate the proportion of genes with significant variance explained at a given false discovery rate (Fig. 4a). We found that 605 (of 788, 76.8%) genes had significant variance explained compared to random edges at 5% FDR, with a Q-Q plot of the permutation p-values showing enrichment over the null at every quantile (Fig. 4b). Next, we followed Ružičková et al.29, generating random graphs by shuffling the node labels on our estimated graph. This yields random graphs with the same degree distribution and edge weights while breaking the association between individual genes. We then calculated the variance explained in the total observed dataset by our graph compared to random graphs from the same distribution. We found that our estimated graph explained 5.84% of the variance in the data, while random graphs explained an average of 1.35% (permutation p