Precise, predictable genome integrations by deep-learning-assisted design of microhomology-based templates

Wait 5 sec.

MainThe precise and targeted integration of transgenes using CRISPR–Cas technology holds great promise for applications in biotechnology and gene therapy1. However, it is paramount that genomic integrity is maintained to avoid unintended side effects and the integration technique is suitable for targeting the intended cell types2,3. Typically, CRISPR–Cas-mediated integration relies on homology-directed repair (HDR), which necessitates large homology arms and is only active in proliferating cells, or on nonhomologous end joining (NHEJ), microhomology (µH)-mediated end joining (MMEJ) or single-strand annealing4. However, NHEJ and MMEJ may result in unintended genomic alterations at transgene–genome borders, including deletions within the surrounding genome or transgene, potentially disrupting neighboring genes5,6.In humans, naturally occurring double-strand breaks (DSBs) are typically repaired accurately; however, occasionally, inherently mutagenic MMEJ repair results in genetic errors. Microdeletion variants account for 20–25% of all clinically pathogenic sequence variants7,8,9. The majority of these mutations display a local sequence signature characteristic of deletions through µHs and are often three adjacent base pairs in length. Using this natural MMEJ mechanism for frame-retaining DSB repair of coding sequences offers biotechnological opportunities.MMEJ as a repair mechanism for DSBs induced by CRISPR–Cas is conserved across a broad spectrum of organisms, ranging from Hydrozoa10 and plants11 to zebrafish12,13, Xenopus14 and humans15,16. Such MMEJ repair occurs in a nonrandom fashion and is predictable by algorithms and deep learning models, such as inDelphi17,18,19. This predictability has been harnessed to establish programmable smaller17 and larger20,21 deletions after DSB repair but never transgene insertions. While MMEJ-mediated approaches have been successfully used for integration (for example, GeneWeld22 and PITCh23,24,25), these did not offer control over gene-editing outcomes at genome–transgene repair boundaries. On the other hand, prime editing’s effectiveness depends on the coordination of multiple components and is traditionally restricted to edits ranging from 1 to ~50 bp, rendering larger insertions inaccessible26. New tools that combine prime editors with serine integrases, such as TwinPE27, PASTE28 and PASSIGE29, have been shown to enable larger DNA insertions yet leave a footprint, making them less suitable for protein tagging applications.The CRISPR–Cas system has been widely adopted in biotechnology and basic research. Here, we explore the insertion of transgenic cassettes using the CRISPR–Cas system and the predictable nature of DSB repair mechanisms when introducing exogenous genetic material. We harnessed deep learning models, pretrained on DNA repair outcomes, to develop optimal rules for designing repair arms, both to integrate transgenic cassettes and to establish small point mutations. This results in predictable editing outcomes driving intended edits and integrations.We used tandem repeats of µHs, placed at the edges of transgene cassettes to facilitate on-target integration by MMEJ using CRISPR–Cas. We find that DSB repair is nonrandom on the interface between the genome and such µH tandem repeat repair arms of transgenic cassettes in vitro and in vivo. Moreover, µH tandem repeat repair arms safeguard the boundaries during integration, precluding extensive DNA trimming. We deduced optimal design rules and showed integration using µH tandem repeats to be effective in cell contexts where HDR is largely ineffective, such as rapidly cycling vertebrate embryos (Xenopus) and adult postmitotic mouse neuronal cells. Lastly, we extend the notion of predictability to the rational design of small repair templates for the introduction of desired point mutations at permissive loci with single-stranded oligodeoxynucleotide (ssODN) donor templates.Cas9 integration with donor templates is nonrandom and predictableEndogenous DNA repair outcomes following DSBs induced by CRISPR–Cas (specifically Streptococcus pyogenes Cas9) are nonrandom and can be predicted on the basis of the local sequence context15,16,17,18. We explored whether one such algorithm, inDelphi17, could also predict editing outcomes at the interface between endogenous DSB edges and exogenous donor DNA. When the inDelphi model predicted a µH-mediated 4-bp deletion as the major editing outcome of an example sequence (Fig. 1a), adding the 3 bp present on the left side of the cut to the sequence right of the cut pivoted the most frequent predicted outcome toward a 3-bp deletion. This effectively removed the inserted 3-bp µH, overruling the previously dominant 4-bp deletion. Further repeating the 3-bp sequences in tandem increased the proportion of predicted editing outcomes that use an inserted artificial µH from 52% to 62% (Fig. 1a). Extending the in silico simulation to 250,000 putative guide RNA (gRNA) target loci on human chromosome 1 revealed an increase in artificial µH usage for DNA repair with an increasing number of tandem repeats, plateauing at five tandem repeats (Fig. 1b and Supplementary Fig. 1). The local sequence context strongly influenced the use of µH tandem repeats (Fig. 1c), suggesting that the optimal design needs to be computed for each gRNA and its surrounding genomic sequence.Fig. 1: Modeling predicted gene-editing outcomes using inDelphi while providing synthetic µHs.a, Predicted editing outcomes are shown using inDelphi (HEK293T) on synthetic DNA. Adding tandem repeats of the bases left of the CRISPR–Cas cut site to the right of the cut affected the predicted editing outcomes. Cumulative µH repair is defined as the percentage of editing outcomes that mobilize (delete) synthetic µHs during repair. Iterative recutting of products is not computationally modeled. b, Modeling of expected editing outcomes across 250,000 distinct gRNAs target sites across human Chr1, when adding the 3 bp flanking the left site of the CRISPR–Cas cut site either as a single repeat (1×) or as tandem repeats (2×–8×). The percentage of repair by µH usage is shown. Box plots show the median, interquartile range (IQR) and whiskers extending to 1.5× the IQR with n = 250,000. c, Heat map highlighting the expected percentage of repair by µH as a function of the length of µH and the number of tandem repeats for 25 gRNAs, demonstrating that there is a sequence-context-specific optimal solution for maximizing the percentage of µH repair outcomes. d, Schematic of the experimental setup: PaqCI digestion releases the linear dsDNA donor, which contains 5× 3-bp µH tandem repeat arms, and is codelivered with RNP targeting AAVS1. e, Sequence of the target locus and 3-bp µH tandem repeat repair arms. f, After 14 days, flow cytometry indicates an increase in stable integration in cells transfected with the linear dsDNA template. g, Integration occurs specifically with PaqCI-linearized templates; circular templates show no detectable on-target integration. h, Quantification of integration efficiency of AAVS1 gRNA compared to a negative control gRNA. Statistical analysis was performed using an unpaired two-tailed t-test; P = 0.021 (n = 3 independent biological replicates). Error bars represent the s.d. i,j, The InDelphi HEK293T model accurately predicts the observed frequency of distinct editing outcomes in the µH tandem repeat arms at both junctions. Data points are the means of three independent biological replicates. A two-sided Pearson correlation was applied (i, r = 0.815, P = 0.00022; j, r = 0.969, P = 1.10 × 10−8). No multiple comparisons were performed. Some schematics were created with BioRender.com.Source dataFull size imageNext, we experimentally investigated whether inDelphi predictions of repair outcomes between endogenous DSB edges and exogenous donor DNA would facilitate CRISPR–Cas-mediated knock-in. For this, the AAVS1 landing site was targeted in HEK293T cells (Fig. 1d). We added five tandem repeats of 3-bp µH (5× 3-bp µH) to the left and right of the donor cassette, matching the sequence context left and right of the AAVS1 cut site (Fig. 1e). We assessed the resulting scarring patterns and validated the predictability of DNA repair at genome–transgene borders and the increased frame retention. To more easily customize the donor edges without undesired 5′->3′ overhangs, we added two PaqCI type IIS endonuclease restriction sites invertedly flanking the donor cassette (pCMV:eGFP) for in vitro release of linear DNA (PaqMan plasmids; Supplementary Fig. 2a). PaqMan linearization facilitated on-target genomic integration (5.2% GFP+), whereas nonlinearized plasmid donor merely resulted in random integration (2.3% GFP+), demonstrated by boundary PCR analysis (Fig. 1f,g and Supplementary Fig. 2b). On-target integration only occurred with an AAVS1-targeting ribonucleoprotein (RNP) and never with control RNP (gRNA target site not present in the human genome) (Fig. 1h and Supplementary Fig. 2c).Using 3-bp µH tandem repeat repair arms provided us with a unique way to sample the distribution of editing outcomes at the interface between endogenous DNA and exogenous cargo. Targeted amplicon sequencing of the boundary PCR products revealed that the rate of µH tandem repeat use after DNA integration observed experimentally correlated well with the inDelphi predictions at the left (r = 0.81, P 25% and 10 million gRNAs across the human genome revealed variations in predicted repair outcomes driven by µH composition, particularly linked to the nucleotide at position −4 (counting the NGG protospacer-adjacent motif (PAM) as nucleotides 0–2) (Fig. 2e). G at position −4 was predicted to enhance integration over C, A or T and was independent of the PAM sequence used (Supplementary Figs. 5 and 6). No similar effects were noted for any other position in the gRNA (Supplementary Figs. 5 and 6). This indicated that the nucleotide located immediately to the left of the CRISPR–Cas-induced DSB (position −4) could be a parameter to improve integration.To test this, we targeted 32 genes in HEK293T cells and codelivered target-specific repair templates with five µH tandem repeats. To avoid a potential negative selection effect, we chose nonessential genes32. We ensured that the gRNAs had similar predicted on-target efficiency and a balanced distribution across different G+C contexts (Fig. 2f). To directly assess whether the nucleotides at position −7 to −4 influence integration, we only considered gRNA target sites with AGG at nucleotides −3 to −1. The 32 targets were chosen to fall into one of eight classes, each representing a distinct combination of strong (G or C) or weak (A or T) bases at positions −4 to −7 (n = 4 per class) (Fig. 2f,g). Within each class, we binned gRNAs according to predicted MMEJ repair usage. Target-specific repair templates, incorporating five 3-bp µH tandem repeats, were generated by overhang PCR (Fig. 2h).Across all 32 targets, we observed a median 1.6-fold increase in integration efficiencies comparing on-target RNP to negative control RNP (median on-target integration of 3.61%, P