Semantic design of functional de novo genes from a genomic language model

Wait 5 sec.

MainAlthough generative artificial intelligence (AI) promises to accelerate the design of functional biological systems, articulating ‘function’ to a generative model remains challenging and often underspecified. In natural language, distributional semantics hypothesizes that meaning can be represented by word co-occurrence, that is, ‘you shall know a word by the company it keeps’3,4 (Fig. 1a). In biology, an emerging distributional hypothesis defines the function of a gene through its interactions with other genes, that is, ‘you shall know a gene by the company it keeps’2.Fig. 1: In-context genomic modelling and design with Evo.a, In natural language, distributional semantics holds that lexically distinct but functionally similar words often occur in similar contexts with shared sets of neighbouring words. b, In semantic design, a genomic language model trained across multiple genes learns to map genes with related functions to similar semantic spaces, enabling the generation of functionally related yet sequence-diverse genes. c, Sequence recovery assessments, where a genomic language model is used to autocomplete three conserved prokaryotic genes, show consistent improvements from Evo 1 131K and Evo 1 8K to Evo 1.5, reflecting an enhanced ability to leverage genomic context. The bar height denotes mean, and the error bars indicate standard error. n = 100 generated sequences. d, Completion of conserved E. coli trp operon gene sequences using both sense and antisense strand prompting yields high sequence recovery and predicted structural conservation in the generated sequences across the operon. TM, template modelling. e, A positional entropy comparison between natural and generated modB sequences at both the nucleotide and the amino acid level shows conservation of essential amino acid residues while maintaining high nucleotide diversity. n = 500 sequences.Full size imageIn prokaryotic genomes, functionally related genes are often positioned next to each other in gene clusters or operons5,6. Researchers have exploited this property, known as guilt by association, to characterize unknown genes neighbouring functionally characterized genes7,8, leading to discoveries of new molecular mechanisms and important biotechnological tools9,10,11,12. At its core, guilt by association leverages the distributional hypothesis of gene function for function-guided discovery.A capable generative model of prokaryotic genomic sequences could learn this distributional notion of function to perform function-guided design. Progress in long-context machine learning has enabled generative models of genomic sequences at the multi-kilobase scale1. These models predict the next base pair in a sequence, enabling them to generate DNA sequences conditioned on a genomic sequence prompt (Fig. 1b).Given the success of guilt by association, we reasoned that prompt engineering of a generative genomic model with a sequence of known function could direct the model to sample novel, functionally related sequences in its response. We call this approach ‘semantic design’, a generative strategy that harnesses the multi-gene relationships in prokaryotic genomes to design novel DNA sequences enriched for targeted biological functions. Of note, unlike traditional biological design, which involves combining or optimizing characterized sequences13, semantic design potentially allows for the exploration of new regions of functional sequence space.Here we show that Evo, a genomic language model, learns a distributional semantics over genes that enables the function-guided design of new sequences reflecting prokaryotic functional relationships. We first demonstrate that Evo enables in-context genomic design, leveraging sequence conservation patterns to complete prokaryotic genes and operons.We then apply semantic design to generate genes with high novelty and specified functional activity. First, we produce several novel toxin–antitoxin pairs with high experimental success rates, including a toxic gene with no significant sequence similarity to known bacterial toxins and a functional RNA antitoxin. We further generate multiple functional anti-CRISPRs (Acrs) that lack sequence or predicted structural similarity to known Acrs.We also report SynGenome (https://evodesign.org/syngenome/), an AI-generated genomics database, containing over 120 billion base pairs of synthetic DNA sequences derived from prompts spanning 9,000 functional terms. We make SynGenome openly available to facilitate semantic design across diverse functions.In summary, we demonstrate that semantic design, with its versatility and robust success rates, offers a promising framework for function-guided design that can generalize beyond observed evolutionary sequence landscapes.Evo enables in-context genomic designTo apply semantic design for function-guided sequence generation, a model must understand not just individual gene sequences, as with protein language models14,15, but also how genes relate to each other within broader genomic contexts. Evo, trained on prokaryotic DNA sequences from OpenGenome (Methods), processes long genomic sequences at single-nucleotide resolution1, enabling it to link nucleotide-level patterns to kilobase-scale genomic context. Given that functionally related sequences cluster on prokaryotic genomes, supplying appropriate genomic context as a prompt could condition Evo to generate novel genes whose functions mirror those found in similar natural contexts (Fig. 1a,b).As an initial experiment, we assessed the ability of Evo to leverage genomic context by performing an ‘autocomplete’ task in which we prompted the model with partial sequences of highly conserved prokaryotic genes. We tested several archaeal and bacterial genes, including RNA polymerase sigma factor rpoS from Escherichia coli, DNA gyrase subunit A gyrA from Salmonella enterica and cell division protein ftsZ1 from Haloferax volcanii (Fig. 1c and Extended Data Fig. 1a). We also tested three versions of the Evo model: Evo 1 8K (pretrained at 8,192 context length), Evo 1 131K (extended to 131,072 context length) and Evo 1.5, newly introduced here, which extends the pretraining of Evo 1 8K by 50% to 450 billion tokens (see Methods; Extended Data Fig. 1b,c). For each gene, we prompted the models with varying amounts (30%, 50% and 80%) of input sequence and evaluated their ability to complete the remainder.Evo 1.5 consistently demonstrated the highest recovery of the natural protein sequence, particularly at lower prompt lengths. For instance, with just 30% of the input sequence, Evo 1.5 achieved 85% amino acid sequence recovery for rpoS, compared with 65% for Evo 1 131K. This performance advantage was maintained across all tested genes and prompt lengths, with Evo 1.5 achieving near-perfect sequence recovery at 80% input. These experiments align with previous findings that longer pretraining can improve learning of long-range interactions in sequence models16,17,18. We therefore selected Evo 1.5 for further investigation, and all results attributed to Evo in this study were produced by the Evo 1.5 model.To further test Evo’s understanding of genomic context at a multi-gene scale, we next evaluated its ability to predict gene sequences based on operonic neighbours (Fig. 1d and Extended Data Fig. 1d). We prompted the model with sequences of genes either upstream or downstream of target genes in the trp and modABC operons, leveraging DNA complementarity to control directionality through sense or antisense strand prompts. Evo demonstrated robust predictive performance across all tested configurations, achieving over 80% protein sequence recovery for all target genes. Furthermore, the model exhibited adaptability to genomic orientation, generating upstream gene sequences when prompted with reverse complement of downstream genes, and vice versa. These results indicate that Evo not only learns the primary sequence of genes but also captures the broader genomic organization of bacterial operons.To assess whether the generations from Evo went beyond trivial memorization of training sequences, we analysed the per-position entropy of both amino acid and nucleotide sequences in the model’s generations (Fig. 1e and Extended Data Fig. 1e). Using the modABC operon as a test case, we prompted the model with sequences encoding modA from E. coli K-12 and analysed variability in the generated modB responses. Amino acid-level entropy analysis revealed selective conservation, with generally lower entropy at key positions and higher variability in less-conserved regions, consistent with natural protein evolution. Further analysis of amino acid substitution patterns using BLOSUM62 showed that when the model generated sequences with amino acid changes, it preferentially selected conservative substitutions, mirroring natural evolutionary constraints (Extended Data Fig. 1f).At the nucleotide level, we observed substantially higher entropy, with variation even in regions encoding conserved amino acids. We also observed lower sequence recovery in genes with low sequence conservation (Extended Data Fig. 1g). Given that a single prompt was used to generate all response sequences, these results suggest that Evo is not simply reproducing memorized sequences; rather, it is synthesizing information from across its training set, reflecting biological constraints while generating diversity in a manner reminiscent of natural evolution.Semantic design of multi-component systemsNext, we explored whether we could apply semantic design to biology with higher evolutionary variation: phage and bacterial defence systems. These systems, which are shaped by the evolutionary arms race between bacteria and phage19, are some of the most rapidly evolving systems in nature. Consequently, defence systems exhibit vast amounts of functional diversity and share limited sequence conservation across species20,21. Defence systems frequently cluster into defence islands, enabling the discovery of new systems through guilt-by-association approaches22.Given the natural diversity and genomic colocalization of these systems, we sought to determine whether semantic design could be used to generate new defence systems. As an initial test case, we focused on type II toxin–antitoxin (T2TA) systems, some of which have a role in phage defence23. T2TAs consist of a toxin protein that inhibits bacterial growth or induces death under stress, paired with an antitoxin protein that binds and neutralizes the toxin in homeostatic conditions (Fig. 2a,b). These systems often maintain conserved genomic architectures despite sequence divergence, with toxin and antitoxin genes arranged in adjacent positions24.Fig. 2: Evo generates functional toxin–antitoxin complexes with low similarity to nature.a, Semantic design of T2TA systems begins by generating toxins (Ts) from toxin genomic context prompts and testing growth inhibition. Functional toxins (EvoTs) then serve as prompts to generate cognate antitoxins (ATs), which are validated through growth recovery. Ara, arabinose; PBAD, araBAD promoter. b, T2TA (protein–protein) and T3TA (protein–RNA) systems function via antitoxin binding and neutralization of toxins. c, AlphaFold 3 structure comparison between the generated type II toxin EvoRelE1 and its closest BLAST match. d, Relative bacterial survival after 12 h when testing generated toxin–antitoxins, normalized to uninduced toxin controls. The type II antitoxins EvoAT1–4 rescue growth against EvoRelE1, whereas the type III antitoxin EvoAT6 rescues growth against ToxN. The bar height denotes the mean, the error bars indicate standard error, and the circles represent biological replicates (n = 3 for type II and n = 6 for type III). Significance was determined by a one-sided Student’s t-test against EvoRelE1 + eGFP (P = 1.6 × 10−7, 1.4 × 10−7, 2.3 × 10−5 and 1.7 × 10−5 for EvoAT1–4) and ToxN wild type (WT) only (P = 1.84 × 10−8) for type II and type III systems, respectively. e, Alignment of EvoAT1–4 amino acid sequences reveals discrete areas of conservation. f, Relative bacterial survival when testing EvoAT1–4 and RelB against natural type II toxins. EvoAT2 and EvoAT4 inhibit multiple natural toxins. g, Number of structural and sequence matches between EvoAT1–4 and natural antitoxin families. B, BLAST; F, Foldseek; H, HHpred. h, AlphaFold 3 structures of EvoAT1–4 in complex with EvoRelE1 and structural alignments of EvoAT1–4 to their closest BLAST matches. EvoAT1–4 have confident predicted structures despite limited sequence identity to natural proteins. i, AlphaFold 3 structure of the novel toxic protein EvoT1. j, Structural and sequence-similarity analyses of EvoT1 show no significant similarity to natural proteins. k, Minimum free energy (MFE) secondary structure and sequence of EvoAT6 show similarity to ToxI antitoxins. The blue lettering shows sequence differences.Full size imageTo generate diversified T2TAs using Evo 1.5, we developed a prompting strategy that leveraged the colocalization of these systems (Fig. 2a). We curated eight types of prompts: toxin and antitoxin sequences, their reverse complements, and the upstream or downstream contexts of toxin and antitoxin loci (Extended Data Fig. 2a,b). Following sampling using these prompts, we filtered generations for sequences encoding protein pairs that exhibited in silico predicted complex formation (Methods). We also included a novelty filter, requiring at least one component to have only limited sequence identity to known T2TA proteins (see Methods; Extended Data Fig. 2c).Using a growth inhibition assay (Methods), we were able to identify a functional bacterial toxin, EvoRelE1, that exhibited strong growth inhibition (approximately 70% reduction in relative survival) while possessing 71% sequence identity to a known RelE toxin (Fig. 2c,d and Extended Data Fig. 2d).We subsequently prompted Evo 1.5 with the sequence of EvoRelE1, hypothesizing that the model could use the context of the toxin to generate conjugate antitoxins (Fig. 2a). Following sampling, we found that generated sequences were enriched for antitoxin-like genes (Extended Data Fig. 2b), demonstrating that context could guide generation towards desired functional outcomes. After filtering generations using the same criteria as for toxins (Methods), we identified ten antitoxin candidates with minimal sequence identity to natural proteins.Following co-expression with EvoRelE1, 50% of the Evo-generated antitoxin candidates rescued cell growth (Fig. 2d and Extended Data Fig. 2e). The most effective candidates, EvoAT1 and EvoAT2, restored growth to 95–100% of normal cell survival, with candidates EvoAT3 and EvoAT4 demonstrating moderate rescue activity (70% and 90% relative survival, respectively). Sequence alignments of the successful antitoxins revealed discrete regions of conservation (Fig. 2e), potentially highlighting motifs required for toxin neutralization despite their overall sequence diversity. Furthermore, when tested against natural RelE, MazF and YoeB toxins, several of the generated antitoxins were able to rescue growth across multiple toxins, with EvoAT2 showing inhibitory activity against all three toxins and EvoAT4 rescuing growth against RelE and YoeB (Fig. 2f and Extended Data Fig. 2f). In contrast, the natural RelB antitoxin only neutralized its cognate RelE toxin and EvoRelE1. This finding is notable given that the EvoATs share limited overall sequence identity with the antitoxin counterpart of each toxin (Fig. 2g and Extended Data Fig. 2g,h). When co-folded using in silico structure prediction methods, several of the EvoATs had low-confidence predicted complex formation with the natural toxins that they inhibited (Extended Data Fig. 2i–k), highlighting the potential for semantic design to generate molecular interactions not readily identifiable by structure prediction models.EvoAT1 through EvoAT4 all had relatively minimal sequence identity to natural proteins (21–27%), a range in which sequence similarity alone can make it difficult to predict shared function25,26. The closest direct sequence matches for EvoAT2 and EvoAT3 were to proteins not annotated as antitoxins, with EvoAT2 showing 25% sequence identity to an uncharacterized Magnetococcus sp. YQC-5 hypothetical protein (e-value of 0.06) and EvoAT3 showing 21% sequence identity to a Jatrophihabitans hypothetical protein (e-value of 2.2; Fig. 2h). As these natural proteins appear in T2TA-like genomic contexts (Supplementary Fig. 1), our results suggest that they may function as part of toxin–antitoxin systems. These findings illustrate how Evo’s understanding of genomic context may help with functional annotation of previously uncharacterized proteins, as well as functional design that is unconstrained by existing annotations.Structural predictions for EvoAT1 and EvoAT2 co-folded with EvoRelE1 had high confidence (predicted local distance difference test (pLDDT) scores of 0.85 and 0.83, respectively; Fig. 2h) and exhibited minimal predicted position error (Extended Data Fig. 2l). This, coupled with the strong functional activity of EvoAT1–4, was particularly noteworthy given the limited sequence identity of these antitoxins to known antitoxins (Fig. 2h). To further assess the novelty of EvoAT1–4, we performed a residue coverage analysis to determine the number of natural proteins required to account for each amino acid position (Methods). Although natural toxins and antitoxins could be constructed by recombining 2–6 different natural proteins, EvoAT1–4 required fragments from 15–20 different proteins (Extended Data Fig. 3a,b). Overall, these results further underscore the ability of Evo to generate functional proteins with low similarity to natural proteins.More sensitive sequence and structural similarity searches using BLAST, HHpred, Dali and Foldseek (Methods) revealed that the EvoATs showed similarity to multiple independent antitoxin superfamilies, particularly ParD, MazE, HicB and VapB (Fig. 2g and Extended Data Fig. 3c–e). This finding, coupled with the activity of EvoAT2 and EvoAT4 against multiple toxins, is notable because their cognate toxins use different mechanisms of action27,28,29,30. This suggests that Evo may have identified a broader functional compatibility between antitoxins and toxins than is typically observed in nature, highlighting the potential of semantic design to provide new insights into protein–protein interaction compatibility31.To evaluate utility of semantic design for systems containing non-coding RNAs, we next focused on type III toxin–antitoxin systems (T3TAs). Like type II systems, T3TAs maintain a conserved genomic architecture of adjacent toxin and antitoxin genes. However, instead of an antitoxin protein, T3TA systems include a repetitive RNA antitoxin that directly binds to the toxin protein, repressing toxins in homeostasis32 (Fig. 2b).Using a similar prompting approach to that of the T2TAs, we used Evo 1.5 to sample sequences with prompts derived from individual T3TA genes, their reverse complements and their respective upstream and downstream sequences (Extended Data Fig. 4a–c). To identify candidate sequences, we filtered generations for sequences containing a complex tandem repeat sequence and at least one RNA structure, Rfam or Pfam match to a T3TA-associated family (see Methods; Extended Data Fig. 4d).Using a growth rescue assay of generated RNA antitoxin candidates against wild-type (WT) type III toxins ToxN, TenpN and CptN (Extended Data Fig. 4a), we identified an Evo-generated antitoxin candidate, EvoAT6, that had neutralizing activity (88% relative survival) against ToxN33 (see Methods; Fig. 2d and Extended Data Fig. 4e). This antitoxin showed only moderate sequence identity to known type III antitoxins, with 78% identity to ToxI from Bacillus multifaciens (Fig. 2k and Extended Data Fig. 4f). Despite this sequence divergence, the predicted consensus repeat secondary structure of EvoAT6 closely resembled that of the natural ToxI sequence, indicating that our design approach successfully diversified the antitoxin while preserving essential structural features (Extended Data Fig. 4g–i).Using T3TA prompts, we also semantically designed a toxic protein, EvoT1, that showed strong growth inhibition (33% relative survival) upon expression in E. coli (Fig. 2d,i). EvoT1 was not neutralized by EvoAT6 or natural antitoxins (Extended Data Fig. 4e), suggesting diverse mechanisms among our generated sequences. Of note, EvoT1 showed no strong sequence or predicted structural similarity to known bacterial toxin–antitoxin genes, even when using sensitive similarity detection methods (Fig. 2j and Extended Data Fig. 4j,k). A residue coverage analysis (Methods) required recombining segments from over 40 proteins to account for all amino acids in EvoT1, which more closely resembled the compositionality of de novo proteins than either natural proteins or sequences designed by protein language models (Extended Data Fig. 4l). In summary, these results demonstrate that semantic design can generate multi-component systems containing proteins and RNA with high degrees of novelty and functional specificity.Semantic design of de novo anti-CRISPRsWe next explored whether semantic design could generate sequences with even greater evolutionary novelty. Anti-CRISPRs (Acrs) are proteins used by phages to neutralize bacterial CRISPR–Cas systems (Fig. 3a). Many Acrs represent striking examples of rapid protein evolution, appearing as novel innovations without detectable similarity to other protein families34,35,36. Their diversity spans varied mechanisms of action, ranging from direct Cas binding and DNA mimicry to transcriptional silencing37,38,39,40, making them valuable tools for understanding protein evolution and developing control systems for CRISPR41.Fig. 3: Evo generates functional anti-CRISPR proteins with no clear similarity to known proteins.a, Type II anti-CRISPR systems feature anti-CRISPR (acr) genes encoding inhibitors of type II-A Cas nucleases that often co-occur with anti-CRISPR-associated genes (aca). b, Semantic design of Acrs uses known type II Acr genomic contexts as prompts. c, PaCRISPR classification shows significant enrichment of Acr-like sequences in generations from Acr prompts. The bar height denotes the mean, the error bars represent standard error, and the circles show batches of generations (n = 4 generation batches). gDNA, genomic DNA. d, Sequence identity matrix demonstrates high diversity among randomly sampled generated Acrs. e, In protection assays, functional Acrs block SpCas9 cleavage of a kanamycin resistance gene, enabling cell survival in kanamycin-supplemented media. kanR, kanamycin resistance gene. f, Relative bacterial survival after 8 h when testing generated Acrs normalized to uninduced SpCas9 controls. EvoAcr1–5 confer protection against SpCas9, with EvoAcr3–5 showing comparable activity with AcrIIA2. The bar height denotes the mean, the error bars show the standard error, and the circles represent biological replicates (n = 3). Significance was determined by a one-sided Student’s t-test compared with random control (P = 8.3 × 10−6, 6.2 × 10−6, 8.2 × 10−6, 1.4 × 10−6 and 8.2 × 10−7 for EvoAcr1–5, respectively) and AcrIIA2 (P = 5.1 × 10−4 for EvoAcr5). g, T4 phage plaque assays validating Acr activity. Plaque formation indicates successful Acr protection. The experiment was performed in triplicate with representative images shown. h, AlphaFold 3 structures for EvoAcr1–2 show low-confidence predictions. i, AlphaFold 3 structures comparing EvoAcr3 and its closest BLAST match show limited sequence and structural similarity to a sigma-70 family protein. j, AlphaFold 3 structures comparing EvoAcr4–5 with their closest BLAST matches show moderate structural and sequence similarity to known Acr proteins. k, Sequence similarity analysis of EvoAcr1–5 against BLAST nr and OpenGenome. EvoAcr1–2 demonstrate no significant sequence identity to known proteins, EvoAcr3 exhibits sequence similarity to proteins at percent identities too low for reliable functional inference, and EvoAcr4–5 share limited sequence identity with known Acrs.Full size imageDespite this diversity, many Acr operons maintain a somewhat conserved architecture, consisting of multiple acr genes appearing together alongside anti-CRISPR-associated (aca) genes42 (Fig. 3b). This architectural conservation makes Acrs an ideal test case for assessing the ability of the semantic design to generalize with respect to sequence while retaining a desired higher-level function.To semantically design Acrs, we leveraged sequences from known Cas9-targeting Acr operons as prompts (see Methods; Fig. 3b and Extended Data Fig. 5a), including type II acr genes, their associated aca genes, the 500 bp upstream and downstream of each acr gene and the reverse complements of both gene types. After filtering for size, complexity and structure (Extended Data Fig. 5b), we next used PaCRISPR, a machine learning model trained to identify potential Acr proteins, to evaluate our generated candidates43. Consistent with our prompts providing successful functional conditioning, generations derived from Acr-containing genomic contexts were substantially more likely to be classified as potential Acrs by PaCRISPR than negative control sequences (Fig. 3c and Extended Data Fig. 5a). Furthermore, the distribution of sequence identities among the predicted Acrs showed a wide range of novelty, with most candidates showing low similarity to each other (median pairwise sequence identity of 23%; Fig. 3d). This enrichment for diverse Acr-like sequences suggests that semantic design can bias generations towards a desired function even in the absence of clear sequence conservation.To test the protection ability of Acr candidates against SpCas9, we co-transformed E. coli with plasmids encoding candidate Acrs and a CRISPR-targeted kanamycin resistance gene, where functional Acrs would preserve kanamycin resistance by inhibiting CRISPR-mediated cleavage44 (Fig. 3e). We found that 17% of tested sequences exhibited measurable Acr activity (Extended Data Fig. 5c,d), a notably high success rate given the lack of structural priors or conditioning (Extended Data Fig. 5e) and the use of a single Cas nuclease for screening. From this pool, we further identified five proteins (EvoAcr1–5) that demonstrated strong protection against SpCas9 cleavage in both liquid culture survival assays (Fig. 3f and Extended Data Fig. 5d) and phage infection experiments (Fig. 3g and Extended Data Fig. 5f), while maintaining normal host growth (Extended Data Fig. 5g).Detailed bioinformatic analysis of these five Acrs revealed a high level of sequence diversity. EvoAcr4 and EvoAcr5 shared moderate sequence similarity to known Acrs, with 58% and 31% sequence identity to AcrIIA2 and AcrIIA4, respectively (Fig. 3j and Extended Data Fig. 6a). Both demonstrated robust protection against SpCas9, showing activity comparable with the positive control AcrIIA2 in liquid culture assays (relative survival rates of 0.91 and 1.01 out of 1.0, respectively, compared with 0.87 for AcrIIA2; Fig. 3f). EvoAcr3 presented an intriguing case: although sharing limited sequence and predicted structural alignment (sequence ID = 25%, E = 0.006, template modelling score = 0.29) with a sigma-70 family RNA polymerase sigma factor, it maintained strong Acr activity (relative survival of 0.89; Fig. 3f,i). Further characterization using HHpred (Methods) revealed moderate-coverage alignments to various DNA-binding proteins, none of which were previously associated with Acr activity (Extended Data Fig. 6a,b). This suggests a potential mode of CRISPR inhibition that is not well characterized in existing functional databases.Most notably, EvoAcr1 and EvoAcr2 represented proteins that eluded both sequence and structural characterization, showing no significant sequence identity to proteins in OpenGenome or BLAST nr (Fig. 3h,k). Further characterization of EvoAcr1 and EvoAcr2 using Dali (Methods) found no strong structural alignments to natural proteins (Extended Data Fig. 6a,c), although the low confidence scores of the predicted structures limited the reliability of this structural comparison (Fig. 3h and Extended Data Fig. 6d). In addition, a residue-level compositionality analysis (Methods) revealed that EvoAcr1 and EvoAcr2 required pieces from 28 and 31 different natural proteins, respectively, to achieve full sequence coverage (Extended Data Fig. 6e,f). This level of novelty is comparable with established de novo proteins such as RFdiffusion-generated serine hydrolases (21 proteins)45 and BindCraft-generated BBF-14 binders (29 proteins)46 and is substantially more novel than natural Acrs (2–6 proteins; Extended Data Fig. 6f).Using more sensitive methods such as HHpred to characterize EvoAcr1 and EvoAcr2 also produced limited significant results, with EvoAcr1 having only low-coverage matches to proteins not thought to be Acrs and EvoAcr2 having no significant matches (Extended Data Fig. 6a,b). Despite this lack of similarity to known Acrs, both EvoAcr1 and EvoAcr2 demonstrated robust protection in both liquid culture and phage infection assays, with relative survival rates of 0.82 and 0.74 (out of 1.0), respectively. This experimental validation of novel, functional Acrs supports the ability of semantic design to leverage learned genomic organization patterns to access unexplored regions of sequence space. Together with EvoAcr3–5, these results demonstrate that semantic design can guide the generation of diverse Acr proteins, from variants of natural proteins to entirely new sequences.SynGenome: 120 gigabases of Evo-generated DNAFollowing our validation that semantic design can generate functional proteins from genomic context alone, we reasoned that semantic design could be applied to create genes from across prokaryotic biology. To this end, we developed SynGenome, a database of synthetic DNA sequences designed using Evo. Applying the principles underlying semantic design, we prompted the model with 1.7 million prokaryotic and phage genes, generating sequences encompassing the broad functional diversity encoded in prokaryotic genomes.To construct SynGenome, we leveraged the UniProt database to identify protein-coding genes and their adjacent sequences from prokaryotic organisms and bacteriophages. For each coding sequence, we extracted six distinct prompts: the upstream region, coding sequence, downstream region and their respective reverse complements (Fig. 4a). Using the Evo 1.5 model, we generated multiple synthetic sequences for each prompt, resulting in a database containing over 120 billion DNA base pairs (Methods).Fig. 4: 120 billion base pairs of AI-generated genomic sequences with SynGenome.a, To construct SynGenome, we derived prompts from known protein-coding genes, generated synthetic DNA sequences using Evo and bioinformatically characterized the generated sequences. b, Number of prompts and generated sequences in SynGenome, along with their associated features. GO, Gene Ontology. c, Codon usage patterns in a representative sample of Prodigal-predicted ORFs from prompt and generated sequences show preservation of codon preferences (n = 36,762 sequences per sample). d, Bar plot showing relative proportions of generation and prompt sequences in the most populous Leiden clusters, with percentages indicating the fraction of generated sequences per cluster. e, Distribution of Prodigal-predicted ORF lengths for representative samples of OpenGenome and SynGenome sequences. ORFs in SynGenome follow natural-looking length distributions. n = 36,762 sequences per sample. nt, nucleotide. f, Distribution of the number of Pfam protein family occurrences in representative samples of SynGenome and OpenGenome. Both follow a similar long-tailed distribution of family abundance. n = 36,762 sequences per sample. g, Scatterplot showing individual Pfam protein family frequencies in SynGenome and OpenGenome. Frequencies appear to be correlated, suggesting preservation of natural protein family abundance patterns. h, Circos plot showing the most enriched connections between Pfam clans found in natural sequence-derived prompts and generated responses across SynGenome. i, Scatterplots showing the most enriched prompt–response associations for two example DUFs in SynGenome. DUF2871 associates with cytochrome c and cytochrome oxidase domains, whereas DUF2797 associates with the rhomboid N-terminal domain, peptidase family M1 domain and Zn-dependent protease domains, consistent with previously hypothesized roles and domain associations for these DUFs48,49. j, Examples of chimeric proteins found in SynGenome representing potentially novel fusions of protein domains (Extended Data Fig. 7g,h).Full size imageTo facilitate functional exploration of the database, we organized the generated sequences according to the Gene Ontology and InterPro domain annotations of their corresponding prompts47,48 (Fig. 4b), expecting enrichment for functionally related elements. After removing low-complexity sequences (Methods), we also used ESMFold to obtain predicted structures of 3.7 million putative protein-coding genes, enabling structure-based downstream analyses.To characterize SynGenome, we first examined codon usage patterns between generated sequences and prompts. This analysis revealed that generated sequences closely mirrored prompt sequences, maintaining similar codon preferences (Fig. 4c). We further examined the prompt–generation relationships in Evo embedding space by performing Leiden clustering (Fig. 4d and Extended Data Fig. 7a,b). As expected, most clusters contained a mix of prompt and generated sequences, indicating that generations and natural sequences generally occupied similar regions of embedding space. However, we observed that 54 clusters (19% of the generated sequences) consisted primarily of generated sequences, potentially highlighting synthetic sequences that extended beyond the semantic space occupied by natural sequence embeddings (Extended Data Fig. 7c).When compared with sequences from OpenGenome, we found that SynGenome-generated open reading frames (ORFs) followed the natural prokaryotic ORF length distribution (Fig. 4e). At the protein level, SynGenome matched natural Pfam domain frequencies globally (Fig. 4f) and individually (Pearson correlation coefficient r = 0.78; Fig. 4g). These analyses demonstrate that SynGenome recapitulates the general characteristics and protein diversity of natural prokaryotic sequences.To probe the functional relationships captured by SynGenome, we constructed and analysed a network linking protein families in SynGenome prompts to those in generated responses (Methods). We found that the protein clans in prompt–response pairs mirrored natural genomic colocalization patterns (Fig. 4h and Extended Data Fig. 7d,e), further supporting that genomic conditioning can guide the generation of functionally related responses. We then investigated whether the functional associations captured by SynGenome could provide functional hypotheses for domains of unknown function (DUFs). As one example, we observed that DUF2871 strongly co-occurred with ‘cytochrome c’ (co-occurring in 14 prompt–response pairs) and ‘N-terminal domain of cytochrome oxidase-cbb3, FixP’ (co-occurring in 25 prompt–response pairs), consistent with previous structural hypotheses linking DUF2871 to cytochrome c proteins49 (Fig. 4i and Extended Data Fig. 7f). These findings demonstrate that the genomic associations captured in SynGenome not only recapitulate known colocalization relationships but may also aid with function prediction. We have provided an interactive visualization of this network (https://evodesign.org/syngenome/network), which may also serve as a valuable tool to identify appropriate prompts for functions of interest.In addition to capturing known domain associations, SynGenome also contains several chimeric proteins (Fig. 4j and Extended Data Fig. 7g,h) with potentially novel domain fusions not widely known to exist in natural proteins. These fusions could represent new functional combinations or provide insights into unexplored protein architectures, opening avenues for designing proteins with new or enhanced functional properties.Together, these results highlight the potential for SynGenome to become a valuable tool for exploring and expanding protein function through semantic relationships. By searching through SynGenome, researchers could discover functionally related proteins that extend beyond natural sequence space, gain insights into potential functions of uncharacterized genes, and create diverse screening libraries for exploring context-informed functions (Extended Data Fig. 8a,b).SynGenome, including all 120 billion generated base pairs and 3.7 million predicted structures, is freely available (https://evodesign.org/syngenome/). The database is searchable with protein names, UniProt IDs, InterPro domains, species names or Gene Ontology terms of interest. We anticipate that SynGenome can serve as a practical tool that facilitates gene discovery and engineering with semantic design for the broader scientific community.DiscussionAdvanced genomic sequence models, trained on hundreds of billions of DNA base pairs across prokaryotic life, can enable powerful capabilities for understanding and engineering biological systems. Here, we demonstrate that Evo enables controllable design of desired functions encoded in prokaryotic genomes by leveraging natural genomic contexts, achieving high experimental success rates of 17–50% when testing just tens of variants and surpassing success rates of many protein design methods15,45. Many of the designed proteins have no significant sequence identity to proteins of a similar function or, in some cases, to any known protein. These results blur the line between de novo protein design50,51,52 and diversification based on evolutionary models14,15,53, providing an ‘existence proof’ that sequence models can meaningfully generalize beyond natural sequences.Semantic design represents a fundamentally new approach for protein design that is complementary to existing approaches. First, unlike methods using task-specific fine-tuning1,14,54, semantic design requires no additional training that could bias generations towards characterized examples. Second, in contrast with approaches that specify function through natural language descriptions from existing knowledgebases55, semantic design accesses the functional diversity embedded within genomic sequences. This enables functional design that can leverage biological processes that are not yet characterized. For example, we generated antitoxins that suggest broader functional compatibility between diverse toxin–antitoxin systems (Fig. 2f) and an anti-CRISPR that maps to a protein family with a different putative function (Fig. 3i). Third, by leveraging genomic context as functional conditioning, semantic design does not require structural or mechanistic hypotheses; indeed, protein design pipelines that filter out low-confidence structure predictions46,56 would have removed many of our functional designs. Semantic design therefore represents a powerful orthogonal approach to current biological design strategies.Semantic design could be particularly valuable for generating novel starting points for directed evolution or rational protein design, providing access to functional protein sequences beyond the constraints of characterized natural sequences57. Genomic conditioning is also useful when specifying functions such as anti-CRISPR activity that could be accomplished by many structures and mechanisms40. Of note, semantic design is not limited to Evo 1.5 and can leverage any language model trained on prokaryotic or phage genomes. Improvements in genomic language models, as well as a better understanding of prokaryotic gene synteny, should therefore directly translate to improvements in semantic design.Traditional biological sequence discovery using guilt by association, which motivates many ideas in this study, is constrained to observed evolutionary diversity generated over billions of years. By contrast, semantic design enables the rapid generation of extensive sequence diversity for a biological system of interest. To facilitate broader accessibility of this new source of sequence material, we have reported SynGenome, a database of 120 billion base pairs of AI-generated genomic sequences, which we have made publicly available. This resource enables researchers, especially those without resources to conduct large-scale sampling from generative models, to find synthetic sequences related to their function of interest. This data could potentially contain new molecular tools and provide insights into protein function and evolution (Fig. 4i,j).Although semantic design represents a new level of sequence novelty and functional improvement for generative genomics, several fundamental limitations and challenges remain. Autoregressive generation is prone to sampling repetitive sequences or to hallucinating realistic but non-functional designs. In addition, semantic design may yield genes that are contextually related to the prompt but encode unrelated functions: for instance, generating regulatory proteins controlling the expression of a gene with a desired function rather than the gene itself. Semantic design therefore requires both in silico filtering and experimental testing to validate downstream functions. Semantic design is also limited to functions encoded by contextual relationships in nature, particularly in prokaryotic organisms. However, we note that only a small fraction of prokaryotic functional diversity has been discovered and that the mining of this diversity has led to powerful biotechnologies such as PCR, optogenetics and genome editing58,59,60. Functional conditioning based on gene synteny also does not extend to many eukaryotic design tasks; however, future eukaryotic applications of semantic design could potentially leverage learned associations between proximal coding and non-coding DNA or within gene clusters.Looking forward, the development of more capable pretrained models and an increase in sequencing data could reinforce the capabilities of semantic design. We also anticipate that combining the rich information learned by pretrained models with more advanced inference-time strategies will improve generation quality. Genomic language models that generate multi-component systems, as was done with our T2TA systems, could accelerate the development of synthetic biological circuits, pathways or even genomes. By leveraging semantic design, exploration of synthetic genomic space may reveal biological discoveries that complement and extend beyond those discovered in natural organisms.MethodsEvo 1.5 pretrainingEvo 1.5 was generated by extending the pretraining of the Evo 1 model, originally trained at a sequence context of 8,192 tokens with an initial learning rate of 0.003 after a warmup of 1,200 iterations; a cosine decay schedule with a maximum decay iteration of 120,000 and a minimum learning rate of 0.0003. Training used a global batch size of 4,194,304 tokens and 75,000 iterations, processing a total of 315 billion tokens. Other details on hyperparameters related to the model architecture and optimizer can be found in ref. 1. Pretraining of the model, including all model states, optimizer states, data loading schedule and learning rate schedule, was resumed from 75,000 iterations to 112,000 iterations (processing a total of 470 billion tokens). The model trained up to 112,000 iterations is referred to as Evo 1.5.As in Nguyen et al.1, Evo 1.5 was trained on the OpenGenome dataset, a comprehensive collection of prokaryotic genomic sequences. In brief, the dataset consists of approximately 80,000 prokaryotic genomes from bacteria and archaea and over 2 million phages and plasmid sequences, totalling approximately 300 billion nucleotides. OpenGenome was carefully curated to provide diverse, non-redundant genomic sequences from three primary sources: (1) bacterial and archaeal genomes from the Genome Taxonomy Database61, (2) prokaryotic viral sequences from the IMG/VR database62, and (3) plasmid sequences from the IMG/PR database63. To reduce redundancy, only representative genomes for each species, viral operational taxonomic unit, or plasmid taxonomic unit were retained. The dataset was further filtered to exclude potential eukaryotic viruses and sequences with poor taxonomic specificity. The training data include both positive and negative strands of the genomic sequences, allowing the model to learn the complementary nature of DNA. A detailed description of the dataset curation process is provided in Nguyen et al.1.Autoregressive samplingTo sample from Evo, a standard low-temperature autoregressive sampling algorithm was used. Sampling code (https://github.com/evo-design/evo/) based on the reference implementation (https://github.com/togethercomputer/stripedhyena) that leverages kv-caching of Transformer layers and the recurrent formulation of hyena layers64,65 was used to achieve efficient, low-memory autoregressive generation. Parameter optimization was performed across various temperatures (0.1–1.5, increments of 0.1), top-k values (1–4) and top-p values (0.1–1.0, increments of 0.1) using the Evo 1.5 model. For each parameter combination, 100 1,000-nt sequences were generated from a test set of five gene prompts encoding 50% of a highly conserved protein. Following identification of ORFs in generated sequences using Prodigal (v2.6.3, default parameters, -p meta) with default parameters in metagenome mode (-p meta)66, generated proteins were aligned against the full-length prompt protein sequence using MAFFT (v7.526)67 for sequence identity calculations. For evaluating sequence degeneracy, DustMasker (v2.14.1+galaxy2)68,69,70 was run across the full-length generations using default parameters and the proportion of masked nucleotides was calculated. Final parameters (temperature = 0.7, top-k = 4, top-p = 1.0) were selected based on maximizing sequence completion accuracy while maintaining DustMasker proportions below 0.2, a value chosen to be slightly higher than the typical frequency of non-coding DNA in prokaryotic genomes71,72. All code for sampling and downstream analysis using Evo was written in Python (v3.11.8).Sequence completion prompt compilation and evaluationSequences of highly conserved genes from across prokaryotic biology were downloaded in FASTA format from NCBI GenBank73. Selected genes included rpoS from E. coli K-12 (GenBank: NC_000913.3, coordinates c2866559–2867551), gyrA from S. enterica LT2 (GenBank: NC_003197.2, coordinates c2373710–2376422) and ftsZ from H. volcanii DS2 (GenBank: NC_013967.1, coordinates 643397–644536). Prompts were prepared by extracting 30%, 50% and 80% sequence lengths from the 5′ end. For the analysis of the gene completion ability of Evo 1.5 using sequences with varying levels of conservation, sequences encoding moderately conserved (gloA and pilA) and poorly conserved (tnsA and yagL) genes gloA from E. coli K-12 (GenBank: U57363.1, coordinates 1–408), pilA from Pseudomonas aeruginosa (GenBank: AE004091.2, coordinates c5069082–5069531), tnsA from E. coli O3 (GenBank: NZ_JALKIH010000010.1, coordinates 27218–28039) and yagL from E. coli K-12 (GenBank: U73857.1, coordinates c1018–1716) were used.Sequence completion performance was evaluated across varying prompt lengths (30%, 50% and 80% of input sequence) using optimal sampling parameters for the Evo 1.5, Evo 1 8K (previous version of Evo trained with context length of 8,192 tokens) and Evo 1 131K (previous version of Evo trained with context length of 131,072 tokens) models (temperature = 0.7, top-k = 4, top-p = 1)1. For each prompt, 500 sequences of length 2,500 nt were generated and filtered to remove generations with DustMasker proportions above 0.2. Prompts were subsequently appended to the start of each generated sequence and ORFs were identified using Prodigal (v2.6.3, default parameters, -p meta) with default parameters in metagenome mode (-p meta). Generated proteins were then aligned against their corresponding natural sequences using MAFFT (v7.526) with default parameters for sequence identity calculations.Operon completion prompt compilation and evaluationSequences encoding the trp operon and modABC operon from E. coli K-12 (GenBank: NC_000913.3) were downloaded in FASTA format from NCBI GenBank. For modABC, prompts were prepared from the full coding sequences for modA (coordinates 795089–795862), modB (coordinates 795862–796551), modC (coordinates 796554–797612) and acrZ (coordinates 794773–794922). For trp, prompts were prepared from the full coding sequences for trpE (coordinates c1321384–1322946), trpD (coordinates c1319789–1321384), trpC (coordinates c1318427–1319788), trpB (coordinates c1317222–1318415) and trpA (coordinates c1316416–1317222). For testing the ability of the model to generate sequences on the antisense strand, the reverse complement of each sense strand-derived prompt sequence was generated using Biopython74.Sequence completion performance was evaluated across the compiled operon completion prompts using previously identified optimal sampling parameters for Evo 1.5. For each prompt, 5,000 sequences of length 2,500 nt were generated. Following filtering of generations with DustMasker proportions above 0.2 and identification of ORFs using Prodigal (v2.6.3, default parameters, -p meta), directional completion was assessed by searching trpE-prompted generations for trpD-like ORFs, trpD reverse complement-prompted generations for trpE-like ORFs, and similar pairing combinations across both modABC and trp operons. Protein sequences were then aligned against their corresponding wild-type proteins for sequence identity calculations using MAFFT (v7.526). Structural similarity was evaluated by generating protein structure predictions using AlphaFold 3 (ref. 75) for both generated and wild-type sequences, with structural alignments and TM-score calculations performed using TM-align76. Natural and predicted protein structures were subsequently visualized using ChimeraX77.Positional entropy evaluationPer-position amino acid and nucleotide entropies were calculated from multiple sequence alignments of 500 generated and natural modB and trpA sequences. Natural modB and trpA sequences were fetched by querying ‘ModB’ and ‘TrpA’ in NCBI protein, filtering by bacteria, and downloading the corresponding amino acid and nucleotide sequences in format ‘FASTA’ and ‘FASTA CDS’, respectively. Generated modB and trpA sequences were chosen by selecting a random sample of 500 modB and trpA ORFs (more than 80% sequence identity to E. coli sequence) from the modB and trpA sequences generated by prompting with modA and trpB, respectively, during the operon completion evaluation. First, nucleotide and amino acid sequences were aligned with MAFFT (v7.526) and trimmed to remove gaps appearing in more than 80% of sequences. For each position i, the Shannon entropy was then calculated as H = –Σ(pi × log2(pi)) and normalized by dividing the calculated entropy by the maximum Shannon entropy (2 for nucleotides meaning all four bases are equally present, 4.32 for amino acids meaning all 20 standard amino acids are equally present), where pi represents the frequency of each amino acid or nucleotide at that position.Analysis of species tag-prompting methodsFor evaluation of the effect of species-specific prompting on sequence completion, sequences encoding dnaK (GenBank: NC_000913.3, coordinates 12163–14079) and recA (GenBank: U00096.3, coordinates c2822708–2823769) in E. coli K-12 and secY (GenBank: AF395886.1, coordinates 203–1669) and tfb2 (GenBank: AF143693.1, coordinates 140–1138) in H. volcanii were downloaded in FASTA format from NCBI GenBank73. Base prompts were prepared by extracting 50% of the sequence lengths from the 5′ end. To generate species-specific prompt tags, the specific domain, phylum, class, family, genus, order and species information was extracted for each species from the Global Biodiversity Information Facility API (https://api.gbif.org/) and appended to the start of each base prompt. The species-specific prompts used are shown below:(1)|d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia | |(2)|d__Archaea;p__Halobacteriota;c__Halobacteria;o__Haloferacales;f__Haloferacaceae;g__Haloferax;s__Haloferax volcanii | |Sequence completion performance was evaluated by sampling the Evo 1 131K model using prompts with and without appended species tags following the method described in the section ‘Sequence completion prompt compilation and evaluation’ above.To evaluate the effect of species-specific prompts on sequence entropy, prompts encoding fusA (GenBank: AH002539.2, coordinates 1243–1521) upstream of the evaluated tufA gene encoding EF-Tu from E. coli K-12 were fetched as described in the section ‘Operon completion prompt compilation and evaluation’. Sampling was performed as described before, but with species-specific prompt tags appended to the start of each prompt. Positional entropy was determined for tufA sequences generated with and without species tags as described in the section ‘Positional entropy evaluation’.Amino acid substitution analysisGenerated (see ‘Operon completion prompt compilation and evaluation’) and natural sequences encoding genes in the trp operon and modABC operon were first filtered to select for those with over 95% and less than 100% minimum amino acid sequence identity to their respective wild-type E. coli K-12 reference sequence using MMseqs2. From these filtered sequences, 100 sequences encoding each of modA, modB, modC, acrZ, trpA, trpB, trpC, trpD and trpE were randomly selected for both the Evo-generated and natural sequence groups. These sequences were subsequently aligned against their respective E. coli K-12 reference sequence using pairwise alignment with MAFFT (v7.526) and default parameters. Following alignment, amino acid substitutions were identified at each aligned position by comparing variant residues to the E. coli K-12 reference, excluding gap positions from analysis. Each substitution was then scored using the BLOSUM62 (ref. 78) matrix, with scores greater than or equal to 0 indicating biochemically conservative changes and scores less than 0 indicating non-conservative changes. BLOSUM scores for substitutions across all evaluated genes in the trp and modABC operons were then aggregated and plotted to get the final distribution of BLOSUM scores.T2TA prompt compilation and analysisGenomic loci and sequences encoding T2TA system sequences were obtained by downloading the nucleotide sequence information for all experimentally validated T2TAs from the TADB 3.0 database24. Using the NCBI Entrez Programming Utilities API (EFetch from the nuccore database using the genomic loci from TADB 3.0)73, the 500 bp of upstream and downstream flanking sequence were extracted for each T2TA locus. In total, for each T2TA system, eight types of prompts were prepared: (1) individual toxin sequences, (2) individual antitoxin sequences, (3) the reverse complement of individual toxin sequences, (4) the reverse complement of individual antitoxin sequences, (5) the upstream context of the toxin loci, (6) the downstream context of the toxin loci, (7) the upstream context of the antitoxin loci, and (8) the downstream context of the antitoxin loci. Following successful identification of an Evo-generated toxin (see ‘Evaluation of types II and III toxin activity’ below), conjugate antitoxins were subsequently generated via prompting with the generated DNA sequence encoding the toxin.To evaluate the frequency with which each prompt type generated toxin and antitoxin sequences, remaining protein sequences following sequence complexity and pDockQ filtering (see ‘T2TA sampling and filtering’) were evaluated using HMMER (v3.3.0) hmmscan (https://hmmer.org) against the Pfam-A database (v35.0)79,80. Generations with Pfam matches against known type II toxin or antitoxin-related families were counted as hits and mapped back to the prompt type used to generate the sequence, with the generation frequency calculated as the number of toxin or antitoxin hits divided by the total number of remaining generations for each prompt classification.T3TA prompt compilation and analysisGenomic loci and sequences encoding T3TA system sequences were obtained by downloading the nucleotide sequence information for all experimentally validated and computationally predicted T3TAs from the TADB 3.0 database24. Using the NCBI Entrez Programming Utilities API (EFetch from the nuccore database using the genomic loci from TADB 3.0), the 1,000 bp of upstream and downstream flanking sequence were extracted for each T3TA locus. For each T3TA system, eight types of prompts were prepared: (1) individual toxin sequences, (2) individual antitoxin sequences, (3) the reverse complement of individual toxin sequences, (4) the reverse complement of individual antitoxin sequences, (5) the upstream context of the toxin loci, (6) the downstream context of the toxin loci, (7) the upstream context of the antitoxin loci, and (8) the downstream context of the antitoxin loci.To evaluate the frequency with which each prompt type generated toxin and antitoxin sequences, generations with Rfam or Pfam matches against known T3TA-related families (see ‘T3TA sampling and filtering’) were mapped back to their prompt type and classified as hits. The overall generation frequency was calculated by dividing the number of hits by the total number of remaining generations for each prompt classification.T2TA sampling and filteringTo generate T2TA candidates, 53,104 sequences of 2,000 nucleotides each were first generated using Evo 1.5 (temperature = 0.7, top-k = 4, top-p = 1.0) from our compiled T2TA prompts. A multi-stage filtering pipeline was then applied to identify promising candidates. First, Prodigal (v2.6.3, default parameters, -p meta) was used to identify ORFs, excluding sequences containing proteins over 300 amino acids or less than to amino acids, resulting in 130,754 called proteins total66. Next, SegMasker (v2.14.1+galaxy2) with default parameters69 was used to remove sequences containing low-complexity regions with limited amino acid diversity, with 58,704 proteins remaining post-filtering. Next, any proteins that belonged to generations with only one passing protein were removed, resulting in 32,181 remaining protein candidates.Protein–protein interaction potential of co-generated ORFs by co-folding all ORF pairs within each remaining generation was then assessed using ESMFold17. Generations were retained if they contained paired proteins with pDockQ81 scores greater than 0.23 and individual pLDDT scores greater than 0.3, resulting in 945 remaining pairs of proteins. Following the removal of any protein pairs with more than 40% sequence identity (MMseqs282) between both proteins, 777 proteins remained. To identify novel candidates, the remaining sequences were searched against the non-redundant protein sequence database using BLAST68 (e-value cut-off of 0.05), selecting for generations containing at least one component with no significant BLAST hits to known toxins or antitoxins and the other component matching a known toxin or antitoxin. Following this filtering step, a total of 36 protein pairs remained. Ten final toxin candidates were then selected based on high-confidence interaction prediction using AlphaFold 3 (ref. 75).Following the identification of functional toxin candidates via experimental testing, in which two toxins were found to be active, four were unable to be successfully cloned and three were inactive, the Evo-generated sequence encoding the strongest Evo-generated toxin, EvoRelE1, was used as a prompt to generate further diversified antitoxin candidates. After generating a total of 3,000 sequences from the EvoRelE1 prompt, 7,708 generated ORFs were filtered as above (744 remaining) before being co-folded with EvoRelE1. As with the first round of generations, candidates were filtered for high pDockQ scores, moderate pLDDT scores using ESMFold-derived co-folds (122 candidates remaining) and less than 40% identity to known antitoxins using BLAST (43 candidates remaining) before being evaluated for strong predicted co-folds using AlphaFold 3 (ipTM > 0.7). Remaining antitoxin candidates were further characterized using Foldseek Search Server83 searches of the AlphaFold 3-predicted structures (probability threshold of 0.6), blastp searches against the non-redundant protein database (e-value threshold of 1) and HHpred searches (probability threshold of more than 90%)84 to select a final of ten antitoxin candidates.T3TA sampling and filteringTo generate T3TA candidates, 25,960 sequences of 3,000 nucleotides each were first generated using Evo 1.5 (temperature = 0.7, top-k = 4, top-p = 1.0) from our compiled T3TA prompts. A multi-stage filtering pipeline was then applied to identify promising candidates. First, Prodigal (v2.6.3, default parameters, -p meta) was used to identify ORFs66, excluding sequences containing proteins over 400 amino acids or less than 50 amino acids, resulting in 80,298 called proteins total. Next, SegMasker (v2.14.1+galaxy2) with default parameters69 was used to remove sequences containing low-complexity regions with limited amino acid diversity, with 34,131 proteins remaining post-filtering. On sequences with at least one high-quality ORF present, ESMFold and Tandem Repeats Finder85 were used to identify generations with at least one ORFs with a pLDDT > 0.3 and at least one tandem repeat respectively. Tandem Repeat Finder was run using parameters: match = 2, mismatch = 7, delta = 7, PM = 80, PI = 10, minscore = 50, maxperiod = 2,000, resulting in 3,847 remaining generations. Consensus and full repeats were subsequently folded using ViennaRNA’s RNAfold86,87. Following filtering to remove any called tandem repeats with no hairpins, minimum free energies of more than −3.0 and without all four nucleotides present, a total of 428 sequences remained.Remaining ORFs and identified tandem repeats from the filtered generations were subsequently evaluated using HMMER (v3.3.0) hmmscan against the Pfam-A database80 (v35.0) and rnascan88 against the Rfam database (v15.0)89, respectively. Tandem repeats were also run against the AbiF5_iter3.CM and the diverse_rna_xinsi.CM files from Zilberzwige-Tal et al.90 using Infernal’s cmscan91. As a point of comparison, natural T3TA sequences were also run against Pfam-A, Rfam and the AbiF-related covariance models. Generated sequences were retained if they had a match (E