Genome modelling and design across all domains of life with Evo 2

Wait 5 sec.

Genome modelling and design across all domains of life with Evo 2Download PDF Download PDF ArticleOpen accessPublished: 04 March 2026Garyk Brixi ORCID: orcid.org/0000-0002-1253-05221,2,3 na1,Matthew G. Durrant1,2 na1,Jerome Ku1,2 na1,Mohsen Naghipourfar1,2,4 na1,Michael Poli2,3,5 na1,Gwanggyu Sun1,2 na1,Greg Brockman2,6 nAff13,Daniel Chang ORCID: orcid.org/0000-0003-0076-28331,2,3,Alison Fanton1,2,Gabriel A. Gonzalez1,2,Samuel H. King ORCID: orcid.org/0000-0002-1260-50451,2,3,David B. Li ORCID: orcid.org/0000-0002-8921-51681,2,3,Aditi T. Merchant ORCID: orcid.org/0000-0002-9465-13511,2,3,Eric Nguyen2,3,Chiara Ricci-Tam ORCID: orcid.org/0000-0002-6549-18381,2,David W. Romero2,7,Jonathan C. Schmok1,2,Ali Taghibakhshi2,7,Anton Vorontsov2,7,Brandon Yang2,6,Myra Deng8,Liv Gorton ORCID: orcid.org/0009-0009-7341-15178,Nam Nguyen8,Nicholas K. Wang ORCID: orcid.org/0000-0003-1043-30728,Michael T. Pearce8,Elana Simon8,Etowah Adams9,Zachary J. Amador10,Euan A. Ashley ORCID: orcid.org/0000-0001-9418-95773,Stephen A. Baccus3,Haoyu Dai ORCID: orcid.org/0009-0008-4920-68153,Steven Dillmann ORCID: orcid.org/0000-0002-4773-14633,Stefano Ermon ORCID: orcid.org/0000-0003-0039-28873,Daniel Guo1,3,Michael H. Herschl1,4,Rajesh Ilango1,Ken Janik7,Amy X. Lu ORCID: orcid.org/0000-0002-6575-71654,Reshma Mehta1,Mohammad R. K. Mofrad ORCID: orcid.org/0000-0001-7004-48594,Madelena Y. Ng ORCID: orcid.org/0000-0003-3824-93493,Jaspreet Pannu11,Christopher Ré3,John St. John7,Jeremy Sullivan1,Joseph Tey ORCID: orcid.org/0009-0002-6825-646X1,3,Ben Viggiano ORCID: orcid.org/0000-0003-1510-75673,Kevin Zhu4,Greg Zynda ORCID: orcid.org/0000-0003-4992-96147,Daniel Balsam8,Patrick Collison1,Anthony B. Costa7,Tina Hernandez-Boussard ORCID: orcid.org/0000-0001-6553-34553,Eric Ho8,Ming-Yu Liu7,Thomas McGrath8,Kimberly Powell7,Sudarshan Pinglay2,10,Dave P. Burke1,2 na2,Hani Goodarzi1,2,12 na2,Patrick D. Hsu ORCID: orcid.org/0000-0002-9380-26481,2,4 na2 &…Brian L. Hie ORCID: orcid.org/0000-0003-3224-81421,2,3 na2 Nature (2026)Cite this articleSubjectsEvolutionary biologyGenomicsMachine learningAbstractAll of life encodes information with DNA. Although tools for genome sequencing, synthesis and editing have transformed biological research, we still lack sufficient understanding of the immense complexity encoded by genomes to predict the effects of many classes of genomic changes or to intelligently compose new biological systems. Artificial intelligence models that learn information from genomic sequences across diverse organisms have increasingly advanced prediction and design capabilities1,2. Here we introduce Evo 2, a biological foundation model trained on 9 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life to have a 1 million token context window with single-nucleotide resolution. Evo 2 learns to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific fine-tuning. Mechanistic interpretability analyses reveal that Evo 2 learns representations associated with biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements and prophage genomic regions. The generative abilities of Evo 2 produce mitochondrial, prokaryotic and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Evo 2 also generates experimentally validated chromatin accessibility patterns when guided by predictive models3,4 and inference-time search. We have made Evo 2 fully open, including model parameters, training code5, inference code and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.MainBiological research spans scales from molecules to systems to organisms, seeking to understand and design functional components across all domains of life. Creating a machine to design functions across the diversity of life would require it to learn a deep, generalist representation of biological complexity. Although this complexity surpasses straightforward human intuition, advances in artificial intelligence offer a universal framework that leverages data and compute at scale to uncover higher-order patterns6,7. We reasoned that training a model with these capabilities would require data spanning the full spectrum of biological diversity to discover emergent properties similar to those found in other fields8.We previously demonstrated that machine learning models trained on prokaryotic genomic sequences can model the function of DNA, RNA and proteins, as well as their interactions that create complex molecular machines1,2. Here we present Evo 2, a biological foundation model trained on a representative snapshot of genomes spanning all domains of life. We extend the sequence modelling paradigm to the scale and complexity of eukaryotic genomes through advances in data curation, model architecture, large-scale pre-training, advanced interpretability methods and inference-time prediction and generation approaches.Emphasizing generalist capabilities over task-specific optimization, Evo 2 represents an important milestone in biological sequence modelling, laying a broad foundation for prediction and design tasks that are relevant to all modalities of the central dogma, that span molecular to genome scale and that generalize across all domains of life.Evo 2 architecture, training, and dataEvo 2 was trained on prokaryotic and eukaryotic genetic sequences, with potential downstream utility for predictive and generative tasks across multiple scales of complexity (Fig. 1a). We trained two versions of Evo 2: a smaller version with 7 billion parameters trained on 2.4 trillion tokens (Evo 2 7B), and a larger version with 40 billion parameters trained on 9.3 trillion tokens (Evo 2 40B). This new training dataset, which we call OpenGenome2, was compiled from curated, non-redundant nucleotide sequence data with a total of more than 8.8 trillion nucleotides from bacteria, archaea, eukarya and bacteriophage (Fig. 1b and Extended Data Fig. 1a).Fig. 1: Overview of model architecture, training procedure, datasets and evaluations for Evo 2.a, Evo 2 models DNA sequence and enables applications across the central dogma, scaling from molecules to genomes and spanning all domains of life. b, Evo 2 was trained on data encompassing trillions of nucleotide sequences from all domains of life. Each point in the UMAP (uniform manifold approximation and projection) graph represents a single genome in the training dataset that is embedded on the basis of the genome’s k-mer frequencies. Arabidopsis thaliana, Bacillus subtilis, Bacteroides fragilis, Caenorhabditis elegans, Chlamydomonas reinhardtii, D. melanogaster, E. coli, Gallus gallus, Gorilla gorilla, Haloferax volcanii, Homo sapiens, Mycobacterium tuberculosis, Pan troglodytes, Pseudomonas aeruginosa, S. cerevisiae and Tetrahymena thermophila are highlighted. c, A two-phase training strategy was used to optimize model performance while expanding the context length up to 1 million base pairs to capture wide-ranging biological patterns. M. genitalium, Mycoplasma genitalium; TAD, topologically associating domain. d, Novel data augmentation and weighting approaches prioritize functional genetic elements during pretraining and long-sequence composition during midtraining. GTDB, Genome Taxonomy Database; IMG/VR, Integrated Microbial Genomes/Virus database. e, The number of tokens used to train Evo 2 40B and 7B, split into the shorter sequence pretraining and the long context midtraining. f, Schematic of the new multi-hybrid StripedHyena 2 architecture, showing the efficient block layout of short explicit (SE), medium regularized (MR) and long implicit (LI) hyena operators. g, Comparison of iteration time at 1,024 GPU, 40B scale between StripedHyena 2, StripedHyena 1 and Transformers, showing improved throughput. h, Validation perplexity of Evo 2 midtraining comparing the model size and context length, showing benefits with scale and increasing context length. i, A modified needle-in-a-haystack task was used to evaluate long context recall ability up to 1 million sequence length, and shows that Evo 2 performs effective recall at 1 million token context.Full size imageBoth Evo 2 7B and 40B are trained in two phases to capture biological length scales from molecular to organismal (Fig. 1c–e). Our first stage of pretraining uses a context length of 8,192 tokens, with data weighting focused on genic windows to learn functional genetic elements, followed by a multi-stage midtraining phase over which we extend the context length of Evo 2 to 1 million tokens to learn the relationships between elements across long genomic distances (Fig. 1c–e and Methods). This matches best practice in natural language, in which initial pretraining at shorter context lengths improves both efficiency and overall model quality9,10,11. As in Evo 1, we excluded genomic sequences from viruses that infect eukaryotic hosts from the training data for biosafety purposes. We verified that these data exclusions led to high perplexity on genomic sequences from eukaryotic viruses (Extended Data Fig. 2a), indicating poor language modelling performance in this domain.Evo 2 uses StripedHyena 2, a convolutional multi-hybrid architecture5 that relies on a combination of three different variants of input-dependent convolution operators12 and attention (Fig. 1f and Extended Data Fig. 1b), improving training efficiency at scale on both short and long sequences, as well as allowing each layer to model interactions at variable distances. StripedHyena 2 provides substantially higher throughput (at 40 billion parameters, up to 3× speedup at 1 million context length) than highly optimized Transformer6 baselines and previous generation hybrid models based on recurrences or long convolutions, such as StripedHyena 1 (ref. 13) (Fig. 1g). StripedHyena 2 also improves loss scaling on DNA against both Transformers and StripedHyena 1 (Extended Data Fig. 1c), thereby achieving both lower prediction error with the same amount of training data and enabling more efficient use of computational resources.We train up to 1 million base pairs in context length through a multi-stage extension phase, which showed improvements in loss with both model scale and longer context (Fig. 1h). With a synthetic long-context evaluation called ‘needle-in-a-haystack’, we show that Evo 2 can identify and predict the value of a specific 100 base pair sequence (the needle) hidden within 1 million base pairs of random DNA (the haystack), serving as a synthetic quality check that the model can retrieve information from its full context window, as desired for long-context models (Fig. 1i and Extended Data Fig. 1d,e).Evo 2 learns evolutionary constraintBy learning the likelihood of sequences across vast evolutionary datasets, biological sequence models capture conserved sequence patterns that often reflect functional importance. These constraints allow the models to perform zero-shot prediction without any task-specific fine-tuning or supervision1,14,15,16. Here, likelihood refers to the probability that the model assigns to a given sequence, where mutations that reduce this probability are predicted to be deleterious. Given that Evo 2 learns a likelihood landscape across all three modalities of the central dogma (DNA, RNA and protein) and all three domains of life, we sought to assess whether Evo 2 could perform mutational effect prediction across these modalities and organisms (Fig. 2a).Fig. 2: Evo 2 predicts mutational effects on protein, RNA and organismal fitness across all domains of life.a, Evo 2-predicted zero-shot likelihoods can be used to predict the effects of DNA, RNA or protein mutations on molecular function or organismal fitness. WT, wild type. b, Effects on Evo 2 prediction of sequence likelihood caused by mutations along gene start sites for various model species across the domains of life. See Extended Data Fig. 3a,b for additional analyses. T. kodakarensis, Thermococcus kodakarensis. c,d, For different prokaryotic (c) and eukaryotic (d) sequences, the likelihood of different types of mutations in different genomic elements were scored using Evo 2 7B. Scatter represents the median change in likelihood from wild type to mutant sequence per species, coloured by domain (c) or kingdom (d). Horizontal line indicates the median of the scatter distribution. lncRNA, long noncoding RNA; snRNA, small nuclear RNA. e, Mutational likelihoods were used to assess the ability of Evo 2 to differentiate between genomic sequences of model organisms on the basis of their usage of different stop codons. Shown are the standardized median of delta likelihood values across 5 species, where medians were calculated across approximately 4,100 randomly selected mutation loci. M. pneumoniae, Mycoplasma pneumoniae; P. tetraurelia, Paramecium tetraurelia. f, DMS assays were used to assess the Spearman correlation of zero-shot likelihoods from models with experimental assays. Notably, Evo 1 and GenSLM were exclusively trained on prokaryotic datasets. g, Schematic of our single-nucleotide resolution exon classifier based on embeddings from Evo 2. h, Single-nucleotide exon classifiers were trained on embeddings from Evo 2, Nucleotide Transformer (NT) and Evo 1, and were evaluated on the basis of their AUROC across eight held-out species. Performance was compared to SegmentNT-30 kb multispecies (asterisks indicate species in SegmentNT training data), ab initio AUGUSTUS, and to baseline nucleotide content and conservation metrics. D. rerio, Danio rerio; H. vulgare, Hordeum vulgare; S. moell., Selaginella moellendorffii; S. oleracea, Spinacia oleracea; T. cacao, Theobroma cacao; V. vinifera, Vitis vinifera. i, Genome browser track showing predictions from the Evo 2 embedding-based exon classifier scanned across the human STOML2 locus, where the vertical axis is the predicted classifier score and the horizontal axis is genome position. j, Evo 2 predicts genes as essential or nonessential, as determined by experimental gene essentiality assays across bacterial, archaeal and phage species (shown as overlaid scatter) using mutational likelihood of premature stop codon insertions (as a genetic perturbation).Full size imageTo assess whether Evo 2 captures core biological principles, we first evaluated how single nucleotide variants (SNVs) affect Evo 2 likelihoods in the genomic sequences around the start codons of protein-coding genes. We introduced these mutations at each position in the wild-type sequence and calculated the resulting changes in Evo 2 predicted likelihoods across thousands of such loci (Fig. 2b and Extended Data Fig. 3a). We observed strong changes in the likelihood for mutations within the start codons in both prokaryotes and eukaryotes. This was followed by a three-base periodicity pattern reflecting the triplet codons, with changes at the wobble positions showing lower impact on likelihood. For both prokaryotic and eukaryotic genomes, we observed a pattern upstream of the coding DNA sequence (CDS) that was consistent with the locations of known consensus sequences associated with translation initiation, namely, the Shine–Dalgarno sequence17 for prokaryotes and the Kozak sequence18 for eukaryotes. We also observed similar patterns for SNVs around stop codons (Extended Data Fig. 3b).Next we measured the effect of mutations across a variety of both noncoding and coding sequences (Fig. 2c,d). Across 20 prokaryotic species and 16 eukaryotic species, we observed changes in model likelihoods consistent with known biological constraints. Non-synonymous mutations, premature stop codons and frameshift mutations caused much larger changes in likelihood than synonymous mutations. In noncoding regions, deletions in transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs) had much larger effects than deletions in intergenic and other noncoding loci, reflecting the known essential roles of these RNAs. The 40B model exhibited higher sensitivity to deletions in microRNA (miRNA) and small nucleolar RNA (snoRNA) sequences compared with the 7B model. Evo 2 also predicted that less efficiently translated codons had lower likelihoods than more efficient codons (Extended Data Fig. 3c–e).Recognizing that our training data contained genomes with distinct genetic codes, we tested how different premature stop codons impacted species that differ in their stop codon usage (Fig. 2e). We found that the model learned the difference between the standard code (stop codons TAA, TAG and TGA), the mycoplasma code (Code 4, stop codons TAA and TAG) and the ciliate code (Code 6, stop codon TGA). When ciliate genomes were artificially recoded to the standard genetic code, Evo 2 predicted mutations from the standard stop codons as deleterious, demonstrating that the model relies on sequence context to determine the appropriate genetic code (Extended Data Fig. 3f).Although Evo 2 likelihoods reflect the expected importance of different genetic alterations, a key question is whether these likelihoods also correlate with functional effects, which can be empirically measured via deep mutational scanning (DMS) of proteins and noncoding RNAs (ncRNAs). Although state-of-the-art methods for this task tend to leverage both sequence alignments and structural conditioning, general-purpose single-sequence protein language models also learn likelihood distributions that correlate with fitness15. Evo 2 sequence likelihoods correlate with diverse definitions of fitness across nine prokaryotic protein datasets; six eukaryotic protein datasets; and seven datasets of rRNAs, tRNAs and ribozymes (Fig. 2f). Evo 2 is competitive with widely used ProGen language models for protein DMS and with RNA language models for ncRNA DMS, although it underperforms state-of-the-art models on protein DMS. Consistent with observed trends for protein language models, the performance of Evo 2 on these fitness prediction benchmarks begins to saturate and can decrease at the largest model scales19,20,21. We also tested the ability of Evo 2 to predict mutation effects in protein sequences from viruses that infect human hosts. We found no correlation between Evo 2 likelihood and viral protein fitness (Extended Data Fig. 2b), consistent with our data exclusions having the intended effect of weakening both language modelling and downstream performance (Extended Data Fig. 2a). Evo 2 likelihoods also have modest zero-shot association with human mRNA decay rates (Extended Data Fig. 3g and Supplementary Information B.2).Since Evo 2 learns from eukaryotic genomes, which can be challenging to annotate, we assessed whether its embeddings capture exon–intron architecture. We trained lightweight models on Evo 2 7B base embeddings to develop single-nucleotide resolution classifiers of exon labels (Fig. 2g and Methods). On eight diverse species held out from classifier training, our best classifier achieved areas under the receiver operating characteristic curve (AUROCs) ranging from 0.91 to 0.99 (Fig. 2h,i), outperforming models trained on embeddings from other genomic language models, Nucleotide Transformer22 and Evo 1 (ref. 1), and classification by conservation metrics (local GC content and PhyloP). As a practical baseline, we show that our classifier outperforms ab initio AUGUSTUS23 across all species tested. Evo 2 also outperforms SegmentNT24 on all species outside the SegmentNT training set and on one of the three species in its training set. These results suggest that combining Evo 2 sequence embeddings with supervised approaches can aid the functional annotation of genetic components across diverse species, including non-model organisms.Beyond molecular or gene-level prediction tasks, we previously showed that high likelihood under Evo 1 is associated with whole organism replication fitness in prokaryotes and phage as quantified by gene essentiality experiments1. Using zero-shot likelihoods to score the effects of premature stop codon insertions into bacterial, archaeal and phage genomes, we found that Evo 2 models performed similarly to Evo 1 and better than other zero-shot methods in predicting gene essentiality across diverse species (Fig. 2j and Extended Data Fig. 3h). On zero-shot prediction of human gene essentiality (Methods), Evo 2 40B (AUROC = 0.66, area under the precision-recall curve (AUPRC) = 0.15) outperformed other genomic language models (AUROC range 0.50–0.59, AUPRC range 0.09–0.12) and performs within the range of four PhyloP conservation scores (AUROC range 0.65–0.71, AUPRC range 0.13–0.21) (Extended Data Fig. 3i), although the overall predictive performance remains modest.Together, these results demonstrate that Evo 2 captures information across biological modalities and domains of life. Notably, the 7B and 40B models expand predictive capabilities without compromising the prokaryotic insights captured by Evo 1. The utility of both zero-shot likelihoods and simple classifiers trained on Evo 2 embeddings for a variety of predictive tasks across prokaryotic and eukaryotic genomes indicates that Evo 2 provides a strong foundation model for downstream applications in computational biology.Human variant effect predictionVariant effect prediction represents a critical challenge in genomics, with direct implications for clinical diagnosis and therapeutic development. Genomic language models have previously struggled in eukaryotic variant effect prediction, lagging considerably behind species-specific models that use multiple sequence alignments16,22,25. Evo 2 can perform accurate zero-shot variant effect prediction for both coding and noncoding DNA by considering the changes in the model’s likelihoods after introducing mutations involving single or multiple nucleotides (Fig. 3a).Fig. 3: Evo 2 enables accurate zero-shot human variant effect prediction.a, Overview of zero-shot variant effect prediction using Evo 2. Evo 2 was used to assign likelihood scores to human genetic variants, distinguishing pathogenic and benign variants in both coding and noncoding regions. b,c, Zero-shot evaluation of variant pathogenicity within the coding (b; n = 14,319 SNVs, n = 1,236 non-SNVs) and noncoding (c; n = 34,761 SNVs, n = 3,894 non-SNVs) regions. Shown are the AUROCs and AUPRCs for classifying pathogenic and benign variants from ClinVar, across models. For non-SNV evaluations, a modified version of PhyloP was used (Methods). d, Zero-shot evaluation on splice-altering variants in SpliceVarDB, split by exonic (n = 1,181) and intronic (n = 3,769) scoring. e, Evo 2 and other models were used to evaluate BRCA1 variant effect predictions against BRCA1 saturation mutagenesis data, comparing classification of loss-of-function versus functional and intermediate variants in both coding (n = 2,077 SNVs) and noncoding (n = 1,125 SNVs) regions. f, Evo 2 zero-shot likelihood scores plotted for loss-of-function (LOF) versus functional/intermediate variants (n = 3,893), demonstrating the ability of Evo 2 to separate these classes. P value calculated by two-sided Wilcoxon rank sum test. g, Evo 2 embeddings were extracted and concatenated to train a supervised classifier for BRCA1 variant effect prediction. h, Predictions of the supervised classifier on functional/intermediate variants compared with true loss-of-function variants on the test set (n = 789), with decision scores on the horizontal axis. P value calculated by two-sided Wilcoxon rank sum test. i, Comparison of a supervised classifier trained on Evo 2 embeddings on the BRCA1 test set against zero-shot baselines, highlighting the value of using Evo 2 embeddings to build lightweight supervised models.Full size imageWe used annotations of human clinical and experimentally determined variants to evaluate the ability of Evo 2 to predict biologically important sequence variation. We also contextualize the performance of Evo 2 against a wide range of models, including statistical measures of conservation (for example, PhyloP); unsupervised language models of proteins, RNA and DNA (for example, ESM-1b); supervised splicing prediction models (for example, Pangolin and SpliceAI); and human variant effect prediction models (for example, AlphaMissense, GPN-MSA and CADD).Using the ClinVar database, we compared the ability of Evo 2 against other methods for predicting the pathogenic effects of human genetic variants across diverse variant classes (Supplementary Data 1). For coding region SNVs, the 40B and 7B models performed competitively, ahead of zero-shot methods, including ESM-2, but behind ESM-1b, GPN-MSA and some PhyloP variants (Fig. 3b). For non-SNV coding variants (for example, insertions and deletions), both Evo 2 models outperformed all other methods; notably, these non-SNV variants are not possible to score by leading models such as AlphaMissense and GPN-MSA (Fig. 3b). For noncoding SNVs, Evo 2 40B ranked first among unsupervised models and only trailed behind supervised models (Fig. 3c). For noncoding non-SNVs, Evo 2 40B outperformed all models tested (Fig. 3c). Across variants stratified by levels of conservation or distance from splice sites, Evo 2 maintains competitive performance among unsupervised models for noncoding variants and the best performance for coding and noncoding non-SNVs out of all methods tested (Extended Data Fig. 4a–c and Supplementary Information B.3).To further evaluate performance on splice variants, we used SpliceVarDB, a repository containing experimentally validated splicing effects. For both exonic and intronic variants, Evo 2 40B and 7B ranked first among unsupervised models (Fig. 3d). On intronic variants, zero-shot prediction with Evo 2 was competitive with supervised models, slightly trailing SpliceAI and CADD but ahead of Pangolin; on exonic variants, Evo 2 trailed specialized supervised models but outperformed all zero-shot models (Fig. 3d).We next focused on a dataset measuring functional consequences of variants across both exons and introns of the BRCA1 gene26. Zero-shot prediction with Evo 2 exhibited strong performance on coding SNVs and outperformed all other models on BRCA1 noncoding SNVs (Fig. 3e). Evo 2 7B and 40B achieved better performance than other models when coding and noncoding SNVs were evaluated together, suggesting well-calibrated predictions across included variant types (Extended Data Fig. 5a). When separately considering BRCA1 noncoding variants near or far from splice sites, Evo 2 40B outperformed all tested models, including supervised splicing predictors (Extended Data Fig. 5b). A recently released BRCA2 variant dataset with experimental measurements27 enabled us to extend this analysis to a related gene. Evo 2 surpassed specialized models such as GPN-MSA when predicting coding and noncoding variants together, achieving second-best performance behind CADD, a supervised model (Extended Data Fig. 5c). These results indicate that Evo 2 is an effective zero-shot predictor across diverse types of functional human variants.Although zero-shot scoring is particularly valuable when task-specific training data are unavailable, model-derived embeddings can also serve as inputs to supervised classifiers that learn task-specific decision boundaries, thereby enhancing both sensitivity and specificity. To illustrate this capability, we assessed whether a simple ridge regression model trained with Evo 2 embeddings exclusively on BRCA1 variants could surpass zero-shot prediction with Evo 2 (Fig. 3g). Given that different layers within large language models capture distinct features, we systematically extracted sequence embeddings from each block of the Evo 2 40B model to identify which layer yielded the most informative features for variant classification (Extended Data Fig. 5d and Methods). Our supervised model achieved a clear separation between loss-of-function variants and all other variants (Fig. 3h and Extended Data Fig. 5e), outperforming zero-shot prediction by Evo 2 40B on the test set (AUROC = 0.95, AUPRC = 0.88) (Fig. 3i). These results underscore how Evo 2 embeddings can be harnessed to train models aimed at more specialized tasks, including those with high clinical relevance.Unlike the highly constrained sequences typically found in clinical variant datasets which are biased towards coding, splicing or untranslated region (UTR) variants, other regulatory sequences—particularly those distal to genes—exhibit substantially lower conservation. In this context, we used DART-eval to assess how effectively Evo 2 embeddings and likelihoods capture regulatory function28. On zero-shot tasks in DART-eval, Evo 2 40B (chromatin accessibility quantitative trait loci (caQTL) AUROC = 0.58, DNase I sensitivity quantitative trait loci (dsQTL) AUROC = 0.66) outperforms other unsupervised DNA language models, such as Nucleotide Transformer (caQTL AUROC = 0.52, dsQTL AUROC = 0.61), but trails sequence-to-function models trained on accessibility data, such as ChromBPNet (caQTL AUROC = 0.77, dsQTL AUROC = 0.89) (Extended Data Fig. 5f). These results indicate that while multi-species language models trained on sequence alone capture some regulatory information, sequence to function models with task-specific training achieve higher performance in this setting.On human clinical variant prediction, Evo 2 represents a major improvement over previous multi-species DNA language models across different variant types, with leading performance on non-SNVs (insertions, deletions, duplications), and maintains this performance even in the absence of strong site-independent sequence conservation, although it falls behind supervised models for distal regulatory variants. Furthermore, leveraging the representations in a supervised setting illustrates how Evo 2 embeddings can serve as a foundation for downstream prediction tasks. Notably, Evo 2 is not trained on any human genetic variation or functional genomics data. In sum, these findings support the versatility of Evo 2 as a genome-scale language model for both unsupervised and supervised variant effect prediction.Feature interpretation in Evo 2Evo 2 learns complex representations of genomic sequences without explicit biological labels or annotations. Contrary to the common critique of large language models as black box systems, recent advances in the field known as mechanistic interpretability have demonstrated that sparse autoencoders (SAEs) can reveal latent dimensions that correspond to semantically meaningful features in natural language29,30,31. Without any prior biological annotations or labels, we trained SAEs on Evo 2 representations (or neuron firing patterns), to decompose the model into sparse, high-dimensional representations in which each latent dimension often exhibits human-interpretable patterns (Fig. 4a).Fig. 4: Mechanistic interpretability of Evo 2 reveals DNA, RNA, protein and organism-level features.a, SAEs were trained on Evo 2 to extract SAE features associated with interpretable biological function that can be used for annotation, discovery and steering of sequence generations. b, Phage-associated feature activates preferentially on RefSeq-annotated prophages (left and top right) in the E. coli K12 MG1655 genome and fires on phage-derived spacer sequences within CRISPR arrays (bottom right). c, Activations of features associated with ORFs, intergenic loci, tRNAs and rRNAs, in a 100-kb region in E. coli K12 MG1655. d, Activations of features associated with α-helices, β-sheets and tRNAs at an E. coli K12 MG1655 locus containing tufB and a tRNA array ending with thrT (left) and the rpoB–rpoC locus (right). AlphaFold 3 (AF3) structure predictions with feature activations overlaid, of EF–Tu in complex with the tRNA (left) and of RpoB and RpoC in complex (right). e, A feature in the human genome with preferential activation immediately after frameshift mutations over other less deleterious mutation types. f, Features with activation on DNA motifs in the human genome that correspond to transcription factor-binding motifs. g, Features associated with exons, introns and their boundaries in the human genome generalize to a segment of the woolly mammoth genome.Full size imageWe trained a Batch-TopK SAE32 on Evo 2 representations from layer 26 (Methods). The SAE was trained on representations from 1 billion tokens evenly split across several complete eukaryotic and prokaryotic genomes (Extended Data Fig. 6a–f).We matched learned SAE latent dimensions, also referred to as features, and known biological concepts by finding features that were enriched in sequence segments containing a particular annotation, a process that we refer to as contrastive feature search (Extended Data Fig. 7a). This revealed diverse features that align with known biological concepts. For example, Evo 2 developed internal representations associated with mobile genetic elements. Feature f/19746 is closely associated with prophage regions across prokaryotes (Extended Data Fig. 7b) and activates on annotated prophages in the Escherichia coli genome, including the cryptic prophage CPZ-55 (Fig. 4b). This feature also activates on spacer sequences within a CRISPR array, which are integrated during CRISPR adaptation from foreign genetic material such as phage DNA (Fig. 4b), as well as after the last CRISPR direct repeat and on synthetic, scrambled spacer sequences, suggesting that Evo 2 associates CRISPR spacers with phage sequences as opposed to directly memorizing phage genomes (Fig. 4b and Extended Data Fig. 7c). This feature also activates on other regions that are not annotated as phage by geNomad33 yet contain genes associated with prophages, such as integrases and invertases (Extended Data Fig. 7d).Next, we sought to identify concepts associated with canonical biological genomic elements. We identified diverse features corresponding to open reading frames (ORFs), intergenic regions, tRNAs and rRNAs in the E. coli genome (Fig. 4c and Extended Data Fig. 7e,f). We further probed for structural signatures at the protein level and identified features linked to protein secondary structures, such as α-helices and β-sheets (Fig. 4d and Extended Data Fig. 7g,h). These associations highlight the multimodal nature of genome language modelling, capturing higher-order structural information beyond DNA alone.We extended our analysis to the human genome in search of eukaryotic features. By introducing mutations into thousands of human coding sequences and applying contrastive feature search on a eukaryotic-only SAE, we identified a mutation-sensitive feature (f/24278) that preferentially activates on frameshifts and pre-mature stop mutations (Fig. 4e and Extended Data Fig. 8a,b). We also observed other activations on DNA motifs in the promoter regions of human genes (Fig. 4f, left) that closely resemble the known binding sites of human transcription factors (Fig. 4f, right). Across a random sample of human promoter sequences, Evo 2 unsupervised SAE features have significant hits (q 2-fold change and one sequence with >3-fold change in mean coverage were observed (where K562 has higher mean coverage), representing a 4-17% success rate. (c) Summary results of designs in human cells that attempt to either maximize accessibility (“K562 on” and “HEK293T on”) or minimize accessibility (“K562 off” and “HEK293T off”) across the full designed sequence (see panels (d) and (e), respectively). “Mean coverage” indicates the mean of coverage values across all 1-kb sequence positions. All designs that maximize accessibility have mean coverage values > 2.7 and all designs that minimize accessibility have mean coverage values 1.5) (i) and for HEK293T (log(1 + TPM) > 2.15) (j). This expression cutoff was used to determine whether TF motifs found in the B7, B10, B11, and B12 designs were significantly enriched for TFs expressed in K562 (i) or in HEK293T cells (j).Supplementary informationSupplementary InformationThis file contains Methods, supplementary text and appendix.Reporting SummarySupplementary Data 1Human variant effect prediction results. Scores and metadata for human variant effect prediction analysis, related to Fig. 3 and Extended Data Fig. 4.Supplementary Data 2Mechanistic interpretability transcription factor motif report. A comprehensive list of promoter-enriched motifs from the HOCOMOCO v.12 CORE database and associated SAE feature hits, related to Fig. 4f and Extended Data Fig. 8f.Supplementary Data 3DNA sequences for experimental chromatin accessibility designs. Experimentally tested DNA sequences in the Morse code mESC and HEK293T/K562 chromatin accessibility design tasks, related to Fig. 6 and Extended Data Fig. 10.Rights and permissionsOpen Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.Reprints and permissionsAbout this article