Compressing the collective knowledge of ESM into a single protein language model

Wait 5 sec.

MainPredicting the functional consequences of genetic variants, known as variant effect prediction (VEP), is a fundamental computational challenge with important applications in human genetics, drug development and protein engineering1,2,3,4,5. In recent years, the field of VEP has seen substantial advancements, particularly with the emergence of protein language models (PLMs)6,7,8,9,10,11,12,13. These models, inspired by transformative developments in natural language processing, are trained on vast repositories of protein sequences8 to capture the intricate patterns and relationships within the protein sequence space, demonstrating remarkable capabilities in VEP and other protein-related tasks, including protein structure prediction9,14, function annotation15 and protein design16.To further increase predictive performance, VEP methods often employ a hybrid approach, combining PLMs with additional sources of information such as multiple sequence alignments (MSA), protein structures and population genetics data. Indeed, the top-performing methods on the ProteinGym Deep Mutational Scan (DMS) benchmark17, including Saprot14, TranceptEVE18 and others19,20, are all PLMs that have been trained on three-dimensional (3D) structure or MSA data and use this information during inference to make accurate predictions of variant effects. The recently developed closed-source models PrimateAI-3D13 and AlphaMissense7 also follow a hybrid approach, demonstrating similar performance gains by integrating both modalities with additional fine-tuning on human and primate population genetics data. Additionally, in theory, protein sequence information alone should be sufficient to achieve the same level of accuracy, at the time of testing, sequence-only PLMs are substantially underperforming compared to these more recent approaches.Although effective, these hybrid approaches come with increased complexity and may have limited applicability compared to sequence-only PLMs in scenarios where such additional data are unavailable, incomplete or computationally expensive to obtain. Moreover, the integration of diverse data sources can introduce biases and dependencies that may affect both the interpretability and generalizability of the predictions in downstream tasks. Experimental 3D structures, for example, are only available for a small fraction of proteins, algorithmic and hyperparameter choices can have variable effects on MSA quality, and the use of population genetics data can lead to data circularity concerns in clinical genetics applications. Even though having this extra information alongside the sequence can certainly help on average, it is not clear how each modality contributes to each prediction and, more importantly, how reliable the corresponding model can be in its absence. The performance of the sequence-only PLM versions of TranceptEVE and SaProt, for example, in the absence of MSA and 3D structure decreases substantially, indicating a strong dependence on the availability of these additional modalities14. By contrast, pure evolutionary scale PLMs, trained solely on unaligned protein sequences, have the potential to offer a more balanced, streamlined and broadly applicable approach to VEP.In this work, we focus on the widely adopted, sequence-only PLMs of the evolutionary scale modeling (ESM) family8,9,21 and show that their performance is not fundamentally limited compared to the above more complex modeling approaches. In particular, we demonstrate that better detection of the evolutionary signals captured during the pretraining of different ESM models has the capacity to substantially increase the VEP performance of all individual models without using any external information. To achieve this, we introduce a co-distillation framework within which multiple models can learn from each other, alternating their role as teachers and students depending on their estimated confidence for each prediction; the log-likelihood ratio (LLR) of the model that provides the most confident prediction for any given mutation is used to refine the predictions of the others. Through comprehensive evaluation across multiple VEP benchmarks, we demonstrate that our approach produces substantially improved variant-effect ESM models (VESM) that match or surpass current state-of-the-art VEP methods, effectively closing the performance gap between hybrid and sequence-only PLMs. We further extend this framework to incorporate structural information in a modular fashion, preserving VESM’s advantages while improving performance on structure-dependent tasks. Finally, we demonstrate our models’ capacity to accurately quantify the continuous effect of variants on clinical phenotypes in large-scale population studies, extending their utility from binary classification to quantitative trait prediction in clinical genetics.ResultsA maximum confidence strategy for detecting evolutionary signals across multiple PLMsA PLM tries to learn an approximate evolutionary fitness landscape by capturing evolutionary patterns across the millions of unaligned sequences it has seen during pretraining. This is an incredibly challenging task, and although the resulting models are surprisingly effective in identifying conserved motifs, domains and other co-evolution signals, they can also have certain blind spots. For example, in Fig. 1, we show that ESM2 models (independent of parameter size) consistently fail to identify conserved KRAB domains as annotated within 519 proteins on Uniprot, whereas ESM1b and ESM1v (both 650-million (hereafter denoted as 650M)-parameter models) consistently fail to identify the less common but well conserved BRICHOS domains as annotated within 48 proteins. For each domain, however, there exists at least one model that has detected the corresponding motifs and can correctly identify them as conserved mutationally sensitive regions within the protein. This large effect is quite unexpected as all of these models, members of the ESM family, have almost identical architecture and were trained on similar data8,9,21. In the context of predicting the pathogenicity of variants, failure to detect these signals would lead to an increased number of false-negative errors for each model. Yet the information for detecting both domains is captured within the model family.Fig. 1: Enriching for evolutionary signals by searching across multiple PLMs.Full size imagea, An example of ESM2 and ESM1b complementarity in detecting evolutionary conserved domains as mutationally sensitive; the 20xL heat map of all LLR scores obtained from each model is shown for two proteins (ZFP57 and ITM2B) that are involved in human disease. ESM1b is able to correctly identify the majority of mutations within the annotated KRAB domain in ZFP57 as damaging (yellow color), whereas ESM2 predicts them as neutral (blue color). The opposite is true for the BRICHOS domain in ITM2B. b, An evaluation of the capacity of different ESM models in detecting these domains across multiple proteins (n = 519 and n = 48 proteins with annotated KRAB and BRICHOS domains, respectively). The average score of all possible mutations that fall within each domain is calculated for each protein as a proxy for each model’s ability to detect it. The violin plot shows these average scores across all proteins for each model (median, center line; interquartile range, box; whiskers, 1.5× interquartile range). c, The average LLR per position as predicted by ESM2 and ESM1b is visualized for two other proteins (ZNF93 and PSPC) with experimentally determined structures of the two domains (PDB: 7Z36 and 2YAD). d, An illustration of the maximum confidence approach; signal‑enriched data are generated for all possible mutations by combining predictions from N models. Individual mutations are scored by choosing the most confident prediction (minimum LLR; indicated by ‘min()’) across all models (Methods).Here, we first consider a simple approach that can exploit this complementarity and enrich for evolutionary signals captured during the pretraining of multiple PLMs. In particular, we use all individual models to calculate the LLRs of all possible missense mutations for a given wild-type (WT) sequence and aggregate their predictions at the variant level by taking the minimum (ESMIN; Fig. 1d). This approach effectively scores each variant by choosing the model that outputs the minimum LLR or, equivalently, has the maximum confidence in the WT residue compared to the mutated amino acid (Methods). Intuitively, this is most effective when different models have varying strengths in predicting mutations in different protein contexts, as demonstrated in the above examples. We contrast this with the more commonly used averaging approach, which can potentially dilute subtle evolutionary signals detected only by a small fraction of the models.To make this intuition precise, we develop a simple theoretical framework (Supplementary Note 1) in which, conditional on the true class (pathogenic versus benign), the LLR scores generated by N base models are independently drawn from a Gaussian distribution. This mathematical model is governed by two interpretable quantities: the separation between the class means and the ratio of their variances (Extended Data Fig. 1a). For both the minimum (maximum confidence) and the averaging aggregators, we derive analytical expressions for the area under the receiver operating characteristic (ROC) curve (area under the curve (AUC)) and evaluate them numerically to map the regimes in which each strategy performs best. In this independent and identically distributed (i.i.d.) setting, we show that minimum aggregation can substantially outperform averaging when pathogenic LLRs are much more dispersed than the benign LLRs. Empirically, we observe that ESM models exhibit this property (Extended Data Fig. 1b). Although ESM models are not independent, our analysis clarifies how maximum confidence can exploit their variance asymmetry to better detect and amplify evolutionary signals captured within the family.To evaluate the efficacy of the maximum confidence strategy in a VEP setting, we aggregated proteome-wide predictions from all individual, sequence-only models within the ESM family (ESM1b, five ESM1v models and five ESM2 models, excluding only the largest 15-billion-parameter model) and benchmarked the resulting predictions (ESMIN) on both Clinical and DMS22 data (Extended Data Fig. 2). To address concerns that ESMIN might indiscriminately predict all variants of the same gene to be pathogenic, we specifically assessed its performance on a per gene class-balanced version of the ClinVar dataset (Methods). We also investigated the correlation structure within the ESM family and how much each individual model contributes to ESMIN (Extended Data Fig. 2c,d). The models cluster into three groups by prediction similarity, with ESM1b and ESM2-3B contributing the most to ESMIN (25.7% and 32.6%). ESM2-650M contributes only 5%, but it is highly correlated with ESMIN scores, probably due to its similarity with ESM2-3B. Further, including the much larger ESM2-15B model when calculating ESMIN scores does not seem to provide additional performance gains compared to the 11 model ESM family ensemble (Extended Data Fig. 2e).Our results show that ESMIN surpasses alternative ensemble methods, including averaging the predictions of all models, averaging the predictions of the best two models as well as the ESM1v ensemble that was specifically developed for VEP. This extends to the DMS benchmark from ProteinGym, where ESMIN achieves a higher average Spearman correlation between model predictions and experimental measurements across 120 organismal fitness and activity assays. Importantly, ESMIN achieves better performance than all individual ESM models in a much larger fraction (~50%) of the DMS assays compared to other ensembles (~20%), indicating that maximum confidence benefits multiple proteins and can be potentially used to provide strong training signals for improving the performance of ESM models.Maximum confidence co-distillation of ESM models improves their performance in VEP and downstream tasksAlthough leveraging the most confident model for scoring individual mutations enhances VEP performance, it introduces practical challenges and limitations compared to using a single PLM. The collective parameter count of all models contributing to ESMIN, for example, exceeds 7 billion, making it computationally prohibitive for large-scale VEP. Furthermore, the multimodel approach complicates the use of embeddings for supervised fine-tuning on downstream tasks, which is a key application of PLMs. To overcome these limitations, we develop a co-distillation framework that uses the same maximum confidence approach within the ESM family to collectively improve the performance of individual ESM models (Fig. 2a).Fig. 2: Maximum confidence co-distillation of the ESM family substantially improves individual ESM models.Full size imagea, An overview of the co-distillation framework; for each protein, all models make predictions, which are then combined into a single 20xL matrix by taking the element-wise minimum (maximum confidence per variant). Each individual model is then updated by calculating a loss with respect to its own output (Methods). b, A performance evaluation of individual co-distilled models compared to their baseline on ClinVar (2,400 genes with a total of ~27K pathogenic and benign variants, class balanced per gene (left)) and ProteinGym (120 organismal fitness and activity DMS assays (right)). The dashed red lines indicate the performance of ESMIN. c, An ablation study varying the percentage of proteins used as training data during co-distillation. The full dataset (100%, including the validation split) contains 18,683 human proteins. Ablations were made by first excluding all proteins that are similar to the ones used in the benchmark (5,279 proteins, >30% sequence similarity) and then sequentially eliminating proteins from the remaining set (nested subsets). d, The performance of learned model embeddings in nine downstream tasks (supervised fine-tuning on experimental measurements). Task-independent (frozen) embeddings were extracted from the base and co-distilled versions of ESM2-650M and used to train task-specific models (Methods). The table reports the mean and s.d. across three random seeds; statistically significant differences (two-sided paired t-test; P 98% and >93% on Balanced ClinVar and ProteinGym DMS, respectively). The improvements were particularly striking for smaller models, with VESM-35M improving performance by more than 20% on ClinVar and more than 70% on DMS compared to the original ESM2-35M (Fig. 3d, right).Sequence-only VESM models outperform state-of-the-art methods in predicting the clinical impact of variantsCo-distillation of ESM2-3B alongside the rest of the ESM family models produced VESM-3B, which is a single, 3-billion-parameter PLM, effectively trained only on unaligned (UniRef34) sequences. This model was able to surpass the collective performance of the ESM family, regardless of how individual model predictions were combined (maximum-confidence, average of all models, ESM1v, average of ESM1b and ESM2-650M), effectively achieving a ~61% reduction in total number of parameters compared to the full ensemble. Smaller models distilled directly from VESM-3B, maintained comparable performance with only 8%, 2% and 0.5% of the parameters. Here, we explore how these models compare with state-of-the-art VEP methods and hybrid approaches that further require the use of MSA or 3D structure information to enhance VEP performance.To provide an objective assessment of the models’ capabilities, we evaluate clinical VEP performance on an independent, publicly available ClinVar benchmark obtained from ProteinGym17 curating 2.57 × 104 benign and 2.68 × 104 pathogenic variants across 2,227 disease associated genes (Fig. 4). Overall, we compare our VESM models with 24 other methods including sequence-only PLMs, recently developed hybrid PLMs trained on protein structure (for example, Saprot14, ESM335 or ProSST19) or MSA information (for example, PoET20, EVE36, TranceptEVE18, GEMME37 or VespaG38), as well as other widely used VEP methods that leverage homology-based (for example, SIFT39 or PROVEAN40) or population-based (for example, PrimateAI13,41 or CADD42) approaches (Methods).Fig. 4: Performance evaluation of sequence-only VESM models on clinical VEP.Full size imagea, Benchmarking of VESM models against 24 other PLMs and VEP methods on an independent, publicly available ClinVar benchmark from ProteinGym (2.57 × 104 benign and 2.68 × 104 pathogenic variants across 2,227 genes). Models are color-coded based on the additional sources of information they use. Top: global AUC is shown for each model. Error bars correspond to s.d. of the ROC-AUC scores centered around the mean (estimated by bootstrapping; n = 50 label-balanced resamples of 6,000 variants). Bottom: the corresponding ROC curve is shown for top-performing models of each class. b, A boxplot of the ClinVar AUC per gene for n = 259 genes that have at least ten pathogenic or benign labeled variants. c, The percentage of ClinVar variants that can be confidently classified by each model. The prediction accuracy and the total number of annotated variants are computed by varying the classification threshold for each method. The average and a two-standard-deviation interval across n = 10 bootstraps are shown (Methods). d, A comparison with AlphaMissense7 on a recent (03/25) release of the ClinVar dataset. A boxplot of the AUC achieved by each model across a range of MAF filtering thresholds from gnomAD v4 (Methods). e, A performance evaluation on a subset of the ClinVar dataset that excludes all human variants that have been used for training AlphaMissense (MAF >1×10−5 in gnomAD v2). The error bars correspond to s.d. of the ROC-AUC scores centered around the mean (estimated by bootstrapping; n = 100 label-balanced resamples of 10,000 variants). f, Calibrated binary classification metrics evaluating the performance of each model on the same MAF-filtered ClinVar dataset used in e. The boxplots show: center line (median); box limits (Q1, Q3); whiskers (±1.5× interquartile range); points (outliers).Surprisingly, our results demonstrate that all sequence-only VESM models performed exceedingly well on ClinVar, using both global and per-gene average AUC as the evaluation metric, effectively closing the performance gap to more complex VEP methods (Fig. 4a,b). Notably, all VESM models showed substantial performance gains over all other sequence-only PLMs, including the recently released ESM-C43. We also evaluated the entire ROC curve for the top-performing models of each class (VESM-3B, sequence only; SaProt, structure based; PoET, alignment based; and CADD, population based). We observe that even though SaProt is matching VESM-3B performance in the low false positive regime (high specificity), it is not able to maintain a similarly high true positive rate for the rest of the curve (high sensitivity regime). The opposite holds for CADD and PoET. VESM-3B on the other hand is more balanced, maintaining high performance across the entire ROC.We also evaluated how VESM-3B compares with other unsupervised methods in a clinical annotation task. In particular, we computed a curve that measures the percentage of ClinVar variants that each method can confidently annotate (either as benign or pathogenic), as a function of prediction accuracy (Fig. 4c and Extended Data Fig. 5a; Methods). We see that VESM-3B consistently outperforms the other methods, maintaining the highest annotation coverage across all accuracy levels. Specifically, VESM-3B annotations at 90% accuracy can cover 90% of ClinVar, which is a substantial improvement from 79% achieved by ESM1b, the currently best-performing sequence-only PLM. Moreover, at this level of accuracy, the second-best method, SaProt (87% coverage), performs substantially better than the other two homology-based methods, PoET (81%) and EVE (73%). At 95% accuracy, a level of accuracy that SaProt could achieve for only 25% of the variants, VESM-3B is still able to annotate ~67%, compared to 54% achieved by PoET. The binary classification performance of all methods is further evaluated across multiple metrics in Extended Data Fig. 5b using a fixed threshold of 0.5. All VESM models performed exceptionally well on this benchmark, with VESM-3B and VESM-650M surpassing all other non-VESM methods across all evaluated metrics (area under the ROC, area under the precision-recall curve, accuracy, diagnostic odds ratio (DOR), Mathews correlation coefficient (MCC) and F1 score).Finally, we provide a direct comparison between VESM-3B and AlphaMissense7, a state-of-the-art closed-source VEP model. AlphaMissense is an ensemble model building on top of the AlphaFold architecture, trained on 3D protein structure and MSA information. Beyond these two modalities, AlphaMissense is further trained using human and primate population allele frequency information, enabling state-of-the-art variant pathogenicity classification performance. However, this introduces data circularity with respect to ClinVar labels, because the allele frequency of a variant is considered as strong supporting evidence contributing to its clinical annotation (ACMG guidelines3). Indeed, the VEP performance of AlphaMissense as evaluated on a more recent release of ClinVar (Mar 2025) shows a strong dependence on the minor allele frequency (MAF, gnomAD v444) of the included variants (Fig. 4d; Methods). VESM-3B, on the other hand, maintains the same performance independent of MAF, outperforming AlphaMissense in the clinically challenging task of distinguishing between rare benign and rare pathogenic mutations (for example, MAF 1 × 10−5 that have been used for training AlphaMissense (Fig. 4e and Extended Data Fig. 5c). Interestingly, all VESM models (3B, 650M, 150M and 35M) are able to surpass the performance of AlphaMissense on this benchmark, substantially outperforming the corresponding base ESM models both in terms of AUC and other binary classification metrics (Fig. 4f and Extended Data Fig. 5d; Methods).Incorporating structure information further improves VESM on both clinical and DMS benchmarksHaving developed sequence-only PLMs with substantially improved performance compared to existing structure- and MSA-based models, we next investigated whether incorporating structural information could provide additional gains. In principle, the amino acid sequence of a protein encodes all the information necessary to determine its structure, suggesting that sequence-based models alone might suffice. However, both sequence and structure data are limited in practice, offering incomplete but potentially complementary views of the protein fitness landscape. Although joint sequence-structure pretraining could theoretically outperform sequence-only methods, our clinical VEP results (Fig. 4) suggest that existing models using this strategy (for example, SaProt, ProSST and ESM3) may underutilize sequence information when structure is readily available during training. More recently, structure-aware fine-tuning of sequence models was proposed as a promising approach to incorporate structure information in sequence-only PLMs, but the resulting model (ISM45 based on ESM2) underperformed on our ClinVar evaluations compared to SaProt (also based on ESM2).Here, in line with our proposed sequence-only training framework, we considered an alternative approach. Rather than using structural supervision to improve a sequence model, we instead use VESM to enhance the sequence representation within a structure-based model architecture. We selected ESM3 for its modular design, which cleanly separates sequence and structure components, and fine-tuned only the sequence-related parameters with the same sequence-based mean squared error (MSE) loss used within our co-distillation framework. This strategy allowed us to distill the benefits of VESM into a structurally informed model with minimal overhead, resulting in improved performance across both clinical and DMS VEP benchmarks (Fig. 5; Methods). We refer to the fine-tuned version of ESM3 as VESM3 and further combine it with the sequence-only VESM-3B to produce a structure-aware ensemble model, named VESM++.Fig. 5: VESM achieves state-of-the-art VEP performance in both clinical and DMS benchmarks.Full size imagea, A comparison of model performance on clinical VEP (ClinVar AUC, same dataset as in Fig. 4a) versus experimental VEP from human protein DMS assays (Spearman correlation, ProteinGym fitness and activity assays). Points denote individual models; the dashed gray line shows the fitted linear regression. Shaded bands (95% CIs) around the regression indicate uncertainty in the fit. The models are color-coded based on the additional sources of information they use. b, Performance on the ProteinGym DMS benchmark (rank score), with assays categorized by selection type: fitness/activity (x axis) versus binding, stability and expression (y axis). VESM3 and VESM++ achieve state-of-the-art performance across both assay types, surpassing existing structure- and MSA-based methods. c, Pairwise win rates between models across 120 DMS fitness and activity assays, quantifying the proportion of assays in which a given model outperforms another. d, A comparison of VESM models and ESM baselines across nonhuman DMS assays, grouped by taxonomic category (eukaryote, prokaryote and virus). Base models for individual VESM models: ESM2 (3B), ESM2 (650M), ESM2 (150M), ESM2 (35M) and ESM3_struct. For VESM++ the baseline is given by the ensemble of the corresponding base models ESM2 (3B) and ESM3_struct.To evaluate the models’ performance on human proteins, we curated a human DMS dataset from ProteinGym consisting of 50 assays measuring protein fitness and assessed the concordance between performance on ClinVar (AUC) and human DMS (Spearman correlation) benchmarks (Fig. 5a). Our results demonstrate that model improvements in clinical VEP translate to corresponding performance gains on human DMS assays, with a Pearson correlation of 0.78 between the two benchmarks. Notably, VESM3 outperformed its base model, ESM3, by a large margin on both clinical and DMS evaluations, achieving performance comparable to VESM-3B. The combined model, VESM++, further improved performance relative to individual VESM models. All individual VESM models, including VESM-35M, matched or surpassed the performance of all other MSA- and structure-based methods across both benchmarks.We next expanded our DMS benchmark to include nonhuman proteins and categorized assays by primary readout: fitness/activity versus structure-dependent measures such as binding, stability and expression (Fig. 5b). This stratification highlights the distinct contributions of MSA and structural information to model performance; MSA-based models performed much better on fitness and activity assays, whereas structure-based models primarily improved performance on binding and stability-related tasks. Currently available PLMs that do not incorporate structure or MSA information (for example, ProGen246,47, Tranception48, ESM2 or ESM-C) appeared to have reached a performance plateau on both assay types, independent of model size and architecture. By contrast, our sequence-only VESM models notably surpassed this barrier, matching the performance of MSA-based methods. Incorporating structure information further improved upon the sequence-only VESM models, with VESM3 and VESM++ simultaneously achieving state-of-the-art performance across both assay categories.Pairwise evaluations across all ProteinGym fitness and activity DMS assays (Fig. 5c) further quantified the improved performance of VESM models, as measured by pairwise win rates (that is, the fraction of assays in which one model outperformed another; Methods). Our results showed that VESM++ and VESM3 achieve the highest overall win rates, outperforming both MSA- and structure-based models. This is also the case for sequence-only VESM models, except for the smaller (150M and 35M parameter) models, which still consistently outperformed all structure-based models. Relative to their respective base models, VESM3 surpassed ESM3_struct in 85% of assays, VESM-650M surpassed ESM2_650M in 78% and VESM-3B surpassed ESM2_3B in 79%.Finally, examining model performance improvements across taxonomic categories revealed disproportionately larger gains for viral proteins relative to other nonhuman proteins (Fig. 5d) Similar gains were observed for ESM1b, ESM2 and ESM1v models during the first round of co-distillation (Extended Data Fig. 6). This is surprising, as these models were co-distilled exclusively on human protein sequences (Methods). Moreover, these improvements transferred effectively to VESM3 through VESM-based distillation, even though the corresponding ESM3 base model and its associated training data explicitly excluded all viral sequences before public release35.VESM can predict the severity of missense variant effects on continuous clinical phenotypes in the UK Biobank dataAlthough traditional VEP methods primarily focus on binary pathogenicity classification, extending these predictions toward quantitative, clinically relevant outcomes offers substantial potential for human genetics applications. To investigate whether VESM scores can quantify the severity of missense variants on continuous clinical traits, we analyzed genotype-phenotype summary statistics from the UK Biobank49 as provided by Genebass50, focusing specifically on genes associated with blood biochemistry biomarkers.We downloaded summary statistics for all 332 significantly associated gene–phenotype pairs involving 186 genes and 27 distinct clinical phenotypes. Each gene was selected to have a minimum of 25 missense variants with Genebass estimated single-variant association effect sizes (β coefficients) and P values. We then examined the relationship between variant-level VESM scores and their observed clinical impact by performing linear regression of VESM predictions against the reported β coefficients, quantifying model accuracy at the gene–phenotype level using the Pearson correlation coefficient and its corresponding regression-derived P value (Methods).Remarkably, VESM-3B derived correlations strongly aligned with gene-level missense burden test effect sizes and SKAT-O51P values across the majority of tested gene–phenotype pairs (Fig. 6a), underscoring the model’s capability in capturing not just pathogenicity but the magnitude and directionality of clinical impact. Upon closer examination of initial outliers, we observed that VESM-3B predictions corresponded more closely to burden test results derived from predicted loss-of-function (pLoF) variants rather than missense variants alone. These examples include well-established associations of APOA1 with apolipoprotein A levels, UGT1A8 with bilirubin metabolism and CASR with calcium levels. Importantly, the observed alignment between VESM correlations and pLoF-based effect sizes for these gene-trait pairs shows that VESM predictions can capture clinically relevant missense variant effects that traditional missense SKAT-O analyses may underestimate. We highlight that this regression approach, unlike SKAT-O, can be used to test for association between genes and phenotypes with only summary statistics.Fig. 6: VESM models accurately quantify missense variant severity across continuous clinical phenotypes in UK Biobank.Full size imagea, A correlation between variant-level VESM-3B predictions and single-variant association effect sizes (Genebass β coefficients) across 153 gene–phenotype pairs. The circle size represents missense SKAT-O significance (gene-level association P value from Genebass); color indicates gene-level pLoF effect size (burden test; Genebass). Prominent outliers are highlighted. b, Performance comparison of VESM models against AlphaMissense, ESM base models and an allele-frequency baseline across 103 strongly associated gene–phenotype pairs (Methods). Performance is summarized by average association strength (−log10 regression P value (top)) and stratified by phenotype category (bottom). Regression-derived P values were computed using Pearson’s product-moment correlation (two-sided test; scipy.stats.pearsonr); no adjustment for multiple comparisons was applied. c, A Pearson correlation between VESM-3B variant-level predictions and single-variant association effect sizes for gene–phenotype pairs with regression derived P value < 0.1 (two-sided test; as in b). Phenotypes are grouped by phenotype category (lipid metabolism, liver function and renal function), and the color denotes pLoF burden test effect size.To compare VESM-3B with other models, we removed nonsignificantly associated missense variants (single-variant association P value >0.05) and further filtered the gene–phenotype pairs to include only those with strong associations (missense SKAT-O P 1 × 10−5 as ‘used for training AlphaMissense’. Finally, we mapped precomputed AlphaMissense scores using hg38 genome coordinates from AlphaMissense_hg38.tsv.gz (downloaded from https://www.doi.org/10.5281/zenodo.org/8208688). After filtering out missing values, we obtained a dataset of 142,951 variants (46,247 pathogenic and 96,704 benign) across 2,685 genes.ProteinGym DMSThe DMS benchmark consists of 217 DMS assays covering 696,311 single variants. The assays are categorized by taxon (96 human, 40 eukaryote, 50 prokaryote and 31 virus) and coarse selection type (77 organismal fitness, 43 activity, 66 stability, 18 expression and 13 binding assays). For our evaluations we further combine them into structure related assays (stability, expression and binding, 97 total) and function related assays (fitness and activity, 120 total). We used the provided experimental scores for all DMS assays except for fitness and activity measurements of CALM1, TPK1, UBC9, RASH, TADBP, SYUA and SRC. For these assays we applied a transformation of the form x → abs(x − WT), where WT represents the WT measurement (either 0 or 1, depending on the assay). This adjustment reflects the fact that, for these assays, variants scoring higher than the WT are typically associated with deleterious effects, as discussed in the original studies63. This preprocessing step improved performance across all evaluated methods, with the exception of the structure-based model ProSST19, whose performance on TADBP dropped substantially, from 0.54 to 0.08 (Extended Data Fig. 6b). Although we were unable to pinpoint the cause of this discrepancy, it is unlikely to be related to structural features, as the regions assayed in TADBP are intrinsically disordered64.Protein structure data for evaluationFor ProteinGym DMS, we used the 3D structures provided by the benchmark. For ProteinGym ClinVar, we used 3D structure provided by AlphaFold Protein Structure Database (https://alphafold.ebi.ac.uk) for 2,300 sequence-matched proteins and generated 3D structure for the remaining proteins using the AlphaFold2 script from ColabFold65 (https://github.com/sokrypton/ColabFold).Comparisons with state-of-the-art PLMs and VEP methodsWe compared VESM models with 25 models on ClinVar and 39 models on the DMS benchmark. These baselines are the latest or widely used PLMs and VEP methods, including PLMs trained on unaligned sequences (for example, ESM, ProGen2 and Tranception), structure-based models (SaProt14, ESM335, ProSST19 and ISM45), MSA-based models (PoET20, EVE36, TranceptEVE18, GEMME37, VespaG38 and RSALOR61) and other VEP methods that use homology-based (for example, SIFT39 and PROVEAN40) or population-based (for example, PrimateAI13 and CADD42) approaches. See Supplementary Table 4 for a detailed description.For the ProteinGym benchmark, we obtained precomputed scores from ProteinGym for all methods except RSALOR61, ProSST, SaProt, ESM3, ISM, VespaG and ESMC. Among these models, ISM scores are currently not reported in the benchmark. Other models do not have scores for the ClinVar benchmark. Furthermore, we wanted to ensure that all structure-based models use the same set of 3D structures for evaluation. Therefore, we compute variant effect scores for all of the above models, following the inference scripts provided in the corresponding repositories. For the comparison with AlphaMissense7, we used the precomputed scores provided by the authors (https://zenodo.org/records/10813168).Performance evaluation on ClinVarWe evaluated performance on ClinVar using the area under the receiver operating characteristic curve (AUC) as our primary metric. We reported both global AUC (calculated across all variants) and average per-gene AUC (calculated separately for each gene with at least ten annotated variants). We bootstrapped the global AUC calculation in Fig. 4a, by randomly sampling 50 label-balanced sets of 6,000 variants (3,000 benign and 3,000 pathogenic). The plot shows the estimated mean and standard deviation.Prediction accuracy versus variant annotationFor this analysis, the scores of each method were first calibrated by performing logistic regression on a small set of ClinVar variants (n = 1,000), randomly sampled to contain an equal number of benign and pathogenic labels. This step ensures that all the scores are comparable between 0 and 1, with 0.5 being the classification threshold. Then we varied the classification threshold symmetrically to exclude variants that each method is most uncertain about (with scores around 0.5). For each threshold, we computed the prediction accuracy, defined as the average accuracy of correctly annotating both benign and pathogenic variants, as well as the total number of variants that each method was able to annotate (that is, the ones that were not excluded by the threshold). Overall, this process was repeated ten times by randomly sampling the calibration set, and the two quantities (± 2 standard deviations around the mean) were plotted against each other in the figure.Comparison with AlphaMissense/MAF filteringWe compared VESM models with AlphaMissense across different MAF thresholds (1 × 10−5, 1 × 10−4, 1 × 10−3, 1 × 10−2 and 1 × 10−1) to assess the impact of population frequency on model performance. For each threshold, we filtered the ClinVar dataset (Feb 2025 release) to include only variants below the specified MAF value and calculated the AUC for each model, by randomly sampling 100 label-balanced sets of 20,000 variants (10,000 benign and 10,000 pathogenic) from genes that contain both labels. Even at the lowest MAF threshold (1 × 10−5) the filtered ClinVar dataset remained well balanced with a total of 18,145 benign and 40,046 pathogenic variants across 2,559 genes. We also performed a specific comparison on the subset of ClinVar variants not used in AlphaMissense training, by removing gnomAD v2 variants with MAF > 1 × 10−5. This resulted in a set of 31,967 pathogenic and 15,392 benign variants across 1,510 genes. We emphasize that this filtering step is based on MAF calculated based on gnomAD v2 and does not exclude 4,818 variants that are reported in gnomAD v4 with MAF >1 × 10−5. As above the mean AUC and standard deviation was calculated over 100 bootstraps but with 10,000 (instead of 20,000) variants sampled in each iteration to account for the smaller size of the filtered dataset. On this dataset, we also computed binary classification metrics by first calibrating predictions for all models as in Fig. 4c (logistic regression, n = 1,000 variants) and using 0.5 as the classification threshold. The reported metrics are estimated by sampling the calibration set 50 times.Performance evaluation on DMS dataWe evaluated model performance on the ProteinGym DMS benchmark using the weighted average of Spearman correlations between model predictions and experimental measurements, following the scripts and scoring guidelines outlined in the ProteinGym github repository17 (https://github.com/OATML-Markslab/ProteinGym). We compared all VESM models (3B, 650M, 150M, 35M, VESM3 and VESM++) against 39 other methods (see ‘Comparisons’ in Supplementary Table 4) across four datasets: (1) human DMS assays measuring fitness and activity (Fig. 5a), (2) all DMS assays measuring fitness and activity (Fig. 5b,c), (3) all DMS assays measuring binding, stability and expression (Fig. 5b) and (4) nonhuman DMS assays measuring fitness and activity (Fig. 5d).To assess consistency between human DMS and ClinVar benchmarks, we calculated the correlation between model performances on the two datasets, considering only methods evaluated across both (Fig. 4a, Fig. 5b). In addition to average performance, we computed a rank score by sorting methods in ascending order from 1 to 43 based on their weighted average Spearman correlations (Fig. 5b). We also assessed pairwise win rates66 between models (Fig. 5c). For each pair of models, we computed the proportion of assays where the first model achieved a higher Spearman correlation than the second. Formally, the win rate of model \({m}_{1}\) over \({m}_{2}\), is defined as \(\mathrm{win}\,\mathrm{rate}({m}_{1},{m}_{2})=\frac{1}{N}{\sum }_{i=1}^{N}1\{{S}_{i}({m}_{1}) > {S}_{i}({m}_{2})\}\), where \({S}_{i}(m)\) denotes the Spearman correlation achieved by model \(m\) on assay i, \(N\) is the total number of assays and \(1\{\cdot \}\) is the indicator function. We note that although win rates were used in previous work66 to compare between models with nonoverlapping sets of VEP, here we compare models on exactly the same assays and variants.UK biobank summary statistics and the Genebass clinical phenotyping benchmarkGenebass summary statisticsWe downloaded summary statistics (gene-level effect size and association P value) for the top pLoF gene burden associations with blood biochemistry biomarkers (pLoF SKAT-O P value