Target sequence-conditioned design of peptide binders using masked language modeling

Wait 5 sec.

MainThe development of therapeutics largely relies on the ability to design small-molecule-based or protein-based binders to pathogenic target proteins of interest1. These binders can be used either as inhibitors or as functional recruiters of effector enzymes2. For example, proteolysis-targeting chimeras (PROTACs) or molecular glues are heterobifunctional small molecules that bind and recruit endogenous E3 ubiquitin ligases for targeted protein degradation (TPD)3,4. Still, these small-molecule-based methods rely on the existence of accessible cryptic or canonical binding sites, which are not present on classically ‘undruggable’ intracellular proteins5,6. With the advent of deep-learning-based structure prediction tools such as AlphaFold2 and AlphaFold3 (refs. 7,8), combined with generative modeling1, algorithms such as RFdiffusion and MASIF-Seed enable researchers to conduct de novo protein binder design from target structure alone9,10. Nonetheless, much of the undruggable proteome, including dysregulated proteins such as transcription factors and fusion oncoproteins, are conformationally disordered, thus biasing design to a small subset of disease-related proteins1,6.Over the past few years, deep learning has revolutionized natural language processing (NLP), particularly through the implementation of the attention mechanism11. This foundational advancement has transcended the boundaries of natural language analysis, finding applications in the modeling of other languages, such as proteins, which are fundamentally sequences of amino acids12. Recently, several protein language models (pLMs) trained on distinct transformer architectures, such as ProtT5, ProGen2, ProtGPT2 and the ESM series, have accurately captured critical physicochemical properties of proteins13,14,15,16. Notably, ESM-2 currently stands as a state-of-the-art model in the realm of protein sequence representation, essentially functioning as an encoder-only model that discerns co-evolutionary patterns among protein sequences via a masked language modeling (MLM) training task17,18. These models have been extended to powerful applications, including antibody design, the creation of novel proteins and structure prediction, offering a streamlined approach to embedding useful protein information14,15,17,18. Recently, our laboratory has leveraged the expressivity of pLMs to both generate and prioritize effective peptidic binder motifs to targets of interest, enabling design of peptide-guided protein degraders19,20 that are modeled after the ubiquibody (uAb) architecture developed by Portnoff et al.21,22. As such, uAbs now represent a programmable, CRISPR-like approach for TPD. Our early models, Cut&CLIP and SaLT&PepPr, rely on the existence of interacting partner sequences as scaffolds for peptide design19,23. Most recently, our PepPrCLIP model generates de novo peptides by first sampling the ESM-2 latent space for naturalistic peptide candidates and then screening these candidates through a contrastive model to determine target sequence specificity20. However, a purely de novo, target sequence-conditioned binder design algorithm has yet to be developed.To achieve this goal, we introduce PepMLM, a Peptide binder design algorithm via Masked Language Modeling, built upon the foundations of ESM-2 (ref. 17). PepMLM employs a masking strategy that uniquely positions the entire peptide binder sequence at the terminus of target protein sequences, compelling ESM-2 to reconstruct the entire binding region (Fig. 1a). PepMLM-derived linear peptides achieve low perplexities, matching or improving upon validated peptide–protein sequence pairs in the test dataset; outperform the state-of-the-art RFdiffusion model for peptide design on structured targets in silico9; and experimentally exhibit potent and specific binding to disease-relevant targets and degradation of difficult-to-drug drivers of Huntington’s disease and emergent viral phosphoproteins when incorporated into the uAb architecture. Overall, by focusing on the complete reconstruction of peptide regions, PepMLM serves as a completely sequence-based, target-conditioned de novo binder design tool, paving the way for the development of more effective, therapeutic binders to conformationally diverse proteins of interest.Fig. 1: Overview and evaluation of the PepMLM model.a, The architecture of the PepMLM model. Based on the finetuning of ESM-2, the model incorporates the target protein sequence along with a masked binder region during the training phase. During the generation phase, the model can accept target protein sequences and mask tokens to facilitate the creation of peptides of specified lengths. b, Perplexity distribution comparison. The perplexity values were calculated for test and designed peptides, encompassing the target proteins in the test set. c, The density distribution visualization of the log perplexity values for target–peptide pairs, encompassing test peptides, PepMLM-650M-designed peptides, ESM-2-650M-designed peptides and random peptides. d, In silico hit rate assessment of RFdiffusion (left) and PepMLM (right). Using AlphaFold-Multimer, ipTM scores were computed for both the designed and test peptides in conjunction with the target protein sequence. The entries are organized in accordance with the ipTM scores attributed to the test set peptides. The hit rate is characterized by the designed peptides exhibiting ipTM scores ≥ those of the test peptides. e, Binding specificity analysis through permutation tests. The distribution of PPL scores for matched target–binder pairs (blue) is compared with randomly shuffled mismatched pairs (red). Each target’s binder was shuffled 100 times to generate the mismatched distribution. Statistical significance was determined using t-test (P 40), validating the effective ability of PepMLM to model them accurately (Fig. 1b and Supplementary Table 2). Our distribution analysis revealed that PepMLM closely mirrors the low PPL region of real binders, a deviation from the distribution shifts observed with the original ESM-2 model alone and with randomly designed binders, indicating that PepMLM can distinguish binders from non-binders especially for random binders by PPL scores (Fig. 1c).To further understand PPL score, we co-folded the test binders with their respective target proteins using AlphaFold-Multimer, which has been proven effective at predicting peptide–protein complexes27,28. The predicted local distance difference test (pLDDT) and interface-predicted template modeling (ipTM) scores, verified metrics within AlphaFold2 (ref. 7), function as critical indicators of the structural integrity and the potential interface binding affinity of the peptide–protein complex, respectively, providing an external quantitative assessment of our generation. PPL, our confidence metric, showed significant agreement with folding scores, as the extracted ipTM and pLDDT values from our benchmarking indicated a statistically significant negative correlation (P