Zero-shot design of drug-binding proteins via neural iterative selection−expansion

Wait 5 sec.

MainThe de novo design of small-molecule binding proteins remains a considerable challenge1,2,5,6,8,9,10,11,12,13,14, despite rapid progress elsewhere in the field15,16,17,18,19,20,21,22,23,24,25,26. Notable successes1,2,3,4,6,13,27,28,29 have relied mainly on high-throughput experimental selection. The few cases3,4 with high computational hit rates (33%) approximated the functional groups of ligands as parts of amino acids. To generalize binder design, neural networks can in principle learn directly from training data to predict protein sequences and protein–ligand co-structures.Self-consistency describes the agreement between the intended structure of a designed sequence and its predicted structure. Maximizing self-consistency has been a guiding principle for the design of new topologies30 and of binders to proteins and peptides15,16,23,31. However, this principle has not been extended to small-molecule binder design because of the complexities of encoding non-amino-acid chemistry. Models such as RoseTTAFold-All Atom (RFAA)8, Boltz-1/2 (refs. 32,33) and AlphaFold3 (AF3) (ref. 34) can predict a protein–ligand co-structure from a protein sequence and a ligand simplified molecular-input line-entry system (SMILES) string. Using such models, a self-consistent design has not only a predicted structure that closely resembles the intended backbone, but also a ligand that is predicted to bind in the intended site (Fig. 1a). This additional dimension should enable a more nuanced assessment of design quality. We reasoned that maximizing sequence–structure–ligand self-consistency would lead to the design of high-affinity small-molecule binding proteins with good success rates.Fig. 1: A self-consistency optimization algorithm for the design of small-molecule binding proteins.Full size imagea, The principle of self-consistency shown for a four-helix bundle and a docked small molecule (shown as a stick model in blue). Top, two self-consistent designs with respect to backbone, computed with, for example, AF2. Bottom, one of these self-consistent backbone designs is self-consistent with respect to the ligand (predicted docked ligand in magenta), computed using, for instance, RFAA. The schematic heat map shows a typical outcome for a NISE trajectory with the goal to populate the bottom left corner. Marginal distributions are shown using coloured bars for protein sequences 1–3 (s1–s3). b, Schematic of the NISE protocol. NISE starts with an initial protein structure (backbone coordinates only) and a docked ligand location (magenta). The NISE process iteratively applies neural-network-based sequence design (expansion) and co-structure prediction. lig, ligand; P, probability; seq, sequence; struct, structure. High-confidence (high ligand pLDDT) self-consistent designs are used as new inputs (selection) for sequence design. The depicted proteins and ligands are coloured by model confidence (low to high, red– yellow–green–cyan–blue). c, Comparison of neural and energy-based iterative selection–expansion (ISE) protocols for generating modified protein–ligand coordinates. In both cases, LASErMPNN was used to design the sequences. For energy-based ISE, the co-structure predictor was replaced with Rosetta energy minimization and designs were selected on the basis of low ligand energy. After 35 rounds of ISE, the structures of all designed sequences were predicted using RFAA. Ligand pLDDT (red, third quartile) and sequence negative log-likelihood (NLL; blue, first quartile) were plotted against design iteration. NISE (but not energy-based ISE) simultaneously optimized both ligand confidence (higher pLDDT) and protein sequence quality (lower NLL). Data are from a NISE trajectory using the input structure that produced EPIC (Fig. 3) and with exatecan used as the ligand. Quartiles were produced from n = 1,500 designs per iteration (n = 500 for the first round). d, Simultaneous optimization of designs along two reciprocal conditional-probability distributions, P(seq|struct, lig) and P(struct, lig|seq), in c suggests that NISE is optimizing within the joint probability distribution, P(seq, struct, lig). conf., conformation.To design binders, we implemented a self-consistency optimization algorithm, NISE, which explicitly considers small-molecule ligands. We applied NISE to two small-molecule drugs, exatecan and apixaban, and achieved highest affinity binders with dissociation constants (Kd) of 120 nM and 80 pM, respectively. These Kd values surpass other methods4,6 considerably.NISE sampling algorithmThe de novo design of small-molecule binding proteins typically begins by docking a ligand into a precomputed protein scaffold, and then designing a sequence for the resulting pose1,4. Because the initial pose is rarely optimal, the designed sequence is also unlikely to be optimal. Therefore, a means of jointly refining the sequence, backbone and ligand conformation in a coupled manner is needed. A few methods, such as COMBS3,4, RIFdock2,6 and even brute-force rigid-body docking (Supplementary Information), can position ligands into a blank backbone for subsequent sequence design, with optional constraints on ligand orientation and burial. NISE then performs the joint refinement.To maximize tripartite self-consistency for a given input structure (backbone and docked ligand), we iteratively designed sequences and predicted co-structures (Fig. 1b). For each sequence, a protein–ligand co-structure was computed and compared to the input structure from the previous round. Only designs with high self-consistency (low root mean square deviation (r.m.s.d.)) with respect to both backbone and ligand coordinates were kept. We then selected a few co-structures with the most confidently predicted ligands as new inputs (both backbone and ligand atoms) for the next round of sequence design. Throughout the NISE process, we avoid using energy functions, relying instead on neural networks and model confidence to guide optimization. To avoid getting trapped in local minima, we encouraged exploration by sampling many sequences for a given backbone–ligand pair in each round, drawing from the probability distribution of the LASErMPNN neural network (see below) at a high softmax temperature. The structures, sequences and ligand conformers become progressively more compatible after each round (Fig. 1c, left).This approach is similar to iterative coordinate ascent35, in which alternating argmax sampling from two conditional distributions, P(a|b) and P(b|a), locally climbs to a high probability mode in the joint distribution, P(a, b). NISE samples broadly from P(sequence|structure, ligand conformation) and takes a confidence-based argmax over P(structure, ligand conformation|sequence). Together, this means that NISE climbs towards a high probability mode in the joint distribution representative of the underlying training data, protein–ligand co-structures in the Protein Data Bank (PDB; Fig. 1d).LASErMPNN neural networkWe set out to train a neural network to learn the probability distribution of protein sequences conditioned on the three-dimensional structures of protein–ligand complexes. LASErMPNN is a heterograph neural network trained on protein–ligand co-crystal structures from the PDB. It predicts protein sequences and side-chain dihedral angles from protein backbone coordinates and ligand atomic coordinates (Fig. 2a and Supplementary Fig. 1). During inference, LASErMPNN autoregressively decodes each residue from an input protein–ligand complex using a randomly chosen decoding order. We trained the model until the accuracies of the training and validation sets overlapped. We then tuned the hyperparameters of the model to maximize foldability of the designed sequences without markedly losing sequence recovery (Supplementary Fig. 2). The model accurately recovered native protein sequences in a held-out test set, outperformed a similarly trained ligand-free version of itself, and was comparable to (even slightly outperformed) a retrained version of a similar model, LigandMPNN36 (Fig. 2b).Fig. 2: LASErMPNN designs protein sequences conditioned on protein–ligand co-structures.Full size imagea, LASErMPNN is an encoder–decoder network. It uses protein–ligand co-structures as input (protein backbone atoms and ligand atoms) and is trained to autoregressively predict (box labelled autoregressive sampling) the amino-acid identity and side-chain dihedral angles of each protein residue. LASErMPNN forms a heterograph from the protein–ligand co-structures, encoding the protein and ligand nodes separately. The ligand node embeddings (e1, e2, etc) are taken from a pretrained ligand encoder (box labelled pretrained ligand encoder) and locked during training. The ligand encoder is trained to predict the atomic properties of ligands (such as the partial atomic charge) conditioned on atomic coordinates and elements. Autoregressive sampling (AS) is performed in turn for each L residue of the protein. Each new residue prediction is conditioned on previously decoded residue identities and their associated dihedral angles (green). LASErMPNN outputs a three-dimensional structure of the sampled side-chain coordinates built from predicted residue identities (Ai) and side-chain dihedral angles (X1, X2, X3 and X4). b, LASErMPNN performance on a held-out test set of proteins compared with a similarly trained protein-only model (left, all ligand information removed) and a retrained LigandMPNN model (right). These plots show the prediction accuracy across binding-site residues when the model is tasked with predicting the entire protein sequence of each test protein (argmax sampling). c, Left, LASErMPNN scores the wild-type (WT) sequence of monomeric streptavidin favourably, with high-probability amino acids (a.a.; shown in magenta) located in the biotin binding site and protein core. Right, the LASErMPNN design with highest binding-site sequence recovery (out of 10,000 sequences). Accurately predicted residues are depicted in green and two designed mutations in red. This design was in the top 25 when ranked by buried, non-hydrogen-bonded polar atoms. d, LASErMPNN designs using the PiB structure (PDB: 8TN6_A) as the input. RFAA-predicted co-structures for 1,000 designs were filtered and ranked by the number of buried, polar non-hydrogen-bonded atoms. The original high-affinity PiB sequence was ranked fourth.LASErMPNN differs from LigandMPNN in several key ways. LASErMPNN contains a distinct, pretrainable ligand encoder; performs simultaneous decoding of side-chain dihedral angles and amino-acid identity; and includes ligand nodes in each round of encoding and decoding. To enable accurate decoding of side-chain dihedrals in the presence of backbone noise, we first idealized and then noised entire backbone ‘frames’. This was necessary to reduce the memorization of crystal-structure artefacts, and contrasts with the approach of LigandMPNN, which applies noise to backbone atoms independently (Supplementary Information). Ablation studies showed that the performance of LASErMPNN depends on the simultaneous prediction of side-chain dihedral angles during sequence decoding and on pretraining a ligand encoder module using a large set of synthetic ligands37,38. The training task of the ligand encoder was to predict atom-level properties derived from quantum chemical computations, such as partial charge (Extended Data Table 1). A useful by-product of ligand-encoder pretraining is that predicted partial charges provide a diagnostic read-out of the model’s understanding of new ligands, a capability that LigandMPNN does not have. In a head-to-head comparison between LASErMPNN and LigandMPNN, we observed a general tendency for LigandMPNN to produce designs that were more overpacked, as measured by an increased density of protein heavy atoms (within 5 Å of the ligand) and a higher van der Waals repulsion energy of both the ligand and overall structure (Kolmogorov–Smirnov test, P