Conditional diffusion with locality-aware modal alignment for generating diverse protein conformational ensembles

Wait 5 sec.

MainProteins are fundamental building blocks of life and play an integral role in cellular and biological processes. Many proteins possess inherent flexibility that enables them to function through the interconversion of different conformational states with varying energy levels. This dynamic nature has profoundly determined protein functional repertoire in various contexts, including ligand interactions, enzymatic reactions and molecular evolutions. As a result, accurately generating the conformational ensembles for given protein sequences is vital for elucidating the mechanisms underlying their functionalities with wide applications in biological and medical sciences1,2,3.In the past, experimental structure determination primarily focused on single—or at best a few—discrete, static protein structures because of the substantial cost involved 4. To achieve a comprehensive delineation, molecular dynamics (MD) simulations are widely used to generate coherent trajectories of protein conformations.Due to the high computational complexity and the requirement to simulate long timescales, MD simulations could be very time-consuming and resource demanding5,6. The emergence of AlphaFold2 has dramatically advanced the state-of-the-art of protein structure prediction. By integrating structural and co-evolutionary information through Evoformer attention in an end-to-end learning architecture, AlphaFold2 enables the faithful prediction of an individual (arguably the most probable) protein structure7. A number of variants8,9,10,11,12,13 based on AlphaFold2 have been further proposed to produce multiple conformations for a protein by expanding the output of AlphaFold2 or by exploiting the evolutionary information. Examples include modifying multiple sequence alignment (MSA) depth through subsampling8,9 or residue replacement10, performing MSA cluster-level structure prediction11, using state-specific structural templates12 and employing multiple structural outputs as initialization for enhanced sampling13. Despite its great potential, how to fully recover the conformational heterogeneity observed in many proteins under standard AlphaFold2 inference protocols remains to be further explored14,15. More recently, Gao et al. proposed AF2Complex16, which allows predicting alternative conformations that emerge in the presence of protein–protein interactions. Incorporating such partners enables the discovery of conformational states that are inaccessible when proteins are modelled in isolation, thereby substantially improving the prediction of biologically relevant structural ensembles17,18. This interaction-driven conformational heterogeneity offers an interesting perspective for extending AlphaFold2’s capabilities towards modelling diverse, functionally relevant conformational ensembles.In recent years, non-deterministic generative models drew considerable attention towards systematically generating the conformational ensembles of proteins. Early works focused on generative adversarial networks or variational autoencoders19,20,21. Then, a considerable amount of interest was drawn on diffusion models22 due to their promising results in generating realistic samples from the distribution they are trained on. Examples include non-conditional diffusion models such as FoldingDiff23 and Str2Str24, and a number of conditional generative models using sequence representation from structure prediction models (as conditions) and various types of equivariant networks (for denoising), such as Eigenfold25, ConfDiff26, DiG27 and BioEmu28 and those using advanced flow models and diffusion transformers such as AlphaFlow29 and IdpSAM30. Furthermore, diffusion models were also applied successfully in protein design tasks31,32.Most existing conditional diffusion methods24,25,26,27,28 have relied closely on structure prediction models such as AlphaFold27, ESMFold33 or OmegaFold34 in terms of the protein geometric representation (for example, residue frames7 and Cα atom coordinates25) and the denoising network architectures used in these methods (for example, invariant point attention7,31). Meanwhile, the initial sequence embedding was also obtained from these structure prediction models as the condition for generative models. For example, EigenFold used the residue embeddings from OmegaFold, both DiG and BioEmu used those from AlphaFold2, and ConfDiff used the sequence embedding from ESMFold. Although encouraging results have been observed, further exploration is needed to determine whether sufficient structural heterogeneity can be extracted from the sequence representations of structure prediction models. This is because many structure prediction models, under their default settings, were designed to predict a single structure for a given sequence, and so the resultant representations might be possibly biased towards a dominant structure that structure models tend to predict35.Here we present modal-aligned conditional diffusion (Mac-Diff), a score-based conditional diffusion model to generate realistic and diverse protein conformational ensembles. Mac-Diff performs iterative denoising on protein backbone geometries (target view) by continually receiving guidance from the protein sequence (conditional view). Here, instead of using structural prediction models like AlphaFold2, Mac-Diff adopts protein language models (PLMs) such as ESM-233 to obtain the initial representation of protein sequences as conditions. ESM-2 was trained by unsupervised masked language modelling based on massive protein sequence datasets, allowing it to capture a wide spectrum of information ranging from evolutionary patterns, structural motifs and functional properties to broader biological knowledge at different scales. This semantically rich sequence representation has shown great potential as a scalable and alignment-free alternative to capturing the conformational diversity of proteins in our study. Notably, the PLM-derived residue embedding has also been used in Chai-136 in its single-sequence mode, achieving strong structure prediction performance particularly for protein–ligand complexes and multimers.Central to Mac-Diff is an attention module to bridge the gap between the conditional modality (protein sequence) and the target modality (residue geometry) called locality-aware modal alignment attention (LAMA-attention). Compared with text-to-image tasks37 requiring only loose, unstructured alignment between text tokens and image pixels, LAMA-attention enforces a physically more delicate alignment between sequence and structure. In particular, while a direct correspondence between a specific residue and its own three-dimensional (3D) coordinates is trivial, the critical alignment lies between a residue in 3D space and its spatially interacting neighbours traced back to the input sequence. Capturing these interactions is the key to injecting useful sequence information into the target space for structural denoising. By restricting the attention field of each residue to its most likely local interacting environment, the locality-aware alignment between the two modalities was able to compute highly contextualized features in the target space to recover realistic and diverse protein structures. See Supplementary Note 1 for a comparison of the different levels of modal alignment for text-to-image generation and protein conformational ensemble generation, as well as discussions on the limitations of conventional cross-attention in the latter task.Mac-Diff showed promising results in generating realistic conformational ensembles for given protein sequences. Empirically, Mac-Diff effectively recovered the conformational distribution of 12 fast-folding proteins from the benchmark test set24,38 it has never seen before, in terms of a number of important evaluation metrics such as the Jensen–Shannon (JS) divergence on Cα–Cα distance distribution, radius of gyration distribution, and Cα–Cα distance distribution on top-two time-lagged independent components (TICs) analysis (TICA) components. Notably, the conformations generated by Mac-Diff exhibited greater diversity compared with competing methods while preserving a high level of accuracy in terms of ensemble distributions.Furthermore, Mac-Diff demonstrated promising ability to predict alternative conformations that are potentially biologically relevant, even for proteins not encountered during training. For example, it recovered important conformational substates of bovine pancreatic trypsin inhibitor (BPTI) that were observed in long MD simulations of 1 ms, and also predicted the closed state and the open state of adenylate kinase (AdK), an allosteric protein involved in energy metabolism. Finally, Mac-Diff achieved a sampling speed approximately 3,000 times faster than conventional MD simulations (that is, over three orders of magnitude). Overall, we believe that Mac-Diff has the potential to improve our understanding of protein folding dynamics and provide insights into the intricate relationship between protein sequence, structure and function. The capability of Mac-Diff in predicting conformational heterogeneity will also be useful in applications of structure-based drug design and protein engineering.ResultsFigure 1 illustrates the overall design of Mac-Diff. Figure 1a is the backbone geometric representation used in Mac-Diff, including pairwise Cβ distance, dihedral angle, planar angle and a padding channel, which are invariant to 3D rotation and translation. Figure 1b is the model overview. It is a score-based conditional generative model capable of recovering the conformational distribution of a protein by generating backbone geometric structures and converting them to atom-level coordinates through the Rosetta folding protocol. The forward diffusion process iteratively injects noise to the geometric tensor, and the backward process achieves iterative denoising. The denoising network is a U-Net structure with five downsampling/upsampling stages. Each stage contains a ResNet block to integrate time-step embeddings with residue-pair representations, and a TransFormer block to update residue-pair representations via self-attention and LAMA-attention. Figure 1c is the LAMA-attention. It enforces a well-controlled spatial alignment between the sequence view and the structure view by forcing each residue to attend to only those neighbouring residues with high contact probability, to update residue-pair representations with highly relevant and contextualized sequence features for denoising. More detailed descriptions can be found in the Methods.Fig. 1: Overview of Mac-Diff architecture.a, Protein backbone geometric representation with L residues as an L × L × 5 tensor with pairwise Cβ distance, dihedral angle ω along two Cβ atoms, dihedral angles θ, and bond angles ϕ (direction of Cβ atom of one residue in a reference frame centred on the other residue), and a padding channel indicating sequence length. b, Mac-Diff workflow. The forward diffusion process iteratively injects noise to geometric tensor, and the backward process performs iterative denoising. The denoising network is a U-Net structure with five downsampling/upsampling stages, each stage with a ResNet block and a TransFormer block (self-attention and LAMA-attention). c, LAMA-attention, allowing each residue to attend only to neighbouring residues with high contact probability, updating residue-pair representations with highly relevant, contextualized sequence features for denoising. repr., denotes representation; seq, protein sequence; Conv, convolutional layer; TF, TransFormer block; RN, ResNet block; Dn, downsampling stage.Full size imageFigure 2 illustrates the difference between Mac-Diff and stable diffusion—a popular diffusion framework for text-to-image generation37—in their attention modules. In stable diffusion, the attention between pixels (query) and words (keys) was dense and global, indicating that the alignment between pixels and words were unstructured, purely data-driven and without prior algorithmic controls (Fig. 2a). In comparison, LAMA-attention allowed focusing only on the interacting neighbours (their amino acid types) of a residue when updating its representation (Fig. 2b). This effectively narrows the attention field from the whole sequence to only a small fraction of useful residues in the conditional view, marking a key distinction from the global cross-attention module in stable diffusion.Fig. 2: Schematic comparison of cross-attention (in text-to-image generation) and LAMA-attention (in protein conformational ensemble generation).a, In traditional cross-attention, each pixel in the generated image is potentially related to all tokens in the input text without prior algorithmic control on the attention field. b, In the LAMA-attention, each residue-pair (Res-Pair) representation ij is related to only those residues that are likely to interact with residue i and j biologically—a stronger, locality-aware spatial alignment between sequence and structure.Full size imageWe performed comprehensive evaluations of Mac-Diff in comparison with state-of-the-art approaches for generating realistic and diverse protein conformational ensembles, making use of both carefully curated training datasets and widely used public benchmark datasets. Two complementary categories of tasks were designed to assess model performance. First, we evaluated the ability of Mac-Diff to recover the underlying distribution of conformational ensembles and to identify key conformational substates, using MD trajectories of fast-folding proteins and the BPTI benchmark as reference standards24,27,28,38,39. Second, we assessed its capability to predict alternative conformations of proteins, leveraging experimentally validated 3D structures from a subset of the Cfold dataset40 and a well-characterized model protein AdK.The Mac-Diff model was trained using a pretraining–fine-tuning strategy. It was first pretrained on 619,045 experimentally determined protein sequence–structure pairs from the Protein Data Bank (PDB), and subsequently fine-tuned on 1,674 MD trajectories collected from two public sources: 371 trajectories from GPCRmd41 and 1,303 trajectories from the Atlas of Protein Molecular Dynamics (ATLAS)42 (see dataset details in ‘Experimental settings’ section in the Methods). Importantly, in both pretraining and fine-tuning stages, we excluded all proteins whose sequence identity with any test protein exceeded 40%, ensuring a rigorous evaluation of the model’s generalization ability. Detailed settings and statistics of the training data are provided in the ‘Training data’ section in the Methods. For all other competing models, results were obtained without redundancy control, by directly using their released models trained on the original datasets (as retraining on de-redundant data would be prohibitively costly). These stringent comparative conditions ensure a rigorous and fair evaluation of Mac-Diff’s performance.Performance evaluation on fast-folding proteins benchmarkWe first evaluated Mac-Diff in generating the conformational ensembles of fast-folding proteins. To evaluate how well the generative model reproduces the conformational distribution of the original MD data, we used the complete D. E. Shaw Research (DESRES) dataset, which includes 12 structurally diverse fast-folding proteins, along with the BPTI protein. These proteins have been widely used as benchmark for evaluating the quality of equilibrium distributions generated by computational models24,26,38. The reference conformation distribution for each protein is represented by 1,000 conformations sampled with fixed stride from their MD simulation trajectories, ranging from 100 μs to 1 ms to cover multiple folding or unfolding events43. See the detailed protein trajectory information in the ‘Test data’ section in the Methods and Supplementary Table 1.We have generated 1,000 conformations using Mac-Diff and each of the competing methods to recover the protein conformational distribution, and examined the quality by assessing their deviation from the reference MD distribution. To provide a quantitative analysis, we computed the JS divergence on the following distributions, including: (1) pairwise Cα atom distance distribution (JS-PwD), which reflects the global spatial structure of the protein; (2) radius of gyration distribution (JS-Rg), which is based on the distance between each Cα atom to the centre of mass of the protein, reflecting the compactness of the protein; and (3) pairwise Cα atom distance matrices projected onto the top-two TICs (JS-TIC), which is commonly used to analyse slow dynamics of proteins MD trajectories44,45,46. The metrics used to quantify the generated conformational distributions are evaluated independently for each protein and subsequently averaged across the 12 fast-folding proteins. Detailed definitions of the three metrics can be found in the ‘Evaluation metrics’ section in the Methods.As shown in Fig. 3a, Mac-Diff achieved competitive performance compared with existing diffusion- and flow-based models across all equilibrium metrics on fast-folding proteins. Averaged over 12 fast-folding proteins, Mac-Diff reduced errors in JS-PwD, JS-Rg and JS-TIC by approximately 18%, 22% and 5%, respectively, relative to the best competing method. In statistical testing (two-sided Wilcoxon test), Mac-Diff outperformed five of the competing methods under JS-PwD and JS-TIC (P < 0.05 with medium-to-large effect sizes) and outperformed all other competing methods under JS-Rg (P < 0.05 with medium-to-large effect sizes), as detailed in Supplementary Table 2. In Supplementary Tables 3–5, we provide more detailed comparative results on each of the 12 fast-folding proteins under these three metrics. These results showed the potential of Mac-Diff in recovering protein conformational distributions.Fig. 3: Performance of all competing methods in recovering the conformational distributions of the fast-folding proteins.a, The JS divergence between the generated conformational distribution and the reference MD distribution for 7 competing methods averaged over 12 fast-folding proteins43, when considering the pairwise Cα atom distances (JS-PwD), the radius of gyration (JS-Rg) and the pairwise Cα atom distances matrices projected onto the top-two TICs (JS-TIC). Data were derived from a sample size of n = 12 independent test proteins, where each data point represents the mean of 3 independent runs. The box plots show the medians as the centre line, the 25th and 75th percentiles as lower and upper quartiles, and 1.5 times the interquartile range as whiskers. Asterisks above boxes indicate statistically significant differences compared with Mac-Diff, as determined by a two-sided Wilcoxon signed-rank test (*P