Coevolutionary signals in multiple sequence alignments improve virulence factor prediction with an MSA Transformer

Wait 5 sec.

IntroductionMotivationPathogenic bacteria invade the body and disrupt normal physiological functions, resulting in serious infectious diseases that have posed a significant threat to human and animal health throughout history1. These bacteria have evolved diverse strategies to establish infections, evade the host immune system, and thrive within the host environment2,3. The biological molecules produced by pathogenic bacteria during these processes can have a notably harmful impact on the host, leading to disease. These molecules are classified as virulence factors (VFs). VFs are specific components or molecules, such as enzymes, toxins, secreted factors, surface structures (e.g., pili or capsules), and genetic elements (e.g., plasmids or pathogenicity islands), that enable bacteria to colonize the host, evade or suppress the immune response, and cause damage to host cells or tissues4.Understanding and identifying VFs is crucial for advancing our knowledge of infectious diseases and developing targeted strategies for prevention and treatment. However, due to the limitations of bacterial culture technology, most bacterial strains remain unexplored, and only a few VFs have been identified5. Furthermore, bacteria undergo genetic mutations and recombination over time. These changes in their genetic material can lead to the development of new strains with altered characteristics.Traditionally, the identification of virulence factors has relied heavily on wet lab experiments, such as mutagenesis studies, transcriptomic profiling, and proteomic analyses. These approaches involve labor-intensive techniques like gene knockouts, where the effect of specific genes on bacterial pathogenicity is studied, or differential expression analysis under various environmental conditions to identify genes associated with virulence6,7.While these methods have provided valuable insights into virulence mechanisms, they are time-consuming, costly, and often limited in scale. As a result, computational approaches have emerged to complement wet lab efforts by rapidly predicting potential virulence factors from genomic and proteomic data8.Related workRecent computational approaches have primarily aimed to develop effective feature extraction methods from protein sequences to address the task of VF prediction. VirulentPred9 utilized sequence-based features, such as Amino Acid Composition (AAC) and Dipeptide Composition (DPC), along with evolutionary-based features obtained through Position-Specific Scoring Matrix (PSSM). It used the Support Vector Machine (SVM) and the BLAST tool for VF prediction. MP310 utilized sequence-based features such as AAC and dipeptide frequency, employing the SVM and Hidden Markov Model (HMM) to predict VFs. PBVF11 introduced an additional feature called sequence similarity (seqsim), which directly considers sequence similarity and demonstrates its significance in VF prediction. DeepVF12 proposed a hybrid ensemble model that integrates Machine Learning models (SVM, Random Forest, XGBoost, MLP) and Deep Learning models (CNN, LSTM, DNN) for classification, incorporating various sequence features. VF-Pred13 incorporates physiochemical property-based feature extraction, Pseudo Amino Acid Composition (PseAAC), Quasi-sequence order (QSO), and sequence alignment features, representing the percentage of sequence alignment with both negative and positive datasets. GTAE-VF14, the most recently proposed VF prediction model, proposed a novel graph transformer autoencoder that leverages ESMFold-predicted 3D structures.Recent studies11,12,13 indicate that well-defined protein feature extraction methods play a crucial role in achieving accurate VF prediction. However, traditional protein feature extraction methods have certain limitations. For example, AAC is useful for broad characterizations of proteins, but since it only considers amino acid frequency without positional information, it lacks the specificity needed to capture detailed functional or structural insights15. PSSM is effective at identifying conserved sequence motifs but is insufficient to capture complex dependencies and interactions, particularly the correlation information between residues, as it focuses on position-specific probabilities without modeling the interdependence of residue positions16..Our approachVF prediction is a critical task for understanding bacterial pathogenicity and developing targeted treatments for infectious diseases. Existing computational approaches have achieved varying levels of success, but often lack the ability to incorporate evolutionary interdependencies within protein sequences. Thus, we aimed to develop a more powerful protein feature extraction method for VF prediction. Coevolutionary information, which reflect the evolutionary interdependencies between amino acid residues, have been shown to provide some biological insights into functional and structural relationships within proteins17,18. While coeovlutionary information have primarily been applied in tasks such as protein structure prediction and protein function analysis, their ability to reveal functional correlations among residues suggests potential applicability in VF prediction. By leveraging coevolutionary information for accurate VF prediction, we propose a novel model, MSA-VF Predictor (MVP) and protein feature extraction method named MSA-composition, which utilizes MSA data to effectively capture coevolutionary information from protein sequences. Additionally, MVP provides a web server that allows users to determine whether a protein is a VF by simply inputting the desired protein sequence, offering valuable support for further research (http://bhi4.snu.ac.kr:7978).MSA-composition is a core component of MVP that allows projection into a protein embedding space and featurizes coevolutionary information-enriched latent representations of protein sequences, extracted via the MSA Transformer19. Transformer architectures have been successful on heterogeneous biological sequence/signal data, supporting our choice of an MSA-Transformer to exploit information present in homologous alignments20. The MSA Transformer, serving as the MSA encoder module within MVP, has previously demonstrated its ability to capture coevolutionary information and homologous protein information in MSA data.As a result, we demonstrated that our approach achieves SOTA performances compared to baseline models, as presented in Sect. "Comparison with existing models". Section "Comparison of feature extraction methods", highlights the effectiveness of the proposed coevolutionary-based feature extraction method, MSA-composition. Section “Coevolutionary-based feature effectiveness analysis”, investigates whether the coevolutionary-based feature extraction method actually influenced the performance of VF prediction. Section “Analysis coevolutionary information in positive test data during inference stage”, and Sect. “Analyzing the impact of coevolutionary information on the MVP”, show that MVP is dependent on the coevolutionary information of VFs, by using Mutual Information Z-scores21 derived from MSA data. In Sect. “Interpreting the correlation between MSA Transformer attention mechanism and performance of MVP”, we investigate the impact of the MSA Transformer on VF prediction performance, through a correlation analysis of attention blocks within the MSA Transformer and their relevance to the task. Finally, in Sect. “Evaluation on the external dataset”, we validate through a external dataset that MVP performs better than competing baseline models for entirely new experimental VF data.Materials and methodsIn this section, we described the architecture of MVP and implementation details. The overview of MVP construction procedures is depicted in Fig. 1.Fig. 1The overall workflow of our proposed model. First, we transform protein sequences into an MSA, where the MSA consists of 128 sequences, each with a length of 1,024. Second, we compute a 3-dimensional representation with an embedding dimension of 768 through a pre-trained MSA Transformer. Then, we use only the embedding data of the query sequence, referred to as MSA embedding. Using the MSA embedding, we construct a novel protein feature extraction method, MSA-composition, with a size of (768 x 20). Finally, we predict the VFs by flattening the MSA-composition and using the MLP as the classifier.Full size imageDatasetTraining and test datasetA high-quality dataset with no information leakage is the most crucial factor for conducting a prediction task. We set DeepVF as the baseline model and used the same benchmark dataset from the work of DeepVF to ensure a fair comparison. They collected 9,749 VFs positive samples (Victor22, VFDB5, and PATRIC23) and 66,928 non-VF negative samples from PBVF11, which was constructed based on the proposed effective negative data selection strategy. Negative data set selection followed the strategy used in PBVF. To compile a set of VF-unlike sequences, PBVF developed an annotation-based approach. This method assesses the annotation frequency of each GO term in the VF foreground compared to the background of all VF-contributing proteomes. A hypergeometric distribution and Bonferroni correction are applied to obtain a P-value for each term, with lower values indicating statistical evidence for enrichment. Each protein is then assigned the lowest P-value observed among its annotated terms. Based on the P-value distribution and the required number of sequences, a threshold is applied to obtain the final selection. All collected data are in the form of amino acid sequences.To reduce data bias, CD-Hit24 was used to cluster the sequences and remove the sequence redundancy in the positive and negative dataset, respectively. The sequence identity threshold was set as 0.3. As a result, the final non-redundant dataset contained 3,576 VFs and 4,910 non-VFs. Among these, 576 VFs and 576 non-VFs were randomly selected as the independent test dataset, following the same dataset configuration as the previous research (DeepVF) for fair comparison, while the remaining were used as the training dataset (3,000 VFs and 4,334 non-VFs).External validation datasetTo evaluate the model’s performance on a completely independent dataset that was not used in training process, we used the VF sequences from the VFDB database updated on January 5, 2024. To ensure minimal redundancy with the existing VF dataset used in our study, we applied CD-HIT to remove sequences with a similarity threshold of 30%, resulting in a final set of 39 sequences. The results of the model performance comparison on the external dataset are presented in Sect. “Evaluation on the external dataset”.Model architecturePreprocessing MSA dataFor each protein sequence in the dataset, we collected homologous sequences and construct an MSA m using UniClust3025 and HHblits26. Due to variations in the number of aligned protein sequences for each protein sequence, the number of sequences included in the MSA is adjusted to n by computing the diversity maximizing strategy. This is a greedy strategy which starts from the reference and adds the sequence with highest average hamming distance to current set of sequences. To match the input format of MSA Transformer, collected sequences were truncated to a length of l, and if a sequence is shorter than l, zero padding was applied, where $m \in R^{n \times l}$ denotes the MSA of the query protein sequence.MSA encoderThe MSA Transformer19 is a specialized deep learning-based model to capture coevolutionary features from MSA, m. We employed pre-trained MSA Transformer as an MSA encoder.$$\begin{aligned} E = f_{MSA}(m), \quad E \in R^{n \times d \times l} \end{aligned}$$(1)where $f_{MSA}$ denotes the MSA Transformer, E denotes the latent representation of the reference MSA from the MSA Transformer, n denotes the number of aligned sequences, d denotes the dimension of latent representation, and l denotes the maximum input sequence length of the MSA Transformer. In our study, we adopted the values specified in the original study19: n set to 128, d set to 768, and l set to 1,024.The MSA Transformer produces a latent representation for each sequence in the MSA. In this study, we exclusively utilized the latent representation of the actual query protein sequences of MSA, denoted as $E_{0} \in R^{d \times l}$, and call this MSA Embedding.MSA-compositionHere we defined ‘MSA-composition’ feature, which is an averaged vector of MSA Embedding’s latent representation for each type of amino acid of the protein sequence.$$\begin{aligned} & E_{0} = [x_{1},x_{2}, ..., x_{L}], \quad x_{i} \in R^{d} \end{aligned}$$(2)$$\begin{aligned} & S = [a_{1},a_{2}, ..., a_{L}], \quad a_{i} \in AA \end{aligned}$$(3)$$\begin{aligned} & AA = [A_{1},A_{2}, ..., A_{20}] \end{aligned}$$(4)$$\begin{aligned} & c_{j} = \frac{\sum \limits _{i=1}^L \mathbb [\mathbbm {1}(a_{i} = A_{j})]x_{i}}{L} , \quad \mathbbm {1}(a_{i} = A_{j}) = {\left\{ \begin{array}{ll} 1\hspace{0.3cm} \text {if } a_{i} = A_{j}\\ 0\hspace{0.3cm} \text {if } a_{i} \ne A_{j} \end{array}\right. } \end{aligned}$$(5)$$\begin{aligned} & C = [c_{1},c_{2}, ..., c_{20}], \quad C \in R^{d \times 20} \end{aligned}$$(6)where S denotes the amino acid sequence of the query protein, AA represents the 20 amino acid types, L denotes the sequence length of the query protein, and C represents the MSA-composition.Sequence similaritySeqsim feature is a feature that directly considers the similarity of sequences. Previous works11,12 have demonstrated that incorporating the seqsim feature during model training significantly enhances performance. To generate the seqsim feature, we used the BLAST+ tool to search a query protein against the positive and negative training dataset with an E-value cutoff of 10, following the previous work11. Then, we generated two dimensional seqsim feature, which is the highest bitscores against the positive and negative training dataset. To prevent data leakage, we ensured that seqsim features for the test dataset were generated independently from the positive and negative training datasets.Model classifierIn the final stage of the model, a multi-layer perceptron (MLP) classifier is employed to predict the virulence factor of each query sequence by leveraging the learned latent representation–generated via MSA composition–combined with the seqsim feature. We simply concatenate the flattened MSA composition and the seqsim feature, then feed the result into the MLP. MLP is a feed forward neural networks that consists of multiple layers of nodes, including an input layer, one or more hidden layers, and an output layer.Training and optimizationTo train the classifier in MVP, we used PyTorch27 to build an MLP with three hidden layers containing 4096, 1024, and 512 nodes, respectively, and a dropout rate of 0.4. To prevent overfitting, the value of epochs was set to 40. In addition, we defined a callback function to enable early termination of iterations while preserving the optimal model. We used Adam optimizer and binary cross entropy loss as an objective function.Performance evaluationSince handling imbalanced data is crucial in model training, we constructed five balanced training datasets (3,000 VFs and 3,000 non-VFs) by randomly sampling 3,000 negatives from the 4,334 negative training dataset for each run. We then performed five independent training runs, evaluating each model on the same fixed independent test set (576 VFs and 576 non-VFs), and the model’s performance was determined by averaging the five results. The accuracy, F1-score, Sensitivity, Specificity and Matthew’s correlation coefficient (MCC) were used to evaluate the performance.In addition, for hyperparameter tuning, we applied 10-fold cross-validation within each of the five balanced training datasets. The hyperparameters achieving the best cross-validation performance were selected and subsequently used to train the final models, which were then evaluated on the independent test dataset. The full cross-validation results are reported in Table S1.$$\begin{aligned} & ACC = \frac{TP+TN}{TP+FN+FP+TN} \end{aligned}$$(7)$$\begin{aligned} & F1-score = 2\times \frac{TP}{2TP+FP+FN} \end{aligned}$$(8)$$\begin{aligned} & SN = \frac{TP}{TP+FN} \end{aligned}$$(9)$$\begin{aligned} & SP = \frac{TN}{FP+TN} \end{aligned}$$(10)$$\begin{aligned} & MCC = \frac{(TP\times TN) - (FP\times FN)}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} \end{aligned}$$(11)Category-wise performance evaluationFor category-wise analysis in Sect. "Interpreting the performance of MVP" we formulated the evaluation as follows. We performed 5-fold stratified splitting on the positive VF pool to preserve per-category proportions (per fold: 2,861 positives for training, 715 positives for testing). The negative sets were held fixed across folds (4,334 for training; 576 for testing). In each fold, a model was trained from scratch with the same architecture and hyperparameters as the main experiments. We computed per-category sensitivity (recall) on the positive test subset with per-category counts; other metrics (specificity, precision, F1, MCC) are not reported because the negative test set does not change across folds or categories.Methods for result analysisMutual Information Z-scoreTo analyse MSA data’s coevolutionary information, we computed Mutual Information Z-score. Gregory et al.21 proposed that coevolutionary signals between two residues within a sequence can be inferred through mutual information. Essentially, MI measures about how much knowledge one already has about a distribution Y, when another distribution X is known:$$\begin{aligned} MI (X;Y) = \sum _{x\in {X}}\sum _{y\in {Y}}P_{X,Y} (x,y)\cdot log_{2}\frac{P_{X,Y} (x,y)}{P_{X} (x)P_{Y} (y)} \end{aligned}$$(12)In practical terms, MI is used to analyse an amino acid sequence alongside multiple homologous sequences. Each column in the alignment represents the frequency distribution of amino acids at that position. High MI between two positions in the alignment suggests that changes (mutations) at one position are often associated with specific changes at the other, indicating potential coevolution or coordinated evolution between the two positions.To distinguish meaningful coevolution signals from random changes arising due to overall sequence conservation within an alignment column, the MI of the actual alignment is compared to MI values from randomized variants of the alignment. In these randomized variants, each alignment column undergoes random shuffling. The MI scores obtained from these variants are then aggregated to calculate a Z-score, providing a statistical measure of the significance of coevolution between the two positions. This Z-score-based analysis helps identify pairs of positions that exhibit statistically significant coevolution, aiding in the understanding of functional or structural relationships within biological sequences.$$\begin{aligned} Z_{MI} = \frac{MI-\mu (MI_{\textrm{shuffle}})}{\sigma (MI_{\textrm{shuffle}})} \end{aligned}$$(13)Protein coevolutionary ratioTo analyse the proportion of coevolutionary amino acid pairs within a protein, we defined the coevolutionary ratio. The coevolutionary ratio was calculated using the following method:$$\begin{aligned} coevolutionary\ ratio = \frac{n(Z_{MI}>1.645)}{sequence\ length^2} \end{aligned}$$(14)To determine the number of residue pairs belonging to the top 5% mutual information score in each MSA, a Z-score cutoff of 1.645 was employed and the coevolutionary ratio was calculated by dividing it by the $sequence\ length^2$.Protein contact mapA protein contact map is a 2-dimensional matrix that represents the spatial proximity between amino acids in a protein’s 3-dimensional structure. It captures which residues are “in contact” with each other, typically within a certain distance threshold (e.g., 8Å). Many researches use contact maps to study how proteins fold and predict 3-dimensional structure of proteins, as they offer insights into the folding process by showing which residues interact. Also, coevolution of residues can be observed by analyzing contact maps28.Error analysis during the inference stageIn order to determine whether the coevolutionary feature played a major role in the MVP model’s VF prediction, we conducted an error analysis by comparing the coevolutionary ratios of successfully predicted sequences and inaccurately predicted sequences. The results for the positive test dataset are described in Sect. “Analysis coevolutionary information in positive test data during inference stage”, while the results for the negative test dataset are presented in Figure S1.ResultsComparison with existing modelsTo evaluate VF prediction performance of our model MVP, we used DeepVF, GTAE-VF, VF-Pred, PBVF and MP3 as baseline models. MP310 is a VF prediction tool, which combined SVM classifier trained with sequence-based features and a hidden markov model. PBVF11 proposed an effective negative data selection strategy that was used in our negative dataset, and introduced the seqsim feature. They achieved high performance in VF prediction when using an RF-based hybrid classifier. DeepVF12 is a hybrid ensemble model, integrating ML models and DL models for classifier. They used various feature extraction methods for their model. VF-Pred13 incorporates physiochemical property-based feature extraction methods, Pseudo Amino Acid Composition (PseAAC) and Quasi-sequence order (QSO). They also proposed and used sequence alignment features for their model. GTAE-VF14 is the latest VF prediction model, employing a novel graph transformer autoencoder using ESMFold-predicted 3D structures. To ensure fair comparison of experimental results, we used the same datasets as those used in DeepVF, GTAE-VF and VF-Pred.Table 1 summarizes the performance comparison of MVP and five baseline models. As shown in Table 1, MVP outperformed all baseline models, achieving an ACC of 0.869, and an F1-score of 0.868. Among the baseline models, GTAE-VF shows the second-best performance, with an ACC of 0.849.Table 1 Performance comparison of existing models for predicting VFs on the test data.Full size tableComparison of feature extraction methodsAlthough raw amino acid sequences contain essential information, they often fail to capture subtle residue interactions, making protein sequence featurization a critical step in many computational approaches for protein function prediction. Existing methods–such as amino acid composition (AAC), dipeptide composition (DPC), and position-specific scoring matrices (PSSM)–typically struggle to encode coevolutionary signals, leading to an incomplete picture of how residues interact and evolve over time.To address this limitation, we introduce “MSA-composition,” a novel approach that leverages multiple sequence alignments to detect correlated mutations among residues. Unlike AAC, which captures only amino-acid frequencies, and PSSM, which focuses on conserved positions without modeling interdependencies, MSA-composition aggregates information from aligned homologs and thereby encodes coevolutionary dependencies between residue positions. By incorporating evolution-based insights, MSA- composition provides a richer, biologically informed representation that improves predictive performance in tasks such as virulence factor detection and beyond. To evaluate MSA-composition’s impact on the predictive capabilities of MVP, we conducted comparative experiments using various classifiers with different featurization methods.In this section, we investigate the performance of various protein feature extraction methods. Each method captures distinct characteristics of protein sequences. For instance, Amino Acid Composition (AAC) and Dipeptide Composition (DPC) capture sequence-based features. Position-Specific Scoring Matrix (PSSM) represents evolutionary-based features, and Multiple Sequence Alignment (MSA) captures the coevolutionary-based features. In the VF prediction task, previous study12 have identified PSSM-based feature methods as the most critical among traditional methods. However, PSSM is limited in its ability to capture coevolutionary information of protein sequences. Coevolutionary patterns between residues can provide valuable insights into regions of proteins that are functionally or structurally significant, such as motifs, domains, or active sites.Table 2 summarizes the VF prediction results using various feature extraction methods combined with traditional classifiers, including MLP, Random Forest, XGBoost, and SVM. To use traditional classifiers, we utilized the scikit-learn package29, and the detailed hyperparameter tuning for each classifier is provided in the Supplementary Data, below Table S2. All experimental results are the average values obtained by training on five balanced training datasets and testing on an independent test dataset. Also, all results are based on models trained with the combination of the seqsim features. Table 2 shows that MSA-composition feature extraction method which captures the coevolutionary-based feature, exhibits superior performance in VF prediction. Additionally, Table S2 provides experimental results for other feature extraction methods.Table 2 Performance comparison of machine learning model and combined features ($+seqsim$) for predicting VFs on the 5 balanced train data and test data.Full size tableWhile MSA-composition alone demonstrated strong performance in VF prediction, we conducted additional experiments to evaluate its effectiveness when combined with other feature extraction methods. As shown in Table S3, combining AAC with MSA-composition and using MLP as the classifier resulted in an accuracy of 0.874 for VF prediction. These results suggest that future research could further enhance VF prediction performance by integrating well-defined feature extraction methods with MSA-composition. Additionally, to assess the importance of the seqsim feature, we evaluated the model using only MSA-composition (excluding seqsim) and obtained an accuracy of 0.821.Coevolutionary-based feature effectiveness analysisTransformers have proven to be a powerful deep learning architecture for modeling sequential data, including protein sequences, by learning latent representations that capture interactions among positions within a sequence. To determine whether MVP’s improved performance arises from the transformer architecture alone or from the MSA Transformer’s ability to capture coevolutionary signals, we conducted a comparative experiment with a non-MSA-based transformer encoder. Specifically, we evaluated VF prediction by comparing two widely used protein language models, ESM230 and MSA Transformer.Table 3 presents the VF prediction results for the two different protein encoder. ESM2 is the SOTA general purpose protein language model. Since ESM2 only takes individual sequences as input, its latent representation has the same size as the MSA Embedding described in Sect. “MSA encoder”. Using the functions described in Sect. “MSA-composition”, we derived the final features using ESM2 and performed VF prediction. Under an identical training and evaluation protocol, the MSA Transformer based MVP outperformed the single-sequence ESM2, with higher sensitivity and comparable specificity. These results confirm that feature extraction methods incorporating coevolutionary information significantly enhance VF prediction performance, highlighting the importance of coevolutionary insights for this task.Table 3 Performance comparison with protein encoder which does not consider coevolutionary information.Full size tableBoth MSA Transformer and ESM2 can provide latent representations and contact maps as output. To evaluate whether the MSA Transformer captures more coevolutionary information than ESM2, we compared their predicted protein contact maps, as coevolutionary signals are critical for accurate contact map prediction. Figure 2 shows the protein contact maps for (a) protein 1EKR and (b) protein 1YWF, both of which are specific VFs where the MSA Transformer produced accurate predictions, while ESM2 failed. Ground truth protein contact maps were derived using three-dimensional protein structures obtained from the RCSB PDB database31. In the regions highlighted by red boxes in Fig. 2, the contact maps predicted by the MSA Transformer more accurately identify interactions between distantly positioned amino acid residues. This observation suggests that the MSA Transformer captures coevolutionary information more effectively than ESM2. Additionally, Supplementary Figure S2 presents more contact map comparison results.Fig. 2Protein contact map of (a) 1EKR and (b) 1YWF (Left: ESM2 protein contact map, Middle: MSA Transformer protein contact map, Right: Real protein contact map).Full size imageInterpreting the performance of MVPAnalysis coevolutionary information in positive test data during inference stageFigure 3 presents a comparison of coevolutionary ratios between true positives and false negatives obtained from the positive test dataset during the inference process. The coevolutionary ratio was calculated as described in Equation (14).Fig. 3Coevolutionary ratio comparison between true positive test data and false negative test data. After training MVP on the entire training dataset, we examined the difference in coevolutionary ratio between true positive data and false negative data through the inference results on the positive test dataset. (Mann-Whitney U test P-value < 0.001).Full size imageThe higher coevolutionary ratio observed in the true positive data suggests that MVP more effectively predicts VFs when they are associated with a greater proportion of coevolutionary-related amino acids. This implies that MVP is highly affected by the coevolutionary signals within proteins, thereby enhancing its predictive power for VF prediction. Additional details of this analysis are provided in Sect. “Impact of MSA coevolutionary information on MVP performance”.Impact of MSA coevolutionary information on MVP performanceTo isolate the contribution of coevolutionary informations (inter-column signal) from general MSA properties, we performed an MSA column shuffling experiment at test time. For each protein, we computed and aggregated per-residue coevolution scores from the MI Z-scores, to rank columns. We then selected the top x% high-MI columns and, for those columns only, independently permuted the MSA rows within each column. This procedure preserves per-column amino-acid frequencies, entropy, gaps, sequence length, and MSA depth, while ablating inter-column (coevolutionary) dependencies for any pair involving the shuffled columns. The training data were not modified; shuffling was applied only to test MSAs. As a control, we shuffled the same number (x%) of low-MI columns, selected to be entropy matched to the high-MI set. Results are reported in Table 4.Table 4 Effect of MSA column shuffling on MVP performance.Full size tableShuffling high-MI columns led to a pronounced decrease in sensitivity, whereas shuffling low-MI columns caused only minor changes. In both conditions, specificity increased slightly, and the magnitude of this increase was similar for high-MI and low-MI shuffles, consistent with the classifier becoming more conservative when informative columns are perturbed. Despite the modest rise in specificity, overall performance declined for the high-MI condition and low-MI condition (e.g., accuracy/MCC decreased), indicating a net loss of predictive power when coevolutionary signal is selectively disrupted. These patterns support the interpretation that MVP’s correct VF calls rely on coevolution information of MSA data. Together with the higher coevolutionary ratios among true positives (Fig. 3), this experiment strengthens the link between coevolutionary signal strength and MVP’s positive-class accuracy.Interpreting the correlation between MSA Transformer attention mechanism and performance of MVPWe observed that MVP’s strong performance in VF prediction is attributed to its ability to model coevolutionary information. Since the MSA Transformer plays a critical role in MVP’s feature extraction method and the attention mechanism is a key component of the Transformer architecture, we conducted experiments to investigate whether the attention mechanism is significant to VFs that contain highly conserved coevolved residues.The MSA Transformer consists of 12 layers and 12 heads, resulting in a total of 144 attention blocks. We investigated the impact of the large number of these attention blocks on VF prediction. To evaluate the impact of these attention blocks on VF prediction, we investigated their association with VFs by systematically pruning each attention block and assessing the resulting prediction accuracies. If pruning a specific attention block caused a significant drop in prediction accuracy, it can be interpreted as evidence of that block’s significant association with the corresponding VF. During the training phase, the model was trained on unpruned data, while during the inference phase, pruned data was used to evaluate the impact of each attention block. We analysed the model’s sensitivity to pruning by measuring the decrease in performance and visualized the results in a heatmap, highlighting the contributions of specific attention blocks to VF prediction.Left plot in Fig. 4 presents a heatmap of the experimental results, measured on the positive VF test dataset. Darker colors indicate attention blocks with lower sensitivity, suggesting that these blocks play a critical role in VF prediction. We conducted experiments to investigate whether these key attention blocks are associated with the coevolutionary information of VFs during prediction. For this experiment, we measured the average coevolutionary ratio of VF proteins that were initially predicted correctly but failed to be predicted accurately after pruning specific attention blocks, analyzing each block individually.Fig. 4Left: The heatmap result of MSA Transformer attention block pruning experiment. This result shows how much impact a specific attention block has on a VF prediction sensitivity. Darker colors signify attention blocks that play a more crucial role in capturing VF. Right: The graph of average coevolutionary ratio of the VF proteins that were originally predicted well but were no longer predicted correctly after a specific attention block was pruned.Full size imageRight plot in Fig. 4 shows the result of the experiment. The x-axis of the graph represents the attention blocks ordered by their VF prediction sensitivity after pruning. The Pearson Correlation Coefficient of this graph is $-0.735^{**}$, indicating a negative correlation. This result reveals that when attention blocks with significant impact on VF prediction were pruned, the model failed to correctly predict VF proteins containing a high coevolutionary ratio. These experiments demonstrate that certain attention blocks have a critical influence on VF prediction sensitivity and are particularly responsive to VFs, which contain highly conserved, coevolved residues.This suggests that the attention mechanism plays a central role in enhancing MVP’s performance by selectively modeling long-range residue dependencies driven by coevolutionary signals. The pruning experiments further validate this interpretation, as removing the most influential attention blocks caused marked declines in prediction accuracy, providing functional evidence that these blocks are indispensable for integrating coevolutionary information into VF prediction.Category-wise evaluation by VF typeTo provide a more granular viewpoint for VF analyses, we conducted a category-wise evaluation using VFDB type annotations. Because category sizes are highly imbalanced, we adopted a 5-fold stratified scheme to preserve per-type proportions in the positive set. In each fold, we trained on 2,861 positive sequences and tested on 715 positive sequences, while keeping the negative sets fixed across all folds (4,334 negatives for training and 576 negatives for testing), identical to the main experiments. Our objective here is to assess how well MVP recognizes VFs within each category, so we report per-category sensitivity (recall) together with per-category sample counts in Table 5. For additional model-level insight, we provide attention-block pruning profiles by category in Figure S3, summarizing the change in sensitivity when pruning individual layer–head blocks at inference.Table 5 Performance of 5-fold stratified cross-validation for each VF category and statistics of VF category.Full size tableEvaluation of model generalizabilityIn this section, we performed VF prediction using an external dataset to validate the model. For the experiment, we used the 39 VF protein sequences mentioned in Sect. “External validation dataset”, and Table 6 presents the results. For DeepVF prediction, we tested their webserver using their own default cutoff thresholds. Because PBVF does not provide a web server for VF prediction, we implemented an RF-based hybrid classifier, which was reported as the top-performing method in the PBVF study. The BLAST model performed VF prediction using only the seqsim feature. For each query, we compute the highest BLAST bitscore against the positive training set and the highest bitscore against the negative training set; the query is predicted as VF if the positive bitscore exceeds the negative bitscore.Our previous experiments revealed that the reason MVP performs well in VF prediction is attributed to its heightened sensitivity to VFs. This enhanced sensitivity led to a significant improvement in MVP’s predictive performance compared to the baseline model, DeepVF. The results of this experiment further validate this observation, as confirmed through the additional data.Table 6 Sensitivity comparison for VF prediction on the external dataset.Full size tableWeb server availability and usageWe provide a web server to facilitate the easy use of MVP and prediction of virulence factors (http://bhi4.snu.ac.kr:7978). Users can submit up to five protein sequences at a time, and the predictions are displayed directly on the web page. Scores are reported as percentages (0–100); consistent with this study, sequences with a score over 50% are predicted as VFs. For server stability, all data processing and model inference are executed on a CPU backend; consequently, each protein sequence typically requires 3–5 minutes. For large-scale studies with many protein sequences, we strongly recommend using the MVP model via our GitHub repository; the web server is intended for quick, small-scale checks.ConclusionIn this work, we proposed a novel virulence factor prediction method that considers coevolutionary information of proteins to achieve accurate VF identification in bacterial pathogens. For a fair and accurate comparison with SOTA models, we used the dataset from DeepVF, one of the most well-constructed and recent benchmark dataset. This study represents the first application of MSA data for VF prediction and introduces MSA-composition, a novel protein feature extraction method based on this approach. To effectively capture coevolutionary information within the MSA data and the relationships among proteins in the alignment, we employed the MSA Transformer. Among four traditional machine learning classifiers evaluated, MLP produced the best results. When evaluated against four existing SOTA methods–MP3, PBVF, VF-Pred, and DeepVF–on the test dataset, our model, MVP, demonstrated significantly superior performance, achieving an accuracy (ACC) of 0.869 for VF prediction.The major finding of our work is to discover the importance of coevolutionary-based feature extraction methods for VF prediction. To analyse whether MVP’s strong performance is attributed to the use of protein coevolutionary information, we used MI Z-scores and the protein coevolutionary ratio. MVP showed better prediction power for VFs containing a high ratio of coevolutionary-related amino acid residues. This experiment highlighted MVP’s sensitivity to the coevolutionary information embedded in VFs. Additionally, we investigated the influence of attention blocks within the MSA Transformer on VF prediction. Notably, attention blocks that had a significant impact on VF prediction sensitivity were also found to be particularly responsive to VF proteins with high coevolutionary information.In summary, MVP is a VF prediction model that incorporates the MSA-composition feature extraction method, effectively capturing and representing the coevolutionary information of proteins. Through extensive experiments, we have highlighted the critical role of coevolutionary information in enhancing VF identification accuracy.Due to the rise of antibiotic resistance, recent research has focused on developing strategies to specifically inhibit the activity of bacterial virulence factors, such as those of $Staphylococcus\ aureus$, as an alternative treatment approach32. Accurate identification of virulence factors is crucial for advancing this process. However, pathogen detection in the wet lab faces several limitations, particularly in terms of time and cost. For instance, Metagenomic Next-Generation Sequencing (mNGS) is often limited by the overwhelming presence of human nucleic acids in samples, which reduces sensitivity for pathogen detection. Additionally, the risk of microbial contamination from reagents or the laboratory environment complicates result interpretation, requiring stringent quality control33. To overcome these challenges, efficient and cost-effective computational methods are needed. Our proposed MVP model can process bacterial protein sequence data efficiently, as it is an already trained deep learning model. Thus our MVP model can help identify virulence factors. Identifying virulence factors enables the differentiation between pathogenic microorganisms and commensals or environmental microbes, offering critical insights into their disease-causing potential34.Future research could focus on incorporating three-dimensional structural information of virulence factors (VFs) to enhance VF prediction accuracy further. Additionally, developing a model that directly integrates phylogenetic information as features would be an interesting avenue to explore. Despite these potential advancements, the proposed MVP model already provides significant support in predicting and studying bacterial pathogenicity.Data availabilityThe data used for training and evaluation in this article curated from https://deepvf.erc.monash.edu/. The external data used for evaluation in this article curated from https://www.mgc.ac.cn/VFs/. All data and the source codes underlying this article are available at our repository https://github.com/kimtaegyuu/MSA-VFpredictor.ReferencesMorens, D. M. & Fauci, A. S. Emerging infectious diseases: threats to human health and global stability. PLoS Pathog 9, e1003467. https://doi.org/10.1371/journal.ppat.1003467 (2013).Article CAS PubMed PubMed Central Google Scholar Diard, M. & Hardt, W.-D. Evolution of bacterial virulence. FEMS Microbiol Rev 41, 679–697. https://doi.org/10.1093/femsre/fux023 (2017).Article CAS PubMed Google Scholar Casadevall, A. & Pirofski, L.-A. Virulence factors and their mechanisms of action: the view from a damage-response framework. J Water Health 7, S2–S18. https://doi.org/10.2166/wh.2009.036 (2009).Article PubMed Google Scholar Sarowska, J. et al. Virulence factors, prevalence and potential transmission of extraintestinal pathogenic Escherichia coli isolated from different sources: recent reports. Gut Pathog 11, 49. https://doi.org/10.1186/s13099-019-0290-0 (2019).Article Google Scholar Liu, B., Zheng, D., Jin, Q., Chen, L. & Yang, J. VFDB 2019: a comparative pathogenomic platform with an interactive web interface. Nucleic Acids Res 47, D687–D692. https://doi.org/10.1093/nar/gky1080 (2019).Article CAS PubMed Google Scholar Falkow, S. Molecular Koch’s postulates applied to microbial pathogenicity. Rev Infect Dis 10, S274–S276. https://doi.org/10.1093/clinids/10.Supplement_2.S274 (1988).Article PubMed Google Scholar Pérez-Llarena, F. J. & Bou, G. Proteomics as a tool for studying bacterial virulence and antimicrobial resistance. Front Microbiol 7, 410. https://doi.org/10.3389/fmicb.2016.00410 (2016).Article PubMed PubMed Central Google Scholar Chen, L., Xiong, Z., Sun, L., Yang, J. & Jin, Q. VFDB 2012 update: toward the genetic diversity and molecular evolution of bacterial virulence factors. Nucleic Acids Res 40, D641–D645. https://doi.org/10.1093/nar/gkr989 (2012).Article CAS PubMed Google Scholar Garg, A. & Gupta, D. VirulentPred: a SVM-based prediction method for virulent proteins in bacterial pathogens. BMC Bioinformatics 9, 62. https://doi.org/10.1186/1471-2105-9-62 (2008).Article CAS PubMed PubMed Central Google Scholar Gupta, A., Kapil, R., Dhakan, D. B. & Sharma, V. K. MP3: a software tool for the prediction of pathogenic proteins in genomic and metagenomic data. PLoS One 9, e93907. https://doi.org/10.1371/journal.pone.0093907 (2014).Article ADS CAS PubMed PubMed Central Google Scholar Rentzsch, R., Deneke, C., Nitsche, A. & Renard, B. Y. Predicting bacterial virulence factors-evaluation of machine learning and negative data strategies. Brief Bioinform 21, 1596–1608. https://doi.org/10.1093/bib/bbz083 (2020).Article PubMed Google Scholar Xie, R. et al. DeepVF: a deep learning-based hybrid framework for identifying virulence factors using the stacking strategy. Brief Bioinform 22, bbaa125, https://doi.org/10.1093/bib/bbaa125 (2021).Singh, S., Le, N. Q. K. & Wang, C. VF-Pred: predicting virulence factor using sequence alignment percentage and ensemble learning models. Comput Biol Med 168, 107662. https://doi.org/10.1016/j.compbiomed.2023.107662 (2024).Article CAS PubMed Google Scholar Li, G., Bai, P., Chen, J. & Liang, C. Identifying virulence factors using graph transformer autoencoder with ESMFold-predicted structures. Comput Biol Med 170, 108062. https://doi.org/10.1016/j.compbiomed.2024.108062 (2024).Article CAS PubMed Google Scholar Zhao, X.-M., Chen, L. & Aihara, K. Protein function prediction with high-throughput data. Amino Acids 35, 517–530. https://doi.org/10.1007/s00726-008-0078-5 (2008).Article CAS PubMed Google Scholar Ahmad, S. & Mizuguchi, K. Partner-aware prediction of interacting residues in protein-protein complexes from sequence data. PLoS One 6, e29104. https://doi.org/10.1371/journal.pone.0029104 (2011).Article ADS CAS PubMed PubMed Central Google Scholar Xu, J., Mcpartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat Mach Intell 3, 601–609. https://doi.org/10.1038/s42256-021-00348-5 (2021).Article PubMed PubMed Central Google Scholar Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat Biotechnol 35, 128–135. https://doi.org/10.1038/nbt.3769 (2017).Article CAS PubMed PubMed Central Google Scholar Rao, R. M. et al. MSA transformer. In Proceedings of the 38th International Conference on Machine Learning, 8844–8856 (PMLR, 2021).Dai, Q. et al. Precision DNA methylation typing via hierarchical clustering of nanopore current signals and attention-based neural network. Brief Bioinform 25, bbae596, https://doi.org/10.1093/bib/bbae596 (2024).Gloor, G. B., Martin, L. C., Wahl, L. M. & Dunn, S. D. Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry 44, 7156–7165. https://doi.org/10.1021/bi050293e (2005).Article CAS PubMed Google Scholar Sayers, S. et al. VICTORs: a web-based knowledge base of virulence factors in human and animal pathogens. Nucleic Acids Res 47, D693–D700. https://doi.org/10.1093/nar/gky9992 (2019).Article CAS PubMed Google Scholar Davis, J. J. et al. The PATRIC bioinformatics resource center: expanding data and analysis capabilities. Nucleic Acids Res 48, D606–D612. https://doi.org/10.1093/nar/gkz943 (2020).Article CAS PubMed Google Scholar Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152. https://doi.org/10.1093/bioinformatics/bts565 (2012).Article CAS PubMed PubMed Central Google Scholar Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res 45, D170–D176. https://doi.org/10.1093/nar/gkw1081 (2017).Article CAS PubMed Google Scholar Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9, 173–175. https://doi.org/10.1038/nmeth.1818 (2012).Article CAS Google Scholar Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. In Adv Neural Inf Process Syst 32, 8024–8035 (Curran Associates, 2019).Baker, F. N. & Porollo, A. CoeViz: a web-based tool for coevolution analysis of protein residues. BMC Bioinformatics 17, 119. https://doi.org/10.1186/s12859-016-0975-z (2016).Article CAS PubMed PubMed Central Google Scholar Pedregosa, F. et al. Scikit-learn: machine learning in Python. J Mach Learn Res 12, 2825–2830 (2011).MathSciNet Google Scholar Lin, Z. et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv https://doi.org/10.1101/2022.07.20.500902 (2022).Berman, H. M. et al. The protein data bank. Nucleic Acids Res 28, 235–242. https://doi.org/10.1093/nar/28.1.235 (2000).Article ADS CAS PubMed PubMed Central Google Scholar Kane, T. L., Carothers, K. E. & Lee, S. W. Virulence factor targeting of the bacterial pathogen Staphylococcus aureus for vaccine and therapeutics. Curr Drug Targets 19, 111–127. https://doi.org/10.2174/1389450118666170823162252 (2018).Article CAS PubMed PubMed Central Google Scholar Gu, W., Miller, S. & Chiu, C. Y. Clinical metagenomic next-generation sequencing for pathogen detection. Annu Rev Pathol 14, 319–338. https://doi.org/10.1146/annurev-pathmechdis-012418-012751 (2019).Article CAS PubMed Google Scholar Reiland, H. A., Omolo, M. A., Johnson, T. J. & Baumler, D. J. A survey of Escherichia coli o157:h7 virulence factors: the first 25 years and 13 genomes. Adv Microbiol 2014, https://doi.org/10.4236/aim.2014.412106 (2014).Download referencesAcknowledgementsThis research was supported by the Bio & Medical Technology Development Program of the National Research Foundation(NRF), funded by the Ministry of Science and ICT, Republic of Korea (grant number. RS-2022-NR067933). This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT)(RS-2023-00257479). This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [RS-2021-II211343, Artificial Intelligence Graduate School Program (Seoul National University)].Author informationAuthors and AffiliationsInterdisciplinary Program in Artificial Intelligence, Seoul National University, 1, Gwanak-ro, 08826, Seoul, Republic of KoreaTaegyu Kim & Sun KimInterdisciplinary Program in Bioinformatics, Seoul National University, 1, Gwanak-ro, 08826, Seoul, Republic of KoreaChangyun Cho & Sun KimDepartment of Computer Science and Engineering, Seoul National University, 1, Gwanak-ro, 08826, Seoul, Republic of KoreaSun KimSchool of Biological Sciences and Institute of Microbiology, Seoul National University, 1, Gwanak-ro, 08826, Seoul, Republic of KoreaYeong-Jae SeokAIGENDRUG Co., Ltd., Gwanak-ro 1, Gwanak-gu, 08826, Seoul, South KoreaChangyun Cho & Sun KimBioinformatics Institute, Seoul National University, Seoul, Republic of KoreaDohoon LeeBK21 FOUR Intelligence Computing, Seoul National University, Seoul, Republic of KoreaDohoon LeeAuthorsTaegyu KimView author publicationsSearch author on:PubMed Google ScholarChangyun ChoView author publicationsSearch author on:PubMed Google ScholarDohoon LeeView author publicationsSearch author on:PubMed Google ScholarYeong-Jae SeokView author publicationsSearch author on:PubMed Google ScholarSun KimView author publicationsSearch author on:PubMed Google ScholarContributionsT.K., C.C. and S.K. conceived the experiments, T.K. conducted the experiments, T.K., C.C. and S.K. analysed the results, T.K., and C.C. wrote the manuscript. All authors reviewed the manuscript.Corresponding authorCorrespondence to Sun Kim.Ethics declarationsCompeting interestsThe authors declare no competing interests.Additional informationPublisher’s noteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary InformationSupplementary Information.Rights and permissionsOpen Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.Reprints and permissionsAbout this article