Unlocking data in Klebsiella lysogens to predict capsular type-specificity of phage depolymerases

Wait 5 sec.

IntroductionHost recognition is considered a critical step for successful bacteriophage (or phage) infection1. This process is mediated by virus-encoded receptor-binding proteins (RBPs), which mostly comprise tail fiber and tail spike proteins. Tail fiber proteins are generally non-enzymatic and bind to a protein receptor on the host surface2. In contrast, tail spikes are characterized by the presence of a polysaccharide-degrading domain, also referred to as depolymerase3,4. The depolymerase domain is responsible for the specific recognition and degradation of complex sugars like exopolysaccharides (EPS), capsular polysaccharides (CPS) or lipopolysaccharide (LPS)5. This primary binding to the capsule by the depolymerase allows subsequent access to the secondary receptor located on the membrane surface before entry into the host6,7. Once inside, the phage continues its cycle by either replicating into the bacteria and propagating, or remaining dormant as a prophage in the host, referred to as a lysogen.Phages are particularly relevant against ESKAPE pathogens, which are regarded as being among the most urgent threats to global health8. This group includes Klebsiella pneumoniae, a Gram-negative bacterium with a highly diverse capsule with 77 serotypes and over 180 K-loci (KL types) identified through analysis of the K-locus, the region responsible for capsule biosynthesis9,10. The KL type is the most determining factor of infectivity for most phages targeting Klebsiella11,12,13, but our ability to determine the KL tropism of a given phage based on depolymerase sequence alone remains limited. This challenge is further complicated by the rapid evolution of the capsule locus in bacteria and depolymerases in phages, frequently subjected to horizontal gene transfer (HGT)14,15,16,17,18,19.Recent advances in the field of machine learning have paved the way for addressing important questions related to phages20. Many of these advancements have been grounded in the attention mechanism, which can be defined as the ability of a model to focus on specific parts of the input to enhance its performance21,22,23. This has led to the development of transformer architectures, including protein language models (PLM)24. PLMs, such as Evolutionary Scale Modeling 2 (ESM2), can capture complex patterns and relationships between amino acids in protein sequence into embedded representations25. These embeddings have been successfully applied to downstream tasks such as sequence or token classification, yielding remarkable results26,27. PLMs offer a promising approach for modeling phage-bacteria interactions, enabling the extraction of relevant sequence features and improving predictive accuracy28,29,30,31. However, in some specific settings, such as the therapeutic one, predictions at the genus and species levels are not satisfactory as phages typically infect a few strains within a species32,33.Three aspects are crucial for attaining a higher level of resolution when predicting the outcome of phage-bacteria interactions: (i) extensive high-quality training data, (ii) strong biological premises and (iii) state-of-the-art machine learning techniques. Recent studies34,35 made great progress in this direction, but still reported limitations related to the lack of training data. While this is an inherent limitation when using experimentally validated phage-bacteria interactions, here this challenge is addressed by leveraging information contained in Klebsiella lysogens. Large numbers of prophage-encoded depolymerase domain sequences were retrieved alongside with the K-locus of the host to predict the capsular specificity of depolymerases based on their sequence. This approach introduced several technical challenges for modeling: phages often encode multiple depolymerases with varying specificities, individual depolymerases can exhibit multi-KL targeting (i.e., binding to multiple capsular types), and frequent HGT among depolymerases obscures direct sequence-KL correlations36,37,38.To address this multi-label classification task and its inherent challenges, we conceptualized a hierarchical framework, akin to a binary relevance method39. First, KL-specific binary classifiers were designed to isolate depolymerase features associated with individual KL types. Second, an ensemble recommender system integrated predictions across classifiers to rank KL targets based on the output probabilities, explicitly accommodating the multi-label nature of depolymerase-KL interactions. Two complementary approaches were employed to model interactions between depolymerase domains, prophages, and bacterial hosts: a directed acyclic graph-based model (TropiGAT) and a sequence clustering-based model (TropiSEQ). Moreover, these approaches enabled the generation of a comprehensive database linking depolymerase domain sequences to their respective KL targets, offering valuable insights into phage-bacteria interactions. This resource holds promise for therapeutic applications, such as targeted phage therapy, and industrial uses, where accurate KL type specificity is crucial40,41.ResultsA comprehensive profiling of prophage-encoded depolymerases targeting KlebsiellaTo gather a collection of prophages with labeled KL type, all Klebsiella genomes were first downloaded from the NCBI database (n = 14,601), retaining those with a confident assigned KL type (n = 12,003). The screening of these bacterial genomes identified 77,802 prophages. Clusters of prophage sequence showing high identity (99% over more than 80% of their genome) were identified to remove redundancy, resulting in a total of 16,077 prophage vOTUs. Then, for each prophage within a prophage vOTU, the KL of the last common ancestor of the infected bacteria was identified and used as a proxy of the most probable KL type of the host at the time of phage infection (see method section). Out of the 77,802 prophages, only 3500 presented an ambiguous target KL type, resulting in a collection of 74,302 KL-labeled prophages.Next, the depolymerase domain sequences within these prophages were screened using three methods (Fig. 1A). Two methods relied on alignment against experimentally validated depolymerases (BLASTp) and HMM profiles of domains associated with polysaccharide-degrading enzymes. The third method employed DepoScope, a machine learning tool combining a fine-tuned ESM-2 protein language model with convolutional neural networks to detect structural folds characteristic of depolymerases42. Identification with one of the three methods was sufficient to include a depolymerase in our dataset. This identified 19,600 depolymerase domain sequences, with 3908 unique sequences after removing redundancies (100% coverage and 100% percentage identity). Interestingly, for 80% of the prophages, no depolymerase was detected, suggesting prophage degradation or alternative modes of infection33,43. For the remaining 15,230 prophages with at least one depolymerase, ~72% harbored a single depolymerase, ~20% encoded two depolymerases, with some encoding as many as 12. This diversity resulted in an average of 1.3 depolymerase sequences per prophage across the database. The depolymerase domain can take on different structural folds, as previously described in studies of carbohydrate-active enzymes42,44,45. In the present prophage analysis, these folds were found in varying frequencies: the most prevalent fold was the right-handed β-helix, representing 69% (2722) of the identified sequences, followed by the n-bladed β-propeller at 18% (714). The other folds represented a smaller fraction, among which the TIM β/α-barrel with 7.5% (294), the α/β hydrolase fold with 3% (114), and finally the triple-helix (32) and α/α toroid (29) represented less than 1% (Fig. 1B, C).Fig. 1: Pipeline for the identification and characterization of the depolymerase domains.A Pipeline for the identification of depolymerase domain sequences from prophage proteomes. B Venn diagram of the depolymerase domain identified through the 3 methods employed. They involved the deep learning based method Deposcope, a method based on HMM profiles of proteins with depolymerase polysaccharides degrading activity using HHMscan and BLASTp to scan the proteins against a collection of characterized depolymerases from the literature. C Proportions of the depolymerase folds identified. D Bar plot of the number of prophages carrying at latest one depolymerase on the y-axis across each KL types represented on the x-axis. The KL types represented by less than 5 prophages were assigned to the “Other” category.Full size imageTo prepare the dataset for modeling, redundant prophages originating from the same bacterial clone were removed, yielding a final dataset of 8,871 prophages, each labeled with one of 128 distinct KL types. (Fig. 1D). The distribution of prophages in the database was not uniform across KL types. Indeed, ~44% of the prophages were assigned to 6 KL types: KL107 (1121), KL64 (897), KL47 (551), KL106 (488), KL17 (481) and KL2 (351). Among the remaining 122 KL types, half contained between 20–300 prophages, while the other half had fewer than 20 prophages.Developing models to predict depolymerase tropismTo optimize individual KL classifiers, two different approaches were employed: a DAG-based approach, denoted TropiGAT, and a sequence clustering-based approach, referred to as TropiSEQ. In TropiGAT, depolymerases encoded by a given prophage were represented as embedding vectors computed using the ESM2 model. These vectors were aggregated using two strategies: an attention-based method and a baseline averaging method. The aggregated vector was then passed through a forward neural network for binary classification. Each model was trained on unique sets of depolymerases corresponding to a specific KL type. Model performance was evaluated using weighted MCC values, a robust metric for classifier quality that accounts for the number of training instances. The attention-based aggregation outperformed the baseline, achieving a weighted MCC of 0.547 compared to 0.528 for the averaging method (P = 0.0477). The attention-based aggregator generalized well for the most represented KL types, notably KL17 (n = 223), KL102 (n = 128) and KL3 (n = 147) with MCC scores > 0.8 (Pearson coefficient = 0.41; P-value