Understanding the regulatory grammar of sepsis-causing Staphylococcus aureus bacteria using contexualised DNA language models

Wait 5 sec.

IntroductionSepsis is a life-threatening condition marked by a severe inflammatory response to infection. To illustrate its substantial impact on healthcare, it is responsible for approximately 20% of global hospital mortality, and many survivors experience adverse long term effects1,2. In addition, it is a common co-morbidity accompanying a range of other already potent diseases such as cancer or COVID-19, compounding the challenges faced by patients and healthcare professionals alike. The rapidly escalating arms race of global antibiotic resistance only amplifies this already worsening situation.Consequently, it comes as no surprise that there is a robust and growing interest in researching the mechanisms of sepsis. The pursuit of early detection methods and targeted clinical interventions has increasingly taken centre stage, as improving patient outcomes in the face of this relentless adversary becomes an imperative goal3,4. More specifically, developing biomarkers and characterising functional molecules involved in gene regulation of these microbes is of critical importance in designing targeted drugs for sepsis.Similar to cancer, sepsis is a complex, multi-faceted and intricate syndrome with various underlying causes, and its severity can vary greatly from one patient to another. Therefore, a more holistic and ideally systems-level approach is required to fully grasp the nuances of such a condition. However, achieving this goal is non-trivial due to the enormous complexity of biological systems. Across these broad landscapes, an array of genome regulatory elements encompassing transcription factors, their corresponding binding sites, as well as genome occupancy orchestrate the resulting functional omics components, for instance transcriptomics, proteomics, and metabolomics into producing a specific output. In this study, we successfully gathered the above data from bacteria which are known to cause sepsis in hospitals. Considering the functional omics perspective independently, we conducted multi-omics integration to find single-molecule-resolution signatures for functional omics data. We use a previously developed multi-omics integration pipeline which has been demonstrated to be successful in identifying therapeutic targets5. Our approach to characterise individual molecules, which could be genes, proteins, or metabolites, is distinct to a traditional pathway-level analysis that often involves surveying the collective behaviour of groups of molecules within predefined pathways or functional categories. In contrast, our approach offers a more granular and fine-grained perspective on the molecular interactions and functions within a biological system. In this study, we aim to expand our previous research scope by delving into a more comprehensive exploration of systems orchestration. Specifically, we seek to gain insights into the regulatory signature of sepsis by using our novel approach of genomeNLP6. Our overarching goal is to establish a cohesive understanding of systems biology by bridging the functional and regulatory components of multi-omics data.Intuitively, a genome is fundamentally a language, and its regulation can be described by an internal set of grammar and semantic rules, most of which remain obscured. We therefore considered methods which have been successfully used to process natural languages when designing our regulatory signature extraction workflow. We observed that the usage of natural language processing (NLP) in the biological sciences to successfully characterise genes and proteins is steadily gaining traction for a wide range of applications7,8,9, driven mainly by recent leaps in the field of deep learning10 and the visible success of large language models (LLM)11. Recently, both NLP-inspired or direct applications of NLP have been applied to perform phylogenetic tree construction12, protein structure prediction13 or to gain a deeper understanding of the“language”of the genome as well as the proteome7,14. Overall, these examples illustrate NLP’s unsurprising effectiveness when applied to biological sequence data.Therefore, we chose to formulate our regulatory omics component signature investigation as a NLP problem. Specifically, using a corpus of promoter sequences, we intended to create a semantic model of promoter regions for the purpose of extracting meaningful genomic “words”. Such“words”are significant, as they constitute potential regulatory elements, or“motifs”corresponding to transcription factor binding sites (TFBS) or RNA-binding domains (RBD), where the binding of transcription factors (TF) or RNA results in a modulation of gene activity.However, there are two major challenges in analysing regulatory elements across the genome. First, motifs corresponding to TFBS or RBD are variable in length and composition even within the same species, requiring specialised pattern matching techniques such as position-weight matrices (PWMs). Unfortunately, these conventional methods are restricted to only predicting patterns that resemble known PWMs15,16,17, or require heavily enriched sequences to consolidate hidden motifs18. This constraint makes them unsuitable for application to newly sequenced genomes, where we anticipate the presence of numerous novel motifs. We overcome this challenge with an intuitive approach which empirically segments genomic sequence to capture“words”hidden in the data for input into machine or deep learning pipelines6. Our method is distinct to conventional methods which ignore biological semantics by arbitrarily dividing genomic sequence into“word”subunits known as “tokens”or“k-mers”for usage in classification and subsequent motif discovery. Additionally, the full meaning of a“word”is often heavily dependent on context of both adjacent and distant words in both natural language and biology. Conventional text processing strategies in NLP do not account for longer range“genomic context”, commonly using numerical representations (a.k.a embeddings) of words with narrow context windows19, or word frequencies without any context20. Both tactics subsequently dilute method effectiveness. In contrast, we account for bidirectional, long-range context by deploying state-of-the-art transformer-based10 NLP methods which are capable of incorporating context over a substantially longer sequence length than a few adjacent“words”. Combining this with our empirical segmentation approach that also accounts for short range“genomic context”, we successfully apply this to motif discovery and detection with a deep learning NLP pipeline that uses distilBERT21 as its core neural network architecture. Finally, we assemble the overall story by bridging both the multi-omics functional and governing regulatory-omics signatures to create a combined systems biology signature.ResultsProblem formulationMultiple strains of Staphylococcus aureus are the focus of this study, given their role in sepsis in humans. Data was previously generated and covers five Staphylococcus aureus strains, with each strain being grown in a control non-infection condition or a simulated blood infection condition, including six biological replicates per condition. As part of an Australian National Framework Project facilitated by Bioplatforms Australia Ltd., our team has provided genomic data for these newly sequenced Australian strains, as well as transcriptomics, proteomics, and metabolomics data2.Obtaining a functional multi-omics signature from quantitative dataWe first obtained a high-resolution functional multi-omics signature serving as the foundational component of our integrated systems biology signature. Data is present in the form of abundance matrices. These abundance matrices are the result of preprocessing and quantifying raw transcripts mapped to the genome. Similarly, protein and metabolite counts were derived from deconvoluted mass spectra. Applying our previously developed pipeline to the data5, we identified strong positive as well as negative correlations across individual transcripts, proteins and metabolites driving the differences across the non-infection and simulated infection state using a sparse Partial Least Squares Discriminant Analysis (sPLSDA). All associated code, parameters and detailed information are present in a version controlled repository for transparency and reproducibility: https://gitlab.com/tyagilab/sepsis_integration22.We considered our functional multi-omics findings in the context of more conventional methods widely used in the field. Conventional low-resolution methods aggregate results at a pathway level. In contrast, our method generates a functional multi-omics signature at the deeper resolution of single molecules. Concordance of the high-resolution signature with the established low-resolution signatures cement the biological significance of our results (Fig. 2). It accurately recapitulates essential pathways such as cholesterol metabolism, fatty acid metabolism, nitrogen metabolism, and metal uptake pathways2. Using the molecular-level signature, we were able to map most features to known functional annotations, while others remain unannotated. Within each strain, we identified a total of 5 molecules per transcriptomics, proteomics and metabolomics data block as highly indicative of the sepsis state, for a total of 75 individual molecules across omics data. Of these molecules, 16 of 25 genes, 18 of 25 proteins and 25 of 25 metabolites were mapped to known annotations. Supplementing our single-molecule features with gene ontology (GO) analysis23,24,25, we observed that various metal ion binding or transport functions were highly represented, in particular iron2. In the process, we were able to add new information (Supplementary Tables 1, 222).Fig. 1Evaluation metrics on test data for five Staphylococcus aureus strains BPH2760, 2819, 2900, 2947, 2986. A set of feature-rich regions and feature-poor regions in genomic DNA were obtained. The data was tokenised using our custom tokeniser and a simple two-case classification was performed on these corpora with our genomeNLP pipeline.Full size imageFig. 2Molecular correlations across individual transcript, protein and metabolite for five Staphylococcus aureus strains BPH2760, 2819, 2900, 2947, 2986 in non-infection vs simulated infection states, and associated promoter regions with regions of interest highlighted. Left: circos plots depict the high multivariate correlations between the selected features from each omics data block. Red and blue lines indicate positive and negative correlations respectively. Blue, green and red blocks indicate metabolomics, proteomics and transcriptomics data blocks respectively, where individual blocks correspond to individual molecules. Right: promoter regions associated with a gene from the functional multi-omics analysis per strain. Green highlights indicate regions associated with promoters, and red highlights indicate regions associated with protein-coding sequences. Highlight intensity indicates the corresponding score associated with the category of interest (promoter or protein-coding).Full size imageObtaining a regulatory omics signature from qualitative dataNext, to acquire the regulatory omics signature, we considered that motifs are a key component in gene regulation and focused our attention on motif-rich promoter regions. At the same time, it is possible to reference discovered motifs against known ones, acting as a layer of validation for our findings. We then harnessed the correlated functional multi-omics signature obtained in the previous step. From this, we selected a set of windows -60:20 around the transcription start site of each gene, corresponding to approximate promoter regions rich in motifs. We then extracted raw genomic sequences from corresponding promoter and protein-coding regions as input into our deep learning pipeline genomeNLP, and applied a novel empirical tokenisation approach6 that would generate “words”or“oligonucleotides”corresponding to the promoter and non-promoter set of sequences. Then, we conducted a standard two-case classification with our custom genomic version of distilBERT algorithm21. We note that the classification itself is of minor interest in this specific context, though we provide the obained metrics for completeness and transparency (Fig. 1). Instead we intercept the classification scores for each oligonucleotide“word”, and inspect the oligonucleotide sequences which strongly weight the classifier towards the promoter category [Figure 2].Detecting, discovering and annotating new motifsIdentifying known regulatory protein binding sites with computational and literature searchesWe annotated these oligonucleotide sequences corresponding to potential TFBS or RBD with a combination of both automated and manual curation. First, MAST was used to match the MEME suite’s combined database of prokaryotic motifs against our oligonucleotide set18,26,27,28. From the full set of oligonucleotides identified by genomeNLP, we retrieved a total of 23 motifs across 182 oligonucleotide sequences, with a total of 13 unique TFBS across 166 unique oligonucleotide sequences. Here, we display the oligonucleotide sequences and their corresponding motif matches (Table 1).Table 1 13 unique motifs identified by MAST across 20 unique oligonucleotide sequences reported by genomeNLP.Full size tableAfter the initial screen, we considered that many transcription factors in Staphylococcus aureus are not well characterised29, as well as the effects of database drift30. Therefore, we performed an extensive search on current prokaryotic literature as a second layer of annotation. We detected some motifs associated with known biological functions for genes of interest, and detail them along with corresponding references. Full details of results are available in the associated version controlled software repository in both human and machine readable formats (Supplementary Table 322).Among the known TFBS detected in the initial screen, binding sites for the lysR transcriptional regulator, trpI tryptophan synthase regulators31, malT maltose uptake regulators32, metJ repressor33, the DNA-binding protein H-NS34 and the narL nitrate regulator35 were detected multiple times. Interestingly, multiple separate transcription factors had the conserved helix-turn-helix (HTH) pattern belonging to a known family of transcription factor binding domains. The HTH binding motif is well-defined and well-conserved in DNA36. Similarly, H-NS is a global transcriptional regulator of pathogenicity and has a strongly conserved binding sequence34,37. In addition, a known lexA SOS regulatory region was recapitulated from the gene brnQ38.Hierarchical ontology of regulatory protein binding sites and the corresponding modules regulated by themHaving examined the individual motifs and their biological functions, we next categorise their regulatory roles in broader detail (Fig. 3). First, numerous metabolism functions including amino acid synthesis, carbohydrate metabolism and nitrate assimilation were shown, corresponding with the list of functional omics features identified in the functional omics component of this manuscript, as well as in previously published results2. Second, motifs involved in regulating epigenetic processes, primarily the global transcriptional regulator H-NS in nucleoid remodelling, general silencing as well as regulating virulence were detected. H-NS is also known to play an indirect but central role in short non-coding RNA regulation (shRNA), interacting directly with the global shRNA regulator DsrA and being capable of targeted RNA binding39. In addition H-NS is implicated in direct post-transcriptional regulation40. We also note that H-NS binding sites are detectable on Escherichia coli plasmids37, although information on this is not available for the bacterial strains under study. At the same time, we also identified the transcription factors narL,35, malT39 and torR41, which are known to be part of the H-NS regulatory network. Methylation-related factors in the form of ADA regulators42 were also present. Third, farR is a known regulator of Staphylococcus farE, which confers resistance to antimicrobial linoeic acid43 previously identified in our studies2. Other interesting regulators of heat shock and biofilm formation were detected, which have implications for bacterial survival.Fig. 3Regulatory elements detected with our pipeline, associated targets and regulatory hierarchy. From top, the 3 layers indicate global regulators, gene module regulators and target genes. Dashed lines indicate regulatory elements not directly detected by our pipeline but are part of two-component systems. farR itself is not directly part of a regulon, but regulates other global regulators44.Full size imageAt the same time, we observed a noticeable number of partial matches to saeR binding sites across all five Staphylococcus aureus strains. saeR is part of a bacterial two-component system, regulating pathways producing secretory proteins known to be associated with Staphylococcus aureus virulence29. We were also able to recapitulate the expected Shine-Dalgarno sequence and TATA boxes in many instances. Similarly, the previously mentioned lysR transcriptional regulator is abundant across the genome45. The natural emergence of these strongly conserved patterns from the data, in addition to HTH and H-NS, implies that our minimal assumptions strategy is capable of discovering interesting genomic features and provides another layer of validation for our NLP approach.Exploiting conserved genomic features and TFBS to predict gene functionIt is of interest that we observed many oligonucleotide sequences with no currently known annotation or function, although in some cases they represent“genomically interesting”features such as known conserved regions, transposon footprints or known over/underrepresented oligomers in the genome46. Short sequences in particular were challenging to annotate due to their ubiquity in the genome. Overall, these oligonucleotide sequences represent empirically derived grammatical and semantic components of the genome, and are candidates worth investigating for potential functional roles which may include putative TFBS or RBS.Discovery of potential new motifsHaving shown that we recapitulate known signals in the data from both automated and manual annotations, we next consider some genes of unknown and/or putative function in these new Staphylococcus aureus strains, and highlight some interesting examples. A region homologous to an upstream region of an antiseptic resistance gene in the same strain was detected47, corresponding to a putative membrane protein of unknown function. In another case, a gpNTRc binding site was detected48, with the corresponding unknown gene putatively annotated as being involved in oxidative stress. Meanwhile, a metJ binding site was discovered in a putatively annotated membrane protein. We also detected a conserved region upstream of acetyl-coA transferase gene in an unannotated gene, which narrowed down its putative function49. We note that these results collectively raise the possibility of using our motif discovery tool to putatively annotate genes as well, and thus demonstrate that our minimal assumption strategy inherently has a broad scope.DiscussionWe demonstrated a capability to not only identify both a functional-omics and regulatory-omics signature, but in addition bridge the gap between both functional and regulatory omics data at a low-level, as opposed to a high-level pathway overview. Until now, this fine-grained level of harmonisation has remained unresolved due to the inherent differences in the conventional data structures of each omics paradigm. For instance, functional-omics data is often represented as numerical matrices while regulatory-omics data pertaining to the genome are represented as strings of text. Through splitting our workflow into two components and later reunifying the processed data by mapping the information to the reference genome, we successfully overcome this obstacle.Delving deeper into our two-component strategy, we next consider the key features and limitations of each method. The functional-omics strategy, along with its capabilities and limitations have been covered in detail previously5. In the case of the regulatory-omics signature, we illustrate that our approach discovers both novel and known variable length motifs in the data with a unique NLP strategy. In contrast to conventional consensus sequence matching or sequence logo fitting, we empirically derive k-mers, tokens or“words”from unannotated sequence data while accounting for genome semantics. Alternatively, we can view this as a form of feature selection, where the most important “words”naturally emerge, which commonly correspond to TFBS, RBD, mobile genetic elements or other important genomic features. An example shown in this experiment is the natural emergence of the HTH and H-NS class of transcription factors, which bind to highly conserved regions of DNA.Some limitations of the regulatory-omics component of our approach are shared with other motif discovery algorithms. For example, it is theoretically possible for weaker signals or motifs in the data to be overwhelmed by stronger signals. For instance, the trinucleotide repeating sequence for protein coding genes50, or the 10 AT-rich dinucleotide periodicity in nucleosomal regions51 may override weaker signals. While we did not observe this in practice, we use this example as a hypothetical illustration of possible confounding factors.In designing our strategy, we favoured simplicity over sophistication, formulating motif discovery as a straightforward classification problem. To retrieve motifs, we exploited interpretability tools to intercept highly weighted“words”which bias the classifier towards a promoter category. We initially considered employing a zero-shot classification approach, but note their limited effectiveness on DNA52. Nevertheless, our results showed that the classification strategy was effective.The organic nature of our strategy which minimises both data loss and assumptions has the side effect of allowing additional insights to naturally emerge from the data. We did not detect only known TFBS but also genomic features which naturally arise from the data, as well as oligonucleotide sequences corresponding to potentially novel TFBS or RBD. Mobile genetic elements or highly conserved regions may be used for species classification purposes in metagenomics studies and broader phylogenetic analyses12. Although this is out of the scope of our study, we demonstrate the potential applicability of our approach outside of systems biology.We identified and accounted for a possible confounding issue that affects many bacterial genomes. Bacterial genomes are compact, and as a result, many of the -60:20 regions around the TSS which we surveyed overlapped with preceding genes, naturally including some coding regions. We intentionally did not discard sequences in these cases for several reasons. From a biological viewpoint, these sequences still retain their identity as promoter regions, and bacterial genomes are known for encoding genetic information in multiple frames53. We expect from this information density of bacterial genomes that sequences extending into the coding region likely retain motifs, and combined with the broader information rich region are unlikely to negatively affect motif capture. From a technical perspective, removing overlapping sequences would result in a reduced corpus size as well as create an imbalanced dataset, which is known to disrupt the performance of deep learning algorithms54. Finally, our results demonstrated that we were able to detect known and novel genomic features of interest regardless, implying that our inclusion of overlapping regions did not adversely affect performance.Ideally, a benchmark would be performed on multiple contemporary methods to showcase our method’s comparative performance. This was challenging as our method operates under a unique paradigm which limits the experimental design of such a study for a direct comparison. While our method performs the same functions as well-established software packages such as HOMER55 and MEME28, their core approaches are distinct. HOMER and MEME operate as motif enrichment tools, where the frequency of occurrence of a subset of motifs in embedded sequences are a major factor driving discovery. Uniquely, our method attempts to learn the grammar of the genome instead of relying on enrichment. This low-level difference results in an implicit requirement of HOMER and MEME to have a large set of sequences with matched motifs or from specific genomic regions, while our limited dataset is not enriched for any particular motif - being conducted across every promoter in the genome. Conversely, this unique mode of operation indicates that our method is free from such constraints, as shown in this study. More broadly, other recent context-aware NLP-based approaches for nucleic and amino acid analysis are available, but none share our application of motif discovery7,13,14.At a broader level, we clearly illustrate with our newly sequenced Staphylococcus aureus strain that annotations are not required throughout our integrated systems biology approach. The majority of motif and bacterial annotation studies use prior knowledge from the model organism Escherichia coli as a central reference point or as a case study. In contrast, we detected de novo motifs which have never been characterised in the Staphylococcus aureus strains under study. Therefore, the scope of our approach is naturally suited to new or unidentified bacterial strains. Beyond S. aureus, the workflow can in theory be applied to any species or datasets where genomic sequences are available for the same purpose, i.e. motif discovery. It may even be possible to apply our workflow in detecting putative mutation sites. However, we predict that mutation sites may require the training of denser models, due to the faint signal that mutation sites present in contrast to conventional motifs. The workflow is not restricted to biomolecular sequence type, rendering the method applicable to non-coding RNA or protein as well. Considering that a sepsis patient may have multiple bacterial species present of unknown identity, applying a generic workflow capable of detecting de novo signals in the data is particularly pertinent.Conversely, our overall integrated approach inherits the limitations of its component parts. A sufficiently replicated set of matching multi-omics samples will be required. The cost of such an experiment may be prohibitive in many cases. Furthermore, the nuances of each individual experiment may make it even more challenging to obtain multi-omics samples in certain cases, for example in transient biological states. While genomic or molecular annotations are not required in order for the method to work, it does not innately identify the genes, molecules or motifs associated with an experiment, and merely infers causal links across molecules. Annotation is performed independently of our pipeline.Beyond the topic of sepsis, our workflow’s agnosticism to input data removes constraints for its application on other biological systems. From a species and system perspective, our pipeline is readily testable across domains. Outside of the motif discovery space, the word inference module specifically may also be applicable to species classification. K-mer groups, analogous to our word inference approach, are applied to classify different species. Therefore, replacing k-mers with our enriched set of inferred words is a legitimate approach, though testing it is beyond the scope of this manuscript. It is even theoretically possible for our data representation to be merged with clinical notes directly, providing a direct link between electronic health records (EHR) to the genome as another step towards personalised medicine, although testing the viability of this approach is outside the scope of this experiment. We also caution that closing the link between disparate modalities will be a problem-specific challenge, much like the linking strategy used in our study.Our pipeline is equally capable of building independent classification models for the purpose of classifying genomic sequences into functional categories, as opposed to the scope of this study. A myriad of possibilities exist for other applications in annotating promoter regions or genes of unknown function by first training a model, and then examining either the scores or the semantic proximity of an unknown oligonucleotide sequence to that of known oligonucleotide sequences. A conceptually identical use case is common in natural language data56, further legitimising this strategy.Finally, we emphasise that the many of our proposed alternate applications of our workflow are untested hypotheses which are outside the scope of this study. In these examples, the key underlying point we wish to highlight is the intentionally designed receptiveness of our pipeline to very common data representations which are not domain-specific. Therefore, it is reasonable to state that it is possible to operate our pipeline across domains and a broad scope of applications to test the aforementioned hypotheses. Naturally, the results of each individual experiment will require a case-by-case review by domain experts.ConclusionOverall, we demonstrated that it is possible to build an integrated systems biology signature of sepsis even in the case of limited annotation, which is particularly pertinent in cases of newly sequenced and discovered bacterial strains. To achieve this, we developed and used a multi-step, data-driven workflow. First, we identified individual molecules comprising a multi-omics functional signature. Next, we applied a transformer-based contextualised DNA language representation model to retrieve motifs from corresponding promoter regions. Ontology, literature and database searches identified regulatory elements of interest which govern the corresponding functional omics output. Our results and the methodology presented here significantly contribute to a deeper understanding of the systems biology underlying sepsis.MethodsSource data generation and availabilityOriginal data was previously published. All accession identifiers, code and data generation steps including bacterial culture, biomolecule extraction, and computational steps are identical to our previous publication2. In this study specifically, we use the previously generated genomic sequence data as well as the abundance matrices for transcriptomics, proteomics and metabolomics data for all five Staphylococcus aureus strains BPH2760, BPH2819, BPH2900, BPH2947, BPH2986 generated as part of an Australian National Framework project.Systems biology signature integration workflowFunctional multi-omics component (quantitative)We first briefly describe the more abstract logic flow of our multiomics pipeline. It requires a series of matrices representing abundance data - for example a table of gene expression data with their samples and gene identifiers. Each matrix in this series represents a specific data modality (proteomics, metabolomics, etc). Of note is that sample identifiers must be matched across these matrices. The data is then normalised in each matrix. Correlations are calculated between each feature (e.g. transcripts, proteins, metabolites), along with their weight or contribution towards a condition of interest. Highly weighted features are then selected, corresponding to inter-correlated features (e.g. transcript, proteins, metabolites) driving a biological state.In applying this pipeline in the experiment, abundance matrices were obtained for transcriptomics, proteomics and metabolomics data, and linked by unique sample identifiers for vertical integration. Data was first standardised by rescaling all blocks to unit variance. A PLSDA between simulated infection and uninfected states was performed for each individual omics data, followed by its sparse variant sPLSDA across all omics data as a whole using multiomics version 1.2.3. Hyperparameter tuning was carried out to identify the best discriminating factors between the two states, resulting in obtaining an optimal subset of variables corresponding to individual transcripts, proteins and metabolites driving the different conditions, along with their classification weights and correlation scores.We previously published the functional omics software component of our pipeline5,57. The workflow used in this study is conceptually identical to the case study featured there, and additional publicly available case studies are available58.For annotations, automated ontology was performed by mapping protein orthologues with EGGNOG version 2.1.1225. Data was sourced from both the genomic and proteomic omics data blocks comprising the identified functional omics signature.Reproducible scripts, code parameters and documentation for the analysis associated with the study in this manuscript are available in a gitlab repository https://gitlab.com/tyagilab/sepsis_integration.22.Regulatory omics component (qualitative)As before, we first illustrate the overall logical flow of our genomenlp pipeline. It ingests sets of plain text file(s) as input - for example FASTA formatted reads. Two categories are required, a case and control set, where one set of text files should contain text belonging to one category, and the other set of text files belonging to another. A classifier employing a state-of-the art machine learning model is then trained on the data, following standard industry best practices. Key to our approach is our interpretability metrics, which capture sub-regions in the text which predispose the classifier towards a certain category. Similar to the functional omics pipeline, these sub-regions are weighted by the amount of interest the classifier shows in them. These sub-regions correspond directly to words or motifs of high interest.Following the logic flow above, for each gene in all five Staphylococcus aureus strains BPH2760, BPH2819, BPH2900, BPH2947, BPH2986, we extracted a pair of nucleotide sequences corresponding to -60:20 and 21:81 from their transcription start sites. Using this as a corpus of“promoter”and “protein-coding”sequences respectively, we first conducted empirical tokenisation to derive“words”or tokens of interest from the combined corpus using genomenlp version 2.8.5. Next, we performed a simple binary classification for these two conditions, generating appropriate quality control metrics.From the transcript subset composing the functional omics signature we identified in the previous step, we obtained their corresponding genes of interest, and extracted the -60:20 promoter region of each gene as with the previous step. (Note that we did not discard sequences in the case of the promoter windows overlapping with preceding genes.) We then applied our trained model on the promoter regions, with the specific intention of intercepting model scores to highlight regions of interest which weight the classifier’s decision towards the“promoter”category. In the scope of our study, we emphasise that the classifier itself is not of interest.A manuscript on the regulatory omics component of our pipeline is in review but available as a preprint59, along with code and documentation6. The workflow used in this study is conceptually similar to the DNA deep learning case study featured in the preprint and corresponding online documentation.We next annotate the resulting highly scoring oligonucleotide sequences from the promoter. To annotate motifs, MAST version 5.5.4 from the MEME suite was used to search against a combined database of prokaryotes28. Manual annotations were also performed by literature search and associated references were recorded in a table. This database, along with reproducible scripts, code parameters and documentation associated with the study in this manuscript are available in a gitlab repository https://gitlab.com/tyagilab/sepsis_integration.22.Finally, the combined systems biology approach that we developed for integrating both functional and regulatory signatures is summarised in (Fig. 4, Supplementary Figure 1). The overall convergence of both the functional and regulatory omics pipelines are also visible on (Fig. 2).Fig. 4Overall systems biology workflow overlaying multi-omics functional as well as regulatory omics data. First, raw functional omics data is obtained in the form of abundance matrices with matched samples. Second, a functional multi-omics integration is performed at the level of individual molecules, identifying a subset of high resolution correlations across molecules in different omics data in the process. Third, taking promoter and protein coding sequences of implicated genes, we preprocessed this data using a NLP approach customised for genomics, and trained a deep learning classifier with the ultimate goal of intercepting classification weights corresponding to regulatory elements or other interesting genomic signals. The combined functional and regulatory signatures form the overall systems biology signature.Full size imageData availabilityOriginal data was previously published. All accession identifiers, code and data generation steps including bacterial culture, biomolecule extraction, and computational steps are identical to our previous publication available at: https://doi.org/10.1038/s41467-023-37200-w. In this study specifically, we use the previously generated genomic sequence data as well as the abundance matrices for transcriptomics, proteomics and metabolomics data for all five Staphylococcus aureus strains BPH2760, BPH2819, BPH2900, BPH2947, BPH2986 generated as part of an Australian National Framework project. Reproducible scripts, code parameters and documentation for the analysis associated with the study in this manuscript are available in a gitlab repository: https://gitlab.com/tyagilab/sepsis_integrationAccession codesAll data and accession codes are identical to previously generated data2. Code to replicate our workflow is in a version controlled open source software repository https://gitlab.com/tyagilab/sepsis_integration22.ReferencesThompson, K. et al. Health-related outcomes of critically ill patients with and without sepsis. Intensive Care Med. 44, 1249–1257 (2018).Article PubMed Google Scholar Mu, A. et al. Integrative omics identifies conserved and pathogen-specific responses of sepsis-causing bacteria. Nat. Commun. 14, 1530 (2023).Article ADS CAS PubMed PubMed Central Google Scholar Chuang, Y.-P., Fang, C.-T., Lai, S.-Y., Chang, S.-C. & Wang, J.-T. Genetic determinants of capsular serotype k1 of klebsiella pneumoniae causing primary pyogenic liver abscess. J. Infect. Dis. 193, 645–654 (2006).Article CAS PubMed Google Scholar Evans, L. et al. Executive summary: Surviving sepsis campaign: international guidelines for the management of sepsis and septic shock 2021. Crit. Care Med. 49, 1974–1982 (2021).Article PubMed Google Scholar Chen, T., Philip, M., Lê Cao, K.-A. & Tyagi, S. A multi-modal data harmonisation approach for discovery of COVID-19 drug targets. Brief. Bioinform. https://doi.org/10.1093/bib/bbab185 (2021).Chen, T., Tyagi, N., Chauhan, S., Peleg, A. Y. & Tyagi, S. genomicBERT and data-free deep-learning model evaluation. https://doi.org/10.5281/zenodo.8135590 (2023).Asgari, E. & Mofrad, M. R. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS one 10, e0141287 (2015).Article PubMed PubMed Central Google Scholar Miller, D., Stern, A. & Burstein, D. Deciphering microbial gene function using natural language processing. Nat. Commun. 13, 5731 (2022).Article ADS CAS PubMed PubMed Central Google Scholar Benegas, G., Ye, C., Albors, C., Li, J. C. & Song, Y. S. Genomic language models: Opportunities and challenges. Trends Genet. (2025).Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017).Wolf, T. et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).Brown, C. T. & Irber, L. sourmash: A library for minhash sketching of DNA. J. Open Source Softw. 1, 27 (2016).Article ADS Google Scholar Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021).Article ADS CAS PubMed PubMed Central Google Scholar Boshar, S., Trop, E., de Almeida, B. P., Copoiu, L. & Pierrot, T. Are genomic language models all you need? exploring genomic language models on protein downstream tasks. Bioinformatics 40, btae529. https://doi.org/10.1093/bioinformatics/btae529 (2024). https://academic.oup.com/bioinformatics/article-pdf/40/9/btae529/59117017/btae529.pdf.Matys, V. et al. Transfac® and its module Transcompel®: Transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34, D108–D110 (2006).Article ADS CAS PubMed Google Scholar Giudice, G., Sánchez-Cabo, F., Torroja, C. & Lara-Pezzi, E. Attract-a database of rna-binding proteins and associated motifs. Database 2016, baw035 (2016).Rauluseviciute, I. et al. Jaspar 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 52, D174–D182 (2024).Article CAS PubMed Google Scholar Bailey, T. L. et al. Meme suite: Tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208 (2009).Article CAS PubMed PubMed Central Google Scholar Ng, P. dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279 (2017).Jones, K. S. A statistical interpretation of term specificity and its application in retrieval. J. Docum. (1972).Sanh, V. Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).tyronechen, Chen, T., Peleg, A. & Tyagi, S. Understanding the regulatory grammar of sepsis- causing bacteria using contexualised DNA language models. https://doi.org/10.5281/zenodo.10032374 (2023).Kanehisa, M. & Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).Article CAS PubMed PubMed Central Google Scholar Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).Article CAS PubMed Google Scholar Huerta-Cepas, J. et al. eggnog 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).Bailey, T. L. & Gribskov, M. Score distributions for simultaneous matching to multiple motifs. J. Comput. Biol. 4, 45–59 (1997).Article CAS PubMed Google Scholar Bailey, T. L. & Gribskov, M. Methods and statistics for combining motif match scores. J. Comput. Biol. 5, 211–221 (1998).Article CAS PubMed Google Scholar Bailey, T. L., Johnson, J., Grant, C. E. & Noble, W. S. The meme suite. Nucleic Acids Res. 43, W39–W49 (2015).Article CAS PubMed PubMed Central Google Scholar Ibarra, J. A., Pérez-Rueda, E., Carroll, R. K. & Shaw, L. N. Global analysis of transcriptional regulators in Staphylococcus aureus. BMC Genomics 14, 1–12 (2013).Article Google Scholar Beber, M. E., Muskhelishvili, G. & Hütt, M.-T. Effect of database drift on network topology and enrichment analyses: A case study for Regulondb. Database 2016, baw003 (2016).Chang, M., Hadero, A. & Crawford, I. Sequence of the pseudomonas aeruginosa trpi activator gene and relatedness of trpi to other procaryotic regulatory genes. J. Bacteriol. 171, 172–183 (1989).Article CAS PubMed PubMed Central Google Scholar Cole, S. T. & Raibaud, O. The nucleotide sequence of the malt gene encoding the positive regulator of the Escherichia coli maltose regulon. Gene 42, 201–208 (1986).Article CAS PubMed Google Scholar Old, I. G., Phillips, S. E., Stockley, P. G. & Saint Girons, I. Regulation of methionine biosynthesis in the Enterobacteriaceae. Prog. Biophys. Mol. Biol. 56, 145–185 (1991).Fitzgerald, S. et al. Redefining the h-ns protein family: A diversity of specialized core and accessory forms exhibit hierarchical transcriptional network integration. Nucleic Acids Res. 48, 10184–10198 (2020).Article CAS PubMed PubMed Central Google Scholar Rabin, R. & Stewart, V. Dual response regulators (narl and narp) interact with dual sensors (narx and narq) to control nitrate-and nitrite-regulated gene expression in Escherichia coli k-12. J. Bacteriol. 175, 3259–3268 (1993).Article CAS PubMed PubMed Central Google Scholar Seetharaman, J., Kumaran, D., Bonanno, J. B., Burley, S. K. & Swaminathan, S. Crystal structure of a putative hth-type transcriptional regulator yxaf from bacillus subtilis. Proteins 63, 1087 (2006).Article CAS PubMed PubMed Central Google Scholar Lang, B. et al. High-affinity DNA binding sites for h-ns provide a molecular basis for selective silencing within proteobacterial genomes. Nucleic Acids Res. 35, 6330–6337 (2007).Article CAS PubMed PubMed Central Google Scholar Fernández de Henestrosa, A. R. et al. Identification of additional genes belonging to the lexa regulon in Escherichia coli. Mol. Microbiol. 35, 1560–1572 (2000).Brescia, C. C., Kaw, M. K. & Sledjeski, D. D. The DNA binding protein h-ns binds to and alters the stability of RNA in vitro and in vivo. J. Mol. Biol. 339, 505–514 (2004).Article CAS PubMed Google Scholar Johansson, J., Dagberg, B., Richet, E. & Uhlin, B. E. H-ns and stpa proteins stimulate expression of the maltose regulon in Escherichia coli. J. Bacteriol. 180, 6117–6125 (1998).Article CAS PubMed PubMed Central Google Scholar Ansaldi, M., Simon, G., Lepelletier, M. & Mejean, V. The torr high-affinity binding site plays a key role in both torr autoregulation and torcad operon expression in Escherichia coli. J. Bacteriol. 182, 961–966 (2000).Article CAS PubMed PubMed Central Google Scholar Nakabeppu, Y. & Sekiguchi, M. Regulatory mechanisms for induction of synthesis of repair enzymes in response to alkylating agents: ADA protein acts as a transcriptional regulator. Proc. Natl. Acad. Sci. 83, 6297–6301 (1986).Article ADS CAS PubMed PubMed Central Google Scholar Bonn, C. M. et al. Repeated emergence of variant tetr family regulator, farr, and increased resistance to antimicrobial unsaturated fatty acid among clonal complex 5 methicillin-resistant Staphylococcus aureus. Antimicrob. Agents Chemother. 67, e00749-22 (2023).Article PubMed PubMed Central Google Scholar Nguyen, M.-T. et al. Inactivation of farr causes high rhodomyrtone resistance and increased pathogenicity in Staphylococcus aureus. Front. Microbiol. 10, 1157 (2019).Article PubMed PubMed Central Google Scholar Maddocks, S. E. & Oyston, P. C. Structure and function of the lysr-type transcriptional regulator (lttr) family proteins. Microbiology 154, 3609–3623 (2008).Article CAS PubMed Google Scholar Burge, C., Campbell, A. M. & Karlin, S. Over-and under-representation of short oligonucleotides in DNA sequences. Proc. Natl. Acad. Sci. 89, 1358–1362 (1992).Article ADS CAS PubMed PubMed Central Google Scholar Zaki, M. E. S., Bastawy, S. & Montasser, K. Molecular study of resistance of Staphylococcus aureus to antiseptic quaternary ammonium compounds. J. Glob. Antimicrob. Resist. 17, 94–97 (2019).Article Google Scholar Hirschman, J., Wong, P., Sei, K., Keener, J. & Kustu, S. Products of nitrogen regulatory genes ntra and ntrc of enteric bacteria activate glna transcription in vitro: Evidence that the ntra product is a sigma factor. Proc. Natl. Acad. Sci. 82, 7525–7529 (1985).Article ADS CAS PubMed PubMed Central Google Scholar Badua, A. T., Boonyayatra, S., Awaiwanont, N., Gaban, P. B. V. & Mingala, C. N. Antibiotic resistance and genotyping of meca-positive methicillin-resistant Staphylococcus aureus (mrsa) from milk and nasal carriage of dairy water buffaloes (Bubalus bubalis) in the Philippines. J. Adv. Vet. Anim. Res. 7, 397 (2020).Article PubMed PubMed Central Google Scholar Shepherd, J. Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc. Natl. Acad. Sci. 78, 1596–1600 (1981).Article ADS CAS PubMed PubMed Central Google Scholar Trifonov, E. N. & Sussman, J. L. The pitch of chromatin DNA is reflected in its nucleotide sequence. Proc. Natl. Acad. Sci. 77, 3816–3820 (1980).Article ADS CAS PubMed PubMed Central Google Scholar Benegas, G., Batra, S. S. & Song, Y. S. DNA language models are powerful zero-shot predictors of genome-wide variant effects. bioRxiv 2022–08 (2022).Chandler, M. & Fayet, O. Translational frameshifting in the control of transposition in bacteria. Mol. Microbiol. 7, 497–503 (1993).Article CAS PubMed Google Scholar Thabtah, F., Hammoud, S., Kamalov, F. & Gonsalves, A. Data imbalance in classification: Experimental evaluation. Inf. Sci. 513, 429–441 (2020).Article MathSciNet Google Scholar Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and b cell identities. Mol. Cell 38, 576–589 (2010).Article CAS PubMed PubMed Central Google Scholar Egger, R. Text Representations and Word Embeddings: Vectorizing Textual Data (Springer, 2022).Chen, T. & Tyagi, S. Integrative computational epigenomics to build data-driven gene regulation hypotheses. GigaScience 9, 1–13. https://doi.org/10.1093/gigascience/giaa064 (2020).Article CAS Google Scholar Chen, T., Abadi, A. J., Lê Cao, K.-A. & Tyagi, S. multiomics: A user-friendly multi-omics data harmonisation r pipeline. F1000Research 10, 538 (2023).Chen, T., Tyagi, N., Chauhan, S., Peleg, A. Y. & Tyagi, S. genomicbert and data-free deep-learning model evaluation. bioRxiv 2023–05 (2023).Download referencesAcknowledgementsTC was supported by an Australian Government Research Training Program (RTP) Scholarship and Monash Faculty of Science Dean’s Postgraduate Research Scholarship. ST acknowledges support from the Early Mid-Career Fellowship by the Australian Academy of Science and Australian Women Research Success Grant at Monash University. AP and ST acknowledge the Medical Research Future Fund and Genomics Health Futures Mission funding for the SuperbugAI flagship project. We thank Benjamin Vezina and Vinícius Salazar for helpful discussions on bacterial gene annotation. We would like to acknowledge the contribution of the Antibiotic Resistant Sepsis Pathogens Framework Initiative consortium. (https://data.bioplatforms.com/organization/pages/bpa-sepsis/consortium) in the generation of data used in this publication. This work was supported by the MASSIVE HPC facility (www.massive.org.au). We acknowledge the helpful discussions and compute resources from the Monash eResearch Platform and Monash Bioinformatics Platform. We thank the three anonymous reviewers whose constructive criticism considerably improved our manuscript. Biorender was used to create many figures in this publication. We acknowledge and pay respects to the Elders and Traditional Owners of the land on which our four Australian campuses stand.FundingTC was supported by an Australian Government Research Training Program (RTP) Scholarship and Monash Faculty of Science Dean’s Postgraduate Research Scholarship. ST acknowledges support from the Early Mid-Career Fellowship by the Australian Academy of Science and Australian Women Research Success Grant at Monash University. AP and ST acknowledge the Medical Research Future Fund and Genomics Health Futures Mission funding for the SuperbugAI flagship project.Author informationAuthors and AffiliationsDepartment of Infectious Diseases, The Alfred Hospital and School of Translational Medicine, Monash University, Melbourne, VIC, 3004, AustraliaTyrone Chen, Anton Y. Peleg & Sonika TyagiCentre to Impact AMR, Monash University, Melbourne, VIC, 3004, AustraliaAnton Y. Peleg & Sonika TyagiInfection Program, Department of Microbiology, Monash Biomedicine Discovery Institute, Monash University, Melbourne, VIC, 3145, AustraliaAnton Y. PelegSchool of Computing Technologies, Royal Melbourne Institute of Technology University, Melbourne, VIC, 3000, AustraliaSonika TyagiAuthorsTyrone ChenView author publicationsSearch author on:PubMed Google ScholarAnton Y. PelegView author publicationsSearch author on:PubMed Google ScholarSonika TyagiView author publicationsSearch author on:PubMed Google ScholarContributionsT. C.: Conceptualization, Methodology, Software, Data Curation, Writing- Original draft preparation, Writing- Reviewing and Editing, Visualisation, Investigation, Validation. A. Y. P.: Supervision. S. T.: Conceptualization, Methodology, Writing- Original draft preparation, Writing- Reviewing and Editing, Supervision.Corresponding authorCorrespondence to Sonika Tyagi.Ethics declarationsCompeting interestsThe authors declare no competing interests.Additional informationPublisher’s noteSpringer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Supplementary InformationSupplementary Information 1.Supplementary Information 2.Supplementary Information 3.Supplementary Information 4.Supplementary Information 5.Supplementary Information 6.Supplementary Information 7.Rights and permissionsOpen Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.Reprints and permissionsAbout this article