IntroductionCarbohydrate active enzymes (CAZy) are essential proteins supporting the synthesis, modification and deconstruction of carbohydrates. CAZy proteins are central to ecosystem functioning, human health and have many biotechnological applications (e.g., biofuels and biotherapeutics). CAZy proteins correspond to functional protein domains in either single-domain (SD) or multi-domain (MD) proteins. CAZy domains are divided into five classes based on their biochemical reactions: glycoside hydrolases (GH), polysaccharide lyases (PL), carbohydrate esterases (CE), auxiliary activities (AA), and glycosyl transferases (GT)1. In addition, accessory non-catalytic carbohydrate binding modules (CBM) help the associated CAZy domains bind to specific substrates [e.g2,3]. Initially, many CAZy families were named according to their described activities. However, as more proteins were characterized, new activities were described in previously identified CAZy families leading to conflicting names. For example, GH families 5 and 18 were originally named “cellulase” and “chitinase”4,5 however these families now include many proteins which are de facto not cellulolytic nor chitinolytic enzymes respectively. For example, some GH5 are xylanases6. Eventually, a naming convention focusing on the type of reaction being catalyzed (e.g., glycoside hydrolase) was adopted and implemented in the carbohydrate active enzyme (CAZy) database (cazy.org)5,7.Among others, the CAZy database lists the reported enzymatic activity of the biochemically characterized CAZy proteins1,7,8. As of 2024, although some CAZy families (e.g., GH22) are associated with only one reported enzymatic activity, most are associated with several activities. For each family, the CAZy database provides an updated list of associated activities in the form of “Enzyme Commission” (EC) numbers (e.g., > 30 EC numbers for GH5). In addition, the recent implementation of CAZac provides detailed information about the substrates and products of each reaction9. Yet, even in polyspecific CAZy families, individual enzymes are usually associated with a single reported function1 and variations in the substrate specificity and activity within CAZy families mirror diverging evolutionary processes and have been associated with subtle changes in the structure of the substrate binding cleft and active site.The CAZy database also defines CAZy domains and classifies them into families based on conserved 3D-structures and amino acid sequences7. Therefore, Multiple Sequence Alignments (MSA) and Hidden Markov Model (HMM) profiles have been frequently used to identify CAZy domains [e.g10,11,12]. However, while the CAZy database defines CAZy families, no explicit definitions (e.g., HMM-profiles, MSA) are provided and no sequence is directly available on the CAZy database. Before 2012, the CAZy database provided links to other relevant databases useful for domain definition, sequence identification, and characterization (e.g., Pfam, InterPro). However, these links have been systematically removed from the cazy.org website. Conversely, links redirecting the users to deprecated databases are still in place (e.g., PRINTS and HOMSTRAD). In this context, the reduction in the usability of the CAZy database prompted the development of alternative annotation systems for CAZy domain identification relying on publicly accessible information [e.g10,12,13]. For some CAZy families, sequence variations leading to polyspecificity were accounted for and integrated into sorting algorithms, resulting in the creation of subfamilies with somewhat conserved substrate specificity [e.g14,15,16,17]. For example, per the CAZy database, the subfamily 7 in the GH5 family (i.e., GH5_7) is associated with just two reported activities.Besides CAZy proteins with a single reported enzymatic activity, some CAZy proteins are associated with multiple activities. Briefly, “Tier 1” multifunctional enzymes are single-domain CAZy proteins that display multiple activities. These proteins generally display high activity toward one substrate and reduced activity on alternative promiscuous substrates18. As proteins are flexible 3D structures, it is expected that most CAZy can accommodate and process structurally related substrates albeit with variable affinity and kinetic efficiency18,19. Alternatively, multidomain enzymes with a single catalytic domain are called “Tier 2”, and those with multiple catalytic domains are “Tier 3” multifunctional enzymes18. These multi-domain proteins result from genetic recombination that shuffles sequences for protein domains20,21. While domain shuffling is essentially random and has the potential to generate many domain combinations12, most characterized multi-domain proteins listed on the CAZy database consist of a single catalytic domain associated with CBM(s) and perform a single function. Proteins with multiple catalytic domains are relatively scarce among CAZy12,20,21. However, the fusion of protein domains involved in the same process and displaying synergies might be under positive selection and few Tier 3 multifunctional CAZy have been identified and characterized [e.g12,21,22]. For most of these multi-domain multi-activity proteins, biochemical evidence linking specific activities to specific domains is generally missing. In consequence, for Tier 3 CAZy proteins, all the described activities are associated with each identified domain and incorporated in the list of activities found in the corresponding CAZy family. This lack of biochemical evidence ends up ballooning the number of activities associated with each CAZy domain and family1,8.While no annotation system is directly available from the CAZy database, the compiled information on the CAZy database is used extensively to link sequences identified in (meta)-genomes to enzymatic function and to ecosystem performance and health [e.g23,24,25]. However, as the CAZy database was not designed for sequence annotation and offered limited resources for that purpose, linking identified sequences to function frequently relies on a “majority rule”: in absence of biochemical evidence, the function of a newly identified CAZy sequence is assumed to be the dominant activity identified in the corresponding CAZy family or potential subfamily. For example, for the lack of a better solution, GH5 identified in metagenomes are generally assumed to be cellulases [e.g26,27,28] however, many distinct activities are associated with GH5. This potentially misleading majority rule is also implemented in some annotation systems, at the GH family or sub-family level29. Overall, although the “majority rule” makes sense for monospecific CAZy families and families dominated by a single enzymatic activity, no rationale supports its application to mostly polyspecific CAZy families.Here, we present ez-CAZy, a custom database for linking newly identified sequences to specific enzymatic activities. We retrieved and re-annotated the sequences of biochemically characterized GHs listed on the CAZy database. For each sequence, we identified the precise domain architecture using publicly accessible tools to link sequence variation to the identified enzymatic function. The data was consolidated in a single publicly available FASTA file. Next, we investigated how the enzymatic activity (i.e., EC) and the overall domain architecture of GHs relate to variation in the sequence of the catalytic domains. We hypothesized that more similar sequences would display similar enzymatic activities and that the clumped distribution of enzymatic activities among characterized CAZy would reflect their multi-domain architecture. Finally, we tested if the activity of newly identified and characterized GHs – not listed in our initial dataset – could be predicted by performing local sequence alignment. Here, we predicted that, since the EC activity and the multi-domain architecture are not randomly distributed among GHs, matching unknown sequences to characterized GHs will facilitate the prediction of the enzymatic activity of newly identified sequences. This work, based on publicly accessible data and bioinformatics tools, provides an easy way to (re)analyze available sequences and envision custom annotation of CAZy sequences from newly sequenced samples.ResultsSequences retrieval and reannotationWe retrieved the sequence and functional information of 7,198 characterized GH proteins listed on CAZy in Spring 2023. The sequences were re-annotated using Pfam HMM-profiles (Supplementary data 1). A total of 14,841 protein domains from 445 distinct domain families were identified. Of the characterized GHs, 3,169 were single-domain proteins, while the remaining 4,029 multi-domain proteins contained 11,672 protein domains. Multi-domain CAZy with two (n = 2,081) or three (n = 1,229) domains made up the majority of multidomain proteins, but more than 170 proteins contained at least 6 identified domains and one GH70 protein (i.e., CDX66820.1) was associated with 19 domains. Our Pfam-based annotation matched most of the annotation from CAZy, except for the recently coined CAZy families for which no HMM-profile nor multiple sequence alignment were publicly available at the time of analysis. These proteins were generally annotated as members of previously defined CAZy families. For example, the b-galactosidase from Bacteroides ovatus ATCC 8483 (ALJ49469.1), although listed as a GH147 on the CAZy database, was annotated as a GH5 (PF00150), suggesting the structural proximity between these families. In addition, some of the unmatched proteins were assigned to domains not known for their involvement in carbohydrate metabolism. For example, the GH120 from Bifidobacterium adolescentis ATCC 15,703 (BAF39080.1) was annotated as a domain of unknown function (DUF1565 - PF07602) and a b-helix domain (PF13229).The reannotated sequences were compiled in two separate FASTA files containing (i) the whole protein sequence (ez-CAZy_V1a.faa) and (ii) the sequence of the identified catalytic domain only (ez-CAZy_V1b.faa) available in supplementary data.GH multi-domain architecture vs. enzymatic activityFollowing the creation of the databases, we investigated the domain organization from N- to C-terminal across characterized GHs (Supplementary data 1). Several GH families contained only single-domain proteins. However, most of these families were recently created or contained few characterized proteins with some notable exceptions including GH22, GH34, and GH56 with 62, 30, and 29 single domain proteins, respectively. When considering GH families with ≥10 characterized proteins, some families were still dominated by single-domain proteins (e.g., GH8, GH51) but most contained proteins with variable and sometimes complex multi-domain architectures. Some large families consisted of mostly multi-domain proteins (e.g., GH3, GH9, GH13, GH32) (Fig. 1A). For these predominantly multi-domain GH families, the existence of specific CAZy subdomains increased the prevalence of multi-domain proteins. For example, many characterized GH3 (PF00933) contained a specific C-terminal subdomain (PF01915) only found in some GH3. Similarly, many GH9 (PF00759) were associated with a specific CelD_N subdomain (PF02927) at the N-terminal end (Fig. 2). Although these subdomains were only found in association with GH3 and GH9 respectively, they were not systematically identified. Besides family-specific subdomains, proteins from many GH families were associated with myriads of domains. For example, 61, 55, 21, and 13 distinct types of protein domains were associated with characterized GH5, GH13, GH9, and GH3 respectively. Most domains associated with GHs were small non-catalytic domains such as carbohydrate binding modules (CBM), dockerin domains, and domains of unknown function (DUF). Finally, some characterized GH proteins contained multiple catalytic domains sometimes in conjunction with accessory non-catalytic domains. For example, celA (ACM60955.1) from Caldicellulosiruptor bescii DSM 6725 consisted of N-GH9-3×(CBM3)-GH48-C22. Other characterized proteins consisted of a single GH catalytic domain and a non-CAZy catalytic domain. For example, RmNag3 (ACY47819.1) from Rhodothermus marinus DSM4252 was composed of typical “GH3-GH3c” at the N-terminal end associated with a functional ß-lactamase domain, for penicillin degradation, at the C-terminal end30.Fig. 1(A) Distribution of multi-domain GH architecture among families with ≥50 characterized proteins at the time of analysis. (Proteins are sorted by proportion of single-domain GHs). (B) Distribution of reported enzymatic activity among GH families with ≥50 characterized proteins at the time of analysis. Number of reported activities may not match the number of proteins as some proteins display multiple activities. (Protein families are sorted by the prevalence of their most frequent activity).Full size imageFig. 2Tree of the GH9 catalytic domains (PF00759, in red) highlighting the distribution of multi-domain architecture, from N- to C-terminal. All the displayed proteins exhibit only cellulolytic activity (EC.3.2.1.4) unless otherwise noted.Full size imageMost single-domain GH families (e.g., GH22, GH34, GH56) were affiliated with a reduced number of enzymatic activities. For example, all identified GH22 were described as lysozyme (EC.3.2.1.17). Conversely, most predominantly multi-domain GH families (e.g., GH3, GH13, GH5) were affiliated with many enzymatic activities (Fig. 1B) suggesting that the addition of accessory domains affected the activity of the catalytic domains. However, in some GH families, variable multi-domain architecture was not associated with variable enzymatic activity. Indeed, although characterized GH9 displayed many variable architectures, most (86%) acted as cellulase (i.e., EC.3.2.1.4) and eventually displayed additional activities (Fig. 2). Conversely, the mostly single-domain GH8 proteins displayed variable activities (e.g., EC.3.2.1.4, EC.3.2.1.6, EC.3.2.1.8).To investigate how the discrete evolution of the catalytic domains and the addition of accessory domains impacted the reported enzymatic activity, we created trees using the sequence of the identified catalytic domains and investigated the distribution of the protein multi-domain architecture and the reported EC activity/ies. For example, when considering biochemically characterized GH9, our initial dataset included 151 proteins accounting for 10 distinct enzymatic activities, with a total of 374 domains from 21 domain families (Fig. 2). The distribution of non-GH9 domains relative to the sequence of the GH9 domain was strongly clustered. Specifically, apart from a few deeply branched GH9 sequences, all the proteins with an N-terminal “CelD_N” domain (PF02927) belonged in a single cluster. Within this cluster, some smaller clusters were composed of proteins with additional domains located before or after the core CelD_N-GH9 domains. Overall, proteins with complex multidomain architecture formed smaller homogenous clusters. The only notable exceptions to this clustered distribution are the presence of CBM_3 (PF00942), and the dockerin domain (PF00404) which were found in multiple clusters.Regarding the reported enzymatic activity (EC) of biochemically characterized GH9 proteins, most acted as cellulase (EC.3.2.1.4), while enzymes with alternative activities were often clustered and displayed specific multi-domain architecture (Fig. 2). However, within these clusters, not all enzymes were reported to share the same activity.The clustered distribution of the multidomain architecture of biochemically characterized GH9 along with the distribution of the reported enzymatic activities was further investigated using an ANOSIM test. Both the distribution of multidomain architecture and the reported enzymatic activity were positive, thus indicating a significantly clustered distribution (p 0, p 0.05). In addition, most of the GH families where the distribution of multi-domain architectures was randomly distributed were families largely dominated by single-domain proteins (e.g., GH8) or GH families with few characterized proteins (e.g., n = 17 in GH130) (Fig. 3A).Fig. 3(A) ANOSIM test investigating the clustering of reported EC activities and multi-domain architectures across GH families with > 10 biochemically characterized enzymes. (B) Frequency of predicted enzymatic activity of newly listed GHs matching the reported biochemically characterized activity.Full size imageThe distribution of the reported enzymatic activities across biochemically characterized GH families, was significantly clustered in most protein families (R-ANOSIMEC>0, p