Clustering the protein universe of life using DIAMOND DeepClust

Wait 5 sec.

Algorithmic overview of DIAMOND DeepClustRepresentative-based clusteringFollowing an analogous strategy as the gold standard approaches implemented in CD-HIT12 and UCLUST13, we define a clustering of an input dataset of protein sequences as a subset of representative sequences such that any input sequence lies within a user-defined distance threshold of at least one representative sequence. As a result, each input sequence will be assigned to one representative sequence. This threshold setting (sequence alignment identity, coverage and e value) is also referred to as the clustering criterion. For the purpose of this study, in addition to a basic e value threshold of 0.001, we require that a pairwise local alignment between sequences satisfy a specific minimum sequence identity and length coverage of both the representative and the coclustered (nonrepresentative) sequences (bi-directional coverage clustering). In addition, we also carried out clustering runs using uni-directional coverage clustering that only requires coverage of the co-clustered (nonrepresentative) sequence. In this context, sequence coverage refers to the length of the range spanned by a local alignment divided by the length of the respective sequence.Greedy vertex coverStarting from a set of alignments that meet the user-specified or default clustering criterion, we compute a set of representative sequences by first encoding the alignments as a directed graph G where nodes represent individual protein sequences and edges denote pairwise local alignments between them, whereby a directed edge from sequence A to sequence B indicates that A can represent B according to the clustering criterion. In a second step, we apply the greedy vertex cover algorithm17 on the alignment-graph G to determine a near-minimal covering set of graph vertices. The algorithm repeatedly selects the vertex with the highest node outdegree and removes it from the graph along with its out-neighbors to form a new cluster until the graph is completely clustered. The greedy vertex cover approach can be generalized to also include indirect neighbors of the designated representative sequence up to a certain search depth on the graph into the newly formed cluster. Here an infinite search depth would mean to form clusters that correspond to the connected components of the graph. DIAMOND DeepClust uses a search depth of one for the first two rounds of cascaded clustering, and a depth of three for the remaining rounds. While the greedy vertex cover algorithm is useful to find a minimum covering set of representatives and minimize the cluster count, it does not attempt to identify weak edges in the graph between densely connected communities that could indicate false positives and should be ignored. We plan to integrate Markov Clustering18 into DIAMOND DeepClust to address this limitation.Minimizer samplingIn the first round of cascaded clustering, we subsample the seed space using minimizers19 with a window size of 12 that we empirically found to provide a good balance between speed and sensitivity. While similar to the sketching approach proposed by MMseqs2/Linclust9, our approach samples a number of minimizers that depends on the length of the sequence instead of a fixed number per sequence.Linear-stage clusteringMMseqs2/Linclust proposed to achieve linear computational scaling of comparisons by considering only seed hits against the longest sequence for identical seeds rather than trialing all possible combinations9. This heuristic is sufficient in the context of uni-directional coverage clustering to find meaningful representatives as the seeds at this stage are selected to be highly specific and the longest sequence is a priori the most likely to maximize recruitment of member sequences due to the uni-directional coverage criterion. It is insufficient for bi-directional coverage clustering however, as it considers comparisons that cannot possibly satisfy the bi-directional coverage criterion based solely on the length of the sequences. We observe that for a coverage threshold c and a length ratio \(\frac{{l}_{1}}{{l}_{2}}\left({l}_{1}\le {l}_{2}\right)\) between two sequences, a local alignment of the two sequences can only reach a coverage of c of both sequences if \(\frac{{l}_{1}}{{l}_{2}} > c-\varepsilon\) holds (with ε > 0 being some tolerance to account for gaps). To make use of this property, our algorithm will conduct comparisons for a group of n sequences \({s}_{1},\cdots ,{s}_{n}\) of lengths \({l}_{1},\cdots ,{l}_{n}\) (\({l}_{i}\ge {l}_{i+1}\)) containing the same seed as follows. An initial range of [i;j] is constructed with i = 1 and \(j=\max \{k\epsilon [{i;n}]:\frac{{l}_{k}}{{l}_{i}}\ge c+\delta \}\). All sequences with index in the range [i;j] are then compared to the ‘median’ sequence of index \(\lfloor \frac{i+j}{2}\rfloor\). If j = n the algorithm aborts, otherwise the interval is then advanced by setting \(i=\min \{k\epsilon [i+1{;n}]:\frac{{l}_{k}}{{l}_{j+1}}\ge c+\delta \}\) and j as above, followed again by comparison against the median sequence and advance of the interval. We note that instead of subtracting a tolerance ε from the coverage threshold, we add a positive value ẟ (ẟ = 0.05 by default) to make it more stringent, which works better in practice as it will increase the number of comparisons. This algorithm is designed to select a subset of potential representative sequences likely to maximize recruitment of cluster members in the context of a bi-directional coverage criterion.Cascaded clusteringAs exhaustive all-versus-all alignment of protein datasets consisting of hundreds of millions of sequences is prohibitively expensive, we approach this issue by adopting cascaded clustering (personal communication with the DeepMSA2 team) to gradually construct larger sequence clusters in several rounds of comparison with increasing alignment sensitivity (iterating between modes: –fast, default, –sensitive, –very-sensitive, –ultra-sensitive, corresponding to the sensitivity modes of our DIAMOND v.2 aligner10). A round of cascaded clustering performs self-comparison of the input sequences at a round-specific sensitivity. We then compute representative sequences from the resulting alignments as explained above under greedy vertex cover clustering, which are then passed on to the next round of cascaded clustering and subjected to an all-versus-all search at increased sensitivity. Depending on the desired clustering depth, two to six of these alignment rounds are chained until reaching sufficient sensitivity such that most representatives within clustering distance have been discovered. The first round of cascaded clustering uses the linear-stage algorithm in conjunction with minimizer subsampling. Contrary to MMseqs2 that uses a single linear round, we add a second linear round without minimizer subsampling, which we found to provide an additional 36% reduction of the sequence space (in case of clustering the NR database) without requiring expensive all-versus-all comparison. In addition, we found it advantageous to use a more stringent clustering criterion in earlier rounds of cascaded clustering and relax it gradually in later rounds. More specifically, we add +7% to the coverage and sequence identity thresholds up to the round using default mode sensitivity, and +5% to the coverage threshold in the round using the ‘–sensitive’ mode. For example, when the base identity cutoff is set to 20% then all individual rounds up to the default mode will use a 27% threshold. The algorithm is memory efficient (Supplementary Discussion 1) and permits adding new sequences to an existing clustering (Supplementary Discussion 2).Self-alignment optimizationWe optimized self-alignment of the representative databases by taking advantage of the symmetry of queries and targets in the seeding stage, avoiding the evaluation of redundant seed hits, and thus doubling the performance of this computation. This is accordingly considered when the database is processed in blocks by eliminating redundant block combinations.Bi-directional coverage optimizationAs explained above, many sequence pairs can be excluded from satisfying a bi-directional coverage criterion based on the ratio of their respective lengths. We take full advantage of this property in the earliest stage of the pipeline for all-versus-all alignment computations. In the context of the double-indexing algorithm of DIAMOND, this is accomplished by sorting the seed index tables by the length of the sequences in descending order. Processing the hits of a seed, for a given query sequence of length lq and a set of n target sequences of lengths \({l}_{1},\cdots ,{l}_{n}\) we determine a range [i;j] of target sequences such that \(i=\min \{k\epsilon [1{;n}]:\frac{{l}_{k}}{{l}_{q}}\ge c-\varepsilon \}\) and \(j=\max \{k\epsilon [1{;n}]:\frac{{l}_{k}}{{l}_{q}}\ge c-\varepsilon \}\) (using a value of ε = 0.05 by default). We then compare the query sequence to all target sequences within the index range [i;j]. To process the next query sequence, both indexes i and j are then incremented until the conditions above are met again. This is only a minimal computation due to the length sorting of both the query and the target index table.Multiple spaced seeds for linear mode of DIAMOND DeepClustThe Linclust approach9 as published depends on choosing long and specific seeds that in turn limits its sensitivity and applicability to deep clustering. To overcome this limitation, we use multiple spaced seeds20 to retain specificity while increasing the sensitivity of the seeding. Previous approaches to compute multiple spaced seeds21,22 used theoretical models to compute a near-optimal set of multiple spaced seed patterns. We propose to empirically learn these patterns based on a collection of real protein alignments. To this end, we aligned a sample of 10,000 sequences from the UniRef5023 database against the whole database using DIAMOND v.2 in very-sensitive mode, retaining at most 300 alignments per query, yielding a collection of 1.47 million alignments. Our algorithm parses the collection of alignments and determines the seed shape pattern of a defined weight and maximum length that would hit the largest collection of alignments. This pattern is then saved and all alignments that it hits are removed from the collection, at which point the algorithm repeats until a predefined number of seed shape patterns has been designed. When running in linear mode, DIAMOND DeepClust uses cascaded clustering with the first two rounds being identical to the workflow described in that chapter. The third round also uses the linear-stage algorithm in conjunction with 30 seed shapes with a weight of 10 and a search depth of 2 for connected component clustering.Multi-node parallelizationClustering on the protein level presents an important component of addressing the challenges of learning from exponentially growing experimental data, with current databases containing several tens of billions of sequences. While MMseqs2/Linclust marked an important advancement in providing a sequence clustering algorithm that scaled linear with dataset size, for practical purposes its scalability to large datasets is impeded by several factors. First, it can only process datasets of up to 4 billion sequences due to the use of 32-bit sequence identifiers. Second, their software demands that the entire input database can be efficiently accessed in a random-access manner. This means that the entire database either needs to fit into memory on the host system or that the storage system needs to provide such efficient access, an assumption that usually only holds true if the data resides on a fast local solid-state drive but typically fails on cluster or cloud-based storage systems. Finally, it is unable to parallelize across multiple compute nodes, running a clustering job for a single dataset on only a single compute server (we note that while the regular mmseqs cluster workflow based on all-versus-all alignment can run on multiple nodes using MPI, the Linclust part of the workflow only runs on a single node). In our implementation of the linear mode of DIAMOND DeepClust, we aimed to address these limitations, providing a tool capable of massively parallel cluster or cloud-based computations and scaling to trillions of sequences and petabytes of data. While clustering based on all-versus-all alignment is trivially parallelizable by splitting the input database into chunks and computing an all-versus-all alignment of these chunks, the Linclust algorithm cannot be parallelized by just partitioning the input database due to the global dependency of comparing each sequence to the longest sequences with which it shares a common seed.Our algorithm is implemented as a series of modular computing steps that operate on tables that can span across many storage volumes on a distributed file system, ingested and processed in parallel on many compute nodes. The first step is reading the input database and producing an output table of seed, sequence ID and sequence length triplets according to the seeding strategy used by the clustering. Next, this table is sorted on the seed column using our own parallel radix sort implementation, optimized toward distributed external sorting of large datasets ranging up to petabytes in size. We perform recursive radix clustering on the highest relevant bits of the key that were not yet subjected to sorting until the size of a bucket is small enough to be processed in-memory on a single compute node. The now sorted seed table is then processed seed-by-seed. A pair table containing pairs of putative representative and cluster member sequences is produced by assigning each sequence to the longest sequence in the same seed groups in case of uni-directional clustering, or by applying our modified logic in case of bi-directional clustering (section ‘Linear-stage clustering’). The pair table is then radix sorted using our implementation on the representative sequence and read linearly, building a table assigning database sequences to chunks of sequences that need to be processed together, with the chunks limited in size to what can be processed in-memory on a single compute node. Next, the output table from this previous step is then radix sorted using our implementation on the sequence ID and processed linearly together with the input database, storing each sequence into the chunks that it has been assigned to. These chunks are then processed independently one-by-one, computing all associated alignments and storing the results as a table of directed graph edges where an edge from sequence A to sequence B means that A can represent B according to the clustering criterion. The last two steps can optionally be processed in multiple rounds to save storage space. Finally, we reduce the inbound edges of each node to the edge whose source is the longest of the possible representative sequences in case of uni-directional coverage clustering, or which has the highest outdegree in case of bi-directional coverage, and output the transitive closure of the resulting graph as the final clustering. All tables are partitioned into 256 radix buckets on creation to reduce the cost of radix sorting.To benchmark our implementation, we ran the linear mode of DIAMOND DeepClust on our experimental study dataset of 22 billion sequences, clustering at 90% sequence identity and 80% uni-directional coverage, performing several runs on 1, 2, 4, 8, 16 and 32 compute nodes (equipped with dual AMD EPYC Genoa 9554 CPUs with 128 cores and 512 GB of RAM). We observe that our implementation exhibits nearly ideal scaling behavior to multiple nodes, reducing the runtime from 15.3 h on a single node to 35 min on 32 nodes (Extended Data Fig. 1).The multi-node feature is available since DIAMOND v.2.1.14 and can be accessed by running the DIAMOND Linclust workflow and setting the –parallel-tmpdir parameter to a shared working directory on all nodes. It supports bi-directional and uni-directional coverage clustering at any sequence identity threshold including deep clustering using multiple spaced seeds.Gapped alignment computationWe produced a new vectorized Smith Waterman implementation based on a modified SWIPE24 approach that was originally developed for the first DIAMOND version11, but dropped out of its code base soon after the initial release. While SWIPE vectorized the Smith Waterman algorithm by computing alignments of the same query against multiple targets, we generalize this approach to computing alignments of multiple independent query–target pairs. This is accomplished by using score profiles for the queries that store alignment scores along the sequence for each of the amino acid residues. For computing one column of the dynamic programming matrix, we maintain pointers into these profiles for each SWIPE lane and apply an AVX2-optimized matrix transposition to interleave the query–target scores for each dynamic programming cell into the same register, then compute the cell updates according to the standard SWIPE logic. Contrary to the original SWIPE design, this approach permits the computation of banded and anchored alignments, which in turn also enable optimizations such as the cheap determination of alignment start and end coordinates as well as X-drop termination. Compared to our implementation of the original SWIPE algorithm, we measured ~20% higher cost per cell update for this approach on the Intel Ice Lake architecture, while the number of cell updates can be greatly reduced as discussed above.BenchmarksBenchmark designOur clustering benchmark is based on the NCBI NR database25 downloaded in June 2023, containing 545,815,195 sequences. As we aimed to benchmark both the precision and the sensitivity of a clustering based on an annotated ground truth, we aligned all sequences against Pfam-A 33.126 and determined27 domain architectures for all sequences with 80% residue-level annotation with Pfam families, resulting in a dataset of 149,824,975 annotated sequences (Supplementary Notes), which we use for the evaluation. To evaluate a clustering run, we define the sequence-level sensitivity as the number of sequences in the same cluster of the same domain architecture, divided by the number of sequences of this architecture in the input database. We define the sequence-level precision as the number of sequences in the same cluster of the same domain architecture, divided by the size of the cluster. For the purpose of determining precision, we consider annotations with different Pfam families belonging to the same Pfam clan as equivalent, as members of the same Pfam clan are assumed to possess a remote homology. For each clustering run, we visualize the distribution of the sequence-level sensitivity and precision (Fig. 1c) and average these numbers using the arithmetic mean for a compounded value.We conducted clustering runs for DIAMOND DeepClust, MMseqs29 and FLSHclust14 on a 64-core server with 2 TB of RAM using the full NR database as input. To determine sensitivity and precision, non-annotated sequences in the output are ignored. We opted for this benchmark design instead of just clustering the subset of annotated sequences as the focus of this work is computational performance and scalability. We highlight that since the tools have no way of knowing which of the input sequences are annotated, this benchmark design does not provide any kind of unfair advantage to our own tool.As a single clustering run of MMseqs2 takes more than 3 weeks for the NR database, we did not find it practical to repeat this computation many times with different parameter settings to produce a receiver-operating characteristic curve-like figure. Instead, we attempted to determine the best possible parameter settings of MMseqs2 (maximizing the sensitivity) while providing at least 95% precision (Supplementary Notes) and used these settings for a single clustering run of the NR database. We chose the parameters of DIAMOND DeepClust accordingly to provide a comparable result to MMseqs2 and conducted a single run with these parameters. We ran FLSHclust using an 80% coverage threshold and a 20% sequence identity threshold as a three-step cascaded clustering run (Supplementary Notes).For all tools, we conducted runs with identical parameters on randomly downsampled databases of the NCBI NR, using sample sizes of 270 million, 135 million, 68 million, 27 million and 10 million sequences (Fig. 1).In addition to the runs presented in (Fig. 1), we also conducted a run of DIAMOND DeepClust using a 30% sequence identity threshold and a 90% uni-directional coverage threshold (corresponding to the settings used for the experimental study). Compared to the run using no sequence identity threshold and an 80% uni-directional coverage threshold, this change reduced the sensitivity from 52.3% to 42.5% and increased the precision from 64.9% to 72.3%. The runtime on the benchmark system was 39 hours and 40 min.Limitations of DIAMOND DeepClust and future improvementsWhile DIAMOND DeepClust allows users to efficiently scale protein sequence clustering to billions of sequences, our method has its limitations. In particular, the maximum sensitivity achievable on the level of pairwise sequence comparisons is a limiting factor in detecting very distant homologs. The aim of this software and accompanying experimental study is to provide a scalable software solution and clustering methodology to associate the billions of protein sequences available today through local pairwise alignment. In future work, we will then focus on refining their relationships in terms of sequence alignment composition and sensitivity beyond pairwise alignments. For future developments of DIAMOND DeepClust we therefore predict that integration of profile-based searches will further increase its sensitivity.While our software aims to find sequences with homologous origin, the difference between an ortholog and a paralog is an important distinction in the context of phylogenomics. We believe that further work is needed to enable this analysis for when millions of genomes will be available.Experimental studyRecently, we introduced DIAMOND v.210 to unlock the familiar functionality of a BLASTP search for tree-of-life-scale applications in the Earth BioGenome era28,29,30,31. Searching protein sequences against millions of species and yielding trillions of pairwise alignments still remains a computational challenge. Overcoming this bottleneck requires extensive dimensionality reduction of protein sequence space into sequence clusters to perform pairwise alignments only on the set of representative sequences rather than all sequences.Using DIAMOND DeepClust, we performed an experimental study to showcase the power of dimensionality reduction through sequence clustering when the Earth BioGenome project will have successfully sequenced and assembled all ~1.8 million species. For this purpose, we collected ~22 billion protein sequences across all kingdoms of life to match the order of magnitude and sequence diversity space of the estimated ~27 billion eukaryotic protein sequences planned to be generated as part of the Earth BioGenome project (assuming ~1.8 million species times an average of ~15,000 genes per species). We note that compared to our collection of ~22 billion sequences (~19.4 billion deduplicated sequences), the Earth BioGenome set will include a much broader sequence diversity space derived from eukaryotes (while our current dataset is enriched in proteins mostly derived from microbial metagenomic samples). We projected that our software will scale to the full Earth BioGenome dataset and yield