ImmunoMatch learns and predicts cognate pairing of heavy and light immunoglobulin chains

Wait 5 sec.

ImmunoMatch learns and predicts cognate pairing of heavy and light immunoglobulin chainsDownload PDF Download PDF ArticleOpen accessPublished: 18 November 2025Dongjun Guo ORCID: orcid.org/0009-0004-1037-92691,2,3,Deborah K. Dunn-Walters4,Franca Fraternali ORCID: orcid.org/0000-0002-3143-65741,2,5 &…Joseph C. F. Ng ORCID: orcid.org/0000-0002-3617-52111,2,5 Nature Methods (2025)Cite this articleSubjectsAdaptive immunityLymphocytesMachine learningSoftwareAbstractThe development of stable antibodies formed by compatible heavy (H) and light (L) chain pairs is crucial in both in vivo maturation of antibody-producing cells and ex vivo designs of therapeutic antibodies. We present ImmunoMatch, a machine-learning framework trained on paired H and L sequences from human B cells to identify molecular features underlying chain compatibility. ImmunoMatch distinguishes cognate from random H–L pairs and captures differences associated with κ and λ light chains, reflecting B cell selection mechanisms in the bone marrow. We apply ImmunoMatch to reconstruct paired antibodies from spatial VDJ sequencing data and study the refinement of H–L pairing across B cell maturation stages in health and disease. We find further that ImmunoMatch is sensitive to sequence differences at the H–L interface. These insights provide a computational lens into the broader biological principles governing antibody assembly and stability.MainThe immune system produces an extraordinarily diverse repertoire of antibodies to combat a wide range of immune challenges from invading pathogens as well as endogenous, aberrantly expressed antigens in contexts such as cancer. Our antibody repertoire is diverse, encompassing over 1012 distinct antibodies1,2,3,4. Such diversity is achieved mainly by two processes: first, a random recombination of genetic fragments in the immunoglobulin gene locus occurs independently for the two types of antibody chains, the heavy (H) and light (L) chains5,6,7,8,9, which assemble to form an antibody molecule. These recombination processes already generate 2 × 106 different combinations of H–L chains10, the sequence diversity of which is further potentiated by the imprecise joining of these gene fragments11,12,13,14. Second, mutations accumulated in the antibody variable region substantially increase repertoire diversity15,16,17,18. Investigation into the nature, dynamics and regulation of the antibody response in vivo19,20,21, as well as the engineering of highly specific antibodies22,23,24, are both active fields of modern biomedical research.Studies of antibodies have gradually recognized the importance of a wide variety of ‘developability’ factors beyond the binding affinity of the antibody to its antigen25,26,27,28. One such issue pertains to thermostability, which is crucial in ensuring that the H and L chain can assemble to constitute a manufacturable, functional antibody therapeutic29,30,31. Using high-throughput sequencing methods, to identify stable H–L pairs, researchers generated separate sequencing libraries of H and L chains, manually paired expanded H and L clonotypes, and expressed and tested these antibodies in vitro32,33; this has been, for instance, applied to identify broadly neutralizing antibodies against HIV-1 (refs. 34,35). Recent single-cell methods produce paired H–L sequences from thousands of antibody-producing cells36,37,38,39, which can circumvent problems in the traditional approaches, yet these methods are costly and still fall short of the true repertoire diversity40. The study of how stable antibody H–L pairs are generated is also relevant in basic biology: B cells, precursors to the antibody-secreting plasma cells, express B cell receptors (BCRs) comprising mainly the membrane-bounded version of antibody molecules41,42. During its development in the bone marrow, a B cell undergoes checks to ensure its BCR can be stably assembled and thus can sustain cellular signals to maintain viability43,44,45,46, while removing autoreactive B cells via cell death47,48,49,50. This poses an inherent challenge in characterizing cases where stable, non-autoreactive H–L pairs fail to be formed in vivo. Such knowledge is important to understand the interplay between the formation of a functional BCR and the development of the antibody response, versus the possible autoimmunity arising from defects of this process51,52. Therefore, deciphering the molecular rules governing the pairing of H and L chains will benefit both basic B cell immunology and antibody discovery.The issue of whether H–L antibody pairing specificity is predictable has been under debate for the past few decades. Seminal antibody structural analyses highlighted the interaction between the H and L chains at the antigen-binding site, consisting mainly of hypervariable regions in each chain known as the complementarity determining region (CDR) (Fig. 1a)53,54. Contact between the H and L chains is crucial to maintain antibody stability, as well as the orientation of the antigen-binding site55,56,57,58. The H–L interface is formed by contacts in the CDR region as well as the antibody framework region (FWR). Molecular biology experiments have identified FWR mutations, which would alter H–L interaction geometries and consequently abolish antigen binding55,57. Furthermore, mouse models coexpressing engineered H and L chains often further edit the L chains by introducing mutations which enhance the viability of B cells59,60,61. Computationally, earlier analyses focused on observations of nonrandom, over-represented H–L chain partners; however, statistical power to identify such associations were limited by the small number of paired H–L sequences and structures available for this type of analysis62,63,64. The development of single-cell sequencing methods, where single B cells are isolated, followed by extraction and sequencing of their H and L chain transcripts36,37,38, has provided new insights to this problem. For instance, statistical analyses in DeKosky et al.65 suggested that H–L pairing preference was random; however, others argued this could be due to idiosyncrasies in the experimental protocol38. More recently, a growing body of evidence suggested coherence between H and L chain choices in B cells. A study by Jaffe et al.39 using newer single-cell methods found that H chains in mature, antigen-experienced B cells tended to use more restricted L chain partners than their naive counterparts. Furthermore, comparisons between artificial intelligence (AI) models trained on paired antibody H–L sequences posited that such models trained on paired data outperformed those trained on unpaired chains, in terms of learning biologically interpretable sequence embeddings and predicting antigen specificity66; moreover, given the H chain sequence, a stable L chain partner can be generated de novo67. Taken together, these findings suggest the existence of a set of molecular rules underlying the specificity of antibody H–L pairs. Tools for exploring these rules will hold the promise to understand and design better, more stable antibody pairs, facilitating the development of antibody therapeutics.Fig. 1: ImmunoMatch for predicting cognate antibody chain pairing.a, Illustration of the antibody H–L interface (Protein Data Bank 6zlr), and its diversity potentiated by genetic recombinations in vivo. Inset on the right illustrates the H–L interface, with amino acid side chains (in sticks) to highlight positions in direct contact with the partner chains. CDR, complementarity determining region. b, Schematic of model training using curated positive and pseudo-negative examples from single-cell antibody repertoire datasets. See main text for further details. c, Proportion of the total surface area of the interface formed between the VH and the VL, contributed by individual CDR loops and the FWR, for n = 3,781 human antibody structures70. The box-and-whisker plots depict distribution medians, lower and upper quartiles, and 1.5 × interquartile range. d–f, Accuracy of models trained solely on H and L chain V and J gene usage (d), one-hot-encoded CDRH3 and CDRL3 sequences (e) and full-length VH and VL sequences (f). Each data point represented a separate fold from the tenfold validation strategy employed during training (Methods), tested on the withheld test set (n = 23,388). The box-and-whisker plots depict distribution medians, lower and upper quartiles, and 1.5 × interquartile range. The final ImmunoMatch model is indicated in f. g, ROC of the final ImmunoMatch model, calculated on the withheld test set (AUC = 0.75) and an external test set constituted by n = 3 donors unseen during training (AUC = 0.66).Source dataFull size imageHere, we present ImmunoMatch, a suite of fine-tuned AI models for the classification of cognate antibody chain pairs. Based on an antibody-specific language model (AntiBERTa268), ImmunoMatch was fine-tuned on full-length H and L chain variable domain sequences extracted from paired antibody repertoire data from healthy donors. This model outperformed baseline models using either CDR sequences or immunoglobulin gene usage as inputs. We found that further optimization to generate ImmunoMatch variants specific to antibody L chain types improved classification performance. We applied ImmunoMatch to study B cell development through the lens of optimizing H–L pairing, and identified chain pairing refinement as a hallmark of B cell maturation in both health and disease. We also validated ImmunoMatch in its ability to recover partner chains in therapeutic antibodies, and highlighted its ability to pinpoint important sequence patterns driving these predictions. Our results underscore the complexity in H–L chain pairing, and highlight the importance of chain pairing in understanding B cell development and engineering stable, functional antibodies.ResultsMachine-learning models for identifying cognate antibody chain pairingWe posited that machine-learning methods could allow us to test two competing hypotheses, namely that antibody H and L chain pairing preference can be predicted from sequence information, or that such pairing is random. Framing this as a binary classification task to distinguish between cognate, observed H–L pairs from randomly generated pairs, we curated paired H–L sequences, sampled from single-cell antibody repertoire datasets where their cell origin was barcoded as short nucleotide strings40 (Fig. 1b). The coexistence of a H and a L chain with the same cell barcode was taken as evidence for paired chains, constituting our positive training examples. Owing to the removal of nonviable H–L pairs by natural selection69, it was not possible to obtain negative training examples. We instead used a random shuffling strategy to generate ‘pseudo-negative’ examples, exchanging the light-chain partners between the observed, positive pairs. This procedure also guaranteed a balanced dataset with equal amounts of positive and pseudo-negative examples. Using three separate datasets38,39,65 covering six donors, in total we curated 233,880 H–L pairs for training and testing, after balancing sample sizes over each donor to avoid bias due to the immunological background of individual donors.We tested the contribution of different input features in combination with multiple machine-learning strategies. Analyzing antibody structural data70,71, we observed that both the antibody FWR and the CDR3 contributed substantially to the interface between the variable heavy (VH) and variable light (VL) domains (Fig. 1c). Indeed, logistic regression and XGBoost models built solely on V and J gene usage achieved accuracies of 0.50 and 0.52, respectively, indicating limited predictive capability for heavy–light pairing preferences (Fig. 1d). To improve predictive performance, we next explored using CDR3 sequences as predictive features, in view of its substantial contribution to the VH–VL interface (Fig. 1c) and their high sequence diversity. We used a one-hot encoding approach and trained a convolutional neural network (CNN) with the CDR3 fragments of the H and L chains, leveraging its ability to capture local patterns within the data. Although the CNN model demonstrated moderate performance, attempts to further improve the model by changing the optimizer, incorporating additional convolutional layers or by adopting the ResNet72 architecture, yielded minimal improvement (Fig. 1e). Taken together, these results highlighted the inherent limitations of only considering CDR3 or gene usage, potentially due to the lack of information on specific framework residues that participate in the VH–VL interface (Fig. 1c).We therefore explored strategies to incorporate full-length VH and VL sequences in prediction, by capitalizing on recent advancement in protein language models73, as well as those specifically trained using antibody sequences68. The transformer architecture, as employed in language models, excels at capturing long-range amino acid interactions74,75. Here, we compared an antibody-specific language model (AntiBERTa2; ref. 68) against a generic protein language model (ESM-2 (ref. 73), 150M parameters) (Fig. 1f). We observed that by fine-tuning ESM-2, its classification performance increased substantially, comparable to the performance of AntiBERTa2 before fine-tuning. The superior performance of AntiBERTa2 suggested that antibody-specific characteristics learned during pretraining were insightful for our task (Fig. 1f). Through these investigations of different machine-learning architectures (Fig. 1d–f), we used the fine-tuned model based on AntiBERTa2 as a final instance to classify antibody cognate H–L chain pairing. This model, ImmunoMatch, demonstrated an area under the receiver operator characteristic (AUC-ROC) of 0.75 (Fig. 1g). To further validate the generalizability of ImmunoMatch, we curated data from n = 3 donors unseen during both pretraining and fine-tuning from Jaffe et al.39. Testing ImmunoMatch on this external evaluation dataset, ImmunoMatch has an area under the receiver operating characteristic curve (AUC-ROC) of 0.66 (Fig. 1g). Details on other performance metrics of ImmunoMatch can be found in Table 1. We further confirmed that model performance was not impacted by immunoglobulin V and J gene usage (Extended Data Fig. 1a), suggesting that ImmunoMatch has learnt features beyond gene usage in H–L pairing prediction. We also validated that the combination of donors used here for training is informative for the model to learn pairing preferences (Supplementary Note 1, Supplementary Figs. 1–3 and Supplementary Table 2). Altogether, this suggests that ImmunoMatch can distinguish cognate antibody VH–VL pairing from randomly paired chains.Table 1 Performance metrics of ImmunoMatch and its variantsFull size tableImmunoMatch performance could be further optimized via a light-chain-specific training strategyWe investigated whether the performance of ImmunoMatch can be further improved. Human antibodies use either one of the two types of light chains, κ and λ, which are encoded by distinct DNA fragments located on separate chromosomes in the human genome5. The VL domains encoded by κ and λ genes are substantially different, as evidenced by pairwise sequence comparison of n = 3,832 antibody structures70 (Fig. 2a): on average κ VL domains share 47.8% sequence identity with λ VL domains, lower than the average identity of 69.1% within κ light chains, and 61.8% within λ. We therefore hypothesized that training separate models on VH–Vκ and VH–Vλ sequence pairs could learn pairing patterns specific to either type of light chains. Two specialized models, ImmunoMatch-κ and ImmunoMatch-λ (Fig. 2b), were fine-tuned using the same workflow as the original ImmunoMatch, with the sole exception that the models were only exposed to a light chain of one specific type during training.Fig. 2: An L chain-specific training strategy of ImmunoMatch was consistent with the in vivo mechanism of B cell development.a, Sequence identity comparison between VL sequences taken from human antibody structures (n = 3,832) utilizing the κ and λ light chains. The averaged sequence identity within κ sequences, within λ sequences, and between κ and λ, are given on the plot. b, Strategy to extract H–κ and H–λ pairs from publicly available datasets to train separate ImmunoMatch-κ and ImmunoMatch-λ models. c, Pairing scores of H–κ pairs calculated by ImmunoMatch and ImmunoMatch-κ. d, Pairing scores of H–λ pairs calculated by ImmunoMatch and ImmunoMatch-λ. e, Accuracy of ImmunoMatch-κ and ImmunoMatch-λ on withheld test sets comprised solely of H–κ and H–λ paired sequences. Each data point corresponded to a separate training fold in the cross-validation framework, tested on a withheld test set (n = 21,598). The box-and-whisker plots depict distribution medians, lower and upper quartiles and 1.5 × interquartile range. f, Schematic to illustrate the formation of the BCR in vivo.Source dataFull size imageWe investigated the change in pairing scores from the original ImmunoMatch model and light-chain-specific models, for paired VH–Vκ and VH–Vλ accordingly (Fig. 2c,d). We noticed that there were some antibodies about which the original ImmunoMatch model was ambivalent (pairing scores peaked at around 0.5) for both H–κ and H–λ pairs. These peaks shifted toward pairing scores of 1 when ImmunoMatch-κ (Fig. 2c) and ImmunoMatch-λ (Fig. 2d) were used for prediction. Therefore, ImmunoMatch-κ and ImmunoMatch-λ are more specialized in capturing signals embedded in the sequences that are informative of H–L pairing. The performance of these variants of ImmunoMatch was further evaluated using separate datasets of antibodies with κ and λ light chains, which were withheld from fine-tuning. The distributions of pairing scores for their withheld test sets are shown in Extended Data Fig. 2, and a detailed summary of performance metrics of these models can be found in Table 1. ImmunoMatch-κ achieved high accuracy (0.817) on κ datasets, while ImmunoMatch-λ performed comparably well on λ datasets, with an accuracy of 0.764 (Fig. 2e and Extended Data Fig. 1b,c), both represent substantial improvements over the original ImmunoMatch (Table 1).We also examined the generalizability of ImmunoMatch-κ and ImmunoMatch-λ, by testing them on H–L pairs of light-chain types different to their respective training sets. When ImmunoMatch-κ was tested on λ datasets, we observed that this model could still achieve an accuracy above 0.5, albeit with performance decreased in comparison to the κ test set (Fig. 2e). Of note, the performance of ImmunoMatch-λ on κ datasets remained largely unaffected by the differing distributions of the light-chain types between training and testing data (Fig. 2e). This suggests that ImmunoMatch-λ is more generalizable in learning pairing rules for antibodies with κ and λ light chains. We further investigated the reason behind this by analyzing their confusion matrices, and found that ImmunoMatch-λ made fewer false-negative predictions on H–κ sequences, compared to applying ImmunoMatch-κ on H–λ pairs (Extended Data Fig. 3). This generalizability, caused by fewer false-negative predictions, may be linked to the process of B cell development in vivo (Fig. 2f). Initially, the heavy chains undergo gene rearrangement, followed by the formation of H–κ pairs. They are then subject to central tolerance69, which either removes B cells expressing unstable and autoreactive pairs of heavy and light chains by signaling them to cell death, or instructs them to rearrange the λ gene locus to generate a H–λ pair9,76,77,78. These H-λ pairs are also subject to positive selection of B cells that express a stable H–L chain pair46,79,80 and negative selection of those which react to self-antigens (Fig. 2f)76. If the H chain is able to pair with κ, the B cell will proceed in maturation, even though this H chain can theoretically form a pair with λ as well. ImmunoMatch-κ thus has difficulty in distinguishing between true negative and false negative H–λ pairs; however, for an observed H–λ pair, it implies that the H chain would have failed to pair with κ. Therefore, ImmunoMatch-λ is still able to capture the signals embedded in the sequence of negative H–κ examples, leading to a low number of false-negative cases. In comparison to H–κ, H–λ pairs represent a more homogeneous dataset encompassing H–L pairing features, leading to different performance of the two models (Discussion).ImmunoMatch facilitates pairing of heavy and light immunoglobulin chains in spatial transcriptomics dataThe analysis above demonstrated that using ImmunoMatch-κ and ImmunoMatch-λ on H-κ and H-λ pairs respectively would be more accurate in H–L pairing prediction in comparison to the original ImmunoMatch model (Table 1). Using this approach, ImmunoMatch can be used to score and predict whether H and L chains given by the user form a cognate pair. This makes ImmunoMatch useful for comparing different single-cell BCR library preparation methods in their fidelity to generate well-resolved paired sequences (Supplementary Note 2, Supplementary Table 3 and Supplementary Fig. 4); however, there still remain data types that do not yet have single-cell resolution, and therefore necessitate H–L pairing annotation. A good example of this is spatial transcriptomics, where spatial VDJ methods have been described81 (amplifying VDJ sequences from 10x Genomics Visium slides for long-read sequencing). Spatial methods are increasingly adopted in cancer and tissue immunology studies, but here the Visium protocol precludes direct identification of paired H–L chains, which will be necessary for downstream analysis such as antibody production to investigate their specificity.In the original spatial VDJ manuscript, the authors proposed to predict H–L pairs by examining, in the spatial data, the colocalization of heavy and light-chain clones within the same tissue section81. Here we believe that ImmunoMatch can complement this method, going beyond comparing the transcript counts of the clones, to consider the complementarity of the full-length VH and VL sequences. We analyzed two breast tumor samples presented by Engblom et al.81, which have been profiled using a multiregion Visium-based protocol and single-cell BCR sequencing. Applying ImmunoMatch-κ and ImmunoMatch-λ on their respective light-chain types, we first validated the performance of our models to correctly identify paired H–L chains in the single-cell libraries from these tumor samples (Fig. 3a). We then applied ImmunoMatch on the spatial VDJ data, and observed that ImmunoMatch pairing score successfully identified H–L pairs from the spatial data which overlapped with the single-cell data (Fig. 3b), especially when used in conjunction with the colocalization-based method (‘repair’) presented by Engblom et al.81. We further examined the H–L pair predictions on the tumor slides, and observed that ImmunoMatch can complement the Engblom et al. ‘repair’ method to predict H–L pairs from intratumoral B cells (Fig. 3c and Extended Data Fig. 4). Since ImmunoMatch directly considers the full-length VH and VL sequences, it addresses the pairing problem from an orthogonal perspective compared to identifying colocalized H and L chain transcripts. We believe ImmunoMatch therefore could be potentially used to facilitate the identification of cognate H–L pairs for antibody discovery applications in tissue immunology.Fig. 3: ImmunoMatch facilitates pairing of heavy and light immunoglobulin chains in spatial transcriptomics data.a, Distribution of ImmunoMatch pairing scores in single-cell, H–L paired sequences (n = 1,326) from n = 2 breast tumors analyzed in Engblom et al.81. b, Comparison between ImmunoMatch pairing score versus the ‘repair’ score from Engblom et al., for n = 112 H–L pairs reconstructed from spatial VDJ sequencing generated from 10x Visium profiling of breast tumor sections analyzed in Engblom et al. Each data point corresponds to one H–L pair, and grouped by whether the same CDRH3 and CDRL3 have been observed together with the same cell barcode in the single-cell library generated on the same sample. The decision boundary of ImmunoMatch and ‘repair’ method from Engblom et al. is indicated by the dashed line. c, Examples of H–L pairs analyzed by the ‘repair’ method of Engblom et al. and ImmunoMatch. The expression of VH and VL sequences in each H–L pair (row) across the analyzed tissue section is shown and overlaid on top of another. Each dot correspond to one spot in the 10x Visium slide. The tissue section hematoxylin and eosin (H&E)-stained image is shown for reference.Source dataFull size imageRefinement of immunoglobulin chain pairing is a hallmark of B cell maturationWe next asked whether chain pairing likelihood would vary across stages of B cell development. The classical theory of B cell maturation posits that upon activation, naive B cells enter the germinal center (GC) to edit and optimize their BCRs to specifically bind their cognate antigens, with the successful binders subsequently exiting the GC and differentiating into memory B cells82,83,84. We collected paired H–L sequences from naive, GC and memory B cells from published studies39,85, and scored the H–κ and H–λ sequences with ImmunoMatch-κ and ImmunoMatch-λ respectively. Comparing the pairing scores from these ImmunoMatch models between the B cell subsets, we observed that memory B cells have substantially higher pairing score than their naive counterparts, with the distribution of GC B cells positioned between the two (Fig. 4a). We further defined clonally related naive and memory B cells based on CDRH3 sequence similarity, and identified examples of clonal expansions where pairing score increased as the clonotype diversified from the germline origin (Fig. 4b). We propose this continuum of chain pairing likelihood to be a feature of B cell maturation: as BCRs undergo class-switch recombination and somatic hypermutation, H–L chain pairing is refined together with these processes, both integral in B cell maturation17. To test this hypothesis, we utilized a single-cell RNA sequencing (scRNA-seq) dataset of B cells sampled from the tonsil85, and compared the H–L pairing score inferred using ImmunoMatch-κ and ImmunoMatch-λ for sequences of different heavy chain isotypes and mutational levels. Of note, pairing score increases as the H chain switches away from IgM and IgD isotypes to IgG and IgA (Fig. 4c). Moreover, pairing scores display an inverse relationship with the H chain germline (Fig. 4c), independent of B cell subtypes (Extended Data Fig. 5). ImmunoMatch pairing scores therefore embed information about B cell maturation, and highlight the increase in H–L pairing specificity as a feature as BCR undergo maturation processes.Fig. 4: ImmunoMatch revealed a continuum of BCR chain pairing likelihood across B cell development stages.The box-and-whisker plots depict distribution medians, lower and upper quartiles and 1.5 × interquartile range. a, Pairing scores of BCRs from naive B cells (n = 3 donors), GC B cells (n = 6 donors) and memory B cells (n = 3 donors). Each data point represents average score per donor, calculated separately for H-κ (panel ‘IGK’, scored using the ImmunoMatch-κ model) and H–λ (‘IGL’, scored using ImmunoMatch-λ) pairs. b, Example clonotype tree from Jaffe et al.39 data with ImmunoMatch pairing scores mapped to individual observations as colored dots in the tree leaves. The germline configuration of VH and VL sequences are illustrated. c, ImmunoMatch pairing scores for paired VH and VL sequences from tonsil B cells (n = 10,264) in the King et al. dataset85, grouped by their germline identity as a proxy of somatic hypermutation status (top), or by the heavy chain isotype to illustrate class-switch recombination status (bottom). As a control, analogous annotation with ImmunoMatch was performed on for each cell, retaining the same observed VH sequence, but each paired with a randomly reshuffled L chain partner. d, Boxplots (right) depicting pairing scores of BCRs from leukemia and lymphoma samples (n = 123) curated from the literature (Methods). Data were organized by cancer subtypes and ordered by their corresponding B cell development stages when oncogenesis was thought to occur, according to published reviews86,87 (schematic on the left). ALL, acute lymphoblastic leukemia; CLL, chronic lymphocytic leukemia; DLBCL, diffuse large B cell lymphoma; GCB, germinal center B cell; ABC, activated B cell.Source dataFull size imageWe further investigated whether a similar trend in the pairing scores can be found in BCRs isolated from diseases arising from B cell development. We collected n = 123 paired sequences from leukemia and lymphoma samples collated from the literature and publicly available databases of cancer cell lines, and mapped these samples to the different B cell developmental stages from which these cancers were thought to initiate86,87. Applying ImmunoMatch-κ and ImmunoMatch-λ, we observed a continuum of H–L pairing scores for these sequences (Fig. 4d). Specifically, leukemia originating from pre-B cells in the bone marrow displayed a notably low pairing likelihood, reflecting their immature origin88,89. In contrast, in agreement with the need of a functional BCR for B cell activation and antigen interactions in these cancers90,91, lymphoma samples typically displayed high pairing scores. These analyses suggest that ImmunoMatch models can be used to annotate immunoglobulin chain pairing, and that the refinement of chain pairing preference is a hallmark of B cell maturation in both health and disease.ImmunoMatch is sensitive to sequence differences in therapeutic antibodiesWe finally investigated whether our ImmunoMatch models can be applied in an antibody discovery context. Specifically, we simulated an antibody triaging application, where ImmunoMatch-κ and ImmunoMatch-λ was used to score a random library of germline recombinations of H chain V and J gene segments against the cognate L chain partner (Fig. 5a). We performed this experiment on n = 625 therapeutic antibodies, for which we generated random VH domain sequences, while preserving the observed CDRH3 fragment, for scoring against their cognate VL domains. We first verified that the ImmunoMatch pairing score was an effective discriminant (P Article PubMed Google Scholar Bannish, G., Fuentes-Pananá, E. M., Cambier, J. C., Pear, W. S. & Monroe, J. G. Ligand-independent signaling functions for the B lymphocyte antigen receptor and their role in positive selection during B lymphopoiesis. J. Exp. Med. 194, 1583–1596 (2001).Article CAS PubMed PubMed Central Google Scholar Louzoun, Y., Friedman, T., Luning Prak, E., Litwin, S. & Weigert, M. Analysis of B cell receptor production and rearrangement: part I. Light chain rearrangement. Semin. Immunol. 14, 169–190 (2002).Article CAS PubMed Google Scholar Engblom, C. et al. Spatial transcriptomics of B cell and T cell receptors reveals lymphocyte clonal dynamics. Science 382, eadf8486 (2023).Article CAS PubMed Google Scholar De Silva, N. S. & Klein, U. Dynamics of B cells in germinal centres. Nat. Rev. Immunol. 15, 137–148 (2015).Article PubMed PubMed Central Google Scholar Young, C. & Brink, R. The unique biology of germinal center B cells. Immunity 54, 1652–1664 (2021).Article CAS PubMed Google Scholar Victora, G. D. & Nussenzweig, M. C. Germinal centers. Ann. Rev. Immunol. 40, 413–442 (2022).Article CAS Google Scholar King, H. W. et al. Single-cell analysis of human B cell maturation predicts how antibody class switching shapes selection dynamics. Sci. Immunol. 6, eabe6291 (2021).Article CAS PubMed Google Scholar Küppers, R. Mechanisms of B-cell lymphoma pathogenesis. Nat. Rev. Cancer 5, 251–262 (2005).Article PubMed Google Scholar Shaffer, A. L. R., Young, R. M. & Staudt, L. M. Pathogenesis of human B cell lymphomas. Ann. Rev. Immunol. 30, 565–610 (2012).Article CAS Google Scholar le Viseur, C. et al. In childhood acute lymphoblastic leukemia, blasts at different stages of immunophenotypic maturation have stem cell properties. Cancer Cell 14, 47–58 (2008).Article PubMed PubMed Central Google Scholar Moorman, A. V. The clinical relevance of chromosomal and genomic abnormalities in B-cell precursor acute lymphoblastic leukaemia. Blood Rev. 26, 123–135 (2012).Article CAS PubMed Google Scholar Davis, R. E. et al. Chronic active B-cell-receptor signalling in diffuse large B-cell lymphoma. Nature 463, 88–92 (2010).Article CAS PubMed PubMed Central Google Scholar Young, R. M. et al. Survival of human lymphoma cells requires B-cell receptor engagement by self-antigens. Proc. Natl Acad. Sci. USA 112, 13447–13454 (2015).Article CAS PubMed PubMed Central Google Scholar Attaf, N. et al. FB5P-seq: FACS-based 5-prime end single-cell RNA-seq for integrative analysis of transcriptome and antigen receptor repertoire in B and T cells. Front. Immunol. 11, 216 (2020).Article CAS PubMed PubMed Central Google Scholar Subas Satish, H. P. et al. NAb-seq: an accurate, rapid, and cost-effective method for antibody long-read sequencing in hybridoma cell lines and single B cells. mAbs 14, 2106621 (2022).Article PubMed PubMed Central Google Scholar Setliff, I. et al. High-throughput mapping of B cell receptor sequences to antigen specificity. Cell 179, 1636–1646.e15 (2019).Article PubMed PubMed Central Google Scholar Raybould, M. I. J., Turnbull, O. M., Suter, A., Guloglu, B. & Deane, C. M. Contextualising the developability risk of antibodies with lambda light chains using enhanced therapeutic antibody profiling. Commun. Biol. 7, 62 (2024).Article CAS PubMed PubMed Central Google Scholar Tennenhouse, A. et al. Computational optimization of antibody humanness and stability by systematic energy-based ranking. Nat. Biomed. Eng. 8, 30–44 (2024).Article CAS PubMed Google Scholar Emami, P., Perreault, A., Law, J., Biagioni, D. & John, P. S. Plug & play directed evolution of proteins with gradient-based discrete MCMC. Machine Learn. Sci. Technol. 4, 025014 (2023).Article Google Scholar Makowski, E. K. et al. Co-optimization of therapeutic antibody affinity and specificity using machine learning models that generalize to novel mutational space. Nat. Commun. 13, 3788 (2022).Article CAS PubMed PubMed Central Google Scholar Tučs, A. et al. Extensive antibody search with whole spectrum black-box optimization. Sci. Rep. 14, 552 (2024).Article PubMed PubMed Central Google Scholar Townsend, C. L. et al. Significant differences in physicochemical properties of human immunoglobulin kappa and lambda CDR3 regions. Front. Immunol. 7, 388 (2016).Article PubMed PubMed Central Google Scholar Chailyan, A., Marcatili, P., Cirillo, D. & Tramontano, A. Structural repertoire of immunoglobulin λ light chains. Proteins 79, 1513–1524 (2011).Article CAS PubMed Google Scholar Ralph, D. K. & Matsen, F. A. Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation. PLoS Comput. Biol. 12, e1004409 (2016).Article PubMed PubMed Central Google Scholar Marcou, Q., Mora, T. & Walczak, A. M. High-throughput immune repertoire analysis with IGoR. Nat. Commun. 9, 561 (2018).Article PubMed PubMed Central Google Scholar Weber, C. R. et al. immuneSIM: tunable multi-feature simulation of B- and T-cell receptor repertoires for immunoinformatics benchmarking. Bioinformatics 36, 3594–3596 (2020).Article CAS PubMed PubMed Central Google Scholar Le Quý, K. et al. Benchmarking and integrating human B-cell receptor genomic and antibody proteomic profiling. NPJ Syst. Biol. Appl. 10, 73 (2024).Article PubMed PubMed Central Google Scholar Bonissone, S. R. et al. Serum proteomics expands on high-affinity antibodies in immunized rabbits than deep B-cell repertoire sequencing alone. Preprint at bioRxiv https://doi.org/10.1101/833871 (2020).Cheung, W. C. et al. A proteomics approach for the identification and cloning of monoclonal antibodies from serum. Nat. Biotechnol. 30, 447–452 (2012).Article CAS PubMed Google Scholar Snapkov, I. et al. Progress and challenges in mass spectrometry-based analysis of antibody repertoires. Trends Biotechnol. 40, 463–481 (2022).Article CAS PubMed Google Scholar Chernigovskaya, M. et al. Systematic benchmarking of mass spectrometry-based antibody sequencing reveals methodological biases. Preprint at bioRxiv https://doi.org/10.1101/2024.11.11.622451 (2024).Lavinder, J. J., Horton, A. P., Georgiou, G. & Ippolito, G. C. Next-generation sequencing and protein mass spectrometry for the comprehensive analysis of human cellular and serum antibody repertoires. Curr. Opin. Chem. Biol. 24, 112–120 (2015).Article CAS PubMed Google Scholar Edwards, B. M. et al. The remarkable flexibility of the human antibody repertoire; isolation of over one thousand different antibodies to a single protein, BLyS. J. Mol. Biol. 334, 103–118 (2003).Article CAS PubMed Google Scholar Yang, X. et al. Large-scale analysis of 2,152 Ig-seq datasets reveals key features of B cell biology and the antibody repertoire. Cell Rep. 35, 109110 (2021).Article CAS PubMed Google Scholar Olsen, T. H., Boyles, F. & Deane, C. M. Observed antibody space: a diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Sci. 31, 141–146 (2022).Article CAS PubMed Google Scholar Ye, J., Ma, N., Madden, T. L. & Ostell, J. M. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res. 41, W34–W40 (2013).Article PubMed PubMed Central Google Scholar Yaari, G. & Kleinstein, S. H. Practical guidelines for B-cell receptor repertoire sequencing analysis. Genome Med. 7, 121 (2015).Article PubMed PubMed Central Google Scholar Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 3150–3152 (2012).Article CAS PubMed PubMed Central Google Scholar Lindeman, I. et al. BraCeR: B-cell-receptor reconstruction and clonality inference from single-cell RNA-seq. Nat. Methods 15, 563–565 (2018).Article CAS PubMed Google Scholar Gupta, N. T. et al. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics 31, 3356–3358 (2015).Article CAS PubMed PubMed Central Google Scholar Felsenstein, J. PHYLIP - phylogeny inference package (version 3.2). Cladistics 5, 164–166 (1989).Google Scholar Stewart, A. et al. Pandemic, epidemic, endemic: B cell repertoire analysis reveals unique anti-viral responses to SARS-CoV-2, ebola and respiratory syncytial virus. Front. Immunol. 13, 807104 (2022).Article CAS PubMed PubMed Central Google Scholar McCann, K. J., Johnson, P. W. M., Stevenson, F. K. & Ottensmeier, C. H. Universal N-glycosylation sites introduced into the B-cell receptor of follicular lymphoma by somatic mutation: a second tumorigenic event? Leukemia 20, 530–534 (2006).Article CAS PubMed Google Scholar Fais, F. et al. Immunoglobulin V region gene use and structure suggest antigen selection in AIDS-related primary effusion lymphomas. Leukemia 13, 1093–1099 (1999).Article CAS PubMed Google Scholar Terness, P. et al. Idiotypic vaccine for treatment of human B-cell lymphoma. Construction of IgG variable regions from single malignant B cells. Hum. Immunol. 56, 17–27 (1997).Article CAS PubMed Google Scholar Ebeling, S. B., Schutte, M. E. & Logtenberg, T. Molecular analysis of VH and VL regions expressed in IgG-bearing chronic lymphocytic leukemia (CLL): further evidence that CLL is a heterogeneous group of tumors. Blood 82, 1626–1631 (1993).Article CAS PubMed Google Scholar Fais, F. et al. CD1d is expressed on B-chronic lymphocytic leukemia cells and mediates α-galactosylceramide presentation to natural killer T lymphocytes. Int. J. Cancer 109, 402–411 (2004).Article CAS PubMed Google Scholar Messmer, B. T. et al. Multiple distinct sets of stereotyped antigen receptors indicate a role for antigen in promoting chronic lymphocytic leukemia. J. Exp. Med. 200, 519–525 (2004).Article CAS PubMed PubMed Central Google Scholar Colombo, M. et al. Intraclonal cell expansion and selection driven by B cell receptor in chronic lymphocytic leukemia. Mol. Med. 17, 834–839 (2011).Article CAS PubMed PubMed Central Google Scholar Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607 (2012).Article CAS PubMed PubMed Central Google Scholar Tan, K.-T. et al. Profiling the B/T cell receptor repertoire of lymphocyte derived cell lines. BMC Cancer 18, 940 (2018).Article CAS PubMed PubMed Central Google Scholar Bolotin, D. A. et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods 12, 380–381 (2015).Article CAS PubMed Google Scholar Dunbar, J. & Deane, C. M. ANARCI: antigen receptor numbering and receptor classification. Bioinformatics 32, 298–300 (2016).Article CAS PubMed Google Scholar Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).Article CAS PubMed Google Scholar Raybould, M. I. J. et al. Thera-SAbDab: the Therapeutic Structural Antibody Database. Nucleic Acids Res. 48, D383–D388 (2020).Article CAS PubMed Google Scholar Download referencesAcknowledgementsWe thank all members of the Fraternali group for comments and suggestions. This work was supported by the Biotechnology and Biological Sciences Research Council (https://bbsrc.ukri.org/, BB/T002212/1 to F.F., D.K.D.-W. and J.C.F.N.; BB/B000745/1 to F.F. and J.C.F.N.). D.G. was supported by a PhD scholarship from the China Scholarship Council (no. 202008440414). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the article.Author informationAuthors and AffiliationsResearch Department of Structural and Molecular Biology, Division of Biosciences, University College London, London, UKDongjun Guo, Franca Fraternali & Joseph C. F. NgInstitute of Structural and Molecular Biology, University College London, London, UKDongjun Guo, Franca Fraternali & Joseph C. F. NgRandall Centre for Cell & Molecular Biophysics, King’s College London, London, UKDongjun GuoSchool of Biosciences and Medicine, University of Surrey, Guildford, UKDeborah K. Dunn-WaltersDepartment of Biological Sciences, Birkbeck, University of London, London, UKFranca Fraternali & Joseph C. F. NgAuthorsDongjun GuoView author publicationsSearch author on:PubMed Google ScholarDeborah K. Dunn-WaltersView author publicationsSearch author on:PubMed Google ScholarFranca FraternaliView author publicationsSearch author on:PubMed Google ScholarJoseph C. F. NgView author publicationsSearch author on:PubMed Google ScholarContributionsD.G. curated training, testing and validation datasets, implemented ImmunoMatch and performed model comparisons, supervised by F.F., J.C.F.N. and D.K.D.-W. J.C.F.N. and D.G. curated ImmunoMatch use cases and carried out computational analyses. F.F. conceived the project and acquired funding. J.C.F.N. and D.G. wrote the manuscript and the Methods section with critical input from F.F. All authors read, commented and approved the final manuscript.Corresponding authorsCorrespondence to Franca Fraternali or Joseph C. F. Ng.Ethics declarationsCompeting interestsThe authors declare no competing interests.Peer reviewPeer review informationNature Methods thanks Kenneth Hoehn who co-reviewed with Hunter Melton and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Madhura Mukhopadhyay, in collaboration with the Nature Methods team. Peer reviewer reports are available.Additional informationPublisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Extended dataExtended Data Fig. 1 Prediction accuracy on withheld sequences, grouped by the frequency of V and J gene usage in the H and L chains.For each chain we grouped sequences by their gene usage into five bins. Results were shown for (a) ImmunoMatch (n = 23,388), (b) ImmunoMatch-κ (n = 21,598) and (c) ImmunoMatch- λ (n = 21,598) models separately.Source dataExtended Data Fig. 2 Distribution of pairing scores from withheld test sequences for the ImmunoMatch models.Shown here data for the ImmunoMatch (top, n = 23,388), ImmunoMatch-κ (bottom left, n = 21,598) and ImmunoMatch-λ (bottom right, n = 21,598) models. The positive (orange) and pseudo-negative (blue) pairs were visualized with separate density curves.Source dataExtended Data Fig. 3 Confusion matrices of ImmunoMatch-κ and ImmunoMatch-λ on H-κ and H- λ sequences.Data for ImmunoMatch-κ and ImmunoMatch-λ models were organized by rows, tested on H-κ and H- λ pairs (columns). False negative (FN) predictions when models are tested on datasets of different L types are highlighted.Source dataExtended Data Fig. 4 Additional examples of H–L pairs analyzed by Engblom et al.’s ‘repair’ method and ImmunoMatch.The expression of VH and VL sequences in each H–L pair (row) across the analyzed tissue sections are shown and overlaid on top of another. Each dot corresponds to one spot in the 10X Visium slide. The tissue section hematoxylin and eosin (H&E) stained images are shown for reference.Source dataExtended Data Fig. 5 Relationship between somatic hypermutation level and ImmunoMatch pairing score in the King et al. tonsil dataset.In each panel somatic hypermutation level (horizontal axis, represented as sequence identity to germline in the heavy chain) was plotted against ImmunoMatch pairing score (vertical axis). Data from the King et al. Tonsil single-cell BCR repertoire dataset (n = 10,264 cells) were shown. This relationship is statistically tested (p