Evaluating the utility of amino acid similarity-aware kmers to represent TCR repertoires for classification

Wait 5 sec.

by Hannah Kockelbergh, Shelley C. Evans, Liam Brierley, Peter L. Green, Andrea L. Jorgensen, Elizabeth J. Soilleux, Anna FowlerInsights gained through interpretation of models trained on the T-cell receptor (TCR) repertoire contribute to advances in understanding of immune-mediated disease. This has the potential to improve diagnostic tests and treatments, particularly for autoimmune diseases. However, TCR repertoire datasets with samples from donors of known autoimmune disease status generally include orders of magnitude fewer samples than TCR sequences. Promising TCR repertoire classification approaches consider relationships between non-identical TCR sequences. In particular, kmer methods demonstrate strong and stable performance for small datasets. We propose a TCR repertoire representation that considers the relationships between amino acids within kmers flexibly and efficiently. XGBoost and logistic regression models are trained and tested on kmer representations of TCR repertoire datasets including samples from patients with coeliac disease as well as donors with previous cytomegalovirus infection. XGBoost models outperform logistic regression, indicating that interactions may be crucial for discriminative ability. We find that a reduced alphabet based on BLOSUM62 can lead to a model with slightly stronger XGBoost testing performance than other kmer features. Though it remains unclear whether there is an amino acid encoding that can substantially improve TCR repertoire classification with reduced alphabet kmers, evidence that this representation enables faster training of XGBoost models in comparison to kmer clusters suggests that our reduced alphabet approach permits wider exploration of amino acid similarity in practice. Finally, we detail motifs which are important in each top-performing XGBoost model and compare them to TCR sequences previously associated with each immune status. We highlight the challenge of interpreting non-linear TCR repertoire classification models trained on kmers which, if overcome, could lead to biomarker discovery for autoimmune diseases.