Abstract
Complementarity-determining region 3 (CDR3) is the most hyper-variable region in B cell receptor (BCR) and T cell receptor (TCR) genes, and the most critical structure in antigen recognition and thereby in determining the fates of developing and responding lymphocytes. There are millions of different TCR Vβ chain or BCR heavy chain CDR3 sequences in human blood. Even now, when high-throughput sequencing becomes widely used, CDR3 length distributions (also called spectratypes) are still a much quicker and cheaper method of assessing repertoire diversity. However, distribution complexity and the large amount of information per sample (e.g. 32 distributions of the TCRα chain, and 24 of TCRβ) calls for the use of machine learning tools for full exploration. We have examined the ability of supervised machine learning, which uses computational models to find hidden patterns in predefined biological groups, to analyze CDR3 length distributions from various sources, and distinguish between experimental groups. We found that (a) splenic BCR CDR3 length distributions are characterized by low standard deviations and few local maxima, compared to peripheral blood distributions; (b) healthy elderly people's BCR CDR3 length distributions can be distinguished from those of the young; and (c) a machine learning model based on TCR CDR3 distribution features can detect myelodysplastic syndrome with approximately 93% accuracy. Overall, we demonstrate that using supervised machine learning methods can contribute to our understanding of lymphocyte repertoire diversity.