Immune2vec: Embedding B/T Cell Receptor Sequences in ℝN Using Natural Language Processing

Miri Ostrovsky-Berman; Miri Ostrovsky-Berman; Boaz Frankel; Boaz Frankel; Pazit Polak; Pazit Polak; Gur Yaari; Gur Yaari

doi:10.3389/fimmu.2021.680687

Frontiers in Immunology (Jul 2021)

Immune2vec: Embedding B/T Cell Receptor Sequences in ℝN Using Natural Language Processing

Miri Ostrovsky-Berman,
Miri Ostrovsky-Berman,
Boaz Frankel,
Boaz Frankel,
Pazit Polak,
Pazit Polak,
Gur Yaari,
Gur Yaari

Affiliations

Miri Ostrovsky-Berman: Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel
Miri Ostrovsky-Berman: Bar Ilan Institute of Nanotechnologies and Advanced Materials, Bar Ilan University, Ramat Gan, Israel
Boaz Frankel: Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel
Boaz Frankel: Bar Ilan Institute of Nanotechnologies and Advanced Materials, Bar Ilan University, Ramat Gan, Israel
Pazit Polak: Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel
Pazit Polak: Bar Ilan Institute of Nanotechnologies and Advanced Materials, Bar Ilan University, Ramat Gan, Israel
Gur Yaari: Bioengineering, Faculty of Engineering, Bar Ilan University, Ramat Gan, Israel
Gur Yaari: Bar Ilan Institute of Nanotechnologies and Advanced Materials, Bar Ilan University, Ramat Gan, Israel

DOI: https://doi.org/10.3389/fimmu.2021.680687
Journal volume & issue: Vol. 12

Abstract

Read online

The adaptive branch of the immune system learns pathogenic patterns and remembers them for future encounters. It does so through dynamic and diverse repertoires of T- and B- cell receptors (TCR and BCRs, respectively). These huge immune repertoires in each individual present investigators with the challenge of extracting meaningful biological information from multi-dimensional data. The ability to embed these DNA and amino acid textual sequences in a vector-space is an important step towards developing effective analysis methods. Here we present Immune2vec, an adaptation of a natural language processing (NLP)-based embedding technique for BCR repertoire sequencing data. We validate Immune2vec on amino acid 3-gram sequences, continuing to longer BCR sequences, and finally to entire repertoires. Our work demonstrates Immune2vec to be a reliable low-dimensional representation that preserves relevant information of immune sequencing data, such as n-gram properties and IGHV gene family classification. Applying Immune2vec along with machine learning approaches to patient data exemplifies how distinct clinical conditions can be effectively stratified, indicating that the embedding space can be used for feature extraction and exploratory data analysis.

Published in Frontiers in Immunology

ISSN: 1664-3224 (Online)
Publisher: Frontiers Media S.A.
Country of publisher: Switzerland
LCC subjects: Medicine: Internal medicine: Specialties of internal medicine: Immunologic diseases. Allergy
Website: http://journal.frontiersin.org/journal/immunology

About the journal

Abstract

Keywords