Analysis of Adaptive Immune Receptor Repertoire Sequencing data with LymphoSeq2

library(LymphoSeq2)
#> Loading required package: data.table
library(RColorBrewer)
library(grDevices)
library(wordcloud2)
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
#> ✔ purrr     1.0.2
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::between()     masks data.table::between()
#> ✖ dplyr::filter()      masks stats::filter()
#> ✖ dplyr::first()       masks data.table::first()
#> ✖ lubridate::hour()    masks data.table::hour()
#> ✖ lubridate::isoweek() masks data.table::isoweek()
#> ✖ dplyr::lag()         masks stats::lag()
#> ✖ dplyr::last()        masks data.table::last()
#> ✖ lubridate::mday()    masks data.table::mday()
#> ✖ lubridate::minute()  masks data.table::minute()
#> ✖ lubridate::month()   masks data.table::month()
#> ✖ lubridate::quarter() masks data.table::quarter()
#> ✖ lubridate::second()  masks data.table::second()
#> ✖ purrr::transpose()   masks data.table::transpose()
#> ✖ lubridate::wday()    masks data.table::wday()
#> ✖ lubridate::week()    masks data.table::week()
#> ✖ lubridate::yday()    masks data.table::yday()
#> ✖ lubridate::year()    masks data.table::year()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(vroom)
#> 
#> Attaching package: 'vroom'
#> 
#> The following objects are masked from 'package:readr':
#> 
#>     as.col_spec, col_character, col_date, col_datetime, col_double,
#>     col_factor, col_guess, col_integer, col_logical, col_number,
#>     col_skip, col_time, cols, cols_condense, cols_only, date_names,
#>     date_names_lang, date_names_langs, default_locale, fwf_cols,
#>     fwf_empty, fwf_positions, fwf_widths, locale, output_column,
#>     problems, spec

Adaptive Immune Receptor Repertoire Sequencing (AIRR-seq) provides a unique opportunity to interrogate the adaptive immune repertoire under various clinical conditions. The utility offered by this technology has quickly garnered interest from a community of clinicians and researchers investigating the immunological landscapes of a large spectrum of health and disease states. LymphoSeq2 is a toolkit that allows users to import, manipulate and visualize AIRR-Seq data from various AIRR-Seq assays such as Adaptive ImmunoSEQ and BGI-IRSeq, with support for 10X VDJ sequencing coming soon. The platform also supports the importing of AIRR-seq data processed using the MiXCR pipeline. The vignette highlights some of the key features of LymphoSeq2.

Importing data

The function readImmunoSeq imports AIRR-seq receptor files from Adaptive ImmunoSEQ assay as well well as BGI-IRSeq assay. The sequences can be (.tsv) files processed using one of the three following platforms: Adaptive Biotechnologies ImmunoSEQ analyzer, BGI IR-SEQ iMonitor platform, and the MiXCR pipeline for AIRR-seq data analysis. The function has the ability to identify file type based on the headers provided in the (.tsv) file, accordingly the data is transformed into a format that is compatible AIRR-Community guidelines (https://github.com/airr-community/airr-standards).

To explore the features of LymphoSeq2, this package includes 2 example data sets. The first is a data set of T cell receptor beta (TCRB) sequencing from 10 blood samples acquired serially from a single patient who underwent a bone marrow transplant (Kanakry, C.G., et al. JCI Insight 2016;1(5):pii: e86252). The second, is a data set of B cell receptor immunoglobulin heavy (IGH) chain sequencing from Burkitt lymphoma tumor biopsies acquired from 10 different individuals (Lombardo, K.A., et al. Blood Advances 2017 1:535-544). To improve performance, both data sets contain only the top 1,000 most frequent sequences. The complete data sets are publicly available through Adapatives’ immuneACCESS portal. As shown in the example below, you can specify the path to the example data sets using the command

system.file("extdata", "TCRB_sequencing", package = "LymphoSeq2") # For the TCRB files
#> [1] "/home/runner/work/_temp/Library/LymphoSeq2/extdata/TCRB_sequencing"
system.file("extdata", "IGH_sequencing", package = "LymphoSeq2") # For the IGH files.
#> [1] "/home/runner/work/_temp/Library/LymphoSeq2/extdata/IGH_sequencing"

readImmunoSeq can take as input, a single file name, a list of files or the path to a directory containing AIRR-seq data. The columns are renamed to follow AIRR-community guidelines based on the input file type. The function returns a tibble with individual file names set as the repertoire_id. The CDR3 nucleotide and amino acid sequences are denoted by the junction and junction_aa fields respectively. The counts of the CDR3 sequences observed, and their frequency in each individual repertoire is denoted by the duplicate_count and duplicate_frequency field respectively.

study_files <- system.file("extdata", "TCRB_sequencing", package = "LymphoSeq2")
study_table <- LymphoSeq2::readImmunoSeq(study_files, threads = 1) %>%
  topSeqs(top = 100)

Looking at the study_table we see a tibble with 145 columns and 1000 rows

study_table
#> # A tibble: 1,000 × 145
#>    sequence_id   sequence sequence_aa rev_comp productive vj_in_frame stop_codon
#>    <chr>         <chr>    <chr>       <lgl>    <lgl>      <lgl>       <lgl>     
#>  1 TRB_CD4_949_4 GAGTCAG… CASSESAGST… FALSE    FALSE      NA          FALSE     
#>  2 TRB_CD4_949_5 GCCCTCA… NA          FALSE    TRUE       NA          TRUE      
#>  3 TRB_CD4_949_6 ATTCCCT… NA          FALSE    TRUE       NA          TRUE      
#>  4 TRB_CD4_949_7 GTGACAT… CASSPRQGES… FALSE    FALSE      NA          FALSE     
#>  5 TRB_CD4_949_8 ACCTTGG… CASSLDGQGQ… FALSE    FALSE      NA          FALSE     
#>  6 TRB_CD4_949_9 GTGACCA… CSAKTSGITY… FALSE    FALSE      NA          FALSE     
#>  7 TRB_CD4_949_… ACCCTGC… CASSQD*ASS… FALSE    TRUE       NA          TRUE      
#>  8 TRB_CD4_949_… CTCCTTC… CAWSDFQGPR… FALSE    FALSE      NA          FALSE     
#>  9 TRB_CD4_949_… CTGACGA… CASSPDKWGY… FALSE    FALSE      NA          FALSE     
#> 10 TRB_CD4_949_… TCAGAAC… CASSFRTGPT… FALSE    FALSE      NA          FALSE     
#> # ℹ 990 more rows
#> # ℹ 138 more variables: complete_vdj <lgl>, locus <chr>, v_call <chr>,
#> #   d_call <chr>, d2_call <chr>, j_call <chr>, c_call <chr>,
#> #   sequence_alignment <chr>, sequence_alignment_aa <chr>,
#> #   germline_alignment <chr>, germline_alignment_aa <chr>, junction <chr>,
#> #   junction_aa <chr>, np1 <chr>, np1_aa <chr>, np2 <chr>, np2_aa <chr>,
#> #   np3 <chr>, np3_aa <chr>, cdr1 <chr>, cdr1_aa <chr>, cdr2 <chr>, …

Since the study table is a tibble, we can use tidyverse syntax to extract a list of sample names

study_table %>%
  dplyr::pull(repertoire_id) %>%
  unique()
#>  [1] "TRB_CD4_949"       "TRB_CD8_949"       "TRB_CD8_CMV_369"  
#>  [4] "TRB_Unsorted_0"    "TRB_Unsorted_1320" "TRB_Unsorted_1496"
#>  [7] "TRB_Unsorted_32"   "TRB_Unsorted_369"  "TRB_Unsorted_83"  
#> [10] "TRB_Unsorted_949"

Subsetting Data

The tibble structure of the TCR data allows for easy subsampling of data. To select the TCR sequences from any given samples in the dataset, the filter function from the dplyr package can be used.

TRB_Unsorted_0 <- study_table %>%
  dplyr::filter(repertoire_id == "TRB_Unsorted_0")
TRB_Unsorted_0
#> # A tibble: 100 × 145
#>    sequence_id   sequence sequence_aa rev_comp productive vj_in_frame stop_codon
#>    <chr>         <chr>    <chr>       <lgl>    <lgl>      <lgl>       <lgl>     
#>  1 TRB_Unsorted… TCAATTC… NA          FALSE    TRUE       NA          TRUE      
#>  2 TRB_Unsorted… CTGATTC… CASSPVSNEQ… FALSE    FALSE      NA          FALSE     
#>  3 TRB_Unsorted… ATCAATT… CASSQEVPPY… FALSE    FALSE      NA          FALSE     
#>  4 TRB_Unsorted… CACACCC… CASSQEASGR… FALSE    FALSE      NA          FALSE     
#>  5 TRB_Unsorted… TGCCATC… NA          FALSE    TRUE       NA          TRUE      
#>  6 TRB_Unsorted… GCCAGCA… CASSLEHTGA… FALSE    FALSE      NA          FALSE     
#>  7 TRB_Unsorted… CCCCTGA… CASSPGDEQYF FALSE    FALSE      NA          FALSE     
#>  8 TRB_Unsorted… AGTGCCC… CSARSPSTGT… FALSE    FALSE      NA          FALSE     
#>  9 TRB_Unsorted… GGAGCTT… NA          FALSE    TRUE       NA          TRUE      
#> 10 TRB_Unsorted… CTGTAGT… CASSEKREGH… FALSE    FALSE      NA          FALSE     
#> # ℹ 90 more rows
#> # ℹ 138 more variables: complete_vdj <lgl>, locus <chr>, v_call <chr>,
#> #   d_call <chr>, d2_call <chr>, j_call <chr>, c_call <chr>,
#> #   sequence_alignment <chr>, sequence_alignment_aa <chr>,
#> #   germline_alignment <chr>, germline_alignment_aa <chr>, junction <chr>,
#> #   junction_aa <chr>, np1 <chr>, np1_aa <chr>, np2 <chr>, np2_aa <chr>,
#> #   np3 <chr>, np3_aa <chr>, cdr1 <chr>, cdr1_aa <chr>, cdr2 <chr>, …

The str_detect function from stringr package can be used in conjunction with the filter to find samples using a pattern

CMV <- study_table %>%
  dplyr::filter(str_detect(repertoire_id, "CMV"))
CMV
#> # A tibble: 100 × 145
#>    sequence_id   sequence sequence_aa rev_comp productive vj_in_frame stop_codon
#>    <chr>         <chr>    <chr>       <lgl>    <lgl>      <lgl>       <lgl>     
#>  1 TRB_CD8_CMV_… CAGCGCA… CASSPPTGER… FALSE    FALSE      NA          FALSE     
#>  2 TRB_CD8_CMV_… CAGCCCT… CASSPAGAYY… FALSE    FALSE      NA          FALSE     
#>  3 TRB_CD8_CMV_… CAGCCTG… CASSQDWERL… FALSE    FALSE      NA          FALSE     
#>  4 TRB_CD8_CMV_… TCGGCCC… CASSQDLMTV… FALSE    FALSE      NA          FALSE     
#>  5 TRB_CD8_CMV_… ATCCTGG… CASSLQGREK… FALSE    FALSE      NA          FALSE     
#>  6 TRB_CD8_CMV_… GAGGATC… NA          FALSE    TRUE       NA          TRUE      
#>  7 TRB_CD8_CMV_… ACCCTGC… CASSQDLGQA… FALSE    FALSE      NA          FALSE     
#>  8 TRB_CD8_CMV_… GAGTCCG… CASSLAGDSQ… FALSE    FALSE      NA          FALSE     
#>  9 TRB_CD8_CMV_… CTCCTCA… CAISDTGELFF FALSE    FALSE      NA          FALSE     
#> 10 TRB_CD8_CMV_… TCCAGCC… NA          FALSE    TRUE       NA          TRUE      
#> # ℹ 90 more rows
#> # ℹ 138 more variables: complete_vdj <lgl>, locus <chr>, v_call <chr>,
#> #   d_call <chr>, d2_call <chr>, j_call <chr>, c_call <chr>,
#> #   sequence_alignment <chr>, sequence_alignment_aa <chr>,
#> #   germline_alignment <chr>, germline_alignment_aa <chr>, junction <chr>,
#> #   junction_aa <chr>, np1 <chr>, np1_aa <chr>, np2 <chr>, np2_aa <chr>,
#> #   np3 <chr>, np3_aa <chr>, cdr1 <chr>, cdr1_aa <chr>, cdr2 <chr>, …

A metadata file for the TCR sequencing samples can easily be combined with the study_table by reading in the metadata file as a tibble and using the dplyr::left_join function to merge the two tables. In the example below, a metadata file is imported for the example TCRB data set which contains information on the number of days post bone marrow transplant the sample was collected and the cellular phenotype the blood sample was sorted for prior to sequencing.

TCRB_metadata <- readr::read_csv(system.file("extdata", "TCRB_metadata.csv", package = "LymphoSeq2"), show_col_types = FALSE)
TCRB_metadata
#> # A tibble: 10 × 3
#>    samples             day phenotype
#>    <chr>             <dbl> <chr>    
#>  1 TRB_Unsorted_0        0 Unsorted 
#>  2 TRB_Unsorted_32      32 Unsorted 
#>  3 TRB_Unsorted_83      82 Unsorted 
#>  4 TRB_CD8_CMV_369     369 CD8+CMV+ 
#>  5 TRB_Unsorted_369    369 Unsorted 
#>  6 TRB_CD4_949         949 CD4+     
#>  7 TRB_CD8_949         949 CD8+     
#>  8 TRB_Unsorted_949    949 Unsorted 
#>  9 TRB_Unsorted_1320  1320 Unsorted 
#> 10 TRB_Unsorted_1496  1496 Unsorted

study_table <- dplyr::left_join(study_table, TCRB_metadata, by = c("repertoire_id" = "samples"))
study_table
#> # A tibble: 1,000 × 147
#>    sequence_id   sequence sequence_aa rev_comp productive vj_in_frame stop_codon
#>    <chr>         <chr>    <chr>       <lgl>    <lgl>      <lgl>       <lgl>     
#>  1 TRB_CD4_949_4 GAGTCAG… CASSESAGST… FALSE    FALSE      NA          FALSE     
#>  2 TRB_CD4_949_5 GCCCTCA… NA          FALSE    TRUE       NA          TRUE      
#>  3 TRB_CD4_949_6 ATTCCCT… NA          FALSE    TRUE       NA          TRUE      
#>  4 TRB_CD4_949_7 GTGACAT… CASSPRQGES… FALSE    FALSE      NA          FALSE     
#>  5 TRB_CD4_949_8 ACCTTGG… CASSLDGQGQ… FALSE    FALSE      NA          FALSE     
#>  6 TRB_CD4_949_9 GTGACCA… CSAKTSGITY… FALSE    FALSE      NA          FALSE     
#>  7 TRB_CD4_949_… ACCCTGC… CASSQD*ASS… FALSE    TRUE       NA          TRUE      
#>  8 TRB_CD4_949_… CTCCTTC… CAWSDFQGPR… FALSE    FALSE      NA          FALSE     
#>  9 TRB_CD4_949_… CTGACGA… CASSPDKWGY… FALSE    FALSE      NA          FALSE     
#> 10 TRB_CD4_949_… TCAGAAC… CASSFRTGPT… FALSE    FALSE      NA          FALSE     
#> # ℹ 990 more rows
#> # ℹ 140 more variables: complete_vdj <lgl>, locus <chr>, v_call <chr>,
#> #   d_call <chr>, d2_call <chr>, j_call <chr>, c_call <chr>,
#> #   sequence_alignment <chr>, sequence_alignment_aa <chr>,
#> #   germline_alignment <chr>, germline_alignment_aa <chr>, junction <chr>,
#> #   junction_aa <chr>, np1 <chr>, np1_aa <chr>, np2 <chr>, np2_aa <chr>,
#> #   np3 <chr>, np3_aa <chr>, cdr1 <chr>, cdr1_aa <chr>, cdr2 <chr>, …

Now the metadata information can be used to further subset the data. For instance to select all “Unsorted” samples collected more than 300 days after bone marrow transplant, we would use the following code

unsorted_300 <- study_table %>%
  dplyr::filter(day > 300 & phenotype == "Unsorted")
unsorted_300
#> # A tibble: 400 × 147
#>    sequence_id   sequence sequence_aa rev_comp productive vj_in_frame stop_codon
#>    <chr>         <chr>    <chr>       <lgl>    <lgl>      <lgl>       <lgl>     
#>  1 TRB_Unsorted… CAGCCCT… CASSPAGAYY… FALSE    FALSE      NA          FALSE     
#>  2 TRB_Unsorted… CAGCGCA… CASSPPTGER… FALSE    FALSE      NA          FALSE     
#>  3 TRB_Unsorted… GAGGATC… NA          FALSE    TRUE       NA          TRUE      
#>  4 TRB_Unsorted… ATCCTGG… CASSLQGREK… FALSE    FALSE      NA          FALSE     
#>  5 TRB_Unsorted… GAGTCAG… CASSESAGST… FALSE    FALSE      NA          FALSE     
#>  6 TRB_Unsorted… CAGCCTG… CASSQDWERL… FALSE    FALSE      NA          FALSE     
#>  7 TRB_Unsorted… GAGTCCG… CASSLAGDSQ… FALSE    FALSE      NA          FALSE     
#>  8 TRB_Unsorted… GCCCTCA… NA          FALSE    TRUE       NA          TRUE      
#>  9 TRB_Unsorted… TCGGCCC… CASSQDLMTV… FALSE    FALSE      NA          FALSE     
#> 10 TRB_Unsorted… CTCAGGC… CASSYVGDGY… FALSE    FALSE      NA          FALSE     
#> # ℹ 390 more rows
#> # ℹ 140 more variables: complete_vdj <lgl>, locus <chr>, v_call <chr>,
#> #   d_call <chr>, d2_call <chr>, j_call <chr>, c_call <chr>,
#> #   sequence_alignment <chr>, sequence_alignment_aa <chr>,
#> #   germline_alignment <chr>, germline_alignment_aa <chr>, junction <chr>,
#> #   junction_aa <chr>, np1 <chr>, np1_aa <chr>, np2 <chr>, np2_aa <chr>,
#> #   np3 <chr>, np3_aa <chr>, cdr1 <chr>, cdr1_aa <chr>, cdr2 <chr>, …

Extracting productive sequences

When AIRR-seq samples are derived from genomic DNA rather than complimentary DNA made from RNA, then you will find productive and unproductive sequences. Productive sequences are defined as in-frame sequences without any early stop codons. To filter out these productive sequences, you can use the productiveSeq to remove unproductive sequences and recompute the duplicate_frequency to reflect the productive amino acid or nucleotide sequence frequencies.

If you are interested in just the complementarity determining region 3 (CDR3) amino acid sequences, then set aggregate to junction_aa and the duplicate_count for duplicate amino acid sequences will be summed. The resulting tibble will have junction_aa, duplicate_count, duplicate_frequency, reading_frame, and the most frequent VDJ gene combinations for each of the duplicated amino acid sequences and the corresponding gene family names. These gene names are only kept for consistency of the tibble structure, but since a single amino acid sequence can be generated from different VDJ combinations, it is inadvisable to use these values for downstream analysis

aa_table <- LymphoSeq2::productiveSeq(study_table = study_table, aggregate = "junction_aa", prevalence = FALSE)
aa_table
#> # A tibble: 810 × 11
#>    repertoire_id junction_aa     v_call d_call j_call v_family d_family j_family
#>    <chr>         <chr>           <chr>  <chr>  <chr>  <chr>    <chr>    <chr>   
#>  1 TRB_CD4_949   CAISVGGSSPLHF   TRBV1… TRBD2… TRBJ1… TRBV10   TRBD2    TRBJ1   
#>  2 TRB_CD4_949   CASDGGFRNTIYF   TRBV1… TRBD2… TRBJ1… TRBV19   TRBD2    TRBJ1   
#>  3 TRB_CD4_949   CASGGLNTEAFF    NA     NA     TRBJ1… NA       NA       TRBJ1   
#>  4 TRB_CD4_949   CASGLVAGSTLGGE… TRBV1… TRBD2… TRBJ2… TRBV12   TRBD2    TRBJ2   
#>  5 TRB_CD4_949   CASGTGGETQYF    TRBV6… TRBD2… TRBJ2… TRBV6    TRBD2    TRBJ2   
#>  6 TRB_CD4_949   CASHSSGNTIYF    TRBV6… NA     TRBJ1… TRBV6    NA       TRBJ1   
#>  7 TRB_CD4_949   CASKPPGQGGYGYTF TRBV6… TRBD1… TRBJ1… TRBV6    TRBD1    TRBJ1   
#>  8 TRB_CD4_949   CASMIDPSGNTIYF  TRBV5… NA     TRBJ1… TRBV5    NA       TRBJ1   
#>  9 TRB_CD4_949   CASNARVDSPLHF   TRBV6… TRBD1… TRBJ1… TRBV6    TRBD1    TRBJ1   
#> 10 TRB_CD4_949   CASRLGESPLHF    NA     NA     TRBJ1… NA       NA       TRBJ1   
#> # ℹ 800 more rows
#> # ℹ 3 more variables: reading_frame <chr>, duplicate_count <dbl>,
#> #   duplicate_frequency <dbl>

Alternatively you can aggregate by junction to group sequences by CDR3 nucleotide sequences. This option will produce a tibble similar to the output of readImmunoSeq. Many of the functions within LymphoSeq2 use the results from productiveSeq function. Please be sure to check the function documentation.

nuc_table <- LymphoSeq2::productiveSeq(
  study_table = study_table, aggregate = "junction",
  prevalence = FALSE
)
nuc_table
#> # A tibble: 812 × 12
#>    repertoire_id junction     junction_aa v_call d_call j_call v_family d_family
#>    <chr>         <chr>        <chr>       <chr>  <chr>  <chr>  <chr>    <chr>   
#>  1 TRB_CD4_949   AACCTGAGCTC… CASSVEVGSA… TRBV9… NA     TRBJ1… TRBV9    NA      
#>  2 TRB_CD4_949   AACCTGAGCTC… CASSVMVGTE… TRBV9… NA     TRBJ1… TRBV9    NA      
#>  3 TRB_CD4_949   AACGCCTTGGA… CASSDSGVPG… TRBV5… TRBD1… TRBJ1… TRBV5    TRBD1   
#>  4 TRB_CD4_949   AACGCCTTGTT… CASSSQGLNT… TRBV5… TRBD1… TRBJ2… TRBV5    TRBD1   
#>  5 TRB_CD4_949   AACGCCTTGTT… CASSLTGRSD… TRBV5… NA     TRBJ2… TRBV5    NA      
#>  6 TRB_CD4_949   AAGATCCAGCC… CASSSNPDQP… NA     NA     TRBJ1… NA       NA      
#>  7 TRB_CD4_949   AATCTTCACAT… CASSQGGPLHF NA     TRBD2… TRBJ1… NA       TRBD2   
#>  8 TRB_CD4_949   AATGTGAACGC… CASSLAGNTE… TRBV5… TRBD2… TRBJ1… TRBV5    TRBD2   
#>  9 TRB_CD4_949   AATTCCCTGGA… CASSQPGLTN… NA     TRBD1… TRBJ1… NA       TRBD1   
#> 10 TRB_CD4_949   AATTCCCTGGA… CASSQGGSYN… NA     NA     TRBJ1… NA       NA      
#> # ℹ 802 more rows
#> # ℹ 4 more variables: j_family <chr>, reading_frame <chr>,
#> #   duplicate_count <dbl>, duplicate_frequency <dbl>

If the parameter prevalence is set to TRUE, then a new column is added to each of the data frames giving the prevalence of each TCR beta CDR3 amino acid sequence in 55 healthy donor peripheral blood samples. Values range from 0 to 100 percent where 100 percent means the sequence appeared in the blood of all 55 individuals.

Notice in the example below that there are no amino acid sequences given in the first and fourth row of the study_table table for sample “TRB_Unsorted_949”. This is because the nucleotide sequence is out of frame and does not produce a productively transcribed amino acid sequence. If an asterisk (*) appears in the amino acid sequences, this would indicate an early stop codon.

study_table %>%
  dplyr::filter(repertoire_id == "TRB_Unsorted_0")
#> # A tibble: 100 × 147
#>    sequence_id   sequence sequence_aa rev_comp productive vj_in_frame stop_codon
#>    <chr>         <chr>    <chr>       <lgl>    <lgl>      <lgl>       <lgl>     
#>  1 TRB_Unsorted… TCAATTC… NA          FALSE    TRUE       NA          TRUE      
#>  2 TRB_Unsorted… CTGATTC… CASSPVSNEQ… FALSE    FALSE      NA          FALSE     
#>  3 TRB_Unsorted… ATCAATT… CASSQEVPPY… FALSE    FALSE      NA          FALSE     
#>  4 TRB_Unsorted… CACACCC… CASSQEASGR… FALSE    FALSE      NA          FALSE     
#>  5 TRB_Unsorted… TGCCATC… NA          FALSE    TRUE       NA          TRUE      
#>  6 TRB_Unsorted… GCCAGCA… CASSLEHTGA… FALSE    FALSE      NA          FALSE     
#>  7 TRB_Unsorted… CCCCTGA… CASSPGDEQYF FALSE    FALSE      NA          FALSE     
#>  8 TRB_Unsorted… AGTGCCC… CSARSPSTGT… FALSE    FALSE      NA          FALSE     
#>  9 TRB_Unsorted… GGAGCTT… NA          FALSE    TRUE       NA          TRUE      
#> 10 TRB_Unsorted… CTGTAGT… CASSEKREGH… FALSE    FALSE      NA          FALSE     
#> # ℹ 90 more rows
#> # ℹ 140 more variables: complete_vdj <lgl>, locus <chr>, v_call <chr>,
#> #   d_call <chr>, d2_call <chr>, j_call <chr>, c_call <chr>,
#> #   sequence_alignment <chr>, sequence_alignment_aa <chr>,
#> #   germline_alignment <chr>, germline_alignment_aa <chr>, junction <chr>,
#> #   junction_aa <chr>, np1 <chr>, np1_aa <chr>, np2 <chr>, np2_aa <chr>,
#> #   np3 <chr>, np3_aa <chr>, cdr1 <chr>, cdr1_aa <chr>, cdr2 <chr>, …

After productiveSeq is run, the unproductive sequences are removed and the duplicate_frequency is recalculated for each sequence. If there were two identical amino acid sequences that differed in their nucleotide sequence, they would be combined and their counts added together.

aa_table %>%
  dplyr::filter(repertoire_id == "TRB_Unsorted_0")
#> # A tibble: 83 × 11
#>    repertoire_id  junction_aa    v_call d_call j_call v_family d_family j_family
#>    <chr>          <chr>          <chr>  <chr>  <chr>  <chr>    <chr>    <chr>   
#>  1 TRB_Unsorted_0 CAISDLAVPPSYN… TRBV1… TRBD2… TRBJ2… TRBV10   TRBD2    TRBJ2   
#>  2 TRB_Unsorted_0 CARPPYWDYGYTF  TRBV1… NA     TRBJ1… TRBV10   NA       TRBJ1   
#>  3 TRB_Unsorted_0 CASKYGGAEKLFF  TRBV7… TRBD2… TRBJ1… TRBV7    TRBD2    TRBJ1   
#>  4 TRB_Unsorted_0 CASREAWTATNEK… TRBV2… TRBD1… TRBJ1… TRBV2    TRBD1    TRBJ1   
#>  5 TRB_Unsorted_0 CASRHREANYGYTF TRBV2… NA     TRBJ1… TRBV28   NA       TRBJ1   
#>  6 TRB_Unsorted_0 CASRPDRGSSPLHF TRBV2… TRBD1… TRBJ1… TRBV28   TRBD1    TRBJ1   
#>  7 TRB_Unsorted_0 CASRPGQGVGEQYF TRBV1… TRBD1… TRBJ2… TRBV10   TRBD1    TRBJ2   
#>  8 TRB_Unsorted_0 CASRPTKNSDGEL… TRBV1… NA     TRBJ2… TRBV19   NA       TRBJ2   
#>  9 TRB_Unsorted_0 CASRSGRTNQPQHF TRBV2… TRBD2… TRBJ1… TRBV2    TRBD2    TRBJ1   
#> 10 TRB_Unsorted_0 CASSARSYEQYF   TRBV7… NA     TRBJ2… TRBV7    NA       TRBJ2   
#> # ℹ 73 more rows
#> # ℹ 3 more variables: reading_frame <chr>, duplicate_count <dbl>,
#> #   duplicate_frequency <dbl>

Create a table of summary statistics

To create a table summarizing the total number of sequences, number of unique productive sequences, number of genomes, clonality, Gini coefficient, and the frequency (%) of the top productive sequence, Simpson index, Inverse Simpson index, Hill diversity index, Chao1 index and Kemp index in each imported file, use the function clonality.

LymphoSeq2::clonality(study_table = study_table)
#> # A tibble: 10 × 8
#>    repertoire_id    total_sequences unique_productive_se…¹ total_count clonality
#>    <chr>                      <int>                  <int>       <dbl>     <dbl>
#>  1 TRB_CD4_949                  100                     80       23093     0.349
#>  2 TRB_CD8_949                  100                     81       23072     0.292
#>  3 TRB_CD8_CMV_369              100                     78        1456     0.305
#>  4 TRB_Unsorted_0               100                     83       14776     0.128
#>  5 TRB_Unsorted_13…             100                     83      157660     0.279
#>  6 TRB_Unsorted_14…             100                     82       28876     0.260
#>  7 TRB_Unsorted_32              100                     82       17043     0.105
#>  8 TRB_Unsorted_369             100                     80      274812     0.387
#>  9 TRB_Unsorted_83              100                     81      170526     0.328
#> 10 TRB_Unsorted_949             100                     82        4971     0.247
#> # ℹ abbreviated name: ¹unique_productive_sequences
#> # ℹ 3 more variables: gini_coefficient <dbl>, top_productive_sequence <dbl>,
#> #   convergence <dbl>

The clonality score is derived from the Shannon entropy, which is calculated from the frequencies of all productive sequences divided by the logarithm of the total number of unique productive sequences. This normalized entropy value is then inverted (1 - normalized entropy) to produce the clonality metric.

The Gini coefficient, Chao1 estimate, Kemp estimate, Hill estimate, Simpson index and Inverse Simpson index are alternative metric to measure sequence diversity within the immune repertoire.

The Gini coefficient is an alternative metric used to calculate repertoire diversity and is derived from the Lorenz curve. The Lorenz curve is drawn such that x-axis represents the cumulative percentage of unique sequences and the y-axis represents the cumulative percentage of reads. A line passing through the origin with a slope of 1 reflects equal frequencies of all clones. The Gini coefficient is the ratio of the area between the line of equality and the observed Lorenz curve over the total area under the line of equality.

Calculate clonal relatedness

One of the drawbacks of the clonality metric is that it does not take into account sequence similarity. This is particularly important when studying affinity maturation or B cell malignancies(Lombardo, K.A., et al. Blood Advances 2017 1:535-544). Clonal relatedness is a useful metric that takes into account sequence similarity without regard for clonal frequency. It is defined as the proportion of nucleotide sequences that are related by a defined edit distance threshold. The value ranges from 0 to 1 where 0 indicates no sequences are related and 1 indicates all sequences are related. Edit distance is a way of quantifying how dissimilar two sequences are to one another by counting the minimum number of operations required to transform one sequence into the other. For example, an edit distance of 0 means the sequences are identical and an edit distance of 1 indicates that the sequences different by a single amino acid or nucleotide.

IGH_path <- system.file("extdata", "IGH_sequencing", package = "LymphoSeq2")
IGH_table <- LymphoSeq2::readImmunoSeq(path = IGH_path, threads = 1) %>%
  LymphoSeq2::topSeqs(top = 100)
LymphoSeq2::clonalRelatedness(study_table = IGH_table, edit_distance = 10)
#> # A tibble: 10 × 2
#>    repertoire_id     relatedness
#>    <chr>                   <dbl>
#>  1 IGH_MVQ108911A_BL        0.61
#>  2 IGH_MVQ194745A_BL        0.7 
#>  3 IGH_MVQ81231A_BL         0.61
#>  4 IGH_MVQ89037A_BL         0.35
#>  5 IGH_MVQ90143A_BL         0.03
#>  6 IGH_MVQ92552A_BL         0.08
#>  7 IGH_MVQ93505A_BL         0.31
#>  8 IGH_MVQ93631A_BL         0.83
#>  9 IGH_MVQ94865A_BL         0.06
#> 10 IGH_MVQ95413A_BL         0.01

Draw a phylogenetic tree

A phylogenetic tree is a useful way to visualize the similarity between sequences. The phyloTree function create a phylogenetic tree of a single sample using neighbor joining tree estimation for amino acid or nucleotide CDR3 sequences. Each leaf in the tree represents a sequence color coded by the V, D, and J gene usage. The number next to each leaf refers to the sequence count. A triangle shaped leaf indicates the most frequent sequence. The distance between leaves on the horizontal axis corresponds to the sequence similarity (i.e. the further apart the leaves are horizontally, the less similar the sequences are to one another).

nuc_IGH_table <- LymphoSeq2::productiveSeq(study_table = IGH_table, aggregate = "junction")
LymphoSeq2::phyloTree(
  study_table = nuc_IGH_table,
  repertoire_ids = "IGH_MVQ92552A_BL",
  type = "junction",
  layout = "rectangular"
)
Warning:  [1m [22mThe `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
 [36mℹ [39m The deprecated feature was likely used in the  [34mLymphoSeq2 [39m package.
  Please report the issue at
   [3m [34m<https://github.com/shashidhar22/LymphoSeq2/issues> [39m [23m.
 [90mThis warning is displayed once every 8 hours. [39m
 [90mCall `lifecycle::last_lifecycle_warnings()` to see where this warning was [39m
 [90mgenerated. [39m

Multiple sequence alignment

In LymphoSeq2, you can perform a multiple sequence alignment using one of three methods provided by the Bioconductor msa package (ClustalW, ClustalOmega, or Muscle), the change in functionality however is, now the function returns a msa S4 object. One may perform the alignment of all amino acid or nucleotide sequences in a single sample. Alternatively, one may search for a given sequence within a list of samples using an edit distance threshold.

alignment <- LymphoSeq2::alignSeq(
  study_table = nuc_IGH_table,
  repertoire_ids = "IGH_MVQ92552A_BL",
  type = "junction_aa",
  method = "ClustalW"
)

use default substitution matrix

LymphoSeq2::plotAlignment(alignment)
#> Registered S3 methods overwritten by 'ggalt':
#>   method                  from   
#>   grid.draw.absoluteGrob  ggplot2
#>   grobHeight.absoluteGrob ggplot2
#>   grobWidth.absoluteGrob  ggplot2
#>   grobX.absoluteGrob      ggplot2
#>   grobY.absoluteGrob      ggplot2
#> Coordinate system already present. Adding new coordinate system, which will
#> replace the existing one.

Searching for sequences

To search for one or more amino acid or nucleotide CDR3 sequences in a list of data frames, use the function searchSeq. The function allows sequence search with a edit distance threshold. For example, an edit distance of 0 means the sequences are identical and an edit distance of 1 indicates that the sequences differ by a single amino acid or nucleotide. Match options include “global” matching which performs end-to-end matching of sequences. “partial” matching allows searching for sub strings with CDR3 sequences.

LymphoSeq2::searchSeq(
  study_table = aa_table,
  sequence = "CASSPVSNEQFF",
  seq_type = "junction_aa",
  match = "global",
  edit_distance = 0
)
#> # A tibble: 1 × 13
#>   repertoire_id  junction_aa  v_call   d_call  j_call v_family d_family j_family
#>   <chr>          <chr>        <chr>    <chr>   <chr>  <chr>    <chr>    <chr>   
#> 1 TRB_Unsorted_0 CASSPVSNEQFF TRBV28-1 TRBD2-1 TRBJ2… TRBV28   TRBD2    TRBJ2   
#> # ℹ 5 more variables: reading_frame <chr>, duplicate_count <dbl>,
#> #   duplicate_frequency <dbl>, edit_distance <dbl>, searchSequence <chr>

Searching for published sequences

To search your entire list of data frames for a published amino acid CDR3 TCRB sequence with known antigen specificity, use the function searchPublished.

LymphoSeq2::searchPublished(study_table = aa_table) %>%
  dplyr::filter(!is.na(PMID))
#> # A tibble: 26 × 16
#>    repertoire_id     junction_aa v_call d_call j_call v_family d_family j_family
#>    <chr>             <chr>       <chr>  <chr>  <chr>  <chr>    <chr>    <chr>   
#>  1 TRB_CD4_949       CASSQDPGYE… TRBV4… TRBD1… TRBJ2… TRBV4    TRBD1    TRBJ2   
#>  2 TRB_CD8_949       CASSPGTGTY… TRBV1… TRBD1… TRBJ1… TRBV10   TRBD1    TRBJ1   
#>  3 TRB_CD8_949       CASSPSRNTE… TRBV4… TRBD2… TRBJ1… TRBV4    TRBD2    TRBJ1   
#>  4 TRB_CD8_949       CASSYSGNTE… NA     NA     TRBJ1… NA       NA       TRBJ1   
#>  5 TRB_CD8_CMV_369   CASSPARNTE… TRBV4… NA     TRBJ1… TRBV4    NA       TRBJ1   
#>  6 TRB_CD8_CMV_369   CASSPGTGTY… TRBV1… TRBD1… TRBJ1… TRBV10   TRBD1    TRBJ1   
#>  7 TRB_CD8_CMV_369   CASSPSRNTE… TRBV4… TRBD2… TRBJ1… TRBV4    TRBD2    TRBJ1   
#>  8 TRB_CD8_CMV_369   CASSYSGNTE… NA     NA     TRBJ1… NA       NA       TRBJ1   
#>  9 TRB_Unsorted_0    CASSPQRNTE… TRBV4… TRBD2… TRBJ1… TRBV4    TRBD2    TRBJ1   
#> 10 TRB_Unsorted_1320 CASSLEGDQP… TRBV5… TRBD1… TRBJ1… TRBV5    TRBD1    TRBJ1   
#> # ℹ 16 more rows
#> # ℹ 8 more variables: reading_frame <chr>, duplicate_count <dbl>,
#> #   duplicate_frequency <dbl>, PMID <fct>, HLA <fct>, antigen <fct>,
#> #   epitope <fct>, prevalence <dbl>

For each found sequence, a table is provides listing the antigen, epitope, HLA type, PubMed ID (PMID), and prevalence percentage of the sequence among 55 healthy donor blood samples.

You can even search of productive CDR3 amino acid sequences from the repertoires that are found in public databases such as VdjDB, IEDB, and McPas-TCR using the function searchDB. By specifying dbname="all" searchDB will look for each CDR3 amino acid sequence in the dataset in all three public databases. You can also pass a vector with any of the three databases (“VdjDB”, “IEDB”, “McPAS-TCR”) to search just those databases.

LymphoSeq2::searchDB(study_table = aa_table, dbname = "all", chain = "trb")
#> # A tibble: 839 × 26
#>    repertoire_id junction_aa     v_call d_call j_call v_family d_family j_family
#>    <chr>         <chr>           <chr>  <chr>  <chr>  <chr>    <chr>    <chr>   
#>  1 TRB_CD4_949   CAISVGGSSPLHF   TRBV1… TRBD2… TRBJ1… TRBV10   TRBD2    TRBJ1   
#>  2 TRB_CD4_949   CASDGGFRNTIYF   TRBV1… TRBD2… TRBJ1… TRBV19   TRBD2    TRBJ1   
#>  3 TRB_CD4_949   CASGGLNTEAFF    NA     NA     TRBJ1… NA       NA       TRBJ1   
#>  4 TRB_CD4_949   CASGLVAGSTLGGE… TRBV1… TRBD2… TRBJ2… TRBV12   TRBD2    TRBJ2   
#>  5 TRB_CD4_949   CASGTGGETQYF    TRBV6… TRBD2… TRBJ2… TRBV6    TRBD2    TRBJ2   
#>  6 TRB_CD4_949   CASHSSGNTIYF    TRBV6… NA     TRBJ1… TRBV6    NA       TRBJ1   
#>  7 TRB_CD4_949   CASKPPGQGGYGYTF TRBV6… TRBD1… TRBJ1… TRBV6    TRBD1    TRBJ1   
#>  8 TRB_CD4_949   CASMIDPSGNTIYF  TRBV5… NA     TRBJ1… TRBV5    NA       TRBJ1   
#>  9 TRB_CD4_949   CASNARVDSPLHF   TRBV6… TRBD1… TRBJ1… TRBV6    TRBD1    TRBJ1   
#> 10 TRB_CD4_949   CASRLGESPLHF    NA     NA     TRBJ1… NA       NA       TRBJ1   
#> # ℹ 829 more rows
#> # ℹ 18 more variables: reading_frame <chr>, duplicate_count <dbl>,
#> #   duplicate_frequency <dbl>, tra_cdr3_aa <chr>, gene <chr>, epitope <chr>,
#> #   pathology <chr>, antigen <chr>, tra_v_call <chr>, tra_j_call <chr>,
#> #   mhc_allele <chr>, reference <chr>, score <dbl>, cell_type <chr>,
#> #   source <chr>, trb_v_call <chr>, trb_j_call <chr>, Species <chr>

Visualizing repertoire diversity

Antigen receptor repertoire diversity can be characterized by a number such as clonality or Gini coefficient calculated by the clonality function. Alternatively, you can visualize the repertoire diversity by plotting the Lorenz curve for each sample as defined above. In this plot, the more diverse samples will appear near the dotted diagonal line (the line of equality) whereas the more clonal samples will appear to have a more bowed shape.

samples <- aa_table %>%
  dplyr::pull(repertoire_id) %>%
  unique()
LymphoSeq2::lorenzCurve(repertoire_ids = samples, study_table = aa_table)

Alternatively, you can get a feel for the repertoire diversity by plotting the cumulative frequency of a selected number of the top most frequent clones using the function topSeqsPlot. In this case, each of the top sequences are represented by a different color and all less frequent clones will be assigned a single color (violet).

LymphoSeq2::topSeqsPlot(study_table = aa_table, top = 10)

Both of these functions are built using the ggplot2 package. You can reformat the plot using ggplot2 functions. Please refer to the lorenzCurve and topSeqsPlot manual for specific examples.

Comparing samples

To compare the T or B cell repertoires of all samples in a pairwise fashion, use the bhattacharyyaMatrix or similarityMatrix functions. Both the Bhattacharyya coefficient and similarity score are measures of the amount of overlap between two samples. The value for each ranges from 0 to 1 where 1 indicates the sequence frequencies are identical in the two samples and 0 indicates no shared frequencies. The Bhattacharyya coefficient differs from the similarity score in that it involves weighting each shared sequence in the two distributions by the arithmetic mean of the frequency of each sequence, while calculating the similarity scores involves weighting each shared sequence in the two distributions by the geometric mean of the frequency of each sequence in the two distributions.

bhattacharyya_matrix <- LymphoSeq2::scoringMatrix(aa_table, mode = "Bhattacharyya")
LymphoSeq2::pairwisePlot(bhattacharyya_matrix)

To view sequences shared between two or more samples, use the function commonSeqs. This function requires that a productive amino acid list be specified.

common <- LymphoSeq2::commonSeqs(
  study_table = aa_table,
  repertoire_ids = c("TRB_Unsorted_0", "TRB_Unsorted_32")
)
common
#> # A tibble: 1 × 3
#>   junction_aa     TRB_Unsorted_0 TRB_Unsorted_32
#>   <chr>                    <dbl>           <dbl>
#> 1 CASSQDRTGQYGYTF        0.00429          0.0152

To visualize the number of overlapping sequences between two or three samples in the form of a Venn diagram, use the function commonSeqVenn

LymphoSeq2::commonSeqsVenn(
  repertoire_ids = c("TRB_Unsorted_32", "TRB_Unsorted_83"),
  amino_table = aa_table
)

LymphoSeq2::commonSeqsVenn(
  repertoire_ids = c("TRB_Unsorted_0", "TRB_Unsorted_32", "TRB_Unsorted_83"),
  amino_table = aa_table
)

To compare the frequency of sequences between two samples as a scatter plot, use the function commonSeqsPlot.

LymphoSeq2::commonSeqsPlot("TRB_Unsorted_32", "TRB_Unsorted_83",
  amino_table = aa_table, show = "common"
)

If you have more than 3 samples to compare, use the commonSeqBar function. You can chose to color a single sample with the color.sample argument or a desired intersection with the color.intersection argument.

LymphoSeq2::commonSeqsBar(
  amino_table = aa_table,
  repertoire_ids = c(
    "TRB_CD4_949", "TRB_CD8_949",
    "TRB_Unsorted_949", "TRB_Unsorted_1320"
  ),
  color_sample = "TRB_CD8_949",
  labels = "no"
)

Differential abundance

When comparing a sample from two different time points, it is useful to identify sequences that are significantly more or less abundant in one versus the other time point (DeWitt, W.S., et al. Journal of Virology 2015 89(8):4517-4526). The differentialAbundance function uses a Fisher exact test to calculate differential abundance of each sequence in two time points and reports the log2 transformed fold change, P value and adjusted P value.

LymphoSeq2::differentialAbundance(
  study_table = aa_table,
  repertoire_ids = c(
    "TRB_Unsorted_949",
    "TRB_Unsorted_1320"
  ),
  type = "junction_aa", q = 0.01
)
#> # A tibble: 107 × 6
#>    junction_aa     TRB_Unsorted_949 TRB_Unsorted_1320        p        q    l2fc
#>    <chr>                      <dbl>             <dbl>    <dbl>    <dbl>   <dbl>
#>  1 CAIKMETPNGEQYF                29               326 1.14e- 6 1.14e- 6   -3.49
#>  2 CAISEGQGVKPQHF                 0               167 1.09e- 2 1.09e- 2 -Inf   
#>  3 CAISESGVLNEKLFF               13               150 1.20e- 3 1.20e- 3   -3.53
#>  4 CASDGGFRNTIYF                 17               387 1.40e- 1 1.40e- 1   -4.51
#>  5 CASKPPGQGGYGYTF                0               173 1.12e- 2 1.12e- 2 -Inf   
#>  6 CASNRVPEETQYF                  0               127 5.75e- 2 5.75e- 2 -Inf   
#>  7 CASNSKADSTDTQYF               21              1325 1.52e- 3 1.52e- 3   -5.98
#>  8 CASRDGQGSGNTIYF               48               358 6.73e-16 6.73e-16   -2.90
#>  9 CASREDRGSSPLHF                 0               147 2.45e- 2 2.45e- 2 -Inf   
#> 10 CASRLGPGAGDEAFF               12               619 1.26e- 1 1.26e- 1   -5.69
#> # ℹ 97 more rows

Finding recurring sequences

To create a tibble of unique, productive amino acid sequences as rows and sample names as headers use the seqMatrix function. Each value in the data frame represents the frequency that each sequence appears in the sample. You can specify your own list of sequences or all unique sequences in the list using the output of the function uniqueSeqs. The uniqueSeqs function creates a tibble of all unique, productive sequences and reports the total count in all samples.

unique_seqs <- LymphoSeq2::uniqueSeqs(productive_table = aa_table)
unique_seqs
#> # A tibble: 438 × 2
#>    junction_aa            duplicate_count
#>    <chr>                            <dbl>
#>  1 CASSQDWERLGEQFF                  99480
#>  2 CASSLQGREKLFF                    90563
#>  3 CASSQDLMTVDSLFAGANVLTF           68679
#>  4 CASSPAGAYYNEQFF                  30418
#>  5 CASSPPTGERDTQYF                  24552
#>  6 CASSLAGDSQETQYF                  22147
#>  7 CASSESAGSTGELFF                  17438
#>  8 CASRDGQGSGNTIYF                  11516
#>  9 CASSPSRNTEAFF                     8705
#> 10 CASSQDRTGQYGYTF                   8017
#> # ℹ 428 more rows

sequence_matrix <- LymphoSeq2::seqMatrix(amino_table = aa_table, sequences = unique_seqs$junction_aa)
sequence_matrix
#> # A tibble: 438 × 11
#>    junction_aa        TRB_CD4_949 TRB_CD8_949 TRB_CD8_CMV_369 TRB_Unsorted_0
#>    <chr>                    <dbl>       <dbl>           <dbl>          <dbl>
#>  1 CAISVGGSSPLHF         0.000695           0               0              0
#>  2 CASDGGFRNTIYF         0.0318             0               0              0
#>  3 CASGGLNTEAFF          0.00160            0               0              0
#>  4 CASGLVAGSTLGGETQYF    0.00202            0               0              0
#>  5 CASGTGGETQYF          0.00146            0               0              0
#>  6 CASHSSGNTIYF          0.000765           0               0              0
#>  7 CASKPPGQGGYGYTF       0.00827            0               0              0
#>  8 CASMIDPSGNTIYF        0.000765           0               0              0
#>  9 CASNARVDSPLHF         0.000834           0               0              0
#> 10 CASRLGESPLHF          0.00167            0               0              0
#> # ℹ 428 more rows
#> # ℹ 6 more variables: TRB_Unsorted_1320 <dbl>, TRB_Unsorted_1496 <dbl>,
#> #   TRB_Unsorted_32 <dbl>, TRB_Unsorted_369 <dbl>, TRB_Unsorted_83 <dbl>,
#> #   TRB_Unsorted_949 <dbl>

If just the top clones with a frequency greater than a specified amount are of interest to you, then use the topFreq function. This creates a tibble of the top productive amino acid sequences having a minimum specified frequency and reports the minimum, maximum, and mean frequency that the sequence appears in a list of samples. For TCRB sequences, the prevalence percentage and the published antigen specificity of that sequence are also provided.

top_freq <- LymphoSeq2::topFreq(productive_table = aa_table, frequency = 0.001)
top_freq
#> # A tibble: 425 × 7
#>    junction_aa  minFrequency maxFrequency meanFrequency numberSamples prevalence
#>    <chr>               <dbl>        <dbl>         <dbl>         <int>      <dbl>
#>  1 CASSQDRTGQY…      0.00429       0.0248       0.0106              9          0
#>  2 CASSLQGREKL…      0.0569        0.322        0.113               8          0
#>  3 CASSQDLMTVD…      0.0292        0.166        0.0924              8          0
#>  4 CASSREGDQPQ…      0.00157       0.0520       0.00913             8          0
#>  5 CASRDGQGSGN…      0.00278       0.0351       0.0155              7          0
#>  6 CASSPFDRGPD…      0.00508       0.0165       0.0112              7          0
#>  7 CASSQDLGQAF…      0.00223       0.0235       0.0111              7          0
#>  8 CASSQDSSDTE…      0.00147       0.0488       0.0106              7          0
#>  9 CAIKMETPNGE…      0.00253       0.0118       0.00704             7          0
#> 10 CASSPGTGTYG…      0.00121       0.0156       0.00585             7          0
#> # ℹ 415 more rows
#> # ℹ 1 more variable: antigen <fct>

One very useful thing to do is merge the output of seqMatrix and topFreq.

top_freq_matrix <- dplyr::full_join(top_freq, sequence_matrix)
#> Joining with `by = join_by(junction_aa)`
top_freq_matrix
#> # A tibble: 438 × 17
#>    junction_aa  minFrequency maxFrequency meanFrequency numberSamples prevalence
#>    <chr>               <dbl>        <dbl>         <dbl>         <int>      <dbl>
#>  1 CASSQDRTGQY…      0.00429       0.0248       0.0106              9          0
#>  2 CASSLQGREKL…      0.0569        0.322        0.113               8          0
#>  3 CASSQDLMTVD…      0.0292        0.166        0.0924              8          0
#>  4 CASSREGDQPQ…      0.00157       0.0520       0.00913             8          0
#>  5 CASRDGQGSGN…      0.00278       0.0351       0.0155              7          0
#>  6 CASSPFDRGPD…      0.00508       0.0165       0.0112              7          0
#>  7 CASSQDLGQAF…      0.00223       0.0235       0.0111              7          0
#>  8 CASSQDSSDTE…      0.00147       0.0488       0.0106              7          0
#>  9 CAIKMETPNGE…      0.00253       0.0118       0.00704             7          0
#> 10 CASSPGTGTYG…      0.00121       0.0156       0.00585             7          0
#> # ℹ 428 more rows
#> # ℹ 11 more variables: antigen <fct>, TRB_CD4_949 <dbl>, TRB_CD8_949 <dbl>,
#> #   TRB_CD8_CMV_369 <dbl>, TRB_Unsorted_0 <dbl>, TRB_Unsorted_1320 <dbl>,
#> #   TRB_Unsorted_1496 <dbl>, TRB_Unsorted_32 <dbl>, TRB_Unsorted_369 <dbl>,
#> #   TRB_Unsorted_83 <dbl>, TRB_Unsorted_949 <dbl>

Tracking sequences across samples

To visually track the frequency of sequences across multiple samples, use the function cloneTrack This function takes the output from the seqMatrix function. You can specify a character vector of amino acid sequences using the parameter track to highlight those sequences with a different color. Alternatively, you can highlight all of the sequences from a given sample using the parameter map. If the mapping feature is use, then you must specify a productive amino acid list and a character vector of labels to title the mapped samples.

ctable <- LymphoSeq2::cloneTrack(
  study_table = aa_table,
  sample_list = c("TRB_CD8_949", "TRB_CD8_CMV_369")
)
LymphoSeq2::plotTrack(ctable)

You can track particular sequences across samples by providing an optional list of CDR3 amino acid sequences.

ttable <- LymphoSeq2::topSeqs(aa_table, top = 10)
ctable <- LymphoSeq2::cloneTrack(ttable)
LymphoSeq2::plotTrack(ctable, alist = c("CASSESAGSTGELFF", "CASSLAGDSQETQYF")) + ggplot2::theme(legend.position = "bottom")

Alternatively you can use the function plotTrackSingular to retrieve a list of alluvial diagrams each tracking one single amino acid from the clone track table. Considering that a plot is generated for each unique CDR3 sequence, we recommend running this feature on a clone track table derived from only the top sequences from each repertoire as described in the example above.

lalluvial <- ctable %>%
  LymphoSeq2::topSeqs(top = 1) %>%
  LymphoSeq2::plotTrackSingular()
lalluvial[[1]]

Comparing V(D)J gene usage

To compare the V, D, and J gene usage across samples, start by creating a data frame of V, D, and J gene counts and frequencies using the function geneFreq. You can specify if you are interested in the “VDJ”, “DJ”, “VJ”, “DJ”, “V”, “D”, or “J” loci using the locus parameter. Set family to TRUE if you prefer the family names instead of the gene names as reported by ImmunoSeq.

vGenes <- LymphoSeq2::geneFreq(nuc_table, locus = "V", family = TRUE)
vGenes
#> # A tibble: 167 × 5
#>    repertoire_id gene_name duplicate_count gene_type gene_frequency
#>    <chr>         <chr>               <dbl> <chr>              <dbl>
#>  1 TRB_CD4_949   NA                   2945 v_family         0.205  
#>  2 TRB_CD4_949   TRBV10               5071 v_family         0.353  
#>  3 TRB_CD4_949   TRBV11                107 v_family         0.00744
#>  4 TRB_CD4_949   TRBV12                 29 v_family         0.00202
#>  5 TRB_CD4_949   TRBV18                226 v_family         0.0157 
#>  6 TRB_CD4_949   TRBV19               1643 v_family         0.114  
#>  7 TRB_CD4_949   TRBV2                 230 v_family         0.0160 
#>  8 TRB_CD4_949   TRBV21                 18 v_family         0.00125
#>  9 TRB_CD4_949   TRBV27                 84 v_family         0.00584
#> 10 TRB_CD4_949   TRBV28                208 v_family         0.0145 
#> # ℹ 157 more rows

To create a chord diagram showing VJ or DJ gene associations from one or more more samples, combine the output of geneFreq with the function chordDiagramVDJ. This function works well the topSeqs function that creates a data frame of a selected number of top productive sequences. In the example below, a chord diagram is made showing the association between V and J genes of just the single dominant clones in each sample. The size of the ribbons connecting VJ genes correspond to the number of samples that have that recombination event. The thicker the ribbon, the higher the frequency of the recombination.

top_seqs <- LymphoSeq2::topSeqs(nuc_table, top = 1)
LymphoSeq2::chordDiagramVDJ(
  study_table = top_seqs,
  association = "VJ",
  colors = c("darkred", "navyblue")
)

You can also visualize the results of geneFreq as a heat map, word cloud, our cumulative frequency bar plot with the support of additional R packages as shown below.

vGenes <- LymphoSeq2::geneFreq(nuc_table, locus = "V", family = TRUE)
RedBlue <- grDevices::colorRampPalette(rev(RColorBrewer::brewer.pal(11, "RdBu")))(256)
vtable <- vGenes %>%
  dplyr::filter(repertoire_id == "TRB_Unsorted_83") %>%
  dplyr::select(gene_name, gene_frequency)
wordcloud2::wordcloud2(
  data = vtable,
  color = RedBlue
)

vGenes <- LymphoSeq2::geneFreq(nuc_table, locus = "V", family = TRUE) %>%
  tidyr::pivot_wider(
    id_cols = gene_name,
    names_from = repertoire_id,
    values_from = gene_frequency,
    values_fn = sum,
    values_fill = 0
  )
gene_names <- vGenes %>%
  dplyr::pull(gene_name)
vGenes <- vGenes %>%
  dplyr::select(-gene_name) %>%
  as.matrix()
rownames(vGenes) <- gene_names
pheatmap::pheatmap(vGenes, scale = "row")

vGenes <- LymphoSeq2::geneFreq(nuc_table, locus = "V", family = TRUE)
multicolors <- grDevices::colorRampPalette(rev(RColorBrewer::brewer.pal(9, "Set1")))(28)
ggplot2::ggplot(vGenes, aes(x = repertoire_id, y = gene_frequency, fill = gene_name)) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::theme_minimal() +
  ggplot2::scale_y_continuous(expand = c(0, 0)) +
  ggplot2::guides(fill = ggplot2::guide_legend(ncol = 2)) +
  ggplot2::scale_fill_manual(values = multicolors) +
  ggplot2::labs(y = "Frequency (%)", x = "", fill = "") +
  ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5, hjust = 1))

Removing sequences

Occasionally you may identify one or more sequences in your data set that appear to be contamination. You can remove an amino acid sequence from all data frames using the function removeSeq and recompute frequencyCount for all remaining sequences.

LymphoSeq2::searchSeq(study_table = aa_table, sequence = "CASSESAGSTGELFF", seq_type = "junction_aa")
#> # A tibble: 4 × 13
#>   repertoire_id     junction_aa  v_call d_call j_call v_family d_family j_family
#>   <chr>             <chr>        <chr>  <chr>  <chr>  <chr>    <chr>    <chr>   
#> 1 TRB_CD4_949       CASSESAGSTG… TRBV1… TRBD2… TRBJ2… TRBV10   TRBD2    TRBJ2   
#> 2 TRB_Unsorted_1320 CASSESAGSTG… TRBV1… TRBD2… TRBJ2… TRBV10   TRBD2    TRBJ2   
#> 3 TRB_Unsorted_1496 CASSESAGSTG… TRBV1… TRBD2… TRBJ2… TRBV10   TRBD2    TRBJ2   
#> 4 TRB_Unsorted_949  CASSESAGSTG… TRBV1… TRBD2… TRBJ2… TRBV10   TRBD2    TRBJ2   
#> # ℹ 5 more variables: reading_frame <chr>, duplicate_count <dbl>,
#> #   duplicate_frequency <dbl>, edit_distance <dbl>, searchSequence <chr>

cleansed <- LymphoSeq2::removeSeq(study_table = aa_table, sequence = "CASSESAGSTGELFF")
LymphoSeq2::searchSeq(study_table = cleansed, sequence = "CASSESAGSTGELFF", seq_type = "junction_aa")
#> # A tibble: 0 × 13
#> # ℹ 13 variables: repertoire_id <chr>, junction_aa <chr>, v_call <chr>,
#> #   d_call <chr>, j_call <chr>, v_family <chr>, d_family <chr>, j_family <chr>,
#> #   reading_frame <chr>, duplicate_count <dbl>, duplicate_frequency <dbl>,
#> #   edit_distance <dbl>, searchSequence <chr>

Rarefaction curves

Rarefaction and extrapolation curves allow for comparison of TCR diversity across repertoires given a ideal sequencing depth. Rarefaction and extrapolation curves are drawn by sampling a sequencing dataset to various depths to understand the trajectory of sequence diversity and then extrapolating the curve to an ideal depth.

LymphoSeq2::plotRarefactionCurve(study_table = aa_table)

Session info

sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] vroom_1.6.5        lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1     
#>  [5] dplyr_1.1.4        purrr_1.0.2        readr_2.1.5        tidyr_1.3.0       
#>  [9] tibble_3.2.1       ggplot2_3.4.4      tidyverse_2.0.0    wordcloud2_0.2.1  
#> [13] RColorBrewer_1.1-3 LymphoSeq2_1.0.0   data.table_1.14.10
#> 
#> loaded via a namespace (and not attached):
#>   [1] jsonlite_1.8.8          shape_1.4.6             magrittr_2.0.3         
#>   [4] iNEXT_3.0.0             farver_2.1.1            rmarkdown_2.25         
#>   [7] GlobalOptions_0.1.2     fs_1.6.3                zlibbioc_1.48.0        
#>  [10] ragg_1.2.7              vctrs_0.6.5             memoise_2.0.1          
#>  [13] RCurl_1.98-1.14         R4RNA_1.30.0            ggtree_3.11.0          
#>  [16] htmltools_0.5.7         progress_1.2.3          lambda.r_1.2.4         
#>  [19] gridGraphics_0.5-1      proj4_1.0-13            sass_0.4.8             
#>  [22] KernSmooth_2.23-22      bslib_0.6.1             htmlwidgets_1.6.4      
#>  [25] desc_1.4.3              plyr_1.8.9              futile.options_1.0.1   
#>  [28] cachem_1.0.8            ggalt_0.4.0             igraph_1.6.0           
#>  [31] lifecycle_1.0.4         pkgconfig_2.0.3         Matrix_1.6-1.1         
#>  [34] R6_2.5.1                fastmap_1.1.1           GenomeInfoDbData_1.2.11
#>  [37] digest_0.6.34           dtplyr_1.3.1            aplot_0.2.2            
#>  [40] colorspace_2.1-0        patchwork_1.2.0         S4Vectors_0.40.2       
#>  [43] textshaping_0.3.7       labeling_0.4.3          fansi_1.0.6            
#>  [46] timechange_0.2.0        polyclip_1.10-6         compiler_4.3.2         
#>  [49] bit64_4.0.5             withr_2.5.2             UpSetR_1.4.0           
#>  [52] highr_0.10              ggforce_0.4.1           Rttf2pt1_1.3.12        
#>  [55] maps_3.4.2              MASS_7.3-60             tools_4.3.2            
#>  [58] ape_5.7-1               extrafontdb_1.0         glue_1.7.0             
#>  [61] VennDiagram_1.7.3       quadprog_1.5-8          nlme_3.1-163           
#>  [64] grid_4.3.2              stringdist_0.9.12       reshape2_1.4.4         
#>  [67] generics_0.1.3          gtable_0.3.4            tzdb_0.4.0             
#>  [70] hms_1.1.3               utf8_1.2.4              XVector_0.42.0         
#>  [73] BiocGenerics_0.48.1     pillar_1.9.0            yulab.utils_0.1.3      
#>  [76] ineq_0.2-13             circlize_0.4.15         tweenr_2.0.2           
#>  [79] treeio_1.27.0.002       lattice_0.21-9          bit_4.0.5              
#>  [82] tidyselect_1.2.0        Biostrings_2.70.1       knitr_1.45             
#>  [85] gridExtra_2.3           msa_1.34.0              IRanges_2.36.0         
#>  [88] stats4_4.3.2            futile.logger_1.4.3     xfun_0.41              
#>  [91] pheatmap_1.0.12         stringi_1.8.3           seqmagick_0.1.7        
#>  [94] lazyeval_0.2.2          ggfun_0.1.3             yaml_2.3.8             
#>  [97] evaluate_0.23           codetools_0.2-19        extrafont_0.19         
#> [100] ggmsa_1.3.4             ggplotify_0.1.2         cli_3.6.2              
#> [103] ash_1.0-15              systemfonts_1.0.5       munsell_0.5.0          
#> [106] jquerylib_0.1.4         Rcpp_1.0.12             GenomeInfoDb_1.38.5    
#> [109] parallel_4.3.2          ellipsis_0.3.2          pkgdown_2.0.7          
#> [112] prettyunits_1.2.0       ggalluvial_0.12.5       bitops_1.0-7           
#> [115] phangorn_2.11.1         tidytree_0.4.6          scales_1.3.0           
#> [118] crayon_1.5.2            rlang_1.1.3             fastmatch_1.1-4        
#> [121] formatR_1.14

Elena Wu

Shashidhar Ravishankar

David Coffey

2024-01-13