Calculate repertoire diversity metrics

Quantify immune repertoire diversity using multiple complementary metrics including Shannon clonality, Gini coefficient, Simpson indices, and TCR/BCR convergence. Optionally correct for sequencing depth bias using rarefaction.

Usage

clonality(study_table, rarefy = FALSE, iterations = 100, min_count = 1000)

Arguments

study_table

A tibble of antigen receptor sequences from readImmunoSeq(). Must contain "junction_aa", "duplicate_count", and "duplicate_frequency" columns. Use productive junction sequences (not aggregated by amino acid) for accurate clonality estimates.

rarefy

Logical. Should diversity be normalized for sequencing depth?

TRUE: Apply rarefaction by subsampling all repertoires to min_count depth, repeating for iterations, and averaging the results. Use this when comparing samples with different sequencing depths.
FALSE (default): Calculate raw diversity metrics without normalization.

iterations

Number of bootstrap iterations for rarefaction (default 100). Higher values increase precision but take longer to compute.

min_count

Target sequencing depth for rarefaction (default 1000). Repertoires with fewer sequences than this will be excluded with a warning.

Value

A tibble with one row per repertoire containing:

total_sequences: Number of total sequences
unique_productive_sequences: Number of unique clones
total_count: Sum of UMI/read counts
clonality: Shannon clonality (0 = diverse, 1 = monoclonal)
gini_coefficient: Gini coefficient (0 = even, 1 = skewed)
simpson_index: Simpson's D (0 = diverse, 1 = monoclonal)
inverse_simpson: Effective number of dominant clones
top_productive_sequence: Frequency (%) of most abundant clone
convergence: Average nucleotide sequences per amino acid

Details

Diversity Metrics:

Shannon Clonality - Measures evenness of clone distribution. Calculated as 1 - (entropy / log(unique clones)). Values near 0 indicate diverse repertoires; near 1 indicate oligoclonal expansion.

Gini Coefficient - Borrowed from economics to measure inequality. Based on the Lorenz curve of cumulative clone frequencies. Ranges 0-1 where 0 is perfect equality and 1 is maximal inequality (single dominant clone).

Simpson Index - Probability that two randomly selected sequences belong to the same clone. Higher values indicate lower diversity.

Inverse Simpson - Number of equally-abundant clones needed to achieve the observed diversity. More intuitive than Simpson's D (higher = more diverse).

Rarefaction (rarefy = TRUE):

When samples have different sequencing depths, raw diversity metrics are not comparable. Rarefaction corrects this by: (1) Subsampling all repertoires to the same depth (min_count), (2) Calculating diversity on the subsampled data, (3) Repeating steps 1-2 for iterations, and (4) Averaging the results.

This allows fair comparison between a deeply-sequenced blood sample and a shallow tumor sample. Samples with fewer than min_count sequences are excluded.

Examples

file_path <- system.file("extdata", "TCRB_sequencing",
 package = "LymphoSeq2")
study_table <- LymphoSeq2::readImmunoSeq(path = file_path, threads = 1)
#> Dataset Analysis:
#>   Files: 10, Total: 0.00 GB, Largest: 0.0 MB
#>   Available memory: 14.2 GB
study_table <- LymphoSeq2::topSeqs(study_table, top = 100)
raw_clonality <- LymphoSeq2::clonality(study_table)
sampled_clonality <- LymphoSeq2::clonality(study_table,
  rarefy = TRUE,
  iterations = 100,
  min_count = 100
)

Usage

Arguments

Value

Details

See also

Examples