Quantify immune repertoire diversity using multiple complementary metrics including Shannon clonality, Gini coefficient, Simpson indices, and TCR/BCR convergence. Optionally correct for sequencing depth bias using rarefaction.
Arguments
- study_table
A tibble of antigen receptor sequences from
readImmunoSeq(). Must contain "junction_aa", "duplicate_count", and "duplicate_frequency" columns. Use productive junction sequences (not aggregated by amino acid) for accurate clonality estimates.- rarefy
Logical. Should diversity be normalized for sequencing depth?
TRUE: Apply rarefaction by subsampling all repertoires tomin_countdepth, repeating foriterations, and averaging the results. Use this when comparing samples with different sequencing depths.FALSE(default): Calculate raw diversity metrics without normalization.
- iterations
Number of bootstrap iterations for rarefaction (default 100). Higher values increase precision but take longer to compute.
- min_count
Target sequencing depth for rarefaction (default 1000). Repertoires with fewer sequences than this will be excluded with a warning.
Value
A tibble with one row per repertoire containing:
total_sequences: Number of total sequencesunique_productive_sequences: Number of unique clonestotal_count: Sum of UMI/read countsclonality: Shannon clonality (0 = diverse, 1 = monoclonal)gini_coefficient: Gini coefficient (0 = even, 1 = skewed)simpson_index: Simpson's D (0 = diverse, 1 = monoclonal)inverse_simpson: Effective number of dominant clonestop_productive_sequence: Frequency (%) of most abundant cloneconvergence: Average nucleotide sequences per amino acid
Details
Diversity Metrics:
Shannon Clonality - Measures evenness of clone distribution. Calculated as 1 - (entropy / log(unique clones)). Values near 0 indicate diverse repertoires; near 1 indicate oligoclonal expansion.
Gini Coefficient - Borrowed from economics to measure inequality. Based on the Lorenz curve of cumulative clone frequencies. Ranges 0-1 where 0 is perfect equality and 1 is maximal inequality (single dominant clone).
Simpson Index - Probability that two randomly selected sequences belong to the same clone. Higher values indicate lower diversity.
Inverse Simpson - Number of equally-abundant clones needed to achieve the observed diversity. More intuitive than Simpson's D (higher = more diverse).
Rarefaction (rarefy = TRUE):
When samples have different sequencing depths, raw diversity metrics are not comparable. Rarefaction corrects this by: (1) Subsampling all repertoires to the same depth (min_count), (2) Calculating diversity on the subsampled data, (3) Repeating steps 1-2 for iterations, and (4) Averaging the results.
This allows fair comparison between a deeply-sequenced blood sample and a shallow tumor sample. Samples with fewer than min_count sequences are excluded.
Examples
file_path <- system.file("extdata", "TCRB_sequencing",
package = "LymphoSeq2")
study_table <- LymphoSeq2::readImmunoSeq(path = file_path, threads = 1)
#> Dataset Analysis:
#> Files: 10, Total: 0.00 GB, Largest: 0.0 MB
#> Available memory: 14.2 GB
study_table <- LymphoSeq2::topSeqs(study_table, top = 100)
raw_clonality <- LymphoSeq2::clonality(study_table)
sampled_clonality <- LymphoSeq2::clonality(study_table,
rarefy = TRUE,
iterations = 100,
min_count = 100
)
