Estimating the sensitivity and specificity of a linkage criteria

The sensitivity and specificity of the linkage criteria to distinguish between linked and unlinked infection pairs can be determined by applying a specific linkage criteria to an existing dataset for which the true transmission events are known, and then extrapolating these results to a new or larger set of data for which transmission are not known.

However, there are many scenarios in which this existing dataset is not available or representative of the desired sampling population. In this package, we provide a method for calculating the sensitivity and specificity of genetic distance – the number of mutations between pathogen genome sequences – as a linkage criteria. Further research is needed to extend this method beyond genetic distance as the linkage metric.

In this package, we provide a function (gendist_sensspec_cutoff()) to calculate the sensitivity and specificity when using a specific genetic distance threshold (e.g. 2 mutations) or a vector of candidate thresholds (e.g. 1-10 mutations). Note that samples separated by a genetic distance of less than the cutoff are considered linked, so the minimum cutoff to consider is 1 (linked cases have identical genomes).

library(phylosamp)
gendist_sensspec_cutoff(cutoff=2, 
                        mut_rate=1, 
                        mean_gens_pdf=c(0.02,0.08,0.15,0.75), 
                        max_link_gens=1)
##      cutoff sensitivity specificity
## [1,]      2   0.7357589   0.8662894
gendist_sensspec_cutoff(cutoff=1:10, 
                        mut_rate=1, 
                        mean_gens_pdf=c(0.02,0.08,0.15,0.75), 
                        max_link_gens=1)
##       cutoff sensitivity specificity
##  [1,]      1   0.3678794 0.967314682
##  [2,]      2   0.7357589 0.866289434
##  [3,]      3   0.9196986 0.697765199
##  [4,]      4   0.9810118 0.499227295
##  [5,]      5   0.9963402 0.316627604
##  [6,]      6   0.9994058 0.178637743
##  [7,]      7   0.9999168 0.090198436
##  [8,]      8   0.9999898 0.041044514
##  [9,]      9   0.9999989 0.016951040
## [10,]     10   0.9999999 0.006396198

This method requires knowledge of the pathogen mutation rate, the distribution of generations between cases, and the number of generations between cases considered linked. We break down each of these inputs in turn:

1. The mutation rate of the pathogen

The mutation rate of the pathogen must be provided in mutations per genome per generation, and represents the average number of mutations observed between the genomes of an infector/infectee pair.

Many pathogen mutation rates are reported in substitutions per site per year. We provide an example of converting the substitution rate for Ebola virus (roughly \(1.2\times 10^{-3} subs/site/year\)) to mutations (used interchangeably with substitutions for this purpose) per genome per generation:

\[ 1.2\times 10^{-3}\:subs/site/year \:\times\:\frac{19000\text{ sites}}{1\text{ genome}} \:\times\:\frac{1\text{ year}}{365\text{ days}}\:\times\:\frac{15\text{ days}}{1\text{ generation}} = 0.94\:subs/genome/generation\]

In other words, assuming an approximate generation time of 15 days, we would expect to see 1 mutation per generation of transmission.

2. The distribution of generations between cases

The number of generations between all pairs of cases in the population is key to determining how many of them are actually linked (separated by less than max_link_gens generations of transmission), which in turn is used to calculate the sensitivity and specificity. This distribution is not trivial, but depends on the reproductive number of the pathogen. We have simulated this distribution for values of \(R\) between 1.3 and 18 and averaged the results of 1000 simulations for each value. The results are provided as part of this package at data/genDistSim.Rda and can be used as a reasonable proxy for the generation distribution for large outbreaks.

We provide an example of using these results for given the mutation rate (~1) and reproductive number (1.5) of an Ebola-like virus:

R <- 1.5

data("genDistSim")
mgd <- as.numeric(genDistSim[genDistSim$R == R, -(1:2)])

gendist_sensspec_cutoff(cutoff=1:10, 
                        mut_rate=1, 
                        mean_gens_pdf=mgd, 
                        max_link_gens=1)
##       cutoff sensitivity specificity
##  [1,]      1   0.3678794   0.9992910
##  [2,]      2   0.7357589   0.9971700
##  [3,]      3   0.9196986   0.9933593
##  [4,]      4   0.9810118   0.9877469
##  [5,]      5   0.9963402   0.9801562
##  [6,]      6   0.9994058   0.9703223
##  [7,]      7   0.9999168   0.9579376
##  [8,]      8   0.9999898   0.9426909
##  [9,]      9   0.9999989   0.9242939
## [10,]     10   0.9999999   0.9025040

We can then choose the optimal threshold and the associated sensitivity and specificity for our transmission study. Many methods exist for choosing the optimal threshold, including the Youden index, closest to corner, etc. Using either of these methods, the optimal threshold in this example is 5 mutations. In other words, cases separate by less than 5 mutations will be considered linked given this linkage criteria, which has a sensitivity of 98.1% and a specificity of 98.7%.