---
title: "Sampling the infectivity profile from published serial interval estimates"
output: html_document
vignette: >
  %\VignetteIndexEntry{Sampling the infectivity profile from published serial interval estimates}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

The infectivity profile is commonly modelled from contact tracing data where
onset of symptoms in infector-infectee pairs is known, or where infection time
can be estimated by known constraints in exposure such as travellers to and from
an outbreak. Serial interval estimates that measure time from symptom onset of
infector to time of symptom onset of infectee are an imperfect proxy for
infectivity profile but are usually adequate in the early phases of an epidemic
when better data is sparse.

In practice we often use serial interval data from the literature to approximate
the infectivity profile. In this case random samples must be drawn from the
parameters published in the literature and the infectivity profile distributions
reconstructed. There is the potential for bias to be introduced in this process
which is investigated here.


```{r setup}
library(patchwork)
library(ggoutbreak)
utils::data("MockRotavirus",package="EpiEstim")
si_data = MockRotavirus$si_data
si_distr = MockRotavirus$si_distr
source(here::here("vignettes/vignette-utils.R"))
```

Estimation of the infectivity profile is usually done by fitting a 
doubly censored survival model with an underlying gamma probability distribution
to data and can be done with MCMC using `EpiEstim`. In this case we demonstrate
this with a mock Rotavirus dataset from `EpiEstim`:

```{r}

dic.fit.mcmc = memoise::memoise(coarseDataTools::dic.fit.mcmc)

clever_init_param <- EpiEstim::init_mcmc_params(si_data, "G")
SI_fit_clever <- dic.fit.mcmc(dat = si_data,
  dist = "G", # Gamma distribution
  init.pars = clever_init_param,
  burnin = 1000,
  n.samples = 5000,
  verbose = 10000)

SI_fit_clever

```

The median estimate and confidence intervals are quantiles of the posterior
samples each of which define a candidate gamma distribution. The density of a
subset of these are plotted here. 

```{r}
# The sample quantiles
# apply(SI_fit_clever@samples,MARGIN = 2,FUN = quantile, p=c(0.5,0.025,0.975))

gammas = SI_fit_clever@samples %>% 
  dplyr::transmute(
    rate = 1/var2, 
    shape = var1, 
    scale=var2, 
    mean = var1*var2, 
    sd = sqrt(var1)*var2, 
    lmean = log(mean), 
    lsd = log(sd))

max_x = ceiling(stats::quantile(stats::qgamma(0.95, gammas$shape, gammas$rate),0.75))

if (is.null(si_distr)) si_distr = EpiEstim::discr_si(0:max_x, stats::quantile(gammas$mean, 0.5), stats::quantile(gammas$sd, 0.5))

ggplot2::ggplot()+
  purrr::pmap(gammas %>% utils::tail(200), 
    function(shape,rate,...) {
      ggplot2::geom_function(fun = function(x) stats::dgamma(x, shape=shape, rate=rate), alpha=0.05, xlim=c(0,max_x))
    }
  )+
  ggplot2::ylab("density")
  
```


In many cases we will want to estimate the infectivity profile from published literature.
Typically a serial interval analysis will publish a summary of the parameters of 
these posteriors which might include statistics as follows:

```{r}

gammas %>% 
  dplyr::select(mean, sd) %>%
  tidyr::pivot_longer(cols=tidyselect::everything()) %>%
  dplyr::group_by(parameter = name) %>%
  dplyr::summarise(
    mean = sprintf("%1.2f", mean(value)),
    sd = sprintf("%1.2f", stats::sd(value)),
    `median [95% CrI]` = sprintf("%1.2f [%1.2f \u2014 %1.2f]",
      stats::quantile(value, p=0.5),
      stats::quantile(value, p=0.025),
      stats::quantile(value, p=0.975))
    )

```

In many cases it will only be the median and 95% credible intervals that are 
published, and in some cases not even that. In very rare cases posterior estimates of
shape and scale parameters of the gamma distribution may be published.

`EpiEstim` can use these published estimates including credible intervals to
infer a set of samples of the mean and SD of the the infectivity profile
distribution. This is done using independent sampling from a truncated normal
distribution for both mean and SD. It is not enforced, but assumed that the
selected values for the mean are larger than 1. 

To perform resampling `EpiEstim` uses the mean and SD of each of the mean and SD
parameters of the infectivity profile distributions (and not the medians). There
is a potential for confusion here and also as there is sometimes a lack of
clarity about whether a serial interval study is publishing a mean of the mean,
or a median of the mean, there is a risk of incorrectly interpreting published
values.

```{r}

summary = gammas %>% 
  dplyr::select(mean, sd) %>%
  tidyr::pivot_longer(cols=tidyselect::everything()) %>%
  dplyr::group_by(parameter = name) %>%
  dplyr::summarise(
    mean = mean(value),
    sd = stats::sd(value),
    min = min(value),
    max = max(value)
  )

config = list(
  mean_si = summary$mean[1], std_mean_si = summary$sd[1],
  min_mean_si = max(c(1,summary$min[1])), max_mean_si = summary$max[1],
  std_si = summary$mean[2], std_std_si = summary$sd[2],
  min_std_si = max(c(summary$min[2],0)), max_std_si = summary$max[2], n1 = 5000
)

# The following code is copied from EpiEstim:
mean_si_sample <- rep(-1, config$n1)
std_si_sample <- rep(-1, config$n1)

for (k in seq_len(config$n1)) {
    while (mean_si_sample[k] < config$min_mean_si || 
      mean_si_sample[k] > config$max_mean_si) {
      mean_si_sample[k] <- stats::rnorm(1, mean = config$mean_si, 
        sd = config$std_mean_si)
    }
    while (std_si_sample[k] < config$min_std_si || 
      std_si_sample[k] > config$max_std_si) {
      std_si_sample[k] <- stats::rnorm(1, mean = config$std_si, 
        sd = config$std_std_si)
    }
}

epiestim_si_samples = tibble::tibble(
  mean = mean_si_sample,
  sd = std_si_sample
) %>% dplyr::mutate(
  shape = mean^2/sd^2,
  scale = sd^2/mean,
  rate = 1/scale,
  lmean = log(mean),
  lsd = log(sd)
)
```


In many cases this strategy gives a very good fit to the original mcmc
estimates, but due to the properties of the gamma distribution, if the standard
deviation is larger than the mean, the shape parameter is smaller than 1 and the
resulting gamma distribution is heavily left skewed and more closely resembles
an exponential distribution with a probability density at zero which is
non-zero. In MCMC sampling these solutions are generally found to be
inconsistent with the data and are rejected. As a result posterior gamma
distribution estimates will have a correlation between mean and standard
deviation when the standard deviation approaches the mean. `EpiEstim`s
re-sampled mean and SD do not represent the underlying data distribution very
accurately in this instance.

```{r}

comparison = dplyr::bind_rows(
    epiestim_si_samples %>% dplyr::mutate(source="epiestim"),
    gammas %>% dplyr::mutate(source = "raw mcmc")
  ) %>% dplyr::mutate(source = factor(source,labels = c("raw mcmc","epiestim")))

p1 = ggplot2::ggplot(comparison)+
  ggplot2::geom_point(mapping=ggplot2::aes(x=mean,y=sd,colour=source), alpha = 0.1)+
  ggplot2::geom_abline()

p2 = ggplot2::ggplot(comparison)+
  ggplot2::geom_point(mapping=ggplot2::aes(x=shape,y=rate,colour=source),alpha=0.1)

p1+p2+patchwork::plot_annotation(tag_levels = "A")

  
# ggplot2::ggplot()+
#   purrr::pmap(epiestim_si_samples %>% utils::tail(200), 
#     function(shape,rate,...) {
#       ggplot2::geom_function(fun = function(x) stats::dgamma(x, shape=shape, rate=rate), alpha=0.05, xlim=c(0,max_x))
#     }
#   )+
#   ggplot2::ylab("density")#+


```

# Alternative resampling

The mean and SD of the gamma distribution are both modelled in `EpiEstim` as
independent truncated normally distributed quantities, and it is assumed we know
the mean, sd, and lower and upper bounds for this distribution. This assumption
breaks down when the SD of the gamma distributions is close to the mean, as
there is correlation between the SD and mean.

An alternative assumption is that the mean and SD of the infectivity profile 
gamma distributions are correlated log-normal 
distributed quantities. In this case we see that the results of MCMC fitting
do seem to be correlated on a log scale and are feasible normally distributed.

```{r}

# They are a correlated approximately lognormal distributed quantity
ggplot2::ggplot(gammas,ggplot2::aes(x=lmean,y=lsd))+ggplot2::geom_point(alpha=0.1)+ggplot2::geom_rug(alpha=0.02)
correlation = stats::cor(gammas$mean, gammas$sd)
correlation
```

Let us also assume that rather than knowledge of mean, sd, etc, we know only
median and 95% credible intervals of the distribution of mean and SD, which are
more commonly what is reported in the literature. For the time being we also
assume we know the degree of correlation.

```{r}
# However typically we will be provided with quantiles of mean and SD
# We can log transform quantiles and estimate a normal distribution
# on the log scale.
quantiles = gammas %>% 
  dplyr::select(mean, sd) %>% 
  dplyr::reframe(dplyr::across(tidyselect::everything(), ~ stats::quantile(p=c(0.025,0.5,0.975),.x))) %>% dplyr::mutate(p=c(0.025,0.5,0.975))
quantiles %>% dplyr::glimpse()

```

```{r}
# We can log transform quantiles and estimate a normal distribution
# on the log scale.
quantiles = quantiles %>% dplyr::mutate(lmean = log(mean), lsd = log(sd), z = stats::qnorm(p))

# this linear model fits a (log)normal distribution to provided quantiles.
# lm coefficients - intercept is mean and z gradient is sd.
lmMean = stats::lm(formula = lmean~z, quantiles)$coeff
lmSd = stats::lm(formula = lsd~z, quantiles)$coeff

# means here is mean of mu=log(mean) and mean of sigma=log(sd)
means2 = c(lmMean[1],lmSd[1])
names(means2) = c("lmean","lsd")

# sds here is sd of mu=log(mean) and sd of sigma=log(sd)
sds2 = c(lmMean[2],lmSd[2])
names(sds2) = c("lmean","lsd")

.cor2cov = function(correlation, sds) {
  corMatrix = matrix(c(1,correlation,correlation,1),nrow = 2)
  covMatrix2 = diag(sds) %*% corMatrix %*% diag(sds)
  colnames(covMatrix2) = names(sds)
  rownames(covMatrix2) = names(sds)
  return(covMatrix2)
}

covMatrix2 = .cor2cov(correlation, sds2)

simulated2 = MASS::mvrnorm(n=5000, mu=means2, Sigma = covMatrix2) %>% as.data.frame()
simulated2 = simulated2 %>% dplyr::mutate(mean = exp(lmean), sd = exp(lsd), shape = (mean^2)/(sd^2), rate = mean/(sd^2))
```

This re-sampling process produces a closer set of distributions than the `EpiEstim`
approach, but does assume we can estimate the correlation between mean and SD.

```{r}
comparison2 = dplyr::bind_rows(
    simulated2 %>% dplyr::mutate(source="corr log normal"),
    gammas %>% dplyr::mutate(source = "raw mcmc")
  ) %>% dplyr::mutate(source = factor(source,labels = c("raw mcmc","corr log normal")))

p1 = ggplot2::ggplot(comparison2)+
  ggplot2::geom_point(mapping=ggplot2::aes(x=mean,y=sd,colour=source), alpha = 0.1)+
  ggplot2::geom_abline()

p2 = ggplot2::ggplot(comparison2)+
  ggplot2::geom_point(mapping=ggplot2::aes(x=shape,y=rate,colour=source),alpha=0.1)

p1+p2+patchwork::plot_annotation(tag_levels = "A")
```

These distributions are a better fit for those derived from the original data
than the sampling strategy from `EpiEstim` but they do produce some resampled
distributions that are incompatible with the discretisation strategy in
`EpiEstim` as they have a mean < 1. These samples would have to be excluded to 
allow them to be used with the Cori method.

Does this make any difference? 

```{r}

comparison = dplyr::bind_rows(
    epiestim_si_samples %>% dplyr::mutate(source="epiestim"),
    simulated2 %>% dplyr::mutate(source="corr log normal"),
    gammas %>% dplyr::mutate(source = "raw mcmc")
  )

original_disc = comparison %>%
  dplyr::filter(mean > 1) %>%
  dplyr::group_by(source) %>%
  dplyr::transmute(
    coll = dplyr::row_number(),
    disc = purrr::map2(mean,sd, ~ tibble::tibble(
                         x=0:50,
                         p = EpiEstim::discr_si(0:50, .x, .y))),
    disc_type = "discr_si"
) %>% tidyr::unnest(disc)


tmp = original_disc %>% dplyr::mutate(source=factor(source, levels = c("raw mcmc","epiestim","corr log normal")))

sources = length(levels(tmp$source))

tmp_summ = tmp %>% dplyr::group_by(source,x) %>% dplyr::summarise(p = mean(p))

tmp_mean = tmp %>% dplyr::group_by(source,coll) %>% 
  dplyr::summarise(mean = sum(x*p)) %>% 
  dplyr::summarise(sd = stats::sd(mean),mean = mean(mean), parameter="mean", disc_type="epiestim") %>%
  dplyr::bind_rows(
    gammas %>% dplyr::summarise(sd = stats::sd(mean),mean=mean(mean), parameter="mean",source="raw",disc_type="none" )
  )

ggplot2::ggplot(tmp)+
  ggplot2::geom_segment(mapping=ggplot2::aes(x=x+as.numeric(source)/sources-1/sources,xend=x+as.numeric(source)/sources,y=p, colour=source), alpha=0.01)+
  ggplot2::geom_segment(data = tmp_summ, mapping=ggplot2::aes(x=x+as.numeric(source)/sources-1/sources,xend=x+as.numeric(source)/sources,y=p), colour="black")+
  ggplot2::xlab("time")+
  ggplot2::coord_cartesian(xlim=c(0,max_x+1))+
  ggplot2::guides(colour = ggplot2::guide_legend(override.aes = list(alpha = 1)))

tmp_mean

```

`EpiEstim`'s default resampling strategy in this example results in a slightly
less skewed discretised infectivity profile distribution than either the
original data derived (`raw mcmc`) or log-normally resampled with correlation
(`corr log normal`). The result of this can be examined by looking at `EpiEstim`
estimates of the reproduction number using each of the different infectivity
profiles. We simulate an artificial incidence timeseries for each of a set of
growth rates and use `EpiEstim` to produce reproduction number estimates for
each of these synthetic timeseries using either the data-derived infectivity
profiles (`raw mcmc`), `EpiEstim` resampled infectivity profiles (`epiestim`), and
correlated log-normal resampled infectivity profiles (`corr log normal`):


```{r}

r = c(0.4,0.2,0.1,0.05,0,-0.05,-0.1)

# select a shorter list of samples
tmp2 = tmp %>% dplyr::group_by(source) %>% dplyr::reframe(
  si_matrix = list(matrix(as.vector(p),nrow=51)[,1:250]),
  r = list(r)
) %>% tidyr::unnest(r)

# tmp2$si_matrix[[1]][1:10,1:10]

cached_estimate_R = memoise::memoise(EpiEstim::estimate_R)

compare_R = tmp2 %>% dplyr::mutate(R = purrr::map2(si_matrix, r, ~ {
  ts = tibble::tibble(
        t = 0:50
      ) %>% dplyr::mutate(
        I = 100*exp(.y*t)
      )
  return(cached_estimate_R(ts, 
      method="si_from_sample", si_sample = .x,
      config = EpiEstim::make_config(t_start=2, t_end = 50))$R)
  })
)

ggplot2::ggplot(compare_R %>% tidyr::unnest(R))+
  ggplot2::geom_point(ggplot2::aes(x=as.factor(r), colour=source, y=`Median(R)`), position=ggplot2::position_dodge(width=0.8))+
  ggplot2::geom_errorbar(ggplot2::aes(x=as.factor(r), colour=source, ymin=`Quantile.0.025(R)`, ymax=`Quantile.0.975(R)`), position=ggplot2::position_dodge(width=0.8), width=0.2)+
  ggplot2::xlab("Growth rate (day⁻¹)")+
  ggplot2::ylab("Reproduction number")+
  ggplot2::geom_hline(yintercept=1)


```

Compared to reproduction number estimates from the `mcmc` originals, `epiestim`
and re-sampling using correlated log-normals are very similar. To produce the 
correlated log-normals we are assuming knowledge of the correlation between
mean and SD, which is not usually specified in published data.

# Conclusion

There is an assumption made in `EpiEstim` that uncertainty in infectivity
profile distributions, can be specified as the mean, SD, upper and lower limit
of truncated normal distributions which in turn define the mean and SD of a gamma
distribution. This gamma distribution is discretised using the PDF of the gamma
distribution offset by 1 day. This requires a lower bound for the mean of 1 day.

Changing these assumptions and instead specifying the parameters of the gamma
distribution as a pair of correlated log normally distributed quantities gives a
qualitatively better fit to the original distribution. This does not, however, affect
reproduction number estimation, and assumes knowledge of the correlation between
mean and SD of the gamma distribution.

We find no evidence that there is benefit in changing the resampling procedure
from that used in `EpiEstim`, but care must be taken to ensure that the
resampling algorithm is given the correct parameters. Published serial interval
estimates are often given as the median and 95% credible intervals of
distribution mean and SD parameters which is not what `EpiEstim` expects.

Any reproduction number estimation using `EpiEstim` is subject to the constraint
that the infectivity at time zero is zero. Coupled with discretisation this is
another potential source of bias that can only be investigated using a different
reproduction number framework.

# Addendum

It is possible that the left truncation that is inherent in `EpiEstim`
discretisation and $R_t$ estimation is hiding an effect so we repeated the
analysis. This time we make infectivity profiles from posterior samples or from
resampled of summarised parameters from `EpiEstim` and using the correlated
log-normal method described here but using a different discretisation framework,
and estimate $R_t$ using a method that does not require that the probability of
infection at time 0 is zero. The result is more or less the same - although we
could possibly argue in this case that the `EpiEstim` re-sampling procedure has
resulted in more uncertainty at higher growth rates, but in this analysis using
correlated log-normal resampling does not make a huge improvement.


```{r}


comparison2 = dplyr::bind_rows(
  do.call(make_posterior_ip, gammas %>% dplyr::select(shape,rate)) %>% dplyr::mutate(source="raw mcmc"),
  do.call(make_posterior_ip, simulated2 %>% dplyr::select(shape,rate)) %>% dplyr::mutate(source="corr log normal"),
  do.call(make_posterior_ip, epiestim_si_samples %>% dplyr::select(shape,rate)) %>% dplyr::mutate(source="epiestim")
) %>% tidyr::nest(ip = -source)

tmp3 = tibble::tibble(r = r) %>%
  dplyr::mutate(
    data = purrr::map(r, ~ {
      tibble::tibble(
        time = 0:30,
        count = stats::rpois(0:30, 100*exp(.x*0:30))
      )}
    )) %>%
  dplyr::cross_join(comparison2) %>%
  dplyr::mutate(fit = purrr::map2(data,ip, ~
      .x %>% 
        poisson_locfit_model() %>%
        rt_from_incidence(ip = .y)
    )
  )

tmp4 = tmp3 %>% dplyr::select(-ip) %>% tidyr::unnest(fit) %>% dplyr::filter(time>10) %>% 
  dplyr::group_by(source,r) %>%
  dplyr::summarise(
    rt.0.025 = mean(rt.0.025),
    rt.0.5 = mean(rt.0.5),
    rt.0.975 = mean(rt.0.975)
  )

ggplot2::ggplot(tmp4)+
  ggplot2::geom_point(ggplot2::aes(x=as.factor(r), colour=source, y=rt.0.5), position=ggplot2::position_dodge(width=0.8))+
  ggplot2::geom_errorbar(ggplot2::aes(x=as.factor(r), colour=source, ymin=rt.0.025, ymax=rt.0.975), position=ggplot2::position_dodge(width=0.8), width=0.2)+
  ggplot2::xlab("Growth rate (day⁻¹)")+
  ggplot2::ylab("Reproduction number")+
  ggplot2::geom_hline(yintercept=1)

```