| Title: | Approximate Bayesian Computing with Tidy Data |
|---|---|
| Description: | A flexible framework for Approximate Bayesian Computation (ABC) that integrates with the tidyverse. Define simulation models and summary statistics as standard R functions, use 'dist_fns' to represent prior and posterior distributions, and perform inference via rejection sampling, Sequential Monte Carlo (SMC), or Adaptive ABC. The package provides tools for diagnostics, visualization, and convergence assessment, enabling reproducible Bayesian inference for complex models with intractable likelihoods. |
| Authors: | Robert Challen [aut, cre] (ORCID: <https://orcid.org/0000-0002-5504-7768>) |
| Maintainer: | Robert Challen <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.0.1 |
| Built: | 2026-05-23 08:07:41 UTC |
| Source: | https://github.com/ai4ci/tidyabc |
This function will execute a simulation for a random selection of parameters.
Based on the acceptance_rate it will reject a proportion of the results.
The remaining results are weighted (using a kernel with a tolerance
equivalent to half the acceptance rate). Empirical distributions are fitted
to weighted parameter particles and from these proposals are generated for
further waves by fresh sampling. Waves are executed until a maximum is
reached or the results converge sufficiently that the changes between waves
are small. A relatively small number of simulations may be attempted with a
high acceptance rate, over multiple waves.
abc_adaptive( obsdata, priors_list, sim_fn, scorer_fn, n_sims, acceptance_rate, ..., max_time = 5 * 60, converged_fn = default_termination_fn(), obsscores = NULL, distance_method = "euclidean", seed = NULL, knots = NULL, parallel = FALSE, max_recover = 3, allow_continue = interactive(), debug_errors = FALSE, kernel = "epanechnikov", distfit = c("empirical", "analytical"), bw = 0.1, widen_by = 1.05, scoreweights = NULL, use_proposal_correlation = TRUE, ess_limit = c(200, n_sims * acceptance_rate) )abc_adaptive( obsdata, priors_list, sim_fn, scorer_fn, n_sims, acceptance_rate, ..., max_time = 5 * 60, converged_fn = default_termination_fn(), obsscores = NULL, distance_method = "euclidean", seed = NULL, knots = NULL, parallel = FALSE, max_recover = 3, allow_continue = interactive(), debug_errors = FALSE, kernel = "epanechnikov", distfit = c("empirical", "analytical"), bw = 0.1, widen_by = 1.05, scoreweights = NULL, use_proposal_correlation = TRUE, ess_limit = c(200, n_sims * acceptance_rate) )
obsdata |
The observational data. The data in this will typically be a named list, but could be anything, e.g. dataframe. It is the reference data that the simulation model is aiming to replicate. |
priors_list |
a named list of priors specified as a |
sim_fn |
a user defined function that takes a set of parameters named
the same as |
scorer_fn |
a user supplied function that matches the following
signature |
n_sims |
The number of simulations to run per wave (for SMC and Adaptive) or overall (for Rejection). For rejection sampling a large number is recommended, for the others sma |
acceptance_rate |
What proportion of simulations to keep in ABC rejection or hard ABC parts of the algorithms. |
... |
must be empty |
max_time |
the maximum time in seconds to spend in ABC waves before admitting defeat. This time may not be all used if the algorithm converges. |
converged_fn |
a function that takes a |
obsscores |
Summary scores for the observational data. This will
be a named list, and is equivalent to the output of |
distance_method |
what metric is used to combine |
seed |
an optional random seed |
knots |
the number of knots to model the CDF with. Optional, and will be typically inferred from the data size. Small numbers tend to work better if we expect the distribution to be unimodal. |
parallel |
parallelise the simulation? If this is set to true then the
simulation step will be parallelised using |
max_recover |
if the effective sample size of SMC or adaptive algorithms
drops below 200, the algorithm will retry the wave with double the sample
size to try and recover the shape of the distribution, up to a maximum of
|
allow_continue |
if SMC or adaptive algorithms have not converged after
|
debug_errors |
Errors that crop up in |
kernel |
one of |
distfit |
one of |
bw |
for Adaptive ABC data distributions are smoothed before modelling empirical CDF. Over smoothing can reduce convergence rate, under-smoothing may result in noisy posterior estimates, and appearance of local modes. |
widen_by |
change the dispersion of the empirical proposal distribution in ABC
adaptive, preserving the median. This is akin to a nonlinear,
heteroscedastic random walk in the quantile space, and can help address
over-fitting or local modes in the ABC adaptive waves. |
scoreweights |
A named vector with names matching output of |
use_proposal_correlation |
When calculating the weight of a particle the proposal correlation structure is available, to help determine how unusual or otherwise a particle is. |
ess_limit |
a numeric vector of length 2 which for ABC adaptive, defines the limits which rate at which the algorithm will converge in terms of effective sample size. If for example the algorithm is converging too quickly and some high weight particles are dominating then the ESS will drop below the lower limit. In this case more particles will be accepted to try and offset this. On the other hand if the algorithm is converging too slowly low probability particles in proposal space are not filtered out quickly enough and this can lead to too much importance being given to unlikely proposals and wide bi-modal peaked posteriors. |
Performs the ABC Adaptive algorithm.
This iterative method refines parameter estimates across waves by fitting
empirical proposal distributions to the weighted posterior samples from the
previous wave. Unlike ABC-SMC, which uses a fixed perturbation kernel,
abc_adaptive constructs a new proposal distribution at
each wave .
Initialization (Wave 1):
Parameters are sampled from the prior .
Simulations are run, summary statistics are computed,
and distances are calculated.
A tolerance threshold is set as the
quantile of these distances.
Unnormalized weights are calculated using a kernel
.
Subsequent Waves ():
Proposal Generation: An empirical joint proposal distribution
is constructed from the weighted posterior sample
of the previous wave.
This is done by fitting marginal empirical distributions
to each parameter , using the
empirical() function with the prior as a link to
enforce support constraints. These marginals are assumed independent,
but their weighted correlation matrix is retained as an
attribute and used to induce dependence in the MVN sampling space.
New proposals are generated by:
Sampling a vector in a
correlated standard normal space.
Mapping each component to uniform space:
.
Mapping to the parameter space using the empirical quantile
functions: .
Simulation and Weighting: Simulations are run for the new proposals.
Distances are computed and the tolerance
is set as the -quantile of the current wave's distances.
The unnormalized weight for particle in wave is calculated as:
where is the prior
density (assuming independence), is the ABC kernel,
and is the
empirical proposal density (also assuming independence for density
calculation, consistent with the marginal fitting). The correlation
structure is handled in the sampling process, and optionally in the density
evaluation.
Normalization: Weights are normalized to sum to one.
The algorithm includes a recovery mechanism: if the Effective Sample Size
(ESS) falls below a threshold (e.g., 200), the acceptance rate is
increased (i.e., is made larger) to accept more particles
and improve the ESS.
Termination: The process repeats until a maximum time is reached or convergence criteria based on parameter stability and credible interval contraction are met.
an S3 object of class abc_fit this contains the following:
type: the type of ABC algorithm
iterations: number of completed iterations
converged: boolean - did the result meet convergence criteria
waves: a list of dataframes of wave convergence metrics
summary: a dataframe with the summary of the parameter fits after each wave.
priors: the priors for the fit as a abc_prior S3 object
posteriors: the final wave posteriors
fit = abc_adaptive( obsdata = example_obsdata(), priors_list = example_priors_list(), sim_fn = example_sim_fn, scorer_fn = example_scorer_fn, n_sims = 1000, acceptance_rate = 0.25, max_time = 5, # 5 seconds to fit within examples limit parallel = FALSE, allow_continue = FALSE ) summary(fit)fit = abc_adaptive( obsdata = example_obsdata(), priors_list = example_priors_list(), sim_fn = example_sim_fn, scorer_fn = example_scorer_fn, n_sims = 1000, acceptance_rate = 0.25, max_time = 5, # 5 seconds to fit within examples limit parallel = FALSE, allow_continue = FALSE ) summary(fit)
abc_fit S3 classA class holding the output of a single ABC model fitting, either as a single ABC rejection round or after a set of SMC waves.
new_abc_fit(type, iterations, converged, priors_list, wave_df, summ_df, sim_df) ## S3 method for class 'abc_fit' format(x, ...) ## S3 method for class 'abc_fit' summary(object, ..., truth = NULL) tidy.abc_fit(x, ...) ## S3 method for class 'abc_fit' print(x, ...) ## S3 method for class 'abc_fit' plot(x, ..., truth = NULL)new_abc_fit(type, iterations, converged, priors_list, wave_df, summ_df, sim_df) ## S3 method for class 'abc_fit' format(x, ...) ## S3 method for class 'abc_fit' summary(object, ..., truth = NULL) tidy.abc_fit(x, ...) ## S3 method for class 'abc_fit' print(x, ...) ## S3 method for class 'abc_fit' plot(x, ..., truth = NULL)
type |
the type of ABC algorithm |
iterations |
number of completed iterations |
converged |
boolean - did the result meet convergence criteria |
priors_list |
a named list of priors specified as a |
wave_df |
a list of dataframes of wave convergence metrics |
summ_df |
the summary of the parameter fits after each wave. |
sim_df |
the final wave posteriors |
x |
a |
... |
passed on to methods
Named arguments passed on to
|
object |
a |
truth |
a named numeric vector of known parameter values |
an S3 object of class abc_fit this contains the following:
type: the type of ABC algorithm
iterations: number of completed iterations
converged: boolean - did the result meet convergence criteria
waves: a list of dataframes of wave convergence metrics
summary: a dataframe with the summary of the parameter fits after each wave.
priors: the priors for the fit as a abc_prior S3 object
posteriors: the final wave posteriors
format(abc_fit): S3 format method
summary(abc_fit): S3 summary method
print(abc_fit): S3 print method
plot(abc_fit): S3 plot method
new_abc_fit(): Create a abc_fit object
tidy.abc_fit(): S3 summary method
abc_prior S3 classabc_prior S3 class
new_abc_prior(.dists, .constraints = list(), .derived = list(), .cor = NULL) as.abc_prior(x, ...) is.abc_prior(x, ...) ## S3 method for class 'abc_prior' format(x, ...) ## S3 method for class 'abc_prior' print(x, ...) ## S3 method for class 'abc_prior' plot(x, ...)new_abc_prior(.dists, .constraints = list(), .derived = list(), .cor = NULL) as.abc_prior(x, ...) is.abc_prior(x, ...) ## S3 method for class 'abc_prior' format(x, ...) ## S3 method for class 'abc_prior' print(x, ...) ## S3 method for class 'abc_prior' plot(x, ...)
.dists |
distribution functions as a named list of S3 |
.constraints |
a list of one sided formulae the result each of which should evaluate to a boolean when compared against the names of the priors and derived values. |
.derived |
a list of two sided formulae. The RHS refer to the priors, and the LHS as a name to derive. |
.cor |
(optional) a correlation matrix for the priors |
x |
an |
... |
passed on to methods |
an S3 object of class abc_prior which contains
a list of dist_fns
a cor attribute describing their correlation
a derived attribute describing derive values
a constraints attribute listing the constraints
a params attribute listing the names of the parameters
format(abc_prior): Format an abc_prior
print(abc_prior): Print an abc_prior
plot(abc_prior): Plot an abc_prior
new_abc_prior(): Create a new prior
as.abc_prior(): Create a prior from a named list of dist_fns
is.abc_prior(): Test is an abc_prior
p = new_abc_prior(
.dists = list(
mean = as.dist_fns("norm",4,2),
sd = as.dist_fns("gamma",2)
),
.derived = list(
shape ~ mean^2 / sd^2,
rate ~ mean / sd^2
),
.constraints = list(
~ mean > sd
)
)
testthat::expect_equal(
format(p),
"Parameters: \n* mean: norm(mean = 4, sd = 2)\n* sd: gamma(shape = 2, rate = 1)\nConstraints:\n* mean > sd\nDerived values:\n* shape = mean^2/sd^2\n* rate = mean/sd^2"
)
This function will execute a simulation for a random selection of parameters
and identify the best matching acceptance_rate percent, as defined by the
summary distance metric. A large number of simulations and a low acceptance
rate are best here.
abc_rejection( obsdata, priors_list, sim_fn, scorer_fn, n_sims, acceptance_rate, ..., converged_fn = default_termination_fn(), obsscores = NULL, distance_method = "euclidean", keep_simulations = FALSE, seed = NULL, parallel = FALSE, debug_errors = FALSE, kernel = "epanechnikov", scoreweights = NULL )abc_rejection( obsdata, priors_list, sim_fn, scorer_fn, n_sims, acceptance_rate, ..., converged_fn = default_termination_fn(), obsscores = NULL, distance_method = "euclidean", keep_simulations = FALSE, seed = NULL, parallel = FALSE, debug_errors = FALSE, kernel = "epanechnikov", scoreweights = NULL )
obsdata |
The observational data. The data in this will typically be a named list, but could be anything, e.g. dataframe. It is the reference data that the simulation model is aiming to replicate. |
priors_list |
a named list of priors specified as a |
sim_fn |
a user defined function that takes a set of parameters named
the same as |
scorer_fn |
a user supplied function that matches the following
signature |
n_sims |
The number of simulations to run per wave (for SMC and Adaptive) or overall (for Rejection). For rejection sampling a large number is recommended, for the others sma |
acceptance_rate |
What proportion of simulations to keep in ABC rejection or hard ABC parts of the algorithms. |
... |
must be empty |
converged_fn |
a function that takes a |
obsscores |
Summary scores for the observational data. This will
be a named list, and is equivalent to the output of |
distance_method |
what metric is used to combine |
keep_simulations |
keep the individual simulation results in the output of an ABC workflow. This can have large implications for the size of the result. It may also not be what you want and it is probably worth considering resampling the posteriors rather than keeping the simulations. |
seed |
an optional random seed |
parallel |
parallelise the simulation? If this is set to true then the
simulation step will be parallelised using |
debug_errors |
Errors that crop up in |
kernel |
one of |
scoreweights |
A named vector with names matching output of |
Parameters are sampled independently from the prior
distribution for .
For each , simulated data
is generated via the simulator function sim_fn, and a vector of summary
statistics
is computed and compared to the observed summary statistics
.
A distance metric is computed. By default,
this is the Euclidean distance:
where is a vector of optional summary statistic weights
(scoreweights), and denotes element-wise multiplication.
Other supported metrics include Manhattan () and Mahalanobis
distance (using the empirical covariance from the first wave).
The tolerance threshold is set to the
quantile of the distances
:
Unnormalized ABC weights are then assigned using a kernel function
:
where is one of the kernels defined in kernels.R
(e.g., Epanechnikov: ).
These weights are then transformed via a logistic ("expit") function and
normalized to sum to one:
The resulting weighted sample approximates
the ABC posterior distribution .
an S3 object of class abc_fit this contains the following:
type: the type of ABC algorithm
iterations: number of completed iterations
converged: boolean - did the result meet convergence criteria
waves: a list of dataframes of wave convergence metrics
summary: a dataframe with the summary of the parameter fits after each wave.
priors: the priors for the fit as a abc_prior S3 object
posteriors: the final wave posteriors
fit = abc_rejection( example_obsdata(), example_priors_list(), example_sim_fn, example_scorer_fn, n_sims = 10000, acceptance_rate = 0.01 ) summary(fit)fit = abc_rejection( example_obsdata(), example_priors_list(), example_sim_fn, example_scorer_fn, n_sims = 10000, acceptance_rate = 0.01 ) summary(fit)
This function will execute a simulation for a random selection of parameters.
Based on the acceptance_rate it will reject a proportion of the results.
The remaining results are weighted (using a kernel with a tolerance
equivalent to half the acceptance rate). Weighted parameter particles
generate proposals for further waves but a particle perturbation. Waves
are executed until a maximum is reached or the results converge sufficiently
that the changes between waves are small. A relatively small number of
simulations may be attempted with a high acceptance rate, over multiple waves.
abc_smc( obsdata, priors_list, sim_fn, scorer_fn, n_sims, acceptance_rate, ..., max_time = 5 * 60, converged_fn = default_termination_fn(), obsscores = NULL, distance_method = "euclidean", seed = NULL, parallel = FALSE, allow_continue = interactive(), debug_errors = FALSE, kernel = "epanechnikov", scoreweights = NULL )abc_smc( obsdata, priors_list, sim_fn, scorer_fn, n_sims, acceptance_rate, ..., max_time = 5 * 60, converged_fn = default_termination_fn(), obsscores = NULL, distance_method = "euclidean", seed = NULL, parallel = FALSE, allow_continue = interactive(), debug_errors = FALSE, kernel = "epanechnikov", scoreweights = NULL )
obsdata |
The observational data. The data in this will typically be a named list, but could be anything, e.g. dataframe. It is the reference data that the simulation model is aiming to replicate. |
priors_list |
a named list of priors specified as a |
sim_fn |
a user defined function that takes a set of parameters named
the same as |
scorer_fn |
a user supplied function that matches the following
signature |
n_sims |
The number of simulations to run per wave (for SMC and Adaptive) or overall (for Rejection). For rejection sampling a large number is recommended, for the others sma |
acceptance_rate |
What proportion of simulations to keep in ABC rejection or hard ABC parts of the algorithms. |
... |
must be empty |
max_time |
the maximum time in seconds to spend in ABC waves before admitting defeat. This time may not be all used if the algorithm converges. |
converged_fn |
a function that takes a |
obsscores |
Summary scores for the observational data. This will
be a named list, and is equivalent to the output of |
distance_method |
what metric is used to combine |
seed |
an optional random seed |
parallel |
parallelise the simulation? If this is set to true then the
simulation step will be parallelised using |
allow_continue |
if SMC or adaptive algorithms have not converged after
|
debug_errors |
Errors that crop up in |
kernel |
one of |
scoreweights |
A named vector with names matching output of |
Performs the ABC Sequential Monte Carlo (SMC) algorithm. This iterative method refines parameter estimates across multiple waves.
Initialization (Wave 1):
Parameters are sampled from the prior .
Simulations are run, summaries
are computed, and distances are calculated.
A tolerance threshold is set as the
quantile of these initial distances.
Unnormalized weights are calculated using a kernel
.
Subsequent Waves ():
Proposal Generation: A particle is selected
from the previous wave's accepted particles with probability proportional
to its weight . The particle is then perturbed in a
transformed MVN space using a multivariate normal kernel with covariance
,
where is the weighted covariance from wave ,
is the parameter dimension, and is the kernel_t parameter.
The new proposal is generated as:
This proposal is mapped back to the original parameter space using the prior's copula transformation (from MVN space defined by prior CDFs).
Simulation and Weighting: Simulations
are run for the new proposals. Distances are computed.
The tolerance is set as the -quantile of
distances from the current wave's simulations. The unnormalized weight
for particle in wave is calculated as:
where is the prior density,
is the ABC kernel, and is the proposal density
from the previous wave's weighted particles (calculated using the
perturbation kernel). This proposal density is computed as a weighted sum:
where is the PDF of a multivariate normal
with mean and covariance .
Normalization: Weights are normalized to sum to one.
Particles with negligible weights are typically filtered out.
Termination: The process repeats until a maximum number of waves or time is reached, or convergence criteria are met based on changes in parameter estimates or effective sample size (ESS).
an S3 object of class abc_fit this contains the following:
type: the type of ABC algorithm
iterations: number of completed iterations
converged: boolean - did the result meet convergence criteria
waves: a list of dataframes of wave convergence metrics
summary: a dataframe with the summary of the parameter fits after each wave.
priors: the priors for the fit as a abc_prior S3 object
posteriors: the final wave posteriors
fit = abc_smc( obsdata = example_obsdata(), priors_list = example_priors_list(), sim_fn = example_sim_fn, scorer_fn = example_scorer_fn, n_sims = 1000, acceptance_rate = 0.25, max_time = 5, # 5 seconds to fit within examples limit parallel = FALSE, allow_continue = FALSE ) summary(fit)fit = abc_smc( obsdata = example_obsdata(), priors_list = example_priors_list(), sim_fn = example_sim_fn, scorer_fn = example_scorer_fn, n_sims = 1000, acceptance_rate = 0.25, max_time = 5, # 5 seconds to fit within examples limit parallel = FALSE, allow_continue = FALSE ) summary(fit)
dist_fns S3 objectA class wrapping a single (or set) of parametrised distributions and provides access to the quantile, cumulative probability and random functions of that specific distribution. Parametrisation is handled on construction.
## S3 method for class 'character' as.dist_fns(x, ...) ## S3 method for class ''function'' as.dist_fns(x, ...) ## S3 method for class 'fitdist' as.dist_fns(x, ...) as.dist_fns(x, ...)## S3 method for class 'character' as.dist_fns(x, ...) ## S3 method for class ''function'' as.dist_fns(x, ...) ## S3 method for class 'fitdist' as.dist_fns(x, ...) as.dist_fns(x, ...)
x |
a |
... |
passed onto methods |
dist_fns and dist_fns_list objects support $ access for fields and @ access
for attributes. dist_fns_lists can be made with the c() or rep() functions,
or with the purrr style map functions, and they support subsetting.
Individual dist_fns members of dist_fns_lists can be accessed with [[.
a dist_fns S3 object
as.dist_fns(character): Construct a distribution by name
as.dist_fns(`function`): From a statistical function
as.dist_fns(fitdist): From a fitdistrplus::fitdist output
pois = as.dist_fns("pois",lambda = 8)
n = as.dist_fns("norm",mean=4)
link_fns S3 objectThe link function class allows forwards and backwards transformation. Link functions can be defined by name or using a statistical distribution in which case the forward link is a logit of the cumulative probability and the reverse is the quantile of the expit.
## S3 method for class 'character' as.link_fns(x, ...) ## S3 method for class 'dist_fns' as.link_fns(x, ...) ## S3 method for class 'family' as.link_fns(x, ...) ## S3 method for class 'numeric' as.link_fns(x, ..., na.rm = TRUE) as.link_fns(x, ...)## S3 method for class 'character' as.link_fns(x, ...) ## S3 method for class 'dist_fns' as.link_fns(x, ...) ## S3 method for class 'family' as.link_fns(x, ...) ## S3 method for class 'numeric' as.link_fns(x, ..., na.rm = TRUE) as.link_fns(x, ...)
x |
a range of values that for the support |
... |
ignored |
na.rm |
remove NAs when estimating mean and sd for data driven link functions |
A link_fns S3 object encapsulates a monotonic transformation function
, its inverse , and their derivatives and
. It also defines the support (domain) of the
original space and the range of the transformed space.
The function dispatches based on the input x:
character: Selects standard links (e.g., "log", "logit", "probit", "identity").
For example, "log" defines with support .
dist_fns: Defines the link as the logit of the CDF and the quantile of the expit:
, ,
where and are the CDF and quantile functions from the dist_fns object.
The support is determined by the quantile function's range (e.g., ).
family (from stats): Uses the link function and its inverse from the GLM family object.
numeric: If length 2, interprets as a support range and creates a logit-like
transformation mapping this range to : .
If all values are finite and length > 2, creates a standardization link: ,
where and are the mean and standard deviation of the input vector.
a link_fns S3 object
as.link_fns(character): Link function from name
as.link_fns(dist_fns): Link function from name
as.link_fns(family): Link function from name
as.link_fns(numeric): Link function from support vector
links = c("ident", "log", "logit", "probit", "cloglog", "neginv", "inv2")
test = seq(0.1,0.9,0.1) # within support of all links
for (l in links) {
lfn = as.link_fns(l)
t = lfn$trans(test)
i = lfn$inv(t)
testthat::expect_equal(i,test)
}
dist_fns S3 object or dist_fns_listsConcatenate a dist_fns S3 object or dist_fns_lists
## S3 method for class 'dist_fns' c(...)## S3 method for class 'dist_fns' c(...)
... |
some of |
a dist_fns_list
link_fns S3 object or link_fns_listsConcatenate a link_fns S3 object or link_fns_lists
## S3 method for class 'link_fns' c(...)## S3 method for class 'link_fns' c(...)
... |
some of |
a link_fns_list
This function takes reference data in the form, for example of count data, and returns a crated function to calculate the mean squared error from simulated data to the observed data.
where are the simulated data points, are the observed
data points, and is the number of data points. Both input vectors
must be of the same length. Missing values (NA) are removed
pairwise before calculation. Returns NA if the input vectors are
not of equal length.
calculate_rmse(sim, obs)calculate_rmse(sim, obs)
sim |
A vector of simulated counts |
obs |
A vector of observed counts |
The square root of the average squared distance between simulation and observation. Simulation and observation must be the same length.
# zero if no distance
testthat::expect_equal(calculate_rmse(0:10, 0:10), 0)
testthat::expect_equal(
calculate_rmse(rep(5, 11), 0:10),
3.16227766016838
)
withr::with_seed(100, {
ref = rnorm(1000)
cmp1 = rnorm(1000)
cmp2 = rnorm(1000, sd=2)
})
breaks = seq(-2,2,length.out=8)
nref = table(cut(ref,breaks))
ncmp1 = table(cut(cmp1,breaks))
ncmp2 = table(cut(cmp2,breaks))
tmp1 = calculate_rmse(ncmp1, nref)
tmp2 = calculate_rmse(ncmp2, nref)
testthat::expect_equal(tmp1, 10.3578817470424)
testthat::expect_equal(tmp2, 68.8403951179829)
# example case counts from an exponential growth process sim = table(floor(rexpgrowth(1000, 0.05, 40, 0))) obs = table(floor(rexpgrowth(1000, 0.075, 40, 0))) obs2 = table(floor(rexpgrowth(1000, 0.05, 40, 0))) # obs is a different distribution to sim (larger growth) calculate_rmse(sim, obs) # obs2 is from the same distribution as sim so the RMSE should be lower: calculate_rmse(sim, obs2)# example case counts from an exponential growth process sim = table(floor(rexpgrowth(1000, 0.05, 40, 0))) obs = table(floor(rexpgrowth(1000, 0.075, 40, 0))) obs2 = table(floor(rexpgrowth(1000, 0.05, 40, 0))) # obs is a different distribution to sim (larger growth) calculate_rmse(sim, obs) # obs2 is from the same distribution as sim so the RMSE should be lower: calculate_rmse(sim, obs2)
This function takes simulation and observed data and calculates a normalised Wasserstein distance.
calculate_wasserstein(sim, obs, debias = FALSE, bootstraps = 1)calculate_wasserstein(sim, obs, debias = FALSE, bootstraps = 1)
sim |
A vector of simulated data points |
obs |
A vector of observed data points |
debias |
Should the simulations be shifted to match the mean of the observed data |
bootstraps |
Randomly resample from the simulated data points to match the observed size this many times and combine the output by averaging. The alternative, when this is 1 (the default) matches the sizes by selecting and/or repeating the simulated data points in order (deterministically) |
In the comparison unequal lengths of the data can be accommodated. The simulated data is sorted and linearly interpolated to the same length as the observed data before the comparison.
where are the ordered observed data points, are the
simulated data points after matching size (potentially via interpolation),
debiasing (optional), and sorting, is the number of observed
points, and is the standard deviation of the observed data
(used for normalisation).
Size matching via linear interpolation:
Let be the sorted simulated
data of length , and be the target length.
Indices are generated as
for . Then is calculated as:
where . If ,
then .
For bootstrapping (bootstraps > 1), the process is repeated bootstraps
times with random sampling, and the final distance is the average of the results.
a length normalised wasserstein distance. This is the average distance an individual simulated data point must be shifted to match the observed data normalised by the average distance of the observed data from the mean.
testthat::expect_equal(calculate_wasserstein(0:10, 10:0), 0)
# zero if no distance
testthat::expect_equal(calculate_wasserstein(0:10, 0:10), 0)
testthat::expect_equal(calculate_wasserstein(10:0, 0:10), 0)
# normalised so that all mass at mean = 1
ref = mean(abs(0:10-5))/sd(0:10)
testthat::expect_equal(calculate_wasserstein(rep(5, 11), 0:10), ref)
# smaller sample recycled and normalises to same value
testthat::expect_equal(calculate_wasserstein(rep(5, 5), 0:10), ref)
# should be ((0+1+0+1+0+0+0+1+0+1+0) / 11) / ((5+4+3+2+1+0+1+2+3+4+5) / 11) = 0.1333...
testthat::expect_equal(
calculate_wasserstein(c(0, 0, 2, 2, 4, 5, 6, 8, 8, 10, 10), 0:10),
0.109640488937369
)
withr::with_seed(100, {
ref = rnorm(1000)
cmp1 = rnorm(1000)
cmp2 = rnorm(1000, sd=2)
cmp3 = rnorm(100)
})
testthat::expect_equal(
calculate_wasserstein(cmp1, ref),
0.0458673541615467
)
testthat::expect_equal(
calculate_wasserstein(cmp2, ref),
0.835125114099775
)
testthat::expect_equal(
calculate_wasserstein(cmp3, ref),
0.180171816429487
)
tmp = withr::with_seed(100, {
calculate_wasserstein(cmp1,ref,bootstraps=10)
})
testthat::expect_equal(tmp, 0.0678606864134826)
gen = function(n, mean=0, sd=1) {
sample(pnorm(seq(-1,1,length.out = n - n
}
# there should be approximately zero
testthat::expect_equal(calculate_wasserstein(gen(1000), gen(1000)), 0)
testthat::expect_equal(calculate_wasserstein(gen(1000), gen(100)), 0)
testthat::expect_equal(
calculate_wasserstein(gen(100), gen(1000)),
0,tolerance = 0.01
)
testthat::expect_equal(
calculate_wasserstein(gen(200), gen(1000)),
0,tolerance = 0.01
)
# these should be approximately equal:
tmp2 = max(abs(diff(c(
calculate_wasserstein(gen(100,0.1), gen(1000)),
calculate_wasserstein(gen(200,0.1), gen(1000)),
calculate_wasserstein(gen(1000,0.1), gen(1000)),
calculate_wasserstein(gen(1000, 0.1), gen(200)),
calculate_wasserstein(gen(1000, 0.1), gen(100))
))))
testthat::expect_equal(tmp2, 0, tolerance = 0.01)
# example case timeseries from an exponential growth process sim = rexpgrowth(1000, 0.05, 40, 0) obs = rexpgrowth(1000, 0.075, 40, 0) obs2 = rexpgrowth(1000, 0.05, 40, 0) # obs is a different distribution to sim (larger growth) calculate_wasserstein(sim, obs) # obs2 is from the same distribution as sim so the distance should be lower: calculate_wasserstein(sim, obs2)# example case timeseries from an exponential growth process sim = rexpgrowth(1000, 0.05, 40, 0) obs = rexpgrowth(1000, 0.075, 40, 0) obs2 = rexpgrowth(1000, 0.05, 40, 0) # obs is a different distribution to sim (larger growth) calculate_wasserstein(sim, obs) # obs2 is from the same distribution as sim so the distance should be lower: calculate_wasserstein(sim, obs2)
Density, distribution function, quantile function and random
generation for the Beta distribution with parameters shape1 and
shape2 (and optional non-centrality parameter ncp).
dbeta2(x, prob, kappa, log = FALSE)dbeta2(x, prob, kappa, log = FALSE)
x |
vector of quantiles |
prob |
the mean probability (vectorised) |
kappa |
a coefficient of variation. where 0 is no variability and 1 is maximally variability (vectorised) |
log |
logical; if TRUE, probabilities p are given as log(p). |
dbeta gives the density, pbeta the distribution
function, qbeta the quantile function, and rbeta
generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rbeta, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
dbeta2(c(0.25,0.5,0.75), 0.5, 0.25)dbeta2(c(0.25,0.5,0.75), 0.5, 0.25)
The following conversion describes the parameters mean and kappa
dcgamma(x, mean, kappa = 1/mean, log = FALSE)dcgamma(x, mean, kappa = 1/mean, log = FALSE)
x |
vector of quantiles |
mean |
the mean value on the true scale (vectorised) |
kappa |
a coefficient of variation. where 0 is no variability and 1 is maximally variability (vectorised) |
log |
logical; if TRUE, probabilities p are given as log(p). |
dgamma gives the density,
pgamma gives the distribution function,
qgamma gives the quantile function, and
rgamma generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rgamma, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
dcgamma(seq(0,4,0.25), 2, 0.5)dcgamma(seq(0,4,0.25), 2, 0.5)
Convergence is assessed on firstly whether the central estimate of the parameters being assessed is stable, and not changing from one wave to the next, and secondly if the 95 percent credible interval is stable between waves. If the parameter central estimate is stationary but the credible intervals are still dropping then continuing simulation may get better estimates of confidence.
default_termination_fn(stability = 0.01, confidence = 0.1)default_termination_fn(stability = 0.01, confidence = 0.1)
stability |
how close do sequential estimates need to be before declaring them as a good set of parameter estimates. This is in the units of the parameter. If this is given as a single number it applies to all parameters equally, alternatively a named vector can be used to set parameter specific cutoffs. |
confidence |
how stable do the 95% confidence intervals need to be before we are happy with the parameter estimates. This is in the scale of the parameters, but represents a change in IQR from wave to wave. If this is given as a single number it applies to all parameters equally, alternatively a named vector can be used to set parameter specific cutoffs. |
a function that specifies the convergence.
# A more permissive definition of convergence has # less strict stability criteria (sequential estimates varying by less than 5%) # and confidence intervals not changing by more than 1 unit between waves.) check = default_termination_fn(0.05, 1.0) fit = example_smc_fit() # This is performed as an integral part of the SMC and adaptive # fitting and is here only for example last_wave_metrics = utils::tail(fit$waves,1) converged = check( wave = 0, summary = last_wave_metrics$summary[[1]], per_param = last_wave_metrics$per_param[[1]] ) if (isTRUE(converged)) print("Converged (permissive definition)")# A more permissive definition of convergence has # less strict stability criteria (sequential estimates varying by less than 5%) # and confidence intervals not changing by more than 1 unit between waves.) check = default_termination_fn(0.05, 1.0) fit = example_smc_fit() # This is performed as an integral part of the SMC and adaptive # fitting and is here only for example last_wave_metrics = utils::tail(fit$waves,1) converged = check( wave = 0, summary = last_wave_metrics$summary[[1]], per_param = last_wave_metrics$per_param[[1]] ) if (isTRUE(converged)) print("Converged (permissive definition)")
Density, distribution function, quantile function and random
generation for the Gamma distribution with parameters shape and
scale.
dgamma2(x, mean, sd = sqrt(mean), log = FALSE)dgamma2(x, mean, sd = sqrt(mean), log = FALSE)
x |
vector of quantiles |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
log |
logical; if TRUE, probabilities p are given as log(p). |
dgamma gives the density,
pgamma gives the distribution function,
qgamma gives the quantile function, and
rgamma generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rgamma, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
dgamma2(seq(0,4,0.25), 2, 1)dgamma2(seq(0,4,0.25), 2, 1)
dist_fns_list
Boilerplate S3 class operation
dist_fns()dist_fns()
an empty dist_fns_list
Density, distribution function, quantile function and random
generation for the log normal distribution whose logarithm has mean
equal to meanlog and standard deviation equal to sdlog.
dlnorm2(x, mean = 1, sd = sqrt(exp(1) - 1), log = FALSE)dlnorm2(x, mean = 1, sd = sqrt(exp(1) - 1), log = FALSE)
x |
vector of quantiles |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
log |
logical; if TRUE, probabilities p are given as log(p). |
The log normal distribution has density
where and are the mean and standard
deviation of the logarithm.
The mean is ,
the median is , and the variance
and hence the coefficient of variation is
which is
approximately when that is small (e.g., ).
dlnorm gives the density,
plnorm gives the distribution function,
qlnorm gives the quantile function, and
rlnorm generates random deviates.
The length of the result is determined by n for
rlnorm, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
The cumulative hazard
is -plnorm(t, r, lower = FALSE, log = TRUE).
dlnorm is calculated from the definition (in ‘Details’).
[pqr]lnorm are based on the relationship to the normal.
Consequently, they model a single point mass at exp(meanlog)
for the boundary case sdlog = 0.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995) Continuous Univariate Distributions, volume 1, chapter 14. Wiley, New York.
Distributions for other standard distributions, including
dnorm for the normal distribution.
dlnorm2(seq(0,4,0.25), 2, 1)dlnorm2(seq(0,4,0.25), 2, 1)
The logit-normal distribution has a support of 0 to 1.
dlogitnorm(x, meanlogit = 0, sdlogit = 1, log = FALSE)dlogitnorm(x, meanlogit = 0, sdlogit = 1, log = FALSE)
x |
vector of quantiles |
meanlogit |
the mean on the logit scale |
sdlogit |
the sd on the logit scale |
log |
logical; if TRUE, probabilities p are given as log(p). |
a vector of probabilities, quantiles, densities or samples.
dlogitnorm(seq(0.1,0.9,0.1), 0, 1)dlogitnorm(seq(0.1,0.9,0.1), 0, 1)
The logit-normal distribution has a support of 0 to 1.
dlogitnorm2(x, prob.0.5 = 0.5, kappa = 1 - exp(-1), log = FALSE)dlogitnorm2(x, prob.0.5 = 0.5, kappa = 1 - exp(-1), log = FALSE)
x |
vector of quantiles |
prob.0.5 |
the median on the true scale |
kappa |
a dispersion parameter from 0 (none) to 1 maximum dispersion |
log |
logical; if TRUE, probabilities p are given as log(p). |
a vector of probabilities, quantiles, densities or samples.
q = seq(0.1,0.9,0.1) eps = sqrt(.Machine$double.eps) dlogitnorm2(q, 0.75, 0.2) (plogitnorm2(q+eps, 0.75, 0.2) - plogitnorm2(q-eps, 0.75, 0.2)) / (2*eps)q = seq(0.1,0.9,0.1) eps = sqrt(.Machine$double.eps) dlogitnorm2(q, 0.75, 0.2) (plogitnorm2(q+eps, 0.75, 0.2) - plogitnorm2(q-eps, 0.75, 0.2)) / (2*eps)
Density, distribution function, quantile function and random
generation for the negative binomial distribution with parameters
size and prob.
dnbinom2(x, mean, sd = sqrt(mean), log = FALSE)dnbinom2(x, mean, sd = sqrt(mean), log = FALSE)
x |
vector of (non-negative integer) quantiles. |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
log |
logical; if TRUE, probabilities p are given as log(p). |
The negative binomial distribution with size and
prob has density
for , and .
This represents the number of failures which occur in a sequence of
Bernoulli trials before a target number of successes is reached.
The mean is and variance .
A negative binomial distribution can also arise as a mixture of
Poisson distributions with mean distributed as a gamma distribution
(see pgamma) with scale parameter (1 - prob)/prob
and shape parameter size. (This definition allows non-integer
values of size.)
An alternative parametrization (often used in ecology) is by the
mean mu (see above), and size, the dispersion
parameter, where prob = size/(size+mu). The variance
is mu + mu^2/size in this parametrization.
If an element of x is not integer, the result of dnbinom
is zero, with a warning.
The case size == 0 is the distribution concentrated at zero.
This is the limiting distribution for size approaching zero,
even if mu rather than prob is held constant. Notice
though, that the mean of the limit distribution is 0, whatever the
value of mu.
The quantile is defined as the smallest value such that
, where is the distribution function.
dnbinom gives the density,
pnbinom gives the distribution function,
qnbinom gives the quantile function, and
rnbinom generates random deviates.
Invalid size or prob will result in return value
NaN, with a warning.
The length of the result is determined by n for
rnbinom, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
rnbinom returns a vector of type integer unless generated
values exceed the maximum representable integer when double
values are returned.
dnbinom computes via binomial probabilities, using code
contributed by Catherine Loader (see dbinom).
pnbinom uses pbeta.
qnbinom uses the Cornish–Fisher Expansion to include a skewness
correction to a normal approximation, followed by a search.
rnbinom uses the derivation as a gamma mixture of Poisson
distributions, see
Devroye, L. (1986) Non-Uniform Random Variate Generation. Springer-Verlag, New York. Page 480.
Distributions for standard distributions, including
dbinom for the binomial, dpois for the
Poisson and dgeom for the geometric distribution, which
is a special case of the negative binomial.
dnbinom2(0:5, 2, sqrt(2))dnbinom2(0:5, 2, sqrt(2))
Null distributions always returns NA
dnull(x, ...)dnull(x, ...)
x |
vector of quantiles |
... |
not used |
dnull(c(0.25,0.5,0.75), 0.5, 0.25)dnull(c(0.25,0.5,0.75), 0.5, 0.25)
The wedge distribution has a support of 0 to 1 and has a linear probability density function over that support.
dwedge(x, a, lower.tail = TRUE, log = FALSE)dwedge(x, a, lower.tail = TRUE, log = FALSE)
x |
vector of quantiles |
a |
a gradient from -2 (left skewed) to 2 (right skewed) |
lower.tail |
logical; if TRUE (default), probabilities are |
log |
logical; if TRUE, probabilities p are given as log(p). |
The rwedge can be combined with quantile functions to
skew standard distributions, or introduce correlation or down weight
certain parts of the distribution.
a vector of probabilities, quantiles, densities or samples.
pwedge(seq(0,1,0.1), a=1) dwedge(seq(0,1,0.1), a=1) qwedge(c(0.25,0.5,0.75), a=-1) stats::cor( stats::qnorm(rwedge(1000, a=2)), stats::qnorm(rwedge(1000, a=-2)) )pwedge(seq(0,1,0.1), a=1) dwedge(seq(0,1,0.1), a=1) qwedge(c(0.25,0.5,0.75), a=-1) stats::cor( stats::qnorm(rwedge(1000, a=2)), stats::qnorm(rwedge(1000, a=-2)) )
This creates statistical distribution functions fitting to data in a
transformed space. Inputs can either be weighted data observations (x,w pairs)
of a CDF (x,p pairs). X axis transformation is specified in the link
parameter and is either something like "log", "logit", etc or can also be
specified as a statistical distribution, or even a length 2 numeric vector
defining support.
empirical( x, ..., w = NULL, p = NULL, link = "ident", fit_spline = !is.null(knots), knots = NULL, name = NULL )empirical( x, ..., w = NULL, p = NULL, link = "ident", fit_spline = !is.null(knots), knots = NULL, name = NULL )
x |
either a vector of samples from a distribution |
... |
Named arguments passed on to
Named arguments passed on to
|
w |
for data fits, a vector the same length as |
p |
for CDF fits, a vector the same length as |
link |
a link function. Either as name, or a |
fit_spline |
for distributions from data should we fit a spline to
reduce memory usage and speed up sampling? If |
knots |
for spline fitting from data how many points do we use to model the cdf? By default this will be estimated from the data. I recommend an uneven number, without a lot of data this will tend to overfit 9 knots for 1000 samples seems OK, max 7 for 250, 5 for 100. Less is usually more. |
name |
a label for this distribution. If not given one will be generated from the distribution and parameters. This can be used as part of plotting. |
If the empirical distribution if from data it is fully transformed to a
logit-z space where the cumulative weighted data is interpolated with a
kernel weighted linear model.
In the case of CDF data the empirical distribution is fitted with a piecewise
linear or monotonically increasing spline fit to data in Q-Q space. The
spline is bounded by 0 and 1 in both dimensions.
This function imputes tails of distributions within the constraints of the link functions. Given perfect data as samples or as quantiles it should well approximate the tail.
If p is provided, data is treated as CDF points .
The function calls empirical_cdf(x, p, ...) internally. This involves:
Transformation: values are mapped via a link function to
a standardized scale .
Interpolation: Monotonic splines (or piecewise linear functions if
smooth=FALSE) are fitted between pairs in Q-Q space,
yielding functions and .
Tail Extrapolation: The fit is extended to and if
necessary.
Final Functions: and
.
If w is provided (or p is NULL), data is treated as weighted samples .
The function calls empirical_data(x, w, ...) internally. This involves:
Transformation: values are mapped via a link function to
.
Standardization: Values are standardized .
Weighted CDF: Empirical CDF is calculated from .
Logit Transformation: .
Local Fitting: locfit is used to fit models between and .
Final Functions: Composed from fitted models and the inverse link .
If fit_spline=TRUE (or knots is specified) when fitting from data,
the resulting empirical_data fit is re-interpolated using empirical_cdf
at quantiles defined by knots.
a dist_fns S3 object containing functions p() for CDF, q()
for quantile, and r() for a RNG, and d() for density. The density
function may be discontinuous.
# from samples:
withr::with_seed(123,{
e2 = empirical(rnorm(10000))
testthat::expect_equal(e2$p(-5:5), pnorm(-5:5), tolerance=0.01)
testthat::expect_equal(e2$q(seq(0,1,0.1)), qnorm(seq(0,1,0.1)), tolerance=0.05)
})
p2 = seq(0,1,0.1)
testthat::expect_equal( e2$p(e2$q(p2)), p2, tolerance = 0.001)
# updating a prior, with a horribly skewed gamma distribution
# not a great fit but not great data
withr::with_seed(124,{
data = rgamma(500,1)
e4 = empirical(data, link=as.dist_fns("unif",0,10))
if (interactive()) plot(e4)+ggplot2::geom_function(fun = ~ dgamma(.x, 1))
testthat::expect_equal(mean(e4$r(10000)), 1, tolerance=0.1)
e5 = empirical(data,link="log")
testthat::expect_equal(mean(e5$r(10000)), 1, tolerance=0.1)
if (interactive()) plot(e5)+ggplot2::geom_function(fun = ~ dgamma(.x, 1))
})
withr::with_seed(123,{
data = c(rnorm(200,4),rnorm(200,7))
weights = c(rep(0.1,200), rep(0.3,200))
e6 = empirical(x=data,w = weights)
plot(e6)
testthat::expect_equal(
format(e6),
"empirical; Median (IQR) 6.57 [5.2 — 7.43]"
)
})
# Construct a normal using a sequence and density as weight.
e7 = empirical(x=seq(-10,10,length.out=1000),w=dnorm(seq(-10,10,length.out=1000)),knots = 20)
testthat::expect_equal(e7$p(-5:5), pnorm(-5:5), tolerance=0.01)
# A random sample from a distribution: sample = rgamma2(1000, mean=5, sd=2) # fit direct from data fit = empirical(sample, link="log") plot(fit)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2)) # suppose we only have quantiles p = seq(0.1,0.9, 0.1) quantiles = quantile(sample, p) # fit from quantiles: fit2 = empirical(x=quantiles,p=p, link="log") plot(fit2)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2)) # fit weighted data samples = seq(0,10,0.1) weights = dgamma2(samples, mean=5, sd=2) fit3 = empirical(x=samples, w=weights, link="log") plot(fit3)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2))# A random sample from a distribution: sample = rgamma2(1000, mean=5, sd=2) # fit direct from data fit = empirical(sample, link="log") plot(fit)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2)) # suppose we only have quantiles p = seq(0.1,0.9, 0.1) quantiles = quantile(sample, p) # fit from quantiles: fit2 = empirical(x=quantiles,p=p, link="log") plot(fit2)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2)) # fit weighted data samples = seq(0,10,0.1) weights = dgamma2(samples, mean=5, sd=2) fit3 = empirical(x=samples, w=weights, link="log") plot(fit3)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2))
This fits a CDF and quantile function to cumulative probabilities in a transformed space. X value
transformation is specified in the link parameter and is either something
like "log", "logit", etc or can also be specified as the logit transformed cdf and quantile function from a statistical distribution.
empirical_cdf(x, p, link = "ident", smooth = TRUE, name = NULL, ...)empirical_cdf(x, p, link = "ident", smooth = TRUE, name = NULL, ...)
x |
either a vector of samples from a distribution |
p |
for CDF fits, a vector the same length as |
link |
a link function. Either as name, or a |
smooth |
fits the empirical distribution with a pair of splines for CDF
and quantile function, creating a mostly smooth density. This smoothness
comes at the price of potential over-fitting and will produce small
differences between |
name |
a label for this distribution. If not given one will be generated from the distribution and parameters. This can be used as part of plotting. |
... |
not used |
The empirical distribution fitted is a piecewise linear or monotonically
increasing spline fit to CDF in transformed X and logit Y space. The end
points are linearly interpolated in this space to the tail_pth quantile.
The function can fit data provided as x, P(X<=x) pairs.
Constructs an empirical distribution from CDF points by fitting
monotonic splines in a transformed quantile–quantile (Q–Q) space.
The input is first mapped to a standardized probability scale via a
link-dependent transformation . The resulting pairs
are then used to build two monotonic interpolation functions:
where and are constructed as follows:
If smooth = FALSE, both are piecewise linear interpolants
(via stats::approx).
If smooth = TRUE, both are strictly monotonic cubic splines
fitted using the "monoH.FC" method from stats::splinefun,
converted to polynomial spline form (polySpline). Monotonicity is
enforced by requiring and to be strictly increasing after
tie-breaking perturbations.
Tail extrapolation is applied by linearly extending the first and last
segments in Q–Q space to the points and if not already
included. The final distribution functions are:
where and are derived from the link argument.
This function imputes tails of distributions. Given perfect data as samples or as quantiles it should recover the tail
a dist_fns S3 object containing functions p() for CDF, q()
for quantile, and r() for a RNG, and d() for density. The density
function may be discontinuous.
#from cdf:
xs = c(2,3,6,9)
ps = c(0.1,0.4,0.6,0.95)
e = empirical_cdf(xs, ps, link="log")
testthat::expect_equal(e$p(xs), ps)
testthat::expect_equal(e$q(ps), xs)
# quantiles:
p = c(0.025,0.05,0.10,0.25,0.5,0.75,0.9,0.95,0.975)
q = stats::qgamma(p, shape=2)
shape2_gamma = as.dist_fns(pgamma, shape=2)
gemp = empirical_cdf(q,p,link = shape2_gamma)
withr::with_seed(123, {
testthat::expect_equal(mean(gemp$r(100000)),2, tolerance=0.01)
testthat::expect_equal(sd(gemp$r(100000)), sqrt(2), tolerance=0.01)
})
# With perfect input can recover the underlying distribution including tails:
tmp = empirical_cdf(x=seq(0.01,0.99,0.01),p=seq(0.01,0.99,0.01),link = as.dist_fns(punif,0, 1), knots = 100)
testthat::expect_equal(
tmp$q(c(0.01, 0.1, 0.25, 0.75, 0.9, 0.99)),
c(0.01, 0.1, 0.25, 0.75, 0.9, 0.99),
tolerance = 0.002
)
# bimodal with log link and end defined
tmp3 = empirical_cdf(x = 1:7, c(0.1,0.3,0.4,0.4,0.5,0.9,1),link="log")
testthat::expect_equal(
tmp3$p(-1:8),
c(0, 0, 0.1, 0.3, 0.4, 0.4, 0.5, 0.9, 1, 1),
tolerance = 0.01
)
testthat::expect_equal(
tmp3$q(seq(0, 1, 0.2)),
c(0, 1.63476839034733, 3, 5.05332769159444, 5.51891960240613, 7)
)
# A random sample from a distribution: sample = rgamma2(1000, mean=5, sd=2) # suppose we only have quantiles p = seq(0.1,0.9, 0.1) quantiles = quantile(sample, p) # fit from quantiles: fit2 = empirical(x=quantiles,p=p, link="log") plot(fit2)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2))# A random sample from a distribution: sample = rgamma2(1000, mean=5, sd=2) # suppose we only have quantiles p = seq(0.1,0.9, 0.1) quantiles = quantile(sample, p) # fit from quantiles: fit2 = empirical(x=quantiles,p=p, link="log") plot(fit2)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2))
This fits a CDF and quantile function to ranked data in a
transformed space. X value transformation is specified in the link
parameter and is either something like "log", "logit", etc.
empirical_data( x, w = NULL, link = "identity", ..., name = NULL, bw = wbw.nrd(x, w)^sqrt(2) )empirical_data( x, w = NULL, link = "identity", ..., name = NULL, bw = wbw.nrd(x, w)^sqrt(2) )
x |
either a vector of samples from a distribution |
w |
for data fits, a vector the same length as |
link |
a link function. Either as name, or a |
... |
Named arguments passed on to
Named arguments passed on to
|
name |
a label for this distribution. If not given one will be generated from the distribution and parameters. This can be used as part of plotting. |
bw |
a bandwidth expressed in terms of the probability width, or proportion of observations. |
The empirical distribution fitted is a piecewise linear in z transformed X and logit Y space. The evaluation points are linearly interpolated in this space given a bandwidth for interpolation.
Link Transformation: Input data x is transformed using the specified
link function: , where is the transformation defined
by the link argument.
Standardization (Z-space): Transformed values are standardized:
, where and
are the weighted mean and standard deviation of .
Weighted Empirical CDF: The cumulative weights are calculated to form
probabilities .
Logit Transformation: CDF probabilities are transformed:
.
Local Fitting (via .logit_z_locfit): Local likelihood models (using locfit)
are fitted between and to represent the CDF, its derivative
(density), and the inverse (quantile) function in the transformed space.
Function Construction: The final p, q, r, and d functions are
constructed by composing the fitted models from step 5 with the inverse
link transformation . For example, the final CDF is
, and the final
quantile function is , where
is the quantile function derived from the fitted models in Z-space.
This function imputes tails of distributions. Given perfect data as samples or as quantiles it should recover the tail
a dist_fns S3 object that function that contains statistical
distribution functions for this data.
# from samples:
withr::with_seed(123,{
e2 = empirical_data(rnorm(10000), bw=0.1)
testthat::expect_equal(e2$p(-5:5), pnorm(-5:5), tolerance=0.01)
testthat::expect_equal(e2$d(-5:5), dnorm(-5:5), tolerance=0.05)
testthat::expect_equal(e2$q(seq(0,1,0.1)), qnorm(seq(0,1,0.1)), tolerance=0.025)
})
# Construct a normal using a sequence and density as weight.
e7 = empirical_data(
x=seq(-10,10,length.out=1000),
w=dnorm(seq(-10,10,length.out=1000))
)
plot(e7)+ggplot2::geom_function(fun = dnorm)
testthat::expect_equal(e7$p(-5:5), pnorm(-5:5), tolerance=0.01)
# A random sample from a distribution: sample = rgamma2(1000, mean=5, sd=2) # fit direct from data fit = empirical(sample, link="log") plot(fit)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2)) # fit weighted data samples = seq(0,10,0.1) weights = dgamma2(samples, mean=5, sd=2) fit3 = empirical(x=samples, w=weights, link="log") plot(fit3)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2))# A random sample from a distribution: sample = rgamma2(1000, mean=5, sd=2) # fit direct from data fit = empirical(sample, link="log") plot(fit)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2)) # fit weighted data samples = seq(0,10,0.1) weights = dgamma2(samples, mean=5, sd=2) fit3 = empirical(x=samples, w=weights, link="log") plot(fit3)+ggplot2::geom_function(fun= ~ dgamma2(.x, mean=5, sd=2))
Run the SMC or adaptive algorithm for a set number of waves
fixed_wave_termination_fn(max_wave)fixed_wave_termination_fn(max_wave)
max_wave |
the number of waves to run |
A function that will test for completion
# Declare converged after 3 waves: converged_fn = fixed_wave_termination_fn(3)# Declare converged after 3 waves: converged_fn = fixed_wave_termination_fn(3)
dist_fns S3 objectFormat a dist_fns S3 object
## S3 method for class 'dist_fns' format(x, ..., digits = 3)## S3 method for class 'dist_fns' format(x, ..., digits = 3)
x |
a |
... |
passed onto methods |
digits |
the number of significant digits |
a character value
link_fns S3 objectFormat a link_fns S3 object
## S3 method for class 'link_fns' format(x, ...)## S3 method for class 'link_fns' format(x, ...)
x |
a |
... |
passed onto methods |
a character value
dist_fns S3 objectCheck if this is a dist_fns S3 object
is.dist_fns(x)is.dist_fns(x)
x |
anything |
TRUE or FALSE
dist_fns_list S3 objectCheck if this is a dist_fns_list S3 object
is.dist_fns_list(x)is.dist_fns_list(x)
x |
anything |
TRUE or FALSE
link_fns S3 objectCheck if this is a link_fns S3 object
is.link_fns(x)is.link_fns(x)
x |
anything |
TRUE or FALSE
link_fns_list S3 objectCheck if this is a link_fns_list S3 object
is.link_fns_list(x)is.link_fns_list(x)
x |
anything |
TRUE or FALSE
Calculate the excess kurtosis of a set of data
kurtosis(x, na.rm = FALSE, excess = TRUE)kurtosis(x, na.rm = FALSE, excess = TRUE)
x |
a vector of observations |
na.rm |
remove |
excess |
if false calculates raw kurtosis rather than excess |
the excess kurtosis
kurtosis(stats::rnorm(1000)) kurtosis(stats::rpois(1000, 2)) # leptokurtic > 0 (usually) kurtosis(stats::runif(1000)) # platykurtic: < 0 kurtosis(stats::rlnorm(1000))kurtosis(stats::rnorm(1000)) kurtosis(stats::rpois(1000, 2)) # leptokurtic > 0 (usually) kurtosis(stats::runif(1000)) # platykurtic: < 0 kurtosis(stats::rlnorm(1000))
link_fns_list
Boilerplate S3 class operation
link_fns()link_fns()
an empty link_fns_list
dist_fns_list
Analogous to purrr::map_dbl()
map_dist_fns(.x, .f, ..., .progress = FALSE)map_dist_fns(.x, .f, ..., .progress = FALSE)
.x |
A list or atomic vector. |
.f |
a function to apply that returns a |
... |
Additional arguments passed on to the mapped function. We now generally recommend against using # Instead of x |> map(f, 1, 2, collapse = ",") # do: x |> map(\(x) f(x, 1, 2, collapse = ",")) This makes it easier to understand which arguments belong to which function and will tend to yield better error messages. |
.progress |
Whether to show a progress bar. Use |
a dist_fns_list
link_fns_list
Analogous to purrr::map_dbl()
map_link_fns(.x, .f, ..., .progress = FALSE)map_link_fns(.x, .f, ..., .progress = FALSE)
.x |
A list or atomic vector. |
.f |
a function to apply that returns a |
... |
Additional arguments passed on to the mapped function. We now generally recommend against using # Instead of x |> map(f, 1, 2, collapse = ",") # do: x |> map(\(x) f(x, 1, 2, collapse = ",")) This makes it easier to understand which arguments belong to which function and will tend to yield better error messages. |
.progress |
Whether to show a progress bar. Use |
a link_fns_list
dist_fns_list
Analogous to purrr::map2_dbl()
map2_dist_fns(.x, .y, .f, ..., .progress = FALSE)map2_dist_fns(.x, .y, .f, ..., .progress = FALSE)
.x, .y
|
A pair of vectors, usually the same length. If not, a vector of length 1 will be recycled to the length of the other. |
.f |
a function to apply to each |
... |
Additional arguments passed on to the mapped function. We now generally recommend against using # Instead of x |> map(f, 1, 2, collapse = ",") # do: x |> map(\(x) f(x, 1, 2, collapse = ",")) This makes it easier to understand which arguments belong to which function and will tend to yield better error messages. |
.progress |
Whether to show a progress bar. Use |
a dist_fns_list
link_fns_list
Analogous to purrr::map2_dbl()
map2_link_fns(.x, .y, .f, ..., .progress = FALSE)map2_link_fns(.x, .y, .f, ..., .progress = FALSE)
.x, .y
|
A pair of vectors, usually the same length. If not, a vector of length 1 will be recycled to the length of the other. |
.f |
a function to apply to each |
... |
Additional arguments passed on to the mapped function. We now generally recommend against using # Instead of x |> map(f, 1, 2, collapse = ",") # do: x |> map(\(x) f(x, 1, 2, collapse = ",")) This makes it easier to understand which arguments belong to which function and will tend to yield better error messages. |
.progress |
Whether to show a progress bar. Use |
a link_fns_list
Constructs a mixture distribution from a list of component distributions
with corresponding weights .
The CDF of the mixture is a weighted sum of the component CDFs:
where is the CDF of the -th component distribution and
. The implementation first evaluates the weighted CDF on a grid
of values (including tail points defined by tail_p and potentially
knot points from empirical components). The resulting
pairs are then used as input to empirical_cdf to create the final smooth or
piecewise linear dist_fns object representing the mixture distribution.
mixture(dists, weights = 1, steps = 200, tail_p = 1e-04, ..., name = "mixture")mixture(dists, weights = 1, steps = 200, tail_p = 1e-04, ..., name = "mixture")
dists |
a |
weights |
a vector of weights |
steps |
the number of points that the mixture distribution is evaluated at to construct the empirical mixture |
tail_p |
the support fo the tail of the distribution |
... |
Named arguments passed on to
|
name |
a name for the mixture |
a dist_fn of the mixture distribution
dists = c(
as.dist_fns("norm", mean=-1),
as.dist_fns("norm", mean=1),
as.dist_fns("gamma", shape=2)
)
weights = c(1,1,0.3)
mix = mixture(dists,weights)
testthat::expect_equal(
format(mix),
"mixture; Median (IQR) 0.289 [-0.886 — 1.31]"
)
dists = c( as.dist_fns("norm", mean=-1), as.dist_fns("norm", mean=1), as.dist_fns("gamma", shape=2) ) weights = c(1,1,0.3) mix = mixture(dists,weights) plot(c(dists,mix))+ggplot2::facet_wrap(~name)dists = c( as.dist_fns("norm", mean=-1), as.dist_fns("norm", mean=1), as.dist_fns("gamma", shape=2) ) weights = c(1,1,0.3) mix = mixture(dists,weights) plot(c(dists,mix))+ggplot2::facet_wrap(~name)
Density, distribution function, quantile function and random
generation for the Beta distribution with parameters shape1 and
shape2 (and optional non-centrality parameter ncp).
pbeta2(q, prob, kappa, lower.tail = TRUE, log.p = FALSE)pbeta2(q, prob, kappa, lower.tail = TRUE, log.p = FALSE)
q |
vector of quantiles |
prob |
the mean probability (vectorised) |
kappa |
a coefficient of variation. where 0 is no variability and 1 is maximally variability (vectorised) |
lower.tail |
logical; if TRUE (default), probabilities are |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
dbeta gives the density, pbeta the distribution
function, qbeta the quantile function, and rbeta
generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rbeta, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
pbeta2(c(0.25,0.5,0.75), 0.5, 0.25)pbeta2(c(0.25,0.5,0.75), 0.5, 0.25)
The following conversion describes the parameters mean and kappa
pcgamma(q, mean, kappa = 1/mean, lower.tail = TRUE, log.p = FALSE)pcgamma(q, mean, kappa = 1/mean, lower.tail = TRUE, log.p = FALSE)
q |
vector of quantiles |
mean |
the mean value on the true scale (vectorised) |
kappa |
a coefficient of variation. where 0 is no variability and 1 is maximally variability (vectorised) |
lower.tail |
logical; if TRUE (default), probabilities are |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
dgamma gives the density,
pgamma gives the distribution function,
qgamma gives the quantile function, and
rgamma generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rgamma, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
pcgamma(seq(0,4,0.25), 2, 0.5)pcgamma(seq(0,4,0.25), 2, 0.5)
Density, distribution function, quantile function and random
generation for the Gamma distribution with parameters shape and
scale.
pgamma2(q, mean, sd = sqrt(mean), lower.tail = TRUE, log.p = FALSE)pgamma2(q, mean, sd = sqrt(mean), lower.tail = TRUE, log.p = FALSE)
q |
vector of quantiles |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
lower.tail |
logical; if TRUE (default), probabilities are |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
dgamma gives the density,
pgamma gives the distribution function,
qgamma gives the quantile function, and
rgamma generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rgamma, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
pgamma2(seq(0,4,0.25), 2, 1)pgamma2(seq(0,4,0.25), 2, 1)
Density, distribution function, quantile function and random
generation for the log normal distribution whose logarithm has mean
equal to meanlog and standard deviation equal to sdlog.
plnorm2(q, mean = 1, sd = sqrt(exp(1) - 1), lower.tail = TRUE, log.p = FALSE)plnorm2(q, mean = 1, sd = sqrt(exp(1) - 1), lower.tail = TRUE, log.p = FALSE)
q |
vector of quantiles |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
lower.tail |
logical; if TRUE (default), probabilities are
|
log.p |
logical; if TRUE, probabilities p are given as log(p). |
The log normal distribution has density
where and are the mean and standard
deviation of the logarithm.
The mean is ,
the median is , and the variance
and hence the coefficient of variation is
which is
approximately when that is small (e.g., ).
dlnorm gives the density,
plnorm gives the distribution function,
qlnorm gives the quantile function, and
rlnorm generates random deviates.
The length of the result is determined by n for
rlnorm, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
The cumulative hazard
is -plnorm(t, r, lower = FALSE, log = TRUE).
dlnorm is calculated from the definition (in ‘Details’).
[pqr]lnorm are based on the relationship to the normal.
Consequently, they model a single point mass at exp(meanlog)
for the boundary case sdlog = 0.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995) Continuous Univariate Distributions, volume 1, chapter 14. Wiley, New York.
Distributions for other standard distributions, including
dnorm for the normal distribution.
plnorm2(seq(0,4,0.25), 2, 1)plnorm2(seq(0,4,0.25), 2, 1)
The logit-normal distribution has a support of 0 to 1.
plogitnorm(q, meanlogit = 0, sdlogit = 1, lower.tail = TRUE, log.p = FALSE)plogitnorm(q, meanlogit = 0, sdlogit = 1, lower.tail = TRUE, log.p = FALSE)
q |
vector of quantiles |
meanlogit |
the mean on the logit scale |
sdlogit |
the sd on the logit scale |
lower.tail |
logical; if TRUE (default), probabilities are |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
a vector of probabilities, quantiles, densities or samples.
plogitnorm(seq(0.1,0.9,0.1), 0, 1)plogitnorm(seq(0.1,0.9,0.1), 0, 1)
The logit-normal distribution has a support of 0 to 1.
plogitnorm2( q, prob.0.5 = 0.5, kappa = 1 - exp(-1), lower.tail = TRUE, log.p = FALSE )plogitnorm2( q, prob.0.5 = 0.5, kappa = 1 - exp(-1), lower.tail = TRUE, log.p = FALSE )
q |
vector of quantiles |
prob.0.5 |
the median on the true scale |
kappa |
a dispersion parameter from 0 (none) to 1 maximum dispersion |
lower.tail |
logical; if TRUE (default), probabilities are |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
a vector of probabilities, quantiles, densities or samples.
plogitnorm2(seq(0.1,0.9,0.1), 0.75, 0.2) plogitnorm2(qlogitnorm2(seq(0.1,0.9,0.1), 0.75, 0.2), 0.75, 0.2)plogitnorm2(seq(0.1,0.9,0.1), 0.75, 0.2) plogitnorm2(qlogitnorm2(seq(0.1,0.9,0.1), 0.75, 0.2), 0.75, 0.2)
Plot convergence metrics by wave for SMC and adaptive ABC
plot_convergence(fit)plot_convergence(fit)
fit |
A S3 |
a patchwork plot of convergence metrics
plot_convergence( example_adaptive_fit() )plot_convergence( example_adaptive_fit() )
A parameter posterior correlation plot
plot_correlations(posteriors_df, truth = NULL)plot_correlations(posteriors_df, truth = NULL)
posteriors_df |
a dataframe of posteriors that have been selected by ABC
this may include columns for scores, weight and/or simulation outputs (
|
truth |
a named numeric vector of known parameter values |
a patchwork of ggplots including density and 2d scatters for each combination of posteriors.
p = plot_correlations( example_adaptive_fit(), example_truth() ) p & ggplot2::theme( axis.title.y = ggplot2::element_text(angle=70,vjust=0.5), axis.title.x = ggplot2::element_text(angle=20,hjust=0.5) )p = plot_correlations( example_adaptive_fit(), example_truth() ) p & ggplot2::theme( axis.title.y = ggplot2::element_text(angle=70,vjust=0.5), axis.title.x = ggplot2::element_text(angle=20,hjust=0.5) )
Plot the evolution of the density function by wave for SMC and adaptive ABC
plot_evolution(fit, truth = NULL, ..., what = c("posteriors", "proposals"))plot_evolution(fit, truth = NULL, ..., what = c("posteriors", "proposals"))
fit |
A S3 |
truth |
a named numeric vector of known parameter values |
... |
passed on to methods
Named arguments passed on to
|
what |
plot posterior densities or proposal distributions? |
a plot of the density functions by wave
plot_evolution( example_adaptive_fit(), example_truth() )plot_evolution( example_adaptive_fit(), example_truth() )
Spaghetti plot of resampled posterior fits
plot_simulations(obsdata, fit, sim_fn, method = "auto", n = 200, ...)plot_simulations(obsdata, fit, sim_fn, method = "auto", n = 200, ...)
obsdata |
The observational data. The data in this will typically be a named list, but could be anything, e.g. dataframe. It is the reference data that the simulation model is aiming to replicate. |
fit |
A S3 |
sim_fn |
a user defined function that takes a set of parameters named
the same as |
method |
one of |
n |
the number of simulations to plot |
... |
Named arguments passed on to
|
a patchwork of ggplots
plot_simulations( example_obsdata(), example_adaptive_fit(), example_sim_fn )plot_simulations( example_obsdata(), example_adaptive_fit(), example_sim_fn )
dist_fns S3 objectPlot a dist_fns S3 object
## S3 method for class 'dist_fns' plot(x, ...)## S3 method for class 'dist_fns' plot(x, ...)
x |
a |
... |
Named arguments passed on to
|
a ggplot
dist_fns_list S3 objectPlot a smoothed version of the PDFs of a set of dist_fns. These are ggplots
that can be facetted by names, id or group
## S3 method for class 'dist_fns_list' plot( x, ..., mapping = .gg_check_for_aes(...), steps = 200, tail = 0.001, plot_quantiles = TRUE, smooth = TRUE )## S3 method for class 'dist_fns_list' plot( x, ..., mapping = .gg_check_for_aes(...), steps = 200, tail = 0.001, plot_quantiles = TRUE, smooth = TRUE )
x |
a |
... |
passed to |
mapping |
override default aesthetics with |
steps |
resolution of the plot |
tail |
the minimum tail probability to plot |
plot_quantiles |
by default the quantiles of the distribution are plotted over the density sometimes this makes it hard to read |
smooth |
by default some additional smoothing is used to cover up small discontinuities in the PDF. |
a ggplot
dist_fns_list
Analogous to purrr::pmap_dbl()
pmap_dist_fns(.l, .f, ..., .progress = FALSE)pmap_dist_fns(.l, .f, ..., .progress = FALSE)
.l |
A list of vectors. The length of Vectors of length 1 will be recycled to any length; all other elements must be have the same length. A data frame is an important special case of |
.f |
a function to apply to each |
... |
Additional arguments passed on to the mapped function. We now generally recommend against using # Instead of x |> map(f, 1, 2, collapse = ",") # do: x |> map(\(x) f(x, 1, 2, collapse = ",")) This makes it easier to understand which arguments belong to which function and will tend to yield better error messages. |
.progress |
Whether to show a progress bar. Use |
a dist_fns_list
link_fns_list
Analogous to purrr::pmap_dbl()
pmap_link_fns(.l, .f, ..., .progress = FALSE)pmap_link_fns(.l, .f, ..., .progress = FALSE)
.l |
A list of vectors. The length of Vectors of length 1 will be recycled to any length; all other elements must be have the same length. A data frame is an important special case of |
.f |
a function to apply to each |
... |
Additional arguments passed on to the mapped function. We now generally recommend against using # Instead of x |> map(f, 1, 2, collapse = ",") # do: x |> map(\(x) f(x, 1, 2, collapse = ",")) This makes it easier to understand which arguments belong to which function and will tend to yield better error messages. |
.progress |
Whether to show a progress bar. Use |
a link_fns_list
Density, distribution function, quantile function and random
generation for the negative binomial distribution with parameters
size and prob.
pnbinom2(q, mean, sd = sqrt(mean), lower.tail = TRUE, log.p = FALSE)pnbinom2(q, mean, sd = sqrt(mean), lower.tail = TRUE, log.p = FALSE)
q |
vector of quantiles. |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
lower.tail |
logical; if TRUE (default), probabilities are
|
log.p |
logical; if TRUE, probabilities p are given as log(p). |
The negative binomial distribution with size and
prob has density
for , and .
This represents the number of failures which occur in a sequence of
Bernoulli trials before a target number of successes is reached.
The mean is and variance .
A negative binomial distribution can also arise as a mixture of
Poisson distributions with mean distributed as a gamma distribution
(see pgamma) with scale parameter (1 - prob)/prob
and shape parameter size. (This definition allows non-integer
values of size.)
An alternative parametrization (often used in ecology) is by the
mean mu (see above), and size, the dispersion
parameter, where prob = size/(size+mu). The variance
is mu + mu^2/size in this parametrization.
If an element of x is not integer, the result of dnbinom
is zero, with a warning.
The case size == 0 is the distribution concentrated at zero.
This is the limiting distribution for size approaching zero,
even if mu rather than prob is held constant. Notice
though, that the mean of the limit distribution is 0, whatever the
value of mu.
The quantile is defined as the smallest value such that
, where is the distribution function.
dnbinom gives the density,
pnbinom gives the distribution function,
qnbinom gives the quantile function, and
rnbinom generates random deviates.
Invalid size or prob will result in return value
NaN, with a warning.
The length of the result is determined by n for
rnbinom, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
rnbinom returns a vector of type integer unless generated
values exceed the maximum representable integer when double
values are returned.
dnbinom computes via binomial probabilities, using code
contributed by Catherine Loader (see dbinom).
pnbinom uses pbeta.
qnbinom uses the Cornish–Fisher Expansion to include a skewness
correction to a normal approximation, followed by a search.
rnbinom uses the derivation as a gamma mixture of Poisson
distributions, see
Devroye, L. (1986) Non-Uniform Random Variate Generation. Springer-Verlag, New York. Page 480.
Distributions for standard distributions, including
dbinom for the binomial, dpois for the
Poisson and dgeom for the geometric distribution, which
is a special case of the negative binomial.
pnbinom2(0:5, 2, sqrt(2))pnbinom2(0:5, 2, sqrt(2))
Null distributions always returns NA
pnull(q, ...)pnull(q, ...)
q |
vector of quantiles |
... |
not used |
pnull(c(0.25,0.5,0.75), 0.5, 0.25)pnull(c(0.25,0.5,0.75), 0.5, 0.25)
The component scores are summary statistics output by the user supplied
scorer_fn as a named list. These can be variable in scale and location and
various options exist for combining them. They may need to be weighted by
scale as well as importance to get a model that works well. Such weights can
be input into the ABC algortihms using the scoreweights parameter. This
function helps provide diagnostics for calibrating the scoreweights
parameter.
posterior_distance_metrics(posteriors_df, obsscores = NULL, keep_data = FALSE)posterior_distance_metrics(posteriors_df, obsscores = NULL, keep_data = FALSE)
posteriors_df |
a dataframe of posteriors that have been selected by ABC
this may include columns for scores, weight and/or simulation outputs (
|
obsscores |
Summary scores for the observational data. This will
be a named list, and is equivalent to the output of |
keep_data |
mainly for internal use this flag gives you the component scores as a matrix |
Given a list of component scores
from simulations and observed scores ,
this function calculates:
The mean and standard deviation of each score component :
The covariance matrix of the score components.
The root mean squared difference (RMSD) between simulated and observed scores for each component:
A recommended vector of scoreweights , calculated as the ratio of the component's
standard deviation to its RMSD, normalized to sum to 1:
These weights help balance the influence of different summary statistics in the overall distance metric, especially when using Euclidean distance.
a list containing the following items:
obsscores: the input reference scores for each component
means, sds: the means and sds of each score component
cov: the covariance matrix for the scores
mad: the mean absolute differences between the simscores and the obsscores
rmsd: the root mean squared differences between the simscores and the obsscores
scoreweights: a the sds divided by the rmsd. This weight should
mean that the weights of the individual summary scores have similar influence
in the overall abc_summary_distance output once combined during SMC and
adaptive waves, especially if euclidean distances are involved.
simscores: (if keep_data) a matrix of all the scores from these input
simulation posteriors
deltascores: (if keep_data) a matrix of the differences between simscores and obsscores.
fit = example_rejection_fit() metrics = posterior_distance_metrics(fit$posteriors) # other elements available but this is the most important and tells you what # the relative sizes of the component scores from `scorer_fn` in this sample. # If this is a sample from the prior then this gives us a way to judge the # most appropriate relative weighting of each component: metrics$scoreweightsfit = example_rejection_fit() metrics = posterior_distance_metrics(fit$posteriors) # other elements available but this is the most important and tells you what # the relative sizes of the component scores from `scorer_fn` in this sample. # If this is a sample from the prior then this gives us a way to judge the # most appropriate relative weighting of each component: metrics$scoreweights
This function allows "updating" of the prior with a posterior distribution from the same family as the prior with updated parameters.
posterior_fit_analytical(posteriors_df, priors_list)posterior_fit_analytical(posteriors_df, priors_list)
posteriors_df |
a dataframe of posteriors that have been selected by ABC
this may include columns for scores, weight and/or simulation outputs (
|
priors_list |
a named list of priors specified as a |
This takes weighted posterior samples
for each parameter and fits an analytical distribution
function that approximates the posterior marginal for
that parameter.
The resulting empirical distribution is a dist_fns object that
includes the support constraints from the original prior. The set of all fitted
marginal distributions forms the new proposal list for the next
wave.
Additionally, the weighted covariance matrix of the samples in the
MVN space (defined by the prior copula) is calculated:
where is the quantile function of the standard normal.
This covariance matrix is stored as an attribute ("cor") of the returned
proposal list and is used to induce correlation structure when sampling
new proposals from the empirical distributions in subsequent waves.
an abc_prior S3 object approximating the distribution of the
posterior samples from the same family as the prior.
fit = example_smc_fit() proposals = posterior_fit_analytical(fit$posteriors, fit$priors) proposalsfit = example_smc_fit() proposals = posterior_fit_analytical(fit$posteriors, fit$priors) proposals
This function allows "updating" of the prior with an empirical posterior distribution which will retain the bounds of the prior, and is effectively a spline based transform of the prior distribution, based on the density of data in prior space. This gives a clean density when data is close to a prior distribution limit and work better than a standard density.
posterior_fit_empirical( posteriors_df, priors_list, knots = NULL, bw = 0.1, widen_by = 1 )posterior_fit_empirical( posteriors_df, priors_list, knots = NULL, bw = 0.1, widen_by = 1 )
posteriors_df |
a dataframe of posteriors that have been selected by ABC
this may include columns for scores, weight and/or simulation outputs (
|
priors_list |
a named list of priors specified as a |
knots |
the number of knots to model the CDF with. Optional, and will be typically inferred from the data size. Small numbers tend to work better if we expect the distribution to be unimodal. |
bw |
for Adaptive ABC data distributions are smoothed before modelling empirical CDF. Over smoothing can reduce convergence rate, under-smoothing may result in noisy posterior estimates, and appearance of local modes. |
widen_by |
change the dispersion of the empirical proposal distribution in ABC
adaptive, preserving the median. This is akin to a nonlinear,
heteroscedastic random walk in the quantile space, and can help address
over-fitting or local modes in the ABC adaptive waves. |
This takes weighted posterior samples
for each parameter and constructs an empirical distribution
function that approximates the posterior marginal for
that parameter.
For each parameter , the empirical distribution is
fitted using the empirical() function. The fitting is performed on the
weighted samples . A key feature is the use
of a link function during the fitting process. This link
function is typically taken from the original prior distribution
(i.e., ). This ensures that the fitted empirical distribution
respects the support constraints defined by the prior (e.g., positive
values for a log-normal prior). The fitting effectively occurs in the transformed
space defined by the link: .
The resulting empirical distribution is a dist_fns object that
includes the support constraints from the original prior. The set of all fitted
marginal distributions forms the new proposal list for the next
wave.
Additionally, the weighted covariance matrix of the samples in the
MVN space (defined by the prior copula) is calculated:
where is the quantile function of the standard normal.
This covariance matrix is stored as an attribute ("cor") of the returned
proposal list and is used to induce correlation structure when sampling
new proposals from the empirical distributions in subsequent waves.
an abc_prior S3 object approximating the distribution of the
posterior samples, and adhering to the support of the provided priors.
fit = example_smc_fit() proposals = posterior_fit_empirical(fit$posteriors, fit$priors) proposalsfit = example_smc_fit() proposals = posterior_fit_empirical(fit$posteriors, fit$priors) proposals
Once an ABC model fitting is complete the simulation data is generally only one possible realisation of the parameters, which has been selected for closeness. To properly compare the output with the observed data we need a set of posterior re-samples, which are selected from posteriors according to importance.
posterior_resample( posteriors_df, sim_fn, n_resamples = 1, seed = NULL, parallel = FALSE, max_samples = 200 )posterior_resample( posteriors_df, sim_fn, n_resamples = 1, seed = NULL, parallel = FALSE, max_samples = 200 )
posteriors_df |
a dataframe of posteriors that have been selected by ABC
this may include columns for scores, weight and/or simulation outputs (
|
sim_fn |
a user defined function that takes a set of parameters named
the same as |
n_resamples |
the number of resamples for each parameter combination. |
seed |
an optional random seed |
parallel |
parallelise the simulation? If this is set to true then the
simulation step will be parallelised using |
max_samples |
the maximum total number of resamples to pick. |
a dataframe of the posteriors with an abc_sim list column
containing the output of sim_fn called with the parameters in that row.
fit = example_adaptive_fit() sample_df = posterior_resample( fit$posteriors, sim_fn = example_sim_fn ) # the fitted simulations are in the `abc_sim` column sim1 = sample_df$abc_sim[[1]] sim1 %>% lapply(head, 10)fit = example_adaptive_fit() sample_df = posterior_resample( fit$posteriors, sim_fn = example_sim_fn ) # the fitted simulations are in the `abc_sim` column sim1 = sample_df$abc_sim[[1]] sim1 %>% lapply(head, 10)
Calculate a basket of summaries from a weighted list of posterior samples
posterior_summarise( posteriors_df, priors_list, p = c(0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.975) )posterior_summarise( posteriors_df, priors_list, p = c(0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.975) )
posteriors_df |
a dataframe of posteriors that have been selected by ABC
this may include columns for scores, weight and/or simulation outputs (
|
priors_list |
a named list of priors specified as a |
p |
a |
a dataframe indexed by parameter with useful summary metrics.
fit = example_adaptive_fit() summ = posterior_summarise(fit$posteriors, fit$priors) summ %>% dplyr::glimpse()fit = example_adaptive_fit() summ = posterior_summarise(fit$posteriors, fit$priors) summ %>% dplyr::glimpse()
abc_prior S3 objects are used to hold the specification of prior and
intermediate proposal distributions. They are inputs to the main abc_...()
workflow functions.
priors(...)priors(...)
... |
a list of formulae. Two sided will be interpreted as distribution
or derived value specifications. One sided as constraints between parameters.
A distribution is specified as the name of the family of statistical distributions
and their parameters: e.g.: |
an S3 object of class abc_prior which contains
a list of dist_fns
a cor attribute describing their correlation
a derived attribute describing derive values
a constraints attribute listing the constraints
a params attribute listing the names of the parameters
p = priors( mean ~ tidyabc::rgamma2(4,2), sd ~ gamma2(2,1), shape ~ mean^2 / sd^2, rate ~ mean / sd^2, ~ mean > sd ) print(p) # Plot methods are also provided: if (interactive()) plot(p) # constraints: p@constraintsp = priors( mean ~ tidyabc::rgamma2(4,2), sd ~ gamma2(2,1), shape ~ mean^2 / sd^2, rate ~ mean / sd^2, ~ mean > sd ) print(p) # Plot methods are also provided: if (interactive()) plot(p) # constraints: p@constraints
The wedge distribution has a support of 0 to 1 and has a linear probability density function over that support.
pwedge(q, a, log.p = FALSE)pwedge(q, a, log.p = FALSE)
q |
vector of quantiles |
a |
a gradient from -2 (left skewed) to 2 (right skewed) |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
The rwedge can be combined with quantile functions to
skew standard distributions, or introduce correlation or down weight
certain parts of the distribution.
a vector of probabilities, quantiles, densities or samples.
pwedge(seq(0,1,0.1), a=1) dwedge(seq(0,1,0.1), a=1) qwedge(c(0.25,0.5,0.75), a=-1) stats::cor( stats::qnorm(rwedge(1000, a=2)), stats::qnorm(rwedge(1000, a=-2)) )pwedge(seq(0,1,0.1), a=1) dwedge(seq(0,1,0.1), a=1) qwedge(c(0.25,0.5,0.75), a=-1) stats::cor( stats::qnorm(rwedge(1000, a=2)), stats::qnorm(rwedge(1000, a=-2)) )
Density, distribution function, quantile function and random
generation for the Beta distribution with parameters shape1 and
shape2 (and optional non-centrality parameter ncp).
qbeta2(p, prob, kappa, lower.tail = TRUE, log.p = FALSE)qbeta2(p, prob, kappa, lower.tail = TRUE, log.p = FALSE)
p |
vector of probabilities |
prob |
the mean probability (vectorised) |
kappa |
a coefficient of variation. where 0 is no variability and 1 is maximally variability (vectorised) |
lower.tail |
logical; if TRUE (default), probabilities are |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
dbeta gives the density, pbeta the distribution
function, qbeta the quantile function, and rbeta
generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rbeta, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
qbeta2(c(0.25,0.5,0.75), 0.5, 0.25)qbeta2(c(0.25,0.5,0.75), 0.5, 0.25)
The following conversion describes the parameters mean and kappa
qcgamma(p, mean, kappa = 1/mean, lower.tail = TRUE, log.p = FALSE)qcgamma(p, mean, kappa = 1/mean, lower.tail = TRUE, log.p = FALSE)
p |
vector of probabilities |
mean |
the mean value on the true scale (vectorised) |
kappa |
a coefficient of variation. where 0 is no variability and 1 is maximally variability (vectorised) |
lower.tail |
logical; if TRUE (default), probabilities are |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
dgamma gives the density,
pgamma gives the distribution function,
qgamma gives the quantile function, and
rgamma generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rgamma, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
qcgamma(c(0.25,0.5,0.75), 2, 0.5)qcgamma(c(0.25,0.5,0.75), 2, 0.5)
Density, distribution function, quantile function and random
generation for the Gamma distribution with parameters shape and
scale.
qgamma2(p, mean, sd = sqrt(mean), lower.tail = TRUE, log.p = FALSE)qgamma2(p, mean, sd = sqrt(mean), lower.tail = TRUE, log.p = FALSE)
p |
vector of probabilities |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
lower.tail |
logical; if TRUE (default), probabilities are |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
dgamma gives the density,
pgamma gives the distribution function,
qgamma gives the quantile function, and
rgamma generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rgamma, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
qgamma2(c(0.25,0.5,0.75), 2, 1)qgamma2(c(0.25,0.5,0.75), 2, 1)
Density, distribution function, quantile function and random
generation for the log normal distribution whose logarithm has mean
equal to meanlog and standard deviation equal to sdlog.
qlnorm2(p, mean = 1, sd = sqrt(exp(1) - 1), lower.tail = TRUE, log.p = FALSE)qlnorm2(p, mean = 1, sd = sqrt(exp(1) - 1), lower.tail = TRUE, log.p = FALSE)
p |
vector of probabilities. |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
lower.tail |
logical; if TRUE (default), probabilities are
|
log.p |
logical; if TRUE, probabilities p are given as log(p). |
The log normal distribution has density
where and are the mean and standard
deviation of the logarithm.
The mean is ,
the median is , and the variance
and hence the coefficient of variation is
which is
approximately when that is small (e.g., ).
dlnorm gives the density,
plnorm gives the distribution function,
qlnorm gives the quantile function, and
rlnorm generates random deviates.
The length of the result is determined by n for
rlnorm, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
The cumulative hazard
is -plnorm(t, r, lower = FALSE, log = TRUE).
dlnorm is calculated from the definition (in ‘Details’).
[pqr]lnorm are based on the relationship to the normal.
Consequently, they model a single point mass at exp(meanlog)
for the boundary case sdlog = 0.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995) Continuous Univariate Distributions, volume 1, chapter 14. Wiley, New York.
Distributions for other standard distributions, including
dnorm for the normal distribution.
qlnorm2(c(0.25,0.5,0.72), 2, 1)qlnorm2(c(0.25,0.5,0.72), 2, 1)
The logit-normal distribution has a support of 0 to 1.
qlogitnorm(p, meanlogit = 0, sdlogit = 1, lower.tail = TRUE, log.p = FALSE)qlogitnorm(p, meanlogit = 0, sdlogit = 1, lower.tail = TRUE, log.p = FALSE)
p |
vector of probabilities |
meanlogit |
the mean on the logit scale |
sdlogit |
the sd on the logit scale |
lower.tail |
logical; if TRUE (default), probabilities are |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
a vector of probabilities, quantiles, densities or samples.
qlogitnorm(c(0.25,0.5,0.75), 0, 1)qlogitnorm(c(0.25,0.5,0.75), 0, 1)
The logit-normal distribution has a support of 0 to 1.
qlogitnorm2( p, prob.0.5 = 0.5, kappa = 1 - exp(-1), lower.tail = TRUE, log.p = FALSE )qlogitnorm2( p, prob.0.5 = 0.5, kappa = 1 - exp(-1), lower.tail = TRUE, log.p = FALSE )
p |
vector of probabilities |
prob.0.5 |
the median on the true scale |
kappa |
a dispersion parameter from 0 (none) to 1 maximum dispersion |
lower.tail |
logical; if TRUE (default), probabilities are |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
a vector of probabilities, quantiles, densities or samples.
qlnorm2(c(0.25,0.5,0.72), 2, 1)qlnorm2(c(0.25,0.5,0.72), 2, 1)
Density, distribution function, quantile function and random
generation for the negative binomial distribution with parameters
size and prob.
qnbinom2(p, mean, sd = sqrt(mean), lower.tail = TRUE, log.p = FALSE)qnbinom2(p, mean, sd = sqrt(mean), lower.tail = TRUE, log.p = FALSE)
p |
vector of probabilities. |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
lower.tail |
logical; if TRUE (default), probabilities are
|
log.p |
logical; if TRUE, probabilities p are given as log(p). |
The negative binomial distribution with size and
prob has density
for , and .
This represents the number of failures which occur in a sequence of
Bernoulli trials before a target number of successes is reached.
The mean is and variance .
A negative binomial distribution can also arise as a mixture of
Poisson distributions with mean distributed as a gamma distribution
(see pgamma) with scale parameter (1 - prob)/prob
and shape parameter size. (This definition allows non-integer
values of size.)
An alternative parametrization (often used in ecology) is by the
mean mu (see above), and size, the dispersion
parameter, where prob = size/(size+mu). The variance
is mu + mu^2/size in this parametrization.
If an element of x is not integer, the result of dnbinom
is zero, with a warning.
The case size == 0 is the distribution concentrated at zero.
This is the limiting distribution for size approaching zero,
even if mu rather than prob is held constant. Notice
though, that the mean of the limit distribution is 0, whatever the
value of mu.
The quantile is defined as the smallest value such that
, where is the distribution function.
dnbinom gives the density,
pnbinom gives the distribution function,
qnbinom gives the quantile function, and
rnbinom generates random deviates.
Invalid size or prob will result in return value
NaN, with a warning.
The length of the result is determined by n for
rnbinom, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
rnbinom returns a vector of type integer unless generated
values exceed the maximum representable integer when double
values are returned.
dnbinom computes via binomial probabilities, using code
contributed by Catherine Loader (see dbinom).
pnbinom uses pbeta.
qnbinom uses the Cornish–Fisher Expansion to include a skewness
correction to a normal approximation, followed by a search.
rnbinom uses the derivation as a gamma mixture of Poisson
distributions, see
Devroye, L. (1986) Non-Uniform Random Variate Generation. Springer-Verlag, New York. Page 480.
Distributions for standard distributions, including
dbinom for the binomial, dpois for the
Poisson and dgeom for the geometric distribution, which
is a special case of the negative binomial.
qnbinom2(c(0.25,0.5,0.75), 5, sqrt(5))qnbinom2(c(0.25,0.5,0.75), 5, sqrt(5))
Null distributions always returns NA
qnull(p, ...)qnull(p, ...)
p |
vector of probabilities |
... |
not used |
qnull(c(0.25,0.5,0.75), 0.5, 0.25)qnull(c(0.25,0.5,0.75), 0.5, 0.25)
The wedge distribution has a support of 0 to 1 and has a linear probability density function over that support.
qwedge(p, a, lower.tail = TRUE, log.p = FALSE)qwedge(p, a, lower.tail = TRUE, log.p = FALSE)
p |
vector of probabilities |
a |
a gradient from -2 (left skewed) to 2 (right skewed) |
lower.tail |
logical; if TRUE (default), probabilities are |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
The rwedge can be combined with quantile functions to
skew standard distributions, or introduce correlation or down weight
certain parts of the distribution.
a vector of probabilities, quantiles, densities or samples.
pwedge(seq(0,1,0.1), a=1) dwedge(seq(0,1,0.1), a=1) qwedge(c(0.25,0.5,0.75), a=-1) stats::cor( stats::qnorm(rwedge(1000, a=2)), stats::qnorm(rwedge(1000, a=-2)) )pwedge(seq(0,1,0.1), a=1) dwedge(seq(0,1,0.1), a=1) qwedge(c(0.25,0.5,0.75), a=-1) stats::cor( stats::qnorm(rwedge(1000, a=2)), stats::qnorm(rwedge(1000, a=-2)) )
A random Bernoulli sample as a logical value
rbern(n, prob)rbern(n, prob)
n |
number of observations |
prob |
the mean probability (vectorised) |
a vector of logical values of size n
table(rbern(100, 0.25))table(rbern(100, 0.25))
Density, distribution function, quantile function and random
generation for the Beta distribution with parameters shape1 and
shape2 (and optional non-centrality parameter ncp).
rbeta2(n, prob, kappa)rbeta2(n, prob, kappa)
n |
number of observations |
prob |
the mean probability (vectorised) |
kappa |
a coefficient of variation. where 0 is no variability and 1 is maximally variability (vectorised) |
dbeta gives the density, pbeta the distribution
function, qbeta the quantile function, and rbeta
generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rbeta, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
rbeta2(3, c(0.1,0.5,0.9),0.1)rbeta2(3, c(0.1,0.5,0.9),0.1)
Sampling from the multinomial equivalent of the Bernoulli distribution
rcategorical(n, prob, factor = FALSE)rcategorical(n, prob, factor = FALSE)
n |
a sample size |
prob |
a (optionally named) vector of probabilities that will be normalised to sum to 1 |
factor |
if not FALSE then factor levels will either be taken from the
names of |
a vector of random class labels of length n. Labels come from names
of prob or from a character vector in factor.
prob = c("one"=0.1,"two"=0.2,"seven"=0.7) table(rcategorical(1000,prob)) rcategorical(10,prob,factor=TRUE) rcategorical(10,rep(1,26),factor=letters)prob = c("one"=0.1,"two"=0.2,"seven"=0.7) table(rcategorical(1000,prob)) rcategorical(10,prob,factor=TRUE) rcategorical(10,rep(1,26),factor=letters)
The following conversion describes the parameters mean and kappa
rcgamma(n, mean, kappa = 1/mean)rcgamma(n, mean, kappa = 1/mean)
n |
number of observations |
mean |
the mean value on the true scale (vectorised) |
kappa |
a coefficient of variation. where 0 is no variability and 1 is maximally variability (vectorised) |
dgamma gives the density,
pgamma gives the distribution function,
qgamma gives the quantile function, and
rgamma generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rgamma, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
rcgamma(10, 2, 0.5)rcgamma(10, 2, 0.5)
Randomly sample incident times in an exponentially growing process
rexpgrowth(n, r, t_end, t_start = 0)rexpgrowth(n, r, t_end, t_start = 0)
n |
the number of items to sample |
r |
and exponential growth rate (per unit time) |
t_end |
the end of the observation period |
t_start |
the start of the observation period |
a vector of n samples from an exponential growth process
graphics::hist(rexpgrowth(1000,0.1,40), breaks=40) graphics::hist(rexpgrowth(1000,-0.1,40), breaks=40)graphics::hist(rexpgrowth(1000,0.1,40), breaks=40) graphics::hist(rexpgrowth(1000,-0.1,40), breaks=40)
Randomly sample incident times in an exponentially growing process with initial case load
rexpgrowthI0(I0, r, t_end, t_start = 0)rexpgrowthI0(I0, r, t_end, t_start = 0)
I0 |
the expected number of cases observed in the first day |
r |
and exponential growth rate (per unit time) |
t_end |
the end of the observation period |
t_start |
the start of the observation period |
a vector of n samples from an exponential growth process
graphics::hist(rexpgrowthI0(10,0.1,20), breaks=40) graphics::hist(rexpgrowthI0(1000,-0.1,40), breaks=40)graphics::hist(rexpgrowthI0(10,0.1,20), breaks=40) graphics::hist(rexpgrowthI0(1000,-0.1,40), breaks=40)
Density, distribution function, quantile function and random
generation for the Gamma distribution with parameters shape and
scale.
rgamma2(n, mean, sd = sqrt(mean))rgamma2(n, mean, sd = sqrt(mean))
n |
number of observations |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
dgamma gives the density,
pgamma gives the distribution function,
qgamma gives the quantile function, and
rgamma generates random deviates.
Invalid arguments will result in return value NaN, with a warning.
The length of the result is determined by n for
rgamma, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
rgamma2(10, 2, 1)rgamma2(10, 2, 1)
Density, distribution function, quantile function and random
generation for the log normal distribution whose logarithm has mean
equal to meanlog and standard deviation equal to sdlog.
rlnorm2(n, mean = 1, sd = sqrt(exp(1) - 1))rlnorm2(n, mean = 1, sd = sqrt(exp(1) - 1))
n |
number of observations. If |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
The log normal distribution has density
where and are the mean and standard
deviation of the logarithm.
The mean is ,
the median is , and the variance
and hence the coefficient of variation is
which is
approximately when that is small (e.g., ).
dlnorm gives the density,
plnorm gives the distribution function,
qlnorm gives the quantile function, and
rlnorm generates random deviates.
The length of the result is determined by n for
rlnorm, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
The cumulative hazard
is -plnorm(t, r, lower = FALSE, log = TRUE).
dlnorm is calculated from the definition (in ‘Details’).
[pqr]lnorm are based on the relationship to the normal.
Consequently, they model a single point mass at exp(meanlog)
for the boundary case sdlog = 0.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995) Continuous Univariate Distributions, volume 1, chapter 14. Wiley, New York.
Distributions for other standard distributions, including
dnorm for the normal distribution.
rlnorm2(10, 2, 1)rlnorm2(10, 2, 1)
The logit-normal distribution has a support of 0 to 1.
rlogitnorm(n, meanlogit = 0, sdlogit = 1)rlogitnorm(n, meanlogit = 0, sdlogit = 1)
n |
number of observations |
meanlogit |
the mean on the logit scale |
sdlogit |
the sd on the logit scale |
a vector of probabilities, quantiles, densities or samples.
rlogitnorm(10, 0, 1)rlogitnorm(10, 0, 1)
The logit-normal distribution has a support of 0 to 1.
rlogitnorm2(n, prob.0.5 = 0.5, kappa = 1 - exp(-1))rlogitnorm2(n, prob.0.5 = 0.5, kappa = 1 - exp(-1))
n |
number of observations |
prob.0.5 |
the median on the true scale |
kappa |
a dispersion parameter from 0 (none) to 1 maximum dispersion |
a vector of probabilities, quantiles, densities or samples.
mean(rlogitnorm2(10000,0.75,0.2))mean(rlogitnorm2(10000,0.75,0.2))
Density, distribution function, quantile function and random
generation for the negative binomial distribution with parameters
size and prob.
rnbinom2(n, mean, sd = sqrt(mean))rnbinom2(n, mean, sd = sqrt(mean))
n |
number of observations. If |
mean |
the mean value on the true scale (vectorised) |
sd |
the standard deviation on the true scale (vectorised) |
The negative binomial distribution with size and
prob has density
for , and .
This represents the number of failures which occur in a sequence of
Bernoulli trials before a target number of successes is reached.
The mean is and variance .
A negative binomial distribution can also arise as a mixture of
Poisson distributions with mean distributed as a gamma distribution
(see pgamma) with scale parameter (1 - prob)/prob
and shape parameter size. (This definition allows non-integer
values of size.)
An alternative parametrization (often used in ecology) is by the
mean mu (see above), and size, the dispersion
parameter, where prob = size/(size+mu). The variance
is mu + mu^2/size in this parametrization.
If an element of x is not integer, the result of dnbinom
is zero, with a warning.
The case size == 0 is the distribution concentrated at zero.
This is the limiting distribution for size approaching zero,
even if mu rather than prob is held constant. Notice
though, that the mean of the limit distribution is 0, whatever the
value of mu.
The quantile is defined as the smallest value such that
, where is the distribution function.
dnbinom gives the density,
pnbinom gives the distribution function,
qnbinom gives the quantile function, and
rnbinom generates random deviates.
Invalid size or prob will result in return value
NaN, with a warning.
The length of the result is determined by n for
rnbinom, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
rnbinom returns a vector of type integer unless generated
values exceed the maximum representable integer when double
values are returned.
dnbinom computes via binomial probabilities, using code
contributed by Catherine Loader (see dbinom).
pnbinom uses pbeta.
qnbinom uses the Cornish–Fisher Expansion to include a skewness
correction to a normal approximation, followed by a search.
rnbinom uses the derivation as a gamma mixture of Poisson
distributions, see
Devroye, L. (1986) Non-Uniform Random Variate Generation. Springer-Verlag, New York. Page 480.
Distributions for standard distributions, including
dbinom for the binomial, dpois for the
Poisson and dgeom for the geometric distribution, which
is a special case of the negative binomial.
rnbinom2(10, 5, sqrt(5))rnbinom2(10, 5, sqrt(5))
Null distributions always returns NA
rnull(n, ...)rnull(n, ...)
n |
number of observations |
... |
not used |
rnull(3, c(0.1,0.5,0.9),0.1)rnull(3, c(0.1,0.5,0.9),0.1)
The wedge distribution has a support of 0 to 1 and has a linear probability density function over that support.
rwedge(n, a)rwedge(n, a)
n |
number of observations |
a |
a gradient from -2 (left skewed) to 2 (right skewed) |
The rwedge can be combined with quantile functions to
skew standard distributions, or introduce correlation or down weight
certain parts of the distribution.
a vector of probabilities, quantiles, densities or samples.
pwedge(seq(0,1,0.1), a=1) dwedge(seq(0,1,0.1), a=1) qwedge(c(0.25,0.5,0.75), a=-1) stats::cor( stats::qnorm(rwedge(1000, a=2)), stats::qnorm(rwedge(1000, a=-2)) )pwedge(seq(0,1,0.1), a=1) dwedge(seq(0,1,0.1), a=1) qwedge(c(0.25,0.5,0.75), a=-1) stats::cor( stats::qnorm(rwedge(1000, a=2)), stats::qnorm(rwedge(1000, a=-2)) )
sim_outbreak datasetThis is a generated data set of an outbreak based on a branching process model with fixed parameters. There is a both the full infection line list as it happened in the simulation and the subset of this data that might have been observed in a contact tracing exercise where identification is done through symptoms.
sim_outbreaksim_outbreak
An object of class list of length 3.
sim_outbreak is a named list with 3 items| Item | Type | Description |
parameters |
list[dbl] |
the ground truth of the simulation parameters |
outbreak_truth |
df[outbreak_truth]* |
the full infection line list |
contact_tracing |
df[contact_tracing]* |
the "observed" contact tracing subset |
df[outbreak_truth] dataframe with 663 rows and 11 columnsThe simulation details
| Column | Type | Description |
time |
dbl |
The true time of infection |
id |
int |
Person unique id |
generation_interval |
dbl |
The time since infector's infection |
infector |
int |
The unique id of the infector |
generation |
dbl |
Which generation is this infection since the simulation start |
symptom |
lgl |
Did this person experience symptoms |
symptom_delay |
dbl |
How long after infection were their symptoms? |
symptom_time |
dbl |
When? (from the simulation start) |
observation |
lgl |
Was this person detected (only if symptoms) |
observation_delay |
dbl |
How long after symptoms were they observed? |
observation_time |
dbl |
When? (from the simulation start) |
df[contact_tracing] dataframe with 663 rows and 4 columnsA minimal set of data that might be collected in a contact tracing exercise.
| Column | Type | Description |
id |
int |
Unique person id |
contact_id |
int |
Unique id of infectious contact |
onset_time |
int |
Time of symptom onset (from the simulation start) |
obs_time |
int |
Time of observation (from the simulation start) |
https://ai4ci.github.io/ggoutbreak/articles/simulation-test-models.html#line-list-simulations
Calculate the skew of a set of data
skew(x, na.rm = FALSE)skew(x, na.rm = FALSE)
x |
a vector of observations |
na.rm |
remove |
the skew
skew(stats::rnorm(1000)) skew(stats::rbeta(1000, 1, 8)) # positively (left) skewed skew(stats::rbeta(1000, 8, 1)) # negatively (right) skewed skew(stats::rlnorm(1000))skew(stats::rnorm(1000)) skew(stats::rbeta(1000, 1, 8)) # positively (left) skewed skew(stats::rbeta(1000, 8, 1)) # negatively (right) skewed skew(stats::rlnorm(1000))
Run the simulation for one set of parameters
test_simulation( sim_fn, scorer_fn, ..., params = NULL, obsdata = NULL, seed = NULL, debug = FALSE, debug_errors = !debug )test_simulation( sim_fn, scorer_fn, ..., params = NULL, obsdata = NULL, seed = NULL, debug = FALSE, debug_errors = !debug )
sim_fn |
a user defined function that takes a set of parameters named
the same as |
scorer_fn |
a user supplied function that matches the following
signature |
... |
simulation parameters, must be named |
params |
a named list of simulation parameters to test (as an alternative
to including them in |
obsdata |
The observational data. The data in this will typically be a named list, but could be anything, e.g. dataframe. It is the reference data that the simulation model is aiming to replicate. |
seed |
an optional random seed |
debug |
start the simulation function in debug mode. This will step
through both the |
debug_errors |
Errors that crop up in |
a list containing the parameters as truth and an
instance of the simulation as obsdata, obsscores is the result of
comparing the obsdata with itself and is usually going to result in
zeros. Both sim_fn and scorer_fn are rewritten to make sure that all
package names are fully qualified. The rewritten versions are refurned in
the result list (as sim_fn and scorer_fn respectively)
test = test_simulation( example_sim_fn, example_scorer_fn, # Model parameters to test with: mean = 4, sd1 = 3, sd2 = 2, obsdata = example_obsdata() ) # the rewritten function: cat(test$sim_fn, sep="\n") # The scores resulting from this one simulation, when compared to the # reference `obsdata`. test$obsscorestest = test_simulation( example_sim_fn, example_scorer_fn, # Model parameters to test with: mean = 4, sd1 = 3, sd2 = 2, obsdata = example_obsdata() ) # the rewritten function: cat(test$sim_fn, sep="\n") # The scores resulting from this one simulation, when compared to the # reference `obsdata`. test$obsscores
Generates a new distribution by applying a link transformation to an existing
distribution dist. If and is the link function,
this function returns the distribution of .
The CDF , quantile function , PDF , and RNG
of the transformed distribution are derived from the original distribution's
functions , , , and the link function
and its inverse as follows:
where is the derivative of the link function. This function implements
these transformations for the p, q, d, and r functions of the resulting
dist_fns object.
transform(link, dist, ..., name = NULL)transform(link, dist, ..., name = NULL)
link |
a link function (or name of a link function) |
dist |
distribution(s) as a name, function or a |
... |
parameters for the underlying distribution if |
name |
a name for the link function |
a dist_fn or dist_fn_list holding the transformed distribution(s)
t = transform("log","norm")
ps = seq(0,1,0.1)
qs = 0:6
testthat::expect_equal(t$q(ps),qlnorm(ps))
testthat::expect_equal(t$p(qs),plnorm(qs))
testthat::expect_equal(t$d(qs),dlnorm(qs))
t2 = transform("log","norm", 0.4, 0.1)
testthat::expect_equal(t2$q(ps),qlnorm(ps,0.4,0.1))
n = as.dist_fns("norm",mean=0.5, sd=0.1) t = transform("log",n) plot(t)+ggplot2::geom_function(fun=~ dlnorm(.x, 0.5, 0.1), linetype="dashed")n = as.dist_fns("norm",mean=0.5, sd=0.1) t = transform("log",n) plot(t)+ggplot2::geom_function(fun=~ dlnorm(.x, 0.5, 0.1), linetype="dashed")
Generates a truncated distribution from an existing distribution dist.
The new distribution is restricted to the interval
(or if open bounds are intended, though the
quantile function will map 0 to and 1 to ).
The probability density function (PDF) of the truncated distribution
is related to the original PDF by a normalization constant:
where is the original CDF and is the indicator function.
Consequently, the CDF and quantile function are:
where is the quantile function of the original distribution.
This function implements these transformations for the p, q, and r functions
of the resulting dist_fns object.
truncate(dist, x_left, x_right, ..., name = NULL)truncate(dist, x_left, x_right, ..., name = NULL)
dist |
a distribution as a name, function or a |
x_left |
The lower end of the interval or NA for open |
x_right |
The upper end of the interval or NA for open |
... |
parameters for the underlying distribution if |
name |
a name for the truncation |
a dist_fn or dist_fn_list holding the truncated distribution(s)
shape2_gamma = as.dist_fns(pgamma, shape=2) g2 = truncate(shape2_gamma, 0.5, 4) testthat::expect_equal( format(g2), "trunc(gamma(shape = 2, rate = 1), 0.50 - 4.00); Median (IQR) 1.68 [1.08 — 2.46]" ) testthat::expect_equal( g2$p(0:5), c(0, 0.212702667019627, 0.615716430100344, 0.868531239886668, 1, 1) ) testthat::expect_equal(g2$q(seq(0, 1, 0.2)), c( 0.5, 0.971743593113941, 1.42711359317907, 1.95304152540128, 2.6642447272259, 4 )) g3 = truncate(shape2_gamma, NA, 4) testthat::expect_equal( format(g3), "trunc(gamma(shape = 2, rate = 1), 0.00 - 4.00); Median (IQR) 1.54 [0.899 — 2.35]" ) g4 = truncate(shape2_gamma, 2, NA) testthat::expect_equal( format(g4), "trunc(gamma(shape = 2, rate = 1), 2.00 - Inf); Median (IQR) 2.97 [2.42 — 3.87]" )
shape2_gamma = as.dist_fns(pgamma, shape=2) g2 = truncate(shape2_gamma, 0.5, 4) plot(g2)+ggplot2::geom_function(fun = ~ dgamma(.x,shape=2), xlim = c(0,5))shape2_gamma = as.dist_fns(pgamma, shape=2) g2 = truncate(shape2_gamma, 0.5, 4) plot(g2)+ggplot2::geom_function(fun = ~ dgamma(.x,shape=2), xlim = c(0,5))
wasserstein_calculator(obs, debias = FALSE, bootstraps = 1)wasserstein_calculator(obs, debias = FALSE, bootstraps = 1)
obs |
A vector of observations |
debias |
Should the simulations be shifted to match the mean of the observed data |
bootstraps |
Randomly resample from the simulated data points to match the observed size this many times and combine the output by averaging. The alternative, when this is 1 (the default) matches the sizes by selecting and/or repeating the simulated data points in order (deterministically) |
This function takes reference data in the forms of individual observations in
the form of for example event times and returns a crated function to
calculate the wasserstein distance from simulated data to the observed data.
For large simulations this will be quicker than calculate_wasserstein but
must be used with a crated scorer_fn
In the comparison unequal lengths of the data can be accommodated. The simulated data is recycled, and sampled, until the same length as the observed data before the comparison.
a function that takes parameter sim and returns a
length normalised wasserstein distance. This is the average distance an
individual data point must be shifted to match the reference data normalised
by the average distance of the reference data from the mean. The function
also takes a p parameter which can be a progressr progress bar which must
be named.
tmp = wasserstein_calculator(0:10) # zero if no distance testthat::expect_equal(tmp(0:10), 0) testthat::expect_equal(tmp(10:0), 0)
# example case counts from an exponential growth process sim = rexpgrowth(1000, 0.05, 40, 0) obs = rexpgrowth(1000, 0.075, 40, 0) obs2 = rexpgrowth(1000, 0.05, 40, 0) calc = wasserstein_calculator(sim) # obs is a different distribution to sim (larger growth) calc(obs) # obs2 is from the same distribution as sim so the RMSE should be lower: calc(obs2)# example case counts from an exponential growth process sim = rexpgrowth(1000, 0.05, 40, 0) obs = rexpgrowth(1000, 0.075, 40, 0) obs2 = rexpgrowth(1000, 0.05, 40, 0) calc = wasserstein_calculator(sim) # obs is a different distribution to sim (larger growth) calc(obs) # obs2 is from the same distribution as sim so the RMSE should be lower: calc(obs2)
Weighted bandwidth selector
wbw.nrd(x, w = NULL)wbw.nrd(x, w = NULL)
x |
data |
w |
weights |
a bandwidth based on weighted data
x = runif(1000) w = rgamma(1000,2) wbw.nrd(x) wbw.nrd(x,w)x = runif(1000) w = rgamma(1000,2) wbw.nrd(x) wbw.nrd(x,w)
The wedge distribution has a support of 0 to 1 and has a linear probability density function over that support.
n |
number of observations |
x |
vector of quantiles |
q |
vector of quantiles |
p |
vector of probabilities |
log |
logical; if TRUE, probabilities p are given as log(p). |
log.p |
logical; if TRUE, probabilities p are given as log(p). |
lower.tail |
logical; if TRUE (default), probabilities are |
a |
a gradient from -2 (left skewed) to 2 (right skewed) |
The rwedge can be combined with quantile functions to
skew standard distributions, or introduce correlation or down weight
certain parts of the distribution.
a vector of probabilities, quantiles, densities or samples.
pwedge(seq(0,1,0.1), a=1) dwedge(seq(0,1,0.1), a=1) qwedge(c(0.25,0.5,0.75), a=-1) stats::cor( stats::qnorm(rwedge(1000, a=2)), stats::qnorm(rwedge(1000, a=-2)) )pwedge(seq(0,1,0.1), a=1) dwedge(seq(0,1,0.1), a=1) qwedge(c(0.25,0.5,0.75), a=-1) stats::cor( stats::qnorm(rwedge(1000, a=2)), stats::qnorm(rwedge(1000, a=-2)) )
Increases the dispersion (spread) of a distribution by transforming its quantile function in a standardized Q-Q space.
widen(x, scale, knots = NULL, name = NULL)widen(x, scale, knots = NULL, name = NULL)
x |
a distribution as a |
scale |
acts as a multiplier for the log-odds difference from the
median, effectively acting like an odds-ratio parameter. A |
knots |
the number of knots in the transformed distribution, if it is not already an empirical CDF distribution. |
name |
a name for the widened distribution |
The transformation aims to increase the standard deviation by a factor
scale while preserving the median. It operates on the
internal Q-Q space representation (qx, qy) of an empirical distribution
generated by empirical_cdf.
Applies a logit-space scaling transformation centred on the median quantile.
This transformation modifies the quantile axis (qx) in the Q-Q space of an
empirical distribution to change its dispersion while preserving the median
value.
Let qx be the original quantile coordinate in [0, 1], and qmedian be the
quantile corresponding to the median (e.g., qx_from_qy(0.5)). The transformation
is defined as:
where and
are the standard logit and inverse-logit functions, respectively.
an empirical dist_fn with the same median and increased SD. This
transformation will change the mean of skewed distributions.
d1 = as.dist_fns("norm",4,2) w1 = widen(d1, scale=1.5)d1 = as.dist_fns("norm",4,2) w1 = widen(d1, scale=1.5)
a simple alias for base weighted,mean
wmean(x, w = NULL, na.rm = TRUE)wmean(x, w = NULL, na.rm = TRUE)
x |
either a vector of samples from a distribution |
w |
for data fits, a vector the same length as |
na.rm |
remove NAs (default TRUE) |
a standard deviation
#' # unweighted: wmean(x = stats::rnorm(1000)) # weighted: wmean(x = seq(-2,2,0.1), w = stats::dnorm(seq(-2,2,0.1)))#' # unweighted: wmean(x = stats::rnorm(1000)) # weighted: wmean(x = seq(-2,2,0.1), w = stats::dnorm(seq(-2,2,0.1)))
This quantile function has different order of parameters from base quantile. It takes a weight and a link function specification which allows us to define the support of the quantile function. It is optimised for imputing the tail of distributions and not speed.
wquantile(p, x, w = NULL, link = "identity", names = TRUE, window = 7)wquantile(p, x, w = NULL, link = "identity", names = TRUE, window = 7)
p |
the probabilities for which to estimate quantiles from the data |
x |
a set of observations |
w |
for data fits, a vector the same length as |
link |
a link function. Either as name, or a |
names |
name the resulting quantile vector |
window |
the number of data points to include when estimating the
quantile. The closest |
The process involves:
Link transformation: values are transformed using the link function:
.
Standardization: Transformed values are standardized:
, where and
are the weighted mean and standard deviation of .
Weighted CDF calculation: The empirical CDF is calculated from weights .
Logit transformation: is transformed:
.
Local interpolation: For a target probability ,
is calculated. A window of points is selected from the
pairs around . A weighted linear model is fitted using Gaussian kernel
weights based on distance in space: ,
where is the normalized distance.
Quantile estimation: The local model predicts for .
Back-transformation: The quantile is transformed back:
.
This is a moderately expensive function to call (in memory terms), as it
needs to construct the whole quantile function. if there are multiple calls
consider using empirical() to build a quantile function and using that.
a vector of quantiles
test = function(rfn,qfn, link,..., n = 100000, tol=1000/(n+10000)) {
testthat::expect_equal(abs(
unname(wquantile(c(0.025, 0.5, 0.975),rfn(100000,...),link=link,names=FALSE)-
qfn(c(0.025, 0.5, 0.975), ...))
), c(0,0,0), tolerance=tol)
}
withr::with_seed(123, {
test(stats::rnorm,stats::qnorm,"identity",n = 10000)
test(stats::rnorm,stats::qnorm,"identity",mean=4,n = 10000)
test(stats::rnorm,stats::qnorm,"identity",sd=3, n = 100000, tol=0.05)
test(stats::rnorm,stats::qnorm,"identity",n = 5000)
test(stats::rnorm,stats::qnorm,"identity",n = 1000)
test(stats::rnorm,stats::qnorm,"identity",n = 100)
test(stats::rnorm,stats::qnorm,"identity",n = 30)
test(stats::rgamma,stats::qgamma,"log", 4,n = 10000)
test(stats::rgamma,stats::qgamma,"log", 4,n = 5000)
test(stats::rgamma,stats::qgamma,"log", 4,n = 1000)
test(stats::rgamma,stats::qgamma,"log", 4, 3,n = 100)
test(stats::rgamma,stats::qgamma,"log", 4,n = 30)
test(stats::runif,stats::qunif,as.link_fns(c(0,10)),0,10)
})
# fit weighted data samples = seq(0,10,0.01) weights = dgamma2(samples, mean=5, sd=2) wquantile(c(0.25,0.5,0.75), x = samples, w = weights, link="log") # compared to the sampled distribution qgamma2(c(0.25,0.5,0.75), mean=5, sd=2) # unweighted: wquantile(p = c(0.25,0.5,0.75), x = stats::rnorm(1000)) qnorm(p = c(0.25,0.5,0.75))# fit weighted data samples = seq(0,10,0.01) weights = dgamma2(samples, mean=5, sd=2) wquantile(c(0.25,0.5,0.75), x = samples, w = weights, link="log") # compared to the sampled distribution qgamma2(c(0.25,0.5,0.75), mean=5, sd=2) # unweighted: wquantile(p = c(0.25,0.5,0.75), x = stats::rnorm(1000)) qnorm(p = c(0.25,0.5,0.75))
Weighted standard deviation
wsd(x, w = NULL, na.rm = TRUE)wsd(x, w = NULL, na.rm = TRUE)
x |
either a vector of samples from a distribution |
w |
for data fits, a vector the same length as |
na.rm |
remove NAs (default TRUE) |
a standard deviation
# unweighted: wsd(x = stats::rnorm(1000)) # weighted: wsd(x = seq(-2,2,0.1), w = stats::dnorm(seq(-2,2,0.1)))# unweighted: wsd(x = stats::rnorm(1000)) # weighted: wsd(x = seq(-2,2,0.1), w = stats::dnorm(seq(-2,2,0.1)))