This vignette provides a detailed guide on writing the core
user-defined functions required for Approximate Bayesian Computation
(ABC) using tidyabc: the simulation function
(sim_fn), the scoring function (scorer_fn),
and the convergence function (converged_fn). Understanding
the precise input and output structure of these functions is crucial for
correctly implementing an ABC workflow.
sim_fn)The simulation function (sim_fn) defines your generative
model. It takes parameter values as input and returns simulated
data.
sim_fn must
accept named arguments corresponding to the parameters defined in your
priors_list. For example, if you have
priors(norm_mean ~ unif(0, 10),norm_sd ~ unif(0, 10)), your
sim_fn must accept norm_mean and
norm_sd as arguments.... (Optional): While not strictly
required by the abc_* functions themselves, it’s good
practice to allow for ... in the function definition if you
anticipate passing extra arguments during debugging or testing.# Example sim_fn matching the parameters from priors defined later
example_sim_fn = function(norm_mean, norm_sd, gamma_mean, gamma_sd) {
# Simulate data based on parameters
A = rnorm(1000, norm_mean, norm_sd)
B = rgamma2(1000, gamma_mean, gamma_sd) # Using a custom dist from tidyabc
C = rgamma2(1000, gamma_mean, gamma_sd)
# Return a list containing the simulated outputs
return(
list(
data1 = A + B - C, # Example composite output
data2 = B * C # Example composite output
)
)
}scorer_fn)The scoring function (scorer_fn) compares simulated data
to observed data, producing a set of summary statistics (distances or
other metrics) that quantify the “closeness” of the simulation to the
observation.
simdata: This is the output list from
your sim_fn.obsdata: This is the observed data
provided to the abc_* function, which should be in the same
format as the simdata list.simdata and
obsdata, but they could also be other metrics like summary
moments (mean, variance).scorer_fn are crucial. They must be consistent and
identifiable, as they are used for weighting (scoreweights)
and internal processing.scorer_fn is called
repeatedly within the ABC workflow, once for each simulation
(simdata) against the fixed obsdata.scorer_fn(simdata, obsdata) is a list of component
scores (e.g.,
list(data1 = dist1, data2 = dist2)).distance_method
argument in abc_* functions:
"euclidean" (default): Calculates the
weighted Euclidean distance:dist1, dist2) and \(w_i\) is the corresponding weight from the
scoreweights vector."manhattan": Calculates the weighted
Manhattan distance:"mahalanobis": Calculates the
Mahalanobis distance using the covariance matrix of the component scores
from the first wave, incorporating weights.obsscores: This argument allows you to
provide pre-calculated component scores for the obsdata
itself (e.g., list(data1 = 0, data2 = 0) if comparing
against obsdata using
scorer_fn(obsdata, obsdata) yields zeros). This is optional
and often inferred internally if scorer_fn is run on
obsdata itself initially.epsilon) using a kernel function
(defined by the kernel argument) to calculate the ABC
weights for each simulation.# Example scorer_fn matching the simdata structure from example_sim_fn
example_scorer_fn = function(simdata, obsdata) {
# Compare corresponding elements of simdata and obsdata
# Using calculate_wasserstein from tidyabc as an example distance
return(list(
data1 = calculate_wasserstein(simdata$data1, obsdata$data1),
data2 = calculate_wasserstein(simdata$data2, obsdata$data2)
))
# Output: list(data1 = <dist_val1>, data2 = <dist_val2>)
}scoreweights)The scoreweights argument allows you to assign relative
importance to each component score returned by
scorer_fn.
scorer_fn.data1 and
data2 have different scales or importance, you can use
weights to balance their contribution to the summary distance. For
instance, c(data1 = 1, data2 = 2) means
data2’s distance contributes twice as much as
data1’s distance to the final \(d_{summary}\) (before applying the square
root for Euclidean).posterior_distance_metrics() function can analyze results
from a trial run (e.g., abc_rejection) and suggest suitable
scoreweights based on the scale of the summary
statistics.It is a good idea to run your simulation with some test parameters
first. Similarly testing the scoring function is correctly working is a
good idea. An end to end test harness is supplied in
test_simulation() that will run your simulation and scorer
for a fixed set of parameters. It can be configured to drop to an
interactive debugging browser if it encounters an error.
Although testing against one set of parameters will catch major
problems edge cases often arise during the main ABC. Each of the ABC
functions can also drop the user to a debugging browser if an issue is
detected. This is not compatible with parallelisation though so best to
run a small set of simulations using
e.g. abc_rejection().
For parallel execution the simulation and scorer functions must be completely self contained, and make no reference to global environment variables, or unqualified references to package functions. Global references will in general be automatically detected, and you will be alerted and your functions will be rewritten with common package names qualified so that they are self contained.
If you need to use globals carrier::crateing your
simulation function will achieve the same thing.
converged_fn)The convergence function (converged_fn) determines when
the iterative ABC-SMC or ABC-Adaptive algorithm should stop. It
evaluates the change in the posterior approximation between waves.
converged_fnThe converged_fn is called internally by
abc_smc and abc_adaptive at the end of each
wave. It receives three arguments:
wave: The wave number as an
integersummary: A single-row data
frame containing summary metrics aggregated across all
parameters for the current wave. Common columns include:
abs_distance: The absolute tolerance threshold
(epsilon) for the current wave.abs_distance_redn: The absolute reduction in
epsilon compared to the previous wave
(epsilon_{prev} - epsilon_{current}).rel_distance_redn: The relative reduction in
epsilon compared to the previous wave
((epsilon_{prev} - epsilon_{current}) / epsilon_{prev}).ESS: The Effective Sample Size of the current wave’s
particle set.per_param: A single-row data
frame containing metrics calculated for each individual
parameter in the current wave, compared to the previous wave.
Common columns include:
param: The name of the parameter (e.g.,
“norm_mean”).IQR_95_redn: The reduction in the 95% credible interval
(IQR) range (q0.975 - q0.025) for this parameter compared
to the previous wave (IQR_{prev} - IQR_{current}).abs_variance_redn: The absolute reduction in the
posterior variance for this parameter compared to the previous wave
(var_{prev} - var_{current}).rel_mean_change: The relative change in the posterior
mean for this parameter compared to the previous wave
(|mean_{current} - mean_{prev}| / |mean_{prev}|).converged_fnTRUE or FALSE.
TRUE: Signals that the algorithm has converged, and the
iterative process stops.FALSE: Signals that the algorithm should continue to
the next wave.default_termination_fntidyabc provides a default_termination_fn()
which implements a common convergence strategy:
rel_mean_change for all
parameters is below a specified stability threshold.IQR_95_redn for
all parameters is below a specified confidence
threshold.TRUE, indicating
convergence.# Example of using default_termination_fn with custom thresholds
default_converged_fn = default_termination_fn(stability = 0.01, confidence = 0.1)
# This means: converge if mean changes are < 1% and 95% CI shrinks by < 0.1 units for all params.
# If you have 2 parameters `norm_mean` and `norm_sd` you can specify different
# convergence criteria by nameing the stability and confidence inputs:
default_converged_fn2 = default_termination_fn(
stability = c(norm_mean = 0.01, norm_sd = 0.1),
confidence = c(norm_mean = 0.01, norm_sd = 0.1)
)
# this would accept more variation in the `norm_sd` estimate compared to that
# of the `norm_mean`Writing a custom convergence function allows you to tailor the stopping criteria to your specific needs.
A custom convergence function must adhere to the input signature:
function(wave, summary, per_param). It processes the data
frames provided and returns a single TRUE or
FALSE.
This example defines convergence as achieving a high Effective Sample Size (ESS) and a minimal reduction in the absolute distance threshold.
# Define a custom convergence function
custom_converged_fn = function(wave, summary, per_param) {
# Extract values from the summary data frame (it's a single row)
current_ess = summary$ESS
abs_dist_redn = summary$abs_distance_redn # Absolute reduction in epsilon
# Define criteria
min_ess_threshold = 500
max_dist_reduction_threshold = 0.001 # Stop if epsilon isn't reducing much
# Check conditions
ess_condition = current_ess >= min_ess_threshold
dist_condition = abs_dist_redn <= max_dist_reduction_threshold
# Return TRUE only if BOTH conditions are met
return(ess_condition && dist_condition)
}
# Example usage (conceptual - requires obsdata, priors, etc. defined)
# smc_result = abc_smc(
# obsdata = test_obsdata,
# priors_list = test_priors,
# sim_fn = example_sim_fn,
# scorer_fn = example_scorer_fn,
# n_sims = 1000,
# acceptance_rate = 0.5,
# converged_fn = custom_converged_fn # Use the custom function
# )This example focuses on the stability of parameter uncertainty, converging when the absolute variance reduction for all parameters is small.
# Define a custom convergence function based on variance reduction
custom_converged_fn_variance = function(wave, summary, per_param) {
# Extract the 'abs_variance_redn' column from the per_param data frame
abs_variance_reductions = per_param$abs_variance_redn
# Define a threshold for variance reduction
variance_reduction_threshold = 0.0001
# Check if ALL absolute variance reductions are below the threshold
all_stable <- all(abs_variance_reductions < variance_reduction_threshold)
return(all_stable)
}
# Example usage (conceptual)
# smc_result = abc_smc(
# obsdata = test_obsdata,
# priors_list = test_priors,
# sim_fn = example_sim_fn,
# scorer_fn = example_scorer_fn,
# n_sims = 1000,
# acceptance_rate = 0.5,
# converged_fn = custom_converged_fn_variance
# )summary and per_param are data frames. Access
columns using $ (e.g., summary$ESS,
per_param$rel_mean_change).per_param data frame
has one row per parameter. Use functions like all(),
any(), or max() on the relevant column to
combine information across parameters for a single logical output.browser(): You can get a better feel
for the statistics available by running an interactive browser from
within a convergence function while testing. —Understanding the precise structure and purpose of
sim_fn, scorer_fn, and
converged_fn is important for effectively using
tidyabc. The simulation function generates data, the
scoring function quantifies similarity, and the convergence function
determines when the iterative process halts. By carefully specifying
these functions, you can tailor the ABC workflow to your model and
inference goals.