---
title: "Simulation, scoring and convergence functions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Simulation, scoring and convergence functions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
link-citations: TRUE
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(tidyabc)
library(dplyr)
library(ggplot2)
```

## Introduction

This vignette provides a detailed guide on writing the core user-defined
functions required for Approximate Bayesian Computation (ABC) using `tidyabc`:
the simulation function (`sim_fn`), the scoring function (`scorer_fn`), and the
convergence function (`converged_fn`). Understanding the precise input and
output structure of these functions is crucial for correctly implementing an ABC
workflow.

---

## 1. Writing the Simulation Function (`sim_fn`)

The simulation function (`sim_fn`) defines your generative model. It takes
parameter values as input and returns simulated data.

### Inputs

- **Named Arguments**: 
The `sim_fn` must accept named arguments corresponding to the parameters defined
in your `priors_list`. For example, if you have 
`priors(norm_mean ~ unif(0, 10),norm_sd ~ unif(0, 10))`, your `sim_fn` must 
accept `norm_mean` and `norm_sd` as arguments.
- **`...` (Optional)**: 
While not strictly required by the `abc_*` functions themselves, it's good
practice to allow for `...` in the function definition if you anticipate passing
extra arguments during debugging or testing.

### Output

- **List**: 
The function must return a **list** containing the simulated data. The structure
of this list is entirely up to your model. It could be a list of vectors, a list
of data frames, or any other R object that represents the output of your
simulation model.

### Example

```{r}
# Example sim_fn matching the parameters from priors defined later
example_sim_fn = function(norm_mean, norm_sd, gamma_mean, gamma_sd) {
  # Simulate data based on parameters
  A = rnorm(1000, norm_mean, norm_sd)
  B = rgamma2(1000, gamma_mean, gamma_sd) # Using a custom dist from tidyabc
  C = rgamma2(1000, gamma_mean, gamma_sd)

  # Return a list containing the simulated outputs
  return(
    list(
      data1 = A + B - C,  # Example composite output
      data2 = B * C       # Example composite output
    )
  )
}
```

---

## 2. Writing the Scoring Function (`scorer_fn`)

The scoring function (`scorer_fn`) compares simulated data to observed data,
producing a set of summary statistics (distances or other metrics) that quantify
the "closeness" of the simulation to the observation.

### Inputs

- **`simdata`**: This is the output list from your `sim_fn`.
- **`obsdata`**: 
This is the observed data provided to the `abc_*` function, which should be in
the same format as the `simdata` list.

### Output

- **List**: 
The function must return a **list**. Each element of this list corresponds to a
*component* of the overall summary statistic used for comparison. Commonly,
these components are distances (e.g., Wasserstein, RMSE) between corresponding
parts of `simdata` and `obsdata`, but they could also be other metrics like
summary moments (mean, variance).
- **Naming**: 
The names of the list elements returned by `scorer_fn` are crucial. They must be
consistent and identifiable, as they are used for weighting (`scoreweights`) and
internal processing.

### Combining Scorer Outputs and Comparison to Observed Scores

1.  **Execution**: 
The `scorer_fn` is called repeatedly within the ABC workflow, once for each
simulation (`simdata`) against the fixed `obsdata`.
2.  **Component Scores**: 
The output of `scorer_fn(simdata, obsdata)` is a list of *component scores*
(e.g., `list(data1 = dist1, data2 = dist2)`).
3.  **Combining Components**: The ABC algorithm combines these component scores
into a single *summary distance* for each simulation. The method is specified by
the `distance_method` argument in `abc_*` functions:
    - **`"euclidean"` (default)**: Calculates the weighted Euclidean distance:  
      \[
      d_{summary} = \sqrt{\sum_i (w_i \cdot s_i)^2}
      \]
      where \(s_i\) is the \(i\)-th component score (e.g., `dist1`, `dist2`) and
      \(w_i\) is the corresponding weight from the `scoreweights` vector.
    - **`"manhattan"`**: Calculates the weighted Manhattan distance:  
      \[
      d_{summary} = \sum_i w_i \cdot |s_i|
      \]
    - **`"mahalanobis"`**: Calculates the Mahalanobis distance using the 
      covariance matrix of the component scores from the first wave, 
      incorporating weights.
4.  **`obsscores`**: 
This argument allows you to provide pre-calculated component scores for the
`obsdata` itself (e.g., `list(data1 = 0, data2 = 0)` if comparing against
`obsdata` using `scorer_fn(obsdata, obsdata)` yields zeros). This is optional
and often inferred internally if `scorer_fn` is run on `obsdata` itself
initially.
5.  **Tolerance and Kernels**: 
The final $d_{summary}$ is compared against a tolerance threshold (`epsilon`)
using a kernel function (defined by the `kernel` argument) to calculate the ABC
weights for each simulation.

### Example

```{r}
# Example scorer_fn matching the simdata structure from example_sim_fn
example_scorer_fn = function(simdata, obsdata) {
  # Compare corresponding elements of simdata and obsdata
  # Using calculate_wasserstein from tidyabc as an example distance
  return(list(
    data1 = calculate_wasserstein(simdata$data1, obsdata$data1),
    data2 = calculate_wasserstein(simdata$data2, obsdata$data2)
  ))
  # Output: list(data1 = <dist_val1>, data2 = <dist_val2>)
}
```

### Score Weights (`scoreweights`)

The `scoreweights` argument allows you to assign relative importance to each 
component score returned by `scorer_fn`.

- **Type**: 
A named numeric vector, where names correspond to the names of the list elements
returned by `scorer_fn`.
- **Purpose**: 
If `data1` and `data2` have different scales or importance, you can use weights
to balance their contribution to the summary distance. For instance, 
`c(data1 = 1, data2 = 2)` means `data2`'s distance contributes twice as 
much as `data1`'s distance to the final $d_{summary}$ (before applying the
square root for Euclidean).
- **Automatic Calculation**: 
The `posterior_distance_metrics()` function can analyze results from a trial run
(e.g., `abc_rejection`) and suggest suitable `scoreweights` based on the scale
of the summary statistics.

## 3. Debugging and Parallelisation

It is a good idea to run your simulation with some test parameters first.
Similarly testing the scoring function is correctly working is a good idea. An
end to end test harness is supplied in `test_simulation()` that will run your
simulation and scorer for a fixed set of parameters. It can be configured to
drop to an interactive debugging browser if it encounters an error.

Although testing against one set of parameters will catch major problems edge
cases often arise during the main ABC. Each of the ABC functions can also drop 
the user to a debugging browser if an issue is detected. This is not 
compatible with parallelisation though so best to run a small set of simulations
using e.g. `abc_rejection()`.

For parallel execution the simulation and scorer functions must be completely
self contained, and make no reference to global environment variables, or
unqualified references to package functions. Global references will in general
be automatically detected, and you will be alerted and your functions will be
rewritten with common package names qualified so that they are self contained.

If you need to use globals `carrier::crate`ing your simulation function will 
achieve the same thing.

---

## 4. The Convergence Function (`converged_fn`)

The convergence function (`converged_fn`) determines when the iterative ABC-SMC
or ABC-Adaptive algorithm should stop. It evaluates the change in the posterior
approximation between waves.

### Inputs to `converged_fn`

The `converged_fn` is called internally by `abc_smc` and `abc_adaptive` at the
end of each wave. It receives three arguments:

- **`wave`**: 
The wave number as an integer
- **`summary`**: 
A **single-row data frame** containing summary metrics aggregated across *all
parameters* for the *current wave*. Common columns include:
    - `abs_distance`: The absolute tolerance threshold (`epsilon`) for the 
    current wave.
    - `abs_distance_redn`: The absolute reduction in `epsilon` compared to the 
    previous wave (`epsilon_{prev} - epsilon_{current}`).
    - `rel_distance_redn`: The relative reduction in `epsilon` compared to the 
    previous wave (`(epsilon_{prev} - epsilon_{current}) / epsilon_{prev}`).
    - `ESS`: The Effective Sample Size of the current wave's particle set.
    - Other aggregated metrics might be present depending on the algorithm's implementation.
- **`per_param`**: A **single-row data frame** containing metrics calculated 
  *for each individual parameter* in the current wave, compared to the previous 
  wave. Common columns include:
    - `param`: The name of the parameter (e.g., "norm_mean").
    - `IQR_95_redn`: The reduction in the 95% credible interval (IQR) range 
    (`q0.975 - q0.025`) for this parameter compared to the previous wave 
    (`IQR_{prev} - IQR_{current}`).
    - `abs_variance_redn`: The absolute reduction in the posterior variance for 
    this parameter compared to the previous wave (`var_{prev} - var_{current}`).
    - `rel_mean_change`: The relative change in the posterior mean for this 
    parameter compared to the previous wave 
    (`|mean_{current} - mean_{prev}| / |mean_{prev}|`).
    - This data frame has one row for each parameter being estimated.

### Output of `converged_fn`

- **Logical**: The function must return a single `TRUE` or `FALSE`.
    - `TRUE`: Signals that the algorithm has converged, and the iterative process stops.
    - `FALSE`: Signals that the algorithm should continue to the next wave.

### The `default_termination_fn`

`tidyabc` provides a `default_termination_fn()` which implements a common convergence strategy:

- It checks if the `rel_mean_change` for *all* parameters is below a 
  specified `stability` threshold.
- *AND* it checks if the `IQR_95_redn` for *all* parameters is below a 
  specified `confidence` threshold.
- If both conditions are met, it returns `TRUE`, indicating convergence.

```{r}
# Example of using default_termination_fn with custom thresholds
default_converged_fn = default_termination_fn(stability = 0.01, confidence = 0.1)
# This means: converge if mean changes are < 1% and 95% CI shrinks by < 0.1 units for all params.

# If you have 2 parameters `norm_mean` and `norm_sd` you can specify different
# convergence criteria by nameing the stability and confidence inputs:

default_converged_fn2 = default_termination_fn(
  stability = c(norm_mean = 0.01, norm_sd = 0.1),
  confidence = c(norm_mean = 0.01, norm_sd = 0.1)
)

# this would accept more variation in the `norm_sd` estimate compared to that
# of the `norm_mean`

```

---

## 5. Writing a Custom Convergence Function

Writing a custom convergence function allows you to tailor the stopping 
criteria to your specific needs.

### Structure

A custom convergence function must adhere to the input signature:
`function(wave, summary, per_param)`. It processes the data frames provided and
returns a single `TRUE` or `FALSE`.

### Example: Convergence Based on ESS and Distance Reduction

This example defines convergence as achieving a high Effective Sample Size (ESS)
*and* a minimal reduction in the absolute distance threshold.

```{r}
# Define a custom convergence function
custom_converged_fn = function(wave, summary, per_param) {
  # Extract values from the summary data frame (it's a single row)
  current_ess = summary$ESS
  abs_dist_redn = summary$abs_distance_redn # Absolute reduction in epsilon

  # Define criteria
  min_ess_threshold = 500
  max_dist_reduction_threshold = 0.001 # Stop if epsilon isn't reducing much

  # Check conditions
  ess_condition = current_ess >= min_ess_threshold
  dist_condition = abs_dist_redn <= max_dist_reduction_threshold

  # Return TRUE only if BOTH conditions are met
  return(ess_condition && dist_condition)
}

# Example usage (conceptual - requires obsdata, priors, etc. defined)
# smc_result = abc_smc(
#   obsdata = test_obsdata,
#   priors_list = test_priors,
#   sim_fn = example_sim_fn,
#   scorer_fn = example_scorer_fn,
#   n_sims = 1000,
#   acceptance_rate = 0.5,
#   converged_fn = custom_converged_fn # Use the custom function
# )
```

### Example: Convergence Based on Parameter Variance

This example focuses on the stability of parameter uncertainty, converging when
the absolute variance reduction for *all* parameters is small.

```{r}
# Define a custom convergence function based on variance reduction
custom_converged_fn_variance = function(wave, summary, per_param) {
  # Extract the 'abs_variance_redn' column from the per_param data frame
  abs_variance_reductions = per_param$abs_variance_redn

  # Define a threshold for variance reduction
  variance_reduction_threshold = 0.0001

  # Check if ALL absolute variance reductions are below the threshold
  all_stable <- all(abs_variance_reductions < variance_reduction_threshold)

  return(all_stable)
}

# Example usage (conceptual)
# smc_result = abc_smc(
#   obsdata = test_obsdata,
#   priors_list = test_priors,
#   sim_fn = example_sim_fn,
#   scorer_fn = example_scorer_fn,
#   n_sims = 1000,
#   acceptance_rate = 0.5,
#   converged_fn = custom_converged_fn_variance
# )
```

### Key Considerations for Custom Functions

- **Data Frame Access**: 
Remember that `summary` and `per_param` are data frames. Access columns using
`$` (e.g., `summary$ESS`, `per_param$rel_mean_change`).
- **Aggregation**: 
The `per_param` data frame has one row per parameter. Use functions like
`all()`, `any()`, or `max()` on the relevant column to combine information
across parameters for a single logical output.
- **Robustness**: 
Consider edge cases, like what happens in the very first wave if previous values
are not available (though the algorithm typically handles initialization).
- **Monitoring**: 
Convergence criteria are crucial for computational efficiency and result
quality. Choose metrics that reflect the stability you require for your specific
inference problem.
- **`browser()`**:
You can get a better feel for the statistics available by running an interactive
browser from within a convergence function while testing.
---

## Conclusion

Understanding the precise structure and purpose of `sim_fn`, `scorer_fn`, and
`converged_fn` is important for effectively using `tidyabc`. The simulation
function generates data, the scoring function quantifies similarity, and the
convergence function determines when the iterative process halts. By carefully
specifying these functions, you can tailor the ABC workflow to your model
and inference goals.