---
title: "ABC Sequential Monte-Carlo"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{ABC Sequential Monte-Carlo}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
link-citations: TRUE
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(tidyabc)
library(dplyr)
library(ggplot2)

ggplot2::set_theme(theme_minimal())
```

## Introduction

This vignette demonstrates the **Approximate Bayesian Computation Sequential
Monte Carlo (ABC-SMC)** algorithm using `tidyabc`. ABC-SMC is an iterative
refinement method that improves upon basic rejection sampling by using
information from previous "waves" to guide the search for plausible parameters.
This typically requires far fewer simulations than rejection sampling to achieve
a similar posterior approximation, making it more efficient for complex models.

This example builds upon the model defined in the "Getting started" vignette,
using the same simulation function (`test_simulation_fn`), scoring function
(`test_scorer_fn`), priors (`test_priors`), and observed data (`test_obsdata`).

---

## Step 1: Define the Model and Data (Recap)

We reuse the same model and data from the "Getting started" vignette:

- **Simulation Function (`test_simulation_fn`)**: Generates synthetic data 
  `data1` and `data2` based on four parameters (`norm_mean`, `norm_sd`, 
  `gamma_mean`, `gamma_sd`).
- **Scoring Function (`test_scorer_fn`)**: Calculates the Wasserstein distance 
  between simulated and observed data for `data1` and `data2`.
- **Priors (`test_priors`)**: Uniform distributions for each parameter, with a 
  constraint `gamma_mean > gamma_sd`.
- **Observed Data (`test_obsdata`)**: Generated using `norm_mean = 4`, 
  `norm_sd = 2`, `gamma_mean = 6`, `gamma_sd = 2`.

---

```{r}

# example simulation
# Well be trying to recover norm and gamma parameters
# We'll use this function for both example generation and fitting
test_simulation_fn = function(norm_mean, norm_sd, gamma_mean, gamma_sd) {
  
  A = rnorm(1000, norm_mean, norm_sd)
  B = rgamma2(1000, gamma_mean, gamma_sd)
  C = rgamma2(1000, gamma_mean, gamma_sd)

  return(
    list(
      data1 = A + B - C,
      data2 = B * C
    )
  )
  
}


test_scorer_fn = function(simdata, obsdata) {
  return(list(
    data1 = calculate_wasserstein(simdata$data1, obsdata$data1),
    data2 = calculate_wasserstein(simdata$data2, obsdata$data2)
  ))
}

tmp = test_simulation(
  test_simulation_fn,
  test_scorer_fn,
  norm_mean = 4, norm_sd=2, gamma_mean=6, gamma_sd=2,
  seed = 123
)

truth = tmp$truth
test_obsdata = tmp$obsdata


test_priors = priors(
  norm_mean ~ unif(0, 10),
  norm_sd ~ unif(0, 10),
  gamma_mean ~ unif(0, 10),
  gamma_sd ~ unif(0, 10),
  # enforces convex gamma:
  ~ gamma_mean > gamma_sd 
)

ggplot(
  tibble(data1 = test_obsdata$data1), aes(x=data1))+
  geom_histogram(bins=50, fill="steelblue", color="white")+
  xlab("A + B - C") +
  theme_minimal()

ggplot(tibble(data2 = test_obsdata$data2), aes(x=data2))+
  geom_histogram(bins=50, fill="coral", color="white")+
  xlab("B × C") +
  theme_minimal()
```

## Step 2: Inform Scoring Weights (Optional but Recommended)

Before running SMC, it's often beneficial to run a small **trial rejection** ABC
fit to gather information about the scale and variability of the summary
statistics. This helps in setting appropriate weights (`scoreweights`) for
combining the distances from different summary statistics, ensuring they
contribute meaningfully to the overall distance metric.

```{r}
trial_fit = abc_rejection(
  obsdata = test_obsdata,
  priors_list = test_priors,
  sim_fn = test_simulation_fn,
  scorer_fn = test_scorer_fn,
  n_sims = 1000, # A smaller number for the trial run
  acceptance_rate = 0.5 # A high acceptance rate for the trial to get diverse samples
)

# Calculate metrics, including recommended score weights
metrics = posterior_distance_metrics(trial_fit)
test_scorewt = metrics$scoreweights # Extract the recommended weights

# Print the metrics to see the recommended weights and other diagnostics
metrics
```

The `posterior_distance_metrics()` function analyzes the distances from the
trial fit. The `scoreweights` it calculates (often based on the standard
deviation of each summary statistic divided by its root mean squared deviation
from the observed value) help balance the influence of `data1` and `data2` in
the Euclidean distance calculation, especially if they are on different scales.
Using these weights in the SMC fit can improve convergence and accuracy.

---

## Step 3: Run ABC-SMC

Now we execute the ABC-SMC algorithm. Unlike rejection sampling, SMC iteratively
refines the parameter estimates.

```{r}
smc_fit = abc_smc(
  obsdata = test_obsdata,
  priors_list = test_priors,
  sim_fn = test_simulation_fn,
  scorer_fn = test_scorer_fn,
  n_sims = 1000,              # Number of simulations per wave
  acceptance_rate = 0.5,      # Proportion of particles accepted each wave (affects epsilon decay)
  distance_method = "euclidean", # How to combine summary statistic distances
  parallel = FALSE,           # Disable parallel processing for vignette
  scoreweights = test_scorewt # Use the weights calculated from the trial fit
)

# Print a summary of the SMC results
summary(smc_fit)
```

- **`n_sims`**: The number of simulations performed in *each* wave.
- **`acceptance_rate`**: This value (e.g., 0.5) determines how the tolerance 
  threshold (epsilon) decreases. After each wave, epsilon is set to the quantile 
  of distances corresponding to this rate among the *current wave's* simulations. 
  A rate of 0.5 means epsilon becomes the median distance of the wave.
- **`scoreweights`**: The weights calculated earlier are used here.
- The `summary` output shows the final posterior statistics (median, IQR, ESS) 
  aggregated across all waves, similar to the rejection sampling output but 
  reflecting the sequential refinement.

---

## Step 4: Visualize the Results

`tidyabc` provides several plots to understand the SMC fit. The first is similar 
to the rejection sampling plot, showing the final posterior.

```{r}
plot(smc_fit, truth = truth)
```

This plot shows the marginal posterior densities for each parameter estimated
after the final SMC wave, compared to the true values (`truth`). It should
ideally show good agreement, similar to the rejection sampling result but
potentially with higher effective sample size (ESS) for the same total number of
simulations.

---

## Step 5: Diagnose Convergence

The `plot_convergence()` function is crucial for SMC, showing how the fit 
improved over the sequence of waves.

```{r}
plot_convergence(smc_fit)
```

This multi-panel plot shows key metrics tracked across waves (wave number on the x-axis):

1.  **`abs_distance`**: 
The absolute value of the tolerance threshold (epsilon) for each wave. This
represents the "closeness" required for a simulation to be accepted in that
wave. In SMC, epsilon typically decreases rapidly (often exponentially) as the
algorithm focuses on regions of higher posterior density. The plot should show a
clear downward trend, eventually plateauing as the algorithm converges or
reaches the resolution limit of the model/data noise.
2.  **`ESS` (Effective Sample Size)**: 
The ESS of the particle set at each wave. A higher ESS indicates a more diverse
and informative posterior sample. The ESS often starts low (as the first wave
might accept many particles) and then increases as the proposal distribution
(based on previous particles) becomes more efficient at finding high-probability
regions. It typically plateaus when the algorithm converges. The plot should
show an increasing trend, ideally reaching a stable, high value.
3.  **`IQR_95_redn` (95% CI Reduction)**: 
The reduction in the 95% credible interval (IQR) range for each parameter
compared to the previous wave. This indicates how much the uncertainty in each
parameter estimate is decreasing per wave. It should generally show a declining
trend (often roughly linear on this log scale initially), approaching zero as
the parameter estimates stabilize.
4.  **`rel_mean_change` (Relative Mean Change)**: 
The relative change in the posterior mean of each parameter from one wave to the
next (`|mean_t - mean_{t-1}| / |mean_{t-1}|`). This tracks the stability of the
central parameter estimates. It should also show a declining trend (often
roughly linear), approaching zero as the estimates stabilise.

For a successful SMC run, you expect epsilon to drop significantly, ESS to rise
and plateau, and both `IQR_95_redn` and `rel_mean_change` to trend towards zero.

---

## Step 6: Visualize Parameter Evolution

The `plot_evolution()` function shows how the estimated parameter distributions 
change over the waves.

```{r}
plot_evolution(smc_fit, truth)
```

This plot displays the posterior density estimates for each parameter, faceted
by wave number. It allows you to visualize how the algorithm progressively
narrows down the plausible parameter space from the initial prior distributions
towards the final posterior, centred around the `truth`.

---

## Step 7: Check Parameter Correlations

Finally, `plot_correlations()` helps understand the relationships between
parameters in the final posterior sample.

```{r}
plot_correlations(smc_fit$posteriors, truth)
```

This heatmap shows the correlation coefficients between pairs of parameters
based on the final set of weighted particles. Correlations can indicate
dependencies learned by the model or potential identifiability issues. Comparing
the correlation structure to the `truth` values (if applicable) can provide
further insight.

---

## Conclusion

ABC-SMC provides an efficient way to perform Bayesian inference for
models where the likelihood is intractable. By iteratively refining the
parameter proposals based on previous results, it can achieve good posterior
approximations with significantly fewer simulations than basic rejection
sampling. Diagnosing convergence using `plot_convergence()` is important to
ensure the algorithm has run for enough waves to stabilize the estimates.