--- title: "Getting started with tidyabc" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with tidyabc} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib link-citations: TRUE --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(tidyabc) library(dplyr) library(ggplot2) ggplot2::set_theme(theme_minimal()) ``` ## Introduction This vignette walks you through a complete **Approximate Bayesian Computation (ABC)** workflow using `tidyabc`, from defining a simulation model to fitting parameters from observed data. ABC is used when the likelihood function is intractable or expensive to compute — common in complex models like those in systems biology, ecology, or social science. We’ll simulate data from a model with two latent processes — a normal distribution and a gamma distribution — and then use ABC to recover their parameters from summary statistics, even without knowing the exact likelihood. --- ## Step 1: Define the Simulation Function We begin by defining a simulation function that generates synthetic data from known parameters. This represents our "forward model" — the mechanism we believe generated the real-world observations. ```{r} # example simulation # We'll be trying to recover norm and gamma parameters # We'll use this function for both example generation and fitting test_simulation_fn = function(norm_mean, norm_sd, gamma_mean, gamma_sd) { A = rnorm(1000, norm_mean, norm_sd) B = rgamma2(1000, gamma_mean, gamma_sd) C = rgamma2(1000, gamma_mean, gamma_sd) return( list( data1 = A + B - C, data2 = B * C ) ) } ``` Here, `data1 = A + B - C` combines a normal variable with two gamma variables, and `data2 = B * C` is their product. The true parameters are unknown to the ABC algorithm — we’ll recover them from summary statistics. --- ## Step 2: Define the Scoring Function Since we can’t compute the likelihood directly, we compare simulated and observed data using **summary statistics**. Here, we use the **Wasserstein distance** — a robust metric for comparing distributions — to measure how similar the simulated and observed data are for each output. ```{r} test_scorer_fn = function(simdata, obsdata) { return(list( data1 = calculate_wasserstein(simdata$data1, obsdata$data1), data2 = calculate_wasserstein(simdata$data2, obsdata$data2) )) } ``` Each element of the returned list represents a distance between the simulated and observed version of `data1` and `data2`. These distances become the basis for accepting or rejecting parameter proposals. --- ## Step 3: Generate Observed Data (Ground Truth) We now generate “observed” data using known parameters — this simulates real-world measurements. In practice, this would come from your actual dataset. ```{r} tmp = test_simulation( test_simulation_fn, test_scorer_fn, norm_mean = 4, norm_sd=2, gamma_mean=6, gamma_sd=2, seed = 123 ) truth = tmp$truth test_obsdata = tmp$obsdata ``` The `test_simulation()` function runs the model once with the specified parameters and returns both the raw simulated data (`obsdata`) and the summary distances (`obsscores`). The `truth` object contains the known parameter values: - `norm_mean = 4` - `norm_sd = 2` - `gamma_mean = 6` - `gamma_sd = 2` We’ll see how well ABC recovers these values. --- ## Step 4: Visualize the Observed Data Let’s look at the two summary variables we’re using for inference. ```{r} ggplot( tibble(data1 = test_obsdata$data1), aes(x=data1))+ geom_histogram(bins=50, fill="steelblue", color="white")+ xlab("A + B - C") ggplot(tibble(data2 = test_obsdata$data2), aes(x=data2))+ geom_histogram(bins=50, fill="coral", color="white")+ xlab("B × C") ``` These histograms show the empirical distributions of the two summary statistics. `data1` is roughly symmetric (due to the normal + gamma combination), while `data2` is right-skewed (product of two gamma variables). ABC will use these shapes to infer the underlying parameters. --- ## Step 5: Define Prior Distributions In Bayesian inference, we express our uncertainty about the parameters before seeing the data using **priors**. Here, we define uniform priors for all parameters, and add a constraint to ensure the gamma distribution has a meaningful shape (mean > standard deviation). ```{r} test_priors = priors( norm_mean ~ unif(0, 10), norm_sd ~ unif(0, 10), gamma_mean ~ unif(0, 10), gamma_sd ~ unif(0, 10), # enforces convex gamma: mean > sd ~ gamma_mean > gamma_sd ) test_priors ``` The output shows the prior distributions for each parameter and the constraint. The constraint `~ gamma_mean > gamma_sd` ensures that the gamma distribution is not overly flat — a common real-world assumption. --- ## Step 6: Run ABC Rejection Sampling Now we perform **ABC rejection sampling**: generate many parameter sets from the prior, simulate data for each, and accept those whose summary statistics are close enough to the observed ones. ```{r} rejection_fit = abc_rejection( obsdata = test_obsdata, priors_list = test_priors, sim_fn = test_simulation_fn, scorer_fn = test_scorer_fn, n_sims = 10000, acceptance_rate = 0.01, parallel = FALSE ) summary(rejection_fit) ``` - `n_sims = 10000`: We simulate 10,000 parameter sets. - `acceptance_rate = 0.01`: We accept the top 1% of simulations with smallest distances (i.e., best matches). The `summary()` output shows: - The **median** and **IQR** of the posterior for each parameter (i.e., the best estimates after conditioning on the data). - The **Effective Sample Size (ESS)**: a measure of how much information is in the posterior (higher = better). - The **convergence status**. Compare the posterior medians to the true values (`truth`). With enough simulations and a good distance metric, we expect them to be close. --- ## Step 7: Plot the Results Finally, we visualize the posterior distributions alongside the true parameter values. ```{r} plot(rejection_fit, truth=truth) ``` The plot shows: - **Marginal posterior densities** for each parameter (red curves), estimated using empirical distribution fitting. - **Dotted vertical lines** indicating the true parameter values (`truth`). - **Solid vertical marks** representing the median and 95% credible intervals. If the true values lie within the credible intervals and are near the peak of the posterior, we can conclude that ABC successfully recovered the underlying parameters — even without knowing the likelihood! --- ## Conclusion This vignette demonstrates a complete ABC workflow that could work with observed data: 1. Define a simulation model (`sim_fn`), 2. Define a distance metric (`scorer_fn`), 3. Specify priors, 4. Run ABC rejection sampling, 5. Summarize and visualize the posterior. The `tidyabc` package makes this process intuitive and modular. You can now replace `test_simulation_fn` and `test_scorer_fn` with your own model and summary statistics — and let ABC do the inference for you. For more efficiency, consider switching to `abc_smc()` or `abc_adaptive()` in larger problems, which use sequential refinement to reduce the number of required simulations.