--- title: "Statistical distributions in tidyabc" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Statistical distributions in tidyabc} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: references.bib link-citations: TRUE --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7 ) ``` ```{r setup} library(tidyabc) library(dplyr) library(ggplot2) ggplot2::set_theme(theme_minimal()) ``` ## Introduction There are parts to this vignette. A description of extra distribution functions available in `tidyabc` and explanation of the `dist_fns` S3 class which collects statistical distribution functions for use as a single entity. ## Reparameterised statistical distributions The `tidyabc` package provides several useful probability distribution functions beyond the standard R families available through `stats`. These are particularly handy for modelling scenarios common in ABC workflows, offering convenient parametrizations (e.g., by mean and standard deviation) or specific support ranges. ### Core Functions Like standard R distributions (e.g., `stats::dnorm`, `stats::pnorm`), these custom families provide four main types of functions: - **`dXXX(x, ...)`**: Density (PDF) or Probability Mass (PMF) function. - **`pXXX(q, ...)`**: Cumulative Distribution Function (CDF). - **`qXXX(p, ...)`**: Quantile function (inverse CDF). - **`rXXX(n, ...)`**: Random number generation function. ### Selected Examples Here are examples of a few key families available in `tidyabc`: **1. Logit-Normal (`logitnorm` / `logitnorm2`)** A distribution supported on (0, 1), useful for modelling probabilities or proportions. The `logitnorm2` family parametrizes via the median (`prob.0.5`) and a dispersion parameter (`kappa`). ```{r} # Density at x=0.7 with median 0.6 and low dispersion dlogitnorm2(0.7, prob.0.5 = 0.6, kappa = 0.2) # CDF value at q=0.5 plogitnorm2(0.5, prob.0.5 = 0.6, kappa = 0.2) # Generate 5 random samples rlogitnorm2(5, prob.0.5 = 0.6, kappa = 0.2) ``` **2. Beta (`beta2`)** Another distribution supported on (0, 1), re-parametrized for mean (`prob`) and a coefficient of variation (`kappa`), useful for modelling uncertainty around a central probability. ```{r} # Density at x=0.3 with mean 0.3 and variability 0.1 dbeta2(0.3, prob = 0.3, kappa = 0.1) # Quantile for the 90th percentile qbeta2(0.9, prob = 0.3, kappa = 0.1) # Generate 5 random samples rbeta2(5, prob = 0.3, kappa = 0.1) ``` **3. Gamma (`gamma2`, `cgamma`)** Distributions supported on (0, ∞). `gamma2` is parametrized by mean and standard deviation. `cgamma` is a "convex" gamma (mean > standard deviation), useful for positive variables with a defined mode. ```{r} # Density at x=6 for a gamma with mean=5, sd=2 dgamma2(6, mean = 5, sd = 2) # Quantile for the 75th percentile for convex gamma qcgamma(0.75, mean = 3, kappa = 0.4) # kappa is coef. of variation here # Generate 5 random samples from gamma2 rgamma2(5, mean = 5, sd = 2) ``` **4. Wedge (`wedge`)** A distribution supported on [0, 1] with a linear probability density function. The `a` parameter controls skewness (a=0 is uniform, a>0 is right-skewed, a<0 is left-skewed). ```{r} # Density at x=0.8 with right skew (a=1) dwedge(0.8, a = 1) # Quantile for the 50th percentile (median) with left skew (a=-1) qwedge(0.5, a = -1) # Generate 5 random samples rwedge(5, a = 1) ``` ### List of Additional Families - `logitnorm`, `logitnorm2` - `wedge` - `beta2` - `lnorm2` (Log-Normal) - `gamma2`, `cgamma` (Convex Gamma) - `nbinom2` (Negative Binomial) - `null` (Always returns `NA` - via `rnull`, `pnull`, etc.) These functions can be used directly in R for calculations, simulations, or defining priors before wrapping them into `dist_fns` objects. ### Specialized Random Number Generators `tidyabc` also provides specific functions designed primarily for random sampling, which are particularly useful for forward simulation within ABC workflows. **1. Bernoulli (`rbern`)** Generates logical values (`TRUE`/`FALSE`) based on a probability of success. ```{r} # Simulate 10 Bernoulli trials with prob=0.3 rbern(10, prob = 0.3) ``` **2. Categorical (`rcategorical`)** Generates random samples from a multinomial-like distribution over a set of named or numbered categories, based on provided probabilities. ```{r} # Define probabilities for categories "A", "B", "C" probs <- c("A" = 0.1, "B" = 0.3, "C" = 0.6) # Sample 5 categories according to the probabilities rcategorical(5, prob = probs) # Sample 5 categories and return them as a factor rcategorical(5, prob = probs, factor = TRUE) ``` **3. Exponential Growth Process Times (`rexpgrowth`, `rexpgrowthI0`)** These functions sample event times from an exponentially growing (or decaying) process. This is useful for simulating arrival times, infection times, or other phenomena where the underlying rate changes exponentially over time. - `rexpgrowth(n, r, t_end, t_start)`: Generates `n` event times assuming an exponential growth rate `r` over the interval `[t_start, t_end]`. - `rexpgrowthI0(I0, r, t_end, t_start)`: Generates event times aiming for an *expected* initial number of events `I0` in the first unit of time, given the growth rate `r`. ```{r} # Example: Simulate 100 event times from a process growing at rate 0.1 # over the interval [0, 40] event_times_growth <- rexpgrowth(100, r = 0.1, t_end = 40, t_start = 0) hist(event_times_growth, breaks = 20, main = "Event Times (Growth r=0.1)") # Example: Simulate event times aiming for an initial rate of 10 events per day # over 20 days with a growth rate of 0.05 event_times_I0 <- rexpgrowthI0(I0 = 10, r = 0.05, t_end = 20, t_start = 0) hist(event_times_I0, breaks = 20, main = "Event Times (I0=10, r=0.05)") ``` These functions provide convenient ways to model specific random processes often encountered in simulation models used with ABC. --- # Distribution family S3 class: The `dist_fns` S3 class in `tidyabc` provides a powerful and unified interface for working with probability distributions. It encapsulates the core statistical functions (CDF `p`, quantile `q`, density `d`, random `r`) into a single object, enabling flexible creation, manipulation, and analysis. This vignette demonstrates how to create `dist_fns` objects from standard families, manipulate them within data frames using `tidyverse` principles, fit empirical distributions from data or quantiles, and combine them into mixtures. ## Creating `dist_fns` Objects ### From Standard Families The most straightforward way to create a `dist_fns` object is from standard R distribution families using `as.dist_fns()`. ```{r} # Create a single normal distribution object norm_dist <- as.dist_fns("norm", mean = 4, sd = 3) print(norm_dist) # Access its functions directly norm_dist$p(5) # CDF at x=5 norm_dist$q(0.95) # 95th percentile norm_dist$r(5) # Generate 5 random samples ``` ### Creating Multiple Distributions with `pmap_dist_fns` You can create a *list* of `dist_fns` objects, each with different parameters. This is efficiently done using `pmap_dist_fns()` within a `dplyr::mutate()` call, mapping parameter values from columns of a data frame. ```{r} # Define a data frame with parameters for different families param_df <- tibble( id = 1:3, family = c("norm", "gamma2", "lnorm2"), mean = c(0, 2, 5), sd = c(1, 1.5, 2) ) # Use pmap_dist_fns to create a list column of dist_fns objects dist_df <- param_df %>% mutate( dist_obj = pmap_dist_fns( ., function(family, mean, sd, ...) as.dist_fns(family, mean, sd) ) ) # N.b. annoyingly RStudio does not use the default pillar printer and the # details do not show up when rendering a tibble in RStudio print(dist_df) ``` This pattern is intended for scenarios like storing prior distributions or posterior approximations for multiple parameters within a single data frame. Plotting distributions together is managed with the `plot` function which can be modified with `ggplot` commands. ```{r} plot(dist_df$dist_obj) ``` --- ## Manipulating `dist_fns` Objects in Data Frames Once `dist_fns` objects are stored in a data frame column, you can manipulate them using `map_dist_fns()` or standard `purrr`/`dplyr` functions. ### Applying Transformations: Truncation Example You can apply transformations like `truncate()` to `dist_fns` objects stored in a column. ```{r} # Truncate each distribution from the previous example to be positive dist_df_truncated <- dist_df %>% mutate( dist_truncated = map_dist_fns(dist_obj, ~ truncate(.x, x_left = 0, x_right = 5)) ) # Plot the truncated versions of the distributions plot(dist_df_truncated$dist_truncated) ``` --- ## Combining `dist_fns`: Mixture Distributions A powerful feature is creating mixture distributions from a list of `dist_fns` objects, for example, using `dplyr::summarise()`. ```{r} # Create a list of different distribution objects (e.g., representing different scenarios/models) mix_df = dist_df %>% # Assign weights to each component mutate(weights = c(0.4, 0.4, 0.2)) %>% summarise( mix = c(mixture(dist_obj, weights = weights, name="weighted")) ) mixture_dist = mix_df$mix # Plot the mixture plot(mixture_dist) # Evaluate the mixture CDF mixture_dist$p(c(0, 1, 2, 3, 4)) ``` --- ## Fitting Empirical Distributions `tidyabc` provides robust tools for fitting empirical distributions from data or quantiles. ### From Quantiles using `empirical_cdf` If you have quantile values (e.g., from another distribution or simulation output), you can fit an empirical CDF. This takes a link parameter which lets you easily define support for the empirical CDF. The resulting fit will impute the tails of the distribution to prevent truncation. ```{r} # Example: Fit an empirical distribution to quantiles of a known distribution target_dist <- as.dist_fns("gamma", shape = 2, rate = 1) # Define probability points p_points <- c(0.025, 0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95, 0.975) # Get corresponding quantiles q_values <- target_dist$q(p_points) # Fit an empirical distribution using these quantiles empirical_from_quantiles <- empirical_cdf(x = q_values, p = p_points, link = "log", name = "Empirical from Quantiles") # Compare the original and fitted plot(c(target_dist,empirical_from_quantiles)) ``` #### Ensemble of Quantile-Based Distributions as a Mixture You can create an ensemble of such quantile-based fits and combine them into a mixture. ```{r} # From previous example we create a data frame with quantiles in it quantile_df = dist_df %>% mutate( quantiles = purrr::map(dist_obj, ~ .x$q(p_points)) ) %>% select(-dist_obj) # now if this was our input we could fit empirical distributions: # N.B. We could do this with a grouped data frame. ensemble_df = quantile_df %>% mutate( emp_cdf = map_dist_fns(quantiles, function(q) empirical(x=q,p = p_points)) ) %>% summarise( ensemble = c(mixture(emp_cdf,name = "unweighted ensemble")) ) plot(c(ensemble_df$ensemble, mixture_dist)) ``` ### From Samples using `empirical_data` For fitting directly from sample data (e.g., MCMC output, simulation results), use `empirical()` or `empirical_data()`. ```{r} # Generate sample data from a distribution set.seed(123) sample_data <- rgamma(1000, shape = 2, rate = 1) # Fit an empirical distribution to the samples empirical_from_samples <- empirical( x = sample_data, name = "Empirical from Samples", bw=0.25) # Compare the sample histogram with the fitted density tibble(sample = sample_data) %>% ggplot(aes(x = sample)) + geom_histogram(aes(y = after_stat(density)), bins = 50, alpha = 0.5, fill = "lightblue", color = "black") + stat_function(fun = empirical_from_samples$d, color = "red", linewidth = 1) + labs(title = "Sample Histogram vs. Fitted Empirical Density", x = "x", y = "Density") ``` #### Weighted Sample Fitting `empirical()` also handles *weighted* samples, which is crucial in methods like ABC where particles have importance weights. ```{r} # Simulate weighted samples (e.g., importance sampling output) set.seed(456) n_samples <- 500 raw_samples <- c( rnorm(n_samples / 2, mean = 1, sd = 0.5), rnorm(n_samples / 2, mean = 3, sd = 0.7) ) # Simulate some weights (e.g., importance weights) # these are correlated to the distance from 3, which for example might be the # true value weights <- runif(n_samples, min = 0.1, max = 2.0) * exp(-((raw_samples - 3)/2)^2) # Normalize weights (often done automatically internally, but good practice) weights <- weights / sum(weights) # Fit empirical distribution from weighted samples empirical_weighted <- empirical( x = raw_samples, w = weights, name = "weighted", bw=0.25 ) empirical_unweighted <- empirical( x = raw_samples, name = "unweighted", bw=0.25 ) # Plot the fitted distribution plot(c(empirical_weighted,empirical_unweighted)) ``` #### Using `wquantile` for Quantiles from Weighted Data The approach above can give you quantiles from the data. It is however fairly heavyweight if you know you are only going ot need a few quantiles from the weighted data. The `wquantile()` function provides a quicker way to estimate quantiles from weighted data, including use of link functions, using a similar but less complex approach to `empirical`. In this case agreement is close but not perfect ```{r} # Calculate specific quantiles from the weighted samples using wquantile weighted_quantiles <- wquantile( p = c(0.025, 0.25, 0.5, 0.75, 0.975), # Common quantiles x = raw_samples, w = weights, link = "log", # Use identity link for unbounded support, or specify prior support ) print(tibble(Probability = c(0.025, 0.25, 0.5, 0.75, 0.975), Quantile = weighted_quantiles)) # Compare with quantiles from the fitted empirical distribution (which used the same weighted data) fitted_quantiles <- empirical_weighted$q(c(0.025, 0.25, 0.5, 0.75, 0.975)) print(tibble(Probability = c(0.025, 0.25, 0.5, 0.75, 0.975), Quantile_Fitted = fitted_quantiles)) # Values should be similar. # Choices regarding the degree of smoothing will affect the exact answer. ``` --- ## Conclusion The `dist_fns` class provides a flexible and consistent way to represent, manipulate, and combine probability distributions in R, especially within a `tidyverse` workflow. Its integration with data frames allows for powerful batch operations and storage of complex distributional information. The empirical fitting capabilities make it suitable for approximating complex distributions derived from data or simulation results, including handling weighted samples common in Bayesian and ABC methods.