Validate fusion output — validate • fusionModel

Performs internal validation analyses on fused microdata to estimate how well the simulated variables reflect patterns in the dataset used to train the underlying fusion model (i.e. observed/donor data). This provides a standard approach to validating fusion output and associated models. See Examples for recommended usage.

Usage

validate(
  observed,
  implicates,
  subset_vars,
  weight = NULL,
  min_size = 30,
  plot = TRUE,
  cores = 1
)

Arguments

observed: Data frame. Observed data against which to validate the simulated variables. Typically the same dataset used to train the fusion model used to generate simulated.
implicates: Data frame. Implicates of synthetic (fused) variables. Typically generated by fuse. The implicates should be row-stacked and identified by integer column "M".
subset_vars: Character. Vector of columns in observed used to define the population subsets across which the fusion variables are validated. The levels of each subset_vars (including all two-way interactions of subset_vars) define the population subsets. Continuous subset_vars are converted to a five-level ordered factor based on a univariate k-means clustering.
weight: Character. Name of the observation weights column in observed. If NULL (default), uniform weights are assumed.
min_size: Integer. Subsets with less than min_size observations are excluded. Since subsets with few observations are unlikely to give reliable estimates, it doesn't make sense to consider them for validation purposes.
plot: Logical. If TRUE (default), plot_valid is called internally and summary plots are returned along with complete validation results. Requires the ggplot2 package.
cores: Integer. Number of cores used. Only applicable on Unix systems.

Value

If plot = FALSE, a data frame containing complete validation results. If If plot = FALSE, a list containing full results as well as additional lot objects as described in plot_valid.

Details

The objective of validate is to confirm that the fusion output is sensible and help establish the utility of the synthetic data across myriad analyses. Utility here is based on comparison of point estimates and confidence intervals derived using multiple-implicate synthetic data with those derived using the original donor data.

The specific analyses tested include variable levels (means and proportions) across population subsets of varying size. This allows estimates of how each of the synthetic variables perform in analyses with real-world relevance, at varying levels of complexity. In effect, validate() performs a large number of analyses of the kind that the analyze function is designed to do on a one-by-one basis.

Most users will want to use the default setting plot = TRUE to simultaneously return visualization (plots) of the validation results. Plot creation is detailed in plot_valid.

Examples

# Build a fusion model using RECS microdata
# Note that "fusion_model.fsn" will be written to working directory
fusion.vars <- c("electricity", "natural_gas", "aircon")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs,
                  y = fusion.vars,
                  x = predictor.vars,
                  weight = "weight")

# Fuse back onto the donor data (multiple implicates)
sim <- fuse(data = recs,
            fsn = fsn.path,
            M = 20)

# Calculate validation results
valid <- validate(observed = recs,
                  implicates = sim,
                  subset_vars = c("income", "education", "race", "urban_rural"))