Plot validation results — plot_valid • fusionModel

Creates and optionally saves to disk representative plots of validation results returned by validate. Requires the suggested ggplot2 package. This function is (by default) called within validate. Can be useful on its own to save graphics to disk or generate plots for a subset of fusion variables.

Usage

plot_valid(valid, y = NULL, path = NULL, cores = 1, ...)

Arguments

valid: Object returned by validate.
y: Character. Fusion variables to use for validation graphics. Useful for plotting partial validation results. Default is to use all fusion variables present in valid.
path: Character. Path to directory where .png graphics are to be saved. Directory is created if necessary. If NULL (default), no files are saved to disk.
cores: Integer. Number of cores used. Only applicable on Unix systems.
...: Arguments passed to ggsave to control .png graphics saved to disk.

Value

A list with "plots", "smooth", and "data" slots. The "plots" slot contains the following ggplot objects:

est: Comparison of point estimates (median absolute percent error).
moe: Comparison of 90% margin of error (median ratio of simulated-to-observed MOE).
Additional named slots (one for each of the fusion variables) contain the plots described above with scatterplot results.

"smooth" is a data frame with the plotting values used to produce the smoothed median plots. "data" is a data frame with the complete validation results as returned by the original call to validate.

Details

Validation results are visualized to convey expected, typical (median) performance of the fusion variables. That is, how well do the simulated data match the observed data with respect to point estimates and confidence intervals for population subsets of various size?

Plausible error metrics are derived from the input validation data for plotting. For comparison of point estimates, the error metric is absolute percent error for continuous variables; in the categorical case it is absolute error scaled such that the maximum possible error is 1. Since these metrics are not strictly comparable, the all-variable plots denote categorical fusion variables with dotted lines.

For a given fusion variable, the error metric will exhibit variation (often quite skewed) even for subsets of comparable size, due to the fact that each subset looks at a unique partition of the data. In order to convey how expected, typical performance varies with subset size, the smoothed median error conditional on subset size is approximated and plotted.

Examples

# Build a fusion model using RECS microdata
# Note that "fusion_model.fsn" will be written to working directory
fusion.vars <- c("electricity", "natural_gas", "aircon")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs,
                  y = fusion.vars,
                  x = predictor.vars,
                  weight = "weight")

# Fuse back onto the donor data (multiple implicates)
sim <- fuse(data = recs,
            file = fsn.path,
            M = 30)

# Calculate validation results but do not generate plots
valid <- validate(observed = recs,
                  implicates = sim,
                  subset_vars = c("income", "education", "race", "urban_rural"),
                  weight = "weight",
                  plot = FALSE)

# Create validation plots
valid <- plot_valid(valid)

# View some of the plots
valid$plots$est
valid$plots$moe
valid$plots$electricity$bias

# Can also save the plots to disk at creation
# Will save .png files to 'valid_plots' folder in working directory
# Note that it is fine to pass a 'valid' object with existing $plots slot
# In that case, the plots are simply re-generated
vplots <- plot_valid(valid,
                     path = file.path(getwd(), "valid_plots"),
                     width = 8, height = 6)