Validate fusion output
validate.Rd
Performs internal validation analyses on fused microdata to estimate how well the simulated variables reflect patterns in the dataset used to train the underlying fusion model (i.e. observed/donor data). This provides a standard approach to validating fusion output and associated models. See Examples for recommended usage.
Usage
validate(
observed,
implicates,
subset_vars,
weight = NULL,
min_size = 30,
plot = TRUE,
cores = 1
)
Arguments
- observed
Data frame. Observed data against which to validate the
simulated
variables. Typically the same dataset used totrain
the fusion model used to generatesimulated
.- implicates
Data frame. Implicates of synthetic (fused) variables. Typically generated by fuse. The implicates should be row-stacked and identified by integer column "M".
- subset_vars
Character. Vector of columns in
observed
used to define the population subsets across which the fusion variables are validated. The levels of eachsubset_vars
(including all two-way interactions ofsubset_vars
) define the population subsets. Continuoussubset_vars
are converted to a five-level ordered factor based on a univariate k-means clustering.- weight
Character. Name of the observation weights column in
observed
. If NULL (default), uniform weights are assumed.- min_size
Integer. Subsets with less than
min_size
observations are excluded. Since subsets with few observations are unlikely to give reliable estimates, it doesn't make sense to consider them for validation purposes.- plot
Logical. If TRUE (default),
plot_valid
is called internally and summary plots are returned along with complete validation results. Requires theggplot2
package.- cores
Integer. Number of cores used. Only applicable on Unix systems.
Value
If plot = FALSE
, a data frame containing complete validation results. If If plot = FALSE
, a list containing full results as well as additional lot objects as described in plot_valid
.
Details
The objective of validate
is to confirm that the fusion output is sensible and help establish the utility of the synthetic data across myriad analyses. Utility here is based on comparison of point estimates and confidence intervals derived using multiple-implicate synthetic data with those derived using the original donor data.
The specific analyses tested include variable levels (means and proportions) across population subsets of varying size. This allows estimates of how each of the synthetic variables perform in analyses with real-world relevance, at varying levels of complexity. In effect, validate()
performs a large number of analyses of the kind that the analyze
function is designed to do on a one-by-one basis.
Most users will want to use the default setting plot = TRUE
to simultaneously return visualization (plots) of the validation results. Plot creation is detailed in plot_valid
.
Examples
# Build a fusion model using RECS microdata
# Note that "fusion_model.fsn" will be written to working directory
fusion.vars <- c("electricity", "natural_gas", "aircon")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs,
y = fusion.vars,
x = predictor.vars,
weight = "weight")
# Fuse back onto the donor data (multiple implicates)
sim <- fuse(data = recs,
fsn = fsn.path,
M = 20)
# Calculate validation results
valid <- validate(observed = recs,
implicates = sim,
subset_vars = c("income", "education", "race", "urban_rural"))