Analyze fusion output
analyze.Rd
Calculation of point estimates and associated margin of error for analyses using fused/synthetic microdata. Can calculate means, proportions, sums, counts, and medians, optionally across population subgroups.
Usage
analyze(
x,
implicates,
static = NULL,
weight = NULL,
rep_weights = NULL,
by = NULL,
fun = NULL,
var_scale = 4,
cores = 1
)
Arguments
- x
List. Named list specifying the desired analysis type(s) and the associated target variable(s). Example:
x = list(mean = c("v1", "v2"), median = "v3")
translates as: "Return the mean value of variables v1 and v2 and the median of v3". Supported analysis types includemean
,sum
, andmedian
. Mean and sum automatically return proportions and counts, respectively, if the target variable is a factor. Target variables must be inimplicates
,static
, or a data.frame returned by a customfun
.- implicates
Data frame. Implicates of synthetic (fused) variables. Typically generated by fuse. The implicates should be row-stacked and identified by integer column "M".
- static
Data frame. Optional static (non-synthetic) variables that do not vary across implicates. Note that
nrow(static) = nrow(implicates) / max(implicates$M)
and the row-ordering is assumed to be consistent betweenstatic
andimplicates
.- weight
Character. Name of the observation weights column in
static
. If NULL (default), uniform weights are assumed.- rep_weights
Character. Optional vector of replicate weight columns in
static
. If provided, the returned margin of errors reflect additional variance due to uncertainty in sample weights.- by
Character. Optional column name(s) in
implicates
orstatic
(typically factors) that collectively define the set of population subgroups for which each analysis is executed. IfNULL
, analysis is done for the whole sample.- fun
Function. Optional function applied to input data prior to executing analyses. Can be used to do non-conventional/custom analyses.
- var_scale
Scalar. Factor by which to scale the unadjusted replicate weight variance. This is determined by the survey design. The default (
var_scale = 4
) is appropriate for ACS and RECS.- cores
Integer. Number of cores used. Only applicable on Unix systems.
Value
A data.table reporting analysis results, possibly across subgroups defined in by
. The returned quantities include:
- N
Number of observations used for the analysis.
- y
Target variable.
- level
Levels of factor target variables.
- type
Type of estimate returned: mean, proportion, sum, count, or median.
- est
Point estimate.
- moe
Margin of error associated with the 90% confidence interval.
Details
At a minimum, the user must supply synthetic implicates (typically generated by fuse). Inputs are checked for consistent dimensions.
If implicates
contains only a single implicate and rep_weights = NULL
, the "typical" standard error is returned with a warning to make sure the user is aware of the situation.
Estimates and standard errors for the requested analysis are calculated separately for each implicate. The final point estimate is the mean estimate across implicates. The final standard error is the pooled SE across implicates, calculated using Rubin's pooling rules (1987) with a finite population adjustment of the degrees of freedom (Barnard and Rubin 1999).
When replicate weights are provided, the standard errors of each implicate are calculated via the variance of estimates across replicates. Calculations leverage data.table
operations for speed and memory efficiency. The within-implicate variance is calculated around the point estimate (rather than around the mean of the replicates). This is equivalent to mse = TRUE
in svrepdesign
. This seems to be the appropriate method for most surveys.
If replicate weights are NOT provided, the standard errors of each implicate are calculated using variance within the implicate. For means, the ratio variance approximation of Cochran (1977) is used, as this is known to be a good approximation of bootstrapped SE's for weighted means (Gatz and Smith 1995). For proportions, a generalization of the unweighted SE formula is used (see here). For regression coefficients, the standard error is calculated by summary.glm
.
References
Barnard, J., & Rubin, D.B. (1999). Small-sample degrees of freedom with multiple imputation. Biometrika, 86, 948-955.
Cochran, W. G. (1977). Sampling Techniques (3rd Edition). Wiley, New York.
Gatz, D.F., and Smith, L. (1995). The Standard Error of a Weighted Mean Concentration — I. Bootstrapping vs Other Methods. Atmospheric Environment, vol. 29, no. 11, 1185–1193.
Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. Hoboken, NJ: Wiley.
Examples
# Build a fusion model using RECS microdata
fusion.vars <- c("electricity", "natural_gas", "aircon")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars)
# Generate 30 implicates of the 'fusion.vars' using original RECS as the recipient
sim <- fuse(data = recs, fsn = fsn.path, M = 30)
head(sim)
#---------
# Multiple types of analyses can be done at once
# This calculates estimates using the full sample
result <- analyze(x = list(mean = c("natural_gas", "aircon"),
median = "electricity",
sum = c("electricity", "aircon")),
implicates = sim,
weight = "weight")
View(result)
#-----
# Mean electricity consumption, by climate zone and urban/rural status
result1 <- analyze(x = list(mean = "electricity"),
implicates = sim,
static = recs,
weight = "weight",
by = c("climate", "urban_rural"))
# Same as above but including sample weight uncertainty
# Note that only the first 30 replicate weights are used internally
result2 <- analyze(x = list(mean = "electricity"),
implicates = sim,
static = recs,
weight = "weight",
rep_weights = paste0("rep_", 1:96),
by = c("climate", "urban_rural"))
# Helper function for comparison plots
pfun <- function(x, y) {plot(x, y); abline(0, 1, lty = 2)}
# Inclusion of replicate weights does not affect estimates, but it does
# increase margin of error due to uncertainty in RECS sample weights
pfun(result1$est, result2$est)
pfun(result1$moe, result2$moe)
# Notice that relative uncertainty declines with subset size
plot(result1$N, result1$moe / result1$est)
#-----
# Use a custom function to perform more complex analyses
# Custom function should return a data frame with non-standard target variables
my_fun <- function(data) {
# Manipulate 'data' as desired
# All variables in 'implicates' and 'static' are available
# Construct electricity consumption per square foot
kwh_per_ft2 <- data$electricity / data$square_feet
# Binary (T/F) indicator if household uses natural gas
use_natural_gas <- data$natural_gas > 0
# Return data.frame of custom variables to be analyzed
data.frame(kwh_per_ft2, use_natural_gas)
}
# Do analysis using variables produced by custom function
# Can included non-custom target variables as well
result <- analyze(x = list(mean = c("kwh_per_ft2", "use_natural_gas", "electricity")),
implicates = sim,
static = recs,
weight = "weight",
fun = my_fun)