Analyze fusion output — analyze2 • fusionModel

Calculation of point estimates and associated margin of error for analyses using fused/synthetic microdata with replicate weights. Efficiently computes means, proportions, sums, counts, medians, standard deviations, and variances, optionally across population subgroups. This differs from analyze in that it requires replicate weights and calculates uncertainty using full replicate weight variance (no approximation).

Usage

analyze2(
  analyses,
  implicates,
  static,
  weight,
  rep_weights,
  by = NULL,
  var_scale = 4,
  cores = 1
)

Arguments

analyses: List. Specifies the desired analyses. See Details and Examples. Variables referenced in analyses must be in implicates or static.
implicates: Data frame or file path. Implicates of synthetic (fused) variables; typically the output from fuse. The implicates should be row-stacked and identified by integer column "M". If a file path to a ".fst" file, only the necessary columns are read into memory.
static: Data frame or file path. Static variables that do not vary across implicates; typically the "recipient" microdata passed to fuse. At a minimum, static must contain weight and rep_weights. If a file path to a ".fst" file, only the necessary columns are read into memory. Note that nrow(static) = nrow(implicates) / max(implicates$M) and the row-ordering is assumed to be consistent between static and implicates.
weight: Character. Name of the primary observation weights column in static.
rep_weights: Character. Vector of replicate weight columns in static.
by: Character. Optional column name(s) in implicates or static (typically factors) that collectively define the set of population subgroups for which each analysis is executed. If NULL, analysis is done for the whole sample.
var_scale: Scalar. Factor by which to scale the unadjusted replicate weight variance. This is determined by the survey design. The default (var_scale = 4) is appropriate for ACS and RECS.
cores: Integer. Number of cores used for multithreading in collapse-package functions.

Value

A tibble reporting analysis results, possibly across subgroups defined in by. The returned quantities include:

lhs: Optional analysis name; the "left hand side" of the analysis formula.
rhs: The "right hand side" of the analysis formula.
type: Type of analysis: sum, mean, median, prop(ortion) or count.
level: Factor levels for categorical analyses; NA or omitted otherwise.
est: Point estimate; mean estimate across implicates.
moe: Margin of error associated with the 90% confidence interval.
rshare: Share of MOE attributable to replicate weights (as opposed to variance across implicates).

Details

The final point estimates are the mean estimates across implicates. The final margin of error is derived from the pooled standard error across implicates, calculated using Rubin's pooling rules (1987). The within-implicate standard error's are calculated using the replicate weights and var_scale.

Each entry in the analyses list is a formula of the format Z ~ F(E), where Z is an optional, user-friendly name for the analysis, F is an allowable “outer function”, and E is an “inner expression” containing one or more microdata variables. For example:

mysum ~ mean(Var1 + Var2)

In this case, the outer function is mean(). Allowable outer functions are: mean(), sum(), median(), sd(), and var(). When the inner expression contains more than one variable, it is first evaluated and then F() is applied to the result. In this case, an internal variable X = Var1 + Var2 is generated across all observations, and then mean(X) is computed.

If no inner expression is desired, the analyses list can use the following convenient syntax to apply a single outer function to multiple variables:

mean = c("Var1", "Var2")

The inner expression can also utilize any function that takes variable names as arguments and returns a vector with the same length as the inputs. This is useful for defining complex operations in a separate function (e.g. microsimulation). For example:

myfun = function(Var1, Var2) {Var1 + Var2}

mysum ~ mean(myfun(Var1, Var2))

The use of sum() or mean() with an inner expression that returns a categorical vector automatically results in category-wise weighted counts and proportions, respectively. For example, the following analysis would fail if evaluated literally, since mean() expects numeric input but the inner expression returns character. But this is interpreted as a request to return weighted proportions for each categorical outcome.

myprop ~ mean(ifelse(Var1 > 10 , 'Yes', 'No'))

analyze2() uses "fast" versions of the allowable outer functions, as provided by fast-statistical-functions in the collapse package. These functions are highly optimized for weighted, grouped calculations. In addition, outer functions mean(), sum(), and median() enjoy the use of platform-independent multithreading across columns when cores > 1. Analyses with numerical inner expressions are processed using a series of calls to collap with unique observation weights. Analyses with categorical inner expressions utilize a series of calls to fsum.

References

Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. Hoboken, NJ: Wiley.

Examples

# Build a fusion model using RECS microdata
fusion.vars <- c("electricity", "natural_gas", "aircon", "insulation")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars)

# Generate 30 implicates of the 'fusion.vars' using original RECS as the recipient
recipient <- recs[c(predictor.vars, "weight", paste0("rep_", 1:96))]
sim <- fuse(data = recipient, fsn = fsn.path, M = 30)
head(sim)

#-----

# Example of custom pre-processing function
myfun <- function(v1, v2, v3) v1 + v2 + v3

# Various ways to specify analyses...
my.analyses <- list(
  # Return means for 'electricity' and proportions for 'aircon'
  mean = c("electricity", "aircon"),
  # Identical to mean = "electricity"; duplicate analyses automatically removed
  electricity ~ mean(electricity),
  # Simple addition in the inner expression
  mysum ~ sum(electricity + natural_gas),
  # Standard deviation of electricity
  sd = "electricity",
  # Unnamed analyses (no left-hand side in formula)
  ~ var(electricity + natural_gas),
  ~ mean(insulation),  # Proportions
  ~ sum(insulation),  # Counts
  # Proportions involving manipulation of >1 variable
  myprop ~ mean(aircon != "No air conditioning" & insulation < "Adequately insulated"),
  # Custom inner function
  mycustom ~ median(myfun(electricity, natural_gas, v3 = 100))
)

# Do the requeted analyses, by "division"
result <- analyze2(
 analyses = my.analyses,
 implicates = sim,
 static = recipient,
 weight = "weight",
 rep_weights = paste0("rep_", 1:96),
 by = "division"
)
head(result)

#-----

# To calculate a conditional estimate, set unused/ignored observations to NA
# All outer functions execute with 'na.rm = TRUE'
# Example: mean natural_gas conditional on natural_gas > 0
# data.table::fifelse() is much faster than base::ifelse() for large data
result <- analyze2(
 analyses = ~mean(data.table::fifelse(natural_gas > 0, natural_gas, NA_real_)),
 implicates = sim,
 static = recipient,
 weight = "weight",
 rep_weights = paste0("rep_", 1:96),
 by = "division"
)