Skip to contents

Calculation of point estimates and associated margin of error for analyses using fused/synthetic microdata with replicate weights. Efficiently computes means, proportions, sums, counts, medians, standard deviations, and variances, optionally across population subgroups. This differs from analyze in that it requires replicate weights and calculates uncertainty using full replicate weight variance (no approximation).

Usage

analyze2(
  analyses,
  implicates,
  static,
  weight,
  rep_weights,
  by = NULL,
  var_scale = 4,
  cores = 1
)

Arguments

analyses

List. Specifies the desired analyses. See Details and Examples. Variables referenced in analyses must be in implicates or static.

implicates

Data frame or file path. Implicates of synthetic (fused) variables; typically the output from fuse. The implicates should be row-stacked and identified by integer column "M". If a file path to a ".fst" file, only the necessary columns are read into memory.

static

Data frame or file path. Static variables that do not vary across implicates; typically the "recipient" microdata passed to fuse. At a minimum, static must contain weight and rep_weights. If a file path to a ".fst" file, only the necessary columns are read into memory. Note that nrow(static) = nrow(implicates) / max(implicates$M) and the row-ordering is assumed to be consistent between static and implicates.

weight

Character. Name of the primary observation weights column in static.

rep_weights

Character. Vector of replicate weight columns in static.

by

Character. Optional column name(s) in implicates or static (typically factors) that collectively define the set of population subgroups for which each analysis is executed. If NULL, analysis is done for the whole sample.

var_scale

Scalar. Factor by which to scale the unadjusted replicate weight variance. This is determined by the survey design. The default (var_scale = 4) is appropriate for ACS and RECS.

cores

Integer. Number of cores used for multithreading in collapse-package functions.

Value

A tibble reporting analysis results, possibly across subgroups defined in by. The returned quantities include:

lhs

Optional analysis name; the "left hand side" of the analysis formula.

rhs

The "right hand side" of the analysis formula.

type

Type of analysis: sum, mean, median, prop(ortion) or count.

level

Factor levels for categorical analyses; NA or omitted otherwise.

est

Point estimate; mean estimate across implicates.

moe

Margin of error associated with the 90% confidence interval.

rshare

Share of MOE attributable to replicate weights (as opposed to variance across implicates).

Details

The final point estimates are the mean estimates across implicates. The final margin of error is derived from the pooled standard error across implicates, calculated using Rubin's pooling rules (1987). The within-implicate standard error's are calculated using the replicate weights and var_scale.

Each entry in the analyses list is a formula of the format Z ~ F(E), where Z is an optional, user-friendly name for the analysis, F is an allowable “outer function”, and E is an “inner expression” containing one or more microdata variables. For example:

mysum ~ mean(Var1 + Var2)

In this case, the outer function is mean(). Allowable outer functions are: mean(), sum(), median(), sd(), and var(). When the inner expression contains more than one variable, it is first evaluated and then F() is applied to the result. In this case, an internal variable X = Var1 + Var2 is generated across all observations, and then mean(X) is computed.

If no inner expression is desired, the analyses list can use the following convenient syntax to apply a single outer function to multiple variables:

mean = c("Var1", "Var2")

The inner expression can also utilize any function that takes variable names as arguments and returns a vector with the same length as the inputs. This is useful for defining complex operations in a separate function (e.g. microsimulation). For example:

myfun = function(Var1, Var2) {Var1 + Var2}

mysum ~ mean(myfun(Var1, Var2))

The use of sum() or mean() with an inner expression that returns a categorical vector automatically results in category-wise weighted counts and proportions, respectively. For example, the following analysis would fail if evaluated literally, since mean() expects numeric input but the inner expression returns character. But this is interpreted as a request to return weighted proportions for each categorical outcome.

myprop ~ mean(ifelse(Var1 > 10 , 'Yes', 'No'))

analyze2() uses "fast" versions of the allowable outer functions, as provided by fast-statistical-functions in the collapse package. These functions are highly optimized for weighted, grouped calculations. In addition, outer functions mean(), sum(), and median() enjoy the use of platform-independent multithreading across columns when cores > 1. Analyses with numerical inner expressions are processed using a series of calls to collap with unique observation weights. Analyses with categorical inner expressions utilize a series of calls to fsum.

References

Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. Hoboken, NJ: Wiley.

Examples

# Build a fusion model using RECS microdata
fusion.vars <- c("electricity", "natural_gas", "aircon", "insulation")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars)

# Generate 30 implicates of the 'fusion.vars' using original RECS as the recipient
recipient <- recs[c(predictor.vars, "weight", paste0("rep_", 1:96))]
sim <- fuse(data = recipient, fsn = fsn.path, M = 30)
head(sim)

#-----

# Example of custom pre-processing function
myfun <- function(v1, v2, v3) v1 + v2 + v3

# Various ways to specify analyses...
my.analyses <- list(
  # Return means for 'electricity' and proportions for 'aircon'
  mean = c("electricity", "aircon"),
  # Identical to mean = "electricity"; duplicate analyses automatically removed
  electricity ~ mean(electricity),
  # Simple addition in the inner expression
  mysum ~ sum(electricity + natural_gas),
  # Standard deviation of electricity
  sd = "electricity",
  # Unnamed analyses (no left-hand side in formula)
  ~ var(electricity + natural_gas),
  ~ mean(insulation),  # Proportions
  ~ sum(insulation),  # Counts
  # Proportions involving manipulation of >1 variable
  myprop ~ mean(aircon != "No air conditioning" & insulation < "Adequately insulated"),
  # Custom inner function
  mycustom ~ median(myfun(electricity, natural_gas, v3 = 100))
)

# Do the requeted analyses, by "division"
result <- analyze2(
 analyses = my.analyses,
 implicates = sim,
 static = recipient,
 weight = "weight",
 rep_weights = paste0("rep_", 1:96),
 by = "division"
)
head(result)

#-----

# To calculate a conditional estimate, set unused/ignored observations to NA
# All outer functions execute with 'na.rm = TRUE'
# Example: mean natural_gas conditional on natural_gas > 0
# data.table::fifelse() is much faster than base::ifelse() for large data
result <- analyze2(
 analyses = ~mean(data.table::fifelse(natural_gas > 0, natural_gas, NA_real_)),
 implicates = sim,
 static = recipient,
 weight = "weight",
 rep_weights = paste0("rep_", 1:96),
 by = "division"
)