Skip to contents

Assembles data inputs to pass to train and fuse to perform survey fusion. Adds fusion, replicate weight, and/or spatial variables and checks that donor and recipient output data frames are consistent.

Usage

assemble(
  x,
  fusion.variables = NULL,
  spatial.datasets = "all",
  window = 2,
  pca = NULL,
  replicates = FALSE,
  agg_fun = NULL,
  agg_adj = NULL
)

Arguments

x

List object produced by prepare.

fusion.variables

Character. Names of donor variables to be included in output as fusion candidates. If NULL (default), an attempt is made to return all donor variables not used in predictor harmonization process.

spatial.datasets

Character. Vector of requested spatial datasets to merge (e.g. "EPA-SLD") or either of two special values: "all" (default) or "none".

window

Integer. Size of allowable temporal window, in years, when merging spatial variables. window = 0 (default) means that a spatial variable is only included if it has the same vintage as the survey. See Details.

pca

Numeric. Controls whether/how PCA is used to reduce dimensionality of spatial variables. Default (NULL) is no PCA. If non-NULL, should be a numeric vector of length two; e.g. pca = c(50, 0.95). First number is the maximum number of components to return; second number is target proportion of variance explained. See Details.

replicates

Logical. Should replicate observation weights be included, if available? Defaults to FALSE.

agg_fun

List. See fusionInput.

agg_adj

List. See fusionInput.

Value

A list of length two containing donor and recipient microdata to pass to train and fuse.

Details

Spatial variables are included if the associated vintage is within +/- window years of the survey vintage. In cases where the spatial variable has multiple vintages equidistant from the survey vintage, the older vintage is selected. Variables with vintage = "always" are, of course, always included.

PCA is restricted to numeric spatial variables and is computed using prcomp. The returned number of principal components is the lesser of pca[1] or the number of components that explain at least pca[2] proportion of the variance. For example, pca = c(50, 0.95) will select the fewest number of components that explain 95% of the variance, up to 50 components maximum. NA's in numeric spatial variables are imputed using median value prior to computing the principal components.

Examples

prep <- prepare(donor = "RECS_2015",
                recipient = "ACS_2015",
                respondent = "household",
                implicates = 3)

data <- assemble(x = prep)