Assemble data used for survey fusion
assemble.Rd
Assembles data inputs to pass to train
and fuse
to perform survey fusion. Adds fusion, replicate weight, and/or spatial variables and checks that donor and recipient output data frames are consistent.
Usage
assemble(
x,
fusion.variables = NULL,
spatial.datasets = "all",
window = 2,
pca = NULL,
replicates = FALSE,
agg_fun = NULL,
agg_adj = NULL
)
Arguments
- x
List object produced by
prepare
.- fusion.variables
Character. Names of donor variables to be included in output as fusion candidates. If NULL (default), an attempt is made to return all donor variables not used in predictor harmonization process.
- spatial.datasets
Character. Vector of requested spatial datasets to merge (e.g.
"EPA-SLD"
) or either of two special values:"all"
(default) or"none"
.- window
Integer. Size of allowable temporal window, in years, when merging spatial variables.
window = 0
(default) means that a spatial variable is only included if it has the same vintage as the survey. See Details.- pca
Numeric. Controls whether/how PCA is used to reduce dimensionality of spatial variables. Default (NULL) is no PCA. If non-NULL, should be a numeric vector of length two; e.g.
pca = c(50, 0.95)
. First number is the maximum number of components to return; second number is target proportion of variance explained. See Details.- replicates
Logical. Should replicate observation weights be included, if available? Defaults to FALSE.
- agg_fun
List. See
fusionInput
.- agg_adj
List. See
fusionInput
.
Details
Spatial variables are included if the associated vintage is within +/- window
years of the survey vintage. In cases where the spatial variable has multiple vintages equidistant from the survey vintage, the older vintage is selected. Variables with vintage = "always"
are, of course, always included.
PCA is restricted to numeric spatial variables and is computed using prcomp
. The returned number of principal components is the lesser of pca[1]
or the number of components that explain at least pca[2]
proportion of the variance. For example, pca = c(50, 0.95)
will select the fewest number of components that explain 95% of the variance, up to 50 components maximum. NA's in numeric spatial variables are imputed using median value prior to computing the principal components.
Examples
prep <- prepare(donor = "RECS_2015",
recipient = "ACS_2015",
respondent = "household",
implicates = 3)
data <- assemble(x = prep)