Generate input files needed for fusion
fusionInput.Rd
Handles all operations needed to generate /input
files from successfully ingested and harmonized donor survey microdata. Optionally uploads resulting local /input
data files to correct location in remote storage (via uploadFiles
).
NOTE: Argument test_mode = TRUE
by default, which causes local "/fusion_" sub-directory to be used (creating it if necessary). This prevents any overwrite of production data in "/fusion" (no underscore) while in test mode.
Usage
fusionInput(
donor,
recipient,
respondent,
fuse = NULL,
force = NULL,
note = NULL,
agg_fun = NULL,
agg_adj = NULL,
test_mode = TRUE,
ncores = getOption("fusionData.cores")
)
Arguments
- donor
Character. Donor survey identifier (e.g.
"RECS_2015"
).- recipient
Character. Recipient (ACS) survey identifier (e.g.
"ACS_2015"
).- respondent
Character. Desired respondent level of microdata. Either
"household"
or"person"
.- fuse
Character or list. Names of donor variables to be fused to recipient. If
fuse
is a list, each entry is a character vector possibly indicating multiple variables to fuse as a block. The order of thefuse
variables does not matter, sinceprepXY
is used internally to determine a plausible fusion sequence. If NULL (default), an attempt is made to return all donor variables not used in predictor harmonization process (it is preferable to specify explicitly, though).- force
Character. Pre-specified subset of potential predictor variables to "force" as included predictors. These variables are also used within
fusionOutput
to create validation subsets. We generally select the variables that best reflect the following socioeconomic and geographic concepts: income; race/ethnicity; education; household size; housing tenure; and the highest-resolution location variable for which the donor survey is thought to be representative.- note
Character. Optional note supplied by user. Inserted in the log file for reference.
- agg_fun
List. Optional override of default aggregation function for person-level
fuse
variables whenrespondent = "household"
. Passed toassemble
internally. See Details.- agg_adj
List. Optional pre-aggregation adjustment code to apply to person-level
fuse
variables whenrespondent = "household"
. Passed toassemble
internally. See Details.- test_mode
Logical. If
TRUE
(default), function uses the local "/fusion_" sub-directory (creating it if necessary). Only whentest_mode = FALSE
is it possible to overwrite production data in "/fusion" (no underscore).- ncores
Integer. Number of physical CPU cores used for parallel computation.
Value
Invisibly returns path to local directory where files were saved. Messages printed to console noting progress. Resulting /input
data files are saved to appropriate local directory and (optionally) to remote Google Drive storage. Also saves a .txt log file alongside data files that records console output from fusionInput
.
Details
The function checks arguments and determines the file path to the appropriate /input
directory (creating it if necessary), based on donor
, recipient
, respondent
, and test_mode
. It then executes the following steps:
Check for custom pre-processing script. Looks for an optional, pre-existing .R script in
/input
starting with "(00)". This script can be used to inject custom code prior to any other operations. Most likely, the custom code is used to set or modify function arguments that cannot be specified manually at function call. If found, the .R file issource
-d locally and code comments are printed to the console.prepare() microdata.
prepare
is called with sensible default values.assemble() microdata.
assemble
is called with sensible default values. Output fromassemble
consists of an object nameddata
with the harmonized donor microdata indata[[1]]
and harmonized ACS microdata indata[[2]]
.Check for custom .R scripts. Looks for optional, pre-existing .R scripts in
/input
starting with "(01)", "(02)", etc. These scripts can be used to inject custom code prior to next step. Most likely, the custom code is used to add or remove non-standard variables fromdata
or otherwise adjust the default harmonized microdata. The custom code must modify thefuse
vector (or list) if changes to the initial function argument are desired. If found, the .R files aresource
-d locally and code comments are printed to the console.Check categorical harmonized variables. Computes the similarity of donor and ACS categorical harmonized variables by comparing proportions for the observable factor levels. Under the assumption that the donor and ACS sample the same underlying population, we expect the proportions to be fairly similar. Returns a "Similarity score" for each categorical harmonized variable, ranging from 0 to 1. Variables with scores below ~0.8 should probably be checked by the analyst (via
harmony
) to confirm that the harmonization strategy is valid. User is prompted in the console to indicate which variables should be ignored/dropped/removed, if any.Check location variables. Checks the number of factor levels in each location predictor variable and compares to the number of levels in the "representative" location variable passed via
force
. If a location variable has more levels than the representative one, it is flagged as a potential issue since it suggests the presence of a location variable with greater spatial resolution than the one known to be representative. User is prompted in the console to indicate if they would like to remove the flagged variable(s).Check fusion and predictor variables. The full set of fusion and potential predictor variables is determined. Summary information is printed to the console. User is prompted in the console to confirm that everything looks OK before proceeding.
Run fusionModel::prepXY().
prepXY
is called with sensible default values. The output is written to the appropriate/input
directory and noted in console.prepXY
argumentfraction
is automatically set to use 10% of donor observations or 50k rows (10k in test mode), whichever is higher. Sampling often has minimal effect on results but speeds up computation.Write training and prediction datasets to disk. The (donor) training and (ACS) prediction datasets are written to the appropriate
/input
directory as fully-compressedfst
files. Output file names noted in console. In test mode, no more than 10k rows of each is written to disk (for speed). In this case, the expected production file size is printed to the console.Upload /input files to Google Drive. User is prompted in the console to confirm if they would like to upload resulting
/input
data files to the analogous location in the remote Google Drive storage.fusionInput() is finished! Upon completion, a log file named
"inputlog.txt"
is written to/input
for reference.
The user is prompted for console input (including asking about GDrive upload) only if interactive
is TRUE
. Otherwise, the steps proceed without user input.
If donor
refers to a survey with both household- and person-level microdata and respondent = "household"
and fuse
includes person-level variables, then we have a situation where person-level fusion variables need to be aggregated to household-level prior to fusion.
This is done automatically within assemble
. In this scenario, person-level fusion variables are aggregated based on their class.
By default, numeric variables return the household total (sum), unordered factors return the level of the household's reference person, and ordered factors return the household's maximum level.
The agg_fun
argument can be used to override the default aggregation function for specific fusion variables. It can reference any function that takes a vector and returns a single value and includes a 'na.rm' argument. Two special, package-specific functions are also available, "ref" and "mode", that return the reference person value and modal value, respectively. These functions are comparatively slow, especially "mode". See Examples.
The agg_adj
argument can be used to adjust/modify a person-level fusion variable prior to aggregation. This may be necessary if the variable is defined or measured in a way that doe not allow for straightforward aggregation to the household level. agg_adj
supplies named formulas to an internal mutate
call, allowing for complex modifications. See Examples.
Note in the Examples the use of the convenience utility function if.else()
. It wraps if_else
and can be used identically but preserves factor levels and ordering in the result if possible.
Examples
# Since 'test_mode = TRUE' by default, this will affect files in local /fusion_ directory
dir <- fusionInput(donor = "RECS_2015",
recipient = "ACS_2015",
respondent = "household",
fuse = c("btung", "btuel", "cooltype"),
force = c("moneypy", "householder_race", "education", "nhsldmem", "kownrent", "recs_division"),
note = "Hello world. Reminder: running in test mode by default.")
# List files in the /input directory
list.files(dir)
# Complicated ASEC example using custom aggregation arguments
dir <- fusionInput(donor = "ASEC_2019",
recipient = "ACS_2019",
respondent = "household",
fuse = c("heatsub", "heatval", "kidcneed", "hipval", "spmwic", "spmmort"),
agg_adj = list(
hipval = ~if.else(duplicated(asec_2019_hid), 0, hipval),
kidcneed = ~if.else(kidcneed == "NIU: Over 14", "No", kidcneed),
spmwic = ~if.else(duplicated(data.table(asec_2019_hid, spmfamunit)), 0, spmwic)
),
agg_fun = list(
spmwic = "mean",
kidcneed = "mode"
))