Generate input files needed for fusion — fusionInput • fusionData

Handles all operations needed to generate /input files from successfully ingested and harmonized donor survey microdata. Optionally uploads resulting local /input data files to correct location in remote storage (via uploadFiles).

NOTE: Argument test_mode = TRUE by default, which causes local "/fusion_" sub-directory to be used (creating it if necessary). This prevents any overwrite of production data in "/fusion" (no underscore) while in test mode.

Usage

fusionInput(
  donor,
  recipient,
  respondent,
  fuse = NULL,
  force = NULL,
  note = NULL,
  agg_fun = NULL,
  agg_adj = NULL,
  test_mode = TRUE,
  ncores = getOption("fusionData.cores")
)

Arguments

donor: Character. Donor survey identifier (e.g. "RECS_2015").
recipient: Character. Recipient (ACS) survey identifier (e.g. "ACS_2015").
respondent: Character. Desired respondent level of microdata. Either "household" or "person".
fuse: Character or list. Names of donor variables to be fused to recipient. If fuse is a list, each entry is a character vector possibly indicating multiple variables to fuse as a block. The order of the fuse variables does not matter, since prepXY is used internally to determine a plausible fusion sequence. If NULL (default), an attempt is made to return all donor variables not used in predictor harmonization process (it is preferable to specify explicitly, though).
force: Character. Pre-specified subset of potential predictor variables to "force" as included predictors. These variables are also used within fusionOutput to create validation subsets. We generally select the variables that best reflect the following socioeconomic and geographic concepts: income; race/ethnicity; education; household size; housing tenure; and the highest-resolution location variable for which the donor survey is thought to be representative.
note: Character. Optional note supplied by user. Inserted in the log file for reference.
agg_fun: List. Optional override of default aggregation function for person-level fuse variables when respondent = "household". Passed to assemble internally. See Details.
agg_adj: List. Optional pre-aggregation adjustment code to apply to person-level fuse variables when respondent = "household". Passed to assemble internally. See Details.
test_mode: Logical. If TRUE (default), function uses the local "/fusion_" sub-directory (creating it if necessary). Only when test_mode = FALSE is it possible to overwrite production data in "/fusion" (no underscore).
ncores: Integer. Number of physical CPU cores used for parallel computation.

Value

Invisibly returns path to local directory where files were saved. Messages printed to console noting progress. Resulting /input data files are saved to appropriate local directory and (optionally) to remote Google Drive storage. Also saves a .txt log file alongside data files that records console output from fusionInput.

Details

The function checks arguments and determines the file path to the appropriate /input directory (creating it if necessary), based on donor, recipient, respondent, and test_mode. It then executes the following steps:

Check for custom pre-processing script. Looks for an optional, pre-existing .R script in /input starting with "(00)". This script can be used to inject custom code prior to any other operations. Most likely, the custom code is used to set or modify function arguments that cannot be specified manually at function call. If found, the .R file is source-d locally and code comments are printed to the console.
prepare() microdata. prepare is called with sensible default values.
assemble() microdata. assemble is called with sensible default values. Output from assemble consists of an object named data with the harmonized donor microdata in data[[1]] and harmonized ACS microdata in data[[2]].
Check for custom .R scripts. Looks for optional, pre-existing .R scripts in /input starting with "(01)", "(02)", etc. These scripts can be used to inject custom code prior to next step. Most likely, the custom code is used to add or remove non-standard variables from data or otherwise adjust the default harmonized microdata. The custom code must modify the fuse vector (or list) if changes to the initial function argument are desired. If found, the .R files are source-d locally and code comments are printed to the console.
Check categorical harmonized variables. Computes the similarity of donor and ACS categorical harmonized variables by comparing proportions for the observable factor levels. Under the assumption that the donor and ACS sample the same underlying population, we expect the proportions to be fairly similar. Returns a "Similarity score" for each categorical harmonized variable, ranging from 0 to 1. Variables with scores below ~0.8 should probably be checked by the analyst (via harmony) to confirm that the harmonization strategy is valid. User is prompted in the console to indicate which variables should be ignored/dropped/removed, if any.
Check location variables. Checks the number of factor levels in each location predictor variable and compares to the number of levels in the "representative" location variable passed via force. If a location variable has more levels than the representative one, it is flagged as a potential issue since it suggests the presence of a location variable with greater spatial resolution than the one known to be representative. User is prompted in the console to indicate if they would like to remove the flagged variable(s).
Check fusion and predictor variables. The full set of fusion and potential predictor variables is determined. Summary information is printed to the console. User is prompted in the console to confirm that everything looks OK before proceeding.
Run fusionModel::prepXY(). prepXY is called with sensible default values. The output is written to the appropriate /input directory and noted in console. prepXY argument fraction is automatically set to use 10% of donor observations or 50k rows (10k in test mode), whichever is higher. Sampling often has minimal effect on results but speeds up computation.
Write training and prediction datasets to disk. The (donor) training and (ACS) prediction datasets are written to the appropriate /input directory as fully-compressed fst files. Output file names noted in console. In test mode, no more than 10k rows of each is written to disk (for speed). In this case, the expected production file size is printed to the console.
Upload /input files to Google Drive. User is prompted in the console to confirm if they would like to upload resulting /input data files to the analogous location in the remote Google Drive storage.
fusionInput() is finished! Upon completion, a log file named "inputlog.txt" is written to /input for reference.

The user is prompted for console input (including asking about GDrive upload) only if interactive is TRUE. Otherwise, the steps proceed without user input.

If donor refers to a survey with both household- and person-level microdata and respondent = "household" and fuse includes person-level variables, then we have a situation where person-level fusion variables need to be aggregated to household-level prior to fusion. This is done automatically within assemble. In this scenario, person-level fusion variables are aggregated based on their class. By default, numeric variables return the household total (sum), unordered factors return the level of the household's reference person, and ordered factors return the household's maximum level.

The agg_fun argument can be used to override the default aggregation function for specific fusion variables. It can reference any function that takes a vector and returns a single value and includes a 'na.rm' argument. Two special, package-specific functions are also available, "ref" and "mode", that return the reference person value and modal value, respectively. These functions are comparatively slow, especially "mode". See Examples.

The agg_adj argument can be used to adjust/modify a person-level fusion variable prior to aggregation. This may be necessary if the variable is defined or measured in a way that doe not allow for straightforward aggregation to the household level. agg_adj supplies named formulas to an internal mutate call, allowing for complex modifications. See Examples.

Note in the Examples the use of the convenience utility function if.else(). It wraps if_else and can be used identically but preserves factor levels and ordering in the result if possible.

Examples

# Since 'test_mode = TRUE' by default, this will affect files in local /fusion_ directory
dir <- fusionInput(donor = "RECS_2015",
                   recipient = "ACS_2015",
                   respondent = "household",
                   fuse = c("btung", "btuel", "cooltype"),
                   force = c("moneypy", "householder_race", "education", "nhsldmem", "kownrent", "recs_division"),
                   note = "Hello world. Reminder: running in test mode by default.")

# List files in the /input directory
list.files(dir)

# Complicated ASEC example using custom aggregation arguments
dir <- fusionInput(donor = "ASEC_2019",
                   recipient = "ACS_2019",
                   respondent = "household",
                   fuse = c("heatsub", "heatval", "kidcneed", "hipval", "spmwic", "spmmort"),
                   agg_adj = list(
                      hipval = ~if.else(duplicated(asec_2019_hid), 0, hipval),
                      kidcneed = ~if.else(kidcneed == "NIU: Over 14", "No", kidcneed),
                      spmwic = ~if.else(duplicated(data.table(asec_2019_hid, spmfamunit)), 0, spmwic)
                   ),
                   agg_fun = list(
                      spmwic = "mean",
                      kidcneed = "mode"
                   ))