Fuse variables to a recipient dataset

Fuse variables to a recipient dataset using a .fsn model produced by train. Output can be passed to analyze and validate.

Usage

fuse(
  data,
  fsn,
  fsd = NULL,
  M = 1,
  retain = NULL,
  kblock = 10,
  margin = 2,
  cores = 1
)

Arguments

data: Data frame. Recipient dataset. All categorical variables should be factors and ordered whenever possible. Data types and levels are strictly validated against predictor variables defined in fsn.
fsn: Character. Path to fusion model file (.fsn) generated by train.
fsd: Character. Optional fusion output file to be created ending in .fsd (i.e. "fused data"). This is a compressed binary file that can be read using the fst package. If fsd = NULL (the default), the fusion results are returned as a data.table.
M: Integer. Number of implicates to simulate.
retain: Character. Names of columns in data that should be retained in the output; i.e. repeated across implicates. Useful for retaining ID or weight variables for use in subsequent analysis of fusion output.
kblock: Integer. Fixed number of nearest neighbors to use when fusing variables in a block. Must be >= 5 and <= 30. Not applicable for variables fused on their own (i.e. no block).
margin: Numeric. Safety margin used when estimating how many implicates can be processed in memory at once. Set higher if fuse() experiences a memory shortfall. Alternatively, can be set to a negative value to manually specify the number of chunks to use. For example, margin = -3 splits M implicates into three chunks of approximately equal size.
cores: Integer. Number of cores used. LightGBM prediction is parallel-enabled on all systems if OpenMP is available.

Value

If fsd = NULL, a data.table with number of rows equal to M * nrow(data). Integer column "M" indicates implicate assignment of each observation. Note that the ordering of recipient observations is consistent within implicates, so do not change the row order if using with analyze.

If fsd is specified, the path to .fsd file where results were written. Metadata for column classes and factor levels are stored in the column names. read_fsd should be used to load files saved via the fsd argument.

Details

TO UPDATE.

Examples

# Build a fusion model using RECS microdata
# Note that "fusion_model.fsn" will be written to working directory
?recs
fusion.vars <- c("electricity", "natural_gas", "aircon")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars)

# Generate single implicate of synthetic 'fusion.vars',
#  using original RECS data as the recipient
recipient <- recs[predictor.vars]
sim <- fuse(data = recipient, fsn = fsn.path)
head(sim)

# Calling fuse() again produces different results
sim <- fuse(data = recipient, fsn = fsn.path)
head(sim)

# Generate multiple implicates
sim <- fuse(data = recipient, fsn = fsn.path, M = 5)
head(sim)
table(sim$M)

# Optionally, write results directly to disk
# Note that "results.fsd" will be written to working directory
sim <- fuse(data = recipient, fsn = fsn.path, M = 5, fsd = "results.fsd")
sim <- read_fsd(sim)
head(sim)