Fuse variables to a recipient dataset
fuse.Rd
Fuse variables to a recipient dataset using a .fsn model produced by train
. Output can be passed to analyze
and validate
.
Arguments
- data
Data frame. Recipient dataset. All categorical variables should be factors and ordered whenever possible. Data types and levels are strictly validated against predictor variables defined in
fsn
.- fsn
Character. Path to fusion model file (.fsn) generated by
train
.- fsd
Character. Optional fusion output file to be created ending in
.fsd
(i.e. "fused data"). This is a compressed binary file that can be read using thefst
package. Iffsd = NULL
(the default), the fusion results are returned as adata.table
.- M
Integer. Number of implicates to simulate.
- retain
Character. Names of columns in
data
that should be retained in the output; i.e. repeated across implicates. Useful for retaining ID or weight variables for use in subsequent analysis of fusion output.- kblock
Integer. Fixed number of nearest neighbors to use when fusing variables in a block. Must be >= 5 and <= 30. Not applicable for variables fused on their own (i.e. no block).
- margin
Numeric. Safety margin used when estimating how many implicates can be processed in memory at once. Set higher if
fuse()
experiences a memory shortfall. Alternatively, can be set to a negative value to manually specify the number of chunks to use. For example,margin = -3
splitsM
implicates into three chunks of approximately equal size.- cores
Integer. Number of cores used. LightGBM prediction is parallel-enabled on all systems if OpenMP is available.
Value
If fsd = NULL
, a data.table
with number of rows equal to M * nrow(data)
. Integer column "M" indicates implicate assignment of each observation. Note that the ordering of recipient observations is consistent within implicates, so do not change the row order if using with analyze
.
If fsd
is specified, the path to .fsd file where results were written. Metadata for column classes and factor levels are stored in the column names. read_fsd
should be used to load files saved via the fsd
argument.
Examples
# Build a fusion model using RECS microdata
# Note that "fusion_model.fsn" will be written to working directory
?recs
fusion.vars <- c("electricity", "natural_gas", "aircon")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars)
# Generate single implicate of synthetic 'fusion.vars',
# using original RECS data as the recipient
recipient <- recs[predictor.vars]
sim <- fuse(data = recipient, fsn = fsn.path)
head(sim)
# Calling fuse() again produces different results
sim <- fuse(data = recipient, fsn = fsn.path)
head(sim)
# Generate multiple implicates
sim <- fuse(data = recipient, fsn = fsn.path, M = 5)
head(sim)
table(sim$M)
# Optionally, write results directly to disk
# Note that "results.fsd" will be written to working directory
sim <- fuse(data = recipient, fsn = fsn.path, M = 5, fsd = "results.fsd")
sim <- read_fsd(sim)
head(sim)