Train a fusion model
train.Rd
Train a fusion model on "donor" data using sequential LightGBM models to model the conditional distributions. The resulting fusion model (.fsn file) can be used with fuse
to simulate outcomes for a "recipient" dataset.
Usage
train(
data,
y,
x,
fsn = "fusion_model.fsn",
weight = NULL,
nfolds = 5,
nquantiles = 2,
nclusters = 2000,
krange = c(10, 500),
hyper = NULL,
fork = FALSE,
cores = 1
)
Arguments
- data
Data frame. Donor dataset. Categorical variables must be factors and ordered whenever possible.
- y
Character or list. Variables in
data
to eventually fuse to a recipient dataset. Variables are fused in the order provided. Ify
is a list, each entry is a character vector possibly indicating multiple variables to fuse as a block.- x
Character or list. Predictor variables in
data
common to donor and eventual recipient. If a list, each slot specifies thex
predictors to use for eachy
.- fsn
Character. File path where fusion model will be saved. Must use
.fsn
suffix.- weight
Character. Name of the observation weights column in
data
. If NULL (default), uniform weights are assumed.- nfolds
Numeric. Number of cross-validation folds used for LightGBM model training. Or, if
nfolds < 1
, the fraction of observations to use for training set; remainder used for validation (faster than cross-validation).- nquantiles
Numeric. Number of quantile models to train for continuous
y
variables, in addition to the conditional mean.nquantiles
evenly-distributed percentiles are used. For example, the defaultnquantiles = 2
yields quantile models for the 25th and 75th percentiles. Higher values may produce more accurate conditional distributions at the expense of computation time. Evennquantiles
is recommended since the conditional mean tends to capture the central tendency, making a median model superfluous.- nclusters
Numeric. Maximum number of k-means clusters to use. Higher is better but at computational cost.
nclusters = 0
ornclusters = Inf
turn off clustering.- krange
Numeric. Minimum and maximum number of nearest neighbors to use for construction of continuous conditional distributions. Higher
max(krange)
is better but at computational cost.- hyper
List. LightGBM hyperparameters to be used during model training. If
NULL
, default values are used. See Details and Examples.- fork
Logical. Should parallel processing via forking be used, if possible? See Details.
- cores
Integer. Number of physical CPU cores used for parallel computation. When
fork = FALSE
or on Windows platform (since forking is not possible), the fusion variables/blocks are processed serially but LightGBM usescores
for internal multithreading via OpenMP. On a Unix system, iffork = TRUE
,cores > 1
, andcores <= length(y)
then the fusion variables/blocks are processed in parallel viamclapply
.
Details
When y
is a list, each slot indicates either a single variable or, alternatively, multiple variables to fuse as a block. Variables within a block are sampled jointly from the original donor data during fusion. See Examples.
y
variables that exhibit no variance or continuous y
variables with less than 10 * nfolds
non-zero observations (minimum required for cross-validation) are automatically removed with a warning.
The fusion model written to fsn
is a zipped archive created by zip
containing models and data required by fuse
.
The hyper
argument can be used to specify the LightGBM hyperparameter values over which to perform a "grid search" during model training. See here for the full list of parameters. For each combination of hyperparameters, nfolds
cross-validation is performed using lgb.cv
with an early stopping condition. The parameter combination with the lowest loss function value is used to fit the final model via lgb.train
. The more candidate parameter values specified in hyper
, the longer the processing time. If hyper = NULL
, a single set of parameters is used with the following default values:
boosting = "gbdt"
data_sample_strategy = "goss"
num_leaves = 31
feature_fraction = 0.8
max_depth = 5
min_data_in_leaf = max(10, round(0.001 * nrow(data)))
num_iterations = 2500
learning_rate= 0.1
max_bin = 255
min_data_in_bin = 3
max_cat_threshold = 32
Typical users will only have reason to modify the hyperparameters listed above. Note that num_iterations
only imposes a ceiling, since early stopping will typically result in models with a lower number of iterations. See Examples.
Testing with small-to-medium size datasets suggests that forking is typically faster than OpenMP multithreading (the default). However, forking will sometimes "hang" (continue to run with no CPU usage or error message) if an OpenMP process has been previously used in the same session. The issue appears to be related to Intel's OpenMP implementation (see here). This can be triggered when other operations are called before train()
that use data.table
or fst
in multithread mode. If you experience hanged forking, try calling data.table::setDTthreads(1)
and fst::threads_fst(1)
immediately after library(fusionModel)
in a new session.
Examples
# Build a fusion model using RECS microdata
# Note that "fusion_model.fsn" will be written to working directory
?recs
fusion.vars <- c("electricity", "natural_gas", "aircon")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars)
# When 'y' is a list, it can specify variables to fuse as a block
fusion.vars <- list("electricity", "natural_gas", c("heating_share", "cooling_share", "other_share"))
fusion.vars
train(data = recs, y = fusion.vars, x = predictor.vars)
# When 'x' is a list, it specifies which predictor variables to use for each 'y'
xlist <- list(predictor.vars[1:4], predictor.vars[2:8], predictor.vars)
xlist
train(data = recs, y = fusion.vars, x = xlist)
# Specify a single set of LightGBM hyperparameters
# Here we use Random Forests instead of the default Gradient Boosting Decision Trees
train(data = recs, y = fusion.vars, x = predictor.vars,
hyper = list(boosting = "rf",
feature_fraction = 0.6,
max_depth = 10
))
# Specify a range of LightGBM hyperparameters to search over
# This takes longer, because there are more models to test
train(data = recs, y = fusion.vars, x = predictor.vars,
hyper = list(max_depth = c(5, 10),
feature_fraction = c(0.7, 0.9)
))