Prepare the 'x' and 'y' inputs
prepXY.Rd
Optional-but-useful function to: 1) provide a plausible ordering of the 'y' (fusion) variables and 2) identify the subset of 'x' (predictor) variables likely to be consequential during subsequent model training. Output can be passed directly to train
. Most useful for large datasets with many and/or highly-correlated predictors. Employs an absolute Spearman rank correlation screen and then LASSO models (via glmnet
) to return a plausible ordering of 'y' and the preferred subset of 'x' variables associated with each.
Usage
prepXY(
data,
y,
x,
weight = NULL,
cor_thresh = 0.05,
lasso_thresh = 0.95,
xmax = 100,
xforce = NULL,
fraction = 1,
cores = 1
)
Arguments
- data
Data frame. Training dataset. All categorical variables should be factors and ordered whenever possible.
- y
Character or list. Variables in
data
to eventually fuse to a recipient dataset. Ify
is a list, each entry is a character vector possibly indicating multiple variables to fuse as a block.- x
Character. Predictor variables in
data
common to donor and eventual recipient.- weight
Character. Name of the observation weights column in
data
. If NULL (default), uniform weights are assumed.- cor_thresh
Numeric. Predictors that exhibit less than
cor_thresh
absolute Spearman (rank) correlation with ay
variable are screened out prior to the LASSO step. Fast exclusion of predictors that the LASSO step probably doesn't need to consider.- lasso_thresh
Numeric. Controls how aggressively the LASSO step screens out predictors. Lower value is more aggressive.
lasso_thresh = 0.95
, for example, retains predictors that collectively explain at least 95% of the deviance explained by a "full" model.- xmax
Integer. Maximum number of predictors returned by LASSO step. Does not strictly control the number of final predictors returned (especially for categorical
y
variables), but useful for setting a (very) soft upper bound. Lowerxmax
can help control computation time if a large number ofx
pass the correlation screen.xmax = Inf
imposes no restriction.- xforce
Character. Subset of
x
variables to "force" as included predictors in the results.- fraction
Numeric. Fraction of observations in
data
to randomly sample. For larger datasets, sampling often has minimal effect on results but speeds up computation.- cores
Integer. Number of cores used. Only applicable on Unix systems.