Prepare the 'x' and 'y' inputs
prepXY.RdOptional-but-useful function to: 1) provide a plausible ordering of the 'y' (fusion) variables and 2) identify the subset of 'x' (predictor) variables likely to be consequential during subsequent model training. Output can be passed directly to train. Most useful for large datasets with many and/or highly-correlated predictors. Employs an absolute Spearman rank correlation screen and then LASSO models (via glmnet) to return a plausible ordering of 'y' and the preferred subset of 'x' variables associated with each.
Usage
prepXY(
data,
y,
x,
weight = NULL,
cor_thresh = 0.05,
lasso_thresh = 0.95,
xmax = 100,
xforce = NULL,
fraction = 1,
cores = 1
)Arguments
- data
Data frame. Training dataset. All categorical variables should be factors and ordered whenever possible.
- y
Character or list. Variables in
datato eventually fuse to a recipient dataset. Ifyis a list, each entry is a character vector possibly indicating multiple variables to fuse as a block.- x
Character. Predictor variables in
datacommon to donor and eventual recipient.- weight
Character. Name of the observation weights column in
data. If NULL (default), uniform weights are assumed.- cor_thresh
Numeric. Predictors that exhibit less than
cor_threshabsolute Spearman (rank) correlation with ayvariable are screened out prior to the LASSO step. Fast exclusion of predictors that the LASSO step probably doesn't need to consider.- lasso_thresh
Numeric. Controls how aggressively the LASSO step screens out predictors. Lower value is more aggressive.
lasso_thresh = 0.95, for example, retains predictors that collectively explain at least 95% of the deviance explained by a "full" model.- xmax
Integer. Maximum number of predictors returned by LASSO step. Does not strictly control the number of final predictors returned (especially for categorical
yvariables), but useful for setting a (very) soft upper bound. Lowerxmaxcan help control computation time if a large number ofxpass the correlation screen.xmax = Infimposes no restriction.- xforce
Character. Subset of
xvariables to "force" as included predictors in the results.- fraction
Numeric. Fraction of observations in
datato randomly sample. For larger datasets, sampling often has minimal effect on results but speeds up computation.- cores
Integer. Number of cores used. Only applicable on Unix systems.