Skip to contents

Optional-but-useful function to: 1) provide a plausible ordering of the 'y' (fusion) variables and 2) identify the subset of 'x' (predictor) variables likely to be consequential during subsequent model training. Output can be passed directly to train. Most useful for large datasets with many and/or highly-correlated predictors. Employs an absolute Spearman rank correlation screen and then LASSO models (via glmnet) to return a plausible ordering of 'y' and the preferred subset of 'x' variables associated with each.

Usage

prepXY(
  data,
  y,
  x,
  weight = NULL,
  cor_thresh = 0.05,
  lasso_thresh = 0.95,
  xmax = 100,
  xforce = NULL,
  fraction = 1,
  cores = 1
)

Arguments

data

Data frame. Training dataset. All categorical variables should be factors and ordered whenever possible.

y

Character or list. Variables in data to eventually fuse to a recipient dataset. If y is a list, each entry is a character vector possibly indicating multiple variables to fuse as a block.

x

Character. Predictor variables in data common to donor and eventual recipient.

weight

Character. Name of the observation weights column in data. If NULL (default), uniform weights are assumed.

cor_thresh

Numeric. Predictors that exhibit less than cor_thresh absolute Spearman (rank) correlation with a y variable are screened out prior to the LASSO step. Fast exclusion of predictors that the LASSO step probably doesn't need to consider.

lasso_thresh

Numeric. Controls how aggressively the LASSO step screens out predictors. Lower value is more aggressive. lasso_thresh = 0.95, for example, retains predictors that collectively explain at least 95% of the deviance explained by a "full" model.

xmax

Integer. Maximum number of predictors returned by LASSO step. Does not strictly control the number of final predictors returned (especially for categorical y variables), but useful for setting a (very) soft upper bound. Lower xmax can help control computation time if a large number of x pass the correlation screen. xmax = Inf imposes no restriction.

xforce

Character. Subset of x variables to "force" as included predictors in the results.

fraction

Numeric. Fraction of observations in data to randomly sample. For larger datasets, sampling often has minimal effect on results but speeds up computation.

cores

Integer. Number of cores used. Only applicable on Unix systems.

Value

List with named slots "y" and "x". Each is a list of the same length. Former gives the preferred fusion order. Latter gives the preferred sets of predictor variables.

Examples

y <- names(recs)[c(14:16, 20:22)]
x <- names(recs)[2:13]

# Fusion variable "blocks" are respected by prepXY()
y <- c(list(y[1:2]), y[-c(1:2)])

# Do the prep work...
prep <- prepXY(data = recs, y = y, x = x)

# The result can be passed to train()
train(data = recs, y = prep$y, x = prep$x)