Prepare ML ready data set from outcome and predictor data

Given feature, a tibble representing a wide format feature matrix, and outcome, a tibble containing the outcome information (regression/classification/survival is supported), prepare_ml() will provide data sets suitable for various machine learning problems along with additional information. The data preparation steps include, but are not limited to data splitting, handling missing values, normalization, removal of redundant information (highly correlated features). Please refer to the Details section for more information.

Usage

prepare_ml(
  feature,
  outcome,
  outcome_name = NULL,
  level_order = NULL,
  prep_recipe = NULL,
  train_prop = 3/4,
  strata_trt = FALSE,
  seed = 1130,
  prep_step_normalize = TRUE,
  prep_step_knnimpute = TRUE,
  prep_step_log = TRUE,
  prep_step_corr = TRUE,
  prep_step_dummy = FALSE,
  thres_log = 2,
  thres_count = 10,
  thres_corr = 0.9,
  thres_lump = 0.05,
  thres_imp = 0.8,
  thres_nzv_freq = 95/5,
  thres_nzv_unique = 10,
  vars_imp_ignore = c(".trt"),
  vars_fct_expl_na = NULL,
  vars_ordinalscore = NULL,
  vars_keep_corr = NULL,
  one_hot = NULL,
  log_base = exp(1),
  outlier_remove = FALSE,
  outlier_ctrl = list(coef = 3),
  quiet = FALSE
)

Arguments

feature: feature matrix in wide format, e.g. output object of build(), i.e. containing .id column and predictors
outcome: tibble containing .id column and the outcome of interest, prepare_ml_outcome()
outcome_name: single character giving the name of the outcome for regression or classification. For survival and repeated measurements analysis (classification or regression), resp., a named vector of length two needs to be specified, c(.time = "<time-coln>", .status = "<status-coln>") for survival and c('.rmtime' = "<timepoint-coln>", '.out' = "<endpoint-coln>") for repeated measurements, resp. See Details section.
level_order: level order for a classification outcome. Default NULL keeps the natural order (only used for classification).
prep_recipe: a custom, pre-defined recipes::recipe() may be provided for data preparation. Defaults to NULL, yielding a data-driven preparation. Please refer to the details section to learn about the individual recipe steps.
train_prop: the proportion of data to be used for the training set. Has to be in [0.5;1.0]. Defaults to 3/4, keeping a quarter of the data for testing.
strata_trt: boolean. Expand default stratum variable (.out for classification, .status for tte, NULL for regression) by trt (if character, else ignored). Defaults to FALSE, but is highly recommended to be set to TRUE.
seed: optionally set a seed before the data splitting.
prep_step_knnimpute, prep_step_log, prep_step_normalize, prep_step_corr, prep_step_dummy: logicals determining whether or not the corresponding step function should be included in the recipe, possibly specified further using additional parameters (thres_*, log_base, one_hot). Please refer to the details section for the full list of recipe steps.
thres_log: variables will be log-transformed (with base log_base) if prep_step_log = TRUE, all observations are positive, and e1071::skewness() > thres_log, where thres_log defaults to 2.
thres_count: integer variables with no more than thres_count distinct values are considered as count variables and are excluded from the log-transformation and normalization. Defaults to 10.
thres_corr: if prep_step_corr = TRUE, thres_corr is passed to recipes::step_corr()'s threshold argument with a default of 0.9 to remove highly correlated features
thres_lump: this parameter is used to prevent renaming of a single low frequency class to 'other' by recipes::step_other(), to which thres_lump is passed as parameter threshold. Defaults to 0.05.
thres_imp: Minimal proportion of non-missing data per feature required to be kept in the data and completed using recipes::step_impute_knn(). Variables not meeting the threshold will be dropped and not be included in data_prep data. Per default thres_imp = 0.8, i.e. variables will be dropped if the proportion of available data is less than 80%. Variables listed in vars_imp_ignore will never be imputed, observations with missing data in the respective variables will be removed.
thres_nzv_freq, thres_nzv_unique: parameters passed to recipes::step_nzv() with defaults thres_nzv_freq = 95/5) and thres_nzv_unique = 10
vars_imp_ignore: variables that shall not be imputed can be specified in vars_imp_ignore (vector of column names, defaults to vars_imp_ignore = '.trt'). Observations with missing values in these variables will be removed. Removal is documented in removed$rows.
vars_fct_expl_na: column names of factors for which NAs should be treated as an explicit factor level. Defaults to NULL.
vars_ordinalscore: column names of ordinal factor variables to be converted into numeric scores. Defaults to NULL.
vars_keep_corr: choose these variables over other options when removing variables due to high correlation in recipes::step_corr(). See recipes::step_rm() below for details.
one_hot: boolean. passed to recipes::step_dummy() to choose one hot encoding over dummy encoding
log_base: base to use for log-transformation in recipes::step_log(). Defaults to exp(1).
outlier_remove, outlier_ctrl: For outcome mode regression only, see prepare_ml_outcome() for details on how outliers are removed from outcome variables. outlier_remove defaults to FALSE, outlier_ctrl to list(coef = 3).
quiet: boolean. Suppress messages during outcome preparation to the console on NA and outlier removal, resp. Defaults to FALSE.

Value

Data sets

prepare_ml() produces a list that contains the data set both with (data_prep) and without (data_raw) applying the specified ML preparation steps. Both versions are split in train and test set. In addition, split contains the combined rsample::initial_split() object that the train and test data was extracted from. Depending on the programming workflow, one might be more convenient to use than the other. Both data_test slots as well as split are NULL if train_prop was set to 1 (i.e. no splitting was done) and train contains the full ML data set.

The slot outcome contains a list giving name, the standardized names of the output column in the data sets ( .out for regression/classification, .time and .status for survival, as well as a mode, character string of the outcome mode regression/classification/survival

The dictionary available as an attribute of feature is updated with information on the outcome variable, any log-transformation as well as alternative labels (label2, label3) indicating correlated variable groups e.g. HB (HCT), where HB is kept for the analysis, HCT was dropped due to absolute correlation above thres_corr. Dictionary is available from the dict slot, NULL if no such attribute is defined.

The source slot simply passes the source attribute of feature, NULL if no such attribute is defined. If build() from the martini package was used to generate feature, this attribute lists the full paths of the files that were used in data generation of feature.

Data preparation and documentation

prep_recipe contains the prepared recipe object, prep_params documents the parameters/thresholds used in the data preparation, giving bare value slots, as well as a verbose description in text. removed gives a list of removed rows and columns along with the information on why/in which recipe step the data was removed. high_corr a tibble listing correlations above thres_corr. NULL if prep_step_corr = FALSE. input a list giving the martini packageVersion and a list of (most) input parameters, including the seed used

Details

The following order of recipe steps for data preparation will be applied (if no recipe is provided). The variable sets that a particular step function will be applied to are determined based on user input and output of the function prepare_ml_vars(), respectively. Further details on particular steps are given below.

drop variables e.g. not meeting the minimum threshold for non-missing data proportion (step_rm()) or for variable removal related to the vars_keep_corr parameter (see below).
remove observations with missing data in outcome (step_naomit())
knn imputation on variables with missing values that are not explicitly excluded from imputation (vars_imp_ignore). Please note, that missing values can still occur after imputation if a large majority (or all) of the imputing variables are also missing (see ?recipes::step_impute_knn()). Related subjects/observations will be removed to obtain a complete data set and listed in removed$rows of the output object.
omit observations with remaining missing values (i.e. in variables that were excluded from imputation and not dropped before) (step_naomit())
removal of near-zero variance variables (step_nzv())
log-transformation (step_log())
normalization (step_normalize())
removal of highly correlated variables (step_corr())
lumping of low frequency factor levels into a single class (step_other())
transform ordinal factors into numeric variables (step_ordinalscore())
dummy/one hot encoding (step_dummy())

The vars_keep_corr parameter allows to prioritize these variables in the step_corr() part of the recipe over the variables that yield high correlations with them (i.e. exceeding thres_corr). This allows to choose a representative from a set of correlated variables that is e.g. commonly used in the context of the indication or easier to interpret. Please note, that these imposed restrictions may increase the total number of removed variables in this step in comparison to the unrestricted version.

A note on step_impute_knn() and the interpretation of the prep()ped recipe: The variables listed for this step are the ones that are used for the imputation step. It does not mean that missing values in these variables have been or will be imputed. For more details on this matter please refer to the documentation of tidymodels and the difference in prep() and bake(), in particular. For example, vars_imp_ignore includes the standard treatment variable .trt by default to prevent any imputations; however, it will be listed in the variable set of the prep()ped recipe (for older versions of recipes package). Don't panic. #rtfm.

For repeated measurement analyses, all observations of the same .id will end up the either in the training or test set (using rsample::group_initial_split()). Note that the strata argument will be ignored (with a warning) for versions below 1.1.1. Currently, grouping is not accounted for in missing value imputation yet.

Specification of outcome_name for survival analysis or repeated measurements: For survival analysis, specify column names for 'time' and 'status' of the Surv object: c(.time = "<time-coln>", .status = "<status-coln>"), where .time is numeric and .status is binary with 0 coding for censored, and 1 coding for event. Currently, only right-censoring is supported.

For repeated measurements, specify outcome_name as c('.rmtime' = "<timepoint-coln>", '.out' = "<endpoint-coln>"). The outcome mode will be guessed as regression or classification according to the type of the column specified in .out.

If outcome_name = NULL (default), the first column in outcome that's not .id is chosen for outcome_name and the outcome mode is guessed accordingly. Thus, neither survival nor repeated measurement analysis will ever be guessed.

Authors

Maike Ahrens (ahrensmaike), Sebastian Voss (svoss09)