Given feature
, a tibble representing a wide format feature matrix,
and outcome
, a tibble containing the outcome information
(regression/classification/survival is supported),
prepare_ml()
will provide data sets suitable for various
machine learning problems along with additional information.
The data preparation steps include, but are not limited to data splitting,
handling missing values, normalization, removal of redundant information
(highly correlated features).
Please refer to the Details section for more information.
Usage
prepare_ml(
feature,
outcome,
outcome_name = NULL,
level_order = NULL,
prep_recipe = NULL,
train_prop = 3/4,
strata_trt = FALSE,
seed = 1130,
prep_step_normalize = TRUE,
prep_step_knnimpute = TRUE,
prep_step_log = TRUE,
prep_step_corr = TRUE,
prep_step_dummy = FALSE,
thres_log = 2,
thres_count = 10,
thres_corr = 0.9,
thres_lump = 0.05,
thres_imp = 0.8,
thres_nzv_freq = 95/5,
thres_nzv_unique = 10,
vars_imp_ignore = c(".trt"),
vars_fct_expl_na = NULL,
vars_ordinalscore = NULL,
vars_keep_corr = NULL,
one_hot = NULL,
log_base = exp(1),
outlier_remove = FALSE,
outlier_ctrl = list(coef = 3),
quiet = FALSE
)
Arguments
- feature
feature matrix in wide format, e.g. output object of
build()
, i.e. containing.id
column and predictors- outcome
tibble containing
.id
column and the outcome of interest,prepare_ml_outcome()
- outcome_name
single character giving the name of the outcome for regression or classification. For survival and repeated measurements analysis (classification or regression), resp., a named vector of length two needs to be specified,
c(.time = "<time-coln>", .status = "<status-coln>")
for survival andc('.rmtime' = "<timepoint-coln>", '.out' = "<endpoint-coln>")
for repeated measurements, resp. See Details section.- level_order
level order for a classification outcome. Default
NULL
keeps the natural order (only used for classification).- prep_recipe
a custom, pre-defined
recipes::recipe()
may be provided for data preparation. Defaults to NULL, yielding a data-driven preparation. Please refer to the details section to learn about the individual recipe steps.- train_prop
the proportion of data to be used for the training set. Has to be in [0.5;1.0]. Defaults to 3/4, keeping a quarter of the data for testing.
- strata_trt
boolean. Expand default stratum variable (
.out
for classification,.status
for tte,NULL
for regression) by trt (if character, else ignored). Defaults to FALSE, but is highly recommended to be set to TRUE.- seed
optionally set a seed before the data splitting.
- prep_step_knnimpute, prep_step_log, prep_step_normalize, prep_step_corr, prep_step_dummy
logicals determining whether or not the corresponding step function should be included in the recipe, possibly specified further using additional parameters (
thres_*
,log_base
,one_hot
). Please refer to the details section for the full list of recipe steps.- thres_log
variables will be log-transformed (with base
log_base
) ifprep_step_log = TRUE
, all observations are positive, ande1071::skewness() > thres_log
, wherethres_log
defaults to 2.- thres_count
integer variables with no more than
thres_count
distinct values are considered as count variables and are excluded from the log-transformation and normalization. Defaults to 10.- thres_corr
if
prep_step_corr = TRUE
,thres_corr
is passed torecipes::step_corr()
'sthreshold
argument with a default of 0.9 to remove highly correlated features- thres_lump
this parameter is used to prevent renaming of a single low frequency class to 'other' by
recipes::step_other()
, to whichthres_lump
is passed as parameterthreshold
. Defaults to 0.05.- thres_imp
Minimal proportion of non-missing data per feature required to be kept in the data and completed using
recipes::step_impute_knn()
. Variables not meeting the threshold will be dropped and not be included indata_prep
data. Per defaultthres_imp = 0.8
, i.e. variables will be dropped if the proportion of available data is less than 80%. Variables listed invars_imp_ignore
will never be imputed, observations with missing data in the respective variables will be removed.- thres_nzv_freq, thres_nzv_unique
parameters passed to
recipes::step_nzv()
with defaultsthres_nzv_freq = 95/5)
andthres_nzv_unique = 10
- vars_imp_ignore
variables that shall not be imputed can be specified in
vars_imp_ignore
(vector of column names, defaults tovars_imp_ignore = '.trt'
). Observations with missing values in these variables will be removed. Removal is documented inremoved$rows
.- vars_fct_expl_na
column names of factors for which NAs should be treated as an explicit factor level. Defaults to NULL.
- vars_ordinalscore
column names of ordinal factor variables to be converted into numeric scores. Defaults to NULL.
- vars_keep_corr
choose these variables over other options when removing variables due to high correlation in
recipes::step_corr()
. Seerecipes::step_rm()
below for details.- one_hot
boolean. passed to
recipes::step_dummy()
to choose one hot encoding over dummy encoding- log_base
base to use for log-transformation in
recipes::step_log()
. Defaults to exp(1).- outlier_remove, outlier_ctrl
For outcome mode regression only, see
prepare_ml_outcome()
for details on how outliers are removed from outcome variables.outlier_remove
defaults to FALSE,outlier_ctrl
tolist(coef = 3)
.- quiet
boolean. Suppress messages during outcome preparation to the console on NA and outlier removal, resp. Defaults to
FALSE
.
Value
Data sets
prepare_ml()
produces a list that contains the data set both with
(data_prep
) and without (data_raw
) applying the specified ML
preparation steps. Both versions are split in train
and test
set. In addition, split
contains the combined
rsample::initial_split()
object that the train
and test
data was extracted from. Depending on the programming workflow, one might be
more convenient to use than the other. Both data_test
slots as well
as split
are NULL
if train_prop
was set to 1
(i.e. no splitting was done) and train
contains the full ML data set.
The slot outcome
contains a list giving name
, the standardized
names of the output column in the data sets ( .out
for
regression/classification, .time
and .status
for survival,
as well as a mode
, character string of the outcome mode
regression/classification/survival
The dictionary available as an attribute of feature
is updated with
information on the outcome variable, any log-transformation as well as
alternative labels (label2
, label3
) indicating correlated variable groups
e.g. HB (HCT), where HB is kept for the analysis, HCT was dropped due to
absolute correlation above thres_corr
. Dictionary is available from
the dict
slot, NULL
if no such attribute is defined.
The source
slot simply passes the source
attribute of
feature
, NULL if no such attribute is defined.
If build()
from the martini
package was used to generate
feature
, this attribute lists the full paths of the files that were
used in data generation of feature
.
Data preparation and documentation
prep_recipe
contains the prepared recipe object,
prep_params
documents the parameters/thresholds used in the data
preparation, giving bare value
slots, as well as a verbose description
in text
.
removed
gives a list of removed rows
and columns
along
with the information on why/in which recipe step the data was removed.
high_corr
a tibble listing correlations above thres_corr
.
NULL
if prep_step_corr = FALSE
.
input
a list giving the martini
packageVersion
and a
list of (most) input parameters, including the seed used
Details
The following order of recipe steps for data preparation will be applied
(if no recipe is provided). The variable sets that a particular step function
will be applied to are determined based on user input
and output of the function prepare_ml_vars()
, respectively.
Further details on particular steps are given below.
drop variables e.g. not meeting the minimum threshold for non-missing data proportion (
step_rm()
) or for variable removal related to thevars_keep_corr
parameter (see below).remove observations with missing data in outcome (
step_naomit()
)knn imputation on variables with missing values that are not explicitly excluded from imputation (
vars_imp_ignore
). Please note, that missing values can still occur after imputation if a large majority (or all) of the imputing variables are also missing (see?recipes::step_impute_knn()
). Related subjects/observations will be removed to obtain a complete data set and listed in removed$rows of the output object.omit observations with remaining missing values (i.e. in variables that were excluded from imputation and not dropped before) (
step_naomit()
)removal of near-zero variance variables (
step_nzv()
)log-transformation (
step_log()
)normalization (
step_normalize()
)removal of highly correlated variables (
step_corr()
)lumping of low frequency factor levels into a single class (
step_other()
)transform ordinal factors into numeric variables (
step_ordinalscore()
)dummy/one hot encoding (
step_dummy()
)
The vars_keep_corr
parameter allows to prioritize these variables in
the step_corr()
part of the recipe over the variables that yield high
correlations with them (i.e. exceeding thres_corr
).
This allows to choose a representative from a set of correlated variables
that is e.g. commonly used in the context of the indication or easier to
interpret. Please note, that these imposed restrictions may increase the
total number of removed variables in this step in comparison to the
unrestricted version.
A note on step_impute_knn()
and the interpretation of the
prep()
ped recipe: The variables listed for this step are the ones
that are used for the imputation step. It does not mean that missing
values in these variables have been or will be imputed.
For more details on this matter please refer to the documentation of
tidymodels and the difference in prep()
and bake()
,
in particular. For example, vars_imp_ignore
includes the standard
treatment variable .trt
by default to prevent any imputations;
however, it will be listed in the variable set of the prep()
ped
recipe (for older versions of recipes
package). Don't panic. #rtfm.
For repeated measurement analyses, all observations of the same .id
will end up the either in the training or test set (using
rsample::group_initial_split()
). Note that the strata argument will be
ignored (with a warning) for versions below 1.1.1.
Currently, grouping is not accounted for in missing value imputation yet.
Specification of outcome_name
for survival analysis or repeated
measurements: For survival analysis, specify column names for 'time' and
'status' of the Surv
object:
c(.time = "<time-coln>", .status = "<status-coln>")
, where .time
is
numeric and .status
is binary with 0 coding for censored, and 1 coding
for event. Currently, only right-censoring is supported.
For repeated measurements, specify outcome_name
as
c('.rmtime' = "<timepoint-coln>", '.out' = "<endpoint-coln>")
.
The outcome mode will be guessed as regression or classification according
to the type of the column specified in .out
.
If outcome_name = NULL
(default), the first column in outcome
that's
not .id
is chosen for outcome_name
and the outcome mode is guessed
accordingly. Thus, neither survival nor repeated measurement analysis will
ever be guessed.