Prepares an ML ready outcome data set (used in prepare_ml)
Usage
prepare_ml_outcome(
outcome,
outcome_name = NULL,
level_order = NULL,
outlier_remove = FALSE,
outlier_ctrl = list(coef = 3)
)Arguments
- outcome
tibble containing
.idcolumn and the outcome of interest- outcome_name
single character giving the name of the outcome for regression or classification. For survival and repeated measurements analysis (classification or regression), resp., a named vector of length two needs to be specified,
c(.time = "<time-coln>", .status = "<status-coln>")for survival andc('.rmtime' = "<timepoint-coln>", '.out' = "<endpoint-coln>")for repeated measurements, resp. See Details section.- level_order
Level order for a classification outcome.
NULLkeeps the natural order (only used for classification).- outlier_remove
Remove outliers in a regression outcome based on the 'boxplot definition'. The outlier coefficient can be modified in
outlier_ctrl(only used for regression).- outlier_ctrl
Control list for the outlier removal, if
outlier_removeisTRUE. Currently, the list contains only the boxplot outlier coefficientcoef, which defaults to 3.
Value
A list with the following entries
- outcome
The outcome data set containing only the id and one or two columns with standardized column names (
.outfor regression or classification (with an additional.rmtimecolumn in case of repeated measurements),.timeand.statusfor survival).- outcome_name
Named vector with the original name(s) of the outcome variable(s).
- outcome_label
Named vector with the labels(s) of the outcome variable(s). If the columns of
outcomedo not contain labels, the column name is used instead.- outcome_mode
The outcome mode (
regression,classificationorsurvival,outcome_modeis guessed to be either classification or regression if a single column was specified as outcome based on the class of the column.- outcome_dict
Dictionary tibble for the outcome variable(s). If no label was provided for the selected columns, the column name will be reused as label in the dictionary.
- na_outcome
The IDs of NAs in
outcome.- id_outlier
The IDs of removed outliers.
Details
Specification of outcome_name for survival analysis or repeated measurements:
For survival analysis, specify column names for 'time' and 'status' of the Surv object: c(.time = "<time-coln>", .status = "<status-coln>"),
where .time is numeric and .status is binary with 0 coding for censored, and 1 coding for event.
Currently, only right-censoring is supported.
For repeated measurements, specify outcome_name as c('.rmtime' = "<timepoint-coln>", '.out' = "<endpoint-coln>").
The outcome mode will be guessed as regression or classification according to the type of the column specified in .out.
If outcome_name = NULL (default), the first column that's not .id is chosen for outcome_name
and the outcome mode is guessed accordingly. Thus, neither survival nor repeated measurement analysis will ever be guessed.