Prepares an ML ready outcome data set (used in prepare_ml
)
Usage
prepare_ml_outcome(
outcome,
outcome_name = NULL,
level_order = NULL,
outlier_remove = FALSE,
outlier_ctrl = list(coef = 3)
)
Arguments
- outcome
tibble containing
.id
column and the outcome of interest- outcome_name
single character giving the name of the outcome for regression or classification. For survival and repeated measurements analysis (classification or regression), resp., a named vector of length two needs to be specified,
c(.time = "<time-coln>", .status = "<status-coln>")
for survival andc('.rmtime' = "<timepoint-coln>", '.out' = "<endpoint-coln>")
for repeated measurements, resp. See Details section.- level_order
Level order for a classification outcome.
NULL
keeps the natural order (only used for classification).- outlier_remove
Remove outliers in a regression outcome based on the 'boxplot definition'. The outlier coefficient can be modified in
outlier_ctrl
(only used for regression).- outlier_ctrl
Control list for the outlier removal, if
outlier_remove
isTRUE
. Currently, the list contains only the boxplot outlier coefficientcoef
, which defaults to 3.
Value
A list with the following entries
- outcome
The outcome data set containing only the id and one or two columns with standardized column names (
.out
for regression or classification (with an additional.rmtime
column in case of repeated measurements),.time
and.status
for survival).- outcome_name
Named vector with the original name(s) of the outcome variable(s).
- outcome_label
Named vector with the labels(s) of the outcome variable(s). If the columns of
outcome
do not contain labels, the column name is used instead.- outcome_mode
The outcome mode (
regression
,classification
orsurvival
,outcome_mode
is guessed to be either classification or regression if a single column was specified as outcome based on the class of the column.- outcome_dict
Dictionary tibble for the outcome variable(s). If no label was provided for the selected columns, the column name will be reused as label in the dictionary.
- na_outcome
The IDs of NAs in
outcome
.- id_outlier
The IDs of removed outliers.
Details
Specification of outcome_name
for survival analysis or repeated measurements:
For survival analysis, specify column names for 'time' and 'status' of the Surv
object: c(.time = "<time-coln>", .status = "<status-coln>")
,
where .time
is numeric and .status
is binary with 0 coding for censored, and 1 coding for event.
Currently, only right-censoring is supported.
For repeated measurements, specify outcome_name
as c('.rmtime' = "<timepoint-coln>", '.out' = "<endpoint-coln>")
.
The outcome mode will be guessed as regression or classification according to the type of the column specified in .out
.
If outcome_name = NULL
(default), the first column that's not .id
is chosen for outcome_name
and the outcome mode is guessed accordingly. Thus, neither survival nor repeated measurement analysis will ever be guessed.