MeasurementPhenotype Tutorial¶
The MeasurementPhenotype is used to handle any numerical values in real world data. This includes observation results, such as height and weight, or blood lab tests, such as 'hemoglobin level'.
Numerical values in RWD sources are usually found in event-based tables, with each row recording a single measurement value for a single patient associated with a single date*. All numerical values are in a single 'measurement_value' column. A medical code is associated with each event, which indicates the type of numerical measurement recorded. Units of measurement are in an additional column.
PersonID | MedicalCode | EventDate | Value | Unit |
---|---|---|---|---|
1 | HbA1c | 2010-01-01 | 4.2 | % |
1 | HT | 2010-01-02 | 121 | cm |
2 | WT | 2010-01-01 | 130 | kg |
MeasurementPhenotype is a subclass of CodelistPhenotype, inheriting all of its functionality to identify patients by single or sets of medical codes. For example, we can identify the patients with a Loinc code of '8480-6', meaning a measurement of systolic blood pressure was performed, within a specified time period; see the CodelistPhenotype tutorial for more information.
MeasurementPhenotype adds additional functionality dealing with numeric values, such as :
- performing simple aggregations, such as mean or daily_mean
- identifying patients with a measurement value or aggregated measurement value within a value range and
- returning measurement values, either all measurements values, or the measurment value nearest/furthest from the anchor date
*if multiple dates are associated with an event, MeasurementPhenotype alone cannot be used. Either data cleaning operations must be performed to the input measurement table to resolve the multiple dates to a single date, or LogicPhenotype can be used in conjunction with a MeasurementPhenotype for each date in order to resolve this; see the tutorial on LogicPhenotype for more information.
After this tutorial, we will be able to answer the following questions :
- which patients had a measurement for systolic blood pressure recorded any time in the data source
- which patients had a measurement for systolic blood pressure recorded one year prior to index date?
- which patients had a measurement for systolic blood pressure recorded in units 'mmHg' one year prior to index date?
- which patients have one or more systolic blood pressure measurements greater than 200 mmHg recorded within one year prior to index date?
- which patients have one or more systolic blood pressure measurements between 120 and 160 mmHg recorded within one year prior to index date?
- which patients have a systolic blood pressure measurements greater than 200 mmHg, with systolic blood pressure defined as the mean of all SBP measurements in the baseline period (baseline period = 1 year pre index)?
- which patients have a systolic blood pressure measurements greater than 200 mmHg, with systolic blood pressure defined as the median of all values recorded on a single day?
- how can I see all measurements for systolic blood pressure recorded one year prior to index date?
- what is value of the systolic blood pressure recorded nearest to the index date?
- what is mean value of the systolic blood pressure recorded in the one year pre index period?
- what is the date and value of the first systolic blood pressure measurement recorded within the one year pre index period?
- what is the date and value first systolic blood pressure measurent recorded after the index date,
- what is the date and value of the systolic blood pressure measurements greater than 200 mmHg, with systolic blood pressure defined as median of values occurring on the same day?
- I see measurements > 300mmHg in my dataset, which are obviously due to error. Which patients have a systolic blood pressure measurements greater than 200 mmHg, having removed SBP measurements >300mmHg?
Step 1 : Define CodelistPhenotype arguments¶
MeasurementPhenotype has all the functionality of CodelistPhenotype for identifying patients by codelists, time ranges in relation to an anchor, and categorical values in other columns. Visit the CodelistPhenotype tutorial for more information on these parameters. The only keyword argument of CodelistPhenotype that requires special attention is return_date, which we will discuss in detail below.
Just like CodelistPhenotype, the two minimum arguments are 'domain' and 'codelist'.
- Measurements in our case are recorded in the observation table, so we are using the 'observation' domain.
- We will need a Codelist for 'Systolic Blood Pressure'. This is a single code. We create a Codelist as follows; see the Codelist tutorial for more information on how to define codelists.
sbp_codelist = Codelist(
name_codelist='systolic_blood_pressure',
codes = '8480-6'
)
We can now make our first MeasurementPhenotypes!
Examples 1 & 2¶
from phenex.phenotypes import MeasurementPhenotype
from phenex.core.constants import ONEYEAR_PREINDEX
# Ex.1
# which patients had a measurement for systolic blood pressure recorded any time in the data source
sbp1 = MeasurementPhenotype(
name = 'sbp_patients_any_time',
codelist = sbp_codelist,
domain = 'observation'
)
# Ex.2
# which patients had a measurement for systolic blood pressure recorded one year prior to index date?
sbp2 = MeasurementPhenotype(
name = 'sbp_patients_on_year_preindex',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range = ONEYEAR_PREINDEX # we set a time_range_filter, exactly like for a CodelistPhenotype
)
# Ex.3
# which patients had a measurement for systolic blood pressure recorded in units 'mmHg' one year prior to index date?
sbp3 = MeasurementPhenotype(
name = 'sbp_patients_on_year_preindex_in_mmHg',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range = ONEYEAR_PREINDEX,
categorical_filter = CategoricalFilter(allowed_values=['mmHg'], columnname='UNIT') # we set a categorical_filter to specify units
)
Just like the CodelistPhenotype, the following MeasurementPhenotypes will return all patients that have a recorded event defined by our CodelistPhenotype parameters if set.
Suggestion : use categorical_filter to define units, if only some subset of units are allowed.
Step 2 : Define a value_filter¶
Till now we have seen how to select patients that have any recorded event of the measurement type defined by our codelist. MeasurementPhenotype allows further selection of patients based on the measurement value. To do this, we define the value_filter keyword argument to either a single threshold value or to an allowed value range.
Using the value_filter of MeasurementPhenotype, I can ask questions such as 'which patients had a measurement greater than 200 in the pre index period?'.
Output tables return only patient ids that fulfill our CodelistPhenotype criteria and are within the ranges defined by our value_filter. Unless return_value is defined, only patient_ids are returned (one row per patient).
# Ex.4
# which patients have one or more systolic blood pressure measurements greater than 200 mmHg recorded within one year prior to index date?
sbp5 = MeasurementPhenotype(
name = 'sbp_preindex_measurements_ge200',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range = ONEYEAR_PREINDEX,
value_filter = GreaterThanOrEqualTo(200)
)
# Ex.5
# which patients have one or more systolic blood pressure measurements between 120 and 160 mmHg recorded within one year prior to index date?
sbp5 = MeasurementPhenotype(
name = 'sbp_preindex_measurements_between_120_160',
codelist = sbp_codelist,
domain = 'observation',
LessThanOrEqualTo(160),
value_filter = GreaterThanOrEqualTo(120) & LessThanOrEqualTo(160) # we can use logical operations on ValueFilters to define ranges
)
Step 4 : Define value_aggregation¶
Till now, the value_filters we have seen perform filtering on the entries recorded directly in the event tables. However, it is common to want to perform filtering on some aggregation of the data found in our event tables. In these cases, we can think of the measurement events in our data as 'raw' values that do not directly actually reflect the definition of a measurement we are interested in. For example,
- Blood Pressure fluctulates very rapidly and changes over time; I therefore do not trust a single, or even multiple values. I may want to define 'hypertension' as not just a single measurement event of systolic blood pressure greater than 160; instead, I may want want to perform value filtering on the 'mean systolic blood pressure in the one year pre index period'. Notice that this requires an aggregation of the 'raw' event based data in the one year pre-index period, meaning that our definition of 'systolic blood pressure' is not that recorded in the raw data, but rather the 'mean SBP in the one year pre index period'. After this aggregation is performed, we then want to perform the value filtering to find those with a 'mean SBP in the one year pre index period' > 160.
- Another common issue is that we often see duplicated entries for a lab measurement performed on the same day. For instance, a systolic blood pressure measurement may be performed 10 times on one day. This is often an issue of data quality, and it is suggested to have pipelines to de-duplicate values. However, we can use MeasurementPhenotype to perform this de-duplication for us. In essence, we create a new definition for systolic blood pressured which could be called 'daily_median_systolic_blood_pressure', and then perform further value_filtering on this new aggregated value.
In order to perform value aggregation, we use the value_aggregation keyword argument. The options for value_aggregation are the obvious mean, median, min and max, which perform the named aggregation on all values defined by the CodelistPhenotype arguments i.e. codelist criteria, time_range_filters and categorical_filters.
In addition to mean, median, min and max, we also have the options of daily mean, median, min and max, which will return the daily means
# Ex.6
# which patients have a systolic blood pressure measurements greater than 200 mmHg,
# with systolic blood pressure defined as the mean of all SBP measurements in the baseline period (baseline period = 1 year pre index)?
sbp6 = MeasurementPhenotype(
name = 'sbp_mean_baseline_ge200',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range = ONEYEAR_PREINDEX,
value_aggregation = 'mean',
value_filter = GreaterThanOrEqualTo(200),
)
# Ex.7
# which patients have a systolic blood pressure measurements greater than 200 mmHg,
# with systolic blood pressure defined as the median of all values recorded on a single day?
sbp7 = MeasurementPhenotype(
name = 'sbp_daily_median_ge200',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range = ONEYEAR_PREINDEX,
value_aggregation = 'daily_median',
value_filter = GreaterThanOrEqualTo(200),
)
Step 3 : Define return_value¶
By default, MeasurementPhenotype only returns the patient ids of patients. However, MeasurementPhenotype is also able to return the values associated with a measurement event. To do this, we must define the return_value keyword argument. The options for return value are "all", "first" and "last".
Setting the return_value to all, we can see all values for all patients with a measurement that fulfill our phenotype criteria. Note that the 'all' argument possibly results in multiple rows per patient.
We can also use to return the value closest to our anchor date, or the first/last in our time_range, using the 'first', 'last' keyword arguments.
Note : if value_aggregation is set to mean, median, max or min, the concept of 'first' and 'last' are nonsensical and are not allowed! A mean over a period means no date exists any more. However, first and last can be used if daily aggregations are used.
Note : the return_date must be equal to the return_value parameter!
# Ex.8
# how can I see all measurements for systolic blood pressure recorded one year prior to index date?
sbp8 = MeasurementPhenotype(
name = 'sbp_all_measurements_one_year_preindex',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range = ONEYEAR_PREINDEX,
return_value = 'all' # this will return all values within the year prior to index
)
# Ex.9
# what is value of the systolic blood pressure recorded nearest to the index date?
sbp8 = MeasurementPhenotype(
name = 'sbp_closest_to_index',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range = ONEYEAR_PREINDEX,
return_value = 'last' # this will return value nearest to the index date
)
# Ex.10
# what is mean value of the systolic blood pressure recorded in the one year pre index period?
sbp8 = MeasurementPhenotype(
name = 'sbp_closest_to_index',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range = ONEYEAR_PREINDEX,
value_aggregation = 'mean',
return_value = 'all' # this must be all, as the value_aggregation of 'mean' removes the concept of a date to the measurement
)
Step 5: Define return_date¶
As mentioned earlier, return_date is the only keyword argument that requires special consideration. The return_date, if defined, must be equal to the return_value parameter.
# Ex.11
# what is the date and value of the first systolic blood pressure measurement recorded within the one year pre index period?
sbp11 = MeasurementPhenotype(
name = 'sbp_date_and_value_furthest_from_index',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range = ONEYEAR_PREINDEX,
return_date = 'first',
return_value = 'first'
)
# Ex.12
# what is the date and value first systolic blood pressure measurent recorded after the index date,
sbp12 = MeasurementPhenotype(
name = 'sbp_date_and_value_first_post_index_measurement',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range =ONEYEAR_POSTINDEX,
return_value = 'first', # notice that if multiple values exist on the same day, this will return multiple rows per patient
return_date = 'first'
)
# Ex.13
# what is the date and value of the systolic blood pressure measurements greater than 200 mmHg,
# with systolic blood pressure defined as median of values occurring on the same day?
sbp13 = MeasurementPhenotype(
name = 'sbp_date_and_value_daily_median_first_post_index',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range = ONEYEAR_PREINDEX,
value_aggregation = 'daily_median',
return_value = 'first',
return_date = 'nearest'
)
Step 6: Define cleaning value filters¶
RWD sources are often quite messy; measurement values are often manually entered and thus we see typos and obviously faulty data in our measurement tables. MeasurementPhenotype allows us to ignore obviously faulty data using the clean_nonphysiologicals_value_filter. This filter works prior to value_aggregation, so that obviously erroneous data does not enter our value aggregation and final results. First define what physiological thresholds look like. Take care when defining these physiological thresholds as RWD sources, while messy, are also very large; rare physiological outliers are therefore 'common'.
# Ex.14
# I see measurements > 300mmHg in my dataset, which are obviously due to error.
# Which patients have a systolic blood pressure measurements greater than 200 mmHg, having removed SBP measurements >300mmHg?
sbp14 = MeasurementPhenotype(
name = 'sbp_ge200_removing_nonphysiologicals',
codelist = sbp_codelist,
domain = 'observation',
relative_time_range = ONEYEAR_PREINDEX,
clean_nonphysiologicals_value_filter = GreaterThanOrEqualTo(300)
value_filter = GreaterThanOrEqualTo(200),
)
Cheat Sheet¶
- Are all raw measurment values within a physiological range? Are there 'nonsense' values that are due to measurement error? yes = set clean_nonphysiologicals_value_filter
- Do I want to want to aggregate raw values, for example perform the daily median operation, or the mean of all values in the time_period? yes = set value_aggregation
- Do I want to set value thresholds or allowed ranges? yes = set value_filter
- Do I want to return a value occurring on a specific day? yes = set return_date and return_value
- Do I want to return all values? yes = set return_value to all