MeasurementPhenotype Tutorial¶
The MeasurementPhenotype is used to handle any numerical values in real world data. This includes observation results, such as height and weight, or blood lab tests, such as 'hemoglobin level'.
Numerical values in RWD sources are usually found in event-based tables, with each row recording a single measurement value for a single patient associated with a single date*. All numerical values are in a single 'measurement_value' column. A medical code is associated with each event, which indicates the type of numerical measurement recorded. Units of measurement are in an additional column.
PersonID | MedicalCode | EventDate | Value | Unit |
---|---|---|---|---|
1 | HbA1c | 2010-01-01 | 4.2 | % |
1 | HT | 2010-01-02 | 121 | cm |
2 | WT | 2010-01-01 | 130 | kg |
MeasurementPhenotype is a subclass of CodelistPhenotype, inheriting all of its functionality to identify patients by single or sets of medical codes. For example, we can identify the patients with a Loinc code of '8480-6', meaning a measurement of systolic blood pressure was performed, within a specified time period; see the CodelistPhenotype tutorial for more information.
MeasurementPhenotype adds additional functionality dealing with numeric values, such as :
- performing simple aggregations, such as mean or daily_mean
- identifying patients with a measurement value or aggregated measurement value within a value range and
- returning measurement values, either all measurements values, or the measurment value nearest/furthest from the anchor date
*if multiple dates are associated with an event, MeasurementPhenotype alone cannot be used. Either data cleaning operations must be performed to the input measurement table to resolve the multiple dates to a single date, or LogicPhenotype can be used in conjunction with a MeasurementPhenotype for each date in order to resolve this; see the tutorial on LogicPhenotype for more information.
After this tutorial, we will be able to answer the following questions :
- which patients had a measurement for systolic blood pressure recorded any time in the data source
- which patients had a measurement for systolic blood pressure recorded one year prior to index date?
- which patients had a measurement for systolic blood pressure recorded in units 'mmHg' one year prior to index date?
- which patients have one or more systolic blood pressure measurements greater than 200 mmHg recorded within one year prior to index date?
- which patients have one or more systolic blood pressure measurements between 120 and 160 mmHg recorded within one year prior to index date?
- How can I see all measurements for systolic blood pressure recorded one year prior to index date?
- How can I see all measurements for systolic blood pressure recorded one year prior to index date less than 180 mmHg?
- What is the patient mean systolic blood pressure in the baseline period?
- What is the patient daily mean systolic blood pressure?
- what is value of the systolic blood pressure recorded nearest to the index date?
- what is mean value of the systolic blood pressure recorded in the one year pre index period?
- I see measurements > 300mmHg in my dataset, which are obviously due to error. Which patients have a systolic blood pressure measurements greater than 200 mmHg, having removed SBP measurements >300mmHg?
For the purposes of this tutorial we will be using dummy generated data. For our first set of examples, we have five patients. The index date is hard coded as January 1, 2020 for all patients. Look at the patients below
from dummy_measurement_data import create_ex1_dummy_measurement_mapped_tables
# %pip install -e /Users/ahartens/src/PhenEx
import ibis
ibis.options.interactive = True
mapped_tables = create_ex1_dummy_measurement_mapped_tables()
mapped_tables['MEASUREMENT']
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ PERSON_ID ┃ MEASUREMENT_TYPE_CONCEPT_ID ┃ VALUE_AS_NUMBER ┃ UNIT ┃ MEASUREMENT_DATE ┃ INDEX_DATE ┃ EVENT_DATE ┃ CODE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ int64 │ string │ date │ date │ date │ string │ int64 │ ├───────────┼─────────────────────────────┼─────────────────┼────────┼──────────────────┼────────────┼────────────┼────────┼───────┤ │ P0 │ SBP │ 120 │ Hg │ 2019-12-02 │ 2020-01-01 │ 2019-12-02 │ SBP │ 120 │ │ P1 │ SBP │ 120 │ mmHg │ 2020-01-31 │ 2020-01-01 │ 2020-01-31 │ SBP │ 120 │ │ P2 │ SBP │ 160 │ mmHg │ 2019-12-02 │ 2020-01-01 │ 2019-12-02 │ SBP │ 160 │ │ P3 │ SBP │ 160 │ mmHg │ 2020-01-31 │ 2020-01-01 │ 2020-01-31 │ SBP │ 160 │ │ P4 │ DBP │ 80 │ mmHg │ 2019-12-02 │ 2020-01-01 │ 2019-12-02 │ DBP │ 80 │ └───────────┴─────────────────────────────┴─────────────────┴────────┴──────────────────┴────────────┴────────────┴────────┴───────┘
Step 1 : Define CodelistPhenotype arguments¶
MeasurementPhenotype has all the functionality of CodelistPhenotype for identifying patients by codelists, time ranges in relation to an anchor, and categorical values in other columns. Visit the CodelistPhenotype tutorial for more information on these parameters. The only keyword argument of CodelistPhenotype that requires special attention is return_date, which we will discuss in detail below.
Just like CodelistPhenotype, the two minimum arguments are 'domain' and 'codelist'.
- Measurements in our case are recorded in the observation table, so we are using the 'observation' domain.
- We will need a Codelist for 'Systolic Blood Pressure'. This is a single code. We create a Codelist as follows; see the Codelist tutorial for more information on how to define codelists.
from phenex.codelists import Codelist
sbp_codelist = Codelist(
name='systolic_blood_pressure',
codelist = 'SBP'
)
Let's also make necessary imports for the following tutorial, and also define some constants we will use throughout such as one year pre and post index. In order to run the following phenotypes, you will also have to create a database connector and define mapped tables. We are using the OMOP mapped tables for these examples. See the PhenEx Study Tutorial for more information on how to connect to a database.
from phenex.phenotypes import MeasurementPhenotype
from phenex.filters import (
RelativeTimeRangeFilter,
GreaterThan,
GreaterThanOrEqualTo,
LessThan,
LessThanOrEqualTo,
CategoricalFilter,
ValueFilter,
)
from phenex.aggregators import (
DailyMean,
DailyMedian,
Mean,
)
ONEYEAR_PREINDEX = RelativeTimeRangeFilter(
min_days = GreaterThanOrEqualTo(0),
max_days = LessThan(365),
when = "before"
)
ONEYEAR_POSTINDEX = RelativeTimeRangeFilter(
min_days = GreaterThanOrEqualTo(0),
max_days = LessThan(365),
when = "after"
)
We can now make our first MeasurementPhenotype! Let's get all SBP measurements in the database.
eg 1 : Measurement name¶
Which patients had a measurement for systolic blood pressure recorded any time in the data source?
sbp1 = MeasurementPhenotype(
name = 'sbp_patients_any_time',
codelist = sbp_codelist,
domain = 'MEASUREMENT'
)
sbp1.execute(mapped_tables)
2025-05-16 16:20:17,719 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_patients_any_time': executing... 2025-05-16 16:20:17,750 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_patients_any_time': execution completed.
┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┓ ┃ PERSON_ID ┃ BOOLEAN ┃ EVENT_DATE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━┩ │ string │ boolean │ date │ int64 │ ├───────────┼─────────┼────────────┼───────┤ │ P0 │ True │ 2019-12-02 │ 120 │ │ P1 │ True │ 2020-01-31 │ 120 │ │ P2 │ True │ 2019-12-02 │ 160 │ │ P3 │ True │ 2020-01-31 │ 160 │ └───────────┴─────────┴────────────┴───────┘
Executing the phenotype, we get the PhenotypeTable. This contains all patients that fulfill the criteria of our phenotype. In this case, it is all patients that have a systolic blood pressure recorded at any time. We see that patients 0-3 are included, and the fourth patient, who has no SBP and only a DBP recorded is not included in the phenotype output.
eg 2 : Relative time range¶
Which patients had a measurement for systolic blood pressure recorded one year prior to index date?
sbp2 = MeasurementPhenotype(
name = 'sbp_patients_on_year_preindex',
codelist = sbp_codelist,
domain = 'MEASUREMENT',
relative_time_range = ONEYEAR_PREINDEX # we set a time_range_filter, exactly like for a CodelistPhenotype
)
sbp2.execute(mapped_tables)
2025-05-16 16:20:18,197 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_patients_on_year_preindex': executing... 2025-05-16 16:20:18,232 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_patients_on_year_preindex': execution completed.
┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┓ ┃ PERSON_ID ┃ BOOLEAN ┃ EVENT_DATE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━┩ │ string │ boolean │ date │ int64 │ ├───────────┼─────────┼────────────┼───────┤ │ P0 │ True │ 2019-12-02 │ 120 │ │ P2 │ True │ 2019-12-02 │ 160 │ └───────────┴─────────┴────────────┴───────┘
We see that patients 0 and 2 remain in this phenotype's output as they have preindex measuremnts, while patients 1 and 3 have been removed, as their measurement was recorded post index (recall that index_date is hardcoded as Jan 1, 2020)
eg 3 : Categorical values e.g. units¶
Which patients had a measurement for systolic blood pressure recorded in units 'mmHg' one year prior to index date? As MeasurementPhenotype has all the functionality of CodelistPhenotype, it also supports categorical filtering. Like CodelistPhenotype, this can be used to require conditions such as 'occurred during an inpatient visit'. This is especially useful in the context of the MeasurementPhenotype for requiring measurement units. Here we create a categorical filter requiring the units be in 'mmHg'
sbp3 = MeasurementPhenotype(
name = 'sbp_patients_on_year_preindex_in_mmHg',
codelist = sbp_codelist,
domain = 'MEASUREMENT',
relative_time_range = ONEYEAR_PREINDEX,
categorical_filter = CategoricalFilter(
allowed_values=['mmHg'],
column_name='UNIT'
)
)
sbp3.execute(mapped_tables)
2025-05-16 16:20:18,838 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_patients_on_year_preindex_in_mmHg': executing... 2025-05-16 16:20:18,873 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_patients_on_year_preindex_in_mmHg': execution completed.
┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┓ ┃ PERSON_ID ┃ BOOLEAN ┃ EVENT_DATE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━┩ │ string │ boolean │ date │ int64 │ ├───────────┼─────────┼────────────┼───────┤ │ P2 │ True │ 2019-12-02 │ 160 │ └───────────┴─────────┴────────────┴───────┘
Notice that only P2 remains, as P0 had units recorded in 'Hg', not 'mmHg'.
Suggestion : use categorical_filter to define units, if only some subset of units are allowed.
Step 2 : Define a value_filter¶
Till now we have seen how to select patients that have any recorded event of the measurement type defined by our codelist. MeasurementPhenotype allows further selection of patients based on the measurement value. To do this, we define the value_filter keyword argument to either a single threshold value or to an allowed value range.
Using the value_filter of MeasurementPhenotype, we can ask questions such as 'which patients had a measurement greater than 130 in the pre index period?'.
Output tables return only patient ids that fulfill our CodelistPhenotype criteria and are within the ranges defined by our value_filter. Unless return_value is defined, only patient_ids are returned (one row per patient).
eg 4 : Value threshold¶
which patients have one or more systolic blood pressure measurements greater than 200 mmHg recorded within one year prior to index date?
sbp4 = MeasurementPhenotype(
name = 'sbp_preindex_measurements_ge130',
codelist = sbp_codelist,
domain = 'MEASUREMENT',
relative_time_range = ONEYEAR_PREINDEX,
value_filter = ValueFilter(
min_value = GreaterThanOrEqualTo(130),
column_name = "VALUE"
)
)
sbp4.execute(mapped_tables)
2025-05-16 16:20:20,273 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_preindex_measurements_ge130': executing... 2025-05-16 16:20:20,288 - phenex.phenotypes.measurement_phenotype - DEBUG - Applying null filtering for sbp_preindex_measurements_ge130 /Users/ahartens/.pyenv/versions/3.12.7/envs/phenex2/lib/python3.12/site-packages/ibis/expr/types/relations.py:685: FutureWarning: Selecting/filtering arbitrary expressions in `Table.__getitem__` is deprecated and will be removed in version 10.0. Please use `Table.select` or `Table.filter` instead. warnings.warn( 2025-05-16 16:20:20,302 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_preindex_measurements_ge130': execution completed.
┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┓ ┃ PERSON_ID ┃ BOOLEAN ┃ EVENT_DATE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━┩ │ string │ boolean │ date │ int64 │ ├───────────┼─────────┼────────────┼───────┤ │ P2 │ True │ 2019-12-02 │ 160 │ └───────────┴─────────┴────────────┴───────┘
eg 5 : Value range¶
which patients have one or more systolic blood pressure measurements between 120 and 160 mmHg recorded within one year prior to index date?
sbp5 = MeasurementPhenotype(
name = 'sbp_preindex_measurements_between_100_130',
codelist = sbp_codelist,
domain = 'MEASUREMENT',
relative_time_range = ONEYEAR_PREINDEX,
value_filter = ValueFilter(
min_value = GreaterThanOrEqualTo(100),
max_value = LessThanOrEqualTo(130),
column_name = "VALUE"
)
)
sbp5.execute(mapped_tables)
2025-05-16 16:20:21,142 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_preindex_measurements_between_100_130': executing... 2025-05-16 16:20:21,154 - phenex.phenotypes.measurement_phenotype - DEBUG - Applying null filtering for sbp_preindex_measurements_between_100_130 /Users/ahartens/.pyenv/versions/3.12.7/envs/phenex2/lib/python3.12/site-packages/ibis/expr/types/relations.py:685: FutureWarning: Selecting/filtering arbitrary expressions in `Table.__getitem__` is deprecated and will be removed in version 10.0. Please use `Table.select` or `Table.filter` instead. warnings.warn( 2025-05-16 16:20:21,168 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_preindex_measurements_between_100_130': execution completed.
┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┓ ┃ PERSON_ID ┃ BOOLEAN ┃ EVENT_DATE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━┩ │ string │ boolean │ date │ int64 │ ├───────────┼─────────┼────────────┼───────┤ │ P0 │ True │ 2019-12-02 │ 120 │ └───────────┴─────────┴────────────┴───────┘
Step 3 : Define value_aggregation¶
Till now, the value_filters we have seen perform filtering on the entries recorded directly in the event tables. Indeed, the generated dummy data we have been using has a single measurement per patient. However, we usually have many values recorded for each patient, on different days and on the same day. It is common to want to perform filtering on some aggregation of the data found in our event tables. In these cases, we can think of the measurement events in our data as 'raw' values that do not directly actually reflect the definition of a measurement we are interested in. For example,
- Blood Pressure fluctuates very rapidly and changes over time; we therefore do not trust a single, or even multiple values. We may want to define 'hypertension' as not just a single measurement event of systolic blood pressure greater than 160; instead, we may want want to perform value filtering on the 'mean systolic blood pressure in the one year pre index period'. Notice that this requires an aggregation of the 'raw' event based data in the one year pre-index period, meaning that our definition of 'systolic blood pressure' is not that recorded in the raw data, but rather the 'mean SBP in the one year pre index period'. After this aggregation is performed, we then want to perform the value filtering to find those with a 'mean SBP in the one year pre index period' > 160.
- Another common issue is that we often see duplicated entries for a lab measurement performed on the same day. For instance, a systolic blood pressure measurement may be performed 10 times on one day. This is often an issue of data quality, and it is suggested to have pipelines to de-duplicate values. However, we can use MeasurementPhenotype to perform this de-duplication for us. In essence, we create a new definition for systolic blood pressured which could be called 'daily_median_systolic_blood_pressure', and then perform further value_filtering on this new aggregated value.
In order to perform value aggregation, we use the value_aggregation keyword argument. The options for value_aggregation are the obvious mean, median, min and max, which perform the named aggregation on all values defined by the CodelistPhenotype arguments i.e. codelist criteria, time_range_filters and categorical_filters.
In addition to mean, median, min and max, we also have the options of daily mean, median, min and max, which will return the daily means
Lets first look at a new generated dummy data set; There are only two patients, P5 and P6. Each has 3 SBP measurements recorded, P5 has these three on the same day, while patient 6 has them on three different days.
from dummy_measurement_data import create_ex2_dummy_measurement_mapped_tables
mapped_tables = create_ex2_dummy_measurement_mapped_tables()
mapped_tables['MEASUREMENT']
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ PERSON_ID ┃ MEASUREMENT_TYPE_CONCEPT_ID ┃ VALUE_AS_NUMBER ┃ UNIT ┃ MEASUREMENT_DATE ┃ INDEX_DATE ┃ EVENT_DATE ┃ CODE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ string │ string │ int64 │ string │ date │ date │ date │ string │ int64 │ ├───────────┼─────────────────────────────┼─────────────────┼────────┼──────────────────┼────────────┼────────────┼────────┼───────┤ │ P5 │ SBP │ 115 │ mmHg │ 2019-12-02 │ 2020-01-01 │ 2019-12-02 │ SBP │ 115 │ │ P5 │ SBP │ 120 │ mmHg │ 2019-12-02 │ 2020-01-01 │ 2019-12-02 │ SBP │ 120 │ │ P5 │ SBP │ 190 │ mmHg │ 2019-12-02 │ 2020-01-01 │ 2019-12-02 │ SBP │ 190 │ │ P6 │ SBP │ 115 │ mmHg │ 2019-12-03 │ 2020-01-01 │ 2019-12-03 │ SBP │ 115 │ │ P6 │ SBP │ 120 │ mmHg │ 2019-12-02 │ 2020-01-01 │ 2019-12-02 │ SBP │ 120 │ │ P6 │ SBP │ 190 │ mmHg │ 2019-12-01 │ 2020-01-01 │ 2019-12-01 │ SBP │ 190 │ └───────────┴─────────────────────────────┴─────────────────┴────────┴──────────────────┴────────────┴────────────┴────────┴───────┘
eg 6 : Return all recorded values (many rows per patient)¶
How can I see all measurements for systolic blood pressure recorded one year prior to index date?
By default, PhenEx returns all values recorded, with multiple rows per patient returned. This means that all values recorded in the time range will be returned.
sbp6 = MeasurementPhenotype(
name = 'sbp_mean_baseline',
codelist = sbp_codelist,
domain = 'MEASUREMENT',
relative_time_range = ONEYEAR_PREINDEX,
)
sbp6.execute(mapped_tables)
2025-05-16 16:20:25,110 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_mean_baseline': executing... 2025-05-16 16:20:25,144 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_mean_baseline': execution completed.
┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┓ ┃ PERSON_ID ┃ BOOLEAN ┃ EVENT_DATE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━┩ │ string │ boolean │ date │ int64 │ ├───────────┼─────────┼────────────┼───────┤ │ P5 │ True │ 2019-12-02 │ 115 │ │ P5 │ True │ 2019-12-02 │ 120 │ │ P5 │ True │ 2019-12-02 │ 190 │ │ P6 │ True │ 2019-12-03 │ 115 │ │ P6 │ True │ 2019-12-02 │ 120 │ │ P6 │ True │ 2019-12-01 │ 190 │ └───────────┴─────────┴────────────┴───────┘
eg 7 : Return all recorded values (many rows per patient) within a value range¶
How can I see all measurements for systolic blood pressure recorded one year prior to index date less than 180 mmHg?
Setting ValueFilters works as expected with duplicate values; all values within the value threshold are returned, with multiple rows per patient
sbp7 = MeasurementPhenotype(
name = 'sbp_mean_baseline',
codelist = sbp_codelist,
domain = 'MEASUREMENT',
relative_time_range = ONEYEAR_PREINDEX,
value_filter=ValueFilter(
max_value=LessThan(180)
)
)
sbp7.execute(mapped_tables)
2025-05-16 16:20:26,338 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_mean_baseline': executing... 2025-05-16 16:20:26,351 - phenex.phenotypes.measurement_phenotype - DEBUG - Applying null filtering for sbp_mean_baseline /Users/ahartens/.pyenv/versions/3.12.7/envs/phenex2/lib/python3.12/site-packages/ibis/expr/types/relations.py:685: FutureWarning: Selecting/filtering arbitrary expressions in `Table.__getitem__` is deprecated and will be removed in version 10.0. Please use `Table.select` or `Table.filter` instead. warnings.warn( 2025-05-16 16:20:26,367 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_mean_baseline': execution completed.
┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┓ ┃ PERSON_ID ┃ BOOLEAN ┃ EVENT_DATE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━┩ │ string │ boolean │ date │ int64 │ ├───────────┼─────────┼────────────┼───────┤ │ P5 │ True │ 2019-12-02 │ 115 │ │ P5 │ True │ 2019-12-02 │ 120 │ │ P6 │ True │ 2019-12-03 │ 115 │ │ P6 │ True │ 2019-12-02 │ 120 │ └───────────┴─────────┴────────────┴───────┘
eg 8 : Whole time period value aggregation (single row per patient)¶
What is the patient mean systolic blood pressure in the baseline period?
sbp8 = MeasurementPhenotype(
name = 'sbp_mean_baseline',
codelist = sbp_codelist,
domain = 'MEASUREMENT',
relative_time_range = ONEYEAR_PREINDEX,
value_aggregation = Mean(),
)
sbp8.execute(mapped_tables)
2025-05-16 16:20:27,636 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_mean_baseline': executing... 2025-05-16 16:20:27,671 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_mean_baseline': execution completed.
┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ PERSON_ID ┃ BOOLEAN ┃ EVENT_DATE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ string │ boolean │ date │ float64 │ ├───────────┼─────────┼────────────┼────────────┤ │ P6 │ True │ NULL │ 141.666667 │ │ P5 │ True │ NULL │ 141.666667 │ └───────────┴─────────┴────────────┴────────────┘
Here we see that there is a single row per patient, as all values within the specified time period (one year pre-index) have been averaged.
notice that value_aggregation occurs prior to value filtering! This is desired behavior, as if the mean over a time period is requested, it is nonsensical to remove values not within the threshold. See the clean_nonphysiologicals_value_filter below to set a threshold prior to value aggregation, if desired.
eg 9 : Daily value aggregation (many rows per patient)¶
What is the patient daily mean systolic blood pressure?
sbp9 = MeasurementPhenotype(
name = 'sbp_daily_median_ge200',
codelist = sbp_codelist,
domain = 'MEASUREMENT',
relative_time_range = ONEYEAR_PREINDEX,
value_aggregation = DailyMean(),
)
sbp9.execute(mapped_tables)
2025-05-16 16:20:29,541 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_daily_median_ge200': executing... 2025-05-16 16:20:29,578 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_daily_median_ge200': execution completed.
┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ PERSON_ID ┃ BOOLEAN ┃ EVENT_DATE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ string │ boolean │ date │ float64 │ ├───────────┼─────────┼────────────┼────────────┤ │ P6 │ True │ NULL │ 115.000000 │ │ P6 │ True │ NULL │ 120.000000 │ │ P6 │ True │ NULL │ 190.000000 │ │ P5 │ True │ NULL │ 141.666667 │ └───────────┴─────────┴────────────┴────────────┘
Here we see that P5 has a single row, as all measurements were performed on the same day. P6 has three values performed on three different days, so we see three rows for the three different days.
Step 3 : Define return_date¶
As seen in example 6, MeasurementPhenotype often returns multiple rows per patient, depending on any value aggregation is performed or not. If no aggregation or daily aggregation is performed, there are multiple rows per patient returned. Only if a time range aggregator is used (mean, median, max, min) is there a single row per patient.
It is also common to want to see the value closest to our index date of interest. In order to do this, we use the return_date keyword argument. Options are 'first', 'last' and 'all'(default)
Note : if value_aggregation is set to mean, median, max or min (i.e. a time range aggregator), the concept of 'first' and 'last' are nonsensical and an error will be thrown! A mean over a period means no date exists any more. However, first and last can be used if daily aggregations are used (daily mean, median, max, min)
eg 10 : Return all values nearest prior to index (many rows per patient)¶
what is value of the systolic blood pressure recorded nearest to the index date?
sbp10 = MeasurementPhenotype(
name = 'sbp_closest_to_index',
codelist = sbp_codelist,
domain = 'MEASUREMENT',
relative_time_range = ONEYEAR_PREINDEX,
return_date = 'last' # this will return value nearest to the index date (last if prior to index, first if post index)
)
sbp10.execute(mapped_tables)
2025-05-16 16:20:32,221 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_closest_to_index': executing... 2025-05-16 16:20:32,264 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_closest_to_index': execution completed.
┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━┓ ┃ PERSON_ID ┃ BOOLEAN ┃ EVENT_DATE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━┩ │ string │ boolean │ date │ int64 │ ├───────────┼─────────┼────────────┼───────┤ │ P5 │ True │ 2019-12-02 │ 115 │ │ P5 │ True │ 2019-12-02 │ 120 │ │ P5 │ True │ 2019-12-02 │ 190 │ │ P6 │ True │ 2019-12-03 │ 115 │ └───────────┴─────────┴────────────┴───────┘
Here we see that P5 still has multiple rows, as three measurements were performed on the date closest to index
eg 11 : Return daily aggregation nearest prior to index (single row per patient)¶
what is mean value of the systolic blood pressure recorded in the one year pre index period?
sbp11 = MeasurementPhenotype(
name = 'sbp_closest_to_index',
codelist = sbp_codelist,
domain = 'MEASUREMENT',
relative_time_range = ONEYEAR_PREINDEX,
value_aggregation = DailyMean(),
return_date = 'last'
)
sbp11.execute(mapped_tables)
2025-05-16 16:20:34,327 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_closest_to_index': executing... 2025-05-16 16:20:34,368 - phenex.phenotypes.phenotype - INFO - Phenotype 'sbp_closest_to_index': execution completed.
┏━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┓ ┃ PERSON_ID ┃ BOOLEAN ┃ EVENT_DATE ┃ VALUE ┃ ┡━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ string │ boolean │ date │ float64 │ ├───────────┼─────────┼────────────┼────────────┤ │ P5 │ True │ NULL │ 141.666667 │ │ P6 │ True │ NULL │ 115.000000 │ └───────────┴─────────┴────────────┴────────────┘
Using a daily aggregator and setting return_date to 'first' or 'last', we have a single measurement for each patient closest to the index date
Step 4: Define cleaning value filters¶
RWD sources are often quite messy; measurement values are often manually entered and thus we see typos and obviously faulty data in our measurement tables. MeasurementPhenotype allows us to ignore obviously faulty data using the clean_nonphysiologicals_value_filter. This filter works prior to value_aggregation, so that obviously erroneous data does not enter our value aggregation and final results. First define what physiological thresholds look like. Take care when defining these physiological thresholds as RWD sources, while messy, are also very large; rare physiological outliers are therefore 'common'.
eg 12 : clean nonphysiological values¶
I see measurements > 300mmHg in my dataset, which are obviously due to error. Which patients have a systolic blood pressure measurements greater than 200 mmHg, having removed SBP measurements >300mmHg?
sbp12 = MeasurementPhenotype(
name = 'sbp_remove_nonphysiological',
codelist = sbp_codelist,
domain = 'MEASUREMENT',
relative_time_range = ONEYEAR_PREINDEX,
clean_nonphysiologicals_value_filter = ValueFilter(
min_value = GreaterThanOrEqualTo(300),
),
value_filter = ValueFilter(
min_value = GreaterThanOrEqualTo(200),
column_name = "VALUE"
)
)
Cheat Sheet¶
- Are all raw measurment values within a physiological range? Are there 'nonsense' values that are due to measurement error? yes = set clean_nonphysiologicals_value_filter
- Do I want to want to aggregate raw values, for example perform the daily median operation, or the mean of all values in the time_period? yes = set value_aggregation
- Do I want to set value thresholds or allowed ranges? yes = set value_filter
- Do I want to return a value occurring on a specific day (first, last) or all values? yes = set return_date