CodelistPhenotype Tutorial¶
The CodelistPhenotype is used to identify patients based on the presence of medical codes in some medical coding system.
A vast majority of the data entered into real world data sources, including records of diagnoses, medical procedures, medications, and lab tests, use medical codes and coding sytems. For this reason, the CodelistPhenotype is the most used phenotype.
In this tutorial, we will present how to use the CodelistPhenotype. Questions we will be able to answer after this tutorial include :
- Which patients had an atrial fibrillation diagnosis at
any time in the data source? - Which patients had an ECG procedure performed at any time in the data source?
- Which patients had an atrial fibrillation diagnosis one year prior to index date?
- Which patients had an atrial fibrillation diagnosis one year after index date?
- Which patients had an ECG performed one month prior to atrial fibrillation diagnosis?
- Which patients had an atrial fibrillation diagnosis in the inpatient setting?
- Which patients had an atrial fibrillation diagnosis in the inpatient setting or outpatient setting and primary diagnosis position?
- Which patients had an ECG performed one month prior to an atrial fibrillation diagnosis in the inpatient setting and primary diagnosis position, with the diagnosis occurring one year prior to index date
- What was the date of the first atrial fibrillation diagnosis for patients that had an atrial fibrillation diagnosis
- Which patients had an ecg performed within one month prior of the first atrial fibrillation diagnosis
CodelistPhenotype makes it possible to answer all these questions, and many more. Let's see how...
Step 1 : Define a codelist¶
As the name implies, codelists are required to use the CodelistPhenotype. Codelists are handled by the Codelist class; users can enter codelists to Codelist in various ways using various types of Codelist classes; look to the Codelists Tutorial for further detail.
For this tutorial, we will create two Codelist objects by hardcoding the codes to be used.
from phenex.codelists.codelists import Codelist
# Create a codelist for Atrial Fibrillation
af_codelist = Codelist(
name = 'atrial_fibrillation',
codelist = {
'ICD10CM':
['I48.0', 'I48.1', 'I48.11', 'I48.19', 'I48.2', 'I48.20', 'I48.21', 'I48.91'],
'ICD9CM':
['427.31']
}
)
# Create a codelist for electrocardiogram (ECG)
ecg_codelist = Codelist(
name = 'electrocardiogram',
codelist = {
'CPT': ['93000','93005','93010','93040','93041','93042']
}
)
Step 2 : Define a domain¶
We now need to know how to point our CodelistPhenotypes to the input data. A CodelistPhenotype always works on a single table. To point the CodelistPhenotype to this table, we use the 'domains' keyword argument.
We must first understand : our input data is in a dictionary where keys = domains and values = input tables.
We need to pass the CodelistPhenotype one of these keys (a domain)! For our examples, we will be working with two domains:
- for atrial fibrillation, we are interested in diagnosis codes Which are stored in the condition_occurrence table/domain
- for ECG's we are interested in procedures, Which are stored in the procedure_occurrence table/domain
Note beyond The reason these are called domains and not tables is because, in the background, PhenEx may work on raw tables or a subset of the raw tables, depending on the stage of execution.
The codelist and domain arguments are the minimum required to create a CodelistPhenotype. We're now ready to create our first phenotypes to answer our first example questions!
Examples 1 & 2¶
from phenex.phenotypes import CodelistPhenotype
# Ex.1
# Which patients had an atrial fibrillation diagnosis at **any time** in the data source?
af_phenotype = CodelistPhenotype(
codelist = af_codelist,
domain = 'condition_occurrence'
)
# Ex.2
# Which patients had an ECG procedure performed at **any time** in the data source?
ecg_phenotype = CodelistPhenotype(
codelist = ecg_codelist,
domain = 'procedure_occurrence'
)
These CodelistPhenotypes create tables containing only patients that have one or more occurrences of an atrial fibrillation code of type ICD10CM and ICD9CM at any time within the condition occurrence table.
A note on CodelistPhenotypes' name argument...¶
Every phenotype requires a name in PhenEx. However, for simplicity, PhenEx attempts to find a name for phenotypes using information you enter the that phenotype.
For CodelistPhenotype, PhenEx will name set the name to the name of the codelist, if the name of the codelist is specified. If the name of the codelist is not specified, an error will be thrown.
As we will see in this tutorial, we will be using the atrial fibrillation and ecg codelists repeatedly; each phenotype that uses them will be identically named and will lead to errors. It is thereforebest practice to always define name using a unique name! All following examples will specify name...
Step 3 : Define time ranges¶
It is very common in RWD studies to want to know if a medical code occurred within some time period of interest. To define time ranges, we pass an instance of the RelativeTimeRangeFilter class to a CodelistPhenotype using the time_range_filter keyword argument.
To create a RelativeTimeRangeFilter, we need to define (1) an anchor and (2) the range in days in relation to the anchor.
What is an anchor?¶
Time ranges always require two dates; we always talk about some date in relation to some other date; for example, is date1 before date2, after date2 or on date2?
In the context of a CodelistPhenotype, one date is always the event_date of the medical code being assigned to a patient. The second date is referred to as the anchor date.
RelativeTimeRangeFilter with index date as anchor¶
In RWD studies, the most common anchor date is the index date of each patient. For example :
- the time range for baseline characteristics is some time period, referred to as the baseline period, prior to index date
- the time range for outcomes is some time period, referred to as the followup period, after the index date
This means that time range filters used to define phenotypes for all baseline charactersitics and all outcomes will use the index date as the anchor date.
Because it is so common to use the index date as the anchor, the default anchor for RelativeTimeRangeFilter is the index date.
Note : The components of EntryPhenotype must define an anchor phenotype (see below), as no index date has been defined! Only after an entry phenotype has been defined do subsequent phenotypes have access to the (possible) index date. See the tutorial on EntryPhenotype for more details.
Example 3 & 4 :¶
from phenex.phenotypes import CodelistPhenotype
from phenex.filters import (
GreaterThanOrEqualTo,
LessThan,
RelativeTimeRangeFilter,
)
# Ex.3
# Which patients had an atrial fibrillation diagnosis **one year prior to index date**?
one_year_before_index = RelativeTimeRangeFilter(
when="before",
min_days = GreaterThanOrEqualTo(0),
max_days = LessThan(0)
)
af_phenotype = CodelistPhenotype(
name = 'af_one_year_before_index',
codelist = af_codelist,
domain = 'condition_occurrence',
relative_time_range = one_year_before_index
)
# Ex.4
# Which patients had an atrial fibrillation diagnosis **one year after index date**?
one_year_after_index = RelativeTimeRangeFilter(
when="after",
min_days = GreaterThanOrEqualTo(0),
max_days = LessThan(0)
)
af_phenotype = CodelistPhenotype(
name = 'af_one_year_after_index',
codelist = af_codelist,
domain = 'condition_occurrence',
relative_time_range = one_year_after_index
)
RelativeTimeRangeFilter anchored by another phenotype¶
Another common pattern is to define time ranges in relation to other phenotypes. In this case, we explicitely set the anchor to the date returned by some other phenotype.
Phenotypes do not return dates by default. It is therefore important to remember to define Which date the anchor phenotype should return, as this greatly affects your query. The options for return date are 'first', 'last', and 'all'; see the 'return_date' section below for more information.
Note : The components of EntryPhenotype must define an anchor phenotype if using RelativeTimeRangeFilter, as no index date is defined. See the tutorial on EntryPhenotype for more details.
Example 5¶
from phenex.phenotypes import CodelistPhenotype
from phenex.filters import (
LessThanOrEqualTo,
RelativeTimeRangeFilter
)
# Ex.5
# Which patients had an ECG performed **one month prior** to atrial fibrillation diagnosis?
# Create the anchor phenotype
af_phenotype = CodelistPhenotype(
name = 'all_af_diagnosis_events',
codelist = af_codelist,
domain = 'condition_occurrence',
return_date = 'all'
)
# Create the time range filter
one_month_after_af_diag = RelativeTimeRangeFilter(
anchor_phenotype = af_phenotype,
when = 'before',
max_days = LessThanOrEqualTo(30)
)
# Create the final phenotype
ecg = CodelistPhenotype(
name = 'ecg_with_af_one_month_prior',
codelist = ecg_codelist,
domain = 'procedure_occurrence',
relative_time_range = one_month_after_af_diag
)
Notice that there are many ways to define the same phenotype. For example, example 5 could be flipped, so that the anchor phenotype is ECG and the final phenotype is atrial fibrillation. Which order to perform is up to the user discretion.
Step 4 : Categorical filters¶
A common pattern in RWD studies is to further qualify a codelist query for additional information present in other columns of the same table. For example, it is common to specify that a diagnosis code must be present in the inpatient hospital setting, or in the primary diagnosis position.
To further qualify a CodelistPhenotype, we can use the categorical_filter keyword argument. We must first define a CategoricalFilter.
Example 6¶
from phenex.phenotypes import CodelistPhenotype
from phenex.filters import (
CategoricalFilter
)
# Ex.6
# Which patients had an atrial fibrillation diagnosis **in the inpatient setting**?
inpatient_setting = CategoricalFilter(columnname = 'encounter_type', allowed_values = ['inpatient'])
inpatient_af_phenotype = CodelistPhenotype(
name = 'af_inpatient',
codelist = af_codelist,
domain = 'condition_occurrence',
categorical_filter = inpatient_setting
)
We can either define a single categorical filter, as seen in the above example, or perform logical operations using categorical filters to combine them.
from phenex.phenotypes import CodelistPhenotype
from phenex.filters import (
CategoricalFilter
)
# Ex.7
# Which patients had an atrial fibrillation diagnosis **in the inpatient setting and primary diagnosis position**?
# create all necessary component categorical filters
inpatient_setting = CategoricalFilter(columnname = 'encounter_type', allowed_values = ['inpatient'])
outpatient_setting = CategoricalFilter(columnname = 'encounter_type', allowed_values = ['outpatient'])
primary_diagnosis_position = CategoricalFilter(columnname = 'diagnosis_position', allowed_values = ['primary'])
# we can join categorical filters using logical operations; be careful with parenthesis!
final_categorical_filter = inpatient_setting | (outpatient_setting & primary_diagnosis_position)
# create final phenotype
af_phenotype = CodelistPhenotype(
name = 'af_inpatient_or_primary_diagnosis_outpatient',
codelist = af_codelist,
domain = 'condition_occurrence',
categorical_filter = final_categorical_filter
)
Putting it all together¶
We can use codelists, time range filters and categorical filters all together in whatever ways are necessary. The following example shows this
Example 8¶
from phenex.phenotypes import CodelistPhenotype
from phenex.filters import (
LessThanOrEqualTo,
RelativeTimeRangeFilter,
CategoricalFilter
)
# Ex.8
# Which patients had an ECG performed **one month prior** to an atrial fibrillation
# diagnosis **in the inpatient setting and primary diagnosis position**, with the diagnosis
# occurring **one year prior to index date**
# create all necessary component categorical filters
inpatient_setting = CategoricalFilter(columnname = 'encounter_type', allowed_values = ['inpatient'])
outpatient_setting = CategoricalFilter(columnname = 'encounter_type', allowed_values = ['outpatient'])
primary_diagnosis_position = CategoricalFilter(columnname = 'diagnosis_position', allowed_values = ['primary'])
# we can join categorical filters using logical operations; be careful with parenthesis!
final_categorical_filter = inpatient_setting | (outpatient_setting & primary_diagnosis_position)
# create the anchor phenotype
af_phenotype = CodelistPhenotype(
name = 'af_inpatient_or_primary_diagnosis_outpatient_dates',
codelist = af_codelist,
domain = 'condition_occurrence',
categorical_filter = final_categorical_filter
return_date = 'all'
)
# Create the time range filter
one_month_after_af_diag = RelativeTimeRangeFilter(
anchor_phenotype = af_phenotype,
when = 'before',
max_days = LessThanOrEqualTo(30)
)
# Create the final phenotype
ecg = CodelistPhenotype(
name = 'ecg_with_af_inpatient_or_outpatient_primary_one_month_prior',
codelist = ecg_codelist,
domain = 'procedure_occurrence',
relative_time_range = one_month_after_af_diag
)
Step 5 : Define a return date¶
By default, CodelistPhenotype will only return the ids for patients that fulfill the CodelistPhenotype criteria. However, it is also possible to return a date associated with the codelist event. In order to do this, we must specify the return_date keyword argument.
Note : When using a CodelistPhenotype as an anchor for another phenotype (see above), it is required to define the return_date.
There are multiple options for return_date
- all will return the date of all events that fulfill the CodelistPhenotype criteria. Note that there are multiple rows per patient in this case (no reduction). All other arguments return a single row per patient.
- first will return the date of the first occurrence of a code in the codelist
- last will return the date of the last occurrence of the code in the codelist
- nearest will return the date of the occurrence of the code in the codelist nearest to the anchor date of the time_range_filter. Note that time_range_filter must be defined to use nearest. If multiple time range filters are defined (i.e. time_range_filter is a list), the anchor is taken from the first time range filter in the list
from phenex.phenotypes import CodelistPhenotype
from phenex.filters import (
LessThanOrEqualTo,
RelativeTimeRangeFilter,
)
# Ex.9
# What was the date of the first atrial fibrillation diagnosis for patients
# that had an atrial fibrillation diagnosis
af_phenotype = CodelistPhenotype(
name = 'af_date_first_diagnosis',
codelist = af_codelist,
domain = 'condition_occurrence',
return_date='first'
)
# Ex.10
# Which patients had an ecg performed within one month prior of the first
# atrial fibrillation diagnosis?
ecg_phenotype = CodelistPhenotype(
codelist = ecg_codelist,
domain = 'procedure_occurrence',
relative_time_range = RelativeTimeRangeFilter(
when='before',
max_days=LessThanOrEqualTo(30),
anchor_phenotype=af_phenotype # set the anchor to the phenotype defined above
)
)