PhenEx Study Tutorial¶
In this page we will show you how to use PhenEx to :
- Connect to a Snowflake Database
- Work with OMOP data
- Create a simple cohort
- View cohort summary statistics
First make sure that your PhenEx version is up to date
# For updating PhenEx to latest released version
# !pip install -Uq PhenEx
import ibis
ibis.options.interactive = True
# import os
# # authentication
# os.environ.update({
# 'SNOWFLAKE_ACCOUNT':'ACOUNT',
# 'SNOWFLAKE_WAREHOUSE':'WAREHOUSE'
# 'SNOWFLAKE_ROLE':'ROLE'
# 'SNOWFLAKE_USER':'USER'
# })
Method 2 : env file¶
You can also specify these with using a dotenv file (https://github.com/motdotla/dotenv). One advantage to doing this is that you do not put sensitive credential information into your jupyter notebook.
from dotenv import load_dotenv
load_dotenv()
If you see True above, it means python was able to find and load your environment file.
Connect to the database¶
We will now establish a connection to Snowflake using a SnowflakeConnector; these connectors will use your environment variables (set above) for login credentials.
At this point we must define two databases in Snowflake:
- Source : the snowflake location where input data to phenex should come from
- Destination (dest) : the snowflake location where output data from phenex should be written. The destination will be created if it does not exist.
Run this cell to connect to these databases; this cell will open up two browser tabs (if you're using browser authentication). After those pages load (wait for them to say completed!), close them and return to this notebook.
%%capture
from phenex.ibis_connect import SnowflakeConnector
con = SnowflakeConnector(
# SNOWFLAKE_SOURCE_DATABASE = 'SOURCE_DATABASE', # enter these, use or use the .env file
# SNOWFLAKE_DEST_DATABASE = 'DEST_DATABASE' # enter these, use or use the .env file
)
Notice that both of these locations can also be specified using environment variables (like we did in method 1/2 for credentials), and vice versa (credentials can be passed to a connector as keyword arguments, rather being hidden in the .env file). However, as credentials generally remain the same between projects and the database locations are project dependent, it is best practice to define database locations with the connector.
Define input data structure¶
PhenEx needs to know a little bit about the structure of the input data in order to help us make phenotypes and cohorts.
What this means is that PhenEx knows in what table and column to find information such as patient id, year of birth, diagnosis events, etc. This information is generally present in all RWD sources, but for each data source, is (1) organized in a different way and (2) can have different column names.
When using a new data source, we need to onboard that database for usage with PhenEx (tell it about table structure and column names). Go to the tutorial on onboarding a new database to learn how to onboard a database.
For the purposes of this tutorial, we will be using OMOP data, which is already onboarded and available in the PhenEx library. All we have to do is import the OMOPDomains and then get the mapped tables.
from phenex.mappers import OMOPDomains
mapped_tables = OMOPDomains.get_mapped_tables(con)
list(mapped_tables.keys())
Looking at input data¶
PhenEx bundles all input data into a dictionary, in this case in the variable called mapped_tables. The keys in this dictionary are known as 'domains'; we can access the input data by these domain keys. The values for each key are the actual tables
Integrating medical codelists¶
Medical codelists are an integral part of any observational study. PhenEx has functionality to help you use medical codelist files (CSVs or Excels). Go to the Codelist Tutorial to find out more about codelists. For the purpose of this tutorial, we will create a LocalCSVCodelistFactory that opens a codelist file, and returns medical codelists ready for use with PhenEx.
from phenex.codelists import LocalCSVCodelistFactory
# create the codelist factory. We have to map column names; see the Codelist Tutorial for more info
codelist_factory = LocalCSVCodelistFactory(
path='./codelists_for_tutorial.csv',
name_code_column='CONCEPT_ID',
name_codelist_column='CODELIST',
name_code_type_column = 'VOCABULARY_ID'
)
# let's see what codelists are available
codelist_factory.get_codelists()
Study Definition¶
Now we're ready to use PhenEx to specify our observational study! For the purposes of this tutorial, we will use the following dummy study definition:
AIM : to characterize patients with atrial fibrillation
Entry Criterion
incident atrial fibrillation, as defined by the first occurrence of a diagnosis code for atrial fibrillation
Inclusion criteria
- At least one year lookback period
- greater than or equal to 18 years old
Exclusion criteria
- No myocardial hospitalization in the year prior to index
Baseline characteristics
- age at index
- sex
- number of deaths within 30 days of index
Time to event analysis for death post index
To use PhenEx, we'll translate this written text to a PhenEx executable study definition. We do that by creating a phenotype for each one of these study elements, and then we will put them all together in the 'Cohort' section
Entry criterion¶
The entry criterion defines study entry i.e. it defines the index date. At this point, it is only a potential index date; only after we evaluate the in/exclusion criteria at this possible index date does it become the true index date.
from phenex.phenotypes.codelist_phenotype import CodelistPhenotype
from phenex.codelists.codelists import Codelist
# create a codelist for atrial fibrillation
cl_af = codelist_factory.get_codelist('ATRIAL_FIBRILLATION').copy(use_code_type=False)
# create a phentype that uses this codelist.
pt_entry = CodelistPhenotype(
name='first_atrial_fibrillation_diagnosis',
domain='CONDITION_OCCURRENCE',
codelist=cl_af,
return_date='first', # return the first occurrence
)
# we can execute the phenotype and look at the results. this is not required!
pt_entry.execute(mapped_tables)
pt_entry.table
from phenex.phenotypes import TimeRangePhenotype
from phenex.filters import RelativeTimeRangeFilter, GreaterThanOrEqualTo
pt_inclusion1 = TimeRangePhenotype(
name = 'one_year_coverage',
relative_time_range=RelativeTimeRangeFilter(
when='before',
min_days=GreaterThanOrEqualTo(365),
anchor_phenotype=pt_entry # this is only necessary if we want to execute pt_inclusion1 outside of a cohort.
)
)
# we can again execute immediately (outside of the Cohort) if we want to observe the results
# pt_inclusion1.execute(mapped_tables)
Inclusion 2 : Age greater than 18¶
from phenex.phenotypes import AgePhenotype
from phenex.filters import ValueFilter, GreaterThan
pt_inclusion2 = AgePhenotype(
name = 'age_g18',
value_filter = ValueFilter(
min_value=GreaterThan(18)
),
anchor_phenotype=pt_entry # this is only necessary if we want to execute pt_inclusion1 outside of a cohort.
)
# we can again execute immediately (outside of the Cohort) if we want to observe the results
# pt_inclusion2.execute(mapped_tables)
# Create the final list of includions. These criteria will be executed sequentially when creating the attrition table, so adjust the order as desired for the attrition table.
inclusions = [pt_inclusion1, pt_inclusion2]
Exclusions¶
We now create a list of exclusion phenotypes; these phenotypes must evaluate to 'false' to enter the cohort i.e patients may NOT fulfill the criteria defined by the exclusion phenotypes. We go one by one, implementing each criterium.
Exclusion 1 : Inpatient myocardial infarction diagnosis¶
from phenex.filters import GreaterThan
from phenex.filters.categorical_filter import CategoricalFilter
from phenex.filters import CategoricalFilter, RelativeTimeRangeFilter, GreaterThanOrEqualTo, LessThan
f_inpatient = CategoricalFilter(
allowed_values = [
9203, #Emergency Room Visit
262, #Emergency Room and Inpatient Visit
9201, #Inpatient Visit
],
column_name = 'VISIT_CONCEPT_ID',
domain = 'VISIT_OCCURRENCE'
)
f_one_year_pre_index = RelativeTimeRangeFilter(
when='before',
anchor_phenotype=pt_entry, # this is only necessary if we want to execute pt_inclusion1 outside of a cohort.
min_days=GreaterThanOrEqualTo(0),
max_days=LessThan(365),
)
cl_mi = codelist_factory.get_codelist('MYOCARDIAL_INFARCTION').copy(use_code_type=False)
pt_exclusion1 = CodelistPhenotype(
name='myocardial_infarction_hospitalization',
domain='CONDITION_OCCURRENCE',
codelist=cl_mi,
categorical_filter=f_inpatient,
relative_time_range=f_one_year_pre_index
)
# we can again execute immediately (outside of the Cohort) if we want to observe the results; patients in this table will be EXCLUDED from the final cohort
# pt_exclusion1.execute(mapped_tables)
exclusions = [pt_exclusion1]
Characteristics¶
We now create a list of baseline characteristic phenotypes; these phenotypes are run on the final cohort only. We can observe the results in Table1. We go one by one, implementing each criterium.
from phenex.phenotypes import AgePhenotype, BinPhenotype, CategoricalPhenotype, DeathPhenotype
pt_characteristic1 = AgePhenotype()
pt_characteristic2 = BinPhenotype(name='binned_age',phenotype=pt_characteristic1)
pt_characteristic3 = CategoricalPhenotype(
name = 'sex',
categorical_filter=CategoricalFilter(column_name="GENDER_SOURCE_VALUE"), domain = "PERSON"
)
pt_characteristic4 = DeathPhenotype(
name='death_30days',
domain='DEATH',
relative_time_range=RelativeTimeRangeFilter(
when='after',
min_days=GreaterThan(0),
max_days=LessThan(30)
)
)
characteristics = [pt_characteristic1, pt_characteristic2, pt_characteristic3, pt_characteristic4]
Outcomes¶
We now create a list of outcome phenotypes; these phenotypes are run on the final cohort only. We can observe the results in a time to event analysis. We go one by one, implementing each criterium.
f_postindex = RelativeTimeRangeFilter(
when='after',
min_days=GreaterThan(0),
)
pt_outcome1 = CodelistPhenotype(
name='myocardial_infarction_after_index',
domain='CONDITION_OCCURRENCE',
codelist=cl_mi,
categorical_filter=f_inpatient,
relative_time_range=f_postindex
)
outcomes = [pt_outcome1]
Cohort¶
We now put everything together in a PhenEx cohort. This takes the entry phenotype, and all the lists of phenotypes we created above. We can then execute the cohort.
from phenex.phenotypes.cohort import Cohort
cohort = Cohort(
name = 'study_tutorial_cohort',
entry_criterion=pt_entry,
inclusions=inclusions,
exclusions=exclusions,
characteristics=characteristics,
outcomes = outcomes,
)
cohort.execute(mapped_tables, con = con, n_threads=6, overwrite=True, lazy_execution=True)
Reporting¶
Once you're done executing the cohort, PhenEx provides basic reporting of attrition, baseline characteristics, and time to event.
Attrition¶
The attrition table shows the flow of patients to result in your final cohort. The first row is the entry criterion.
- The N column shows how many patients in the entire dataset fulfill the entry criterium. The N column in the following rows shows how many patients that fulfill the entry criterium fulfill the criterium on that row.
- The Remaining column shows how many patients remain after applying the criterium on that row.
- The % column shows how many remaining, as a percentage of the entry criterium
- The delta column shows how many patients are lost after applying the criterium on each row
from phenex.reporting import Waterfall
reporter = Waterfall()
reporter.execute(cohort)
Table 1¶
We can look at summary statistics of our baseline characteristics using PhenEx. The order of the phenotypes in the list of characteristics determines the order in Table1.
- The N column shows how many patients that fulfill the entry criterium fulfill the criterium on that row
- The % column shows the percentage of patients that fulfill the entry criterium have the criterium on that row
- For categorically valued phenotypes (e.g Categorical Phenotype, BinPhenotype), we see a row for each category found
- For numerically valued phenotypes (e.g. MeasurementPhenotype, AgePhenotype), we see summary statistics
cohort.table1
Time to event analysis¶
PhenEx allows you do to basic time to event analyses. KaplanMeier curves are currently supported. You simply define your right censoring phenotypes, and then create a time to event reporter. The survival curve is then generated for you.
import datetime
from phenex.reporting import TimeToEvent
from phenex.phenotypes import DeathPhenotype
end_of_followup = TimeRangePhenotype(
name='end_of_followup',
relative_time_range=RelativeTimeRangeFilter(when='after')
)
death_right_censor = DeathPhenotype(
name = 'death_censoring',
domain='DEATH',
relative_time_range=f_postindex
)
right_censor_phenotypes = [end_of_followup, death_right_censor]
tte = TimeToEvent(
right_censor_phenotypes = right_censor_phenotypes,
end_of_study_period=datetime.date(2025,12,12)
)
tte.execute(cohort)
tte.plot_single_kaplan_meier(xlim=[0,90], outcome_index=0)