Python for Phenex Tutorial¶
A lot of people who are doing observational studies do not consider themselves coders/programmers and primarily see themselves as epidemiologists, guided by a research question at hand. Among those doing observational research who do program, R or even SAS are often used, and common data science languages such as python are not often used. This is in large part due to the fact that tools for observational research are historically written in python.
PhenEx is designed to reflect the thought processes of doing observational research with real world data. This means that many epidemiologists find themselves very quickly able to read a PhenEx cohort specificition without any python knowledge.
Nonetheless, to write PhenEx cohorts from scratch (rather than copy and paste), it is useful to know some of the basics of python programming. We have collected here some notes and solutions from teaching PhenEx to python beginners, with simple explanations and links to resources to dive deeper into a topic.
This is by no means exhaustive and simply a starting point! As with all tutorials, it is hard to determine the level of the reader. We have tried to create this for total beginners, though that is difficult. It is important, for complete beginners, to not be overwhelmed! At the beginning, programming can seem overly complicated or magical, but it is important to remember that it is definitely not magical, and shockingly simple once understood.
Use a text editor or IDE with syntax highlighting!¶
Syntax highlighting means that different programmatic entities are different colors. This makes reading and writing code much easier! Suggested text editors are sublime text, zed, and for a more fully featured IDE (integrated development environment) you can use VSCode. Also, jupyter notebooks are highly recommended, as you can then interact with your data (i.e. see the output immediately). Notebooks also provide basic syntax highlighting. Make sure, once you are in your text editor, to specify that you are reading/writing a python file.
Variables¶
Variables are known from algebra. In the expression x = 3
, x is a variable that is assigned the value 3. We can then use x in later expressions, and it maintains its value, for example y = x + 5
, y is then assigned the value of 8. This seemingly trivial concept is used extensively in programming languages. In the case of PhenEx, will be using variables extensively. There are two important reasons for using variables:
- To define something once and use it in many different places
- To break up a long expression into smaller, more easily understandable parts
Let's look at an example where we create a variable called one_year_pre_index and assign it to a PhenEx component called a RelativeTimeRangeFilter.
# create a relative time range filter looking one year pre-index
one_year_pre_index = RelativeTimeRangeFilter(
when='before',
min_days = GreaterThanOrEqualTo(0),
max_days = LessThanOrEqualTo(365)
)
We can now re-use one_year_pre_index in multiple places! For example, we can create two code list phenotypes that both use the same relative time range filter.
phenotype1 = CodelistPhenotype(
...
relative_time_range = one_year_pre_index
)
phenotype2 = CodelistPhenotype(
...
relative_time_range = one_year_pre_index
)
Notice that we've created two new variables called phenotype1 and phenotype2. We can then use these phenotypes in downstream code, for example, using one of these as an anchor phenotype, or adding these phenotypes to a list of inclusion criteria.
In general, if you use some PhenEx component in more than one place, it is recommended to assign a variable to that component and then use the variable name in all places it is used. This is good for reducing errors when you need to make a change; if you need to change that component, you only need to change it in one place, and not in multiple places. For example, if we decided that we wanted to change our one_year_pre_index definition to NOT included the index date (day 0), we would simply have to change it in the definition of the variable one_year_pre_index, and this change would apply to everywhere where that variable is used.
Another reason to use variables is to make your code easier to read. Phenotypes can become quite complicated, with many PhenEx components being used, for example, for codelists, relative time range filters, categorical filters, anchor phenotypes and so on. We can use variables to break our code down easily readable parts. As an illustration of how long PhenEx code can get, let us revisit the definition of phenotype1 from above, but this time NOT use the variable one_year_pre_index as was done above; notice that the relative_time_range does NOT use the variable, and instead defines the RelativeTimeRangeFilter within the definition of the CodelistPhenotype.
# example of NOT using variables for relative_time_range
phenotype1 = CodelistPhenotype(
...
relative_time_range = RelativeTimeRangeFilter(
when='before',
min_days = GreaterThanOrEqualTo(0),
max_days = LessThanOrEqualTo(365)
)
)
While doing this is possible, it is much harder to read this code! Using the variable one_year_pre_index defined separately would make this code easier to read.
Built-in datatypes¶
Now we know that variables exist and we can assign them values using an equal sign. But what can those values be? In the above example, we set x=3
and we also saw things like phenotype1 = CodelistPhenotype(...)
.
When programming, it is important to know that variables have a datatype. Other programming languages make this very explicit, and in those languages you have to state what datatype a variable is. In python, a variable's datatype is implicit. By implicit, that means that we do not state a variables datatype, even though it has one! Depending on what datatype a variable is, we can do different things with it. PhenEx provides many different datatypes for you to use (e.g. CodelistPhenotype is a datatype).
There are several built-in datatypes provided by the python language itself that you should be familiar with. To use PhenEx we need to be able recognize, manipulate, and finally create these built in datatypes. This is for the most part not too difficult! These built-in datatypes are :
- int : integer values i.e. non-decimal numbers
- float : decimal numbers
- boolean :
True
orFalse
- string : a string of characters! recognizable by single or double quotations e.g. "this is a string" or 'this is a string'
- list : a list of other datatypes, starting/ending with square brackets, items separated by commas e.g. [1,2,3]
- dictionary : key value pairs, starting/ending with curly brackets, pairs separated by colon, items separated by commas, e.g. {"a":"value1"}
- none type : meaning no value is passed, use the capital
None
Most of them are fairly easy to recognize; integer numbers and decimal numbers are very straightforward. The booleans are also easy; we just need to remember that we capitalize True or False. Similarly, if we want to be explicit that we are not setting a value, we can use the none-type which is just a capital None. String datatypes are also relatively straightforward; we just need to remember to use either single or double quotation marks at the begining or end of the string.
The list and dictionary are a bit more complicated, but once mastered, are incredibly powerful. A list (called an array in most other languages) is simply an ordered list of things. Lists in python can contain anything you want to - the contents of a list do not need to have the same data type (this is not the case in most other programming languages). Remember, to recognize a list you look for square brackets and commas.
# a list containing phenotypes!
list_of_phenotypes = [phenotype1, phenotype2]
Once you've created your list, you can also access the items in the list. Remember, lists are ordered, and the order does not change! This means we access the items using their index (i.e. position) within the list. Somewhat confusingly, the index starts at 0 so the first item in the list is at index 0.
# write the name of the list you want to access, followed by square brackets containing the index of the item you want to access
list_of_phenotypes[0] # this will be phenotype1, the first item in the list
list_of_phenotypes[1] # this will be phenotype2, the second item in the list
In PhenEx, the only time you are required to create a list is to define inclusion/exclusion criteria and baseline characteristics. These are lists of phenotypes, and thus you simply have to know to put square brackets at the beginning and end, and commas between the phenotypes of interest. (as seen above)
Dictionaries¶
Next up is dictionaries! First to recognize a dictionary, look for curly brackets, colons and commas. Dictionaries are different from lists in that we access items in the dictionary not by their positional index (such as in the list), but rather by something we define ourselves called the 'key'. When creating the dictionary, we define both the keys, and the value the keys should correspond to. In the following example, we create a dictionary that represents a person, with keys to the left of the colon, and values to the right, with commas between key/value pairs
person1 = {
'name':'Alice', # create a key which is a string called 'name' that is assigned the string value of alice
'age':40, # create a key which is a string with value 'age' that is assgned the integer value 40
'occupation':'researcher'
}
We access values in a dictionary by using square brackets containing the key we want to access the value of
# write the name of dictionary you want to access, followed by square brackets containing the key you want to access
person1['name'] # we want the value for the key which is a string with value 'name'
'Alice'
When to use dictionaries in PhenEx¶
In PhenEx the only time you are required to create a dictionary is when onboarding a new datasource. However, after this data onboarding process, we will continue to use that dictionary for almost all phenotypes. Therefore, it is important to understand the key/value concept of dictionaries. Each phenotype will access an onboarded dataset mapped_tables dictionary using a key. Thus each, for example, CodelistPhenotype, has a domain parameter; this domain is a key within the mapped_tables dictionary. If that key does not exist, PhenEx will through an error.
In PhenEx, we can use built in mapped datasets such as OMOP.
from phenex.mappers import OMOPDomains
omop_mapped_tables = OMOPDomains.get_mapped_tables(con)
print(omop_mapped_tables.keys()) # this will list the keys of the mapped tables dictionary
phenotype3 = CodelistPhenotype(
...
domain = 'CONDITION_OCCURRENCE' # this must be one of the keys in our mapped_tables dictionary
)
Indentation¶
All programming languages have a way of beginning and ending a continuous block of code. In other programming languages, curly brackets are used. In python, line indentation (number of white spaces or tabs at the beginning of a line) is used! This enforces that python code has correct indentation - if the indentation is incorrect, the code will not run! Generally, all code that is continuously at the same level of indentation (same number of white spaces/tabs at the beginning of the line) is all part of one block of code. When the indentation level changes, we have entered or exited some code block!
Functions¶
A function in Python is a block of reusable code that performs a specific task. Functions help us organize our code and make it more readable. We first write a function, and then we can run the code in the function later by calling the function.
Functions in python begin with def
. Then comes the function name, followed by parentheses containing input parameters, and then a colon. The following lines are then indented and contain the body of the function. The last line of the function is an optional return
statement, which defines the output of the function (if any).
Let's create a function called 'greet' that takes one parameter, which is assigned to a variable called 'name'. The function will print out 'hello name'
def greet(name):
print("Hello, " + name)
Now we have defined the function! We haven't run the code in that function yet. Let's run the function, passing the parameter 'Alice'.
greet("Alice") # Output: Hello, Alice
Hello, Alice
When writing a PhenEx cohort definition we will probably not need to understand much about functions. However, for complicated cohorts, it is generally recommended to break up our definition into logical components as follows :
def create_my_cohort():
# define the entry criterion
entry_criterion = CodelistPhenotype(...)
# call all functions that further define oru cohort
inclusion_criteria = create_inclusion_criteria(entry_criterion)
exclusion_criteria = create_exclusion_criteria(entry_criterion)
baseline_characteristics = create_baseline_characteristics(entry_criterion)
# put the cohort together using all components we have created
cohort = Cohort(
entry = entry_criterion,
inclusion = inclusion_criteria,
exclusion = exclusion_criteria,
characteristics = baseline_characteristics
)
return cohort
def create_inclusion_criteria(entry_criterion):
# ... create inclusion criteria criteria here
return inclusion_criteria
def create_exclusion_criteria(entry_criterion):
# ... create exclusion criteria here
return exclusion_criteria
def create_baseline_characteristics(entry_criterion):
# ... create baseline characteristics criteria here
return baseline_characteristics
# call the function that builds the entire cohort
cohort = create_my_cohort()
cohort.execute()
Here we see that we can use functions to break up the definition of inclusion, exclusion criteria as well as baseline characteristics. Further, if we have a single phenotype that is very complicated, with many dependent phenotypes (for example, CHADSVASc), we can write a function for that complicated phenotype that encapsulates (i.e. contains) that phenotype.
Keyword Arguments¶
We saw in the above 'functions' section that functions take parameters (the part within parentheses). We define in our function definition what parameters a function takes; in our 'greet' function we had a parameter called 'name'. When calling a function, we can explicitely use the variable name defined in our function definition. This is how we use 'keyword arguments'; we set the variable the function requires when we call the function.
Lets see this by calling our greet function by explicitely using the name keyword.
greet(name='Alice') # use the keyword 'name' when calling greet
Hello, Alice
When using components provided by PhenEx, for example CodelistPhenotype, we make extensive use of keyword arguements! (note beyond : most components provided by PhenEx are classes not functions, but the concept of keyword arguments is the same). We use PhenEx components by assigning values to the required keyword arguments, thus parameterizing PhenEx so that it extracts the data we want it to extract. When we go to the documentation of PhenEx, we see a list of the keyword arguments that our PhenEx component can receive; we need to assign values to these keyword arguments. These keyword arguments thus tell us what PhenEx is capable of doing, and what kind of questions we need to ask ourselves when constructing our cohorts.
phenotype4 = CodelistPhenotype(
name = 'atrial fibrillation', # the keyword argument name takes a value of type string; we can define the name of our phenotype!
domain = 'CONDITION_OCCURRENCE', # the keyword argument domain takes a value of type string ; this must be one of the keys in our mapped_tables dictionary
codelist = Codelist(['c1']), # the keyword argument codelist takes a value of type Codelist; we need to create this!
relative_time_range = one_year_pre_index # the keyword argument relative_time_range takes value of type RelativeTimeRangeFilter which we must create
)
In Scope¶
In programming, "scope" refers to the region of the code where a particular variable or function is accessible. There are two types of scope
- Global Scope: Variables defined outside of any function or code block have a global scope. They can be accessed from anywhere in the code.
- Local Scope: Variables defined inside a function or code block have a local scope. They can only be accessed within that function or block.
In Python, indentation is used to define the scope! Here's a simple example to illustrate scope in Python. We will create a variable with global scope called x, and a variable with local scope only called y. Notice that x is available everywhere i.e. in all code blocks, while y is only available within the function!
# we create a variable called x with global scope (no indentation! not in a code block)
x = 10
def some_function():
# create a variable called y with local scope
y = 5
print("Inside function, x:", x) # Accessing global variable
print("Inside function, y:", y) # Accessing local variable
some_function()
print("Outside function, x:", x) # Accessing global variable
# print("Outside function, y:", y) # This would cause an error because y is not in the global scope
Inside function, x: 10 Inside function, y: 5 Outside function, x: 10
In PhenEx, we need to be careful of scope when defining variables that should be reused. For example, we may want to use the variable one_year_pre_index in multiple functions, such as in the functions create_inclusion_criteria and create_baseline_characteristics. If we want it to be available to both of these functions but only want to define it once, we have two options :
- define one_year_pre_index as a global variable i.e. we don't define it in a function. To do this, just put it at the top of the file with no indents
- pass one_year_pre_index to each function. This means we add a keyword argument 'relative_time_range' to all functions that should use one_year_pre_index, and then when calling that function, pass it our one_year_pre_index variable.
Imports¶
PhenEx is a python library that provides many components to create a cohort. To use PhenEx, you first need to install it, and then you need to import the components you want to use. Import statements can be anywhere in a file, but it is recommended to keep them at the top of your file. Then, for each PhenEx component you want to use, you import it. Look into the PhenEx documentation for where to import things from.
# import phenotypes one by on
from phenex.phenotypes import CodelistPhenotype
from phenex.phenotypes import MeasurementPhenotype
# import multiple phenotypes at once. This is identical to above
from phenex.phenotypes import (CodelistPhenotype, MeasurementPhenotype)