Skip to content

Codelist

Codelist

Codelist is a class that allows us to conveniently work with medical codes used in RWD analyses. A Codelist represents a (single) specific medical concept, such as 'atrial fibrillation' or 'myocardial infarction'. A Codelist is associated with a set of medical codes from one or multiple source vocabularies (such as ICD10CM or CPT); we call these vocabularies 'code types'. Code type is important, as there are no assurances that codes from different vocabularies (different code types) do not overlap. It is therefore highly recommended to always specify the code type when using a codelist.

Codelist is a simple class that stores the codelist as a dictionary. The dictionary is keyed by code type and the value is a list of codes. Codelist also has various convenience methods such as read from excel, csv or yaml files, and export to excel files.

Fuzzy codelists allow the use of '%' as a wildcard character in codes. This can be useful when you want to match a range of codes that share a common prefix. For example, 'I48.%' will match any code that starts with 'I48.'. Multiple fuzzy matches can be passed just like ordinary codes in a list.

If a codelist contains more than 100 fuzzy codes, a warning will be issued as performance may suffer significantly.

Parameters:

Name Type Description Default
name Optional[str]

Descriptive name of codelist

None
codelist Union[str, List, Dict[str, List]]

User can enter codelists as either a string, a list of strings or a dictionary keyed by code type. In first two cases, the class will convert the input to a dictionary with a single key None. All consumers of the Codelist instance can then assume the codelist in that format.

required
use_code_type Optional[bool]

User can define whether code type should be used or not.

True
remove_punctuation Optional[bool]

User can define whether punctuation should be removed from codes or not.

False

Methods:

Name Description
from_yaml

Load a codelist from a YAML file.

from_excel

Load a codelist from an Excel file.

from_csv

Load a codelist from a CSV file.

File Formats

YAML: The YAML file should contain a dictionary where the keys are code types (e.g., "ICD-9", "ICD-10") and the values are lists of codes for each type.

Example:

ICD-9:
  - "427.31"  # Atrial fibrillation
ICD-10:
  - "I48.0"   # Paroxysmal atrial fibrillation
  - "I48.1"   # Persistent atrial fibrillation
  - "I48.2"   # Chronic atrial fibrillation
  - "I48.91"  # Unspecified atrial fibrillation

Excel: The Excel file should contain a minimum of two columns for code and code_type. If multiple codelists exist in the same table, an additional column for codelist names is required.

Example (Single codelist):

| code_type | code   |
|-----------|--------|
| ICD-9     | 427.31 |
| ICD-10    | I48.0  |
| ICD-10    | I48.1  |
| ICD-10    | I48.2  |
| ICD-10    | I48.91 |

Example (Multiple codelists):

| code_type | code   | codelist           |
|-----------|--------|--------------------|
| ICD-9     | 427.31 | atrial_fibrillation|
| ICD-10    | I48.0  | atrial_fibrillation|
| ICD-10    | I48.1  | atrial_fibrillation|
| ICD-10    | I48.2  | atrial_fibrillation|
| ICD-10    | I48.91 | atrial_fibrillation|

CSV: The CSV file should follow the same format as the Excel file, with columns for code, code_type, and optionally codelist names.

Example:

# Initialize with a list
cl = Codelist(
    ['x', 'y', 'z'],
    'mycodelist'
    )
print(cl.codelist)
{None: ['x', 'y', 'z']}

Example:

# Initialize with string
cl = Codelist(
    'SBP'
    )
print(cl.codelist)
{None: ['SBP']}

Example:

# Initialize with a dictionary
>> atrial_fibrillation_icd_codes = {
    "ICD-9": [
        "427.31"  # Atrial fibrillation
    ],
    "ICD-10": [
        "I48.0",  # Paroxysmal atrial fibrillation
        "I48.1",  # Persistent atrial fibrillation
        "I48.2",  # Chronic atrial fibrillation
        "I48.91", # Unspecified atrial fibrillation
    ]
}
cl = Codelist(
    atrial_fibrillation_icd_codes,
    'atrial_fibrillation',
)
print(cl.codelist)
{
    "ICD-9": [
        "427.31"  # Atrial fibrillation
    ],
    "ICD-10": [
        "I48.0",  # Paroxysmal atrial fibrillation
        "I48.1",  # Persistent atrial fibrillation
        "I48.2",  # Chronic atrial fibrillation
        "I48.91", # Unspecified atrial fibrillation
    ]
}

# Initialize with a fuzzy codelist
anemia = Codelist(
    {'ICD10CM': ['D55%', 'D56%', 'D57%', 'D58%', 'D59%', 'D60%']},
    {'ICD9CM': ['284%', '285%', '282%']},
    'fuzzy_codelist'
)
Source code in phenex/codelists/codelists.py
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
class Codelist:
    """
    Codelist is a class that allows us to conveniently work with medical codes used in RWD analyses. A Codelist represents a (single) specific medical concept, such as 'atrial fibrillation' or 'myocardial infarction'. A Codelist is associated with a set of medical codes from one or multiple source vocabularies (such as ICD10CM or CPT); we call these vocabularies 'code types'. Code type is important, as there are no assurances that codes from different vocabularies (different code types) do not overlap. It is therefore highly recommended to always specify the code type when using a codelist.

    Codelist is a simple class that stores the codelist as a dictionary. The dictionary is keyed by code type and the value is a list of codes. Codelist also has various convenience methods such as read from excel, csv or yaml files, and export to excel files.

    Fuzzy codelists allow the use of '%' as a wildcard character in codes. This can be useful when you want to match a range of codes that share a common prefix. For example, 'I48.%' will match any code that starts with 'I48.'. Multiple fuzzy matches can be passed just like ordinary codes in a list.

    If a codelist contains more than 100 fuzzy codes, a warning will be issued as performance may suffer significantly.

    Parameters:
        name: Descriptive name of codelist
        codelist: User can enter codelists as either a string, a list of strings or a dictionary keyed by code type. In first two cases, the class will convert the input to a dictionary with a single key None. All consumers of the Codelist instance can then assume the codelist in that format.
        use_code_type: User can define whether code type should be used or not.
        remove_punctuation: User can define whether punctuation should be removed from codes or not.

    Methods:
        from_yaml: Load a codelist from a YAML file.
        from_excel: Load a codelist from an Excel file.
        from_csv: Load a codelist from a CSV file.

    File Formats:
        YAML:
        The YAML file should contain a dictionary where the keys are code types
        (e.g., "ICD-9", "ICD-10") and the values are lists of codes for each type.

        Example:
        ```yaml
        ICD-9:
          - "427.31"  # Atrial fibrillation
        ICD-10:
          - "I48.0"   # Paroxysmal atrial fibrillation
          - "I48.1"   # Persistent atrial fibrillation
          - "I48.2"   # Chronic atrial fibrillation
          - "I48.91"  # Unspecified atrial fibrillation
        ```

        Excel:
        The Excel file should contain a minimum of two columns for code and code_type. If multiple codelists exist in the same table, an additional column for codelist names is required.

        Example (Single codelist):
        ```markdown
        | code_type | code   |
        |-----------|--------|
        | ICD-9     | 427.31 |
        | ICD-10    | I48.0  |
        | ICD-10    | I48.1  |
        | ICD-10    | I48.2  |
        | ICD-10    | I48.91 |
        ```

        Example (Multiple codelists):
        ```markdown
        | code_type | code   | codelist           |
        |-----------|--------|--------------------|
        | ICD-9     | 427.31 | atrial_fibrillation|
        | ICD-10    | I48.0  | atrial_fibrillation|
        | ICD-10    | I48.1  | atrial_fibrillation|
        | ICD-10    | I48.2  | atrial_fibrillation|
        | ICD-10    | I48.91 | atrial_fibrillation|
        ```

        CSV:
        The CSV file should follow the same format as the Excel file, with columns for code, code_type, and optionally codelist names.

    Example:
    ```python
    # Initialize with a list
    cl = Codelist(
        ['x', 'y', 'z'],
        'mycodelist'
        )
    print(cl.codelist)
    {None: ['x', 'y', 'z']}
    ```

    Example:
    ```python
    # Initialize with string
    cl = Codelist(
        'SBP'
        )
    print(cl.codelist)
    {None: ['SBP']}
    ```

    Example:
    ```python
    # Initialize with a dictionary
    >> atrial_fibrillation_icd_codes = {
        "ICD-9": [
            "427.31"  # Atrial fibrillation
        ],
        "ICD-10": [
            "I48.0",  # Paroxysmal atrial fibrillation
            "I48.1",  # Persistent atrial fibrillation
            "I48.2",  # Chronic atrial fibrillation
            "I48.91", # Unspecified atrial fibrillation
        ]
    }
    cl = Codelist(
        atrial_fibrillation_icd_codes,
        'atrial_fibrillation',
    )
    print(cl.codelist)
    {
        "ICD-9": [
            "427.31"  # Atrial fibrillation
        ],
        "ICD-10": [
            "I48.0",  # Paroxysmal atrial fibrillation
            "I48.1",  # Persistent atrial fibrillation
            "I48.2",  # Chronic atrial fibrillation
            "I48.91", # Unspecified atrial fibrillation
        ]
    }
    ```

    ```python
    # Initialize with a fuzzy codelist
    anemia = Codelist(
        {'ICD10CM': ['D55%', 'D56%', 'D57%', 'D58%', 'D59%', 'D60%']},
        {'ICD9CM': ['284%', '285%', '282%']},
        'fuzzy_codelist'
    )
    ```
    """

    def __init__(
        self,
        codelist: Union[str, List, Dict[str, List]],
        name: Optional[str] = None,
        use_code_type: Optional[bool] = True,
        remove_punctuation: Optional[bool] = False,
    ) -> None:
        self.name = name

        if isinstance(codelist, dict):
            self.codelist = codelist
        elif isinstance(codelist, list):
            self.codelist = {None: codelist}
        elif isinstance(codelist, str):
            if name is None:
                self.name = codelist
            self.codelist = {None: [codelist]}
        else:
            raise TypeError("Input codelist must be a dictionary, list, or string.")

        if list(self.codelist.keys()) == [None]:
            self.use_code_type = False
        else:
            self.use_code_type = use_code_type

        self.remove_punctuation = remove_punctuation

        self.fuzzy_match = False
        for code_type, codelist in self.codelist.items():
            if any(["%" in str(code) for code in codelist]):
                self.fuzzy_match = True
                if len(codelist) > 100:
                    warnings.warn(
                        f"Detected fuzzy codelist match with > 100 regex's for code type {code_type}. Performance may suffer significantly."
                    )

        self._resolved_codelist = None

    def copy(
        self,
        name: Optional[str] = None,
        use_code_type: bool = True,
        remove_punctuation: bool = False,
        rename_code_type: dict = None,
    ) -> "Codelist":
        """
        Codelist's are immutable. If you want to update how codelists are resolved, make a copy of the given codelist changing the resolution parameters.

        Parameters:
            name: Name for newly created code list if different from the old one.
            use_code_type: If False, merge all the code lists into one with None as the key.
            remove_punctuation: If True, remove '.' from all codes.
            rename_code_type: Dictionary defining code types that should be renamed. For example, if the original code type is 'ICD-10-CM', but it is 'ICD10' in the database, we must rename the code type. This keyword argument is a dictionary with keys being the current code type and the value being the desired code type. Code types not included in the mapping are left unchanged.

        Returns:
            Codelist instance with the updated resolution options.
        """
        _codelist = self.codelist.copy()
        if rename_code_type is not None and isinstance(rename_code_type, dict):
            for current, renamed in rename_code_type.items():
                if _codelist.get(current) is not None:
                    _codelist[renamed] = _codelist[current]
                    del _codelist[current]

        return Codelist(
            _codelist,
            name=name or self.name,
            use_code_type=use_code_type,
            remove_punctuation=remove_punctuation,
        )

    @property
    def resolved_codelist(self):
        """
        Retrieve the actual codelists used for filtering after processing for punctuation and code type options (see __init__()).
        """
        if self._resolved_codelist is None:
            resolved_codelist = {}

            for code_type, codes in self.codelist.items():
                if self.remove_punctuation:
                    codes = [code.replace(".", "") for code in codes]
                if self.use_code_type:
                    resolved_codelist[code_type] = codes
                else:
                    if None not in resolved_codelist:
                        resolved_codelist[None] = []
                    resolved_codelist[None] = list(
                        set(resolved_codelist[None]) | set(codes)
                    )
            self._resolved_codelist = resolved_codelist

        return self._resolved_codelist

    @classmethod
    def from_yaml(cls, path: str) -> "Codelist":
        """
        Load a codelist from a yaml file.

        The YAML file should contain a dictionary where the keys are code types
        (e.g., "ICD-9", "ICD-10") and the values are lists of codes for each type.

        Example:
        ```yaml
        ICD-9:
          - "427.31"  # Atrial fibrillation
        ICD-10:
          - "I48.0"   # Paroxysmal atrial fibrillation
          - "I48.1"   # Persistent atrial fibrillation
          - "I48.2"   # Chronic atrial fibrillation
          - "I48.91"  # Unspecified atrial fibrillation
        ```

        Parameters:
            path: Path to the YAML file.

        Returns:
            Codelist instance.
        """
        import yaml

        with open(path, "r") as f:
            data = yaml.safe_load(f)
        return cls(
            data, name=os.path.basename(path.replace(".yaml", "").replace(".yml", ""))
        )

    @classmethod
    def from_excel(
        cls,
        path: str,
        sheet_name: Optional[str] = None,
        codelist_name: Optional[str] = None,
        code_column: Optional[str] = "code",
        code_type_column: Optional[str] = "code_type",
        codelist_column: Optional[str] = "codelist",
    ) -> "Codelist":
        """
         Load a single codelist located in an Excel file.

         It is required that the Excel file contains a minimum of two columns for code and code_type. The actual columnnames can be specified using the code_column and code_type_column parameters.

         If multiple codelists exist in the same excel table, the codelist_column and codelist_name are required to point to the specific codelist of interest.

         It is possible to specify the sheet name if the codelist is in a specific sheet.

         1. Single table, single codelist : The table (whether an entire excel file, or a single sheet in an excel file) contains only one codelist. The table should have columns for code and code_type.

             ```markdown
             | code_type | code   |
             |-----------|--------|
             | ICD-9     | 427.31 |
             | ICD-10    | I48.0  |
             | ICD-10    | I48.1  |
             | ICD-10    | I48.2  |
             | ICD-10    | I48.91 |
             ```

        2. Single table, multiple codelists: A single table (whether an entire file, or a single sheet in an excel file) contains multiple codelists. A column for the name of each codelist is required. Use codelist_name to point to the specific codelist of interest.

             ```markdown
             | code_type | code   | codelist           |
             |-----------|--------|--------------------|
             | ICD-9     | 427.31 | atrial_fibrillation|
             | ICD-10    | I48.0  | atrial_fibrillation|
             | ICD-10    | I48.1  | atrial_fibrillation|
             | ICD-10    | I48.2  | atrial_fibrillation|
             | ICD-10    | I48.91 | atrial_fibrillation|
             ```

         Parameters:
             path: Path to the Excel file.
             sheet_name: An optional label for the sheet to read from. If defined, the codelist will be taken from that sheet. If no sheet_name is defined, the first sheet is taken.
             codelist_name: An optional name of the codelist which to extract. If defined, codelist_column must be present and the codelist_name must occur within the codelist_column.
             code_column: The name of the column containing the codes.
             code_type_column: The name of the column containing the code types.
             codelist_column: The name of the column containing the codelist names.

         Returns:
             Codelist instance.
        """
        import pandas as pd

        if sheet_name is None:
            _df = pd.read_excel(path)
        else:
            xl = pd.ExcelFile(path)
            if sheet_name not in xl.sheet_names:
                raise ValueError(
                    f"Sheet name {sheet_name} not found in the Excel file."
                )
            _df = xl.parse(sheet_name)

        if codelist_name is not None:
            # codelist name is not none, therefore we subset the table to the current codelist
            _df = _df[_df[codelist_column] == codelist_name]

        code_dict = _df.groupby(code_type_column)[code_column].apply(list).to_dict()

        if codelist_name is not None:
            name = codelist_name
        elif sheet_name is not None:
            name = sheet_name
        else:
            name = path.split(os.sep)[-1].replace(".xlsx", "")

        return cls(code_dict, name=name)

    @classmethod
    def from_csv(
        cls,
        path: str,
        codelist_name: Optional[str] = None,
        code_column: Optional[str] = "code",
        code_type_column: Optional[str] = "code_type",
        codelist_column: Optional[str] = "codelist",
    ) -> "Codelist":
        _df = pd.read_csv(path)

        if codelist_name is not None:
            # codelist name is not none, therefore we subset the table to the current codelist
            _df = _df[_df[codelist_column] == codelist_name]

        code_dict = _df.groupby(code_type_column)[code_column].apply(list).to_dict()

        if codelist_name is None:
            name = codelist_name
        else:
            name = path.split(os.sep)[-1].replace(".csv", "")

        return cls(code_dict, name=name)

    @classmethod
    def from_medconb(cls, codelist):
        """
        Converts a MedConB style Codelist into a PhenEx style codelist.

        Example:

        ```python
        from medconb_client import Client
        endpoint = "https://api.medconb.example.com/graphql/"
        token = get_token()
        client = Client(endpoint, token)

        medconb_codelist = client.get_codelist(
            codelist_id="9c4ad312-3008-4d95-9b16-6f9b21ec1ad9"
        )
        phenex_codelist = Codelist.from_medconb(medconb_codelist)
        ```
        """
        phenex_codelist = {}
        for codeset in codelist.codesets:
            phenex_codelist[codeset.ontology] = [c[0] for c in codeset.codes]
        return cls(codelist=phenex_codelist, name=codelist.name)

    def to_tuples(self) -> List[tuple]:
        """
        Convert the codelist to a list of tuples, where each tuple is of the form
        (code_type, code).
        """
        return sum(
            [[(ct, c) for c in self.codelist[ct]] for ct in self.codelist.keys()],
            [],
        )

    def __repr__(self):
        return f"""Codelist(
    name='{self.name}',
    codelist={self.codelist}
)"""

    def to_pandas(self) -> pd.DataFrame:
        """
        Export the codelist to a pandas DataFrame. The DataFrame will have three columns: code_type, code, and codelist.
        """

        _df = pd.DataFrame(self.to_tuples(), columns=["code_type", "code"])
        _df["codelist"] = self.name
        return _df

    def to_dict(self):
        return to_dict(self)

    def __add__(self, other):
        codetypes = list(set(list(self.codelist.keys()) + list(other.codelist.keys())))
        new_codelist = {}
        for codetype in codetypes:
            new_codelist[codetype] = list(
                set(self.codelist.get(codetype, []) + other.codelist.get(codetype, []))
            )
        if self.remove_punctuation != other.remove_punctuation:
            raise ValueError(
                "Cannot add codelists with non-matching remove_punctuation settings."
            )
        if self.use_code_type != other.use_code_type:
            raise ValueError(
                "Cannot add codelists with non-matching use_code_type settings."
            )

        return Codelist(
            new_codelist,
            name=f"({self.name}_union_{other.name})",
            remove_punctuation=self.remove_punctuation,
            use_code_type=self.use_code_type,
        )

    def __sub__(self, other):
        codetypes = list(self.codelist.keys())
        new_codelist = {}
        for codetype in codetypes:
            new_codelist[codetype] = [
                x
                for x in self.codelist.get(codetype, [])
                if x not in other.codelist.get(codetype, [])
            ]

        if self.remove_punctuation != other.remove_punctuation:
            raise ValueError(
                "Cannot create difference of codelists with non-matching remove_punctuation settings."
            )
        if self.use_code_type != other.use_code_type:
            raise ValueError(
                "Cannot create difference of codelists with non-matching use_code_type settings."
            )

        return Codelist(
            new_codelist,
            name=f"{self.name}_excluding_{other.name}",
            remove_punctuation=self.remove_punctuation,
            use_code_type=self.use_code_type,
        )

resolved_codelist property

Retrieve the actual codelists used for filtering after processing for punctuation and code type options (see init()).

copy(name=None, use_code_type=True, remove_punctuation=False, rename_code_type=None)

Codelist's are immutable. If you want to update how codelists are resolved, make a copy of the given codelist changing the resolution parameters.

Parameters:

Name Type Description Default
name Optional[str]

Name for newly created code list if different from the old one.

None
use_code_type bool

If False, merge all the code lists into one with None as the key.

True
remove_punctuation bool

If True, remove '.' from all codes.

False
rename_code_type dict

Dictionary defining code types that should be renamed. For example, if the original code type is 'ICD-10-CM', but it is 'ICD10' in the database, we must rename the code type. This keyword argument is a dictionary with keys being the current code type and the value being the desired code type. Code types not included in the mapping are left unchanged.

None

Returns:

Type Description
Codelist

Codelist instance with the updated resolution options.

Source code in phenex/codelists/codelists.py
def copy(
    self,
    name: Optional[str] = None,
    use_code_type: bool = True,
    remove_punctuation: bool = False,
    rename_code_type: dict = None,
) -> "Codelist":
    """
    Codelist's are immutable. If you want to update how codelists are resolved, make a copy of the given codelist changing the resolution parameters.

    Parameters:
        name: Name for newly created code list if different from the old one.
        use_code_type: If False, merge all the code lists into one with None as the key.
        remove_punctuation: If True, remove '.' from all codes.
        rename_code_type: Dictionary defining code types that should be renamed. For example, if the original code type is 'ICD-10-CM', but it is 'ICD10' in the database, we must rename the code type. This keyword argument is a dictionary with keys being the current code type and the value being the desired code type. Code types not included in the mapping are left unchanged.

    Returns:
        Codelist instance with the updated resolution options.
    """
    _codelist = self.codelist.copy()
    if rename_code_type is not None and isinstance(rename_code_type, dict):
        for current, renamed in rename_code_type.items():
            if _codelist.get(current) is not None:
                _codelist[renamed] = _codelist[current]
                del _codelist[current]

    return Codelist(
        _codelist,
        name=name or self.name,
        use_code_type=use_code_type,
        remove_punctuation=remove_punctuation,
    )

from_excel(path, sheet_name=None, codelist_name=None, code_column='code', code_type_column='code_type', codelist_column='codelist') classmethod

Load a single codelist located in an Excel file.

It is required that the Excel file contains a minimum of two columns for code and code_type. The actual columnnames can be specified using the code_column and code_type_column parameters.

If multiple codelists exist in the same excel table, the codelist_column and codelist_name are required to point to the specific codelist of interest.

It is possible to specify the sheet name if the codelist is in a specific sheet.

  1. Single table, single codelist : The table (whether an entire excel file, or a single sheet in an excel file) contains only one codelist. The table should have columns for code and code_type.

    | code_type | code   |
    |-----------|--------|
    | ICD-9     | 427.31 |
    | ICD-10    | I48.0  |
    | ICD-10    | I48.1  |
    | ICD-10    | I48.2  |
    | ICD-10    | I48.91 |
    
  2. Single table, multiple codelists: A single table (whether an entire file, or a single sheet in an excel file) contains multiple codelists. A column for the name of each codelist is required. Use codelist_name to point to the specific codelist of interest.

    | code_type | code   | codelist           |
    |-----------|--------|--------------------|
    | ICD-9     | 427.31 | atrial_fibrillation|
    | ICD-10    | I48.0  | atrial_fibrillation|
    | ICD-10    | I48.1  | atrial_fibrillation|
    | ICD-10    | I48.2  | atrial_fibrillation|
    | ICD-10    | I48.91 | atrial_fibrillation|
    

Parameters: path: Path to the Excel file. sheet_name: An optional label for the sheet to read from. If defined, the codelist will be taken from that sheet. If no sheet_name is defined, the first sheet is taken. codelist_name: An optional name of the codelist which to extract. If defined, codelist_column must be present and the codelist_name must occur within the codelist_column. code_column: The name of the column containing the codes. code_type_column: The name of the column containing the code types. codelist_column: The name of the column containing the codelist names.

Returns: Codelist instance.

Source code in phenex/codelists/codelists.py
@classmethod
def from_excel(
    cls,
    path: str,
    sheet_name: Optional[str] = None,
    codelist_name: Optional[str] = None,
    code_column: Optional[str] = "code",
    code_type_column: Optional[str] = "code_type",
    codelist_column: Optional[str] = "codelist",
) -> "Codelist":
    """
     Load a single codelist located in an Excel file.

     It is required that the Excel file contains a minimum of two columns for code and code_type. The actual columnnames can be specified using the code_column and code_type_column parameters.

     If multiple codelists exist in the same excel table, the codelist_column and codelist_name are required to point to the specific codelist of interest.

     It is possible to specify the sheet name if the codelist is in a specific sheet.

     1. Single table, single codelist : The table (whether an entire excel file, or a single sheet in an excel file) contains only one codelist. The table should have columns for code and code_type.

         ```markdown
         | code_type | code   |
         |-----------|--------|
         | ICD-9     | 427.31 |
         | ICD-10    | I48.0  |
         | ICD-10    | I48.1  |
         | ICD-10    | I48.2  |
         | ICD-10    | I48.91 |
         ```

    2. Single table, multiple codelists: A single table (whether an entire file, or a single sheet in an excel file) contains multiple codelists. A column for the name of each codelist is required. Use codelist_name to point to the specific codelist of interest.

         ```markdown
         | code_type | code   | codelist           |
         |-----------|--------|--------------------|
         | ICD-9     | 427.31 | atrial_fibrillation|
         | ICD-10    | I48.0  | atrial_fibrillation|
         | ICD-10    | I48.1  | atrial_fibrillation|
         | ICD-10    | I48.2  | atrial_fibrillation|
         | ICD-10    | I48.91 | atrial_fibrillation|
         ```

     Parameters:
         path: Path to the Excel file.
         sheet_name: An optional label for the sheet to read from. If defined, the codelist will be taken from that sheet. If no sheet_name is defined, the first sheet is taken.
         codelist_name: An optional name of the codelist which to extract. If defined, codelist_column must be present and the codelist_name must occur within the codelist_column.
         code_column: The name of the column containing the codes.
         code_type_column: The name of the column containing the code types.
         codelist_column: The name of the column containing the codelist names.

     Returns:
         Codelist instance.
    """
    import pandas as pd

    if sheet_name is None:
        _df = pd.read_excel(path)
    else:
        xl = pd.ExcelFile(path)
        if sheet_name not in xl.sheet_names:
            raise ValueError(
                f"Sheet name {sheet_name} not found in the Excel file."
            )
        _df = xl.parse(sheet_name)

    if codelist_name is not None:
        # codelist name is not none, therefore we subset the table to the current codelist
        _df = _df[_df[codelist_column] == codelist_name]

    code_dict = _df.groupby(code_type_column)[code_column].apply(list).to_dict()

    if codelist_name is not None:
        name = codelist_name
    elif sheet_name is not None:
        name = sheet_name
    else:
        name = path.split(os.sep)[-1].replace(".xlsx", "")

    return cls(code_dict, name=name)

from_medconb(codelist) classmethod

Converts a MedConB style Codelist into a PhenEx style codelist.

Example:

from medconb_client import Client
endpoint = "https://api.medconb.example.com/graphql/"
token = get_token()
client = Client(endpoint, token)

medconb_codelist = client.get_codelist(
    codelist_id="9c4ad312-3008-4d95-9b16-6f9b21ec1ad9"
)
phenex_codelist = Codelist.from_medconb(medconb_codelist)
Source code in phenex/codelists/codelists.py
@classmethod
def from_medconb(cls, codelist):
    """
    Converts a MedConB style Codelist into a PhenEx style codelist.

    Example:

    ```python
    from medconb_client import Client
    endpoint = "https://api.medconb.example.com/graphql/"
    token = get_token()
    client = Client(endpoint, token)

    medconb_codelist = client.get_codelist(
        codelist_id="9c4ad312-3008-4d95-9b16-6f9b21ec1ad9"
    )
    phenex_codelist = Codelist.from_medconb(medconb_codelist)
    ```
    """
    phenex_codelist = {}
    for codeset in codelist.codesets:
        phenex_codelist[codeset.ontology] = [c[0] for c in codeset.codes]
    return cls(codelist=phenex_codelist, name=codelist.name)

from_yaml(path) classmethod

Load a codelist from a yaml file.

The YAML file should contain a dictionary where the keys are code types (e.g., "ICD-9", "ICD-10") and the values are lists of codes for each type.

Example:

ICD-9:
  - "427.31"  # Atrial fibrillation
ICD-10:
  - "I48.0"   # Paroxysmal atrial fibrillation
  - "I48.1"   # Persistent atrial fibrillation
  - "I48.2"   # Chronic atrial fibrillation
  - "I48.91"  # Unspecified atrial fibrillation

Parameters:

Name Type Description Default
path str

Path to the YAML file.

required

Returns:

Type Description
Codelist

Codelist instance.

Source code in phenex/codelists/codelists.py
@classmethod
def from_yaml(cls, path: str) -> "Codelist":
    """
    Load a codelist from a yaml file.

    The YAML file should contain a dictionary where the keys are code types
    (e.g., "ICD-9", "ICD-10") and the values are lists of codes for each type.

    Example:
    ```yaml
    ICD-9:
      - "427.31"  # Atrial fibrillation
    ICD-10:
      - "I48.0"   # Paroxysmal atrial fibrillation
      - "I48.1"   # Persistent atrial fibrillation
      - "I48.2"   # Chronic atrial fibrillation
      - "I48.91"  # Unspecified atrial fibrillation
    ```

    Parameters:
        path: Path to the YAML file.

    Returns:
        Codelist instance.
    """
    import yaml

    with open(path, "r") as f:
        data = yaml.safe_load(f)
    return cls(
        data, name=os.path.basename(path.replace(".yaml", "").replace(".yml", ""))
    )

to_pandas()

Export the codelist to a pandas DataFrame. The DataFrame will have three columns: code_type, code, and codelist.

Source code in phenex/codelists/codelists.py
def to_pandas(self) -> pd.DataFrame:
    """
    Export the codelist to a pandas DataFrame. The DataFrame will have three columns: code_type, code, and codelist.
    """

    _df = pd.DataFrame(self.to_tuples(), columns=["code_type", "code"])
    _df["codelist"] = self.name
    return _df

to_tuples()

Convert the codelist to a list of tuples, where each tuple is of the form (code_type, code).

Source code in phenex/codelists/codelists.py
def to_tuples(self) -> List[tuple]:
    """
    Convert the codelist to a list of tuples, where each tuple is of the form
    (code_type, code).
    """
    return sum(
        [[(ct, c) for c in self.codelist[ct]] for ct in self.codelist.keys()],
        [],
    )

LocalCSVCodelistFactory

LocalCSVCodelistFactory allows for the creation of multiple codelists from a single CSV file. Use this class when you have a single CSV file that contains multiple codelists.

To use, create an instance of the class and then call the get_codelist method with the name of the codelist you want to retrieve; this codelist name must be an entry in the name_codelist_column.

Source code in phenex/codelists/codelists.py
class LocalCSVCodelistFactory:
    """
    LocalCSVCodelistFactory allows for the creation of multiple codelists from a single CSV file. Use this class when you have a single CSV file that contains multiple codelists.

    To use, create an instance of the class and then call the `get_codelist` method with the name of the codelist you want to retrieve; this codelist name must be an entry in the name_codelist_column.
    """

    def __init__(
        self,
        path: str,
        name_code_column: str = "code",
        name_codelist_column: str = "codelist",
        name_code_type_column: str = "code_type",
    ) -> None:
        """
        Parameters:
            path: Path to the CSV file.
            name_code_column: The name of the column containing the codes.
            name_codelist_column: The name of the column containing the codelist names.
            name_code_type_column: The name of the column containing the code types.
        """
        self.path = path
        self.name_code_column = name_code_column
        self.name_codelist_column = name_codelist_column
        self.name_code_type_column = name_code_type_column
        try:
            self.df = pd.read_csv(path)
        except:
            raise ValueError("Could not read the file at the given path.")

        # Check if the required columns exist in the DataFrame
        required_columns = [
            name_code_column,
            name_codelist_column,
            name_code_type_column,
        ]
        missing_columns = [
            col for col in required_columns if col not in self.df.columns
        ]
        if missing_columns:
            raise ValueError(
                f"The following required columns are missing in the CSV: {', '.join(missing_columns)}"
            )

    def get_codelists(self) -> List[str]:
        """
        Get a list of all codelists in the supplied CSV.
        """
        return self.df[self.name_codelist_column].unique().tolist()

    def get_codelist(self, name: str) -> Codelist:
        """
        Retrieve a single codelist by name.
        """
        try:
            df_codelist = self.df[self.df[self.name_codelist_column] == name]
            code_dict = (
                df_codelist.groupby(self.name_code_type_column)[self.name_code_column]
                .apply(list)
                .to_dict()
            )
            return Codelist(name=name, codelist=code_dict)
        except:
            raise ValueError("Could not find the codelist with the given name.")

__init__(path, name_code_column='code', name_codelist_column='codelist', name_code_type_column='code_type')

Parameters:

Name Type Description Default
path str

Path to the CSV file.

required
name_code_column str

The name of the column containing the codes.

'code'
name_codelist_column str

The name of the column containing the codelist names.

'codelist'
name_code_type_column str

The name of the column containing the code types.

'code_type'
Source code in phenex/codelists/codelists.py
def __init__(
    self,
    path: str,
    name_code_column: str = "code",
    name_codelist_column: str = "codelist",
    name_code_type_column: str = "code_type",
) -> None:
    """
    Parameters:
        path: Path to the CSV file.
        name_code_column: The name of the column containing the codes.
        name_codelist_column: The name of the column containing the codelist names.
        name_code_type_column: The name of the column containing the code types.
    """
    self.path = path
    self.name_code_column = name_code_column
    self.name_codelist_column = name_codelist_column
    self.name_code_type_column = name_code_type_column
    try:
        self.df = pd.read_csv(path)
    except:
        raise ValueError("Could not read the file at the given path.")

    # Check if the required columns exist in the DataFrame
    required_columns = [
        name_code_column,
        name_codelist_column,
        name_code_type_column,
    ]
    missing_columns = [
        col for col in required_columns if col not in self.df.columns
    ]
    if missing_columns:
        raise ValueError(
            f"The following required columns are missing in the CSV: {', '.join(missing_columns)}"
        )

get_codelist(name)

Retrieve a single codelist by name.

Source code in phenex/codelists/codelists.py
def get_codelist(self, name: str) -> Codelist:
    """
    Retrieve a single codelist by name.
    """
    try:
        df_codelist = self.df[self.df[self.name_codelist_column] == name]
        code_dict = (
            df_codelist.groupby(self.name_code_type_column)[self.name_code_column]
            .apply(list)
            .to_dict()
        )
        return Codelist(name=name, codelist=code_dict)
    except:
        raise ValueError("Could not find the codelist with the given name.")

get_codelists()

Get a list of all codelists in the supplied CSV.

Source code in phenex/codelists/codelists.py
def get_codelists(self) -> List[str]:
    """
    Get a list of all codelists in the supplied CSV.
    """
    return self.df[self.name_codelist_column].unique().tolist()

MedConBCodelistFactory

Retrieve Codelists for use in Phenex from MedConB.

Example:

from medconb_client import Client
endpoint = "https://api.medconb.example.com/graphql/"
token = get_token()
client = Client(endpoint, token)
medconb_factory = MedConBCodelistFactory(client)

phenex_codelist = medconb_factory.get_codelist(
    id="9c4ad312-3008-4d95-9b16-6f9b21ec1ad9"
)

Source code in phenex/codelists/codelists.py
class MedConBCodelistFactory:
    """
    Retrieve Codelists for use in Phenex from MedConB.

    Example:
    ```python
    from medconb_client import Client
    endpoint = "https://api.medconb.example.com/graphql/"
    token = get_token()
    client = Client(endpoint, token)
    medconb_factory = MedConBCodelistFactory(client)

    phenex_codelist = medconb_factory.get_codelist(
        id="9c4ad312-3008-4d95-9b16-6f9b21ec1ad9"
    )
    ```
    """

    def __init__(
        self,
        medconb_client,
    ):
        self.medconb_client = medconb_client

    def get_codelist(self, id: str):
        """
        Resolve the codelist by querying the MedConB client.
        """
        medconb_codelist = self.medconb_client.get_codelist(codelist_id=id)
        return Codelist.from_medconb(medconb_codelist)

    def get_codelists(self):
        """
        Returns a list of all available codelist IDs.
        """
        return sum(
            [c.items for c in self.medconb_client.get_workspace().collections], []
        )

get_codelist(id)

Resolve the codelist by querying the MedConB client.

Source code in phenex/codelists/codelists.py
def get_codelist(self, id: str):
    """
    Resolve the codelist by querying the MedConB client.
    """
    medconb_codelist = self.medconb_client.get_codelist(codelist_id=id)
    return Codelist.from_medconb(medconb_codelist)

get_codelists()

Returns a list of all available codelist IDs.

Source code in phenex/codelists/codelists.py
def get_codelists(self):
    """
    Returns a list of all available codelist IDs.
    """
    return sum(
        [c.items for c in self.medconb_client.get_workspace().collections], []
    )