Skip to content

API documentation

Parser for JSON files from Illumina Connected Annotations pipeline.

add_gene_types(positions)

Adds the gene type to each transcript.

Transcripts will be annotated with the gene type (oncogene, tsg, mixed) by adding a new attribute geneType. Only transcripts with one of these three gene types get this additional annotation. Other transcripts will not get the geneType attribute.

Parameters:

  • positions (list) –

    list of filtered or unfiltered positions from JSON files.

Returns:

  • list

    list of positions with additional annotation of transcripts.

Examples:

>>> import icaparser as icap
>>> positions = icap.add_gene_types(positions)

apply_mutation_classification_rules(positions, rule_set=get_default_mutation_classification_rules(), gene_type_map=get_default_gene_type_map(), hide_progress=False)

Applies mutation classification rules to all positions.

Each variant is categorized for each transcript that overlaps with the genomic position of the variant. Each transcript that passes the "mutated" or "uncertain" mutation classification rules gets a new attribute mutation_status with the value "mutated" or "uncertain". The input list of positions is modified by adding the mutation_status attribute to transcripts, and the modified list of positions is returned as the first element of the returned tuple.

In addition to modifying and returning the list of positions, this function also returns the assembled mutation status after aggregating the impact on all transcripts covering a variant. This is returned as the second item of the returned tuple. The impact depends on the type of gene ("gof" or "lof"), so the impacts are assembled separately for each gene type.

The impact of a particular mutational variant can be different for different overlapping transcript variants of a gene, and the transcript variants can also belong to different genes. The strongest impact on any overlapping transcript of a gene is defined as the impact of that mutational variant on the gene. The analyst must decide which isoforms are used to classify genes. For example, only canonical transcripts may be considered. Alternatively, all transcripts or a subset of transcripts may be used. Therefore, it is necessary to first apply transcript-level filters to all genomic positions before this function is called for determining the mutation status of genes.

The returned value is a multi-dimensional dictionary:

sample_idgenegene_typevariant_idmutationStatus

Parameters:

  • positions (list) –

    list of positions.

  • rule_set (dict, default: get_default_mutation_classification_rules() ) –

    rules for classifying "gof" and "lof" genes. See the default value for an example if a custom rule set is needed.

  • gene_type_map (dict, default: get_default_gene_type_map() ) –

    dictionary for mapping gene types to canonical gene types. See the default value for an example if a custom rule set is needed.

Returns:

  • tuple[list, dict]

    A list of positions and a dictionary with assembled and aggregated mutations.

Examples:

>>> import icaparser as icap
>>> positions, sample_muts = icap.apply_mutation_classification_rules(positions)

cleanup_cosmic(positions)

Remove Cosmic entries with alleles not matching the variant alleles.

ICA attaches Cosmic entries to variants based on position only, which leads to wrong assignments of Cosmic entries to variants. This function removes all Cosmic entries from a variant for which reference and altered alleles do not match those of the variant.

Filtering is done in place.

Parameters:

  • positions (list) –

    list of positions to clean up.

Returns:

  • list

    list of positions with cleaned up Cosmic entries.

common_variant_filter(variant, max_af=0.001)

Get a variant filter based on GnomAD, GnomAd Exome, and 1000 Genomes.

Returns True if none of the maximum allele frequencies from GnomAD, GnomAD exomes and 1000 genomes is greater than max_af. The default value of 0.1 % for the maximum allele frequency corresponds to that of the AACR GENIE project.

Parameters:

  • variant (dict) –

    the variant to investigate.

  • max_af (float, default: 0.001 ) –

    the maximum allele frequency threshold.

Returns:

  • bool

    True if this is not a common variant.

explode_consequence(mutation_table, inplace=False)

Explode the VEP consequence column of a mutation table.

Exploding the VEP consequence column with the standard Pandas explode() function would return consquences as strings, not as ordered categories. This function will instead return a consequence column which is an ordered category. The categories are ordered by their impact.

Exploding means that if a row of the input table has multiple consequences in the consequence column, the list of consequences will be split into single consequences and the output table will have multiple rows with a single consequence per row.

Parameters:

  • mutation_table (DataFrame) –

    the mutation table to explode

  • inplace (bool, default: False ) –

    if True, then modify the mutation_table in place instead of returning a new object

Returns:

  • DataFrame

    new mutation table with exploded consequences.

Examples:

>>> import icaparser as icap
>>> icap.explode_consequence(mutation_table, inplace=True)
>>> mutation_table_exploded = icap.explode_consequence(mutation_table)

filter_positions_by_transcripts(positions, filter_func)

Filter positions based on a filter function for transcripts.

Apply a filter function to all transcripts of each position. Transcripts not passing the filter are removed from the variants of a position. Variants without any transcript left are removed from a position. Positions without any variants left are removed from the returned list of positions.

Parameters:

  • positions (list) –

    list of positions to filter.

  • filter_func (Callable[[dict], bool]) –

    function taking a transcript and returning a bool. True means to keep the transcript.

Returns:

  • list

    filtered positions.

Examples:

>>> is_canonical_transcript = lambda x: x.get('isCanonical', False)
>>> canonical_positions = icap.filter_positions_by_transcripts(
        non_common_positions,
        is_canonical_transcript
    )

filter_positions_by_variants(positions, filter_func)

Filter positions based on a filter function for variants.

Apply a filter function to all variants of each position. Variants not passing the filter are removed from a position. Positions without any variants passing the filter are removed from the returned list.

Parameters:

  • positions (list) –

    list of positions to filter.

  • filter_func (Callable[[dict], bool]) –

    function taking a variant and returning a bool. True means to keep the variant.

Returns:

  • list

    filtered positions.

Examples:

>>> import icaparser as icap
>>> max_af = 0.001
>>> is_not_common_variant = lambda x: icap.common_variant_filter(x, max_af)
>>> non_common_positions = icap.filter_positions_by_variants(
        positions,
        is_not_common_variant
    )

filter_variants_by_transcripts(variants, filter_func)

Filter variants based on a filter function for transcripts.

Apply a filter function to all transcripts of each variant. Transcripts not passing the filter are removed from a variant. Variants without any transcripts passing the filter are removed from the returned list.

Parameters:

  • variants (list) –

    list of variants to filter.

  • filter_func (Callable[[dict], bool]) –

    function taking a transcript and returning a bool. True means to keep the transcript.

Returns:

  • list

    filtered variants.

get_aggregated_mutation_table(positions, sample_muts=None, mutation_classification_rules=get_default_mutation_classification_rules(), mutation_aggregation_rules=get_default_mutation_aggregation_rules(), gene_type_map=get_default_gene_type_map(), hide_progress=False)

Returns a sample-gene-mutationStatus table.

This function applies mutation classification rules to all mutational variants and aggregates the mutations according to the aggregation rules. This results in a table with one row for each sample-gene pair. The table contains several columns with impacts according to lof and gof rules on allele level and gene level and with one additional column with the maximum impact for both allele and gene level.

Parameters:

  • positions (list) –

    list of positions. If sample_muts is also specified, it is assumed that the positions have already been processed previously by apply_mutation_classification_rules and we do not have to run mutation classification again.

  • sample_muts (dict, default: None ) –

    if apply_mutation_classification_rules has been run before, you can use the second return value of that function as the sample_muts argument. This is helpful for very large datasets because otherwise apply_mutation_classification_rules will be run again as an internal call within get_aggregated_mutation_table, which is time consuming for very large data sets. This also means that if sample_muts is provided as an argument, the mutation_classification_rules argument is ignored and has no effect.

  • mutation_classification_rules (dict, default: get_default_mutation_classification_rules() ) –

    rules for classifying single mutations. See get_default_mutation_classification_rules() for details.

  • mutation_aggregation_rules (dict, default: get_default_mutation_aggregation_rules() ) –

    rules for aggregation mutations. See get_default_mutation_aggregation_reles() for details.

  • gene_type_map (dict, default: get_default_gene_type_map() ) –

    dictonary for mapping gene types to canonical gene types. See get_default_gene_type_map() for details.

Returns:

  • DataFrame

    mutation table.

get_biotype_priority(biotype)

Get the numeric priority of a biotype.

The numeric priority of a biotype that is returned by this function is the same as defined by vcf2maf.pl by MSKCC. Biotypes are 'protein_coding', 'LRG_gene', ,'miRNA', ...

Parameters:

  • biotype (str) –

    the biotype for which the priority is to be returned.

Returns:

  • int

    the priority, smaller values mean higher priority.

get_clinvar(variant)

Get a table of all ClinVar annotations for a variant.

Parameters:

  • variant (dict) –

    the variant to investigate.

Returns:

  • DataFrame

    table with ClinVar annotations.

get_clinvar_max_significance(variant, ordered_significances=_CLINVAR_ORDERED_SIGNIFICANCES)

Get the maximum signifinance for all ClinVar annotations of a variant.

Parameters:

  • variant (dict) –

    the variant to investigate.

  • ordered_significances (list, default: _CLINVAR_ORDERED_SIGNIFICANCES ) –

    ranked order of ClinVar significances.

Returns:

  • str

    ClinVar significance of highest rank for the variant.

get_consequences(transcript)

Get a list of consequences for a transcript.

A list of consequences of a variant for a transcript is returned. If any of the annotated consequences is a combination of single consequences, separated by ampersands (&) or commas, the consequence is split into single consequences.

Parameters:

  • transcript (dict) –

    the transcript for which the consequences are to be returned.

Returns:

  • list

    the consequences, a list of strings.

get_cosmic_max_sample_count(variant, only_allele_specific=True)

Get the maximum sample count for all Cosmic annotations of a variant.

A variant can have no, one or multiple associated Cosmic identifiers. This function returns the maximum sample count of all Cosmic identifiers. For each Cosmic identifier, sample numbers are summed up across all indications. Returns 0 if no Cosmic identifier exists for this variant.

The 'only_allele_specific' argument is used to exclude Cosmic entries that annotate the same chromosomal location but an allele that is different from the allele of the annotated variant. ICA annotates a variant with all Cosmic entries for that chromosomal location, irrespective of alleles. When counting Cosmic samples, this leads to an overestimation of Cosmic sample counts for a particular variant. Therefore, 'only_allele_specific' is True by default to count only samples from Cosmic entries with matching alleles. Occasionally, it may be desired, though, to count all samples with mutations at a given position, irrespective of allele. For example, several different alleles at a functional site of a gene can lead to function-disrupting mutations, so we want to get the maximum sample count for any allele at that position. One might also think of adding the sample counts for all Cosmic entries annotating a variant, but this does not work due to redundancy of Cosmic entries. Older Cosmic versions often included the same sample in different Cosmic entries. And newer Cosmic versions often have multiple entries for an allele, one for each transcript variant, with the same underlying samples.

Parameters:

  • variant (dict) –

    the variant to investigate

  • only_allele_specific (bool, default: True ) –

    consider only cosmic entries with alleles matching the allele of the annotated variant

Returns:

  • int

    maximum cosmic sample count

get_data_sources(file)

Extract a table with annotation data sources from the JSON header.

Parameters:

  • file (str) –

    name of the ICA JSON file.

Returns:

  • DataFrame

    table with annotation data sources and their versions.

get_default_gene_type_map()

Returns the default gene type map.

The canonical gene types are gof, lof, and the union of both. Genes that need to be activated to drive a tumor are of type gof. Genes that need to be deactivated to drive a tumor are of type lof. Genes that need to be activated or deactivated depending on the context are of the union of both types. Genes for which it is unknown if they need to be activated or deactivated are also annotated with both types. Genes can be originally annotated with other type names than the canonical ones. The gene type map is used to map these other gene type names to the canonical gene types.

The default map is:

  • oncogene{"gof"}
  • tsg{"lof"}
  • Act{"gof"}
  • LoF{"lof"}
  • mixed{"gof", "lof"}
  • ambiguous{"gof", "lof"}

Returns:

  • dict

    mappings from gene types to canonical gene types.

Examples:

>>> import icaparser as icap
>>> icap.get_default_gene_type_map()

get_default_mutation_aggregation_rules()

Returns the default mutation aggregation rules.

Two types of the mutation status of a gene are defined - allele level and gene level:

  • For gof genes (like oncogenes) it is sufficient that one of the alleles of one of the relevant isoforms has an activating mutation.
  • For lof genes (like tumor suppressor genes) all alleles of all relevant isoforms need to be functionally disrupted, either by mutations or by other means.
  • For ambiguous or other genes, the impact of a mutation is defined as the highest impact according to gof rules and lof rules.

For gain of function (gof) genes, the classifications at both the allele and gene levels are identical unless there is supplementary information about activating modifications beyond mutations. In contrast, for loss of function (lof) genes, classifications at the allele and gene levels may diverge. For instance, a truncating mutation in a tumor suppressor gene typically disrupts the function of the affected allele. However, other alleles of the same gene may remain functionally active, meaning the gene as a whole can still be operational, unless the mutated allele is a dominant negative variant. For a gene to be considered completely dysfunctional, all its alleles must be impaired, either through additional mutations or other mechanisms such as copy number deletions or hypermethylation. Consequently, a single variant that disrupts function at the allele level does not necessarily imply disruption at the gene level.

For loss of function (lof) genes, the available information often falls short of allowing a reliable estimation of functional effects. As a result, heuristic rules must be employed, and the analyst is tasked with deciding whether to utilize allele-level or gene-level classifications. A lof gene is classified as functionally disrupted at gene level (strong impact) if it harbors at least two mutations, each either of strong impact or of uncertain impact. Should a lof gene possess only one such mutation, it is classified as having an uncertain impact at the gene level, regardless of whether the mutation exhibits a strong impact at the allele level. By differentiating the effects at both the allele and gene levels, we maintain the flexibility to determine in subsequent analyses how to consolidate these categories for further statistical evaluations.

The function returns a dictionary containing two keys: gof and lof. Associated with each key is a function that accepts a dictionary of counts as its input and outputs a tuple comprising two elements: the mutation status at the allele level and at the gene level. The input dictionary of counts is expected to have two keys, mutated and uncertain. The value for each key represents the number of variants within a gene classified as mutated or uncertain, respectively.

Returns:

  • dict

    the gof and lof allele level and gene level aggregation rules.

Examples:

>>> import icaparser as icap
>>> icap.get_default_mutation_aggregation_rules()

get_default_mutation_classification_rules(cosmic_threshold=10)

Returns the default rules for classifying mutations.

Defines the default rules for classifying mutations. The returned dictionary has keys "gof" and "lof", and the respective values are the rule sets for these gene types. Each rule set is a dictionary with the keys "mutated" and "uncertain". The values for "mutated" or "uncertain" are dictionaries with three filter functions, a "position_filter", a "variant_filter", and a "transcript_filter". For example, a transcript will be called "mutated" if all three filters for "mutated" return True, and it will be called "uncertain", if all three filter functions for "uncertain" return True.

These are the default rules returned by this function:

GOF

mutated: non-deleterious hotspot mutations.

  • position_filter: all positions retained (no restrictions by position).
  • variant_filter: keep only hotspot variants with a Cosmic sample count >= cosmic_threshold.
  • transcript_filter: keep only amino acid sequence modifying variants that are not most likely deleterious. This includes missense mutations and in-frame insertions and deletions.

uncertain: non-deleterious mutations that aren't hotspots.

  • position_filter: all positions retained (no restrictions by position).
  • variant_filter: keep only non-hotspot variants with a Cosmic sample count < cosmic_threshold.
  • transcript_filter: keep only amino acid sequence modifying variants that are not most likely deleterious. This includes missense mutations and in-frame insertions and deletions.

LOF

mutated: deleterious mutations (such as truncations, start or stop codon loss).

  • position_filter: all positions retained (no restrictions by position).
  • variant_filter: all variants retained (no restrictions by variant).
  • transcript_filter: keep only deleterious variants, such as truncations or stop codon loss or start codon loss.

uncertain: amino acid sequence modifying mutations that are not most likely deleterious. This includes missense mutations and in-frame insertions and deletions.

  • position_filter: all positions retained (no restrictions by position).
  • variant_filter: all variants retained (no restrictions by variant).
  • transcript_filter: keep only amino acid sequence modifying variants that are not most likely deleterious. This includes missense mutations and in-frame insertions and deletions.

Parameters:

  • cosmic_threshold (int, default: 10 ) –

    for "gof" genes, this is the "hotspot threshold" for Cosmic, i.e., the minimum number of samples in Cosmic having that mutation to consider a mutation a hot spot and, therefore, call the mutation "mutated". If the number of Cosmic samples is smaller, the mutation is called "uncertain".

Returns:

  • dict

    default mutation classification rules.

Examples:

>>> import icaparser as icap
>>> icap.get_default_mutation_classification_rules()
>>> icap.get_default_mutation_classification_rules(cosmic_threshold=20)

get_dna_json_files(base_dir, pattern='*MergedVariants_Annotated_filtered.json.gz')

Find DNA annotation JSON files in or below base_dir.

Searches for ICA DNA annotation JSON files in and below base_dir. All file names matching pattern are returned.

Parameters:

  • base_dir (str) –

    base directory of directory subtree where to search for DNA annotation JSON files.

  • pattern (str, default: '*MergedVariants_Annotated_filtered.json.gz' ) –

    files names matching this pattern are returned.

Returns:

  • list

    file names.

get_gene_type(gene_symbol)

Get the gene type (oncogene, tsg, mixed) for a gene.

Parameters:

  • gene_symbol (str) –

    the gene symbol of the gene.

Returns:

  • str

    the gene type.

get_genes(file)

Extract gene annotation from a ICA JSON file.

The genes section of ICA JSON files is optional. If this section is not included in the file, an empty list is returned.

Parameters:

  • file (str) –

    name of the ICA JSON file.

Returns:

  • list

    gene annotations.

get_gnomad_exome_max_af(variant, cohorts=['afr', 'amr', 'eas', 'nfe', 'sas'])

Get the maximum allele frequency for gnomAD Exome.

Get the maximum allele frequences across all major cohorts annotated by gnomAD, Exome excluding bottleneck populations (Ashkenazy Jews and Finish) and other.

Parameters:

  • variant (dict) –

    the variant to investigate.

  • cohorts (list, default: ['afr', 'amr', 'eas', 'nfe', 'sas'] ) –

    subpopulations to include.

Returns:

  • float

    maximum GnomAD Exome allele frequency.

get_gnomad_max_af(variant, cohorts=['afr', 'amr', 'eas', 'nfe', 'sas'])

Get the maximum allele frequency for gnomAD.

Get the maximum allele frequences across all major cohorts annotated by gnomAD, excluding bottleneck populations (Ashkenazy Jews and Finish) and other.

Parameters:

  • variant (dict) –

    the variant to investigate.

  • cohorts (list, default: ['afr', 'amr', 'eas', 'nfe', 'sas'] ) –

    subpopulations to include.

Returns:

  • float

    maximum GnomAD allele frequency.

get_header(file)

Extract the header element from a ICA JSON file.

Parameters:

  • file (str) –

    name of the ICA JSON file.

Returns:

  • dict

    header from the JSON file.

get_header_scalars(file)

Extract a table with all scalar attributes from the JSON header.

Parameters:

  • file (str) –

    name of the ICA JSON file.

Returns:

  • DataFrame

    table of scalar attributes and their values.

get_max_af(variant, source, cohorts=None)

Get the maximum allele frequency for a particular annotation source.

Get the maximum allele frequency across all cohorts annotated by the annotation source.

Parameters:

  • variant (dict) –

    the variant to investigate.

  • source (str) –

    the annotation source to use, for example 'gnomad' or 'gnomadExome' or 'oneKg'.

  • cohorts (list, default: None ) –

    subpopulations to include; include all if None.

Returns:

  • float

    the maximum allele frequency.

Examples:

>>> import icaparser as icap
>>> icap.get_max_af(variant, 'gnomad')

get_multi_sample_positions(files, *args, **kwargs)

Extract all positions for a set of ICA JSON files.

The sample id is stored as an additional new attribute of the samples element of a position. The samples element is a list, although ICA usually only creates single sample JSON files.

Parameters:

  • files (list) –

    names of the ICA JSON files.

  • args (object, default: () ) –

    extra arguments forwarded to get_positions().

  • kwargs (object, default: {} ) –

    extra named arguments forwarded to get_positions().

Returns:

  • list

    filtered positions from all files.

Examples:

>>> import icaparser as icap
>>> positions = icap.get_multi_sample_positions(json_files)
>>> print(positions[0]['samples'][0]['sampleId'])

get_mutation_table_for_files(json_files, max_af=0.001, min_vep_consequence_priority=6, min_cosmic_sample_count=0, only_canonical=False, extra_variant_filters=[], extra_transcript_filters=[])

Get an annotated table of all filtered transcripts from a list of ICA JSON files.

Load all positions from a list of ICA JSON files and filter them. Positions having any remaining variants and transcripts passing the filter are returned as an annotated table.

Parameters:

  • json_files (list) –

    list of ICA JSON files

  • max_af (float, default: 0.001 ) –

    maximum allele frequency for gnomAD, gnomAD Exome and 1000 Genomes. Only variants with maximum allele frequencies below this threshold will be returned.

  • min_vep_consequence_priority (int, default: 6 ) –

    only transcripts with a minimum VEP consequence priority not larger than this threshold will be retained. Consequences with priorities <= 6 change the protein sequence, consequences with priorities > 6 do not change the protein sequence.

  • min_cosmic_sample_count (int, default: 0 ) –

    only variants with a maximum cosmic sample count not lower than this threshold will be retained

  • only_canonical (bool, default: False ) –

    if true, only canonical transcripts will be retained

  • extra_variant_filters (list, default: [] ) –

    any additional filters to apply to variants. Filters shall return True to keep a variant.

  • extra_transcript_filters (list, default: [] ) –

    any additional filters to apply to transcripts. Filters shall return True to keep a transcript.

Returns:

  • DataFrame

    table of annotated mutations and affected transcripts.

Examples:

>>> import icaparser as icap
>>> extra_transcript_filters = [
        lambda x: x.get('source', '') == 'Ensembl',
        lambda x: x.get('hgnc', '') == 'KRAS'
    ]
>>> mut_table = icap.get_mutation_table_for_files(
        json_files,
        extra_transcript_filters=extra_transcript_filters
    )

get_mutation_table_for_position(position)

Get an annotated table of all transcripts for a single position.

Returns an annotated table of all transcripts that are affected by a mutation at a position.

Parameters:

  • position (dict) –

    the position to investigate.

Returns:

  • DataFrame

    table of annotated mutations and affected transcripts.

get_mutation_table_for_positions(positions, hide_progress=False)

Get an annotated table of all transcripts for all positions.

Returns an annotated table of all transcripts that are affected by a mutation at any of the positions.

Parameters:

  • positions (list) –

    the positions to investigate.

Returns:

  • DataFrame

    table of annotated mutations and affected transcripts.

get_onekg_max_af(variant)

Get the maximum allele frequency for the 1000 Genomes Project.

Get the maximum allele frequences across all cohorts annotated by the 1000 Genomes Project.

Parameters:

  • variant (dict) –

    the variant to investigate.

Returns:

  • float

    maximum 1000 genomes allele frequency.

get_pipeline_metadata(files)

Extract a table with metadata annotation pipeline run from the JSON header.

Parameters:

  • files (list) –

    names of the ICA JSON files.

Returns:

  • DataFrame

    table with metadata of pipeline runs.

get_position_by_coordinates(positions, chromosome, position)

Extract a particular position from a position list.

Parameters:

  • positions (list) –

    list of input positions.

  • chromosome (str) –

    name of the chromosome.

  • position (int) –

    numeric position on the chromosome.

Returns:

  • dict

    the position for the specified chromosome and numeric position.

Examples:

>>> import icaparser as icap
>>> icap.get_position_by_coordinates(positions, 'chr1', 204399064)

get_positions(file, variant_filters=[], transcript_filters=[])

Extract all positions from a ICA JSON file.

The sample id is stored as an additional new attribute of the samples element of a position. The samples element is a list, although ICA usually only creates single sample JSON files.

Parameters:

  • file (str) –

    name of the ICA JSON file

  • variant_filters (list, default: [] ) –

    any filters to apply to variants. Filters shall return True to keep a variant.

  • transcript_filters (list, default: [] ) –

    any filters to apply to transcripts. Filters shall return True to keep a transcript.

Returns:

  • list

    filtered positions from file.

Examples:

>>> transcript_filters = [
        lambda x: x.get('source', '') == 'Ensembl',
        lambda x: x.get('hgnc', '') == 'KRAS'
    ]
>>> positions = icap.get_sample_positions(
        json_file,
        transcript_filters = transcript_filters
    )
>>> print(positions[0]['samples'][0]['sampleId'])

get_sample(file, suffix='(-D[^.]*)?\\.bam')

Extract the sample name from a ICA JSON file.

Parameters:

  • file (str) –

    name of the ICA JSON file.

  • suffix (str, default: '(-D[^.]*)?\\.bam' ) –

    regular expression to remove from the sample name in the JSON file. Defaults to '(-D[^.]*)?.bam'.

Returns:

  • str

    name of the sample annotated in the JSON file.

get_strongest_vep_consequence_name(transcript)

Get the name of the strongest VEP consequence for a transcript.

Parameters:

  • transcript (dict) –

    the transcript to investigate.

Returns:

  • str

    the consequence.

get_strongest_vep_consequence_priority(transcript)

Get the strongest priority of VEP consequence for a transcript.

Get the strongest numeric priority of all VEP consequences for a transcript. Smaller numeric priorities mean stronger impact.

Parameters:

  • transcript (dict) –

    the transcript to investigate.

Returns:

  • int

    the strongest numeric priority, smaller values mean higher priority.

get_strongest_vep_consequence_rank(transcript)

Get the strongest rank of VEP consequences for a transcript.

Get the strongest numeric rank of all VEP consequences for a transcript. Smaller ranks mean stronger impact.

The priority of consequences is taken into account first. So if two consequences have different priorities, the consequence with the higher priority (lower priority number) will be used, and the rank for this consequence will be returned. If there are multiple consequences with the same priority, the lowest (strongest) rank will be returned.

For clarification: ranks are unique, i.e. all VEP consequences ordered as listed on the VEP documentation page get the row number of this table assigned as rank.

However, several consequences can have the same priority (e.g., stop gained and frameshift have the same priority). Priorities are copied from vcf2maf.pl of MSKCC.

Parameters:

  • transcript (dict) –

    the transcript to investigate.

Returns:

  • int

    the rank of the VEP consequence with strongest impact.

get_vep_consequence_for_rank(rank)

Get the VEP consequence term of a numeric rank.

Parameters:

  • rank (int) –

    the numeric rank of the consequence term.

Returns:

  • str

    the consequence.

get_vep_priority_for_consequence(consequence)

Get the numeric priority of a VEP consequence term.

The numeric priority of a consequence that is returned by this function is the same as defined by vcf2maf.pl of MSKCC.

Parameters:

  • consequence (str) –

    the consequence term of the variant.

Returns: the priority of the consequence, smaller values mean higher priority.

get_vep_rank_for_consequence(consequence)

Get the numeric rank of a VEP consequence term.

The numeric rank of a consequence is the position of the consequence in this list of consequences for the Variant Effect Predictor VEP.

Parameters:

  • consequence (str) –

    the consequence term of the variant.

Returns:

  • int

    the rank of the consequence, smaller values mean higher rank.

split_multi_sample_json_file(json_file, output_dir)

Splits a multi-sample JSON file into sample specific JSON files.

This function reads a multi-sample JSON file that was generated by annotating a multi-sample VCF file with ICA and splits it into sample-specific JSON files.

Annotating very many single-sample VCF files with ICA is very time consuming, because ICA reads all annotation sources for each VCF file and this is dominating the runtime of ICA. It is therefore helpful to first merge many single-sample VCF files into one or a small number of multi-sample VCF files (for example, with bcftools merge), to annotate the multi-sample VCF file with ICA, and then to split the multi-sample JSON output of ICA into single-sample JSON files. These single-sample JSON files are required for the rest of this package.

Parameters:

  • json_file (str) –

    the multi-sample json input file.

  • output_dir (str) –

    the directory where to write the single sample JSON files. The directory will be created if it does not exist.

Returns:

  • None

    None.

strip_json_file(ifname, ofname)

Reduce the JSON file size by keeping only 'PASS' variants.

JSON files from Illumina's ICA pipeline can be very large because they contain any deviation from the reference genome, irrespective of the quality of the mutation call. Gzip compressed JSON files with sizes in the gigabyte range cannot be processed by JSON packages that read the entire file into memory. It is necessary to first reduce the size of JSON files by removing all variants that do not meet Illumina's quality criteria.

This function reads a single JSON file and creates a single JSON outpout file by removing all variants that do not pass Illumina's quality criteria.

Parameters:

  • ifname (str) –

    name of the input file.

  • ofname (str) –

    name of the output file.

Returns:

  • None

    None.

strip_json_files(source_dir, target_dir, pattern='*.json.gz')

Strip all JSON files of a project by keeping only 'PASS' variants.

JSON files from Illumina's ICA pipeline can be very large because they contain any deviation from the reference genome, irrespective of the quality of the mutation call. Gzip compressed JSON files with sizes in the gigabyte range cannot be processed by JSON packages that read the entire file into memory. It is necessary to first reduce the size of JSON files by removing all variants that do not meet Illumina's quality criteria.

This function searches source_dir recursively for all files matching the file_pattern. Each of those files is processed and a stripped version keeping only variants that PASS Illumina's quality criteria is created. The output file has the same name as the input file. The directory structure below source_dir is replicated in target_dir. Output files get the suffix '_filtered.json.gz'.

Parameters:

  • source_dir (str) –

    directory where to search for input JSON files.

  • target_dir (str) –

    directory where to save the stripped outpout JSON files.

  • pattern (str, default: '*.json.gz' ) –

    files matching this pattern will be processed.

Returns:

  • None

    None.

Examples:

>>> strip_json_files('../Data/Original', '../Data/Derived')