API documentation
Parser for JSON files from Illumina Connected Annotations pipeline.
add_gene_types(positions)
Adds the gene type to each transcript.
Transcripts will be annotated with the gene type (oncogene, tsg,
mixed) by adding a new attribute geneType
. Only transcripts with
one of these three gene types get this additional annotation. Other
transcripts will not get the geneType
attribute.
Parameters:
-
positions
(list
) –list of filtered or unfiltered positions from JSON files.
Returns:
-
list
–list of positions with additional annotation of transcripts.
Examples:
apply_mutation_classification_rules(positions, rule_set=get_default_mutation_classification_rules(), gene_type_map=get_default_gene_type_map(), hide_progress=False)
Applies mutation classification rules to all positions.
Each variant is categorized for each transcript that overlaps with the
genomic position of the variant. Each transcript that passes the "mutated"
or "uncertain" mutation classification rules gets a new attribute
mutation_status
with the value "mutated" or "uncertain". The input list of
positions is modified by adding the mutation_status
attribute to
transcripts, and the modified list of positions is returned as the first
element of the returned tuple.
In addition to modifying and returning the list of positions, this function also returns the assembled mutation status after aggregating the impact on all transcripts covering a variant. This is returned as the second item of the returned tuple. The impact depends on the type of gene ("gof" or "lof"), so the impacts are assembled separately for each gene type.
The impact of a particular mutational variant can be different for different overlapping transcript variants of a gene, and the transcript variants can also belong to different genes. The strongest impact on any overlapping transcript of a gene is defined as the impact of that mutational variant on the gene. The analyst must decide which isoforms are used to classify genes. For example, only canonical transcripts may be considered. Alternatively, all transcripts or a subset of transcripts may be used. Therefore, it is necessary to first apply transcript-level filters to all genomic positions before this function is called for determining the mutation status of genes.
The returned value is a multi-dimensional dictionary:
sample_id
→ gene
→ gene_type
→ variant_id
→ mutationStatus
Parameters:
-
positions
(list
) –list of positions.
-
rule_set
(dict
, default:get_default_mutation_classification_rules()
) –rules for classifying "gof" and "lof" genes. See the default value for an example if a custom rule set is needed.
-
gene_type_map
(dict
, default:get_default_gene_type_map()
) –dictionary for mapping gene types to canonical gene types. See the default value for an example if a custom rule set is needed.
Returns:
-
tuple[list, dict]
–A list of positions and a dictionary with assembled and aggregated mutations.
Examples:
cleanup_cosmic(positions)
Remove Cosmic entries with alleles not matching the variant alleles.
ICA attaches Cosmic entries to variants based on position only, which leads to wrong assignments of Cosmic entries to variants. This function removes all Cosmic entries from a variant for which reference and altered alleles do not match those of the variant.
Filtering is done in place.
Parameters:
-
positions
(list
) –list of positions to clean up.
Returns:
-
list
–list of positions with cleaned up Cosmic entries.
common_variant_filter(variant, max_af=0.001)
Get a variant filter based on GnomAD, GnomAd Exome, and 1000 Genomes.
Returns True if none of the maximum allele frequencies from GnomAD, GnomAD
exomes and 1000 genomes is greater than max_af
. The default value of 0.1 %
for the maximum allele frequency corresponds to that of the AACR GENIE
project.
Parameters:
-
variant
(dict
) –the variant to investigate.
-
max_af
(float
, default:0.001
) –the maximum allele frequency threshold.
Returns:
-
bool
–True if this is not a common variant.
explode_consequence(mutation_table, inplace=False)
Explode the VEP consequence column of a mutation table.
Exploding the VEP consequence column with the standard Pandas explode()
function would return consquences as strings, not as ordered categories.
This function will instead return a consequence column which is an ordered
category. The categories are ordered by their impact.
Exploding means that if a row of the input table has multiple consequences in the consequence column, the list of consequences will be split into single consequences and the output table will have multiple rows with a single consequence per row.
Parameters:
-
mutation_table
(DataFrame
) –the mutation table to explode
-
inplace
(bool
, default:False
) –if True, then modify the mutation_table in place instead of returning a new object
Returns:
-
DataFrame
–new mutation table with exploded consequences.
Examples:
filter_positions_by_transcripts(positions, filter_func)
Filter positions based on a filter function for transcripts.
Apply a filter function to all transcripts of each position. Transcripts not passing the filter are removed from the variants of a position. Variants without any transcript left are removed from a position. Positions without any variants left are removed from the returned list of positions.
Parameters:
-
positions
(list
) –list of positions to filter.
-
filter_func
(Callable[[dict], bool]
) –function taking a transcript and returning a bool. True means to keep the transcript.
Returns:
-
list
–filtered positions.
Examples:
filter_positions_by_variants(positions, filter_func)
Filter positions based on a filter function for variants.
Apply a filter function to all variants of each position. Variants not passing the filter are removed from a position. Positions without any variants passing the filter are removed from the returned list.
Parameters:
-
positions
(list
) –list of positions to filter.
-
filter_func
(Callable[[dict], bool]
) –function taking a variant and returning a bool. True means to keep the variant.
Returns:
-
list
–filtered positions.
Examples:
filter_variants_by_transcripts(variants, filter_func)
Filter variants based on a filter function for transcripts.
Apply a filter function to all transcripts of each variant. Transcripts not passing the filter are removed from a variant. Variants without any transcripts passing the filter are removed from the returned list.
Parameters:
-
variants
(list
) –list of variants to filter.
-
filter_func
(Callable[[dict], bool]
) –function taking a transcript and returning a bool. True means to keep the transcript.
Returns:
-
list
–filtered variants.
get_aggregated_mutation_table(positions, sample_muts=None, mutation_classification_rules=get_default_mutation_classification_rules(), mutation_aggregation_rules=get_default_mutation_aggregation_rules(), gene_type_map=get_default_gene_type_map(), hide_progress=False)
Returns a sample-gene-mutationStatus table.
This function applies mutation classification rules to all mutational variants and aggregates the mutations according to the aggregation rules. This results in a table with one row for each sample-gene pair. The table contains several columns with impacts according to lof and gof rules on allele level and gene level and with one additional column with the maximum impact for both allele and gene level.
Parameters:
-
positions
(list
) –list of positions. If sample_muts is also specified, it is assumed that the positions have already been processed previously by
apply_mutation_classification_rules
and we do not have to run mutation classification again. -
sample_muts
(dict
, default:None
) –if
apply_mutation_classification_rules
has been run before, you can use the second return value of that function as the sample_muts argument. This is helpful for very large datasets because otherwiseapply_mutation_classification_rules
will be run again as an internal call withinget_aggregated_mutation_table
, which is time consuming for very large data sets. This also means that ifsample_muts
is provided as an argument, themutation_classification_rules
argument is ignored and has no effect. -
mutation_classification_rules
(dict
, default:get_default_mutation_classification_rules()
) –rules for classifying single mutations. See
get_default_mutation_classification_rules()
for details. -
mutation_aggregation_rules
(dict
, default:get_default_mutation_aggregation_rules()
) –rules for aggregation mutations. See
get_default_mutation_aggregation_reles()
for details. -
gene_type_map
(dict
, default:get_default_gene_type_map()
) –dictonary for mapping gene types to canonical gene types. See
get_default_gene_type_map()
for details.
Returns:
-
DataFrame
–mutation table.
get_biotype_priority(biotype)
Get the numeric priority of a biotype.
The numeric priority of a biotype that is returned by this function is the same as defined by vcf2maf.pl by MSKCC. Biotypes are 'protein_coding', 'LRG_gene', ,'miRNA', ...
Parameters:
-
biotype
(str
) –the biotype for which the priority is to be returned.
Returns:
-
int
–the priority, smaller values mean higher priority.
get_clinvar(variant)
Get a table of all ClinVar annotations for a variant.
Parameters:
-
variant
(dict
) –the variant to investigate.
Returns:
-
DataFrame
–table with ClinVar annotations.
get_clinvar_max_significance(variant, ordered_significances=_CLINVAR_ORDERED_SIGNIFICANCES)
Get the maximum signifinance for all ClinVar annotations of a variant.
Parameters:
-
variant
(dict
) –the variant to investigate.
-
ordered_significances
(list
, default:_CLINVAR_ORDERED_SIGNIFICANCES
) –ranked order of ClinVar significances.
Returns:
-
str
–ClinVar significance of highest rank for the variant.
get_consequences(transcript)
Get a list of consequences for a transcript.
A list of consequences of a variant for a transcript is returned. If any of the annotated consequences is a combination of single consequences, separated by ampersands (&) or commas, the consequence is split into single consequences.
Parameters:
-
transcript
(dict
) –the transcript for which the consequences are to be returned.
Returns:
-
list
–the consequences, a list of strings.
get_cosmic_max_sample_count(variant, only_allele_specific=True)
Get the maximum sample count for all Cosmic annotations of a variant.
A variant can have no, one or multiple associated Cosmic identifiers. This function returns the maximum sample count of all Cosmic identifiers. For each Cosmic identifier, sample numbers are summed up across all indications. Returns 0 if no Cosmic identifier exists for this variant.
The 'only_allele_specific' argument is used to exclude Cosmic entries that annotate the same chromosomal location but an allele that is different from the allele of the annotated variant. ICA annotates a variant with all Cosmic entries for that chromosomal location, irrespective of alleles. When counting Cosmic samples, this leads to an overestimation of Cosmic sample counts for a particular variant. Therefore, 'only_allele_specific' is True by default to count only samples from Cosmic entries with matching alleles. Occasionally, it may be desired, though, to count all samples with mutations at a given position, irrespective of allele. For example, several different alleles at a functional site of a gene can lead to function-disrupting mutations, so we want to get the maximum sample count for any allele at that position. One might also think of adding the sample counts for all Cosmic entries annotating a variant, but this does not work due to redundancy of Cosmic entries. Older Cosmic versions often included the same sample in different Cosmic entries. And newer Cosmic versions often have multiple entries for an allele, one for each transcript variant, with the same underlying samples.
Parameters:
-
variant
(dict
) –the variant to investigate
-
only_allele_specific
(bool
, default:True
) –consider only cosmic entries with alleles matching the allele of the annotated variant
Returns:
-
int
–maximum cosmic sample count
get_data_sources(file)
Extract a table with annotation data sources from the JSON header.
Parameters:
-
file
(str
) –name of the ICA JSON file.
Returns:
-
DataFrame
–table with annotation data sources and their versions.
get_default_gene_type_map()
Returns the default gene type map.
The canonical gene types are gof
, lof
, and the union of both.
Genes that need to be activated to drive a tumor are of type gof
.
Genes that need to be deactivated to drive a tumor are of type lof
.
Genes that need to be activated or deactivated depending on the context
are of the union of both types.
Genes for which it is unknown if they need to be activated or deactivated
are also annotated with both types.
Genes can be originally annotated with other type names than the canonical
ones. The gene type map is used to map these other gene type names to the
canonical gene types.
The default map is:
oncogene
→{"gof"}
tsg
→{"lof"}
Act
→{"gof"}
LoF
→{"lof"}
mixed
→{"gof", "lof"}
ambiguous
→{"gof", "lof"}
Returns:
-
dict
–mappings from gene types to canonical gene types.
Examples:
get_default_mutation_aggregation_rules()
Returns the default mutation aggregation rules.
Two types of the mutation status of a gene are defined - allele level and gene level:
- For gof genes (like oncogenes) it is sufficient that one of the alleles of one of the relevant isoforms has an activating mutation.
- For lof genes (like tumor suppressor genes) all alleles of all relevant isoforms need to be functionally disrupted, either by mutations or by other means.
- For ambiguous or other genes, the impact of a mutation is defined as the highest impact according to gof rules and lof rules.
For gain of function (gof) genes, the classifications at both the allele and gene levels are identical unless there is supplementary information about activating modifications beyond mutations. In contrast, for loss of function (lof) genes, classifications at the allele and gene levels may diverge. For instance, a truncating mutation in a tumor suppressor gene typically disrupts the function of the affected allele. However, other alleles of the same gene may remain functionally active, meaning the gene as a whole can still be operational, unless the mutated allele is a dominant negative variant. For a gene to be considered completely dysfunctional, all its alleles must be impaired, either through additional mutations or other mechanisms such as copy number deletions or hypermethylation. Consequently, a single variant that disrupts function at the allele level does not necessarily imply disruption at the gene level.
For loss of function (lof) genes, the available information often falls short of allowing a reliable estimation of functional effects. As a result, heuristic rules must be employed, and the analyst is tasked with deciding whether to utilize allele-level or gene-level classifications. A lof gene is classified as functionally disrupted at gene level (strong impact) if it harbors at least two mutations, each either of strong impact or of uncertain impact. Should a lof gene possess only one such mutation, it is classified as having an uncertain impact at the gene level, regardless of whether the mutation exhibits a strong impact at the allele level. By differentiating the effects at both the allele and gene levels, we maintain the flexibility to determine in subsequent analyses how to consolidate these categories for further statistical evaluations.
The function returns a dictionary containing two keys: gof and lof. Associated with each key is a function that accepts a dictionary of counts as its input and outputs a tuple comprising two elements: the mutation status at the allele level and at the gene level. The input dictionary of counts is expected to have two keys, mutated and uncertain. The value for each key represents the number of variants within a gene classified as mutated or uncertain, respectively.
Returns:
-
dict
–the gof and lof allele level and gene level aggregation rules.
Examples:
get_default_mutation_classification_rules(cosmic_threshold=10)
Returns the default rules for classifying mutations.
Defines the default rules for classifying mutations. The returned dictionary has keys "gof" and "lof", and the respective values are the rule sets for these gene types. Each rule set is a dictionary with the keys "mutated" and "uncertain". The values for "mutated" or "uncertain" are dictionaries with three filter functions, a "position_filter", a "variant_filter", and a "transcript_filter". For example, a transcript will be called "mutated" if all three filters for "mutated" return True, and it will be called "uncertain", if all three filter functions for "uncertain" return True.
These are the default rules returned by this function:
GOF
mutated: non-deleterious hotspot mutations.
- position_filter: all positions retained (no restrictions by position).
- variant_filter: keep only hotspot variants with a Cosmic sample count >=
cosmic_threshold
. - transcript_filter: keep only amino acid sequence modifying variants that are not most likely deleterious. This includes missense mutations and in-frame insertions and deletions.
uncertain: non-deleterious mutations that aren't hotspots.
- position_filter: all positions retained (no restrictions by position).
- variant_filter: keep only non-hotspot variants with a Cosmic sample count <
cosmic_threshold
. - transcript_filter: keep only amino acid sequence modifying variants that are not most likely deleterious. This includes missense mutations and in-frame insertions and deletions.
LOF
mutated: deleterious mutations (such as truncations, start or stop codon loss).
- position_filter: all positions retained (no restrictions by position).
- variant_filter: all variants retained (no restrictions by variant).
- transcript_filter: keep only deleterious variants, such as truncations or stop codon loss or start codon loss.
uncertain: amino acid sequence modifying mutations that are not most likely deleterious. This includes missense mutations and in-frame insertions and deletions.
- position_filter: all positions retained (no restrictions by position).
- variant_filter: all variants retained (no restrictions by variant).
- transcript_filter: keep only amino acid sequence modifying variants that are not most likely deleterious. This includes missense mutations and in-frame insertions and deletions.
Parameters:
-
cosmic_threshold
(int
, default:10
) –for "gof" genes, this is the "hotspot threshold" for Cosmic, i.e., the minimum number of samples in Cosmic having that mutation to consider a mutation a hot spot and, therefore, call the mutation "mutated". If the number of Cosmic samples is smaller, the mutation is called "uncertain".
Returns:
-
dict
–default mutation classification rules.
Examples:
get_dna_json_files(base_dir, pattern='*MergedVariants_Annotated_filtered.json.gz')
Find DNA annotation JSON files in or below base_dir
.
Searches for ICA DNA annotation JSON files in and below base_dir
.
All file names matching pattern
are returned.
Parameters:
-
base_dir
(str
) –base directory of directory subtree where to search for DNA annotation JSON files.
-
pattern
(str
, default:'*MergedVariants_Annotated_filtered.json.gz'
) –files names matching this pattern are returned.
Returns:
-
list
–file names.
get_gene_type(gene_symbol)
Get the gene type (oncogene, tsg, mixed) for a gene.
Parameters:
-
gene_symbol
(str
) –the gene symbol of the gene.
Returns:
-
str
–the gene type.
get_genes(file)
Extract gene annotation from a ICA JSON file.
The genes
section of ICA JSON files is optional. If this section
is not included in the file, an empty list is returned.
Parameters:
-
file
(str
) –name of the ICA JSON file.
Returns:
-
list
–gene annotations.
get_gnomad_exome_max_af(variant, cohorts=['afr', 'amr', 'eas', 'nfe', 'sas'])
Get the maximum allele frequency for gnomAD Exome.
Get the maximum allele frequences across all major cohorts annotated by gnomAD, Exome excluding bottleneck populations (Ashkenazy Jews and Finish) and other.
Parameters:
-
variant
(dict
) –the variant to investigate.
-
cohorts
(list
, default:['afr', 'amr', 'eas', 'nfe', 'sas']
) –subpopulations to include.
Returns:
-
float
–maximum GnomAD Exome allele frequency.
get_gnomad_max_af(variant, cohorts=['afr', 'amr', 'eas', 'nfe', 'sas'])
Get the maximum allele frequency for gnomAD.
Get the maximum allele frequences across all major cohorts annotated by gnomAD, excluding bottleneck populations (Ashkenazy Jews and Finish) and other.
Parameters:
-
variant
(dict
) –the variant to investigate.
-
cohorts
(list
, default:['afr', 'amr', 'eas', 'nfe', 'sas']
) –subpopulations to include.
Returns:
-
float
–maximum GnomAD allele frequency.
get_header(file)
Extract the header element from a ICA JSON file.
Parameters:
-
file
(str
) –name of the ICA JSON file.
Returns:
-
dict
–header from the JSON file.
get_header_scalars(file)
Extract a table with all scalar attributes from the JSON header.
Parameters:
-
file
(str
) –name of the ICA JSON file.
Returns:
-
DataFrame
–table of scalar attributes and their values.
get_max_af(variant, source, cohorts=None)
Get the maximum allele frequency for a particular annotation source.
Get the maximum allele frequency across all cohorts annotated by the annotation source.
Parameters:
-
variant
(dict
) –the variant to investigate.
-
source
(str
) –the annotation source to use, for example 'gnomad' or 'gnomadExome' or 'oneKg'.
-
cohorts
(list
, default:None
) –subpopulations to include; include all if None.
Returns:
-
float
–the maximum allele frequency.
Examples:
get_multi_sample_positions(files, *args, **kwargs)
Extract all positions for a set of ICA JSON files.
The sample id is stored as an additional new attribute of the
samples
element of a position. The samples
element is a list,
although ICA usually only creates single sample JSON files.
Parameters:
-
files
(list
) –names of the ICA JSON files.
-
args
(object
, default:()
) –extra arguments forwarded to get_positions().
-
kwargs
(object
, default:{}
) –extra named arguments forwarded to get_positions().
Returns:
-
list
–filtered positions from all files.
Examples:
get_mutation_table_for_files(json_files, max_af=0.001, min_vep_consequence_priority=6, min_cosmic_sample_count=0, only_canonical=False, extra_variant_filters=[], extra_transcript_filters=[])
Get an annotated table of all filtered transcripts from a list of ICA JSON files.
Load all positions from a list of ICA JSON files and filter them. Positions having any remaining variants and transcripts passing the filter are returned as an annotated table.
Parameters:
-
json_files
(list
) –list of ICA JSON files
-
max_af
(float
, default:0.001
) –maximum allele frequency for gnomAD, gnomAD Exome and 1000 Genomes. Only variants with maximum allele frequencies below this threshold will be returned.
-
min_vep_consequence_priority
(int
, default:6
) –only transcripts with a minimum VEP consequence priority not larger than this threshold will be retained. Consequences with priorities <= 6 change the protein sequence, consequences with priorities > 6 do not change the protein sequence.
-
min_cosmic_sample_count
(int
, default:0
) –only variants with a maximum cosmic sample count not lower than this threshold will be retained
-
only_canonical
(bool
, default:False
) –if true, only canonical transcripts will be retained
-
extra_variant_filters
(list
, default:[]
) –any additional filters to apply to variants. Filters shall return True to keep a variant.
-
extra_transcript_filters
(list
, default:[]
) –any additional filters to apply to transcripts. Filters shall return True to keep a transcript.
Returns:
-
DataFrame
–table of annotated mutations and affected transcripts.
Examples:
get_mutation_table_for_position(position)
Get an annotated table of all transcripts for a single position.
Returns an annotated table of all transcripts that are affected by a mutation at a position.
Parameters:
-
position
(dict
) –the position to investigate.
Returns:
-
DataFrame
–table of annotated mutations and affected transcripts.
get_mutation_table_for_positions(positions, hide_progress=False)
Get an annotated table of all transcripts for all positions.
Returns an annotated table of all transcripts that are affected by a mutation at any of the positions.
Parameters:
-
positions
(list
) –the positions to investigate.
Returns:
-
DataFrame
–table of annotated mutations and affected transcripts.
get_onekg_max_af(variant)
Get the maximum allele frequency for the 1000 Genomes Project.
Get the maximum allele frequences across all cohorts annotated by the 1000 Genomes Project.
Parameters:
-
variant
(dict
) –the variant to investigate.
Returns:
-
float
–maximum 1000 genomes allele frequency.
get_pipeline_metadata(files)
Extract a table with metadata annotation pipeline run from the JSON header.
Parameters:
-
files
(list
) –names of the ICA JSON files.
Returns:
-
DataFrame
–table with metadata of pipeline runs.
get_position_by_coordinates(positions, chromosome, position)
Extract a particular position from a position list.
Parameters:
-
positions
(list
) –list of input positions.
-
chromosome
(str
) –name of the chromosome.
-
position
(int
) –numeric position on the chromosome.
Returns:
-
dict
–the position for the specified chromosome and numeric position.
Examples:
get_positions(file, variant_filters=[], transcript_filters=[])
Extract all positions from a ICA JSON file.
The sample id is stored as an additional new attribute of the
samples
element of a position. The samples
element is a list,
although ICA usually only creates single sample JSON files.
Parameters:
-
file
(str
) –name of the ICA JSON file
-
variant_filters
(list
, default:[]
) –any filters to apply to variants. Filters shall return True to keep a variant.
-
transcript_filters
(list
, default:[]
) –any filters to apply to transcripts. Filters shall return True to keep a transcript.
Returns:
-
list
–filtered positions from file.
Examples:
get_sample(file, suffix='(-D[^.]*)?\\.bam')
Extract the sample name from a ICA JSON file.
Parameters:
-
file
(str
) –name of the ICA JSON file.
-
suffix
(str
, default:'(-D[^.]*)?\\.bam'
) –regular expression to remove from the sample name in the JSON file. Defaults to '(-D[^.]*)?.bam'.
Returns:
-
str
–name of the sample annotated in the JSON file.
get_strongest_vep_consequence_name(transcript)
Get the name of the strongest VEP consequence for a transcript.
Parameters:
-
transcript
(dict
) –the transcript to investigate.
Returns:
-
str
–the consequence.
get_strongest_vep_consequence_priority(transcript)
Get the strongest priority of VEP consequence for a transcript.
Get the strongest numeric priority of all VEP consequences for a transcript. Smaller numeric priorities mean stronger impact.
Parameters:
-
transcript
(dict
) –the transcript to investigate.
Returns:
-
int
–the strongest numeric priority, smaller values mean higher priority.
get_strongest_vep_consequence_rank(transcript)
Get the strongest rank of VEP consequences for a transcript.
Get the strongest numeric rank of all VEP consequences for a transcript. Smaller ranks mean stronger impact.
The priority of consequences is taken into account first. So if two consequences have different priorities, the consequence with the higher priority (lower priority number) will be used, and the rank for this consequence will be returned. If there are multiple consequences with the same priority, the lowest (strongest) rank will be returned.
For clarification: ranks are unique, i.e. all VEP consequences ordered as listed on the VEP documentation page get the row number of this table assigned as rank.
However, several consequences can have the same priority (e.g., stop gained and frameshift have the same priority). Priorities are copied from vcf2maf.pl of MSKCC.
Parameters:
-
transcript
(dict
) –the transcript to investigate.
Returns:
-
int
–the rank of the VEP consequence with strongest impact.
get_vep_consequence_for_rank(rank)
Get the VEP consequence term of a numeric rank.
Parameters:
-
rank
(int
) –the numeric rank of the consequence term.
Returns:
-
str
–the consequence.
get_vep_priority_for_consequence(consequence)
Get the numeric priority of a VEP consequence term.
The numeric priority of a consequence that is returned by this function is the same as defined by vcf2maf.pl of MSKCC.
Parameters:
-
consequence
(str
) –the consequence term of the variant.
Returns: the priority of the consequence, smaller values mean higher priority.
get_vep_rank_for_consequence(consequence)
Get the numeric rank of a VEP consequence term.
The numeric rank of a consequence is the position of the consequence in this list of consequences for the Variant Effect Predictor VEP.
Parameters:
-
consequence
(str
) –the consequence term of the variant.
Returns:
-
int
–the rank of the consequence, smaller values mean higher rank.
split_multi_sample_json_file(json_file, output_dir)
Splits a multi-sample JSON file into sample specific JSON files.
This function reads a multi-sample JSON file that was generated by annotating a multi-sample VCF file with ICA and splits it into sample-specific JSON files.
Annotating very many single-sample VCF files with ICA is very time
consuming, because ICA reads all annotation sources for each VCF file
and this is dominating the runtime of ICA. It is therefore helpful to
first merge many single-sample VCF files into one or a small number of
multi-sample VCF files (for example, with bcftools merge
), to annotate
the multi-sample VCF file with ICA, and then to split the multi-sample
JSON output of ICA into single-sample JSON files. These single-sample
JSON files are required for the rest of this package.
Parameters:
-
json_file
(str
) –the multi-sample json input file.
-
output_dir
(str
) –the directory where to write the single sample JSON files. The directory will be created if it does not exist.
Returns:
-
None
–None.
strip_json_file(ifname, ofname)
Reduce the JSON file size by keeping only 'PASS' variants.
JSON files from Illumina's ICA pipeline can be very large because they contain any deviation from the reference genome, irrespective of the quality of the mutation call. Gzip compressed JSON files with sizes in the gigabyte range cannot be processed by JSON packages that read the entire file into memory. It is necessary to first reduce the size of JSON files by removing all variants that do not meet Illumina's quality criteria.
This function reads a single JSON file and creates a single JSON outpout file by removing all variants that do not pass Illumina's quality criteria.
Parameters:
-
ifname
(str
) –name of the input file.
-
ofname
(str
) –name of the output file.
Returns:
-
None
–None.
strip_json_files(source_dir, target_dir, pattern='*.json.gz')
Strip all JSON files of a project by keeping only 'PASS' variants.
JSON files from Illumina's ICA pipeline can be very large because they contain any deviation from the reference genome, irrespective of the quality of the mutation call. Gzip compressed JSON files with sizes in the gigabyte range cannot be processed by JSON packages that read the entire file into memory. It is necessary to first reduce the size of JSON files by removing all variants that do not meet Illumina's quality criteria.
This function searches source_dir
recursively for all files matching the
file_pattern
. Each of those files is processed and a stripped version
keeping only variants that PASS Illumina's quality criteria is created. The
output file has the same name as the input file. The directory structure
below source_dir
is replicated in target_dir
. Output files get the
suffix '_filtered.json.gz'.
Parameters:
-
source_dir
(str
) –directory where to search for input JSON files.
-
target_dir
(str
) –directory where to save the stripped outpout JSON files.
-
pattern
(str
, default:'*.json.gz'
) –files matching this pattern will be processed.
Returns:
-
None
–None.
Examples: