Skip to content

Comments on the GENIE data

Mutations are not normalized

The mutations contained in GENIE are not always normalized (left-shifted). Insertions and deletions do not always include the leading common nucleotide. Therefore, one and the same genomic alteration may be reported differently by different assays. When mutations are aggregated across different panels, the specification of genomic alterations must be normalized.

This is done by creating a VCF file from the GENIE mutation data, normalizing and re-annotating the VCF file (here we use Illumina Connected Annotations).

GENIE uses ENSEMBL transcripts

Although the annotation of mutations provided by GENIE contains also RefSeq transcripts, the RefSeq annotation is incomplete. Many mutations have only an ENSEMBL transcripts but no RefSeq transcript assigned. On the other hand, there is no alteration that has only an assigned RefSeq transcript but no ENSEMBL transcript.

Based on this observation, there are several different options for creating a unified consistent annotation across panels:

Option 1: Keep ENSEMBL transcripts reported by GENIE

Consistently re-annotate all genomic variants, extract annotations for the ENSEMBL transcripts contained in GENIE, and map these updated annotations to the genomic changes reported by GENIE, using the ENSEMBL transcript ID as a mapping link.

This option keeps the annotation as close a possible to the original GENIE annotations and just ensures that we get consistent annotations across all assays.

However, this option does not have any focus on MANE Select transcripts.

Option 2: Focus on MANE transcripts

Clinical reporting of mutations should preferrably be based on MANE transcripts. GENIE is based on GRCh37, while MANE is defined for GRCh38. The NCBI has remapped the MANE transcript RefSeq sequences back to the GRCh37 genome. Ensembl did not do this. Although there is a mapping table from MANE transcripts to Ensembl transcripts of GRCh37, this does not cover all MANE transcripts and the GRCh37 transcripts are also not always identical to the MANE transcripts. Therefore, when focusing on MANE transcripts while working with the GRCh37 genome (which GENIE is based on), using RefSeq transcripts rather than Ensembl transcripts is preferrable.

Focusing on MANE requires some additional decisions. If a genomic variant affects a MANE transcript, the consequence on that MANE transcript can be reported. Even then, it needs to be decided how to handle situations where both MANE Select and MANE Plus Clincial transcripts are affected.

If a genomic variant does not affect a MANE transcript, there is no clear and obvious procedure as to which transcript to report. One could decide to report only effects on MANE transcripts and omit all genomic variants that do not affect a MANE transcript. Or one could decide to omit non-MANE transcripts for genes for which a MANE transcript is defined, even if the MANE transcript itself is not affected by a genomic variant, but to report none-MANE transcripts for genes for which no MANE transcript is defined. For such genes it still must be decided which of the non-MANE transcripts to report.

So there are several different options when focusing on MANE. Unless the report is MANE-only, i.e. filters out any non-MANE transcript, it depends very much on the use case which transcripts to keep.

Conclusion

The genie package provides two types of annotations, one is based on the ENSEMBL transcripts originally used by Genie, the other one is MANE-only and uses RefSeq. The universe argument of many genie functions can be used to select the annotation type.

If an enhanced "MANE + Non-MANE" version is required, the file genie.annot.tsv.gz can be used as input for a customized filtering process outside of the genie package. This file was generated when creating the auxiliary files for a new GENIE release and contains annotations for all transcripts affected by a genomic variant.