Annotate a VCF with GraphKB¶
A common format for small mutations is the VCF file. There are a number of variants of this format. To help users getting started we have provided this example script of how to convert from a VCF file to the JSON format that the IPR python adapter expects.
We expect users to annotate their VCF files with protein and/or cds HGVS notation prior to this.
Download Files¶
Install¶
To do this we are going to install a couple of common python libraries. We will use pysam to read the vcf file (Pysam - Working with VCF).
!pip install pysam pandas seaborn graphkb
Prepare Canonical Transcripts List¶
First we will read the tabbed biomart export file. This contains a list of gene and transcript IDs for only canonical transcripts. We'll use this to filter annotations in a later step
import pandas as pd
mart_df = pd.read_csv('https://raw.githubusercontent.com/bcgsc/pori/feature/colab-notebooks/demo/mart_export.protein_coding.canonical.txt', sep='\t')
mart_df
In this case we know that the transcripts in our VCF are unversioned so we will store the transcript names without their versions as well for quick access in later steps
mart_df['unversioned'] = mart_df['Transcript stable ID version'].str.split('.').str[0]
canonical_transcripts = set(mart_df['Transcript stable ID version'].unique().tolist() + mart_df['unversioned'].unique().tolist())
len(canonical_transcripts)
Process the VCF¶
Then we are ready to read the VCF file as input
!wget https://raw.githubusercontent.com/bcgsc/pori/feature/colab-notebooks/demo/example_snvs.variants.passed.4.3.eff.hgvs.vcf
from pysam import VariantFile
fh = VariantFile('/content/example_snvs.variants.passed.4.3.eff.hgvs.vcf')
Then we will iterate over the variant VCF records. From here we will need to pull out the HGVSp notation. Since this is output from SnpEff we can refer to their documentation for how to parse this. According to their site this is the expected format of the EFF
field which contains the HGVS notation. This field is pipe delimited inside parentheses.
EFF= Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank | Genotype_Number [ | ERRORS | WARNINGS ] )
Note: EFF is a deprecated field. Newer versions use the ANN
field. Parsing should be similar.
We define the following functions below to help with parsing the INFO fields in the VCF
import re
def parse_eff(eff):
match = re.match(r'^[^(]+\((.+)\)$', eff)
if match:
[impact, functional_class, codon_change, aa_change, _, gene_name, transcript_biotype, gene_coding, transcript_id, _, genotype_allele ] = match.group(1).split('|')[:11]
changes = aa_change.split('/')
protein_changes = [p for p in changes if p.startswith('p.')]
cds_changes = [p for p in changes if p.startswith('c.')]
return [
impact,
gene_name,
transcript_biotype,
gene_coding,
transcript_id,
protein_changes[0] if protein_changes else None,
cds_changes[0] if cds_changes else None,
genotype_allele
]
raise NotImplementedError('cannot parse eff string. Does not match the expected pattern')
Now we want to convert the parsed content into a JSON/Dict to match the fields that we expect in IPR. If you are not uploading to IPR you can leave variants in their original format instead.
Note: Some of the fields (support counts) require picking the samples. By default we have assumed the names are NORMAL and TUMOR. These may need to be customized for other input
from graphkb.util import convert_aa_3to1
def pick_effect(record):
non_canonical = []
for eff in record.info['EFF']:
[impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele] = parse_eff(eff)
# prefer canonical annotation
if transcript_id in canonical_transcripts:
return [impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele]
non_canonical.append([impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele])
return non_canonical[0]
ipr_variants = []
for record in fh.fetch():
tumour_sample = get_sample_by_name(record, 'TUMOR')
normal_sample = record.samples['NORMAL']
ref_key = f'{record.ref}U'
no_canonical = True
[impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele] = pick_effect(record)
alt_key = f'{alt_allele}U'
if protein_change:
protein_change = convert_aa_3to1(protein_change)
# convert variant string from 3 to 1 letter AA if it is protein notation
ipr_format = {
'gene': gene_name,
'transcript': transcript_id,
'chromosome': record.chrom,
'startPosition': record.pos,
'endPosition': record.pos + record.rlen,
'proteinChange': protein_change or cds_change, # if no protein change use cds here instead
'refSeq': record.ref,
'altSeq': alt_allele,
'tumourDepth': tumour_sample['DP'],
'normalDepth': normal_sample['DP'],
'normalRefCount': normal_sample[ref_key],
'normalAltCount': normal_sample[alt_key],
'tumourRefCount': tumour_sample[ref_key],
'tumourAltCount': tumour_sample[alt_key],
'hgvsProtein': f'{gene_name}:{protein_change}' if protein_change else '',
'hgvsCds': f'{transcript_id}:{cds_change}' if cds_change else ''
}
ipr_variants.append(ipr_format)
# just for displaying in notebook
variants_df = pd.DataFrame.from_records(ipr_variants)
variants_df
Annotate¶
Now we are ready to add GraphKB annotations. First connect to the API. Here we are using the demo credentials.
If you were uploading to IPR you would skip this step and just input the variants. The IPR python adapter performs matching against GraphKB prior to uploading the report.
NOTE: The demo server is incomplete. It contains a small subset of the data we would expect in a normal production instance of GraphKB and should be used for demonstration and testing only
from graphkb import GraphKBConnection
GKB_API_URL = 'https://pori-demo.bcgsc.ca/graphkb-api/api'
GKB_USER = 'colab_demo'
GKB_PASSWORD = 'colab_demo'
graphkb_conn = GraphKBConnection(GKB_API_URL, use_global_cache=False)
graphkb_conn.login(GKB_USER, GKB_PASSWORD)
Now use the matching functions to get the equivalent variant forms from GraphKB. This may take several minutes on the demo server as it has limited resources and open access.
from graphkb.match import match_positional_variant
from graphkb.util import FeatureNotFoundError
for variant in ipr_variants:
# for the purposes of this tutorial we will skip non-protein changes for now
if not variant['hgvsProtein']:
continue
variant_name = variant['gene'] + ':' + variant['proteinChange']
try:
variant_matches = match_positional_variant(graphkb_conn, variant_name)
variant['_variantMatches'] = variant_matches
if variant_matches:
print(f'{variant_name} matched {len(variant_matches)} other variant representations')
except FeatureNotFoundError:
pass # if the gene isn't in the db, there will not be annotations
except Exception as err:
print(variant_name, err)
print(f'queried {len(ipr_variants)} variants')
Now that we have matched the variant we will fetch the related statements to annotate this variant with its possible relevance
from graphkb.constants import BASE_RETURN_PROPERTIES, GENERIC_RETURN_PROPERTIES
from graphkb.util import convert_to_rid_list
# return properties should be customized to the users needs
return_props = (
BASE_RETURN_PROPERTIES
+ ['sourceId', 'source.name', 'source.displayName']
+ [f'conditions.{p}' for p in GENERIC_RETURN_PROPERTIES]
+ [f'subject.{p}' for p in GENERIC_RETURN_PROPERTIES]
+ [f'evidence.{p}' for p in GENERIC_RETURN_PROPERTIES]
+ [f'relevance.{p}' for p in GENERIC_RETURN_PROPERTIES]
+ [f'evidenceLevel.{p}' for p in GENERIC_RETURN_PROPERTIES]
)
results = []
for variant in ipr_variants:
if not variant.get('_variantMatches'):
continue
variant_name = variant['gene'] + ':' + variant['proteinChange']
variant_matches = variant['_variantMatches']
statements = graphkb_conn.query(
{
'target': 'Statement',
'filters': {'conditions': convert_to_rid_list(variant_matches), 'operator': 'CONTAINSANY'},
'returnProperties': return_props,
}
)
print(variant_name)
print(f'annotated {len(variant_matches)} variant matches with {len(statements)} statements')
print()
for statement in statements:
results.append((
variant_name,
len(variant_matches),
';'.join([c['displayName'] for c in statement['conditions'] if c['@class'].endswith('Variant')]),
statement['relevance']['displayName'],
statement['subject']['displayName'],
statement['source']['displayName'] if statement['source'] else '',
';'.join([c['displayName'] for c in statement['evidence']])
))
Now we will put this into a dataframe to display nicely in this notebook
df = pd.DataFrame(results, columns=['variant_name', 'variant_matches', 'statement_variants', 'relevance', 'subject', 'statement_source', 'evidence'])
df