Skip to content

Annotate a VCF with GraphKB

Open In Colab

A common format for small mutations is the VCF file. There are a number of variants of this format. To help users getting started we have provided this example script of how to convert from a VCF file to the JSON format that the IPR python adapter expects.

We expect users to annotate their VCF files with protein and/or cds HGVS notation prior to this.

Download Files


To do this we are going to install a couple of common python libraries. We will use pysam to read the vcf file (Pysam - Working with VCF).

!pip install pysam pandas seaborn graphkb
Collecting pysam
  Downloading (9.9MB)
     |████████████████████████████████| 9.9MB 4.3MB/s 
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (1.1.5)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (0.11.1)
Collecting graphkb
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas) (2.8.1)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas) (1.19.5)
Requirement already satisfied: matplotlib>=2.2 in /usr/local/lib/python3.7/dist-packages (from seaborn) (3.2.2)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from seaborn) (1.4.1)
Requirement already satisfied: typing-extensions<4,>= in /usr/local/lib/python3.7/dist-packages (from graphkb) (
Requirement already satisfied: requests<3,>=2.22.0 in /usr/local/lib/python3.7/dist-packages (from graphkb) (2.23.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.22.0->graphkb) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.22.0->graphkb) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.22.0->graphkb) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.22.0->graphkb) (2021.5.30)
Installing collected packages: pysam, graphkb
Successfully installed graphkb-1.5.4 pysam-

Prepare Canonical Transcripts List

First we will read the tabbed biomart export file. This contains a list of gene and transcript IDs for only canonical transcripts. We'll use this to filter annotations in a later step

import pandas as pd 

mart_df = pd.read_csv('', sep='\t')
Gene stable ID version Transcript stable ID version
0 ENSG00000198888.2 ENST00000361390.2
1 ENSG00000198763.3 ENST00000361453.3
2 ENSG00000198804.2 ENST00000361624.2
3 ENSG00000198712.1 ENST00000361739.1
4 ENSG00000228253.1 ENST00000361851.1
... ... ...
22791 ENSG00000181817.6 ENST00000315732.3
22792 ENSG00000116885.18 ENST00000235532.9
22793 ENSG00000116898.12 ENST00000373116.6
22794 ENSG00000119535.18 ENST00000373106.6
22795 ENSG00000142694.7 ENST00000490466.2

22796 rows × 2 columns

In this case we know that the transcripts in our VCF are unversioned so we will store the transcript names without their versions as well for quick access in later steps

mart_df['unversioned'] = mart_df['Transcript stable ID version'].str.split('.').str[0]
canonical_transcripts = set(mart_df['Transcript stable ID version'].unique().tolist() + mart_df['unversioned'].unique().tolist())

Process the VCF

Then we are ready to read the VCF file as input

--2021-06-17 20:42:11--
Resolving (,,, ...
Connecting to (||:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 387417 (378K) [text/plain]
Saving to: ‘example_snvs.variants.passed.4.3.eff.hgvs.vcf.1’

example_snvs.varian 100%[===================>] 378.34K  --.-KB/s    in 0.04s   

2021-06-17 20:42:12 (8.94 MB/s) - ‘example_snvs.variants.passed.4.3.eff.hgvs.vcf.1’ saved [387417/387417]

from pysam import VariantFile

fh = VariantFile('/content/example_snvs.variants.passed.4.3.eff.hgvs.vcf')

Then we will iterate over the variant VCF records. From here we will need to pull out the HGVSp notation. Since this is output from SnpEff we can refer to their documentation for how to parse this. According to their site this is the expected format of the EFF field which contains the HGVS notation. This field is pipe delimited inside parentheses.

EFF= Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank  | Genotype_Number [ | ERRORS | WARNINGS ] )

Note: EFF is a deprecated field. Newer versions use the ANN field. Parsing should be similar.

We define the following functions below to help with parsing the INFO fields in the VCF

import re 

def parse_eff(eff):
  match = re.match(r'^[^(]+\((.+)\)$', eff)
  if match:
    [impact, functional_class, codon_change, aa_change, _, gene_name, transcript_biotype, gene_coding, transcript_id, _, genotype_allele ] ='|')[:11]

    changes = aa_change.split('/')
    protein_changes = [p for p in changes if p.startswith('p.')]
    cds_changes = [p for p in changes if p.startswith('c.')]

    return [
      protein_changes[0] if protein_changes else None, 
      cds_changes[0] if cds_changes else None,
  raise NotImplementedError('cannot parse eff string. Does not match the expected pattern')

Now we want to convert the parsed content into a JSON/Dict to match the fields that we expect in IPR. If you are not uploading to IPR you can leave variants in their original format instead.

Note: Some of the fields (support counts) require picking the samples. By default we have assumed the names are NORMAL and TUMOR. These may need to be customized for other input

from graphkb.util import convert_aa_3to1

def pick_effect(record):
  non_canonical = []

  for eff in['EFF']:
    [impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele] = parse_eff(eff)

    # prefer canonical annotation
    if transcript_id in canonical_transcripts:
      return [impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele]
    non_canonical.append([impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele])

  return non_canonical[0]

ipr_variants = []

for record in fh.fetch():
  tumour_sample = get_sample_by_name(record, 'TUMOR')
  normal_sample = record.samples['NORMAL']
  ref_key = f'{record.ref}U'

  no_canonical = True

  [impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele] = pick_effect(record)

  alt_key = f'{alt_allele}U'
  if protein_change:
    protein_change = convert_aa_3to1(protein_change)

  # convert variant string from 3 to 1 letter AA if it is protein notation
  ipr_format = {
      'gene': gene_name,
      'transcript': transcript_id,
      'chromosome': record.chrom,
      'startPosition': record.pos,
      'endPosition': record.pos + record.rlen,
      'proteinChange': protein_change or cds_change,  # if no protein change use cds here instead
      'refSeq': record.ref,
      'altSeq': alt_allele,
      'tumourDepth': tumour_sample['DP'],
      'normalDepth': normal_sample['DP'],
      'normalRefCount': normal_sample[ref_key],
      'normalAltCount': normal_sample[alt_key],
      'tumourRefCount': tumour_sample[ref_key],
      'tumourAltCount': tumour_sample[alt_key],
      'hgvsProtein': f'{gene_name}:{protein_change}' if protein_change else '',
      'hgvsCds': f'{transcript_id}:{cds_change}' if cds_change else ''

# just for displaying in notebook
variants_df = pd.DataFrame.from_records(ipr_variants)
gene transcript chromosome startPosition endPosition proteinChange refSeq altSeq tumourDepth normalDepth normalRefCount normalAltCount tumourRefCount tumourAltCount hgvsProtein hgvsCds
0 NOC2L ENST00000327044 1 883516 883517 p.L552L G A 69 44 (42, 45) (0, 0) (16, 16) (53, 56) NOC2L:p.L552L ENST00000327044:c.1654C>T
1 ATAD3B ENST00000378741 1 1418472 1418473 p.R114* A T 41 33 (31, 33) (0, 0) (29, 34) (10, 14) ATAD3B:p.R114* ENST00000378741:c.340A>T
2 PRDM16 ENST00000270722 1 3329312 3329313 p.P851S C T 52 49 (47, 49) (2, 2) (40, 44) (10, 10) PRDM16:p.P851S ENST00000270722:c.2551C>T
3 TNFRSF8 ENST00000263932 1 12170228 12170229 p.P215S C T 57 26 (24, 27) (0, 1) (12, 12) (43, 44) TNFRSF8:p.P215S ENST00000263932:c.643C>T
4 CLCNKB ENST00000375679 1 16372084 16372085 p.F44F C T 70 51 (51, 52) (0, 0) (50, 51) (18, 19) CLCNKB:p.F44F ENST00000375679:c.132C>T
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
410 ZNF280C ENST00000370978 X 129362963 129362964 p.H379Y G A 40 19 (19, 21) (0, 0) (0, 0) (39, 42) ZNF280C:p.H379Y ENST00000370978:c.1135C>T
411 ZNF280C ENST00000370978 X 129362964 129362965 p.P378P G A 40 19 (19, 21) (0, 0) (0, 1) (39, 41) ZNF280C:p.P378P ENST00000370978:c.1134C>T
412 PNMA3 ENST00000370264 X 152226437 152226438 p.P342L C T 60 28 (28, 32) (0, 0) (0, 1) (57, 61) PNMA3:p.P342L ENST00000370264:c.1025C>T
413 WASH6P ENST00000359512 X 155253820 155253821 p.G373S G A 74 40 (39, 59) (0, 1) (61, 114) (9, 20) WASH6P:p.G373S ENST00000359512:c.1117G>A
414 AC134882.1 ENST00000331172 Y 13524548 13524549 p.Y57S T G 15 15 (13, 13) (2, 3) (0, 0) (14, 26) AC134882.1:p.Y57S ENST00000331172:c.170A>C

415 rows × 16 columns


Now we are ready to add GraphKB annotations. First connect to the API. Here we are using the demo credentials.

If you were uploading to IPR you would skip this step and just input the variants. The IPR python adapter performs matching against GraphKB prior to uploading the report.

NOTE: The demo server is incomplete. It contains a small subset of the data we would expect in a normal production instance of GraphKB and should be used for demonstration and testing only

from graphkb import GraphKBConnection

GKB_USER = 'colab_demo'
GKB_PASSWORD = 'colab_demo'

graphkb_conn = GraphKBConnection(GKB_API_URL, use_global_cache=False)

graphkb_conn.login(GKB_USER, GKB_PASSWORD)

Now use the matching functions to get the equivalent variant forms from GraphKB. This may take several minutes on the demo server as it has limited resources and open access.

from graphkb.match import match_positional_variant
from graphkb.util import FeatureNotFoundError

for variant in ipr_variants:
  # for the purposes of this tutorial we will skip non-protein changes for now
  if not variant['hgvsProtein']:
  variant_name = variant['gene'] + ':' + variant['proteinChange']
    variant_matches = match_positional_variant(graphkb_conn, variant_name)
    variant['_variantMatches'] = variant_matches
    if variant_matches:
      print(f'{variant_name} matched {len(variant_matches)} other variant representations')
  except FeatureNotFoundError:
    pass # if the gene isn't in the db, there will not be annotations
  except Exception as err:
    print(variant_name, err)

print(f'queried {len(ipr_variants)} variants')
SF3B1:p.P718L matched 1 other variant representations
STK19:p.D89N matched 2 other variant representations
BRAF:p.V600E matched 6 other variant representations
queried 415 variants

Now that we have matched the variant we will fetch the related statements to annotate this variant with its possible relevance

from graphkb.util import convert_to_rid_list

# return properties should be customized to the users needs
return_props = (
    + ['sourceId', '', 'source.displayName']
    + [f'conditions.{p}' for p in GENERIC_RETURN_PROPERTIES]
    + [f'subject.{p}' for p in GENERIC_RETURN_PROPERTIES]
    + [f'evidence.{p}' for p in GENERIC_RETURN_PROPERTIES]
    + [f'relevance.{p}' for p in GENERIC_RETURN_PROPERTIES]
    + [f'evidenceLevel.{p}' for p in GENERIC_RETURN_PROPERTIES]

results = []

for variant in ipr_variants:
  if not variant.get('_variantMatches'):
  variant_name = variant['gene'] + ':' + variant['proteinChange']
  variant_matches = variant['_variantMatches']
  statements = graphkb_conn.query(
          'target': 'Statement',
          'filters': {'conditions': convert_to_rid_list(variant_matches), 'operator': 'CONTAINSANY'},
          'returnProperties': return_props,
  print(f'annotated {len(variant_matches)} variant matches with {len(statements)} statements')

  for statement in statements:
          ';'.join([c['displayName'] for c in statement['conditions'] if c['@class'].endswith('Variant')]),
          statement['source']['displayName'] if statement['source'] else '',
          ';'.join([c['displayName'] for c in statement['evidence']])
annotated 1 variant matches with 5 statements

annotated 2 variant matches with 2 statements

annotated 6 variant matches with 135 statements

Now we will put this into a dataframe to display nicely in this notebook

df = pd.DataFrame(results, columns=['variant_name', 'variant_matches', 'statement_variants', 'relevance', 'subject', 'statement_source', 'evidence'])
variant_name variant_matches statement_variants relevance subject statement_source evidence
0 SF3B1:p.P718L 1 SF3B1 mutation favourable prognosis patient CIViC pmid:21995386
1 SF3B1:p.P718L 1 SF3B1 mutation unfavourable prognosis patient CIViC pmid:26837699
2 SF3B1:p.P718L 1 SF3B1 mutation unfavourable prognosis patient CIViC pmid:24943832
3 SF3B1:p.P718L 1 SF3B1 mutation unfavourable prognosis patient CIViC pmid:23086750
4 SF3B1:p.P718L 1 SF3B1 mutation unfavourable prognosis patient CIViC pmid:23568491
... ... ... ... ... ... ... ...
137 BRAF:p.V600E 6 BRAF mutation resistance Irinotecan [c62040] CIViC pmid:19603024
138 BRAF:p.V600E 6 BRAF mutation resistance Oxaliplatin [c1181] CIViC pmid:19603024
139 BRAF:p.V600E 6 BRAF mutation unfavourable prognosis patient CIViC pmid:19603024
140 BRAF:p.V600E 6 BRAF mutation resistance cetuximab + chemotherapy CIViC pmid:20619739
141 BRAF:p.V600E 6 BRAF mutation resistance Cetuximab [c1723] CIViC pmid:22586653

142 rows × 7 columns

Back to top