Annotate a VCF with GraphKB¶

A common format for small mutations is the VCF file. There are a number of variants of this format. To help users getting started we have provided this example script of how to convert from a VCF file to the JSON format that the IPR python adapter expects.

We expect users to annotate their VCF files with protein and/or cds HGVS notation prior to this.

Download Files¶

Install¶

To do this we are going to install a couple of common python libraries. We will use pysam to read the vcf file (Pysam - Working with VCF).

!pip install pysam pandas seaborn graphkb

Collecting pysam
  Downloading https://files.pythonhosted.org/packages/20/85/335857b9888f6d9a13b03a8f21b0a6228b180c361631d9d70e7be3e22163/pysam-0.16.0.1-cp37-cp37m-manylinux1_x86_64.whl (9.9MB)
     |████████████████████████████████| 9.9MB 4.3MB/s 
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (1.1.5)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (0.11.1)
Collecting graphkb
  Downloading https://files.pythonhosted.org/packages/cc/34/48ec589f81fce66cc70053e666a7834c3a7f708cee9a101dfc3d90158fe6/graphkb-1.5.4-py3-none-any.whl
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas) (2.8.1)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas) (1.19.5)
Requirement already satisfied: matplotlib>=2.2 in /usr/local/lib/python3.7/dist-packages (from seaborn) (3.2.2)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from seaborn) (1.4.1)
Requirement already satisfied: typing-extensions<4,>=3.7.4.2 in /usr/local/lib/python3.7/dist-packages (from graphkb) (3.7.4.3)
Requirement already satisfied: requests<3,>=2.22.0 in /usr/local/lib/python3.7/dist-packages (from graphkb) (2.23.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.22.0->graphkb) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.22.0->graphkb) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.22.0->graphkb) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.22.0->graphkb) (2021.5.30)
Installing collected packages: pysam, graphkb
Successfully installed graphkb-1.5.4 pysam-0.16.0.1

Prepare Canonical Transcripts List¶

First we will read the tabbed biomart export file. This contains a list of gene and transcript IDs for only canonical transcripts. We'll use this to filter annotations in a later step

import pandas as pd 

mart_df = pd.read_csv('https://raw.githubusercontent.com/bcgsc/pori/feature/colab-notebooks/demo/mart_export.protein_coding.canonical.txt', sep='\t')
mart_df

	Gene stable ID version	Transcript stable ID version
0	ENSG00000198888.2	ENST00000361390.2
1	ENSG00000198763.3	ENST00000361453.3
2	ENSG00000198804.2	ENST00000361624.2
3	ENSG00000198712.1	ENST00000361739.1
4	ENSG00000228253.1	ENST00000361851.1
...	...	...
22791	ENSG00000181817.6	ENST00000315732.3
22792	ENSG00000116885.18	ENST00000235532.9
22793	ENSG00000116898.12	ENST00000373116.6
22794	ENSG00000119535.18	ENST00000373106.6
22795	ENSG00000142694.7	ENST00000490466.2

22796 rows × 2 columns

In this case we know that the transcripts in our VCF are unversioned so we will store the transcript names without their versions as well for quick access in later steps

mart_df['unversioned'] = mart_df['Transcript stable ID version'].str.split('.').str[0]
canonical_transcripts = set(mart_df['Transcript stable ID version'].unique().tolist() + mart_df['unversioned'].unique().tolist())
len(canonical_transcripts)

Process the VCF¶

Then we are ready to read the VCF file as input

!wget https://raw.githubusercontent.com/bcgsc/pori/feature/colab-notebooks/demo/example_snvs.variants.passed.4.3.eff.hgvs.vcf

--2021-06-17 20:42:11--  https://raw.githubusercontent.com/bcgsc/pori/feature/colab-notebooks/demo/example_snvs.variants.passed.4.3.eff.hgvs.vcf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 387417 (378K) [text/plain]
Saving to: ‘example_snvs.variants.passed.4.3.eff.hgvs.vcf.1’

example_snvs.varian 100%[===================>] 378.34K  --.-KB/s    in 0.04s   

2021-06-17 20:42:12 (8.94 MB/s) - ‘example_snvs.variants.passed.4.3.eff.hgvs.vcf.1’ saved [387417/387417]

from pysam import VariantFile

fh = VariantFile('/content/example_snvs.variants.passed.4.3.eff.hgvs.vcf')

Then we will iterate over the variant VCF records. From here we will need to pull out the HGVSp notation. Since this is output from SnpEff we can refer to their documentation for how to parse this. According to their site this is the expected format of the EFF field which contains the HGVS notation. This field is pipe delimited inside parentheses.

EFF= Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_Change| Amino_Acid_Length | Gene_Name | Transcript_BioType | Gene_Coding | Transcript_ID | Exon_Rank  | Genotype_Number [ | ERRORS | WARNINGS ] )

Note: EFF is a deprecated field. Newer versions use the ANN field. Parsing should be similar.

We define the following functions below to help with parsing the INFO fields in the VCF

import re 


def parse_eff(eff):
  match = re.match(r'^[^(]+\((.+)\)$', eff)
  if match:
    [impact, functional_class, codon_change, aa_change, _, gene_name, transcript_biotype, gene_coding, transcript_id, _, genotype_allele ] = match.group(1).split('|')[:11]

    changes = aa_change.split('/')
    protein_changes = [p for p in changes if p.startswith('p.')]
    cds_changes = [p for p in changes if p.startswith('c.')]

    return [
      impact, 
      gene_name,
      transcript_biotype,
      gene_coding, 
      transcript_id, 
      protein_changes[0] if protein_changes else None, 
      cds_changes[0] if cds_changes else None,
      genotype_allele
    ]
  raise NotImplementedError('cannot parse eff string. Does not match the expected pattern')

Now we want to convert the parsed content into a JSON/Dict to match the fields that we expect in IPR. If you are not uploading to IPR you can leave variants in their original format instead.

Note: Some of the fields (support counts) require picking the samples. By default we have assumed the names are NORMAL and TUMOR. These may need to be customized for other input

from graphkb.util import convert_aa_3to1


def pick_effect(record):
  non_canonical = []

  for eff in record.info['EFF']:
    [impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele] = parse_eff(eff)

    # prefer canonical annotation
    if transcript_id in canonical_transcripts:
      return [impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele]
    non_canonical.append([impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele])

  return non_canonical[0]



ipr_variants = []

for record in fh.fetch():
  tumour_sample = get_sample_by_name(record, 'TUMOR')
  normal_sample = record.samples['NORMAL']
  ref_key = f'{record.ref}U'

  no_canonical = True

  [impact, gene_name, transcript_biotype, gene_coding, transcript_id, protein_change, cds_change, alt_allele] = pick_effect(record)



  alt_key = f'{alt_allele}U'
  if protein_change:
    protein_change = convert_aa_3to1(protein_change)


  # convert variant string from 3 to 1 letter AA if it is protein notation
  ipr_format = {
      'gene': gene_name,
      'transcript': transcript_id,
      'chromosome': record.chrom,
      'startPosition': record.pos,
      'endPosition': record.pos + record.rlen,
      'proteinChange': protein_change or cds_change,  # if no protein change use cds here instead
      'refSeq': record.ref,
      'altSeq': alt_allele,
      'tumourDepth': tumour_sample['DP'],
      'normalDepth': normal_sample['DP'],
      'normalRefCount': normal_sample[ref_key],
      'normalAltCount': normal_sample[alt_key],
      'tumourRefCount': tumour_sample[ref_key],
      'tumourAltCount': tumour_sample[alt_key],
      'hgvsProtein': f'{gene_name}:{protein_change}' if protein_change else '',
      'hgvsCds': f'{transcript_id}:{cds_change}' if cds_change else ''
  }
  ipr_variants.append(ipr_format)

# just for displaying in notebook
variants_df = pd.DataFrame.from_records(ipr_variants)
variants_df

	gene	transcript	chromosome	startPosition	endPosition	proteinChange	refSeq	altSeq	tumourDepth	normalDepth	normalRefCount	normalAltCount	tumourRefCount	tumourAltCount	hgvsProtein	hgvsCds
0	NOC2L	ENST00000327044	1	883516	883517	p.L552L	G	A	69	44	(42, 45)	(0, 0)	(16, 16)	(53, 56)	NOC2L:p.L552L	ENST00000327044:c.1654C>T
1	ATAD3B	ENST00000378741	1	1418472	1418473	p.R114*	A	T	41	33	(31, 33)	(0, 0)	(29, 34)	(10, 14)	ATAD3B:p.R114*	ENST00000378741:c.340A>T
2	PRDM16	ENST00000270722	1	3329312	3329313	p.P851S	C	T	52	49	(47, 49)	(2, 2)	(40, 44)	(10, 10)	PRDM16:p.P851S	ENST00000270722:c.2551C>T
3	TNFRSF8	ENST00000263932	1	12170228	12170229	p.P215S	C	T	57	26	(24, 27)	(0, 1)	(12, 12)	(43, 44)	TNFRSF8:p.P215S	ENST00000263932:c.643C>T
4	CLCNKB	ENST00000375679	1	16372084	16372085	p.F44F	C	T	70	51	(51, 52)	(0, 0)	(50, 51)	(18, 19)	CLCNKB:p.F44F	ENST00000375679:c.132C>T
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
410	ZNF280C	ENST00000370978	X	129362963	129362964	p.H379Y	G	A	40	19	(19, 21)	(0, 0)	(0, 0)	(39, 42)	ZNF280C:p.H379Y	ENST00000370978:c.1135C>T
411	ZNF280C	ENST00000370978	X	129362964	129362965	p.P378P	G	A	40	19	(19, 21)	(0, 0)	(0, 1)	(39, 41)	ZNF280C:p.P378P	ENST00000370978:c.1134C>T
412	PNMA3	ENST00000370264	X	152226437	152226438	p.P342L	C	T	60	28	(28, 32)	(0, 0)	(0, 1)	(57, 61)	PNMA3:p.P342L	ENST00000370264:c.1025C>T
413	WASH6P	ENST00000359512	X	155253820	155253821	p.G373S	G	A	74	40	(39, 59)	(0, 1)	(61, 114)	(9, 20)	WASH6P:p.G373S	ENST00000359512:c.1117G>A
414	AC134882.1	ENST00000331172	Y	13524548	13524549	p.Y57S	T	G	15	15	(13, 13)	(2, 3)	(0, 0)	(14, 26)	AC134882.1:p.Y57S	ENST00000331172:c.170A>C

415 rows × 16 columns

Annotate¶

Now we are ready to add GraphKB annotations. First connect to the API. Here we are using the demo credentials.

If you were uploading to IPR you would skip this step and just input the variants. The IPR python adapter performs matching against GraphKB prior to uploading the report.

NOTE: The demo server is incomplete. It contains a small subset of the data we would expect in a normal production instance of GraphKB and should be used for demonstration and testing only

from graphkb import GraphKBConnection

GKB_API_URL = 'https://pori-demo.bcgsc.ca/graphkb-api/api'
GKB_USER = 'colab_demo'
GKB_PASSWORD = 'colab_demo'

graphkb_conn = GraphKBConnection(GKB_API_URL, use_global_cache=False)

graphkb_conn.login(GKB_USER, GKB_PASSWORD)

Now use the matching functions to get the equivalent variant forms from GraphKB. This may take several minutes on the demo server as it has limited resources and open access.

from graphkb.match import match_positional_variant
from graphkb.util import FeatureNotFoundError

for variant in ipr_variants:
  # for the purposes of this tutorial we will skip non-protein changes for now
  if not variant['hgvsProtein']:
    continue
  variant_name = variant['gene'] + ':' + variant['proteinChange']
  try:
    variant_matches = match_positional_variant(graphkb_conn, variant_name)
    variant['_variantMatches'] = variant_matches
    if variant_matches:
      print(f'{variant_name} matched {len(variant_matches)} other variant representations')
  except FeatureNotFoundError:
    pass # if the gene isn't in the db, there will not be annotations
  except Exception as err:
    print(variant_name, err)

print(f'queried {len(ipr_variants)} variants')

SF3B1:p.P718L matched 1 other variant representations
STK19:p.D89N matched 2 other variant representations
BRAF:p.V600E matched 6 other variant representations
queried 415 variants

Now that we have matched the variant we will fetch the related statements to annotate this variant with its possible relevance

from graphkb.constants import BASE_RETURN_PROPERTIES, GENERIC_RETURN_PROPERTIES
from graphkb.util import convert_to_rid_list

# return properties should be customized to the users needs
return_props = (
    BASE_RETURN_PROPERTIES
    + ['sourceId', 'source.name', 'source.displayName']
    + [f'conditions.{p}' for p in GENERIC_RETURN_PROPERTIES]
    + [f'subject.{p}' for p in GENERIC_RETURN_PROPERTIES]
    + [f'evidence.{p}' for p in GENERIC_RETURN_PROPERTIES]
    + [f'relevance.{p}' for p in GENERIC_RETURN_PROPERTIES]
    + [f'evidenceLevel.{p}' for p in GENERIC_RETURN_PROPERTIES]
)

results = []

for variant in ipr_variants:
  if not variant.get('_variantMatches'):
    continue
  variant_name = variant['gene'] + ':' + variant['proteinChange']
  variant_matches = variant['_variantMatches']
  statements = graphkb_conn.query(
      {
          'target': 'Statement',
          'filters': {'conditions': convert_to_rid_list(variant_matches), 'operator': 'CONTAINSANY'},
          'returnProperties': return_props,
      }
  )
  print(variant_name)
  print(f'annotated {len(variant_matches)} variant matches with {len(statements)} statements')
  print()

  for statement in statements:
      results.append((
          variant_name, 
          len(variant_matches), 
          ';'.join([c['displayName'] for c in statement['conditions'] if c['@class'].endswith('Variant')]),
          statement['relevance']['displayName'],
          statement['subject']['displayName'],
          statement['source']['displayName'] if statement['source'] else '',
          ';'.join([c['displayName'] for c in statement['evidence']])
      ))

SF3B1:p.P718L
annotated 1 variant matches with 5 statements

STK19:p.D89N
annotated 2 variant matches with 2 statements

BRAF:p.V600E
annotated 6 variant matches with 135 statements

Now we will put this into a dataframe to display nicely in this notebook

df = pd.DataFrame(results, columns=['variant_name', 'variant_matches', 'statement_variants', 'relevance', 'subject', 'statement_source', 'evidence'])
df

	variant_name	variant_matches	statement_variants	relevance	subject	statement_source	evidence
0	SF3B1:p.P718L	1	SF3B1 mutation	favourable prognosis	patient	CIViC	pmid:21995386
1	SF3B1:p.P718L	1	SF3B1 mutation	unfavourable prognosis	patient	CIViC	pmid:26837699
2	SF3B1:p.P718L	1	SF3B1 mutation	unfavourable prognosis	patient	CIViC	pmid:24943832
3	SF3B1:p.P718L	1	SF3B1 mutation	unfavourable prognosis	patient	CIViC	pmid:23086750
4	SF3B1:p.P718L	1	SF3B1 mutation	unfavourable prognosis	patient	CIViC	pmid:23568491
...	...	...	...	...	...	...	...
137	BRAF:p.V600E	6	BRAF mutation	resistance	Irinotecan [c62040]	CIViC	pmid:19603024
138	BRAF:p.V600E	6	BRAF mutation	resistance	Oxaliplatin [c1181]	CIViC	pmid:19603024
139	BRAF:p.V600E	6	BRAF mutation	unfavourable prognosis	patient	CIViC	pmid:19603024
140	BRAF:p.V600E	6	BRAF mutation	resistance	cetuximab + chemotherapy	CIViC	pmid:20619739
141	BRAF:p.V600E	6	BRAF mutation	resistance	Cetuximab [c1723]	CIViC	pmid:22586653

142 rows × 7 columns