Calculating Expression Outliers Against a Reference Data Set¶
Download the Data¶
First we need to download the data set we will use. We are going to use the TCGA data here since it is publically available and well known (UCSC Xena TCGA PANCAN). We will download the TPM matrix as well as the corresponding metadata file
!wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/tcga_RSEM_gene_tpm.gz
!wget https://tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com/download/Survival_SupplementalTable_S1_20171025_xena_sp
The matrix we have downloaded is very large as it contains many different disease cohorts. For the purposes of this tutorial we want a smaller data set so we are going to choose the smallest TCGA cohort, CHOL. To do this we need to use the metadata file to subset the main matrix file
Install Dependencies¶
For the following tutorial we are going to use some common data processing and visualization libraries in python: seaborn and pandas. We need to install them before we proceed
!pip install pandas seaborn
Create the Disease Matrix Subset¶
First read the metadata file
import json
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
sns.set_style('whitegrid')
meta_df = pd.read_csv('Survival_SupplementalTable_S1_20171025_xena_sp', sep='\t')
meta_df
Now select the rows which match the disease we are intereseted in, CHOL. Print out the number of rows after the selecton. There should be 45
chol_meta_df = meta_df[meta_df['cancer type abbreviation'] == 'CHOL']
chol_meta_df.shape[0]
Next we are going to read in the data matrix. We will use the metadata we just collected to only load the columns for the cohort of interest. This will save memory and speed up the process. This will still probably take a few seconds.
columns_to_keep = ['sample'] + chol_meta_df['sample'].tolist()
chol_matrix = pd.read_csv('tcga_RSEM_gene_tpm.gz', compression='gzip', sep='\t', usecols=columns_to_keep)
To make things less confusing we will rename the "sample" column as "gene" since that is what is actually in that column
chol_matrix = chol_matrix.rename(columns={'sample': 'gene'})
chol_matrix = chol_matrix.set_index('gene')
chol_matrix
Now that we have our matrix we are ready to start calculating metrics
Calculating Metrics¶
The input for an analysis where a patient sample is being compared to the TCGA cohort of interest would typically be an expression matrix using the same normalization method as the comparator cohort.
Note: Differences in library preparation, sequencing and bioinformatics pipelines can lead to variation in expression quantification. It is important to understand the impact these differences have on expression quantification before comparing expression values that are produced using different pipelines.
For the purposes of this tutorial we are going to choose one of the CHOL samples from our expression matrix to use as our input. We arbitrarily chose the first sample.
test_sample_id = chol_meta_df['sample'].tolist()[0]
Percentile¶
We will start by calculating the percentile rank
percentiles = chol_matrix.copy()
percentiles = percentiles.rank(1, pct=True, numeric_only=True).apply(lambda x: round(x * 100))
percentiles
We can plot a couple of these as a sanity check. The line plots should never go up and then down again
gene = 'ENSG00000105063.18'
fig = sns.relplot(kind='line', x=chol_matrix.loc[gene], y=percentiles.loc[gene])
fig.set(xlabel='TPM', ylabel='Percentile Rank')
Now let's look at the percentiles for our test sample
percentiles[[test_sample_id]]
We can select the highest and lowest percentiles as a way to look for possible outliers (arbitrarily selecting 95 and 5 as thresholds).
percentiles[[test_sample_id]][(percentiles[test_sample_id] >= 95) | (percentiles[test_sample_id] <= 5)]
Calculating kIQR¶
The next metric we are going to look at is the kIQR. This metric is different from percentile in that it takes into account the spread or variance of the distribution. To be able to calculate this we first need to calculate some parameters for each gene
quantiles = pd.concat([
chol_matrix[test_sample_id],
chol_matrix.quantile(0.25, 1),
chol_matrix.quantile(0.5, 1),
chol_matrix.quantile(0.75, 1),
], axis=1).rename(
columns={0.25: 'q1', 0.5: 'q2', 0.75: 'q3', test_sample_id: 'tpm'}
).reset_index()
quantiles['iqr'] = quantiles.q3 - quantiles.q1
quantiles
Now we are ready to calculate the kIQR for our sample
quantiles['kiqr'] = quantiles[quantiles.iqr > 0].apply(lambda row: (row.tpm - row.q2) / row.iqr, axis=1)
Based on the newly calcuated kIQR we pick thresholds to select outliers. We can do these alone or with the percentile filters. We have arbitrarily chosen 3 for the purposes of this tutorial
quantile_outliers = quantiles[(quantiles.kiqr >= 3) | (quantiles.kiqr <= -3)].reset_index()
quantile_outliers
Now we pick some of these genes to plot
outlier_gene = quantile_outliers.iloc[0].gene
outlier_gene
And create a histogram which highlights where the current sample lands relative to the other samples
ax = sns.histplot(chol_matrix.T[outlier_gene], kde=True)
ax.set_title(outlier_gene)
ax.set_xlabel('TPM')
# add a marker over the bin the sample lands in
current = quantile_outliers[quantile_outliers.gene == outlier_gene].iloc[0].tpm
for p in ax.patches:
if current >= p.get_x() and current <= p.get_x() + p.get_width():
p.set_color('black')
plt.text(
p.get_x() + (p.get_width() / 2),
p.get_height(),
'*',
horizontalalignment='center',
verticalalignment='bottom',
)
Creating the Report Input¶
Since the required input to the IPR python adapter is JSON we will format the output from this analysis to match that specification. Note this will only contain the expression relevant portions.
We will start by combining the percentile, tpm, and kiqr into a single matrix
metrics_df = pd.concat([
quantiles.set_index('gene'),
percentiles[[test_sample_id]].rename(columns={test_sample_id: 'percentile'})
], axis=1)
metrics_df
Next we are going to use our thresholds to assign labels
percentile_threshold = 97.5
kiqr_threshold = 2.5
metrics_df.loc[(metrics_df.percentile >= percentile_threshold) & (metrics_df.kiqr >= kiqr_threshold), 'kbCategory'] = 'increased expression'
metrics_df.loc[(metrics_df.percentile <= 100 - percentile_threshold) & (metrics_df.kiqr <= -1 * kiqr_threshold), 'kbCategory'] = 'reduced expression'
metrics_df[~metrics_df.kbCategory.isnull()]
Finally we can use this to create our output. The percentiles and kIQR values we have calculated above were for our disease comparator but a similar process would be used for normal and biopsy comparators
metrics_df = metrics_df.rename(columns={'kiqr': 'diseasekIQR', 'percentile': 'diseasePercentile'}).reset_index()
We often leave non-variant expression records in to supplement the copy variant and structural variant information. For these records the kbCategory field is a null value since they are not an expression outlier
expression_variants = metrics_df[['tpm', 'diseasekIQR', 'diseasePercentile', 'gene', 'kbCategory']].replace({np.nan:None}).to_dict('records')
expression_variants[0]
We will also add the disease comparator information to the JSON
output_json = {
'comparators': [
{'analysisRole': 'expression (disease)', 'name': 'TCGA CHOL', 'size': chol_matrix.shape[0]}
],
'expressionVariants': expression_variants
}
with open('result.json', 'w') as fh:
json.dump(output_json, fh)