Calculating Expression Outliers Against a Reference Data Set¶

Download the Data¶

First we need to download the data set we will use. We are going to use the TCGA data here since it is publically available and well known (UCSC Xena TCGA PANCAN). We will download the TPM matrix as well as the corresponding metadata file

!wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/tcga_RSEM_gene_tpm.gz
!wget https://tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com/download/Survival_SupplementalTable_S1_20171025_xena_sp

--2021-06-09 19:24:14--  https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/tcga_RSEM_gene_tpm.gz
Resolving toil-xena-hub.s3.us-east-1.amazonaws.com (toil-xena-hub.s3.us-east-1.amazonaws.com)... 52.216.80.45
Connecting to toil-xena-hub.s3.us-east-1.amazonaws.com (toil-xena-hub.s3.us-east-1.amazonaws.com)|52.216.80.45|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 740772247 (706M) [binary/octet-stream]
Saving to: ‘tcga_RSEM_gene_tpm.gz.1’

tcga_RSEM_gene_tpm. 100%[===================>] 706.46M  25.4MB/s    in 24s     

2021-06-09 19:24:38 (30.0 MB/s) - ‘tcga_RSEM_gene_tpm.gz.1’ saved [740772247/740772247]

--2021-06-09 19:24:38--  https://tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com/download/Survival_SupplementalTable_S1_20171025_xena_sp
Resolving tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com (tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com)... 52.217.93.16
Connecting to tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com (tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com)|52.217.93.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2419504 (2.3M) [binary/octet-stream]
Saving to: ‘Survival_SupplementalTable_S1_20171025_xena_sp.1’

Survival_Supplement 100%[===================>]   2.31M  7.80MB/s    in 0.3s    

2021-06-09 19:24:38 (7.80 MB/s) - ‘Survival_SupplementalTable_S1_20171025_xena_sp.1’ saved [2419504/2419504]

The matrix we have downloaded is very large as it contains many different disease cohorts. For the purposes of this tutorial we want a smaller data set so we are going to choose the smallest TCGA cohort, CHOL. To do this we need to use the metadata file to subset the main matrix file

Install Dependencies¶

For the following tutorial we are going to use some common data processing and visualization libraries in python: seaborn and pandas. We need to install them before we proceed

!pip install pandas seaborn

Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (1.1.5)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (0.11.1)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas) (1.19.5)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas) (2.8.1)
Requirement already satisfied: matplotlib>=2.2 in /usr/local/lib/python3.7/dist-packages (from seaborn) (3.2.2)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from seaborn) (1.4.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (0.10.0)

Create the Disease Matrix Subset¶

First read the metadata file

import json

import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

sns.set_style('whitegrid')

meta_df = pd.read_csv('Survival_SupplementalTable_S1_20171025_xena_sp', sep='\t')
meta_df

	sample	_PATIENT	cancer type abbreviation	age_at_initial_pathologic_diagnosis	gender	race	ajcc_pathologic_tumor_stage	clinical_stage	histological_type	histological_grade	initial_pathologic_dx_year	menopause_status	birth_days_to	vital_status	tumor_status	last_contact_days_to	death_days_to	cause_of_death	new_tumor_event_type	new_tumor_event_site	new_tumor_event_site_other	new_tumor_event_dx_days_to	treatment_outcome_first_course	margin_status	residual_tumor	OS	OS.time	DSS	DSS.time	DFI	DFI.time	PFI	PFI.time	Redaction
0	TCGA-OR-A5J1-01	TCGA-OR-A5J1	ACC	58.0	MALE	WHITE	Stage II	NaN	Adrenocortical carcinoma- Usual Type	NaN	2000.0	NaN	-21496.0	Dead	WITH TUMOR	NaN	1355.0	NaN	Distant Metastasis	Peritoneal Surfaces	NaN	754.0	Complete Remission/Response	NaN	NaN	1.0	1355.0	1.0	1355.0	1.0	754.0	1.0	754.0	NaN
1	TCGA-OR-A5J2-01	TCGA-OR-A5J2	ACC	44.0	FEMALE	WHITE	Stage IV	NaN	Adrenocortical carcinoma- Usual Type	NaN	2004.0	NaN	-16090.0	Dead	WITH TUMOR	NaN	1677.0	NaN	Distant Metastasis	Soft Tissue	NaN	289.0	Progressive Disease	NaN	NaN	1.0	1677.0	1.0	1677.0	NaN	NaN	1.0	289.0	NaN
2	TCGA-OR-A5J3-01	TCGA-OR-A5J3	ACC	23.0	FEMALE	WHITE	Stage III	NaN	Adrenocortical carcinoma- Usual Type	NaN	2008.0	NaN	-8624.0	Alive	WITH TUMOR	2091.0	NaN	NaN	Distant Metastasis	Lung	NaN	53.0	Complete Remission/Response	NaN	NaN	0.0	2091.0	0.0	2091.0	1.0	53.0	1.0	53.0	NaN
3	TCGA-OR-A5J4-01	TCGA-OR-A5J4	ACC	23.0	FEMALE	WHITE	Stage IV	NaN	Adrenocortical carcinoma- Usual Type	NaN	2000.0	NaN	-8451.0	Dead	WITH TUMOR	NaN	423.0	NaN	Locoregional Recurrence	Peritoneal Surfaces	NaN	126.0	Progressive Disease	NaN	NaN	1.0	423.0	1.0	423.0	NaN	NaN	1.0	126.0	NaN
4	TCGA-OR-A5J5-01	TCGA-OR-A5J5	ACC	30.0	MALE	WHITE	Stage III	NaN	Adrenocortical carcinoma- Usual Type	NaN	2000.0	NaN	-11171.0	Dead	WITH TUMOR	NaN	365.0	NaN	Locoregional Recurrence	Other, specify	vena cava thrombus	50.0	Progressive Disease	NaN	NaN	1.0	365.0	1.0	365.0	NaN	NaN	1.0	50.0	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
12586	TCGA-YZ-A980-01	TCGA-YZ-A980	UVM	75.0	MALE	WHITE	Stage IIIA	Stage IIIA	Spindle Cell\|Epithelioid Cell	NaN	2010.0	NaN	-27716.0	Alive	TUMOR FREE	1862.0	NaN	NaN	New Primary Tumor	Other, specify	Scalp	1556.0	NaN	NaN	NaN	0.0	1862.0	0.0	1862.0	NaN	NaN	1.0	1556.0	NaN
12587	TCGA-YZ-A982-01	TCGA-YZ-A982	UVM	79.0	FEMALE	WHITE	Stage IIIB	Stage IIIB	Spindle Cell	NaN	2013.0	NaN	-28938.0	Alive	TUMOR FREE	495.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	495.0	0.0	495.0	NaN	NaN	0.0	495.0	NaN
12588	TCGA-YZ-A983-01	TCGA-YZ-A983	UVM	51.0	FEMALE	WHITE	Stage IIB	Stage IIB	Epithelioid Cell	NaN	2013.0	NaN	-18769.0	Alive	TUMOR FREE	798.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	798.0	0.0	798.0	NaN	NaN	0.0	798.0	NaN
12589	TCGA-YZ-A984-01	TCGA-YZ-A984	UVM	50.0	FEMALE	WHITE	Stage IIB	Stage IIIA	Spindle Cell\|Epithelioid Cell	NaN	2011.0	NaN	-18342.0	Dead	WITH TUMOR	NaN	1396.0	Metastatic Uveal Melanoma	New Primary Tumor	Other, specify	Thyroid	154.0	NaN	NaN	NaN	1.0	1396.0	1.0	1396.0	NaN	NaN	1.0	154.0	NaN
12590	TCGA-YZ-A985-01	TCGA-YZ-A985	UVM	41.0	FEMALE	WHITE	Stage IIIA	Stage IIIA	Spindle Cell	NaN	2012.0	NaN	-15164.0	Alive	TUMOR FREE	1184.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	1184.0	0.0	1184.0	NaN	NaN	0.0	1184.0	NaN

12591 rows × 34 columns

Now select the rows which match the disease we are intereseted in, CHOL. Print out the number of rows after the selecton. There should be 45

chol_meta_df = meta_df[meta_df['cancer type abbreviation'] == 'CHOL']
chol_meta_df.shape[0]

Next we are going to read in the data matrix. We will use the metadata we just collected to only load the columns for the cohort of interest. This will save memory and speed up the process. This will still probably take a few seconds.

columns_to_keep = ['sample'] + chol_meta_df['sample'].tolist()
chol_matrix = pd.read_csv('tcga_RSEM_gene_tpm.gz', compression='gzip', sep='\t', usecols=columns_to_keep)

To make things less confusing we will rename the "sample" column as "gene" since that is what is actually in that column

chol_matrix = chol_matrix.rename(columns={'sample': 'gene'})
chol_matrix = chol_matrix.set_index('gene')
chol_matrix

	TCGA-ZH-A8Y5-01	TCGA-4G-AAZT-01	TCGA-W5-AA2R-11	TCGA-W5-AA2I-01	TCGA-W5-AA34-01	TCGA-W5-AA2X-01	TCGA-W5-AA2H-01	TCGA-3X-AAVB-01	TCGA-3X-AAVA-01	TCGA-ZH-A8Y4-01	TCGA-W5-AA31-11	TCGA-W5-AA2U-01	TCGA-W5-AA2Q-01	TCGA-ZH-A8Y2-01	TCGA-W5-AA30-01	TCGA-WD-A7RX-01	TCGA-ZU-A8S4-11	TCGA-W5-AA2U-11	TCGA-3X-AAVE-01	TCGA-W5-AA2G-01	TCGA-W5-AA39-01	TCGA-4G-AAZO-01	TCGA-W5-AA2R-01	TCGA-W5-AA34-11	TCGA-W5-AA2I-11	TCGA-W6-AA0S-01	TCGA-W5-AA2W-01	TCGA-YR-A95A-01	TCGA-W5-AA33-01	TCGA-ZD-A8I3-01	TCGA-W5-AA2O-01	TCGA-W5-AA31-01	TCGA-W5-AA2Q-11	TCGA-W5-AA36-01	TCGA-W5-AA38-01	TCGA-ZU-A8S4-01	TCGA-W5-AA2Z-01	TCGA-3X-AAVC-01	TCGA-W5-AA30-11	TCGA-ZH-A8Y1-01	TCGA-3X-AAV9-01	TCGA-W5-AA2T-01	TCGA-ZH-A8Y6-01	TCGA-ZH-A8Y8-01	TCGA-W5-AA2X-11
gene
ENSG00000242268.2	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-4.2934	-9.9658	-4.2934	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-4.0350	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-4.0350	-9.9658	-3.8160	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658
ENSG00000259041.1	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658
ENSG00000270112.3	-9.9658	-9.9658	-9.9658	-6.5064	-6.5064	-9.9658	-9.9658	-1.9942	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-3.1714	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-6.5064	-6.5064	-9.9658
ENSG00000167578.16	4.0575	5.2430	3.1179	4.3716	4.6282	4.7203	6.7259	5.4233	4.4243	4.8105	3.2841	3.9892	3.8778	5.6244	4.7345	4.9175	3.3647	3.4277	4.8405	4.0046	5.7520	4.6955	5.0104	3.1393	3.2282	4.9855	5.9074	4.2958	4.6106	4.8510	4.5940	5.3035	3.8993	4.6525	5.6795	5.4216	4.2297	4.6101	3.7846	4.8802	4.8905	4.2025	5.7632	5.1611	3.8808
ENSG00000278814.1	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ENSG00000273233.1	-9.9658	-1.9942	-9.9658	-1.5951	-3.3076	-9.9658	-3.4580	-3.0469	-3.1714	-0.6873	-9.9658	-2.2447	-3.1714	-9.9658	-0.1504	-3.4580	-9.9658	-9.9658	-9.9658	-3.0469	-2.2447	-9.9658	-9.9658	-9.9658	-3.6259	-3.4580	-2.7274	-9.9658	-3.0469	-1.9379	-9.9658	-9.9658	-9.9658	-1.2828	-9.9658	-1.9942	-3.1714	-9.9658	-3.3076	-9.9658	-9.9658	-0.9132	-3.1714	-3.4580	-9.9658
ENSG00000105063.18	5.1157	4.9901	3.5187	5.1277	4.6713	4.9322	6.0830	5.5604	5.1056	4.8445	3.2374	4.9275	4.7049	5.2327	4.9589	4.6910	2.9224	2.8462	5.4266	5.3005	4.3876	5.4654	5.0246	3.3842	2.8096	4.0207	5.8985	4.6944	5.4199	4.9741	5.1027	5.8483	3.5742	5.1211	5.4019	5.2114	4.7939	5.8783	2.9839	5.0747	5.0352	5.2250	5.6744	4.9978	3.4778
ENSG00000231119.2	-4.2934	-5.5735	-3.8160	-9.9658	-6.5064	-2.6349	-5.5735	-3.3076	-9.9658	-4.0350	-3.1714	-4.2934	-5.5735	-9.9658	-4.0350	-2.8262	-3.4580	-5.0116	-4.6082	-3.6259	-5.0116	-4.2934	-3.8160	-3.0469	-4.0350	-4.2934	-2.2447	-3.6259	-3.8160	-5.5735	-4.6082	-2.1140	-3.0469	-9.9658	-3.6259	-1.8314	-3.8160	-4.6082	-3.3076	-3.1714	-2.4659	-4.0350	-2.4659	-2.7274	-2.9324
ENSG00000280861.1	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658
ENSG00000181518.3	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658

60498 rows × 45 columns

Now that we have our matrix we are ready to start calculating metrics

Calculating Metrics¶

The input for an analysis where a patient sample is being compared to the TCGA cohort of interest would typically be an expression matrix using the same normalization method as the comparator cohort.

Note: Differences in library preparation, sequencing and bioinformatics pipelines can lead to variation in expression quantification. It is important to understand the impact these differences have on expression quantification before comparing expression values that are produced using different pipelines.

For the purposes of this tutorial we are going to choose one of the CHOL samples from our expression matrix to use as our input. We arbitrarily chose the first sample.

test_sample_id = chol_meta_df['sample'].tolist()[0]

Percentile¶

We will start by calculating the percentile rank

percentiles = chol_matrix.copy()
percentiles = percentiles.rank(1, pct=True, numeric_only=True).apply(lambda x: round(x * 100))
percentiles

	TCGA-ZH-A8Y5-01	TCGA-4G-AAZT-01	TCGA-W5-AA2R-11	TCGA-W5-AA2I-01	TCGA-W5-AA34-01	TCGA-W5-AA2X-01	TCGA-W5-AA2H-01	TCGA-3X-AAVB-01	TCGA-3X-AAVA-01	TCGA-ZH-A8Y4-01	TCGA-W5-AA31-11	TCGA-W5-AA2U-01	TCGA-W5-AA2Q-01	TCGA-ZH-A8Y2-01	TCGA-W5-AA30-01	TCGA-WD-A7RX-01	TCGA-ZU-A8S4-11	TCGA-W5-AA2U-11	TCGA-3X-AAVE-01	TCGA-W5-AA2G-01	TCGA-W5-AA39-01	TCGA-4G-AAZO-01	TCGA-W5-AA2R-01	TCGA-W5-AA34-11	TCGA-W5-AA2I-11	TCGA-W6-AA0S-01	TCGA-W5-AA2W-01	TCGA-YR-A95A-01	TCGA-W5-AA33-01	TCGA-ZD-A8I3-01	TCGA-W5-AA2O-01	TCGA-W5-AA31-01	TCGA-W5-AA2Q-11	TCGA-W5-AA36-01	TCGA-W5-AA38-01	TCGA-ZU-A8S4-01	TCGA-W5-AA2Z-01	TCGA-3X-AAVC-01	TCGA-W5-AA30-11	TCGA-ZH-A8Y1-01	TCGA-3X-AAV9-01	TCGA-W5-AA2T-01	TCGA-ZH-A8Y6-01	TCGA-ZH-A8Y8-01	TCGA-W5-AA2X-11
gene
ENSG00000242268.2	46.0	46.0	46.0	46.0	46.0	46.0	46.0	92.0	46.0	92.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	97.0	46.0	46.0	46.0	46.0	46.0	46.0	46.0	97.0	46.0	100.0	46.0	46.0	46.0	46.0	46.0
ENSG00000259041.1	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0
ENSG00000270112.3	44.0	44.0	44.0	92.0	92.0	44.0	44.0	100.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	98.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	44.0	92.0	92.0	44.0
ENSG00000167578.16	29.0	80.0	2.0	38.0	49.0	56.0	100.0	87.0	40.0	60.0	9.0	24.0	18.0	89.0	58.0	71.0	11.0	13.0	62.0	27.0	93.0	53.0	76.0	4.0	7.0	73.0	98.0	36.0	47.0	64.0	42.0	82.0	22.0	51.0	91.0	84.0	33.0	44.0	16.0	67.0	69.0	31.0	96.0	78.0	20.0
ENSG00000278814.1	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ENSG00000273233.1	23.0	86.0	23.0	91.0	59.0	23.0	52.0	73.0	66.0	98.0	23.0	81.0	66.0	23.0	100.0	52.0	23.0	23.0	23.0	73.0	81.0	23.0	23.0	23.0	47.0	52.0	78.0	23.0	73.0	89.0	23.0	23.0	23.0	93.0	23.0	86.0	66.0	23.0	59.0	23.0	23.0	96.0	66.0	52.0	23.0
ENSG00000105063.18	64.0	49.0	18.0	69.0	27.0	42.0	100.0	89.0	62.0	38.0	11.0	40.0	33.0	76.0	44.0	29.0	7.0	4.0	84.0	78.0	24.0	87.0	53.0	13.0	2.0	22.0	98.0	31.0	82.0	47.0	60.0	93.0	20.0	67.0	80.0	71.0	36.0	96.0	9.0	58.0	56.0	73.0	91.0	51.0	16.0
ENSG00000231119.2	37.0	17.0	54.0	6.0	11.0	89.0	17.0	70.0	6.0	46.0	74.0	37.0	17.0	6.0	46.0	84.0	67.0	23.0	29.0	62.0	23.0	37.0	54.0	79.0	46.0	37.0	96.0	62.0	54.0	17.0	29.0	98.0	79.0	6.0	62.0	100.0	54.0	29.0	70.0	74.0	92.0	46.0	92.0	87.0	82.0
ENSG00000280861.1	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0
ENSG00000181518.3	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0	51.0

60498 rows × 45 columns

We can plot a couple of these as a sanity check. The line plots should never go up and then down again

gene = 'ENSG00000105063.18'

fig = sns.relplot(kind='line', x=chol_matrix.loc[gene], y=percentiles.loc[gene])
fig.set(xlabel='TPM', ylabel='Percentile Rank')

<seaborn.axisgrid.FacetGrid at 0x7f6b96ba7f50>

Now let's look at the percentiles for our test sample

percentiles[[test_sample_id]]

	TCGA-3X-AAV9-01
gene
ENSG00000242268.2	46.0
ENSG00000259041.1	51.0
ENSG00000270112.3	44.0
ENSG00000167578.16	69.0
ENSG00000278814.1	51.0
...	...
ENSG00000273233.1	23.0
ENSG00000105063.18	56.0
ENSG00000231119.2	92.0
ENSG00000280861.1	51.0
ENSG00000181518.3	51.0

60498 rows × 1 columns

We can select the highest and lowest percentiles as a way to look for possible outliers (arbitrarily selecting 95 and 5 as thresholds).

percentiles[[test_sample_id]][(percentiles[test_sample_id] >= 95) | (percentiles[test_sample_id] <= 5)]

	TCGA-3X-AAV9-01
gene
ENSG00000280143.1	98.0
ENSG00000232001.1	98.0
ENSG00000206072.12	100.0
ENSG00000254102.1	100.0
ENSG00000165312.6	100.0
...	...
ENSG00000103241.6	100.0
ENSG00000237803.5	100.0
ENSG00000270987.1	98.0
ENSG00000186115.12	4.0
ENSG00000123685.8	96.0

3530 rows × 1 columns

Calculating kIQR¶

The next metric we are going to look at is the kIQR. This metric is different from percentile in that it takes into account the spread or variance of the distribution. To be able to calculate this we first need to calculate some parameters for each gene

quantiles = pd.concat([
  chol_matrix[test_sample_id], 
  chol_matrix.quantile(0.25, 1), 
  chol_matrix.quantile(0.5, 1), 
  chol_matrix.quantile(0.75, 1),
], axis=1).rename(
    columns={0.25: 'q1', 0.5: 'q2', 0.75: 'q3', test_sample_id: 'tpm'}
).reset_index()
quantiles['iqr'] = quantiles.q3 - quantiles.q1
quantiles

	gene	tpm	q1	q2	q3	iqr
0	ENSG00000242268.2	-9.9658	-9.9658	-9.9658	-9.9658	0.0000
1	ENSG00000259041.1	-9.9658	-9.9658	-9.9658	-9.9658	0.0000
2	ENSG00000270112.3	-9.9658	-9.9658	-9.9658	-9.9658	0.0000
3	ENSG00000167578.16	4.8905	4.0046	4.6525	5.0104	1.0058
4	ENSG00000278814.1	-9.9658	-9.9658	-9.9658	-9.9658	0.0000
...	...	...	...	...	...	...
60493	ENSG00000273233.1	-9.9658	-9.9658	-3.4580	-3.0469	6.9189
60494	ENSG00000105063.18	5.0352	4.6713	4.9978	5.2327	0.5614
60495	ENSG00000231119.2	-2.4659	-4.6082	-3.8160	-3.1714	1.4368
60496	ENSG00000280861.1	-9.9658	-9.9658	-9.9658	-9.9658	0.0000
60497	ENSG00000181518.3	-9.9658	-9.9658	-9.9658	-9.9658	0.0000

60498 rows × 6 columns

Now we are ready to calculate the kIQR for our sample

quantiles['kiqr'] = quantiles[quantiles.iqr > 0].apply(lambda row: (row.tpm - row.q2) / row.iqr, axis=1)

Based on the newly calcuated kIQR we pick thresholds to select outliers. We can do these alone or with the percentile filters. We have arbitrarily chosen 3 for the purposes of this tutorial

quantile_outliers = quantiles[(quantiles.kiqr >= 3) | (quantiles.kiqr <= -3)].reset_index()
quantile_outliers

	index	gene	tpm	q1	q2	q3	iqr	kiqr
0	245	ENSG00000259677.1	-9.9658	-4.6082	-3.6259	-2.8262	1.7820	-3.557744
1	481	ENSG00000223922.1	-9.9658	-4.2934	-3.4580	-2.3147	1.9787	-3.288927
2	598	ENSG00000255236.2	-9.9658	-3.3076	-2.1140	-1.3921	1.9155	-4.099086
3	652	ENSG00000057149.14	4.7464	-9.9658	-9.9658	-5.5735	4.3923	3.349544
4	1159	ENSG00000271916.1	-9.9658	-2.2447	-0.4719	0.4125	2.6572	-3.572896
...	...	...	...	...	...	...	...	...
121	58758	ENSG00000253492.1	-9.9658	-0.5756	1.7702	2.6161	3.1917	-3.677037
122	58852	ENSG00000169562.9	4.1612	6.8178	7.3061	7.6176	0.7998	-3.932108
123	59675	ENSG00000270617.1	-9.9658	-2.9324	-2.4659	-0.7834	2.1490	-3.489949
124	59884	ENSG00000235298.1	-9.9658	-1.0862	-0.3201	0.6239	1.7101	-5.640430
125	59912	ENSG00000070526.14	4.1252	-2.7274	-1.8314	-1.0262	1.7012	3.501411

126 rows × 8 columns

Now we pick some of these genes to plot

outlier_gene = quantile_outliers.iloc[0].gene 
outlier_gene

'ENSG00000259677.1'

And create a histogram which highlights where the current sample lands relative to the other samples

ax = sns.histplot(chol_matrix.T[outlier_gene], kde=True)
ax.set_title(outlier_gene)
ax.set_xlabel('TPM')

# add a marker over the bin the sample lands in
current = quantile_outliers[quantile_outliers.gene == outlier_gene].iloc[0].tpm

for p in ax.patches:
  if current >= p.get_x() and current <= p.get_x() + p.get_width():
      p.set_color('black')
      plt.text(
          p.get_x() + (p.get_width() / 2),
          p.get_height(),
          '*',
          horizontalalignment='center',
          verticalalignment='bottom',
      )

Creating the Report Input¶

Since the required input to the IPR python adapter is JSON we will format the output from this analysis to match that specification. Note this will only contain the expression relevant portions.

We will start by combining the percentile, tpm, and kiqr into a single matrix

metrics_df = pd.concat([
  quantiles.set_index('gene'), 
  percentiles[[test_sample_id]].rename(columns={test_sample_id: 'percentile'})
], axis=1)
metrics_df

	tpm	q1	q2	q3	iqr	kiqr	percentile
gene
ENSG00000242268.2	-9.9658	-9.9658	-9.9658	-9.9658	0.0000	NaN	46.0
ENSG00000259041.1	-9.9658	-9.9658	-9.9658	-9.9658	0.0000	NaN	51.0
ENSG00000270112.3	-9.9658	-9.9658	-9.9658	-9.9658	0.0000	NaN	44.0
ENSG00000167578.16	4.8905	4.0046	4.6525	5.0104	1.0058	0.236628	69.0
ENSG00000278814.1	-9.9658	-9.9658	-9.9658	-9.9658	0.0000	NaN	51.0
...	...	...	...	...	...	...	...
ENSG00000273233.1	-9.9658	-9.9658	-3.4580	-3.0469	6.9189	-0.940583	23.0
ENSG00000105063.18	5.0352	4.6713	4.9978	5.2327	0.5614	0.066619	56.0
ENSG00000231119.2	-2.4659	-4.6082	-3.8160	-3.1714	1.4368	0.939658	92.0
ENSG00000280861.1	-9.9658	-9.9658	-9.9658	-9.9658	0.0000	NaN	51.0
ENSG00000181518.3	-9.9658	-9.9658	-9.9658	-9.9658	0.0000	NaN	51.0

60498 rows × 7 columns

Next we are going to use our thresholds to assign labels

percentile_threshold = 97.5
kiqr_threshold = 2.5


metrics_df.loc[(metrics_df.percentile >= percentile_threshold) & (metrics_df.kiqr >= kiqr_threshold), 'kbCategory'] = 'increased expression'
metrics_df.loc[(metrics_df.percentile <= 100 - percentile_threshold) & (metrics_df.kiqr <= -1 * kiqr_threshold), 'kbCategory'] = 'reduced expression'

metrics_df[~metrics_df.kbCategory.isnull()]

	tpm	q1	q2	q3	iqr	kiqr	percentile	kbCategory
gene
ENSG00000057149.14	4.7464	-9.9658	-9.9658	-5.5735	4.3923	3.349544	100.0	increased expression
ENSG00000140522.11	0.0580	-9.9658	-9.9658	-6.5064	3.4594	2.897554	100.0	increased expression
ENSG00000089116.3	-0.0277	-9.9658	-9.9658	-6.5064	3.4594	2.872781	100.0	increased expression
ENSG00000087495.16	4.8095	-3.0469	-2.0529	-0.4325	2.6144	2.624847	98.0	increased expression
ENSG00000272702.1	-6.5064	-2.5479	-1.5951	-1.1488	1.3991	-3.510328	2.0	reduced expression
ENSG00000254231.1	-2.2447	0.5955	1.2756	1.9527	1.3572	-2.593796	2.0	reduced expression
ENSG00000073282.12	1.4652	-4.2934	-3.6259	-2.3884	1.9050	2.672493	100.0	increased expression
ENSG00000142700.11	0.3796	-9.9658	-9.9658	-6.5064	3.4594	2.990519	100.0	increased expression
ENSG00000077616.10	1.1641	-2.6349	-2.1779	-1.4699	1.1650	2.868670	100.0	increased expression
ENSG00000227392.1	-9.9658	-1.3183	-0.2159	0.5568	1.8751	-5.199669	2.0	reduced expression
ENSG00000147689.16	5.2694	-6.5064	-5.5735	-3.1714	3.3350	3.251244	100.0	increased expression
ENSG00000174332.5	1.3846	-5.0116	-4.2934	-3.3076	1.7040	3.332160	100.0	increased expression
ENSG00000163207.6	2.4008	-9.9658	-9.9658	-6.5064	3.4594	3.574782	100.0	increased expression
ENSG00000158786.4	-0.6643	-9.9658	-9.9658	-6.5064	3.4594	2.688761	100.0	increased expression
ENSG00000261857.6	7.6963	-2.7274	-1.2481	-0.2845	2.4429	3.661386	100.0	increased expression
ENSG00000265179.6	-0.4325	-9.9658	-9.9658	-6.5064	3.4594	2.755767	98.0	increased expression
ENSG00000198910.12	5.3948	-3.4580	-1.8836	-0.9406	2.5174	2.891237	100.0	increased expression
ENSG00000196154.11	11.1159	5.3827	6.2565	7.3173	1.9346	2.511837	100.0	increased expression
ENSG00000148704.12	-0.1828	-9.9658	-9.9658	-6.5064	3.4594	2.827947	100.0	increased expression
ENSG00000171956.6	-0.2159	-9.9658	-9.9658	-6.5064	3.4594	2.818379	98.0	increased expression
ENSG00000228630.5	0.9568	-9.9658	-9.9658	-6.5064	3.4594	3.157368	98.0	increased expression
ENSG00000132361.16	3.7825	5.2987	5.4922	5.7616	0.4629	-3.693454	2.0	reduced expression
ENSG00000113430.9	-0.1993	-9.9658	-9.9658	-6.5064	3.4594	2.823177	100.0	increased expression
ENSG00000166869.2	-0.1504	-9.9658	-9.9658	-6.5064	3.4594	2.837313	98.0	increased expression
ENSG00000198488.10	0.4967	-9.9658	-9.9658	-6.5064	3.4594	3.024368	98.0	increased expression
ENSG00000101470.9	2.6510	-2.5479	-1.8836	-1.3183	1.2296	3.687866	100.0	increased expression
ENSG00000099958.14	7.1647	1.5563	2.5238	3.2002	1.6439	2.823104	100.0	increased expression
ENSG00000125820.5	2.4805	-9.9658	-9.9658	-6.5064	3.4594	3.597820	100.0	increased expression
ENSG00000215115.6	0.7664	-9.9658	-9.9658	-6.5064	3.4594	3.102330	100.0	increased expression
ENSG00000130035.6	1.9931	-4.0350	-3.4580	-2.3884	1.6466	3.310519	100.0	increased expression
ENSG00000101160.13	8.6516	7.4139	7.6005	7.8323	0.4184	2.512189	98.0	increased expression
ENSG00000204930.9	-3.8160	-6.5064	-6.5064	-5.5735	0.9329	2.883910	100.0	increased expression
ENSG00000215113.6	0.7664	-9.9658	-9.9658	-6.5064	3.4594	3.102330	100.0	increased expression
ENSG00000175928.5	3.2796	-2.8262	-1.9379	-0.7588	2.0674	2.523701	98.0	increased expression
ENSG00000186081.11	6.5720	-3.6259	-2.3884	-0.1828	3.4431	2.602422	100.0	increased expression
ENSG00000181690.7	0.9789	-4.0350	-3.3076	-2.6349	1.4001	3.061567	98.0	increased expression
ENSG00000182103.4	-1.4305	-6.5064	-6.5064	-5.0116	1.4948	3.395705	100.0	increased expression
ENSG00000271474.1	-0.9686	-4.6082	-3.8160	-3.6259	0.9823	2.898707	100.0	increased expression
ENSG00000169962.4	1.3109	-2.7274	-2.1140	-1.4305	1.2969	2.640836	98.0	increased expression
ENSG00000166105.15	-1.0262	-9.9658	-9.9658	-6.5064	3.4594	2.584148	100.0	increased expression
ENSG00000143199.17	-6.5064	-1.6850	-0.4521	0.5763	2.2613	-2.677354	2.0	reduced expression
ENSG00000180616.8	0.9191	-2.6349	-2.1140	-1.4305	1.2044	2.518349	100.0	increased expression
ENSG00000230587.1	0.0990	-3.6259	-3.1714	-2.4659	1.1600	2.819310	100.0	increased expression
ENSG00000256162.2	-9.9658	-1.0559	0.6880	3.1604	4.2163	-2.526813	2.0	reduced expression
ENSG00000205209.7	-6.5064	-3.4580	-2.5479	-1.9942	1.4638	-2.704263	2.0	reduced expression
ENSG00000204542.2	3.6382	-9.9658	-9.9658	-4.6082	5.3576	2.539197	98.0	increased expression
ENSG00000166961.14	2.4359	-9.9658	-9.9658	-5.5735	4.3923	2.823509	100.0	increased expression
ENSG00000102096.9	5.7455	2.5112	3.0393	3.5657	1.0545	2.566335	98.0	increased expression
ENSG00000124191.17	5.1023	0.2522	0.9568	1.4911	1.2389	3.346113	98.0	increased expression
ENSG00000107485.15	3.9175	-1.9379	-1.3921	0.0014	1.9393	2.737895	100.0	increased expression
ENSG00000065618.16	5.4741	-4.6082	-2.9324	-1.4699	3.1383	2.678680	98.0	increased expression
ENSG00000158816.15	-0.1828	-9.9658	-9.9658	-6.5064	3.4594	2.827947	100.0	increased expression
ENSG00000135374.9	3.6077	-9.9658	-9.9658	-5.5735	4.3923	3.090294	100.0	increased expression
ENSG00000145113.21	5.3618	-2.5479	-1.9942	-1.0262	1.5217	4.834067	100.0	increased expression
ENSG00000162367.11	2.7847	-1.5522	-1.0559	-0.2328	1.3194	2.910869	100.0	increased expression
ENSG00000105371.8	1.3735	-3.8160	-3.0469	-2.0529	1.7631	2.507175	98.0	increased expression
ENSG00000070526.14	4.1252	-2.7274	-1.8314	-1.0262	1.7012	3.501411	98.0	increased expression

Finally we can use this to create our output. The percentiles and kIQR values we have calculated above were for our disease comparator but a similar process would be used for normal and biopsy comparators

metrics_df = metrics_df.rename(columns={'kiqr': 'diseasekIQR', 'percentile': 'diseasePercentile'}).reset_index()

We often leave non-variant expression records in to supplement the copy variant and structural variant information. For these records the kbCategory field is a null value since they are not an expression outlier

expression_variants = metrics_df[['tpm', 'diseasekIQR', 'diseasePercentile', 'gene', 'kbCategory']].replace({np.nan:None}).to_dict('records')
expression_variants[0]

{'diseasePercentile': 46.0,
 'diseasekIQR': None,
 'gene': 'ENSG00000242268.2',
 'kbCategory': None,
 'tpm': -9.9658}

We will also add the disease comparator information to the JSON

output_json = {
    'comparators': [
        {'analysisRole': 'expression (disease)', 'name': 'TCGA CHOL', 'size': chol_matrix.shape[0]}
    ],
    'expressionVariants': expression_variants
}
with open('result.json', 'w') as fh:
  json.dump(output_json, fh)