Computing RNA Expression Similarity¶
Download the Data¶
First we need to download the data set we will use. We are going to use the TCGA data here since it is publically available and well known (UCSC Xena TCGA PANCAN). We will download the TPM matrix as well as the corresponding metadata file
!wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/tcga_RSEM_gene_tpm.gz
!wget https://tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com/download/Survival_SupplementalTable_S1_20171025_xena_sp
--2021-06-10 17:34:56-- https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/tcga_RSEM_gene_tpm.gz
Resolving toil-xena-hub.s3.us-east-1.amazonaws.com (toil-xena-hub.s3.us-east-1.amazonaws.com)... 52.216.136.30
Connecting to toil-xena-hub.s3.us-east-1.amazonaws.com (toil-xena-hub.s3.us-east-1.amazonaws.com)|52.216.136.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 740772247 (706M) [binary/octet-stream]
Saving to: ‘tcga_RSEM_gene_tpm.gz’
tcga_RSEM_gene_tpm. 100%[===================>] 706.46M 15.5MB/s in 49s
2021-06-10 17:35:46 (14.5 MB/s) - ‘tcga_RSEM_gene_tpm.gz’ saved [740772247/740772247]
--2021-06-10 17:35:46-- https://tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com/download/Survival_SupplementalTable_S1_20171025_xena_sp
Resolving tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com (tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com)... 52.216.94.142
Connecting to tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com (tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com)|52.216.94.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2419504 (2.3M) [binary/octet-stream]
Saving to: ‘Survival_SupplementalTable_S1_20171025_xena_sp’
Survival_Supplement 100%[===================>] 2.31M 1.75MB/s in 1.3s
2021-06-10 17:35:48 (1.75 MB/s) - ‘Survival_SupplementalTable_S1_20171025_xena_sp’ saved [2419504/2419504]
The matrix we have downloaded is very large as it contains many different disease cohorts. For the purposes of this tutorial we want a smaller data set so we are going to choose the smallest TCGA cohort, CHOL. To do this we need to use the metadata file to subset the main matrix file
Install Dependencies¶
For the following tutorial we are going to use some common data processing and visualization libraries in python: seaborn and pandas. We need to install them before we proceed
!pip install pandas seaborn
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (1.1.5)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (0.11.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas) (2.8.1)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas) (1.19.5)
Requirement already satisfied: matplotlib>=2.2 in /usr/local/lib/python3.7/dist-packages (from seaborn) (3.2.2)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from seaborn) (1.4.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Create the Disease Matrix¶
First read the metadata file
import json
import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
sns.set_style('whitegrid')
meta_df = pd.read_csv('Survival_SupplementalTable_S1_20171025_xena_sp', sep='\t')
meta_df
sample | _PATIENT | cancer type abbreviation | age_at_initial_pathologic_diagnosis | gender | race | ajcc_pathologic_tumor_stage | clinical_stage | histological_type | histological_grade | initial_pathologic_dx_year | menopause_status | birth_days_to | vital_status | tumor_status | last_contact_days_to | death_days_to | cause_of_death | new_tumor_event_type | new_tumor_event_site | new_tumor_event_site_other | new_tumor_event_dx_days_to | treatment_outcome_first_course | margin_status | residual_tumor | OS | OS.time | DSS | DSS.time | DFI | DFI.time | PFI | PFI.time | Redaction | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TCGA-OR-A5J1-01 | TCGA-OR-A5J1 | ACC | 58.0 | MALE | WHITE | Stage II | NaN | Adrenocortical carcinoma- Usual Type | NaN | 2000.0 | NaN | -21496.0 | Dead | WITH TUMOR | NaN | 1355.0 | NaN | Distant Metastasis | Peritoneal Surfaces | NaN | 754.0 | Complete Remission/Response | NaN | NaN | 1.0 | 1355.0 | 1.0 | 1355.0 | 1.0 | 754.0 | 1.0 | 754.0 | NaN |
1 | TCGA-OR-A5J2-01 | TCGA-OR-A5J2 | ACC | 44.0 | FEMALE | WHITE | Stage IV | NaN | Adrenocortical carcinoma- Usual Type | NaN | 2004.0 | NaN | -16090.0 | Dead | WITH TUMOR | NaN | 1677.0 | NaN | Distant Metastasis | Soft Tissue | NaN | 289.0 | Progressive Disease | NaN | NaN | 1.0 | 1677.0 | 1.0 | 1677.0 | NaN | NaN | 1.0 | 289.0 | NaN |
2 | TCGA-OR-A5J3-01 | TCGA-OR-A5J3 | ACC | 23.0 | FEMALE | WHITE | Stage III | NaN | Adrenocortical carcinoma- Usual Type | NaN | 2008.0 | NaN | -8624.0 | Alive | WITH TUMOR | 2091.0 | NaN | NaN | Distant Metastasis | Lung | NaN | 53.0 | Complete Remission/Response | NaN | NaN | 0.0 | 2091.0 | 0.0 | 2091.0 | 1.0 | 53.0 | 1.0 | 53.0 | NaN |
3 | TCGA-OR-A5J4-01 | TCGA-OR-A5J4 | ACC | 23.0 | FEMALE | WHITE | Stage IV | NaN | Adrenocortical carcinoma- Usual Type | NaN | 2000.0 | NaN | -8451.0 | Dead | WITH TUMOR | NaN | 423.0 | NaN | Locoregional Recurrence | Peritoneal Surfaces | NaN | 126.0 | Progressive Disease | NaN | NaN | 1.0 | 423.0 | 1.0 | 423.0 | NaN | NaN | 1.0 | 126.0 | NaN |
4 | TCGA-OR-A5J5-01 | TCGA-OR-A5J5 | ACC | 30.0 | MALE | WHITE | Stage III | NaN | Adrenocortical carcinoma- Usual Type | NaN | 2000.0 | NaN | -11171.0 | Dead | WITH TUMOR | NaN | 365.0 | NaN | Locoregional Recurrence | Other, specify | vena cava thrombus | 50.0 | Progressive Disease | NaN | NaN | 1.0 | 365.0 | 1.0 | 365.0 | NaN | NaN | 1.0 | 50.0 | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12586 | TCGA-YZ-A980-01 | TCGA-YZ-A980 | UVM | 75.0 | MALE | WHITE | Stage IIIA | Stage IIIA | Spindle Cell|Epithelioid Cell | NaN | 2010.0 | NaN | -27716.0 | Alive | TUMOR FREE | 1862.0 | NaN | NaN | New Primary Tumor | Other, specify | Scalp | 1556.0 | NaN | NaN | NaN | 0.0 | 1862.0 | 0.0 | 1862.0 | NaN | NaN | 1.0 | 1556.0 | NaN |
12587 | TCGA-YZ-A982-01 | TCGA-YZ-A982 | UVM | 79.0 | FEMALE | WHITE | Stage IIIB | Stage IIIB | Spindle Cell | NaN | 2013.0 | NaN | -28938.0 | Alive | TUMOR FREE | 495.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 495.0 | 0.0 | 495.0 | NaN | NaN | 0.0 | 495.0 | NaN |
12588 | TCGA-YZ-A983-01 | TCGA-YZ-A983 | UVM | 51.0 | FEMALE | WHITE | Stage IIB | Stage IIB | Epithelioid Cell | NaN | 2013.0 | NaN | -18769.0 | Alive | TUMOR FREE | 798.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 798.0 | 0.0 | 798.0 | NaN | NaN | 0.0 | 798.0 | NaN |
12589 | TCGA-YZ-A984-01 | TCGA-YZ-A984 | UVM | 50.0 | FEMALE | WHITE | Stage IIB | Stage IIIA | Spindle Cell|Epithelioid Cell | NaN | 2011.0 | NaN | -18342.0 | Dead | WITH TUMOR | NaN | 1396.0 | Metastatic Uveal Melanoma | New Primary Tumor | Other, specify | Thyroid | 154.0 | NaN | NaN | NaN | 1.0 | 1396.0 | 1.0 | 1396.0 | NaN | NaN | 1.0 | 154.0 | NaN |
12590 | TCGA-YZ-A985-01 | TCGA-YZ-A985 | UVM | 41.0 | FEMALE | WHITE | Stage IIIA | Stage IIIA | Spindle Cell | NaN | 2012.0 | NaN | -15164.0 | Alive | TUMOR FREE | 1184.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 1184.0 | 0.0 | 1184.0 | NaN | NaN | 0.0 | 1184.0 | NaN |
12591 rows × 34 columns
To facilitate running this quickly and with low compute resource requirements we will use a subset of the total set of samples. In practice we would likely store this in a database to improve performance. If you would like to use the entire data set you can remove the "usecols" part of read_csv. Let's use the metadata file to see how many samples we have in each of the different cohorts
meta_df.groupby('cancer type abbreviation').agg({'sample': 'nunique'})
sample | |
---|---|
cancer type abbreviation | |
ACC | 92 |
BLCA | 436 |
BRCA | 1236 |
CESC | 312 |
CHOL | 45 |
COAD | 545 |
DLBC | 48 |
ESCA | 204 |
GBM | 602 |
HNSC | 604 |
KICH | 91 |
KIRC | 944 |
KIRP | 352 |
LAML | 200 |
LGG | 529 |
LIHC | 438 |
LUAD | 641 |
LUSC | 623 |
MESO | 87 |
OV | 604 |
PAAD | 196 |
PCPG | 187 |
PRAD | 566 |
READ | 183 |
SARC | 271 |
SKCM | 479 |
STAD | 511 |
TGCT | 139 |
THCA | 580 |
THYM | 126 |
UCEC | 583 |
UCS | 57 |
UVM | 80 |
Now we will select a couple of cohorts - CHOL - LIHC - SKCM
columns_to_keep = set(['sample'] + meta_df[meta_df['cancer type abbreviation'].isin({'CHOL','SKCM', 'LIHC'})]['sample'].tolist())
len(columns_to_keep)
963
Reading the data matrix is a lot of data and so this process is slow. This will still probably take a few minutes. In practice we likely store this in a database to improvement performance
exp_matrix = pd.read_csv('tcga_RSEM_gene_tpm.gz', compression='gzip', sep='\t', usecols=lambda x: x in columns_to_keep)
To make things less confusing we will rename the "sample" column as "gene" since that is what is actually in that column
exp_matrix = exp_matrix.rename(columns={'sample': 'gene'})
exp_matrix = exp_matrix.set_index('gene')
exp_matrix
TCGA-G3-A3CH-11 | TCGA-RP-A695-06 | TCGA-DD-AAW0-01 | TCGA-EE-A17X-06 | TCGA-DD-AACA-02 | TCGA-DD-AACA-01 | TCGA-D3-A8GD-06 | TCGA-DD-A3A6-11 | TCGA-K7-AAU7-01 | TCGA-FS-A1ZF-06 | TCGA-EB-A41A-01 | TCGA-RC-A7SH-01 | TCGA-CC-A3MA-01 | TCGA-DD-A3A5-11 | TCGA-RC-A7SB-01 | TCGA-D3-A3ML-06 | TCGA-EE-A2GB-06 | TCGA-DD-AAVZ-01 | TCGA-DD-A116-11 | TCGA-ED-A627-01 | TCGA-ER-A19S-06 | TCGA-EE-A3AB-06 | TCGA-DD-AAVV-01 | TCGA-XV-AAZW-01 | TCGA-D3-A5GR-06 | TCGA-BC-4073-01 | TCGA-W3-AA1R-06 | TCGA-QA-A7B7-01 | TCGA-DA-A1IC-06 | TCGA-D3-A51G-06 | TCGA-BF-A5ER-01 | TCGA-HP-A5N0-01 | TCGA-D3-A5GU-06 | TCGA-ZH-A8Y5-01 | TCGA-FS-A1ZB-06 | TCGA-DD-A114-11 | TCGA-DD-A1EG-11 | TCGA-DD-AAEI-01 | TCGA-4G-AAZT-01 | TCGA-DD-AACT-01 | ... | TCGA-BC-A112-01 | TCGA-BW-A5NO-01 | TCGA-FV-A3I1-11 | TCGA-D3-A3MU-06 | TCGA-ER-A2NB-01 | TCGA-XV-A9W2-01 | TCGA-UB-A7MF-01 | TCGA-WX-AA44-01 | TCGA-ER-A194-01 | TCGA-DD-AACQ-01 | TCGA-D3-A2JH-06 | TCGA-EE-A2GR-06 | TCGA-G3-A25U-01 | TCGA-FS-A4F4-06 | TCGA-DD-A1EK-01 | TCGA-EB-A3XE-01 | TCGA-DD-A11C-11 | TCGA-BC-A10Q-11 | TCGA-EE-A2GI-06 | TCGA-EP-A12J-01 | TCGA-W5-AA2X-11 | TCGA-D9-A1JX-06 | TCGA-D3-A2JL-06 | TCGA-EE-A2MM-06 | TCGA-DD-AAED-01 | TCGA-DA-A95X-06 | TCGA-DD-A1EG-01 | TCGA-BC-A10R-01 | TCGA-EE-A2MF-06 | TCGA-G3-A3CG-01 | TCGA-DD-A1ED-01 | TCGA-2Y-A9H8-01 | TCGA-2Y-A9GT-01 | TCGA-3N-A9WD-06 | TCGA-DA-A95W-06 | TCGA-EE-A2GN-06 | TCGA-D9-A4Z6-06 | TCGA-EE-A29L-06 | TCGA-DD-A115-01 | TCGA-FV-A3I0-11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ENSG00000242268.2 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.965800 | -9.9658 | -9.9658 | -9.9658 | -3.1714 | -9.9658 | -9.9658 | -4.6082 | -9.9658 | -9.9658 | -9.9658 | -5.0116 | -3.8160 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -4.6082 | -2.3884 | -1.8314 | -5.0116 | -9.9658 | -3.3076 | -9.9658 | -4.0350 | -9.9658 | -9.9658 | -1.7322 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -3.8160 | ... | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -3.4580 | -9.9658 | -1.9942 | -9.9658 | -1.4699 | -9.9658 | -0.4325 | -9.9658 | -9.9658 | -9.9658 | -1.5105 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -5.0116 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -2.6349 | -2.6349 | -9.9658 | -9.9658 | -4.6082 | -9.9658 | -9.9658 |
ENSG00000259041.1 | -9.9658 | -9.9658 | -9.9658 | -1.6850 | -9.965800 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -3.3076 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | ... | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -0.9406 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 |
ENSG00000270112.3 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.965800 | -9.9658 | -9.9658 | -9.9658 | -6.5064 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -2.0529 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 0.9343 | -9.9658 | -9.9658 | -4.2934 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | ... | -6.5064 | -9.9658 | -9.9658 | -4.2934 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -6.5064 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -6.5064 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -6.5064 | -9.9658 | -9.9658 | -4.2934 | -9.9658 | -9.9658 | -9.9658 | -9.9658 |
ENSG00000167578.16 | 3.5572 | 5.2308 | 4.6077 | 5.5741 | 4.080749 | 3.2018 | 6.1010 | 3.1540 | 4.5098 | 5.6767 | 5.7337 | 5.8094 | 3.7367 | 3.4739 | 4.8125 | 3.8954 | 5.2972 | 4.1700 | 3.1028 | 4.8485 | 5.1223 | 5.2235 | 4.8440 | 6.3684 | 4.8299 | 4.3169 | 5.7307 | 4.7539 | 4.0636 | 6.1067 | 4.6916 | 3.9487 | 4.6854 | 4.0575 | 5.6186 | 4.1310 | 3.1129 | 2.9966 | 5.2430 | 4.5772 | ... | 4.5324 | 4.9294 | 3.2557 | 5.3917 | 5.1052 | 4.8033 | 5.5799 | 5.0009 | 5.6930 | 4.5367 | 5.5571 | 4.6759 | 3.2988 | 5.7052 | 4.3903 | 5.7391 | 3.5299 | 3.2174 | 4.8827 | 4.0488 | 3.8808 | 5.5568 | 5.8390 | 4.2943 | 4.4647 | 4.6764 | 4.4310 | 4.7735 | 5.5503 | 3.7475 | 4.2072 | 4.6960 | 4.4324 | 4.5311 | 4.3541 | 3.7730 | 5.6846 | 5.2817 | 4.0260 | 3.0876 |
ENSG00000278814.1 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.965800 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | ... | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
ENSG00000273233.1 | -9.9658 | -3.4580 | -9.9658 | -9.9658 | -4.442233 | -9.9658 | -9.9658 | -9.9658 | -2.0529 | -2.7274 | -9.9658 | -2.5479 | -3.4580 | -9.9658 | -9.9658 | -4.0350 | -3.6259 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -3.6259 | -3.4580 | -9.9658 | -3.6259 | -2.8262 | -9.9658 | -1.9379 | -9.9658 | -3.3076 | -3.3076 | -9.9658 | -2.7274 | -9.9658 | -9.9658 | -9.9658 | -1.9942 | -9.9658 | ... | -9.9658 | -9.9658 | -9.9658 | -2.4659 | -9.9658 | -2.3884 | -4.0350 | -2.9324 | -9.9658 | -9.9658 | -3.4580 | -9.9658 | -9.9658 | -3.1714 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -3.3076 | -2.4659 | -9.9658 | -9.9658 | -9.9658 | -3.8160 | -2.8262 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -2.5479 | -9.9658 | -2.6349 | -9.9658 | -9.9658 |
ENSG00000105063.18 | 3.4358 | 6.5907 | 4.6317 | 7.8058 | 5.034368 | 5.2365 | 6.0885 | 2.2175 | 4.9346 | 5.3906 | 5.1416 | 5.7926 | 4.9393 | 2.9356 | 4.5154 | 4.8832 | 7.1736 | 4.6247 | 2.8301 | 4.3855 | 5.7978 | 5.4300 | 5.2211 | 5.0879 | 5.6574 | 4.8435 | 6.5130 | 5.2825 | 5.7211 | 5.4050 | 6.0856 | 3.4277 | 5.0972 | 5.1157 | 5.9709 | 3.8591 | 2.8681 | 4.2366 | 4.9901 | 4.2699 | ... | 5.2169 | 3.9974 | 3.2677 | 6.4069 | 4.6236 | 4.7507 | 5.1930 | 5.1343 | 5.5236 | 4.5571 | 6.1048 | 5.5448 | 3.9289 | 5.7727 | 4.5820 | 4.9069 | 2.9857 | 2.7551 | 5.9905 | 4.6106 | 3.4778 | 5.8316 | 5.1575 | 6.0275 | 4.6053 | 5.5871 | 4.9141 | 4.6815 | 5.6848 | 4.2189 | 3.3549 | 5.0900 | 4.0884 | 4.5892 | 4.7735 | 5.7247 | 6.5297 | 6.0788 | 4.4647 | 3.2174 |
ENSG00000231119.2 | -2.5479 | -9.9658 | -6.5064 | -6.5064 | -1.910505 | -2.4659 | -1.4305 | -1.8836 | -3.1714 | -2.1779 | -5.0116 | -4.2934 | -4.2934 | -3.0469 | -2.4659 | -1.3548 | -1.9942 | -4.0350 | -4.0350 | -2.1779 | -2.7274 | -2.8262 | -1.8314 | -3.8160 | -2.3147 | -3.3076 | -2.4659 | -9.9658 | -3.4580 | -0.7108 | -5.5735 | -3.6259 | -3.1714 | -4.2934 | -6.5064 | -5.0116 | -3.8160 | -2.5479 | -5.5735 | -2.9324 | ... | -5.0116 | -1.7809 | -2.5479 | -4.6082 | -9.9658 | -5.0116 | -3.8160 | -2.6349 | -9.9658 | -4.0350 | -3.6259 | -2.7274 | -3.6259 | -5.0116 | -4.0350 | -1.0559 | -4.2934 | -5.0116 | -4.2934 | -2.6349 | -2.9324 | -1.0862 | -5.5735 | -9.9658 | -2.6349 | -1.6850 | -1.5522 | -3.6259 | -3.8160 | -3.3076 | -1.0262 | -5.0116 | -3.3076 | -2.9324 | -5.0116 | -4.2934 | -3.3076 | -9.9658 | -2.3884 | -3.0469 |
ENSG00000280861.1 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.965800 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | ... | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 |
ENSG00000181518.3 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.965800 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | ... | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -5.5735 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 |
60498 rows × 936 columns
Now that we have our matrix we are ready to start the correlation
Spearman Correlation¶
The input for an analysis where a patient sample is being compared to the TCGA cohort of interest would typically be an expression matrix using the same normalization method as the comparator cohort. Note that differences in library preparation, sequencing and bioinformatics pipelines can lead to variation in expression quantification. It is important to understand the impact these differences have on expression quantification before comparing expression values that are produced using different pipelines.
For the purposes of this tutorial we are going to choose one of the CHOL samples from our expression matrix to use as our input. We arbitrarily choose the first sample.
test_sample_id = str(exp_matrix.iloc[0].index[0])
meta_df[meta_df['sample'] == test_sample_id]
sample | _PATIENT | cancer type abbreviation | age_at_initial_pathologic_diagnosis | gender | race | ajcc_pathologic_tumor_stage | clinical_stage | histological_type | histological_grade | initial_pathologic_dx_year | menopause_status | birth_days_to | vital_status | tumor_status | last_contact_days_to | death_days_to | cause_of_death | new_tumor_event_type | new_tumor_event_site | new_tumor_event_site_other | new_tumor_event_dx_days_to | treatment_outcome_first_course | margin_status | residual_tumor | OS | OS.time | DSS | DSS.time | DFI | DFI.time | PFI | PFI.time | Redaction | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6583 | TCGA-G3-A3CH-11 | TCGA-G3-A3CH | LIHC | 53.0 | MALE | ASIAN | Stage IIIA | NaN | Hepatocellular Carcinoma | G2 | 2010.0 | NaN | -19473.0 | Alive | WITH TUMOR | 780.0 | NaN | NaN | Intrahepatic Recurrence | Liver | NaN | 116.0 | NaN | NaN | R0 | 0.0 | 780.0 | 0.0 | 780.0 | 1.0 | 116.0 | 1.0 | 116.0 | NaN |
Now we compute the correlation of this sample with all other samples in the matrix excluding itself
corr = exp_matrix[[c for c in exp_matrix.columns.tolist() if c != test_sample_id]].corrwith(
exp_matrix[test_sample_id], method='spearman'
).to_frame()
corr = corr.rename(columns={ corr.columns[0]: "correlation" }).reset_index().rename(columns={'index': 'sample'})
corr
sample | correlation | |
---|---|---|
0 | TCGA-RP-A695-06 | 0.840728 |
1 | TCGA-DD-AAW0-01 | 0.901187 |
2 | TCGA-EE-A17X-06 | 0.825520 |
3 | TCGA-DD-AACA-02 | 0.871860 |
4 | TCGA-DD-AACA-01 | 0.883324 |
... | ... | ... |
930 | TCGA-EE-A2GN-06 | 0.838558 |
931 | TCGA-D9-A4Z6-06 | 0.824104 |
932 | TCGA-EE-A29L-06 | 0.817158 |
933 | TCGA-DD-A115-01 | 0.885494 |
934 | TCGA-FV-A3I0-11 | 0.908514 |
935 rows × 2 columns
Merge that with the metadata to group the correlations into their respective cohorts for plotting
corr = corr.merge(meta_df[['sample', 'cancer type abbreviation']], on=['sample'], how='left')
corr
sample | correlation | cancer type abbreviation | |
---|---|---|---|
0 | TCGA-RP-A695-06 | 0.840728 | SKCM |
1 | TCGA-DD-AAW0-01 | 0.901187 | LIHC |
2 | TCGA-EE-A17X-06 | 0.825520 | SKCM |
3 | TCGA-DD-AACA-02 | 0.871860 | LIHC |
4 | TCGA-DD-AACA-01 | 0.883324 | LIHC |
... | ... | ... | ... |
930 | TCGA-EE-A2GN-06 | 0.838558 | SKCM |
931 | TCGA-D9-A4Z6-06 | 0.824104 | SKCM |
932 | TCGA-EE-A29L-06 | 0.817158 | SKCM |
933 | TCGA-DD-A115-01 | 0.885494 | LIHC |
934 | TCGA-FV-A3I0-11 | 0.908514 | LIHC |
935 rows × 3 columns
Now plot the correlations
fig = sns.catplot(kind='box', data=corr, x='correlation', y='cancer type abbreviation')
title = test_sample_id + ' (' + meta_df[meta_df['sample'] == test_sample_id].iloc[0]['cancer type abbreviation'] +')'
fig.set(xlabel='spearman correlation', ylabel='', title=title)
<seaborn.axisgrid.FacetGrid at 0x7f3bc4d89350>
As expected, the average pairwise correlation is highest within the same disease cohort as the sample
Principle Component Analysis (PCA)¶
The principal component analysis would serve a similar function to the pairwise spearman correlations plot above and could be included in place of said plot if preferred. For this we will need to install more python libraries
!pip install scikit-learn
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (0.22.2.post1)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied: numpy>=1.11.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.19.5)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)
Now we are ready to set up the PCA model.
from sklearn.decomposition import PCA
pca = PCA()
Our input to the PCA model is a matrix with each sample as a row. Since in our current matrix the genes are rows we need to transpose this
X = exp_matrix.T
X
gene | ENSG00000242268.2 | ENSG00000259041.1 | ENSG00000270112.3 | ENSG00000167578.16 | ENSG00000278814.1 | ENSG00000078237.5 | ENSG00000269416.5 | ENSG00000263642.1 | ENSG00000146083.11 | ENSG00000158486.13 | ENSG00000273639.4 | ENSG00000198242.13 | ENSG00000231981.3 | ENSG00000269475.2 | ENSG00000134108.12 | ENSG00000261030.1 | ENSG00000172137.18 | ENSG00000276644.4 | ENSG00000240423.1 | ENSG00000271616.1 | ENSG00000234881.1 | ENSG00000236040.1 | ENSG00000231105.1 | ENSG00000094963.13 | ENSG00000182141.9 | ENSG00000280143.1 | ENSG00000251334.2 | ENSG00000231112.1 | ENSG00000258610.1 | ENSG00000264981.1 | ENSG00000275265.1 | ENSG00000185105.4 | ENSG00000233540.1 | ENSG00000102174.8 | ENSG00000166391.14 | ENSG00000232001.1 | ENSG00000270469.1 | ENSG00000225275.4 | ENSG00000234253.1 | ENSG00000070087.13 | ... | ENSG00000279778.1 | ENSG00000223671.2 | ENSG00000263573.1 | ENSG00000222213.1 | ENSG00000214124.3 | ENSG00000206836.1 | ENSG00000233845.1 | ENSG00000066044.13 | ENSG00000264491.1 | ENSG00000146587.17 | ENSG00000278151.1 | ENSG00000228658.1 | ENSG00000173930.8 | ENSG00000274396.1 | ENSG00000107863.16 | ENSG00000199892.2 | ENSG00000221760.1 | ENSG00000253333.1 | ENSG00000213782.7 | ENSG00000146707.14 | ENSG00000212084.2 | ENSG00000248838.2 | ENSG00000255083.1 | ENSG00000158417.10 | ENSG00000223665.1 | ENSG00000203729.8 | ENSG00000238300.1 | ENSG00000221756.1 | ENSG00000089177.17 | ENSG00000186115.12 | ENSG00000009694.13 | ENSG00000238244.3 | ENSG00000216352.1 | ENSG00000123685.8 | ENSG00000267117.1 | ENSG00000273233.1 | ENSG00000105063.18 | ENSG00000231119.2 | ENSG00000280861.1 | ENSG00000181518.3 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TCGA-G3-A3CH-11 | -9.9658 | -9.9658 | -9.9658 | 3.557200 | -9.9658 | 0.099000 | -9.965800 | -9.9658 | 1.832300 | -9.96580 | -9.9658 | 8.372500 | -9.9658 | -9.9658 | 3.502200 | -9.9658 | -9.965800 | -1.148800 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.965800 | -1.086200 | -1.117200 | -1.214200 | -9.9658 | -9.9658 | 0.84880 | -9.9658 | -3.625900 | -9.965800 | -9.965800 | -5.573500 | 5.737700 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 2.150900 | ... | -6.5064 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 3.408800 | -9.965800 | 0.099000 | -9.9658 | -9.9658 | 0.6969 | -9.9658 | 2.111400 | -9.9658 | -9.9658 | -9.9658 | 3.570600 | 0.566600 | -9.9658 | -9.9658 | -9.9658 | 3.149100 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 1.590200 | 6.864000 | -1.7809 | -9.9658 | -9.9658 | 0.264200 | -3.6259 | -9.965800 | 3.435800 | -2.547900 | -9.9658 | -9.9658 |
TCGA-RP-A695-06 | -9.9658 | -9.9658 | -9.9658 | 5.230800 | -9.9658 | 3.092700 | -9.965800 | -9.9658 | 3.201800 | -3.45800 | -9.9658 | 10.228500 | -9.9658 | -9.9658 | 5.436700 | -9.9658 | -1.994200 | -5.573500 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -5.011600 | -2.465900 | -0.284500 | 0.566600 | -9.9658 | -9.9658 | 1.18330 | -9.9658 | -9.965800 | -6.506400 | -9.965800 | -3.046900 | -9.965800 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 6.128700 | ... | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 5.601400 | -9.965800 | 1.580600 | -9.9658 | -9.9658 | -4.6082 | -9.9658 | 3.502200 | -9.9658 | -9.9658 | -9.9658 | 4.384800 | 3.737800 | -9.9658 | -9.9658 | -9.9658 | 4.073000 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 2.451800 | -4.293400 | -6.5064 | -9.9658 | -9.9658 | 1.316700 | -9.9658 | -3.458000 | 6.590700 | -9.965800 | -9.9658 | -9.9658 |
TCGA-DD-AAW0-01 | -9.9658 | -9.9658 | -9.9658 | 4.607700 | -9.9658 | 2.646400 | -4.293400 | -9.9658 | 3.808500 | -4.29340 | -9.9658 | 8.705700 | -9.9658 | -9.9658 | 4.635200 | -9.9658 | -9.965800 | -1.780900 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.965800 | 0.783200 | 0.215400 | 0.444700 | -9.9658 | -9.9658 | 1.48590 | -9.9658 | -9.965800 | -5.573500 | -9.965800 | -5.011600 | 5.382700 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 1.196000 | ... | -4.6082 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 4.635800 | -9.965800 | 1.339700 | -9.9658 | -9.9658 | -4.0350 | -9.9658 | 3.489400 | -9.9658 | -9.9658 | -9.9658 | 4.165200 | 1.566100 | -9.9658 | -9.9658 | -4.6082 | 3.956100 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 2.186200 | 7.373500 | -5.5735 | -9.9658 | -9.9658 | 0.001400 | -9.9658 | -9.965800 | 4.631700 | -6.506400 | -9.9658 | -9.9658 |
TCGA-EE-A17X-06 | -9.9658 | -1.6850 | -9.9658 | 5.574100 | -9.9658 | 3.165300 | -9.965800 | -9.9658 | 2.582800 | -6.50640 | -9.9658 | 10.009000 | -9.9658 | -9.9658 | 4.784600 | -9.9658 | -2.314700 | -4.608200 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -5.011600 | -2.314700 | -0.338300 | -0.249800 | -9.9658 | -9.9658 | 0.64250 | -9.9658 | -9.965800 | -9.965800 | -0.997100 | -6.506400 | -9.965800 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 5.923400 | ... | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 6.418900 | -9.965800 | 1.322500 | -9.9658 | -9.9658 | -5.5735 | -9.9658 | 4.085800 | -9.9658 | -9.9658 | -9.9658 | 5.278400 | 4.536100 | -9.9658 | -9.9658 | -9.9658 | 5.495700 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 2.205100 | -3.816000 | -9.9658 | -9.9658 | -9.9658 | -0.687300 | -3.6259 | -9.965800 | 7.805800 | -6.506400 | -9.9658 | -9.9658 |
TCGA-DD-AACA-02 | -9.9658 | -9.9658 | -9.9658 | 4.080749 | -9.9658 | -2.114057 | 0.407079 | -9.9658 | 3.499608 | -3.62591 | -9.9658 | 10.322895 | -9.9658 | -9.9658 | 4.162807 | -9.9658 | -4.442233 | -3.921375 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -5.965822 | -3.717877 | 0.276236 | -0.619235 | -9.9658 | -9.9658 | 0.79993 | -9.9658 | -3.380826 | -2.775961 | -4.293386 | -7.380866 | 1.543998 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 2.055201 | ... | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 4.657673 | -3.625967 | 0.856772 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 3.205689 | -9.9658 | -9.9658 | -9.9658 | 4.628535 | 2.168608 | -9.9658 | -9.9658 | -9.9658 | 4.183576 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 1.331141 | 6.773435 | -9.9658 | -9.9658 | -9.9658 | -0.641625 | -9.9658 | -4.442233 | 5.034368 | -1.910505 | -9.9658 | -9.9658 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
TCGA-EE-A2GN-06 | -9.9658 | -9.9658 | -4.2934 | 3.773000 | -9.9658 | 3.740000 | -2.826200 | -9.9658 | 4.802200 | -2.46590 | -9.9658 | 11.006800 | -9.9658 | -9.9658 | 5.737700 | -9.9658 | -2.314700 | -3.046900 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -3.307600 | -1.469900 | 0.723300 | 0.888300 | -9.9658 | -9.9658 | 1.88790 | -9.9658 | -9.965800 | -3.046900 | -9.965800 | -3.046900 | -9.965800 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 4.970900 | ... | -3.4580 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 5.912400 | -9.965800 | 2.694000 | -9.9658 | -9.9658 | -6.5064 | -9.9658 | 3.270700 | -9.9658 | -9.9658 | -0.7108 | 5.218000 | 4.586200 | -9.9658 | -9.9658 | -9.9658 | 5.684000 | -9.9658 | -3.8160 | -9.9658 | -9.9658 | 3.132700 | -6.506400 | -3.6259 | -9.9658 | -9.9658 | 1.526600 | -3.6259 | -2.547900 | 5.724700 | -4.293400 | -9.9658 | -9.9658 |
TCGA-D9-A4Z6-06 | -9.9658 | -9.9658 | -9.9658 | 5.684600 | -9.9658 | 3.996500 | -6.506400 | -9.9658 | 4.482900 | -2.05290 | -9.9658 | 10.159600 | -9.9658 | -9.9658 | 4.879800 | -9.9658 | -0.284500 | -2.932400 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -4.293400 | -1.639400 | 1.098300 | -0.087700 | -9.9658 | -9.9658 | 1.26960 | -9.9658 | -2.177900 | -5.011600 | -9.965800 | -3.307600 | -9.965800 | -9.9658 | -4.2934 | -9.9658 | -9.9658 | 6.171300 | ... | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 6.079200 | -9.965800 | 2.018300 | -9.9658 | -9.9658 | -4.0350 | -9.9658 | 2.912800 | -9.9658 | -9.9658 | -1.0262 | 5.615300 | 3.476500 | -9.9658 | -9.9658 | -9.9658 | 4.551600 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 2.498500 | -9.965800 | -9.9658 | -9.9658 | -9.9658 | 0.368500 | -3.1714 | -9.965800 | 6.529700 | -3.307600 | -9.9658 | -9.9658 |
TCGA-EE-A29L-06 | -4.6082 | -9.9658 | -9.9658 | 5.281700 | -9.9658 | 4.370200 | -6.506400 | -9.9658 | 3.438400 | -9.96580 | -9.9658 | 10.427600 | -9.9658 | -9.9658 | 6.153200 | -9.9658 | -5.573500 | -5.011600 | -9.9658 | -9.9658 | -9.9658 | -2.1779 | -6.506400 | -1.181100 | 2.046500 | -1.430500 | -9.9658 | -9.9658 | -0.43250 | -9.9658 | -2.727400 | -4.035000 | -2.388400 | -6.506400 | -9.965800 | -9.9658 | -5.0116 | -9.9658 | -9.9658 | 7.421200 | ... | -6.5064 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 5.801700 | -9.965800 | 3.513600 | -9.9658 | -9.9658 | -5.5735 | -9.9658 | 4.181200 | -9.9658 | -9.9658 | -9.9658 | 5.083300 | 3.028700 | -9.9658 | -9.9658 | -9.9658 | 6.211000 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 2.985700 | -9.965800 | -6.5064 | -9.9658 | -9.9658 | 2.070700 | -9.9658 | -2.634900 | 6.078800 | -9.965800 | -9.9658 | -9.9658 |
TCGA-DD-A115-01 | -9.9658 | -9.9658 | -9.9658 | 4.026000 | -9.9658 | 2.154100 | -1.318300 | -9.9658 | 3.260200 | -4.03500 | -3.8160 | 9.137500 | -9.9658 | -4.6082 | 4.263100 | -9.9658 | -5.573500 | -3.307600 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -5.573500 | -0.042500 | -0.375200 | 0.240000 | -9.9658 | -9.9658 | 2.28130 | -9.9658 | -9.965800 | -9.965800 | -2.932400 | -5.573500 | 2.899400 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 2.716100 | ... | -1.1172 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 4.764500 | -9.965800 | 0.749300 | -9.9658 | -9.9658 | -1.7322 | -9.9658 | 3.411600 | -9.9658 | -9.9658 | -9.9658 | 3.997400 | 1.164100 | -9.9658 | -9.9658 | -9.9658 | 3.720400 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 1.820100 | 6.673000 | -2.1779 | -9.9658 | -9.9658 | 0.311500 | -9.9658 | -9.965800 | 4.464700 | -2.388400 | -9.9658 | -9.9658 |
TCGA-FV-A3I0-11 | -9.9658 | -9.9658 | -9.9658 | 3.087600 | -9.9658 | -0.575600 | -9.965800 | -9.9658 | 1.692000 | -9.96580 | -9.9658 | 8.292000 | -9.9658 | -9.9658 | 3.569400 | -9.9658 | -5.011600 | -2.388400 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.965800 | -1.595100 | -1.994200 | -0.687300 | -9.9658 | -9.9658 | -0.32010 | -9.9658 | -9.965800 | -9.965800 | -9.965800 | -6.506400 | 5.202100 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 3.309000 | ... | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | 3.577900 | -9.965800 | 0.357300 | -9.9658 | -9.9658 | 0.0158 | -9.9658 | 2.160600 | -9.9658 | -9.9658 | -9.9658 | 3.577900 | 0.774800 | -9.9658 | -9.9658 | -9.9658 | 3.082500 | -9.9658 | -9.9658 | -9.9658 | -9.9658 | -0.249800 | 7.026400 | -2.2447 | -9.9658 | -9.9658 | 0.614500 | -9.9658 | -9.965800 | 3.217400 | -3.046900 | -9.9658 | -9.9658 |
936 rows × 60498 columns
Now we are ready to fit the PCA
fit = pca.fit(X)
fit
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
svd_solver='auto', tol=0.0, whiten=False)
Plot the explained variance by the number of components for any components that explain at least 0.5% of the variance
components_df = pd.DataFrame(enumerate(fit.explained_variance_ratio_), columns=['component', 'variance_explained'])
components_df['cum_var'] = components_df.variance_explained.cumsum()
components_df
component | variance_explained | cum_var | |
---|---|---|---|
0 | 0 | 2.098103e-01 | 0.209810 |
1 | 1 | 4.941317e-02 | 0.259223 |
2 | 2 | 2.684464e-02 | 0.286068 |
3 | 3 | 2.096802e-02 | 0.307036 |
4 | 4 | 1.821842e-02 | 0.325255 |
... | ... | ... | ... |
931 | 931 | 2.724521e-04 | 0.999199 |
932 | 932 | 2.708955e-04 | 0.999470 |
933 | 933 | 2.679152e-04 | 0.999738 |
934 | 934 | 2.619554e-04 | 1.000000 |
935 | 935 | 2.545080e-28 | 1.000000 |
936 rows × 3 columns
fig = sns.relplot(data=components_df[components_df.variance_explained >= 0.005], x='component', y='variance_explained', kind='line')
fig.set(xlabel='Component #', ylabel='Proportion of Variance Explained')
<seaborn.axisgrid.FacetGrid at 0x7f3bc2ef49d0>
If we want to see the variance explained as we keep X components we can plot that as well
fig = sns.relplot(data=components_df, x='component', y='cum_var', kind='line')
fig.set(xlabel='Number of Components', ylabel='Proportion of Variance Explained')
<seaborn.axisgrid.FacetGrid at 0x7f3bc2ec0450>
To visualize this in 2-d we can only plot the first two components
pca_xy = PCA(n_components=2)
fit_xy = pd.DataFrame(pca_xy.fit_transform(X), index=X.index, columns=['component 1', 'component 2'])
fit_xy
component 1 | component 2 | |
---|---|---|
TCGA-G3-A3CH-11 | 313.877223 | 58.038358 |
TCGA-RP-A695-06 | -178.026704 | 152.348532 |
TCGA-DD-AAW0-01 | 205.633995 | -104.460686 |
TCGA-EE-A17X-06 | -185.690333 | 172.857407 |
TCGA-DD-AACA-02 | 226.342307 | -60.439144 |
... | ... | ... |
TCGA-EE-A2GN-06 | -253.444806 | 21.297225 |
TCGA-D9-A4Z6-06 | -243.644653 | 27.199689 |
TCGA-EE-A29L-06 | -220.859424 | 123.556796 |
TCGA-DD-A115-01 | 250.808496 | -44.392051 |
TCGA-FV-A3I0-11 | 375.594610 | 127.824032 |
936 rows × 2 columns
Now join this back to the metadata to get the groupings
fig_df = fit_xy.reset_index().rename(columns={'index': 'sample'}).merge(meta_df, on=['sample'], how='left')
fig_df[['sample', 'component 1', 'component 2', 'cancer type abbreviation']]
sample | component 1 | component 2 | cancer type abbreviation | |
---|---|---|---|---|
0 | TCGA-G3-A3CH-11 | 313.877223 | 58.038358 | LIHC |
1 | TCGA-RP-A695-06 | -178.026704 | 152.348532 | SKCM |
2 | TCGA-DD-AAW0-01 | 205.633995 | -104.460686 | LIHC |
3 | TCGA-EE-A17X-06 | -185.690333 | 172.857407 | SKCM |
4 | TCGA-DD-AACA-02 | 226.342307 | -60.439144 | LIHC |
... | ... | ... | ... | ... |
931 | TCGA-EE-A2GN-06 | -253.444806 | 21.297225 | SKCM |
932 | TCGA-D9-A4Z6-06 | -243.644653 | 27.199689 | SKCM |
933 | TCGA-EE-A29L-06 | -220.859424 | 123.556796 | SKCM |
934 | TCGA-DD-A115-01 | 250.808496 | -44.392051 | LIHC |
935 | TCGA-FV-A3I0-11 | 375.594610 | 127.824032 | LIHC |
936 rows × 4 columns
Add a flag column for our sample of interest
fig_df.loc[fig_df['sample'] == test_sample_id, 'target sample'] = True
fig_df['target sample'] = fig_df['target sample'].fillna(False)
fig_df[['sample', 'component 1', 'component 2', 'cancer type abbreviation', 'target sample']]
sample | component 1 | component 2 | cancer type abbreviation | target sample | |
---|---|---|---|---|---|
0 | TCGA-G3-A3CH-11 | 313.877223 | 58.038358 | LIHC | True |
1 | TCGA-RP-A695-06 | -178.026704 | 152.348532 | SKCM | False |
2 | TCGA-DD-AAW0-01 | 205.633995 | -104.460686 | LIHC | False |
3 | TCGA-EE-A17X-06 | -185.690333 | 172.857407 | SKCM | False |
4 | TCGA-DD-AACA-02 | 226.342307 | -60.439144 | LIHC | False |
... | ... | ... | ... | ... | ... |
931 | TCGA-EE-A2GN-06 | -253.444806 | 21.297225 | SKCM | False |
932 | TCGA-D9-A4Z6-06 | -243.644653 | 27.199689 | SKCM | False |
933 | TCGA-EE-A29L-06 | -220.859424 | 123.556796 | SKCM | False |
934 | TCGA-DD-A115-01 | 250.808496 | -44.392051 | LIHC | False |
935 | TCGA-FV-A3I0-11 | 375.594610 | 127.824032 | LIHC | False |
936 rows × 5 columns
fig = sns.relplot(kind='scatter', data=fig_df, x='component 1', y='component 2', hue='cancer type abbreviation')
Add the annotation to pick out our current sample in the PCA. We are going to plot overtop of the original plot to accomplish this
ax = sns.scatterplot(data=fig_df, x='component 1', y='component 2', hue='cancer type abbreviation', alpha=0.5)
ax = sns.scatterplot(data=fig_df[fig_df['sample'] == test_sample_id], x='component 1', y='component 2', marker='X', alpha=1, ax=ax, color='black', s=300)
As we saw with the correlation plot, the LIHC sample is grouped with its disease cohort