Computing RNA Expression Similarity¶

Download the Data¶

First we need to download the data set we will use. We are going to use the TCGA data here since it is publically available and well known (UCSC Xena TCGA PANCAN). We will download the TPM matrix as well as the corresponding metadata file

!wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/tcga_RSEM_gene_tpm.gz
!wget https://tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com/download/Survival_SupplementalTable_S1_20171025_xena_sp

--2021-06-10 17:34:56--  https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/tcga_RSEM_gene_tpm.gz
Resolving toil-xena-hub.s3.us-east-1.amazonaws.com (toil-xena-hub.s3.us-east-1.amazonaws.com)... 52.216.136.30
Connecting to toil-xena-hub.s3.us-east-1.amazonaws.com (toil-xena-hub.s3.us-east-1.amazonaws.com)|52.216.136.30|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 740772247 (706M) [binary/octet-stream]
Saving to: ‘tcga_RSEM_gene_tpm.gz’

tcga_RSEM_gene_tpm. 100%[===================>] 706.46M  15.5MB/s    in 49s     

2021-06-10 17:35:46 (14.5 MB/s) - ‘tcga_RSEM_gene_tpm.gz’ saved [740772247/740772247]

--2021-06-10 17:35:46--  https://tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com/download/Survival_SupplementalTable_S1_20171025_xena_sp
Resolving tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com (tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com)... 52.216.94.142
Connecting to tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com (tcga-pancan-atlas-hub.s3.us-east-1.amazonaws.com)|52.216.94.142|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2419504 (2.3M) [binary/octet-stream]
Saving to: ‘Survival_SupplementalTable_S1_20171025_xena_sp’

Survival_Supplement 100%[===================>]   2.31M  1.75MB/s    in 1.3s    

2021-06-10 17:35:48 (1.75 MB/s) - ‘Survival_SupplementalTable_S1_20171025_xena_sp’ saved [2419504/2419504]

The matrix we have downloaded is very large as it contains many different disease cohorts. For the purposes of this tutorial we want a smaller data set so we are going to choose the smallest TCGA cohort, CHOL. To do this we need to use the metadata file to subset the main matrix file

Install Dependencies¶

For the following tutorial we are going to use some common data processing and visualization libraries in python: seaborn and pandas. We need to install them before we proceed

!pip install pandas seaborn

Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (1.1.5)
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (0.11.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas) (2.8.1)
Requirement already satisfied: numpy>=1.15.4 in /usr/local/lib/python3.7/dist-packages (from pandas) (1.19.5)
Requirement already satisfied: matplotlib>=2.2 in /usr/local/lib/python3.7/dist-packages (from seaborn) (3.2.2)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from seaborn) (1.4.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (1.3.1)

Create the Disease Matrix¶

First read the metadata file

import json

import seaborn as sns
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np

sns.set_style('whitegrid')

meta_df = pd.read_csv('Survival_SupplementalTable_S1_20171025_xena_sp', sep='\t')
meta_df

	sample	_PATIENT	cancer type abbreviation	age_at_initial_pathologic_diagnosis	gender	race	ajcc_pathologic_tumor_stage	clinical_stage	histological_type	histological_grade	initial_pathologic_dx_year	menopause_status	birth_days_to	vital_status	tumor_status	last_contact_days_to	death_days_to	cause_of_death	new_tumor_event_type	new_tumor_event_site	new_tumor_event_site_other	new_tumor_event_dx_days_to	treatment_outcome_first_course	margin_status	residual_tumor	OS	OS.time	DSS	DSS.time	DFI	DFI.time	PFI	PFI.time	Redaction
0	TCGA-OR-A5J1-01	TCGA-OR-A5J1	ACC	58.0	MALE	WHITE	Stage II	NaN	Adrenocortical carcinoma- Usual Type	NaN	2000.0	NaN	-21496.0	Dead	WITH TUMOR	NaN	1355.0	NaN	Distant Metastasis	Peritoneal Surfaces	NaN	754.0	Complete Remission/Response	NaN	NaN	1.0	1355.0	1.0	1355.0	1.0	754.0	1.0	754.0	NaN
1	TCGA-OR-A5J2-01	TCGA-OR-A5J2	ACC	44.0	FEMALE	WHITE	Stage IV	NaN	Adrenocortical carcinoma- Usual Type	NaN	2004.0	NaN	-16090.0	Dead	WITH TUMOR	NaN	1677.0	NaN	Distant Metastasis	Soft Tissue	NaN	289.0	Progressive Disease	NaN	NaN	1.0	1677.0	1.0	1677.0	NaN	NaN	1.0	289.0	NaN
2	TCGA-OR-A5J3-01	TCGA-OR-A5J3	ACC	23.0	FEMALE	WHITE	Stage III	NaN	Adrenocortical carcinoma- Usual Type	NaN	2008.0	NaN	-8624.0	Alive	WITH TUMOR	2091.0	NaN	NaN	Distant Metastasis	Lung	NaN	53.0	Complete Remission/Response	NaN	NaN	0.0	2091.0	0.0	2091.0	1.0	53.0	1.0	53.0	NaN
3	TCGA-OR-A5J4-01	TCGA-OR-A5J4	ACC	23.0	FEMALE	WHITE	Stage IV	NaN	Adrenocortical carcinoma- Usual Type	NaN	2000.0	NaN	-8451.0	Dead	WITH TUMOR	NaN	423.0	NaN	Locoregional Recurrence	Peritoneal Surfaces	NaN	126.0	Progressive Disease	NaN	NaN	1.0	423.0	1.0	423.0	NaN	NaN	1.0	126.0	NaN
4	TCGA-OR-A5J5-01	TCGA-OR-A5J5	ACC	30.0	MALE	WHITE	Stage III	NaN	Adrenocortical carcinoma- Usual Type	NaN	2000.0	NaN	-11171.0	Dead	WITH TUMOR	NaN	365.0	NaN	Locoregional Recurrence	Other, specify	vena cava thrombus	50.0	Progressive Disease	NaN	NaN	1.0	365.0	1.0	365.0	NaN	NaN	1.0	50.0	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
12586	TCGA-YZ-A980-01	TCGA-YZ-A980	UVM	75.0	MALE	WHITE	Stage IIIA	Stage IIIA	Spindle Cell\|Epithelioid Cell	NaN	2010.0	NaN	-27716.0	Alive	TUMOR FREE	1862.0	NaN	NaN	New Primary Tumor	Other, specify	Scalp	1556.0	NaN	NaN	NaN	0.0	1862.0	0.0	1862.0	NaN	NaN	1.0	1556.0	NaN
12587	TCGA-YZ-A982-01	TCGA-YZ-A982	UVM	79.0	FEMALE	WHITE	Stage IIIB	Stage IIIB	Spindle Cell	NaN	2013.0	NaN	-28938.0	Alive	TUMOR FREE	495.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	495.0	0.0	495.0	NaN	NaN	0.0	495.0	NaN
12588	TCGA-YZ-A983-01	TCGA-YZ-A983	UVM	51.0	FEMALE	WHITE	Stage IIB	Stage IIB	Epithelioid Cell	NaN	2013.0	NaN	-18769.0	Alive	TUMOR FREE	798.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	798.0	0.0	798.0	NaN	NaN	0.0	798.0	NaN
12589	TCGA-YZ-A984-01	TCGA-YZ-A984	UVM	50.0	FEMALE	WHITE	Stage IIB	Stage IIIA	Spindle Cell\|Epithelioid Cell	NaN	2011.0	NaN	-18342.0	Dead	WITH TUMOR	NaN	1396.0	Metastatic Uveal Melanoma	New Primary Tumor	Other, specify	Thyroid	154.0	NaN	NaN	NaN	1.0	1396.0	1.0	1396.0	NaN	NaN	1.0	154.0	NaN
12590	TCGA-YZ-A985-01	TCGA-YZ-A985	UVM	41.0	FEMALE	WHITE	Stage IIIA	Stage IIIA	Spindle Cell	NaN	2012.0	NaN	-15164.0	Alive	TUMOR FREE	1184.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	1184.0	0.0	1184.0	NaN	NaN	0.0	1184.0	NaN

12591 rows × 34 columns

To facilitate running this quickly and with low compute resource requirements we will use a subset of the total set of samples. In practice we would likely store this in a database to improve performance. If you would like to use the entire data set you can remove the "usecols" part of read_csv. Let's use the metadata file to see how many samples we have in each of the different cohorts

meta_df.groupby('cancer type abbreviation').agg({'sample': 'nunique'})

	sample
cancer type abbreviation
ACC	92
BLCA	436
BRCA	1236
CESC	312
CHOL	45
COAD	545
DLBC	48
ESCA	204
GBM	602
HNSC	604
KICH	91
KIRC	944
KIRP	352
LAML	200
LGG	529
LIHC	438
LUAD	641
LUSC	623
MESO	87
OV	604
PAAD	196
PCPG	187
PRAD	566
READ	183
SARC	271
SKCM	479
STAD	511
TGCT	139
THCA	580
THYM	126
UCEC	583
UCS	57
UVM	80

Now we will select a couple of cohorts - CHOL - LIHC - SKCM

columns_to_keep = set(['sample'] + meta_df[meta_df['cancer type abbreviation'].isin({'CHOL','SKCM', 'LIHC'})]['sample'].tolist())
len(columns_to_keep)

Reading the data matrix is a lot of data and so this process is slow. This will still probably take a few minutes. In practice we likely store this in a database to improvement performance

exp_matrix = pd.read_csv('tcga_RSEM_gene_tpm.gz', compression='gzip', sep='\t', usecols=lambda x: x in columns_to_keep)

To make things less confusing we will rename the "sample" column as "gene" since that is what is actually in that column

exp_matrix = exp_matrix.rename(columns={'sample': 'gene'})
exp_matrix = exp_matrix.set_index('gene')
exp_matrix

	TCGA-G3-A3CH-11	TCGA-RP-A695-06	TCGA-DD-AAW0-01	TCGA-EE-A17X-06	TCGA-DD-AACA-02	TCGA-DD-AACA-01	TCGA-D3-A8GD-06	TCGA-DD-A3A6-11	TCGA-K7-AAU7-01	TCGA-FS-A1ZF-06	TCGA-EB-A41A-01	TCGA-RC-A7SH-01	TCGA-CC-A3MA-01	TCGA-DD-A3A5-11	TCGA-RC-A7SB-01	TCGA-D3-A3ML-06	TCGA-EE-A2GB-06	TCGA-DD-AAVZ-01	TCGA-DD-A116-11	TCGA-ED-A627-01	TCGA-ER-A19S-06	TCGA-EE-A3AB-06	TCGA-DD-AAVV-01	TCGA-XV-AAZW-01	TCGA-D3-A5GR-06	TCGA-BC-4073-01	TCGA-W3-AA1R-06	TCGA-QA-A7B7-01	TCGA-DA-A1IC-06	TCGA-D3-A51G-06	TCGA-BF-A5ER-01	TCGA-HP-A5N0-01	TCGA-D3-A5GU-06	TCGA-ZH-A8Y5-01	TCGA-FS-A1ZB-06	TCGA-DD-A114-11	TCGA-DD-A1EG-11	TCGA-DD-AAEI-01	TCGA-4G-AAZT-01	TCGA-DD-AACT-01	...	TCGA-BC-A112-01	TCGA-BW-A5NO-01	TCGA-FV-A3I1-11	TCGA-D3-A3MU-06	TCGA-ER-A2NB-01	TCGA-XV-A9W2-01	TCGA-UB-A7MF-01	TCGA-WX-AA44-01	TCGA-ER-A194-01	TCGA-DD-AACQ-01	TCGA-D3-A2JH-06	TCGA-EE-A2GR-06	TCGA-G3-A25U-01	TCGA-FS-A4F4-06	TCGA-DD-A1EK-01	TCGA-EB-A3XE-01	TCGA-DD-A11C-11	TCGA-BC-A10Q-11	TCGA-EE-A2GI-06	TCGA-EP-A12J-01	TCGA-W5-AA2X-11	TCGA-D9-A1JX-06	TCGA-D3-A2JL-06	TCGA-EE-A2MM-06	TCGA-DD-AAED-01	TCGA-DA-A95X-06	TCGA-DD-A1EG-01	TCGA-BC-A10R-01	TCGA-EE-A2MF-06	TCGA-G3-A3CG-01	TCGA-DD-A1ED-01	TCGA-2Y-A9H8-01	TCGA-2Y-A9GT-01	TCGA-3N-A9WD-06	TCGA-DA-A95W-06	TCGA-EE-A2GN-06	TCGA-D9-A4Z6-06	TCGA-EE-A29L-06	TCGA-DD-A115-01	TCGA-FV-A3I0-11
gene
ENSG00000242268.2	-9.9658	-9.9658	-9.9658	-9.9658	-9.965800	-9.9658	-9.9658	-9.9658	-3.1714	-9.9658	-9.9658	-4.6082	-9.9658	-9.9658	-9.9658	-5.0116	-3.8160	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-4.6082	-2.3884	-1.8314	-5.0116	-9.9658	-3.3076	-9.9658	-4.0350	-9.9658	-9.9658	-1.7322	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-3.8160	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-3.4580	-9.9658	-1.9942	-9.9658	-1.4699	-9.9658	-0.4325	-9.9658	-9.9658	-9.9658	-1.5105	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-5.0116	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-2.6349	-2.6349	-9.9658	-9.9658	-4.6082	-9.9658	-9.9658
ENSG00000259041.1	-9.9658	-9.9658	-9.9658	-1.6850	-9.965800	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-3.3076	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-0.9406	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658
ENSG00000270112.3	-9.9658	-9.9658	-9.9658	-9.9658	-9.965800	-9.9658	-9.9658	-9.9658	-6.5064	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-2.0529	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	0.9343	-9.9658	-9.9658	-4.2934	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	...	-6.5064	-9.9658	-9.9658	-4.2934	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-6.5064	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-6.5064	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-6.5064	-9.9658	-9.9658	-4.2934	-9.9658	-9.9658	-9.9658	-9.9658
ENSG00000167578.16	3.5572	5.2308	4.6077	5.5741	4.080749	3.2018	6.1010	3.1540	4.5098	5.6767	5.7337	5.8094	3.7367	3.4739	4.8125	3.8954	5.2972	4.1700	3.1028	4.8485	5.1223	5.2235	4.8440	6.3684	4.8299	4.3169	5.7307	4.7539	4.0636	6.1067	4.6916	3.9487	4.6854	4.0575	5.6186	4.1310	3.1129	2.9966	5.2430	4.5772	...	4.5324	4.9294	3.2557	5.3917	5.1052	4.8033	5.5799	5.0009	5.6930	4.5367	5.5571	4.6759	3.2988	5.7052	4.3903	5.7391	3.5299	3.2174	4.8827	4.0488	3.8808	5.5568	5.8390	4.2943	4.4647	4.6764	4.4310	4.7735	5.5503	3.7475	4.2072	4.6960	4.4324	4.5311	4.3541	3.7730	5.6846	5.2817	4.0260	3.0876
ENSG00000278814.1	-9.9658	-9.9658	-9.9658	-9.9658	-9.965800	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
ENSG00000273233.1	-9.9658	-3.4580	-9.9658	-9.9658	-4.442233	-9.9658	-9.9658	-9.9658	-2.0529	-2.7274	-9.9658	-2.5479	-3.4580	-9.9658	-9.9658	-4.0350	-3.6259	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-3.6259	-3.4580	-9.9658	-3.6259	-2.8262	-9.9658	-1.9379	-9.9658	-3.3076	-3.3076	-9.9658	-2.7274	-9.9658	-9.9658	-9.9658	-1.9942	-9.9658	...	-9.9658	-9.9658	-9.9658	-2.4659	-9.9658	-2.3884	-4.0350	-2.9324	-9.9658	-9.9658	-3.4580	-9.9658	-9.9658	-3.1714	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-3.3076	-2.4659	-9.9658	-9.9658	-9.9658	-3.8160	-2.8262	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-2.5479	-9.9658	-2.6349	-9.9658	-9.9658
ENSG00000105063.18	3.4358	6.5907	4.6317	7.8058	5.034368	5.2365	6.0885	2.2175	4.9346	5.3906	5.1416	5.7926	4.9393	2.9356	4.5154	4.8832	7.1736	4.6247	2.8301	4.3855	5.7978	5.4300	5.2211	5.0879	5.6574	4.8435	6.5130	5.2825	5.7211	5.4050	6.0856	3.4277	5.0972	5.1157	5.9709	3.8591	2.8681	4.2366	4.9901	4.2699	...	5.2169	3.9974	3.2677	6.4069	4.6236	4.7507	5.1930	5.1343	5.5236	4.5571	6.1048	5.5448	3.9289	5.7727	4.5820	4.9069	2.9857	2.7551	5.9905	4.6106	3.4778	5.8316	5.1575	6.0275	4.6053	5.5871	4.9141	4.6815	5.6848	4.2189	3.3549	5.0900	4.0884	4.5892	4.7735	5.7247	6.5297	6.0788	4.4647	3.2174
ENSG00000231119.2	-2.5479	-9.9658	-6.5064	-6.5064	-1.910505	-2.4659	-1.4305	-1.8836	-3.1714	-2.1779	-5.0116	-4.2934	-4.2934	-3.0469	-2.4659	-1.3548	-1.9942	-4.0350	-4.0350	-2.1779	-2.7274	-2.8262	-1.8314	-3.8160	-2.3147	-3.3076	-2.4659	-9.9658	-3.4580	-0.7108	-5.5735	-3.6259	-3.1714	-4.2934	-6.5064	-5.0116	-3.8160	-2.5479	-5.5735	-2.9324	...	-5.0116	-1.7809	-2.5479	-4.6082	-9.9658	-5.0116	-3.8160	-2.6349	-9.9658	-4.0350	-3.6259	-2.7274	-3.6259	-5.0116	-4.0350	-1.0559	-4.2934	-5.0116	-4.2934	-2.6349	-2.9324	-1.0862	-5.5735	-9.9658	-2.6349	-1.6850	-1.5522	-3.6259	-3.8160	-3.3076	-1.0262	-5.0116	-3.3076	-2.9324	-5.0116	-4.2934	-3.3076	-9.9658	-2.3884	-3.0469
ENSG00000280861.1	-9.9658	-9.9658	-9.9658	-9.9658	-9.965800	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658
ENSG00000181518.3	-9.9658	-9.9658	-9.9658	-9.9658	-9.965800	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-5.5735	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658

60498 rows × 936 columns

Now that we have our matrix we are ready to start the correlation

Spearman Correlation¶

The input for an analysis where a patient sample is being compared to the TCGA cohort of interest would typically be an expression matrix using the same normalization method as the comparator cohort. Note that differences in library preparation, sequencing and bioinformatics pipelines can lead to variation in expression quantification. It is important to understand the impact these differences have on expression quantification before comparing expression values that are produced using different pipelines.

For the purposes of this tutorial we are going to choose one of the CHOL samples from our expression matrix to use as our input. We arbitrarily choose the first sample.

test_sample_id = str(exp_matrix.iloc[0].index[0])
meta_df[meta_df['sample'] == test_sample_id]

	sample	_PATIENT	cancer type abbreviation	age_at_initial_pathologic_diagnosis	gender	race	ajcc_pathologic_tumor_stage	clinical_stage	histological_type	histological_grade	initial_pathologic_dx_year	menopause_status	birth_days_to	vital_status	tumor_status	last_contact_days_to	death_days_to	cause_of_death	new_tumor_event_type	new_tumor_event_site	new_tumor_event_site_other	new_tumor_event_dx_days_to	treatment_outcome_first_course	margin_status	residual_tumor	OS	OS.time	DSS	DSS.time	DFI	DFI.time	PFI	PFI.time	Redaction
6583	TCGA-G3-A3CH-11	TCGA-G3-A3CH	LIHC	53.0	MALE	ASIAN	Stage IIIA	NaN	Hepatocellular Carcinoma	G2	2010.0	NaN	-19473.0	Alive	WITH TUMOR	780.0	NaN	NaN	Intrahepatic Recurrence	Liver	NaN	116.0	NaN	NaN	R0	0.0	780.0	0.0	780.0	1.0	116.0	1.0	116.0	NaN

Now we compute the correlation of this sample with all other samples in the matrix excluding itself

corr = exp_matrix[[c for c in exp_matrix.columns.tolist() if c != test_sample_id]].corrwith(
  exp_matrix[test_sample_id], method='spearman'
).to_frame()
corr = corr.rename(columns={ corr.columns[0]: "correlation" }).reset_index().rename(columns={'index': 'sample'})
corr

	sample	correlation
0	TCGA-RP-A695-06	0.840728
1	TCGA-DD-AAW0-01	0.901187
2	TCGA-EE-A17X-06	0.825520
3	TCGA-DD-AACA-02	0.871860
4	TCGA-DD-AACA-01	0.883324
...	...	...
930	TCGA-EE-A2GN-06	0.838558
931	TCGA-D9-A4Z6-06	0.824104
932	TCGA-EE-A29L-06	0.817158
933	TCGA-DD-A115-01	0.885494
934	TCGA-FV-A3I0-11	0.908514

935 rows × 2 columns

Merge that with the metadata to group the correlations into their respective cohorts for plotting

corr = corr.merge(meta_df[['sample', 'cancer type abbreviation']], on=['sample'], how='left')
corr

	sample	correlation	cancer type abbreviation
0	TCGA-RP-A695-06	0.840728	SKCM
1	TCGA-DD-AAW0-01	0.901187	LIHC
2	TCGA-EE-A17X-06	0.825520	SKCM
3	TCGA-DD-AACA-02	0.871860	LIHC
4	TCGA-DD-AACA-01	0.883324	LIHC
...	...	...	...
930	TCGA-EE-A2GN-06	0.838558	SKCM
931	TCGA-D9-A4Z6-06	0.824104	SKCM
932	TCGA-EE-A29L-06	0.817158	SKCM
933	TCGA-DD-A115-01	0.885494	LIHC
934	TCGA-FV-A3I0-11	0.908514	LIHC

935 rows × 3 columns

Now plot the correlations

fig = sns.catplot(kind='box', data=corr, x='correlation', y='cancer type abbreviation')
title = test_sample_id + ' (' + meta_df[meta_df['sample'] == test_sample_id].iloc[0]['cancer type abbreviation'] +')'
fig.set(xlabel='spearman correlation', ylabel='', title=title)

<seaborn.axisgrid.FacetGrid at 0x7f3bc4d89350>

As expected, the average pairwise correlation is highest within the same disease cohort as the sample

Principle Component Analysis (PCA)¶

The principal component analysis would serve a similar function to the pairwise spearman correlations plot above and could be included in place of said plot if preferred. For this we will need to install more python libraries

!pip install scikit-learn

Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (0.22.2.post1)
Requirement already satisfied: scipy>=0.17.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.4.1)
Requirement already satisfied: numpy>=1.11.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.19.5)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn) (1.0.1)

Now we are ready to set up the PCA model.

from sklearn.decomposition import PCA

pca = PCA()

Our input to the PCA model is a matrix with each sample as a row. Since in our current matrix the genes are rows we need to transpose this

X = exp_matrix.T
X

gene	ENSG00000242268.2	ENSG00000259041.1	ENSG00000270112.3	ENSG00000167578.16	ENSG00000278814.1	ENSG00000078237.5	ENSG00000269416.5	ENSG00000263642.1	ENSG00000146083.11	ENSG00000158486.13	ENSG00000273639.4	ENSG00000198242.13	ENSG00000231981.3	ENSG00000269475.2	ENSG00000134108.12	ENSG00000261030.1	ENSG00000172137.18	ENSG00000276644.4	ENSG00000240423.1	ENSG00000271616.1	ENSG00000234881.1	ENSG00000236040.1	ENSG00000231105.1	ENSG00000094963.13	ENSG00000182141.9	ENSG00000280143.1	ENSG00000251334.2	ENSG00000231112.1	ENSG00000258610.1	ENSG00000264981.1	ENSG00000275265.1	ENSG00000185105.4	ENSG00000233540.1	ENSG00000102174.8	ENSG00000166391.14	ENSG00000232001.1	ENSG00000270469.1	ENSG00000225275.4	ENSG00000234253.1	ENSG00000070087.13	...	ENSG00000279778.1	ENSG00000223671.2	ENSG00000263573.1	ENSG00000222213.1	ENSG00000214124.3	ENSG00000206836.1	ENSG00000233845.1	ENSG00000066044.13	ENSG00000264491.1	ENSG00000146587.17	ENSG00000278151.1	ENSG00000228658.1	ENSG00000173930.8	ENSG00000274396.1	ENSG00000107863.16	ENSG00000199892.2	ENSG00000221760.1	ENSG00000253333.1	ENSG00000213782.7	ENSG00000146707.14	ENSG00000212084.2	ENSG00000248838.2	ENSG00000255083.1	ENSG00000158417.10	ENSG00000223665.1	ENSG00000203729.8	ENSG00000238300.1	ENSG00000221756.1	ENSG00000089177.17	ENSG00000186115.12	ENSG00000009694.13	ENSG00000238244.3	ENSG00000216352.1	ENSG00000123685.8	ENSG00000267117.1	ENSG00000273233.1	ENSG00000105063.18	ENSG00000231119.2	ENSG00000280861.1	ENSG00000181518.3
TCGA-G3-A3CH-11	-9.9658	-9.9658	-9.9658	3.557200	-9.9658	0.099000	-9.965800	-9.9658	1.832300	-9.96580	-9.9658	8.372500	-9.9658	-9.9658	3.502200	-9.9658	-9.965800	-1.148800	-9.9658	-9.9658	-9.9658	-9.9658	-9.965800	-1.086200	-1.117200	-1.214200	-9.9658	-9.9658	0.84880	-9.9658	-3.625900	-9.965800	-9.965800	-5.573500	5.737700	-9.9658	-9.9658	-9.9658	-9.9658	2.150900	...	-6.5064	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	3.408800	-9.965800	0.099000	-9.9658	-9.9658	0.6969	-9.9658	2.111400	-9.9658	-9.9658	-9.9658	3.570600	0.566600	-9.9658	-9.9658	-9.9658	3.149100	-9.9658	-9.9658	-9.9658	-9.9658	1.590200	6.864000	-1.7809	-9.9658	-9.9658	0.264200	-3.6259	-9.965800	3.435800	-2.547900	-9.9658	-9.9658
TCGA-RP-A695-06	-9.9658	-9.9658	-9.9658	5.230800	-9.9658	3.092700	-9.965800	-9.9658	3.201800	-3.45800	-9.9658	10.228500	-9.9658	-9.9658	5.436700	-9.9658	-1.994200	-5.573500	-9.9658	-9.9658	-9.9658	-9.9658	-5.011600	-2.465900	-0.284500	0.566600	-9.9658	-9.9658	1.18330	-9.9658	-9.965800	-6.506400	-9.965800	-3.046900	-9.965800	-9.9658	-9.9658	-9.9658	-9.9658	6.128700	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	5.601400	-9.965800	1.580600	-9.9658	-9.9658	-4.6082	-9.9658	3.502200	-9.9658	-9.9658	-9.9658	4.384800	3.737800	-9.9658	-9.9658	-9.9658	4.073000	-9.9658	-9.9658	-9.9658	-9.9658	2.451800	-4.293400	-6.5064	-9.9658	-9.9658	1.316700	-9.9658	-3.458000	6.590700	-9.965800	-9.9658	-9.9658
TCGA-DD-AAW0-01	-9.9658	-9.9658	-9.9658	4.607700	-9.9658	2.646400	-4.293400	-9.9658	3.808500	-4.29340	-9.9658	8.705700	-9.9658	-9.9658	4.635200	-9.9658	-9.965800	-1.780900	-9.9658	-9.9658	-9.9658	-9.9658	-9.965800	0.783200	0.215400	0.444700	-9.9658	-9.9658	1.48590	-9.9658	-9.965800	-5.573500	-9.965800	-5.011600	5.382700	-9.9658	-9.9658	-9.9658	-9.9658	1.196000	...	-4.6082	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	4.635800	-9.965800	1.339700	-9.9658	-9.9658	-4.0350	-9.9658	3.489400	-9.9658	-9.9658	-9.9658	4.165200	1.566100	-9.9658	-9.9658	-4.6082	3.956100	-9.9658	-9.9658	-9.9658	-9.9658	2.186200	7.373500	-5.5735	-9.9658	-9.9658	0.001400	-9.9658	-9.965800	4.631700	-6.506400	-9.9658	-9.9658
TCGA-EE-A17X-06	-9.9658	-1.6850	-9.9658	5.574100	-9.9658	3.165300	-9.965800	-9.9658	2.582800	-6.50640	-9.9658	10.009000	-9.9658	-9.9658	4.784600	-9.9658	-2.314700	-4.608200	-9.9658	-9.9658	-9.9658	-9.9658	-5.011600	-2.314700	-0.338300	-0.249800	-9.9658	-9.9658	0.64250	-9.9658	-9.965800	-9.965800	-0.997100	-6.506400	-9.965800	-9.9658	-9.9658	-9.9658	-9.9658	5.923400	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	6.418900	-9.965800	1.322500	-9.9658	-9.9658	-5.5735	-9.9658	4.085800	-9.9658	-9.9658	-9.9658	5.278400	4.536100	-9.9658	-9.9658	-9.9658	5.495700	-9.9658	-9.9658	-9.9658	-9.9658	2.205100	-3.816000	-9.9658	-9.9658	-9.9658	-0.687300	-3.6259	-9.965800	7.805800	-6.506400	-9.9658	-9.9658
TCGA-DD-AACA-02	-9.9658	-9.9658	-9.9658	4.080749	-9.9658	-2.114057	0.407079	-9.9658	3.499608	-3.62591	-9.9658	10.322895	-9.9658	-9.9658	4.162807	-9.9658	-4.442233	-3.921375	-9.9658	-9.9658	-9.9658	-9.9658	-5.965822	-3.717877	0.276236	-0.619235	-9.9658	-9.9658	0.79993	-9.9658	-3.380826	-2.775961	-4.293386	-7.380866	1.543998	-9.9658	-9.9658	-9.9658	-9.9658	2.055201	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	4.657673	-3.625967	0.856772	-9.9658	-9.9658	-9.9658	-9.9658	3.205689	-9.9658	-9.9658	-9.9658	4.628535	2.168608	-9.9658	-9.9658	-9.9658	4.183576	-9.9658	-9.9658	-9.9658	-9.9658	1.331141	6.773435	-9.9658	-9.9658	-9.9658	-0.641625	-9.9658	-4.442233	5.034368	-1.910505	-9.9658	-9.9658
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
TCGA-EE-A2GN-06	-9.9658	-9.9658	-4.2934	3.773000	-9.9658	3.740000	-2.826200	-9.9658	4.802200	-2.46590	-9.9658	11.006800	-9.9658	-9.9658	5.737700	-9.9658	-2.314700	-3.046900	-9.9658	-9.9658	-9.9658	-9.9658	-3.307600	-1.469900	0.723300	0.888300	-9.9658	-9.9658	1.88790	-9.9658	-9.965800	-3.046900	-9.965800	-3.046900	-9.965800	-9.9658	-9.9658	-9.9658	-9.9658	4.970900	...	-3.4580	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	5.912400	-9.965800	2.694000	-9.9658	-9.9658	-6.5064	-9.9658	3.270700	-9.9658	-9.9658	-0.7108	5.218000	4.586200	-9.9658	-9.9658	-9.9658	5.684000	-9.9658	-3.8160	-9.9658	-9.9658	3.132700	-6.506400	-3.6259	-9.9658	-9.9658	1.526600	-3.6259	-2.547900	5.724700	-4.293400	-9.9658	-9.9658
TCGA-D9-A4Z6-06	-9.9658	-9.9658	-9.9658	5.684600	-9.9658	3.996500	-6.506400	-9.9658	4.482900	-2.05290	-9.9658	10.159600	-9.9658	-9.9658	4.879800	-9.9658	-0.284500	-2.932400	-9.9658	-9.9658	-9.9658	-9.9658	-4.293400	-1.639400	1.098300	-0.087700	-9.9658	-9.9658	1.26960	-9.9658	-2.177900	-5.011600	-9.965800	-3.307600	-9.965800	-9.9658	-4.2934	-9.9658	-9.9658	6.171300	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	6.079200	-9.965800	2.018300	-9.9658	-9.9658	-4.0350	-9.9658	2.912800	-9.9658	-9.9658	-1.0262	5.615300	3.476500	-9.9658	-9.9658	-9.9658	4.551600	-9.9658	-9.9658	-9.9658	-9.9658	2.498500	-9.965800	-9.9658	-9.9658	-9.9658	0.368500	-3.1714	-9.965800	6.529700	-3.307600	-9.9658	-9.9658
TCGA-EE-A29L-06	-4.6082	-9.9658	-9.9658	5.281700	-9.9658	4.370200	-6.506400	-9.9658	3.438400	-9.96580	-9.9658	10.427600	-9.9658	-9.9658	6.153200	-9.9658	-5.573500	-5.011600	-9.9658	-9.9658	-9.9658	-2.1779	-6.506400	-1.181100	2.046500	-1.430500	-9.9658	-9.9658	-0.43250	-9.9658	-2.727400	-4.035000	-2.388400	-6.506400	-9.965800	-9.9658	-5.0116	-9.9658	-9.9658	7.421200	...	-6.5064	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	5.801700	-9.965800	3.513600	-9.9658	-9.9658	-5.5735	-9.9658	4.181200	-9.9658	-9.9658	-9.9658	5.083300	3.028700	-9.9658	-9.9658	-9.9658	6.211000	-9.9658	-9.9658	-9.9658	-9.9658	2.985700	-9.965800	-6.5064	-9.9658	-9.9658	2.070700	-9.9658	-2.634900	6.078800	-9.965800	-9.9658	-9.9658
TCGA-DD-A115-01	-9.9658	-9.9658	-9.9658	4.026000	-9.9658	2.154100	-1.318300	-9.9658	3.260200	-4.03500	-3.8160	9.137500	-9.9658	-4.6082	4.263100	-9.9658	-5.573500	-3.307600	-9.9658	-9.9658	-9.9658	-9.9658	-5.573500	-0.042500	-0.375200	0.240000	-9.9658	-9.9658	2.28130	-9.9658	-9.965800	-9.965800	-2.932400	-5.573500	2.899400	-9.9658	-9.9658	-9.9658	-9.9658	2.716100	...	-1.1172	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	4.764500	-9.965800	0.749300	-9.9658	-9.9658	-1.7322	-9.9658	3.411600	-9.9658	-9.9658	-9.9658	3.997400	1.164100	-9.9658	-9.9658	-9.9658	3.720400	-9.9658	-9.9658	-9.9658	-9.9658	1.820100	6.673000	-2.1779	-9.9658	-9.9658	0.311500	-9.9658	-9.965800	4.464700	-2.388400	-9.9658	-9.9658
TCGA-FV-A3I0-11	-9.9658	-9.9658	-9.9658	3.087600	-9.9658	-0.575600	-9.965800	-9.9658	1.692000	-9.96580	-9.9658	8.292000	-9.9658	-9.9658	3.569400	-9.9658	-5.011600	-2.388400	-9.9658	-9.9658	-9.9658	-9.9658	-9.965800	-1.595100	-1.994200	-0.687300	-9.9658	-9.9658	-0.32010	-9.9658	-9.965800	-9.965800	-9.965800	-6.506400	5.202100	-9.9658	-9.9658	-9.9658	-9.9658	3.309000	...	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	-9.9658	3.577900	-9.965800	0.357300	-9.9658	-9.9658	0.0158	-9.9658	2.160600	-9.9658	-9.9658	-9.9658	3.577900	0.774800	-9.9658	-9.9658	-9.9658	3.082500	-9.9658	-9.9658	-9.9658	-9.9658	-0.249800	7.026400	-2.2447	-9.9658	-9.9658	0.614500	-9.9658	-9.965800	3.217400	-3.046900	-9.9658	-9.9658

936 rows × 60498 columns

Now we are ready to fit the PCA

fit = pca.fit(X)
fit

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

Plot the explained variance by the number of components for any components that explain at least 0.5% of the variance

components_df = pd.DataFrame(enumerate(fit.explained_variance_ratio_), columns=['component', 'variance_explained'])
components_df['cum_var'] = components_df.variance_explained.cumsum()
components_df

	component	variance_explained	cum_var
0	0	2.098103e-01	0.209810
1	1	4.941317e-02	0.259223
2	2	2.684464e-02	0.286068
3	3	2.096802e-02	0.307036
4	4	1.821842e-02	0.325255
...	...	...	...
931	931	2.724521e-04	0.999199
932	932	2.708955e-04	0.999470
933	933	2.679152e-04	0.999738
934	934	2.619554e-04	1.000000
935	935	2.545080e-28	1.000000

936 rows × 3 columns

fig = sns.relplot(data=components_df[components_df.variance_explained >= 0.005], x='component', y='variance_explained', kind='line')
fig.set(xlabel='Component #', ylabel='Proportion of Variance Explained')

<seaborn.axisgrid.FacetGrid at 0x7f3bc2ef49d0>

If we want to see the variance explained as we keep X components we can plot that as well

fig = sns.relplot(data=components_df, x='component', y='cum_var', kind='line')
fig.set(xlabel='Number of Components', ylabel='Proportion of Variance Explained')

<seaborn.axisgrid.FacetGrid at 0x7f3bc2ec0450>

To visualize this in 2-d we can only plot the first two components

pca_xy = PCA(n_components=2)
fit_xy = pd.DataFrame(pca_xy.fit_transform(X), index=X.index, columns=['component 1', 'component 2'])
fit_xy

	component 1	component 2
TCGA-G3-A3CH-11	313.877223	58.038358
TCGA-RP-A695-06	-178.026704	152.348532
TCGA-DD-AAW0-01	205.633995	-104.460686
TCGA-EE-A17X-06	-185.690333	172.857407
TCGA-DD-AACA-02	226.342307	-60.439144
...	...	...
TCGA-EE-A2GN-06	-253.444806	21.297225
TCGA-D9-A4Z6-06	-243.644653	27.199689
TCGA-EE-A29L-06	-220.859424	123.556796
TCGA-DD-A115-01	250.808496	-44.392051
TCGA-FV-A3I0-11	375.594610	127.824032

936 rows × 2 columns

Now join this back to the metadata to get the groupings

fig_df = fit_xy.reset_index().rename(columns={'index': 'sample'}).merge(meta_df, on=['sample'], how='left')
fig_df[['sample', 'component 1', 'component 2', 'cancer type abbreviation']]

	sample	component 1	component 2	cancer type abbreviation
0	TCGA-G3-A3CH-11	313.877223	58.038358	LIHC
1	TCGA-RP-A695-06	-178.026704	152.348532	SKCM
2	TCGA-DD-AAW0-01	205.633995	-104.460686	LIHC
3	TCGA-EE-A17X-06	-185.690333	172.857407	SKCM
4	TCGA-DD-AACA-02	226.342307	-60.439144	LIHC
...	...	...	...	...
931	TCGA-EE-A2GN-06	-253.444806	21.297225	SKCM
932	TCGA-D9-A4Z6-06	-243.644653	27.199689	SKCM
933	TCGA-EE-A29L-06	-220.859424	123.556796	SKCM
934	TCGA-DD-A115-01	250.808496	-44.392051	LIHC
935	TCGA-FV-A3I0-11	375.594610	127.824032	LIHC

936 rows × 4 columns

Add a flag column for our sample of interest

fig_df.loc[fig_df['sample'] == test_sample_id, 'target sample'] = True
fig_df['target sample'] = fig_df['target sample'].fillna(False)
fig_df[['sample', 'component 1', 'component 2', 'cancer type abbreviation', 'target sample']]

	sample	component 1	component 2	cancer type abbreviation	target sample
0	TCGA-G3-A3CH-11	313.877223	58.038358	LIHC	True
1	TCGA-RP-A695-06	-178.026704	152.348532	SKCM	False
2	TCGA-DD-AAW0-01	205.633995	-104.460686	LIHC	False
3	TCGA-EE-A17X-06	-185.690333	172.857407	SKCM	False
4	TCGA-DD-AACA-02	226.342307	-60.439144	LIHC	False
...	...	...	...	...	...
931	TCGA-EE-A2GN-06	-253.444806	21.297225	SKCM	False
932	TCGA-D9-A4Z6-06	-243.644653	27.199689	SKCM	False
933	TCGA-EE-A29L-06	-220.859424	123.556796	SKCM	False
934	TCGA-DD-A115-01	250.808496	-44.392051	LIHC	False
935	TCGA-FV-A3I0-11	375.594610	127.824032	LIHC	False

936 rows × 5 columns

fig = sns.relplot(kind='scatter', data=fig_df, x='component 1', y='component 2', hue='cancer type abbreviation')

Add the annotation to pick out our current sample in the PCA. We are going to plot overtop of the original plot to accomplish this

ax = sns.scatterplot(data=fig_df, x='component 1', y='component 2', hue='cancer type abbreviation', alpha=0.5)
ax = sns.scatterplot(data=fig_df[fig_df['sample'] == test_sample_id], x='component 1', y='component 2', marker='X', alpha=1, ax=ax, color='black', s=300)

As we saw with the correlation plot, the LIHC sample is grouped with its disease cohort