Multi omics integration (TCGA-BRCA)

Overview

This repository contains a single, end-to-end Python notebook that integrates The Cancer Genome Atlas (TCGA) Breast Cancer (BRCA) multi-omics profiles—gene expression (RNA-seq), DNA methylation (450K), and copy-number variation (CNV) — to discover unsupervised patient groups and interpret them biologically.

Links - Github
Status: Complete

Why this project

Projects goals:

Build a clean, reproducible multi-omics integration workflow in Python
Identify molecularly distinct tumor groups using unsupervised learning
Provide biologically grounded interpretation using canonical BRCA markers, CNV driver loci, and pathway enrichment
Demonstrate practical skills in data harmonization, preprocessing, integration, clustering, and interpretation

Highlights

Multi-omics data handling (RNA + methylation + CNV)
Sample harmonization with TCGA barcodes
Robust preprocessing and missing-value handling
Dimensionality reduction (PCA) and integration (early integration of PCs)
Unsupervised clustering with model selection (silhouette)
Biological interpretation via:
- canonical marker genes,
- CNV driver loci inspection,
- pathway enrichment (Hallmark)

Result

Figure
Figure 1: CNV of key genes by cluster
Figure 2: Integrated multi-omics embedding (colored by cluster)
Figure 3: Marker gene expression by multi-omics cluster
Figure 4: Overall survival by multi-omics cluster (TCGA BRCA)

How to run

git clone https://github.com/Anwesha19-prog/TCGA-BRCA-Multi-Omics-Intergration
cd TCGA-BRCA-Multi-Omics-Intergration
python -m requirements.txt
jupyter notebook Project-3.ipynb