Overview

This repository contains a single, end-to-end Python notebook that integrates The Cancer Genome Atlas (TCGA) Breast Cancer (BRCA) multi-omics profiles—gene expression (RNA-seq), DNA methylation (450K), and copy-number variation (CNV) — to discover unsupervised patient groups and interpret them biologically.

  • Links - Github
  • Status: Complete

Why this project

Projects goals:

  • Build a clean, reproducible multi-omics integration workflow in Python
  • Identify molecularly distinct tumor groups using unsupervised learning
  • Provide biologically grounded interpretation using canonical BRCA markers, CNV driver loci, and pathway enrichment
  • Demonstrate practical skills in data harmonization, preprocessing, integration, clustering, and interpretation

Highlights

  • Multi-omics data handling (RNA + methylation + CNV)
  • Sample harmonization with TCGA barcodes
  • Robust preprocessing and missing-value handling
  • Dimensionality reduction (PCA) and integration (early integration of PCs)
  • Unsupervised clustering with model selection (silhouette)
  • Biological interpretation via:

    • canonical marker genes,
    • CNV driver loci inspection,
    • pathway enrichment (Hallmark)

Result

Figure
CNV key genes
Figure 1: CNV of key genes by cluster
Embedding clusters
Figure 2: Integrated multi-omics embedding (colored by cluster)
Marker expression
Figure 3: Marker gene expression by multi-omics cluster
Survival rate
Figure 4: Overall survival by multi-omics cluster (TCGA BRCA)

How to run

git clone https://github.com/Anwesha19-prog/TCGA-BRCA-Multi-Omics-Intergration
cd TCGA-BRCA-Multi-Omics-Intergration
python -m requirements.txt
jupyter notebook Project-3.ipynb

Updated: