Breast cancer microarray analysis

Overview

This notebook demonstrates an end-to-end gene expression analysis workflow starting from a raw GEO Series Matrix file. The focus is on a core real-world bioinformatics skill: turning “messy” public repository files into analysis-ready matrices and metadata, then performing basic differential expression and exploratory analysis.

Although the dataset is public and compact, the same workflow scales to large transcriptomics studies and is directly applicable in oncology pipelines.

Links: GitHub
Status: Complete

Why this project

This project demonstrates a reproducible end-to-end gene expression analysis workflow turning chaotic public repository data into analysis-ready matrices and metadata.

Highlights

Parsing GEO Series Matrix structure:
- detecting !series_matrix_table_begin / !series_matrix_table_end
- separating metadata lines from expression table
Building a clean expression matrix:
- rows = genes
- columns = samples
Extracting phenotype labels (Tumor vs Normal) from sample characteristics

Results

Figure
Figure 1: Volcano plot Tumor vs Normal
Figure 2: PCA plot Tumor vs Normal
Figure 3: Top 20 genes based on p-value

How to run

git clone https://github.com/Anwesha19-prog/Breast-Cancer-Microarray-Analysis
cd Breast-Cancer-Microarray-Analysis
pip install pandas numpy scipy matplotlib seaborn scikit-learn
jupyter notebook Project-1_clean.ipynb