Overview

This notebook demonstrates an end-to-end gene expression analysis workflow starting from a raw GEO Series Matrix file. The focus is on a core real-world bioinformatics skill: turning “messy” public repository files into analysis-ready matrices and metadata, then performing basic differential expression and exploratory analysis.

Although the dataset is public and compact, the same workflow scales to large transcriptomics studies and is directly applicable in oncology pipelines.

  • Links: GitHub
  • Status: Complete

Why this project

This project demonstrates a reproducible end-to-end gene expression analysis workflow turning chaotic public repository data into analysis-ready matrices and metadata.

Highlights

  • Parsing GEO Series Matrix structure:
    • detecting !series_matrix_table_begin / !series_matrix_table_end
    • separating metadata lines from expression table
  • Building a clean expression matrix:
    • rows = genes
    • columns = samples
  • Extracting phenotype labels (Tumor vs Normal) from sample characteristics

Results

Figure
Volcano Plot
Figure 1: Volcano plot Tumor vs Normal
PCA Plot
Figure 2: PCA plot Tumor vs Normal
Top Genes
Figure 3: Top 20 genes based on p-value

How to run

git clone https://github.com/Anwesha19-prog/Breast-Cancer-Microarray-Analysis
cd Breast-Cancer-Microarray-Analysis
pip install pandas numpy scipy matplotlib seaborn scikit-learn
jupyter notebook Project-1_clean.ipynb

Updated: