Skip to the content.

About the Author

Taimisson

Software Developer & Data Science Enthusiast - explore more projects, skills, and contact info on my portfolio.

Visit Portfolio

Breast Cancer Diagnosis via Supervised Learning

A reproducible machine learning pipeline for binary classification of breast tumors (malignant vs. benign) using morphological features from the Wisconsin Diagnostic Breast Cancer (WDBC) dataset.

Python 3.8+ MIT License 96.5% Accuracy

Abstract

This project implements an end-to-end supervised classification pipeline applied to the Wisconsin Diagnostic Breast Cancer (WDBC) dataset (Wolberg et al., 1995). The dataset comprises 569 observations with 30 numeric features derived from digitized images of fine needle aspirate (FNA) cytology samples. Each feature describes morphological properties of cell nuclei - including radius, texture, perimeter, area, smoothness, compactness, concavity, symmetry, and fractal dimension - computed as mean, standard error, and worst-case values.

The pipeline encompasses exploratory data analysis, multicollinearity-based feature selection, standardization, and comparative evaluation of Logistic Regression and Random Forest classifiers. The best-performing model (Logistic Regression) achieved 96.5% accuracy and 97.5% precision for the malignant class on the held-out test set.

Observations 569 357 benign, 212 malignant
Original Features 30 Reduced to 23 after selection
Best Accuracy 96.5% Logistic Regression
Malignant Precision 97.5% Low false-positive rate

Background

Breast cancer remains the most frequently diagnosed malignancy among women globally, with early detection substantially improving five-year survival rates (WHO, 2024). Computer-aided diagnosis (CAD) systems that leverage machine learning have demonstrated clinical value in supporting pathologists during screening workflows.

The WDBC dataset, originally compiled at the University of Wisconsin–Madison, provides quantitative descriptors of cell nuclei observed in FNA biopsy images. These features capture geometric and textural properties that correlate with malignancy, making the dataset a well-established benchmark in medical classification literature.

This project applies standard supervised learning techniques to this dataset, with emphasis on interpretable models suited to clinical decision support.


Methodology

The analysis follows a structured six-stage pipeline:

1

Data Acquisition

Dataset loaded from the UCI Machine Learning Repository via the ucimlrepo library. Target variable encoded as binary (M = 1, B = 0).

2

Exploratory Data Analysis

Class distribution analysis, descriptive statistics, missing value verification, and distribution assessment for each feature group (mean, SE, worst).

3

Feature Preprocessing

Column names standardized to reflect measurement type (mean, standard error, worst-case). No missing values or categorical encoding required.

4

Feature Selection

Pairwise Pearson correlation computed; features with |r| > 0.95 were removed to mitigate multicollinearity, reducing the feature space from 30 to 23 variables.

5

Model Training

Logistic Regression and Random Forest classifiers trained on standardized features (StandardScaler) with a 75/25 train-test split and fixed random state for reproducibility.

6

Evaluation

Performance assessed via accuracy, precision, recall, F1-score, and confusion matrix analysis on the held-out test set.


Results

Classification Performance (Test Set)

Model Accuracy Precision Recall F1-Score
Logistic Regression 96.5% 97.5% 92.9% 95.1%
Random Forest 94.7% 97.4% 88.1% 92.5%

Clinical Relevance of Recall

In diagnostic screening, recall (sensitivity) for the malignant class is particularly important: a false negative - classifying a malignant tumor as benign - carries significantly higher clinical risk than a false positive. Logistic Regression achieved 92.9% recall, reducing the probability of missed malignant diagnoses.

Key Predictive Features

Feature importance analysis identified the following morphological descriptors as the strongest predictors of malignancy:

  1. Concave points (mean and worst) - captures the number of concave portions of the cell nucleus contour, a known indicator of irregular cell morphology
  2. Radius measurements (mean and worst) - reflects overall nucleus size, which tends to be larger in malignant cells
  3. Concavity metrics - quantifies the severity of concave portions, correlating with irregular nuclear shape

These findings are consistent with established cytopathological criteria, where nuclear size and shape irregularity are primary indicators of malignancy (Wolberg et al., 1995).


Visualizations

Class Distribution

Class distribution: 357 benign (62.7%) and 212 malignant (37.3%) samples

Fig. 1. Target variable distribution in the WDBC dataset. The dataset exhibits moderate class imbalance with 62.7% benign (n=357) and 37.3% malignant (n=212) cases.

Feature Distributions by Diagnosis

Kernel density estimates of key features grouped by diagnosis

Fig. 2. Distribution of selected features by diagnosis class, illustrating separability between benign and malignant populations.

Pearson correlation matrix across all 30 features

Fig. 3. Pearson correlation matrix for the full feature set. Highly correlated pairs (|r| > 0.95) were identified for removal during feature selection.

Feature Selection and Correlation Analysis

Top 15 features ranked by absolute correlation with diagnosis

Fig. 4. Top 15 features ranked by absolute Pearson correlation with the target variable. Concave points (worst) exhibits the strongest linear association (r = 0.79).

Correlation heatmap of the 10 most predictive features

Fig. 5. Pairwise correlation heatmap of the top 10 predictive features. Multicollinearity between radius, perimeter, and area is evident.

Confusion Matrices

Confusion matrix for Logistic Regression classifier

Fig. 6. Confusion matrix - Logistic Regression. 3 false negatives and 2 false positives on the test set (n=143).

Confusion matrix for Random Forest classifier

Fig. 7. Confusion matrix - Random Forest. 5 false negatives and 1 false positive on the test set (n=143).

Feature Importance

Random Forest feature importance (Gini impurity)

Fig. 8. Feature importance scores from the Random Forest model, computed via mean decrease in Gini impurity.

Logistic Regression coefficient magnitudes

Fig. 9. Standardized coefficient magnitudes from Logistic Regression, indicating each feature's contribution to the decision boundary.


Implementation Details

Environment and Dependencies

Design Decisions

Reproducing the Analysis

git clone https://github.com/Taimisson/breast-cancer-wdbc-ml.git
cd breast-cancer-wdbc-ml
pip install -r requirements.txt
jupyter notebook notebooks/projeto_cancer_mama.ipynb

Project Structure

breast-cancer-wdbc-ml/
├── README.md
├── requirements.txt
├── LICENSE
├── CITATION.cff
├── notebooks/
│   └── projeto_cancer_mama.ipynb    # Full analysis notebook
└── docs/
    ├── index.md                      # This page
    ├── notebook.html                 # Exported notebook (HTML)
    └── assets/
        ├── css/custom.css
        └── img/                      # Generated figures

Resources


References


License & Citation

Licensed under the MIT License. See LICENSE.

@software{breast_cancer_wdbc_ml,
  author = {Taimisson},
  title = {Breast Cancer Diagnosis ML Pipeline},
  year = {2026},
  url = {https://github.com/Taimisson/breast-cancer-wdbc-ml}
}