Breast Cancer Diagnosis via Supervised Learning

A reproducible machine learning pipeline for binary classification of breast tumors (malignant vs. benign) using morphological features from the Wisconsin Diagnostic Breast Cancer (WDBC) dataset.

Abstract

This project implements an end-to-end supervised classification pipeline applied to the Wisconsin Diagnostic Breast Cancer (WDBC) dataset (Wolberg et al., 1995). The dataset comprises 569 observations with 30 numeric features derived from digitized images of fine needle aspirate (FNA) cytology samples. Each feature describes morphological properties of cell nuclei - including radius, texture, perimeter, area, smoothness, compactness, concavity, symmetry, and fractal dimension - computed as mean, standard error, and worst-case values.

The pipeline encompasses exploratory data analysis, multicollinearity-based feature selection, standardization, and comparative evaluation of Logistic Regression and Random Forest classifiers. The best-performing model (Logistic Regression) achieved 96.5% accuracy and 97.5% precision for the malignant class on the held-out test set.

Observations 569 357 benign, 212 malignant

Original Features 30 Reduced to 23 after selection

Best Accuracy 96.5% Logistic Regression

Malignant Precision 97.5% Low false-positive rate

Background

Breast cancer remains the most frequently diagnosed malignancy among women globally, with early detection substantially improving five-year survival rates (WHO, 2024). Computer-aided diagnosis (CAD) systems that leverage machine learning have demonstrated clinical value in supporting pathologists during screening workflows.

The WDBC dataset, originally compiled at the University of Wisconsin–Madison, provides quantitative descriptors of cell nuclei observed in FNA biopsy images. These features capture geometric and textural properties that correlate with malignancy, making the dataset a well-established benchmark in medical classification literature.

This project applies standard supervised learning techniques to this dataset, with emphasis on interpretable models suited to clinical decision support.

Methodology

The analysis follows a structured six-stage pipeline:

Data Acquisition

Dataset loaded from the UCI Machine Learning Repository via the ucimlrepo library. Target variable encoded as binary (M = 1, B = 0).

Exploratory Data Analysis

Class distribution analysis, descriptive statistics, missing value verification, and distribution assessment for each feature group (mean, SE, worst).

Feature Preprocessing

Column names standardized to reflect measurement type (mean, standard error, worst-case). No missing values or categorical encoding required.

Feature Selection

Pairwise Pearson correlation computed; features with |r| > 0.95 were removed to mitigate multicollinearity, reducing the feature space from 30 to 23 variables.

Model Training

Logistic Regression and Random Forest classifiers trained on standardized features (StandardScaler) with a 75/25 train-test split and fixed random state for reproducibility.

Evaluation

Performance assessed via accuracy, precision, recall, F1-score, and confusion matrix analysis on the held-out test set.

Results

Classification Performance (Test Set)

Model	Accuracy	Precision	Recall	F1-Score
Logistic Regression	96.5%	97.5%	92.9%	95.1%
Random Forest	94.7%	97.4%	88.1%	92.5%

Clinical Relevance of Recall

In diagnostic screening, recall (sensitivity) for the malignant class is particularly important: a false negative - classifying a malignant tumor as benign - carries significantly higher clinical risk than a false positive. Logistic Regression achieved 92.9% recall, reducing the probability of missed malignant diagnoses.

Key Predictive Features

Feature importance analysis identified the following morphological descriptors as the strongest predictors of malignancy:

Concave points (mean and worst) - captures the number of concave portions of the cell nucleus contour, a known indicator of irregular cell morphology
Radius measurements (mean and worst) - reflects overall nucleus size, which tends to be larger in malignant cells
Concavity metrics - quantifies the severity of concave portions, correlating with irregular nuclear shape

These findings are consistent with established cytopathological criteria, where nuclear size and shape irregularity are primary indicators of malignancy (Wolberg et al., 1995).

Visualizations

Class Distribution

Fig. 1. Target variable distribution in the WDBC dataset. The dataset exhibits moderate class imbalance with 62.7% benign (n=357) and 37.3% malignant (n=212) cases.

Feature Distributions by Diagnosis

Kernel density estimates of key features grouped by diagnosis

Fig. 2. Distribution of selected features by diagnosis class, illustrating separability between benign and malignant populations.

Pearson correlation matrix across all 30 features

Fig. 3. Pearson correlation matrix for the full feature set. Highly correlated pairs (|r| > 0.95) were identified for removal during feature selection.

Feature Selection and Correlation Analysis

Top 15 features ranked by absolute correlation with diagnosis

Fig. 4. Top 15 features ranked by absolute Pearson correlation with the target variable. Concave points (worst) exhibits the strongest linear association (r = 0.79).

Correlation heatmap of the 10 most predictive features

Fig. 5. Pairwise correlation heatmap of the top 10 predictive features. Multicollinearity between radius, perimeter, and area is evident.

Confusion Matrices

Confusion matrix for Logistic Regression classifier

Fig. 6. Confusion matrix - Logistic Regression. 3 false negatives and 2 false positives on the test set (n=143).

Confusion matrix for Random Forest classifier

Fig. 7. Confusion matrix - Random Forest. 5 false negatives and 1 false positive on the test set (n=143).

Feature Importance

Fig. 8. Feature importance scores from the Random Forest model, computed via mean decrease in Gini impurity.

Logistic Regression coefficient magnitudes

Fig. 9. Standardized coefficient magnitudes from Logistic Regression, indicating each feature's contribution to the decision boundary.

Implementation Details

Environment and Dependencies

Language: Python 3.8+
Core libraries: pandas, NumPy, matplotlib, seaborn, scikit-learn
Data source: ucimlrepo (UCI ML Repository API)
Scaling: StandardScaler - zero mean, unit variance normalization
Train/test split: 75/25, stratified, random_state=42

Design Decisions

Feature selection threshold (|r| > 0.95): chosen to remove near-redundant features while preserving information. Radius/perimeter/area groups exhibited the highest multicollinearity.
Logistic Regression as primary model: preferred for interpretability and strong performance on linearly separable data, which aligns with the structure of morphological features.
Random Forest as baseline comparison: included to evaluate whether non-linear decision boundaries improve classification. The marginal accuracy decrease (−1.8 pp) suggests linear separation is sufficient for this feature space.
No hyperparameter tuning: default scikit-learn parameters were used. Further gains may be achievable through cross-validated grid search.

Reproducing the Analysis

git clone https://github.com/Taimisson/breast-cancer-wdbc-ml.git
cd breast-cancer-wdbc-ml
pip install -r requirements.txt
jupyter notebook notebooks/projeto_cancer_mama.ipynb

Project Structure

breast-cancer-wdbc-ml/
├── README.md
├── requirements.txt
├── LICENSE
├── CITATION.cff
├── notebooks/
│   └── projeto_cancer_mama.ipynb    # Full analysis notebook
└── docs/
    ├── index.md                      # This page
    ├── notebook.html                 # Exported notebook (HTML)
    └── assets/
        ├── css/custom.css
        └── img/                      # Generated figures

Resources

Repository Full Notebook Interactive Version

References

Wolberg, W. H., Street, W. N., & Mangasarian, O. L. (1995). Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, 17(2), 77–87.
UCI Machine Learning Repository - Breast Cancer Wisconsin (Diagnostic)
World Health Organization (2024). Breast Cancer Fact Sheet. who.int

License & Citation

Licensed under the MIT License. See LICENSE.

@software{breast_cancer_wdbc_ml,
  author = {Taimisson},
  title = {Breast Cancer Diagnosis ML Pipeline},
  year = {2026},
  url = {https://github.com/Taimisson/breast-cancer-wdbc-ml}
}

Breast Cancer WDBC - ML Classification

Supervised learning pipeline for binary classification of breast tumors using the Wisconsin Diagnostic Breast Cancer dataset

Taimisson