Breast Cancer Diagnosis via Supervised Learning
A reproducible machine learning pipeline for binary classification of breast tumors (malignant vs. benign) using morphological features from the Wisconsin Diagnostic Breast Cancer (WDBC) dataset.
Abstract
This project implements an end-to-end supervised classification pipeline applied to the Wisconsin Diagnostic Breast Cancer (WDBC) dataset (Wolberg et al., 1995). The dataset comprises 569 observations with 30 numeric features derived from digitized images of fine needle aspirate (FNA) cytology samples. Each feature describes morphological properties of cell nuclei - including radius, texture, perimeter, area, smoothness, compactness, concavity, symmetry, and fractal dimension - computed as mean, standard error, and worst-case values.
The pipeline encompasses exploratory data analysis, multicollinearity-based feature selection, standardization, and comparative evaluation of Logistic Regression and Random Forest classifiers. The best-performing model (Logistic Regression) achieved 96.5% accuracy and 97.5% precision for the malignant class on the held-out test set.
Background
Breast cancer remains the most frequently diagnosed malignancy among women globally, with early detection substantially improving five-year survival rates (WHO, 2024). Computer-aided diagnosis (CAD) systems that leverage machine learning have demonstrated clinical value in supporting pathologists during screening workflows.
The WDBC dataset, originally compiled at the University of Wisconsin–Madison, provides quantitative descriptors of cell nuclei observed in FNA biopsy images. These features capture geometric and textural properties that correlate with malignancy, making the dataset a well-established benchmark in medical classification literature.
This project applies standard supervised learning techniques to this dataset, with emphasis on interpretable models suited to clinical decision support.
Methodology
The analysis follows a structured six-stage pipeline:
Data Acquisition
Dataset loaded from the UCI Machine Learning Repository via the ucimlrepo library. Target variable encoded as binary (M = 1, B = 0).
Exploratory Data Analysis
Class distribution analysis, descriptive statistics, missing value verification, and distribution assessment for each feature group (mean, SE, worst).
Feature Preprocessing
Column names standardized to reflect measurement type (mean, standard error, worst-case). No missing values or categorical encoding required.
Feature Selection
Pairwise Pearson correlation computed; features with |r| > 0.95 were removed to mitigate multicollinearity, reducing the feature space from 30 to 23 variables.
Model Training
Logistic Regression and Random Forest classifiers trained on standardized features (StandardScaler) with a 75/25 train-test split and fixed random state for reproducibility.
Evaluation
Performance assessed via accuracy, precision, recall, F1-score, and confusion matrix analysis on the held-out test set.
Results
Classification Performance (Test Set)
| Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Logistic Regression | 96.5% | 97.5% | 92.9% | 95.1% |
| Random Forest | 94.7% | 97.4% | 88.1% | 92.5% |
Clinical Relevance of Recall
In diagnostic screening, recall (sensitivity) for the malignant class is particularly important: a false negative - classifying a malignant tumor as benign - carries significantly higher clinical risk than a false positive. Logistic Regression achieved 92.9% recall, reducing the probability of missed malignant diagnoses.
Key Predictive Features
Feature importance analysis identified the following morphological descriptors as the strongest predictors of malignancy:
- Concave points (mean and worst) - captures the number of concave portions of the cell nucleus contour, a known indicator of irregular cell morphology
- Radius measurements (mean and worst) - reflects overall nucleus size, which tends to be larger in malignant cells
- Concavity metrics - quantifies the severity of concave portions, correlating with irregular nuclear shape
These findings are consistent with established cytopathological criteria, where nuclear size and shape irregularity are primary indicators of malignancy (Wolberg et al., 1995).
Visualizations
Class Distribution
Fig. 1. Target variable distribution in the WDBC dataset. The dataset exhibits moderate class imbalance with 62.7% benign (n=357) and 37.3% malignant (n=212) cases.
Feature Distributions by Diagnosis
Fig. 2. Distribution of selected features by diagnosis class, illustrating separability between benign and malignant populations.
Fig. 3. Pearson correlation matrix for the full feature set. Highly correlated pairs (|r| > 0.95) were identified for removal during feature selection.
Feature Selection and Correlation Analysis
Fig. 4. Top 15 features ranked by absolute Pearson correlation with the target variable. Concave points (worst) exhibits the strongest linear association (r = 0.79).
Fig. 5. Pairwise correlation heatmap of the top 10 predictive features. Multicollinearity between radius, perimeter, and area is evident.
Confusion Matrices
Fig. 6. Confusion matrix - Logistic Regression. 3 false negatives and 2 false positives on the test set (n=143).
Fig. 7. Confusion matrix - Random Forest. 5 false negatives and 1 false positive on the test set (n=143).
Feature Importance
Fig. 8. Feature importance scores from the Random Forest model, computed via mean decrease in Gini impurity.
Fig. 9. Standardized coefficient magnitudes from Logistic Regression, indicating each feature's contribution to the decision boundary.
Implementation Details
Environment and Dependencies
- Language: Python 3.8+
- Core libraries: pandas, NumPy, matplotlib, seaborn, scikit-learn
- Data source:
ucimlrepo(UCI ML Repository API) - Scaling:
StandardScaler- zero mean, unit variance normalization - Train/test split: 75/25, stratified,
random_state=42
Design Decisions
- Feature selection threshold (|r| > 0.95): chosen to remove near-redundant features while preserving information. Radius/perimeter/area groups exhibited the highest multicollinearity.
- Logistic Regression as primary model: preferred for interpretability and strong performance on linearly separable data, which aligns with the structure of morphological features.
- Random Forest as baseline comparison: included to evaluate whether non-linear decision boundaries improve classification. The marginal accuracy decrease (−1.8 pp) suggests linear separation is sufficient for this feature space.
- No hyperparameter tuning: default scikit-learn parameters were used. Further gains may be achievable through cross-validated grid search.
Reproducing the Analysis
git clone https://github.com/Taimisson/breast-cancer-wdbc-ml.git
cd breast-cancer-wdbc-ml
pip install -r requirements.txt
jupyter notebook notebooks/projeto_cancer_mama.ipynb
Project Structure
breast-cancer-wdbc-ml/
├── README.md
├── requirements.txt
├── LICENSE
├── CITATION.cff
├── notebooks/
│ └── projeto_cancer_mama.ipynb # Full analysis notebook
└── docs/
├── index.md # This page
├── notebook.html # Exported notebook (HTML)
└── assets/
├── css/custom.css
└── img/ # Generated figures
Resources
References
- Wolberg, W. H., Street, W. N., & Mangasarian, O. L. (1995). Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, 17(2), 77–87.
- UCI Machine Learning Repository - Breast Cancer Wisconsin (Diagnostic)
- World Health Organization (2024). Breast Cancer Fact Sheet. who.int
License & Citation
Licensed under the MIT License. See LICENSE.
@software{breast_cancer_wdbc_ml,
author = {Taimisson},
title = {Breast Cancer Diagnosis ML Pipeline},
year = {2026},
url = {https://github.com/Taimisson/breast-cancer-wdbc-ml}
}