Developing an SVM-based Diagnotic Tool for Identifying Benign and Malignant Cells
About Data
The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007).
Feature | Description |
---|---|
ID | Clump thickness |
Clump | Clump thickness |
UnifSize | Uniformity of cell size |
UnifShape | CUniformity of cell shape |
MargAdh | Marginal adhesion |
SingEpiSize | Single epithelial cell size |
BareNuc | Bare nuclei |
BlandChrom | Bland chromatin |
NormNucl | Normal nucleoli |
Mit | Mitoses |
Class | Benign or malignant |
Aimed at developing a Support Vector Machine (SVM)-based diagnostic tool for classifying cells as benign or malignant.
Introduction
The notebook begins with an overview of its objective, which is to develop an SVM-based tool for identifying benign and malignant cells using a dataset from the UCI Machine Learning Repository.
Data Description
The data features are described, including cell characteristics like Clump Thickness, Uniformity of Cell Size, and others. The target variable is the Class field indicating benign (2) or malignant (4) diagnoses.
Development Process
-
Data Loading and Preprocessing:
- Libraries such as pandas, numpy, and sklearn are imported.
- The dataset is loaded into a DataFrame, and initial exploration is performed.
-
Data Exploration and Visualization:
- This section includes visualizations and statistical analysis to understand the data better.
-
Model Development:
- The SVM model is created using sklearn's SVM module.
- Features are selected, and the model is trained on the dataset.
- Code cells are dedicated to setting up the SVM, adjusting parameters, and fitting the model to the data.
-
Model Evaluation:
The SVM-based diagnostic tool's performance was rigorously evaluated, and the results are encapsulated in a confusion matrix and a classification report:
-
Confusion Matrix:
- True Positive (Benign): 85
- False Positive (Benign misclassified as Malignant): 5
- True Negative (Malignant): 47
- False Negative (Malignant misclassified as Benign): 0 This indicates that the model is particularly good at identifying malignant samples, with no instances of false negatives.
-
Classification Report:
- Class 2 (Benign):
- Precision: 1.00 (no benign cases were misclassified as malignant)
- Recall: 0.94 (94% of the benign cases were correctly identified)
- TF1-Score: 0.97 (a measure of the test's accuracy)
- Support: 90 (number of actual occurrences of the class in the dataset)
- Class 4 (Malignant):
- Precision: 0.90 (90% of the identified malignant cases were correct)
- Recall: 1.00 (every malignant case was identified correctly)
- F1-Score: 0.95
- Support: 47
- Class 2 (Benign):
-
Confusion Matrix:
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
# from sklearn.cross_validation import train_test_split
%matplotlib inline
import matplotlib.pyplot as plt
# Train_Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)