Diagnostic Tool with SVM by Raji Kudus Adewale

Developing an SVM-based Diagnotic Tool for Identifying Benign and Malignant Cells

About Data

The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007).

Feature	Description
ID	Clump thickness
Clump	Clump thickness
UnifSize	Uniformity of cell size
UnifShape	CUniformity of cell shape
MargAdh	Marginal adhesion
SingEpiSize	Single epithelial cell size
BareNuc	Bare nuclei
BlandChrom	Bland chromatin
NormNucl	Normal nucleoli
Mit	Mitoses
Class	Benign or malignant

Aimed at developing a Support Vector Machine (SVM)-based diagnostic tool for classifying cells as benign or malignant.

Introduction

The notebook begins with an overview of its objective, which is to develop an SVM-based tool for identifying benign and malignant cells using a dataset from the UCI Machine Learning Repository.

Data Description

The data features are described, including cell characteristics like Clump Thickness, Uniformity of Cell Size, and others. The target variable is the Class field indicating benign (2) or malignant (4) diagnoses.

Development Process

Data Loading and Preprocessing:
- Libraries such as pandas, numpy, and sklearn are imported.
- The dataset is loaded into a DataFrame, and initial exploration is performed.
Data Exploration and Visualization:
- This section includes visualizations and statistical analysis to understand the data better.
Model Development:
- The SVM model is created using sklearn's SVM module.
- Features are selected, and the model is trained on the dataset.
- Code cells are dedicated to setting up the SVM, adjusting parameters, and fitting the model to the data.
Model Evaluation:

The SVM-based diagnostic tool's performance was rigorously evaluated, and the results are encapsulated in a confusion matrix and a classification report:
- Confusion Matrix:
  - True Positive (Benign): 85
  - False Positive (Benign misclassified as Malignant): 5
  - True Negative (Malignant): 47
  - False Negative (Malignant misclassified as Benign): 0 This indicates that the model is particularly good at identifying malignant samples, with no instances of false negatives.
- Classification Report:
  - Class 2 (Benign):
    - Precision: 1.00 (no benign cases were misclassified as malignant)
    - Recall: 0.94 (94% of the benign cases were correctly identified)
    - TF1-Score: 0.97 (a measure of the test's accuracy)
    - Support: 90 (number of actual occurrences of the class in the dataset)
  - Class 4 (Malignant):
    - Precision: 0.90 (90% of the identified malignant cases were correct)
    - Recall: 1.00 (every malignant case was identified correctly)
    - F1-Score: 0.95
    - Support: 47


import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
# from sklearn.cross_validation import train_test_split
%matplotlib inline 
import matplotlib.pyplot as plt


# Train_Test Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)