Diagnostic Tool with SVM

Python Machine Learning

Developing an SVM-based Diagnotic Tool for Identifying Benign and Malignant Cells

About Data

The example is based on a dataset that is publicly available from the UCI Machine Learning Repository (Asuncion and Newman, 2007).


Feature Description
ID Clump thickness
Clump Clump thickness
UnifSize Uniformity of cell size
UnifShape CUniformity of cell shape
MargAdh Marginal adhesion
SingEpiSize Single epithelial cell size
BareNuc Bare nuclei
BlandChrom Bland chromatin
NormNucl Normal nucleoli
Mit Mitoses
Class Benign or malignant

Aimed at developing a Support Vector Machine (SVM)-based diagnostic tool for classifying cells as benign or malignant.


Introduction

The notebook begins with an overview of its objective, which is to develop an SVM-based tool for identifying benign and malignant cells using a dataset from the UCI Machine Learning Repository.


Data Description

The data features are described, including cell characteristics like Clump Thickness, Uniformity of Cell Size, and others. The target variable is the Class field indicating benign (2) or malignant (4) diagnoses.


Development Process

  1. Data Loading and Preprocessing:
    • Libraries such as pandas, numpy, and sklearn are imported.
    • The dataset is loaded into a DataFrame, and initial exploration is performed.
  2. Data Exploration and Visualization:
    • This section includes visualizations and statistical analysis to understand the data better.
  3. Model Development:
    • The SVM model is created using sklearn's SVM module.
    • Features are selected, and the model is trained on the dataset.
    • Code cells are dedicated to setting up the SVM, adjusting parameters, and fitting the model to the data.
  4. Model Evaluation:

    The SVM-based diagnostic tool's performance was rigorously evaluated, and the results are encapsulated in a confusion matrix and a classification report:


    • Confusion Matrix:
      • True Positive (Benign): 85
      • False Positive (Benign misclassified as Malignant): 5
      • True Negative (Malignant): 47
      • False Negative (Malignant misclassified as Benign): 0 This indicates that the model is particularly good at identifying malignant samples, with no instances of false negatives.
    • Classification Report:
      • Class 2 (Benign):
        • Precision: 1.00 (no benign cases were misclassified as malignant)
        • Recall: 0.94 (94% of the benign cases were correctly identified)
        • TF1-Score: 0.97 (a measure of the test's accuracy)
        • Support: 90 (number of actual occurrences of the class in the dataset)
      • Class 4 (Malignant):
        • Precision: 0.90 (90% of the identified malignant cases were correct)
        • Recall: 1.00 (every malignant case was identified correctly)
        • F1-Score: 0.95
        • Support: 47

import pandas as pd import pylab as pl import numpy as np import scipy.optimize as opt from sklearn import preprocessing # from sklearn.cross_validation import train_test_split %matplotlib inline import matplotlib.pyplot as plt

# Train_Test Split from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4) print ('Train set:', X_train.shape, y_train.shape) print ('Test set:', X_test.shape, y_test.shape)

View project