Machine Learning algorithm based on support vector machine to predict cancer cells
Published on August 06, 2021 by Vimal Octavius PJ
Machine Learning SVM
229 min READ
Building a classification model using human cell records, and classify cells - whether the samples are benign or malignant.
This exercise uses Support Vector Machine
Dataset is publicly available at the UCI Machine Learning Repository Asuncion and Newman, 2007. It contains several hundred human cell sample records, each of which contains the values of a set of cell characteristics. The fields in each record are:
Field name | Description |
---|---|
ID | Clump thickness |
Clump | Clump thickness |
UnifSize | Uniformity of cell size |
UnifShape | Uniformity of cell shape |
MargAdh | Marginal adhesion |
SingEpiSize | Single epithelial cell size |
BareNuc | Bare nuclei |
BlandChrom | Bland chromatin |
NormNucl | Normal nucleoli |
Mit | Mitoses |
Class | Benign or malignant |
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
%matplotlib inline
import matplotlib.pyplot as plt
df_cell_data = pd.read_csv("/content/drive/MyDrive/Datasets/CancerCellSamples.csv")
df_cell_data.head()
Unnamed: 0 | ID | Clump | UnifSize | UnifShape | MargAdh | SingEpiSize | BareNuc | BlandChrom | NormNucl | Mit | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1000025 | 5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | 2 |
1 | 1 | 1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | 2 |
2 | 2 | 1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | 2 |
3 | 3 | 1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | 2 |
4 | 4 | 1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | 2 |
Fields Clump, UnifSize, UnifShape, MargAdh, SingEpiSize, BareNuc, BlandChrom, NormNucl, Mit contain data where values are graded from 1 to 10. The closest to benign is 1.
The Class field has benign (value = 2) or malignant (value = 4).
The distribution of the classes based on Clump thickness and Uniformity of cell size may be visualized:
ax = df_cell_data[df_cell_data['Class'] == 4][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='DarkBlue', label='malignant');
df_cell_data[df_cell_data['Class'] == 2][0:50].plot(kind='scatter', x='Clump', y='UnifSize', color='Yellow', label='benign', ax=ax);
plt.show()
A view of the columns' data types:
df_cell_data.dtypes
Unnamed: 0 int64 ID int64 Clump int64 UnifSize int64 UnifShape int64 MargAdh int64 SingEpiSize int64 BareNuc object BlandChrom int64 NormNucl int64 Mit int64 Class int64 dtype: object
BareNuc column has values that are not numerical. Dropping these rows:
df_cell_data = df_cell_data[pd.to_numeric(df_cell_data['BareNuc'], errors='coerce').notnull()]
df_cell_data['BareNuc'] = df_cell_data['BareNuc'].astype('int')
df_cell_data.dtypes
Unnamed: 0 int64 ID int64 Clump int64 UnifSize int64 UnifShape int64 MargAdh int64 SingEpiSize int64 BareNuc int64 BlandChrom int64 NormNucl int64 Mit int64 Class int64 dtype: object
df_features = df_cell_data[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(df_features)
X[0:5]
array([[ 5, 1, 1, 1, 2, 1, 3, 1, 1], [ 5, 4, 4, 5, 7, 10, 3, 2, 1], [ 3, 1, 1, 1, 2, 2, 3, 1, 1], [ 6, 8, 8, 1, 3, 4, 3, 7, 1], [ 4, 1, 1, 3, 2, 1, 3, 1, 1]])
The Model should predict Class, benign (=2) or malignant (=4). Let's change its measurement level:
df_cell_data['Class'] = df_cell_data['Class'].astype('int')
y = np.asarray(df_cell_data['Class'])
y [0:5]
array([2, 2, 2, 2, 2])
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
Train set: (546, 9) (546,) Test set: (137, 9) (137,)
Mapping of data into a higher dimensional space is called kernelling. With SVM, there are a few kernel functions for processing. The mathematical function used for the transformation is called the kernel function, like:
1.Sigmoid
2.Polynomial
3.Radial basis function (RBF)
4.Linear
Each of these return different results, has Pros and Cons. Let's use the default RBF (Radial Basis Function)
from sklearn import svm
clf = svm.SVC(kernel='rbf')
clf.fit(X_train, y_train)
SVC()
Prediction:
yhat = clf.predict(X_test)
yhat [0:5]
array([2, 4, 2, 4, 2])
Evaluating the model
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def confusion_matrix_plot(cm, classes,
normalize=False,
title='Confusion - Matrix',
cmap=plt.cm.Blues):
"""
`normalize=True` applies Normalization
Function to print and plot the confusion matrix
"""
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
print("Confusion Matrix, Normalized")
else:
print('Confusion Matrix, not Normalized')
print(cm)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[2,4])
np.set_printoptions(precision=2)
print (classification_report(y_test, yhat))
# Plot non-normalized confusion matrix
plt.figure()
confusion_matrix_plot(cnf_matrix, classes=['Benign(2)','Malignant(4)'],normalize= False, title='Confusion matrix')
precision recall f1-score support 2 1.00 0.94 0.97 90 4 0.90 1.00 0.95 47 accuracy 0.96 137 macro avg 0.95 0.97 0.96 137 weighted avg 0.97 0.96 0.96 137 Confusion Matrix, not Normalized [[85 5] [ 0 47]]
Using f1_score from sklearn library:
from sklearn.metrics import f1_score
f1_score(y_test, yhat, average='weighted')
0.9639038982104676
Jaccard index for accuracy:
from sklearn.metrics import jaccard_score
jaccard_score(y_test, yhat,pos_label=2)
0.9444444444444444
Building the model with Linear kernel:
clf2 = svm.SVC(kernel='linear')
clf2.fit(X_train, y_train)
yhat2 = clf2.predict(X_test)
print("Avg F1-score: %.4f" % f1_score(y_test, yhat2, average='weighted'))
print("Jaccard score: %.4f" % jaccard_score(y_test, yhat2,pos_label=2))
Avg F1-score: 0.9639 Jaccard score: 0.9444