Classification is a type of supervised machine learning problem with a categorical target (response) variable. Given the known label in the training data, the classifier approximates a mapping function (f) from the input variables (X) to the output variables (Y) (Y).
Import Libraries and Load Dataset
To begin, we must import the following libraries: pandas (for dataset loading), numpy (for matrix manipulation), matplotlib and seaborn (for visualisation), and sklearn (for machine learning) (building classifiers). Before importing them, make sure they are already installed. To import the libraries and load the dataset, use the following code:
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from pandas.plotting import parallel_coordinates from sklearn.tree import DecisionTreeClassifier, plot_tree from sklearn import metrics from sklearn.naive_bayes import GaussianNB from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.linear_model import LogisticRegression
To load the dataset, we can use the read_csv function from pandas (my code also includes the option of loading through URL).
data = pd.read_csv('data.csv')
After we load the data, we can take a look at the first couple of rows through the head function:
We can now divide the dataset into two parts: training and testing. In general, we should have a validation set that is used to evaluate the performance of each classifier and fine-tune the model parameters in order to find the best model. The test set is primarily used for reporting. However, because this dataset is small, we can simplify the process by using the test set to serve the purpose of the validation set.
In addition, to estimate model accuracy, I used a stratified hold-out approach. Cross-validation is another method for reducing bias and variances.
train, test = train_test_split(data, test_size = 0.4, stratify = data[‘species’], random_state = 42)
Exploratory Data Analysis
We can now proceed to explore the training data after we have split the dataset. Matplotlib and Seaborn both have excellent plotting tools that we can use for visualisation.
Let's start by making some univariate plots with a histogram for each feature:
n_bins = 10 fig, axs = plt.subplots(2, 2) axs[0,0].hist(train['sepal_length'], bins = n_bins); axs[0,0].set_title('Sepal Length'); axs[0,1].hist(train['sepal_width'], bins = n_bins); axs[0,1].set_title('Sepal Width'); axs[1,0].hist(train['petal_length'], bins = n_bins); axs[1,0].set_title('Petal Length'); axs[1,1].hist(train['petal_width'], bins = n_bins); axs[1,1].set_title('Petal Width'); # add some spacing between subplots fig.tight_layout(pad=1.0);
It's worth noting that for both petal length and petal width, there appears to be a group of data points with lower values than the others, implying that there could be different groups in this data.
Let's try some side-by-side box plots next:
fig, axs = plt.subplots(2, 2) fn = ["sepal_length", "sepal_width", "petal_length", "petal_width"] cn = ['setosa', 'versicolor', 'virginica'] sns.boxplot(x = 'species', y = 'sepal_length', data = train, order = cn, ax = axs[0,0]); sns.boxplot(x = 'species', y = 'sepal_width', data = train, order = cn, ax = axs[0,1]); sns.boxplot(x = 'species', y = 'petal_length', data = train, order = cn, ax = axs[1,0]); sns.boxplot(x = 'species', y = 'petal_width', data = train, order = cn, ax = axs[1,1]); # add some spacing between subplots fig.tight_layout(pad=1.0);
The two plots at the bottom imply that the setosas we saw earlier are setosas. Their petal measurements are smaller and more evenly distributed than those of the other two species. When compared to the other two species, versicolor has lower average values than virginica.
Another type of visualisation is the violin plot, which combines the advantages of both the histogram and the box plot:
sns.violinplot(x="species", y="petal_length", data=train, size=5, order = cn, palette = 'colorblind');
Gaussian Naive Bayes Classifier
Naive Bayes is a popular classification model. It contains the word "Naive" because it contains a key assumption of class-conditional independence, which means that given the class, each feature's value is assumed to be independent of any other feature's value (read more here).
We know that this is not the case here, as evidenced by the high correlation between the petal features. Let's look at the test accuracy using this model to see if this assumption is sound:
The accuracy of the Guassian Naive Bayes Classifier on test data is 0.933
What about the result if we only use the petal features:
The accuracy of the Guassian Naive Bayes Classifier with 2 predictors on test data is 0.950
Interestingly, using only two features results in more correctly classified points, implying that using all features may result in over-fitting. Our Naive Bayes classifier appears to have done a good job.
Linear Discriminant Analysis (LDA)
If we calculate the class conditional density using a multivariate Gaussian distribution rather than a product of univariate Gaussian distributions (as in Naive Bayes), we get an LDA model (read more here). The key assumption of LDA is that covariances between classes are equal. We can examine the test accuracy using both all and only petal features:
The accuracy of the LDA Classifier on test data is 0.983 The accuracy of the LDA Classifier with two predictors on test data is 0.933
The use of all features improves the test accuracy of our LDA model.
We can use our LDA model with only petals and plot the test data to visualise the decision boundary in 2D:
Three virginica and one versicolor test points are misclassified.