Learning

Supervised Learning
1. Classification Tasks
1. Nearest Neighbor Classification
1. K nearest neighbor Algorithm
1. Perceptron Learning
1. Support Vector Machine
1. Regression Tasks
1. Loss Function
1. Overfitting
1. Regularization
Scikit Learn
1. K Nearest Neighbor Implementation
1. Perceptron Algorithm Implementation
1. Support Vector Machine Algorithm Implementation
1. Regression Implementation

Machine Learning

In the Artificial Intelligence tutorial, we have learned how to use AI to solve a particular problem by giving the AI set of instructions like how to search for a solution and how to satisfy certain constraints to find output of a given input to solve some problem.

We will now learn about Machine Learning, which refers to the idea of not giving a computer a set of instructions. Rather we will give the computer the information in the form of data from which it can learn and figure out how the task will be performed on its own.

So, According to the Oxford Dictionary.

Machine Learning is the use of computer systems in a way that they can learn and figure out the solutions by themselves with the help of data, rather than following a set of instructions.

Various Algorithms and Statistical models are used to analyze and plot results from the system's data and patterns.

Machine Learning is a very wide field, it has many different forms, and in this tutorial, we will learn about some foundational Algorithms used in different areas in Machine Learning.

Supervised Machine Learning

In Supervised Machine Learning, data given to the machine consist of input-output pairs. And our AI system figures out some functions that map an input to output. So, we provide the computer with a large amount of data, which includes both input and output and our system predicts the output to the given input.

In other words, the system trains its model on the given data and figures out relationships between the input and output and based on those relationships. It predicts the output to some given input.

Supervised Machine Learning problems can be classified into two types:

Classification Problem: A task is called a classification task when the output variable is 0 or 1.

For Example:

A task to predict whether the ball is blue or red.

A task to predict whether the person has diabetes or not.

Regression problem: A task is called a Regression task when the output variable is some real value.

For Example,

A task to predict the price of the house.

A task to predict the weight of a person.

Classification Tasks

Now, let's take an example of the Supervised Learning Classification task. To predict whether it's going to rain or not.

For this task, we will provide our system with data collected by a human, which contains all the details like the days when it rained and the day when it didn't rain. We can structure the data like in the table below:

DATE	HUMIDITY	PRESSURE	RAIN
Jan 1	93%	999.7	Rain
Jan 2	49%	1015.5	No Rain
Jan 3	79%	1031.1	No Rain
Jan 4	65%	984.9	Rain
Jan 5	85%	975.2	Rain

In the table above, Date, Humidity, and Pressure are input variables, also known as Independent variables, and lare47st variable Rain is the output variable, also known as the dependent variable as its results are dependent on the other variables.

The system will use the independent variable to predict the dependent variable. Mathematically they can be written as:

f (humidity, pressure)

f (93,999.7) = Rain

f (49,1015.5) = Rain

Since 'f' is the complex function and it becomes more complex with more number variables inside it. So, we try to estimate this function, i.e., we try to calculate the rough value of the function. For that, we use a Hypothesis Function. It will try to approximate what f does, and it will also produce the same inputs and provide the same outputs, 'Rain' or 'No Rain'.

H (humidity, pressure)

Now to estimate this function, we can plot all the data points on the graph, and in this Example, we have two variables so, the plotted graph will be 2d, but the computer can plot and imagine any number of dimensions.

Now in this graph, blue dots represent the 'Rain' and Red represents the 'No Rain' at some particular value of humidity and pressure. All these points are the data given to us. And now we have to predict the result for the white dots, which represents some value of humidity and pressure at which we have to predict whether it will rain or not.

We can visualize that the white dot is much closer to all the blue dots so, we will assign it as 'Rain' as blue dots are representing Rain.

These kinds of classification algorithms are very popular in Machine Learning, and they are known as nearest-neighbor classification.

Nearest-neighbor classification

This Algorithm chooses the class (category) based on the nearest data point to the given input data point. As in the above Example, it chooses a blue data point based on all of its neighbor data points. This Algorithm can work well sometimes but now let's consider the situation in which the white dot is at a different position, as shown in the graph below.

Now, if we follow the same strategy, then the white dot should be colored red because according to the nearest-neighbor Algorithm, the nearest data point is red. But we look at the bigger picture, and there is more number of blue points closer to the white dot than the red ones. So, predicting output in this situation is difficult. This limitation can be solved by considering multiple data points instead of one closest data point. This is done by the K-nearest-neighbor classification.

K-nearest-neighbor classification

This Algorithm chooses the class (category) based on the k nearest data points to some given input data point where 'k' is the number of the data point to consider while choosing the class. The value of 'k' can be chosen by the programmer.

For Example, if we choose the value of k as 5 then, 5 nearest neighbors are considered and suppose 3 of them say that the output is 'Rain' and 2 of them say that the output is 'No Rain', then the final output will be 'Rain'.

There are few drawbacks of this nearest neighbor approach, which is that it's time-consuming as distances are calculated from each data point, and then the closest ones are selected.

Even after solving the time consumption problem as there are many data structures available to solve these kinds of problems and make the Algorithm faster in Machine Learning, there are a number of different algorithms we can apply to a particular type of problem, and we will look at the few of them in this tutorial.

A lot of Machine Learning research is about finding the best Algorithm for a particular kind of problem as every Algorithm have their own advantages and disadvantages. And, results and performance of every machine learning algorithm are based on the type of problem and data.

Now, we will try to look at a different kind of approach to solve the above problem where we will try to define some kind of decision boundary, and this kind of approach is called perceptron learning.

Perceptron Learning

In this approach, we create a decision boundary separating the different categories. Suppose in two-dimensional data, a line is created separating the two types of classes, and considering the above example, a line can be created which separates the blue and red data points, as we can see in the graph below. So, if the input data point for which we have to make predictions fall on the red side, then the output will be 'No Rain' otherwise 'Rain'.

The limitation of this type of approach is that usually, the real-world data is messy, and it's difficult to separate the data with any straight line. So, often times we try to draw the best fit line, which means separating the maximum number of data points and ignoring some of the misfits.

To define the decision boundary, first, we will estimate the hypothesis function 'h'.

Let's take the above example of predicting whether it will rain or not, so again, in this case, we have two variables, which are:

Humidity = x1
Pressure = x2

So, the hypothesis function will be h(x1, x2), which will predict the output, i.e., whether it will rain or not, by measuring the location of the input data point.

If we represent the condition on the basis of which the hypothesis function takes the decision, it will look like this:

And, Decision boundary is defined by a linear mathematical equation w? + w?x? + w?x? ? 0 where w0, w1, w2 are some weights which have some value and that value is decided by the Algorithm itself; these weights determine the slope and shape of the line called decision boundary which separates the different classes.

Instead of using the name of categories as 'Rain' or 'No Rain', computer deals with numbers, so they code as '1' and '0' respectively.

These weights should be chosen in such a way that if a linear equation yields more than 0, then the output is 1 (Rain), and otherwise 0 (No Rain).

Often times this linear mathematical expression is represented by vectors (vectors are the sequence of numbers. And in python, they can be stored in a List or a Tuple).

There are two vectors that reproduced considering the above Example, and they are:

Weight vector w: (w0,w1,w2)
Input vector x: (1,x1,x2)

Each weight will be multiplied by the input variable. W0 by 1 and w1 by x1 and so on.

W0 is just a bias weight, which is used to control the overall value obtained by multiplying all the weights with each input variable. And, we take 1 in the input vector x because all the vectors should be of the same length.

This is the importance of having data in supervised machine learning algorithms. Every data point is looked at multiple times in order to figure out the perfect weight vector that is able to estimate the output accurately.

The formula for updating weights is given below:

In this formula, some additional expression is added to the original value of weight.

In this expression, 'y' is the actual value, and hw(x) is the hypothesis of x, which stands for the estimate. Now, if the estimated value is equal to the actual value then the difference will be zero and there will no change in Wi.

‘?’ is the learning rate, which means the scale by which the values of weights change every time it is updated. The value of the alpha depends on the type of problem, and sometimes the higher value is helpful and sometimes lower.

To understand this formula more, we can consider a situation. For example, if the estimated value is greater than the actual value, then we will need to decrease the value of weights and considering the additional expression, its value will be negative in this situation and the value of weight is decreased and vice versa.

After this training process is completed, i.e., after calculating the proper values of all the weights, a threshold function is obtained, which provides a particular estimated threshold value before which the output is '0' and the output is '1' once the estimated values cross some threshold.

This type of function only gives two outputs, either 1 or 0 and is called Hard Threshold. It fails to express uncertainty.

In order to obtain the uncertainty, we can use the Logistic Function, which provides the soft threshold. Uncertainty simply means the probability of a particular event.

A Logistic Function can also give a value between 0 and 1. If the value is closer to 1, it is more likely to 'Rain' and vice versa. For Example:

0.95 probability is obtained, then it is more likely to 'Rain'.
0.11 probability is obtained, then it is more likely to 'No Rain'.

Support Vector Machine

Support Vector Machine is a very popular approach to train the dataset. The idea behind the SVM is that we can draw a lot of decision boundaries, but the aim is always to come up with the best line. For Example, we have this set of data. Yellow and blue points represent the output as 0 and 1, respectively.

There are various lines by which we can separate the yellow and blue data points, like in the first graph, points are separated by a vertical decision boundary, and in this case, if another input data point for which we have to make predictions falls on the side of yellow data points, but it is closer to the blue data points, then also it will be declared as yellow point according to the decision boundary, but practically it should be blue because it is much closer to the blue data point so we can conclude that this decision boundary is not accurate and data point from its group can be incorrectly classified and the same problem exists with diagonal decision boundary drawn in the second graph.

So, to solve this problem, SVM uses the Maximum Margin Separator approach.

Support Vector Machines are designed to find the Maximum Margin Separator. And, Maximum Margin Separator is some boundary which maximizes the distance between any of the data point. As we can see in the third graph, the decision boundary has the maximum distance from the data points on both sides. This is done by finding the support vectors, i.e., the data points which are closest to the line and then maximizing the distance between those points. This Algorithm works well in two dimensions but also capable of working in higher dimensions where a hyperplane is made, which is also a kind of decision boundary that separates one set of data from another set of data.

This Algorithm is also useful in separating the data, which is not linearly separable. SVM is capable of representing the decision boundaries in more than two dimensions and also the decision boundaries which are not linear shown as below:

To summarize all the algorithms, we have seen till now, Nearest-neighbor, Perceptron Learning, Support Vector Machine there are many more algorithms to solve any classification problems, every Algorithm has its own advantages and disadvantages, and every Algorithm performs differently on every problem, i.e., the performance of the algorithms totally depends on the kind dataset used in the problem and the problem itself.

Regression Task

Regression is a supervised machine learning task where the function is defined to map inputs to outputs, and the output value is a continuous value, i.e., some real values.

For Example, predicting the selling price of any House.

It is different from the classification task. In a classification problem, output value is discrete, like categories (Rain or No Rain).

We can understand the regression task more thoroughly by considering an example:

To predict the effect of the advertising on the sales of the product, like how do advertising dollars spent translate into sales for the company’s product.

We can take one function which takes the input, the amount of money spent on advertising we can take as many numbers of inputs totally depends on the data we are dealing with, here we are considering one.

f(advertising) = Sales

‘advertising' is the input that represents the amount of money spent on advertising.

‘sales’ represent the sales amount.

For Example

F(1000) = 6000

F(2500) =12500

Now, again as we have seen in the classification task, we will predict some hypothesis function, which will estimate some real values for amount sales for some given input amount spent on advertising in a month, week or a day, whatever the unit of time we are choosing to measure things in.

To solve this type of problem, we can use linear regression.

As we can see in the graph above, we have plotted all the data points. On the x-axis, we have advertising, and on the y axis we have sales, then we try to draw a line that does a good job in estimating the relationship between sales and advertising.

In this case, unlike before in the classification tasks, we will not separate the data into categories, but we will just try to find the best line that approximates the relationship between advertising and sales parameters.

Then, for any advertising budget, the number of sales can be predicted with the help of this line.

Now, the question arises in our minds that how we evaluate these approaches and hypotheses function. As every Algorithm will give us some sort of hypothesis function that maps inputs to outputs, and we want to know the efficiency of that function.

Evaluating these hypotheses can be compared with the optimization problems.

In optimization problems, we either try to maximize the objective function by finding the global maximum, or we try to minimize some cost function by finding the global minimum.

Now in the case of evaluating this hypothesis similar to minimizing the cost function in the optimization problem, we can try to minimize the loss function here.

Loss Functions

It is the function that tells how poor our hypothesis function performs. It is like a loss of utility. Whenever a wrong prediction is made, that is the Loss of the utility, and that will be added to the output of the Loss Function.

The loss function is the mathematical way of estimating the difference between the predicted output and actual output for some given data points.

There are some popular loss functions for classification and regression problems, and they are:

For Classification task

0-1 Loss Function

For Regression Task

L1 and L2 Loss Function

0-1 Loss Function – This approach is used when the output is discrete, i.e., for Classification Task. In this approach, the Loss function takes the actual output and predicted output as the inputs, and if the values of actual and predicted output are the same, then the Loss Function value is '0,' and if they are not the same, then Loss function value is '1'.

L (actual, predicted) = (0 if actual = predicted) & (1 otherwise)

Then the goal in these situations is to minimize the loss function value.

For Example, let's consider the Example of Rain and No Rain.

Suppose in the graph above, we can see two blue data points marked as 1 and two yellow data points marked as 1. It means data points that are marked as 1 are incorrectly classified.

Each data point in this graph has some label like Rain or No Rain, so we can compare it with our predictions and assign it numerical value 0 if predictions are correct and 1 if predictions are wrong, and then total empirical Loss is calculated.

In this case above, we have a total loss of 4.

L? and L? Loss Functions – This Loss Function is used while dealing with real values cases, i.e., Regression Task. For Example, while dealing with the problem of predicting the effect of advertising on the sales of the product. In these kinds of problems, we don't focus on either our predicted results are correct or incorrect. Instead, we focus on how close and far the actual output and predicted output values are.

For Example, the actual value in sales is 3000 dollars, and the prediction made is 2900 dollars, then this prediction can be considered as a good prediction.

And, if the prediction made is 1000 dollars and the actual value is again 3000 dollars, then this prediction is really bad.

So, while dealing with the regression problem, we want our function to not only take correct and incorrect predictions into consideration but if the predictions are incorrect, then it should also be able to calculate the difference between the actual and predicted values.

L?: L (actual, predicted) = |actual - predicted|

The L1 function takes actual and predicted values as input and consider the absolute value of the actual value minus the predicted value, and all the differences of actual and predicted values are added together to calculate the final Loss.

Suppose in the above graph we have all the data points plotted, i.e., the actual value of sales on the y axis and advertising on the x-axis and this decision boundary defines the estimate, i.e., prediction of sales for any amount of advertising and L1 Loss is the sum of all the individual vertical distance between the line and data points. Then the goal is to minimize the L1 Loss and estimate the line in such a way that Loss is minimum.

There are other loss functions as well, another popular loss function is the L2 Loss Function, and it uses the square of the actual minus predicted value. This loss function penalizes the worse prediction much more harshly than the L1 Loss function because it uses the square value.

L?: L (actual, predicted) = (actual - predicted) ²

Suppose we have two data points representing the actual value of sales for some advertising value. And first data point is 1 value away from the estimated line and the second data point is 2 values away from the estimated line, then it will penalize the second data point more harshly as it will square the distance.

We can use the L2 Loss function when we have to deal with outliers more strictly in our data. And when there are many outliers in the data, and we choose to ignore them while modeling, then the L1 Loss function is useful. But these types of decisions totally depend on the type of data and problem we are dealing with.

Overfitting

Overfitting is a big and most common problem in machine learning. Overfitting happens when a model fits too closely to a particular dataset while minimizing the Loss, due to which model fails to generalize. By general, we mean that we want our model to predict correct results for the data it has, but also, we want high accuracy predictions for the values our model has not seen before.

For Example, we will train our model to predict Rain or No Rain by using the past data, but our main goal is to predict the future, whether it will rain or not by giving the same parameters to our model on which we have trained our model like pressure and humidity in this particular Example.

But, when Overfitting happens, i.e., the model is too closely tied to the data it has, then the model doesn't generalize very well.

We can again consider the two examples first to predict whether it will rain or not and another one of predicting the number of sales for advertising.

The overfitted model for both of these examples is given below:

The left graph is the overfitted model of the classification task. Here the estimated line has perfectly separated all the yellow and blue points according to the data available to it, but that one yellow data point might just be an outlier, and while making predictions if any point falls near that yellow outlier, then this model will predict it as yellow, i.e., No Rain but, practically that point will be closer to more number of blue points which means it should be categorized as blue, i.e., Rain.

Likewise, Overfitting can happen in Regression tasks as well. On the right side, there is an overfitted model of a regression task. Here the estimated line perfectly fits all the data points, and there is no loss. This model can give the perfect predictions for the data available to it on which it is trained, but again it will not generalize well.

We always try to avoid Overfitting, and there are various strategies to avoid Overfitting, and one of them is Regularization.

Regularization

Like in the optimization problem, there is some cost, and we try to minimize that cost, and till now, for finding the efficient estimation, we have defined that cost equal to the empirical Loss. i.e., the total sum of all the individual values of losses. And as we try to minimize the empirical Loss, the result is that the model is overfitted.

cost(h) = loss(h)

So, to avoid the overfitting situation, we can reduce some complexity of the hypothesis to the cost function.

cost(h) = loss(h) + ?complexity(h)

Here, the complexity of the hypothesis is basically defined as the complexity of the estimated line (decision boundary).

This is the Occam's razor style approach, where preference is given to the simpler decision boundary. The idea is that the simpler decision boundary is probably the better solution and can generalize well, i.e., give accurate predictions to other inputs.

Now, a measure of Loss and complexity is taken into account, and we need some balance between the Loss and complexity; that's why complexity is multiplied by some parameter '?', if lambda has higher value then we need to penalize more complex hypothesis and vice versa. And it is up to the machine learning programmer to decide the value of lambda, and again depending on the type of data available and the problem, the value of these parameters may vary, and we may need to experiment with different values to figure out the right choice.

This process of considering Loss as well as some measure of the complexity is known as Regularization.

We can define Regularization as a process of penalizing the hypothesis that is more complex and favoring the simpler hypothesis and more general hypothesis.

There are other ways as well to avoid Overfitting. We can run some experiments and check whether our model is giving correct predictions on the other dataset as well. This is the reason we do not use all the data for training our model; instead, we keep some of the data for testing. This method is called Holdout cross-validation.

Holdout cross-validation is a method in which we split our dataset into a training set and testing set.

Training Data: This data contains both input and output information, i.e., the machine has access to both training and testing data, and the machine uses this data to train itself and gain knowledge about the relationships between input and output data.

Test Data: This data contains only input information, i.e., the machine only has access to input data, and the model makes predictions for the given input data based on the knowledge it has gained from training data, and those predictions are compared with the actual output given in the dataset. It is used to test the machine learning model.

One of the disadvantages of the holdout cross-validation is that we do not use the whole dataset to make our model accurate because if the information on which the model is trained is less, then the model trained will be less accurate.

So, to deal with this issue, we use k-Fold Cross-Validation.

In this technique, we divide the dataset into k sets. Suppose for now we take k=10.

So, the dataset will be divided into 10 equal sets now, and our model will be trained and tested 10 times. Suppose we decide to use 9 sets for training the data and 1 for testing then, different combinations of those sets will be used every time.

For example, k=10

And these are the 10 sets of data.

Case 1

Training Data

Testing Data

Case 2:

Training Data

Testing Data

At last, we will have 10 different results of the accuracy of the model, and we can calculate the average and figure out how accurately our model performs. Like this, we can train and test our data on different sets of data each time, and we can train and test our model without losing any data. And, the number of iterations depends on the value of k we are taking.

Scikit-Learn

If we want to implement any of the techniques of supervised and unsupervised machine learning in Python, one way is that we can write the whole code from scratch on our own and also, we can use different open-source libraries available to implement these algorithms.

One of the most famous libraries is Scikit learn, which allows us to take advantage of the existing implementations of these algorithms. This library has already written an Algorithm for nearest neighbor classification, perceptron learning, regression etc. So, using this library and different methods present in this library, we can create a Machine Learning model and test it working.

To understand the implementation of all the algorithms discussed above, we are going to take an example:

Suppose we own a car showroom, and we have to make a list of most potential customers who can buy new SUV, so to predict whether the customer will purchase the new SUV car or not, we will use the dataset which contains the salary and age of the customers and their purchase decision whether they bought the new SUV in the past or not and based on that data we will train our model and make predictions using different algorithms we have studied so far.

Above is the screenshot of the dataset.

In this dataset, first two columns are the independent variables, which are age and estimated salary, and the third column contains the purchase decision value 0 or 1, which is a dependent variable because purchase value is dependent on estimated salary and age.

Now, we will make the Machine Learning model to predict whether the customer will buy the new SUV or not using the KNN algorithm.

Before moving further, you should know some concepts of Data Preprocessing, which will be used in this code.

Follow this Link

KNN Algorithm Implementation

First Step- Importing Libraries

Below are some libraries we need in the program; we will use the short forms np, plt, pd, etc., instead of using the whole name.

Numpy means Numerical Python; this library is used for mathematical calculations.

Matplotlib.pyplot is used for plotting graphs etc.

Pandas is used for reading the data from the dataset, and further data handling and data manipulation is done using this library.

Text Box: import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Second Step

Importing the dataset using the pandas library and then dividing the dataset into two variables, 'X' and 'y' using the 'iloc' function. X contains the independent variables, which are 'age' and 'estimated salary' and y contains the dependent variable, which is output.

 # Importing the dataset
 dataset = pd.read_csv('Social_Network_Ads.csv')
 X = dataset.iloc[:, :-1].values
 y = dataset.iloc[:, -1].values

Third Step:

Now we need to split the dataset into the training set and test set.

For this, we have to import the 'test_train_split’ from the sciket-learn library, and we have kept the size of the dataset 25% of the whole dataset.

 # Splitting the dataset into the Training set and Test set
 from sklearn.model_selection import train_test_split
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
 print(X_train)
 print(y_train)
 print(X_test)
 print(y_test)

We can visualize the train test data by simply using print.

Fourth Step

The third step is feature scaling. It is one of the most important steps, while building any machine learning model because, as we can see, there is a huge difference between the values of age and salaries, and if our model tries to find out the relationship between those variables then, it will be not fair and accurate. So, there is a need to normalize all the values by using Feature Scaling Techniques.

Standard Scaler is used to perform Standardisation.

Here, we have defined an 'sc’ object as Standard Scaler.

Now, for the data in X_train, we have used fit_transform. ‘fit' method connects the Scaler to the features in the data set, and the 'transform' method calculates and replaces all the values with newly calculated values. As in Standardisation, we calculate the mean and standard deviation of the whole dataset. In this case, it will be the whole X_train, and based on these two values, all the values in X_train will be transformed.

Now, here comes the tricky part, we will not use the ‘fit’ method for X_test because we don't want to show the testing data to our model, so we will use the same Scaler for transforming the X_test values, which means same standard deviation and mean values will be used to transform the X_test values.

Text Box: # Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
print(X_train)
print(X_test)

 # Feature Scaling
 from sklearn.preprocessing import StandardScaler
 sc = StandardScaler()
 X_train = sc.fit_transform(X_train)
 X_test = sc.transform(X_test)
 print(X_train)
 print(X_test)

We can visualize the X_train and X_test values, and we can notice that now all the values in 'age' and 'salary' are of the same scale.

Fifth Step

In the fourth step, we will finally make our machine learning model.

First, we have imported KNeighbourClassifier from the sklearn.neighbor module. Then, we have defined an object called a classifier, and to make an object of KNeighbourClassifier, and we have called a class again, which includes some parameters like n_neighbour, metric, and p.

‘n_neighbour’ is the number of neighbors that we want our model to consider.

‘metric’ and ‘p’ are the parameters to calculate the distance between the points and to do further calculations.

And, then finally training the data on the X_train and y_train using the fit function.

‘fit' functions connect the classifier and the data.

Text Box: # Training the K-NN model on the Training set
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, y_train)

# Training the K-NN model on the Training set
 from sklearn.neighbors import KNeighborsClassifier
 classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
 classifier.fit(X_train, y_train)

Sixth Step

Next, we have tested our model on X_test as it was the unseen data, and only X_test will be passed as a parameter in the predict function as now we are testing our model, and later we will compare the predicted results with the actual results in the y_test.

Next, we have printed the predicted result and actual results.

# Predicting the Test set results
 y_pred = classifier.predict(X_test)
 print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

Text Box: # Predicting the Test set results
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

We can predict results for any value. Here we have used sc.transform because again, we need to normalize the values of 'age' and 'salary'.

 # Predicting a new result
 print(classifier.predict(sc.transform([[30,87000]])))

Text Box: # Predicting a new result
print(classifier.predict(sc.transform([[30,87000]])))

Seventh Step

To find out how well our model has predicted, we will make the confusion metric and calculate the accuracy.

# Making the Confusion Matrix
 from sklearn.metrics import confusion_matrix, accuracy_score
 cm = confusion_matrix(y_test, y_pred)
 print(cm)
 accuracy_score(y_test, y_pred)

Eighth Step

At last, we will plot a two-dimensional graphical visualization of all the data points and decision boundaries made by the KNN Algorithm.

So more specifically what we're about to plot a two-dimensional graph with two-axis X and Y; on the x-axis, we will have the first feature 'age,' and on the y axis we'll have the second feature 'estimated salary' and therefore, each of the points you will see on the plot will correspond to a specific customer. On the y axis, estimated salary scale ranges from 4000 to 14000 and 'age' ranges from 10 to 60 on the x-axis, and this grid is made with the help of the 'meshgrid’ function.

Now, to plot all the data points and decision boundary on the graph, the trick is to apply the predict method on all the data points. Green represents 1, which means the customer has bought the SUV, and Red represents 0, which means the customer hasn't bought SUV.

Steps:

Firstly, ListedColormap is used to add colors in the graph and to all the data points.
First, we have defined two local variables X_set and y_set, which have the values of X_train and y_train, respectively.
Then, by using ‘np.meshgrid', we have created a grid that has salary on the y axis and age on the x-axis.
We have used the predict method to categorize each data point. If the predicted result is 0, then the data point is Red, and if it's 1, then the data point is Green.
And ‘contour’ is used to color both the sides of decision boundary different like Redand Green.

# Visualising the Training set results
 from matplotlib.colors import ListedColormap
 X_set, y_set = sc.inverse_transform(X_train), y_train
 X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 1),np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 1))
 plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green')))
 plt.xlim(X1.min(), X1.max())
 plt.ylim(X2.min(), X2.max())
 for i, j in enumerate(np.unique(y_set)):
     plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
 plt.title('K-NN (Training set)')
 plt.xlabel('Age')
 plt.ylabel('Estimated Salary')
 plt.legend()
 plt.show()

In the same way, we can visualize for the test set too.

# Visualising the Test set results
 from matplotlib.colors import ListedColormap
 X_set, y_set = sc.inverse_transform(X_test), y_test
 X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 1) np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 1))
 plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green')))
 plt.xlim(X1.min(), X1.max())
 plt.ylim(X2.min(), X2.max())
 for i, j in enumerate(np.unique(y_set)):
     plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
 plt.title('K-NN (Test set)')
 plt.xlabel('Age')
 plt.ylabel('Estimated Salary')
 plt.legend()
 plt.show()

Perceptron Learning Implementation

First Step- Importing Libraries

 import numpy as np
 import matplotlib.pyplot as plt
 import pandas as pd

Second Step

Importing the dataset using the pandas library and then dividing the dataset into two variables ‘X’ and ‘y’ using the ‘iloc’ function. X contains the independent variables which are ‘age’ and ‘estimated salary’ and y contains the dependent variable which is output.

 # Importing the dataset
 dataset = pd.read_csv('Social_Network_Ads.csv')
 X = dataset.iloc[:, :-1].values
 y = dataset.iloc[:, -1].values

Third Step:

Now we need to split the dataset into a training set and test set.

For this we have to import the ‘test_train_split’ from sciket-learn library and we have kept the size of the dataset 25% of the whole dataset.

 # Splitting the dataset into the Training set and Test set
 from sklearn.model_selection import train_test_split
 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
 print(X_train)
 print(y_train)
 print(X_test)
 print(y_test)

Fourth Step

It will also remain same as in KNN implementation.

Feature Scaling
 from sklearn.preprocessing import StandardScaler
 sc = StandardScaler()
 X_train = sc.fit_transform(X_train)
 X_test = sc.transform(X_test)
 print(X_train)
 print(X_test)

Fifth Step

In the fourth step, we will finally make our machine learning model.

First, we have imported Perceptron from the sklearn.linear module. Then, we have defined an object called a model and to make a Perceptron model.

And, then finally training the data on the X_train and y_train using the fit function.

‘fit' functions connect the classifier and the data.

Training the Perceptron model on the Training set
 from sklearn.linear_model import Perceptron
 model = Perceptron()
 model.fit(X_train, y_train)

Sixth Step

Next, we have printed the predicted result and actual results.

Predicting the Test set results
 y_pred = model.predict(X_test)
 print(np.concatenate((y_pred.reshape(len(y_pred),1),y_test.reshape(len(y_test),1)),1))

We can predict results for any value, here we have used sc.transform because again we need to normalize the values of ‘age’ and ‘salary’.

Predicting a new result
 print(model.predict(sc.transform([[30,87000]])))

Seventh Step

To find out how well our model has predicted, we will make the confusion metric and calculate the accuracy.

#Making the Confusion Matrix
 from sklearn.metrics import confusion_matrix, accuracy_score
 cm = confusion_matrix(y_test, y_pred)
 print(cm)
 accuracy_score(y_test, y_pred)

Eighth Step

At last, we will plot a two-dimensional graphical visualization of all the data points and decision boundaries made by the Perceptron Algorithm in the same way as in KNN implementation for the training set and test set.

For Training Set:

 # Visualising the Training set results
 from matplotlib.colors import ListedColormap
 X_set, y_set = sc.inverse_transform(X_train), y_train
 X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 1),  np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 1))
 plt.contourf(X1,X2,model.predict(sc.transform(np.array([X1.ravel(),X2.ravel()]).T)).reshape(X1.shape),alpha = 0.75, cmap = ListedColormap(('red', 'green')))
 plt.xlim(X1.min(), X1.max())
 plt.ylim(X2.min(), X2.max())
 for i, j in enumerate(np.unique(y_set)):
     plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
 plt.title('Perceptron (Training set)')
 plt.xlabel('Age')
 plt.ylabel('Estimated Salary')
 plt.legend()
 plt.show()

For Test Set

Visualizing the Test set results
 from matplotlib.colors import ListedColormap
 X_set, y_set = sc.inverse_transform(X_test), y_test
 X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 1), np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 1))
 plt.contourf(X1,X2,model.predict(sc.transform(np.array([X1.ravel(),X2.ravel()]).T)).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green')))
 plt.xlim(X1.min(), X1.max())
 plt.ylim(X2.min(), X2.max())
 for i, j in enumerate(np.unique(y_set)):
     plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
 plt.title('Perceptron (Test set)')
 plt.xlabel('Age')
 plt.ylabel('Estimated Salary')
 plt.legend()
 plt.show()

Comparison of the result between KNN and Perceptron models

KNN Model	Perceptron Model
The accuracy obtained with the KNN algorithm is 93%.The decision boundary of the KNN machine learning model is more accurate, separating all the data points accurately.	The accuracy obtained with the Perceptron algorithm is 87%.The decision boundary in the Perceptron machine learning model is straight and not separating all the data points accurately.

The results of every machine learning model depend on the Algorithm that we have used and the type of dataset. So, any particular algorithm can give good results on one dataset and can perform poorly on another dataset.

In this case, the K-Nearest Algorithm performed better than Perceptron.

Regression Implementation

To understand the regression problems, let’s consider the following dataset:

This is the Salary dataset, which contains the salary and years of experience of an employee.

So, our task is to predict the salary of the employee based on years of experience. 'Years of experience' will be the independent variable, and 'Salary' will be the dependent variable.

Now, we will write the code for Linear Regression and Polynomial Regression Algorithms.

There are other algorithms as well for building the regression model, but here we will only see Linear Regression and Polynomial Regression.

Linear Regression

In Linear Regression, we draw a straight decision boundary, and our task is to find the best decision boundary, which defines the values of all the data points accurately.

The whole approach is based on one linear equation, which is mathematically written as:

y=mX+ c

It can be written as:

Yi= b0+ biXi

Y= Dependent Variable

b0= Constant

Xi= Independent variable

bi= Coefficient of Independent Variable

Now, suppose in the above figure, we consider the scale of 1 year i.e., how much salary increased in 1 year, suppose it is 10 thousand, then b1 will be 10 thousand.

And, b0 constant is a minimum scale of the dependent variable.

Now, above in the second graph, we can see Yi represents the actual value of the data point. In this case, it represents the actual salary for the particular experience and Yi` represents the predicted value. So, to calculate the best fit line:

Sum(Y-Y`) = min

If the line/decision boundary satisfies the above condition, then, the line is called the best fit line.

Polynomial Linear Regression

In polynomial linear regression, all the calculations are done on the basis of this mathematical equation.

Here, the linear name does not define the relationship between x and Y, but it defines the linear relationship between the coefficients.

Code of Linear Regression and Polynomial Regression

First Step:

Importing necessary libraries.

Importing the libraries
 import numpy as np
 import matplotlib.pyplot as plt
 import pandas as pd

Second Step

Importing the dataset using the pandas library and then dividing the dataset into two variables ‘X’ and ‘y’ using the ‘iloc’ function. X contains the independent variables which are ‘Year’s Experience’ and y contains the dependent variable ‘Salary’ which is output.

Importing the dataset
 dataset = pd.read_csv('Salary_Data.csv')
 X = dataset.iloc[:, :-1].values
 y = dataset.iloc[:, -1].values

Third Step:

We will train the Linear Regression model on the whole dataset, we will not split the dataset as the dataset is small, and no encoding is required as it did in classification task because in this dataset there is only one independent variable, and one continuous dependent variable, i.e., Salary, which is to be predicted.

So, first we will import the ‘LinearRegression’ from ‘sklearn.linear_model’. Then we will create an object called regressor for making a Linear Regression model and then finally fit the model on whole dataset by directly passing ‘X’ and ‘y’ as a parameter.

from sklearn.linear_model import LinearRegression
 lin_reg = LinearRegression()
 lin_reg.fit(X, y)
 lin_pred=lin_reg.predict(X)

Fourth Step:

In this step, we will train the Polynomial Linear Regression model on the whole dataset.

First step is to decide the degree, i.e., till which matrix we want to build our polynomial linear regression model.

Suppose we choose degree=10 then the matrix will be x, x^2, x^3…..x^10.

Mathematically it can be written as:

So, on the basis of the above equation, the decision boundary will be made and it will be more accurate than normal linear regression model.

The first step is to import PolynomialFeatures from Sklearn.processing.
The second step is to make an object called 'poly_reg’ for Polynomial Feature with a degree =10 as a parameter.
Third step is to 'fit' and 'transform' all the values of 'X' into a new matrix of degree 10.

‘fit’ method will calculate all the degree values of ‘X’ and transform method will replace all the values by new matrix.

Fourth Step is to create another object called ‘lin_reg_2’ for Polynomial Linear Regression and then with the help of ‘fit’ method and passing new parameters which are ‘X_poly’ as it contains the updated matrix of degree 10 and ‘y’ output, i.e., the continuous values of ‘Salary’ and finally model will be trained.

Text Box: from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 10)
X_poly = poly_reg.fit_transform(X)
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_poly, y)

Fifth Step:

Now, let’s visualize the Linear Regression model.

The first step is to plot all the data points, and it is done in the first line using 'X' and 'y'.
In second line, decision boundary is drawn, as we know for plotting decision boundary, we need predicted results and ‘lin_pred’ contains the predicted results.
Then, further title, X label, Y label are set.

Text Box: plt.scatter(X, y, color = 'red')
plt.plot(X, lin_pred, color = 'blue')
plt.title('Expected Salary (Linear Regression)')
plt.xlabel('Experience in years')
plt.ylabel('Salary')
plt.show()

We can notice that the decision boundary in the Linear Regression model is straight.

Now, let’s visualize Polynomial Linear Regression model.

The first step is to plot all the data points, and it is done in the first line using 'X' and 'y'.
In second line, decision boundary is drawn, as we know for plotting decision boundary we need predicted results and 'lin_reg_2.predict (X_poly)' will give the predicted results. X_poly is used because this object contains the multiple metrics, which we need for Polynomial Linear Regression and for its visualization as well.
Then, further title, X label, Y label are set.

Text Box: plt.scatter(X,y,color="red")
plt.plot(X, lin_reg_2.predict(X_poly), color="blue")
plt.title('Expected Salary (Polynomial Regression)')
plt.xlabel('Experience in years')
plt.ylabel('Salary')
plt.show()

Code given below is just to smooth the decision boundary and make its look better, but in real-world problems where there are many variables, we don't require to smoothen up the regression line/ decision boundary.

Text Box: X_grid = np.arange(min(X), max(X), 0.1)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, lin_reg_2.predict(poly_reg.fit_transform(X_grid)), color = 'blue')
plt.title('Expected Salary (Polynomial Regression)')
plt.xlabel('Experience in years')
plt.ylabel('Salary')
plt.show()

And the trick to plot this curve is instead of taking the integer zero one two three four five six seven eight nine ten etc. on the scale of X-axis, i.e., number of years of experience we will increase the density of these points by taking not only these integers but also 0.1, 0.2………9.9 etc. i.e., taking all the real values for the years of experience.

Sixth Step:

Final step is to predict the estimated Salary for 6.5 and 9.5 years of experience by using Linear Regression model and Polynomial Linear Regression model.

Text Box: lin_reg.predict ([[6.5]])
lin_reg_2.predict (poly_reg.fit_transform ([[6.5]]))
lin_reg.predict ([[9.5]])
lin_reg_2.predict (poly_reg.fit_transform ([[9.5]]))

As we can clearly notice the difference between the graph of Linear Regression and Polynomial Linear Regression and also difference in their result has been noticed.

Key Points

Polynomial Linear Regression model looks overfitted because no test set has been used, and the whole dataset has been used to train the model as the dataset was small.
Overfitting can be avoided by dividing the dataset into training and test set.

Artificial Intelligence Tutorial

Search Algorithms

Knowledge, Reasoning and Planning

Uncertain Knowledge and Reasoning

Misc

Artificial Intelligence Tutorial

Search Algorithms

Knowledge, Reasoning and Planning

Uncertain Knowledge and Reasoning

Misc

Supervised Learning in AI