Naïve Bayes Algorithm in C++
Naïve Bayes is a supervised machine learning algorithm for classification tasks such as text classification. The Naïve Bayes algorithm belongs to a family of generative learning algorithms, which model the distribution of inputs for a given class. Naïve Bayes algorithms assume that the features of the input data are independent of the given class, which makes the algorithm make fast and quick predictions accurately.
Naïve Bayes classifiers are simple probabilistic classifiers that make use of Baye's Theorem. This Theorem is based on hypothesis probability, given the data and some prior knowledge. Naïve Bayes makes an assumption that all features are independent of each other in the input data, which is not true in real-world situations. When working on text classification, the Naïve Bayes algorithm is very good because of its good performance and its efficiency.
Naïve Bayes classifiers are the simplest Bayesian network. Still, they have the ability to achieve high accuracy when integrated with kernel density estimation. It is used mainly on text classification that includes a high-dimensional training dataset. It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is made of two words, Naïve and Bayes, which can be defined as follows:
- Naïve: It is called Naïve because it makes the assumption that the occurrence of a certain feature is independent of the occurrence of the other features. For example, if the fruit is identified on the basis of color, shape, and taste, then red, spherical, and sweet fruit is identified as an apple. Hence, to identify an apple, each feature is contributed individually without on each other.
- Bayes: It is called Bayes because it is based on the principle of Bayes' Theorem.
Bayes' Theorem:
- Bayes' Theorem is also called Bayes' Rule or Bayes' Law, which is used to find the probability of a hypothesis with previous knowledge. It is based on the conditional probability.
- The formula of Bayes' Theorem is defined as:
P(A|B) = P (B|A) P(A)/P(B)
Where,
- P (A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
- P(B|A) is Likelihood probability: The probability of the evidence given that the probability of a hypothesis is true.
- P (A) is Prior Probability: The probability of the hypothesis before observing the evidence.
- P (B) is marginal probability, which is the probability of evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be explained with the help of the below example:
Consider a dataset of weather conditions and the corresponding target variable "Play". So, using this dataset, we need to classify whether a player should play or not on a particular day given the weather conditions. So, to solve the problem, we need to follow the following step:
- Convert the given dataset into frequency tables.
- Generate a Likelihood table by finding the probability of given features.
- Now, use the Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, should the player play?
Solution: To solve this, first consider the below dataset:
Outlook | Play | |
0 | Rainy | Yes |
1 | Sunny | Yes |
2 | Overcast | Yes |
3 | Overcast | Yes |
4 | Sunny | No |
5 | Rainy | Yes |
6 | Sunny | Yes |
7 | Overcast | Yes |
8 | Rainy | No |
9 | Sunny | No |
10 | Sunny | Yes |
11 | Rainy | No |
12 | Overcast | Yes |
13 | Overcast | Yes |
Frequency table for the Weather Conditions:
Weather | Yes | No |
Overcast | 5 | 0 |
Rainy | 2 | 2 |
Sunny | 3 | 2 |
Total | 10 | 5 |
Likelihood table weather condition:
Weather | No | Yes | |
Overcast | 0 | 5 | 5/14= 0.35 |
Rainy | 2 | 2 | 4/14=0.29 |
Sunny | 2 | 3 | 5/14=0.35 |
All | 4/14=0.29 | 10/14=0.71 | , |
Applying Bayes' Theorem:
P(Yes|Sunny) = P(Sunny|Yes) *P(Yes)/P(Sunny)
P(Sunny|Yes) = 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So, P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny) = P(Sunny|No) *P(No)/P(Sunny)
P(Sunny|NO) = 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So, P(No|Sunny) = 0.5*0.29/0.35 = 0.41
So, as we can see from the above calculation P(Yes|Sunny) > P(No|Sunny)
Advantages of Naïve Bayes Classifier:
- Naïve Bayes is one of the fastest and easiest ML algorithms to predict a class of datasets.
- It can be used for Binary as well as Multi-class Classifications.
- It performs well in multi-class predictions as compared to the other Algorithms.
- It is the most popular choice for,text classification problems.
Disadvantages of Naïve Bayes Classifier:
- Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features.
Applications of Naïve Bayes Classifier:
- It is used for,Credit Scoring.
- It is used in,medical data classification.
- It can be used in,real-time predictions because the Naïve Bayes Classifier is an eager learner.
- It is used in Text classification, such as Spam filtering,and,Sentiment analysis.
Types of Naïve Baiyes Model:
There are three types of Naïve Bayes Models, which are defined as follows:
- Gaussian: The Gaussian model makes the assumption that features follow a normal distribution. This means if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution.
- Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial distributed. It is primarily used for document classification problems; it means a particular document belongs to which category, such as Sports, Politics, education, etc.
- The classifier uses the frequency of words for the predictors.
- Bernoulli: The Bernoulli classifier works similarly to the Multinomial classifier, but the predictor variables are the independent Booleans variables, such as if a particular word is present or not in a document. This model is also famous for document classification tasks.
C++ Implementation of Naïve Bayes algorithm:
Code:
Include the required header files, such as iostream and stream, for loading files and string and vector for creating the 2D dynamic array.
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
using namespace std;
Code:
An instance will contain the image and its classification, which takes the image and classification. If there is a face, then 1 and if there is a non-face, then 0.
//an instance will contain the image and its classification(face or non-face)
struct Instance
{
vector<string> image;
int classification; //1 if face, 0 if non-face
};
Explanation:
The vector<string> image represents the image associated with the instance. This is the strings vector, where each string is likely to correspond to a row or line of the image. The second member of the structure is a classification of the integer type, which represents 1 as face and 0 as non-face.
Code:
Define the function fill_othertrainingsets for filling the training set, getData, which takes the input as string data, data label in string format and data.
//function prototypes
void fill_othertrainingsets();
void getData(string data, string datalabel, vector<Instance>& dataset);
void count(vector<vector<int>>& count_h_given_f, vector<vector<int>>& count_b_given_f, vector<Instance>& train);
void findProbability(vector<vector<float>>& prob_h_given_f, char h, int f);
void findProbabiltyGiven(vector<vector<float>>& probability, char h, vector<Instance>& traindata);
void classify();
int classifyInstance(Instance& instance, int instance_number);
void printStats();
void calculateProbabilities();
Code:
//global variables
vector<Instance> trainingset(451); //training set consisting of all 451 instances
vector<Instance> trainingset_faces, trainingset_nonfaces; //training set containing faces and nonfaces respectively
vector<Instance> testingset(150); //testing set consisting of all 150 instances
vector<int> testedlabels(150); //vector containing the classification of all test instances
int faces = 217, nonfaces = 234;
//the total probabilities of faces and nonfaces
float prob_face = faces / 451.0, prob_nonface = nonfaces / 451.0;
vector<vector<float>> prob_hash_givenface(70, vector<float>(60));
vector<vector<float>> prob_hash_givennonface(70, vector<float>(60));
vector<vector<float>> prob_blank_givenface(70, vector<float>(60));
vector<vector<float>> prob_blank_givennonface(70, vector<float>(60));
Explanation:
The above code snippet defines the global variable that is used for implementing the Naïve Bayes algorithm,
- the training set is a vector of features or data points that is used for training that is initialized with a size of 451
- the trainingset_faces and trainingset_nonfaces are used to separate the vectors for storing the rows labelled as faces and non-faces, respectively.
- The testing set is a vector of instances for testing that is set to a size of 150.
- Testlabels is a vector that stores the classification labels for the test rows.
- Faces and nonfaces are the number or count of face and non-face rows in the train set.
- Prob_face and prop_nonface are the probabilities of the face or non-face of the rows that are encountered in the training set.
- Prob_has_givenface, prob_hash_givennonface, prob_blank_givenface and prob_blank_givennonface are the 2-dimensional vectors that represent conditional probabilities.
The code snippet takes a dataset of size 451 rows for training the naïve Bayes algorithm and 150 rows for the testing set; the training set is split into labelled instances as faces and non-faces. The probabilities are calculated for faces and non-face instances that are encountered in the training set.
Code:
The naïve Bayes algorithm is implemented in the main function by calling all the functions defined to implement the naïve Bayes algorithm.
int main()
{
//fill the training set with all images and their classification, and calculate the number of faces and non-faces
getData("datatrain/facedatatrainlabels.txt", "datatrain/facedatatrain.txt", training set);
//fill trainingset_faces and trainingset_nonfaces
fill_othertrainingsets();
//fill testing set
getData("datatest/facedatatestlabels.txt", "datatest/facedatatest.txt", testing set);
//calculate all probabilities
calculate probabilities();
//classify new instances and subsequently store them to tested labels vector and also to classified.txt
classify();
//print accuracy and confusion matrix
printStats();
return 0;
}
Explanation:
In the main function, the getData function is called to read the data from the text file, which is named facedatatrainlabels.txt and facedatatest.txt for training and testing purposes; the function fill_othertrainingsets split the training data into two vectors, traingset_faces and trainingset_nonfaces that is based on the labels, the probabilities for training sets are calculated that includes global probabilities of faces and non-faces with the help of calculated probabilities Function, with the help classify function the new instances are classified in the testing set with the help of naïve bayes algorithm after that stores the predicted labels in the vector called tested labels and prints the statistics with the help of printStats function.
Code:
The function below calculates the conditional probabilities for a specific attribute given whether a row is a face or non-face.
void calculateProbabilities()
{
findprobability(prob_hash_givenface, '#', 1);
findProbability(prob_hash_givennonface, '#', 0);
findProbability(prob_blank_givenface, ' ', 1);
findProbability(prob_blank_givennonface, ' ', 0);
}
Explanation:
The above function findProbability evaluates the conditional probabilities based on the presence of specific labels ('#' or ' ') and the class label (1 for face, 0 for non-face). The first line of the function gives the probabilities of the "#" features, given that the instance is a face. The second line gives the probabilities of the "#" feature, given that the instance is a non-face. The third line of the function gives the probabilities of the blank features given that the instance is a face, and the fourth line gives the probabilities of the blank features given that the instance is a non-face.
Code:
The printStats function evaluates and prints the various statistical measures for the naïve Bayes algorithm.
void printStats()
{
int tp = 0, fn = 0, fp = 0, tn = 0;
for (int i = 0; i < testingset.size(); i++)
{
if (testingset[i].classification == 1)
{
if (testedlabels[i] == 1)
{
tp++;
}
else
{
fn++;
}
}
else
{
if (testedlabels[i] == 1)
{
fp++;
}
else
{
tn++;
}
}
}
cout << "Accuracy : " << ((tp + tn) / float(testingset.size())) * 100 << "%\n\n";
cout << "------------- Confusion Matrix -------------\n\n";
cout << "Number of True Positives (Actually Face, Predicted Face) : " << tp << "\n";
cout << "Number of False Negatives (Actually Face, Predicted Non-Face) : " << fn << "\n";
cout << "Number of False Positives (Actually Non-Face, Predicted Face) : " << fp << "\n";
cout << "Number of True Negatives (Actually Non-Face, Predicted Non-Face) : " << tn << "\n";
cout << "\nPrecision : " << (tp / float(tp + fp)) * 100 << "%\n";
cout << "\nRecall : " << (tp / float(tp + fn)) * 100<< "%\n";
cout << "\nF - measure : " << ((2*(tp / float(tp + fn))*(tp / float(tp + fp))) / ((tp / float(tp + fn))+(tp / float(tp + fp)))) * 100 << "%\n\n";
}
Explanation:
In the function printStats, four variables are defined named tp, which stands for true positive (rows that are correctly identified as faces), fn which stands for false negative (rows that are wrongly identified as non-faces but face), fp stands for false positive (rows that are wrongly identified as faces but are actually non-faces), and tn stands for true negatives (rows that are correctly identified as non-faces). Then, a confusion matrix is created, which gives the summary of the performance of the naïve Bayes algorithm by showing the counts of true positive, false negative, false positive, and true negative classifications. Then, four statistical measures are calculated, which are Accuracy, Precision, Recall and F measures and the values of each are printed.
Code:
The classify function is used to apply the trained model to the instances of the testing set.
void classify()
{
ofstream fout("datatest/classified.txt");
for (int i = 0; i < testingset.size(); i++)
{
testedlabels[i] = classifyInstance(testingset[i], i);
fout << testedlabels[i] << "\n";
}
fout.close();
}
Explanation:
The function classifies () opens a file called classified.txt to write the predicted labels, and then it iterates through each instance in the testing set. The classifyInstance() function is used to pass the current instance and the index 'i', which stores the result of the classification.
Code:
The function classifyInstance() is used for classifying instance that represents an image with the help of Naïve Bayes algorithm. The function evaluated probabilities that are based on specific features present ('#' or '') in the image and uses these probabilities to determine whether the instance is classified as face or not.
int classifyInstance(Instance& instance, int instance_number)
{
long double probability_face = 1.0, probability_nonface = 1.0;
long double p_face = 1.0, p_nonface = 1.0;
float m;
if (instance_number == 61)
{
m = 1.2;
}
else
{
m = 1.5;
}
for (int i = 0; i < 70; i++)
{
for (int j = 0; j < 60; j++)
{
if (instance.image[i][j] == '#')
{
p_face = p_face * prob_hash_givenface[i][j] * m;
p_nonface = p_nonface * prob_hash_givennonface[i][j] * m;
}
else
{
p_face = p_face * prob_blank_givenface[i][j] * m;
p_nonface = p_nonface * prob_blank_givennonface[i][j] * m;
}
}
}
probability_face = prob_face * p_face;
probability_nonface = prob_nonface * p_nonface;
if (probability_face > probability_nonface)
{
return 1;
}
else
{
return 0;
}
}
Explanation:
The code initializes a probability of face and the probability of non-face, and it initializes another variable and then adjusts the value of m that is based on the number. The adjustment implementation is specific and may be determined with the help of experimentation or fine-tuning. The function then iterates through the pixels of an image (70 rows and 60 columns). The probabilities are then updated for each pixel, the probabilities based on the presence of specified features ('#' or '') and the conditional probabilities like 'prob_hash_givenface, the probability of a non-blank face and the probability of the given blank face that is given.
Code:
The function fill_othertrainingsets() is used to iterate over the training vector, which contains attributes with image data and their corresponding classification (1 for face, 0 for non-face). It splits these attributes into two different vectors.
void fill_othertrainingsets()
{
for (int i = 0; i < trainingset.size(); i++)
{
if (trainingset[i].classification == 1)
{
trainingset_faces.push_back(trainingset[i]);
}
else
{
trainingset_nonfaces.push_back(trainingset[i]);
}
}
}
Explanation:
The iteration is done with the help of a for loop to iterate over each attribute in the training vector, and for every row, the classification member value is checked. If the current row is classified as a face, the classification = 1 and it is added to the training set faces vector, and If the current row is non-face, the classification == 0 and it is added to the training set non-faces vector.
Code:
The function count is responsible for counting the occurrences of specific attributes ('#' and '') in the images of the instances with the training data.
void count(vector<vector<int>>& count_h_given_f, vector<vector<int>>& count_b_given_f, vector<Instance>& traindata)
{
for (int i = 0; i < traindata.size(); i++)
{
for (int j = 0; j < 70; j++)
{
for (int k = 0; k < 60; k++)
{
if (traindata[i].image[j][k] == '#')
{
count_h_given_f[j][k]++;
}
else
{
count_b_given_f[j][k]++;
}
}
}
}
}
Explanation:
The function takes the three input count_h_given_f, a 2-dimensional vector for storing the counts of the '#' feature given that the instance is facing, count_b_given_f is a 2-dimensional vector that stores the count of the blank space in the image ('') or blank space feature that instance is a face and a traindata that is a vector of instances that represents the training data. A nested loop is run with three loops that iterate through each instance in the training data ('i'), each row of the image ('j'), and each column of the image ('k'). The value of pixels is checked in the innermost loop, the value of pixel in the current position of the image for the current row that is traindata[i].image[j][k]. If the value of the pixel is '#', the count is incremented in the vector 'count_h_given_f', and if the value of the pixel is '' that is blank space, then the count is incremented in the vector count_b_given_f. The main aim of this function is to count the occurrences of specific attributes in the image of the training instances, separately for '#' and ''.
Code:
The function find probability is designed to evaluate the probabilities of encountering specific features at each position in the image.
void findProbabiltyGiven(vector<vector<float>>& probability, char h, vector<Instance>& traindata)
{
float n = traindata.size();
float pseudo = 1.9; //pseudocount taken to be 1.9
vector<vector<int>> count_hash_given(70, vector<int>(60));
vector<vector<int>> count_blank_given(70, vector<int>(60));
count(count_hash_given, count_blank_given, traindata);
if (h == '#')
{
for (int i = 0; i < 70; i++)
{
for (int j = 0; j < 60; j++)
{
probability[i][j] = ((count_hash_given[i][j] + (0.5*pseudo)) / (n + pseudo)); //prior probability is 0.5
}
}
}
else
{
for (int i = 0; i < 70; i++)
{
for (int j = 0; j < 60; j++)
{
probability[i][j] = ((count_blank_given[i][j] + (0.5*pseudo)) / (n + pseudo)); //prior probability is 0.5
}
}
}
}
Explanation:
The function takes three inputs: the probability, which is a 2-dimensional vector that stores the evaluated probabilities for every position in the image, h which is the specific feature for which probabilities are calculated ('#', ''); and a train data, which is a vector of rows that represents the training data. Laplace smoothing is applied for calculating the probability with the help of a pseudocount that helps to avoid zero probabilities and addresses the issue of unseen features. The function takes the help of the count function to calculate the counts of occurrences of '#' and '' for every position in the image, given the class label. On the basis of a specific feature ('#' or ''), the function evaluates the probabilities for each position in the image. The probability is evaluated with the help of the Laplace-smoothed formula that is given by (count + (0.5 * pseudo)) / (n + pseudo), where the count is the number of occurrences for the feature, pseudo is the pseudo count and n total number of instances. The probabilities that are evaluated are stored in the probability vector.
Code:
The function findProbability evaluates the probabilities to the findProbabilityGiven function; the function takes the input as a 2-dimensional vector named probability and a character h that represents a specific feature ('#' or ''), and an integer f that indicates whether to evaluate the probabilities for faces (1) or non-faces(0).
void findProbability(vector<vector<float>>& probability, char h, int f)
{
if (f == 1)
{
findProbabiltyGiven(probability, h, trainingset_faces);
}
else
{
findProbabiltyGiven(probability, h, trainingset_nonfaces);
}
}
Explanation:
This function aims to decide whether to evaluate the probabilities for faces or non-faces and then give the task to the function findProbabilityGiven.
Code:
The function getData is built to read the data from the two files. One file contains the labels, and the other file contains the image data.
void getData(string datalabeltxt, string datatxt, vector<Instance>& dataset)
{
ifstream trainingfile(datalabeltxt);
string line;
vector<int> labels; //this array contains the classifications of the training/testing set
for (int i = 0; getline(training file, line); i++)
{
labels.push_back(stoi(line));
}
trainingfile.close();
trainingfile.open(datatxt);
for (int i = 0; i < dataset.size();i++)
{
dataset[i].classification = labels[i];
}
int i = 0, z = 1;
while (getline(trainingfile, line))
{
dataset[i].image.push_back(line);
z++;
if ((z - 1) % 70 == 0)
{
i++;
}
}
trainingfile.close();
}
Explanation:
The function gives access to open and read the file, which contains the labels of every instance in the dataset. Each line is read, and then each line is converted into a label from a string to an integer with the help of a stop.
When you run the program, the output will be like as given below, which tells the accuracy of the naïve Bayes algorithm and a confusion matrix corresponding to the naïve Bayes algorithm, which gives the true positive value, true negative value, false positive value, false negative value and the three statistical measure such as precision, recall and f-measure.
Output:
Accuracy: 89.3333%
Confusion Matrix
Predicted Face Predicted Non-Face
Actual Face 68 5
Actual Non-Face 11 66
Precision: 86.076%
Recall: 93.1507%
F-measure: 89.4737%
Conclusion:
The naïve Bayes algorithm is a type of supervised machine learning algorithm that works on the basis of Baye's Theorem. The naïve Bayes algorithm makes the assumption that there is independence among features, and that simplifies complex problems. The Naïve Bayes algorithm is commonly used for classification tasks, especially when working with textual data; it evaluates the probabilities of an instance that belongs to a particular class by multiplying the probabilities of the individual attribute of that given class. The naïve Bayes algorithm is effective and particularly used for text classification and spam filtering. This algorithm is efficient in computation and requires minimal training data, making it suitable for various applications.