Machine Learning Tutorial

What is Machine Learning? Machine Learning Life Cycle Python Anaconda setup Difference between ML/ AI/ Deep Learning Understanding different types of Machine Learning Data Pre-processing Supervised Machine Learning

ML Regression Algorithm

Linear Regression

ML Classification Algorithm

Introduction to ML Classification Algorithm Logistic Regression Support Vector Machine Decision Tree Naïve Bayes Random Forest

ML Clustering Algorithm

Introduction to ML Clustering Algorithm K-means Clustering Hierarchical Clustering

ML Association Rule learning Algorithm

Introduction to association Rule Learning Algorithm

Miscellaneous

Top 5 programming languages and their libraries for Machine Learning Basics Vectors in Linear Algebra in ML Decision Tree Algorithm in Machine Learning Bias and Variances in Machine Learning Machine Learning Projects for the Final Year Students Top Machine Learning Jobs Machine Learning Engineer Salary in Different Organisation Best Python Libraries for Machine Learning Regularization in Machine Learning Some Innovative Project Ideas in Machine Learning What is Cross Compiler Decoding in Communication Process IPv4 vs IPv6 Supernetting in Network Layer TCP Ports TCP vs UDP TCP Working of ARP Hands-on Machine Learning with Scikit-Learn, TensorFlow, and Keras Kaggle Machine Learning Project Machine Learning Gesture Recognition Machine Learning IDE Pattern Recognition and Machine Learning a MATLAB Companion Chi-Square Test in Machine Learning Heart Disease Prediction Using Machine Learning Machine Learning and Neural Networks Machine Learning for Audio Classification Standardization in Machine Learning Student Performance Prediction Using Machine Learning

Spam Filter- Machine Learning

A spam filter is software that is designed to detect and classify unwanted emails, also known as spam, from legitimate messages. One of the most effective ways to achieve this is through the use of machine learning.

Spam Filter- Machine Learning

Machine learning is a method of teaching computers to learn from data without being explicitly programmed. This allows for the creation of algorithms that can identify patterns and make predictions based on input data. In the case of spam filters, machine learning algorithms are trained on a dataset of both spam and legitimate emails, allowing them to learn the characteristics that distinguish the two types of messages.

There are a few different types of machine learning algorithms that are commonly used in spam filters.

  • One popular method is the use of Bayesian filters, which use Bayes' theorem to calculate the probability that an email is spam based on the presence of certain words or phrases.
  • Another method is the use of support vector machines (SVMs), which can effectively classify emails based on a set of features, such as the sender's email address or the presence of certain words in the subject line.

One of the key advantages of using machine learning for spam filtering is its ability to adapt to new types of spam. As spamming techniques evolve, traditional rule-based spam filters can quickly become outdated. Machine learning algorithms, on the other hand, are able to learn from new examples and adjust their classifications accordingly.

However, machine learning-based spam filters are not without their limitations. One major issue is the need for a large and diverse dataset of both spam and legitimate emails for training. Without enough data, the algorithm may not be able to accurately distinguish between the two types of messages. Additionally, machine learning models can make mistakes, which can result in false positives (legitimate emails being classified as spam) or false negatives (spam emails slipping through the filter).

Now, using naive Bayes and a tagged dataset of emails received from Kaggle, we will develop a basic spam filter as an example. The several preparation methods for text data will also be explained, feature extraction and a classification model will follow it.

Importing Libraries

import nltk
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import re
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
%matplotlib inline

Exploratory Data Analysis

emails_files = pd.read_csv('emails.csv')




emails_files.head()

Output:

Spam Filter- Machine Learning
# Reading one email


emails_files._get_value(58,'text')

Output:

Spam Filter- Machine Learning
# Total Emails in the dataset
emails_files.shape

Output:

Spam Filter- Machine Learning
# Class Distribution
emails_files.groupby('spam').count()


# Spam makes up 23.88% of emails, which is plenty for our objective.

Output:

Spam Filter- Machine Learning
# Distribution of Spam using graphs
counts_labels = emails_files.spam.value_counts()
plt.figure(figsize = (12,6))
sns.barplot(counts_labels.index, counts_labels.values, alpha = 0.9)


plt.xticks(rotation = 'vertical')
plt.xlabel('Spam', fontsize =12)
plt.ylabel('Counts', fontsize = 12)
plt.show()

Output:

Spam Filter- Machine Learning
# Checking whether length of email is coorelated to spam
emails_files['length'] = emails_files['text'].map(lambda text: len(text))


emails_files.groupby('spam').length.describe()


Output:

Spam Filter- Machine Learning
# There are some severe outliers in emails' length; let's set length criteria and examine the distribution of length.
subset_of_emails = emails_files[emails_files.length < 1800]
subset_of_emails.hist(column='length', by='spam', bins=50)


# Nothing significant here; let's now process the mail's contents to develop a spam filter.

Output:

array([<AxesSubplot:title={'center':'0'}>,

       <AxesSubplot:title={'center':'1'}>], dtype=object)

Spam Filter- Machine Learning

Data Preprocessing

Since the emails presented in the data are a disorganized mess, it's crucial to preprocess them before feature extraction and modeling. With just a few lines of Python code, this preprocessing is now quite simple, thanks to the nltk package.

Tokenization

Continuous word streams are tokenized, creating a unique token for each word.

import nltk
nltk.download('punkt')
emails_files['tokens'] = emails_files['text'].map(lambda text:  nltk.tokenize.word_tokenize(text

Output:

Spam Filter- Machine Learning
# tokenized text from the 1st email


print(emails_files['tokens'][1])

Output:

Spam Filter- Machine Learning

Removal of Stop Words

Stop words often refer to the most prevalent terms in a language, such as "the," "a," and "as." Let's get rid of these terms because they often don't transmit any information that spam filters might find beneficial.

# Removing
import nltk
nltk.download('stopwords')


words_stops = set(nltk.corpus.stopwords.words('english'))
emails_files['filtered_text'] = emails_files['tokens'].map(lambda tokens: [w for w in tokens if not w in words_stops])

Output:

Spam Filter- Machine Learning
emails_files['filtered_text'] = emails_files['filtered_text'].map(lambda text: text[2:])


# Comparing email with the removed(words_stops) email.
print(emails_files['tokens'][3],end='\n\n')
print(emails_files['filtered_text'][3])


# several stop words like 'the', 'of' etc., were eliminated

Output:

Spam Filter- Machine Learning
# There are still a lot of special character tokens in emails that might not be useful for spam filters; let's get rid of them.
# forming a string by joining all tokens together
emails_files['filtered_text'] = emails_files['filtered_text'].map(lambda text: ' '.join(text))


# Taking special characters out of every email
emails_files['filtered_text'] = emails_files['filtered_text'].map(lambda text: re.sub('[^A-Za-z0-9]+', ' ', text))

Lemmatization

It's the procedure of combining a word's inflected forms so they may be examined as a single entity, distinguished by the word's lemma or dictionary form. The term "move" will be substituted for words like "moved" and "moving."

import nltk
nltk.download('wordnet')
wnl = nltk.WordNetLemmatizer()
emails_files['filtered_text'] = emails_files['filtered_text'].map(lambda text: wnl.lemmatize(text))

Output:

Spam Filter- Machine Learning
# Checking email after preprocessing
emails_files['filtered_text'][4]

Output:

Spam Filter- Machine Learning
# Wordcloud
words_spam = ''.join(list(emails_files[emails_files['spam']==1]['filtered_text']))
wordclod_spam = WordCloud(width = 512,height = 512).generate(words_spam)
plt.figure(figsize = (10, 8), facecolor = 'k')
plt.imshow(wordclod_spam)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()

Output:

Spam Filter- Machine Learning
#Wordcloud of non-spam mail
spam_words = ''.join(list(emails_files[emails_files['spam']==0]['filtered_text']))
spam_wordclod = WordCloud(width = 512,height = 512).generate(spam_words)
plt.figure(figsize = (10, 8), facecolor = 'k')
plt.imshow(spam_wordclod)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()

Output:

Spam Filter- Machine Learning

Models for Spam Filtering

After the text has been cleaned up enough through preprocessing, let's turn these emails into vectors of numbers using two well-liked techniques: TF-IDF and a Bag of Words. We will use Naive Bayes to create our classifier after obtaining vectors for each mail.

1. Bag of Words

In essence, it builds a vector containing the frequency of each word in a given mail's lexicon. As the name implies, a bag of words treats text as a collection of unconnected bags of words rather than a sequence. These vectors are simple to make using Scikit-CountVectorizer().

count_vect = CountVectorizer()
counts = count_vect.fit_transform(emails_files['filtered_text'].values)


print(counts.shape)

Output:

Spam Filter- Machine Learning

Naive Bayes Classifier

classifier_mnb = MultinomialNB()
targets = emails_files['spam'].values
classifier_mnb.fit(counts, targets)

Output:

Spam Filter- Machine Learning
# Prediciton
examples_to_predict = ['cheap Viagra', "Forwarding you minutes of meeting"]
example_counts = count_vect.transform(examples_to_predict)
predictions = classifier_mnb.predict(example_counts)


print(predictions)

Output:

Spam Filter- Machine Learning

2. TF-IDF

A word's importance to a collection of all mail or corpus is meant to be reflected by the tf-idf statistic, which is a numerical measure. The tf-idf values for each word in each message are likewise contained in this vector. To create this vector, in this case, we'll utilize the TfidfTransformer() function from Scikit Learn.

tfidf_vect = TfidfTransformer().fit(counts)
tfidf = tfidf_vect.transform(counts)


print(tfidf.shape)

Output:

Spam Filter- Machine Learning
classifier_mnb = MultinomialNB()
targets_ = emails_files['spam'].values
classifier_mnb.fit(counts, targets_)

Output:

Spam Filter- Machine Learning
#Predictions
examples_to_predict = ['Free Offer Buy now',"Lottery from Nigeria","Please send the files"]
example_counts = count_vect.transform(examples)
example_tfidf = tfidf_vect.transform(example_counts)
predictions_tfidf = classifier_mnb.predict(example_tfidf)


print(predictions_tfidf)

Output:

Spam Filter- Machine Learning

In conclusion, machine learning is a powerful tool for creating effective spam filters. These algorithms can learn to identify patterns and characteristics that distinguish spam from legitimate emails, which allows them to adapt to new types of spam. However, it's important to have a large, diverse dataset for training and understand that machine learning models can also make mistakes.