Machine Learning Tutorial

Machine Learning Tutorial Machine Learning Life Cycle Python Anaconda setup Difference between ML/ AI/ Deep Learning Understanding different types of Machine Learning Data Pre-processing Supervised Machine Learning

ML Regression Algorithm

Linear Regression

ML Classification Algorithm

Introduction to ML Classification Algorithm Logistic Regression Support Vector Machine Decision Tree Naïve Bayes Random Forest

ML Clustering Algorithm

Introduction to ML Clustering Algorithm K-means Clustering Hierarchical Clustering

ML Association Rule learning Algorithm

Introduction to association Rule Learning Algorithm

How To

How to Learn AI and Machine Learning How Many Types of Learning are available in Machine Learning How to Create a Chabot in Python Using Machine Learning

ML Questions

What is Cross Compiler What is Artificial Intelligence And Machine Learning What is Gradient Descent in Machine Learning What is Backpropagation in a Neural Network Why is Machine Learning Important What Machine Learning Technique Helps in Answering the Question Is Data Science and Machine Learning Same

Differences

Difference between Machine Learning and Deep Learning Difference between Machine learning and Human Learning

Miscellaneous

Top 5 programming languages and their libraries for Machine Learning Basics Vectors in Linear Algebra in ML Decision Tree Algorithm in Machine Learning Bias and Variances in Machine Learning Machine Learning Projects for the Final Year Students Top Machine Learning Jobs Machine Learning Engineer Salary in Different Organisation Best Python Libraries for Machine Learning Regularization in Machine Learning Some Innovative Project Ideas in Machine Learning Decoding in Communication Process Working of ARP Hands-on Machine Learning with Scikit-Learn, TensorFlow, and Keras Kaggle Machine Learning Project Machine Learning Gesture Recognition Machine Learning IDE Pattern Recognition and Machine Learning a MATLAB Companion Chi-Square Test in Machine Learning Heart Disease Prediction Using Machine Learning Machine Learning and Neural Networks Machine Learning for Audio Classification Standardization in Machine Learning Student Performance Prediction Using Machine Learning Automated Machine Learning Hyper Parameter Tuning in Machine Learning IIT Machine Learning Image Processing in Machine Learning Recall in Machine Learning Handwriting Recognition in Machine Learning High Variance in Machine Learning Inductive Learning in Machine Learning Instance Based Learning in Machine Learning International Journal of Machine Learning and Computing Iris Dataset Machine Learning Disadvantages of K-Means Clustering Machine Learning in Healthcare Machine Learning is Inspired by the Structure of the Brain Machine Learning with Python Machine Learning Workflow Semi-Supervised Machine Learning Stacking in Machine Learning Top 10 Machine Learning Projects For Beginners in 2023 Train and Test datasets in Machine Learning Unsupervised Machine Learning Algorithms VC Dimension in Machine Learning Accuracy Formula in Machine Learning Artificial Neural Networks Images Autoencoder in Machine Learning Bias Variance Tradeoff in Machine Learning Disadvantages of Machine Learning Haar Algorithm for Face Detection Haar Classifier in Machine Learning Introduction to Machine Learning using C++ How to Avoid Over Fitting in Machine Learning What is Haar Cascade Handling Imbalanced Data with Smote and Near Miss Algorithm in Python Optics Clustering Explanation Generate Test Datasets for Machine Learning

Handling Imbalanced Data with SMOTE and Near Miss Algorithm in Python

Imbalanced data distribution is a word that is frequently used in machine learning and data science. It typically occurs when observations in one class are significantly greater or lower than those in the other classes. Machine learning algorithms do not take the class distribution into account since they prefer to improve accuracy by decreasing the error. Examples of this issue are common in fraud detection, anomaly detection, facial recognition, etc.

Traditional ML methods favor the majority class and tend to disregard the minority class, such as Decision Trees and Logistic Regression. They frequently misclassify the minority class relative to the majority class because they tend to exclusively forecast the majority class. In other words, if our dataset includes an unbalanced distribution of data, our model is more susceptible to situations where the recall for the minority class is zero or extremely low.

Techniques for Handling Unbalanced Data: There are primarily 2 algorithms that are frequently used to handle unbalanced class distribution.

  1. SMOTE
  2. Near Miss Algorithm

Machine learning often faces difficulties with imbalanced data, particularly when classes in the dataset are not evenly distributed. Two popular methods for addressing this issue are SMOTE (Synthetic Minority Over-sampling Technique) and Near Miss Algorithm. While the Near Miss Algorithm chooses a portion of the majority class to balance the dataset, SMOTE creates synthetic samples for the minority class. I'll demonstrate how to apply both approaches in Python in this response.

Make sure the necessary libraries are installed first. If you don't already have them, you may install them via pip:

pip install numpy scikit-learn imbalanced-learn

SMOTE – Oversampling:

The popular oversampling method known as SMOTE (Synthetic Minority Over-sampling Technique) is used to address datasets that are unbalanced. By creating synthetic samples for the minority class and subsequently boosting its representation in the dataset, it addresses the issue of unbalanced classes. By doing so, biased predictions can be avoided and machine learning models can perform better on data that is unbalanced.

SMOTE's fundamental principle is to interpolate between existing minority class samples to produce new synthetic samples. Here is how SMOTE functions:

  1. Find the underrepresented and minority class in the dataset: The first step is to find the underrepresented and minority class in the dataset.
  2. Select a minority class sample: A sample from the minority class should be chosen at random to serve as the basis for the creation of synthetic samples.
  3. Find the k-nearest neighbors: For the chosen sample, identify the k-nearest neighbors from the minority class (often set to 5). Typically, distance parameters like Euclidean distance are used to determine closest neighbors.
  4. Generate synthetic samples: Create artificial samples by calculating the disparity among the feature values of the chosen sample and each of its k-nearest neighbors. Add this difference to the feature values of the chosen sample and multiply it by a random number between 0 and 1. Synthetic samples are produced by this technique along the line segments that connect the chosen sample and its k-nearest neighbors.
  5. Repetition is required: Up until the necessary amount of oversampling is reached, keep choosing random samples from the minority class and create synthetic samples.

SMOTE is accessible via a number of libraries, including the Python imbalanced-learn package. To handle unbalanced datasets, it is simple to include them in your machine learning workflow. A quick description of SMOTE usage in Python is shown below:

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

Some crucial things to think about when using SMOTE:

  • Use SMOTE exclusively on the training data: To prevent information leakage and skewed results, SMOTE should only be used on the training data.
  • Combine SMOTE with other techniques:  SMOTE should be used in conjunction with other strategies because it is just one way to deal with unbalanced data. For improved outcomes, think about combining it with under sampling strategies, class weights, alternative algorithms, or ensemble methods.
  • Evaluate model performance: A second test set should always be used to assess the performance of your model in order to make sure that the SMOTE oversampling did not result in overfitting or biased predictions.
  • Pick the appropriate value for k: The SMOTE parameter k (number of nearest neighbors) might affect the caliber of created synthetic samples. In actuality, k is frequently set to 5, but depending on your dataset, you might need to adjust it.

Keep in mind that SMOTE is not a one-size-fits-all approach and that it may or may not be useful, depending on the particular dataset and issue at hand. To choose the optimal method for handling unbalanced data in your specific situation, it is critical to understand the properties of your data and experiment with several strategies.

Near Miss Algorithm – Undersampling:

Undersampling methods like the Near Miss Algorithm are employed to solve the problem of unbalanced datasets. Undersampling strategies are used to balance a dataset by reducing the number of samples in the majority class as opposed to oversampling approaches, which increase the number of samples in the minority class.

The majority class samples that are near the decision border are identified by the Near Miss Algorithm, which makes them more comparable to the minority class samples. Removing these samples can enhance the performance of machine learning models because they are more likely to be incorrectly labelled.

The Near Miss Algorithm comes in three different iterations, known as NearMiss-1, NearMiss-2, and NearMiss-3. For the purpose of deciding which majority class samples to preserve or discard, each version employs a different criterion. An overview of each version is given below:

  • NearMiss-1: The majority class samples that have the least mean distance to their k-nearest neighbors in the minority class are chosen using the NearMiss-1 algorithm. The goal of this revision is to minimize class overlap.
  • NearMiss-2: Select samples from the majority class whose average distance from their k-nearest neighbors in the minority class is the greatest using NearMiss-2. This version seeks to eliminate jittery samples from the majority class.
  • NearMiss-3: Similar to NearMiss-2, but eliminates the majority class sample if the total number of their k-nearest neighbors is more than that of the minority class sample. NearMiss-3 picks the k-nearest neighbors from the minority class to each majority class sample. This version aims to reduce noisy samples while balancing the class distribution.

Numerous libraries, like the imbalanced-learn package in Python, contain the Near Miss Algorithm. The Near Miss Algorithm for undersampling in Python can be used as follows:

from imblearn.under_sampling import NearMiss
near_miss = NearMiss(version=1)
X_train_near_miss, y_train_near_miss = near_miss.fit_resample(X_train, y_train)

When utilizing the Near Miss Algorithm, keep the following things in mind:

  1. Apply Near Miss solely to the training data, like with SMOTE, to prevent information leakage and skewed assessments.
  2. Better outcomes might be obtained by combining undersampling with additional strategies like oversampling, utilizing other algorithms, or utilizing class weights.
  3. To check that the undersampling using Near Miss has not resulted in the loss of crucial data or biased predictions, evaluate the model's performance on a different test set.
  4. The particular challenge and your dataset will determine which version of Near Miss (NearMiss-1, NearMiss-2, or NearMiss-3) to use. Try out many iterations to determine which works best for your data.

In general, the Near Miss Algorithm is a helpful tool to cope with imbalanced datasets by minimizing the number of samples from the majority class, but like any technique, its performance depends on the properties of the data and the learning job at hand. Always think about trying various class imbalance management techniques to find the most effective fix for your particular issue.

Let's examine how to manage imbalanced data with SMOTE and the Near Miss Algorithm now:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from collections import Counter
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.9, 0.1], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Class distribution in the original data:", Counter(y_train))
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)
print("Class distribution after SMOTE:", Counter(y_train_smote))
near_miss = NearMiss(version=1)
X_train_near_miss, y_train_near_miss = near_miss.fit_resample(X_train, y_train)
print("Class distribution after Near Miss:", Counter(y_train_near_miss))

Output:

Class distribution in the original data: Counter({0: 716, 1: 84})
Class distribution after SMOTE: Counter({0: 716, 1: 716})
Class distribution after Near Miss: Counter({0: 84, 1: 84})
  • In this example, we first create an unbalanced dataset using make_classification, with only 10% of the samples belonging to the minority class. In the following step, we divided the data into training and testing sets.
  • The minority class is then oversampled using SMOTE to provide artificial samples that balance the data. To carry out this action, we employ the SMOTE object's fit_resample method.
  • After that, we undersample the majority class using the Near Miss Algorithm with version=1. The undersampling is carried out using the NearMiss object's fit_resample method.
  • In order to see the changes, we print the class distributions both before and after using SMOTE and the Near Miss Algorithm.