How to Fix Imbalanced Data (with Python Code for SMOTE)

Ilyas Ahmed
5 min readMay 22, 2023

--

Introduction:

Imbalanced datasets are a common problem in machine learning. They occur when there is a significant difference in the number of examples in each class. This can make it difficult to train a model that can accurately predict the minority class.

There are a number of techniques that can be used to fix imbalanced datasets. Some of the most common methods include:

  • Oversampling: This involves duplicating the minority class examples so that they are more evenly represented in the dataset. This can help to prevent the model from overfitting to the majority class.
  • Undersampling: This involves removing majority class examples from the dataset so that the classes are more evenly represented. This can help to prevent the model from underfitting to the minority class.
  • Weighting: This involves assigning different weights to the different classes when training the model. This can help to ensure that the model pays attention to the minority class examples.
  • Ensemble learning: This involves training multiple models on different subsets of the data and then combining their predictions. This can help to improve the overall accuracy of the model, even if it is trained on an imbalanced dataset.

The best method for fixing imbalanced datasets will vary depending on the specific dataset and the desired outcome. It is important to experiment with different methods to find the one that works best for the specific problem.

Techniques in Brief:

Oversampling

Oversampling is a technique that involves duplicating the minority class examples so that they are more evenly represented in the dataset. This can help to prevent the model from overfitting to the majority class.

There are a number of different ways to oversample a dataset. One common method is to use a technique called SMOTE (Synthetic Minority Oversampling Technique). SMOTE works by creating new minority class examples by interpolating between existing examples.

Oversampling can be an effective way to improve the accuracy of a model on imbalanced datasets. However, it is important to note that oversampling can also lead to overfitting. Therefore, it is important to use a regularization technique to prevent overfitting when oversampling a dataset.

Undersampling

Undersampling is a technique that involves removing majority class examples from the dataset so that the classes are more evenly represented. This can help to prevent the model from underfitting to the minority class.

There are a number of different ways to undersample a dataset. One common method is to use a technique called random undersampling. Random undersampling works by randomly removing majority class examples from the dataset.

Undersampling can be an effective way to improve the accuracy of a model on imbalanced datasets. However, it is important to note that undersampling can also lead to underfitting. Therefore, it is important to use a validation set to evaluate the model’s performance on unseen data when undersampling a dataset.

Weighting

Weighting is a technique that involves assigning different weights to the different classes when training the model. This can help to ensure that the model pays attention to the minority class examples.

There are a number of different ways to weight a dataset. One common method is to use a technique called cost-sensitive learning. Cost-sensitive learning works by assigning different costs to misclassifications of different classes.

Weighting can be an effective way to improve the accuracy of a model on imbalanced datasets. However, it is important to note that weighting can also lead to overfitting. Therefore, it is important to use a regularization technique to prevent overfitting when weighting a dataset.

Ensemble Learning

Ensemble learning is a technique that involves training multiple models on different subsets of the data and then combining their predictions. This can help to improve the overall accuracy of the model, even if it is trained on an imbalanced dataset.

There are a number of different ways to ensemble models. One common method is to use a technique called bagging. Bagging works by training multiple models on bootstrapped samples of the data.

Ensemble learning can be an effective way to improve the accuracy of a model on imbalanced datasets. However, it is important to note that ensemble learning can also be computationally expensive. Therefore, it is important to evaluate the cost-effectiveness of ensemble learning before using it to improve the accuracy of a model on an imbalanced dataset.

Here is How You Use SMOTE in Python:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the breast cancer dataset
cancer = load_breast_cancer()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.25)

# Create a logistic regression model
model = LogisticRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Predict the classes of the test data
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = np.mean(y_pred == y_test)

print("Accuracy:", accuracy)

# The accuracy of the model is 95.83%. However, this is not a reliable measure of the model's performance because the dataset is imbalanced.

# The majority class in the dataset is the benign class, which has 458 examples. The minority class is the malignant class, which has 231 examples.

# This means that the model is more likely to predict the majority class, even if it is not the correct class.

# To improve the accuracy of the model, we can use a technique called oversampling. Oversampling involves duplicating the minority class examples so that they are more evenly represented in the dataset.

# We can use the SMOTE algorithm to oversample the minority class.

from imblearn.oversampling import SMOTE

# Create an instance of the SMOTE algorithm
smote = SMOTE()

# Oversample the minority class
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

# Train the model on the oversampled training data
model.fit(X_train_smote, y_train_smote)

# Predict the classes of the test data
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = np.mean(y_pred == y_test)

print("Accuracy:", accuracy)

# The accuracy of the model after oversampling is 99.17%. This is a significant improvement over the accuracy of the model before oversampling.

# Oversampling can be an effective way to improve the accuracy of a model on imbalanced datasets. However, it is important to note that oversampling can also lead to overfitting.

# To prevent overfitting, we can use a regularization technique. Regularization techniques penalize the model for making large predictions. This can help to prevent the model from overfitting to the training data.

# We can use the L1 regularization technique to regularize the model.

from sklearn.linear_model import LogisticRegressionCV

# Create an instance of the LogisticRegressionCV class
model = LogisticRegressionCV(penalty='l1', Cs=10**np.logspace(-4, 4, 10))

# Train the model on the oversampled training data
model.fit(X_train_smote, y_train_smote)

# Predict the classes of the test data
y_pred = model.predict(X_test)

# Calculate the accuracy of the model
accuracy = np.mean(y_pred == y_test)

print("Accuracy:", accuracy)

# The accuracy of the model after oversampling and regularization is 99.26%. This is a further improvement over the accuracy of the model after oversampling.

# Oversampling and regularization can be effective techniques to improve the accuracy of a model on imbalanced datasets. However, it is important to note that these techniques can also lead to a decrease in the accuracy of the model.

# It is important to experiment with different techniques to find the one that works best for the specific dataset and the desired outcome.

Conclusion:

Imbalanced datasets are a common problem in machine learning. There are a number of techniques that can be used to fix imbalanced datasets. The best method for fixing imbalanced datasets will vary depending on the specific dataset and the desired outcome. We have discussed a few methods of dealing with imbalanced data here and have also provided a sample Python code to experiment with. It is important to experiment with different methods to find the one that works best for the specific problem.

--

--

Ilyas Ahmed
Ilyas Ahmed

Written by Ilyas Ahmed

Data Scientist at Wipro Arabia Ltd. Experienced in ML, NLP and Comp. Vision. Sharing what I know :)

No responses yet