Managing Imbalanced Data in Machine Learning: A beginner's guide

Introduction

Imbalanced data, when one class predominates over the other, is one of the most frequent issues when working with classification tasks in Machine learning. For instance, in the job of detecting credit card fraud, there will be many fewer fraud transactions (positive class) than non-fraud transactions (negative class). It is occasionally even feasible that 99.99% of transactions will not be fraudulent and just 0.01% of transactions will.

Both binary and multi-class classification problems might have a class imbalance issue. In this article, let's go through a few popular techniques followed by the researchers to tackle the class imbalance problem while dealing with predictive modelling.

The problem with Imbalanced data

Think about the same scenario of credit card fraud transaction detection, where the ratio of fraudulent to non-fraudulent transactions is 99% to 1%. This dataset is quite imbalanced. Because the classifier will recognise patterns in the popular classifications and forecast nearly everything as a non-fraud transaction, using this dataset to train the model will result in accuracy as high as 99%. The model won't be able to generalise to the new data as a consequence. Accuracy is not an appropriate assessment criterion when working with imbalanced data because of this as well.

Examples of Imbalanced Data

Here are a few instances in machine learning when we encounter imbalanced data:

Fraud Detection
Claim Prediction
Default Prediction
Churn Prediction
Spam Detection
Anomaly Detection
Outlier Detection
Intrusion Detection
Conversion Prediction

Managing Imbalanced data

Let's attempt to go through some of the techniques for dealing with an imbalanced dataset-

Collecting more samples

It is best practice to see if you can obtain extra data when your data is imbalanced in order to lessen the class imbalance. Most of the time, you won't acquire the additional data you want because of the nature of the problem you are attempting to address.

Trying different evaluation metrics

Accuracy is not a useful statistic when dealing with imbalanced classes, as we have mentioned above. Here are some more categorization metrics that can offer more insightful data. Depending on the use case or issue you're attempting to resolve, you can select the metric.

Precision: What proportion of the favourably anticipated classes were, in fact, positive classes?

Recall: Sensitivity and True Positive Rate are other names for recall (TPR). How many of the positive classes were really expected to be positive out of all the positive classes?

F1 Score: A harmonic mean of Precision and Recall is the F1-Score. The outcome is between 0 and 1, with 1 representing the best result. It serves as a gauge for the model's precision.

ROC-AUC: The Receiver Operating Characteristic curve (ROC curve) is a graph that displays how well a classification model performs at various probability thresholds. Plotting FPR vs. TPR, with FPR (False Positive Rate) on the x-axis and TPR (True Positive Rate) on the y-axis, for various probability threshold values between 0.0 and 1.0, results in a ROC graph.

Trying different ML algorithms

Try comparing the outcomes with various machine learning techniques. Even with an imbalanced dataset, tree-based classifiers like Decision Tree classifiers, Extra Tree classifiers, Random Forest classifiers, etc., perform well.

Implementing Resampling Techniques

It is a technique for balancing the dataset's majority and minority classes. There are two methods for resampling:

Undersampling: Reduce the number of samples from the dominant class in the dataset via undersampling. Some of the information is lost since we are deleting the "majority class" samples from the original data.

Oversampling: Increase the amount of minority class examples in the dataset via oversampling. Oversampling has the important benefit of not losing any data from the majority or minority classes in the process. It is prone to overfitting.

Generating artificial data points

To balance the dataset, the over-sampling strategy generates duplicates of minority classes. Using various strategies to combine new minority classes is an improvement on this strategy. SMOTE (Synthetic Minority Over-Sampling Method) and ADASYN are two of the methods that are most frequently employed (Adaptive Synthetic sampling).

Conclusion

You now know what class imbalance is and how several approaches to dealing with imbalanced data in machine learning work. As a data scientist, there are several "bright new things" you might test, but you should always ask yourself which ones are anticipated to produce the most return on your time investment. There are many strategies for the class imbalance that I have not covered here, but in reality, sticking to the simplest strategies and deploying your model as soon as feasible frequently has the most significant return on investment.