General questions

Assignment 2 Task 2 SMOTENC

Assignment 2 Task 2 SMOTENC

by Jan Marcel Ehrlinspiel -
Number of replies: 1

Dear IML Team, 

while doing the assignment I started to wonder, if there is a difference in applying for the Random Forest first SMOTENC and then HotEncode the categorical variables or doing it the otherway around, where SMOTE is applied to after the HotEncoding?

With kind regards

In reply to Jan Marcel Ehrlinspiel

Re: Assignment 2 Task 2 SMOTENC

by Renato Miguel Sopa Gonçalves -
Hi Jan,

Let us think about the methods for a moment.

SMOTE is a balancing technique for numerical data. Indeed, you could perform OHE and thereafter employ SMOTE with categorical data encoded in binary format, but would it be sensible? SMOTE leverages distances to perform interpolation, and calculating distances between categories may not be quite logical; after OHE, distances are still between binary features that represent categories (e.g., what is the distance between movie A and movie B?).

SMOTENC, in contrast, works for both numerical and categorical data (leverages the mode for the latter). Now, the question is whether it requires some form of prior encoding to address the categorical data: for such an inquiry, it is best to inspect the documentation (https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html), as it details how the method can be employed; notice, in particular, that the first parameter expects the user to specify something, and that the second parameter alludes to some transformation already transpiring for the algorithm to function (SMOTENC returns the features to the original format afterwards).

With this in mind, you will be able to arrive at the answer.