General questions

Assignment 2 Task 2 SMOTENC

Re: Assignment 2 Task 2 SMOTENC

by Renato Miguel Sopa Gonçalves -
Number of replies: 0
Hi Jan,

Let us think about the methods for a moment.

SMOTE is a balancing technique for numerical data. Indeed, you could perform OHE and thereafter employ SMOTE with categorical data encoded in binary format, but would it be sensible? SMOTE leverages distances to perform interpolation, and calculating distances between categories may not be quite logical; after OHE, distances are still between binary features that represent categories (e.g., what is the distance between movie A and movie B?).

SMOTENC, in contrast, works for both numerical and categorical data (leverages the mode for the latter). Now, the question is whether it requires some form of prior encoding to address the categorical data: for such an inquiry, it is best to inspect the documentation (https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html), as it details how the method can be employed; notice, in particular, that the first parameter expects the user to specify something, and that the second parameter alludes to some transformation already transpiring for the algorithm to function (SMOTENC returns the features to the original format afterwards).

With this in mind, you will be able to arrive at the answer.