Did you know that there are basically three ways of selecting the most important features before feeding your database into an ML model?
- FILTER METHODS are generally used to quickly remove irrelevant and redundant features (constant, quasi-constant and duplicated features). They also rely on statistical procedures to rank the most important features. Mutual Information, Fisher Score/Chi-Square, Univariate ANOVA, Univariate ROC-AUC/MSE are the most useful tests, just to name a few of them.
- WRAPPER METHODS are the most effective ones since they are optimized for the ML algorithm chosen afterwards. Step forward, Step backward and Exhaustive Search is part of this category. However, they are computationally expensive and you need a powerful pc to execute them. Mlxtend is the Python package used for these algorithms.
- EMBEDDED METHODS embed feature selection within the ML algorithms and so the task of selecting the best features is accomplished during the algorithm execution. Lasso and embedded methods within trees are the most used within the data science community. Advantages: faster than wrapper and filter methods, they detect the interaction
among variables, and are also more accurate than filter methods.