Predictive Hacks

Feature Selection in Python

feature selection

We will provide a walk-through example of how you can choose the most important features. For this example, we will work with a classification problem but can be extended to regression cases too by adjusting the parameters of the function.

We will work with the breast-cancer dataset. Let’s start:

import pandas as pd
import numpy as np
from scipy import stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFpr, chi2, SelectKBest, SelectFwe, f_classif, SelectFdr

import matplotlib.pyplot as plt
%matplotlib inline  


# https://www.kaggle.com/uciml/breast-cancer-wisconsin-data?select=data.csv
df = pd.read_csv("data.csv")

# replace M with 1 and B with 0
my_map = {
          'M':1,
          'B' :0
         }

df['diagnosis'] = df['diagnosis'].map(my_map)

# remove the id column
df.drop(['id'], axis=1, inplace=True)

df

Statistical Significant features with t-test

Since our target is binary, we can compare the values of the independent variables for each group (0,1) by applying t-test.

my_important = []

for x in df.columns[1:]:
    pvalue = stats.ttest_ind(df.loc[df.diagnosis==1][x], df.loc[df.diagnosis==0][x])[1]
    if pvalue<0.05:
        my_important.append(x)
        print(f'The variable {x} is statistically significant with a pvalue = {pvalue:.2}')
    else:
        print(f'The variable {x} is NOT statistically significant')
 

And we get:

The variable radius_mean is statistically significant with a pvalue = 8.5e-96
The variable texture_mean is statistically significant with a pvalue = 4.1e-25
The variable perimeter_mean is statistically significant with a pvalue = 8.4e-101
The variable area_mean is statistically significant with a pvalue = 4.7e-88
The variable smoothness_mean is statistically significant with a pvalue = 1.1e-18
The variable compactness_mean is statistically significant with a pvalue = 3.9e-56
The variable concavity_mean is statistically significant with a pvalue = 1e-83
The variable concave points_mean is statistically significant with a pvalue = 7.1e-116
The variable symmetry_mean is statistically significant with a pvalue = 5.7e-16
The variable fractal_dimension_mean is NOT statistically significant
The variable radius_se is statistically significant with a pvalue = 9.7e-50
The variable texture_se is NOT statistically significant
The variable perimeter_se is statistically significant with a pvalue = 1.7e-47
The variable area_se is statistically significant with a pvalue = 5.9e-46
The variable smoothness_se is NOT statistically significant
The variable compactness_se is statistically significant with a pvalue = 1e-12
The variable concavity_se is statistically significant with a pvalue = 8.3e-10
The variable concave points_se is statistically significant with a pvalue = 3.1e-24
The variable symmetry_se is NOT statistically significant
The variable fractal_dimension_se is NOT statistically significant
The variable radius_worst is statistically significant with a pvalue = 8.5e-116
The variable texture_worst is statistically significant with a pvalue = 1.1e-30
The variable perimeter_worst is statistically significant with a pvalue = 5.8e-119
The variable area_worst is statistically significant with a pvalue = 2.8e-97
The variable smoothness_worst is statistically significant with a pvalue = 6.6e-26
The variable compactness_worst is statistically significant with a pvalue = 7.1e-55
The variable concavity_worst is statistically significant with a pvalue = 2.5e-72
The variable concave points_worst is statistically significant with a pvalue = 2e-124
The variable symmetry_worst is statistically significant with a pvalue = 3e-25
The variable fractal_dimension_worst is statistically significant with a pvalue = 2.3e-15

So, the statistically significant variables are the:

my_important
['radius_mean',
 'texture_mean',
 'perimeter_mean',
 'area_mean',
 'smoothness_mean',
 'compactness_mean',
 'concavity_mean',
 'concave points_mean',
 'symmetry_mean',
 'radius_se',
 'perimeter_se',
 'area_se',
 'compactness_se',
 'concavity_se',
 'concave points_se',
 'radius_worst',
 'texture_worst',
 'perimeter_worst',
 'area_worst',
 'smoothness_worst',
 'compactness_worst',
 'concavity_worst',
 'concave points_worst',
 'symmetry_worst',
 'fractal_dimension_worst']

Random Forest Feature Importance

We can also run a model, like Random Forest and see which are the most important features.

clf = RandomForestClassifier( n_estimators=50)

X = df.drop(['diagnosis'], axis=1)
y = df.diagnosis

model = clf.fit(X,y)
feat_importances = pd.DataFrame(model.feature_importances_, index=X.columns, columns=["Importance"])
feat_importances.sort_values(by='Importance', ascending=False, inplace=True)
feat_importances.plot(kind='bar')
 

Let’s see the 10 most important features according to Random Forest:

feat_importances.index[0:10]
 
Index(['concave points_worst', 'perimeter_worst', 'radius_worst',
       'concave points_mean', 'area_worst', 'perimeter_mean', 'area_mean',
       'concavity_worst', 'area_se', 'perimeter_se'],
      dtype='object')

Feature Selection with Scikit-Learn

We can work with the scikit-learn. You can find more details at the documentation. We will provide some examples:

k-best

It selects the k most important features. In our case, we will work with the chi-square test. Keep in mind that the new_data are the final data after we removed the non-significant variables.

selector = SelectKBest(score_func=chi2, k=5)
new_data = selector.fit_transform(X, y)

mask = selector.get_support()
new_features = X.columns[mask]
new_features
 
Index(['perimeter_mean', 'area_mean', 'area_se', 'perimeter_worst',
       'area_worst'],
      dtype='object')

Fwe

This is similar to what we did at the beginning with the t-test. It can be done with a chi-square test or with ANOVA (wherein binary case is the same as the t-test)

# chi-ssquare
selector = SelectFwe(score_func=chi2, alpha=0.05)
new_data = selector.fit_transform(X, y)

mask = selector.get_support()
new_features = X.columns[mask]
new_features
 
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'concavity_mean', 'concave points_mean', 'radius_se', 'perimeter_se',
       'area_se', 'radius_worst', 'texture_worst', 'perimeter_worst',
       'area_worst', 'compactness_worst', 'concavity_worst',
       'concave points_worst'],
      dtype='object')

# ANOVA

selector = SelectFwe(score_func=f_classif, alpha=0.05)
new_data = selector.fit_transform(X, y)

mask = selector.get_support()
new_features = X.columns[mask]
new_features
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'radius_se', 'perimeter_se',
       'area_se', 'compactness_se', 'concavity_se', 'concave points_se',
       'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst',
       'smoothness_worst', 'compactness_worst', 'concavity_worst',
       'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

FDR

The false discovery rate (FDR) takes into account the multiple comparisons.

selector = SelectFdr(chi2, alpha=0.05)
new_data = selector.fit_transform(X, y)

mask = selector.get_support()
new_features = X.columns[mask]
new_features
 
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'compactness_mean', 'concavity_mean', 'concave points_mean',
       'radius_se', 'perimeter_se', 'area_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'compactness_worst', 'concavity_worst',
       'concave points_worst'],
      dtype='object')
 

From the Model

We can keep the most important features derived from the model. Let’s consider again the random forest:

from sklearn.feature_selection import SelectFromModel
selector = SelectFromModel(estimator=RandomForestClassifier(n_estimators=50)).fit(X, y)
mask = selector.get_support()
new_features = X.columns[mask]
new_features
 
ndex(['perimeter_mean', 'area_mean', 'concavity_mean', 'concave points_mean',
       'radius_worst', 'perimeter_worst', 'area_worst',
       'concave points_worst'],
      dtype='object')

Discussion

This was an example of how you can get a good idea of which variables are the most important for your model. Keep in mind that you should always run exploratory data analysis. For binary cases, a box plot is always an appropriate plot. Let’s see the box plot of the concave points_worst for both benign (0) and malignant tumors (1). As we can see, the difference between the groups seems to be significant and this what we confirmed by running the tests above.

boxplot

Finally, you can find a detailed explanation of the Chi-Square test.

More Data Science Hacks?

You can follow us on Medium for more Data Science Hacks

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.