We will provide a walk-through example of how you can choose the most important features. For this example, we will work with a classification problem but can be extended to regression cases too by adjusting the parameters of the function.

We will work with the breast-cancer dataset. Let’s start:

import pandas as pd import numpy as np from scipy import stats from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectFpr, chi2, SelectKBest, SelectFwe, f_classif, SelectFdr import matplotlib.pyplot as plt %matplotlib inline # https://www.kaggle.com/uciml/breast-cancer-wisconsin-data?select=data.csv df = pd.read_csv("data.csv") # replace M with 1 and B with 0 my_map = { 'M':1, 'B' :0 } df['diagnosis'] = df['diagnosis'].map(my_map) # remove the id column df.drop(['id'], axis=1, inplace=True) df

## Statistical Significant features with t-test

Since our target is binary, we can compare the values of the independent variables for each group (0,1) by applying `t-test`

.

my_important = [] for x in df.columns[1:]: pvalue = stats.ttest_ind(df.loc[df.diagnosis==1][x], df.loc[df.diagnosis==0][x])[1] if pvalue<0.05: my_important.append(x) print(f'The variable {x} is statistically significant with a pvalue = {pvalue:.2}') else: print(f'The variable {x} is NOT statistically significant')

And we get:

The variable radius_mean is statistically significant with a pvalue = 8.5e-96 The variable texture_mean is statistically significant with a pvalue = 4.1e-25 The variable perimeter_mean is statistically significant with a pvalue = 8.4e-101 The variable area_mean is statistically significant with a pvalue = 4.7e-88 The variable smoothness_mean is statistically significant with a pvalue = 1.1e-18 The variable compactness_mean is statistically significant with a pvalue = 3.9e-56 The variable concavity_mean is statistically significant with a pvalue = 1e-83 The variable concave points_mean is statistically significant with a pvalue = 7.1e-116 The variable symmetry_mean is statistically significant with a pvalue = 5.7e-16 The variable fractal_dimension_mean is NOT statistically significant The variable radius_se is statistically significant with a pvalue = 9.7e-50 The variable texture_se is NOT statistically significant The variable perimeter_se is statistically significant with a pvalue = 1.7e-47 The variable area_se is statistically significant with a pvalue = 5.9e-46 The variable smoothness_se is NOT statistically significant The variable compactness_se is statistically significant with a pvalue = 1e-12 The variable concavity_se is statistically significant with a pvalue = 8.3e-10 The variable concave points_se is statistically significant with a pvalue = 3.1e-24 The variable symmetry_se is NOT statistically significant The variable fractal_dimension_se is NOT statistically significant The variable radius_worst is statistically significant with a pvalue = 8.5e-116 The variable texture_worst is statistically significant with a pvalue = 1.1e-30 The variable perimeter_worst is statistically significant with a pvalue = 5.8e-119 The variable area_worst is statistically significant with a pvalue = 2.8e-97 The variable smoothness_worst is statistically significant with a pvalue = 6.6e-26 The variable compactness_worst is statistically significant with a pvalue = 7.1e-55 The variable concavity_worst is statistically significant with a pvalue = 2.5e-72 The variable concave points_worst is statistically significant with a pvalue = 2e-124 The variable symmetry_worst is statistically significant with a pvalue = 3e-25 The variable fractal_dimension_worst is statistically significant with a pvalue = 2.3e-15

So, the statistically significant variables are the:

my_important

['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'radius_se', 'perimeter_se', 'area_se', 'compactness_se', 'concavity_se', 'concave points_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']

## Random Forest Feature Importance

We can also run a model, like Random Forest and see which are the most important features.

clf = RandomForestClassifier( n_estimators=50) X = df.drop(['diagnosis'], axis=1) y = df.diagnosis model = clf.fit(X,y) feat_importances = pd.DataFrame(model.feature_importances_, index=X.columns, columns=["Importance"]) feat_importances.sort_values(by='Importance', ascending=False, inplace=True) feat_importances.plot(kind='bar')

Let’s see the 10 most important features according to Random Forest:

feat_importances.index[0:10]

Index(['concave points_worst', 'perimeter_worst', 'radius_worst', 'concave points_mean', 'area_worst', 'perimeter_mean', 'area_mean', 'concavity_worst', 'area_se', 'perimeter_se'], dtype='object')

## Feature Selection with Scikit-Learn

We can work with the `scikit-learn`

. You can find more details at the documentation. We will provide some examples:

**k-best**

It selects the k most important features. In our case, we will work with the chi-square test. Keep in mind that the **new_data **are the final data after we removed the non-significant variables.

selector = SelectKBest(score_func=chi2, k=5) new_data = selector.fit_transform(X, y) mask = selector.get_support() new_features = X.columns[mask] new_features

Index(['perimeter_mean', 'area_mean', 'area_se', 'perimeter_worst', 'area_worst'], dtype='object')

**Fwe**

This is similar to what we did at the beginning with the t-test. It can be done with a `chi-square test`

or with `ANOVA`

(wherein binary case is the same as the t-test)

# chi-ssquare selector = SelectFwe(score_func=chi2, alpha=0.05) new_data = selector.fit_transform(X, y) mask = selector.get_support() new_features = X.columns[mask] new_features

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'concavity_mean', 'concave points_mean', 'radius_se', 'perimeter_se', 'area_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst'], dtype='object')

# ANOVA selector = SelectFwe(score_func=f_classif, alpha=0.05) new_data = selector.fit_transform(X, y) mask = selector.get_support() new_features = X.columns[mask] new_features

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'radius_se', 'perimeter_se', 'area_se', 'compactness_se', 'concavity_se', 'concave points_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst'], dtype='object')

**FDR**

The false discovery rate (FDR) takes into account the multiple comparisons.

selector = SelectFdr(chi2, alpha=0.05) new_data = selector.fit_transform(X, y) mask = selector.get_support() new_features = X.columns[mask] new_features

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'radius_se', 'perimeter_se', 'area_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst'], dtype='object')

**From the Model**

We can keep the most important features derived from the model. Let’s consider again the random forest:

from sklearn.feature_selection import SelectFromModel selector = SelectFromModel(estimator=RandomForestClassifier(n_estimators=50)).fit(X, y) mask = selector.get_support() new_features = X.columns[mask] new_features

ndex(['perimeter_mean', 'area_mean', 'concavity_mean', 'concave points_mean', 'radius_worst', 'perimeter_worst', 'area_worst', 'concave points_worst'], dtype='object')

## Discussion

This was an example of how you can get a good idea of which variables are the most important for your model. Keep in mind that you should always run exploratory data analysis. For binary cases, a box plot is always an appropriate plot. Let’s see the box plot of the `concave points_worst`

for both benign (0) and malignant tumors (1). As we can see, the difference between the groups seems to be significant and this what we confirmed by running the tests above.

Finally, you can find a detailed explanation of the Chi-Square test.

## More Data Science Hacks?

You can follow us on Medium for more Data Science Hacks