We will provide a walk-through example of how you can choose the most important features. For this example, we will work with a classification problem but can be extended to regression cases too by adjusting the parameters of the function.
We will work with the breast-cancer dataset. Let’s start:
import pandas as pd import numpy as np from scipy import stats from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectFpr, chi2, SelectKBest, SelectFwe, f_classif, SelectFdr import matplotlib.pyplot as plt %matplotlib inline # https://www.kaggle.com/uciml/breast-cancer-wisconsin-data?select=data.csv df = pd.read_csv("data.csv") # replace M with 1 and B with 0 my_map = { 'M':1, 'B' :0 } df['diagnosis'] = df['diagnosis'].map(my_map) # remove the id column df.drop(['id'], axis=1, inplace=True) df
Statistical Significant features with t-test
Since our target is binary, we can compare the values of the independent variables for each group (0,1) by applying t-test
.
my_important = [] for x in df.columns[1:]: pvalue = stats.ttest_ind(df.loc[df.diagnosis==1][x], df.loc[df.diagnosis==0][x])[1] if pvalue<0.05: my_important.append(x) print(f'The variable {x} is statistically significant with a pvalue = {pvalue:.2}') else: print(f'The variable {x} is NOT statistically significant')
And we get:
The variable radius_mean is statistically significant with a pvalue = 8.5e-96 The variable texture_mean is statistically significant with a pvalue = 4.1e-25 The variable perimeter_mean is statistically significant with a pvalue = 8.4e-101 The variable area_mean is statistically significant with a pvalue = 4.7e-88 The variable smoothness_mean is statistically significant with a pvalue = 1.1e-18 The variable compactness_mean is statistically significant with a pvalue = 3.9e-56 The variable concavity_mean is statistically significant with a pvalue = 1e-83 The variable concave points_mean is statistically significant with a pvalue = 7.1e-116 The variable symmetry_mean is statistically significant with a pvalue = 5.7e-16 The variable fractal_dimension_mean is NOT statistically significant The variable radius_se is statistically significant with a pvalue = 9.7e-50 The variable texture_se is NOT statistically significant The variable perimeter_se is statistically significant with a pvalue = 1.7e-47 The variable area_se is statistically significant with a pvalue = 5.9e-46 The variable smoothness_se is NOT statistically significant The variable compactness_se is statistically significant with a pvalue = 1e-12 The variable concavity_se is statistically significant with a pvalue = 8.3e-10 The variable concave points_se is statistically significant with a pvalue = 3.1e-24 The variable symmetry_se is NOT statistically significant The variable fractal_dimension_se is NOT statistically significant The variable radius_worst is statistically significant with a pvalue = 8.5e-116 The variable texture_worst is statistically significant with a pvalue = 1.1e-30 The variable perimeter_worst is statistically significant with a pvalue = 5.8e-119 The variable area_worst is statistically significant with a pvalue = 2.8e-97 The variable smoothness_worst is statistically significant with a pvalue = 6.6e-26 The variable compactness_worst is statistically significant with a pvalue = 7.1e-55 The variable concavity_worst is statistically significant with a pvalue = 2.5e-72 The variable concave points_worst is statistically significant with a pvalue = 2e-124 The variable symmetry_worst is statistically significant with a pvalue = 3e-25 The variable fractal_dimension_worst is statistically significant with a pvalue = 2.3e-15
So, the statistically significant variables are the:
my_important
['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'radius_se', 'perimeter_se', 'area_se', 'compactness_se', 'concavity_se', 'concave points_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst']
Random Forest Feature Importance
We can also run a model, like Random Forest and see which are the most important features.
clf = RandomForestClassifier( n_estimators=50) X = df.drop(['diagnosis'], axis=1) y = df.diagnosis model = clf.fit(X,y) feat_importances = pd.DataFrame(model.feature_importances_, index=X.columns, columns=["Importance"]) feat_importances.sort_values(by='Importance', ascending=False, inplace=True) feat_importances.plot(kind='bar')
Let’s see the 10 most important features according to Random Forest:
feat_importances.index[0:10]
Index(['concave points_worst', 'perimeter_worst', 'radius_worst', 'concave points_mean', 'area_worst', 'perimeter_mean', 'area_mean', 'concavity_worst', 'area_se', 'perimeter_se'], dtype='object')
Feature Selection with Scikit-Learn
We can work with the scikit-learn
. You can find more details at the documentation. We will provide some examples:
k-best
It selects the k most important features. In our case, we will work with the chi-square test. Keep in mind that the new_data are the final data after we removed the non-significant variables.
selector = SelectKBest(score_func=chi2, k=5) new_data = selector.fit_transform(X, y) mask = selector.get_support() new_features = X.columns[mask] new_features
Index(['perimeter_mean', 'area_mean', 'area_se', 'perimeter_worst', 'area_worst'], dtype='object')
Fwe
This is similar to what we did at the beginning with the t-test. It can be done with a chi-square test
or with ANOVA
(wherein binary case is the same as the t-test)
# chi-ssquare selector = SelectFwe(score_func=chi2, alpha=0.05) new_data = selector.fit_transform(X, y) mask = selector.get_support() new_features = X.columns[mask] new_features
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'concavity_mean', 'concave points_mean', 'radius_se', 'perimeter_se', 'area_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst'], dtype='object')
# ANOVA selector = SelectFwe(score_func=f_classif, alpha=0.05) new_data = selector.fit_transform(X, y) mask = selector.get_support() new_features = X.columns[mask] new_features
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'radius_se', 'perimeter_se', 'area_se', 'compactness_se', 'concavity_se', 'concave points_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst'], dtype='object')
FDR
The false discovery rate (FDR) takes into account the multiple comparisons.
selector = SelectFdr(chi2, alpha=0.05) new_data = selector.fit_transform(X, y) mask = selector.get_support() new_features = X.columns[mask] new_features
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'radius_se', 'perimeter_se', 'area_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst'], dtype='object')
From the Model
We can keep the most important features derived from the model. Let’s consider again the random forest:
from sklearn.feature_selection import SelectFromModel selector = SelectFromModel(estimator=RandomForestClassifier(n_estimators=50)).fit(X, y) mask = selector.get_support() new_features = X.columns[mask] new_features
ndex(['perimeter_mean', 'area_mean', 'concavity_mean', 'concave points_mean', 'radius_worst', 'perimeter_worst', 'area_worst', 'concave points_worst'], dtype='object')
Discussion
This was an example of how you can get a good idea of which variables are the most important for your model. Keep in mind that you should always run exploratory data analysis. For binary cases, a box plot is always an appropriate plot. Let’s see the box plot of the concave points_worst
for both benign (0) and malignant tumors (1). As we can see, the difference between the groups seems to be significant and this what we confirmed by running the tests above.
Finally, you can find a detailed explanation of the Chi-Square test.
More Data Science Hacks?
You can follow us on Medium for more Data Science Hacks