Predictive Hacks

The Ultimate Guide of Feature Importance in Python

feature importance python

Feature Importance is a score assigned to the features of a Machine Learning model that defines how “important” is a feature to the model’s prediction. It can help in feature selection and we can get very useful insights about our data. We will show you how you can get it in the most common models of machine learning.

We will use the famous Titanic Dataset from Kaggle.

import pandas as pd
import numpy as np

import statsmodels.formula.api as smf
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.feature_extraction.text import CountVectorizer

#we used only the train dataset from Titanic
data=pd.read_csv('train.csv')
data=data[['Sex','Age','Embarked','Pclass','SibSp','Parch','Survived']]
data.dropna(inplace=True)

Feature Importance in Sklearn Linear Models

model=LogisticRegression(random_state=1)

features=pd.get_dummies(data[['Sex','Embarked','Pclass','SibSp','Parch']],drop_first=True)
features['Age']=data['Age']

model.fit(features,data['Survived'])

feature_importance=pd.DataFrame({'feature':list(features.columns),'feature_importance':[abs(i) for i in model.coef_[0]]})
feature_importance.sort_values('feature_importance',ascending=False)

#if you don't want the absolute value
#feature_importance=pd.DataFrame({'feature':list(features.columns),'feature_importance':[i for i in model.coef_[0]]})
#feature_importance.sort_values('feature_importance',ascending=False)
      feature     feature_importance
3    Sex_male               2.501471
0      Pclass               1.213811
4  Embarked_Q               0.595491
5  Embarked_S               0.380094
1       SibSp               0.336785
6         Age               0.042501
2       Parch               0.029937

As you can see we took the absolute value of the coefficients because we want to get the Importance of the feature both with negative and positive effect. If you want to keep this information, you can remove the absolute function from the code. Keep in mind that you will not have this option when using Tree-Based models like Random Forest or XGBoost.

Feature Importance in Sklearn Ensemble Models

model=RandomForestClassifier()

model.fit(features,data['Survived'])

feature_importances=pd.DataFrame({'features':features.columns,'feature_importance':model.feature_importances_})
feature_importances.sort_values('feature_importance',ascending=False)
     features  feature_importance
6         Age            0.416853
3    Sex_male            0.288845
0      Pclass            0.145641
1       SibSp            0.063167
2       Parch            0.052152
5  Embarked_S            0.025383
4  Embarked_Q            0.007959

Feature Importance in Stats Models

model=smf.logit('Survived~Sex+Age+Embarked+Pclass+SibSp+Parch',data=data)
result = model.fit()

feature_importances=pd.DataFrame(result.conf_int()[1]).rename(columns={1:'Coefficients'}).eval("absolute_coefficients=abs(Coefficients)")
feature_importances.sort_values('absolute_coefficients',ascending=False).drop('Intercept')[['absolute_coefficients']]

               absolute_coefficients
Sex[T.male]                 2.204154
Pclass                      0.959873
Embarked[T.Q]               0.329163
Parch                       0.192208
SibSp                       0.103804
Embarked[T.S]               0.084723
Age                         0.027517

Feature Importance in XGBoost

model=XGBClassifier()

model.fit(features,data['Survived'])

feature_importances=pd.DataFrame({'features':features.columns,'feature_importance':model.feature_importances_})
print(feature_importances.sort_values('feature_importance',ascending=False))
     features  feature_importance
3    Sex_male            0.657089
0      Pclass            0.163064
1       SibSp            0.067181
6         Age            0.041643
5  Embarked_S            0.029463
2       Parch            0.027073
4  Embarked_Q            0.014488

Feature Importance when using a Word Vectorizer

In most of the cases, when we are dealing with text we are applying a Word Vectorizer like Count or TF-IDF. The features that we are feeding our model is a sparse matrix and not a structured data-frame with column names. However we can get the feature importances using the following technique.

We are using a dataset from Kaggle which is about spam or ham message classification. This will be interesting because words with high importance are representing words that if contained in a message, this message is more likely to be a spam.

df=pd.read_csv('SPAM text message 20170820 - Data.csv')
df.head()
  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
v = CountVectorizer(ngram_range=(1,1))
x = v.fit_transform(df['Message'])


model=LogisticRegression()
model.fit(x,df['Category'])

#we are not getting the absolute value
feature_importance=pd.DataFrame({'feature':v.get_feature_names(),'feature_importance':model.coef_[0]})
feature_importance.sort_values('feature_importance',ascending=False).head(10)
       feature     feature_importance
2978     error               2.606383
7982       txt               2.178409
6521  ringtone               1.788390
7640      text               1.777959
8012        uk               1.717855
1824      call               1.709997
6438     reply               1.643512
1975      chat               1.528649
5354       new               1.441076
8519       won               1.436101

Here we can see how useful the feature Importance can be. From the example above we are getting that the word error is very important when classifying a message. In other words, because we didn’t get the absolute value, we can say that If this word is contained in a message, then the message is most likely to be a spam.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

snowflake
Miscellaneous

How to Schedule Tasks in Snowflake

We have started a series of Snowflake tutorials, like How to Get Data from Snowflake using Python, How to Load