Feature Importance is a score assigned to the features of a Machine Learning model that defines how “important” is a feature to the model’s prediction. It can help in feature selection and we can get very useful insights about our data. We will show you how you can get it in the most common models of machine learning.
We will use the famous Titanic Dataset from Kaggle.
import pandas as pd import numpy as np import statsmodels.formula.api as smf from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier from sklearn.feature_extraction.text import CountVectorizer #we used only the train dataset from Titanic data=pd.read_csv('train.csv') data=data[['Sex','Age','Embarked','Pclass','SibSp','Parch','Survived']] data.dropna(inplace=True)
Feature Importance in Sklearn Linear Models
model=LogisticRegression(random_state=1) features=pd.get_dummies(data[['Sex','Embarked','Pclass','SibSp','Parch']],drop_first=True) features['Age']=data['Age'] model.fit(features,data['Survived']) feature_importance=pd.DataFrame({'feature':list(features.columns),'feature_importance':[abs(i) for i in model.coef_[0]]}) feature_importance.sort_values('feature_importance',ascending=False) #if you don't want the absolute value #feature_importance=pd.DataFrame({'feature':list(features.columns),'feature_importance':[i for i in model.coef_[0]]}) #feature_importance.sort_values('feature_importance',ascending=False)
feature feature_importance
3 Sex_male 2.501471
0 Pclass 1.213811
4 Embarked_Q 0.595491
5 Embarked_S 0.380094
1 SibSp 0.336785
6 Age 0.042501
2 Parch 0.029937
As you can see we took the absolute value of the coefficients because we want to get the Importance of the feature both with negative and positive effect. If you want to keep this information, you can remove the absolute function from the code. Keep in mind that you will not have this option when using Tree-Based models like Random Forest or XGBoost.
Feature Importance in Sklearn Ensemble Models
model=RandomForestClassifier() model.fit(features,data['Survived']) feature_importances=pd.DataFrame({'features':features.columns,'feature_importance':model.feature_importances_}) feature_importances.sort_values('feature_importance',ascending=False)
features feature_importance
6 Age 0.416853
3 Sex_male 0.288845
0 Pclass 0.145641
1 SibSp 0.063167
2 Parch 0.052152
5 Embarked_S 0.025383
4 Embarked_Q 0.007959
Feature Importance in Stats Models
model=smf.logit('Survived~Sex+Age+Embarked+Pclass+SibSp+Parch',data=data) result = model.fit() feature_importances=pd.DataFrame(result.conf_int()[1]).rename(columns={1:'Coefficients'}).eval("absolute_coefficients=abs(Coefficients)") feature_importances.sort_values('absolute_coefficients',ascending=False).drop('Intercept')[['absolute_coefficients']]
absolute_coefficients
Sex[T.male] 2.204154
Pclass 0.959873
Embarked[T.Q] 0.329163
Parch 0.192208
SibSp 0.103804
Embarked[T.S] 0.084723
Age 0.027517
Feature Importance in XGBoost
model=XGBClassifier() model.fit(features,data['Survived']) feature_importances=pd.DataFrame({'features':features.columns,'feature_importance':model.feature_importances_}) print(feature_importances.sort_values('feature_importance',ascending=False))
features feature_importance
3 Sex_male 0.657089
0 Pclass 0.163064
1 SibSp 0.067181
6 Age 0.041643
5 Embarked_S 0.029463
2 Parch 0.027073
4 Embarked_Q 0.014488
Feature Importance when using a Word Vectorizer
In most of the cases, when we are dealing with text we are applying a Word Vectorizer like Count or TF-IDF. The features that we are feeding our model is a sparse matrix and not a structured data-frame with column names. However we can get the feature importances using the following technique.
We are using a dataset from Kaggle which is about spam or ham message classification. This will be interesting because words with high importance are representing words that if contained in a message, this message is more likely to be a spam.
df=pd.read_csv('SPAM text message 20170820 - Data.csv') df.head()
Category Message
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
v = CountVectorizer(ngram_range=(1,1)) x = v.fit_transform(df['Message']) model=LogisticRegression() model.fit(x,df['Category']) #we are not getting the absolute value feature_importance=pd.DataFrame({'feature':v.get_feature_names(),'feature_importance':model.coef_[0]}) feature_importance.sort_values('feature_importance',ascending=False).head(10)
feature feature_importance
2978 error 2.606383
7982 txt 2.178409
6521 ringtone 1.788390
7640 text 1.777959
8012 uk 1.717855
1824 call 1.709997
6438 reply 1.643512
1975 chat 1.528649
5354 new 1.441076
8519 won 1.436101
Here we can see how useful the feature Importance can be. From the example above we are getting that the word error is very important when classifying a message. In other words, because we didn’t get the absolute value, we can say that If this word is contained in a message, then the message is most likely to be a spam.