We have all been in this situation that we didn’t know which model is optimum for our ML project and most likely we were trying and evaluating many ML models just to see their behavior in our data. However, this is not a simple task and requires time and effort.
Fortunately, we can do this with only a few lines of code using LazyPredict. it will run more than 20 different ML models and return their performance statistics.
Installation
pip install lazypredict
Example
Let’s see an example using the Titanic dataset from Kaggle.
import pandas as pd import numpy as np from lazypredict.Supervised import LazyClassifier, LazyRegressor from sklearn.model_selection import train_test_split data=pd.read_csv('train.csv') data.head()
Here, we will try to predict if a passenger survived the Titanic so we have a classification problem.
Lazypredict can also do basic data preprocessing like fill NA values, create dummy variables, etc. That means that we can test the models immediately after reading the data and without getting any errors. However, we can use our preprocessed data so the model testing will be more accurate as it will be closer to our final models.
For this example, we will not do any preprocessing and let the Lazypredict do all the work.
#we are selecting the following columns as features for our models X=data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']] y=data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=7) # Fit LazyRegressor reg = LazyClassifier(ignore_warnings=True, random_state=7, verbose=False) #we have to pass the train and test dataset so it can evaluate the models models, predictions = reg.fit(X_train, X_test, y_train, y_test) # pass all sets models
As you can see, it will return a data frame that contains the models and their statistics. We can see that Tree-Based models are performing better than the others. Knowing this, we can use Tree-based models in our approach.
You can get the complete pipeline and the models parameters used from Lazypredict as follows.
#we will get the pipeline of LGBMClassifier reg.models['LGBMClassifier']
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('numeric',
Pipeline(steps=[('imputer',
SimpleImputer()),
('scaler',
StandardScaler())]),
Index(['Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')),
('categorical_low',
Pipeline(steps=[('imputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('encoding',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
Index(['Sex', 'Embarked'], dtype='object')),
('categorical_high',
Pipeline(steps=[('imputer',
SimpleImputer(fill_value='missing',
strategy='constant')),
('encoding',
OrdinalEncoder())]),
Index([], dtype='object'))])),
('classifier', LGBMClassifier(random_state=7))])
Also, you can use the complete model pipeline for prediction.
reg.models['LGBMClassifier'].predict(X_test)
array([0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0,
0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0,
0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
0, 0, 1], dtype=int64)
In the same way as LazyClassifier, we can use LazyRegressor to test models for regression problems.
Summing it up
Lazypredicty can help us have a basic understanding of which model is performing better in our data. It can be run nearly without any data preprocessing so we can test models immediately after reading the data.
It is worth noting that there are more ways to do automated machine learning model testing like using auto-sklearn but it is very complex to install it, especially in windows.