Definition of Decision Boundary
In classification problems with two or more classes, a decision boundary is a hypersurface that separates the underlying vector space into sets, one for each class. Andrew Ng provides a nice example of Decision Boundary in Logistic Regression.
We know that there are some Linear (like logistic regression) and some non-Linear (like Random Forest) decision boundaries. Let’s create a dummy dataset of two explanatory variables and a target of two classes and see the Decision Boundaries of different algorithms.
Create the Dummy Dataset
We will create a dummy dataset with scikit-learn of 200 rows, 2 informative independent variables, and 1 target of two classes.
from sklearn.datasets import make_classification X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0, n_classes=2, random_state=1)
Create the Decision Boundary of each Classifier
We will compare 6 classification algorithms such as:
- Logistic Regression
- Decision Tree
- Random Forest
- Support Vector Machines (SVM)
- Naive Bayes
- Neural Network
We will work with the Mlxtend library. For simplicity, we decided to keep the default parameters of every algorithm.
from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.neural_network import MLPClassifier # Initializing Classifiers clf1 = LogisticRegression() clf2 = DecisionTreeClassifier() clf3 = RandomForestClassifier() clf4 = SVC(gamma='auto') clf5 = GaussianNB() clf6 = MLPClassifier() import matplotlib.pyplot as plt from mlxtend.plotting import plot_decision_regions import matplotlib.gridspec as gridspec %matplotlib inline gs = gridspec.GridSpec(3, 2) fig = plt.figure(figsize=(14,10)) labels = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'SVM', 'Naive Bayes', 'Neural Network'] for clf, lab, grd in zip([clf1, clf2, clf3, clf4, clf5, clf6], labels, [(0,0), (0,1), (1,0), (1,1), (2,0), (2,1)]): clf.fit(X, y) ax = plt.subplot(gs[grd[0], grd[1]]) fig = plot_decision_regions(X=X, y=y, clf=clf, legend=2) plt.title(lab) plt.show()
Discussion
Clearly, the Logistic Regression has a Linear Decision Boundary, where the tree-based algorithms like Decision Tree and Random Forest create rectangular partitions. The Naive Bayes leads to a linear decision boundary in many common cases but can also be quadratic as in our case. The SVMs can capture many different boundaries depending on the gamma
and the kernel. The same applies to the Neural Networks.