In this post, we will provide some techniques of how you can prevent overfitting in Neural Network when you work with TensorFlow 2.0. We will apply the following techniques at the same time.
We will work with the diabetes
dataset provided by the scikit-learn
.
Load the Data
We will load the data, and we will split the data into train and test dataset (90-10)
import tensorflow as tf from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dropout, Dense, BatchNormalization from tensorflow.keras import regularizers import pandas as pd import matplotlib.pyplot as plt %matplotlib inline diabetes_dataset = load_diabetes() data = diabetes_dataset['data'] targets = diabetes_dataset['target'] train_data, test_data, train_targets, test_targets = train_test_split(data, targets, test_size=0.1)
Build the Model
We will build a model with 6 layers of 128 neurons, by adding L2 regularization of rate 0.00001 in every layer as well as a dropout of 30%. Finally, we will add 1 batch normalization.
wd = 0.00001 rate = 0.3 model = Sequential([ Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu", input_shape=(train_data.shape[1],)), BatchNormalization(), Dropout(rate), Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu"), Dropout(rate), Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu"), Dropout(rate), Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu"), Dropout(rate), Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu"), Dropout(rate), Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu"), Dense(1) ]) model.summary()
The model summary:
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_7 (Dense) (None, 128) 1408 _________________________________________________________________ batch_normalization_1 (Batch (None, 128) 512 _________________________________________________________________ dropout_5 (Dropout) (None, 128) 0 _________________________________________________________________ dense_8 (Dense) (None, 128) 16512 _________________________________________________________________ dropout_6 (Dropout) (None, 128) 0 _________________________________________________________________ dense_9 (Dense) (None, 128) 16512 _________________________________________________________________ dropout_7 (Dropout) (None, 128) 0 _________________________________________________________________ dense_10 (Dense) (None, 128) 16512 _________________________________________________________________ dropout_8 (Dropout) (None, 128) 0 _________________________________________________________________ dense_11 (Dense) (None, 128) 16512 _________________________________________________________________ dropout_9 (Dropout) (None, 128) 0 _________________________________________________________________ dense_12 (Dense) (None, 128) 16512 _________________________________________________________________ dense_13 (Dense) (None, 1) 129 ================================================================= Total params: 84,609 Trainable params: 84,353 Non-trainable params: 256
Compile the Model
We will compile the model using the Adam
optimizer and loss function the mse
. As metric we will report the mae
.
# Compile the model model.compile(optimizer='adam', loss='mse', metrics=['mae'])
Train the Model
We will train the model with 1000 epochs, by keeping an internal validation dataset of 15% and a batch size equal to 64. Finally, we will apply an early stopping with patience equal to 20, meaning that if the model has not been improved for the next 20 consecutive epochs in the validation dataset, then stop training.
model_history = model.fit(train_data, train_targets, epochs=1000, validation_split=0.15, batch_size=64, verbose=False, callbacks = [tf.keras.callbacks.EarlyStopping(patience=20)])
Plot the Learning Curves
We can plot the Loss of the model across epochs for the “training” and “validation” dataset:
# Plot the training and validation loss import matplotlib.pyplot as plt plt.plot(model_history.history['loss']) plt.plot(model_history.history['val_loss']) plt.title('Loss vs. epochs') plt.ylabel('Loss') plt.xlabel('Epoch') plt.legend(['Training', 'Validation'], loc='upper right') plt.show()
As we can see the model run only 36 epochs and then stopped since was not getting better. We can evaluate the model on the test dataset:
# Evaluate the model on the test set model.evaluate(test_data, test_targets, verbose=2)
45/1 - 0s - loss: 4512.4256 - mae: 54.0345 [4839.613487413195, 54.034515]
Let’s compare the regularized models with a model without any of the techniques that we mentioned above.
Model without Regularization
model = Sequential([ Dense(128, activation="relu", input_shape=(train_data.shape[1],)), Dense(128, activation="relu"), Dense(128, activation="relu"), Dense(128, activation="relu"), Dense(128, activation="relu"), Dense(128, activation="relu"), Dense(1) ]) import matplotlib.pyplot as plt plt.plot(model_history.history['loss']) plt.plot(model_history.history['val_loss']) plt.title('Loss vs. epochs') plt.ylabel('Loss') plt.xlabel('Epoch') plt.legend(['Training', 'Validation'], loc='upper right') plt.show()
# Evaluate the model on the test set model.evaluate(test_data, test_targets, verbose=2)
45/1 - 0s - loss: 8459.9902 - mae: 66.0895 [7391.370073784722, 66.08951]
As we can see, without regularization the mae
was 66.09 compared to 54.03 on the test dataset. Also the loss was 7391 compared to 4839. Finally, we saved much time since we ran it only 36 epochs compared to 1000 without early stopping.
The Takeaway
When we build a neural network there is a high chance of overfitting. This is not the case when we are dealing with bagging and boosting algorithms like Gradient Boost, Random Forest, XGBoost, LightGBM etc. For that reason, we should be very careful and always add regularization techniques, especially when we build complicated neural networks with many layers and many neurons. Especially when there are many epochs, there is an even higher risk of overfitting.