Predictive Hacks

How to Prevent Overfitting in Neural Networks with TensorFlow 2.0

In this post, we will provide some techniques of how you can prevent overfitting in Neural Network when you work with TensorFlow 2.0. We will apply the following techniques at the same time.

We will work with the diabetes dataset provided by the scikit-learn.

We will load the data, and we will split the data into train and test dataset (90-10)

import tensorflow as tf
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dropout, Dense, BatchNormalization
from tensorflow.keras import regularizers

import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

data = diabetes_dataset['data']
targets = diabetes_dataset['target']

train_data, test_data, train_targets, test_targets = train_test_split(data, targets, test_size=0.1)


Build the Model

We will build a model with 6 layers of 128 neurons, by adding L2 regularization of rate 0.00001 in every layer as well as a dropout of 30%. Finally, we will add 1 batch normalization.

wd = 0.00001
rate = 0.3

model = Sequential([
Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu", input_shape=(train_data.shape[1],)),
BatchNormalization(),
Dropout(rate),
Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu"),
Dropout(rate),
Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu"),
Dropout(rate),
Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu"),
Dropout(rate),
Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu"),
Dropout(rate),
Dense(128, kernel_regularizer=regularizers.l2(wd), activation="relu"),
Dense(1)
])

model.summary()


The model summary:

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense_7 (Dense)              (None, 128)               1408
_________________________________________________________________
batch_normalization_1 (Batch (None, 128)               512
_________________________________________________________________
dropout_5 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_8 (Dense)              (None, 128)               16512
_________________________________________________________________
dropout_6 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_9 (Dense)              (None, 128)               16512
_________________________________________________________________
dropout_7 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_10 (Dense)             (None, 128)               16512
_________________________________________________________________
dropout_8 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_11 (Dense)             (None, 128)               16512
_________________________________________________________________
dropout_9 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_12 (Dense)             (None, 128)               16512
_________________________________________________________________
dense_13 (Dense)             (None, 1)                 129
=================================================================
Total params: 84,609
Trainable params: 84,353
Non-trainable params: 256


Compile the Model

We will compile the model using the Adam optimizer and loss function the mse. As metric we will report the mae.

# Compile the model


Train the Model

We will train the model with 1000 epochs, by keeping an internal validation dataset of 15% and a batch size equal to 64. Finally, we will apply an early stopping with patience equal to 20, meaning that if the model has not been improved for the next 20 consecutive epochs in the validation dataset, then stop training.

model_history = model.fit(train_data, train_targets, epochs=1000,
validation_split=0.15, batch_size=64, verbose=False,
callbacks = [tf.keras.callbacks.EarlyStopping(patience=20)])


Plot the Learning Curves

We can plot the Loss of the model across epochs for the “training” and “validation” dataset:

# Plot the training and validation loss

import matplotlib.pyplot as plt

plt.plot(model_history.history['loss'])
plt.plot(model_history.history['val_loss'])
plt.title('Loss vs. epochs')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Training', 'Validation'], loc='upper right')
plt.show()


As we can see the model run only 36 epochs and then stopped since was not getting better. We can evaluate the model on the test dataset:

# Evaluate the model on the test set

model.evaluate(test_data, test_targets, verbose=2)

45/1 - 0s - loss: 4512.4256 - mae: 54.0345
[4839.613487413195, 54.034515]


Let’s compare the regularized models with a model without any of the techniques that we mentioned above.

Model without Regularization

model = Sequential([
Dense(128, activation="relu", input_shape=(train_data.shape[1],)),
Dense(128, activation="relu"),
Dense(128, activation="relu"),
Dense(128, activation="relu"),
Dense(128, activation="relu"),
Dense(128, activation="relu"),
Dense(1)
])

import matplotlib.pyplot as plt

plt.plot(model_history.history['loss'])
plt.plot(model_history.history['val_loss'])
plt.title('Loss vs. epochs')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Training', 'Validation'], loc='upper right')
plt.show()

# Evaluate the model on the test set

model.evaluate(test_data, test_targets, verbose=2)

45/1 - 0s - loss: 8459.9902 - mae: 66.0895
[7391.370073784722, 66.08951]


As we can see, without regularization the mae was 66.09 compared to 54.03 on the test dataset. Also the loss was 7391 compared to 4839. Finally, we saved much time since we ran it only 36 epochs compared to 1000 without early stopping.

The Takeaway

When we build a neural network there is a high chance of overfitting. This is not the case when we are dealing with bagging and boosting algorithms like Gradient Boost, Random Forest, XGBoost, LightGBM etc. For that reason, we should be very careful and always add regularization techniques, especially when we build complicated neural networks with many layers and many neurons. Especially when there are many epochs, there is an even higher risk of overfitting.

Get updates and learn from the best

Miscellaneous

How to Redirect and Save Errors in Unix

In Unix, there are three types of redirection such as: Standard Input (stdin) that is denoted by 0. Usually, it’s

Python

Content-Based Recommender Systems with TensorFlow Recommenders

In this post, we will consider as a reference point the “Building deep retrieval models” tutorial from TensorFlow and we