Predictive Hacks

# How to Fine-Tune an NLP Regression Model with Transformers and HuggingFace

HuggingFace provides us with state-of-the-art pre-trained models that can be used in many different applications. In this post, we will show you how to use a pre-trained model for a regression problem. The pre-trained model that we are going to use is DistilBERT which is a lighter and faster version of the famous BERT with 95% of its performance.

Suppose that we have the text from online ads and its response rate normalized by the ad set. Our goal is to create a Machine Learning model that can predict the performance of an ad.

Let’s start coding by importing the necessary libraries and import our data:

import numpy as np
import pandas as pd

import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification

X.head(3)

The text represents the ad text and the label is the normalized response rate.

## Pandas To Dataset

In order to use our data for training, we need to convert the Pandas Dataframe into ‘Dataset‘ format. Also, we want to split the data into train and test so we can evaluate the model. These can be done easily by running the following:

dataset = Dataset.from_pandas(X,preserve_index=False)
dataset = dataset.train_test_split(test_size=0.3)

dataset

As you can see, the dataset object contains both train and test sets. You can still access the data as shown below:

dataset['train']['text'][:5]

## Tokenization & How To Add New Tokens

We will use a pre-trained model so we need to import its tokenizer and tokenize our data.

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Let’s tokenize a sentence and see what we got:

tokenizer('🚨 JUNE DROP LIVE 🚨')['input_ids']

We can decode these ids and see the actual token:

[tokenizer.decode(i) for i in tokenizer('🚨 JUNE DROP LIVE 🚨')['input_ids']]

The [CLS] and [SEP] are special tokens that will always be at the beginning and at the end of a sentence. As you can see, instead of the emoji ‘🚨’ is the [UNK] token which means that the token is unknown. This is because the pre-trained model distilbert does not have emojis in its bag of words. However, we can add more tokens to the tokenizer so they can be trained when we will fine-tune the model to our data. Let’s add some emojis to our tokenizer.

for i in ['🚨', '🙂', '😍', '✌️' , '🤩 ']:
tokenizer.add_tokens(i)

Now, if you tokenize the sentence you will see that the emoji remains as emoji and not the [UNK] token.

[tokenizer.decode(i) for i in tokenizer('🚨 JUNE DROP LIVE 🚨')['input_ids']]

The next step is to tokenize the data.

def tokenize_function(examples):

tokenized_datasets = dataset.map(tokenize_function, batched=True)

## Fine-Tuning The Model

It’s time to import the pre-trained model.

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=1)

According to the documentation, for regression problems, we have to pass num_labels=1.

Now, we need to resize the token embeddings because we added more tokens to our tokenizer.

model.resize_token_embeddings(len(tokenizer))

### Metrics Function

In a regression problem, you are trying to predict a continuous value. So, you need metrics that measure the distance between the predicted value and the true value. The most common metrics are MSE (Mean Squared Error) and RMSE (Root Mean Squared Error). For this application, we will use RMSE and we need to create a function to use it when training the data.

from datasets import load_metric

def compute_metrics(eval_pred):
predictions, labels = eval_pred
rmse = mean_squared_error(labels, predictions, squared=False)
return {"rmse": rmse}

### Train The Model

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer",
logging_strategy="epoch",
evaluation_strategy="epoch",
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
save_total_limit = 2,
save_strategy = 'no',
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
compute_metrics=compute_metrics
)
trainer.train()

## Save And Load The Pre-Trained Model And Tokenizer

To save and load the model, run the following:

# save the model/tokenizer

model.save_pretrained("model")
tokenizer.save_pretrained("tokenizer")

from transformers import AutoModelForTokenClassification
model = AutoModelForSequenceClassification.from_pretrained("model")
tokenizer = AutoTokenizer.from_pretrained("tokenizer")

## How To Use The Model

Once we have loaded the tokenizer and the model we can use Transformer’s trainer to get the predictions from text input. I created a function that takes as input the text and returns the prediction. The steps we need to do is the following:

1. Add the text into a dataframe to a column called text.
2. Convert the dataframe into dataset.
3. Tokenize the dataset.
4. Make the prediction using the trainer.

Of course, you can do it without a function for more than one inputs. This way, it will be faster as it uses batches do make predictions.

from transformers import Trainer
trainer = Trainer(model=model)

def tokenize_function(examples):

def pipeline_prediction(text):
df=pd.DataFrame({'text':[text]})
dataset = Dataset.from_pandas(df,preserve_index=False)
tokenized_datasets = dataset.map(tokenize_function)
raw_pred, _, _ = trainer.predict(tokenized_datasets)
return(raw_pred[0][0])

pipeline_prediction("🚨 Get 50% now!")
-0.019468416

## Summing It Up

In this post, we showed you how to use pre-trained models for regression problems. We used the Huggingface’s transformers library to load the pre-trained model DistilBERT and fine-tune it to our data. We think that the transformer models are very powerful and if used right can lead to way better results than the more classic approaches of word embeddings like word2vec and TF-IDF.

If you want to learn more about HuggingFace models you can check out their documentation on transformers.

### Get updates and learn from the best

Miscellaneous

#### How to Redirect and Save Errors in Unix

In Unix, there are three types of redirection such as: Standard Input (stdin) that is denoted by 0. Usually, it’s

Python

#### Content-Based Recommender Systems with TensorFlow Recommenders

In this post, we will consider as a reference point the “Building deep retrieval models” tutorial from TensorFlow and we