Predictive Hacks

How to Fine-Tune an NLP Regression Model with Transformers and HuggingFace

Photo by DeepMind on Unsplash

HuggingFace provides us with state-of-the-art pre-trained models that can be used in many different applications. In this post, we will show you how to use a pre-trained model for a regression problem. The pre-trained model that we are going to use is DistilBERT which is a lighter and faster version of the famous BERT with 95% of its performance.

Suppose that we have the text from online ads and its response rate normalized by the ad set. Our goal is to create a Machine Learning model that can predict the performance of an ad.

Let’s start coding by importing the necessary libraries and import our data:

import numpy as np
import pandas as pd

import transformers
from datasets import Dataset,load_dataset, load_from_disk
from transformers import AutoTokenizer, AutoModelForSequenceClassification

X=pd.read_csv('ad_data.csv')
X.head(3)

The text represents the ad text and the label is the normalized response rate.

Pandas To Dataset

In order to use our data for training, we need to convert the Pandas Dataframe into ‘Dataset‘ format. Also, we want to split the data into train and test so we can evaluate the model. These can be done easily by running the following:

dataset = Dataset.from_pandas(X,preserve_index=False) 
dataset = dataset.train_test_split(test_size=0.3) 

dataset
transformers regression huggingface bert

As you can see, the dataset object contains both train and test sets. You can still access the data as shown below:

dataset['train']['text'][:5]

Tokenization & How To Add New Tokens

We will use a pre-trained model so we need to import its tokenizer and tokenize our data.

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Let’s tokenize a sentence and see what we got:

tokenizer('🚨 JUNE DROP LIVE 🚨')['input_ids']
transformers regression huggingface bert

We can decode these ids and see the actual token:

[tokenizer.decode(i) for i in tokenizer('🚨 JUNE DROP LIVE 🚨')['input_ids']]

The [CLS] and [SEP] are special tokens that will always be at the beginning and at the end of a sentence. As you can see, instead of the emoji ‘🚨’ is the [UNK] token which means that the token is unknown. This is because the pre-trained model distilbert does not have emojis in its bag of words. However, we can add more tokens to the tokenizer so they can be trained when we will fine-tune the model to our data. Let’s add some emojis to our tokenizer.

for i in ['🚨', '🙂', '😍', '✌️' , '🤩 ']:
    tokenizer.add_tokens(i)

Now, if you tokenize the sentence you will see that the emoji remains as emoji and not the [UNK] token.

[tokenizer.decode(i) for i in tokenizer('🚨 JUNE DROP LIVE 🚨')['input_ids']]

The next step is to tokenize the data.

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Fine-Tuning The Model

It’s time to import the pre-trained model.

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=1)

According to the documentation, for regression problems, we have to pass num_labels=1.

Now, we need to resize the token embeddings because we added more tokens to our tokenizer.

model.resize_token_embeddings(len(tokenizer))

Metrics Function

In a regression problem, you are trying to predict a continuous value. So, you need metrics that measure the distance between the predicted value and the true value. The most common metrics are MSE (Mean Squared Error) and RMSE (Root Mean Squared Error). For this application, we will use RMSE and we need to create a function to use it when training the data.

from datasets import load_metric


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    rmse = mean_squared_error(labels, predictions, squared=False)
    return {"rmse": rmse}

Train The Model

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer",
                                  logging_strategy="epoch",
                                  evaluation_strategy="epoch",
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=16,
                                  num_train_epochs=3,
                                  save_total_limit = 2,
                                  save_strategy = 'no',
                                  load_best_model_at_end=False
                                  )


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics
)
trainer.train()

Save And Load The Pre-Trained Model And Tokenizer

To save and load the model, run the following:

# save the model/tokenizer

model.save_pretrained("model")
tokenizer.save_pretrained("tokenizer")



# load the model/tokenizer

from transformers import AutoModelForTokenClassification
model = AutoModelForSequenceClassification.from_pretrained("model")
tokenizer = AutoTokenizer.from_pretrained("tokenizer")

How To Use The Model

Once we have loaded the tokenizer and the model we can use Transformer’s trainer to get the predictions from text input. I created a function that takes as input the text and returns the prediction. The steps we need to do is the following:

  1. Add the text into a dataframe to a column called text.
  2. Convert the dataframe into dataset.
  3. Tokenize the dataset.
  4. Make the prediction using the trainer.

Of course, you can do it without a function for more than one inputs. This way, it will be faster as it uses batches do make predictions.

from transformers import Trainer
trainer = Trainer(model=model)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True) 

def pipeline_prediction(text):
    df=pd.DataFrame({'text':[text]})
    dataset = Dataset.from_pandas(df,preserve_index=False) 
    tokenized_datasets = dataset.map(tokenize_function)
    raw_pred, _, _ = trainer.predict(tokenized_datasets) 
    return(raw_pred[0][0])

pipeline_prediction("🚨 Get 50% now!")
-0.019468416

Summing It Up

In this post, we showed you how to use pre-trained models for regression problems. We used the Huggingface’s transformers library to load the pre-trained model DistilBERT and fine-tune it to our data. We think that the transformer models are very powerful and if used right can lead to way better results than the more classic approaches of word embeddings like word2vec and TF-IDF.

If you want to learn more about HuggingFace models you can check out their documentation on transformers.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

1 thought on “How to Fine-Tune an NLP Regression Model with Transformers and HuggingFace”

  1. Great work here! I found this more helpful than Kaggle, HuggingFace docs, or any other resource out there on the web. One note: I found that I needed datasets=2.15.0 to get things working. Otherwise, it was smooth sailing following this post.

    Reply

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s