Predictive Hacks

AWS BlazingText Tutorial

sagemaker

In this tutorial, we will show you how to train a text classifier using AWS SageMaker BazingText. We will consider the womens_clothing_ecommerce_reviews_balanced.csv. The column sentiment has 3 classes:

  • -1: Negative
  • 0: Neutral
  • 1: Positive

Our goal is to build a classifier that takes as input the “review_body” and returns the predicted sentiment. Since we work with the SageMaker, we will load the following libraries and set the roles and the buckets.

import boto3
import sagemaker
import pandas as pd
import json

sess   = sagemaker.Session()
bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

import matplotlib.pyplot as plt
%matplotlib inline

Prepare the Data

The BlazingText expects a particular format, where in my opinion is not user-friendly at all! In particular, it expects the following format:

__label__<label> "<features>"

For example:

__label__-1 "this is bad"
__label__0 "this is ok"
__label__1 "this is great"

Let’s load the data:

df = pd.read_csv('./womens_clothing_ecommerce_reviews_balanced.csv, delimiter=',')
df

Note that the dataset is balanced based on sentiment labels:

df.sentiment.value_counts()

Now, we will add the prefix __label__ to each sentiment value and we will tokenize the review body using the nltk library.

import nltk
nltk.download('punkt')

Let’s define a prepare_data function to transform the dataset

def tokenize(review):
    # delete commas and quotation marks, apply tokenization and join back into a string separating by spaces
    return ' '.join([str(token) for token in nltk.word_tokenize(str(review).replace(',', '').replace('"', '').lower())])
    
def prepare_data(df):
    df['sentiment'] = df['sentiment'].map(lambda sentiment : '__label__{}'.format(str(sentiment).replace('__label__', '')))
    df['review_body'] = df['review_body'].map(lambda review : tokenize(review)) # Replace all None
    return df



df_blazingtext = df[['sentiment', 'review_body']].reset_index(drop=True)
df_blazingtext = prepare_data(df_blazingtext)
df_blazingtext.head()

Finally, we will split the dataset into train (90%) and validation (10%) datasets.

from sklearn.model_selection import train_test_split

# Split all data into 90% train and 10% holdout
df_train, df_validation = train_test_split(df_blazingtext, 
                                           test_size=0.10,
                                           stratify=df_blazingtext['sentiment'])

Save the datasets as CSV files and upload them to S3 bucket:

blazingtext_train_path = './train.csv'
df_train[['sentiment', 'review_body']].to_csv(blazingtext_train_path, index=False, header=False, sep=' ')

blazingtext_validation_path = './validation.csv'
df_validation[['sentiment', 'review_body']].to_csv(blazingtext_validation_path, index=False, header=False, sep=' ')


train_s3_uri = sess.upload_data(bucket=bucket, key_prefix='blazingtext/data', path=blazingtext_train_path)
validation_s3_uri = sess.upload_data(bucket=bucket, key_prefix='blazingtext/data', path=blazingtext_validation_path)

Train the Model

First, we need to setup the BlazingText estimator.

image_uri = sagemaker.image_uris.retrieve(
    region=region,
    framework='blazingtext' 

)

Then, we need to create an estimator instance passing the container image.

estimator = sagemaker.estimator.Estimator(

    image_uri=image_uri, 
    role=role, 
    instance_count=1, 
    instance_type='ml.m5.large',
    volume_size=30,
    max_run=7200,
    sagemaker_session=sess
)

Finally, we set the hyper-parameters:

estimator.set_hyperparameters(mode='supervised',   # supervised (text classification)
                              epochs=10,           # number of complete passes through the dataset: 5 - 15
                              learning_rate=0.01,  # step size for the  numerical optimizer: 0.005 - 0.01
                              min_count=2,         # discard words that appear less than this number: 0 - 100                              
                              vector_dim=300,      # number of dimensions in vector space: 32-300
                              word_ngrams=3)       # number of words in a word n-gram: 1 - 3

Now, we need to create the train and validation data channels.

train_data = sagemaker.inputs.TrainingInput(
    train_s3_uri, 
    distribution='FullyReplicated', 
    content_type='text/plain', 
    s3_data_type='S3Prefix'
)


validation_data = sagemaker.inputs.TrainingInput(
    validation_s3_uri, 
    distribution='FullyReplicated', 
    content_type='text/plain', 
    s3_data_type='S3Prefix'
)


# Organize the data channels defined above as a dictionary

data_channels = {
    'train': train_data, # Replace None
    'validation': validation_data # Replace None
}

We’re ready to start fitting the model to the datasets:

estimator.fit(
    inputs=data_channels, 
    wait=False
)

Get the accuracy of the train and validation dataset.

estimator.latest_training_job.wait(logs=False)
estimator.training_job_analytics.dataframe()

If we go to the S3 bucket, we can find the model.

Deploy the Model

In this step, we will deploy the trained model as an endpoint.

text_classifier = estimator.deploy(initial_instance_count=1,
                                   instance_type='ml.m5.large',
                                   serializer=sagemaker.serializers.JSONSerializer(),
                                   deserializer=sagemaker.deserializers.JSONDeserializer())

print()
print('Endpoint name:  {}'.format(text_classifier.endpoint_name))

Output:

Endpoint name:  blazingtext-2021-08-03-18-01-26-021
CPU times: user 215 ms, sys: 35.1 ms, total: 250 ms
Wall time: 8min 32s

If we go to the Amazon SageMaker -> Endpoints -> [Endpoint name] we will be able to see the endpoint:

Test the Model

Let’s provide an example of 3 reviews that we want to get predictions:

  • ‘This product is great!’
  • ‘OK, but not great’
  • ‘This is not the right product.’

Tokenize the reviews and specify the payload to use when calling the REST API.

reviews = ['This product is great!',
           'OK, but not great',
           'This is not the right product.'] 

tokenized_reviews = [' '.join(nltk.word_tokenize(review)) for review in reviews]

payload = {"instances" : tokenized_reviews}
print(payload)

Output:

{'instances': ['This product is great !', 'OK , but not great', 'This is not the right product .']}

Now we can predict the sentiment for each review. Call the predict method of the text classifier passing the tokenized sentence instances (payload) into the data argument.

predictions = text_classifier.predict(data=payload)
for prediction in predictions:
    print('Predicted class: {}'.format(prediction['label'][0].lstrip('__label__')))

Output:

Predicted class: 1
Predicted class: 0
Predicted class: -1

Finally, we can clean up your endpoint as follows:

text_classifier.delete_endpoint()

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.