In this tutorial, we will show you how to train a text classifier using AWS SageMaker BazingText. We will consider the womens_clothing_ecommerce_reviews_balanced.csv. The column sentiment has 3 classes:
- -1: Negative
- 0: Neutral
- 1: Positive
Our goal is to build a classifier that takes as input the “review_body” and returns the predicted sentiment. Since we work with the SageMaker, we will load the following libraries and set the roles and the buckets.
import boto3 import sagemaker import pandas as pd import json sess = sagemaker.Session() bucket = sess.default_bucket() role = sagemaker.get_execution_role() region = boto3.Session().region_name import matplotlib.pyplot as plt %matplotlib inline
Prepare the Data
The BlazingText expects a particular format, where in my opinion is not user-friendly at all! In particular, it expects the following format:
__label__<label> "<features>"
For example:
__label__-1 "this is bad" __label__0 "this is ok" __label__1 "this is great"
Let’s load the data:
df = pd.read_csv('./womens_clothing_ecommerce_reviews_balanced.csv, delimiter=',') df

Note that the dataset is balanced based on sentiment labels:
df.sentiment.value_counts()

Now, we will add the prefix __label__
to each sentiment value and we will tokenize the review body using the nltk
library.
import nltk nltk.download('punkt')
Let’s define a prepare_data
function to transform the dataset
def tokenize(review): # delete commas and quotation marks, apply tokenization and join back into a string separating by spaces return ' '.join([str(token) for token in nltk.word_tokenize(str(review).replace(',', '').replace('"', '').lower())]) def prepare_data(df): df['sentiment'] = df['sentiment'].map(lambda sentiment : '__label__{}'.format(str(sentiment).replace('__label__', ''))) df['review_body'] = df['review_body'].map(lambda review : tokenize(review)) # Replace all None return df df_blazingtext = df[['sentiment', 'review_body']].reset_index(drop=True) df_blazingtext = prepare_data(df_blazingtext) df_blazingtext.head()

Finally, we will split the dataset into train (90%) and validation (10%) datasets.
from sklearn.model_selection import train_test_split # Split all data into 90% train and 10% holdout df_train, df_validation = train_test_split(df_blazingtext, test_size=0.10, stratify=df_blazingtext['sentiment'])
Save the datasets as CSV files and upload them to S3 bucket:
blazingtext_train_path = './train.csv' df_train[['sentiment', 'review_body']].to_csv(blazingtext_train_path, index=False, header=False, sep=' ') blazingtext_validation_path = './validation.csv' df_validation[['sentiment', 'review_body']].to_csv(blazingtext_validation_path, index=False, header=False, sep=' ') train_s3_uri = sess.upload_data(bucket=bucket, key_prefix='blazingtext/data', path=blazingtext_train_path) validation_s3_uri = sess.upload_data(bucket=bucket, key_prefix='blazingtext/data', path=blazingtext_validation_path)
Train the Model
First, we need to setup the BlazingText estimator.
image_uri = sagemaker.image_uris.retrieve( region=region, framework='blazingtext' )
Then, we need to create an estimator instance passing the container image.
estimator = sagemaker.estimator.Estimator( image_uri=image_uri, role=role, instance_count=1, instance_type='ml.m5.large', volume_size=30, max_run=7200, sagemaker_session=sess )
Finally, we set the hyper-parameters:
estimator.set_hyperparameters(mode='supervised', # supervised (text classification) epochs=10, # number of complete passes through the dataset: 5 - 15 learning_rate=0.01, # step size for the numerical optimizer: 0.005 - 0.01 min_count=2, # discard words that appear less than this number: 0 - 100 vector_dim=300, # number of dimensions in vector space: 32-300 word_ngrams=3) # number of words in a word n-gram: 1 - 3
Now, we need to create the train and validation data channels.
train_data = sagemaker.inputs.TrainingInput( train_s3_uri, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix' ) validation_data = sagemaker.inputs.TrainingInput( validation_s3_uri, distribution='FullyReplicated', content_type='text/plain', s3_data_type='S3Prefix' ) # Organize the data channels defined above as a dictionary data_channels = { 'train': train_data, # Replace None 'validation': validation_data # Replace None }
We’re ready to start fitting the model to the datasets:
estimator.fit( inputs=data_channels, wait=False )

Get the accuracy of the train and validation dataset.
estimator.latest_training_job.wait(logs=False) estimator.training_job_analytics.dataframe()

If we go to the S3 bucket, we can find the model.

Deploy the Model
In this step, we will deploy the trained model as an endpoint.
text_classifier = estimator.deploy(initial_instance_count=1, instance_type='ml.m5.large', serializer=sagemaker.serializers.JSONSerializer(), deserializer=sagemaker.deserializers.JSONDeserializer()) print() print('Endpoint name: {}'.format(text_classifier.endpoint_name))
Output:
Endpoint name: blazingtext-2021-08-03-18-01-26-021
CPU times: user 215 ms, sys: 35.1 ms, total: 250 ms
Wall time: 8min 32s
If we go to the Amazon SageMaker
-> Endpoints
-> [Endpoint name]
we will be able to see the endpoint:

Test the Model
Let’s provide an example of 3 reviews that we want to get predictions:
- ‘This product is great!’
- ‘OK, but not great’
- ‘This is not the right product.’
Tokenize the reviews and specify the payload to use when calling the REST API.
reviews = ['This product is great!', 'OK, but not great', 'This is not the right product.'] tokenized_reviews = [' '.join(nltk.word_tokenize(review)) for review in reviews] payload = {"instances" : tokenized_reviews} print(payload)
Output:
{'instances': ['This product is great !', 'OK , but not great', 'This is not the right product .']}
Now we can predict the sentiment for each review. Call the predict
method of the text classifier passing the tokenized sentence instances (payload
) into the data argument.
predictions = text_classifier.predict(data=payload) for prediction in predictions: print('Predicted class: {}'.format(prediction['label'][0].lstrip('__label__')))
Output:
Predicted class: 1
Predicted class: 0
Predicted class: -1
Finally, we can clean up your endpoint as follows:
text_classifier.delete_endpoint()