Predictive Hacks

Content-Based Recommender Systems with TensorFlow Recommenders

content-based collaborative filtering

In this post, we will consider as a reference point the “Building deep retrieval models” tutorial from TensorFlow and we will try to go deeper by showing:

  • How to add more features
  • How to retrieve the Users’ and Items’ Embeddings
  • How to make predictions using the model
  • How to make predictions using the Embeddings

Content-Based Recommender Systems

In the previous tutorial, we build a recommender system by taking into consideration the Users and the Items only, trying to build their corresponding embeddings. However, if we are in a position to know the users’ and items’ attributes, we can enrich the model by adding these features. These types of models are called content-based and in essence, they recommend items based on features of user and item to find a good match.

For example, think of a function that takes as input the user and movie features and tries to predict the rating of a user to a movie, or a preferred movie by a user.

Coursera

The idea is to build a user vector \(x_u\) and a movie vector \(x_m\) that can be of different dimensions, but then the final layer should be of the same size. The predicted value of the function will be the dot product of the final layer (purple).

Coursera

Load the Data and the Libraries

We will work with the movielens/100k-ratings dataset from TensorFlow datasets. Feel free to start experimenting with other datasets. In the previous tutorial, we have shown you how to convert a pandas data frame to a TensorFlow dataset. We will work with Colab and you can code along.

!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets
import os
import tempfile

import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow_datasets as tfds

import tensorflow_recommenders as tfrs
 

Load the ratings and the movies.

ratings = tfds.load("movielens/100k-ratings", split="train")
movies = tfds.load("movielens/100k-movies", split="train")
 

If we want to have a look at the first records we can run:

for record in ratings.take(5):
  print(record)
 

However, we can convert the dataset into a Pandas data frame to run an exploratory data analysis. Let’s get the first records from the ratings dataset.

# convert ratings to a Pandas data frame
ratings_df = tfds.as_dataframe(ratings)
ratings_df.head(10)
 

After running an EDA, we decided to keep the following features:

ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_gender": x["user_gender"],
    "bucketized_user_age": x["bucketized_user_age"],
    "user_occupation_label": x["user_occupation_label"],
    "timestamp": x["timestamp"]
})
movies = movies.map(lambda x: x["movie_title"])
 

Users and Movies Features

For this model, we consider the following features.

For Users:

  • Gender
  • Age Group
  • Occupation
  • Time

For Movies:

  • Movie Title
  • Movie ID

I urge you to have a look at the Feature preprocessing tutorial that explains how to create features in TensorFlow. Once you read this tutorial, you can continue with this tutorial. At this point, we will build our features. Note, that for the timestamp feature, we will bucketize it using 1000 buckets.

timestamps = np.concatenate(list(ratings.map(lambda x: x["timestamp"]).batch(100)))

max_timestamp = timestamps.max()
min_timestamp = timestamps.min()

timestamp_buckets = np.linspace(
    min_timestamp, max_timestamp, num=1000,
)

unique_movie_titles = np.unique(np.concatenate(list(movies.batch(1000))))

unique_user_gender = np.unique(np.concatenate(list(ratings.batch(1_000).map(
    lambda x: x["user_gender"]))))

unique_bucketized_user_age = np.unique(np.concatenate(list(ratings.batch(1_000).map(
    lambda x: x["bucketized_user_age"]))))
unique_user_occupation_label = np.unique(np.concatenate(list(ratings.batch(1_000).map(
    lambda x: x["user_occupation_label"]))))
 

Build the Models

We will build a UserModel expaned with hidden layers using a QueryModel. Then, we will build a MovieModel expandined with hidden layers using a CandidateModel. Finally, we will build a CombinedMobel by combining the QueryModel and the CandidateModel and implementing the loss and metrics logic.

UserModel

class UserModel(tf.keras.Model):
  
  def __init__(self):
    super().__init__()

    self.age_embedding = tf.keras.Sequential([
        tf.keras.layers.IntegerLookup(
            vocabulary=unique_bucketized_user_age, mask_token=None),
        tf.keras.layers.Embedding(len(unique_bucketized_user_age) + 1, 32),
    ])

    self.occupation_embedding = tf.keras.Sequential([
        tf.keras.layers.IntegerLookup(
            vocabulary=unique_user_occupation_label, mask_token=None),
        tf.keras.layers.Embedding(len(unique_user_occupation_label) + 1, 32),
    ])

    self.gender_embedding = tf.keras.Sequential([
        tf.keras.layers.IntegerLookup(
            vocabulary=unique_user_gender, mask_token=None),
        tf.keras.layers.Embedding(len(unique_user_gender) + 1, 32),
    ])
        
    self.timestamp_embedding = tf.keras.Sequential([
        tf.keras.layers.Discretization(timestamp_buckets.tolist()),
        tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32),
    ])
    self.normalized_timestamp = tf.keras.layers.Normalization(
        axis=None
    )

    self.normalized_timestamp.adapt(timestamps)

  def call(self, inputs):
    # Take the input dictionary, pass it through each input layer,
    # and concatenate the result.
    return tf.concat([
        self.age_embedding(inputs["bucketized_user_age"]),
        self.occupation_embedding(inputs["user_occupation_label"]),
        self.gender_embedding(inputs["user_gender"]),
        self.timestamp_embedding(inputs["timestamp"]),
        tf.reshape(self.normalized_timestamp(inputs["timestamp"]), (-1, 1)),
    ], axis=1)
  

QueryModel

class QueryModel(tf.keras.Model):
  """Model for encoding user queries."""

  def __init__(self, layer_sizes):
    """Model for encoding user queries.

    Args:
      layer_sizes:
        A list of integers where the i-th entry represents the number of units
        the i-th layer contains.
    """
    super().__init__()

    # We first use the user model for generating embeddings.
    self.embedding_model = UserModel()

    # Then construct the layers.
    self.dense_layers = tf.keras.Sequential()

    # Use the ReLU activation for all but the last layer.
    for layer_size in layer_sizes[:-1]:
      self.dense_layers.add(tf.keras.layers.Dense(layer_size, activation="relu"))

    # No activation for the last layer.
    for layer_size in layer_sizes[-1:]:
      self.dense_layers.add(tf.keras.layers.Dense(layer_size))
    
  def call(self, inputs):
    feature_embedding = self.embedding_model(inputs)
    return self.dense_layers(feature_embedding)
  

MovieModel

class MovieModel(tf.keras.Model):
  
  def __init__(self):
    super().__init__()

    max_tokens = 10_000

    self.title_embedding = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
          vocabulary=unique_movie_titles,mask_token=None),
      tf.keras.layers.Embedding(len(unique_movie_titles) + 1, 32)
    ])

    self.title_vectorizer = tf.keras.layers.TextVectorization(
        max_tokens=max_tokens)

    self.title_text_embedding = tf.keras.Sequential([
      self.title_vectorizer,
      tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True),
      tf.keras.layers.GlobalAveragePooling1D(),
    ])

    self.title_vectorizer.adapt(movies)

  def call(self, titles):
    return tf.concat([
        self.title_embedding(titles),
        self.title_text_embedding(titles),
    ], axis=1)
  

CandidateModel

class CandidateModel(tf.keras.Model):
  """Model for encoding movies."""

  def __init__(self, layer_sizes):
    """Model for encoding movies.

    Args:
      layer_sizes:
        A list of integers where the i-th entry represents the number of units
        the i-th layer contains.
    """
    super().__init__()

    self.embedding_model = MovieModel()

    # Then construct the layers.
    self.dense_layers = tf.keras.Sequential()

    # Use the ReLU activation for all but the last layer.
    for layer_size in layer_sizes[:-1]:
      self.dense_layers.add(tf.keras.layers.Dense(layer_size, activation="relu"))

    # No activation for the last layer.
    for layer_size in layer_sizes[-1:]:
      self.dense_layers.add(tf.keras.layers.Dense(layer_size))
    
  def call(self, inputs):
    feature_embedding = self.embedding_model(inputs)
    return self.dense_layers(feature_embedding)
  

CombinedModel

class MovielensModel(tfrs.models.Model):

  def __init__(self, layer_sizes):
    super().__init__()
    self.query_model = QueryModel(layer_sizes)
    self.candidate_model = CandidateModel(layer_sizes)
    self.task = tfrs.tasks.Retrieval(
        metrics=tfrs.metrics.FactorizedTopK(
            candidates=movies.batch(128).map(self.candidate_model),
        ),
    )

  def compute_loss(self, features, training=False):

    query_embeddings = self.query_model({
        "bucketized_user_age": features["bucketized_user_age"],
        "user_occupation_label": features["user_occupation_label"],
        "user_gender": features["user_gender"],
        "timestamp": features["timestamp"],
    })
    movie_embeddings = self.candidate_model(features["movie_title"])

    return self.task(
        query_embeddings, movie_embeddings, compute_metrics=not training)
  

Training the Model

We will split the data into training and test datasets (80-20).

tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

train = shuffled.take(80_000)
test = shuffled.skip(80_000).take(20_000)

cached_train = train.shuffle(100_000).batch(2048)
cached_test = test.batch(4096).cache()
 

Let’s run the model using one layer of 32 units.

num_epochs = 50

model = MovielensModel([32])
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))

one_layer_history = model.fit(
    cached_train,
    validation_data=cached_test,
    validation_freq=5,
    epochs=num_epochs,
    verbose=0)

accuracy = one_layer_history.history["val_factorized_top_k/top_100_categorical_accuracy"][-1]
print(f"Top-100 accuracy: {accuracy:.2f}.")
 
Top-100 accuracy: 0.16.

Note, that if you want to run a deeper model, of two layers, for example, a layer of 64 followed by a layer of 32 units, you can pass it as follows.

model = MovielensModel([64, 32])
 

Notice, that the accuracy is low because we have excluded as a feature the UserIDs since we wanted to focus on the Users’ features only. However, we can easily add it if we want. Feel free to start experimenting with different features.

Predictions

The TensorFlow tutorial does not show you how to make predictions! Let’s see how we can get recommendations for the following input.

 "bucketized_user_age": np.array([25])
 "user_occupation_label": np.array([17])
 "user_gender": np.array([True])
 "timestamp": np.array([879024327])
index = tfrs.layers.factorized_top_k.BruteForce(model.query_model)
index.index_from_dataset(
  tf.data.Dataset.zip((movies.batch(100), movies.batch(100).map(model.candidate_model)))
)


_, titles = index({
    "bucketized_user_age": np.array([25]),
    "user_occupation_label": np.array([17]),
    "user_gender": np.array([True]),
    "timestamp": np.array([879024327])},
    k=50
)
 

Get the top 50 recommended movies.

titles[0].numpy()
 
array([b'Guilty as Sin (1993)', b'Malice (1993)', b'Tank Girl (1995)',
       b'Body Parts (1991)', b'Murder in the First (1995)',
       b'Secret Garden, The (1993)', b'Death and the Maiden (1994)',
       b'Devil in a Blue Dress (1995)', b'Professional, The (1994)',
       b'Dolores Claiborne (1994)', b'Kazaam (1996)',
       b'Madonna: Truth or Dare (1991)', b'Heat (1995)',
       b'Leaving Las Vegas (1995)', b'Smoke (1995)',
       b'Twelve Monkeys (1995)', b'Rock, The (1996)',
       b'Turbo: A Power Rangers Movie (1997)',
       b'Mis\xc3\xa9rables, Les (1995)', b'Legends of the Fall (1994)',
       b'Big Night (1996)',
       b'Willy Wonka and the Chocolate Factory (1971)',
       b'Hunchback of Notre Dame, The (1996)', b'101 Dalmatians (1996)',
       b'Hercules (1997)', b'Robin Hood: Prince of Thieves (1991)',
       b'True Romance (1993)', b'Fugitive, The (1993)',
       b'Jungle Book, The (1994)', b'Substitute, The (1996)',
       b'Bridges of Madison County, The (1995)', b'Ransom (1996)',
       b'Fargo (1996)', b'Eat Drink Man Woman (1994)',
       b'Client, The (1994)', b'No Escape (1994)', b'Firm, The (1993)',
       b'Psycho (1960)', b'Jerry Maguire (1996)', b'Kalifornia (1993)',
       b'Homeward Bound II: Lost in San Francisco (1996)',
       b"Preacher's Wife, The (1996)", b'Donnie Brasco (1997)',
       b'Jaws (1975)', b'Jude (1996)',
       b'Lost World: Jurassic Park, The (1997)',
       b"Miller's Crossing (1990)", b'Mirror Has Two Faces, The (1996)',
       b'In the Line of Fire (1993)', b'Diabolique (1996)'], dtype=object)

How to Get Users and Movies Embeddings

We can easily generate the QueryModel and CandidateModel embeddings. Let’s get the embeddings of our previous input to the query model.

model.query_model.predict({ "bucketized_user_age": np.array([25]),
                           "user_gender": np.array([True]),
                           "user_occupation_label": np.array([17]),
                           "timestamp": np.array([879024327])})
 
1/1 [==============================] - 0s 160ms/step
array([[-0.18362528,  0.34076893,  0.30787617, -0.05884944, -0.1902037 ,
        -0.2550715 , -0.504346  ,  0.2827947 , -0.78624237, -0.6882196 ,
         0.33391213,  0.44255227,  0.31369513,  0.8378982 ,  0.31462622,
         1.4631445 , -0.20804778,  0.520661  , -1.0084451 ,  0.05185542,
         0.43012613,  1.2417494 , -0.5769971 ,  0.04806066,  0.14111865,
        -0.44874376,  1.0474757 , -0.37607613, -0.00435135,  0.3776496 ,
         0.09939097,  1.0909896 ]], dtype=float32)

Similarly, we can get the embeddings of the “Red Rock West (1992)” movie.

model.candidate_model.predict(["Red Rock West (1992)"])
 
1/1 [==============================] - 0s 170ms/step
array([[ 0.88278526, -0.2830179 ,  0.04174206,  0.5652223 , -0.73649615,
        -0.5735289 ,  0.22620858,  0.16391961, -0.17856865, -0.3096914 ,
        -0.7460059 , -0.11384141, -0.22080638, -0.3547582 , -0.01823226,
         0.7193038 , -0.52851117,  0.21540207, -0.36750677, -0.396778  ,
         0.46227038,  0.1192332 , -0.3032046 , -0.44636455, -0.39206988,
        -0.19133675,  0.08684351,  0.02275734, -0.04393479, -0.16428764,
        -0.10052559,  0.47641134]], dtype=float32)

Note that we get the same results without using the predict method. For example:

model.query_model({ "bucketized_user_age": np.array([25]),
                           "user_gender": np.array([True]),
                           "user_occupation_label": np.array([17]),
                           "timestamp": np.array([879024327])})
 

Make Predictions using Embeddings and Dot Product

We can make predictions using the dot product of the QueryModel and CandidateModel embeddings. The highest the dot product, the better the score. Let’s say that we want to recommend one of the following three movies to a user with specific attributes.

  • Red Rock West (1992)
  • Palookaville (1996)
  • Beautiful Thing (1996)
# dot product

query_01 = model.query_model.predict({ "bucketized_user_age": np.array([25]),
                           "user_gender": np.array([True]),
                           "user_occupation_label": np.array([17]),
                           "timestamp": np.array([879024327])})


movie_01 = model.candidate_model.predict(["Red Rock West (1992)"])
movie_02 = model.candidate_model.predict(["Palookaville (1996)"])
movie_03 = model.candidate_model.predict(["Beautiful Thing (1996)"])
 

We get the score of each movie:

np.matmul(query_01, np.transpose(np.concatenate((movie_01,movie_02,movie_03), axis=0)))
 
array([[ 2.3073187 ,  0.42857647, -0.6794552 ]], dtype=float32)

Hence, based on the above example, we should recommend the “Red Rock West (1992)” movie.

Save and Load the Model

We can save and load the model as follows:

import tempfile
import os
# Export the query model.
with tempfile.TemporaryDirectory() as tmp:
  path = os.path.join(tmp, "model")
 
  # Save the index.
  tf.saved_model.save(index, path)
 
  # Load it back; can also be done in TensorFlow Serving.
  loaded = tf.saved_model.load(path)
 
  # Pass a user id in, get top predicted movie titles back.
  _, titles = loaded({
    "bucketized_user_age": np.array([25]),
    "user_occupation_label": np.array([17]),
    "user_gender": np.array([True]),
    "timestamp": np.array([879024327])}
)
 
  print(f"Recommendations: {titles[0][:10]}")
 
Recommendations: [b'Guilty as Sin (1993)' b'Malice (1993)' b'Tank Girl (1995)'
 b'Body Parts (1991)' b'Murder in the First (1995)'
 b'Secret Garden, The (1993)' b'Death and the Maiden (1994)'
 b'Devil in a Blue Dress (1995)' b'Professional, The (1994)'
 b'Dolores Claiborne (1994)']

We can also call the sub-class query model in order to get the embeddings as follows:

loaded.query_model({ "bucketized_user_age": np.array([25]),
                           "user_gender": np.array([True]),
                           "user_occupation_label": np.array([17]),
                           "timestamp": np.array([879024327])})
 

More tutorials related to recommendations?

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

Python

Image Captioning with HuggingFace

Image captioning with AI is a fascinating application of artificial intelligence (AI) that involves generating textual descriptions for images automatically.

Python

Intro to Chatbots with HuggingFace

In this tutorial, we will show you how to use the Transformers library from HuggingFace to build chatbot pipelines. Let’s