In this post, we will consider as a reference point the “Building deep retrieval models” tutorial from TensorFlow and we will try to go deeper by showing:
- How to add more features
- How to retrieve the Users’ and Items’ Embeddings
- How to make predictions using the model
- How to make predictions using the Embeddings
Content-Based Recommender Systems
In the previous tutorial, we build a recommender system by taking into consideration the Users and the Items only, trying to build their corresponding embeddings. However, if we are in a position to know the users’ and items’ attributes, we can enrich the model by adding these features. These types of models are called content-based and in essence, they recommend items based on features of user and item to find a good match.
For example, think of a function that takes as input the user and movie features and tries to predict the rating of a user to a movie, or a preferred movie by a user.
The idea is to build a user vector \(x_u\) and a movie vector \(x_m\) that can be of different dimensions, but then the final layer should be of the same size. The predicted value of the function will be the dot product of the final layer (purple).
Load the Data and the Libraries
We will work with the movielens/100k-ratings
dataset from TensorFlow datasets. Feel free to start experimenting with other datasets. In the previous tutorial, we have shown you how to convert a pandas data frame to a TensorFlow dataset. We will work with Colab and you can code along.
!pip install -q tensorflow-recommenders !pip install -q --upgrade tensorflow-datasets
import os import tempfile import numpy as np import pandas as pd import tensorflow as tf import tensorflow_datasets as tfds import tensorflow_recommenders as tfrs
Load the ratings
and the movies
.
ratings = tfds.load("movielens/100k-ratings", split="train") movies = tfds.load("movielens/100k-movies", split="train")
If we want to have a look at the first records we can run:
for record in ratings.take(5): print(record)
However, we can convert the dataset into a Pandas data frame to run an exploratory data analysis. Let’s get the first records from the ratings
dataset.
# convert ratings to a Pandas data frame ratings_df = tfds.as_dataframe(ratings) ratings_df.head(10)
After running an EDA, we decided to keep the following features:
ratings = ratings.map(lambda x: { "movie_title": x["movie_title"], "user_gender": x["user_gender"], "bucketized_user_age": x["bucketized_user_age"], "user_occupation_label": x["user_occupation_label"], "timestamp": x["timestamp"] }) movies = movies.map(lambda x: x["movie_title"])
Users and Movies Features
For this model, we consider the following features.
For Users:
- Gender
- Age Group
- Occupation
- Time
For Movies:
- Movie Title
- Movie ID
I urge you to have a look at the Feature preprocessing tutorial that explains how to create features in TensorFlow. Once you read this tutorial, you can continue with this tutorial. At this point, we will build our features. Note, that for the timestamp
feature, we will bucketize it using 1000 buckets.
timestamps = np.concatenate(list(ratings.map(lambda x: x["timestamp"]).batch(100))) max_timestamp = timestamps.max() min_timestamp = timestamps.min() timestamp_buckets = np.linspace( min_timestamp, max_timestamp, num=1000, ) unique_movie_titles = np.unique(np.concatenate(list(movies.batch(1000)))) unique_user_gender = np.unique(np.concatenate(list(ratings.batch(1_000).map( lambda x: x["user_gender"])))) unique_bucketized_user_age = np.unique(np.concatenate(list(ratings.batch(1_000).map( lambda x: x["bucketized_user_age"])))) unique_user_occupation_label = np.unique(np.concatenate(list(ratings.batch(1_000).map( lambda x: x["user_occupation_label"]))))
Build the Models
We will build a UserModel expaned with hidden layers using a QueryModel. Then, we will build a MovieModel expandined with hidden layers using a CandidateModel. Finally, we will build a CombinedMobel by combining the QueryModel and the CandidateModel and implementing the loss and metrics logic.
UserModel
class UserModel(tf.keras.Model): def __init__(self): super().__init__() self.age_embedding = tf.keras.Sequential([ tf.keras.layers.IntegerLookup( vocabulary=unique_bucketized_user_age, mask_token=None), tf.keras.layers.Embedding(len(unique_bucketized_user_age) + 1, 32), ]) self.occupation_embedding = tf.keras.Sequential([ tf.keras.layers.IntegerLookup( vocabulary=unique_user_occupation_label, mask_token=None), tf.keras.layers.Embedding(len(unique_user_occupation_label) + 1, 32), ]) self.gender_embedding = tf.keras.Sequential([ tf.keras.layers.IntegerLookup( vocabulary=unique_user_gender, mask_token=None), tf.keras.layers.Embedding(len(unique_user_gender) + 1, 32), ]) self.timestamp_embedding = tf.keras.Sequential([ tf.keras.layers.Discretization(timestamp_buckets.tolist()), tf.keras.layers.Embedding(len(timestamp_buckets) + 1, 32), ]) self.normalized_timestamp = tf.keras.layers.Normalization( axis=None ) self.normalized_timestamp.adapt(timestamps) def call(self, inputs): # Take the input dictionary, pass it through each input layer, # and concatenate the result. return tf.concat([ self.age_embedding(inputs["bucketized_user_age"]), self.occupation_embedding(inputs["user_occupation_label"]), self.gender_embedding(inputs["user_gender"]), self.timestamp_embedding(inputs["timestamp"]), tf.reshape(self.normalized_timestamp(inputs["timestamp"]), (-1, 1)), ], axis=1)
QueryModel
class QueryModel(tf.keras.Model): """Model for encoding user queries.""" def __init__(self, layer_sizes): """Model for encoding user queries. Args: layer_sizes: A list of integers where the i-th entry represents the number of units the i-th layer contains. """ super().__init__() # We first use the user model for generating embeddings. self.embedding_model = UserModel() # Then construct the layers. self.dense_layers = tf.keras.Sequential() # Use the ReLU activation for all but the last layer. for layer_size in layer_sizes[:-1]: self.dense_layers.add(tf.keras.layers.Dense(layer_size, activation="relu")) # No activation for the last layer. for layer_size in layer_sizes[-1:]: self.dense_layers.add(tf.keras.layers.Dense(layer_size)) def call(self, inputs): feature_embedding = self.embedding_model(inputs) return self.dense_layers(feature_embedding)
MovieModel
class MovieModel(tf.keras.Model): def __init__(self): super().__init__() max_tokens = 10_000 self.title_embedding = tf.keras.Sequential([ tf.keras.layers.StringLookup( vocabulary=unique_movie_titles,mask_token=None), tf.keras.layers.Embedding(len(unique_movie_titles) + 1, 32) ]) self.title_vectorizer = tf.keras.layers.TextVectorization( max_tokens=max_tokens) self.title_text_embedding = tf.keras.Sequential([ self.title_vectorizer, tf.keras.layers.Embedding(max_tokens, 32, mask_zero=True), tf.keras.layers.GlobalAveragePooling1D(), ]) self.title_vectorizer.adapt(movies) def call(self, titles): return tf.concat([ self.title_embedding(titles), self.title_text_embedding(titles), ], axis=1)
CandidateModel
class CandidateModel(tf.keras.Model): """Model for encoding movies.""" def __init__(self, layer_sizes): """Model for encoding movies. Args: layer_sizes: A list of integers where the i-th entry represents the number of units the i-th layer contains. """ super().__init__() self.embedding_model = MovieModel() # Then construct the layers. self.dense_layers = tf.keras.Sequential() # Use the ReLU activation for all but the last layer. for layer_size in layer_sizes[:-1]: self.dense_layers.add(tf.keras.layers.Dense(layer_size, activation="relu")) # No activation for the last layer. for layer_size in layer_sizes[-1:]: self.dense_layers.add(tf.keras.layers.Dense(layer_size)) def call(self, inputs): feature_embedding = self.embedding_model(inputs) return self.dense_layers(feature_embedding)
CombinedModel
class MovielensModel(tfrs.models.Model): def __init__(self, layer_sizes): super().__init__() self.query_model = QueryModel(layer_sizes) self.candidate_model = CandidateModel(layer_sizes) self.task = tfrs.tasks.Retrieval( metrics=tfrs.metrics.FactorizedTopK( candidates=movies.batch(128).map(self.candidate_model), ), ) def compute_loss(self, features, training=False): query_embeddings = self.query_model({ "bucketized_user_age": features["bucketized_user_age"], "user_occupation_label": features["user_occupation_label"], "user_gender": features["user_gender"], "timestamp": features["timestamp"], }) movie_embeddings = self.candidate_model(features["movie_title"]) return self.task( query_embeddings, movie_embeddings, compute_metrics=not training)
Training the Model
We will split the data into training and test datasets (80-20).
tf.random.set_seed(42) shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False) train = shuffled.take(80_000) test = shuffled.skip(80_000).take(20_000) cached_train = train.shuffle(100_000).batch(2048) cached_test = test.batch(4096).cache()
Let’s run the model using one layer of 32 units.
num_epochs = 50 model = MovielensModel([32]) model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1)) one_layer_history = model.fit( cached_train, validation_data=cached_test, validation_freq=5, epochs=num_epochs, verbose=0) accuracy = one_layer_history.history["val_factorized_top_k/top_100_categorical_accuracy"][-1] print(f"Top-100 accuracy: {accuracy:.2f}.")
Top-100 accuracy: 0.16.
Note, that if you want to run a deeper model, of two layers, for example, a layer of 64 followed by a layer of 32 units, you can pass it as follows.
model = MovielensModel([64, 32])
Notice, that the accuracy is low because we have excluded as a feature the UserIDs since we wanted to focus on the Users’ features only. However, we can easily add it if we want. Feel free to start experimenting with different features.
Predictions
The TensorFlow tutorial does not show you how to make predictions! Let’s see how we can get recommendations for the following input.
"bucketized_user_age": np.array([25])
"user_occupation_label": np.array([17])
"user_gender": np.array([True])
"timestamp": np.array([879024327])
index = tfrs.layers.factorized_top_k.BruteForce(model.query_model) index.index_from_dataset( tf.data.Dataset.zip((movies.batch(100), movies.batch(100).map(model.candidate_model))) ) _, titles = index({ "bucketized_user_age": np.array([25]), "user_occupation_label": np.array([17]), "user_gender": np.array([True]), "timestamp": np.array([879024327])}, k=50 )
Get the top 50 recommended movies.
titles[0].numpy()
array([b'Guilty as Sin (1993)', b'Malice (1993)', b'Tank Girl (1995)',
b'Body Parts (1991)', b'Murder in the First (1995)',
b'Secret Garden, The (1993)', b'Death and the Maiden (1994)',
b'Devil in a Blue Dress (1995)', b'Professional, The (1994)',
b'Dolores Claiborne (1994)', b'Kazaam (1996)',
b'Madonna: Truth or Dare (1991)', b'Heat (1995)',
b'Leaving Las Vegas (1995)', b'Smoke (1995)',
b'Twelve Monkeys (1995)', b'Rock, The (1996)',
b'Turbo: A Power Rangers Movie (1997)',
b'Mis\xc3\xa9rables, Les (1995)', b'Legends of the Fall (1994)',
b'Big Night (1996)',
b'Willy Wonka and the Chocolate Factory (1971)',
b'Hunchback of Notre Dame, The (1996)', b'101 Dalmatians (1996)',
b'Hercules (1997)', b'Robin Hood: Prince of Thieves (1991)',
b'True Romance (1993)', b'Fugitive, The (1993)',
b'Jungle Book, The (1994)', b'Substitute, The (1996)',
b'Bridges of Madison County, The (1995)', b'Ransom (1996)',
b'Fargo (1996)', b'Eat Drink Man Woman (1994)',
b'Client, The (1994)', b'No Escape (1994)', b'Firm, The (1993)',
b'Psycho (1960)', b'Jerry Maguire (1996)', b'Kalifornia (1993)',
b'Homeward Bound II: Lost in San Francisco (1996)',
b"Preacher's Wife, The (1996)", b'Donnie Brasco (1997)',
b'Jaws (1975)', b'Jude (1996)',
b'Lost World: Jurassic Park, The (1997)',
b"Miller's Crossing (1990)", b'Mirror Has Two Faces, The (1996)',
b'In the Line of Fire (1993)', b'Diabolique (1996)'], dtype=object)
How to Get Users and Movies Embeddings
We can easily generate the QueryModel and CandidateModel embeddings. Let’s get the embeddings of our previous input to the query model.
model.query_model.predict({ "bucketized_user_age": np.array([25]), "user_gender": np.array([True]), "user_occupation_label": np.array([17]), "timestamp": np.array([879024327])})
1/1 [==============================] - 0s 160ms/step
array([[-0.18362528, 0.34076893, 0.30787617, -0.05884944, -0.1902037 ,
-0.2550715 , -0.504346 , 0.2827947 , -0.78624237, -0.6882196 ,
0.33391213, 0.44255227, 0.31369513, 0.8378982 , 0.31462622,
1.4631445 , -0.20804778, 0.520661 , -1.0084451 , 0.05185542,
0.43012613, 1.2417494 , -0.5769971 , 0.04806066, 0.14111865,
-0.44874376, 1.0474757 , -0.37607613, -0.00435135, 0.3776496 ,
0.09939097, 1.0909896 ]], dtype=float32)
Similarly, we can get the embeddings of the “Red Rock West (1992)” movie.
model.candidate_model.predict(["Red Rock West (1992)"])
1/1 [==============================] - 0s 170ms/step
array([[ 0.88278526, -0.2830179 , 0.04174206, 0.5652223 , -0.73649615,
-0.5735289 , 0.22620858, 0.16391961, -0.17856865, -0.3096914 ,
-0.7460059 , -0.11384141, -0.22080638, -0.3547582 , -0.01823226,
0.7193038 , -0.52851117, 0.21540207, -0.36750677, -0.396778 ,
0.46227038, 0.1192332 , -0.3032046 , -0.44636455, -0.39206988,
-0.19133675, 0.08684351, 0.02275734, -0.04393479, -0.16428764,
-0.10052559, 0.47641134]], dtype=float32)
Note that we get the same results without using the predict
method. For example:
model.query_model({ "bucketized_user_age": np.array([25]), "user_gender": np.array([True]), "user_occupation_label": np.array([17]), "timestamp": np.array([879024327])})
Make Predictions using Embeddings and Dot Product
We can make predictions using the dot product of the QueryModel and CandidateModel embeddings. The highest the dot product, the better the score. Let’s say that we want to recommend one of the following three movies to a user with specific attributes.
- Red Rock West (1992)
- Palookaville (1996)
- Beautiful Thing (1996)
# dot product query_01 = model.query_model.predict({ "bucketized_user_age": np.array([25]), "user_gender": np.array([True]), "user_occupation_label": np.array([17]), "timestamp": np.array([879024327])}) movie_01 = model.candidate_model.predict(["Red Rock West (1992)"]) movie_02 = model.candidate_model.predict(["Palookaville (1996)"]) movie_03 = model.candidate_model.predict(["Beautiful Thing (1996)"])
We get the score of each movie:
np.matmul(query_01, np.transpose(np.concatenate((movie_01,movie_02,movie_03), axis=0)))
array([[ 2.3073187 , 0.42857647, -0.6794552 ]], dtype=float32)
Hence, based on the above example, we should recommend the “Red Rock West (1992)” movie.
Save and Load the Model
We can save and load the model as follows:
import tempfile import os # Export the query model. with tempfile.TemporaryDirectory() as tmp: path = os.path.join(tmp, "model") # Save the index. tf.saved_model.save(index, path) # Load it back; can also be done in TensorFlow Serving. loaded = tf.saved_model.load(path) # Pass a user id in, get top predicted movie titles back. _, titles = loaded({ "bucketized_user_age": np.array([25]), "user_occupation_label": np.array([17]), "user_gender": np.array([True]), "timestamp": np.array([879024327])} ) print(f"Recommendations: {titles[0][:10]}")
Recommendations: [b'Guilty as Sin (1993)' b'Malice (1993)' b'Tank Girl (1995)'
b'Body Parts (1991)' b'Murder in the First (1995)'
b'Secret Garden, The (1993)' b'Death and the Maiden (1994)'
b'Devil in a Blue Dress (1995)' b'Professional, The (1994)'
b'Dolores Claiborne (1994)']
We can also call the sub-class query model in order to get the embeddings as follows:
loaded.query_model({ "bucketized_user_age": np.array([25]), "user_gender": np.array([True]), "user_occupation_label": np.array([17]), "timestamp": np.array([879024327])})
More tutorials related to recommendations?
- Spelling Recommender with NLTK
- How to run Recommender Systems in Python
- Topic Modelling with NMF in Python
- How to Build an Autocorrect in Python
- A Tutorial about Market Basket Analysis in Python
- Item-Based Collaborative Filtering in Python
- Market Basket Analysis and Association Rules from Scratch
- Get Started with TensorFlow Recommenders and Matrix Factorization