Predictive Hacks

How to run Recommender Systems in Python

recommender systems

A Brief Introduction to Recommender Systems

Nowadays, almost every company applies Recommender Systems (RecSys) which is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item. They are primarily used in commercial applications. Just to give an example of some famous recommender systems:

  • Amazon: Was the first company that applied Recommender Systems extensively around 1998. Based on the user’s preferences was suggesting similar products. It first applied with books and now with all of its products.
  • youtube: Based on the videos that you have watched, it suggested other videos that are likely to like them.
  • Spotify: Their successful Recommender System made them famous and many people let Spotify play music for them.
  • Facebook: It shows on the top of the feed the posts are more likely to be of your interest.
  • Instagram: It suggests profiles to follow based on your preference.
  • Netflix: It recommends movies for you based on your past ratings. It is worth mentioning the Netflix Prize, an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest. On September 21, 2009 they awarded the $1M Grand Prize to team “BellKor’s Pragmatic Chaos”. So, you can build your own improved Recommender System and you can become rich one day 🙂

Surprise for Recommender Systems

Still, there is much interest in Recommender Systems and a great field of research. Our goal here is to show how you can easily apply your Recommender System without explaining the maths below. We will work with the surprise package which is an easy-to-use Python scikit for recommender systems

The available prediction algorithms are:

random_pred.NormalPredictorAlgorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
baseline_only.BaselineOnlyAlgorithm predicting the baseline estimate for given user and item.
knns.KNNBasicA basic collaborative filtering algorithm.
knns.KNNWithMeansA basic collaborative filtering algorithm, taking into account the mean ratings of each user.
knns.KNNWithZScoreA basic collaborative filtering algorithm, taking into account
knns.KNNBaselineA basic collaborative filtering algorithm taking into account a baseline rating.
matrix_factorization.SVDThe famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.When baselines are not used, this is equivalent to Probabilistic Matrix Factorization [SM08] (see note below)..
matrix_factorization.SVDppThe SVD++ algorithm, an extension of SVD taking into account implicit ratings.
matrix_factorization.NMFA collaborative filtering algorithm based on Non-negative Matrix Factorization.
slope_one.SlopeOneA simple yet accurate collaborative filtering algorithm.
co_clustering.CoClusteringA collaborative filtering algorithm based on co-clustering.

Build your own Recommender System

We will provide an example of how you can build your own recommender. We will work with the MovieLens dataset, collected by the GroupLens Research Project at the University of Minnesota.

Let’s get our hands dirty!

import pandas as pd
import numpy as np

columns = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('ml-100k/u.data', sep='\t', names=columns)

columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
          'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
          'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

movies = pd.read_csv('ml-100k/u.item', sep='|', names=columns, encoding='latin-1')
movie_names = movies[['item_id', 'movie title']]

combined_movies_data = pd.merge(df, movie_names, on='item_id')
combined_movies_data = combined_movies_data[['user_id','movie title', 'rating']]
combined_movies_data.head()
user_idmovie titlerating
0196Kolya (1996)3
163Kolya (1996)3
2226Kolya (1996)5
3154Kolya (1996)3
4306Kolya (1996)5

I will also provide my ratings for some movies from this data set since my ultimate goal is to get recommendations for myself ;). Below you can see my preferences. I will give for myself the user_id 1001.

# my user_id is the 1001
my_ratings = pd.read_csv('my_movies_rating.csv')
my_ratings 
user_idmovie titlerating
01001Aladdin (1992)1.0
11001Braveheart (1995)5.0
21001Clockwork Orange, A (1971)2.0
31001Dances with Wolves (1990)3.5
41001English Patient, The (1996)2.0
51001Face/Off (1997)3.5
61001Forrest Gump (1994)4.0
71001Game, The (1997)3.5
81001Godfather, The (1972)5.0
91001Jurassic Park (1993)3.5
101001Liar Liar (1997)2.0
111001Lion King, The (1994)2.5
121001Pulp Fiction (1994)4.0
131001Reservoir Dogs (1992)4.0
141001Return of the Jedi (1983)1.0
151001Rock, The (1996)4.0
161001Scream (1996)1.0
171001Seven (Se7en) (1995)5.0
181001Silence of the Lambs, The (1991)5.0
191001Star Trek: First Contact (1996)1.0
201001Star Trek: The Wrath of Khan (1982)1.0
211001Star Wars (1977)1.0
221001Terminator 2: Judgment Day (1991)3.5
231001Titanic (1997)4.0
241001Trainspotting (1996)3.0

The next step is to append my ratings to the rest ratings. Also, we will keep the movies which have at least 25 reviews

combined_movies_data = pd.concat([combined_movies_data, my_ratings], axis=0)

# rename the columns to userID, itemID and rating
combined_movies_data.columns = ['userID', 'itemID', 'rating']

# use the transform method group by userID and count to keep the movies with more than 25 reviews

combined_movies_data['reviews'] = combined_movies_data.groupby(['itemID'])['rating'].transform('count')

combined_movies_data= combined_movies_data[combined_movies_data.reviews>25][['userID', 'itemID', 'rating']]

Now we have ready our dataset and we can apply different recommender systems using the surprise package.

from surprise import NMF, SVD, SVDpp, KNNBasic, KNNWithMeans, KNNWithZScore, CoClustering
from surprise.model_selection import cross_validate
from surprise import Reader, Dataset
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(combined_movies_data, reader)

Clearly, we want to remove the movies that I have rated from the suggested ones. Let’s remove the rated movies:

# get the list of the movie ids
unique_ids = combined_movies_data['itemID'].unique()

# get the list of the ids that the userid 1001 has rated
iids1001 = combined_movies_data.loc[combined_movies_data['userID']==1001, 'itemID']

# remove the rated movies for the recommendations
movies_to_predict = np.setdiff1d(unique_ids,iids1001)

Recommender Systems using NMF

algo = NMF()
algo.fit(data.build_full_trainset())

my_recs = []
for iid in movies_to_predict:
    my_recs.append((iid, algo.predict(uid=1001,iid=iid).est))
    
pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(10)

My recommendations according to NMF:

iidpredictions
547Once Were Warriors (1994)4.246192
30All About Eve (1950)4.043059
543On Golden Pond (1981)3.968875
770True Romance (1993)3.850275
399Jaws (1975)3.841900
667Singin’ in the Rain (1952)3.841563
99Blade Runner (1982)3.835287
141Casablanca (1942)3.815703
661Shawshank Redemption, The (1994)3.811728
646Secret of Roan Inish, The (1994)3.800933

Recommender Systems using SVD

algo = SVD()
algo.fit(data.build_full_trainset())

my_recs = []
for iid in movies_to_predict:
    my_recs.append((iid, algo.predict(uid=1001,iid=iid).est))
    
pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(10)

Recommender Systems using SVD

iidpredictions
608Rear Window (1954)3.923817
536North by Northwest (1959)3.855131
661Shawshank Redemption, The (1994)3.838807
757To Kill a Mockingbird (1962)3.820112
30All About Eve (1950)3.783826
549One Flew Over the Cuckoo’s Nest (1975)3.780879
112 Angry Men (1957)3.762648
719Sunset Blvd. (1950)3.746023
324Godfather: Part II, The (1974)3.737058
330Graduate, The (1967)3.725382

Recommender Systems using SVD++

algo = SVDpp()
algo.fit(data.build_full_trainset())

my_recs = []
for iid in movies_to_predict:
    my_recs.append((iid, algo.predict(uid=1001,iid=iid).est))
    
pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(10)

Recommender Systems using SVD++

iidpredictions
168Close Shave, A (1995)4.111232
608Rear Window (1954)3.947740
641Schindler’s List (1993)3.931670
536North by Northwest (1959)3.875646
112 Angry Men (1957)3.862930
330Graduate, The (1967)3.850796
787Vertigo (1958)3.832873
463Manchurian Candidate, The (1962)3.832311
549One Flew Over the Cuckoo’s Nest (1975)3.824999
459Maltese Falcon, The (1941)3.812775

Recommender Systems using KNN with Z-Score

algo = KNNWithZScore()
algo.fit(data.build_full_trainset())

my_recs = []
for iid in movies_to_predict:
    my_recs.append((iid, algo.predict(uid=1001,iid=iid).est))
    
pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(10)

Recommender Systems using KNN with Z-Score

iidpredictions
425L.A. Confidential (1997)4.284275
661Shawshank Redemption, The (1994)4.214673
168Close Shave, A (1995)4.177236
820Wrong Trousers, The (1993)4.161339
608Rear Window (1954)4.118845
55As Good As It Gets (1997)4.108190
112 Angry Men (1957)4.096164
549One Flew Over the Cuckoo’s Nest (1975)4.088914
784Usual Suspects, The (1995)4.086552
647Secrets & Lies (1996)4.082041

Recommender Systems using Co-Clustering

algo = CoClustering()
algo.fit(data.build_full_trainset())

my_recs = []
for iid in movies_to_predict:
    my_recs.append((iid, algo.predict(uid=1001,iid=iid).est))
    
pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(10)

Recommender Systems using Co-Clustering

iidpredictions
168Close Shave, A (1995)3.889752
820Wrong Trousers, The (1993)3.864782
141Casablanca (1942)3.855470
796Wallace & Gromit: The Best of Aardman Animation (1996)3.846441
608Rear Window (1954)3.786240
784Usual Suspects, The (1995)3.784448
112 Angry Men (1957)3.742680
740Third Man, The (1949)3.732013
158Citizen Kane (1941)3.691609
757To Kill a Mockingbird (1962)3.690918

How to Evaluate the Recommender Systems

We saw earlier that each recommender algorithm suggested different movies. The question is which one performed best and how we can choose between different algorithms.

Like in all Machine Learning problems, we can split our dataset into train and test and evaluate the performance on the test dataset. We will apply Cross Validation (k-fold of k=3) and we will get the average RMSE of the 3-folds.

cv = []
# Iterate over all recommender system algorithms
for recsys in [NMF(), SVD(), SVDpp(), KNNWithZScore(), CoClustering()]:
    # Perform cross validation
    tmp = cross_validate(recsys, data, measures=['RMSE'], cv=3, verbose=False)
    cv.append((str(recsys).split(' ')[0].split('.')[-1], tmp['test_rmse'].mean()))

pd.DataFrame(cv, columns=['RecSys', 'RMSE'])

Average RMSE on the Test Dataset

RecSysRMSE
0NMF0.959439
1SVD0.934698
2SVD++0.916328
3 KNNWithZScore 0.942878
4 CoClustering 0.950441

As we can see the SVD++ had the best performance (lowest RMSE)

Discussion

recommender systems

We built several Recommender Systems where the RMSE was less than 1. For our models, we took into consideration only the UserID and the ItemID. This post explains briefly the logic of the item-based and user-based collaborative filtering. You can also find an example of item-based collaborative filtering. We can apply different algorithms by taking into account other attributes like the genre of the movie, the released date, the director, the actor, the budget, the duration and so on. In this case, we are referring to Content-based recommenders that treat recommendation as a user-specific classification problem and learn a classifier for the user’s likes and dislikes based on an item’s features. In this system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. Finally, we can even take into consideration the user’s attributes, like gender, age, location, language, etc.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

3 thoughts on “How to run Recommender Systems in Python”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

data science journey
Miscellaneous

My Journey as a Data Science Blogger

Μy Background My Studies Back in 2001, I entered university to study Statistics. During my first year, I ran my