A Brief Introduction to Recommender Systems
Nowadays, almost every company applies Recommender Systems (RecSys) which is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item. They are primarily used in commercial applications. Just to give an example of some famous recommender systems:
- Amazon: Was the first company that applied Recommender Systems extensively around 1998. Based on the user’s preferences was suggesting similar products. It first applied with books and now with all of its products.
- youtube: Based on the videos that you have watched, it suggested other videos that are likely to like them.
- Spotify: Their successful Recommender System made them famous and many people let Spotify play music for them.
- Facebook: It shows on the top of the feed the posts are more likely to be of your interest.
- Instagram: It suggests profiles to follow based on your preference.
- Netflix: It recommends movies for you based on your past ratings. It is worth mentioning the Netflix Prize, an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest. On September 21, 2009 they awarded the $1M Grand Prize to team “BellKor’s Pragmatic Chaos”. So, you can build your own improved Recommender System and you can become rich one day 🙂
Surprise for Recommender Systems
Still, there is much interest in Recommender Systems and a great field of research. Our goal here is to show how you can easily apply your Recommender System without explaining the maths below. We will work with the surprise package which is an easy-to-use Python scikit for recommender systems
The available prediction algorithms are:
random_pred.NormalPredictor | Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. |
baseline_only.BaselineOnly | Algorithm predicting the baseline estimate for given user and item. |
knns.KNNBasic | A basic collaborative filtering algorithm. |
knns.KNNWithMeans | A basic collaborative filtering algorithm, taking into account the mean ratings of each user. |
knns.KNNWithZScore | A basic collaborative filtering algorithm, taking into account |
knns.KNNBaseline | A basic collaborative filtering algorithm taking into account a baseline rating. |
matrix_factorization.SVD | The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.When baselines are not used, this is equivalent to Probabilistic Matrix Factorization [SM08] (see note below).. |
matrix_factorization.SVDpp | The SVD++ algorithm, an extension of SVD taking into account implicit ratings. |
matrix_factorization.NMF | A collaborative filtering algorithm based on Non-negative Matrix Factorization. |
slope_one.SlopeOne | A simple yet accurate collaborative filtering algorithm. |
co_clustering.CoClustering | A collaborative filtering algorithm based on co-clustering. |
Build your own Recommender System
We will provide an example of how you can build your own recommender. We will work with the MovieLens dataset, collected by the GroupLens Research Project at the University of Minnesota.
Let’s get our hands dirty!
import pandas as pd import numpy as np columns = ['user_id', 'item_id', 'rating', 'timestamp'] df = pd.read_csv('ml-100k/u.data', sep='\t', names=columns) columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'] movies = pd.read_csv('ml-100k/u.item', sep='|', names=columns, encoding='latin-1') movie_names = movies[['item_id', 'movie title']] combined_movies_data = pd.merge(df, movie_names, on='item_id') combined_movies_data = combined_movies_data[['user_id','movie title', 'rating']] combined_movies_data.head()
user_id | movie title | rating | |
---|---|---|---|
0 | 196 | Kolya (1996) | 3 |
1 | 63 | Kolya (1996) | 3 |
2 | 226 | Kolya (1996) | 5 |
3 | 154 | Kolya (1996) | 3 |
4 | 306 | Kolya (1996) | 5 |
I will also provide my ratings for some movies from this data set since my ultimate goal is to get recommendations for myself ;). Below you can see my preferences. I will give for myself the user_id 1001.
# my user_id is the 1001 my_ratings = pd.read_csv('my_movies_rating.csv') my_ratings
user_id | movie title | rating | |
---|---|---|---|
0 | 1001 | Aladdin (1992) | 1.0 |
1 | 1001 | Braveheart (1995) | 5.0 |
2 | 1001 | Clockwork Orange, A (1971) | 2.0 |
3 | 1001 | Dances with Wolves (1990) | 3.5 |
4 | 1001 | English Patient, The (1996) | 2.0 |
5 | 1001 | Face/Off (1997) | 3.5 |
6 | 1001 | Forrest Gump (1994) | 4.0 |
7 | 1001 | Game, The (1997) | 3.5 |
8 | 1001 | Godfather, The (1972) | 5.0 |
9 | 1001 | Jurassic Park (1993) | 3.5 |
10 | 1001 | Liar Liar (1997) | 2.0 |
11 | 1001 | Lion King, The (1994) | 2.5 |
12 | 1001 | Pulp Fiction (1994) | 4.0 |
13 | 1001 | Reservoir Dogs (1992) | 4.0 |
14 | 1001 | Return of the Jedi (1983) | 1.0 |
15 | 1001 | Rock, The (1996) | 4.0 |
16 | 1001 | Scream (1996) | 1.0 |
17 | 1001 | Seven (Se7en) (1995) | 5.0 |
18 | 1001 | Silence of the Lambs, The (1991) | 5.0 |
19 | 1001 | Star Trek: First Contact (1996) | 1.0 |
20 | 1001 | Star Trek: The Wrath of Khan (1982) | 1.0 |
21 | 1001 | Star Wars (1977) | 1.0 |
22 | 1001 | Terminator 2: Judgment Day (1991) | 3.5 |
23 | 1001 | Titanic (1997) | 4.0 |
24 | 1001 | Trainspotting (1996) | 3.0 |
The next step is to append my ratings to the rest ratings. Also, we will keep the movies which have at least 25 reviews
combined_movies_data = pd.concat([combined_movies_data, my_ratings], axis=0) # rename the columns to userID, itemID and rating combined_movies_data.columns = ['userID', 'itemID', 'rating'] # use the transform method group by userID and count to keep the movies with more than 25 reviews combined_movies_data['reviews'] = combined_movies_data.groupby(['itemID'])['rating'].transform('count') combined_movies_data= combined_movies_data[combined_movies_data.reviews>25][['userID', 'itemID', 'rating']]
Now we have ready our dataset and we can apply different recommender systems using the surprise package.
from surprise import NMF, SVD, SVDpp, KNNBasic, KNNWithMeans, KNNWithZScore, CoClustering from surprise.model_selection import cross_validate from surprise import Reader, Dataset
# A reader is still needed but only the rating_scale param is requiered. reader = Reader(rating_scale=(1, 5)) data = Dataset.load_from_df(combined_movies_data, reader)
Clearly, we want to remove the movies that I have rated from the suggested ones. Let’s remove the rated movies:
# get the list of the movie ids unique_ids = combined_movies_data['itemID'].unique() # get the list of the ids that the userid 1001 has rated iids1001 = combined_movies_data.loc[combined_movies_data['userID']==1001, 'itemID'] # remove the rated movies for the recommendations movies_to_predict = np.setdiff1d(unique_ids,iids1001)
Recommender Systems using NMF
algo = NMF() algo.fit(data.build_full_trainset()) my_recs = [] for iid in movies_to_predict: my_recs.append((iid, algo.predict(uid=1001,iid=iid).est)) pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(10)
My recommendations according to NMF:
iid | predictions | |
---|---|---|
547 | Once Were Warriors (1994) | 4.246192 |
30 | All About Eve (1950) | 4.043059 |
543 | On Golden Pond (1981) | 3.968875 |
770 | True Romance (1993) | 3.850275 |
399 | Jaws (1975) | 3.841900 |
667 | Singin’ in the Rain (1952) | 3.841563 |
99 | Blade Runner (1982) | 3.835287 |
141 | Casablanca (1942) | 3.815703 |
661 | Shawshank Redemption, The (1994) | 3.811728 |
646 | Secret of Roan Inish, The (1994) | 3.800933 |
Recommender Systems using SVD
algo = SVD() algo.fit(data.build_full_trainset()) my_recs = [] for iid in movies_to_predict: my_recs.append((iid, algo.predict(uid=1001,iid=iid).est)) pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(10)
Recommender Systems using SVD
iid | predictions | |
---|---|---|
608 | Rear Window (1954) | 3.923817 |
536 | North by Northwest (1959) | 3.855131 |
661 | Shawshank Redemption, The (1994) | 3.838807 |
757 | To Kill a Mockingbird (1962) | 3.820112 |
30 | All About Eve (1950) | 3.783826 |
549 | One Flew Over the Cuckoo’s Nest (1975) | 3.780879 |
1 | 12 Angry Men (1957) | 3.762648 |
719 | Sunset Blvd. (1950) | 3.746023 |
324 | Godfather: Part II, The (1974) | 3.737058 |
330 | Graduate, The (1967) | 3.725382 |
Recommender Systems using SVD++
algo = SVDpp() algo.fit(data.build_full_trainset()) my_recs = [] for iid in movies_to_predict: my_recs.append((iid, algo.predict(uid=1001,iid=iid).est)) pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(10)
Recommender Systems using SVD++
iid | predictions | |
---|---|---|
168 | Close Shave, A (1995) | 4.111232 |
608 | Rear Window (1954) | 3.947740 |
641 | Schindler’s List (1993) | 3.931670 |
536 | North by Northwest (1959) | 3.875646 |
1 | 12 Angry Men (1957) | 3.862930 |
330 | Graduate, The (1967) | 3.850796 |
787 | Vertigo (1958) | 3.832873 |
463 | Manchurian Candidate, The (1962) | 3.832311 |
549 | One Flew Over the Cuckoo’s Nest (1975) | 3.824999 |
459 | Maltese Falcon, The (1941) | 3.812775 |
Recommender Systems using KNN with Z-Score
algo = KNNWithZScore() algo.fit(data.build_full_trainset()) my_recs = [] for iid in movies_to_predict: my_recs.append((iid, algo.predict(uid=1001,iid=iid).est)) pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(10)
Recommender Systems using KNN with Z-Score
iid | predictions | |
---|---|---|
425 | L.A. Confidential (1997) | 4.284275 |
661 | Shawshank Redemption, The (1994) | 4.214673 |
168 | Close Shave, A (1995) | 4.177236 |
820 | Wrong Trousers, The (1993) | 4.161339 |
608 | Rear Window (1954) | 4.118845 |
55 | As Good As It Gets (1997) | 4.108190 |
1 | 12 Angry Men (1957) | 4.096164 |
549 | One Flew Over the Cuckoo’s Nest (1975) | 4.088914 |
784 | Usual Suspects, The (1995) | 4.086552 |
647 | Secrets & Lies (1996) | 4.082041 |
Recommender Systems using Co-Clustering
algo = CoClustering() algo.fit(data.build_full_trainset()) my_recs = [] for iid in movies_to_predict: my_recs.append((iid, algo.predict(uid=1001,iid=iid).est)) pd.DataFrame(my_recs, columns=['iid', 'predictions']).sort_values('predictions', ascending=False).head(10)
Recommender Systems using Co-Clustering
iid | predictions | |
---|---|---|
168 | Close Shave, A (1995) | 3.889752 |
820 | Wrong Trousers, The (1993) | 3.864782 |
141 | Casablanca (1942) | 3.855470 |
796 | Wallace & Gromit: The Best of Aardman Animation (1996) | 3.846441 |
608 | Rear Window (1954) | 3.786240 |
784 | Usual Suspects, The (1995) | 3.784448 |
1 | 12 Angry Men (1957) | 3.742680 |
740 | Third Man, The (1949) | 3.732013 |
158 | Citizen Kane (1941) | 3.691609 |
757 | To Kill a Mockingbird (1962) | 3.690918 |
How to Evaluate the Recommender Systems
We saw earlier that each recommender algorithm suggested different movies. The question is which one performed best and how we can choose between different algorithms.
Like in all Machine Learning problems, we can split our dataset into train and test and evaluate the performance on the test dataset. We will apply Cross Validation (k-fold of k=3) and we will get the average RMSE of the 3-folds.
cv = [] # Iterate over all recommender system algorithms for recsys in [NMF(), SVD(), SVDpp(), KNNWithZScore(), CoClustering()]: # Perform cross validation tmp = cross_validate(recsys, data, measures=['RMSE'], cv=3, verbose=False) cv.append((str(recsys).split(' ')[0].split('.')[-1], tmp['test_rmse'].mean())) pd.DataFrame(cv, columns=['RecSys', 'RMSE'])
Average RMSE on the Test Dataset
RecSys | RMSE | |
---|---|---|
0 | NMF | 0.959439 |
1 | SVD | 0.934698 |
2 | SVD++ | 0.916328 |
3 | KNNWithZScore | 0.942878 |
4 | CoClustering | 0.950441 |
As we can see the SVD++ had the best performance (lowest RMSE)
Discussion
We built several Recommender Systems where the RMSE was less than 1. For our models, we took into consideration only the UserID and the ItemID. This post explains briefly the logic of the item-based and user-based collaborative filtering. You can also find an example of item-based collaborative filtering. We can apply different algorithms by taking into account other attributes like the genre of the movie, the released date, the director, the actor, the budget, the duration and so on. In this case, we are referring to Content-based recommenders that treat recommendation as a user-specific classification problem and learn a classifier for the user’s likes and dislikes based on an item’s features. In this system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. Finally, we can even take into consideration the user’s attributes, like gender, age, location, language, etc.
3 thoughts on “How to run Recommender Systems in Python”