Predictive Hacks

# Item-Based Collaborative Filtering in Python

In another post, we explained how we can easily apply advanced Recommender Systems. In this post we will provide an example of Item-Based Collaborative Filterings by showing how we can find similar movies. There are many different approaches and techniques. We will work with the Singular Matrix Decomposition. You can find on-line good lectures about Matrix Factorization by Gilbert Strang (MIT), for example about LU and SVD Decomposition.

## Item-Based Collaborative Filtering on Movies

We will work with the MovieLens dataset, collected by the GroupLens Research Project at the University of Minnesota.

import pandas as pd
import numpy as np
import sklearn
from sklearn.decomposition import TruncatedSVD

columns = ['user_id', 'item_id', 'rating', 'timestamp']

columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

movies = pd.read_csv('ml-100k/u.item', sep='|', names=columns, encoding='latin-1')
movie_names = movies[['item_id', 'movie title']]

combined_movies_data = pd.merge(df, movie_names, on='item_id')
combined_movies_data.head()

We want to create the user-item table so we will need to pivot our data.

rating_crosstab = combined_movies_data.pivot_table(values='rating', index='user_id', columns='movie title', fill_value=0)
rating_crosstab.head()

Since we want the item-based collaborative filtering we will transpose the rating_crosstab matrix.

X = rating_crosstab.T

## SVD

Using Scikit-Learn we can easily run the SVD.

SVD = TruncatedSVD(n_components=12, random_state=5)

resultant_matrix = SVD.fit_transform(X)

resultant_matrix.shape
(1664, 12)

As we can see we created a matrix of 1664 rows (as many as the unique movies) and 12 columns which are the latent variables.

## Correlation Pearson

We can use different similarity measures, like correlation Pearson, Cosine Similarity and so on. Today, we will work with the Correlation Pearson. Let’s create the correlation matrix:

### correlation matrix
corr_mat = np.corrcoef(resultant_matrix)
corr_mat.shape
(1664, 1664)

## Find Similar Movies

Now we are ready to find Similar Movies. Let’s find a similar movie to  Star Wars (1977)

### Similar Movies to Star Wars (1977)

col_idx = rating_crosstab.columns.get_loc("Star Wars (1977)")
corr_specific = corr_mat[col_idx]
pd.DataFrame({'corr_specific':corr_specific, 'Movies': rating_crosstab.columns})\
.sort_values('corr_specific', ascending=False)\
.head(10)

### Similar Movies to Godfather, The (1972)

col_idx = rating_crosstab.columns.get_loc("Godfather, The (1972)")
corr_specific = corr_mat[col_idx]
pd.DataFrame({'corr_specific':corr_specific, 'Movies': rating_crosstab.columns})\
.sort_values('corr_specific', ascending=False)\
.head(10)

### Similar Movies to Pulp Fiction (1994)

col_idx = rating_crosstab.columns.get_loc("Pulp Fiction (1994)")
corr_specific = corr_mat[col_idx]
pd.DataFrame({'corr_specific':corr_specific, 'Movies': rating_crosstab.columns})\
.sort_values('corr_specific', ascending=False)\
.head(10)

## Sum Up

We saw that every movie has a 100% Correlation Pearson with itself as expected. With the Item-Based collaborative filtered we can recommend movies based on user preference. For example, if someone likes the “Pulp Fiction (1994)” we can recommend him to watch the ” Usual Suspects, The (1995)“. It works also on the other way around. If someone does not like the ” Star Wars (1977)” we can suggest him to avoid watching the “Return of the Jedi (1983)

Python