Predictive Hacks

Item-Based Collaborative Filtering in Python

collaborative filtering

In another post, we explained how we can easily apply advanced Recommender Systems. In this post we will provide an example of Item-Based Collaborative Filterings by showing how we can find similar movies. There are many different approaches and techniques. We will work with the Singular Matrix Decomposition. You can find on-line good lectures about Matrix Factorization by Gilbert Strang (MIT), for example about LU and SVD Decomposition.

Item-Based Collaborative Filtering on Movies

 We will work with the MovieLens dataset, collected by the GroupLens Research Project at the University of Minnesota.

import pandas as pd
import numpy as np
import sklearn
from sklearn.decomposition import TruncatedSVD


columns = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('ml-100k/u.data', sep='\t', names=columns)


columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
          'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
          'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

movies = pd.read_csv('ml-100k/u.item', sep='|', names=columns, encoding='latin-1')
movie_names = movies[['item_id', 'movie title']]

combined_movies_data = pd.merge(df, movie_names, on='item_id')
combined_movies_data.head()
user_iditem_idratingtimestampmovie title
01962423881250949Kolya (1996)
1632423875747190Kolya (1996)
22262425883888671Kolya (1996)
31542423879138235Kolya (1996)
43062425876503793Kolya (1996)

We want to create the user-item table so we will need to pivot our data.

rating_crosstab = combined_movies_data.pivot_table(values='rating', index='user_id', columns='movie title', fill_value=0)
rating_crosstab.head()
movie title‘Til There Was You (1997)1-900 (1994)101 Dalmatians (1996)12 Angry Men (1957)187 (1997)2 Days in the Valley (1996)20,000 Leagues Under the Sea (1954)2001: A Space Odyssey (1968)3 Ninjas: High Noon At Mega Mountain (1998)39 Steps, The (1935)Yankee Zulu (1994)Year of the Horse (1997)You So Crazy (1994)Young Frankenstein (1974)Young Guns (1988)Young Guns II (1990)Young Poisoner’s Handbook, The (1995)Zeus and Roxanne (1997)unknownÁ köldum klaka (Cold Fever) (1994)
user_id
100250034000005300040
200000000100000000000
300002000000000000000
400000000000000000000
500200004000004000040

Since we want the item-based collaborative filtering we will transpose the rating_crosstab matrix.

X = rating_crosstab.T

SVD

Using Scikit-Learn we can easily run the SVD.

SVD = TruncatedSVD(n_components=12, random_state=5)

resultant_matrix = SVD.fit_transform(X)

resultant_matrix.shape
(1664, 12)

As we can see we created a matrix of 1664 rows (as many as the unique movies) and 12 columns which are the latent variables.


Correlation Pearson

We can use different similarity measures, like correlation Pearson, Cosine Similarity and so on. Today, we will work with the Correlation Pearson. Let’s create the correlation matrix:

### correlation matrix
corr_mat = np.corrcoef(resultant_matrix)
corr_mat.shape
(1664, 1664)

Find Similar Movies

Now we are ready to find Similar Movies. Let’s find a similar movie to Star Wars (1977)

Similar Movies to Star Wars (1977)

col_idx = rating_crosstab.columns.get_loc("Star Wars (1977)")
corr_specific = corr_mat[col_idx]
pd.DataFrame({'corr_specific':corr_specific, 'Movies': rating_crosstab.columns})\
.sort_values('corr_specific', ascending=False)\
.head(10)
corr_specificMovies
13981.000000Star Wars (1977)
12340.988090Return of the Jedi (1983)
14600.942415Terminator 2: Judgment Day (1991)
15230.932997Toy Story (1995)
14610.931761Terminator, The (1984)
12050.925040Raiders of the Lost Ark (1981)
4560.923516Empire Strikes Back, The (1980)
5700.915802Fugitive, The (1993)
4140.914633Die Hard (1988)
440.893270Aliens (1986)

Similar Movies to Godfather, The (1972)

col_idx = rating_crosstab.columns.get_loc("Godfather, The (1972)")
corr_specific = corr_mat[col_idx]
pd.DataFrame({'corr_specific':corr_specific, 'Movies': rating_crosstab.columns})\
.sort_values('corr_specific', ascending=False)\
.head(10)
corr_specificMovies
6121.000000Godfather, The (1972)
4980.922046Fargo (1996)
6130.921390Godfather: Part II, The (1974)
6230.901607GoodFellas (1990)
13980.867224Star Wars (1977)
2370.867152Bronx Tale, A (1993)
2090.864241Boot, Das (1981)
3890.856926Dead Man Walking (1995)
6220.845238Good, The Bad and The Ugly, The (1966)
11900.844128Pulp Fiction (1994)

Similar Movies to Pulp Fiction (1994)

col_idx = rating_crosstab.columns.get_loc("Pulp Fiction (1994)")
corr_specific = corr_mat[col_idx]
pd.DataFrame({'corr_specific':corr_specific, 'Movies': rating_crosstab.columns})\
.sort_values('corr_specific', ascending=False)\
.head(10)
corr_specificMovies
11901.000000Pulp Fiction (1994)
15720.974776Usual Suspects, The (1995)
5710.971498Full Metal Jacket (1987)
13290.969880Silence of the Lambs, The (1991)
6230.968037GoodFellas (1990)
15340.960548True Romance (1993)
11830.959407Professional, The (1994)
12310.953151Reservoir Dogs (1992)
13010.950477Seven (Se7en) (1995)
14400.943600Swimming with Sharks (1995)

Sum Up

We saw that every movie has a 100% Correlation Pearson with itself as expected. With the Item-Based collaborative filtered we can recommend movies based on user preference. For example, if someone likes the “Pulp Fiction (1994)” we can recommend him to watch the ” Usual Suspects, The (1995)“. It works also on the other way around. If someone does not like the ” Star Wars (1977)” we can suggest him to avoid watching the “Return of the Jedi (1983)


Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

3 thoughts on “Item-Based Collaborative Filtering in Python”

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

connect with sql
R

How to Connect R with SQL

Need to Connect R with SQL It is common for Data Analysts/Scientists to connect R with SQL. For that reason,

letter frequency
Python

Document Letter Frequency in Python

Letter Frequency We will provide you a walk-through example of how you can easily get the letter frequency in documents

[the_ad_group id="232"]
[the_ad id="2133"]