In another post, we explained how we can easily apply advanced Recommender Systems. In this post we will provide an example of Item-Based Collaborative Filterings by showing how we can find similar movies. There are many different approaches and techniques. We will work with the Singular Matrix Decomposition. You can find on-line good lectures about Matrix Factorization by Gilbert Strang (MIT), for example about LU and SVD Decomposition.
Item-Based Collaborative Filtering on Movies
We will work with the MovieLens dataset, collected by the GroupLens Research Project at the University of Minnesota.
import pandas as pd import numpy as np import sklearn from sklearn.decomposition import TruncatedSVD columns = ['user_id', 'item_id', 'rating', 'timestamp'] df = pd.read_csv('ml-100k/u.data', sep='\t', names=columns) columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'] movies = pd.read_csv('ml-100k/u.item', sep='|', names=columns, encoding='latin-1') movie_names = movies[['item_id', 'movie title']] combined_movies_data = pd.merge(df, movie_names, on='item_id') combined_movies_data.head()
user_id | item_id | rating | timestamp | movie title | |
---|---|---|---|---|---|
0 | 196 | 242 | 3 | 881250949 | Kolya (1996) |
1 | 63 | 242 | 3 | 875747190 | Kolya (1996) |
2 | 226 | 242 | 5 | 883888671 | Kolya (1996) |
3 | 154 | 242 | 3 | 879138235 | Kolya (1996) |
4 | 306 | 242 | 5 | 876503793 | Kolya (1996) |
We want to create the user-item table so we will need to pivot our data.
rating_crosstab = combined_movies_data.pivot_table(values='rating', index='user_id', columns='movie title', fill_value=0) rating_crosstab.head()
movie title | ‘Til There Was You (1997) | 1-900 (1994) | 101 Dalmatians (1996) | 12 Angry Men (1957) | 187 (1997) | 2 Days in the Valley (1996) | 20,000 Leagues Under the Sea (1954) | 2001: A Space Odyssey (1968) | 3 Ninjas: High Noon At Mega Mountain (1998) | 39 Steps, The (1935) | … | Yankee Zulu (1994) | Year of the Horse (1997) | You So Crazy (1994) | Young Frankenstein (1974) | Young Guns (1988) | Young Guns II (1990) | Young Poisoner’s Handbook, The (1995) | Zeus and Roxanne (1997) | unknown | Á köldum klaka (Cold Fever) (1994) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
user_id | |||||||||||||||||||||
1 | 0 | 0 | 2 | 5 | 0 | 0 | 3 | 4 | 0 | 0 | … | 0 | 0 | 0 | 5 | 3 | 0 | 0 | 0 | 4 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | … | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 4 | 0 |
Since we want the item-based collaborative filtering we will transpose the rating_crosstab
matrix.
X = rating_crosstab.T
SVD
Using Scikit-Learn we can easily run the SVD.
SVD = TruncatedSVD(n_components=12, random_state=5) resultant_matrix = SVD.fit_transform(X) resultant_matrix.shape
(1664, 12)
As we can see we created a matrix of 1664 rows (as many as the unique movies) and 12 columns which are the latent variables.
Correlation Pearson
We can use different similarity measures, like correlation Pearson, Cosine Similarity and so on. Today, we will work with the Correlation Pearson. Let’s create the correlation matrix:
### correlation matrix corr_mat = np.corrcoef(resultant_matrix) corr_mat.shape
(1664, 1664)
Find Similar Movies
Now we are ready to find Similar Movies. Let’s find a similar movie to Star Wars (1977)
Similar Movies to Star Wars (1977)
col_idx = rating_crosstab.columns.get_loc("Star Wars (1977)") corr_specific = corr_mat[col_idx] pd.DataFrame({'corr_specific':corr_specific, 'Movies': rating_crosstab.columns})\ .sort_values('corr_specific', ascending=False)\ .head(10)
corr_specific | Movies | |
---|---|---|
1398 | 1.000000 | Star Wars (1977) |
1234 | 0.988090 | Return of the Jedi (1983) |
1460 | 0.942415 | Terminator 2: Judgment Day (1991) |
1523 | 0.932997 | Toy Story (1995) |
1461 | 0.931761 | Terminator, The (1984) |
1205 | 0.925040 | Raiders of the Lost Ark (1981) |
456 | 0.923516 | Empire Strikes Back, The (1980) |
570 | 0.915802 | Fugitive, The (1993) |
414 | 0.914633 | Die Hard (1988) |
44 | 0.893270 | Aliens (1986) |
Similar Movies to Godfather, The (1972)
col_idx = rating_crosstab.columns.get_loc("Godfather, The (1972)") corr_specific = corr_mat[col_idx] pd.DataFrame({'corr_specific':corr_specific, 'Movies': rating_crosstab.columns})\ .sort_values('corr_specific', ascending=False)\ .head(10)
corr_specific | Movies | |
---|---|---|
612 | 1.000000 | Godfather, The (1972) |
498 | 0.922046 | Fargo (1996) |
613 | 0.921390 | Godfather: Part II, The (1974) |
623 | 0.901607 | GoodFellas (1990) |
1398 | 0.867224 | Star Wars (1977) |
237 | 0.867152 | Bronx Tale, A (1993) |
209 | 0.864241 | Boot, Das (1981) |
389 | 0.856926 | Dead Man Walking (1995) |
622 | 0.845238 | Good, The Bad and The Ugly, The (1966) |
1190 | 0.844128 | Pulp Fiction (1994) |
Similar Movies to Pulp Fiction (1994)
col_idx = rating_crosstab.columns.get_loc("Pulp Fiction (1994)") corr_specific = corr_mat[col_idx] pd.DataFrame({'corr_specific':corr_specific, 'Movies': rating_crosstab.columns})\ .sort_values('corr_specific', ascending=False)\ .head(10)
corr_specific | Movies | |
---|---|---|
1190 | 1.000000 | Pulp Fiction (1994) |
1572 | 0.974776 | Usual Suspects, The (1995) |
571 | 0.971498 | Full Metal Jacket (1987) |
1329 | 0.969880 | Silence of the Lambs, The (1991) |
623 | 0.968037 | GoodFellas (1990) |
1534 | 0.960548 | True Romance (1993) |
1183 | 0.959407 | Professional, The (1994) |
1231 | 0.953151 | Reservoir Dogs (1992) |
1301 | 0.950477 | Seven (Se7en) (1995) |
1440 | 0.943600 | Swimming with Sharks (1995) |
Sum Up
We saw that every movie has a 100% Correlation Pearson with itself as expected. With the Item-Based collaborative filtered we can recommend movies based on user preference. For example, if someone likes the “Pulp Fiction (1994)” we can recommend him to watch the ” Usual Suspects, The (1995)“. It works also on the other way around. If someone does not like the ” Star Wars (1977)” we can suggest him to avoid watching the “Return of the Jedi (1983)“
3 thoughts on “Item-Based Collaborative Filtering in Python”