What is Market Basket Analysis
Intuitively, we could say that the Market Basket Analysis is given a database of customer transactions, where each transaction is a set of items, the goal is to find group of items which are frequently purchased. The outcome of the algorithm will be a recommendation like that if you buy one or more specific items then you are more (or less) likely to buy this extra item. So, for example, if someone has already added to his basket cereals then it is more likely to add milk. Practically, in Market Basket Analysis we are not looking for similar products, but for supplementary products. For the algorithm, we need to identify products that tend to be purchased together.
The Market Basket Analysis can be applied in the following cases:
- Build a movie/song recommendation engine
- Build a live recommendation algorithm on an e-commerce store
- Cross-sell or Upsell products in a supermarket
Association Rules
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction. For example, we can extract information on purchasing behavior like “If someone buys beer and sausage, then is likely to buy mustard with high probability“
Let’s define the main Associaton Rules:
Support
It calculates how often the product is purchased and is given by the formula:
\(Support(X) = \frac{Frequency(X)}{N (\#of \;Transactions)}\)
\(Support(X \rightarrow Y) = \frac{Frequency(X \bigcap Y)}{N (\#of \;Transactions)}\)
Confidence
It measures how often items in Y appear in transactions that contain X and is given by the formula.
\(Confidence(X \rightarrow Y ) = \frac{ Support(X \rightarrow Y )}{ Support(X) }\)
Lift
It is the value that tells us how likely item Y is bought together with item X. Values greater than one indicate that the items are likely to be purchased together. It tells us how much better a rule is at predicting the result than just assuming the result in the first place. When lift > 1 then the rule is better at predicting the result than guessing. When lift < 1, the rule is doing worse than informed guessing. It can be given by the formula:
\(Lift(X \rightarrow Y ) = \frac{ Support(X \rightarrow Y )}{ Support(X)\times Support(Y) }\)
Market Basket Analysis in Movies
In the previous posts, we showed how we can apply Item-Based Collaborative filtering and how to Run Recommender Systems in movies. Today, we will apply Market Basket Analysis taking the same dataset of Movies. Here, each “Basket” is the movies watched by a user id without taking into consideration the rating.
Let’s start.
We will work with the MovieLens dataset, collected by the GroupLens Research Project at the University of Minnesota.
import pandas as pd import numpy as np columns = ['user_id', 'item_id', 'rating', 'timestamp'] df = pd.read_csv('ml-100k/u.data', sep='\t', names=columns) columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure', 'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'] movies = pd.read_csv('ml-100k/u.item', sep='|', names=columns, encoding='latin-1') movie_names = movies[['item_id', 'movie title']] combined_movies_data = pd.merge(df, movie_names, on='item_id') combined_movies_data = combined_movies_data[['user_id','movie title']] combined_movies_data.head()
user_id | movie title | |
---|---|---|
0 | 196 | Kolya (1996) |
1 | 63 | Kolya (1996) |
2 | 226 | Kolya (1996) |
3 | 154 | Kolya (1996) |
4 | 306 | Kolya (1996) |
Then we want to create a “Onehot” Data Frame, with True values when one movie has been watched by a User_ID and False otherwise.
onehot = combined_movies_data.pivot_table(index='user_id', columns='movie title', aggfunc=len, fill_value=0) onehot = onehot>0
Generate the Association Rules of 2 Items
We will generate the Association Rules of 2 Items using the Apriori algorithm.
from mlxtend.frequent_patterns import association_rules, apriori # compute frequent items using the Apriori algorithm frequent_itemsets = apriori(onehot, min_support = 0.001, max_len = 2, use_colnames=True) # compute all association rules for frequent_itemsets rules = association_rules(frequent_itemsets) rules.head()
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
0 | (‘Til There Was You (1997)) | (My Best Friend’s Wedding (1997)) | 0.009544 | 0.182397 | 0.008484 | 0.888889 | 4.873385 | 0.006743 | 7.358431 |
1 | (All Dogs Go to Heaven 2 (1996)) | (101 Dalmatians (1996)) | 0.015907 | 0.115589 | 0.012725 | 0.800000 | 6.921101 | 0.010887 | 4.422057 |
2 | (101 Dalmatians (1996)) | (Independence Day (ID4) (1996)) | 0.115589 | 0.454931 | 0.098621 | 0.853211 | 1.875473 | 0.046037 | 3.713282 |
3 | (101 Dalmatians (1996)) | (Return of the Jedi (1983)) | 0.115589 | 0.537646 | 0.093319 | 0.807339 | 1.501620 | 0.031173 | 2.399838 |
4 | (101 Dalmatians (1996)) | (Star Wars (1977)) | 0.115589 | 0.618240 | 0.100742 | 0.871560 | 1.409744 | 0.029281 | 2.972277 |
The above table shows all the association rules of the left-hand side (antecedents) and the right-hand side (consequent). Let’s try to find recommendations for a movie
101 Dalmatians
rules[rules.antecedents.apply(str).str.contains("101 Dalmatians")].sort_values('lift', ascending=False)
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
75 | (101 Dalmatians (1996)) | (Independence Day (ID4) (1996)) | 0.115589 | 0.454931 | 0.098621 | 0.853211 | 1.875473 | 0.046037 | 3.713282 |
96 | (101 Dalmatians (1996)) | (Toy Story (1995)) | 0.115589 | 0.479321 | 0.099682 | 0.862385 | 1.799180 | 0.044278 | 3.783598 |
88 | (101 Dalmatians (1996)) | (Return of the Jedi (1983)) | 0.115589 | 0.537646 | 0.093319 | 0.807339 | 1.501620 | 0.031173 | 2.399838 |
91 | (101 Dalmatians (1996)) | (Star Wars (1977)) | 0.115589 | 0.618240 | 0.100742 | 0.871560 | 1.409744 | 0.029281 | 2.972277 |
According to the lift association rule, the 101 Dalmatians is associated with the Independence Day.
Let’s get the top 5 associated movies according to the “lift” measure.
rules.sort_values('lift', ascending=False).head(5)
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
1255 | (Amityville: A New Generation (1993)) | (Amityville 1992: It’s About Time (1992)) | 0.005302 | 0.005302 | 0.005302 | 1.000000 | 188.600000 | 0.005274 | inf |
1254 | (Amityville 1992: It’s About Time (1992)) | (Amityville: A New Generation (1993)) | 0.005302 | 0.005302 | 0.005302 | 1.000000 | 188.600000 | 0.005274 | inf |
1251 | (Amityville 3-D (1983)) | (Amityville 1992: It’s About Time (1992)) | 0.006363 | 0.005302 | 0.005302 | 0.833333 | 157.166667 | 0.005268 | 5.968187 |
1266 | (Amityville 3-D (1983)) | (Amityville: A New Generation (1993)) | 0.006363 | 0.005302 | 0.005302 | 0.833333 | 157.166667 | 0.005268 | 5.968187 |
1267 | (Amityville: A New Generation (1993)) | (Amityville 3-D (1983)) | 0.005302 | 0.006363 | 0.005302 | 1.000000 | 157.166667 | 0.005268 | inf |
Market Basket Analysis in Grocery Basket
The Movies
dataset is relatively large for tutoring purposes. For that reason we will provide another example with a smaller dataset which are hypothetical transactions (baskets) from a grocery.
groceries = pd.read_csv("groceries.txt", sep=";") groceries
ID | Transaction | |
---|---|---|
0 | 0 | milk,bread,biscuit |
1 | 1 | bread,milk,biscuit,cereal |
2 | 2 | bread,tea |
3 | 3 | jam,bread,milk |
4 | 4 | tea,biscuit |
5 | 5 | bread,tea |
6 | 6 | tea,cereal |
7 | 7 | bread,tea,biscuit |
8 | 8 | jam,bread,tea |
9 | 9 | bread,milk |
10 | 10 | coffee,orange,biscuit,cereal |
11 | 11 | coffee,orange,biscuit,cereal |
12 | 12 | coffee,sugar |
13 | 13 | bread,coffee,orange |
14 | 14 | bread,sugar,biscuit |
15 | 15 | coffee,sugar,cereal |
16 | 16 | bread,sugar,biscuit |
17 | 17 | bread,coffee,sugar |
18 | 18 | bread,coffee,sugar |
19 | 19 | tea,milk,coffee,cereal |
As you can see the items are in the same row, separated by a comma. There are two ways to create the onehot
data frame. One is to work with the CountVectorizer as explained in another post or to work with the TransactionEncoder
as we will show right now. For this example, we will work with association rules of 3 items.
import pandas as pd import numpy as np from mlxtend.frequent_patterns import association_rules, apriori from mlxtend.preprocessing import TransactionEncoder # get all the transcactions as a list transcactions = list(groceries['Transaction'].apply(lambda x: sorted(x.split(',')))) # instantiate transcation encoder encoder = TransactionEncoder().fit(transcactions) onehot = encoder.transform(transcactions) # convert one-hot encode data to DataFrame onehot = pd.DataFrame(onehot, columns=encoder.columns_) # compute frequent items using the Apriori algorithm - Get up to three items frequent_itemsets = apriori(onehot, min_support = 0.001, max_len = 3, use_colnames=True) # compute all association rules for frequent_itemsets rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
Let’s say that we want to get the top associated rules, given that the left-hand side has two items, then which item is more likely to be added to the basket?
rules['lhs items'] = rules['antecedents'].apply(lambda x:len(x) ) rules[rules['lhs items']>1].sort_values('lift', ascending=False).head()
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | lhs items | |
---|---|---|---|---|---|---|---|---|---|---|
60 | (coffee, biscuit) | (orange) | 0.10 | 0.15 | 0.10 | 1.000000 | 6.666667 | 0.0850 | inf | 2 |
53 | (cereal, biscuit) | (orange) | 0.15 | 0.15 | 0.10 | 0.666667 | 4.444444 | 0.0775 | 2.55 | 2 |
112 | (tea, coffee) | (milk) | 0.05 | 0.25 | 0.05 | 1.000000 | 4.000000 | 0.0375 | inf | 2 |
64 | (cereal, bread) | (milk) | 0.05 | 0.25 | 0.05 | 1.000000 | 4.000000 | 0.0375 | inf | 2 |
104 | (milk, tea) | (cereal) | 0.05 | 0.30 | 0.05 | 1.000000 | 3.333333 | 0.0350 | inf | 2 |
As we can see, if someone has already added to his basket (coffee, biscuit)
or (cereal, biscuit)
then the item which is more likely to be added is orange
Visualize Market Basket Analysis
Now, we will show how we can visualize the Market Basket Analysis Association Rules using Heatmap. We will show all the rules where the left-hand side consists of 2 items and we are looking for an extra one.
# Import seaborn under its standard alias import seaborn as sns # Replace frozen sets with strings rules['antecedents_'] = rules['antecedents'].apply(lambda a: ','.join(list(a))) rules['consequents_'] = rules['consequents'].apply(lambda a: ','.join(list(a))) # Transform the DataFrame of rules into a matrix using the lift metric pivot = rules[rules['lhs items']>1].pivot(index = 'antecedents_', columns = 'consequents_', values= 'lift') # Generate a heatmap with annotations on and the colorbar off sns.heatmap(pivot, annot = True) plt.yticks(rotation=0) plt.xticks(rotation=90) plt.show()