Predictive Hacks

A Tutorial about Market Basket Analysis in Python

market_basket_analysis
Invest in you. Online courses from $11.99

What is Market Basket Analysis

A Tutorial about Market Basket Analysis in Python 1

Intuitively, we could say that the Market Basket Analysis is given a database of customer transactions, where each transaction is a set of items, the goal is to find group of items which are frequently purchased. The outcome of the algorithm will be a recommendation like that if you buy one or more specific items then you are more (or less) likely to buy this extra item. So, for example, if someone has already added to his basket cereals then it is more likely to add milk. Practically, in Market Basket Analysis we are not looking for similar products, but for supplementary products. For the algorithm, we need to identify products that tend to be purchased together.

The Market Basket Analysis can be applied in the following cases:

  • Build a movie/song recommendation engine
  • Build a live recommendation algorithm on an e-commerce store
  • Cross-sell or Upsell products in a supermarket

Association Rules

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction. For example, we can extract information on purchasing behavior like “If someone buys beer and sausage, then is likely to buy mustard with high probability

Let’s define the main Associaton Rules:

Support

It calculates how often the product is purchased and is given by the formula:

\(Support(X) = \frac{Frequency(X)}{N (\#of \;Transactions)}\)

\(Support(X \rightarrow Y) = \frac{Frequency(X \bigcap Y)}{N (\#of \;Transactions)}\)

Confidence

It measures how often items in Y appear in transactions that contain X and is given by the formula.

\(Confidence(X \rightarrow Y ) = \frac{ Support(X \rightarrow Y )}{ Support(X) }\)

Lift

It is the value that tells us how likely item Y is bought together with item X. Values greater than one indicate that the items are likely to be purchased together. It tells us how much better a rule is at predicting the result than just assuming the result in the first place. When lift > 1 then the rule is better at predicting the result than guessing. When lift < 1, the rule is doing worse than informed guessing. It can be given by the formula:

\(Lift(X \rightarrow Y ) = \frac{ Support(X \rightarrow Y )}{ Support(X)\times Support(Y) }\)

Market Basket Analysis in Movies

In the previous posts, we showed how we can apply Item-Based Collaborative filtering and how to Run Recommender Systems in movies. Today, we will apply Market Basket Analysis taking the same dataset of Movies. Here, each “Basket” is the movies watched by a user id without taking into consideration the rating.

Let’s start.

We will work with the MovieLens dataset, collected by the GroupLens Research Project at the University of Minnesota.

import pandas as pd
import numpy as np
columns = ['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv('ml-100k/u.data', sep='\t', names=columns)
columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
          'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
          'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=columns, encoding='latin-1')
movie_names = movies[['item_id', 'movie title']]
combined_movies_data = pd.merge(df, movie_names, on='item_id')
combined_movies_data = combined_movies_data[['user_id','movie title']]
combined_movies_data.head()
user_idmovie title
0196Kolya (1996)
163Kolya (1996)
2226Kolya (1996)
3154Kolya (1996)
4306Kolya (1996)

Then we want to create a “Onehot” Data Frame, with True values when one movie has been watched by a User_ID and False otherwise.

onehot = combined_movies_data.pivot_table(index='user_id', columns='movie title', aggfunc=len, fill_value=0)
onehot = onehot>0

Generate the Association Rules of 2 Items

We will generate the Association Rules of 2 Items using the Apriori algorithm.

from mlxtend.frequent_patterns import association_rules, apriori

# compute frequent items using the Apriori algorithm
frequent_itemsets = apriori(onehot, min_support = 0.001, max_len = 2, use_colnames=True)

# compute all association rules for frequent_itemsets
rules = association_rules(frequent_itemsets)

rules.head()
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
0(‘Til There Was You (1997))(My Best Friend’s Wedding (1997))0.0095440.1823970.0084840.8888894.8733850.0067437.358431
1(All Dogs Go to Heaven 2 (1996))(101 Dalmatians (1996))0.0159070.1155890.0127250.8000006.9211010.0108874.422057
2(101 Dalmatians (1996))(Independence Day (ID4) (1996))0.1155890.4549310.0986210.8532111.8754730.0460373.713282
3(101 Dalmatians (1996))(Return of the Jedi (1983))0.1155890.5376460.0933190.8073391.5016200.0311732.399838
4(101 Dalmatians (1996))(Star Wars (1977))0.1155890.6182400.1007420.8715601.4097440.0292812.972277

The above table shows all the association rules of the left-hand side (antecedents) and the right-hand side (consequent). Let’s try to find recommendations for a movie

101 Dalmatians

rules[rules.antecedents.apply(str).str.contains("101 Dalmatians")].sort_values('lift', ascending=False)
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
75(101 Dalmatians (1996))(Independence Day (ID4) (1996))0.1155890.4549310.0986210.8532111.8754730.0460373.713282
96(101 Dalmatians (1996))(Toy Story (1995))0.1155890.4793210.0996820.8623851.7991800.0442783.783598
88(101 Dalmatians (1996))(Return of the Jedi (1983))0.1155890.5376460.0933190.8073391.5016200.0311732.399838
91(101 Dalmatians (1996))(Star Wars (1977))0.1155890.6182400.1007420.8715601.4097440.0292812.972277

According to the lift association rule, the 101 Dalmatians is associated with the Independence Day.

Let’s get the top 5 associated movies according to the “lift” measure.

rules.sort_values('lift', ascending=False).head(5)
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
1255(Amityville: A New Generation (1993))(Amityville 1992: It’s About Time (1992))0.0053020.0053020.0053021.000000188.6000000.005274inf
1254(Amityville 1992: It’s About Time (1992))(Amityville: A New Generation (1993))0.0053020.0053020.0053021.000000188.6000000.005274inf
1251(Amityville 3-D (1983))(Amityville 1992: It’s About Time (1992))0.0063630.0053020.0053020.833333157.1666670.0052685.968187
1266(Amityville 3-D (1983))(Amityville: A New Generation (1993))0.0063630.0053020.0053020.833333157.1666670.0052685.968187
1267(Amityville: A New Generation (1993))(Amityville 3-D (1983))0.0053020.0063630.0053021.000000157.1666670.005268inf


Market Basket Analysis in Grocery Basket

The Movies dataset is relatively large for tutoring purposes. For that reason we will provide another example with a smaller dataset which are hypothetical transactions (baskets) from a grocery.

groceries  = pd.read_csv("groceries.txt", sep=";")
groceries
IDTransaction
00milk,bread,biscuit
11bread,milk,biscuit,cereal
22bread,tea
33jam,bread,milk
44tea,biscuit
55bread,tea
66tea,cereal
77bread,tea,biscuit
88jam,bread,tea
99bread,milk
1010coffee,orange,biscuit,cereal
1111coffee,orange,biscuit,cereal
1212coffee,sugar
1313bread,coffee,orange
1414bread,sugar,biscuit
1515coffee,sugar,cereal
1616bread,sugar,biscuit
1717bread,coffee,sugar
1818bread,coffee,sugar
1919tea,milk,coffee,cereal

As you can see the items are in the same row, separated by a comma. There are two ways to create the onehot data frame. One is to work with the CountVectorizer as explained in another post or to work with the TransactionEncoder as we will show right now. For this example, we will work with association rules of 3 items.

import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import association_rules, apriori
from mlxtend.preprocessing import TransactionEncoder


# get all the transcactions as a list
transcactions = list(groceries['Transaction'].apply(lambda x: sorted(x.split(','))))


# instantiate transcation encoder
encoder = TransactionEncoder().fit(transcactions)

onehot = encoder.transform(transcactions)

# convert one-hot encode data to DataFrame
onehot = pd.DataFrame(onehot, columns=encoder.columns_)
# compute frequent items using the Apriori algorithm - Get up to three items
frequent_itemsets = apriori(onehot, min_support = 0.001, max_len = 3, use_colnames=True)

# compute all association rules for frequent_itemsets
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

Let’s say that we want to get the top associated rules, given that the left-hand side has two items, then which item is more likely to be added to the basket?


rules['lhs items'] = rules['antecedents'].apply(lambda x:len(x) )
rules[rules['lhs items']>1].sort_values('lift', ascending=False).head()
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconvictionlhs items
60(coffee, biscuit)(orange)0.100.150.101.0000006.6666670.0850inf2
53(cereal, biscuit)(orange)0.150.150.100.6666674.4444440.07752.552
112(tea, coffee)(milk)0.050.250.051.0000004.0000000.0375inf2
64(cereal, bread)(milk)0.050.250.051.0000004.0000000.0375inf2
104(milk, tea)(cereal)0.050.300.051.0000003.3333330.0350inf2

As we can see, if someone has already added to his basket (coffee, biscuit) or (cereal, biscuit) then the item which is more likely to be added is orange

Visualize Market Basket Analysis

Now, we will show how we can visualize the Market Basket Analysis Association Rules using Heatmap. We will show all the rules where the left-hand side consists of 2 items and we are looking for an extra one.

# Import seaborn under its standard alias
import seaborn as sns

# Replace frozen sets with strings
rules['antecedents_'] = rules['antecedents'].apply(lambda a: ','.join(list(a)))
rules['consequents_'] = rules['consequents'].apply(lambda a: ','.join(list(a)))


# Transform the DataFrame of rules into a matrix using the lift metric
pivot = rules[rules['lhs items']>1].pivot(index = 'antecedents_', 
                    columns = 'consequents_', values= 'lift')

# Generate a heatmap with annotations on and the colorbar off
sns.heatmap(pivot, annot = True)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()
market basket analysis

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Leave a Comment

Subscribe To Our Newsletter

Get updates and learn from the best

More To Explore

monte carlo integration in python
Python

Monte Carlo Integration in Python

We will provide examples of how you solve integrals numerically in Python. Let’s recall from statistics that the mean value

3-day flash sale. Online courses start at $11.99