Predictive Hacks

## What is Market Basket Analysis

Intuitively, we could say that the Market Basket Analysis is given a database of customer transactions, where each transaction is a set of items, the goal is to find group of items which are frequently purchased. The outcome of the algorithm will be a recommendation like that if you buy one or more specific items then you are more (or less) likely to buy this extra item. So, for example, if someone has already added to his basket cereals then it is more likely to add milk. Practically, in Market Basket Analysis we are not looking for similar products, but for supplementary products. For the algorithm, we need to identify products that tend to be purchased together.

The Market Basket Analysis can be applied in the following cases:

• Build a movie/song recommendation engine
• Build a live recommendation algorithm on an e-commerce store
• Cross-sell or Upsell products in a supermarket

## Association Rules

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction. For example, we can extract information on purchasing behavior like “If someone buys beer and sausage, then is likely to buy mustard with high probability

Let’s define the main Associaton Rules:

### Support

It calculates how often the product is purchased and is given by the formula:

$$Support(X) = \frac{Frequency(X)}{N (\#of \;Transactions)}$$

$$Support(X \rightarrow Y) = \frac{Frequency(X \bigcap Y)}{N (\#of \;Transactions)}$$

### Confidence

It measures how often items in Y appear in transactions that contain X and is given by the formula.

$$Confidence(X \rightarrow Y ) = \frac{ Support(X \rightarrow Y )}{ Support(X) }$$

### Lift

It is the value that tells us how likely item Y is bought together with item X. Values greater than one indicate that the items are likely to be purchased together. It tells us how much better a rule is at predicting the result than just assuming the result in the first place. When lift > 1 then the rule is better at predicting the result than guessing. When lift < 1, the rule is doing worse than informed guessing. It can be given by the formula:

$$Lift(X \rightarrow Y ) = \frac{ Support(X \rightarrow Y )}{ Support(X)\times Support(Y) }$$

## Market Basket Analysis in Movies

In the previous posts, we showed how we can apply Item-Based Collaborative filtering and how to Run Recommender Systems in movies. Today, we will apply Market Basket Analysis taking the same dataset of Movies. Here, each “Basket” is the movies watched by a user id without taking into consideration the rating.

Let’s start.

We will work with the MovieLens dataset, collected by the GroupLens Research Project at the University of Minnesota.

import pandas as pd
import numpy as np
columns = ['user_id', 'item_id', 'rating', 'timestamp']
columns = ['item_id', 'movie title', 'release date', 'video release date', 'IMDb URL', 'unknown', 'Action', 'Adventure',
'Animation', 'Childrens', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror',
'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
movies = pd.read_csv('ml-100k/u.item', sep='|', names=columns, encoding='latin-1')
movie_names = movies[['item_id', 'movie title']]
combined_movies_data = pd.merge(df, movie_names, on='item_id')
combined_movies_data = combined_movies_data[['user_id','movie title']]
combined_movies_data.head()

Then we want to create a “Onehot” Data Frame, with True values when one movie has been watched by a User_ID and False otherwise.

onehot = combined_movies_data.pivot_table(index='user_id', columns='movie title', aggfunc=len, fill_value=0)
onehot = onehot>0

### Generate the Association Rules of 2 Items

We will generate the Association Rules of 2 Items using the Apriori algorithm.

from mlxtend.frequent_patterns import association_rules, apriori

# compute frequent items using the Apriori algorithm
frequent_itemsets = apriori(onehot, min_support = 0.001, max_len = 2, use_colnames=True)

# compute all association rules for frequent_itemsets
rules = association_rules(frequent_itemsets)



The above table shows all the association rules of the left-hand side (antecedents) and the right-hand side (consequent). Let’s try to find recommendations for a movie

### 101 Dalmatians

rules[rules.antecedents.apply(str).str.contains("101 Dalmatians")].sort_values('lift', ascending=False)

According to the lift association rule, the 101 Dalmatians is associated with the Independence Day.

Let’s get the top 5 associated movies according to the “lift” measure.

rules.sort_values('lift', ascending=False).head(5)

The Movies dataset is relatively large for tutoring purposes. For that reason we will provide another example with a smaller dataset which are hypothetical transactions (baskets) from a grocery.

groceries  = pd.read_csv("groceries.txt", sep=";")
groceries

As you can see the items are in the same row, separated by a comma. There are two ways to create the onehot data frame. One is to work with the CountVectorizer as explained in another post or to work with the TransactionEncoder as we will show right now. For this example, we will work with association rules of 3 items.

import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import association_rules, apriori
from mlxtend.preprocessing import TransactionEncoder

# get all the transcactions as a list
transcactions = list(groceries['Transaction'].apply(lambda x: sorted(x.split(','))))

# instantiate transcation encoder
encoder = TransactionEncoder().fit(transcactions)

onehot = encoder.transform(transcactions)

# convert one-hot encode data to DataFrame
onehot = pd.DataFrame(onehot, columns=encoder.columns_)
# compute frequent items using the Apriori algorithm - Get up to three items
frequent_itemsets = apriori(onehot, min_support = 0.001, max_len = 3, use_colnames=True)

# compute all association rules for frequent_itemsets
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)

Let’s say that we want to get the top associated rules, given that the left-hand side has two items, then which item is more likely to be added to the basket?

rules['lhs items'] = rules['antecedents'].apply(lambda x:len(x) )
rules[rules['lhs items']>1].sort_values('lift', ascending=False).head()

As we can see, if someone has already added to his basket (coffee, biscuit) or (cereal, biscuit) then the item which is more likely to be added is orange

Now, we will show how we can visualize the Market Basket Analysis Association Rules using Heatmap. We will show all the rules where the left-hand side consists of 2 items and we are looking for an extra one.

# Import seaborn under its standard alias
import seaborn as sns

# Replace frozen sets with strings
rules['antecedents_'] = rules['antecedents'].apply(lambda a: ','.join(list(a)))
rules['consequents_'] = rules['consequents'].apply(lambda a: ','.join(list(a)))

# Transform the DataFrame of rules into a matrix using the lift metric
pivot = rules[rules['lhs items']>1].pivot(index = 'antecedents_',
columns = 'consequents_', values= 'lift')

# Generate a heatmap with annotations on and the colorbar off
sns.heatmap(pivot, annot = True)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()

### Get updates and learn from the best

Python

#### How to Transform an Unbalanced Dataset to a Balanced

We have provided examples of how you can Resample Data By Groups in Python and how you do Undersampling by

Miscellaneous

#### Pair Programming for Assessing Coding Skills in the Interview

Pair Coding Tools are really helpful for interviews if you want to assess the coding skills of the candidates. Introduction