Market basket analysis is important for an e-shop because it can provide insights into customer purchasing behavior and help identify patterns in what products are often purchased together. This information can be useful for a number of purposes, including:
- Cross-selling and upselling: By identifying which products are frequently purchased together, you can make recommendations to customers about related products that they might be interested in. This can help increase sales and customer satisfaction.
- Product bundling: By grouping products that are frequently purchased together into bundles, you can create a more convenient shopping experience for customers and potentially increase the overall value of their purchase.
- Inventory management: By understanding which products are often purchased together, you can optimize your inventory levels to ensure that you have the right products in stock at the right time. This can help reduce the risk of stockouts and improve customer satisfaction.
- Marketing and advertising: By understanding which products are frequently purchased together, you can create targeted marketing and advertising campaigns that are more likely to be effective.
In this post, we will demonstrate how to conduct market basket analysis using a real-case scenario on Shopify, the leading e-commerce platform with millions of users globally.
Getting the Data
We will utilize Shopify’s API to access all orders within a store. In order to use the API and retrieve data, the store owner will need to create a private app and share the credentials. For more guidance on this process, you can refer to the tutorial on how to get all orders from Shopify.
Having the credentials you can create a URL in the following format:
https://{apikey}:{password}@{hostname}/admin/api/{version}/orders.json
To retrieve all orders from the store, we will need to include certain variables in the URL of the API call.
The ‘limit‘ variable allows us to specify the maximum number of orders to retrieve in a single call, which is set at 250 and “status” should be set to “any”.
To retrieve more than 250 orders, we can use a loop to make successive API calls, using the oldest order ID from the previous call as the ‘since_id‘ for the next call. When the API returns fewer orders than the specified limit, we can stop the loop and return the collected orders in a Dataframe.
There is no need to understand the above, just run the following and make sure you have the right API URL.
import pandas as pd import numpy as np import re import requests def get_all_orders(): last=0 orders=pd.DataFrame() while True: url = f"https://{apikey}:{password}@{hostname}/admin/api/{version}/orders.json?limit=250&status=any&since_id={last}" response = requests.request("GET", url) df=pd.DataFrame(response.json()['orders']) orders=pd.concat([orders,df]) last=df['id'].iloc[-1] if len(df)<250: break return(orders) df=get_all_orders()
Running the above you should get a data frame containing the following columns:
['id', 'admin_graphql_api_id', 'app_id', 'browser_ip', 'buyer_accepts_marketing', 'cancel_reason', 'cancelled_at', 'cart_token', 'checkout_id', 'checkout_token', 'client_details', 'closed_at', 'confirmed', 'contact_email', 'created_at', 'currency', 'current_subtotal_price', 'current_subtotal_price_set', 'current_total_discounts', 'current_total_discounts_set', 'current_total_duties_set', 'current_total_price', 'current_total_price_set', 'current_total_tax', 'current_total_tax_set', 'customer_locale', 'device_id', 'discount_codes', 'email', 'estimated_taxes', 'financial_status', 'fulfillment_status', 'gateway', 'landing_site', 'landing_site_ref', 'location_id', 'merchant_of_record_app_id', 'name', 'note', 'note_attributes', 'number', 'order_number', 'order_status_url', 'original_total_duties_set', 'payment_gateway_names', 'phone', 'presentment_currency', 'processed_at', 'processing_method', 'reference', 'referring_site', 'source_identifier', 'source_name', 'source_url', 'subtotal_price', 'subtotal_price_set', 'tags', 'tax_lines', 'taxes_included', 'test', 'token', 'total_discounts', 'total_discounts_set', 'total_line_items_price', 'total_line_items_price_set', 'total_outstanding', 'total_price', 'total_price_set', 'total_shipping_price_set', 'total_tax', 'total_tax_set', 'total_tip_received', 'total_weight', 'updated_at', 'user_id', 'billing_address', 'customer', 'discount_applications', 'fulfillments', 'line_items', 'payment_details', 'payment_terms', 'refunds', 'shipping_address', 'shipping_lines', 'list_items']
Next, we will extract the unique names of the products contained in each order from the ‘line_items’ column. This will result in a list of products that were purchased in each specific order.
df['list_items']=df['line_items'].apply(lambda x: list(set([i['title'] for i in x])))
Now we need to create a separate row for every item in an order. This can be done easily using the explode function of pandas.
f=df.explode('list_items')
Finally, In order to perform market basket analysis, we need to transform the data into a data frame with the order ID as the index and the unique items in the store as columns. If a particular item is included in an order, it will be represented by a value of 1 in the corresponding column, while items not included in the order will be represented by a value of 0. This can be achieved as follows:
df2 = pd.crosstab(f['id'], f['list_items']) df2.head()
The data are ready for our analysis!
Market Basket Analysis
When performing market basket analysis, we want three main variables as a result.
- Support is used to measure the overall popularity of a product. It is calculated by dividing the number of transactions containing the item by the total number of transactions. For example, if milk is present in 80% of all purchases, the support for milk would be 0.8.
- Confidence measures the likelihood of different combinations of purchases and is calculated by dividing the number of transactions containing both items by the number of transactions containing one of the items. For example, if 50% of customers who bought bread also purchased butter, the confidence in the relationship between bread and butter would be 0.5.
- Lift measures the increase in the ratio of the sale of one item when the other item is also sold and is calculated by dividing the confidence between the two items by the support of one of the items. For example, if the lift of bread and butter is 1.2 that means that customers are 1.2 times more likely to buy butter if you also sell bread.
Although the concepts of support, confidence, and lift may seem complex, they can be easily implemented using the apriori and association rules functions in just two lines of code.
from mlxtend.frequent_patterns import apriori from mlxtend.frequent_patterns import association_rules frequent_itemsets = apriori(df2, min_support=0.001, use_colnames=True) rules = association_rules(frequent_itemsets, metric="lift") rules.head()
Now, it’s up to the owner and/or the marketing team to take action.
Conclusion
Market basket analysis is a powerful tool that can provide significant value to any store. As demonstrated, it is easy to implement and should be a key tool in the toolkit of any data scientist. I recently conducted a market basket analysis for a store and the marketing team used the insights effectively by creating product bundles. As a result, the store experienced a 300% increase in sales of these bundles. Overall, market basket analysis is an effective way to identify patterns and relationships in purchase data, which can be used to inform business decisions and drive sales growth.