Instacart Market Basket Analysis

8 min readJan 29, 2021

Instacart is an American company that provides grocery delivery and pick-up service. The Company operates in the U.S and Canada. Instacart offers its services via a website and mobile app. Unlike another E-commerce website providing products directly from Seller to Customer. Instacart allows users to buy products from participating vendors. And this shopping is done by a Personal Shopper.

Task: The task here is to build a model that predicts where the user order’s the product again or not

Outline:

Business problem
Data Source
Data Overview
Mapping to a Machine learning problem
EDA(Exploratory data analysis)
Feature Engineering
Splitting data for train and cross-validation and test_data
Modeling
Deployment using flask
Conclusion
Future works
References

1. Business Problem

Instacart is an American company that operates a grocery delivery and pick-up service in the United States and Canada. The company offers its services via a website and mobile app. In 2017 the company hosted a Kaggle competition and also provided the data of the Instacart users (fully anonymized). The main objective of this problem is which products the user purchase again. The problem is different from the Recommendation system because, In this problem, we will predict if the users reorder the products or not based on the prior(previous orders) order.

2. Data source

Instacart has open-sourced this data in Kaggle for the competition

Instacart Market Basket Analysis

Which products will an Instacart consumer purchase again?

www.kaggle.com

3. Data overview

From the data source you can download the data. It has seven files

Aisles.csv : It has aisle name and aisle_id(primary_key)
Departments.csv: It has DepartmentName and Department_id(primary_key)
Products.csv : It has product_id(primary_key),product_name and also have Aisle_id and Department_id for which it belongs to
order_product__prior.csv : It has the prior order(history of orders) of each users like order_id(primary_key),product_id,add_to_cart,reordered (reordered or not)
order_product__train.csv: It will having the training data of order_id(primary_key),product_id,add_to_cart,reordered
orders.csv: This file will be having all the orders (prior,train,test) and user_id(primary_key),order_id,order_hour_of_day, order_day_of_week, days_since_prior_order
Sample_submission_file: This file having all the test order_id for that we have to predict

4. Mapping to ML problem:

So as by going through the Business problem and Data it is a kind of recommendation problem. We will be building a model with a classification problem. Using this we can predict which products can be reordered or not. Finally, we can create a list of products that may be reordered

Metric: Mean_f1_score

Here the mean_f1 is the f1_score is calculated for each order and taken an average

5. EDA

Based on the prior order data here is the plot products using word_cloud

In the figure above the size of the products tells that how many times the product is reordered

As we can see above that the products related to banana-like organic banana, a bag of organic bananas are mostly bought from instacart

As from above plot we can see that top 3 products are Banana, Bag of organic banana and organic strawberry

From the above plot as we can see that produce has high frequency of purchase

From above two plots of reorder_ratio of hour and day we can see that peek shopping hours are between day and time 9:00 to 16:00 time and peek shopping day’s are Saturday and Sunday.

The top 50 products purchased in weekly bias(7days)

Top 50 products purchased monthly (30days)

From the above plots of monthly and weekly we can see the top 50 products purchased

For In_depth EDA you can go to my GITHUB repository

6. Feature Engineering

Before feature engineering, I have added a product Name, ID, Department, Aisle as ‘None’ to the list of products because if any order of the users is empty(Sum of all products reordered==0) then we will append a product ‘None’ for that particular order and value for will be reorderdered=1. so that if the user didn’t order any products we can easily predict None.

I have tried with my own features and some are taken from blogs and Kaggle discussions. These features are calculated only based on prior_data did not include train and test_data

User_product_ratio
Day_of_week_reordered ratio
Hour_of_day_reordered ratio
Day_since_prior_order_reordered_ratio
Product_Day_of_week reordered ratio
Product_Hour_of_day_reordered ratio
User_hour_of_day
User_day_of_week
Day since prior order for a particular product
How many times user purchased the product
Word2Vec of products+Department+aisle
Converting Day of the week to cyclic
Converting hour of the week to cyclic
Weighted_features with respect to 7,14,30 Days

`User_product_ratio`:

What is the reorder_ratio of the particular product with respect to all the products? It is calculated as no of times the product reordered by no of total reorders.

Day_week_Reordered ratio:

What is the reordered ratio of that particular day with respect to all the days? It is calculated as the total number of reorders of that particular day divided by the total no of orders of all days

Hour_of_day Reordered ratio:

What is the reordered ratio of the particular hour with respect to all the hours?. Calculated as same as day_week_ratio

Day_since_prior_order Reordered Ratio:

The reordered ratio of that particular since the prior day with respect to all the days. It is calculated as The total no of reorder’s of particular Days_since order by total no of orders

Product_day_week Reorder_ratio:

The reordered ratio of a product for that particular day. It is the total no of reorders of that product for that particular day by a total no of reorders of that day only.

Product_Hour_of day Reorder_ratio:

The reordered ratio of that particular hour of the days. Calculated same as above

User_hour_Reordered Ratio:

what is the reordered ratio of the user for that particular hour?. It is calculated as the reordered sum of that particular hour by the total no of reordered of the day

User_day Reordered Ratio:

What is reordered ratio of the user for that particular day

Days_since_product:

How many days have been passed since the user has purchased the product

User_times_product:

How many time’s user purchased the product

Word2Vec of product+Dep+Aisle

By combining the product, Department, and aisle we have done Word2Vec using a spacy library with default or pre-trained models (en_core_web_sm) which is of 90_dimension Vectors

import spacy
nlp = spacy.load(“en_core_web_sm”)
vectars = []
for i in tqdm(all_items.concat):
  vectars.append(nlp(i).vector.tolist())

Converting day and hours to cyclic Features:

#Reference: http://blog.davidkaleko.com/feature-engineering-cyclical-features.html

As shown in the above figure we can see that the day features are converted to cyclic. The reason why we should convert to cyclic I will tell with example ‘The distance between hour 23rd and hour 0th is greater the distance between the hour 21 and 23 as shown in Before cyclic conversion, But after converting them into cyclic features we can see that distance between 23rd and 0th hour is less than the 21st and 23rd hour.’ So we will convert these hour and day features to cyclic

df['hr_sin'] = np.sin(df.hr*(2.*np.pi/24))
df['hr_cos'] = np.cos(df.hr*(2.*np.pi/24))
df['day_sin'] = np.sin((df.day-1)*(2.*np.pi/12))
df['day_cos'] = np.cos((df.day-1)*(2.*np.pi/12))

Weighted features with respect to 7,14,30 days prior

Giving more weights particularly to these days because the user mostly shops on the weekdays (7 days) monthly(30_days) and 14 days.

data['weight7days_sin_since_prod']=(1.01 + np.sin(2*np.pi*(data['pro']/7)))/2
data['weight7days_cos_since_prod']=(1.01 + np.cos(2*np.pi*(data['pro']/7)))/2
#similarly for 30 and 14 days

7. Splitting Data into Train_Test Split:

Taking all the products in prior orders of all the users and merging with orders_products train data and filling reordered column nill values with zero. And merging with feature files

#train_data
prior_ = prior_data.merge(orders[['user_id','order_id','eval_set']],on='order_id',how='left')
prior_.drop(['order_id','reordered','eval_set'],axis=1,inplace=True)
prior_ = prior_.drop_duplicates()
train=prior_.merge(train_data[['user_id','order_id','order_number','      order_dow','order_hour_of_day','days_since_prior_order']],on='user_id').drop_duplicates()

Test_data: Taking all the prior products of the user of test data and merging with features files

test_order = orders[orders.eval_set=='test']
test_order = test_order.drop(['eval_set'],axis=1)
temp = prior_data[['user_id','product_id']].drop_duplicates()
test_data = test_order.merge(temp,on='user_id')

Here fill all the Nan values with Zero, because for the first order of the user the days since the prior product is given Nan value so we fill with Zero

8. Modeling:

For cross-validation the training data is divided with users, There are totally 1,31,209 users in the data so I have divided the data based on users 101209 for training and 30,000 users for cross-validation.

For modeling, I tried Logistic_regression,SVM,DecisionTree,RandomForest, and Lightgbm in these algorithms I found Lightgbm works better and very fast so I Used Lightgbm and used RandomSearch for hyperparameter tunning

lgbm=LGBMClassifier(device_type='gpu')prams={
    'learning_rate':[0.05,0.1,0.15,0.2],
     'n_estimators':[50,100,150,200,300],
     'max_depth':[3,5,8,10],
     'num_leaves':[31,62,93],
     'class_weight':[{0:1,1:5},{0:1,1:10},{0:1,1:50}],
}
lgbm_cfl1=RandomizedSearchCV(lgbm,param_distributions=prams,verbose=10,scoring='f1',return_train_score=True,cv=3)
lgbm_cfl1.fit(X_train,y_train)#best1
lgbm_cfl1.best_params_{'class_weight': {0: 1, 1: 10},
 'learning_rate': 0.1,
 'max_depth': 10,
 'n_estimators': 300,
 'num_leaves': 62}

After getting the Best params I tried giving more estimators and the cv result also increased. So my final params I used are

lgbm=LGBMClassifier(device_type='gpu',class_weight={0:1,1:5},learning_rate=0.1,max_depth=10,n_estimators=1200,num_leaves=63,random_state=0)
lgbm.fit(X_train,y_train)

These params gave me the f1_score of 0.4273138591291579 for CV(cross-validation_data)

Feature_importances after Modelling with lgbm

plt.figure(figsize=(10,10))
plt.bar(X_train.columns,lgbm.feature_importances_)
plt.xticks(rotation='vertical')
plt.show()

After the best model I got, I saved my model and ran it with test data to get the results. For submitting on Kaggle I created a function to merge all the products as a list

def sub(result):
  '''If any order_id has no products we explicitly give None vale and submit'''
  sub = pd.DataFrame()
  sub['order_id'] = Test.order_id
  sub['products'] = Test.product_id
  sub['reordered'] = result
  temp = sub.groupby('order_id').agg({'reordered':'sum'}).reset_index()
  temp = temp[temp.reordered==0]
  print(len(temp))
  if len(temp)!=0:
    temp['reordered'] = ['None' for i in range(0,len(temp))]
    temp.columns = ['order_id','products']
    sub = sub[sub.reordered==1]
    sub.drop(['reordered'],axis=1,inplace=True)
    sub = sub.sort_values(by ='order_id')
    sub.products = sub.products.astype('str')
    sub = sub.groupby('order_id')['products'].apply(' '.join).reset_index()
    pd.concat([sub,temp],ignore_index=True).to_csv('/content/submission.csv',index=False)
  else:
    sub = sub[sub.reordered==1]
    sub.drop(['reordered'],axis=1,inplace=True)
    sub = sub.sort_values(by ='order_id')
    sub.products = sub.products.astype('str')
    sub = sub.groupby('order_id')['products'].apply(' '.join).reset_index()
    sub.to_csv('/content/submission.csv',index=False)
sub(result

My Kaggle submission score

9. Deployment

I have used Flask for Deployment this Ml Model

10. Conclusion

As from the above Screenshots of kaggle, we can see that I am able to predict the products for the given orders with the mean _f1_score of 0.38 which places me in the top 15 percentile of the competition

11. Future works:

In the future, I wanted to do with some neural networks

create a separate model to predict ‘None’ value because I have seen some discussions and blogs who have trained separate model for predicting None have got a good score

Create more features with the add_to_cart feature which I haven’t tried in this project

12. References

Instacart Market Basket Analysis

Winner’s Interview: 2nd place, Kazuki Onodera

medium.com

Instacart Market Basket Analysis Challenged

This blog is about my first Kaggle Challenge and Everything I did to solve the challenge. Hopefully, this blog will…

vishalmendekarhere.medium.com

alexanderrich/instacart-analysis

This repository contains scripts to produce the 23rd place solution to Kaggle's Instacart Market Basket Analysis…

github.com

To check out my entire work you can visit my Github Repository

SivaPrasad02/Instacart-Market-Basket-Anlaysis

Instacart is an American company that operates a grocery delivery and pick-up service in the United States and Canada…

github.com

Instacart Market Basket Analysis

Outline:

1. Business Problem

2. Data source

Instacart Market Basket Analysis

Which products will an Instacart consumer purchase again?

3. Data overview

4. Mapping to ML problem:

5. EDA

6. Feature Engineering

`User_product_ratio`:

Day_week_Reordered ratio:

Hour_of_day Reordered ratio:

Day_since_prior_order Reordered Ratio:

Product_day_week Reorder_ratio:

Product_Hour_of day Reorder_ratio:

User_hour_Reordered Ratio:

User_day Reordered Ratio:

Days_since_product:

User_times_product:

Word2Vec of product+Dep+Aisle

Weighted features with respect to 7,14,30 days prior

7. Splitting Data into Train_Test Split:

8. Modeling:

9. Deployment

10. Conclusion

11. Future works:

12. References

Instacart Market Basket Analysis

Winner’s Interview: 2nd place, Kazuki Onodera

Instacart Market Basket Analysis Challenged

This blog is about my first Kaggle Challenge and Everything I did to solve the challenge. Hopefully, this blog will…

alexanderrich/instacart-analysis

This repository contains scripts to produce the 23rd place solution to Kaggle's Instacart Market Basket Analysis…

SivaPrasad02/Instacart-Market-Basket-Anlaysis

Instacart is an American company that operates a grocery delivery and pick-up service in the United States and Canada…

Written by Siva Prasad

No responses yet

Instacart Market Basket Analysis

Outline:

1. Business Problem

2. Data source

Instacart Market Basket Analysis

Which products will an Instacart consumer purchase again?

3. Data overview

4. Mapping to ML problem:

5. EDA

6. Feature Engineering

User_product_ratio:

Day_week_Reordered ratio:

Hour_of_day Reordered ratio:

Day_since_prior_order Reordered Ratio:

Product_day_week Reorder_ratio:

Product_Hour_of day Reorder_ratio:

User_hour_Reordered Ratio:

User_day Reordered Ratio:

Days_since_product:

User_times_product:

Word2Vec of product+Dep+Aisle

Weighted features with respect to 7,14,30 days prior

7. Splitting Data into Train_Test Split:

8. Modeling:

9. Deployment

10. Conclusion

11. Future works:

12. References

Instacart Market Basket Analysis

Winner’s Interview: 2nd place, Kazuki Onodera

Instacart Market Basket Analysis Challenged

This blog is about my first Kaggle Challenge and Everything I did to solve the challenge. Hopefully, this blog will…

alexanderrich/instacart-analysis

This repository contains scripts to produce the 23rd place solution to Kaggle's Instacart Market Basket Analysis…

SivaPrasad02/Instacart-Market-Basket-Anlaysis

Instacart is an American company that operates a grocery delivery and pick-up service in the United States and Canada…

Written by Siva Prasad

No responses yet

`User_product_ratio`: