Instacart Market Basket Analysis

Siva Prasad
8 min readJan 29, 2021

--

Instacart is an American company that provides grocery delivery and pick-up service. The Company operates in the U.S and Canada. Instacart offers its services via a website and mobile app. Unlike another E-commerce website providing products directly from Seller to Customer. Instacart allows users to buy products from participating vendors. And this shopping is done by a Personal Shopper.

Task: The task here is to build a model that predicts where the user order’s the product again or not

Outline:

  1. Business problem
  2. Data Source
  3. Data Overview
  4. Mapping to a Machine learning problem
  5. EDA(Exploratory data analysis)
  6. Feature Engineering
  7. Splitting data for train and cross-validation and test_data
  8. Modeling
  9. Deployment using flask
  10. Conclusion
  11. Future works
  12. References

1. Business Problem

Instacart is an American company that operates a grocery delivery and pick-up service in the United States and Canada. The company offers its services via a website and mobile app. In 2017 the company hosted a Kaggle competition and also provided the data of the Instacart users (fully anonymized). The main objective of this problem is which products the user purchase again. The problem is different from the Recommendation system because, In this problem, we will predict if the users reorder the products or not based on the prior(previous orders) order.

2. Data source

Instacart has open-sourced this data in Kaggle for the competition

3. Data overview

From the data source you can download the data. It has seven files

  1. Aisles.csv : It has aisle name and aisle_id(primary_key)
  2. Departments.csv: It has DepartmentName and Department_id(primary_key)
  3. Products.csv : It has product_id(primary_key),product_name and also have Aisle_id and Department_id for which it belongs to
  4. order_product__prior.csv : It has the prior order(history of orders) of each users like order_id(primary_key),product_id,add_to_cart,reordered (reordered or not)
  5. order_product__train.csv: It will having the training data of order_id(primary_key),product_id,add_to_cart,reordered
  6. orders.csv: This file will be having all the orders (prior,train,test) and user_id(primary_key),order_id,order_hour_of_day, order_day_of_week, days_since_prior_order
  7. Sample_submission_file: This file having all the test order_id for that we have to predict

4. Mapping to ML problem:

So as by going through the Business problem and Data it is a kind of recommendation problem. We will be building a model with a classification problem. Using this we can predict which products can be reordered or not. Finally, we can create a list of products that may be reordered

Metric: Mean_f1_score

Here the mean_f1 is the f1_score is calculated for each order and taken an average

5. EDA

Based on the prior order data here is the plot products using word_cloud

In the figure above the size of the products tells that how many times the product is reordered

As we can see above that the products related to banana-like organic banana, a bag of organic bananas are mostly bought from instacart

Top 100 products purchased in instacart

As from above plot we can see that top 3 products are Banana, Bag of organic banana and organic strawberry

Department frequency of reorders

From the above plot as we can see that produce has high frequency of purchase

Reorder_ratio of day_of_week
Reorder_ratio of ahour

From above two plots of reorder_ratio of hour and day we can see that peek shopping hours are between day and time 9:00 to 16:00 time and peek shopping day’s are Saturday and Sunday.

The top 50 products purchased in weekly bias(7days)
Top 50 products purchased monthly (30days)

From the above plots of monthly and weekly we can see the top 50 products purchased

For In_depth EDA you can go to my GITHUB repository

6. Feature Engineering

Before feature engineering, I have added a product Name, ID, Department, Aisle as ‘None’ to the list of products because if any order of the users is empty(Sum of all products reordered==0) then we will append a product ‘None’ for that particular order and value for will be reorderdered=1. so that if the user didn’t order any products we can easily predict None.

I have tried with my own features and some are taken from blogs and Kaggle discussions. These features are calculated only based on prior_data did not include train and test_data

  1. User_product_ratio
  2. Day_of_week_reordered ratio
  3. Hour_of_day_reordered ratio
  4. Day_since_prior_order_reordered_ratio
  5. Product_Day_of_week reordered ratio
  6. Product_Hour_of_day_reordered ratio
  7. User_hour_of_day
  8. User_day_of_week
  9. Day since prior order for a particular product
  10. How many times user purchased the product
  11. Word2Vec of products+Department+aisle
  12. Converting Day of the week to cyclic
  13. Converting hour of the week to cyclic
  14. Weighted_features with respect to 7,14,30 Days

User_product_ratio:

What is the reorder_ratio of the particular product with respect to all the products? It is calculated as no of times the product reordered by no of total reorders.

Day_week_Reordered ratio:

What is the reordered ratio of that particular day with respect to all the days? It is calculated as the total number of reorders of that particular day divided by the total no of orders of all days

Hour_of_day Reordered ratio:

What is the reordered ratio of the particular hour with respect to all the hours?. Calculated as same as day_week_ratio

Day_since_prior_order Reordered Ratio:

The reordered ratio of that particular since the prior day with respect to all the days. It is calculated as The total no of reorder’s of particular Days_since order by total no of orders

Product_day_week Reorder_ratio:

The reordered ratio of a product for that particular day. It is the total no of reorders of that product for that particular day by a total no of reorders of that day only.

Product_Hour_of day Reorder_ratio:

The reordered ratio of that particular hour of the days. Calculated same as above

User_hour_Reordered Ratio:

what is the reordered ratio of the user for that particular hour?. It is calculated as the reordered sum of that particular hour by the total no of reordered of the day

User_day Reordered Ratio:

What is reordered ratio of the user for that particular day

Days_since_product:

How many days have been passed since the user has purchased the product

User_times_product:

How many time’s user purchased the product

Word2Vec of product+Dep+Aisle

By combining the product, Department, and aisle we have done Word2Vec using a spacy library with default or pre-trained models (en_core_web_sm) which is of 90_dimension Vectors

import spacy
nlp = spacy.load(“en_core_web_sm”)
vectars = []
for i in tqdm(all_items.concat):
vectars.append(nlp(i).vector.tolist())

Converting day and hours to cyclic Features:

#Reference: http://blog.davidkaleko.com/feature-engineering-cyclical-features.html

As shown in the above figure we can see that the day features are converted to cyclic. The reason why we should convert to cyclic I will tell with example ‘The distance between hour 23rd and hour 0th is greater the distance between the hour 21 and 23 as shown in Before cyclic conversion, But after converting them into cyclic features we can see that distance between 23rd and 0th hour is less than the 21st and 23rd hour.’ So we will convert these hour and day features to cyclic

df['hr_sin'] = np.sin(df.hr*(2.*np.pi/24))
df['hr_cos'] = np.cos(df.hr*(2.*np.pi/24))
df['day_sin'] = np.sin((df.day-1)*(2.*np.pi/12))
df['day_cos'] = np.cos((df.day-1)*(2.*np.pi/12))

Weighted features with respect to 7,14,30 days prior

Giving more weights particularly to these days because the user mostly shops on the weekdays (7 days) monthly(30_days) and 14 days.

data['weight7days_sin_since_prod']=(1.01 + np.sin(2*np.pi*(data['pro']/7)))/2
data['weight7days_cos_since_prod']=(1.01 + np.cos(2*np.pi*(data['pro']/7)))/2
#similarly for 30 and 14 days

7. Splitting Data into Train_Test Split:

Taking all the products in prior orders of all the users and merging with orders_products train data and filling reordered column nill values with zero. And merging with feature files

#train_data
prior_ = prior_data.merge(orders[['user_id','order_id','eval_set']],on='order_id',how='left')
prior_.drop(['order_id','reordered','eval_set'],axis=1,inplace=True)
prior_ = prior_.drop_duplicates()
train=prior_.merge(train_data[['user_id','order_id','order_number',' order_dow','order_hour_of_day','days_since_prior_order']],on='user_id').drop_duplicates()

Test_data: Taking all the prior products of the user of test data and merging with features files

test_order = orders[orders.eval_set=='test']
test_order = test_order.drop(['eval_set'],axis=1)
temp = prior_data[['user_id','product_id']].drop_duplicates()
test_data = test_order.merge(temp,on='user_id')

Here fill all the Nan values with Zero, because for the first order of the user the days since the prior product is given Nan value so we fill with Zero

8. Modeling:

For cross-validation the training data is divided with users, There are totally 1,31,209 users in the data so I have divided the data based on users 101209 for training and 30,000 users for cross-validation.

For modeling, I tried Logistic_regression,SVM,DecisionTree,RandomForest, and Lightgbm in these algorithms I found Lightgbm works better and very fast so I Used Lightgbm and used RandomSearch for hyperparameter tunning

lgbm=LGBMClassifier(device_type='gpu')prams={
'learning_rate':[0.05,0.1,0.15,0.2],
'n_estimators':[50,100,150,200,300],
'max_depth':[3,5,8,10],
'num_leaves':[31,62,93],
'class_weight':[{0:1,1:5},{0:1,1:10},{0:1,1:50}],
}
lgbm_cfl1=RandomizedSearchCV(lgbm,param_distributions=prams,verbose=10,scoring='f1',return_train_score=True,cv=3)
lgbm_cfl1.fit(X_train,y_train)
#best1
lgbm_cfl1.best_params_
{'class_weight': {0: 1, 1: 10},
'learning_rate': 0.1,
'max_depth': 10,
'n_estimators': 300,
'num_leaves': 62}

After getting the Best params I tried giving more estimators and the cv result also increased. So my final params I used are

lgbm=LGBMClassifier(device_type='gpu',class_weight={0:1,1:5},learning_rate=0.1,max_depth=10,n_estimators=1200,num_leaves=63,random_state=0)
lgbm.fit(X_train,y_train)

These params gave me the f1_score of 0.4273138591291579 for CV(cross-validation_data)

Feature_importances after Modelling with lgbm

plt.figure(figsize=(10,10))
plt.bar(X_train.columns,lgbm.feature_importances_)
plt.xticks(rotation='vertical')
plt.show()
Featue importaces

After the best model I got, I saved my model and ran it with test data to get the results. For submitting on Kaggle I created a function to merge all the products as a list

def sub(result):
'''If any order_id has no products we explicitly give None vale and submit'''
sub = pd.DataFrame()
sub['order_id'] = Test.order_id
sub['products'] = Test.product_id
sub['reordered'] = result
temp = sub.groupby('order_id').agg({'reordered':'sum'}).reset_index()
temp = temp[temp.reordered==0]
print(len(temp))
if len(temp)!=0:
temp['reordered'] = ['None' for i in range(0,len(temp))]
temp.columns = ['order_id','products']
sub = sub[sub.reordered==1]
sub.drop(['reordered'],axis=1,inplace=True)
sub = sub.sort_values(by ='order_id')
sub.products = sub.products.astype('str')
sub = sub.groupby('order_id')['products'].apply(' '.join).reset_index()
pd.concat([sub,temp],ignore_index=True).to_csv('/content/submission.csv',index=False)
else:
sub = sub[sub.reordered==1]
sub.drop(['reordered'],axis=1,inplace=True)
sub = sub.sort_values(by ='order_id')
sub.products = sub.products.astype('str')
sub = sub.groupby('order_id')['products'].apply(' '.join).reset_index()
sub.to_csv('/content/submission.csv',index=False)
sub(result

My Kaggle submission score

9. Deployment

I have used Flask for Deployment this Ml Model

10. Conclusion

As from the above Screenshots of kaggle, we can see that I am able to predict the products for the given orders with the mean _f1_score of 0.38 which places me in the top 15 percentile of the competition

11. Future works:

In the future, I wanted to do with some neural networks

create a separate model to predict ‘None’ value because I have seen some discussions and blogs who have trained separate model for predicting None have got a good score

Create more features with the add_to_cart feature which I haven’t tried in this project

12. References

To check out my entire work you can visit my Github Repository

--

--

No responses yet