Instacart Market Basket Challenge

Jean-Pierre Rouhana and Chiemi Kato

San Francisco State University, Statistical Learning & Data Mining 2019

Introduction - Instacart Background

  • Instacart is a grocery ordering and delivery service app, which aims to make the process of ordering groceries more convenient for stores and shoppers

  • Shoppers place orders to partnering stores (ie Whole Foods, Trader Joe’s, etc.) and are connected to in-person shoppers who drive to the store location and purchase and deliver the goods

  • Instacart shoppers are able to use the grocery delivery service for a one-time fee of \$3.99, or for free if they have a membership (\$99/year) and orders are above \$35

About the Data

  • The data set was released by Instacart as a machine learning competition via Kaggle

  • The data sets consisted of over 3 million grocery orders of anonymized information based on the products that belong to each order, and the orders that belong to each user

  • Additionally there was also information the times and days the orders were placed, the relative amount of time between orders per user, grocery department and aisle information, and the sequence of the products being placed in the basket

  • The goal of the kaggle competition was to predict which items were likely to be reordered again by the same customers

Exploratory Data Analysis

The files provide a combination of categorical data, both ordinal and nominal pertaining to the features of the actual goods being ordered such as “dairy,” “produce,” “organic,” etc.

In [5]:
head(products, n=10)
length(unique(products$product_id)) # 49688 total products
A data.frame: 10 × 4
product_idproduct_nameaisle_iddepartment_id
<int><fct><int><int>
1Chocolate Sandwich Cookies 6119
2All-Seasons Salt 10413
3Robust Golden Unsweetened Oolong Tea 94 7
4Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce 38 1
5Green Chile Anytime Sauce 513
6Dry Nose Oil 1111
7Pure Coconut Water With Orange 98 7
8Cut Russet Potatoes Steam N' Mash 116 1
9Light Strawberry Blueberry Yogurt 12016
10Sparkling Orange Juice & Prickly Pear Beverage 115 7
49688
In [8]:
head(order_products_train, n=15)
order_idproduct_idadd_to_cart_orderreordered
1 493021 1
1 111092 1
1 102463 0
1 496834 0
1 436335 1
1 131766 0
1 472097 0
1 220358 1
36 396121 0
36 196602 1
36 492353 0
36 430864 1
36 466205 1
36 344976 1
36 486797 1
In [3]:
head(orders, n=5)
A data.frame: 5 × 7
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_order
<int><int><fct><int><int><int><int>
25393291prior12 8NA
23987951prior23 715
4737471prior331221
22547361prior44 729
4315341prior541528

Days of the week when people order

In [47]:
orders %>% 
  ggplot(aes(x=order_dow, height=.2)) + 
  geom_histogram(stat="count",fill="slateblue", position=position_dodge(width=1))
Warning message:
“Ignoring unknown parameters: binwidth, bins, pad”

Relative Time Between Orders

In [48]:
orders %>% 
  ggplot(aes(x=days_since_prior_order)) + 
geom_histogram(stat="count",fill="slateblue")
Warning message:
“Ignoring unknown parameters: binwidth, bins, pad”Warning message:
“Removed 63100 rows containing non-finite values (stat_count).”
In [50]:
mostcommon <- order_products_train %>% 
  group_by(product_id) %>% 
  summarize(count = n()) %>% 
  top_n(10, wt = count) %>%
  left_join(select(products,product_id,product_name),by="product_id") %>%
  arrange(desc(count)) 

head(mostcommon, n=10)
A tibble: 10 × 3
product_idcountproduct_name
<int><int><fct>
2485218726Banana
1317615480Bag of Organic Bananas
2113710894Organic Strawberries
21903 9784Organic Baby Spinach
47626 8135Large Lemon
47766 7409Organic Avocado
47209 7293Organic Hass Avocado
16797 6494Strawberries
26209 6033Limes
27966 5546Organic Raspberries
In [51]:
mostcommon %>% 
  ggplot(aes(x=reorder(product_name,-count), y=count))+
  geom_bar(stat="identity",fill="slateblue")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())

What items are most often reordered?

In [8]:
item_reorders <- order_products_train %>% 
  group_by(product_id) %>% 
  summarize(proportion_reordered = mean(reordered), n=n()) %>% 
  filter(n>40) %>% 
  top_n(10,wt=proportion_reordered) %>% 
  arrange(desc(proportion_reordered)) %>% 
  left_join(products,by="product_id")

head(data.frame(item_reorders), n=15)
A data.frame: 10 × 6
product_idproportion_reorderednproduct_nameaisle_iddepartment_id
<int><dbl><int><fct><int><int>
17290.9347826 922% Lactose Free Milk 8416
209400.9130435 368Organic Low Fat Milk 8416
121930.8983051 59100% Florida Orange Juice 98 7
210380.8888889 81Organic Spelt Tortillas 128 3
317640.8888889 45Original Sparkling Seltzer Water Cans115 7
248520.884171718726Banana 24 4
1170.8833333 120Petit Suisse Fruit 216
391800.8819876 483Organic Lowfat 1% Milk 8416
123840.8810409 269Organic Lactose Free 1% Lowfat Milk 9116
240240.8785249 4611% Lowfat Milk 8416
In [54]:
item_reorders %>% 
  ggplot(aes(x=reorder(product_name,-proportion_reordered), y=proportion_reordered))+
  geom_bar(stat="identity",fill="slateblue")+
  theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.85,0.95))

Proportion of Products Reordered

In [55]:
reorder %>% 
  ggplot(aes(x=reordered,y=count, fill=reordered))+
geom_bar(stat="identity")
A data.frame: 2 × 3
reorderednprop
<int><int><dbl>
05557930.4014056
18288240.5985944
In [9]:
reorder.table<- order_products_train %>% 
  count(reordered) %>%            
  mutate(prop = prop.table(n))    

as.data.frame(reorder.table)   

reorder <- order_products_train %>% 
  group_by(reordered) %>% 
  summarize(count = n()) %>% 
  mutate(proportion = count/sum(count))
A data.frame: 2 × 3
reorderednprop
<int><int><dbl>
05557930.4014056
18288240.5985944

Limitations:

  • When working with such a large data set, iterative techniques such as cross-validation become a very challenging task. These methods are computationally expensive and many of these techniques are beyond our resources if we want to use the original data set we were given
  • For example: Cross-Validation, Stepwise Regression, ROC graph/threshold determination, diagnostic plotting

Codes Examples that didn't complete:

ROC

install.packages("pROC") library(pROC) plot(roc(Test.Orders$reordered, PredM1 , direction=">"), #col="yellow", lwd=3, main="Reordering ROC")

Preprocessing

Main Approach to Organizing the Variables

1. Feature Engineering

We had to think about what variables could be created to help us solve our problem - (which is to predict the probability of a user reordering a specific product/of a product being reordered).

A few new columns we created:

  1. user_prop_reordered: proportion of products reordered for each user for all products
  2. total_orders: total number of orders for each user
  3. reordered.Num: total number of reorders per product
  4. proportion_reordered: proportion of reorders per product and per user

2. Selection of Variables

  • Some of the variables we were able to automatically take out, since they were either originally created for referential purposes only (ie eval_set, user_id, order_id, order_number, product_name)
  • Others we took out since we had already created information about the users and proportion of reordered items across for each product
  • Out of aisle, product, and departmental categories for the products, we decided to keep only one of these variables, since they are all correlated and related to one another

3. Null Values: Non-random Missing Data

  • The variable days_since_prior_order had a large number of null values. We had originally omitted them from the data set, but after further evaluation, we saw that these null values actually had a meaning.
  • We realized that every null value in the days_since_prior_order corresponded to the first order made by each of the users. The reason this column gave null values is before it represents a relative amount of time and there was no "prior" order to measure this time unit.
  • Since we were only looking to examine data that involved reordered products, we decided that eliminating all of the first orders from the data set would not pose a huge problem to our analysis

Original Data Sets

In [30]:
head(orders)
head(order_products_prior)
A data.frame: 6 × 7
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_order
<int><int><fct><int><int><int><int>
25393291prior12 8NA
23987951prior23 715
4737471prior331221
22547361prior44 729
4315341prior541528
33675651prior62 719
A data.frame: 6 × 4
order_idproduct_idadd_to_cart_orderreordered
<int><int><int><int>
23312011
22898521
2 932730
24591841
23003550
21779461

Updated Data Sets

In [16]:
head(Combined.Orders, n=10)
A data.frame: 10 × 18
user_iduser_prop_reorderedtotal_ordersproduct_idproduct_ordersproportion_reorderedorder_idadd_to_cart_orderreorderedeval_setorder_numberorder_doworder_hour_of_daydays_since_prior_orderproduct_nameaisle_iddepartment_idreordered.Num
<fct><dbl><int><int><int><dbl><int><int><fct><fct><int><fct><dbl><int><fct><int><int><dbl>
11.7285717046149 26991.8106711187899111train114 814Zero Calorie Cola 77 72
11.7285717026405 4051.4296301187899 41train114 814XL Pick-A-Size Paper Towel Rolls 54172
11.7285717010258 5691.690685 431534 31prior 541528Pistachios 117192
11.72857170 196119361.7856903367565 11prior 62 719Soda 77 72
11.7285717030450 72861.586467 473747 50prior 331221Creamy Almond Butter 88131
11.7285717012427 20591.739679 550135 31prior 71 920Original Beef Jerky 23192
11.7285717012427 20591.7396792398795 31prior 23 715Original Beef Jerky 23192
11.7285717026405 4051.4296302254736 51prior 44 729XL Pick-A-Size Paper Towel Rolls 54172
11.7285717026088 8511.5370152398795 51prior 23 715Aged White Cheddar Popcorn 23192
11.7285717013032 11311.6392572398795 60prior 23 715Cinnamon Toast Crunch 121141

Splitting the Data

In [9]:
set.seed(1)
Train<- sample(1:nrow(Combined.Orders), nrow(Combined.Orders)*.8)
Training.Orders <- Combined.Orders[Train,]
Test.Orders <- Combined.Orders[-Train,]
In [10]:
nrow(Training.Orders)
nrow(Test.Orders)
7771023
1942756

Model 1: Logistic Regression

In [ ]:
Logistic.TestModel2 <- glm(reordered ~ user_prop_reordered + proportion_reordered + product_orders + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id, family = binomial, data = Training.Orders)
summary(Logistic.TestModel2)

In [18]:
departments
A data.frame: 21 × 2
department_iddepartment
<int><fct>
1frozen
2other
3bakery
4produce
5alcohol
6international
7beverages
8pets
9dry goods pasta
10bulk
11personal care
12meat seafood
13pantry
14breakfast
15canned goods
16dairy eggs
17household
18babies
19snacks
20deli
21missing

Model Performance

Prediction and Accuracy

In [57]:
attach(Combined.Orders)
set.seed(1)

Accuracy <- function(table)
{
 n11 <- table[1,1] 
 n22 <- table[2,2]
 Total <- table[1,1]+table[2,2]+table[2,1]+table[1,2]
 Total
 return((n11+n22)/Total)
}
In [98]:
ProbM1 <- predict.glm(Logistic.TestModel1, newdata = Test.Orders, type = "response")
PredM1 <- ifelse(ProbM1 > .5, "1" , "0")
TableM1 <- table(PredM1, Test.Orders$reordered)
TableM1
Accuracy(TableM1)
      
PredM1       0       1
     0  370828  175666
     1  349293 1047199
0.729818434100915

Deviance Test Against the Null Model

In [26]:
Null.TestModel2 <- glm(reordered ~ 1, family = binomial, data = Training.Orders)

summary(Null.TestModel2)
Call:
glm(formula = reordered ~ 1, family = binomial, data = Training.Orders)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4092  -1.4092   0.9621   0.9621   0.9621  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 0.5300206  0.0007428   713.6   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 10245747  on 7771022  degrees of freedom
Residual deviance: 10245747  on 7771022  degrees of freedom
AIC: 10245749

Number of Fisher Scoring iterations: 4
In [29]:
anova(Null.TestModel2, Logistic.TestModel2, test='Chisq')
A anova: 2 × 5
Resid. DfResid. DevDfDeviancePr(>Chi)
<dbl><dbl><dbl><dbl><dbl>
777102210245747NA NANA
7770990 8321585321924161 0

Model 2: Classification Tree

In [30]:
library(rpart)
In [ ]:
Tree.TestModel <- rpart(reordered ~ user_prop_reordered + proportion_reordered + product_orders + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id,method = "class", data = Training.Orders)

plot(Tree.TestModel)
text(Tree.TestModel)

Pred.Tree <- predict(Tree.TestModel, newdata = Test.Orders, type = "class")
Tree.Table <- table(Pred.Tree, Test.Orders$reordered)
Tree.Table

Accuracy(Tree.Table)
In [101]:

         
Pred.Tree       0       1
        0  459415  219611
        1  387676 1003254
0.706618401550564

Model 3: Support Vector Machine Model

In [12]:
set.seed(1)
Trying <- sample(1:nrow(Combined.Orders), nrow(Combined.Orders)*.001)
DF <- Combined.Orders[Trying,]
Train.Trying <- sample(1:nrow(DF), nrow(DF)*.8)
Trying.DF.Train <- DF[Train.Trying, ]
Trying.DF.Test <- DF[-Train.Trying, ]
In [ ]:
SVM.TestModel <- svm(reordered ~ user_prop_reordered + proportion_reordered + product_orders + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id.x, data = Trying.DF.Train)
In [58]:
summary(SVM.TestModel)
Call:
svm(formula = reordered ~ user_prop_reordered + proportion_reordered + 
    product_orders + total_orders + add_to_cart_order + order_dow + 
    days_since_prior_order + department_id, data = Trying.DF.Train)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.03030303 

Number of Support Vectors:  4727

 ( 2351 2376 )


Number of Classes:  2 

Levels: 
 0 1


In [59]:
SVM.Pred <- predict(SVM.TestModel, Trying.DF.Test)
SVM.Table <- table(SVM.Pred, Trying.DF.Test$reordered)

Accuracy(SVM.Table)
0.739063304168811

Conclusion

Model 1: Logistic Regression - 0.7298

Model 2: Classification Tree - 0.7066

Model 3: SVM - 0.7391 on .1% of the data