Instacart Market Basket Challenge - Exploratory Data Analysis and Predictive Modeling¶

By Chiemi Kato and JP Rouhana

Introduction¶

Instacart is a grocery ordering and delivery service app, which aims to make the process of ordering groceries more convenient for grocery stores and shoppers. The purpose of this competition is to make predictions about which products are likely to be reordered by the same customers.

Reasons why this competition could benefit the company include finding ways to optimize the company's recommendation system, improving inventory and staffing infrastructures, and acquiring better insights about longer term customers who rely on Instacart as part of their regular shopping routines.

Click here to see more information on our final results and presentation.

Looking at the raw data...¶

The data is separated into six csv files, which describe the different aspects of the products hosted on the website. It is a random sample of over 3 million grocery orders from over 200,000 instacart users. We will use R Statistical Software to analyze the data.

The files provide a combination of categorical data, both ordinal and nominal pertaining to the features of the actual goods being ordered such as “dairy,” “produce,” “organic,” etc. There are several other variables including the store aisle, and the sequence of orders which are placed in the basket, day of the week, time of the day, and relative time (in days) between orders for each user. Also notice, in the data set titled 'order_products_train' (as well as 'order_products_prior') there are columns labeled 'reordered' which have values of either 0 or 1. We will use these columns are our response variables, which represent 0 for items that are not reordered, and 1 for items that are reordered.

library(data.table)
library(dplyr)
library(ggplot2)
library(knitr)
library(DT)
library(tidyr)
library(tidyverse)
library(randomForest)
library(ROCR)
library(repr)

options(repr.plot.width=5, repr.plot.height=4)

library(dplyr)
library(ggplot2)
library(caret)
library(DT)

aisles <- read.csv('data/aisles.csv', header=TRUE)
departments <- read.csv('data/departments.csv', header=TRUE)
products <- read.csv('data/products.csv',header=TRUE)
order_products_train <- read.csv('data/order_products__train.csv', header=TRUE)
order_products_prior <- read.csv('data/order_products__prior.csv', header=TRUE)
orders <- read.csv('data/orders.csv', header=TRUE)

head(products, n=10)
length(unique(products$product_id)) # 49688 total products

head(order_products_train, n=15)

Note that in the above data set, we are provided with a response column of reordered= 0 or 1.¶

head(orders, n=5)

Quickly recode to convert character variables to factors

orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))

Looking at Some Histograms...¶

Days of the week when people order¶

By simply looking at a few histograms, you can spot out some of the stand out groups that have more observations that the rest.

orders %>% 
  ggplot(aes(x=order_dow)) + 
  geom_histogram(stat="count",fill="slateblue") + ggtitle("Order Day of Week")

Warning message:
“Ignoring unknown parameters: binwidth, bins, pad”

Hour of day when people order¶

orders %>% 
  ggplot(aes(x=days_since_prior_order)) + 
geom_histogram(stat="count",fill="slateblue") + ggtitle("Days Since Last Order (relative days between orders)")

Warning message:
“Ignoring unknown parameters: binwidth, bins, pad”Warning message:
“Removed 63100 rows containing non-finite values (stat_count).”

What are the most popular items ordered on Instacart?¶

mostcommon <- order_products_train %>% 
  group_by(product_id) %>% 
  summarize(count = n()) %>% 
  top_n(10, wt = count) %>%
  left_join(select(products,product_id,product_name),by="product_id") %>%
  arrange(desc(count)) 

head(mostcommon, n=10)

mostcommon %>% 
  ggplot(aes(x=reorder(product_name,-count), y=count))+
  geom_bar(stat="identity",fill="slateblue")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank()) +  ggtitle("Most Popular Items Overall")

What items are most often reordered?¶

item_reorders <- order_products_train %>% 
  group_by(product_id) %>% 
  summarize(proportion_reordered = mean(reordered), n=n()) %>% 
  filter(n>40) %>% 
  top_n(10,wt=proportion_reordered) %>% 
  arrange(desc(proportion_reordered)) %>% 
  left_join(products,by="product_id")

head(data.frame(item_reorders), n=15)

item_reorders %>% 
  ggplot(aes(x=reorder(product_name,-proportion_reordered), y=proportion_reordered))+
  geom_bar(stat="identity",fill="slateblue")+
  theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.85,0.95)) + ggtitle("Most Popular Reordered Items")

Proportion of Products Reordered¶

reorder.table<- order_products_train %>% 
  count(reordered) %>%            
  mutate(prop = prop.table(n))    

as.data.frame(reorder.table)   

reorder <- order_products_train %>% 
  group_by(reordered) %>% 
  summarize(count = n()) %>% 
  mutate(proportion = count/sum(count))

reorder %>% 
  ggplot(aes(x=reordered,y=count, fill=reordered))+
geom_bar(stat="identity")+ggtitle("Proportion of Reordered / Non-Reordered\n Products")

Now let's take a look at some specific "Power" Instacart users¶

# First calculating proportion of reorders per order and n number of products in a single order
power.users <- order_products_prior %>% 
  group_by(order_id) %>% 
  summarize(reordered.prop = mean(reordered),n=n()) %>% 
  right_join(filter(orders,order_number>2), by="order_id")
# take out first order (reordered response column is irrelevant fir first orders)
# for each order, calculating reordered.prop, then joining with orders table to add reordered.prop and n

power.users2 <- power.users %>% 
  filter(eval_set =="prior") %>% 
  group_by(user_id) %>% 
  summarize(orders_total = sum(reordered.prop==1,na.rm=T), percent_equal = orders_total/n()) %>% 
  filter(percent_equal == 1) %>% 
arrange(desc(orders_total))
# lists users with 100% reorders in the order of highest number of total orders

head(power.users2, n=10)
head(power.users)

User 99753 and 55331 have the highest frequency of orders AND the highest proportions of reorders per individual order.¶

Below are lists of the specific items ordered across orders for each of the two "Power Users." Notice that the number of products purchased per order is not considered in our identification of the "Power Users."

power.order.numbers <- filter(power.users, user_id == 99753)$order_id
power.products <- order_products_prior %>% 
  filter(order_id %in% power.order.numbers) %>% 
  left_join(products, by="product_id")

head(power.products, n=10)[,1:5]

power.order.numbers <- filter(power.users, user_id == 55331)$order_id
power.products <- order_products_prior %>% 
  filter(order_id %in% power.order.numbers) %>% 
  left_join(products, by="product_id")

head(power.products, n=10)[,1:5]

What's Next??¶

Predictive Analysis¶

Now that we have a better picture about the data we're working with, we can begin to use our best judgement to train models based on our historical data and test the models to see which ones work best on unseen data.

Preprocessing¶

Before diving into the models, we must first reorganize the data to be able to easily fit the models using our data. For smoother computability, we will decrease the size of the overall data, Combined.orders by 1/4 using random sampling. We will merge the aisles and department data sets at a later point.

All.Orders = merge(order_products_prior, orders)
All.Orders2 = merge(order_products_train, orders, by="order_id")
Combined.Orders = rbind(All.Orders,All.Orders2)
Combined.Orders <- merge(Combined.Orders, products, by = "product_id")
Combined.Orders$reordered <- as.factor(Combined.Orders$reordered)

set.seed(1)

subset <- sample(nrow(Combined.Orders),nrow(Combined.Orders)/4 )
Combined.Orders <- Combined.Orders[subset,]
nrow(Combined.Orders)

attach(Combined.Orders)
Combined.Orders$department_id <- as.factor(Combined.Orders$department_id)
Combined.Orders$user_id <- as.factor(Combined.Orders$user_id)

# Sort Combined.Orders by user_id
Combined.Orders <- Combined.Orders[order(user_id),]

Main Approach to Organizing the Variables¶

1. Feature Engineering¶

A few new columns we created for the purpose of evaluating association of particular variables (specific users and productss) with the reordered response

user_prop_reordered: proportion of products reordered for each user for all products
total_orders: total number of orders for each user
reordered.Num: total number of reorders per product
proportion_reordered: proportion of reorders per product and per user

2. Selection of Variables¶

Some of the variables we were able to automatically take out, since they were either originally created for referential purposes only (ie eval_set, user_id, order_id, order_number, product_name)
Others we took out since we had already created information about the users and proportion of reordered items across for each product
Out of aisle, product, and departmental categories for the products, we decided to keep only one of these variables, since they are all correlated and related to one another

3. Null Values: Non-random Missing Data¶

The variable days_since_prior_order had a null value assigned to the first order made for every customer (since this variable is measured as a relative unit of time between orders). We decided to omit all of these first orders, since our response variable of specific reordered products is also measured relative to the products purchased in the first orders.

Creation of Dataframes for Proportion Reordered per User (User.Reordered and User.data)¶

User.data <- data.frame(unique(Combined.Orders[,5]))

User.Reordered <-Combined.Orders %>%
  group_by(user_id) %>%
  count(reordered)

User.Reordered[1:14,]

# Create list of unique user_id's
id.list <- unique(Combined.Orders$user_id)
re.prop <- c()

# Calculating proportion of items reordered for each user in id.list
for (userid in id.list)
{
  subset <- User.Reordered[User.Reordered$user_id==userid,]
  if (nrow(subset)==1){
    if (subset$reordered == 0){
      pro = 0
    }
    else {
      pro = 1
    }
    } 
  else
  {
    total <- sum(subset$n)
    pro <- subset$n[subset$reordered==1]/total
  }
  re.prop <- append(re.prop, pro)
}

User.data <- data.frame(user_id=id.list, reorder.prop=re.prop)

head(User.data, n=10)

Combined.Orders$reordered.num <- as.numeric(Combined.Orders$reordered)-1

#########   User  Proportion Reorders
group_userid <- group_by(Combined.Orders, Combined.Orders$user_id)

User_Reorders = summarise(group_userid, total_orders=n())
User_Reorders$user_prop_reordered = User.data$reorder.prop

head(User_Reorders)

Creation of Dataframe for Proportion Reordered per Product (item_reorders)¶

item_reorders <- Combined.Orders %>%
  group_by(product_id) %>%
  summarize(n=n())

# Create list of unique product_id's
prodid.list <- (unique(Combined.Orders$product_id))
prodid.list <- sort(prodid.list)

re.prop <- c()

head(item_reorders)

# Sort Combined.Orders by product_id for easier parsing in for loop below
Combined.Orders <- Combined.Orders[order(Combined.Orders$product_id),]
attach(Combined.Orders)

for (prodid in prodid.list)
{
  subset <- Combined.Orders[Combined.Orders$product_id==prodid,]
  if (nrow(subset)==1){
    if (subset$reordered == 0){
      pro=0
    }
    else {
      pro = 1
    }
  } 
  else
  {
    total <- nrow(subset)
    pro <- nrow(subset[subset$reordered==1,])/total
    if (length(pro) == 0) {
      pro= 0
    }
  }
  re.prop <- append(re.prop, pro)
}

item_reorders$proportion_reordered <- re.prop
ProductCounts <- item_reorders

ProductCounts$product_orders <- ProductCounts$n
ProductCounts<- ProductCounts[,-2]

head(ProductCounts)

Completing the Dataset¶

# Merge Combined.Orders with ProductCounts
Combined.Orders <- merge(ProductCounts, Combined.Orders, by = "product_id")

# Merge Combined.Orders with user_Reorders
User_Reorders$user_id <- User_Reorders$'Combined.Orders$user_id'
User_Reorders <- User_Reorders[,-1]
Combined.Orders <- merge(User_Reorders, Combined.Orders , by = "user_id")
Combined.Orders$order_dow <- as.factor(Combined.Orders$order_dow)
names(Combined.Orders)[6] <- 'product_orders_total'

Combined.Orders$reordered <- as.factor(Combined.Orders$reordered)
Combined.Orders$department_id = as.factor(Combined.Orders$department_id)

head(Combined.Orders)

str(Combined.Orders)

'data.frame':	2428182 obs. of  18 variables:
 $ user_id               : Factor w/ 62866 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ total_orders          : int  19 19 19 19 19 19 19 19 19 19 ...
 $ user_prop_reordered   : num  0.737 0.737 0.737 0.737 0.737 ...
 $ product_id            : int  196 25133 25133 39657 39657 38928 49235 12427 13176 10258 ...
 $ proportion_reordered  : num  0.782 0.747 0.747 0.779 0.779 ...
 $ product_orders        : int  3007 498 498 403 403 1022 5751 501 30440 140 ...
 $ product_orders_total  : int  3007 498 498 403 403 1022 5751 501 30440 140 ...
 $ order_id              : int  2550362 3367565 473747 2550362 1187899 1187899 1187899 2254736 431534 2254736 ...
 $ add_to_cart_order     : int  1 4 4 3 5 3 10 2 8 3 ...
 $ reordered             : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
 $ order_number          : int  10 6 3 10 11 11 11 4 5 4 ...
 $ order_dow             : Factor w/ 7 levels "0","1","2","3",..: 5 3 4 5 5 5 5 5 5 5 ...
 $ order_hour_of_day     : int  8 7 12 8 8 8 8 7 15 7 ...
 $ days_since_prior_order: int  30 19 21 30 14 14 14 29 28 29 ...
 $ product_name          : Factor w/ 49688 levels ".5\\\" Waterproof Tape",..: 41347 31986 31986 25489 25489 31 30307 32766 3460 35029 ...
 $ aisle_id              : int  77 21 21 45 45 120 53 23 24 117 ...
 $ department_id         : Factor w/ 21 levels "1","2","3","4",..: 7 16 16 19 19 16 16 19 4 19 ...
 $ reordered.num         : num  1 1 0 0 1 1 1 1 1 1 ...

Eliminating the null values (in days_since_prior_order) as was discussed in previous section (#3 in Preprocessing section)¶

Combined.Orders <- na.omit(Combined.Orders)

head(Combined.Orders, n=10)

Modeling¶

Splitting the Data¶

Now that we have reorganized the data by creating additional variables to consider that give us information about various proportions of reordered products, we can now split the data into a training set to train the data and a test/validation set to test the generalizability of our model.

set.seed(1)
Train<- sample(1:nrow(Combined.Orders), nrow(Combined.Orders)*.8)
Training.Orders <- Combined.Orders[Train,]
Test.Orders <- Combined.Orders[-Train,]

nrow(Training.Orders)
nrow(Test.Orders)

Model 1a: Logistic Regression¶

We can apply a logistic regression model to this data because we are looking to make predictions on a binomial response variable (0 for a purchased product that was not previously ordered and 1 for one that was).

The first model below indicates that the subset of variables discussed in the preprocessing section (user_prop_reordered, proportion_reordered, product_orders_total, total_orders, add_to_cart_order, order_dow, days_since_prior_order) and WITHOUT the department_id indicator columns all have coefficients that have significant associations with the reordered variable (ie p-values are all much smaller than the significance level of .05)

Logistic.TestModel1 <- glm(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order, family = binomial, data = Training.Orders)
summary(Logistic.TestModel1)

Call:
glm(formula = reordered ~ user_prop_reordered + proportion_reordered + 
    product_orders_total + total_orders + add_to_cart_order + 
    order_dow + days_since_prior_order, family = binomial, data = Training.Orders)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9123  -0.8964   0.4977   0.8079   3.3840  

Coefficients:
                         Estimate Std. Error  z value Pr(>|z|)    
(Intercept)            -4.217e+00  1.106e-02 -381.392  < 2e-16 ***
user_prop_reordered     4.288e+00  1.238e-02  346.384  < 2e-16 ***
proportion_reordered    4.274e+00  1.298e-02  329.200  < 2e-16 ***
product_orders_total    7.885e-06  3.715e-07   21.228  < 2e-16 ***
total_orders            1.557e-03  2.695e-05   57.778  < 2e-16 ***
add_to_cart_order      -5.416e-02  2.588e-04 -209.283  < 2e-16 ***
order_dow1             -2.985e-02  5.705e-03   -5.232 1.68e-07 ***
order_dow2             -7.870e-02  6.176e-03  -12.742  < 2e-16 ***
order_dow3             -8.478e-02  6.342e-03  -13.367  < 2e-16 ***
order_dow4             -8.205e-02  6.384e-03  -12.854  < 2e-16 ***
order_dow5             -6.011e-02  6.174e-03   -9.736  < 2e-16 ***
order_dow6             -2.330e-02  6.014e-03   -3.875 0.000107 ***
days_since_prior_order  6.650e-03  2.091e-04   31.805  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2560373  on 1942399  degrees of freedom
Residual deviance: 2033676  on 1942387  degrees of freedom
AIC: 2033702

Number of Fisher Scoring iterations: 4

Model 1b: Logistic Regression¶

Similarly, the logistic regression model that considers the same variables used in the previous model AS WELL AS the department indicator variables have coefficients that are almost all significant. From the results below, we can see that there are two departments with insignificant coefficients to the response variable. These two departments are department 12 (meat/seafood) and department 15 (canned goods).

Logistic.TestModel2 <- glm(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id, family = binomial, data = Training.Orders)
summary(Logistic.TestModel2)

Call:
glm(formula = reordered ~ user_prop_reordered + proportion_reordered + 
    product_orders_total + total_orders + add_to_cart_order + 
    order_dow + days_since_prior_order + department_id, family = binomial, 
    data = Training.Orders)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9205  -0.8958   0.4976   0.8076   3.3781  

Coefficients:
                         Estimate Std. Error  z value Pr(>|z|)    
(Intercept)            -4.103e+00  1.324e-02 -309.872  < 2e-16 ***
user_prop_reordered     4.302e+00  1.240e-02  346.928  < 2e-16 ***
proportion_reordered    4.169e+00  1.567e-02  265.994  < 2e-16 ***
product_orders_total    1.012e-05  4.252e-07   23.798  < 2e-16 ***
total_orders            1.564e-03  2.698e-05   57.975  < 2e-16 ***
add_to_cart_order      -5.408e-02  2.596e-04 -208.312  < 2e-16 ***
order_dow1             -2.859e-02  5.708e-03   -5.009 5.48e-07 ***
order_dow2             -7.775e-02  6.181e-03  -12.579  < 2e-16 ***
order_dow3             -8.400e-02  6.347e-03  -13.234  < 2e-16 ***
order_dow4             -8.160e-02  6.389e-03  -12.772  < 2e-16 ***
order_dow5             -6.033e-02  6.179e-03   -9.763  < 2e-16 ***
order_dow6             -2.378e-02  6.016e-03   -3.953 7.73e-05 ***
days_since_prior_order  6.638e-03  2.092e-04   31.725  < 2e-16 ***
department_id2         -1.372e-01  5.501e-02   -2.495  0.01260 *  
department_id3         -3.248e-02  1.107e-02   -2.934  0.00335 ** 
department_id4         -8.767e-02  7.419e-03  -11.817  < 2e-16 ***
department_id5          2.524e-01  2.620e-02    9.637  < 2e-16 ***
department_id6         -7.945e-02  1.991e-02   -3.991 6.59e-05 ***
department_id7         -4.397e-02  8.943e-03   -4.917 8.80e-07 ***
department_id8          8.658e-02  3.257e-02    2.658  0.00785 ** 
department_id9         -2.802e-02  1.183e-02   -2.368  0.01790 *  
department_id10        -2.777e-01  5.222e-02   -5.318 1.05e-07 ***
department_id11        -2.130e-01  1.641e-02  -12.975  < 2e-16 ***
department_id12        -7.148e-03  1.285e-02   -0.556  0.57791    
department_id13        -1.695e-01  9.798e-03  -17.301  < 2e-16 ***
department_id14        -9.672e-02  1.286e-02   -7.522 5.38e-14 ***
department_id15         5.436e-03  1.099e-02    0.495  0.62078    
department_id16        -3.436e-02  7.847e-03   -4.379 1.19e-05 ***
department_id17        -1.236e-01  1.280e-02   -9.653  < 2e-16 ***
department_id18        -1.762e-01  1.596e-02  -11.045  < 2e-16 ***
department_id19        -1.079e-01  8.447e-03  -12.772  < 2e-16 ***
department_id20        -4.875e-02  1.132e-02   -4.307 1.65e-05 ***
department_id21        -7.725e-02  3.548e-02   -2.177  0.02945 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2560373  on 1942399  degrees of freedom
Residual deviance: 2032748  on 1942367  degrees of freedom
AIC: 2032814

Number of Fisher Scoring iterations: 4

Insignificant Variables¶

The coefficients for the indicator variables, department_id==12 (meat/seafood) and department_id==15 (canned goods) have large p-values. In these cases where the null hypotheses refer to there being no effect on the response variable from these department indicator variables, we cannot reject the null hypothesis, since the probability that the data is a result of the null hypothesis is very high.

We will conclude that the meat/seafood and canned goods departments do not show significant contribution to the variance of the response outcomes. This means that products purchased in these departments are usually not likely to affect whether or not the item is a reordered product or not. Interpretations for this insignificance can vary, and may or may not involve the rates of which these grocery products spoil (ie, bulk items do not have sensitive timelines that they have to be used and meats/seafoods have especially sensitive timelines). All in all, further exploration of the data would need to be completed in order to come to stronger insights about why these particular departments lack significance to the response variable over others.

For a glimpse of the specific types of products in these categories, there are lists of the most popular items in these categories as well as bar charts measuring the response variables per department below.

dept12 <- Combined.Orders[Combined.Orders$department_id == 12,]

reorder.table12<- dept12 %>% 
  count(reordered) %>%            
  mutate(prop = prop.table(n))    

as.data.frame(reorder.table12)   

reorder12 <- dept12 %>% 
  group_by(reordered) %>% 
  summarize(count = n()) %>% 
  mutate(proportion = count/sum(count))

reorder12 %>% 
  ggplot(aes(x=reordered,y=count, fill=reordered))+
geom_bar(stat="identity")+ ggtitle("Proportion of Reordered Products \n in Dept 12 (meat/seafood)") + theme(plot.title = element_text(hjust = 0.5))

item_list12 <- unique(dept12$product_id)

prop12<- c()
n = c()

for (prodid in item_list12){
    subset <- dept12[dept12$product_id == prodid,]
    avg = (nrow(subset[subset$reordered==1,]))/nrow(subset)
    
    prop12 <- append(prop12, avg)
    n = append(n, nrow(subset))
}

dept12proportions <- data.frame(item_list12,prop12,n)
names(dept12proportions)[1]<- 'product_id'
dept12proportions<- merge(dept12proportions, products, by='product_id')
dept12proportions<-dept12proportions %>% arrange(desc(dept12proportions$n))

head(dept12proportions, n=15)

dept15 <- Combined.Orders[Combined.Orders$department_id == 15,]

reorder.table15<- dept15 %>% 
  count(reordered) %>%            
  mutate(prop = prop.table(n))    

as.data.frame(reorder.table15)   

reorder15 <- dept15 %>% 
  group_by(reordered) %>% 
  summarize(count = n()) %>% 
  mutate(proportion = count/sum(count))

reorder15 %>% 
  ggplot(aes(x=reordered,y=count, fill=reordered))+
geom_bar(stat="identity") + ggtitle("Proportion of Reordered Products \n in Dept 15 (canned goods)") + theme(plot.title = element_text(hjust = 0.5))

item_list15 <- unique(dept15$product_id)

prop15<- c()
n = c()

for (prodid in item_list15){
    subset <- dept15[dept15$product_id == prodid,]
    avg = (nrow(subset[subset$reordered==1,]))/nrow(subset)
    
    prop15 <- append(prop15, avg)
    n = append(n, nrow(subset))
}

dept15proportions <- data.frame(item_list15,prop15,n)
names(dept15proportions)[1]<- 'product_id'
dept15proportions<- merge(dept15proportions, products, by='product_id')
dept15proportions<-dept15proportions %>% arrange(desc(dept15proportions$n))

head(dept15proportions, n=15)

Model Performance¶

Prediction and Accuracy¶

attach(Combined.Orders)
set.seed(1)

Accuracy <- function(table)
{
 n11 <- table[1,1] 
 n22 <- table[2,2]
 Total <- table[1,1]+table[2,2]+table[2,1]+table[1,2]
 Total
 return((n11+n22)/Total)
}

ProbM1 <- predict.glm(Logistic.TestModel1, newdata = Test.Orders, type = "response")
PredM1 <- ifelse(ProbM1 > .5, "1" , "0")
TableM1 <- table(PredM1, Test.Orders$reordered)
TableM1
Accuracy(TableM1)

      
PredM1      0      1
     0  95350  42530
     1  84329 263392

print('Sensitivity:')
sensitivity(TableM1)
print('Specificity:')
specificity(TableM1)

[1] "Sensitivity:"

[1] "Specificity:"

ProbM2 <- predict.glm(Logistic.TestModel2, newdata = Test.Orders, type = "response")
PredM2 <- ifelse(ProbM2 > .5, "1" , "0")

TableM2 <- table(PredM2, Test.Orders$reordered) 
TableM2

Accuracy(TableM2)

      
PredM2      0      1
     0  95348  42445
     1  84331 263477

print('Sensitivity:')
sensitivity(TableM2)
print('Specificity:')
specificity(TableM2)

[1] "Sensitivity:"

[1] "Specificity:"

# DIFFERENT?? WHY
ProbM2 <- predict.glm(Logistic.TestModel2, newdata = Test.Orders, type = "response")
PredM2 <- ifelse(ProbM2 > .5, "1" , "0")

TableM2 <- table(PredM2, Test.Orders$reordered) 
TableM2

Accuracy(TableM2)

      
PredM2      0      1
     0  95348  42445
     1  84331 263477

The accuracy results of the two models are very close; however, the second model (the model including the department indicator variables) is marginally better by 2e-4.¶

The sensitivity measures the proportion of correctly classified "success" responses (the true positive rates or the number of correctly classified reorder=1 outcomes) out of the true number of "success" responses.

The specificity measures the proportion of correctly classified "failure" responses (the true negatie rates or the number of correctly classified reorder=0 outcomes) out of the true number of "failure" responses.

From the results above, we get that

ROC Curve¶

When interpreting the classification error rates, how can we be sure that the best decision boundary for identifying a reorder=1 response should be when the likelihood, p=success > .5?

Depending on the business costs of Type I errors (false positive, ie a product/user/time observation is incorrectly recommended/estimated to reorder the product when in reality they would not reorder the product)and Type II errors (false negative, ie a product/user/time observation is not estimated to reorder the product when in reality they would want to reorder the product at thist time), we may want to adjust our decision boundary thresholds in order to minimize the two types of possible errors that can occur.

Thus, we will use a Receiver Operating Characteric (ROC) curve, which plots the true positive and false positive rates of the data at various decision boundary threshold settings.

library(Epi)

ROC(form=(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id) , data=Training.Orders, plot=
"ROC", MX=TRUE, MI=FALSE, main="ROC Curve")

According to the results above, we can see that the optimal decision boundary threshold for our data is at p(reordered=1) = .639.¶

With this new information, we can recalculate our accuracy, sensitivity, and specificity rates of our logistic regression model. As we can see below, the accuracy rate has decreased from 0.7389 to .7186; however, the important difference to note here is that the sensitivity and specificity are now closer to eachother (where previously they were 0.5307 and 0.8613 respectively).

All in all, the optimal decision boundary should be chosen with consideration of several business implications. Thus, for our purposes, we will leave this question aside and use the .5 decision threshold and associated accuracy rate for comparisons with the other two models.¶

ProbM2 <- predict.glm(Logistic.TestModel2, newdata = Test.Orders, type = "response")
PredM2 <- ifelse(ProbM2 > .639, "1" , "0")

TableM2 <- table(PredM2, Test.Orders$reordered) 
TableM2

Accuracy(TableM2)

      
PredM2      0      1
     0 126675  83639
     1  53004 222283

print('Sensitivity:')
sensitivity(TableM2)
print('Specificity:')
specificity(TableM2)

[1] "Sensitivity:"

[1] "Specificity:"

Model 2: Classification Tree¶

We will next look at the classification tree technique, which makes predictions about the response variable based on the mean responses of the training observations that belong to the leaves (or endpoints).

The tree model below, Tree.TestModel, uses the same predictors as the ones used in the previous logistic regression model, Logistic.TestModel2.

library(rpart)

Tree.TestModel <- rpart(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id,method = "class", data = Training.Orders)

plot(Tree.TestModel, uniform=TRUE, margin=.05,
   main="Classification Tree for Reordered")
text(Tree.TestModel, use.n=TRUE, all=TRUE, cex=.57)

Cross validation using complexity parameter (cp)¶

Tree.TestModelR <- capture.output(summary(Tree.TestModel))
plotcp(Tree.TestModel)

The graph above and outputs below show that the relative error decreases as the complexity parameter decreases. We can see that the optimal complexity parameter value associated with the minimum error is .01.¶

printcp(Tree.TestModel) # display the results 

Tree.TestModel$cptable

Classification tree:
rpart(formula = reordered ~ user_prop_reordered + proportion_reordered + 
    product_orders_total + total_orders + add_to_cart_order + 
    order_dow + days_since_prior_order + department_id, data = Training.Orders, 
    method = "class")

Variables actually used in tree construction:
[1] proportion_reordered user_prop_reordered 

Root node error: 719122/1942400 = 0.37022

n= 1942400 

        CP nsplit rel error  xerror       xstd
1 0.155263      0   1.00000 1.00000 0.00093582
2 0.029823      1   0.84474 0.84474 0.00089851
3 0.017304      3   0.78509 0.78448 0.00087981
4 0.010000      5   0.75048 0.74991 0.00086792

Tree.TestModel$cptable[which.min(Tree.TestModel$cptable[,'xerror']), 'CP']

Pruning¶

Normally, we would want to prune the tree in order to prevent overfitting and decrease the amount of variance from our original tree; however, the optimal CP value came out to be the same as the default CP value for the rpart tree method. Thus, we will keep the original tree model built.

Pred.Tree <- predict(Tree.TestModel, newdata = Test.Orders, type = "class")
Tree.Table <- table(Pred.Tree, Test.Orders$reordered)
Tree.Table

Accuracy(Tree.Table)

         
Pred.Tree      0      1
        0  71961  27142
        1 107718 278780

print('Sensitivity:')
sensitivity(Tree.Table)
print('Specificity:')
specificity(Tree.Table)

[1] "Sensitivity:"

[1] "Specificity:"

Random Forest¶

To improve upon the tree method used above, a random forest allows a way to make decorrelated trees to lower the variance of the model. By forcing each split in the tree to be selected based on only a subset of all of the predictors available for use, the random forest prevents a dominant predictor from ruling the results of the trees built.

As demonstrated below, the randomForest function from the R package will not run on large data frames. In order for the randomForest to run, we had to cut the data by 50% (with 971200 observations).

library(randomForest)

# First attempt: Error from the training set being too large :(
randomForest.fit <- randomForest(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id,   data=Training.Orders)

Error in randomForest.default(m, y, ...): long vectors (argument 26) are not supported in .C
Traceback:

1. randomForest(reordered ~ user_prop_reordered + proportion_reordered + 
 .     product_orders_total + total_orders + add_to_cart_order + 
 .     order_dow + days_since_prior_order + department_id, data = Training.Orders)
2. randomForest.formula(reordered ~ user_prop_reordered + proportion_reordered + 
 .     product_orders_total + total_orders + add_to_cart_order + 
 .     order_dow + days_since_prior_order + department_id, data = Training.Orders)
3. randomForest.default(m, y, ...)

# Split the data in half

set.seed(1)
subset <- sample(1:nrow(Training.Orders), nrow(Training.Orders)/2)
Training.Orders2 <- Training.Orders[subset,]

nrow(Training.Orders2)

# Second attempt
randomForest.fit2 <- randomForest(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id,   data=Training.Orders2)

print(randomForest.fit2) 
importance(randomForest.fit2)

Call:
 randomForest(formula = reordered ~ user_prop_reordered + proportion_reordered +      product_orders_total + total_orders + add_to_cart_order +      order_dow + days_since_prior_order + department_id, data = Training.Orders2) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 26.07%
Confusion matrix:
       0      1 class.error
0 191642 168236   0.4674806
1  84935 526387   0.1389366

Using the random forest method, our model has a test error rate of 26.23% and an accuracy rate of 73.77%. Also by calculating the mean decrease in Gini index for each predictor relative to the varible with the largest decrease, we can see that the variables with the highest importance are user_prop_reordered, proportion_reordered, product_order_total, and total orders.

varImpPlot(randomForest.fit2, main='Importance Plot of Random Forest')

The good news is that the resulting accuracy rates from the original tree model came out to 72.22% and from the random forest model was 73.77, which is comparable to the 74% accuracy of the logistic regression fits computed above.¶

The bad news is that the trees mostly use information based off of our newly engineered variables, proportion_reordered and user_prop_reordered, based off of the importance parameters provided above. Because these variables indicate aggregated information about the products and users who have purchased these products, we cannot easily identify helpful information about the various time and departmental features around these products.¶

Thus, although our tree model may have a higher accuracy rate, more work needs to be done to be able to use this model for industry purposes.

Model 3: K Nearest Neighbor (KNN) Classification¶

Next we will look at K Nearest Neighbor (KNN) classification, which applies distance formulas to locate points that are nearest to our training data points. Once this is done, a prediction model is fitted using the labels that are the most frequently out of each k neighbor group for each data point.

An important piece of information to consider is that the KNN classification requires that all predictors used in the model to be numeric. Thus, we will take out of consideration the following variables, which are not numeric or ordinal: user_id, eval_set, order_number, order_dow, product_name, product_id, aisle_id, department_id)

As for the ordinal categorical predictor variables (ie total_orders, product_orders_total, add_to_cart_order, order_hour_of_day), we will coerce these variables to be continuous instead of discrete.

library(MASS)
library(class)

train.y <- Training.Orders[9]
train.X <- Training.Orders[c(-1, -4, -9, -10, -11, -12, -15, -16, -17, -18 )]

test.X <- Test.Orders[c(-1, -4, -9, -10, -11, -12, -15, -16, -17, -18 )]

train.y = train.y[,1]

nrow(train.X)
length(train.y)
nrow(test.X)

head(train.X)

attach(train.X)

Often with KNN, the scale of the predictor variables should be considered so that variables with larger numbers will not dominant the classification over those with smaller values (since distances between points are evaluated to locate the K nearest neighbors).¶

Thus, below we make sure to normalize each of the variables used in the KNN classification training and test sets.

normalize <- function(x){
    return((x-min(x)) / (max(x)-min(x)))    
}

train.X.normalized <- as.data.frame(lapply(train.X, normalize))
test.X.normalized <- as.data.frame(lapply(test.X, normalize))

set.seed(1)

# Arbitrarily chose a K value of 5 here, which is a common choice for industry use
knn.pred = knn(train.X.normalized,test.X.normalized,train.y ,k=5)

test.y <- Test.Orders[9]
test.y = test.y[,1]

knn.table<- table(knn.pred, test.y)
knn.table
Accuracy(knn.table)

        test.y
knn.pred      0      1
       0  94310  54707
       1  85369 251215

print('Sensitivity:')
sensitivity(knn.table)
print('Specificity:')
specificity(knn.table)

[1] "Sensitivity:"

[1] "Specificity:"

levels(knn.pred)
knn.pred[1:15]

knn.table<- table(knn.pred, test.y)
knn.table
Accuracy(knn.table)

        test.y
knn.pred      0      1
       0  94310  54707
       1  85369 251215

We get an accuracy rate of 71.15%. This rate is comparable to the accuracies of the other models computed above as well. To improve and tweak the model further, we would want to compute some sort of cross-validation method in order to identify the optimal K number of neighbors to consider when creating the prediction model; however, our we unfortunately do not have¶

K-Fold Cross Validation to Optimize K for KNN Classification¶

Cross-Validation is a commonly used method to to test the performance reliability of a model on different groupings of unseen data. For KNN classification, we will consider a variety of different K values that select the K nearest neighbors of the training data. We will compute 10 different KNN classification models using different K values from 1 to 10. We will then select the model with the lowest test/validation error as the final model to use.

train.normalized <- cbind(train.X.normalized, train.y)
names(train.normalized)[9] <- 'reordered'
attach(train.normalized)

# In order to make the training and test samples the same size for cross-validation evaluation
set.seed(1)
Training.Orders2 <- Training.Orders[sample(1:nrow(train.X.normalized), 485601),]

train.y2 <- Training.Orders2[9]
train.X2 <- Training.Orders2[c(-1, -4, -9, -10, -11, -12, -15, -16, -17, -18 )]
train.y2 = train.y2[,1]

train.X.normalized2 <- as.data.frame(lapply(train.X2, normalize))

set.seed(1)

# Creating the 5 lists of indices that represent the 5 folds of the cross validation
idx <- createFolds(train.y2, k=5)

idx[[1]]

set.seed(1)

ks<- 1:10

res <- sapply(ks, function(k) {
  # try out each version of k from 1 to 10
  res.k <- sapply(seq_along(idx), function(i) {
    # looping over each of the 5 cross-validation folds to create predicted classifications of the training data
    pred <- knn(train=train.X.normalized2[-idx[[i]], ],
                test=test.X.normalized[idx[[i]], ],
                cl=train.y2[-idx[[i]]], k = k)
    # calculation of ratio of misclassified training samples
    mean(train.y2[idx[[i]]] != pred)
  })
  # averaging the error rates over the 5 folds per K-value group
  mean(res.k)
})

The outputted values are the error rates of each of the 10 different KNN models (using a range of K values from 1 to 10). From the plot below, we can see that the model with the minimal error considers K=10 nearest neighbors.¶

res

plot(ks, res, type="o",xlab='K value', ylab="Misclassification Test Error", main='5-Fold Cross Validation Error (for K= 1 to 10)')

which.min(res)

res[10]

set.seed(1)

knn.pred2 = knn(train.X.normalized,test.X.normalized,train.y ,k=10)

knn.table2<- table(knn.pred2, test.y)
knn.table2
Accuracy(knn.table2)

        test.y
knn.pred      0      1
       0  92965  48051
       1  86714 257871

print('Sensitivity:')
sensitivity(knn.table2)
print('Specificity:')
specificity(knn.table2)

[1] "Sensitivity:"

[1] "Specificity:"

The final accuracy rate of the knn.pred2 model is 72.24%, which is an improvement from the previous accuracy rate of 71.15% from the original knn.pred model using K=5 nearest neighbors. Next let's compare the different models below!

Conclusion:¶

Model 1: Logistic Regression - 0.7389
Model 2: Classification Tree - 0.7223 / Random Forest (on 1/2 of data) - .7377
Model 3: KNN - 0.7224

From the results above, we can see that the logistic regression model (with the department indicator variables included) gives us the highest accuracy/minimal error rate. Because the accuracy rates are so close, there is a chance that one of the other two models may be able to outperform the logistic regression model through using additional techniques to improve the model (ie bagging or boosting of the classification tree).

Interaction Terms and Understanding:¶

Another factor to consider is the use of interaction terms. To keep the models simple, we chose to not implement interaction terms; however, since the likelihood of a user reordering a product involves a nuanced relationship between the user to specific products, it would make sense to include interaction terms to account for the variety of relationships between the two variables. Thus, to improve upon all of the three models provided, it is recommended to include interaction terms between product_id's and user_ids's.

Subset Selection:¶

One may notice that the variables that came out as particularly effective and high in "importance" (according to our importance plot above) were the features that we engineered using aggregated reordered data across user_id's and product_id's; however, as stated in the Classification Tree analysis, there are some inherent problems with considering these variables because by their nature, they will have bias since we used the reordered columns to extract these new columns.

Thus, for further exploration, it is recommended to consider models that use a subset of the variables we have considered. There are various subset selection methods that can applied such as backward and forward stepwise regression, principal component analysis, and lasso and ridge regression that will help eliminate some of the unnecessary variables. Additionally, one should take a look at a correlation map in order to identify some of the highest correlated variables to take out of the model (to minimize the chance of overfitting).

References:¶

https://www.kaggle.com/c/instacart-market-basket-analysis

https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2

product_id	count	product_name
24852	18726	Banana
13176	15480	Bag of Organic Bananas
21137	10894	Organic Strawberries
21903	9784	Organic Baby Spinach
47626	8135	Large Lemon
47766	7409	Organic Avocado
47209	7293	Organic Hass Avocado
16797	6494	Strawberries
26209	6033	Limes
27966	5546	Organic Raspberries

product_id	proportion_reordered	n	product_name	aisle_id	department_id
<int>	<dbl>	<int>	<fct>	<int>	<int>
1729	0.9347826	92	2% Lactose Free Milk	84	16
20940	0.9130435	368	Organic Low Fat Milk	84	16
12193	0.8983051	59	100% Florida Orange Juice	98	7
21038	0.8888889	81	Organic Spelt Tortillas	128	3
31764	0.8888889	45	Original Sparkling Seltzer Water Cans	115	7
24852	0.8841717	18726	Banana	24	4
117	0.8833333	120	Petit Suisse Fruit	2	16
39180	0.8819876	483	Organic Lowfat 1% Milk	84	16
12384	0.8810409	269	Organic Lactose Free 1% Lowfat Milk	91	16
24024	0.8785249	461	1% Lowfat Milk	84	16

reordered	n	prop
<int>	<int>	<dbl>
0	555793	0.4014056
1	828824	0.5985944

order_id	product_id	add_to_cart_order	reordered	product_name
46614	27845	1	1	Organic Whole Milk
46614	38689	2	1	Organic Reduced Fat Milk
67223	27845	1	1	Organic Whole Milk
67223	38689	2	1	Organic Reduced Fat Milk
214506	27845	1	1	Organic Whole Milk
214506	38689	2	1	Organic Reduced Fat Milk
240832	27845	1	1	Organic Whole Milk
240832	38689	2	1	Organic Reduced Fat Milk
260804	27845	1	1	Organic Whole Milk
260804	38689	2	1	Organic Reduced Fat Milk

order_id	product_id	add_to_cart_order	reordered	product_name
19195	21333	1	1	Original Whipped Cream Cheese
19195	17744	2	1	Dark Chocolate with A Touch Of Sea Salt
19195	40028	3	1	Deluxe Cinnamon Raisin Bagels
47826	21333	1	1	Original Whipped Cream Cheese
47826	6723	2	1	Palmiers, Petite
47826	17744	3	1	Dark Chocolate with A Touch Of Sea Salt
47826	40028	4	1	Deluxe Cinnamon Raisin Bagels
71579	17744	1	1	Dark Chocolate with A Touch Of Sea Salt
71579	40028	2	1	Deluxe Cinnamon Raisin Bagels
115171	17744	1	1	Dark Chocolate with A Touch Of Sea Salt

product_id	product_name	aisle_id	department_id
1	Chocolate Sandwich Cookies	61	19
2	All-Seasons Salt	104	13
3	Robust Golden Unsweetened Oolong Tea	94	7
4	Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce	38	1
5	Green Chile Anytime Sauce	5	13
6	Dry Nose Oil	11	11
7	Pure Coconut Water With Orange	98	7
8	Cut Russet Potatoes Steam N' Mash	116	1
9	Light Strawberry Blueberry Yogurt	120	16
10	Sparkling Orange Juice & Prickly Pear Beverage	115	7

order_id	product_id	add_to_cart_order	reordered
1	49302	1	1
1	11109	2	1
1	10246	3	0
1	49683	4	0
1	43633	5	1
1	13176	6	0
1	47209	7	0
1	22035	8	1
36	39612	1	0
36	19660	2	1
36	49235	3	0
36	43086	4	1
36	46620	5	1
36	34497	6	1
36	48679	7	1

order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
2539329	1	prior	1	2	8	NA
2398795	1	prior	2	3	7	15
473747	1	prior	3	3	12	21
2254736	1	prior	4	4	7	29
431534	1	prior	5	4	15	28

user_id	orders_total	percent_equal
99753	97	1
55331	49	1
106510	49	1
111365	47	1
74656	46	1
170174	45	1
12025	43	1
164779	37	1
37075	34	1
110225	33	1

order_id	reordered.prop	n	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
473747	0.6000000	5	1	prior	3	3	12	21
2254736	1.0000000	5	1	prior	4	4	7	29
431534	0.6250000	8	1	prior	5	4	15	28
3367565	1.0000000	4	1	prior	6	2	7	19
550135	1.0000000	5	1	prior	7	1	9	20
3108588	0.6666667	6	1	prior	8	1	14	14

user_id	reorder.prop
<fct>	<dbl>
1	0.7368421
2	0.5106383
3	0.7000000
4	0.1428571
5	0.3636364
6	0.0000000
7	0.6562500
8	0.2000000
9	0.4193548
10	0.2888889

CP	nsplit	rel error	xerror	xstd
0.15526295	0	1.0000000	1.0000000	0.0009358192
0.02982317	1	0.8447371	0.8447440	0.0008985051
0.01730374	3	0.7850907	0.7844802	0.0008798050
0.01000000	5	0.7504832	0.7499061	0.0008679229

	MeanDecreaseGini
user_prop_reordered	100512.53
proportion_reordered	81943.24
product_orders_total	53429.74
total_orders	62535.25
add_to_cart_order	40436.10
order_dow	32692.11
days_since_prior_order	38478.58
department_id	32674.32

A data.frame: 15 × 6
product_id	prop12	n	product_name	aisle_id	department_id
<int>	<dbl>	<int>	<fct>	<int>	<int>
25890	0.7203591	3787	Boneless Skinless Chicken Breasts	49	12
38293	0.6815102	2066	Ground Turkey Breast	35	12
6046	0.7020798	1779	Boneless Skinless Chicken Breast	35	12
47672	0.6695769	1489	Uncured Hickory Smoked Sunday Bacon	106	12
5646	0.6835237	1226	Organic Turkey Bacon	106	12
45633	0.6210351	1198	Organic Beef Hot Dogs	106	12
17461	0.6814310	1174	Air Chilled Organic Boneless Skinless Chicken Breasts	35	12
10017	0.6493625	1098	Tilapia Filet	39	12
9339	0.6947271	1081	Organic Chicken & Apple Sausage	106	12
27695	0.6951673	1076	Natural Chicken & Sage Breakfast Sausage	106	12
13198	0.6952880	955	85% Lean Ground Beef	122	12
12456	0.6473214	896	Organic Sunday Bacon	106	12
4421	0.5612245	882	Organic Beef Uncured Hot Dogs	106	12
21927	0.7262639	811	Organic Boneless Skinless Chicken Breast	35	12
9020	0.6377171	806	Boneless Skinless Chicken Thighs	35	12

A data.frame: 15 × 6
product_id	prop15	n	product_name	aisle_id	department_id
<int>	<dbl>	<int>	<fct>	<int>	<int>
27156	0.6182759	2900	Organic Black Beans	59	15
42768	0.5831123	2262	Organic Garbanzo Beans	59	15
28849	0.6348528	2106	No Salt Added Black Beans	59	15
890	0.5033475	1643	Organic Diced Tomatoes	81	15
7021	0.4450072	1382	Organic Tomato Paste	81	15
45535	0.4969088	1294	Organic Low Sodium Chicken Broth	69	15
26790	0.7570394	1243	Organic AppleApple	99	15
8670	0.4750890	1124	Diced Tomatoes	81	15
46720	0.4669951	1015	Organic Free Range Chicken Broth	69	15
35042	0.5390071	987	Black Beans	59	15
22395	0.4773196	970	Tomato Sauce	81	15
9092	0.5161290	961	Organic Vegetable Broth	69	15
44449	0.4796238	957	Organic Tomato Sauce	81	15
3376	0.5164718	941	Organic Whole Kernel Sweet Corn No Salt Added	81	15
11408	0.5096701	879	Crushed Tomatoes	81	15

A data.frame: 6 × 8
	total_orders	user_prop_reordered	proportion_reordered	product_orders_total	order_id	add_to_cart_order	order_hour_of_day	days_since_prior_order
	<int>	<dbl>	<dbl>	<int>	<int>	<int>	<int>	<int>
2367507	30	0.7000000	0.6840405	11147	1565475	5	0	30
1582830	147	0.7142857	0.6295110	8119	182523	20	13	13
912381	33	0.6666667	0.4566474	173	3119618	3	6	30
2261718	31	0.7419355	0.7200675	593	2512048	6	12	8
1431980	297	0.7609428	0.7500000	24	763357	11	14	7
682798	84	0.6547619	0.6628571	875	2902612	14	13	14