integrate_eda_Copy2 with writing

Instacart Market Basket Challenge - Exploratory Data Analysis and Predictive Modeling

By Chiemi Kato and JP Rouhana

Introduction

Instacart is a grocery ordering and delivery service app, which aims to make the process of ordering groceries more convenient for grocery stores and shoppers. The purpose of this competition is to make predictions about which products are likely to be reordered by the same customers.

Reasons why this competition could benefit the company include finding ways to optimize the company's recommendation system, improving inventory and staffing infrastructures, and acquiring better insights about longer term customers who rely on Instacart as part of their regular shopping routines.

Click here to see more information on our final results and presentation.

Looking at the raw data...

The data is separated into six csv files, which describe the different aspects of the products hosted on the website. It is a random sample of over 3 million grocery orders from over 200,000 instacart users. We will use R Statistical Software to analyze the data.

The files provide a combination of categorical data, both ordinal and nominal pertaining to the features of the actual goods being ordered such as “dairy,” “produce,” “organic,” etc. There are several other variables including the store aisle, and the sequence of orders which are placed in the basket, day of the week, time of the day, and relative time (in days) between orders for each user. Also notice, in the data set titled 'order_products_train' (as well as 'order_products_prior') there are columns labeled 'reordered' which have values of either 0 or 1. We will use these columns are our response variables, which represent 0 for items that are not reordered, and 1 for items that are reordered.

In [ ]:
library(data.table)
library(dplyr)
library(ggplot2)
library(knitr)
library(DT)
library(tidyr)
library(tidyverse)
library(randomForest)
library(ROCR)
library(repr)

options(repr.plot.width=5, repr.plot.height=4)

library(dplyr)
library(ggplot2)
library(caret)
library(DT)
In [2]:
aisles <- read.csv('data/aisles.csv', header=TRUE)
departments <- read.csv('data/departments.csv', header=TRUE)
products <- read.csv('data/products.csv',header=TRUE)
order_products_train <- read.csv('data/order_products__train.csv', header=TRUE)
order_products_prior <- read.csv('data/order_products__prior.csv', header=TRUE)
orders <- read.csv('data/orders.csv', header=TRUE)
In [7]:
head(products, n=10)
length(unique(products$product_id)) # 49688 total products
product_idproduct_nameaisle_iddepartment_id
1 Chocolate Sandwich Cookies 61 19
2 All-Seasons Salt 104 13
3 Robust Golden Unsweetened Oolong Tea 94 7
4 Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce 38 1
5 Green Chile Anytime Sauce 5 13
6 Dry Nose Oil 11 11
7 Pure Coconut Water With Orange 98 7
8 Cut Russet Potatoes Steam N' Mash 116 1
9 Light Strawberry Blueberry Yogurt 120 16
10 Sparkling Orange Juice & Prickly Pear Beverage 115 7
49688
In [8]:
head(order_products_train, n=15)
order_idproduct_idadd_to_cart_orderreordered
1 493021 1
1 111092 1
1 102463 0
1 496834 0
1 436335 1
1 131766 0
1 472097 0
1 220358 1
36 396121 0
36 196602 1
36 492353 0
36 430864 1
36 466205 1
36 344976 1
36 486797 1

Note that in the above data set, we are provided with a response column of reordered= 0 or 1.

In [9]:
head(orders, n=5)
order_iduser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_order
25393291 prior 1 2 8 NA
23987951 prior 2 3 7 15
4737471 prior 3 3 12 21
22547361 prior 4 4 7 29
4315341 prior 5 4 15 28

Quickly recode to convert character variables to factors

In [3]:
orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))

Looking at Some Histograms...

Days of the week when people order

In [ ]:
By simply looking at a few histograms, you can spot out some of the stand out groups that have more observations that the rest.
In [239]:
orders %>% 
  ggplot(aes(x=order_dow)) + 
  geom_histogram(stat="count",fill="slateblue") + ggtitle("Order Day of Week")
Warning message:
“Ignoring unknown parameters: binwidth, bins, pad”

Hour of day when people order

In [240]:
orders %>% 
  ggplot(aes(x=days_since_prior_order)) + 
geom_histogram(stat="count",fill="slateblue") + ggtitle("Days Since Last Order (relative days between orders)")
Warning message:
“Ignoring unknown parameters: binwidth, bins, pad”Warning message:
“Removed 63100 rows containing non-finite values (stat_count).”
In [54]:
mostcommon <- order_products_train %>% 
  group_by(product_id) %>% 
  summarize(count = n()) %>% 
  top_n(10, wt = count) %>%
  left_join(select(products,product_id,product_name),by="product_id") %>%
  arrange(desc(count)) 

head(mostcommon, n=10)
product_idcountproduct_name
24852 18726 Banana
13176 15480 Bag of Organic Bananas
21137 10894 Organic Strawberries
21903 9784 Organic Baby Spinach
47626 8135 Large Lemon
47766 7409 Organic Avocado
47209 7293 Organic Hass Avocado
16797 6494 Strawberries
26209 6033 Limes
27966 5546 Organic Raspberries
In [7]:
mostcommon %>% 
  ggplot(aes(x=reorder(product_name,-count), y=count))+
  geom_bar(stat="identity",fill="slateblue")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank()) +  ggtitle("Most Popular Items Overall")

What items are most often reordered?

In [241]:
item_reorders <- order_products_train %>% 
  group_by(product_id) %>% 
  summarize(proportion_reordered = mean(reordered), n=n()) %>% 
  filter(n>40) %>% 
  top_n(10,wt=proportion_reordered) %>% 
  arrange(desc(proportion_reordered)) %>% 
  left_join(products,by="product_id")

head(data.frame(item_reorders), n=15)
A data.frame: 10 × 6
product_idproportion_reorderednproduct_nameaisle_iddepartment_id
<int><dbl><int><fct><int><int>
17290.9347826 922% Lactose Free Milk 8416
209400.9130435 368Organic Low Fat Milk 8416
121930.8983051 59100% Florida Orange Juice 98 7
210380.8888889 81Organic Spelt Tortillas 128 3
317640.8888889 45Original Sparkling Seltzer Water Cans115 7
248520.884171718726Banana 24 4
1170.8833333 120Petit Suisse Fruit 216
391800.8819876 483Organic Lowfat 1% Milk 8416
123840.8810409 269Organic Lactose Free 1% Lowfat Milk 9116
240240.8785249 4611% Lowfat Milk 8416
In [242]:
item_reorders %>% 
  ggplot(aes(x=reorder(product_name,-proportion_reordered), y=proportion_reordered))+
  geom_bar(stat="identity",fill="slateblue")+
  theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.85,0.95)) + ggtitle("Most Popular Reordered Items")

Proportion of Products Reordered

In [254]:
reorder.table<- order_products_train %>% 
  count(reordered) %>%            
  mutate(prop = prop.table(n))    

as.data.frame(reorder.table)   

reorder <- order_products_train %>% 
  group_by(reordered) %>% 
  summarize(count = n()) %>% 
  mutate(proportion = count/sum(count))

reorder %>% 
  ggplot(aes(x=reordered,y=count, fill=reordered))+
geom_bar(stat="identity")+ggtitle("Proportion of Reordered / Non-Reordered\n Products")
A data.frame: 2 × 3
reorderednprop
<int><int><dbl>
05557930.4014056
18288240.5985944

Now let's take a look at some specific "Power" Instacart users

In [13]:
# First calculating proportion of reorders per order and n number of products in a single order
power.users <- order_products_prior %>% 
  group_by(order_id) %>% 
  summarize(reordered.prop = mean(reordered),n=n()) %>% 
  right_join(filter(orders,order_number>2), by="order_id")
# take out first order (reordered response column is irrelevant fir first orders)
# for each order, calculating reordered.prop, then joining with orders table to add reordered.prop and n

power.users2 <- power.users %>% 
  filter(eval_set =="prior") %>% 
  group_by(user_id) %>% 
  summarize(orders_total = sum(reordered.prop==1,na.rm=T), percent_equal = orders_total/n()) %>% 
  filter(percent_equal == 1) %>% 
arrange(desc(orders_total))
# lists users with 100% reorders in the order of highest number of total orders

head(power.users2, n=10)
head(power.users)
user_idorders_totalpercent_equal
9975397 1
5533149 1
10651049 1
11136547 1
7465646 1
17017445 1
1202543 1
16477937 1
3707534 1
11022533 1
order_idreordered.propnuser_ideval_setorder_numberorder_doworder_hour_of_daydays_since_prior_order
473747 0.60000005 1 prior 3 3 12 21
2254736 1.00000005 1 prior 4 4 7 29
431534 0.62500008 1 prior 5 4 15 28
3367565 1.00000004 1 prior 6 2 7 19
550135 1.00000005 1 prior 7 1 9 20
3108588 0.66666676 1 prior 8 1 14 14

User 99753 and 55331 have the highest frequency of orders AND the highest proportions of reorders per individual order.

Below are lists of the specific items ordered across orders for each of the two "Power Users." Notice that the number of products purchased per order is not considered in our identification of the "Power Users."

In [15]:
power.order.numbers <- filter(power.users, user_id == 99753)$order_id
power.products <- order_products_prior %>% 
  filter(order_id %in% power.order.numbers) %>% 
  left_join(products, by="product_id")

head(power.products, n=10)[,1:5]

power.order.numbers <- filter(power.users, user_id == 55331)$order_id
power.products <- order_products_prior %>% 
  filter(order_id %in% power.order.numbers) %>% 
  left_join(products, by="product_id")

head(power.products, n=10)[,1:5]
order_idproduct_idadd_to_cart_orderreorderedproduct_name
46614 27845 1 1 Organic Whole Milk
46614 38689 2 1 Organic Reduced Fat Milk
67223 27845 1 1 Organic Whole Milk
67223 38689 2 1 Organic Reduced Fat Milk
214506 27845 1 1 Organic Whole Milk
214506 38689 2 1 Organic Reduced Fat Milk
240832 27845 1 1 Organic Whole Milk
240832 38689 2 1 Organic Reduced Fat Milk
260804 27845 1 1 Organic Whole Milk
260804 38689 2 1 Organic Reduced Fat Milk
order_idproduct_idadd_to_cart_orderreorderedproduct_name
19195 21333 1 1 Original Whipped Cream Cheese
19195 17744 2 1 Dark Chocolate with A Touch Of Sea Salt
19195 40028 3 1 Deluxe Cinnamon Raisin Bagels
47826 21333 1 1 Original Whipped Cream Cheese
47826 6723 2 1 Palmiers, Petite
47826 17744 3 1 Dark Chocolate with A Touch Of Sea Salt
47826 40028 4 1 Deluxe Cinnamon Raisin Bagels
71579 17744 1 1 Dark Chocolate with A Touch Of Sea Salt
71579 40028 2 1 Deluxe Cinnamon Raisin Bagels
115171 17744 1 1 Dark Chocolate with A Touch Of Sea Salt

What's Next??

Predictive Analysis

Now that we have a better picture about the data we're working with, we can begin to use our best judgement to train models based on our historical data and test the models to see which ones work best on unseen data.

Preprocessing

Before diving into the models, we must first reorganize the data to be able to easily fit the models using our data. For smoother computability, we will decrease the size of the overall data, Combined.orders by 1/4 using random sampling. We will merge the aisles and department data sets at a later point.

In [3]:
All.Orders = merge(order_products_prior, orders)
All.Orders2 = merge(order_products_train, orders, by="order_id")
Combined.Orders = rbind(All.Orders,All.Orders2)
Combined.Orders <- merge(Combined.Orders, products, by = "product_id")
Combined.Orders$reordered <- as.factor(Combined.Orders$reordered)

set.seed(1)

subset <- sample(nrow(Combined.Orders),nrow(Combined.Orders)/4 )
Combined.Orders <- Combined.Orders[subset,]
nrow(Combined.Orders)
2587444
In [4]:
attach(Combined.Orders)
Combined.Orders$department_id <- as.factor(Combined.Orders$department_id)
Combined.Orders$user_id <- as.factor(Combined.Orders$user_id)

# Sort Combined.Orders by user_id
Combined.Orders <- Combined.Orders[order(user_id),]

Main Approach to Organizing the Variables

1. Feature Engineering

A few new columns we created for the purpose of evaluating association of particular variables (specific users and productss) with the reordered response

  1. user_prop_reordered: proportion of products reordered for each user for all products
  2. total_orders: total number of orders for each user
  3. reordered.Num: total number of reorders per product
  4. proportion_reordered: proportion of reorders per product and per user

2. Selection of Variables

  • Some of the variables we were able to automatically take out, since they were either originally created for referential purposes only (ie eval_set, user_id, order_id, order_number, product_name)
  • Others we took out since we had already created information about the users and proportion of reordered items across for each product
  • Out of aisle, product, and departmental categories for the products, we decided to keep only one of these variables, since they are all correlated and related to one another

3. Null Values: Non-random Missing Data

  • The variable days_since_prior_order had a null value assigned to the first order made for every customer (since this variable is measured as a relative unit of time between orders). We decided to omit all of these first orders, since our response variable of specific reordered products is also measured relative to the products purchased in the first orders.

Creation of Dataframes for Proportion Reordered per User (User.Reordered and User.data)

In [5]:
User.data <- data.frame(unique(Combined.Orders[,5]))

User.Reordered <-Combined.Orders %>%
  group_by(user_id) %>%
  count(reordered)

User.Reordered[1:14,]
A tibble: 14 × 3
user_idreorderedn
<fct><fct><int>
10 2
1116
2027
2132
30 6
3121
40 4
41 1
50 9
51 3
60 6
61 1
7011
7141
In [6]:
# Create list of unique user_id's
id.list <- unique(Combined.Orders$user_id)
re.prop <- c()

# Calculating proportion of items reordered for each user in id.list
for (userid in id.list)
{
  subset <- User.Reordered[User.Reordered$user_id==userid,]
  if (nrow(subset)==1){
    if (subset$reordered == 0){
      pro = 0
    }
    else {
      pro = 1
    }
    } 
  else
  {
    total <- sum(subset$n)
    pro <- subset$n[subset$reordered==1]/total
  }
  re.prop <- append(re.prop, pro)
}
In [7]:
User.data <- data.frame(user_id=id.list, reorder.prop=re.prop)
In [18]:
head(User.data, n=10)
A data.frame: 10 × 2
user_idreorder.prop
<fct><dbl>
1 0.7368421
2 0.5106383
3 0.7000000
4 0.1428571
5 0.3636364
6 0.0000000
7 0.6562500
8 0.2000000
9 0.4193548
100.2888889
In [8]:
Combined.Orders$reordered.num <- as.numeric(Combined.Orders$reordered)-1

#########   User  Proportion Reorders
group_userid <- group_by(Combined.Orders, Combined.Orders$user_id)

User_Reorders = summarise(group_userid, total_orders=n())
User_Reorders$user_prop_reordered = User.data$reorder.prop
In [46]:
head(User_Reorders)
A tibble: 6 × 3
total_ordersuser_prop_reordereduser_id
<int><dbl><fct>
190.73684211
470.51063832
200.70000003
70.14285714
110.36363645
10.00000006

Creation of Dataframe for Proportion Reordered per Product (item_reorders)

In [9]:
item_reorders <- Combined.Orders %>%
  group_by(product_id) %>%
  summarize(n=n())

# Create list of unique product_id's
prodid.list <- (unique(Combined.Orders$product_id))
prodid.list <- sort(prodid.list)

re.prop <- c()
In [22]:
head(item_reorders)
A tibble: 6 × 2
product_idn
<int><int>
1147
2 10
3 37
4 21
7 6
8 7
In [ ]:
# Sort Combined.Orders by product_id for easier parsing in for loop below
Combined.Orders <- Combined.Orders[order(Combined.Orders$product_id),]
attach(Combined.Orders)
In [11]:
for (prodid in prodid.list)
{
  subset <- Combined.Orders[Combined.Orders$product_id==prodid,]
  if (nrow(subset)==1){
    if (subset$reordered == 0){
      pro=0
    }
    else {
      pro = 1
    }
  } 
  else
  {
    total <- nrow(subset)
    pro <- nrow(subset[subset$reordered==1,])/total
    if (length(pro) == 0) {
      pro= 0
    }
  }
  re.prop <- append(re.prop, pro)
}
In [12]:
item_reorders$proportion_reordered <- re.prop
ProductCounts <- item_reorders
In [13]:
ProductCounts$product_orders <- ProductCounts$n
ProductCounts<- ProductCounts[,-2]
In [32]:
head(ProductCounts)
A tibble: 6 × 3
product_idproportion_reorderedproduct_orders
<int><dbl><int>
10.5986395147
20.2000000 10
30.7567568 37
40.3809524 21
70.3333333 6
80.2857143 7

Completing the Dataset

In [14]:
# Merge Combined.Orders with ProductCounts
Combined.Orders <- merge(ProductCounts, Combined.Orders, by = "product_id")
In [15]:
# Merge Combined.Orders with user_Reorders
User_Reorders$user_id <- User_Reorders$'Combined.Orders$user_id'
User_Reorders <- User_Reorders[,-1]
Combined.Orders <- merge(User_Reorders, Combined.Orders , by = "user_id")
Combined.Orders$order_dow <- as.factor(Combined.Orders$order_dow)
names(Combined.Orders)[6] <- 'product_orders_total'
In [16]:
Combined.Orders$reordered <- as.factor(Combined.Orders$reordered)
Combined.Orders$department_id = as.factor(Combined.Orders$department_id)
In [48]:
head(Combined.Orders)
A data.frame: 6 × 20
user_idtotal_orders.xuser_prop_reordered.xproduct_idproportion_reorderedproduct_orderstotal_orders.yuser_prop_reordered.yproduct_orders_totalorder_idadd_to_cart_orderreorderedorder_numberorder_doworder_hour_of_daydays_since_prior_orderproduct_nameaisle_iddepartment_idreordered.num
<fct><int><dbl><int><dbl><int><int><dbl><int><int><int><fct><int><fct><int><int><fct><int><fct><dbl>
1190.7368421 1960.78150983007190.73684213007255036211104 830Soda 777 1
1190.7368421251330.7469880 498190.7368421 498336756541 62 719Organic String Cheese 21161
1190.7368421251330.7469880 498190.7368421 498 47374740 331221Organic String Cheese 21160
1190.7368421396570.7791563 403190.7368421 403255036230104 830Milk Chocolate Almonds 45190
1190.7368421396570.7791563 403190.7368421 403118789951114 814Milk Chocolate Almonds 45191
1190.7368421389280.80626221022190.73684211022118789931114 8140% Greek Strained Yogurt120161
In [63]:
str(Combined.Orders)
'data.frame':	2428182 obs. of  18 variables:
 $ user_id               : Factor w/ 62866 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ total_orders          : int  19 19 19 19 19 19 19 19 19 19 ...
 $ user_prop_reordered   : num  0.737 0.737 0.737 0.737 0.737 ...
 $ product_id            : int  196 25133 25133 39657 39657 38928 49235 12427 13176 10258 ...
 $ proportion_reordered  : num  0.782 0.747 0.747 0.779 0.779 ...
 $ product_orders        : int  3007 498 498 403 403 1022 5751 501 30440 140 ...
 $ product_orders_total  : int  3007 498 498 403 403 1022 5751 501 30440 140 ...
 $ order_id              : int  2550362 3367565 473747 2550362 1187899 1187899 1187899 2254736 431534 2254736 ...
 $ add_to_cart_order     : int  1 4 4 3 5 3 10 2 8 3 ...
 $ reordered             : Factor w/ 2 levels "0","1": 2 2 1 1 2 2 2 2 2 2 ...
 $ order_number          : int  10 6 3 10 11 11 11 4 5 4 ...
 $ order_dow             : Factor w/ 7 levels "0","1","2","3",..: 5 3 4 5 5 5 5 5 5 5 ...
 $ order_hour_of_day     : int  8 7 12 8 8 8 8 7 15 7 ...
 $ days_since_prior_order: int  30 19 21 30 14 14 14 29 28 29 ...
 $ product_name          : Factor w/ 49688 levels ".5\\\" Waterproof Tape",..: 41347 31986 31986 25489 25489 31 30307 32766 3460 35029 ...
 $ aisle_id              : int  77 21 21 45 45 120 53 23 24 117 ...
 $ department_id         : Factor w/ 21 levels "1","2","3","4",..: 7 16 16 19 19 16 16 19 4 19 ...
 $ reordered.num         : num  1 1 0 0 1 1 1 1 1 1 ...

Eliminating the null values (in days_since_prior_order) as was discussed in previous section (#3 in Preprocessing section)

In [17]:
Combined.Orders <- na.omit(Combined.Orders)
In [28]:
head(Combined.Orders, n=10)
A data.frame: 10 × 18
user_idtotal_ordersuser_prop_reorderedproduct_idproportion_reorderedproduct_orders_totalorder_idadd_to_cart_orderreorderedeval_setorder_numberorder_doworder_hour_of_daydays_since_prior_orderproduct_nameaisle_iddepartment_idreordered.num
<fct><int><dbl><int><dbl><int><int><int><fct><fct><int><fct><int><int><fct><int><fct><dbl>
1180.8888889124270.7584830 501 55013531prior 71 920Original Beef Jerky 23191
1180.8888889124270.7584830 501336756521prior 62 719Original Beef Jerky 23191
1180.8888889124270.7584830 501 43153421prior 541528Original Beef Jerky 23191
1180.8888889396570.7667494 403255036230prior104 830Milk Chocolate Almonds 45190
1180.8888889130320.6328671 286255036281prior104 830Cinnamon Toast Crunch 121141
1180.8888889 1960.78350523007310858821prior 811414Soda 777 1
1180.8888889396570.7667494 403118789951train114 814Milk Chocolate Almonds 45191
1180.8888889251330.7309237 498255036251prior104 830Organic String Cheese 21161
1180.8888889264050.4811321 106225473651prior 44 729XL Pick-A-Size Paper Towel Rolls 54171
1180.8888889171220.67832851029 43153460prior 541528Honeycrisp Apples 244 0

Modeling

Splitting the Data

Now that we have reorganized the data by creating additional variables to consider that give us information about various proportions of reordered products, we can now split the data into a training set to train the data and a test/validation set to test the generalizability of our model.

In [18]:
set.seed(1)
Train<- sample(1:nrow(Combined.Orders), nrow(Combined.Orders)*.8)
Training.Orders <- Combined.Orders[Train,]
Test.Orders <- Combined.Orders[-Train,]
In [18]:
nrow(Training.Orders)
nrow(Test.Orders)
1942400
485601

Model 1a: Logistic Regression

We can apply a logistic regression model to this data because we are looking to make predictions on a binomial response variable (0 for a purchased product that was not previously ordered and 1 for one that was).

The first model below indicates that the subset of variables discussed in the preprocessing section (user_prop_reordered, proportion_reordered, product_orders_total, total_orders, add_to_cart_order, order_dow, days_since_prior_order) and WITHOUT the department_id indicator columns all have coefficients that have significant associations with the reordered variable (ie p-values are all much smaller than the significance level of .05)

In [234]:
Logistic.TestModel1 <- glm(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order, family = binomial, data = Training.Orders)
summary(Logistic.TestModel1)
Call:
glm(formula = reordered ~ user_prop_reordered + proportion_reordered + 
    product_orders_total + total_orders + add_to_cart_order + 
    order_dow + days_since_prior_order, family = binomial, data = Training.Orders)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9123  -0.8964   0.4977   0.8079   3.3840  

Coefficients:
                         Estimate Std. Error  z value Pr(>|z|)    
(Intercept)            -4.217e+00  1.106e-02 -381.392  < 2e-16 ***
user_prop_reordered     4.288e+00  1.238e-02  346.384  < 2e-16 ***
proportion_reordered    4.274e+00  1.298e-02  329.200  < 2e-16 ***
product_orders_total    7.885e-06  3.715e-07   21.228  < 2e-16 ***
total_orders            1.557e-03  2.695e-05   57.778  < 2e-16 ***
add_to_cart_order      -5.416e-02  2.588e-04 -209.283  < 2e-16 ***
order_dow1             -2.985e-02  5.705e-03   -5.232 1.68e-07 ***
order_dow2             -7.870e-02  6.176e-03  -12.742  < 2e-16 ***
order_dow3             -8.478e-02  6.342e-03  -13.367  < 2e-16 ***
order_dow4             -8.205e-02  6.384e-03  -12.854  < 2e-16 ***
order_dow5             -6.011e-02  6.174e-03   -9.736  < 2e-16 ***
order_dow6             -2.330e-02  6.014e-03   -3.875 0.000107 ***
days_since_prior_order  6.650e-03  2.091e-04   31.805  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2560373  on 1942399  degrees of freedom
Residual deviance: 2033676  on 1942387  degrees of freedom
AIC: 2033702

Number of Fisher Scoring iterations: 4

Model 1b: Logistic Regression

Similarly, the logistic regression model that considers the same variables used in the previous model AS WELL AS the department indicator variables have coefficients that are almost all significant. From the results below, we can see that there are two departments with insignificant coefficients to the response variable. These two departments are department 12 (meat/seafood) and department 15 (canned goods).

In [235]:
Logistic.TestModel2 <- glm(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id, family = binomial, data = Training.Orders)
summary(Logistic.TestModel2)
Call:
glm(formula = reordered ~ user_prop_reordered + proportion_reordered + 
    product_orders_total + total_orders + add_to_cart_order + 
    order_dow + days_since_prior_order + department_id, family = binomial, 
    data = Training.Orders)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.9205  -0.8958   0.4976   0.8076   3.3781  

Coefficients:
                         Estimate Std. Error  z value Pr(>|z|)    
(Intercept)            -4.103e+00  1.324e-02 -309.872  < 2e-16 ***
user_prop_reordered     4.302e+00  1.240e-02  346.928  < 2e-16 ***
proportion_reordered    4.169e+00  1.567e-02  265.994  < 2e-16 ***
product_orders_total    1.012e-05  4.252e-07   23.798  < 2e-16 ***
total_orders            1.564e-03  2.698e-05   57.975  < 2e-16 ***
add_to_cart_order      -5.408e-02  2.596e-04 -208.312  < 2e-16 ***
order_dow1             -2.859e-02  5.708e-03   -5.009 5.48e-07 ***
order_dow2             -7.775e-02  6.181e-03  -12.579  < 2e-16 ***
order_dow3             -8.400e-02  6.347e-03  -13.234  < 2e-16 ***
order_dow4             -8.160e-02  6.389e-03  -12.772  < 2e-16 ***
order_dow5             -6.033e-02  6.179e-03   -9.763  < 2e-16 ***
order_dow6             -2.378e-02  6.016e-03   -3.953 7.73e-05 ***
days_since_prior_order  6.638e-03  2.092e-04   31.725  < 2e-16 ***
department_id2         -1.372e-01  5.501e-02   -2.495  0.01260 *  
department_id3         -3.248e-02  1.107e-02   -2.934  0.00335 ** 
department_id4         -8.767e-02  7.419e-03  -11.817  < 2e-16 ***
department_id5          2.524e-01  2.620e-02    9.637  < 2e-16 ***
department_id6         -7.945e-02  1.991e-02   -3.991 6.59e-05 ***
department_id7         -4.397e-02  8.943e-03   -4.917 8.80e-07 ***
department_id8          8.658e-02  3.257e-02    2.658  0.00785 ** 
department_id9         -2.802e-02  1.183e-02   -2.368  0.01790 *  
department_id10        -2.777e-01  5.222e-02   -5.318 1.05e-07 ***
department_id11        -2.130e-01  1.641e-02  -12.975  < 2e-16 ***
department_id12        -7.148e-03  1.285e-02   -0.556  0.57791    
department_id13        -1.695e-01  9.798e-03  -17.301  < 2e-16 ***
department_id14        -9.672e-02  1.286e-02   -7.522 5.38e-14 ***
department_id15         5.436e-03  1.099e-02    0.495  0.62078    
department_id16        -3.436e-02  7.847e-03   -4.379 1.19e-05 ***
department_id17        -1.236e-01  1.280e-02   -9.653  < 2e-16 ***
department_id18        -1.762e-01  1.596e-02  -11.045  < 2e-16 ***
department_id19        -1.079e-01  8.447e-03  -12.772  < 2e-16 ***
department_id20        -4.875e-02  1.132e-02   -4.307 1.65e-05 ***
department_id21        -7.725e-02  3.548e-02   -2.177  0.02945 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2560373  on 1942399  degrees of freedom
Residual deviance: 2032748  on 1942367  degrees of freedom
AIC: 2032814

Number of Fisher Scoring iterations: 4

Insignificant Variables

The coefficients for the indicator variables, department_id==12 (meat/seafood) and department_id==15 (canned goods) have large p-values. In these cases where the null hypotheses refer to there being no effect on the response variable from these department indicator variables, we cannot reject the null hypothesis, since the probability that the data is a result of the null hypothesis is very high.

We will conclude that the meat/seafood and canned goods departments do not show significant contribution to the variance of the response outcomes. This means that products purchased in these departments are usually not likely to affect whether or not the item is a reordered product or not. Interpretations for this insignificance can vary, and may or may not involve the rates of which these grocery products spoil (ie, bulk items do not have sensitive timelines that they have to be used and meats/seafoods have especially sensitive timelines). All in all, further exploration of the data would need to be completed in order to come to stronger insights about why these particular departments lack significance to the response variable over others.

For a glimpse of the specific types of products in these categories, there are lists of the most popular items in these categories as well as bar charts measuring the response variables per department below.

In [249]:
dept12 <- Combined.Orders[Combined.Orders$department_id == 12,]

reorder.table12<- dept12 %>% 
  count(reordered) %>%            
  mutate(prop = prop.table(n))    

as.data.frame(reorder.table12)   

reorder12 <- dept12 %>% 
  group_by(reordered) %>% 
  summarize(count = n()) %>% 
  mutate(proportion = count/sum(count))

reorder12 %>% 
  ggplot(aes(x=reordered,y=count, fill=reordered))+
geom_bar(stat="identity")+ ggtitle("Proportion of Reordered Products \n in Dept 12 (meat/seafood)") + theme(plot.title = element_text(hjust = 0.5))
A data.frame: 2 × 3
reorderednprop
<fct><int><dbl>
0205970.390961
1320860.609039
In [96]:
item_list12 <- unique(dept12$product_id)

prop12<- c()
n = c()

for (prodid in item_list12){
    subset <- dept12[dept12$product_id == prodid,]
    avg = (nrow(subset[subset$reordered==1,]))/nrow(subset)
    
    prop12 <- append(prop12, avg)
    n = append(n, nrow(subset))
}

dept12proportions <- data.frame(item_list12,prop12,n)
names(dept12proportions)[1]<- 'product_id'
dept12proportions<- merge(dept12proportions, products, by='product_id')
dept12proportions<-dept12proportions %>% arrange(desc(dept12proportions$n))

head(dept12proportions, n=15)
A data.frame: 15 × 6
product_idprop12nproduct_nameaisle_iddepartment_id
<int><dbl><int><fct><int><int>
258900.72035913787Boneless Skinless Chicken Breasts 4912
382930.68151022066Ground Turkey Breast 3512
60460.70207981779Boneless Skinless Chicken Breast 3512
476720.66957691489Uncured Hickory Smoked Sunday Bacon 10612
56460.68352371226Organic Turkey Bacon 10612
456330.62103511198Organic Beef Hot Dogs 10612
174610.68143101174Air Chilled Organic Boneless Skinless Chicken Breasts 3512
100170.64936251098Tilapia Filet 3912
93390.69472711081Organic Chicken & Apple Sausage 10612
276950.69516731076Natural Chicken & Sage Breakfast Sausage 10612
131980.6952880 95585% Lean Ground Beef 12212
124560.6473214 896Organic Sunday Bacon 10612
44210.5612245 882Organic Beef Uncured Hot Dogs 10612
219270.7262639 811Organic Boneless Skinless Chicken Breast 3512
90200.6377171 806Boneless Skinless Chicken Thighs 3512
In [250]:
dept15 <- Combined.Orders[Combined.Orders$department_id == 15,]

reorder.table15<- dept15 %>% 
  count(reordered) %>%            
  mutate(prop = prop.table(n))    

as.data.frame(reorder.table15)   

reorder15 <- dept15 %>% 
  group_by(reordered) %>% 
  summarize(count = n()) %>% 
  mutate(proportion = count/sum(count))

reorder15 %>% 
  ggplot(aes(x=reordered,y=count, fill=reordered))+
geom_bar(stat="identity") + ggtitle("Proportion of Reordered Products \n in Dept 15 (canned goods)") + theme(plot.title = element_text(hjust = 0.5))
A data.frame: 2 × 3
reorderednprop
<fct><int><dbl>
0409290.5098154
1393530.4901846
In [107]:
item_list15 <- unique(dept15$product_id)

prop15<- c()
n = c()

for (prodid in item_list15){
    subset <- dept15[dept15$product_id == prodid,]
    avg = (nrow(subset[subset$reordered==1,]))/nrow(subset)
    
    prop15 <- append(prop15, avg)
    n = append(n, nrow(subset))
}

dept15proportions <- data.frame(item_list15,prop15,n)
names(dept15proportions)[1]<- 'product_id'
dept15proportions<- merge(dept15proportions, products, by='product_id')
dept15proportions<-dept15proportions %>% arrange(desc(dept15proportions$n))

head(dept15proportions, n=15)
A data.frame: 15 × 6
product_idprop15nproduct_nameaisle_iddepartment_id
<int><dbl><int><fct><int><int>
271560.61827592900Organic Black Beans 5915
427680.58311232262Organic Garbanzo Beans 5915
288490.63485282106No Salt Added Black Beans 5915
8900.50334751643Organic Diced Tomatoes 8115
70210.44500721382Organic Tomato Paste 8115
455350.49690881294Organic Low Sodium Chicken Broth 6915
267900.75703941243Organic AppleApple 9915
86700.47508901124Diced Tomatoes 8115
467200.46699511015Organic Free Range Chicken Broth 6915
350420.5390071 987Black Beans 5915
223950.4773196 970Tomato Sauce 8115
90920.5161290 961Organic Vegetable Broth 6915
444490.4796238 957Organic Tomato Sauce 8115
33760.5164718 941Organic Whole Kernel Sweet Corn No Salt Added8115
114080.5096701 879Crushed Tomatoes 8115

Model Performance

Prediction and Accuracy

In [ ]:
attach(Combined.Orders)
set.seed(1)

Accuracy <- function(table)
{
 n11 <- table[1,1] 
 n22 <- table[2,2]
 Total <- table[1,1]+table[2,2]+table[2,1]+table[1,2]
 Total
 return((n11+n22)/Total)
}
In [21]:
ProbM1 <- predict.glm(Logistic.TestModel1, newdata = Test.Orders, type = "response")
PredM1 <- ifelse(ProbM1 > .5, "1" , "0")
TableM1 <- table(PredM1, Test.Orders$reordered)
TableM1
Accuracy(TableM1)
      
PredM1      0      1
     0  95350  42530
     1  84329 263392
0.738758775208453
In [23]:
print('Sensitivity:')
sensitivity(TableM1)
print('Specificity:')
specificity(TableM1)
[1] "Sensitivity:"
0.530668581192015
[1] "Specificity:"
0.860977634821948
In [22]:
ProbM2 <- predict.glm(Logistic.TestModel2, newdata = Test.Orders, type = "response")
PredM2 <- ifelse(ProbM2 > .5, "1" , "0")

TableM2 <- table(PredM2, Test.Orders$reordered) 
TableM2

Accuracy(TableM2)
      
PredM2      0      1
     0  95348  42445
     1  84331 263477
0.738929697426488
In [24]:
print('Sensitivity:')
sensitivity(TableM2)
print('Specificity:')
specificity(TableM2)
[1] "Sensitivity:"
0.530657450230689
[1] "Specificity:"
0.861255483423879
In [96]:
# DIFFERENT?? WHY
ProbM2 <- predict.glm(Logistic.TestModel2, newdata = Test.Orders, type = "response")
PredM2 <- ifelse(ProbM2 > .5, "1" , "0")

TableM2 <- table(PredM2, Test.Orders$reordered) 
TableM2

Accuracy(TableM2)
      
PredM2      0      1
     0  95348  42445
     1  84331 263477
0.738929697426488

The accuracy results of the two models are very close; however, the second model (the model including the department indicator variables) is marginally better by 2e-4.

The sensitivity measures the proportion of correctly classified "success" responses (the true positive rates or the number of correctly classified reorder=1 outcomes) out of the true number of "success" responses.

The specificity measures the proportion of correctly classified "failure" responses (the true negatie rates or the number of correctly classified reorder=0 outcomes) out of the true number of "failure" responses.

From the results above, we get that

ROC Curve

When interpreting the classification error rates, how can we be sure that the best decision boundary for identifying a reorder=1 response should be when the likelihood, p=success > .5?

Depending on the business costs of Type I errors (false positive, ie a product/user/time observation is incorrectly recommended/estimated to reorder the product when in reality they would not reorder the product)and Type II errors (false negative, ie a product/user/time observation is not estimated to reorder the product when in reality they would want to reorder the product at thist time), we may want to adjust our decision boundary thresholds in order to minimize the two types of possible errors that can occur.

Thus, we will use a Receiver Operating Characteric (ROC) curve, which plots the true positive and false positive rates of the data at various decision boundary threshold settings.

In [25]:
library(Epi)
In [121]:
ROC(form=(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id) , data=Training.Orders, plot=
"ROC", MX=TRUE, MI=FALSE, main="ROC Curve")

According to the results above, we can see that the optimal decision boundary threshold for our data is at p(reordered=1) = .639.

With this new information, we can recalculate our accuracy, sensitivity, and specificity rates of our logistic regression model. As we can see below, the accuracy rate has decreased from 0.7389 to .7186; however, the important difference to note here is that the sensitivity and specificity are now closer to eachother (where previously they were 0.5307 and 0.8613 respectively).

All in all, the optimal decision boundary should be chosen with consideration of several business implications. Thus, for our purposes, we will leave this question aside and use the .5 decision threshold and associated accuracy rate for comparisons with the other two models.

In [27]:
ProbM2 <- predict.glm(Logistic.TestModel2, newdata = Test.Orders, type = "response")
PredM2 <- ifelse(ProbM2 > .639, "1" , "0")

TableM2 <- table(PredM2, Test.Orders$reordered) 
TableM2

Accuracy(TableM2)
      
PredM2      0      1
     0 126675  83639
     1  53004 222283
0.718610546518644
In [28]:
print('Sensitivity:')
sensitivity(TableM2)
print('Specificity:')
specificity(TableM2)
[1] "Sensitivity:"
0.705007262952265
[1] "Specificity:"
0.72660024450677

Model 2: Classification Tree

We will next look at the classification tree technique, which makes predictions about the response variable based on the mean responses of the training observations that belong to the leaves (or endpoints).

The tree model below, Tree.TestModel, uses the same predictors as the ones used in the previous logistic regression model, Logistic.TestModel2.

In [59]:
library(rpart)
In [60]:
Tree.TestModel <- rpart(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id,method = "class", data = Training.Orders)
In [93]:
plot(Tree.TestModel, uniform=TRUE, margin=.05,
   main="Classification Tree for Reordered")
text(Tree.TestModel, use.n=TRUE, all=TRUE, cex=.57)

Cross validation using complexity parameter (cp)

In [105]:
Tree.TestModelR <- capture.output(summary(Tree.TestModel))
plotcp(Tree.TestModel)

The graph above and outputs below show that the relative error decreases as the complexity parameter decreases. We can see that the optimal complexity parameter value associated with the minimum error is .01.

In [129]:
printcp(Tree.TestModel) # display the results 

Tree.TestModel$cptable
Classification tree:
rpart(formula = reordered ~ user_prop_reordered + proportion_reordered + 
    product_orders_total + total_orders + add_to_cart_order + 
    order_dow + days_since_prior_order + department_id, data = Training.Orders, 
    method = "class")

Variables actually used in tree construction:
[1] proportion_reordered user_prop_reordered 

Root node error: 719122/1942400 = 0.37022

n= 1942400 

        CP nsplit rel error  xerror       xstd
1 0.155263      0   1.00000 1.00000 0.00093582
2 0.029823      1   0.84474 0.84474 0.00089851
3 0.017304      3   0.78509 0.78448 0.00087981
4 0.010000      5   0.75048 0.74991 0.00086792
A matrix: 4 × 5 of type dbl
CPnsplitrel errorxerrorxstd
0.1552629501.00000001.00000000.0009358192
0.0298231710.84473710.84474400.0008985051
0.0173037430.78509070.78448020.0008798050
0.0100000050.75048320.74990610.0008679229
In [111]:
Tree.TestModel$cptable[which.min(Tree.TestModel$cptable[,'xerror']), 'CP']
0.01

Pruning

Normally, we would want to prune the tree in order to prevent overfitting and decrease the amount of variance from our original tree; however, the optimal CP value came out to be the same as the default CP value for the rpart tree method. Thus, we will keep the original tree model built.

In [140]:
Pred.Tree <- predict(Tree.TestModel, newdata = Test.Orders, type = "class")
Tree.Table <- table(Pred.Tree, Test.Orders$reordered)
Tree.Table

Accuracy(Tree.Table)
         
Pred.Tree      0      1
        0  71961  27142
        1 107718 278780
0.722282285250648
In [126]:
print('Sensitivity:')
sensitivity(Tree.Table)
print('Specificity:')
specificity(Tree.Table)
[1] "Sensitivity:"
0.400497553971249
[1] "Specificity:"
0.911278038192742

Random Forest

To improve upon the tree method used above, a random forest allows a way to make decorrelated trees to lower the variance of the model. By forcing each split in the tree to be selected based on only a subset of all of the predictors available for use, the random forest prevents a dominant predictor from ruling the results of the trees built.

As demonstrated below, the randomForest function from the R package will not run on large data frames. In order for the randomForest to run, we had to cut the data by 50% (with 971200 observations).

In [51]:
library(randomForest)

# First attempt: Error from the training set being too large :(
randomForest.fit <- randomForest(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id,   data=Training.Orders)
Error in randomForest.default(m, y, ...): long vectors (argument 26) are not supported in .C
Traceback:

1. randomForest(reordered ~ user_prop_reordered + proportion_reordered + 
 .     product_orders_total + total_orders + add_to_cart_order + 
 .     order_dow + days_since_prior_order + department_id, data = Training.Orders)
2. randomForest.formula(reordered ~ user_prop_reordered + proportion_reordered + 
 .     product_orders_total + total_orders + add_to_cart_order + 
 .     order_dow + days_since_prior_order + department_id, data = Training.Orders)
3. randomForest.default(m, y, ...)
In [52]:
# Split the data in half

set.seed(1)
subset <- sample(1:nrow(Training.Orders), nrow(Training.Orders)/2)
Training.Orders2 <- Training.Orders[subset,]
In [53]:
nrow(Training.Orders2)
971200
In [54]:
# Second attempt
randomForest.fit2 <- randomForest(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id,   data=Training.Orders2)
In [55]:
print(randomForest.fit2) 
importance(randomForest.fit2)
Call:
 randomForest(formula = reordered ~ user_prop_reordered + proportion_reordered +      product_orders_total + total_orders + add_to_cart_order +      order_dow + days_since_prior_order + department_id, data = Training.Orders2) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 26.07%
Confusion matrix:
       0      1 class.error
0 191642 168236   0.4674806
1  84935 526387   0.1389366
A matrix: 8 × 1 of type dbl
MeanDecreaseGini
user_prop_reordered100512.53
proportion_reordered 81943.24
product_orders_total 53429.74
total_orders 62535.25
add_to_cart_order 40436.10
order_dow 32692.11
days_since_prior_order 38478.58
department_id 32674.32

Using the random forest method, our model has a test error rate of 26.23% and an accuracy rate of 73.77%. Also by calculating the mean decrease in Gini index for each predictor relative to the varible with the largest decrease, we can see that the variables with the highest importance are user_prop_reordered, proportion_reordered, product_order_total, and total orders.

In [58]:
varImpPlot(randomForest.fit2, main='Importance Plot of Random Forest')

The good news is that the resulting accuracy rates from the original tree model came out to 72.22% and from the random forest model was 73.77, which is comparable to the 74% accuracy of the logistic regression fits computed above.

The bad news is that the trees mostly use information based off of our newly engineered variables, proportion_reordered and user_prop_reordered, based off of the importance parameters provided above. Because these variables indicate aggregated information about the products and users who have purchased these products, we cannot easily identify helpful information about the various time and departmental features around these products.

Thus, although our tree model may have a higher accuracy rate, more work needs to be done to be able to use this model for industry purposes.

Model 3: K Nearest Neighbor (KNN) Classification

Next we will look at K Nearest Neighbor (KNN) classification, which applies distance formulas to locate points that are nearest to our training data points. Once this is done, a prediction model is fitted using the labels that are the most frequently out of each k neighbor group for each data point.

An important piece of information to consider is that the KNN classification requires that all predictors used in the model to be numeric. Thus, we will take out of consideration the following variables, which are not numeric or ordinal: user_id, eval_set, order_number, order_dow, product_name, product_id, aisle_id, department_id)

As for the ordinal categorical predictor variables (ie total_orders, product_orders_total, add_to_cart_order, order_hour_of_day), we will coerce these variables to be continuous instead of discrete.

In [25]:
library(MASS)
library(class)
In [26]:
train.y <- Training.Orders[9]
train.X <- Training.Orders[c(-1, -4, -9, -10, -11, -12, -15, -16, -17, -18 )]

test.X <- Test.Orders[c(-1, -4, -9, -10, -11, -12, -15, -16, -17, -18 )]
In [27]:
train.y = train.y[,1]
In [32]:
nrow(train.X)
length(train.y)
nrow(test.X)
1942400
1942400
485601
In [34]:
head(train.X)
A data.frame: 6 × 8
total_ordersuser_prop_reorderedproportion_reorderedproduct_orders_totalorder_idadd_to_cart_orderorder_hour_of_daydays_since_prior_order
<int><dbl><dbl><int><int><int><int><int>
2367507 300.70000000.6840405111471565475 5 030
15828301470.71428570.6295110 8119 182523201313
912381 330.66666670.4566474 1733119618 3 630
2261718 310.74193550.7200675 5932512048 612 8
14319802970.76094280.7500000 24 7633571114 7
682798 840.65476190.6628571 8752902612141314
In [ ]:
attach(train.X)

Often with KNN, the scale of the predictor variables should be considered so that variables with larger numbers will not dominant the classification over those with smaller values (since distances between points are evaluated to locate the K nearest neighbors).

Thus, below we make sure to normalize each of the variables used in the KNN classification training and test sets.

In [29]:
normalize <- function(x){
    return((x-min(x)) / (max(x)-min(x)))    
}
In [160]:
train.X.normalized <- as.data.frame(lapply(train.X, normalize))
test.X.normalized <- as.data.frame(lapply(test.X, normalize))
In [174]:
set.seed(1)

# Arbitrarily chose a K value of 5 here, which is a common choice for industry use
knn.pred = knn(train.X.normalized,test.X.normalized,train.y ,k=5)
In [31]:
test.y <- Test.Orders[9]
test.y = test.y[,1]
In [41]:
knn.table<- table(knn.pred, test.y)
knn.table
Accuracy(knn.table)
        test.y
knn.pred      0      1
       0  94310  54707
       1  85369 251215
0.711540956464258
In [176]:
print('Sensitivity:')
sensitivity(knn.table)
print('Specificity:')
specificity(knn.table)
[1] "Sensitivity:"
0.524880481302768
[1] "Specificity:"
0.821173370989991
In [45]:
levels(knn.pred)
knn.pred[1:15]
  1. '0'
  2. '1'
  1. 1
  2. 1
  3. 1
  4. 1
  5. 1
  6. 1
  7. 0
  8. 1
  9. 1
  10. 0
  11. 1
  12. 1
  13. 0
  14. 1
  15. 1
Levels:
  1. '0'
  2. '1'
In [48]:
knn.table<- table(knn.pred, test.y)
knn.table
Accuracy(knn.table)
        test.y
knn.pred      0      1
       0  94310  54707
       1  85369 251215
0.711540956464258

We get an accuracy rate of 71.15%. This rate is comparable to the accuracies of the other models computed above as well. To improve and tweak the model further, we would want to compute some sort of cross-validation method in order to identify the optimal K number of neighbors to consider when creating the prediction model; however, our we unfortunately do not have

K-Fold Cross Validation to Optimize K for KNN Classification

Cross-Validation is a commonly used method to to test the performance reliability of a model on different groupings of unseen data. For KNN classification, we will consider a variety of different K values that select the K nearest neighbors of the training data. We will compute 10 different KNN classification models using different K values from 1 to 10. We will then select the model with the lowest test/validation error as the final model to use.

In [ ]:
train.normalized <- cbind(train.X.normalized, train.y)
names(train.normalized)[9] <- 'reordered'
attach(train.normalized)
In [116]:
# In order to make the training and test samples the same size for cross-validation evaluation
set.seed(1)
Training.Orders2 <- Training.Orders[sample(1:nrow(train.X.normalized), 485601),]
485601
In [117]:
train.y2 <- Training.Orders2[9]
train.X2 <- Training.Orders2[c(-1, -4, -9, -10, -11, -12, -15, -16, -17, -18 )]
train.y2 = train.y2[,1]
In [119]:
train.X.normalized2 <- as.data.frame(lapply(train.X2, normalize))
In [121]:
set.seed(1)

# Creating the 5 lists of indices that represent the 5 folds of the cross validation
idx <- createFolds(train.y2, k=5)
In [64]:
idx[[1]]
  1. 10
  2. 14
  3. 18
  4. 20
  5. 26
  6. 29
  7. 31
  8. 48
  9. 53
  10. 54
  11. 60
  12. 72
  13. 73
  14. 84
  15. 85
  16. 91
  17. 94
  18. 95
  19. 97
  20. 99
In [122]:
set.seed(1)

ks<- 1:10

res <- sapply(ks, function(k) {
  # try out each version of k from 1 to 10
  res.k <- sapply(seq_along(idx), function(i) {
    # looping over each of the 5 cross-validation folds to create predicted classifications of the training data
    pred <- knn(train=train.X.normalized2[-idx[[i]], ],
                test=test.X.normalized[idx[[i]], ],
                cl=train.y2[-idx[[i]]], k = k)
    # calculation of ratio of misclassified training samples
    mean(train.y2[idx[[i]]] != pred)
  })
  # averaging the error rates over the 5 folds per K-value group
  mean(res.k)
})

The outputted values are the error rates of each of the 10 different KNN models (using a range of K values from 1 to 10). From the plot below, we can see that the model with the minimal error considers K=10 nearest neighbors.

In [126]:
res

plot(ks, res, type="o",xlab='K value', ylab="Misclassification Test Error", main='5-Fold Cross Validation Error (for K= 1 to 10)')
  1. 0.460814533256717
  2. 0.462124251621861
  3. 0.451230541581707
  4. 0.450579795905251
  5. 0.446941007679733
  6. 0.446774206991236
  7. 0.444366880201086
  8. 0.44426391424615
  9. 0.442661774452777
  10. 0.442437310042531
In [127]:
which.min(res)
10
In [172]:
res[10]
0.442437310042531
In [168]:
set.seed(1)

knn.pred2 = knn(train.X.normalized,test.X.normalized,train.y ,k=10)
In [171]:
knn.table2<- table(knn.pred2, test.y)
knn.table2
Accuracy(knn.table2)
        test.y
knn.pred      0      1
       0  92965  48051
       1  86714 257871
0.722477919114664
In [173]:
print('Sensitivity:')
sensitivity(knn.table2)
print('Specificity:')
specificity(knn.table2)
[1] "Sensitivity:"
0.517394909811386
[1] "Specificity:"
0.842930550924746

The final accuracy rate of the knn.pred2 model is 72.24%, which is an improvement from the previous accuracy rate of 71.15% from the original knn.pred model using K=5 nearest neighbors. Next let's compare the different models below!

Conclusion:

  • Model 1: Logistic Regression - 0.7389

  • Model 2: Classification Tree - 0.7223 / Random Forest (on 1/2 of data) - .7377

  • Model 3: KNN - 0.7224

From the results above, we can see that the logistic regression model (with the department indicator variables included) gives us the highest accuracy/minimal error rate. Because the accuracy rates are so close, there is a chance that one of the other two models may be able to outperform the logistic regression model through using additional techniques to improve the model (ie bagging or boosting of the classification tree).

Interaction Terms and Understanding:

Another factor to consider is the use of interaction terms. To keep the models simple, we chose to not implement interaction terms; however, since the likelihood of a user reordering a product involves a nuanced relationship between the user to specific products, it would make sense to include interaction terms to account for the variety of relationships between the two variables. Thus, to improve upon all of the three models provided, it is recommended to include interaction terms between product_id's and user_ids's.

Subset Selection:

One may notice that the variables that came out as particularly effective and high in "importance" (according to our importance plot above) were the features that we engineered using aggregated reordered data across user_id's and product_id's; however, as stated in the Classification Tree analysis, there are some inherent problems with considering these variables because by their nature, they will have bias since we used the reordered columns to extract these new columns.

Thus, for further exploration, it is recommended to consider models that use a subset of the variables we have considered. There are various subset selection methods that can applied such as backward and forward stepwise regression, principal component analysis, and lasso and ridge regression that will help eliminate some of the unnecessary variables. Additionally, one should take a look at a correlation map in order to identify some of the highest correlated variables to take out of the model (to minimize the chance of overfitting).

Projects Main Page