Instacart is a grocery ordering and delivery service app, which aims to make the process of ordering groceries more convenient for grocery stores and shoppers. The purpose of this competition is to make predictions about which products are likely to be reordered by the same customers.
Reasons why this competition could benefit the company include finding ways to optimize the company's recommendation system, improving inventory and staffing infrastructures, and acquiring better insights about longer term customers who rely on Instacart as part of their regular shopping routines.
Click here to see more information on our final results and presentation.
The data is separated into six csv files, which describe the different aspects of the products hosted on the website. It is a random sample of over 3 million grocery orders from over 200,000 instacart users. We will use R Statistical Software to analyze the data.
The files provide a combination of categorical data, both ordinal and nominal pertaining to the features of the actual goods being ordered such as “dairy,” “produce,” “organic,” etc. There are several other variables including the store aisle, and the sequence of orders which are placed in the basket, day of the week, time of the day, and relative time (in days) between orders for each user. Also notice, in the data set titled 'order_products_train' (as well as 'order_products_prior') there are columns labeled 'reordered' which have values of either 0 or 1. We will use these columns are our response variables, which represent 0 for items that are not reordered, and 1 for items that are reordered.
library(data.table)
library(dplyr)
library(ggplot2)
library(knitr)
library(DT)
library(tidyr)
library(tidyverse)
library(randomForest)
library(ROCR)
library(repr)
options(repr.plot.width=5, repr.plot.height=4)
library(dplyr)
library(ggplot2)
library(caret)
library(DT)
aisles <- read.csv('data/aisles.csv', header=TRUE)
departments <- read.csv('data/departments.csv', header=TRUE)
products <- read.csv('data/products.csv',header=TRUE)
order_products_train <- read.csv('data/order_products__train.csv', header=TRUE)
order_products_prior <- read.csv('data/order_products__prior.csv', header=TRUE)
orders <- read.csv('data/orders.csv', header=TRUE)
head(products, n=10)
length(unique(products$product_id)) # 49688 total products
head(order_products_train, n=15)
head(orders, n=5)
Quickly recode to convert character variables to factors
orders <- orders %>% mutate(order_hour_of_day = as.numeric(order_hour_of_day), eval_set = as.factor(eval_set))
products <- products %>% mutate(product_name = as.factor(product_name))
aisles <- aisles %>% mutate(aisle = as.factor(aisle))
departments <- departments %>% mutate(department = as.factor(department))
By simply looking at a few histograms, you can spot out some of the stand out groups that have more observations that the rest.
orders %>%
ggplot(aes(x=order_dow)) +
geom_histogram(stat="count",fill="slateblue") + ggtitle("Order Day of Week")
orders %>%
ggplot(aes(x=days_since_prior_order)) +
geom_histogram(stat="count",fill="slateblue") + ggtitle("Days Since Last Order (relative days between orders)")
mostcommon <- order_products_train %>%
group_by(product_id) %>%
summarize(count = n()) %>%
top_n(10, wt = count) %>%
left_join(select(products,product_id,product_name),by="product_id") %>%
arrange(desc(count))
head(mostcommon, n=10)
mostcommon %>%
ggplot(aes(x=reorder(product_name,-count), y=count))+
geom_bar(stat="identity",fill="slateblue")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank()) + ggtitle("Most Popular Items Overall")
item_reorders <- order_products_train %>%
group_by(product_id) %>%
summarize(proportion_reordered = mean(reordered), n=n()) %>%
filter(n>40) %>%
top_n(10,wt=proportion_reordered) %>%
arrange(desc(proportion_reordered)) %>%
left_join(products,by="product_id")
head(data.frame(item_reorders), n=15)
item_reorders %>%
ggplot(aes(x=reorder(product_name,-proportion_reordered), y=proportion_reordered))+
geom_bar(stat="identity",fill="slateblue")+
theme(axis.text.x=element_text(angle=90, hjust=1),axis.title.x = element_blank())+coord_cartesian(ylim=c(0.85,0.95)) + ggtitle("Most Popular Reordered Items")
reorder.table<- order_products_train %>%
count(reordered) %>%
mutate(prop = prop.table(n))
as.data.frame(reorder.table)
reorder <- order_products_train %>%
group_by(reordered) %>%
summarize(count = n()) %>%
mutate(proportion = count/sum(count))
reorder %>%
ggplot(aes(x=reordered,y=count, fill=reordered))+
geom_bar(stat="identity")+ggtitle("Proportion of Reordered / Non-Reordered\n Products")
# First calculating proportion of reorders per order and n number of products in a single order
power.users <- order_products_prior %>%
group_by(order_id) %>%
summarize(reordered.prop = mean(reordered),n=n()) %>%
right_join(filter(orders,order_number>2), by="order_id")
# take out first order (reordered response column is irrelevant fir first orders)
# for each order, calculating reordered.prop, then joining with orders table to add reordered.prop and n
power.users2 <- power.users %>%
filter(eval_set =="prior") %>%
group_by(user_id) %>%
summarize(orders_total = sum(reordered.prop==1,na.rm=T), percent_equal = orders_total/n()) %>%
filter(percent_equal == 1) %>%
arrange(desc(orders_total))
# lists users with 100% reorders in the order of highest number of total orders
head(power.users2, n=10)
head(power.users)
Below are lists of the specific items ordered across orders for each of the two "Power Users." Notice that the number of products purchased per order is not considered in our identification of the "Power Users."
power.order.numbers <- filter(power.users, user_id == 99753)$order_id
power.products <- order_products_prior %>%
filter(order_id %in% power.order.numbers) %>%
left_join(products, by="product_id")
head(power.products, n=10)[,1:5]
power.order.numbers <- filter(power.users, user_id == 55331)$order_id
power.products <- order_products_prior %>%
filter(order_id %in% power.order.numbers) %>%
left_join(products, by="product_id")
head(power.products, n=10)[,1:5]
Now that we have a better picture about the data we're working with, we can begin to use our best judgement to train models based on our historical data and test the models to see which ones work best on unseen data.
Before diving into the models, we must first reorganize the data to be able to easily fit the models using our data. For smoother computability, we will decrease the size of the overall data, Combined.orders by 1/4 using random sampling. We will merge the aisles and department data sets at a later point.
All.Orders = merge(order_products_prior, orders)
All.Orders2 = merge(order_products_train, orders, by="order_id")
Combined.Orders = rbind(All.Orders,All.Orders2)
Combined.Orders <- merge(Combined.Orders, products, by = "product_id")
Combined.Orders$reordered <- as.factor(Combined.Orders$reordered)
set.seed(1)
subset <- sample(nrow(Combined.Orders),nrow(Combined.Orders)/4 )
Combined.Orders <- Combined.Orders[subset,]
nrow(Combined.Orders)
attach(Combined.Orders)
Combined.Orders$department_id <- as.factor(Combined.Orders$department_id)
Combined.Orders$user_id <- as.factor(Combined.Orders$user_id)
# Sort Combined.Orders by user_id
Combined.Orders <- Combined.Orders[order(user_id),]
A few new columns we created for the purpose of evaluating association of particular variables (specific users and productss) with the reordered response
User.data <- data.frame(unique(Combined.Orders[,5]))
User.Reordered <-Combined.Orders %>%
group_by(user_id) %>%
count(reordered)
User.Reordered[1:14,]
# Create list of unique user_id's
id.list <- unique(Combined.Orders$user_id)
re.prop <- c()
# Calculating proportion of items reordered for each user in id.list
for (userid in id.list)
{
subset <- User.Reordered[User.Reordered$user_id==userid,]
if (nrow(subset)==1){
if (subset$reordered == 0){
pro = 0
}
else {
pro = 1
}
}
else
{
total <- sum(subset$n)
pro <- subset$n[subset$reordered==1]/total
}
re.prop <- append(re.prop, pro)
}
User.data <- data.frame(user_id=id.list, reorder.prop=re.prop)
head(User.data, n=10)
Combined.Orders$reordered.num <- as.numeric(Combined.Orders$reordered)-1
######### User Proportion Reorders
group_userid <- group_by(Combined.Orders, Combined.Orders$user_id)
User_Reorders = summarise(group_userid, total_orders=n())
User_Reorders$user_prop_reordered = User.data$reorder.prop
head(User_Reorders)
item_reorders <- Combined.Orders %>%
group_by(product_id) %>%
summarize(n=n())
# Create list of unique product_id's
prodid.list <- (unique(Combined.Orders$product_id))
prodid.list <- sort(prodid.list)
re.prop <- c()
head(item_reorders)
# Sort Combined.Orders by product_id for easier parsing in for loop below
Combined.Orders <- Combined.Orders[order(Combined.Orders$product_id),]
attach(Combined.Orders)
for (prodid in prodid.list)
{
subset <- Combined.Orders[Combined.Orders$product_id==prodid,]
if (nrow(subset)==1){
if (subset$reordered == 0){
pro=0
}
else {
pro = 1
}
}
else
{
total <- nrow(subset)
pro <- nrow(subset[subset$reordered==1,])/total
if (length(pro) == 0) {
pro= 0
}
}
re.prop <- append(re.prop, pro)
}
item_reorders$proportion_reordered <- re.prop
ProductCounts <- item_reorders
ProductCounts$product_orders <- ProductCounts$n
ProductCounts<- ProductCounts[,-2]
head(ProductCounts)
# Merge Combined.Orders with ProductCounts
Combined.Orders <- merge(ProductCounts, Combined.Orders, by = "product_id")
# Merge Combined.Orders with user_Reorders
User_Reorders$user_id <- User_Reorders$'Combined.Orders$user_id'
User_Reorders <- User_Reorders[,-1]
Combined.Orders <- merge(User_Reorders, Combined.Orders , by = "user_id")
Combined.Orders$order_dow <- as.factor(Combined.Orders$order_dow)
names(Combined.Orders)[6] <- 'product_orders_total'
Combined.Orders$reordered <- as.factor(Combined.Orders$reordered)
Combined.Orders$department_id = as.factor(Combined.Orders$department_id)
head(Combined.Orders)
str(Combined.Orders)
Combined.Orders <- na.omit(Combined.Orders)
head(Combined.Orders, n=10)
Now that we have reorganized the data by creating additional variables to consider that give us information about various proportions of reordered products, we can now split the data into a training set to train the data and a test/validation set to test the generalizability of our model.
set.seed(1)
Train<- sample(1:nrow(Combined.Orders), nrow(Combined.Orders)*.8)
Training.Orders <- Combined.Orders[Train,]
Test.Orders <- Combined.Orders[-Train,]
nrow(Training.Orders)
nrow(Test.Orders)
We can apply a logistic regression model to this data because we are looking to make predictions on a binomial response variable (0 for a purchased product that was not previously ordered and 1 for one that was).
The first model below indicates that the subset of variables discussed in the preprocessing section (user_prop_reordered, proportion_reordered, product_orders_total, total_orders, add_to_cart_order, order_dow, days_since_prior_order) and WITHOUT the department_id indicator columns all have coefficients that have significant associations with the reordered variable (ie p-values are all much smaller than the significance level of .05)
Logistic.TestModel1 <- glm(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order, family = binomial, data = Training.Orders)
summary(Logistic.TestModel1)
Similarly, the logistic regression model that considers the same variables used in the previous model AS WELL AS the department indicator variables have coefficients that are almost all significant. From the results below, we can see that there are two departments with insignificant coefficients to the response variable. These two departments are department 12 (meat/seafood) and department 15 (canned goods).
Logistic.TestModel2 <- glm(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id, family = binomial, data = Training.Orders)
summary(Logistic.TestModel2)
The coefficients for the indicator variables, department_id==12 (meat/seafood) and department_id==15 (canned goods) have large p-values. In these cases where the null hypotheses refer to there being no effect on the response variable from these department indicator variables, we cannot reject the null hypothesis, since the probability that the data is a result of the null hypothesis is very high.
We will conclude that the meat/seafood and canned goods departments do not show significant contribution to the variance of the response outcomes. This means that products purchased in these departments are usually not likely to affect whether or not the item is a reordered product or not. Interpretations for this insignificance can vary, and may or may not involve the rates of which these grocery products spoil (ie, bulk items do not have sensitive timelines that they have to be used and meats/seafoods have especially sensitive timelines). All in all, further exploration of the data would need to be completed in order to come to stronger insights about why these particular departments lack significance to the response variable over others.
For a glimpse of the specific types of products in these categories, there are lists of the most popular items in these categories as well as bar charts measuring the response variables per department below.
dept12 <- Combined.Orders[Combined.Orders$department_id == 12,]
reorder.table12<- dept12 %>%
count(reordered) %>%
mutate(prop = prop.table(n))
as.data.frame(reorder.table12)
reorder12 <- dept12 %>%
group_by(reordered) %>%
summarize(count = n()) %>%
mutate(proportion = count/sum(count))
reorder12 %>%
ggplot(aes(x=reordered,y=count, fill=reordered))+
geom_bar(stat="identity")+ ggtitle("Proportion of Reordered Products \n in Dept 12 (meat/seafood)") + theme(plot.title = element_text(hjust = 0.5))
item_list12 <- unique(dept12$product_id)
prop12<- c()
n = c()
for (prodid in item_list12){
subset <- dept12[dept12$product_id == prodid,]
avg = (nrow(subset[subset$reordered==1,]))/nrow(subset)
prop12 <- append(prop12, avg)
n = append(n, nrow(subset))
}
dept12proportions <- data.frame(item_list12,prop12,n)
names(dept12proportions)[1]<- 'product_id'
dept12proportions<- merge(dept12proportions, products, by='product_id')
dept12proportions<-dept12proportions %>% arrange(desc(dept12proportions$n))
head(dept12proportions, n=15)
dept15 <- Combined.Orders[Combined.Orders$department_id == 15,]
reorder.table15<- dept15 %>%
count(reordered) %>%
mutate(prop = prop.table(n))
as.data.frame(reorder.table15)
reorder15 <- dept15 %>%
group_by(reordered) %>%
summarize(count = n()) %>%
mutate(proportion = count/sum(count))
reorder15 %>%
ggplot(aes(x=reordered,y=count, fill=reordered))+
geom_bar(stat="identity") + ggtitle("Proportion of Reordered Products \n in Dept 15 (canned goods)") + theme(plot.title = element_text(hjust = 0.5))
item_list15 <- unique(dept15$product_id)
prop15<- c()
n = c()
for (prodid in item_list15){
subset <- dept15[dept15$product_id == prodid,]
avg = (nrow(subset[subset$reordered==1,]))/nrow(subset)
prop15 <- append(prop15, avg)
n = append(n, nrow(subset))
}
dept15proportions <- data.frame(item_list15,prop15,n)
names(dept15proportions)[1]<- 'product_id'
dept15proportions<- merge(dept15proportions, products, by='product_id')
dept15proportions<-dept15proportions %>% arrange(desc(dept15proportions$n))
head(dept15proportions, n=15)
attach(Combined.Orders)
set.seed(1)
Accuracy <- function(table)
{
n11 <- table[1,1]
n22 <- table[2,2]
Total <- table[1,1]+table[2,2]+table[2,1]+table[1,2]
Total
return((n11+n22)/Total)
}
ProbM1 <- predict.glm(Logistic.TestModel1, newdata = Test.Orders, type = "response")
PredM1 <- ifelse(ProbM1 > .5, "1" , "0")
TableM1 <- table(PredM1, Test.Orders$reordered)
TableM1
Accuracy(TableM1)
print('Sensitivity:')
sensitivity(TableM1)
print('Specificity:')
specificity(TableM1)
ProbM2 <- predict.glm(Logistic.TestModel2, newdata = Test.Orders, type = "response")
PredM2 <- ifelse(ProbM2 > .5, "1" , "0")
TableM2 <- table(PredM2, Test.Orders$reordered)
TableM2
Accuracy(TableM2)
print('Sensitivity:')
sensitivity(TableM2)
print('Specificity:')
specificity(TableM2)
# DIFFERENT?? WHY
ProbM2 <- predict.glm(Logistic.TestModel2, newdata = Test.Orders, type = "response")
PredM2 <- ifelse(ProbM2 > .5, "1" , "0")
TableM2 <- table(PredM2, Test.Orders$reordered)
TableM2
Accuracy(TableM2)
The sensitivity measures the proportion of correctly classified "success" responses (the true positive rates or the number of correctly classified reorder=1 outcomes) out of the true number of "success" responses.
The specificity measures the proportion of correctly classified "failure" responses (the true negatie rates or the number of correctly classified reorder=0 outcomes) out of the true number of "failure" responses.
From the results above, we get that
When interpreting the classification error rates, how can we be sure that the best decision boundary for identifying a reorder=1 response should be when the likelihood, p=success > .5?
Depending on the business costs of Type I errors (false positive, ie a product/user/time observation is incorrectly recommended/estimated to reorder the product when in reality they would not reorder the product)and Type II errors (false negative, ie a product/user/time observation is not estimated to reorder the product when in reality they would want to reorder the product at thist time), we may want to adjust our decision boundary thresholds in order to minimize the two types of possible errors that can occur.
Thus, we will use a Receiver Operating Characteric (ROC) curve, which plots the true positive and false positive rates of the data at various decision boundary threshold settings.
library(Epi)
ROC(form=(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id) , data=Training.Orders, plot=
"ROC", MX=TRUE, MI=FALSE, main="ROC Curve")
With this new information, we can recalculate our accuracy, sensitivity, and specificity rates of our logistic regression model. As we can see below, the accuracy rate has decreased from 0.7389 to .7186; however, the important difference to note here is that the sensitivity and specificity are now closer to eachother (where previously they were 0.5307 and 0.8613 respectively).
ProbM2 <- predict.glm(Logistic.TestModel2, newdata = Test.Orders, type = "response")
PredM2 <- ifelse(ProbM2 > .639, "1" , "0")
TableM2 <- table(PredM2, Test.Orders$reordered)
TableM2
Accuracy(TableM2)
print('Sensitivity:')
sensitivity(TableM2)
print('Specificity:')
specificity(TableM2)
We will next look at the classification tree technique, which makes predictions about the response variable based on the mean responses of the training observations that belong to the leaves (or endpoints).
The tree model below, Tree.TestModel, uses the same predictors as the ones used in the previous logistic regression model, Logistic.TestModel2.
library(rpart)
Tree.TestModel <- rpart(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id,method = "class", data = Training.Orders)
plot(Tree.TestModel, uniform=TRUE, margin=.05,
main="Classification Tree for Reordered")
text(Tree.TestModel, use.n=TRUE, all=TRUE, cex=.57)
Tree.TestModelR <- capture.output(summary(Tree.TestModel))
plotcp(Tree.TestModel)
printcp(Tree.TestModel) # display the results
Tree.TestModel$cptable
Tree.TestModel$cptable[which.min(Tree.TestModel$cptable[,'xerror']), 'CP']
Normally, we would want to prune the tree in order to prevent overfitting and decrease the amount of variance from our original tree; however, the optimal CP value came out to be the same as the default CP value for the rpart tree method. Thus, we will keep the original tree model built.
Pred.Tree <- predict(Tree.TestModel, newdata = Test.Orders, type = "class")
Tree.Table <- table(Pred.Tree, Test.Orders$reordered)
Tree.Table
Accuracy(Tree.Table)
print('Sensitivity:')
sensitivity(Tree.Table)
print('Specificity:')
specificity(Tree.Table)
To improve upon the tree method used above, a random forest allows a way to make decorrelated trees to lower the variance of the model. By forcing each split in the tree to be selected based on only a subset of all of the predictors available for use, the random forest prevents a dominant predictor from ruling the results of the trees built.
As demonstrated below, the randomForest function from the R package will not run on large data frames. In order for the randomForest to run, we had to cut the data by 50% (with 971200 observations).
library(randomForest)
# First attempt: Error from the training set being too large :(
randomForest.fit <- randomForest(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id, data=Training.Orders)
# Split the data in half
set.seed(1)
subset <- sample(1:nrow(Training.Orders), nrow(Training.Orders)/2)
Training.Orders2 <- Training.Orders[subset,]
nrow(Training.Orders2)
# Second attempt
randomForest.fit2 <- randomForest(reordered ~ user_prop_reordered + proportion_reordered + product_orders_total + total_orders + add_to_cart_order + order_dow + days_since_prior_order + department_id, data=Training.Orders2)
print(randomForest.fit2)
importance(randomForest.fit2)
Using the random forest method, our model has a test error rate of 26.23% and an accuracy rate of 73.77%. Also by calculating the mean decrease in Gini index for each predictor relative to the varible with the largest decrease, we can see that the variables with the highest importance are user_prop_reordered, proportion_reordered, product_order_total, and total orders.
varImpPlot(randomForest.fit2, main='Importance Plot of Random Forest')
Thus, although our tree model may have a higher accuracy rate, more work needs to be done to be able to use this model for industry purposes.
Next we will look at K Nearest Neighbor (KNN) classification, which applies distance formulas to locate points that are nearest to our training data points. Once this is done, a prediction model is fitted using the labels that are the most frequently out of each k neighbor group for each data point.
An important piece of information to consider is that the KNN classification requires that all predictors used in the model to be numeric. Thus, we will take out of consideration the following variables, which are not numeric or ordinal: user_id, eval_set, order_number, order_dow, product_name, product_id, aisle_id, department_id)
As for the ordinal categorical predictor variables (ie total_orders, product_orders_total, add_to_cart_order, order_hour_of_day), we will coerce these variables to be continuous instead of discrete.
library(MASS)
library(class)
train.y <- Training.Orders[9]
train.X <- Training.Orders[c(-1, -4, -9, -10, -11, -12, -15, -16, -17, -18 )]
test.X <- Test.Orders[c(-1, -4, -9, -10, -11, -12, -15, -16, -17, -18 )]
train.y = train.y[,1]
nrow(train.X)
length(train.y)
nrow(test.X)
head(train.X)
attach(train.X)
Thus, below we make sure to normalize each of the variables used in the KNN classification training and test sets.
normalize <- function(x){
return((x-min(x)) / (max(x)-min(x)))
}
train.X.normalized <- as.data.frame(lapply(train.X, normalize))
test.X.normalized <- as.data.frame(lapply(test.X, normalize))
set.seed(1)
# Arbitrarily chose a K value of 5 here, which is a common choice for industry use
knn.pred = knn(train.X.normalized,test.X.normalized,train.y ,k=5)
test.y <- Test.Orders[9]
test.y = test.y[,1]
knn.table<- table(knn.pred, test.y)
knn.table
Accuracy(knn.table)
print('Sensitivity:')
sensitivity(knn.table)
print('Specificity:')
specificity(knn.table)
levels(knn.pred)
knn.pred[1:15]
knn.table<- table(knn.pred, test.y)
knn.table
Accuracy(knn.table)
Cross-Validation is a commonly used method to to test the performance reliability of a model on different groupings of unseen data. For KNN classification, we will consider a variety of different K values that select the K nearest neighbors of the training data. We will compute 10 different KNN classification models using different K values from 1 to 10. We will then select the model with the lowest test/validation error as the final model to use.
train.normalized <- cbind(train.X.normalized, train.y)
names(train.normalized)[9] <- 'reordered'
attach(train.normalized)
# In order to make the training and test samples the same size for cross-validation evaluation
set.seed(1)
Training.Orders2 <- Training.Orders[sample(1:nrow(train.X.normalized), 485601),]
train.y2 <- Training.Orders2[9]
train.X2 <- Training.Orders2[c(-1, -4, -9, -10, -11, -12, -15, -16, -17, -18 )]
train.y2 = train.y2[,1]
train.X.normalized2 <- as.data.frame(lapply(train.X2, normalize))
set.seed(1)
# Creating the 5 lists of indices that represent the 5 folds of the cross validation
idx <- createFolds(train.y2, k=5)
idx[[1]]
set.seed(1)
ks<- 1:10
res <- sapply(ks, function(k) {
# try out each version of k from 1 to 10
res.k <- sapply(seq_along(idx), function(i) {
# looping over each of the 5 cross-validation folds to create predicted classifications of the training data
pred <- knn(train=train.X.normalized2[-idx[[i]], ],
test=test.X.normalized[idx[[i]], ],
cl=train.y2[-idx[[i]]], k = k)
# calculation of ratio of misclassified training samples
mean(train.y2[idx[[i]]] != pred)
})
# averaging the error rates over the 5 folds per K-value group
mean(res.k)
})
res
plot(ks, res, type="o",xlab='K value', ylab="Misclassification Test Error", main='5-Fold Cross Validation Error (for K= 1 to 10)')
which.min(res)
res[10]
set.seed(1)
knn.pred2 = knn(train.X.normalized,test.X.normalized,train.y ,k=10)
knn.table2<- table(knn.pred2, test.y)
knn.table2
Accuracy(knn.table2)
print('Sensitivity:')
sensitivity(knn.table2)
print('Specificity:')
specificity(knn.table2)
The final accuracy rate of the knn.pred2 model is 72.24%, which is an improvement from the previous accuracy rate of 71.15% from the original knn.pred model using K=5 nearest neighbors. Next let's compare the different models below!
Model 1: Logistic Regression - 0.7389
Model 2: Classification Tree - 0.7223 / Random Forest (on 1/2 of data) - .7377
Model 3: KNN - 0.7224
From the results above, we can see that the logistic regression model (with the department indicator variables included) gives us the highest accuracy/minimal error rate. Because the accuracy rates are so close, there is a chance that one of the other two models may be able to outperform the logistic regression model through using additional techniques to improve the model (ie bagging or boosting of the classification tree).
Another factor to consider is the use of interaction terms. To keep the models simple, we chose to not implement interaction terms; however, since the likelihood of a user reordering a product involves a nuanced relationship between the user to specific products, it would make sense to include interaction terms to account for the variety of relationships between the two variables. Thus, to improve upon all of the three models provided, it is recommended to include interaction terms between product_id's and user_ids's.
One may notice that the variables that came out as particularly effective and high in "importance" (according to our importance plot above) were the features that we engineered using aggregated reordered data across user_id's and product_id's; however, as stated in the Classification Tree analysis, there are some inherent problems with considering these variables because by their nature, they will have bias since we used the reordered columns to extract these new columns.
Thus, for further exploration, it is recommended to consider models that use a subset of the variables we have considered. There are various subset selection methods that can applied such as backward and forward stepwise regression, principal component analysis, and lasso and ridge regression that will help eliminate some of the unnecessary variables. Additionally, one should take a look at a correlation map in order to identify some of the highest correlated variables to take out of the model (to minimize the chance of overfitting).