My goal in this project is to determine which items instacart should recommend to each user next time they log into their account (based on predictions of what they'll reorder from items purchased in the past).
Dataset originated here: https://www.kaggle.com/competitions/instacart-market-basket-analysis/overview
My summative project report is available here, or read a summary below to understand what can be found in which notebook:
-
In the wrangling notebook, I load data and merge data about products with that about users and their orders.
-
💜 Exploratory data analysis begins in notebook 2 but continues all the way through notebook 6, as more insights about the data become possible after some features are engineered. Specifically, in notebooks 3 and 4, I explore in-depth one single user's purchasing and re-ordering habits in order to deeply understand factors that might support accurate predictions of items they will reorder in the future.
-
Notebooks 5.1 and 5.2 are dedicated to the most crucial step of 💜 adding non-ordered items to each order by each user. This spans two notebooks only because the size of the dataset was slowing down my work in the first. This is such a crucial step because the raw data came with only information about items a user actually purchased in each order. In order to make predictions about what a user will re-order in their 100th order, for example, we need to know, from their 99th order, which items they reordered AND did not reorder from among all items they ever ordered. This required creating new rows such that the size of each order grew cumulatively over time for each user, and this resulted in a dataframe with approxiimately 15 times as many rows as the raw set.
-
I had the most fun with notebook 6, where I 💜 engineered several new features For example, for each item in each order, I added a column to indicate the 💜 percentage of past orders where that item had been purchased. Other new features included "percent of past orders where this product was one of the first 6 items placed in this user's cart" and product kewords such as "organic" or "fresh." These improve modeling because "percent of past orders where this item was purchased," for example, is highly correlated with whether an item will be reordered again in the future, but until the feature was engineered, a model would have had no way to pcik up on this valuable variable. See visualizations of the top three most predictive features:
-
In notebook 7, I tried multiple strategies for encoding categorical data and chose a 💜 Target Encoder with default hyperparameters in order to encode categorical features such as the user id, product name, and aisle and department in which each product is classified.
-
Finally, in notebook 8, I use a random grid search with cross-validation to test multiple classifiers. I selected the 💜 Random Forest Classifier because it performed best overall in a combined evaluation using log loss, roc_auc, and f1 metrics. I tuned the hyperparameters and finalized the model.
💜 Log Loss: 2.55, vs. 5.68 with naive model
💜 ROC_AUC: 0.71
💜 F1: 0.4 vs. 0.09 with naive model
Project based on the cookiecutter data science project template. #cookiecutterdatascience