Before Covid-19 hit in France, my colleague Rafael de Rezende asked me if I wanted to join his team for competing in the kaggle M5 Data Science competition on predicting Walmart Sales.
The given data set consisted of a sales history of US stores in three different states in three different categories (e.g. FOOD). We did not get any information on the specific items, e.g. items were just labeled ‘FOOD-123’. Our sales data was cut at a certain moment in history and we were to predict the following weeks of sales as a probability distribution.
My role was primarily business analysis using Python Jupyter notebooks to figure out the impact of aspects of the time series such as day of the week, month-based seasonality, the impact of calendar events such as Christmas (which varied a lot depending on the representation of religions in the different states), but also the effect of food stamp distribution that varied greatly by state.
The team then used this insight to craft a multi-stage state-space model (states inactive or active) with Monte Carlo simulations to generate our predictions as negative binomial probability distributions.
If you want to know more:
- Rafael’s summary on kaggle
- Lokad’s blog post
- Rafael was a guest at a Lokad TV‘s episode about the competition (23 min)
