Dealing with Data Bias Part I

One major challenge we face as supply chain scientists is bias in data. In this series of blog post, I would like to describe a few of the major biases that we see, why they are harmful for any prediction or sales analysis and how to deal with them.

While I write from my perspective as supply chain scientists, this analysis applies to basically anyone who is doing any sorts of data analysis or research on data.

Bias #1: Confounding variables

In a nutshell, at Lokad, we use past sales data to make predictions on future demand. At first glance, this sounds perfectly reasonable – you measure past demand to predict future demand – so where’s the bias?

In fact, there is quite a range of other variables, that would make meaningful predictions hard to impossible. The term ‘confounding variable’ refers to any sorts of influences that impact the cause-and-effect relationship that you would like to study which are outside of your current considerations.

Let’s assume you are studying the following sales history:

Day Nb. of Sales
Monday 8
Tuesday 11
Wednesday 9
Thursday 2
Friday 15
Saturday 15
Sunday 12

So what happened on on Thursday? Maybe mid-week is just the slow period for you? It could, however also be that there are some confounding variables at play.

One basic question one should start with is the following: Were clients able to buy all the time or was there some obstacle preventing sales?

One such factor could be website downtime if you are, for example, an eCommerce. In our example above, maybe, the webshop was down from Wednesday night to Thursday night. This is very valuable information to store. There might also be some “soft downtimes” if, for example, there was a small bug in the shop software preventing one product category to be shown properly at it’s usual place.

Even with a functioning web shop, you cannot sell what you do not have (except in the cases where one allows backorders). Stockouts represent one of the major form of confounding variables we encounter frequently.  Low observed demand might just be caused by low stock or stock outs of a top selling product. In some businesses, stock levels are visible to clients either in a reduced way if there is little stock to indicate urgency to the client (“Only two articles in size 40 left”) or sometimes fully (“300 pieces in available”) to demonstrate stock health to potential big customers. The latter is known as a facing quantity.

Without understanding the relationship between stock levels and sales, demand can be critically underestimated: In our example, we may have encountered a stockout on Wednesday evening. Thursday morning, we might have inbounded two returns  that were sold right away before we received a replenishment before business started on Friday. (In real life, this usually drags out a bit longer and you might see a longer period of low sales days depending on your replenishment cycle length) Note that keeping the stock out history would probably not have caught this issue in the example, since at some point two pieces were in stock on Thursday, but clients were not completely able to buy freely, since the quota of the day was capped by the two pieces in stock.

In general, keeping a history of stock levels or at least a history of stock outs per product is generally very much advised.

Another confounding bias may be evolution of prices. Depending on your vertical, your clients may be more or less sensitive to price changes and therefore knowing the sell price of each sale might give you important context of the demand. In our example above, the product might have been discounted for all week days apart from Thursday.

Unfortunately, this only gives you the demand at the price you had for a certain day. It is not possible to “replay” the demand at a given day at another price to study what would have happened if you had discounted or up-priced more. Here,  a more dedicated price elasticity study would be necessary. On the other hand, knowing for each sale the price allows to establish a certain baseline around the main prices classes (such as e.g. original price, -20%, -30%) so that you can understand which periods in history were associated with an inflated demand due to discounts, or deflated demand due to surges.

How to cope with confounding bias

Well, the most ideal course of action is to try and study all possible confounding variables and make them part of your consideration so they can become genuine variables of your analysis.

To be able to do so, understanding how the business works is crucial. The key take away for me for confounding bias is that looking at the data alone is not enough, it is analyzing the business together with the data that makes any data investigation meaningful leading to usable results for business.

In some cases, this might mean to start tracking or snapshotting some extra data (downtimes, stock levels, etc), which usually means a little extra effort and some more storage consumption, but this can pay out quite well.