Dealing with Data Bias Part II

In this series of blog posts I explore the different type of biases that I come across as supply chain scientist. While my particular examples usually tend to come from the supply chain and eCommerce world, the biases described are quite general.

Bias #2: Selection Bias

Another tricky bias to deal with is selection bias. Selection bias is the systematic under- or over representation of data sources. Think of a polling agency trying to predict the outcomes of an upcoming election. In order to do so, they need to sample people’s opinion. So, assume next Sunday are the election and on this Wednesday at 11:30 they go their own city’s central market place and start asking people who they would vote for. It turns out, people happening to be on a market place at 11:30 on a week day might actually not be correctly representing the general population and hence a sampling like this will not yield to trustworthy results.

In the era of Tech data analysis one might think ‘great, now with our fancy Big Data technology we no longer have to rely on unrepresentative samples, we can just process all the data there is – voilà no more selection bias‘. And in a perfect world with only clean data, this would be true. However, in the real world with real processes or humans creating your data feed, systematic corruption of parts of your data might again cause selection type bias.

Usually, with the growth of a business, the number of input data sources grows as well. For example, if you think of a commerce, the number of sales channels increases (phone, their own web shop, Amazon market place etc). All these different data sources for sales do not necessarily capture data in the same way and the same level of completeness e.g. for some edge cases one sales channel may be bad at recording, while others may do it correctly. Over time, these differences may accumulate and if you look at the results as a whole, skew your perception in one direction or the other.

One other very common source for this type of bias is when you replace one data source tracking system (i.e. the ERP tool) with another one. This yields to one legacy system coexisting with a new one with potentially completely different philosophy, use cases and configuration. Defining a proper data mapping between these systems can be a quite complex task. Avoiding introducing systematic errors from one to the other is quite delicate.

How to deal with selection bias

The problem with selection bias is that you will not be able to tell it from only looking at the data you have. So, there is no fancy high tech statistical Data Science method to conquer this bias.

The way to deal with selection bias is to carefully study the underlying business processes to first understand potential sources of selection bias and then find proper business indicators to check for these.

Part one involves a proper business analysis along side with an analysis of the data architecture supporting it. In my own work, I usually try to draw a diagram of the flow of goods from conception to delivery to the customer and if applicable, the return processes and a second diagram of the flow of data for each of the steps of the diagram of the flow of goods.

For each of the “arrows” of such a diagram (i.e. from production/manufacturing to storage, from storage to delivery etc), i.e. each cross section of a physical system to another and each interface between data systems, data may be lost, skewed, delayed or counted multiple times. From business one needs to get the appropriate real life statistics of what is supposed to happen between these systems such as how many items have been send from suppliers to the central warehouse each day, how much sales volume has been generated from each channel on any given day etc. Then you can compare the measured values with the figured coming from business to spot gaps.

This requires a quite extensive knowledge and accounting and documentation of the business and the data architecture. In the long run, the effort spend on this will lead a high return on investment.

Dealing with Data Bias Part I

One major challenge we face as supply chain scientists is bias in data. In this series of blog post, I would like to describe a few of the major biases that we see, why they are harmful for any prediction or sales analysis and how to deal with them.

While I write from my perspective as supply chain scientists, this analysis applies to basically anyone who is doing any sorts of data analysis or research on data.

Bias #1: Confounding variables

In a nutshell, at Lokad, we use past sales data to make predictions on future demand. At first glance, this sounds perfectly reasonable – you measure past demand to predict future demand – so where’s the bias?

In fact, there is quite a range of other variables, that would make meaningful predictions hard to impossible. The term ‘confounding variable’ refers to any sorts of influences that impact the cause-and-effect relationship that you would like to study which are outside of your current considerations.

Let’s assume you are studying the following sales history:

Day Nb. of Sales
Monday 8
Tuesday 11
Wednesday 9
Thursday 2
Friday 15
Saturday 15
Sunday 12

So what happened on on Thursday? Maybe mid-week is just the slow period for you? It could, however also be that there are some confounding variables at play.

One basic question one should start with is the following: Were clients able to buy all the time or was there some obstacle preventing sales?

One such factor could be website downtime if you are, for example, an eCommerce. In our example above, maybe, the webshop was down from Wednesday night to Thursday night. This is very valuable information to store. There might also be some “soft downtimes” if, for example, there was a small bug in the shop software preventing one product category to be shown properly at it’s usual place.

Even with a functioning web shop, you cannot sell what you do not have (except in the cases where one allows backorders). Stockouts represent one of the major form of confounding variables we encounter frequently.  Low observed demand might just be caused by low stock or stock outs of a top selling product. In some businesses, stock levels are visible to clients either in a reduced way if there is little stock to indicate urgency to the client (“Only two articles in size 40 left”) or sometimes fully (“300 pieces in available”) to demonstrate stock health to potential big customers. The latter is known as a facing quantity.

Without understanding the relationship between stock levels and sales, demand can be critically underestimated: In our example, we may have encountered a stockout on Wednesday evening. Thursday morning, we might have inbounded two returns  that were sold right away before we received a replenishment before business started on Friday. (In real life, this usually drags out a bit longer and you might see a longer period of low sales days depending on your replenishment cycle length) Note that keeping the stock out history would probably not have caught this issue in the example, since at some point two pieces were in stock on Thursday, but clients were not completely able to buy freely, since the quota of the day was capped by the two pieces in stock.

In general, keeping a history of stock levels or at least a history of stock outs per product is generally very much advised.

Another confounding bias may be evolution of prices. Depending on your vertical, your clients may be more or less sensitive to price changes and therefore knowing the sell price of each sale might give you important context of the demand. In our example above, the product might have been discounted for all week days apart from Thursday.

Unfortunately, this only gives you the demand at the price you had for a certain day. It is not possible to “replay” the demand at a given day at another price to study what would have happened if you had discounted or up-priced more. Here,  a more dedicated price elasticity study would be necessary. On the other hand, knowing for each sale the price allows to establish a certain baseline around the main prices classes (such as e.g. original price, -20%, -30%) so that you can understand which periods in history were associated with an inflated demand due to discounts, or deflated demand due to surges.

How to cope with confounding bias

Well, the most ideal course of action is to try and study all possible confounding variables and make them part of your consideration so they can become genuine variables of your analysis.

To be able to do so, understanding how the business works is crucial. The key take away for me for confounding bias is that looking at the data alone is not enough, it is analyzing the business together with the data that makes any data investigation meaningful leading to usable results for business.

In some cases, this might mean to start tracking or snapshotting some extra data (downtimes, stock levels, etc), which usually means a little extra effort and some more storage consumption, but this can pay out quite well.