Dealing with Data Bias Part II

In this series of blog posts I explore the different type of biases that I come across as supply chain scientist. While my particular examples usually tend to come from the supply chain and eCommerce world, the biases described are quite general.

Bias #2: Selection Bias

Another tricky bias to deal with is selection bias. Selection bias is the systematic under- or over representation of data sources. Think of a polling agency trying to predict the outcomes of an upcoming election. In order to do so, they need to sample people’s opinion. So, assume next Sunday are the election and on this Wednesday at 11:30 they go their own city’s central market place and start asking people who they would vote for. It turns out, people happening to be on a market place at 11:30 on a week day might actually not be correctly representing the general population and hence a sampling like this will not yield to trustworthy results.

In the era of Tech data analysis one might think ‘great, now with our fancy Big Data technology we no longer have to rely on unrepresentative samples, we can just process all the data there is – voilà no more selection bias‘. And in a perfect world with only clean data, this would be true. However, in the real world with real processes or humans creating your data feed, systematic corruption of parts of your data might again cause selection type bias.

Usually, with the growth of a business, the number of input data sources grows as well. For example, if you think of a commerce, the number of sales channels increases (phone, their own web shop, Amazon market place etc). All these different data sources for sales do not necessarily capture data in the same way and the same level of completeness e.g. for some edge cases one sales channel may be bad at recording, while others may do it correctly. Over time, these differences may accumulate and if you look at the results as a whole, skew your perception in one direction or the other.

One other very common source for this type of bias is when you replace one data source tracking system (i.e. the ERP tool) with another one. This yields to one legacy system coexisting with a new one with potentially completely different philosophy, use cases and configuration. Defining a proper data mapping between these systems can be a quite complex task. Avoiding introducing systematic errors from one to the other is quite delicate.

How to deal with selection bias

The problem with selection bias is that you will not be able to tell it from only looking at the data you have. So, there is no fancy high tech statistical Data Science method to conquer this bias.

The way to deal with selection bias is to carefully study the underlying business processes to first understand potential sources of selection bias and then find proper business indicators to check for these.

Part one involves a proper business analysis along side with an analysis of the data architecture supporting it. In my own work, I usually try to draw a diagram of the flow of goods from conception to delivery to the customer and if applicable, the return processes and a second diagram of the flow of data for each of the steps of the diagram of the flow of goods.

For each of the “arrows” of such a diagram (i.e. from production/manufacturing to storage, from storage to delivery etc), i.e. each cross section of a physical system to another and each interface between data systems, data may be lost, skewed, delayed or counted multiple times. From business one needs to get the appropriate real life statistics of what is supposed to happen between these systems such as how many items have been send from suppliers to the central warehouse each day, how much sales volume has been generated from each channel on any given day etc. Then you can compare the measured values with the figured coming from business to spot gaps.

This requires a quite extensive knowledge and accounting and documentation of the business and the data architecture. In the long run, the effort spend on this will lead a high return on investment.