Data Science Resources

In this post, I will collect some online data science resources such as online classes, blog posts etc. that I personally find very useful. (Ratings: ⭐ useful, ⭐⭐ very well done, ⭐⭐⭐ recommend absolutely)

Coursera: IBM Data Science Professional Certificate ⭐⭐
This class is a perfect introduction for anyone entering the field. It does not assume much prior knowledge, even though some prior programming experience is helpful. Even if most of it is not completely new material, as it was for me and probably for most people, it is a great refresher: If you already know some topics, you will be quicker going through exercises and labs, e.g. you won’t lose money because of it.

Pricing: around 35€ / month.

Pros:

  • Very comprehensive: Covering theoretic methodology, SQL, Python tutorials, visualisation, etc.
  • Nicely done quizzes and labs.
  • The community is quite active in the forums: Your questions will be answered quickly.
  • The class gives an overview into the IBM data science framework around Watson studio.
  • Not very expensive. The more time you have, the less you spend.

Cons:

  • Within the class, some of the individual classes are a bit redundant.
  • Lecturers have varying presentational skills.
  • The final project for the last class (Predicting car accident severity based on a police accident report data set) was less well curated.
  • It does not go beyond IBM tools.
  • There is little information or material on how to “production-ize” a machine learning project.

Coursera: Deep Learning Specialization (deeplearning.ai) ⭐⭐⭐
This class picks up where the IBM classes stopped. Once you have mastered the foundations in machine learning such as logistic regression, you start with the more complex concepts such as neural networks. I find this class extremely well curated. This doesn’t as a surprise, since Andrew Ng, the course lecturer, is also the co-founder of coursera, so it makes sense that he is able to leverage this platform to its fullest.

Pricing: 41€/month

Pros:

  • Andrew Ng’s tutorial videos are great extremely well done and nice to follow.
  • The labs are interactive (auto correcting) and figure into your final grade. (Usually in coursera, it’s quizzes and peer-reviewed assignments)
  • The community support by the organizer team is very quick and comprehensive.

Cons:

  • The forums are a bit spammed sometimes and it is sometimes hard to find a relevant post. This is probably a general Coursera problem.

I’m currently enrolled in this class, so this verdict is not final.

The Art of Dealing with Data Instability

Carrying out any data driven initiative for larger organisations is hard. And this is not just because tools like deep learning or any other data science-type method is complicated to do, but just getting largish (let’s say > 1 Mio. lines) data from a few source systems to your workspace system, processing it and producing meaningful results relevant to an ever-changing business strategy and all of this in production grade quality is highly non trivial.

Pipelines break easily for various reasons, starting from scheduling crashing because of daylight saving time switches to timeouts because of increased data volumes, expiring authentifications or crashing source systems. However, production grade initiatives need to be able to cope with this by having a Plan B for this sort of situation and also being able to quickly recover once the issue is resolved.

Of course the speed of recovery is key when it happens, but there are also steps that one can take in advance to smooth out these situations.

Make stakeholders aware. For non-technical stakeholders of your initiative, the different steps of the data pipeline may be far from obvious. A simple diagramme with the key steps together with a timeline can become very helpful. In a world, where the 4G mobile connection may be enough to watch movies while using public transport has made people believe that any data transfer is basically instantaneous. Putting timespans and also prices per data transfer rates can do wonders to put things into context.

Figure out key GO NOGO criteria. The one piece of technology that made NASA able to put people on the moon while having 1960s technology is a meticulous system of GO NOGO criteria and checklists making sure that each piece is properly working and for thinkable eventualities are taken care of. Depending on what exactly breaks in your pipeline, the impact might range from a slight warning to “stop using results immediately” issues to all end users. We have made quite good experience by defining for each data source the maximal allowed data age, so that criticality of the incident can be immediately assessed. When, for example, a catalog table changes only little from day to day, having a two day old catalog is usually less harmful than e.g. working on a two-week old stock table.

Having a clearly defined list of criteria determining whether the produced can be used for business decisions can make matters transparent and quick to assess for everyone including the non-technical stakeholders. Being able to flag corrupted results before further use minimizes business damage and enables trust in the process and thus in the system.

This way, users can be confident and efficient in working with these results knowing that they have a “green light” on the most critical points of potential issues.

Classify problematic cases. The list of usual suspects of situations that might break or endanger your pipeline boils down to about a dozen common cases. Having a paragraph or two reading describing the potential impact on results and business descriptions can be very handy when facing these situations. In the best case, these can be communicated to stakeholders automatically when an incident occurs or, if automation is not possible, be ready for the human noticing the issue and communicating this as quickly as possible.

Don’t panic. Last but not least, things go wrong all the time and they will probably go wrong the first time the principal data engineer is mountain climbing without cell phone reception. This is not a sign of personal failure but due to the complicated nature of orchestration. Being prepared, however, makes it less stressful.

Dealing with Data Bias Part I

One major challenge we face as supply chain scientists is bias in data. In this series of blog post, I would like to describe a few of the major biases that we see, why they are harmful for any prediction or sales analysis and how to deal with them.

While I write from my perspective as supply chain scientists, this analysis applies to basically anyone who is doing any sorts of data analysis or research on data.

Bias #1: Confounding variables

In a nutshell, at Lokad, we use past sales data to make predictions on future demand. At first glance, this sounds perfectly reasonable – you measure past demand to predict future demand – so where’s the bias?

In fact, there is quite a range of other variables, that would make meaningful predictions hard to impossible. The term ‘confounding variable’ refers to any sorts of influences that impact the cause-and-effect relationship that you would like to study which are outside of your current considerations.

Let’s assume you are studying the following sales history:

Day Nb. of Sales
Monday 8
Tuesday 11
Wednesday 9
Thursday 2
Friday 15
Saturday 15
Sunday 12

So what happened on on Thursday? Maybe mid-week is just the slow period for you? It could, however also be that there are some confounding variables at play.

One basic question one should start with is the following: Were clients able to buy all the time or was there some obstacle preventing sales?

One such factor could be website downtime if you are, for example, an eCommerce. In our example above, maybe, the webshop was down from Wednesday night to Thursday night. This is very valuable information to store. There might also be some “soft downtimes” if, for example, there was a small bug in the shop software preventing one product category to be shown properly at it’s usual place.

Even with a functioning web shop, you cannot sell what you do not have (except in the cases where one allows backorders). Stockouts represent one of the major form of confounding variables we encounter frequently.  Low observed demand might just be caused by low stock or stock outs of a top selling product. In some businesses, stock levels are visible to clients either in a reduced way if there is little stock to indicate urgency to the client (“Only two articles in size 40 left”) or sometimes fully (“300 pieces in available”) to demonstrate stock health to potential big customers. The latter is known as a facing quantity.

Without understanding the relationship between stock levels and sales, demand can be critically underestimated: In our example, we may have encountered a stockout on Wednesday evening. Thursday morning, we might have inbounded two returns  that were sold right away before we received a replenishment before business started on Friday. (In real life, this usually drags out a bit longer and you might see a longer period of low sales days depending on your replenishment cycle length) Note that keeping the stock out history would probably not have caught this issue in the example, since at some point two pieces were in stock on Thursday, but clients were not completely able to buy freely, since the quota of the day was capped by the two pieces in stock.

In general, keeping a history of stock levels or at least a history of stock outs per product is generally very much advised.

Another confounding bias may be evolution of prices. Depending on your vertical, your clients may be more or less sensitive to price changes and therefore knowing the sell price of each sale might give you important context of the demand. In our example above, the product might have been discounted for all week days apart from Thursday.

Unfortunately, this only gives you the demand at the price you had for a certain day. It is not possible to “replay” the demand at a given day at another price to study what would have happened if you had discounted or up-priced more. Here,  a more dedicated price elasticity study would be necessary. On the other hand, knowing for each sale the price allows to establish a certain baseline around the main prices classes (such as e.g. original price, -20%, -30%) so that you can understand which periods in history were associated with an inflated demand due to discounts, or deflated demand due to surges.

How to cope with confounding bias

Well, the most ideal course of action is to try and study all possible confounding variables and make them part of your consideration so they can become genuine variables of your analysis.

To be able to do so, understanding how the business works is crucial. The key take away for me for confounding bias is that looking at the data alone is not enough, it is analyzing the business together with the data that makes any data investigation meaningful leading to usable results for business.

In some cases, this might mean to start tracking or snapshotting some extra data (downtimes, stock levels, etc), which usually means a little extra effort and some more storage consumption, but this can pay out quite well.