The Art of Dealing with Data Instability

Carrying out any data driven initiative for larger organisations is hard. And this is not just because tools like deep learning or any other data science-type method is complicated to do, but just getting largish (let’s say > 1 Mio. lines) data from a few source systems to your workspace system, processing it and producing meaningful results relevant to an ever-changing business strategy and all of this in production grade quality is highly non trivial.

Pipelines break easily for various reasons, starting from scheduling crashing because of daylight saving time switches to timeouts because of increased data volumes, expiring authentifications or crashing source systems. However, production grade initiatives need to be able to cope with this by having a Plan B for this sort of situation and also being able to quickly recover once the issue is resolved.

Of course the speed of recovery is key when it happens, but there are also steps that one can take in advance to smooth out these situations.

Make stakeholders aware. For non-technical stakeholders of your initiative, the different steps of the data pipeline may be far from obvious. A simple diagramme with the key steps together with a timeline can become very helpful. In a world, where the 4G mobile connection may be enough to watch movies while using public transport has made people believe that any data transfer is basically instantaneous. Putting timespans and also prices per data transfer rates can do wonders to put things into context.

Figure out key GO NOGO criteria. The one piece of technology that made NASA able to put people on the moon while having 1960s technology is a meticulous system of GO NOGO criteria and checklists making sure that each piece is properly working and for thinkable eventualities are taken care of. Depending on what exactly breaks in your pipeline, the impact might range from a slight warning to “stop using results immediately” issues to all end users. We have made quite good experience by defining for each data source the maximal allowed data age, so that criticality of the incident can be immediately assessed. When, for example, a catalog table changes only little from day to day, having a two day old catalog is usually less harmful than e.g. working on a two-week old stock table.

Having a clearly defined list of criteria determining whether the produced can be used for business decisions can make matters transparent and quick to assess for everyone including the non-technical stakeholders. Being able to flag corrupted results before further use minimizes business damage and enables trust in the process and thus in the system.

This way, users can be confident and efficient in working with these results knowing that they have a “green light” on the most critical points of potential issues.

Classify problematic cases. The list of usual suspects of situations that might break or endanger your pipeline boils down to about a dozen common cases. Having a paragraph or two reading describing the potential impact on results and business descriptions can be very handy when facing these situations. In the best case, these can be communicated to stakeholders automatically when an incident occurs or, if automation is not possible, be ready for the human noticing the issue and communicating this as quickly as possible.

Don’t panic. Last but not least, things go wrong all the time and they will probably go wrong the first time the principal data engineer is mountain climbing without cell phone reception. This is not a sign of personal failure but due to the complicated nature of orchestration. Being prepared, however, makes it less stressful.