The Art of Dealing with Data Instability

Carrying out any data driven initiative for larger organisations is hard. And this is not just because tools like deep learning or any other data science-type method is complicated to do, but just getting largish (let’s say > 1 Mio. lines) data from a few source systems to your workspace system, processing it and producing meaningful results relevant to an ever-changing business strategy and all of this in production grade quality is highly non trivial.

Pipelines break easily for various reasons, starting from scheduling crashing because of daylight saving time switches to timeouts because of increased data volumes, expiring authentifications or crashing source systems. However, production grade initiatives need to be able to cope with this by having a Plan B for this sort of situation and also being able to quickly recover once the issue is resolved.

Of course the speed of recovery is key when it happens, but there are also steps that one can take in advance to smooth out these situations.

Make stakeholders aware. For non-technical stakeholders of your initiative, the different steps of the data pipeline may be far from obvious. A simple diagramme with the key steps together with a timeline can become very helpful. In a world, where the 4G mobile connection may be enough to watch movies while using public transport has made people believe that any data transfer is basically instantaneous. Putting timespans and also prices per data transfer rates can do wonders to put things into context.

Figure out key GO NOGO criteria. The one piece of technology that made NASA able to put people on the moon while having 1960s technology is a meticulous system of GO NOGO criteria and checklists making sure that each piece is properly working and for thinkable eventualities are taken care of. Depending on what exactly breaks in your pipeline, the impact might range from a slight warning to “stop using results immediately” issues to all end users. We have made quite good experience by defining for each data source the maximal allowed data age, so that criticality of the incident can be immediately assessed. When, for example, a catalog table changes only little from day to day, having a two day old catalog is usually less harmful than e.g. working on a two-week old stock table.

Having a clearly defined list of criteria determining whether the produced can be used for business decisions can make matters transparent and quick to assess for everyone including the non-technical stakeholders. Being able to flag corrupted results before further use minimizes business damage and enables trust in the process and thus in the system.

This way, users can be confident and efficient in working with these results knowing that they have a “green light” on the most critical points of potential issues.

Classify problematic cases. The list of usual suspects of situations that might break or endanger your pipeline boils down to about a dozen common cases. Having a paragraph or two reading describing the potential impact on results and business descriptions can be very handy when facing these situations. In the best case, these can be communicated to stakeholders automatically when an incident occurs or, if automation is not possible, be ready for the human noticing the issue and communicating this as quickly as possible.

Don’t panic. Last but not least, things go wrong all the time and they will probably go wrong the first time the principal data engineer is mountain climbing without cell phone reception. This is not a sign of personal failure but due to the complicated nature of orchestration. Being prepared, however, makes it less stressful.

Reading Rec: The inspection paradox

I just stumbled upon this very interesting read on a specific type of observation bias, called Inspection Bias. Inspection bias is the phenomenon that happens when the point of view of the observer creates a bias. An example is that e.g. when you ride your bike on a track you may observe a lot of much slower riders and a lot faster riders than yourself just by the fact that due to your own speed keeping other people of the same speed at approximately the same distance.

Check out the article here.

What Pro Data Analysis can learn from Strava

 

If you are a cyclist or a runner, chances are that you use or have at least heard of Strava. Strava is a platform for athletes to analyze and share their activities’ data and virtually compete with one another.

With the rise of sports tracking devices tracing position and pacing via GPS and additionally measuring heart rate and elevation, Strava leverages this data that users upload to create a frame of reference for several types of activities. First founded in 2009, Strava has more than 10 Mio active members (in fact they emphasize not to call them users) in more than 195 countries. In 2017 alone, cyclists shared over 7.3 Bio km worth of rides.

In 2009, when they began as a typical Californian data start up, they were highly dependent on the hardware vendor Garmin: In fact, in the beginning uploading data to Strava was only possible directly from a Garmin device leaving early Strava at the mercy of Garmin. Today, the power dynamics have changed a lot: It is now Strava-compatibility that drives hardware sales. Automatic data synchronisation to Strava or even live Strava powered analytics during an activity enable not only Garmin but also their competitors like Polar and Wahoo to sell their newest generations of devices.

How do they make money

Strava’s main sources of revenue is first of all their premium membership options (59,99€ a year or 7,99€ a month).

Secondly, industry partners can sponsor challenges, that is specific goals on a specific time frame, Strava members can commit to to motivate themselves. One example is the Rapha 500 Challenge [1] of the bike vendor Rapha challenging its participants to ride 500km between Christmas Eve and New Year’s Eve.

As Big Data company, the selling of data to third party is also part of Strava’s business. For now, they are committed to share data only in an aggregated and therefore anonymized form with partners that are aligned with Strava’s vision of enabling and helping athletes. Notably, the project Strava Metro [2] aims to partner with city planners around the globe to make e.g. bike paths and most frequent bike tracks safer. On their website, you can find a case study of the partnership with the Seattle Department of Transportation.

By 2018, Strava has yet to become profitable.

My personal use cases

To navigate on my bike following pre-planned tracks, I bought a Garmin Edge 800 a few years ago. For this, I create a GPX, short for GPS exchange format, file of my route and upload it onto my Garmin device.

GPX track covering both the Gampen Pass and Mendel Pass in italy planned on GPSies

 

GPX is an XML schema which can be used to track GPS based waypoints  and routes together with timestamps. In this format, one can both store pre-planned routes which can be later used for navigation as well as recording timestamps when passing these waypoints on a bike ride or a run. I use my Garmin both for navigation as well as recording, but any smartphone can do the trick as well (within its battery limitations).

When I ride, I have my track as a purple line embedded into a map [3] that I can follow to pick the right turns. After my ride, I use Garmin’s own software Garmin Express to read out the recorded GPS/time data as well as my heart rate. It is automatically transferred to the Garmin platform Garmin Connect. Garmin connect offers similar features as Strava while being restricted to its own devices. In my opinion, their dashboard composition used to be a bit messy. The new modern look has improved matters quite a lot, however this was too late and many users like myself went to look for alternatives.

Garmin exposes data of newly created activities to Strava via an API, automatically uploading any rides making them visible to my community of friends and acquaintances. Over there, I get an instant analysis of my ride: How I performed on pre-defined segments during my ride:  Did I hit a personal best? Have I been able to score top 10 for women? How do I rank compared to my friends that have done this particular segment as well? Subsequently, my activity becomes visible to my friends (or to the world if I chose so).

Activity on Strava

 

From a data analysis perspective, Strava does a few things well from which the pro data analysis world could benefit as well.

 

Easy and powerful visualizations and tracking tools geared for its user base yield a powerful Business Intelligence

One of the main challenges for most amateur athletes is to keep up motivation to continue with one’s sport. On the one hand, it’s the community part that allows sharing your passion, but as well your challenges with your friends in real life like your bike club or with other like minded people that you know only online. 

On the other hand, you can follow your own progress and try to beat your past self. I particularly like the feature of tracking the number of weekly activities, the overall length and elevation gain to motivate myself to keep up my rides and training.

Strava, of course, benefits from the fact that their are a lot of canonical KPIs for sports activities such as distance covered, speed, heart rate, elevation gain etc. that quite easily open the door to make a sports tracking platform’s insights relevant and meaningful for the user.

Neatly visualizing this data adapted to the needs of the particular type of sport is on the other hand much more difficult. In my opinion, Strava’s success is mainly due to its strength there outperforming Garmin with a little-cluttered interface and visuals.

Well incentivized community dynamics keep the platform and its data relevant

The heart of Strava’s data analysis capabilities are segments. Segments are short tracks of variable length between two points which cover a part of a road or a route. A typical example would be the start of a slope of a mountain to its highest point. Users can create segments themselves, but also flag segments as duplicates or irrelevant (sprints of only a couple of meters).

Even though Strava has recently invested in getting rid of most obvious duplicates in segments, it mostly relies on the communities to do their own clean ups: A lot of Strava members develop quite some enthusiasm to curate the most relevant segments that appear on their routes in order to track and showcase their performances.

The same is true for fraudulent data: If you track your “performance” on an e-Bike or a motorcycle in order to score a good ranking on an at least moderately frequented segment, you can be sure that other Top 10-candidates will be quick to report the activity to get a “fair” ranking.

This principle of community police as curator allows to avoid one of the most common threats of any Big Data endeavour, namely the loss of meaning of data due to spam and irrelevance.

Gather your stars as marketers

On the one hand, some KOMs and QOMs, short for King of the mountain and Queen of the mountain denoting the respective leader on a segment, are pretty much completely out of reach if a major competition has traversed one’s territory, it is invariably cool that one can also follow people like Romain Bardet (who is competing in the Tour de France at the moment) and see how they perform on your favourite segment.

Below is a screenshot of a segment that I just rode. – A segment that has been part of this year’s edition of the Giro d’Italia race allowing to compare world class cyclists Romain Bardet and Vicenzo Nibali.

Having the industry stars on ones platform is a great marketing coup to showcase ones functionalities and gives professional athletes a platform to interact with their fans.

A thought-through Premium membership principle

Currently, I use the free membership option of Strava. The premium option would allow me to get more detailed analysis such as power meter analysis, live feedback and personalized coaching to reach more customized goals.

One fun example of what could be gained from a premium membership is the possibility to get live segment information during my ride: I would see exactly how I’d need to perform to score a good ranking on say my favorite hill. Strava’s philosophy is that most people will sign up for the free option and quite steadily go for the premium option once they have been with Strava for a while.

And even while you are not paying, you are still contributing to the richness of data accumulated and curated in Strava. While Strava has yet to reach profitability, this balance seems to be quite powerful for Strava to generate value to both members and partners.

The big key word for the future of Strava is ‘Discovery’: Assume you travel to a new city and you want to go for a run. Strava knows your typical distance and whether you like hilly terrain or flats and can recommend you routes that other athletes just like you do in this particular city. To which extent this will be part of the premium part of Strava, is not yet clear, but to me, these kinds of recommendations would be very valuable and something, I would definitely consider paying for.

Grow with challenges

As a data company/social media platform, you are under constant public scrutiny. In the beginning of the year, a story broke of a secret US military air base [5] being exposed on a Strava heat map: Soldiers had been recording their training as ‘public’ on Strava. Even though the data was anonymized, having a well-frequented running course in an Afghanistan desert left not much room for speculation.

Even though one can clearly argue that this incident was largely due to the carelessness of the people uploading their data publicly without second thought, this still is a challenge for a community to educate its members on the consequences of privacy. This holds both for Strava itself, but also for mainstream journalism who mistakenly called this a ‘data leak’ or ‘data breach’ which it most definitely wasn’t.

Strava itself took action to highlight in detail the opting out possibilities in order to avoid these including the introduction of a minimum numbers of activities for a path to show up on any heat map. Furthermore, heat maps are refreshed regularly so that activities that are later made private no longer show up. This means that even if a group of soldiers mistakenly uploads an activity of a secret location, they can still take action to have it be hidden and further damage can be avoided.

Another ongoing discussion that concerns a far greater base of users is the possibility to opt out of certain aspects of data sharing. Unfortunately, many athletes, in particular women, are not comfortable sharing timed location data of their runs publicly, since it could be very easy for stalkers or even attackers to guess patterns and pose a serious threat. For now, the only option is to not share an activity at all publicly. Strava has said that they currently explore of how to make only parts of an activity publicly visible while still integrating the other relevant parts of the activity.  

 

Notes

[1] Rapha 500 Challenge, see e.g. the 2016 version here:https://www.strava.com/challenges/rapha-festive500-2016

[2] Strava Metro: Insights business for city planning https://metro.strava.com/

[3] For planning, I use www.gpsies.com. Strava offers its own planning tool, Strava routes https://www.strava.com/routes. Out of say a bit arbitrary historic reasons, this tool however has not yet gained much traction in my friend cycle after the tool recommended to one of our more passionate road cyclist to use a less than optimal gravel path for his precious road bike.

[4] Open Street Map https://www.openstreetmap.org/ is an open source mapping database permitting to download any map selection in Garmin-compatible formats

[5] BBC article on the Strava Military Base incident from 28/01/2018 https://www.bbc.com/news/technology-42853072

Do you also have Peter’s Problem?

Pet·ers Pro·blem.

The set of problems that arise if one’s online profile is incorrect, resulting in unsuitable online advertising and ratings which may limit your access to information and services regulated by these ratings and clustering based on one’s profile.

In his latest novel (1), the German cabaret artist Marc-Uwe Kling envisions an absurdist post-modern future in a country that for marketing reason chose to rename itself “Qualityland” and of which (also for marketing reasons) all inhabitants must only use superlatives when speaking about it.

The protagonist, Peter Arbeitsloser (Peter Unemployed – to increase transparency, children in Qualityland take their parents’ job title at the time of birth as surname), suffers from a particular problem: The major online actors, TheShopThe world’s most popular online retailer -, which sends you products you will like based on your profile, the social media platform Everybody – everybody is on Everybody! and as well as QualityPartner the dating service that chooses your partner for you-, have gotten it wrong:  Peter receives things he has no use for, he is made to hang out with people he does not like and is dating someone he cannot stand. All of it is due to the fact that people have completely surrendered to the suggestions and recommendations of the underlying algorithms making it virtually impossible to consciously make one’s own choices.

Now Peter has been misclassified by all of these platforms. The most obvious consequences like the TheShop sending him highly inappropriate dolphin-shaped items seem like nothing more than a nuisance at first glance. Much more problematic though is the fact that Qualityland’s society has decided to also completely rely on one’s online profile for any sort of aptitude tests, such as recruiting or any sort of social stance. Much like Black Mirror’s episode Nosedive,  one’s profile is condensed into a personal score which functions as a shortcut to one’s worth in society.

Peter, at some point, drops down to a 9 out of a 100 and is therefore part of the “useless” category of society, banning him from most of the daily life. (For real life examples of this see (4))

So, what does Peter’s tale wants us to do? To stop using our favorite online services is hardly an option. But to take ownership of our profiles and to what purpose it is used might be worthwhile.

The twist however is where this highly amusing piece of fiction comes to its limits: Having Peter’s problem becomes less and less likely. Already today, having access to someone’s “likes” on social media makes algorithms far more capable of judging one’s personality than ones acquaintances, friends and family. (2)

While the current mood tends to go to more transparency toward what our data is used for, it must still be clear, that once we create data points in public, conclusions can be drawn from that. This includes also not quite voluntarily produced data aka data collection of which we cannot opt out easily, such as face recognition in public – see again (4))

It is the far greater challenge for us to decide, what we, both as citizens and consumers, would like to do with this data and how we would like to steer our ownership of it.

As organizations collecting these personal profiles, the need for a transparent usage policy of these profiles, that is a

  • clearly defined data model/structure,
  • an explicit formulation of intent of usage visible to the user,
  • and a defined data life cycle from collection to an option of deletion (e.g. flow chart style)

becomes, in my opinion, more and more important. First, because of  legal reasons (5), and secondly, to continue to provide any “data driven” added value of quality (pun intended) based on correct and consciously given data.

Notes:

(1) For now, Qualityland is only available in German – but Marc-Uwe Kling’s first series The Kangaroo Chronicles is already available in English.

(2) Private traits and attributes are predictable from digital records of human behavior.  Michal KosinskiDavid Stillwell, and Thore Graepel. 

(3) If you’re interested in what kind of conclusions Google and co have drawn from your online behavior, check out the following links:

(4) The Chinese Government seems to also to be quite ambitious in this direction.

(5) GDPR is in place since May 25th 2018.

Heute Mathe, morgen Lokad

Am 12. Juni 2018 bin ich auf Kurzbesuch in meiner Uni, an der TU Darmstadt und halte einen Vortrag im Seminar ‘Heute Mathe Morgen ?’ und spreche über meine in Paris ansässige Firma Lokad:

12.06.18 — 13:30 Uhr in S2|15 51

Wer Interesse hat mehr zu erfahren über:

  • Deep Learning / Machine Learning in der Industrie,
  • Quantitative Logistik Optimierung,
  • was endliche l_1-Folgen mit Logistik zu tun haben,
  • Arbeiten in Frankreich,

oder wer tendenziell Interesse hat

ist herzlich willkommen!

Bitte teilen!

 

On June 12th  I have the great honor to be back in Darmstadt! I’ll be speaking in the alumni seminar ‘Heute Mathe Morgen ?’

Deep Learning: What it is and how it relates to supply chains

Disclaimer: In this post I’m going to write about how we use Deep Learning in my company, Lokad.

 

When you follow the news about deep learning, you might have come across exciting breakthroughs such as algorithms which are able to colorize black and white photographs or automatic real-life translations of texts on pictures taken by a phone app .  While these are all pretty cool applications, they do not immediately give any direct use cases for most traditional businesses. At Lokad, our goal is to translate the stunning reach of deep learning capabilities into the real world, to optimize supply chains everywhere.

So, before going into detail how we do that, let me quickly and very roughly summarize what Deep Learning actually entails without going too much into technical details.

First of all, deep learning is a flavor of machine learning. Regular non-machine learning algorithms require full prior knowledge on the task (and no training data whatsoever). An expert-knowledge approach to demand forecasting would require you to specify in advance all specific rules and patterns such as

“All articles that have category=Spring will peak in May and slowly die down until October.”

This, however may only be true for some products of this category. It is also possible that  there might be subcategories that behave a bit differently and so on. Combining these with a moving average forecast already yields an overall understanding of future demand which is not so far from reality.

However it does have the following downsides:

  1. It does not embrace uncertainty — In our experience, risk and uncertainty are crucial for supply chains, since it’s mostly the boundary cases that can be either very profitable or very costly if ignored,
  2. You have to maintain and manage the complexity of your rule set – An approach is only as powerful as the set of rules that are applied to it. Maintaining rules is very costly i.e. for each rule in the algorithm, we calculate there is an initial cost of about 1 man-day of implementation, testing and proper documentation initially and about half a day of maintenance. Assuming you keep on refining your rules and therefore have to readjust the old ones this yields a cost of 8k € per rule for a five year period. It is worth noting that this only applies  for one rule and does not take into account the exponential increase in complexity that arises when dealing with more complex product portfolios. Even demand patterns for small businesses usually exhibit dozens of influences making their maintenance incredibly costly.

Now imagine that there is a technology that could, like a human child, learn on its own to deduce patterns from data and could thus independently predict how your portfolio of products develop throughout a year.

Just like a child in development, a deep learning algorithm will try to make sense of the world by trying to deduce correlations from observations. It will test them and discard those that do not make any sense for the remaining data.

Again following our analogy, like a child learning to makes sense of the world, a deep learning algorithm is consuming lots and lots of data and the key lies in grasping the information that is actually relevant. While a child in a big city might be completely overwhelmed with all the different colors, noises and smells, it will learn later that the traffic lights are the ones to watch out for in combination with noises coming from approaching cars that are most critical when crossing a street. The same mechanism is in place for deep learning. The algorithm may process a vast amount of data and needs to find out the essence of what drives demand.

The way to figure out what is important and what is not is carried out via repeating similar situations several times, like you would repeat correct traffic behavior with a child. A human brain is highly parallelizing its sensory input processing and reaction, so that it is able to react quickly to urgent new data such as a car that is approaching while crossing the street.

With the rise of big data, parallelization became also a key topic driving efficiency and, in fact, feasibility of a “human-like” autonomous learning process.

At Lokad, we actually use the parallelized computing power high end gaming graphic cards in our cloud servers, to efficiently run our optimization for our clients, processing for a portfolio of 10.000 products with five years of sales data in less half an hour while largely outperforming any conventional moving average based algorithms (or even Lokad’s own earlier generation machine learning forecasts) in accuracy.  

Lokad then uses the demand forecasting results which come in a probabilistic format to optimize the supply chain decisions taking into account economic drivers such as one’s stance on growth vs. profitability. With these analyses, Lokad directly delivers the best supply chain  decisions such as purchase orders or dispatching decisions. “Best” here refers to the economic driver set up (i.e. growth vs. profitability, opportunity costs etc.) that has been put in place supply chain decisions. It it will scale with the business as one’s portfolio and demand patterns become more complex making any hard coded demand forecasting rules which need to be maintained by a human completely obsolete.  

 

Notes:

Average Developer salary in Germany 58k €, 261 working days – 30 days of vacation in 2018 yields a 250€ manday rate)

Technology: A tool or an agent?

Last week, the CEO of Google, Sundar Pichai, presented in a quite impressive  keynote the capacities of their new digital assistant Google Duplex. They showed how their assistant was able to make calls to book appointments or make restaurant reservations while carrying out quite naturally sounding conversations. In fact, it was able to understand and react to ambiguity and could deal with information that was not quite what it originally asked for – Quite a leap forward from the very limited interactions and tasks that for example Apple’s Siri used to be able to perform.

Sparked by this presentation, this article juxtaposes the most dominating two views on the position of technology currently exposed by the main technology providers:

One view is that technology mainly serves as a tool for people, as sort of a “bicycle of the mind” enhancing human capabilities. This view is notably demonstrated by the philosophies of Apple and Microsoft. Both these companies come from a background of being the main drivers of personal computing and come from the same era in the 1970’s.

The other view is that technology is supposed to act as an agent, that is by carrying out tasks for you independently. This view is attributed to Google and Facebook – companies that are much younger than Apple and Microsoft from the internet era. In fact, this is exactly what the new abilities of Duplex should show you: It is able to carry out tedious tasks, such as making appointments, for you instead of you. In this second scope, you could also place the autonomously driving cars.

One of the main implications of these different view points is their different ethical setup. While in the tool model, the users – the ones literally using the tools – are always in control and therefore assume responsibility of the actions of operating these tools. On the other hand, the question of responsibility becomes less clear in the second model when technology has its own agency. If this technological agent, say your autonomous digital assistant or your autonomous car does something that goes against your intentions? Who is at fault? You, the technology provider, or maybe the tech agent itself? What if it actually acted upon something that you wanted but maybe only subconsciously?

Therefore, defining the exact scope, the precise intentions and possible means of the actions performed by such an agent seems crucial. In my opinion, having these three things transparent is what we as users should demand from all our “agent providers” aka Google and Facebook.

Repost: Can an algorithm beat VOGUE in predicting fashion trends?

What is Fashion’s next big trend is a three digit billion dollar question. It is however not a question, that, personally, I thought I wouldn’t have much to contribute to, given that I work in a technology company, where the biggest trend seems to be goofy meme t-shirts. So, naturally, our approach to fashion trends is completely non-traditional and does not rely on our own fashion expertise (which is probably a good thing).

Traditionally, trend scouting starts by closely following high fashion designers and fashion shows, which may be picked up later by the mainstream fashion industry.

Other trend sources can come from films and TV series, influences from YouTube and the blogosphere. To determine exactly what will sell in the next seasons used to be a question only industry experts, with decades of experience, could determine. Then, it was more of a guessing game, when a trend wave would pick up and how long it would last before slowly ebbing out or being completely consumed by another trend.

At Lokad, our data scientists have developed algorithms to revolutionize this trend scouting process, allowing for large scale reliable and accurate sales predictions based on prior seasons’ sales history. Furthermore, while trend-scouting may be the holy grail of purchasing decisions for fashion organisation, identifying statistical noise is equally important and usually completely overlooked. Excluding noise in demand, that is observed “links” between products that are happening at random without any particular trend connection, can boost the effectiveness of any buying decisions.

To find out what actually drives sales, that is what attributes of a product make customers buy a product is somewhat elusive. Defining a product for now is more of an art than a science. At Lokad, our quantitative approach allows us to study precisely what has triggered sales, may it be a fancy product name, a combination of price, value of material or may be a certain visible accessory, which made a customer choose one product over other similar products.

In particular, Lokad can leverage similarities between products to predict sales quantities of products that have never been sold anywhere by comparing it to sales of other products in the past. To look for similarities between products, Lokad analyses not only product attributes such as materials or colours, but also envisions leveraging the selling history of customers.

Identifying early trend adopters within a customer base allows Lokad to obtain a next season “preview”: for example, imagine you have that one very fashionable colleague who always wears a certain colour before everyone else does. With big data crunching tools, Lokad can identify this colleague and through their current shopping habits can pick up what many others will want to buy next season.

However, in the end, there is no way around a team of fashion experts who can distinguish the mega trends from the “ordinary” ones. However, with Lokad, this expert group can leverage their insights on a much higher scale with high statistically sound computational power – minimizing costs for any brand or market place. This is what we refer to as augmented human intelligence, putting the enablement of the expert group as the ultimate goal of our work.

 

This article was first published on LinkedIn on October 14, 2017. 

Disclaimer: Katharina currently works as a supply chain scientist at Lokad.