Should a knowledge graph replace commercial scientific journals?

The problem with the current system of science contributions via the publications is inherently flawed:

It incentivises only additions to the general public knowledge by writing more articles and gives little to no credit to those who review mostly anonymously, those who spot mistakes or those who actually fix mistakes. To advance in one’s career, researchers are generally measured by the number of publications and the reputations of their journals

Furthermore, journals make a profit while authors, editors and reviewers either do their work for free or even have to pay to publish (e.g. for an open access licence).

In the special case of mathematics, all scientific results are self-contained and can, in theory, be verified by anyone knowledgeable in the field (e.g. no special equipment is necessary).

Wouldn’t it make more sense to craft a knowledge tree instead of a linear journal, that would semantically link to previous results?

For illustration, let’s consider a new paper in mathematics, that publishes, e.g. three theorems. In the proof, several older results are used and quoted via citation.

The reader then has to either trust the author or the peer review process, that these older results have been indeed been proven correctly in the other papers (as well as their quoted results) or she has to verify the cited results and their quoted results until arriving at the axioms.

In some cases, mistakes happen, and a quoted result does not (yet) exist or is even provably false. However, there is no immediate incentive (apart from the author’s honor code) to correct mistakes. Thus, some unsuspecting grad student might continue to quote a wrong or only partly-proven result having either to figure out the mistake themselves or falling into the trap and building on top of this possibly false assertion.

I propose to instead build a semantic knowledge graph, in the following style:

As a researcher, you can contribute in several ways:

You can add a new node, that is a new theorem by submitting the statement of the proof as well as the links for the correct quoted parent theorem nodes.
You can challenge an existing edge of the graph if you spot holes in a proof or have a counter example.
Formalize existing results into the knowledge graph.

Item 3 could be a nice source for students’ theses. The role of editors and reviewers could be to function as administrators for new additions to the graph.

Advantages:

The knowledge graph becomes crawlable. Issues with a parent node can automatically be inherited to child theorem nodes.
Contributions to rectify issues with papers can be acknowledged by the scientific community as valuable contribution.
Automatic proof checking becomes an option if the proofs are formalized (se. e.g. Lean).
Researches might become quicker in discovering related results.
AI can function as a proof-helper, either auto-completing (like chat gpt for coding) or even suggest noteworthy theorems that are corollaries to current knowledge.

Disadvantages / Challenges

There needs to be software and infrastructure to run the knowledge graph.
Change Management in the scientific community would have to take place.
Commercial journals might oppose this heavily.
The graph might become quickly “unreadable” due to complex combinations. One would need a suitable filtering / tag structure in order to be able to see only relevant content.

Publication of our EU tender

I’m excited to announce that we have just published out tender for our “Supply Chain Cockpit” on the EU’s public tender platform:

https://ted.europa.eu/en/notice/-/detail/122003-2024

Since my company, DB Fernverkehr AG (long distance trains in Germany) is publicly owned, we have to follow EU tender laws for all tenders that exceed a certain volume.
I’m very grateful of the great work my cross-functional team consisting of supply chain specialists, architects, business analysts etc. who have made this possible.

Composing a tender for an AI-based SaaS for a complex supply chain has been quite an exciting challenge and I’m eager to see how the market reacts.

Important: Please use DB Fernverkehr’s bidding platform for any communication regarding the tender.

Data facts for pregnancy

This might be an interesting link for all parents-to-be who are in need of scientific facts about early pregnancy:

Miscarriage Odds Reassurer

I find it that most online resources for parents lack scientific proof. For me, actual evidence based data is best to reassure me…

Cycling & Pregnancy

For anyone who speaks German I have written a post about my experience of cycling (road bike) during my two pregnancies, including some hacks on how to enjoy riding throughout those quite intensive nine months. I included some data on the heart rate zones, I target as well as the progress of my performance indicators during all three trimesters.

There was scarcely any resources on the subject on the internet, so I hope this might be useful to other people in the future.

Rennrad Training in der Schwangerschaft

Data Science Resources

In this post, I will collect some online data science resources such as online classes, blog posts etc. that I personally find very useful. (Ratings: ⭐ useful, ⭐⭐ very well done, ⭐⭐⭐ recommend absolutely)

Coursera: IBM Data Science Professional Certificate ⭐⭐
This class is a perfect introduction for anyone entering the field. It does not assume much prior knowledge, even though some prior programming experience is helpful. Even if most of it is not completely new material, as it was for me and probably for most people, it is a great refresher: If you already know some topics, you will be quicker going through exercises and labs, e.g. you won’t lose money because of it.

Pricing: around 35€ / month.

Pros:

Very comprehensive: Covering theoretic methodology, SQL, Python tutorials, visualisation, etc.
Nicely done quizzes and labs.
The community is quite active in the forums: Your questions will be answered quickly.
The class gives an overview into the IBM data science framework around Watson studio.
Not very expensive. The more time you have, the less you spend.

Cons:

Within the class, some of the individual classes are a bit redundant.
Lecturers have varying presentational skills.
The final project for the last class (Predicting car accident severity based on a police accident report data set) was less well curated.
It does not go beyond IBM tools.
There is little information or material on how to “production-ize” a machine learning project.

Coursera: Deep Learning Specialization (deeplearning.ai) ⭐⭐⭐
This class picks up where the IBM classes stopped. Once you have mastered the foundations in machine learning such as logistic regression, you start with the more complex concepts such as neural networks. I find this class extremely well curated. This doesn’t as a surprise, since Andrew Ng, the course lecturer, is also the co-founder of coursera, so it makes sense that he is able to leverage this platform to its fullest.

Pricing: 41€/month

Pros:

Andrew Ng’s tutorial videos are great extremely well done and nice to follow.
The labs are interactive (auto correcting) and figure into your final grade. (Usually in coursera, it’s quizzes and peer-reviewed assignments)
The community support by the organizer team is very quick and comprehensive.

Cons:

The forums are a bit spammed sometimes and it is sometimes hard to find a relevant post. This is probably a general Coursera problem.

I’m currently enrolled in this class, so this verdict is not final.

Ranked 6th at Walmart Data Science Competition

Before Covid-19 hit in France, my colleague Rafael de Rezende asked me if I wanted to join his team for competing in the kaggle M5 Data Science competition on predicting Walmart Sales.

The given data set consisted of a sales history of US stores in three different states in three different categories (e.g. FOOD). We did not get any information on the specific items, e.g. items were just labeled ‘FOOD-123’. Our sales data was cut at a certain moment in history and we were to predict the following weeks of sales as a probability distribution.

My role was primarily business analysis using Python Jupyter notebooks to figure out the impact of aspects of the time series such as day of the week, month-based seasonality, the impact of calendar events such as Christmas (which varied a lot depending on the representation of religions in the different states), but also the effect of food stamp distribution that varied greatly by state.

The team then used this insight to craft a multi-stage state-space model (states inactive or active) with Monte Carlo simulations to generate our predictions as negative binomial probability distributions.

If you want to know more:

Rafael’s summary on kaggle
Lokad’s blog post
Rafael was a guest at a Lokad TV‘s episode about the competition (23 min)

The Art of Dealing with Data Instability

Carrying out any data driven initiative for larger organisations is hard. And this is not just because tools like deep learning or any other data science-type method is complicated to do, but just getting largish (let’s say > 1 Mio. lines) data from a few source systems to your workspace system, processing it and producing meaningful results relevant to an ever-changing business strategy and all of this in production grade quality is highly non trivial.

Pipelines break easily for various reasons, starting from scheduling crashing because of daylight saving time switches to timeouts because of increased data volumes, expiring authentifications or crashing source systems. However, production grade initiatives need to be able to cope with this by having a Plan B for this sort of situation and also being able to quickly recover once the issue is resolved.

Of course the speed of recovery is key when it happens, but there are also steps that one can take in advance to smooth out these situations.

Make stakeholders aware. For non-technical stakeholders of your initiative, the different steps of the data pipeline may be far from obvious. A simple diagramme with the key steps together with a timeline can become very helpful. In a world, where the 4G mobile connection may be enough to watch movies while using public transport has made people believe that any data transfer is basically instantaneous. Putting timespans and also prices per data transfer rates can do wonders to put things into context.

Figure out key GO NOGO criteria. The one piece of technology that made NASA able to put people on the moon while having 1960s technology is a meticulous system of GO NOGO criteria and checklists making sure that each piece is properly working and for thinkable eventualities are taken care of. Depending on what exactly breaks in your pipeline, the impact might range from a slight warning to “stop using results immediately” issues to all end users. We have made quite good experience by defining for each data source the maximal allowed data age, so that criticality of the incident can be immediately assessed. When, for example, a catalog table changes only little from day to day, having a two day old catalog is usually less harmful than e.g. working on a two-week old stock table.

Having a clearly defined list of criteria determining whether the produced can be used for business decisions can make matters transparent and quick to assess for everyone including the non-technical stakeholders. Being able to flag corrupted results before further use minimizes business damage and enables trust in the process and thus in the system.

This way, users can be confident and efficient in working with these results knowing that they have a “green light” on the most critical points of potential issues.

Classify problematic cases. The list of usual suspects of situations that might break or endanger your pipeline boils down to about a dozen common cases. Having a paragraph or two reading describing the potential impact on results and business descriptions can be very handy when facing these situations. In the best case, these can be communicated to stakeholders automatically when an incident occurs or, if automation is not possible, be ready for the human noticing the issue and communicating this as quickly as possible.

Don’t panic. Last but not least, things go wrong all the time and they will probably go wrong the first time the principal data engineer is mountain climbing without cell phone reception. This is not a sign of personal failure but due to the complicated nature of orchestration. Being prepared, however, makes it less stressful.

Reading Rec: The inspection paradox

I just stumbled upon this very interesting read on a specific type of observation bias, called Inspection Bias. Inspection bias is the phenomenon that happens when the point of view of the observer creates a bias. An example is that e.g. when you ride your bike on a track you may observe a lot of much slower riders and a lot faster riders than yourself just by the fact that due to your own speed keeping other people of the same speed at approximately the same distance.

Check out the article here.

What Pro Data Analysis can learn from Strava

If you are a cyclist or a runner, chances are that you use or have at least heard of Strava. Strava is a platform for athletes to analyze and share their activities’ data and virtually compete with one another.

With the rise of sports tracking devices tracing position and pacing via GPS and additionally measuring heart rate and elevation, Strava leverages this data that users upload to create a frame of reference for several types of activities. First founded in 2009, Strava has more than 10 Mio active members (in fact they emphasize not to call them users) in more than 195 countries. In 2017 alone, cyclists shared over 7.3 Bio km worth of rides.

In 2009, when they began as a typical Californian data start up, they were highly dependent on the hardware vendor Garmin: In fact, in the beginning uploading data to Strava was only possible directly from a Garmin device leaving early Strava at the mercy of Garmin. Today, the power dynamics have changed a lot: It is now Strava-compatibility that drives hardware sales. Automatic data synchronisation to Strava or even live Strava powered analytics during an activity enable not only Garmin but also their competitors like Polar and Wahoo to sell their newest generations of devices.

How do they make money

Strava’s main sources of revenue is first of all their premium membership options (59,99€ a year or 7,99€ a month).

Secondly, industry partners can sponsor challenges, that is specific goals on a specific time frame, Strava members can commit to to motivate themselves. One example is the Rapha 500 Challenge [1] of the bike vendor Rapha challenging its participants to ride 500km between Christmas Eve and New Year’s Eve.

As Big Data company, the selling of data to third party is also part of Strava’s business. For now, they are committed to share data only in an aggregated and therefore anonymized form with partners that are aligned with Strava’s vision of enabling and helping athletes. Notably, the project Strava Metro [2] aims to partner with city planners around the globe to make e.g. bike paths and most frequent bike tracks safer. On their website, you can find a case study of the partnership with the Seattle Department of Transportation.

By 2018, Strava has yet to become profitable.

My personal use cases

To navigate on my bike following pre-planned tracks, I bought a Garmin Edge 800 a few years ago. For this, I create a GPX, short for GPS exchange format, file of my route and upload it onto my Garmin device.

GPX track covering both the Gampen Pass and Mendel Pass in italy planned on GPSies

GPX is an XML schema which can be used to track GPS based waypoints and routes together with timestamps. In this format, one can both store pre-planned routes which can be later used for navigation as well as recording timestamps when passing these waypoints on a bike ride or a run. I use my Garmin both for navigation as well as recording, but any smartphone can do the trick as well (within its battery limitations).

When I ride, I have my track as a purple line embedded into a map [3] that I can follow to pick the right turns. After my ride, I use Garmin’s own software Garmin Express to read out the recorded GPS/time data as well as my heart rate. It is automatically transferred to the Garmin platform Garmin Connect. Garmin connect offers similar features as Strava while being restricted to its own devices. In my opinion, their dashboard composition used to be a bit messy. The new modern look has improved matters quite a lot, however this was too late and many users like myself went to look for alternatives.

Garmin exposes data of newly created activities to Strava via an API, automatically uploading any rides making them visible to my community of friends and acquaintances. Over there, I get an instant analysis of my ride: How I performed on pre-defined segments during my ride: Did I hit a personal best? Have I been able to score top 10 for women? How do I rank compared to my friends that have done this particular segment as well? Subsequently, my activity becomes visible to my friends (or to the world if I chose so).

From a data analysis perspective, Strava does a few things well from which the pro data analysis world could benefit as well.

Easy and powerful visualizations and tracking tools geared for its user base yield a powerful Business Intelligence

One of the main challenges for most amateur athletes is to keep up motivation to continue with one’s sport. On the one hand, it’s the community part that allows sharing your passion, but as well your challenges with your friends in real life like your bike club or with other like minded people that you know only online.

On the other hand, you can follow your own progress and try to beat your past self. I particularly like the feature of tracking the number of weekly activities, the overall length and elevation gain to motivate myself to keep up my rides and training.

Strava, of course, benefits from the fact that their are a lot of canonical KPIs for sports activities such as distance covered, speed, heart rate, elevation gain etc. that quite easily open the door to make a sports tracking platform’s insights relevant and meaningful for the user.

Neatly visualizing this data adapted to the needs of the particular type of sport is on the other hand much more difficult. In my opinion, Strava’s success is mainly due to its strength there outperforming Garmin with a little-cluttered interface and visuals.

Well incentivized community dynamics keep the platform and its data relevant

The heart of Strava’s data analysis capabilities are segments. Segments are short tracks of variable length between two points which cover a part of a road or a route. A typical example would be the start of a slope of a mountain to its highest point. Users can create segments themselves, but also flag segments as duplicates or irrelevant (sprints of only a couple of meters).

Even though Strava has recently invested in getting rid of most obvious duplicates in segments, it mostly relies on the communities to do their own clean ups: A lot of Strava members develop quite some enthusiasm to curate the most relevant segments that appear on their routes in order to track and showcase their performances.

The same is true for fraudulent data: If you track your “performance” on an e-Bike or a motorcycle in order to score a good ranking on an at least moderately frequented segment, you can be sure that other Top 10-candidates will be quick to report the activity to get a “fair” ranking.

This principle of community police as curator allows to avoid one of the most common threats of any Big Data endeavour, namely the loss of meaning of data due to spam and irrelevance.

Gather your stars as marketers

On the one hand, some KOMs and QOMs, short for King of the mountain and Queen of the mountain denoting the respective leader on a segment, are pretty much completely out of reach if a major competition has traversed one’s territory, it is invariably cool that one can also follow people like Romain Bardet (who is competing in the Tour de France at the moment) and see how they perform on your favourite segment.

Below is a screenshot of a segment that I just rode. – A segment that has been part of this year’s edition of the Giro d’Italia race allowing to compare world class cyclists Romain Bardet and Vicenzo Nibali.

Having the industry stars on ones platform is a great marketing coup to showcase ones functionalities and gives professional athletes a platform to interact with their fans.

A thought-through Premium membership principle

Currently, I use the free membership option of Strava. The premium option would allow me to get more detailed analysis such as power meter analysis, live feedback and personalized coaching to reach more customized goals.

One fun example of what could be gained from a premium membership is the possibility to get live segment information during my ride: I would see exactly how I’d need to perform to score a good ranking on say my favorite hill. Strava’s philosophy is that most people will sign up for the free option and quite steadily go for the premium option once they have been with Strava for a while.

And even while you are not paying, you are still contributing to the richness of data accumulated and curated in Strava. While Strava has yet to reach profitability, this balance seems to be quite powerful for Strava to generate value to both members and partners.

The big key word for the future of Strava is ‘Discovery’: Assume you travel to a new city and you want to go for a run. Strava knows your typical distance and whether you like hilly terrain or flats and can recommend you routes that other athletes just like you do in this particular city. To which extent this will be part of the premium part of Strava, is not yet clear, but to me, these kinds of recommendations would be very valuable and something, I would definitely consider paying for.

Grow with challenges

As a data company/social media platform, you are under constant public scrutiny. In the beginning of the year, a story broke of a secret US military air base [5] being exposed on a Strava heat map: Soldiers had been recording their training as ‘public’ on Strava. Even though the data was anonymized, having a well-frequented running course in an Afghanistan desert left not much room for speculation.

Even though one can clearly argue that this incident was largely due to the carelessness of the people uploading their data publicly without second thought, this still is a challenge for a community to educate its members on the consequences of privacy. This holds both for Strava itself, but also for mainstream journalism who mistakenly called this a ‘data leak’ or ‘data breach’ which it most definitely wasn’t.

Strava itself took action to highlight in detail the opting out possibilities in order to avoid these including the introduction of a minimum numbers of activities for a path to show up on any heat map. Furthermore, heat maps are refreshed regularly so that activities that are later made private no longer show up. This means that even if a group of soldiers mistakenly uploads an activity of a secret location, they can still take action to have it be hidden and further damage can be avoided.

Another ongoing discussion that concerns a far greater base of users is the possibility to opt out of certain aspects of data sharing. Unfortunately, many athletes, in particular women, are not comfortable sharing timed location data of their runs publicly, since it could be very easy for stalkers or even attackers to guess patterns and pose a serious threat. For now, the only option is to not share an activity at all publicly. Strava has said that they currently explore of how to make only parts of an activity publicly visible while still integrating the other relevant parts of the activity.

Notes

[1] Rapha 500 Challenge, see e.g. the 2016 version here:https://www.strava.com/challenges/rapha-festive500-2016

[2] Strava Metro: Insights business for city planning https://metro.strava.com/

[3] For planning, I use www.gpsies.com. Strava offers its own planning tool, Strava routes https://www.strava.com/routes. Out of say a bit arbitrary historic reasons, this tool however has not yet gained much traction in my friend cycle after the tool recommended to one of our more passionate road cyclist to use a less than optimal gravel path for his precious road bike.

[4] Open Street Map https://www.openstreetmap.org/ is an open source mapping database permitting to download any map selection in Garmin-compatible formats

[5] BBC article on the Strava Military Base incident from 28/01/2018 https://www.bbc.com/news/technology-42853072

Dealing with Data Bias Part II

In this series of blog posts I explore the different type of biases that I come across as supply chain scientist. While my particular examples usually tend to come from the supply chain and eCommerce world, the biases described are quite general.

Bias #2: Selection Bias

Another tricky bias to deal with is selection bias. Selection bias is the systematic under- or over representation of data sources. Think of a polling agency trying to predict the outcomes of an upcoming election. In order to do so, they need to sample people’s opinion. So, assume next Sunday are the election and on this Wednesday at 11:30 they go their own city’s central market place and start asking people who they would vote for. It turns out, people happening to be on a market place at 11:30 on a week day might actually not be correctly representing the general population and hence a sampling like this will not yield to trustworthy results.

In the era of Tech data analysis one might think ‘great, now with our fancy Big Data technology we no longer have to rely on unrepresentative samples, we can just process all the data there is – voilà no more selection bias‘. And in a perfect world with only clean data, this would be true. However, in the real world with real processes or humans creating your data feed, systematic corruption of parts of your data might again cause selection type bias.

Usually, with the growth of a business, the number of input data sources grows as well. For example, if you think of a commerce, the number of sales channels increases (phone, their own web shop, Amazon market place etc). All these different data sources for sales do not necessarily capture data in the same way and the same level of completeness e.g. for some edge cases one sales channel may be bad at recording, while others may do it correctly. Over time, these differences may accumulate and if you look at the results as a whole, skew your perception in one direction or the other.

One other very common source for this type of bias is when you replace one data source tracking system (i.e. the ERP tool) with another one. This yields to one legacy system coexisting with a new one with potentially completely different philosophy, use cases and configuration. Defining a proper data mapping between these systems can be a quite complex task. Avoiding introducing systematic errors from one to the other is quite delicate.

How to deal with selection bias

The problem with selection bias is that you will not be able to tell it from only looking at the data you have. So, there is no fancy high tech statistical Data Science method to conquer this bias.

The way to deal with selection bias is to carefully study the underlying business processes to first understand potential sources of selection bias and then find proper business indicators to check for these.

Part one involves a proper business analysis along side with an analysis of the data architecture supporting it. In my own work, I usually try to draw a diagram of the flow of goods from conception to delivery to the customer and if applicable, the return processes and a second diagram of the flow of data for each of the steps of the diagram of the flow of goods.

For each of the “arrows” of such a diagram (i.e. from production/manufacturing to storage, from storage to delivery etc), i.e. each cross section of a physical system to another and each interface between data systems, data may be lost, skewed, delayed or counted multiple times. From business one needs to get the appropriate real life statistics of what is supposed to happen between these systems such as how many items have been send from suppliers to the central warehouse each day, how much sales volume has been generated from each channel on any given day etc. Then you can compare the measured values with the figured coming from business to spot gaps.

This requires a quite extensive knowledge and accounting and documentation of the business and the data architecture. In the long run, the effort spend on this will lead a high return on investment.

Bits and Pieces

A personal blog on topics related to Data Science

Author: Katharina