Should a knowledge graph replace commercial scientific journals?

The problem with the current system of science contributions via the publications is inherently flawed:

It incentivises only additions to the general public knowledge by writing more articles and gives little to no credit to those who review mostly anonymously, those who spot mistakes or those who actually fix mistakes. To advance in one’s career, researchers are generally measured by the number of publications and the reputations of their journals

Furthermore, journals make a profit while authors, editors and reviewers either do their work for free or even have to pay to publish (e.g. for an open access licence).

In the special case of mathematics, all scientific results are self-contained and can, in theory, be verified by anyone knowledgeable in the field (e.g. no special equipment is necessary).

Wouldn’t it make more sense to craft a knowledge tree instead of a linear journal, that would semantically link to previous results?

For illustration, let’s consider a new paper in mathematics, that publishes, e.g. three theorems. In the proof, several older results are used and quoted via citation.

The reader then has to either trust the author or the peer review process, that these older results have been indeed been proven correctly in the other papers (as well as their quoted results) or she has to verify the cited results and their quoted results until arriving at the axioms.

In some cases, mistakes happen, and a quoted result does not (yet) exist or is even provably false. However, there is no immediate incentive (apart from the author’s honor code) to correct mistakes. Thus, some unsuspecting grad student might continue to quote a wrong or only partly-proven result having either to figure out the mistake themselves or falling into the trap and building on top of this possibly false assertion.

I propose to instead build a semantic knowledge graph, in the following style:

As a researcher, you can contribute in several ways:

  1. You can add a new node, that is a new theorem by submitting the statement of the proof as well as the links for the correct quoted parent theorem nodes.
  2. You can challenge an existing edge of the graph if you spot holes in a proof or have a counter example.
  3. Formalize existing results into the knowledge graph.

Item 3 could be a nice source for students’ theses. The role of editors and reviewers could be to function as administrators for new additions to the graph.

Advantages:

  1. The knowledge graph becomes crawlable. Issues with a parent node can automatically be inherited to child theorem nodes.
  2. Contributions to rectify issues with papers can be acknowledged by the scientific community as valuable contribution.
  3. Automatic proof checking becomes an option if the proofs are formalized (se. e.g. Lean).
  4. Researches might become quicker in discovering related results.
  5. AI can function as a proof-helper, either auto-completing (like chat gpt for coding) or even suggest noteworthy theorems that are corollaries to current knowledge.

Disadvantages / Challenges

  1. There needs to be software and infrastructure to run the knowledge graph.
  2. Change Management in the scientific community would have to take place.
  3. Commercial journals might oppose this heavily.
  4. The graph might become quickly “unreadable” due to complex combinations. One would need a suitable filtering / tag structure in order to be able to see only relevant content.

Dealing with Data Bias Part I

One major challenge we face as supply chain scientists is bias in data. In this series of blog post, I would like to describe a few of the major biases that we see, why they are harmful for any prediction or sales analysis and how to deal with them.

While I write from my perspective as supply chain scientists, this analysis applies to basically anyone who is doing any sorts of data analysis or research on data.

Bias #1: Confounding variables

In a nutshell, at Lokad, we use past sales data to make predictions on future demand. At first glance, this sounds perfectly reasonable – you measure past demand to predict future demand – so where’s the bias?

In fact, there is quite a range of other variables, that would make meaningful predictions hard to impossible. The term ‘confounding variable’ refers to any sorts of influences that impact the cause-and-effect relationship that you would like to study which are outside of your current considerations.

Let’s assume you are studying the following sales history:

Day Nb. of Sales
Monday 8
Tuesday 11
Wednesday 9
Thursday 2
Friday 15
Saturday 15
Sunday 12

So what happened on on Thursday? Maybe mid-week is just the slow period for you? It could, however also be that there are some confounding variables at play.

One basic question one should start with is the following: Were clients able to buy all the time or was there some obstacle preventing sales?

One such factor could be website downtime if you are, for example, an eCommerce. In our example above, maybe, the webshop was down from Wednesday night to Thursday night. This is very valuable information to store. There might also be some “soft downtimes” if, for example, there was a small bug in the shop software preventing one product category to be shown properly at it’s usual place.

Even with a functioning web shop, you cannot sell what you do not have (except in the cases where one allows backorders). Stockouts represent one of the major form of confounding variables we encounter frequently.  Low observed demand might just be caused by low stock or stock outs of a top selling product. In some businesses, stock levels are visible to clients either in a reduced way if there is little stock to indicate urgency to the client (“Only two articles in size 40 left”) or sometimes fully (“300 pieces in available”) to demonstrate stock health to potential big customers. The latter is known as a facing quantity.

Without understanding the relationship between stock levels and sales, demand can be critically underestimated: In our example, we may have encountered a stockout on Wednesday evening. Thursday morning, we might have inbounded two returns  that were sold right away before we received a replenishment before business started on Friday. (In real life, this usually drags out a bit longer and you might see a longer period of low sales days depending on your replenishment cycle length) Note that keeping the stock out history would probably not have caught this issue in the example, since at some point two pieces were in stock on Thursday, but clients were not completely able to buy freely, since the quota of the day was capped by the two pieces in stock.

In general, keeping a history of stock levels or at least a history of stock outs per product is generally very much advised.

Another confounding bias may be evolution of prices. Depending on your vertical, your clients may be more or less sensitive to price changes and therefore knowing the sell price of each sale might give you important context of the demand. In our example above, the product might have been discounted for all week days apart from Thursday.

Unfortunately, this only gives you the demand at the price you had for a certain day. It is not possible to “replay” the demand at a given day at another price to study what would have happened if you had discounted or up-priced more. Here,  a more dedicated price elasticity study would be necessary. On the other hand, knowing for each sale the price allows to establish a certain baseline around the main prices classes (such as e.g. original price, -20%, -30%) so that you can understand which periods in history were associated with an inflated demand due to discounts, or deflated demand due to surges.

How to cope with confounding bias

Well, the most ideal course of action is to try and study all possible confounding variables and make them part of your consideration so they can become genuine variables of your analysis.

To be able to do so, understanding how the business works is crucial. The key take away for me for confounding bias is that looking at the data alone is not enough, it is analyzing the business together with the data that makes any data investigation meaningful leading to usable results for business.

In some cases, this might mean to start tracking or snapshotting some extra data (downtimes, stock levels, etc), which usually means a little extra effort and some more storage consumption, but this can pay out quite well.