We all like infoboxes – Google uses them and so does Wikipedia, Bing, and DuckDuckGo. But how do you add infoboxes to your own content?

To do so, you first have to distill all your content into a knowledge graph, and then you can link the content to the knowledge graph through infoboxes – similarly to Wikipedia page previews.

The problem is that knowledge graphs are hard to create – they are big, and they need to be manually curated to reflect the correct and up-to-date information.

The definition of the (Google) Knowledge Graph is best given by DBpedia:

The Google Knowledge Graph is a knowledge base used by Google and its services to enhance its search engine’s results with information gathered from a variety of sources. The information is presented to users in an infobox next to the search results. Knowledge Graph infoboxes were added to Google’s search engine in May 2012, starting in the United States, with international expansion by the end of the year. The information covered by Google’s Knowledge Graph grew quickly after launch, tripling its size within seven months and answering “roughly one-third” of the 100 billion monthly searches Google processed in May 2016. It has been criticized for providing answers without source attribution or citation.

Often, content creators take the lazy path and link the content only to the free knowledge graphs like DBpedia (6 Million entities, 9.5 Billion relations), WikiData, and Yago that may lack the factual detail from the specific content creator. The largest knowledge graphs like Google Knowledge Graph (1Bn entities, 70Bn relations), Amazon Product Graph, Microsoft Satori, and Facebook Entity Graph are proprietary and cannot be linked to.

The PiniTree Knowledge Graph

This is where the PiniTree Knowledge Graph editor developed by the SELMA project partner IMCS, University of Latvia – together with the PiniTree.com startup – comes in. It allows to annotate your own content with entities, link them together into a knowledge graph and finally visualize your content similarly to the Wikipedia page preview infoboxes. See a demo here.

To mitigate the attribution problem of other knowledge graphs, the PiniTree editor always preserves the attribution to the source documents. The PiniTree Knowledge Graph editor was first presented at Extended Semantic Web Conference (ESWC-2020), and recently its various use cases were presented also at the 17th Baltic Conference on Intellectual Cooperation (BCIC). Here is a live PiniTree editor demo.

Separate the wheat from the chaff

The key problem with knowledge graph creation from text is that you have to distill what matters and what does not – you cannot show everything in the infobox.

Infobox, an example

Schema.org is often used to guide the distillation process by defining the important kinds of facts and their canonical naming in the (Google) Knowledge Graph. With PiniTree, we have found that an ontology consisting of merely nine FrameNet frames Being_born, Death, Personal_relationship, Education_teaching, Being_employed, Membership, Possession, Participation and Statement is sufficient for most news media use cases, which significantly simplifies the custom knowledge graph creation. Knowledge graphs are also excellent at merging information from diverse sources, including legacy databases such as enterprise registries.

Finally, there are also purely lexical knowledge graphs like WordNet, FrameNet, and Tezaurs.lv, the latter recently being successfully imported into the PiniTree editor for studying coverage of the Latvian dictionaries. This shows that knowledge graphs scale to complete natural language vocabulary if the restricted schema distillation is not the aim.

Knowledge graphs have some commonality also with the large neural language models like GPT-3 (Generative Pretrained Transformer) in that they both can generate text. As the user navigates through the knowledge graph, the navigation path is verbalized into a readable sentence: “Steve Jobs owned Apple (which) owns Beats Electronics” displayed in the History panel of the PiniTree editor shown above. Meanwhile, GPT-3 generates text as the statistically most likely continuation for the given text prompt. 

Developing a unified theory for these very different knowledge representations and text generation approaches is a hot research topic investigated also in the SELMA project.

Further reading:

Article written by Guntis Barzdins, Lead Researcher at AI Lab, University of Latvia and co-founder of PiniTree.com

Photo by Kelli McClintock on Unsplash