Abstract: In a project with the Wildenstein Plattner Institute in New York, we worked on constructing a knowledge graph based on art-historical documents. The goal is to create a graph-based representation of the relationships between different artworks, artists, and historical events and use this representation to gain insights and use this representation for downstream applications and analysis.

Setting: The project described in the following was conducted as part of a lecture/seminar called “Knowledge Graphs” as part of my master’s degree at HPI together with Konstantin Dobler and Malte Barth. Overall, the seminar was split into two parts: first learning central aspects and becoming familiar with the current scientific literature as well as a practical project in collaboration with the Wildenstein Plattner Institut in New York - a non-profit foundation dedicated to the study of art history and for example the digitization of art-historic artifacts and documents.

What is a knowledge graph? A knowledge graph is used to store and represent knowledge in a structured way. It typically uses a network of nodes and edges to represent entities and their relationships. Entities can be people, places, things, or concepts. The relationships between entities are represented by edges with optional properties associated with them. Essentially, a knowledge graph represents facts as triples of subject (entity), predicate (relation), and object (entity). An example here would be (Barack Obama - born - Honolulu). This allows for a more flexible and expressive representation of knowledge than traditional relational databases and enables more advanced analysis and machine learning. Common use cases for which knowledge graphs can be used are, among others1:

  • Improvement of search capabilities by connecting different pieces of information.
  • Recommendation Systems as knowledge graphs can be used to make personal recommendations by understanding the relationships between entities.
  • Semantic Web: Knowledge graphs can link and connect information from different sources, improving web search and information presentation to the user.
  • Other NLP tasks: Knowledge graphs can be a means to understand a text’s meaning and improve the accuracy of natural language tasks such as machine translation and question answering.

Important sidenote: While knowledge graphs are an interesting approach, there are other approaches to the above use cases besides knowledge graphs that use semantic information in texts. Overall, however, knowledge graphs enjoy great popularity. In knowledge graphs for enterprises, Google and Microsoft, for example, have graphs with several billion facts that are actively used in their products (2019)2.

What is an ontology? An ontology formally represents a set of concepts and relationships between them that make up a specific domain of knowledge. It defines the types of entities, their properties, and their relationships. An ontology is usually represented in a machine-readable format, such as RDF (Resource Description Framework) or OWL (Web Ontology Language). These formats provide a standardized way to define classes and properties and express constraints on their relationships.

Overall, working with knowledge graphs comprises multiple steps depending on the use of the knowledge graph. The framework in the following picture gives an excellent high-level overview3).

Knowledge Graph Creation Process

Our Project: Open Knowledge Graph Creation for art-historic documents with the Wildenstein Plattner Institute New York

As part of the seminar, we worked on a large body of art-historic documents from which we created a knowledge graph for further use cases. Our focus was on the knowledge graph construction to allow for downstream applications in the future.

The data we used were already scanned & preprocessed art-historic documents, which were converted into digital form via OCR (optical character recognition). The dataset contained approximately one million sentences stored in chunked .txt-files.

When starting, we used techniques such as Named Entity Recognition (see picture using Spacy - a python-based NLP framework) to get a first understanding of the dataset and a first take at the task at hand. As with any approach tied to a predefined ontology, the problem is that you need to have an ontology fitting to your existing domain and figure out how to associate entities and relations with your predefined ontology.

Named Entity Recognition

However, in this, we were limited in what we could discover as part of the ontology - or if something needed to be added (out of ontology), what to do with it. So we assumed a green-field approach without a predefined ontology. The project mainly focussed on creating a knowledge graph with said data and later evaluating the quality of knowledge stored there. Three main steps:

  • Open Information Extraction (OpenIE): Extraction of tripes (subject - predicate - object) from texts.
  • Knowledge Graph Embeddings: Representation of knowledge in a versatile fashion for downstream applications such as machine learning.
  • Canonicalization via clustering of relations & entities: Converting more than one representation of entities and relations into a standard format. The results of Open IE are not canonicalized, leading to the storage of redundant and ambiguous facts.

Details on Open Information Extraction: For this task we used the existing OpenIE6 system that uses iterative grid labelling and coordination analysis4. Before that, we performed co-reference solution, which clusters and replaces mentions in a text that refer to the same entity. An example of two sentences like “Christian learned a lot about knowledge graphs. He now reads blogs about it often” would be replaced by “Christian learned a lot about knowledge graphs. >Christian< now reads blogs about >knowledge graphs<.” OpenIE transforms single sentences into triples (subject - predicate - object) to be processed later. We used filter heuristics to reduce the number of non-relevant triples, such as “One - can skip - truth”. Unfiltered, these were roughly 3 million triples. After preprocessing, we landed around 500.000 triples.

Details on Knowledge Graph Embeddings: Constructing/learning embeddings of entities and relations in a knowledge graph is a versatile method to represent that knowledge and perform machine learning or other tasks on said data. In detail, we used and tried out multiple embeddings such as TransE, HolE, ConvE as well as CESI can be used to canonicalize knowledge graphs using embeddings and additional side-information.

Details on Canonicalization: The goal is to reduce multiple non-standard representations of entities or relations into a more standard one. An example could be my name being represented as an entity “Christian Warmuth”, “C. Warmuth”, or “Christian W.” across the knowledge graph. Through embeddings, we represent this information in a high-dimensional space. If we then cluster said space and take representatives per cluster to replace all cluster entities, we would end up in our example with a representative representation of my name, such as “C. Warmuth”. CESI, for instance, uses Hierarchical Agglomerative Clustering (HAC), which has an exponential runtime complexity O(n^2) which is not feasible for our case of ~500k triples. Therefore, we used DBSCAN, a classic density-based clustering algorithm that does not need a predefined number of clusters.

Result of the canonicalization:
Input~ 500k noun phrases
~ 62k relation phrases
~ 370k unique tripes
Output~ 350k noun phrases (diff: ~150k, -30%)
~ 13k relation phrases (diff: ~45k, -79%)
~ 345k unique triples (diff: ~25k, -7%)

Evaluation: Generally, it is not easy to evaluate the quality of a knowledge graph, especially as we are missing ground truth labeling. The knowledge graph can be qualitatively assessed when evaluating individual triples and their truthfulness. We can, for example, fact-check whether the triple “Pablo Picasso - painted - Guernica” is a valid fact.

Knowledge Graph Quality Evaluation

The cluster quality can be evaluated manually by assessing individual clusters and their entities/relations. As this is not quantitative, we created “ground truth labeling” to objectively assess the quality of our knowledge graph. Based on our ground-truth labeling, we performed a set of queries against the knowledge graph. By doing so, we could also compare different configurations of embeddings/canonicalization & their influence.

An example query was to ask whether the triple “Picasso - painted - Guernica” or any related but similar form is present in the knowledge graph. Our results find that our best results were achieved with CESI using HolE as embedding trained with ~100-dimensional embeddings (and using pre-trained GloVE embeddings for vector representations of words). For this setting, 80% of our predefined queries were correct in our knowledge graph.

Summary: Overall, the project was an excellent opportunity to work with a non-public dataset and immerse in the challenges of knowledge graph construction and the downstream work with knowledge graphs. While qualitative evaluation of our results without ground-truth data is difficult, we achieved good results with our knowledge graph based on historical documents.