As the world moves faster, more and more information is generated every day. In order to make sense of such large amounts of information, journalists and media monitors benefit from automatic processes that are capable of extracting entities present in the text.

Multilingual Media Monitoring with Named Entity Recognition

This task is called Named Entity Recognition (NER) and encompasses both the detection of entities in a text and their classification with a specific label (e.g. organization, person, etc.). Having this information has several advantages:

  • It allows journalists and media monitors to quickly assess the entities present in a text.
  • Entities can be later linked to a knowledge base to retrieve more information about a given organization, person, etc.
  • It makes it possible to discover trending entities and more easily cover certain topics or ongoing events.
Example of annotated text, with location, organization, and person tags (retrieved from internal tooling)

The goal of SELMA is to tackle media understanding in multiple languages. This adds a new challenge: how can we perform named entity recognition in multiple languages? One possible strategy is to take advantage of multilingual transfer learning. With this method, a large machine learning model is trained in multiple languages at once on a specific set of tasks, which then produces meaningful word-level representations that are useful for Named Entity Recognition and other downstream tasks such as Sentiment Analysis.

Our participation at SlavNER – a challenge in Slavic languages

We showcased the potential of such approaches in a shared task, which is an event that targets a specific problem, and where different research teams compete and share results and solutions. We participated in the SlavNER event, a Multilingual Named Entity Challenge in Slavic languages. This shared task was part of the 8th Workshop on Balto-Slavic Natural Language Processing, held in conjunction with the 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021).

The SlavNER shared task focused “on the analysis of Named Entities in multilingual Web documents” and covered a total of six languages: Bulgarian, Czech, Polish, Russian, Slovene, and Ukrainian. Among these, Bulgarian, Polish, Russian, and Ukrainian are covered by SELMA. Despite being restricted to a set of Slavic languages, this shared task allows us to highlight the power of multilingual transfer learning, applied not only to a set of multiple languages but also to different scripts.

Best performer with multilingual model

Our submission to the NER portion of the shared task corresponded to a single model able to predict entities for all six languages. It consisted of three modules: a large pre-trained multilingual model, a model that is capable of extracting information at the character level, and, finally, a classifier model. Combining these three modules and training the full model on the provided data we were able to outperform all other teams and submitted models and achieve the best performance for NER across all metrics.

At SELMA our goal is to continue improving existing multilingual approaches and further advance the state of the art when it comes to detecting and classifying named entities in text, regardless of language.

Further reading