TNA Fellow Vladimir Polomac

Universal Dependencies for Old Serbian and Serbian Church Slavonic: Creating a training data set for lemmatization and morphosyntactic annotation using UDPipe

The recent development of the UDPipe tool enables users to do automatic lemmatization and morphosyntactic annotation for contemporary languages, for many classical languages (e.g. Ancient Greek, Paleo-Hebrew, Coptic, Gothic, Sanskrit, Latin), as well as for older versions of modern Indo-European languages (e.g. Old French). When it comes to older versions of Slavic languages, UDPipe enables automatic text processing with the aid of models for Old Church Slavonic and Old East Slavic languages. Initial steps in automatic lemmatization and morphosyntactic annotation via UDPipe have been recently conducted for Old Czech. The general aim of our project is to create a starting data set that would be used to train models for automatic lemmatization and morphosyntactic annotation of the oldest preserved Serbian texts from the 12th and 13th century with the help of the UDPipe tool. Special goals of the project include: a) defining the principles for the lemmatization of Old Serbian and Serbian Church Slavonic texts, b) creating a set of tags for morphosyntactic annotation in accordance with the Universal Dependencies standards for Old Church Slavonic and modern Slavic languages, c) manually creating a starting data set for the training model with UDPipe.

In this video, CLS INFRA TNA Fellow Vladimir Polomac discusses the TNA Fellowship project: “Universal Dependencies for Old Serbian and Serbian Church Slavonic: Creating a training data set for lemmatization and morphosyntactic annotation using UDPipe”.