TNA Fellow Riva Quiroga: Improving Part of Speech Tagging for Latin-American Spanish Corpora
Part of Speech (POS) tagging is a key step when preprocessing a corpus, as the correct assignment of parts of speech to words impacts the results of lemmatization and dependency parsing. This is particularly important in a field such as Computational Literary Studies, where identifying actions, characters, and instances of particular linguistic constructions can help researchers to gain insights about a corpus.
Although POS taggers are currently available for Spanish, most of them have been trained exclusively with texts written in European Spanish, a variety that is used only by around 10% of all native speakers of this language. For example, AnCora –the biggest Spanish corpus available at https://universaldependencies.org, and the one used by tools such as spaCy and UDPipe– is composed only of newspaper and newswire articles written by Spanish media. As a consequence, POS taggers have lower accuracy with texts written in other varieties of Spanish, and with genres that do not rely exclusively in formal speech. This presents a serious limitation to explore corpora of Latin-American Literature using computational methods.
To fill this gap, this project aims to train a model for POS tagging texts written in Latin-American Spanish. To that end, a corpus of literary and historical texts written in this variety will be used to train a model with UDPipe, a trainable pipeline for Natural Language Processing tasks developed at the Institute of Formal and Applied Linguistics at Charles University, CZ. The implementation plan involves assembling the corpus, documenting the annotation process, training the new model, and evaluating its results. The outputs will include the tagged corpus (available via https://universaldependencies.org/ for reusability), an article about the corpus and the training process, and a tutorial submitted to Programming Historian (https://programminghistorian.org/) on how to use this model to annotate and explore Latin-American literary texts.
Check this short video to find out more about the project aims and outcomes as Riva Quiroga talks us through her developments during her TNA fellowship.