TNA Fellow Andressa Rodrigues Gomide

Compiling a literary corpus with minimal resources

Andressa Rodrigues Gomide is a researcher at the University of Coimbra. With this project she addresses the creation of the literary subcorpus of Corpus Pluricêntrico da Língua Portuguesa (CPLP), a large reference corpus of Portuguese language varieties. Given the obstacles to deal with copyrighted documents, the aim of this project is to devise a system that allows optimal literary data collection with limited resources. To achieve that, quantitative linguistic analysis will be piloted on six literary datasets that pose distinct levels of difficulties concerning data collection. It is expected that the knowledge of how these datasets differ from each other will aid the creation of a framework that allows for fast and simple data collection that does not affect the quality of the final data.