Project outputs

D5.2 Case Studies in Data Preparation and Sharing

Building on previous CLS INFRA deliverables, this report provides step-by-step case studies of research questions involving digitisation and transformation processes of literary corpora. The case studies: 

  1. Creation of an ELTeC affine corpus of the Slovak novel (chapter 2)
  2. Finding the haiku across multilingual corpora (chapter 3)
  3. Measuring entropy and surprisal in the prose of the Tsarist Empire Devoted to Terrorism (Russian and Polish Texts) (chapter 4)

These case studies uniquely address not only the tools and resources available to CLS researchers but the complexities of collaborative decision-making regarding research methodologies. 

D8.1: Tools for Basic Natural Language Processing (NLP) Tasks

In this video, Prof. Dr. Julie Birkholz and Mgr. Dr. Silvie Cinková discuss D8.1. This report lists and describes a selection of Natural Language Processing (NLP) tools which are considered to form a Corpus-Enrichment and NLP toolchain for common CLS research tasks. The tools were selected to be:

  • safely positioned in their life cycle, i.e., state-of-the art, and mature as well as continuously maintained, or in development and promised as CLS Infra Deliverables by March 2025
  • as multilingual as possible (beyond English and several major European languages)
  • as interoperable as possible with other tools and texts in other languages.

Read the deliverable here

Image of the survey areas

D3.2: Survey of Methods

This survey documents current, widespread practices in research areas or issues that are prominent within CLS. Though it is not intended as a primer, the Survey Grid provides useful, targeted information in a format suitable for gaining a broad understanding of methods and issues. Fields include authorship attribution, genre analysis, literary history, gender analysis, and canonicity.

Click here to use the Survey Grid and review the deliverable.


In this video CLS INFRA TNA Fellow, Khanim Garayeva, speaks about the project: ‘Calculations of similarities or distances in Peter Ackroyd’s historiographic metafictions and lexical diversity in Dan Brown’s straightforward storytelling’.


The CLS INFRA Transnational Access Fellowship Programme funds scholars from literary studies or with an interest in Computational Literary Studies methods to visit leading research institutions and infrastructures and become part of the larger CLS community.

This archive includes interviews with the TNA Fellows on their experience, their research project and outcomes, video testimonials produced during their fellowship and access to their full reports. Check out the archives here.

Image shows the user interface of DraCor

D7.1: On programmable Corpora and DraCor

Work Package 7 of the CLS project, entitled “Building the Ecosystem of and for Programmable Corpora”, is developing a small-scale, but highly functional prototype for an infrastructural ecosystem for CLS research, following the concept of a network-based software architecture. The prototype, implemented as the multi-component system “DraCor” (Drama Corpora Platform), realizes the concept of “Programmable Corpora”, which is defined as corpora that expose an open, transparently documented and (at least partly) research-driven API to make texts machine-actionable. This report gives a detailed description of the DraCor system as a prototype for “Programmable Corpora”.  It also shares two first experiments in adapting and transferring the approach of an API-based CLS research infrastructure to other systems and resources.

Read the deliverable

CLS INFRA TNA Fellow Federico Pianzola

In this video CLS INFRA TransNational Access Fellow, Assistant Professor Federico Pianzola, speaks about the project: ‘Programmable Corpora as Linked Data’.

CLS INFRA TNA Fellow Cassandra Ulph

In this video CLS INFRA TNA Fellow, Dr Cassandra Ulph, speaks about her project:’Developing Attribute-based Sentiment Analysis Model for Romantic-period Letters’


In this video CLS INFRA TNA Fellow, Ivan Pozdniakov, speaks about his project: ‘Building an R Package and Web Application to Interact Within a Digital Ecosystem for Literary Studies’. 

Deliverable 5.1 Review of the Data Landscape

In this video PD Dr. Michal Mrulgaski summarises CLS INFRA Deliverable 5.1 ‘Review of the Data Landscape’. This landscape review focuses on intellectual access, i.e. providing guidance for finding and sharing literary data, while D6.1 approaches the task from a more technological side, collecting and analyzing literary corpora, available formats, tools, and metadata in order to create an exploratory catalogue / inventory of literary corpora and to provide a transformation matrix/toolbox for solving common issues. Yet we coordinate our efforts – beginning with the compilation of the table of literary collections – therefore one can regard these as two sides of the same coin. The review’s point of departure is the abundance of existing data and their diversity or heterogeneity as regards corpus design and underlying concepts, for example the definitions of text (is it a source, an edition, a data set? see chapter 3), the purpose of a corpus (e.g. general, reference, or monitoring corpora, special purpose corpora; see chapter 4), central considerations or criteria regarding the construction of a corpus (sampling, balancing, representativeness, annotation model(s), data format(s); see likewise chapter 4). How can I go about obtaining data without transgressing ethical or legal boundaries (see chapter 5)? We ask: How can we assist literary scholars in searching for and finding existing data that are relevant to their own research questions? And additionally, what kind of research question is relevant concerning the present-day state of the data landscape and literariness and textuality?

Read the deliverable. 

Srishti Sharma - Can a Book Make You Happy?

On Wednesday 29 June Srishti Sharma presented the results of her fellowship as a Transnational Access Research Fellow on the H2020 Computational Literary Studies project at GhentCDH. Her research project explores the effect of the emotions expressed in fictional novels on the emotions experienced by their readers. The corpus includes more than 400 English books from 9 different genres and their corresponding reviews from the Goodreads platform. Using sentiment analysis and emotion recognition she seeks to investigate the emotional links between genre, plot, and reader response.


In this video CLS INFRA TNA Fellow, Srishti Sharma, speaks about her project ‘Can A Book Make You Happy?’ which is being hosted at Ghent University.

CLS infra tna fellow Riva quiroga

In this video CLS INFRA TNA Fellow, Riva Quiroga, speaks about her ‘Improving Part of Speech Tagging for Latin-American Spanish Corpora’ project which is being hosted at Charles University, Prague.

CLS INFRA TNA Fellow Lou Burnard

In this video CLS INFRA TNA Fellow, Lou Burnard, speaks about his ‘Reviving the Victorian Play’s Project’, which is being hosted by the Moore Institute at NUI Galway, Ireland. 

Deliverable 4.1: Skills gap analysis

We have explored gaps in teaching of research skills for computational literary studies to inform the CLS INFRA project’s own approach to training schools and chart the territory to gain broader insight into current CLS teaching practices. To understand supply we have manually annotated a sample of European university courses in Digital Humanities and summer school workshops. To index demand we set up an online survey to ask the community to evaluate a set of predetermined ‘skills’ based on its perceived future prospects in the field and teaching (1-5 scale response, 118 participants).

The survey also offered a chance to observe the demographic structure of the CLS community. The prevalence of early career respondents indicates a new generational wave within computational literary studies. Participant gender was balanced, although introduction of variables such as career stage, self-reported proficiency, and discipline demonstrated skewness. Researchers who work in the field of CLS also report more experience in computational methods, which suggests that these go hand in hand in current practice. Despite the gap in skills education being more general in nature, we identified areas of heightened interest. These are the skills that make up the backbone of computational research: from designing the study to text collection, to multivariate analysis and statistical modeling. Survey responses reiterated that the current gap in schooling is quantitative rather than qualitative. Moreover, there was a consensus among participants that the institutionalized training of a new generation of researchers is instrumental to disciplinary advancement of CLS.

Download and read the full report here.

Download and view the poster presentation of results here.

Deliverable 3.1: Baseline Methodological User Needs Analysis

The purpose of this task was to identify, document and show-case best practices in CLS research in order to specify infrastructure requirements for the community. The central concerns of our study are data formats, tools and methods most widely mentioned in the publications related to CLS. These findings play an important role also for the training programme within the project, as they show what key qualifications are required for literary studies, what data formats researchers deal with and what methods and tools are especially relevant in the CLS field.

Download and read this report here

Watch a summary of the findings on our YouTube channel

Training School: Prague

The training school on Data and Annotation took place from 7 to 9 June 2022 and be hosted by the Institute of Formal and Applied Linguistics of the Faculty of Mathematics and PhysicsCharles University in Prague, Czechia. 

The materials, videos of sessions, and other information can be found on the DARIAH campus:

Kraków DH Lunch: the CLS INFRA project

CLS INFRA Principal Investigator, Professor Maciej Eder (IJP PAN) gives a quick tour around the CLS INFRA project for computational literary studies at the Kraków DH Lunch on 11th February 2022.

This special DH lunch introduces the Computational Literary Studies Infrastructure (CLS Infra) project – a multinational European collaboration to connect people, data, tools, and methods, focused on large-scale analysis of literary sources.

D4.1 Skills matrix survey


Researcher profile: Professor Maciej Eder (PI)

In this video, our project’s Principal Investigator Professor Maciej Eder introduces us to the CLS INFRA project and how Computational Literary Studies methodologies and research intersect with his own scholarship and teaching. Make sure to subscribe to our YouTube channel so that you don’t miss our upcoming researcher profiles.