The Data of the Corpus of the Saeima

The source data for this corpus was crawled from the Saeima's website where verbatim reports of all the sessions of the Saeima are published in text format. The texts are processed using a semi-automatic pipeline to identify the boundaries of speeches and the speakers. The text is split into utterances, where each utterance contains a speech from only one speaker.

The Corpus of the Saeima includes transcriptions of parliamentary debate from 7 parliamentary terms (5th-12th), covering years 1993-2017. The transcriptions of the Corpus of Saeima contain 38 million tokens, 497 thousand utterances and 468 speakers.

The available metadata for each utterance includes the date and type of the parliamentary session, speaker's name and affiliation.

Linked Data

Linked Data allows us to represent structured information about parliamentary debates by describing the properties of the objects from the domain of parliamentary meetings and relations between these objects. According to Linked Data principles, this information is represented using Resource Description Framework (RDF)

The types of objects in the LinkedSaeima dataset are:

For data modelling we reuse the work of the LinkedEP project (European Parliament debates as Linked Data) and their Linkedpolitics vocabulary, referenced in RDF data using prefixes lpv and lpv_eu

For example, a Speech is represented by lpv_eu:Speech, its properties include date (dc:date), sequence number and spoken text (lpv:spokenText), and it is related to the Meeting it is a part of (dct:isPartOf), to the Speaker (lpv:speaker) and its Role (lpv:spokenAs), and to the named entities mentioned in the text (schema:mentions).

Datasets and access points