The Data of the Corpus of the Saeima
The source data for this corpus was crawled from the Saeima's website where verbatim reports of all the sessions of the Saeima are published in text format. The texts are processed using a semi-automatic pipeline to identify the boundaries of speeches and the speakers. The text is split into utterances, where each utterance contains a speech from only one speaker.
The Corpus of the Saeima includes transcriptions of parliamentary debate from 7 parliamentary terms (5th-12th), covering years 1993-2017. The transcriptions of the Corpus of Saeima contain 38 million tokens, 497 thousand utterances and 468 speakers.
The available metadata for each utterance includes the date and type of the parliamentary session, speaker's name and affiliation.
Linked Data allows us to represent structured information about parliamentary debates by describing the properties of the objects from the domain of parliamentary meetings and relations between these objects. According to Linked Data principles, this information is represented using Resource Description Framework (RDF)
The types of objects in the LinkedSaeima dataset are:
- Meeting - a top-level concept representing one parliament meeting (a plenary) usually consisting of multiple Speeches;
- Speech - an individual speech given at a Meeting by a particular Speaker in some Role;
- Speaker - a person giving a speech;
- Role - a role (e.g. Prime Minister) which the person represented when giving a Speech.
For data modelling we reuse the work of the LinkedEP project (European Parliament debates as Linked Data) and their Linkedpolitics vocabulary, referenced in RDF data using prefixes lpv and lpv_eu
For example, a Speech is represented by lpv_eu:Speech, its properties include date (dc:date), sequence number and spoken text (lpv:spokenText), and it is related to the Meeting it is a part of (dct:isPartOf), to the Speaker (lpv:speaker) and its Role (lpv:spokenAs), and to the named entities mentioned in the text (schema:mentions).
Datasets and access points
- Triple pattern fragments server
- RDF data dump
- Universal Dependencies (CoNLL-U)
- Bonito corpus browser (NoSketch engine)