Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.
ninpnin edited this page Mar 17, 2021 · 1 revision

Swedish parliamentary proceedings - Riksdagens protokoll 1921-2020

Westac Project, 2020-21

Overview

The primary objective of the project is to process both digital speech records and scanned images of parliamentary proceedings, curate the data and segment it into speeches. Additionally, a catalog of the MPs will be created and linked to the speeches.

Programmatic part of the project is implemented as a Python package. The documentation of the package can be found under this link.

1. Gathering the data

Digital originals 1990s ->

OCR'd text

2. Curation and segmentation

Some key goals and assumptions:

  1. Most changes in the data is orthogonal to other changes. For example, a change in segmentation in one protocol has no effect on the segmentation of other protocols, and curation on page 12 has no effect on curation on page 35.
  2. Manual curation is assumed to be superior in quality to automatic curation.
  3. Curation and segmentation are ongoing processes, and it is important to facilitate adding improvements.
  4. Keeping the curation separate from the actual data and code is following the principles for annotation of textual data (Pustejovsky and Stubbs 2013), for which curation can be seen as a special case.

The following design choices were made based on the assumptions:

  1. All changes to curation and segmentation are stored and version controlled with git.
  2. When possible, the orthogonality between different parts of the data is retained.
    • Git allows for orthogonality on the file level.
    • Manual curation is orthogonal outside of the changed paragraph.
    • The orthogonality constraint in automatic curation is delegated to the author of the changes
  3. In cases of conflict, manual curation takes precedent.

The repository provides a regexp-based approach to automatic curation. The general flow of the process is the following:

Data process image

Input data in yellow, scripts in white, processing databases in blue, intermediary formats in gray, end products in green.

3. Metadata catalog

Wikipedia is currently the most extensive digital data collection of historical Swedish MPs. While there are gaps, its current scope is a good starting point and useful in many applications. Additionally, a (presumably comprehensive) list of the MPs can be found at https://data.riksdagen.se/data/ledamoter/ from 1990 onwards.

In the data, speeches are associated with an ID that points to a specific MP in the MP metadata database. Different subsets in the MP database can then be selected, and a subset of the data where these MPs are present can be selected.

4. Parla-CLARIN as the output format

Parla-Clarin is based on TEI, and it allows for rather delicate distinctions and segmentation of the data. Since most of the tagging we do is automatic, it is not feasible to use all of the spec'd features. Currently, we include

  • Utterances <u> with the attributes of @who, containing an MP ID or "unknown". The <u> tags include <seg> tags, which correspond to paragraphs in the data.
  • Notes <note>. These are subtypes
    • @type="speaker" Introduction to a speech, eg. "Herr Svensson yttrade:".
    • @type="date" Date, eg. "Onsdagen den 21. December 2020".
    • No type specified, generic note, metadata or similar.
  • Page beginnings <pb> with the @facs attribute pointing to the image of the page.
  • <teiHeader> containing several metadata tags such as <docDate>.

5. Train and test sets for curation and segmentation

For quality assurance, we have manually curated and segmented 2+2 randomly selected pages per decade. Furthermore, CI that checks that this gold standard remains unchanged is run for each pull request.

End product

  • Speeches in Parla-Clarin format
  • Structured catalog of MPs and associated metadata
  • Test set consisting of images of text and with corresponding correct transcriptions
  • Interface to edit the files both manually and programmatically, with tests / continuous integration