-
Notifications
You must be signed in to change notification settings - Fork 5
Home
Westac Project, 2020-21
The primary objective of the project is to process both digital speech records and scanned images of parliamentary proceedings, curate the data and segment it into speeches. Additionally, a catalog of the MPs will be created and linked to the speeches.
Programmatic part of the project is implemented as a Python package. The documentation of the package can be found under this link.
- Original data already in XML format are available from 1993-01-01 and onward. We will parse all available current data. These are located at: https://data.riksdagen.se/data/dokument/ and https://www.riksdagen.se/sv/dokument-lagar
- Earlier parlimentary proceedings from the bicameral era (up until 1970) are available as PDFs at: https://riksdagstryck.kb.se/tvakammarriksdagen.html
- Proceedings from the unicameral era are available at (1970–>) https://data.riksdagen.se/data/dokument/
- In the initial curation, the data we are going to use are the parlimentary proceedings from 1945-01-01 until 1992-12-31. Later on, proceedings from 1920 to 1945 will be added.
- As with the digital originals, all parliamentary data can be accessed at https://betalab.kb.se/
Some key goals and assumptions:
- Most changes in the data is orthogonal to other changes. For example, a change in segmentation in one protocol has no effect on the segmentation of other protocols, and curation on page 12 has no effect on curation on page 35.
- Manual curation is assumed to be superior in quality to automatic curation.
- Curation and segmentation are ongoing processes, and it is important to facilitate adding improvements.
- Keeping the curation separate from the actual data and code is following the principles for annotation of textual data (Pustejovsky and Stubbs 2013), for which curation can be seen as a special case.
The following design choices were made based on the assumptions:
- All changes to curation and segmentation are stored and version controlled with git.
- When possible, the orthogonality between different parts of the data is retained.
- Git allows for orthogonality on the file level.
- Manual curation is orthogonal outside of the changed paragraph.
- The orthogonality constraint in automatic curation is delegated to the author of the changes
- In cases of conflict, manual curation takes precedent.
The repository provides a regexp-based approach to automatic curation. The general flow of the process is the following:
Input data in yellow, scripts in white, processing databases in blue, intermediary formats in gray, end products in green.
Wikipedia is currently the most extensive digital data collection of historical Swedish MPs. While there are gaps, its current scope is a good starting point and useful in many applications. Additionally, a (presumably comprehensive) list of the MPs can be found at https://data.riksdagen.se/data/ledamoter/ from 1990 onwards.
In the data, speeches are associated with an ID that points to a specific MP in the MP metadata database. Different subsets in the MP database can then be selected, and a subset of the data where these MPs are present can be selected.
4. Parla-CLARIN as the output format
Parla-Clarin is based on TEI, and it allows for rather delicate distinctions and segmentation of the data. Since most of the tagging we do is automatic, it is not feasible to use all of the spec'd features. Currently, we include
- Utterances
<u>
with the attributes of@who
, containing an MP ID or "unknown". The<u>
tags include<seg>
tags, which correspond to paragraphs in the data. - Notes
<note>
. These are subtypes-
@type="speaker"
Introduction to a speech, eg. "Herr Svensson yttrade:". -
@type="date"
Date, eg. "Onsdagen den 21. December 2020". - No type specified, generic note, metadata or similar.
-
- Page beginnings
<pb>
with the@facs
attribute pointing to the image of the page. -
<teiHeader>
containing several metadata tags such as<docDate>
.
For quality assurance, we have manually curated and segmented 2+2 randomly selected pages per decade. Furthermore, CI that checks that this gold standard remains unchanged is run for each pull request.
- Speeches in Parla-Clarin format
- Structured catalog of MPs and associated metadata
- Test set consisting of images of text and with corresponding correct transcriptions
- Interface to edit the files both manually and programmatically, with tests / continuous integration