Quality control

The corpus is quality controlled statistically. We have drawn random samples of the pages and estimated certain quantities.

There are currently three subtasks that are quality controlled. For all of them, the hurdle is at 90% accuracy. All of the subtasks need to reach this accuracy for the corpus to be quality controlled.

#1 Introduction detection

In the corpus, each speaker is introduced before they speak. In the optimal case, the line where this is done is tagged with a <note type="speaker"> tag, however, this is done automatically and is not 100% accurate.

To assess the accuracy, we

Sampled 25 pages per decade at uniform probability
Counted how many real introductions there were

Completed samples

First sample, accuracy 88.4%
Second sample, accuracy 91.1%

#2 MP detection in introduction

Each introduction that we detect will go through an MP detection process. Optimally, each <note type="speaker"> tag will be associated with

To assess the accuracy, we

Sampled 25 pages per decade at uniform probability
For each introcution, we looked what the next who attribute was
- If the who attribute was correct (eg. "Väinö Yrjänäinen" is associated with vaino_yrjanainen_1234), we deem the the MP detection incorrect
- If the who attribute was incorrect (eg. "Väinö Yrjänäinen" is associated with mans_magnusson_1234), we deem the MP detection incorrect
- If the who attriburte was "unknown", we deem the MP detection incorrect

Completed samples

Automatic check which only counted "unknowns", accuracy upper bound 68.7%
Test sample with 5 per decade, accuracy 76.6%

#3 Paragraph classification

After detecting introductions, the rest of the plaintext will be categorized into written information (s) and transcriptions of people speaking (s, for utterance).

To assess the accuracy, we

Sampled 25 pages per decade at uniform probability
For each paragraph classified as a <note> in the sample
- If it is not a transcription of a person speaking, it is deemed correct, otherwise incorrect.
For each paragraph inside a <u> tag in the sample
- If it is a person speaking, it is deemed correct, otherwise incorrect.

Completed samples

None.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality control

#1 Introduction detection

Completed samples

#2 MP detection in introduction

Completed samples

#3 Paragraph classification

Completed samples

Clone this wiki locally