Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

Quality control

ninpnin edited this page Jul 29, 2021 · 5 revisions

The corpus is quality controlled statistically. We have drawn random samples of the pages and estimated certain quantities.

There are currently three subtasks that are quality controlled. For all of them, the hurdle is at 90% accuracy. All of the subtasks need to reach this accuracy for the corpus to be quality controlled.

#1 Introduction detection

In the corpus, each speaker is introduced before they speak. In the optimal case, the line where this is done is tagged with a <note type="speaker"> tag, however, this is done automatically and is not 100% accurate.

To assess the accuracy, we

  1. Sampled 25 pages per decade at uniform probability
  2. Counted how many real introductions there were

Completed samples

  1. First sample, accuracy 88.4%
  2. Second sample, accuracy 91.1%

#2 MP detection in introduction

Each introduction that we detect will go through an MP detection process. Optimally, each <note type="speaker"> tag will be associated with

To assess the accuracy, we

  1. Sampled 25 pages per decade at uniform probability
  2. For each introcution, we looked what the next who attribute was
    • If the who attribute was correct (eg. "Väinö Yrjänäinen" is associated with vaino_yrjanainen_1234), we deem the the MP detection incorrect
    • If the who attribute was incorrect (eg. "Väinö Yrjänäinen" is associated with mans_magnusson_1234), we deem the MP detection incorrect
    • If the who attriburte was "unknown", we deem the MP detection incorrect

Completed samples

  1. Automatic check which only counted "unknowns", accuracy upper bound 68.7%
  2. Test sample with 5 per decade, accuracy 76.6%

#3 Paragraph classification

After detecting introductions, the rest of the plaintext will be categorized into written information (s) and transcriptions of people speaking (s, for utterance).

To assess the accuracy, we

  1. Sampled 25 pages per decade at uniform probability
  2. For each paragraph classified as a <note> in the sample
    • If it is not a transcription of a person speaking, it is deemed correct, otherwise incorrect.
  3. For each paragraph inside a <u> tag in the sample
    • If it is a person speaking, it is deemed correct, otherwise incorrect.

Completed samples

None.