Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

Submit corrections

ninpnin edited this page Mar 17, 2021 · 1 revision

All curation and edits to segmentation are done directly in the parla-clarin files in the data folder.

Manual curation

Editing the protocols on GitHub

It is possible to make manual corrections directly on the project's github page.

First, find the protocol file and click on edit:

Open protocol file for editing GIF

When you are done with the editing, create a pull request:

Create pull request GIF

In addition to that, it is possible to clone the repository, make and commit the changes and pust & create a pull request. Once you have created a pull request, automatic tests are run to verify the validity of the format.

Making hanges to Parla Clarin formatted protocols

Error curation is relatively straightforward since you only edit the plaintext. Segmentation, however, needs to adhere to the parla clarin structure.

All plaintext is located inside a tag. This means <tagname>The actual plaintext</tagname>. In addition to text, tags can contain other tags such as <tagname1><tagname2>The actual plaintext</tagname2></tagname1>.

For example, you can fix the curation

[...]
<note>
  $ 39 Övriga diskussioner
</note>
[...]

to this

[...]
<note>
  § 39 Övriga diskussioner
</note>
[...]

Tags might also have attributes .

Specifically in the Parla-clarin format, the tags we use are <note>, <u> and <seg>. tags only contain seg tags, and seg and note tags contain text.

For example, you can fix the segmentation of

[...]
<note type="speaker">
  Herr Lunström yttrade:
</note>
<u>
  <seg>
    Herr Talman!
  </seg>
</u>
<note>
  Jag glömde vad jag skulle säga.
</note>
[...]

to this

[...]
<note type="speaker">
  Herr Lunström yttrade:
</note>
<u>
  <seg>
    Herr Talman!
  </seg>
  <seg>
    Jag glömde vad jag borde säga.
  </seg>
</u>
[...]

Observe, that the last sentence moved inside the <u> tag and its tag changed from <note> to <seg>.

The example . No need to worry about that, the formatting is automatic after a pull request.

Automatic curation

Rule number one: do not overwrite manual corrections. Each element has a hash in its @n attribute. A new hash can be calculated with a function in the python module. If this new hash does not match the content, it has been changed in the meantime – manually.

For this reason, an automatic workflow should be done in the following manner:

  1. Check whether the element is by comparing the old and the new hash
  2. If the element has not been manually edited, apply the potential changes
  3. If the element has not been manually edited, recalculate the hash and set @n to it

When you are done with the editing, submit a pull request to the main branch.