Skip to content
This repository has been archived by the owner on May 8, 2024. It is now read-only.

Parla Clarin cheat sheet

ninpnin edited this page Jul 29, 2021 · 3 revisions

Parla-Clarin is a flexible and complex format. In this project, we use a subset of it. Here's a quick rundown.

Overview

<teiCorpus>
  <teiHeader>
  <!-- metadata on the whole corpurs -->
  </teiHeader>
  <tei>
    <teiHeader>
    <!-- metadata on the current protocol -->
    </teiHeader>
    <text>
      <body>
        <div>
          <!-- actual content -->
        </div>
      </body>
    </text>
  </tei>
</teicorpus>

Metadata

TODO: write summary

Content

Here's a snippet of the content of a parla clarin protocol file

<pb xml:n="1" facs="https://website.com/protokoll-page-1.jpg"/>
<note>
  Session started
</note>
<note type="date" when="2020-02-05">
  Onsdagen den 5. februari 2020
</note>
<note type="speaker">
  Herr talmannen anförde:
</note>
<u who="talmannen_u42fd2">
  <seg>
    Hej allihopa!
  </seg>
  <seg>
    Trevligt att träffas.
  </seg>
</u>

Speeches

Speeches are contained in a <u> tag. Inside the <u> tag, there is one or more <seg> tags that contain the paragraphs of the speech

<u who="talmannen_u42fd2">
  <seg>Hej allihopa!</seg>
  <seg>Trevligt att träffas.</seg>
</u>

in addition to <seg> tags, <u> tags have attributes

<u who="talmannen_u42fd2">

the who attribute refers to the speaker.

<u xml:id="speech1" cont="speech2" who="talmannen_u42fd2">Hej!</u>
<note>30 sec paus</note>
<u xml:id="speech2" prev="speech1" who="talmannen_u42fd2">Hejdå!</u>

With the cont and prev tags and IDs, a speech can be connected even with a note in between.

Notes

<note type="date" when="2020-02-05">Onsdagen den 5. februari 2020</note>

Date in the YYYY-MM-DD format. Text content includes the original text.

<note type="speaker">Herr talmannen anförde:</note>

Page beginnings

<pb xml:n="1" facs="https://website.com/protokoll-page-1.jpg"/>

Page beginnings do not have textual content. They have two attributes: n which refers to the page number, and facs which links to the scanned image of the page.

In our case, the links will point to Betalab (eg. https://betalab.kb.se/prot-1949--ak--12/prot_1949__ak__12-002.jp2/_view). The files are non-copyrighted, but you need credentials to access them.