-
Notifications
You must be signed in to change notification settings - Fork 5
Parla Clarin cheat sheet
Parla-Clarin is a flexible and complex format. In this project, we use a subset of it. Here's a quick rundown.
<teiCorpus>
<teiHeader>
<!-- metadata on the whole corpurs -->
</teiHeader>
<tei>
<teiHeader>
<!-- metadata on the current protocol -->
</teiHeader>
<text>
<body>
<div>
<!-- actual content -->
</div>
</body>
</text>
</tei>
</teicorpus>
TODO: write summary
Here's a snippet of the content of a parla clarin protocol file
<pb xml:n="1" facs="https://website.com/protokoll-page-1.jpg"/>
<note>
Session started
</note>
<note type="date" when="2020-02-05">
Onsdagen den 5. februari 2020
</note>
<note type="speaker">
Herr talmannen anförde:
</note>
<u who="talmannen_u42fd2">
<seg>
Hej allihopa!
</seg>
<seg>
Trevligt att träffas.
</seg>
</u>
Speeches are contained in a <u>
tag. Inside the <u>
tag, there is one or more <seg>
tags that contain the paragraphs of the speech
<u who="talmannen_u42fd2">
<seg>Hej allihopa!</seg>
<seg>Trevligt att träffas.</seg>
</u>
in addition to <seg>
tags, <u>
tags have attributes
<u who="talmannen_u42fd2">
the who attribute refers to the speaker.
<u xml:id="speech1" cont="speech2" who="talmannen_u42fd2">Hej!</u>
<note>30 sec paus</note>
<u xml:id="speech2" prev="speech1" who="talmannen_u42fd2">Hejdå!</u>
With the cont and prev tags and IDs, a speech can be connected even with a note in between.
<note type="date" when="2020-02-05">Onsdagen den 5. februari 2020</note>
Date in the YYYY-MM-DD format. Text content includes the original text.
<note type="speaker">Herr talmannen anförde:</note>
<pb xml:n="1" facs="https://website.com/protokoll-page-1.jpg"/>
Page beginnings do not have textual content. They have two attributes: n which refers to the page number, and facs which links to the scanned image of the page.
In our case, the links will point to Betalab (eg. https://betalab.kb.se/prot-1949--ak--12/prot_1949__ak__12-002.jp2/_view). The files are non-copyrighted, but you need credentials to access them.