Skip to content

Commit

Permalink
Update MS Word sample
Browse files Browse the repository at this point in the history
  • Loading branch information
wolfgangmm committed Feb 2, 2024
1 parent 370dc62 commit f22d713
Showing 1 changed file with 97 additions and 96 deletions.
193 changes: 97 additions & 96 deletions data/test/test.docx.xml
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,15 @@
<p>Information about the source</p>
</sourceDesc>
</fileDesc>
<profileDesc>
<abstract>
<p>Sample automatically converted from Word DOCX format into TEI, demonstrating all features covered by the transformation.</p>
</abstract>
<textClass>
<catRef scheme="#feature" target="#docx"/>
</textClass>
</profileDesc>
</teiHeader>
<text>
<body>
<div>
<head>Word to TEI</head>
<quote>This is a sample document to test conversion of word docx to TEI using the TEI processing model. It was generated by uploading the word document located in <gi>data/doc/test.docx</gi> in your local TEI Publisher installation. You can also download it from the <ref target="https://teipublisher.com/exist/apps/tei-publisher/data/doc/test.docx">
<hi rend="u">TEI Publisher website</hi>
<quote>This is a sample document to test conversion of word docx to TEI using the TEI processing model. It was generated by uploading the word document located in <gi>data/doc/test.docx</gi> in your local TEI Publisher installation. You can also download it from the <ref target="https://github.com/eeditiones/tei-publisher-app/blob/master/data/doc/test.docx">
<hi rend="u">TEI Publisher git repository</hi>
</ref>, edit it and upload it via the upload panel on the start page of your local TEI Publisher.</quote>
<p>The following sections document and provide examples for all features and conventions implemented by the default ODD transformation.</p>
<div>
<head>Style Conversion</head>
<p>Rather than trying to convert everything, the default ODD transformation attempts to preserve the semantics of the text. Most style properties are thus ignored. This is by intention: trying to preserve as much as possible would likely just add noise and result in low-quality TEI.</p>
Expand All @@ -37,12 +30,13 @@
<div>
<head>Inline Styles</head>
<p>Text styled as <hi rend="i">italic</hi>, <hi rend="u">underline</hi> or <hi rend="b caps">bold</hi> will be transformed into a <tag>tei:hi</tag> with a corresponding <gi>@rend</gi> attribute. Other types of formatting will be ignored.</p>
<p>Inline styles whose name starts with „tei:“ are transformed into TEI elements with the same name. So if a character sequence uses a style called „tei:persName”, it will be wrapped into a TEI <tag>persName</tag> element in the output, e.g. <persName>Johann Wolfgang Goethe</persName>. A <placeName>place name</placeName> can be marked with a style „tei:placeName” and should be transformed accordingly: <placeName>Frankfurt</placeName>, <placeName>Berlin</placeName>, <placeName>München</placeName>. And <supplied>damaged text could be encoded</supplied> by applying a style „tei:supplied“.</p>
<p>There’s also a default convention for encoding additional attributes: text in angle brackets will be interpreted as a list of attribute=value pairs. Multiple items should be separated with a “;”. For example, to set a <gi>@rend</gi> and provide a <gi>@ref</gi> for a <tag>placeName</tag>, you can write <placeName rend="smallcaps" ref="Frankfurt am Main">Frankfurt</placeName>
<p>
<anchor xml:id="target1"/>Inline styles whose name starts with „tei:“ are transformed into TEI elements with the same name. So if a character sequence uses a style called „tei:persName”, it will be wrapped into a TEI <tag>persName</tag> element in the output, e.g. <persName>Johann Wolfgang Goethe</persName>. A <placeName>place name</placeName> can be marked with a style <gi>tei:placeName</gi> and should be transformed accordingly: <placeName>Frankfurt</placeName>, <placeName>Berlin</placeName>, <placeName>München</placeName>. And <supplied>damaged text could be encoded</supplied> by applying a style <gi>tei:supplied</gi>.</p>
<p>There’s also a default convention for encoding additional attributes: text in angle brackets will be interpreted as a list of attribute=value pairs. Multiple items should be separated with a “;”. For example, to set a <gi>@rend</gi> and provide a <gi>@ref</gi> for a <tag>placeName</tag>, you can write <code>Frankfurt&lt;rend=smallcaps;ref=Frankfurt am Main&gt;</code>, which would be rendered in the output as <placeName rend="smallcaps" ref="Frankfurt am Main">Frankfurt</placeName>
<note place="footnote">
<p> Text content in angle brackets will be automatically stripped from an inline element by the post-processing step, so you do not need to handle this within the ODD.</p>
</note>.</p>
<p>This notation requires quite some typing. You may always extend the ODD with additional rules for easier conventions though. For example, if <tag>persName</tag> does always need a <gi>@ref</gi> attribute in your edition, you could have a simplified rule which parses: <persName ref="http://d-nb.info/gnd/118527908">Friedrich Dürrenmatt</persName>.</p>
<p>This notation requires quite some typing. You may always extend the ODD with additional rules for easier conventions though. For example, if <tag>persName</tag> does always need a <gi>@ref</gi> attribute in your edition, you could have a simplified rule which parses <code>Friedrich Dürrenmatt&lt;118527908&gt;</code>. In the output this should appear as: <persName ref="http://d-nb.info/gnd/118527908">Friedrich Dürrenmatt</persName>.</p>
<p>Because Word has a tendency to split character ranges at random points, some pre-processing is applied before the docx is passed to the ODD for processing: subsequent ranges referencing the same character style are combined by nesting them into an additional <tag>w:r</tag> range element, which references the common character style and the style is then removed from the individual ranges.</p>
<p>You can thus safely assume within the ODD that the content of a range includes all sibling text using the same character style.</p>
<p>By design, Word does not support nested character styles. It is thus not possible to e.g. mark up a <tag>persName</tag> inside a <tag>supplied</tag>. The standard character styles for <hi rend="b">bold</hi>, <hi rend="i">italics</hi> and <hi rend="u">underline</hi> are preserved though – like in the following paragraph which is marked up as supplied:</p>
Expand All @@ -54,7 +48,7 @@
<head>Paragraph Styles</head>
<div>
<head>Headings</head>
<p>Word does not have a concept for text division, so we have to reconstruct them:</p>
<p>Word does not have a concept for text divisions, so we have to reconstruct them:</p>
<list type="ordered">
<item>
<p>Paragraph styles starting with „heading“, „title“ or „subtitle“ generate a <tag>tei:head</tag>. The outline level assigned to the heading is recorded as well.</p>
Expand Down Expand Up @@ -83,20 +77,27 @@
<p> Another footnote written by the original author of the text</p>
</note>, and text-critical notes.</p>
<p>By default, footnotes with a custom mark are encoded with <tag>note n=”custom mark” type=”original”</tag>. Instead of being numbered automatically in the output, they appear with the custom mark.</p>
<p>Text-critical notes usually enclose a span of text: to encode them in Word, <anchor xml:id="ac1" type="note"/>we can <tag>abuse</tag> comments<note place="footnote" target="ac1">
<p>You may also use endnotes<note place="endnote">
<p> This is an endnote containing a link to an <ref target="https://teipublisher.com/">
<hi rend="u">external website</hi>
</ref>.</p>
</note>, though they will render as normal footnotes in web output by default – unless you define a different behaviour in the output ODD.</p>
<p>Text-critical notes usually enclose a span of text: to encode them in Word, <anchor xml:id="ac5" type="note"/>we can <hi rend="i">abuse</hi> comments<note place="footnote" target="ac5">
<p>Person <persName>whoever</persName> referenced here.</p>
<p>A <supplied>text-critical</supplied> footnote</p>
</note>. Word comments insert a start and end marker, which can be <anchor xml:id="ac2" type="note"/>easily converted to TEI<note place="footnote" target="ac2">
</note>. Word comments insert a start and end marker, which can be <anchor xml:id="ac6" type="note"/>easily converted to TEI<note place="footnote" target="ac6">
<p>By default, the ODD inserts an <tag>anchor xml:id=”a1” type=”note”</tag> at the start of the span and a <tag>note target=”a1”</tag> at the end.</p>
</note> and output accordingly later.</p>
</div>
<div>
<head>Lists</head>
<p>Lists are tricky, because Word essentially just stores list items in a flat list. Reconstructing nesting thus requires looking at the list level associated with every item. Simple lists are easy:</p>
<p>Lists are tricky, because Word essentially just stores list items in a flat sequence. Reconstructing nesting thus requires looking at the list level associated with every item. Simple lists are easy:</p>
<list>
<item>
<p>A list item<note place="footnote">
<p> And here we have a footnote with a <hi rend="u">link</hi> to another place in the document.</p>
<p> And here we have a footnote with a <ref target="#target1">
<hi rend="u">link</hi>
</ref> to another place in the document.</p>
</note>
</p>
</item>
Expand Down Expand Up @@ -142,86 +143,86 @@
<l>I graunt I never saw a goddesse goe,</l>
<l>My Mistres when shee walkes treads on the ground.</l>
</div>
<div>
<head>Tables</head>
<p>We can do simple tables very well. Spanning multiple colums is also easy, but things become more difficult for row spans, which are not implemented yet.</p>
<table>
<row>
<cell>
<p>Item</p>
</cell>
<cell>
<p>Hours</p>
</cell>
<cell>
<p>Hourly rate</p>
</cell>
<cell>
<p>Price</p>
</cell>
</row>
<row>
<cell>
<p>Customize ODD</p>
</cell>
<cell>
<p>3</p>
</cell>
<cell>
<p>120</p>
</cell>
<cell>
<p>360</p>
</cell>
</row>
<row>
<cell>
<p>Generate App</p>
</cell>
<cell>
<p>4</p>
</cell>
<cell>
<p>120</p>
</cell>
<cell>
<p>480</p>
</cell>
</row>
<row>
<cell>
<p>Test and Deploy</p>
</cell>
<cell>
<p>2</p>
</cell>
<cell>
<p>120</p>
</cell>
<cell>
<p>240</p>
</cell>
</row>
<row>
<cell cols="3">
<p>Total</p>
</cell>
<cell>
<p>1080</p>
</cell>
</row>
</table>
</div>
<div>
<head>Embedded Images</head>
<p>Below image will be embedded:</p>
<p>
<graphic url="test.docx.media/image1.png"/>
</p>
<p>Inside eXist, images are copied into a subcollection starting with the name of the docx file being processed and suffixed with <hi rend="i">.media</hi>.</p>
</div>
</div>
</div>
<div>
<head>Tables</head>
<p>We can do simple tables very well. Spanning multiple colums is also easy, but things become more difficult for row spans, which are not implemented yet.</p>
<table>
<row>
<cell>
<p>Item</p>
</cell>
<cell>
<p>Hours</p>
</cell>
<cell>
<p>Hourly rate</p>
</cell>
<cell>
<p>Price</p>
</cell>
</row>
<row>
<cell>
<p>Customize ODD</p>
</cell>
<cell>
<p>3</p>
</cell>
<cell>
<p>120</p>
</cell>
<cell>
<p>360</p>
</cell>
</row>
<row>
<cell>
<p>Generate App</p>
</cell>
<cell>
<p>4</p>
</cell>
<cell>
<p>120</p>
</cell>
<cell>
<p>480</p>
</cell>
</row>
<row>
<cell>
<p>Test and Deploy</p>
</cell>
<cell>
<p>2</p>
</cell>
<cell>
<p>120</p>
</cell>
<cell>
<p>240</p>
</cell>
</row>
<row>
<cell cols="3">
<p>Total</p>
</cell>
<cell>
<p>1080</p>
</cell>
</row>
</table>
</div>
<div>
<head>Embedded Images</head>
<p>Below image will be embedded:</p>
<p>
<graphic url="test.docx.media/image1.png"/>
</p>
<p>Inside eXist, images are copied into a subcollection starting with the name of the docx file being processed and suffixed with <hi rend="i">.media</hi>.</p>
</div>
</div>
</body>
</text>
Expand Down

0 comments on commit f22d713

Please sign in to comment.