From f22d713c328f82318b121368ccb61ea6b62df6c2 Mon Sep 17 00:00:00 2001 From: Wolfgang Meier Date: Fri, 2 Feb 2024 18:11:26 +0100 Subject: [PATCH] Update MS Word sample --- data/test/test.docx.xml | 193 ++++++++++++++++++++-------------------- 1 file changed, 97 insertions(+), 96 deletions(-) diff --git a/data/test/test.docx.xml b/data/test/test.docx.xml index d6ca4e86..9fd66188 100644 --- a/data/test/test.docx.xml +++ b/data/test/test.docx.xml @@ -13,22 +13,15 @@

Information about the source

- - -

Sample automatically converted from Word DOCX format into TEI, demonstrating all features covered by the transformation.

-
- - - -
Word to TEI - This is a sample document to test conversion of word docx to TEI using the TEI processing model. It was generated by uploading the word document located in data/doc/test.docx in your local TEI Publisher installation. You can also download it from the - TEI Publisher website + This is a sample document to test conversion of word docx to TEI using the TEI processing model. It was generated by uploading the word document located in data/doc/test.docx in your local TEI Publisher installation. You can also download it from the + TEI Publisher git repository , edit it and upload it via the upload panel on the start page of your local TEI Publisher. +

The following sections document and provide examples for all features and conventions implemented by the default ODD transformation.

Style Conversion

Rather than trying to convert everything, the default ODD transformation attempts to preserve the semantics of the text. Most style properties are thus ignored. This is by intention: trying to preserve as much as possible would likely just add noise and result in low-quality TEI.

@@ -37,12 +30,13 @@
Inline Styles

Text styled as italic, underline or bold will be transformed into a tei:hi with a corresponding @rend attribute. Other types of formatting will be ignored.

-

Inline styles whose name starts with „tei:“ are transformed into TEI elements with the same name. So if a character sequence uses a style called „tei:persName”, it will be wrapped into a TEI persName element in the output, e.g. Johann Wolfgang Goethe. A place name can be marked with a style „tei:placeName” and should be transformed accordingly: Frankfurt, Berlin, München. And damaged text could be encoded by applying a style „tei:supplied“.

-

There’s also a default convention for encoding additional attributes: text in angle brackets will be interpreted as a list of attribute=value pairs. Multiple items should be separated with a “;”. For example, to set a @rend and provide a @ref for a placeName, you can write Frankfurt +

+ Inline styles whose name starts with „tei:“ are transformed into TEI elements with the same name. So if a character sequence uses a style called „tei:persName”, it will be wrapped into a TEI persName element in the output, e.g. Johann Wolfgang Goethe. A place name can be marked with a style tei:placeName and should be transformed accordingly: Frankfurt, Berlin, München. And damaged text could be encoded by applying a style tei:supplied.

+

There’s also a default convention for encoding additional attributes: text in angle brackets will be interpreted as a list of attribute=value pairs. Multiple items should be separated with a “;”. For example, to set a @rend and provide a @ref for a placeName, you can write Frankfurt<rend=smallcaps;ref=Frankfurt am Main>, which would be rendered in the output as Frankfurt

Text content in angle brackets will be automatically stripped from an inline element by the post-processing step, so you do not need to handle this within the ODD.

.

-

This notation requires quite some typing. You may always extend the ODD with additional rules for easier conventions though. For example, if persName does always need a @ref attribute in your edition, you could have a simplified rule which parses: Friedrich Dürrenmatt.

+

This notation requires quite some typing. You may always extend the ODD with additional rules for easier conventions though. For example, if persName does always need a @ref attribute in your edition, you could have a simplified rule which parses Friedrich Dürrenmatt<118527908>. In the output this should appear as: Friedrich Dürrenmatt.

Because Word has a tendency to split character ranges at random points, some pre-processing is applied before the docx is passed to the ODD for processing: subsequent ranges referencing the same character style are combined by nesting them into an additional w:r range element, which references the common character style and the style is then removed from the individual ranges.

You can thus safely assume within the ODD that the content of a range includes all sibling text using the same character style.

By design, Word does not support nested character styles. It is thus not possible to e.g. mark up a persName inside a supplied. The standard character styles for bold, italics and underline are preserved though – like in the following paragraph which is marked up as supplied:

@@ -54,7 +48,7 @@ Paragraph Styles
Headings -

Word does not have a concept for text division, so we have to reconstruct them:

+

Word does not have a concept for text divisions, so we have to reconstruct them:

Paragraph styles starting with „heading“, „title“ or „subtitle“ generate a tei:head. The outline level assigned to the heading is recorded as well.

@@ -83,20 +77,27 @@

Another footnote written by the original author of the text

, and text-critical notes.

By default, footnotes with a custom mark are encoded with note n=”custom mark” type=”original”. Instead of being numbered automatically in the output, they appear with the custom mark.

-

Text-critical notes usually enclose a span of text: to encode them in Word, we can abuse comments +

You may also use endnotes +

This is an endnote containing a link to an + external website + .

+ , though they will render as normal footnotes in web output by default – unless you define a different behaviour in the output ODD.

+

Text-critical notes usually enclose a span of text: to encode them in Word, we can abuse comments

Person whoever referenced here.

A text-critical footnote

- . Word comments insert a start and end marker, which can be easily converted to TEI + . Word comments insert a start and end marker, which can be easily converted to TEI

By default, the ODD inserts an anchor xml:id=”a1” type=”note” at the start of the span and a note target=”a1” at the end.

and output accordingly later.

Lists -

Lists are tricky, because Word essentially just stores list items in a flat list. Reconstructing nesting thus requires looking at the list level associated with every item. Simple lists are easy:

+

Lists are tricky, because Word essentially just stores list items in a flat sequence. Reconstructing nesting thus requires looking at the list level associated with every item. Simple lists are easy:

A list item -

And here we have a footnote with a link to another place in the document.

+

And here we have a footnote with a + link + to another place in the document.

@@ -142,86 +143,86 @@ I graunt I never saw a goddesse goe, My Mistres when shee walkes treads on the ground.
-
- Tables -

We can do simple tables very well. Spanning multiple colums is also easy, but things become more difficult for row spans, which are not implemented yet.

- - - -

Item

-
- -

Hours

-
- -

Hourly rate

-
- -

Price

-
-
- - -

Customize ODD

-
- -

3

-
- -

120

-
- -

360

-
-
- - -

Generate App

-
- -

4

-
- -

120

-
- -

480

-
-
- - -

Test and Deploy

-
- -

2

-
- -

120

-
- -

240

-
-
- - -

Total

-
- -

1080

-
-
-
-
-
- Embedded Images -

Below image will be embedded:

-

- -

-

Inside eXist, images are copied into a subcollection starting with the name of the docx file being processed and suffixed with .media.

-
+
+ Tables +

We can do simple tables very well. Spanning multiple colums is also easy, but things become more difficult for row spans, which are not implemented yet.

+ + + +

Item

+
+ +

Hours

+
+ +

Hourly rate

+
+ +

Price

+
+
+ + +

Customize ODD

+
+ +

3

+
+ +

120

+
+ +

360

+
+
+ + +

Generate App

+
+ +

4

+
+ +

120

+
+ +

480

+
+
+ + +

Test and Deploy

+
+ +

2

+
+ +

120

+
+ +

240

+
+
+ + +

Total

+
+ +

1080

+
+
+
+
+
+ Embedded Images +

Below image will be embedded:

+

+ +

+

Inside eXist, images are copied into a subcollection starting with the name of the docx file being processed and suffixed with .media.

+