These are conventions for organizing the repository, proposed for discussion and evaluation. This preamble will be revised once we have agreed on a protocol.
The main repo has several subdirectories:
input
for all input files and all canonical results- one directory for each solution to the task (at the time of writing, these are
accumulators
,inside-out
,python_string
,python_xml
,right-sibling
, andregex
, but more may be added as we think of new solutions) doc
for our Balisage paper and the Balisage author-kit artifactslib
for any shared codetesting
for data and code related to testing for correctness or performance
The subdirectory names and their contents are discussed below.
The main repo contains a single input
directory with four separate subdirectories, one for each type of input, with the following directory names:
input/basic
Uses Trojan attributes inth:
namespace. No overlap, no non-Trojan attributes, no non-Trojan empty elements. Basic test of whether the method works.input/extended
As above, but with non-Trojan attributes (on start-markers only) and with non-Trojan empty elements. Makes sure that the method doesn’t over-generalize.input/overlap
Uses Trojan attribute in theth:
namespace. Simple example of overlapping hierarchies to see what the method produces. Four possible outcomes (I think): a) throws an error, b) raises as much as it can without creating overlap and leaves other markers unraised, c) raises everything, moving tags (as it were) to force proper nesting, or d) creates overlapping markup, which is not well formed.input/frankenstein
Uses@ana
and@loc
. Raising all flattened elements that use@ana
and@loc
is guaranteed to be well-formed. Other flattened elements, not to be raised on the pass we are discussing, may use other markup (e.g., the<seg>
elements). I would suggest that for Frankenstein we put the flattened version in the TEI namespace if that’s what it’s in in Real Life, which means that the output must respect that namespace.input/brown
In the TEI namespace, with Trojan attributes inth:
namespace. Currently holdsCorpus_flattened.xml
(full corpus, 75M) andr02_flattened.xm
(56k).
TODO: Should the brown files be in separate subdirectories of input
? Or separate subdirectories of input/brown
? Otherwise they don’t fully follow the naming conventions (below) because we have to distinguish the two samples.
Each of the four input subdirectories described above must include two versions of each logically distinct input file, named
- filename
.xml
(or if desired for clarityflattened.
+ filename +.xml
), which is in flattened form and is to be raised target.
filename.xml
, which is what it should be raised to
where filename uniquely identifies each logically distinct input file.
Each input directory may also include (in a subdirectory called aux
) files used to create flattened.
filename.xml
, such as original.
filename.xml
and a flattening script. This genetic subdirectory is optional because information about how the files were flattened is not crucial for the purpose of raising them. The contents and filenames in the aux
subdirectory are not standardized.
It is not practical to use diffxml
on input/brown/Corpus_target.xml
, so a wrapped version was created with xmllint --format Corpus_target.xml > Corpus_target.wrapped.xml
. Output of transformations can be validated by performing the same xmllint
pretty-print operation on the output and then using regular diff
, which does not make the same demand on resources as diffxml
.
The Frankenstein data uses two forms of start- and end-marker:
- empty TEI elements with
@ana
attributes whose values arestart
orend
, coindexed by the@loc
attribute tei:seg
elements with@xml:id
attributes whose values end in_start
or_end
, coindexed by the part of the@xml:id
attribute before that suffix
All other data uses co-indexed @th:sID
and @th:eID
attributes in namespace http://www.blackmesatech.com/2017/nss/trojan-horse
to signal start- and end-markers. (At the time of writing we make no use of the @th:soleID
or @th:doc
attributes.)
When raising elements marked with @th:*
: remove all @th:*
attributes and the relevant namespace declaration.
When raising (in Frankenstein) elements marked with @ana
: remove @ana
, refactor @loc
as @xml:id
.
When raising (in Frankenstein) elements marked with @xml:id
: include @xml:id
on the raised element, with the prefix (before _start
or _end
) of the input value.
There is one code subdirectory in the main repo for each method, called accumulators
, inside-out
, python_string
, python_xml
, right-sibling
, and regex
(etc.).
Each code directory must contain, as a single file (if possible) raise
with an appropriate filename extension (e.g., raise.xsl
, raise.py
, raise.sh
). The same file should work for the first three of our four input types: basic, extended, and overlap. If Frankenstein input requires a different transformation file, it should be called raise_frankenstein.xsl
, etc. Notes:
- If we write multiple versions of any given method, their names should have the form
raise
+ infix +.
+ extension. For example, versions written for XSLT 1.0 and 3.0 might beraise_1.0.xsl
andraise_3.0.xsl
; versions which differ in using a function or a named template might beraise_f.xsl
andraise_t.xsl
. The simplest version (the one for someone to look at first to understand the method) should beraise.xsl
. - To avoid operating-system-specific command-line oddities with regex, regular expressions should be either in a Python file (
raise.py
) or a sed file (raise.sed
). The latter may be invoked by a shell script.
Code directories must contain a subdirectory called output
that contains the results of the raising operations on the input files. The output
directory should contain sub-subdirectories matching those of input
. The output for an input file whose name is of the form flattened.
+ filename + .xml
should be raised.
+ filename + xml
. For example, the result of raising input/basic/flattened.xml
would be in .../output/basic/raised.xml
. If an attempt at raising produces no output (e.g. because the method dies on that particular input), an empty output file may be left to signal the failure. If possible, a file named filename + .stderr
should be provided to record the error(s) raised.
When there are multiple versions of the implementation (e.g. raise_1.0
and raise_3.0
), and when we test with multiple processors (e.g. xsltproc, Saxon 6.5.3, Saxon 9.8 HE), the output filename should take the form
`raised_` + _filename_ + `_` + _version_ + `_` + _processor_ + `.xml`.
Code directories must contain a very brief markdown file called README.md
that explains how to run the transformation. It should not, at least for now explain how the code works (that’s in our paper); its purpose is just to tell users what they have to type to produce output. It should also include dependency information, e.g., Python 3 but not Python 2, XSLT 3.0 but not XSLT 1.0.
Code directories may optionally contain a shell script or Windows batch file to run the transformations that use that code. These should be called raise.sh
or raise.bat
. If the shell script depends on a particular shell or on system-specific configuration information (e.g., how to run saxon on the command line may differ from user to user), that should be documented in the README.md
file. Include this only if you find it personally useful.
Additional files go in an optional subdirectory called aux
. For example, the Python files were developed in a Jupyter notebook interface, and the notebook files will be placed in aux
.
Each XSLT stylesheet should accept a parameter named debug
, which can be used to guard debugging code (typically surrounded by <xsl:if test="$debug">...</xsl:if>
).
To minimize the effect of the debugging code on the runtime of the final product, in XSLT 3.0 the parameter should be declared with static="yes"
. (If this is done, we can add use-when="$debug"
to the xsl:message
or other debugging code, and dispense with the surrounding xsl:if
.)
To eliminate the need for separate stylesheets for the different styles of markers, stylesheets can (should?) also accept a parameter named th-style
with values th
, ana
, or xmlid
, which can be used to signal which kind of marker the stylesheet should process.
The doc
directory will contain our Balisage paper, a stylesheet named local.xsl
, and a subdirectory named balisage
with the Balisage authors' kit (stylesheets, sample document and its images, etc.). The main purpose of local.xsl
is to redefine various parameters so that the CSS and XSLT stylesheets do not need to be in the same directory as the document.
As we attempt to improve our code, we may find ourselves reusing bits (for example, functions to determine whether a given node is or is not a start- or end-marker). To simplify maintenance, modules containing such reused code should be placed in the lib
directory.
At the time of writing, it's not clear exactly what will need to go here; the following proposal will need revision in the light of experience.
The testing
directory has several subdirectories:
bin
for testing scripts, processing scripts, etc.raw
for raw test output (if, for example, we acquire timing data by copying stderr and stdout into files to be parsed for the times, the raw logs go here).reports
for final output of a test script
It also contains a README.md
file to record any special information needed.
For the moment, we assume all test reports will be XML, either in an ad hoc vocabulary or in XHTML.
A test report should begin with a summary, but should normally either contain or link to full data on the test results. If full data and summary are in separate files, the filename should have the form full.
or summary.
+ yyyymmddThhmmss.ss + '.xml' or .xhtml
. If they are in the same document, the filename prefix should be report.
.
If the test report is selective (only certain input data, only certain methods), the file name should use infixes to signal what it covers, and the meaning of the infixes used should be recorded in README.md
.
We give test reports names include a creation time in order to allow ourselves to keep more than one set of test results around.