A DFDL Parser is an application or code library that takes as input:
- A DFDL annotated XML schema
- A data stream
It uses the DFDL schema description to interpret the data stream and realize the DFDL Information Set. If successful the data stream is said to be well-formed for the data format described by the DFDL Schema. The information set could then be written out (for example it could be realized as an XML or JSON text string) or it could be accessed by an application through an API (for example, a DOM-like tree could be created in memory for access by applications).
Symmetrically, there is a notion of a DFDL Unparser. The unparser works from an instance of the DFDL Information Set, a DFDL annotated schema and writes out to a target data stream in the appropriate representation formats.
Often both parser and unparser would be implemented in the same body of software and so we do not always distinguish them. Collectively they are called a DFDL Processor. The parser and unparser MAY, of course, be different bodies of software. Conforming DFDL processors MAY implement only a parser, because the unparser is an optional feature of DFDL.
The DFDL logical parser is a recursive-descent parser[8] having guided, but potentially unbounded look ahead that is used to resolve points of uncertainty. A DFDL parser reads a specification (the DFDL schema) and it recursively walks down and up the schema as it processes the data. This is done in a manner consistent with the scoping of properties and variables described in Section 8 Property Scoping and DFDL Schema Checking.
The unbounded look ahead means that there are situations where the parser MUST speculatively attempt to parse data where the occurrence of a processing error causes the parser to suppress the error, back out and make another attempt.
Implementations of DFDL MAY provide control mechanisms for limiting the speculative search behavior of DFDL parsers. The nature of these mechanisms is beyond the scope of the DFDL specification which defines the behavior of conforming parsers only on data that does not cause an implementation to reach such a control-mechanism limit. Any such control mechanisms MUST be documented by the implementation and are thus implementation-defined.
The logical parser recursively descends the DFDL schema beginning with the distinguished global element declaration or root element, which is, among the global element declarations in the DFDL schema, is the one distinguished as being the one that defines the overall data format being parsed. The distinguished global element or root, is specified for the processor in an implementation-defined manner, see Section 18. Depending on the kind of schema construct that is encountered and the DFDL annotations on it, and the pre-existing context, the parser performs specific parsing operations on the data stream. These parsing operations typically recognize and consume data from the stream and construct values in the logical model. For values of complex types and for arrays, these logical model values may incorporate values created by recursive parsing.
DFDL Implementations are free to use whatever techniques for parsing they wish so long as the semantics are equivalent to that of the speculative recursive-descent logical parser described in this specification. Implementations MUST distinguish the various kinds of errors (Schema Definition Error, processing error, etc.) no matter what time they are detected. Some implementations MAY not detect certain Schema Definition Errors until data are being parsed; however, they MUST still distinguish Schema Definition Errors (which indicate that the schema itself is not meaningful), from parsing errors (which indicate that the input data doesn't satisfy the requirements of the schema), or unparsing errors (which indicate that the Infoset does not satisfy the requirements of the schema).
[8] A "top-down" parser built from a set of mutually-recursive procedures or a non-recursive equivalent where each such procedure usually implements one of the productions of the grammar. Thus, the structure of the resulting program closely mirrors that of the grammar it recognizes. See [RDP].
If a DFDL schema contains no Schema Definition Errors, then there is the additional possibility of a processing error when processing data using a DFDL schema. A processing error occurs when parsing if the data does not conform to the format described by the schema, that is to say, the data is not well-formed relative to the schema. A processing error occurs when unparsing when the incoming Infoset does not conform to the logical structure described by the schema.
Processing errors interact with the schema’s points of uncertainty. A point of uncertainty occurs in the data stream when there is more than one schema component that can describe the data format at that point. Points of uncertainty arise from the schema’s use of xs:choice model groups, optional and array elements with varying numbers of occurrences, unordered sequences, and sequences with floating elements.
When a DFDL parser encounters a processing error, then that error is said to be suppressed by a point of uncertainty if there is another schema component that can be selected by the parsing algorithm which subsequently parses successfully. The details of the DFDL parsing algorithm are described in Section 9.4.
Processing errors MUST be able to be suppressed by a point of uncertainty. See section 9.4.3.
Note that unlike processing errors, Schema Definition Errors cannot be suppressed by points of uncertainty when parsing data. That is, a Schema Definition Error is fatal. It does not trigger search or backtracking to find alternative ways to parse the data.
Exceptions that occur in the evaluation of the DFDL expression language are processing errors.
Non-conformance with the XSD minOccurs or XSD maxOccurs constraints is either a processing error or only a validation error depending on the settings of certain DFDL properties (see section 16 below)
This error type is used with the dfdl:assert annotation when parsing to permit the checking of physical format constraints without terminating a parse. For example, some formats will have redundancy by having known lengths, as well as delimiters. A recoverable error can be issued, using an assert to check a physical length constraint when property lengthKind is 'delimited'.
Recoverable errors are independent of validation, and when resolving points of uncertainty, recoverable errors are ignored.
Data in a format describable via a DFDL schema obeys the grammar given here. A given DFDL schema is read by the DFDL processor to provide specific meaning to the terminals and decisions in this grammar.
The bits of the data are divided into two broad categories:
- Content
- Framing
The content is the bits of data that are interpreted to compute a logical value.
Framing is the term used to describe the delimiters, length fields, and other parts of the data stream which are present and may be necessary to determine the length or position of the content of DFDL Infoset items.
Note that sometimes the framing is not strictly necessary for parsing, but adds useful redundancy to the data format, allowing corrupt data to be more robustly detected, and sometimes the framing adds human readability to the data format.
In the grammar tables below, the terminal symbols are shown in bold italic font.
Productions |
---|
Document = SimpleElement | ComplexElement SimpleElement = SimpleLiteralNilElementRep | SimpleEmptyElementRep | SimpleNormalRep SimpleEnclosedElement = SimpleElement | AbsentElementRep ComplexElement = ComplexLiteralNilElementRep | ComplexNormalRep | ComplexEmptyElementRep ComplexEnclosedElement = ComplexElement | AbsentElementRep EnclosedElement = SimpleEnclosedElement | ComplexEnclosedElement |
AbsentElementRep = Absent |
SimpleEmptyElementRep = EmptyElementLeftFraming EmptyElementRightFraming ComplexEmptyElementRep = EmptyElementLeftFraming EmptyElementRightFraming EmptyElementLeftFraming = LeadingAlignment EmptyElementInitiator PrefixLength EmptyElementRightFraming = EmptyElementTerminator TrailingAlignment |
SimpleLiteralNilElementRep = NilElementLeftFraming [NilLiteralCharacters | NilElementLiteralContent] NilElementRightFraming ComplexLiteralNilElementRep = NilElementLeftFraming NilLiteralValue NilElementRightFraming NilElementLeftFraming = LeadingAlignment NilElementInitiator PrefixLength NilElementRightFraming = NilElementTerminator TrailingAlignment NilElementLiteralContent = LeftPadding NilLiteralValue RightPadOrFill |
SimpleNormalRep = LeftFraming PrefixLength SimpleContent RightFraming ComplexNormalRep = LeftFraming PrefixLength ComplexContent RightFraming LeftFraming = LeadingAlignment Initiator RightFraming = Terminator TrailingAlignment PrefixLength = SimpleContent | PrefixPrefixLength SimpleContent PrefixPrefixLength = SimpleContent SimpleContent = LeftPadding [ SimpleLogicalValue ] RightPadOrFill SimpleLogicalValue = SimpleNormalValue | NilLogicalValue ComplexContent = ComplexValue ElementUnused ComplexValue = Sequence | Choice |
Sequence = LeftFraming SequenceContent RightFraming SequenceContent = [ PrefixSeparator EnclosedContent [ Separator EnclosedContent ]* PostfixSeparator ] Choice = LeftFraming ChoiceContent RightFraming ChoiceContent = [ EnclosedContent ] ChoiceUnused EnclosedContent = [ EnclosedElement | Array | Sequence | Choice ] Array = [ EnclosedElement [ Separator EnclosedElement ]* [ Separator StopValue] ] StopValue = SimpleElement |
LeadingAlignment = LeadingSkip AlignmentFill TrailingAlignment = TrailingSkip RightPadOrFill = RightPadding | RightFill | RightPadding RightFill |
Table 11 DFDL Grammar Productions
XML Schema and DFDL properties are used to control constraints on the terminals of the above grammar, as well as repetition (the "*" operator), and alternatives (the "|" operator). For a given set of XML Schema and DFDL properties, and prior data, any terminal may be allowed to be length zero, to contain specific data, or to contain a variety of different admissible data.
Some definitions are needed to cover the range of representations that are possible in the data stream for an element. The representations are:
- Nil Representation
- Empty Representation
- Normal Representation
- Absent Representation
We also define below the concepts:
- Zero-Length Representation
- Missing
These definitions are with respect to the grammar above, and they do reference some DFDL properties necessary for their definitions. These properties are defined in sections 9.8 and beyond.
Some examples follow the definitions.
An element occurrence has a nil representation if the element declaration has XSD nillable property 'true' and the occurrence either:
- conforms to the grammar for SimpleNilLiteralElementRep or ComplexNilLiteralElementRep. Specifically, the NilElementInitiator and NilElementTerminator regions must be conformant with property dfdl:nilValueDelimiterPolicy[9]. (If non-conformant it is not a processing error and the representation is not nil).
- conforms to the grammar for SimpleNormalRep and its value is NilLogicalValue.
The LeadingAlignment, TrailingAlignment, PrefixLength regions may be present.
An element occurrence has an empty representation if the occurrence does not have a nil representation and it conforms to the grammar for SimpleEmptyElementRep or ComplexEmptyElementRep. Specifically, the EmptyElementInitiator and EmptyElementTerminator regions must be conformant with dfdl:emptyValueDelimiterPolicy[10] and the occurrence's SimpleContent or ComplexContent region in the data must be of length zero. (If non-conformant it is not a processing error and the representation is not empty).
LeadingAlignment, TrailingAlignment, PrefixLength regions may be present.
The empty representation is special in DFDL because when parsing it is used to determine when default values are created in the Infoset. The empty representation can require initiators or terminators be present to enable data formats which explicitly distinguish occurrences with empty string/hexBinary values from occurrences that are absent. See Section 9.5 Element Defaults below about default values. Hence, the empty representation might not be zero-length. it may require specific non-zero-length syntax in the data stream.
The empty representation is not possible for fixed-length elements with a non-zero length.
[9] For dfdl:nilValueDelimiterPolicy, see Section 12.2 Properties for Specifying Delimiters.
[10] For dfdl:emptyValueDelimiterPolicy, see Section 12.2 Properties for Specifying Delimiters.
An element occurrence has a normal representation if the occurrence does not have the nil representation or the empty representation and it conforms to the grammar for SimpleNormalRep or ComplexNormalRep.
Note that it is possible for the normal representation to be of zero length, but this can only happen when zero-length is not the nil nor empty representation, and the simple type is xs:string or xs:hexBinary. For all other simple types, the normal representation cannot be zero length.
Often, we know the location where an element or group's representation would be in the data based on the delimiters of an enclosing group. (An example: if there are adjacent delimiters of an enclosing sequence.) When this location in the data, which is of zero length, cannot be a nil, empty, or normal representation, then we say it has absent representation, or "the representation is absent".
Absent representation differs from empty representation because absent representation is always zero length, whereas the empty representation may be specifically intended to require a non-zero-length representation. However, when the empty representation is zero-length, then the absent representation is not applicable.
More formally, an element occurrence has an absent representation if the occurrence does not have a nil or empty or normal representation, and it conforms to the grammar for AbsentElementRep. Specifically, the occurrence's representation in the data stream is of length zero. Consequently, the Initiator, Terminator, LeadingAlignment, TrailingAlignment, PrefixLength regions must not be present.
As an example of an absent representation: during unparsing, if an optional element does not have an item in the Infoset then nothing is output. However, if a separator of an enclosing structure is subsequently output as the immediate next thing, then a subsequent parse of the element may return a representation of length zero. If this happens, and this zero-length representation does not conform to any of the nil representation, the empty representation, or the normal representation, then it is the absent representation, and it behaves as if the element occurrence is 'missing'. (The term 'missing' is defined below.)
We use the term zero-length representation to describe the situations where any of the above representations turn out to be of length zero due to specific combinations of data type and format properties:
- The nil representation can be a zero-length representation if dfdl:nilValue is ‘%ES;’ or ‘%WSP*;’ appearing on its own as a literal nil value and there is no framing or framing is suppressed by dfdl:nilValueDelimiterPolicy.
- The empty representation can be a zero-length representation if there is no framing or framing is suppressed by dfdl:emptyValueDelimiterPolicy.
- The normal representation can be a zero-length representation if the type is xs:string or xs:hexBinary and there is no framing.
- The absent representation always has a zero-length representation.
If the nil representation may be zero-length, then the absent representation cannot occur because zero-length will be interpreted as nil representation.
If the nil representation may not be zero length, but the empty representation is zero-length, then the absent representation cannot occur because zero-length will be interpreted as the empty representation.
If the nil and empty representations cannot be zero-length, but the normal representation may be zero length then the absent representation cannot occur because zero length will be interpreted as a normal representation.
If the nil representation may not be zero-length, the empty representation may not be zero-length, and the normal representation may not be zero-length, then a zero-length representation is the absent representation, or "is absent".
When parsing, an element occurrence is missing if it does not have nil, empty, or normal representations, or it has the absent representation.
When parsing, the term missing really covers two situations. First, it subsumes absent representation. Secondly it applies when an element does not have a representation at all in the data stream, that is, when we do not even have the constructs in the data stream to determine the location of the representation of the element; hence, none of the concepts above apply. This will be made clearer in the examples below. If an element occurrence is missing when parsing, no item is ever added to the Infoset.
When unparsing, an element occurrence is missing if there is no item in the Infoset. For a required element occurrence, it is this condition that can trigger the creation of a default value in the augmented Infoset. See Section 9.5 Element Defaults below about default values. For an optional element occurrence, no item is ever added to the augmented Infoset nor any representation ever output in the data stream.
The following examples illustrate missing and empty representation.
<xs:sequence dfdl:separator="," dfdl:terminator="@" ...> <xs:element name="A" type="xs:string" dfdl:lengthKind="delimited"/> <xs:element name="B" type="xs:string" minOccurs="0" dfdl:lengthKind="delimited"/> <xs:element name="C" type="xs:string" minOccurs="0" dfdl:lengthKind="delimited"/> </xs:sequence>
In data stream 'aaa,@' element B has the empty representation, and element C does not have a representation so is missing.
<xs:sequence dfdl:separator="," ...> <xs:element name="A" type="xs:string" dfdl:lengthKind="delimited" dfdl:initiator="A:" dfdl:emptyValueDelimiterPolicy=initiator"/> <xs:element name="B" type="xs:string" minOccurs="0" dfdl:lengthKind="delimited" dfdl:initiator="B:" dfdl:emptyValueDelimiterPolicy="initiator"/> <xs:element name="C" type="xs:string" minOccurs="0" dfdl:lengthKind="delimited" dfdl:initiator="C:" dfdl:emptyValueDelimiterPolicy=initiator"/> </xs:sequence>
In data stream 'A:aaaa,C:cccc' element B does not have a representation at all, so is missing.
In data stream 'A:aaaa,B:,C:cccc' element B has the empty representation. The format definition requires element B to have its initiator in order to indicate the empty representation.
In the data stream 'A:aaaa,,C:cccc' element B has the absent representation, because we are able to tell where element B would appear, but the syntax there does not contain the required initiator delimiter; hence, it does not satisfy any of nil, empty, or normal representation. Since we know its location, and the data stream there (between the two separators) is zero-length, it is the absent representation, and so is missing.
The overlapping nature of the possible representations: normal, empty, nil, and absent, creates a number of ambiguities where taking an Infoset, unparsing it, and reparsing it will result in a second Infoset that is not the same as the original. However, taking the second Infoset, unparsing it, and reparsing it, will result in a third Infoset which is the same as the second.
When unparsing, if a string Infoset item happens to contain a string that matches either one of the dfdl:nilValue list value or the default value, it is not given any special treatment. The string's characters are output, or if the value is the empty string, zero length content is output. (In both cases along with an initiator or terminator if defined.) This creates an ambiguity where one can unparse an Infoset item which has member [nilled] true, but when reparsed will produce an Infoset item which has member [nilled] false.
These ambiguities are natural and unavoidable. For example, if the dfdl:nilValue is the 3-character string "nil", then encountering the characters "nil" in the data stream will parse to produce an Infoset item with [nilled] true in the Infoset. If you unparsed a string Infoset item with contents of the 3 characters "nil", this will be output as the letters "nil", which on parse will not produce a string with the characters "nil", but rather an Infoset item with no data value and member [nilled] true.
To avoid this issue, one can use validation, along with a pattern that prevents the string from matching any of the nil values.
A DFDL parser proceeds by determining the existence of occurrences of schema components. It does this by examining the data and the schema, to:
- Establish representation
- Resolve points of uncertainty
These two activities are defined below. They are mutually recursive in the expected way as a DFDL schema is a recursive nest of schema components.
The parsing algorithm described here has many aspects which depends on the definitions of numerous DFDL properties. The properties are defined in sections 9.8 and beyond.
Establishing the representation of an occurrence of a schema component and resolving points of uncertainty involve the concepts of known-to-exist and known-not-to-exist.
9.3.1.1 Known-to-exist
An occurrence of a schema component is said to be known-to-exist when any of these positive determinations hold:
- There is a dfdl:discriminator[11] applying to the component and its expression evaluates to true or regular expression pattern matches.
- The component is a direct child of an xs:sequence or xs:choice with dfdl:initiatedContent[12] 'yes' and a dfdl:initiator defined for the component is found.
- The component is a direct child of an xs:choice with dfdl:choiceDispatchKey[13] and the result of the dfdl:choiceDispatchKey expression matches one of the dfdl:choiceBranchKey property values of the child.
If none of those hold because they are not applicable then the occurrence is still known-to-exist if ALL of the following hold, and no processing error occurs during their determination:
- When there are dfdl:assert[14] statements with failureType 'processingError' on the component, all their expressions evaluate to true or their regular expression patterns match.
- It has nil, empty, or normal representation.
- When it has normal representation the content of the representation is convertible to the element type without error.
Note that validation errors or recoverable errors do not prevent determination that a component is known-to-exist.
[11] DFDL discriminators are described in Section: 7.4 The dfdl:discriminator Statement Annotation Element.
[12] For dfdl:initiator and dfdl:initiatedContent, see Section 12.2 Properties for Specifying Delimiters.
[13] For dfdl:choiceDispatchKey and dfdl:choiceBranchKey, see Section 15.1.2 Resolving Choices via Direct Dispatch.
[14] DFDL asserts are described in Section 7.3 The dfdl:assert Statement Annotation Element.
9.3.1.2 Processing Error After Determining Known-to-exist
Note that it is possible for an occurrence of a schema component to be known-to-exist due to a positive discrimination, but then subsequently a processing error occurs when evaluating a statement annotation such as a dfdl:assert or a dfdl:setVariable, or a processing error occurs when determining the representation, or in the case of normal representation and simpleType, when converting that representation's content into a value of the type. This processing error does not change the fact that the schema component was determined to be known-to-exist. This is important in the discussion of resolving Points of Uncertainty below.
9.3.1.3 Known-not-to-exist
An occurrence of a schema component is known-not-to-exist when any of these negative determinations holds:
- There is a dfdl:discriminator applying to the component and its expression evaluates to false or regular expression pattern fails to match, or a processing error occurs while processing the dfdl:discriminator.
- The component is a direct child of an xs:sequence or xs:choice with dfdl:initiatedContent 'yes' and an initiator defined for the component is not found.
- The component is a direct child of an xs:choice with dfdl:choiceDispatchKey and the result of the dfdl:choiceDispatchKey expression does not match any of the dfdl:choiceBranchKey property values of the child.
If none of those hold because they are not applicable, then a schema component is known-not-to-exist when any of the following hold:
- The occurrence is missing
- There is a dfdl:assert with failureType 'processingError' on the component and its expression evaluates to false or its regular expression pattern fails to match, or a processing error occurs while processing the dfdl:assert.
- A processing error occurs when parsing the component. Processing errors include, but are not limited to, inability to identify any of nil, empty, normal or absent representations, or failure to convert a value to the built-in logical type.
Note that validation errors or recoverable errors do not cause a component to be known-not-to-exist.
Note: based on the above, when processing a sequence for which a separator is defined, the presence of a match in the data for the separator is not sufficient to cause the parser to determine that an associated component is known-to-exist. See Section 14.2 Sequence Groups with Separators for details.
Unless an element occurrence is known-not-to-exist, the parsing algorithm establishes if it has the nil, empty, normal, or absent representation.
The first step is to see if the SimpleContent or ComplexContent region is of length zero as a first approximation. This is dfdl:lengthKind dependent.
- explicit => length is zero (either fixed or from expression evaluation)
- prefixed => length given by the prefix is zero
- implicit (simple) => length is zero[15]
- implicit (complex) => not possible.
- delimited => length is zero (in scope delimiter is immediately encountered)
- pattern => pattern returns zero length match
- endOfParent => already positioned at parent's end so length is zero
[15] This is a corner case that only happens when type is xs:string or xs:hexBinary and the maxLength facet is 0. Such an element can only be of length 0.
9.3.2.1 Simple element
If the result is length zero as described above, the representation is then established by checking, in order, for:
- nil representation (if %ES; or %WSP*; on its own is a literal nil value).
- empty representation.
- normal representation (xs:string or xs:hexBinary only)
- absent representation (if none of the prior representations apply).
If the result is not length zero, the representation is then established by checking, in order, for:
- nil representation (as a literal nil value)
- nil representation (as a logical nil value)
- normal representation
9.3.2.2 Complex element
If the result is length zero as described above, the representation is then established by checking for:
- nil representation (if %ES; is a literal nil value).[16]
To establish any other representations requires that the parser descends into the complex type for the element, and returns successfully (that is, no unsuppressed processing error occurs). If the result is zero bits consumed, the representation is then established by checking, in order, for:
- empty representation.
- absent representation (if none of the prior representations apply).
Otherwise the element has normal representation.
Note: The DFDL parser SHALL NOT recursively parse the schema components inside a complex element when it has already established that the element occurrence is missing[17].
[16] It is a Schema Definition Error if a complex element has XSD nillable ‘true’ and dfdl:lengthKind ‘implicit’.
[17] The rationale for this is that otherwise this could give rise to misleading error messages where the parser reported that required child elements were missing required occurrences. (This is consistent with XML Schema validation, where if a required element is missing, it gets reported as such, and there is nothing reported about its children).
A point of uncertainty occurs in the data stream when there is more than one schema component that might occur at that point. Points of uncertainty can be nested.
Any one of the following constructs is a potential point of uncertainty:
- An xs:choice
- All xs:elements in an unordered xs:sequence (dfdl:sequenceKind[18] is 'unordered')
- An optional[19] xs:element
- An array xs:element.
- All xs:elements in an xs:sequence containing one or more dfdl:floating[20] xs:elements.
The parser resolves these points of uncertainty by way of a set of construct-specific rules given below along with determining whether schema components are known-to-exist or known-not-to-exist. For some of these constructs, whether there is an actual point of uncertainty depends on the representation of the constructs in the data.
An xs:choice is always a point of uncertainty. It is resolved sequentially, or by direct dispatch. Sequential choice resolution occurs by parsing each choice branch in schema definition order until one is known-to-exist. It is a processing error if none of the choice branches are known-to-exist. Direct-dispatch choice resolution occurs by matching the value of the dfdl:choiceDispatchKey property to the value of one of the dfdl:choiceBranchKey property values of one of the choice branches. It is a processing error if none of the choice branches have a matching value in their dfdl:choiceBranchKey property.
An element in an unordered xs:sequence is always a point of uncertainty. It is resolved by parsing for the child components of the sequence in schema definition order at each point in the data stream where a component can exist until the required number of occurrences of each child component is known-to-exist or the sequence is terminated by delimiters or specified length.
An element in a sequence with one or more floating elements is always a point of uncertainty. It is resolved by parsing for the expected element at that point in the data stream. If the expected element is known-not-to-exist then an occurrence of each floating element is parsed in schema definition order.
When parsing an array, points of uncertainty only occur for certain values of dfdl:occursCountKind[21], as follows:
occursCountKind | Details of Point of Uncertainty |
---|---|
fixed | No point of uncertainty (maxOccurs occurrences expected). |
implicit | A point of uncertainty exists after XSD minOccurs occurrences are found and until XSD maxOccurs occurrences are found. |
parsed | A point of uncertainty exists for all occurrences |
expression | No point of uncertainty (The number of occurrences equal to the dfdl:occursCount[22] value is expected) |
stopValue | No point of uncertainty (The stop value must always be present, even when XSD minOccurs is 0). |
Table 12: Points of Uncertainty and dfdl:occursCountKind
An optional element point of uncertainty is resolved by parsing the element until it is either known-to-exist or known-not-to-exist. Whether an optional element is an actual point of uncertainty depends on property dfdl:occursCountKind as described above.
For an array element, the point of uncertainty is resolved for each occurrence separately by parsing the occurrence until it is either known-to-exist or known-not-to-exist.
[18] For dfdl:sequenceKind, see Section 14 Sequence Groups.
[19] For optional and array elements, see Section 16 Properties for Array Elements and Optional Elements.
[20] For dfdl:floating elements, see Section 14.4 Floating Elements.
[21] Property dfdl:occursCountKind is defined in Section 16.1 dfdl:occursCountKind property.
[22] Property dfdl:occursCount is defined in Section 16 Properties for Array Elements and Optional Elements.
9.3.3.1 Nested Points of Uncertainty
A point of uncertainty can be resolved because a schema component has been determined to be known-to-exist due to positive discrimination. In that case, if a subsequent processing error occurs when completing the parsing of that schema component this will cause the next enclosing schema component surrounding this point of uncertainty to be determined to be known-not-to exist.
For example, when parsing an element occurrence for an array with a variable number of occurrences, a positive discrimination tells the parser that the currently-being-parsed occurrence is known-to-exist. If a subsequent processing error occurs while completing the parsing of this occurrence, then the entire array is then known-not-to-exist.
Another example is a choice. If a discriminator resolves the choice point of uncertainty to the first of the choice's alternatives, a subsequent processing error causes the entire choice construct to be determined to be known-not-to-exist.
This will cause the next enclosing point of uncertainty to try the next possible alternative, or if there isn't one, will cause an unsuppressed processing error.
The behavior of a DFDL processor on an unsuppressed processing error is not specified, but it is allowable for implementations to abort further parsing. Any other behavior is implementation-defined.
A DFDL processor can create element defaults in the Infoset for both simple and complex elements. This happens quite differently for parsing and unparsing as will be explained in this section.
A simple element has a default value if any of these are true:
- The XSD default property exists. The default value is the XSD default property's value.
- The XSD fixed[23] property exists. The default value is the XSD fixed property's value.
- The element has XSD nillable is 'true' and dfdl:useNilForDefault[24] is 'yes'. The corresponding Infoset item will have the [nilled] member true, and the [dataValue] member will have no value.
If empty representation is established when parsing, the possibility of applying an element default arises. Essentially, if a required occurrence of an element has empty representation, then an element default will be applied if present, though there are a couple of variations on this rule. Remember that in order to have established empty representation, the occurrence must be compliant with the dfdl:emptyValueDelimiterPolicy for the element, and for a complex element the parser must have descended into the type and returned with no unsuppressed processing error.
The rules for applying element defaults are not dependent on dfdl:occursCountKind. However, if a required occurrence does not produce an item in the Infoset after the rules have been applied, then whether it is a processing error or a validation error (if validation is enabled) does depend on dfdl:occursCountKind (see Section 16.1 dfdl:occursCountKind property).
The sections below indicate when an item is added to the Infoset, and whether it has a default or other value. If there is no processing error then regardless of whether an item is added to the Infoset or not, any side-effects due to dfdl:discriminator statements evaluating to true, or dfdl:setVariable statements, are retained.
Assuming the empty representation has been established, there are three main cases to consider:
- Simple element (not type xs:string or xs:hexBinary)
- Simple element (type xs:string or xs:hexBinary)
- Complex element
Each is described in a section below.
[23] The XSD fixed property is like the XSD default property, with the further stipulation that if a value is present, its value must equal to the XSD fixed property value.
[24]For dfdl:useNilAsDefault see Section 13.16 Properties for Nillable Elements.
9.4.2.1 Simple element (not xs:string and not xs:hexBinary)
Required occurrence: If the element has a default value then an item is added to the Infoset using the default value, otherwise nothing is added to the Infoset.
Optional occurrence: Nothing is added to the Infoset.
9.4.2.2 Simple element (xs:string or xs:hexBinary)
Required occurrence: If the element has a default value then an item is added to the Infoset using the default value, otherwise an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value.
Optional occurrence: if dfdl:emptyValueDelimiterPolicy is applicable and is not 'none'[25], then an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value, otherwise nothing is added to the Infoset.
Note: To prevent unwanted empty strings or empty hexBinary values from being added to the Infoset, use XSD minLength > '0' and a dfdl:assert that uses the dfdl:checkConstraints() function, to raise a processing error.
[25] If other than ‘none’, either an initiator, terminator or both must have been found in the data stream.
9.4.2.3 Complex element
Required occurrence: An item is added to the Infoset.
Optional occurrence: if dfdl:emptyValueDelimiterPolicy is applicable and is not 'none'[26], then an item is added to the Infoset, otherwise nothing is added to the Infoset.
A complex element can parse, by recursive descent, and construct a complex element in the Infoset containing a single child element. This can occur when:
- the first child element of the complex type is a required simple element, then an empty string (type xs:string), empty hexBinary (type xs:hexBinary), or default value will also be added to the Infoset.
- the first child element of the complex type is a required complex element, then an item is added to the Infoset (which may itself have a child via (1))
As an example, consider the following:
<xs:sequence dfdl:separator="|"> <!-- sequence S0 --> ...prior schema components ... <xs:element name="E1" minOccurs="0" dfdl:lengthKind="delimited" dfdl:occursCountKind="implicit"> <xs:complexType> <xs:sequence dfdl:separator=";"> <!-- sequence S1 --> <xs:element name="E2" type="xs:string" dfdl:lengthKind="delimited"/> ... other optional content ... </xs:sequence> </xs:complexType> </xs:element> ... </xs:sequence>
In the above we have a sequence S0 with a separator that contains among other content an optional non-nillable non-initiated element E1 of complex type. The content of the type is a sequence S1 with a different separator and the first child is a required non-initiated element E2 of type xs:string. The dfdl:lengthKind of both E1 and E2 is 'delimited'.
Now consider a data stream '...||...' that is, where we have two adjacent S0 separators, and where we have successfully parsed the schema components prior to E1 within S0, which is what the "..." prior to the two separators represents. That prior parse is delimited by the first S0 "|" separator, and E1's representation begins immediately after that first S0 separator.
The representation of E1 has zero length because of these two adjacent S0 separators. On processing E1, the parser will establish a point of uncertainty with the data stream positioned after the first S0 separator. The parser will then descend into E1's complex type to process E2. It scans for in-scope delimiters and immediately encounters the second S0 separator. E2 has the empty representation, so E1 is added to the Infoset along with a value of empty string for E2. All other content of S1 is missing, so the parser returns from the descent into E1 with this temporary Infoset (illustrated as XML):
<E1> <E0></E0> </E1>
Upon this successful parse of E1, it is therefore known-to-exist. However, because the position in the data has not changed, E1 therefore has the empty representation. Because E1 is empty and optional (it has XSD minOccurs='0') it is not added to the Infoset, and the temporary Infoset item for E1 containing E2 is discarded.
[26] If other than ‘none’, either an initiator, terminator or both must have been found in the data stream.
If an element is missing from the Infoset when unparsing, the possibility of applying an element default arises. Essentially if a required occurrence of an element is missing, then an element default will be applied if present, and the resulting item is added to the augmented Infoset.
The rules for applying element defaults are not dependent on dfdl:occursCountKind. However if a required occurrence does not produce an item in the augmented Infoset after the rules have been applied then whether it is a processing error or a validation error (if enabled) is dependent on dfdl:occursCountKind (see Section 16.1 dfdl:occursCountKind property).
There are two main cases to consider.
9.4.3.1 Simple element
Required occurrence: If an element has a default value then an item is added to the augmented Infoset using the default value, otherwise nothing is added.
Optional occurrence: Nothing is added to the augmented Infoset.
9.4.3.2 Complex element
Required occurrence: An item is added to the augmented Infoset as specified below.
Optional occurrence: Nothing is added to the augmented Infoset.
For a required occurrence, the unparser descends into the complex type:
For a sequence, each child element is examined in schema order and the rules for simple and complex elements applied (recursively). The lack of a default may give rise to a processing error, as described above.
For a choice, each branch is examined in schema order and the above rules applied recursively to the branch. The lack of a default may give rise to a processing error, as described above, and if so the error is suppressed and the next branch is tried, otherwise that branch is selected. It is a processing error if no choice branch is ultimately selected. If no choice branch is selected, then there must be a choice branch with no required elements, and the first such branch would be selected.
Given a component of a DFDL schema, there is a resolved set of annotations for it.
Of these, some are statement annotations and the order of their evaluation relative to the actual processing of the schema component itself (parsing or unparsing via its format annotations) is as defined in the ordered lists below.
For elements and element references:
- dfdl:discriminator or dfdl:assert(s) with testKind 'pattern' (parsing only)
- dfdl:element following property scoping rules, which includes establishing representation as described in Section 9.3.2 and conversion to the element type for simple types
- dfdl:setVariable(s) - in lexical order, innermost schema component first
- dfdl:discriminator or dfdl:assert(s) with testKind 'expression' (parsing only)
For sequences, choices and group references:
- dfdl:discriminator or dfdl:assert(s) with testKind 'pattern' (parsing only)
- dfdl:newVariableInstance(s) - in lexical order, innermost schema component first
- dfdl:setVariable(s) - in lexical order, innermost schema component first
- dfdl:sequence or dfdl:choice or dfdl:group following property scoping rules and evaluating any property expressions (corresponds to ComplexContent grammar region)
- dfdl:discriminator or dfdl:assert(s) with testKind 'expression' (parsing only)
The dfdl:setVariable annotations at any one annotation point of the schema are always executed in lexical order. However, dfdl:setVariable annotations can also be found in different annotation points that are combined into the resolved set of annotations for one schema component. In this case, the order of execution of the dfdl:setVariable statements from any one annotation point remains lexical. The order of execution of the dfdl:setVariable annotations different annotation points follows the principle of innermost first, meaning that a schema component that references another schema component has its dfdl:setVariable statements executed after those of the referenced schema component. For example, if an element reference and an element declaration both have dfdl:setVariable statements, then those on the element declaration will execute before those on the element reference. Similarly, dfdl:setVariable statements on a base simple type execute before those of a simple type derived from it. The dfdl:setVariable statements on a simple type execute before those on an element having that simple type (whether that type is by reference, or when the simple type is lexically nested within the element declaration). The dfdl:setVariable statements on the sequence or choice within a global group definition execute before those on a group reference.
The dfdl:newVariableInstance annotations at any one annotation point of the schema are always executed in lexical order. However, dfdl:newVariableInstance annotations can also be found in different annotation points that are combined into the resolved set of annotations for one schema component. In this case, the order of execution of the dfdl:newVariableInstance statements from any one annotation point remains lexical. The order of execution of the dfdl:newVariableInstance annotations different annotation points follows the principle of innermost first, meaning that a schema component that contains or references another schema component has its dfdl:newVariableInstance statements executed after those of the contained or referenced schema component. For example, if a group reference and the sequence or choice group of a group definition both have dfdl:newVariableInstance statements, then those on the global group definition will execute before those on the group reference.
Implementations are free to optimize by recognizing and executing discriminators or asserts with testKind 'expression' earlier so long as the resulting behavior is consistent with what results from the description above.
When parsing, an attempt to evaluate a discriminator MUST be made even if preceding statements or the parse of the schema component ended in a processing error.
This is because a discriminator's expression could evaluate to true thereby resolving a point of uncertainty even if the complete parsing of the construct ultimately caused a processing error.
Such discriminator evaluation has access to the DFDL Infoset of the attempted parse as it existed immediately before detecting the parse failure. Attempts to reference parts of the DFDL Infoset that do not exist are processing errors.
The resolved set of dfdl:setVariable statements for an element are executed after the parsing of the element. This contrasts with the resolved set of dfdl:setVariable statements for a group which are executed before the parsing of the group.
For elements, this implies that these variables are set after the evaluation of expressions corresponding to any computed DFDL properties for that element, and so the variables may not be referenced from expressions that compute these DFDL properties.
That is, if an expression is used to provide the value of a property (such as dfdl:terminator or dfdl:byteOrder), the evaluation of that property expression occurs before any dfdl:setVariable annotation from the resolved set of annotations for that element are executed; hence, the expression providing the value of the property may not reference the variable. Schema authors can insert sequences to provide more precise control over when variables are set.
Logical validation checks are constraints expressed in XSD, and they apply to the logical values of the Infoset. Hence, parsing MUST successfully construct the Infoset before validation checks can be performed. This implies that validation errors cannot affect the parsing or unparsing of data.
DFDL processors MAY provide both validating and non-validating behaviors on either or both of parse and unparse. (A DFDL implementation could support validate on parse, but not support it on unparse and still be considered conforming.)
Validation on unparsing takes place on the augmented Infoset that is created by the unparser as a side-effect of creating the output data stream.
When resolving points of uncertainty (during parsing), validation errors are ignored.
The way a validation error is presented to the execution context of a DFDL processor is not specified by the DFDL language. The validity of an element is recorded in the DFDL Infoset, see Section 4 The DFDL Information Set (Infoset).
The following DFDL schema constructs are allowed in DFDL and are checked when validating:
- XSD pattern facet - (for xs:string type elements only)
- XSD minLength, maxLength
- XSD minInclusive, minExclusive, maxInclusive, maxExclusive
- XSD enumeration
- XSD maxOccurs
Note that validation is distinct from the checking of DFDL assert or discriminator predicates. When a DFDL discriminator or assert is used to discriminate a choice or other point of uncertainty when parsing, then that dfdl:assert or dfdl:discriminator is essential to parsing and it is evaluated irrespective of whether validation is enabled or disabled.
There is also a function dfdl:checkConstraints available in the DFDL Expression language. This can be used to explicitly include checking of the XSD facet constraints as part of parsing a specific element. Such checking is part of parsing and does not create validation errors. See Section 18.5.3 DFDL Functions for details.
The unparsing algorithhm starts from a DFDL Infoset, and it begins by augmenting the Infoset by filling in default values for reqired elements that are not present, and for calculated elements by use of the dfdl:outputValueCalc property (see section 17 Calculated Value Properties).
An element declaration in the schema describes a potentially represented item if that element declaration does not have a dfdl:inputValueCalc property (see section 17 Calculated Value Properties). Whether the element declaration describes an item that is actually represented or not depends on whether the element declaration is for an optional element and whether the element has a corresponding value in the augmented Infoset.
In expressions, the function dfdl:contentLength() and dfdl:valueLength() can be called to determine the length of an item. If an element declaration is not potentially represented, then these functions are defined to return 0.
When unparsing, an element declaration and the Infoset are considered as follows. An implementation MAY use any technique consistent with this algorithm:
a) |
If the element declaration has a dfdl:outputValueCalc property, then the
expression which is the dfdl:outputValueCalc property value is evaluated
and the resulting value becomes the value of the element item in the augmented
Infoset. Any pre-existing value for the Infoset item is superseded by this
new value.
|
b) |
If the element declaration has no corresponding value in the augmented Infoset, and the element declaration is for a required occurrence, and it has a default value specified, then an element item having the default value is created in the augmented Infoset. |
c) |
If any Infoset item's value is requested recursively as a part of (a) above and (a) does not apply, and the corresponding value is not present, and (b) does not apply then it is a processing error. |
Given this augmented Infoset, then if the potentially represented element declaration has a corresponding Infoset item then that item is converted to its representation according to its DFDL properties. If the element declaration is for a required occurrence, and there is no value in the augmented Infoset then it is a processing error.
Because rule (a) above is used even if the augmented Infoset item already exists and has a value, it is possible for a dfdl:outputValueCalc expression to be evaluated multiple times. DFDL implementations are free to cache values and avoid this repeated evaluation for efficiency, as the semantics of DFDL require that the dfdl:outputValueCalc expression return the same value every time it is evaluated.