Skip to content

Latest commit

 

History

History
337 lines (227 loc) · 26.2 KB

GLOSSARY.md

File metadata and controls

337 lines (227 loc) · 26.2 KB

Adjacent

Two addressable units of the input/output stream are adjacent if they are at consecutive positions.

Addressable Unit

This is the unit of storage that makes up the input or output stream holding the representation of the data. The units are bits, bytes, or characters.

Annotation point

A location within a DFDL schema where DFDL annotation elements are allowed to appear.

Applicable Properties

All the DFDL properties that apply to a given type of schema construct. For example, all the DFDL properties that apply to an xs:simpleType.

Array

A set of adjacent elements whose XSD element declaration specifies the potential for it to have more than one occurrence (XSD maxOccurs > '1' or 'unbounded'). Of course, any given array can have any number of element occurrences, including zero elements or exactly 1 element as long as the occurrence constraints are met. If XSD maxOccurs is 'unbounded' then there is no constraint to the maximum number of occurrences, though implementations may have implementation-defined maximum capabilities. An optional element (where XSD maxOccurs is '1', and XSD minOccurs is '0') is not considered to be an array as described in this document. Note that a sequence is not to be confused with an array. A sequence is a complex tuple type for an element; the children of a sequence can be of different types. All elements of an array have the same type and if the type is simple, then they have the same information item members except for the value member.

Array Element

An element declaration or reference with XSD property maxOccurs > '1' or 'unbounded'.

Augmented Infoset

When unparsing one begins with the DFDL schema and conceptually with the logical DFDL Infoset. As the values of items are filled in by defaulting, and by use of the DFDL outputValueCalc property (including on hidden items), these new item values augment the Infoset. The resulting Infoset is called the augmented Infoset.

Binary

There are two meanings for this word depending on context.

  1. Data is divided into two broad categories of representations, which are text and binary. Hence, binary representation includes any kind of non-text representation.
  2. Within binary (not text) data, we distinguish base-10 representations which are called packed decimal, from base-2 representations which are called binary. The common twos-complement representation used for signed integers is a base-2 binary representation.

Binary Representation

Of type xs:hexBinary, or of other type with property dfdl:representation 'binary'. Note that type xs:string can never have binary representation.

Bit Order

Within a binary integer, if the most-significant bit is assigned bit position 1, then the bit order is most-significant-bit first. If the least-significant bit is assigned bit position 1, then the bit order is least-significant-bit first. When the bit order is most-significant-bit first, then the least-significant bit of byte N is considered to be adjacent to the most-significant bit of byte N+1. When the bit order is least-significant-bit first, then the most-significant bit of byte N is considered to be adjacent to the least-significant bit of byte N+1.

Bit Position

The data stream is assumed to be a collection of consecutively numbered unsigned bytes. Each byte is a numeric value from 0 to 255. The bits of a byte are referred to by their numerical significance as the 2_n_ bit, for n from 0 to 7. Hence, the byte value 255 = 27 + 26 + 25 + 24 + 23 + 22 + 21 + 20. The 27-bit is the most-significant bit, and the 20-bit is the least significant bit. The bits within each byte are assigned numbered bit positions 1 to 8 according to the bit order. Given a bit-order, every bit in the data stream has a unique bit position.

Bit String

The ordered set of bits from a first bit with bit position N, to bit position N+M is a bit string of length M bits.

Byte

The term "byte" refers to an 8-bit octet. Can also refer to an integer with value from 0 to 255 inclusive. Hexadecimal digit pairs are commonly used to illustrate byte values.

CCSID

see Coded Character Set Identifier.

Character

An ISO10646 [ISO10646] character having a unique character code as its identifier. This concept is independent of font, typeface, size, and style, so 'F', 'F', 'F', are all the same character 'F'.

Character Code

The canonical integer used to identify a character in the ISO10646 [ISO10646] standards. This number uniquely identifies the character independently of the various ways it is represented by different character set encodings of the character. For example: The '{' character known in Unicode [Unicode] as LEFT CURLY BRACKET has character code U+007B. In both ASCII and UTF-8 character set encodings the representation of this character is as a single byte code point 0x7B. However, in EBCDIC-based character set encodings the representation of this same character code is the single byte code point 0xC0.

Character Set

An abstract set of characters that are assigned (or mapped to) a representation by a particular character set encoding. For most character set encodings their character set is a subset of the Unicode character set.

Character Set Encoding

Often abbreviated to just 'encoding'. A specific representation of a character set as bytes or bits of data. A character set encoding is usually identified by a standard character set encoding name or a recognized alias name, or by a coded character set identifier or CCSID [CCSID]. These identifiers are standardized. The names and aliases are standardized by the IANA [IANA] (where unfortunately, they are called character set names). CCSIDs are an industry standard. Examples of character set encoding names are UTF-8, USASCII, GB2312, ebcdic-cp-it, ISO-8859-5, UTF-16BE, Shift_JIS. There are also additional DFDL standard character set encodings, see DFDL Standard Encoding. The DFDL standard also allows for implementation-defined character set encodings to be supported.

Character Width

The number of code units or alternatively the number of bytes or bits used to represent a character in a specific character set encoding is called the character width. Encodings are either fixed width (all characters encoded using the same width), or variable-width (different characters are encoded using different widths). For example, the UTF-32 character set encoding has 4-byte character width, whereas USASCII has a 1-byte character width. UTF-8 is variable width, and any specific character in that encoding has width 1, 2, 3, or 4 bytes. See also Fixed-Width Character Encoding and Variable-Width Character Encoding.

Code Point

The integer that identifies a character within a character set encoding. A code point is represented by one or more code units. When a character set is fixed width, then there is no distinction between a code unit and a code point. For Unicode character set encodings, there is no distinction between a character code and a code point. Examples:

  • € - character code U+20AC
    • IBM01148 encoding - the code point is 0x9F, and this encoding is fixed width so there is no distinction between the code point 0x9F and the code unit 0x9F that represents the encoded character.
    • UTF-8 encoding - there is no distinction between the character code 0x20AC and the code point 0x20AC. However, it is represented by 3 code units 0xE2 0x82 0xAC

Code Unit

When a character set encoding uses differing variable width representations for characters, the units making up these variable width representations are called code units. For example, the UTF-8 encoding uses between 1 and 4 code units to represent characters, and for UTF-8, the individual code units are single bytes. DFDL's interpretation of the UTF-16 encoding is either fixed or variable width. When format property dfdl:utf16Width 'variable' then UTF-16 is variable width and this encoding uses either one or two code units per character, but in this case each individual code unit is a 16-bit value. When a character set is fixed width, then there is no distinction between a code unit and a code point.

Coded Character Set Identifier

CCSID) - An alternate identifier of a character set encoding. Originally created by IBM, CCSIDs are a broadly used industry standard. See [CCSID].

Component

A construct within a DFDL schema that may contain a DFDL annotation. These constructs include XSD element declarations, type definitions, group definitions, sequence definitions, choice definitions, element references, and group references. DFDL schema annotations are not components of the schema, rather they appear on components of the schema or on the top-level xs:schema element of a schema document.

Content

The bits of the data stream data that are interpreted when parsing to compute the logical value of a simple type, and when unparsing are computed from the logical value for incorporation into the data stream.

Content Model

One of 3 kinds of syntactic structure of XSD element declarations. The DFDL subset of XSD includes only empty, simple, and element-only content models. An XSD element declaration for an element of complex type containing a xs:sequence element is said to have a sequence in its content model. (DFDL’s usage is derived from Section 13.3 of [Walmsley]).

Contiguous

An element has a contiguous representation if all parts of its representation are adjacent in the input/output stream. Most simple types have contiguous representations naturally. Groups containing elements that are themselves contiguous are also considered to have contiguous representations irrespective of alignment fill or padding of any kind that exists within the group. Similarly, arrays of elements that are themselves contiguous are also contiguous. An example of a non-contiguous representation would be a nillable element, where a flag is used to determine whether the element is nil, and the location of that flag is not adjacent to the value representation.

Count

The number of occurrences of an element.

Data Stream

Data where the format is being described by a DFDL schema. Often abbreviated to just “data” for short. This use of 'stream' implies only that there is a numbering scheme that specifies a unique bit position for every bit within the data. This use of 'stream' does not imply anything about whether the data is persistently stored or not, nor does it imply anything about whether there are sequential or random-access capabilities for access to the data. When parsing, the data stream may be referred to as the input stream, and when unparsing the output stream.

DBCS

See Double-Byte Character Set

Decimal

This term is used several different ways distinguished by context:

  1. Base 10. When data has text representation, a decimal number has base-10 digits.
  2. Type xs:decimal - a logical type of number that has an integer component and an optional base-10 fractional component. This type subsumes all integer types, as they are of type xs:decimal but with the further restriction that the fractional part doesn't exist. Note that a base-10 fraction has different rounding properties than a base-2 or floating point numeric fraction; hence, xs:decimal is the type commonly used to represent currency/money in data.
  3. Packed Decimal - A binary data representation. See separate glossary entry below.

Defining Annotations

The annotation elements dfdl:defineFormat, dfdl:defineVariable, and dfdl:defineEscapeScheme

Delimiter

A character or string used to separate, or mark the start and end of, items of data. In DFDL, dfdl:lengthKind 'delimited' scans the data for initiators, separators, and terminators.

Delimiter scanning

When parsing, the process of scanning for a specific item in the input data which either marks the end of an item or the beginning of a subsequent item. Delimiter scanning also takes into account escape schemes so as to allow the delimiters to appear within data if properly escaped.

DFDL

Data Format Description Language

DFDL Infoset

The abstract data structure that must be provided:

  • To an invoking application by a DFDL parser when parsing DFDL-described data using a DFDL Schema;
  • To a DFDL unparser by an invoking application when generating DFDL-described data using a DFDL Schema

DFDL Processor

A program that uses DFDL schemas in order to process data described by them.

DFDL Schema

An XML schema containing DFDL annotations to describe data format and using only the DFDL subset of the XSD language. This includes all included and imported schemas taken together. This also includes both the XSD declarations and definitions and the DFDL definitions provided in the top-level DFDL annotations.

DFDL Standard Encoding

A character set for which there is no IANA name or CCSID but the name and definition of which DFDL implementations must agree on. See Section 33 Appendix D: DFDL Standard Encodings.

Double-Byte Character Set

DBCS) - A character set encoding where each character code consists of one code unit which uses exactly 2 bytes.

Dynamic extent

This is a characteristic of the data stream. When parsing data corresponding to a schema component, the collection of bits within the data stream that contain any aspect of the representation of that schema component make up the component's dynamic extent.

Dynamic scope

This is a characteristic of parts of the DFDL schema. When a definition or declaration contains or references another declaration or definition, then the contained definition or declaration is said to be in the dynamic scope of the enclosing one. The important characteristic of dynamic scoping is that it traverses references. When parsing, the dynamic scope of an element declaration includes all definitions and declarations used as part of parsing that element.

Element

A part of the data described by an element declaration in the schema and presented as an element information item in the Infoset.

Encoding

See Character Set Encoding.

Explicit Properties

The explicit properties are the combination of any defined locally on the annotation and any defined by a dfdl:defineFormat annotation referenced by a local dfdl:ref property.

Fixed-Length Element

An element of specified length where dfdl:lengthKind is 'explicit' but dfdl:length is not an expression, or dfdl:lengthKind is 'implicit' (of simple type only). Note that choice branches where dfdl:choiceLengthKind is 'explicit' are also referred to as ‘fixed-length’ but are not necessarily elements.

Fixed-Width Character Encoding -

character set encoding where all characters are encoded using a single code unit for their representation. Note that a code unit is not necessarily a single byte. It may be more than one byte, or some number of bits less than a byte. Examples of different fixed widths are:

  • 1-byte wide: ASCII, ebcdic-cp-us, iso-8859-1. See also SBCS (Single-Byte Character Set)
  • 2-bytes wide: UTF-16 when dfdl:utf16Width is 'fixed'. See also DBCS (Double-Byte Character Set)
  • 4-bytes wide: UTF-32.
  • 7-bits wide: X-DFDL-US-ASCII-7-BIT-PACKED[51].

Fixed Array Element

An array element where XSD minOccurs is equal to XSD maxOccurs.

Format Annotations

The annotation elements dfdl:format, dfdl:element, dfdl:simpleType, dfdl:group, dfdl:sequence, dfdl:choice, and dfdl:escapeScheme.

Format Property

A DFDL property carried on a DFDL format annotation.

Framing

The term used to describe the delimiters, length fields, and other parts of the data stream which are present and may be necessary to determine the length or position of the content of an element.

Implementation-Defined

Feature*_ - A feature where the implementation has discretion in how it is performed, and the implementation MUST document how it is performed.

Implementation-Dependent

Feature*_ - A feature where the implementation has discretion in how it is performed, but the implementation is not required to document how the feature is performed.

Index

The position of an occurrence in a count, starting at 1.

Infoset

See DFDL Infoset

Item

A DFDL information set consists of a number of # information items r just items for short.

Least-Significant Bit

Often abbreviated to # LSB n a binary integer the least significant bit is the bit having the least place value. Within an 8-bit unsigned byte, the bit with place value 20 (or 1) is the least significant bit.

Length

When discussing data items and their representations, the term 'length' is used to refer to the measure of the size of the representation of an item in units of bits, bytes, or characters. The length of an array is the number of bits, bytes, or characters making up its representation, and has nothing to do with the number of occurrences of the array. Any element occurrence has length. Only array elements and optional elements have numbers of occurrences other than 1.

Lexical scope

In a DFDL Schema document, the lexical scope of any element is the collection of schema declarations, definitions, and annotations contained within the element textually.

Local properties

Local properties are the properties defined on an annotation in either short, attribute or element form

Logical layer

A DFDL Schema with all the DFDL annotations ignored is an ordinary XSD schema. The logical structure described by this XSD is called the DFDL logical layer. The logical layer of a DFDL schema describes the DFDL Infoset of the data format.

Most-Significant Bit

Often abbreviated to # MSB n a binary integer the most significant bit is the bit having the greatest place value. Within an 8-bit unsigned byte, the bit with place value 27 is the most significant bit.

Nibble

4 bits. A single hexadecimal digit (0 to F) is often referred to as a nibble as it can be represented in exactly 4 bits.

Node

The term Node is a shorter equivalent to Element Information Item of the DFDL Infoset described in Section 4.2.2 Element Information Items.

Non-Representation Property

A format property that is not a representation property, specifically dfdl:ref, dfdl:hiddenGroupRef, dfdl:choiceBranchKey, dfdl:choiceDispatchKey, dfdl:inputValueCalc, dfdl:outputValueCalc. See also representation property.

Occurrence

An instance of an element in the data, or an item in the DFDL Infoset.

Optional Element

An element declaration or reference where XSD minOccurs is equal to zero.

Optional Occurrence

An occurrence with an index greater than XSD minOccurs.

Packed Decimal

A physical representation of a decimal and integer numbers where each digit is packed into one nibble (4 bits) of a byte. There are several variants, some also include a sign nibble, and some include a padding nibble. The term covers all the following enums of the dfdl:binaryNumberRep and dfdl:binaryCalendarRep properties – 'packed' (IBM 390 packed), 'bcd' (standard binary coded decimals or BCDs) and 'ibm4690Packed' (IBM 4690 packed).

Parse

To construct an Infoset from the data stream representation of the data, based on its DFDL format description.

Potentially Represented

An element declaration in the schema describes a potentially represented item if that element declaration does not have a dfdl:inputValueCalc property. Whether the element declaration describes an occurrence that is actually represented or not depends on whether the element declaration is for an optional element, and whether the element has a corresponding value in the augmented Infoset.

Physical Layer

A DFDL Schema adds DFDL annotations onto an XSD language schema. The annotations describe the physical representation or physical layer of the data. The physical layer of a DFDL schema describes the representation in the data stream.

Point of Uncertainty

A point of uncertainty occurs in the data stream when there is more than one schema component that might occur based on parsing up to that point. These arise from the xs:choice model group, use of optional and array elements with varying numbers of occurrences, use of unordered sequences, and use of sequences with floating elements.

Representation Property

A format property that is used to describe a physical characteristic of a component. Such a property will apply to one or more grammar regions of the component. See also non-representation property.

Required Element

_ An element declaration or reference where XSD minOccurs is greater than zero.

Required Occurrence

An occurrence with an index less than or equal to XSD minOccurs.

Required Property

A DFDL property that must have a value. The required properties for each xs:schema component are listed in the Property Precedence tables in section 23.

Resolved Set of Annotations

When DFDL annotations appear on

  • a group reference and the global group definition it references
  • an element reference and the global element declaration it references, and any type definition it references.
  • an element declaration and the type definition it references.
  • a simple type definition and the base simple type it references (recursively, if the base simple type also references another base simple type)

then they are combined, and the resulting set of annotations is referred to as the resolved set of annotations for the schema component.

SBCS

See Single Byte Character Set

Scan

Examine the input data looking for delimiters such as separators and terminators or matches to regular expressions.

Single-Byte Character Set

SBCS) - A character set encoding where each character code consists of one code unit which is exactly a single byte (8 bits).

Schema

see DFDL Schema.

Schema Component Designator

SCD) - A notation for referring to one of the components of a DFDL Schema. This is being standardized by W3C. See http://www.w3.org/TR/xmlschema-ref.

Schema Definition Order

The order that the schema components are defined in a schema document.

Specified Length

An item has specified length when dfdl:lengthKind is "implicit" (simple type only), "explicit", or "prefixed".

Speculative Parsing

When the parser reaches a point of uncertainty it attempts to parse each option in turn until one is known-to-exist or known-not-to-exist.

Statement Annotations

The annotation elements dfdl:assert, dfdl:discriminator, dfdl:setVariable, and dfdl:newVariableInstance. Also called DFDL Statements.

Static Analysis

A DFDL Implementation can analyze a DFDL schema and determine the presence of many kinds of errors. This is called static analysis, compilation of the schema, or determining the presence of the error statically.

Surrogate Pair

A Unicode character whose character code value is greater than 0xFFFF can be encoded into variable-width UTF-16BE or UTF-16LE (which are variable-width encodings when the DFDL property dfdl:utf16Width is 'variable'). In this case the representation uses two adjacent code units each of which is called a surrogate, and the pair of which is called a surrogate pair.

Target Length

When unparsing, the length (in dfdl:lengthUnits) of an item's representation is the target length. The length of the content corresponding to a logical data value in the Infoset may be shorter or longer than the target length, in which case padding or truncation may be necessary to make the logical data content conform to the target length. Rules for when padding and truncation occur, and how they are applied are specific to simple data types and are controlled by a number of DFDL format properties.

Text

Consisting of characters in some character set encoding. Normally we think of text data as being human-readable, but many character set encodings contain special control characters that are not human-readable, but we call data containing these text anyway. The dfdl:encoding property is required in order to decode/encode the text.

Text Representation

Of type xs:string, or of other types (except xs:hexBinary) with property dfdl:representation 'text'. Note that type xs:hexBinary never has text representation. This term specifically refers to the representation of the SimpleContent region being textual.

Textual

Of type Text.

Twos-Complement

A very common scheme for representing binary integers within data. A positive integer consisting of N bits is represented as its base-2 absolute value. A negative integer is represented as the complement (all bits inverted) of its absolute value plus 1.

Unicode

A character set defined by the Unicode Consortium, and standardized at the International Standards Organization (ISO) as ISO10646.

Unit

See Addressable Unit.

Unpadded Length

This is the length of the content of an item of the Infoset, prior to any filling or padding which might be introduced due to dfdl:lengthKind "prefixed" or dfdl:lengthKind "explicit". It is equal to or smaller than the target length.

Unparse

e process of recreating the data represntation in a data stream of the Infoset according to its DFDL format description. The terms marshalling, and data serialization are sometimes used, but they connote a sequentiality that is not necessarily the case when using DFDL.

Validity

A DFDL Infoset is said to be valid with respect to a DFDL schema if each Infoset item is valid with respect to its corresponding DFDL schema component. Validity is about the Infoset and the values it holds. It is independent of the data representation when parsing or unparsing. See Section 1.1 , for a list of the specific value checks that are performed when validating a DFDL Infoset against a DFDL schema.

Variable-Width Character Encoding

A character set encoding where characters are encoded using one or more code units for their representation depending on which specific character is being encoded. Examples with their ranges of varying width:

  • 1 to 4 bytes: UTF-8
  • 1 or 2 16-bit code units: UTF-16 when property dfdl:utf16Width is 'variable'
  • 1 or 2 bytes: Shift-JIS

Well-Formed

A data stream is said to be well-formed with respect to a DFDL schema if a DFDL processor can parse the data into a DFDL Infoset, or there exists a DFDL Infoset such that a DFDL processor can unparse to that data stream. The validity of values in the Infoset is not necessary for data to be well-formed.

Width

See Character Width. [51] X-DFDL-US-ASCII-7-BIT-PACKED is a DFDL standard encoding, which uses the US-ASCII characters, but each code unit is stored occupying only 7 bits, not a whole 8-bit-byte. DFDL standard encodings are defined in a separate specification. See Section 33 Appendix D: DFDL Standard Encodings.