Skip to content
Edward Slavich edited this page Jun 23, 2020 · 3 revisions

This page is a work-in-progress plan for updates to the ASDF Standard that will be included in version 2.0.0.

Background

A little bit of history of ASDF schema wanderings

In the beginning all schemas were in asdf-standard, including schemas currently in astropy and gwcs.

astropy.coordinates was changing rapidly and we could not keep up with those changes. This led to many bug reports related to coordinate frames. So we decided to move the "transform" and "coordinate" tag code to astropy. At some point some of the schemas were moved there too because IIRC they were considered "astropy" or "astronomy" specific (e.g., the coordinate schemas, but not the transform schemas). At that time WCS schemas were moved to gwcs.

Most problems came from adding/changing attributes in astropy classes. The idea was that by moving tags to astropy they will be easier to maintain because a failing tag test will alert us to problems. This sort of worked. Instead we could have improved testing with astropy dev in asdf to keep up with changes in astropy. Another perceived advantage was that an astropy release would be self-contained because it has versions of supported tags and schema with the code. Essentially this was a versioning problem. How to properly handle versioning was poorly understood at the time. Support has improved since then.

In retrospect, was the decision to move tags and schemas the right one? In hindsight, moving the tags was the right solution. Moving the schemas to astropy was probably not. Both problems were in a way management issues - lack of resources to support the development. The first one could have been solved/avoided also with better testing. There was also the feeling that until there was another language supported that this would simplify management, and when more languages were supported, we would have to move the schemas, not that we thought carefully about the mechanics of how that would happen.

Discussion of moving schemas

Reasons for moving schemas out of asdf-standard

  • Perception of a more stable ASDF Standard
    • Users may be skeptical of a file format that changes so often
    • We may be discouraging other implementations by appearing unstable
  • Keep very astronomy-specific stuff out a more general standard for scientific data
    • Much of the modeling stuff has potential users well outside of astronomy, but it really shouldn't be a main requirement for supporting ASDF
    • An ASDF implementation that does not implement transforms would still be useful
  • Developer convenience
    • Updates to schemas will no longer require release of asdf and asdf-standard before they can be used in other packages
    • The packages that actually implement the tags can directly include or have the schemas as a dependency, and updates to those schemas can move in lockstep as the objects and tags that serialize those objects are developed. And there needs to be no updates or new releases in asdf-standard or asdf to achieve this. Think of this as something like the pytest and pytest plugin ecosystem.
      • Does this raise a policy issue of when someone binds tags to an implementation, do we outline how that can be disassociated when someone else wants support in a different language; essentially what we are facing with transforms and LSST.

Consequences

  • A user will have to install more dependencies
    • Can this inconvenience be solved by collecting the dependencies into a single meta-package that users can install?
      • e.g. instead of "pip install asdf asdf-transform-schemas asdf-fits-schemas asdf-coordinate-schemas" users can "pip install asdf-astropy" and get everything they might need
        • That's a Python solution, what about other languages? We ought to suggest solutions for common cases.
        • Meta packages in Python can be a real pain
    • Installation of more dependencies should be handled via standard package dependency management (hopefully)
  • The ecosystem of ASDF repositories and packages will become more fragmented
    • Increased maintenance overhead, more combinations of package versions to debug
      • We may need some enhanced schema versioning management, e.g. WCS schemas X support transform schemas (or the equivalent of version-map) Y, coordinates schemas Z and unit schemas U.
  • Astropy will not install these packages by default
    • Ultimately we would like that to happen. But in the meantime, we should have a simple pip mechanism to install all that is needed in one command.
      • Would be kind of neat for the code to do that when trying to use an astropy tool that needs it. XXX is not installed, type i to install it...
  • Perhaps a summary for explicit use cases in a table is worthwhile

Implementation

Options (using transform schemas as an example)

  • Create repository with Python package that installs transform schemas.
    • Astropy lists the new package as an additional optional dependency
      • Users must install "all", or manually install both packages, or maybe astropy would be open to adding an "asdf" extras category
    • Options for non-Python software
      • Include as submodule and package with the software itself somehow
  • Create repository without Python package
    • Astropy incorporates the schemas as a submodule and the files are installed with astropy on the user's system
    • Non-Python software also "vendorizes" the schemas, potential for multiple copies of the same schema to exist on a user's system
  • Schemas not included with software but downloaded over http from some centralized "schema repository"
    • Not desirable to require network connection to open files
      • But software could still "vendorize" the schemas and only hit the http service as a backup
      • Caching would also help
  • Create schema package but also move astropy tag classes to a new package, asdf-astropy
    • astropy users would only need to explicitly install a single dependency
    • other packages (transform schemas, asdf itself) would be hard dependencies of asdf-astropy and would be installed automatically
    • potential advantage to be able to work on schemas and tag code without waiting for astropy releases
    • would need tests in astropy to encourage developers to maintain the tag code even if it lives in a separate package
  • Keep schemas in the ASDF Standard
    • Continue to endure pain around releasing both asdf-standard and asdf to make new schema material available to astropy
    • dependency tree remains simple
  • Consider moving all ASDF related packages to a dedicated github organization.
  • Consider using pipfile and pipenv (although this may be some way in the future)

Open questions

  • How is the new package going to be installed in different environments? How is asdf-standard installed by asdf-cpp or in a C only environment.
    • Include the schema repo(s) as submodules
      • Auto-generate header files that contain the schema content as a string?
      • Alternatively install the schemas at some path in the user's system that the executable knows how to find

Potential changes for 2.0.0

Remove transform schemas from the standard

Remove the transform schemas from version_map-2.0.0.yaml. Transform schemas not included in ASDF Standard 1.x will only be available via a new dedicated repository, which will be installable as a Python package.

Consider adding additional transform attributes, serialized in the basic Transform tag, to the base Transform schema.

Things like bounding_box, equivalencies, .... One reason to have them in the schemas is that libraries implementing the standard in other languages are not aware that these attributes exist. However, they can change the behaviour of the deserialized object. One example is the bounding_box and its use in WCS. Currently it is written to file but is not in the schema. Which means the LSST asdf conversion code will not take it into account. That particular case may be OK but the general problem exists.

Support YAML arrays as alternative to numpy arrays in schema fields.

Some schemas already support this, for example affine.

Astropy normally writes out these as numpy arrays. However, a different library may be writing them out as arrays.

Astropy should be able to read asdf files where values are written as YAML arrays instead of numpy arrays, if the schema validates the file.

For example, an affine transform written as YAML array, while valid to the affine schema, cannot be read by astropy.

Move other schemas out of the standard

These are the other (non-transform) ASDF Standard schemas, grouped by URI prefix:

core - schemas for essential asdf objects like ndarray, the top-level node, etc

fits - schema supporting nesting a FITS file inside of an ASDF file

unit - support for units and quantities

time - schema supporting time objects, with "special emphasis ... on supporting time scales that are used in astronomy"

Some of these (e.g., fits) may be candidates for moving out of the ASDF Standard and into their own satellite repositories.

Consider also moving schemas from the following packages into separate repositories:

  • gwcs
  • astropy.coordinates

Upgrade to JSON Schema draft-07

Users have expressed some interest in the new features of JSON Schema draft-07, so we might take this opportunity to designate draft-07 as the ASDF Standard 2.0.0 schema format.

One downside of this change is that draft-07 and draft-04 schemas are mutually incompatible, so all current schemas would need to be updated before they could be used with an ASDF Standard 2.0.0 file.

A potentially troublesome JSON Schema change introduced in draft-06 is that the "integer" type now validates any number with a zero fractional part, so floats like 1.0 will begin validating where they did not before.

Is this a problem since it is a relaxation and presumably won't break reading old files? It is a reasonable interpretation; but does it make it difficult to support in our or other libraries?

Actually it will help with at least one issue we know of in the jwst pipeline where a spectral order needs to validated as integer but it comes out of a model as a float because modeling turns everything into float.

Drop the tag: URI scheme and use http:// URIs for YAML tagging

We are currently maintaining two parallel URI schemes: http:// URIs that refer to schemas, and tag: URIs that refer to tagged YAML objects. There is a 1:1 mapping between the two sets of URIs. Since YAML supports http:// URIs as tags, we have the option to drop the tag: URIs entirely and just use http:// URIs everywhere. This would remove a source of confusion and mistakes.

The version_map-x.y.z.yaml files in the ASDF Standard would need to be changed in some way, as they currently refer to schemas by tag.

Incorporate the concept of a "schema collection" into the standard

It is useful to have an overarching version that ties together a group of related schemas – for example, software can read a list of schemas associated with that version and write objects to an ASDF file that validate against schemas in that particular set. For the ASDF core schemas, the ASDF Standard version provides that overarching version. For user-defined schemas, there is currently no solution, and no library support for selecting a particular version of user-defined schema on write. Define a format for a schema collection manifest

This will be a YAML file, analogous to the existing version_map-x.y.z.yaml files, that defines a schema collection version. The file will need the following features:

  • Unique id that defines the name and version
    • Could be a similar URI to the schema ids, for example http://stsci.edu/schema_collections/core-1.0.0.yaml
      • Would be used by implementations to allow users to select a particular version of a schema collection
    • List of schemas in the "collection"
      • A list of schema id URIs

Create manifest files for the new 2.0.0 version

If all we do is drop the transform schemas, this will simply contain all of the schema ids of non-transform schemas currently listed in version_map-1.5.0.yaml.

Consider replacing existing version_map files with new-style manifests

Since the version_map files aren't described by the spec, we are free to replace them with the new-style manifest files. This would simplify implementation. We have the choice of creating one manifest for all schemas in the ASDF Standard, or multiple manifests for each URI prefix like "core", "unit", etc.

Create new metadata section that lists schema collection ids used

Similar to the existing extension metadata section, this would be a list of schema collections used when writing the file. Useful for debugging and providing warnings when support for a given collection is missing.

Drop subclass_metadata from the standard

This is an experimental feature that sought to make serialization of subclasses more convenient by reusing the superclass's schema, with some additional metadata appended to inform the library of which subclass to instantiate. This feature has some drawbacks. For one, the name of a class or subclass is an implementation detail that is meaningless to other ASDF implementations. Another drawback is that by using a generic schema for multiple subclasses, we are not able to validate as strictly as we could with separate schemas – for example, if subclass A requires property "foo", but subclass B does not, we can't make the property required because both objects must validate against the same schema.

These drawbacks may be reason enough to remove subclass_metadata from the standard.

Replace the "extensions" section of the file history with a section for implementation-specific metadata

The "extensions" section of the history object contains a list of AsdfExtension class names used by the Python library when writing the file. This is useful when debugging issues with a file, and enables the Python library to issue warnings when an extension that was used to write the file is missing on read. Since the concept of an "extension" is not defined by the ASDF Standard and is an implementation detail of the Python library, it may not be reasonable to require that other implementations store their metadata in the same structure.

An alternative is to replace "extensions" with a new section for freeform implementation-specific metadata.

Perhaps use a standard convention for library specific metadata; e.g., some sort of standard prefix?

Support for blosc-style byte transposition and additional compressors

There's been discussion around supporting additional compression modes offered by the blosc library, particularly zstd with blosc's byte transposition filter. Supporting the transposition would require new field(s) in the ASDF block header that describes the compression block size and the fact that the bytes were transposed. We would also need to add a new 4-byte compression code for zstd.

Could we just create a block prefix area for extra metadata information that is implicit for that compressions scheme? Does it have to be explicit in the block fields? This would allow much more flexible additions in the future without having to keep changing the definition of the block structure?

Be explicit about behavior of null values in the YAML

The ASDF Standard doesn't specify behavior around null values, but the Python library currently strips out any object key whose value is null. Some users would prefer that keys with null values be preserved. Regardless of which behavior we settle on, we should consider adding language to the ASDF Standard that defines how nulls are to be treated.

This needs some careful thought. There are cases where the absence of something should be taken to imply a certain mode. We probably have been misusing None. Getting rid of defaults probably makes this easier (e.g, a distinction between a missing attribute, which is handled by the tag code, and a None value which is preserved in the tree). Yet, this would raise the question of how extensions document handling of missing attributes and their defaults since we cannot use the schema directly (unless we have a special field that describes the behavior the library should have without actually enforcing though schema validation tools). This is because extensions ought to be language neutral in principle. and someone implementing the extension in a different language needs to know how to handle these issues without being an expert in the original implementation.

Be explicit about behavior of implementations with regard to default values in the schema

The ASDF Standard doesn't provide explicit guidance on how the "default" annotation in the schemas is to be used. The Python library currently adds default values to the tree where missing on read, and removes values that match the default on write. This feature seems intended to reduce file size when many objects with default values are present. There are some downsides: the files when viewed independent of the schemas seem to be missing some of their data (including required fields), and it's not always possible to identify a single default value for objects that are validated against multiple schemas using combiners.

Regardless of which behavior we settle on, we should consider adding language to the ASDF Standard that defines how default values are to be treated.

We talked about the option of removing defaults from the ASDF standard. Can we give that serious consideration? This is linked a bit to the previous item regarding null values.

Be explicit about support for complex YAML keys

The YAML 1.1 spec permits object keys to themselves be objects or arrays, which isn't well supported by Python (since dicts and lists are not hashable). A more serious issue is that complex keys are not at all covered by JSON Schema, since JSON only supports string keys. Consider declaring in the ASDF Standard that object or array keys are not permitted.

Restricting our tree to a subset of YAML would also offer the benefit of a simplified implementation if we ever decide to write our own fast YAML parser.

One option is to require complex keys be encoded as strings, perhaps with some specification of what is legal. Again, this would requires some thought. We would like to stay away from anything goes for keys.

Review schema versioning policy

Unlike semver for software, there isn't a clear winner as far as versioning strategies for schemas. https://github.com/snowplow/iglu/wiki/SchemaVer is one option. Review and consider revising the section in the ASDF Standard on schema versioning, Revise section on "Handling version mismatches"

The ASDF Standard documentation recommends that libraries read later versions of objects than they actually support, for "future-proofing". This may be dangerous, because new data added in later versions of a schema might be discarded by the library if unrecognized, thus corrupting the file when written back out.

We may wish to revise this section to instead recommend against attempting to handle unrecognized versions.

This is where a real url to refer to might be handy. Your version of the library is trying to read a later version. Maybe it would work, and maybe not. If the library could retrieve information about the newer version to see what older versions it is compatible with to decide whether or not to fail. If not accessible, it fails. Maybe this is a bit too fancy...

Describe the ASDF_STANDARD comment in the ASDF Standard

Our Python library always writes an ASDF_STANDARD comment near the top of each file with the version of the standard that was used. This comment, however, is not described in the ASDF Standard documentation. It is useful to inform implementations of the anticipated structure of the tree, particularly with regard to metadata.

We may wish to merge the ASDF_STANDARD version and the ASDF file format version – it may not be useful to maintain two separate version numbers. In that case this comment can be replaced with ASDF 2.0.0 for new files.

Create a schema for a "number or quantity" object

Nadia pointed out that a schema that anyOf-combines number and quantity would be useful, since this is a common case, particularly in the transform schemas.

Drop style-related properties from the metaschema

The current draft-01 yaml-schema metaschema includes three properties related to the style of the serialized YAML:

propertyOrder – specify the order in which object properties should be written

flowStyle – specify the YAML style for an array or an object

style - specify the YAML style for a string

Detailed review and cleanup of all schemas that are to remain in ASDF Standard

Some of these may represent early ideas that did not turn out to be useful.

core/asdf

  • We can probably drop support for the old history format.

core/column

core/complex

  • It seems odd to store these as strings instead of two numeric fields.

core/constant

Used as a utility to indicate that value is a literal constant.

  • ???
  • Don't see any evidence of use.

core/extension_metadata

  • Propose to drop this schema for reasons described above

core/externalarray

Allow referencing of array-like objects in external files. These files can be any type of file and in any absolute or relative location to the asdf file. Loading of these files into arrays is not handled by asdf.

  • Is this useful?
    • By definition the asdf library won't handle loading the external array, so custom code is always required.
      • Why not keep the schema alongside that custom code?

core/history_entry

core/integer

core/ndarray

core/software

core/subclass_metadata

  • Propose to drop this schema for reasons described above

core/table

unit/defunit

Defines a new unit... The new unit must be defined before any unit tags that use it.

  • The tag class for this schema was never implemented
  • What does it mean to "define before" in a tree?

unit/quantity

  • The current quantity schema permits either a single number quantity or an array. In some cases (maybe most?) users are going to be expecting one or the other and not actually want both It may be helpful to provide support for an array of quantities in a separate schema.

unit/unit

Clone this wiki locally