Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed new API for serializing custom types #815

Closed
eslavich opened this issue Jun 26, 2020 · 10 comments
Closed

Proposed new API for serializing custom types #815

eslavich opened this issue Jun 26, 2020 · 10 comments

Comments

@eslavich
Copy link
Contributor

We've been considering a new API for serializing custom types and would like to solicit community input on our proposal. The original motivation for this was a desire to support multiple URI schemes in the same serializer. For example, as we move the transform schemas out of the ASDF standard and into their own package, we would like to change the current tag URIs:

tag:stsci.edu:asdf/transform/add-1.2.0
tag:stsci.edu:asdf/transform/affine-1.3.0
...

to something like this:

http://asdf-format.org/schemas/transform/add-2.0.0
http://asdf-format.org/schemas/transform/affine-2.0.0
...

The current API, ExtensionType, assumes the tag: URI scheme and forces each tag to share the same "organization", "standard", and "name". We thought we'd simplify by replacing those individual class attributes with a single list of full tags. Here's the proposed new API, called AsdfConverter:

https://github.com/asdf-format/asdf/blob/asdf-standard-2.0.0-epic/asdf/converter.py#L28

I would like to point out several changes and consequences of this new design:

  • The name, organization, standard, version, supported_versions, and yaml_tag class attributes have all been collapsed into one attribute, tags, which in the case of the "add" transform schema we'd set like this:
tags = {
  "tag:stsci.edu:asdf/transform/add-1.0.0",
  "tag:stsci.edu:asdf/transform/add-1.1.0",
  "tag:stsci.edu:asdf/transform/add-1.2.0",
  "http://asdf-format.org/schemas/transform/add-2.0.0",
}

This gives developers the flexibility to use any URI scheme they choose, and allows for diverse tags to be handled by the same class.

  • Drop support for tags that aren't explicitly listed in the class attribute. The ExtensionType behavior is to use the serializer with the closest higher version if an exact match is unavailable. Instead, the library would refuse to handle the object if explicit support for the tag is unavailable, since to do otherwise is to risk losing data when the file is written back out.

  • Drop the handle_dynamic_subclasses attribute and its associated feature. This is an experimental feature that writes a Python class name into the YAML so that the correct subclass can be automatically chosen when reconstructing the object on read. This is by definition implementation specific and we would prefer to encourage use of separate tags for these cases.

  • The ExtensionType classes are capable of operating on themselves, that is to say, serializing and deserializing instances of the same class that defines the from_tree and to_tree behavior. In order to handle multiple tag versions, the ExtensionType metaclass would dynamically generate additional class definitions for each version. All this made for a system that was difficult to understand and troubleshoot, so with AsdfConverter we're not permitting self-serialization and are creating instances of the class to handle each tag.

  • Rename from_tree and to_tree to from_yaml_tree and to_yaml_tree. The renaming is to help remind us what each method is intended to do, since in both cases the input and output are both trees.

  • Drop the from_tree_tagged and to_tree_tagged methods. These existed to allow developers to provide custom logic for choosing a tag. Now, to_tree_yaml is permitted to return a tagged object directly.

  • Drop the ctx variable from the from_yaml_tree and to_yaml_tree methods. In most cases, use of the ctx is not necessary -- only converters that deal directly with binary blocks need access to it. We're planning to introduce an additional interface for that type of converter.

  • Drop the requires class attribute. Most implementations will reside in a package that lists required dependencies in the package metadata. Those that don't are well served by raising ImportError from within the from_yaml_tree method.

  • The validators class attribute is not yet implemented for AsdfConverter, but I suspect we'll need it. That feature is used internally to support custom validations on the shape and datatype of the ndarray objects.

@Cadair @CagtayFabry I know you've both implemented support for custom types using the old API -- we'd appreciate hearing any thoughts you have on the new one.

@eslavich eslavich added the ASDF Standard 2.0.0 Label identifying PRs that are part of the ASDF Standard 2.0.0 epic label Jun 26, 2020
@CagtayFabry
Copy link
Contributor

So far we are more or less still prototyping with asdf so changes to the tag URI format are not really an issue up to this point. I like the idea of having more flexibility with handling different tags outside of the current "organization", "standard", and "name" structure.

Some thoughts regarding the proposed changes off the top of my head:

  • I like the idea of moving the to_tree_tagged functionality into the new to_yaml_tree method
  • Using and implementing custom validators is a very powerfull feature of asdf in my eyes so I would really like the functionality to carry over to the new API. Are you considering dropping the feature or simply moving it away from ExtensionType / AsdfConverter ? Not having the validator feature would actually break some of our implementation ideas.

I will have a closer look at the other changes as well.

@eslavich
Copy link
Contributor Author

Using and implementing custom validators is a very powerfull feature of asdf in my eyes so I would really like the functionality to carry over to the new API. Are you considering dropping the feature or simply moving it away from ExtensionType / AsdfConverter ? Not having the validator feature would actually break some of our implementation ideas.

We definitely want keep the feature, dropping it from AsdfConverter was an oversight on my part. One thing that puzzles me though -- the custom validators are applied globally to all tagged objects, so why are they defined as an attribute of ExtensionType? Wouldn't it make more sense to define them on AsdfExtension directly?

On the other hand, maybe it doesn't make sense to apply the custom validators to every object. It's possible that two extension developers will provide custom validators for the same property name with conflicting behavior. If we scoped the custom validator to only the AsdfConverter classes that it was assigned to, then we'd prevent any such conflicts.

@CagtayFabry
Copy link
Contributor

I was about to ask the same question. I noticed that validators get registered globally regardless of which class I assign them to. So defining them on AsdfExtension would be a more natural fit imo.

Regarding different validators per class I also thought about that and it does sound interesting. I am not sure about the current behavior but I doubt this is working with the current implementation.

I must say I really like the current "global" behaviour since I don't have to worry about adding all my validators to all classes individually. Defining them together with the AsdfExtension would feel even more intuitive for me. (also easier to look for)

I guess the most flexible option could look somewhat like the following (no idea if this is possible to implement):

  • validators defined on AsdfExtension trigger on all schemas
  • validators only defined on AsdfConverter will trigger only on the specific classes
  • validators defined on AsdfConverter will override globally defined validators in AsdfExtension for their classes

@CagtayFabry
Copy link
Contributor

One more question regarding the following note @eslavich

  • Drop the ctx variable from the from_yaml_tree and to_yaml_tree methods. In most cases, use of the ctx is not necessary -- only converters that deal directly with binary blocks need access to it. We're planning to introduce an additional interface for that type of converter.

What functionality would the additional interface provide?

One current use case for ctx in to_tree is to switch ctx.set_array_storage(data, "inline") depending on some related class properties. Would that still be possible from within to_yaml_tree?

Also, when dropping ctx would there still be any way to access the type mapping of the AsdfFile (including extensions?)
Those might be of interest if the tag _tag property should be set manually

@eslavich
Copy link
Contributor Author

eslavich commented Jul 6, 2020

What functionality would the additional interface provide?

I haven't given this a lot of thought yet, but I find the current interface unsatisfying. There are potentially multiple ndarray objects that map to the same block, yet the set_array_storage method operates on an ndarray. And how does the user control storage for other types of block-backed objects? I also find the lack of symmetry in serialization/deserialization troubling -- after writing an numpy.ndarray to the tree, you get back asdf.tags.core.ndarray.NDArrayType, which is a proxy wrapper around the ndarray that fails the isinstance(obj, numpy.ndarray) test.

So I'm vaguely hoping that we can devise a new interface that generalizes to other objects with blocks and returns an object of the correct type when the tree is read back in.

One current use case for ctx in to_tree is to switch ctx.set_array_storage(data, "inline") depending on some related class properties. Would that still be possible from within to_yaml_tree?

We definitely want to support that feature. At the very least it'll be possible using the advanced/extended interface (call it AsdfExtendedConverter for the sake of discussion).

Also, when dropping ctx would there still be any way to access the type mapping of the AsdfFile (including extensions?)
Those might be of interest if the tag _tag property should be set manually

I think I see the issue -- we might know that we want to manually set the tag to http://asdf-format.org/foo-x.y.z, but we don't know how to choose the correct version x.y.z without information from the type mapping. Do I understand correctly?

Removing ctx may not be worth the trouble. I thought it'd be useful to be able to easily distinguish straightforward converters from those that may be engaged in funny business like manipulating the block index.

@CagtayFabry
Copy link
Contributor

I think I see the issue -- we might know that we want to manually set the tag to http://asdf-format.org/foo-x.y.z, but we don't know how to choose the correct version x.y.z without information from the type mapping. Do I understand correctly?

yes, deciding on the correct version number was pretty much the same problematic use case I was thinking of as well

Thank you for your explanations

@eslavich eslavich removed the ASDF Standard 2.0.0 Label identifying PRs that are part of the ASDF Standard 2.0.0 epic label Jul 13, 2020
@eslavich
Copy link
Contributor Author

I think I have a good solution for the ctx variable: allow to_yaml_tree and from_yaml_tree implementations to accept either 1 or 2 arguments. I'm also planning to add back in the validators property.

I'll link back to this issue when I open a PR.

@CagtayFabry
Copy link
Contributor

CagtayFabry commented Aug 10, 2020

Just to have this here:

I think some way to integrate the ideas behind asdf-format/asdf-standard#271 / #858 into the tags attribute somewhere down the line could be really nice

@eslavich
Copy link
Contributor Author

I think some way to integrate the ideas behind asdf-format/asdf-standard#271 / #858 into the tags attribute somewhere down the line could be really nice

That feature would help us a lot, I think. We did a test drive of the proposed API and one class in particular ended up with a long clumsy list of tags:

https://github.com/astropy/asdf-astropy/blob/master/asdf_astropy/tags/transform/compound.py#L38

That list would be lot nicer with wildcards:

tags = {
    "asdf://asdf-format.org/extensions/transform/tags/add-*",
    "asdf://asdf-format.org/extensions/transform/tags/subtract-*",
    "asdf://asdf-format.org/extensions/transform/tags/multiply-*",
    "asdf://asdf-format.org/extensions/transform/tags/divide-*",
    "asdf://asdf-format.org/extensions/transform/tags/power-*",
    "asdf://asdf-format.org/extensions/transform/tags/compose-*",
    "asdf://asdf-format.org/extensions/transform/tags/concatenate-*",
    "asdf://asdf-format.org/extensions/transform/tags/fix_inputs-*",

    "tag:stsci.edu:asdf/transform/add-*",
    "tag:stsci.edu:asdf/transform/subtract-*",
    "tag:stsci.edu:asdf/transform/multiply-*",
    "tag:stsci.edu:asdf/transform/divide-*",
    "tag:stsci.edu:asdf/transform/power-*",
    "tag:stsci.edu:asdf/transform/compose-*",
    "tag:stsci.edu:asdf/transform/concatenate-*",
    "tag:stsci.edu:asdf/transform/fix_inputs-*",
}

@eslavich
Copy link
Contributor Author

Here's another iteration on the converter API:

class Converter(abc.ABC):

I ended up bringing back the ctx argument, but assigning it an instance of a new class called SerializationContext instead of the AsdfFile itself. The block manager isn't available yet (so it's not yet possible to call set_array_storage, but it will be eventually). I'm trying to avoid passing around the AsdfFile because AsdfFile.write_to operations aren't supposed to modify the file, so I don't want code in the new converters to be able to manipulate it directly.

Tag patterns are permitted in the converter tags property.

I've been thinking more about the validators and my latest idea is to define them on the extension itself, but optionally allow a list of tag URI patterns in case developers want to restrict their use to certain tags. The interface might look something like this:

class Validator:
  @property
  def name(self):
    # Return the name of the schema property that triggers the validation.

  @property
  def tags(self):
    # Optional list of tag URIs or URI patterns that the validator should
    # be used with.

  def validate(self, schema_value, instance, tag, ctx):
    # schema_value: the value assigned to the schema property
    # instance: the YAML object to be validated
    # tag: the YAML object's tag
    # ctx: the same SerializationContext object as passed to the converters
    #
    # Implementation is expected to raise asdf.exceptions.ValidationError
    # if the instance fails validation.

My goal is to abstract away the jsonschema library so that we can swap in a faster implementation someday. Validation is currently the slowest part of opening an ASDF file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants