Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not for merge] Parsl serializer plugins development #2718

Draft
wants to merge 87 commits into
base: master
Choose a base branch
from

Conversation

benclifford
Copy link
Collaborator

@benclifford benclifford commented May 19, 2023

Description

i'm using this as an excuse to explore a serialiser plugin API for external serialisers - that sort of behaviour is half-implemented in parsl already, in so much as new serialiser subclasses can be implemented. but this is not well exposed to users at all - for example, identifiers are forced to be three bytes and not amenable to shared namespaces; the order in which serializers are registered (when defined in multiple files) is dependent on the order in which imports are executed (which is python can be very subtle)

The tests all pass for this.

Memoization needed some changes, as it uses parsl.serializer to serialize values for memo identifiers - which is not a suitable use for proxystore's proxy objects - this PR switches to using one of the usual serializers to do this, which is probably a better thing to do in the presence of pluggable serializers: there is still a pluggable API to put in new id_for_memo implementations...

benclifford and others added 25 commits May 19, 2023 11:22
…uired to return consistent results ii) changing serialiser order/configuration between runs would mean memoisation get different values between runs for the same input structure
…d_for_memo.py that is still failing, for reasons i do not understand
This mirrors a removal in Globus Compute, which originally contributed this
code as a fork of the funcX serializer:

PR globus/globus-compute#1153
commit (in funcX) 955e666fe6279571bf6024c239fc789595d8e206
Prior to this PR, the serialization header was parsed twice during
deserialization, and in two different ways, leading to an awkward header
format: the header must be exactly three bytes long for one (length based)
parser, and the third byte must be a \n for the other (new-line based)
parser.

This PR removes the length based parser, instead allowing an arbitrary
length \n-terminated header.

Backwards/forwards compatibility:
The parsl serialization wire format is not a user exposed interface, as parsl
requires the same version of parsl to be installed at all locations in a
parsl deployment.

This PR does not change any on the wire byte sequences when used with the two
existing de*serializers, but opens the opportunity for future implementations
to use more than two bytes for identifiers.

Performance:
I made a basic benchmark of serializing and deserializing a few thousand
integers (deliberately using a simple object like `int` so that the object
processing would be quite small, allowing changes in header time more chance
to show up). My main concern with this PR is that I didn't make things
noticeably worse, not to improve performance.

After this PR, a serialization->deserialization round trip is about 200ns
faster (1165ns before vs 965ns after), so I am happy that this PR does not]
damage performance.

Future:
This is part of ongoing work to introduce user pluggable de*serializers.
… benc-prototype-proxystore

 Conflicts:
	parsl/serialize/facade.py
benclifford added a commit that referenced this pull request Jul 5, 2023
TODO: (broken) make a test that runs using a custom serializer to an htex worker... to check that remote processes can load a custom serializer.

prior to this PR, serializer registration happened in order of class definiton, which in the case of a single concretes file, is the order of the serializers inside that file.

this does not work well when trying to define other serializers in other files, because the order of import of files can change subtly and because of distant effects.

so doing serialiser registration in a subclass hook is not a very nice thing to do here...

this PR should move to a model which defines the default de*serializers explicitly, and allows the user to modify that list explicitly too, to add in new de*serializers.

limitations:

this is part of work needed to allow serializer plugins, but not all of it, as it does not define how a remote worker will discover plugged-in serializers - for that, perhaps, the header should become something importable? instead of a registry-based ID bytes string.
contrasting with that: proxystore doesn't need its own deserializer! it can use the existing pickle/picklecode deserialisers, because what it generates is a pickled object... and that's a legitimate use case... which suggests that serializers don't need to be indexed by "serializer" ID at all on the serializing side, but only on the deserializing side...

this API also will not easily allow serializer methods to be defined for values returned from workers to the submit side: the expected place of registration, near the start of the user workflow, does not have a corresponding place on the worker side.

this PR separates out serialization behaviour (object -> bytestream) from deserialization
behaviour: (identifier, bytestream) -> object, because serializer might not (and in this
prototype, *cannot*) implement custom deserialization... due to non-registration on the
remote side, and instead can only generate dill/pickle targeted bytestreams


the motivating use case for this basic serializer plugin API is supporting proxystore, prototyped in
#2718
https://labs.globus.org/projects/proxystore.html

this PR could also contain an optional proxystore plugin that would need to be activated by a user

loading remote serialisers via importlib is a lot of stuff... perhaps i should... just pickle the deserializer and send it (and let pickle deal with that?)
- i guess there's two paths here: one is loading remote deserialisers, and the other is focusing on the proxystore use case, which, differently, wants access to the pickle deserializer remotely - and so that drives clearer separation of serializers vs deserialisers... that 2nd case is probably waht i should concentrate on...

right now this does module loading in order to work with the class based serializer design inherited from before... but maybe that's not how it should end up...

perhaps just serializer callables and deserializer callables? or based around modules?

it seems nice to be able to configure the serializer (aka make a class or a partially applied callable) for configuring proxystore... but unclear how that would work for remote end configuration.

TODO: as an example plug in, also show 'serpent' as a serialiser/deserialiser pair
TODO: (broken) make a test that runs using a custom serializer to an htex worker... to check that remote processes can load a custom serializer.

prior to this PR, serializer registration happened in order of class definiton, which in the case of a single concretes file, is the order of the serializers inside that file.

this does not work well when trying to define other serializers in other files, because the order of import of files can change subtly and because of distant effects.

so doing serialiser registration in a subclass hook is not a very nice thing to do here...

this PR should move to a model which defines the default de*serializers explicitly, and allows the user to modify that list explicitly too, to add in new de*serializers.

limitations:

this is part of work needed to allow serializer plugins, but not all of it, as it does not define how a remote worker will discover plugged-in serializers - for that, perhaps, the header should become something importable? instead of a registry-based ID bytes string.
contrasting with that: proxystore doesn't need its own deserializer! it can use the existing pickle/picklecode deserialisers, because what it generates is a pickled object... and that's a legitimate use case... which suggests that serializers don't need to be indexed by "serializer" ID at all on the serializing side, but only on the deserializing side...

this API also will not easily allow serializer methods to be defined for values returned from workers to the submit side: the expected place of registration, near the start of the user workflow, does not have a corresponding place on the worker side.

this PR separates out serialization behaviour (object -> bytestream) from deserialization
behaviour: (identifier, bytestream) -> object, because serializer might not (and in this
prototype, *cannot*) implement custom deserialization... due to non-registration on the
remote side, and instead can only generate dill/pickle targeted bytestreams


the motivating use case for this basic serializer plugin API is supporting proxystore, prototyped in
#2718
https://labs.globus.org/projects/proxystore.html

this PR also contain an optional proxystore plugin that would need to be activated by a user,
and an attempt at an in-pickle (recursive style) proxystore plugin that is policy-aware and can
proxy stuff deeper inside lists... to deal with proxying only interesting objects deep inside
graphs - which is actually what happens with regular argument lists when they're non-trivial.

loading remote serialisers via importlib is a lot of stuff... perhaps i should... just pickle the deserializer and send it (and let pickle deal with that?)
- i guess there's two paths here: one is loading remote deserialisers, and the other is focusing on the proxystore use case, which, differently, wants access to the pickle deserializer remotely - and so that drives clearer separation of serializers vs deserialisers... that 2nd case is probably waht i should concentrate on...

right now this does module loading in order to work with the class based serializer design inherited from before... but maybe that's not how it should end up...

perhaps just serializer callables and deserializer callables? or based around modules?

it seems nice to be able to configure the serializer (aka make a class or a partially applied callable) for configuring proxystore... but unclear how that would work for remote end configuration.
…ystore' into benc-prototype-proxystore

Conflicts:
	parsl/serialize/facade.py
…ystore

Conflicts:
	parsl/dataflow/memoization.py
	parsl/serialize/facade.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant