-
Notifications
You must be signed in to change notification settings - Fork 198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[not for merge] Parsl serializer plugins development #2718
Draft
benclifford
wants to merge
87
commits into
master
Choose a base branch
from
benc-prototype-proxystore
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…uired to return consistent results ii) changing serialiser order/configuration between runs would mean memoisation get different values between runs for the same input structure
…d_for_memo.py that is still failing, for reasons i do not understand
This mirrors a removal in Globus Compute, which originally contributed this code as a fork of the funcX serializer: PR globus/globus-compute#1153 commit (in funcX) 955e666fe6279571bf6024c239fc789595d8e206
Prior to this PR, the serialization header was parsed twice during deserialization, and in two different ways, leading to an awkward header format: the header must be exactly three bytes long for one (length based) parser, and the third byte must be a \n for the other (new-line based) parser. This PR removes the length based parser, instead allowing an arbitrary length \n-terminated header. Backwards/forwards compatibility: The parsl serialization wire format is not a user exposed interface, as parsl requires the same version of parsl to be installed at all locations in a parsl deployment. This PR does not change any on the wire byte sequences when used with the two existing de*serializers, but opens the opportunity for future implementations to use more than two bytes for identifiers. Performance: I made a basic benchmark of serializing and deserializing a few thousand integers (deliberately using a simple object like `int` so that the object processing would be quite small, allowing changes in header time more chance to show up). My main concern with this PR is that I didn't make things noticeably worse, not to improve performance. After this PR, a serialization->deserialization round trip is about 200ns faster (1165ns before vs 965ns after), so I am happy that this PR does not] damage performance. Future: This is part of ongoing work to introduce user pluggable de*serializers.
… benc-prototype-proxystore Conflicts: parsl/serialize/facade.py
…hods' into benc-prototype-proxystore
…-prototype-proxystore
benclifford
added a commit
that referenced
this pull request
Jul 5, 2023
TODO: (broken) make a test that runs using a custom serializer to an htex worker... to check that remote processes can load a custom serializer. prior to this PR, serializer registration happened in order of class definiton, which in the case of a single concretes file, is the order of the serializers inside that file. this does not work well when trying to define other serializers in other files, because the order of import of files can change subtly and because of distant effects. so doing serialiser registration in a subclass hook is not a very nice thing to do here... this PR should move to a model which defines the default de*serializers explicitly, and allows the user to modify that list explicitly too, to add in new de*serializers. limitations: this is part of work needed to allow serializer plugins, but not all of it, as it does not define how a remote worker will discover plugged-in serializers - for that, perhaps, the header should become something importable? instead of a registry-based ID bytes string. contrasting with that: proxystore doesn't need its own deserializer! it can use the existing pickle/picklecode deserialisers, because what it generates is a pickled object... and that's a legitimate use case... which suggests that serializers don't need to be indexed by "serializer" ID at all on the serializing side, but only on the deserializing side... this API also will not easily allow serializer methods to be defined for values returned from workers to the submit side: the expected place of registration, near the start of the user workflow, does not have a corresponding place on the worker side. this PR separates out serialization behaviour (object -> bytestream) from deserialization behaviour: (identifier, bytestream) -> object, because serializer might not (and in this prototype, *cannot*) implement custom deserialization... due to non-registration on the remote side, and instead can only generate dill/pickle targeted bytestreams the motivating use case for this basic serializer plugin API is supporting proxystore, prototyped in #2718 https://labs.globus.org/projects/proxystore.html this PR could also contain an optional proxystore plugin that would need to be activated by a user loading remote serialisers via importlib is a lot of stuff... perhaps i should... just pickle the deserializer and send it (and let pickle deal with that?) - i guess there's two paths here: one is loading remote deserialisers, and the other is focusing on the proxystore use case, which, differently, wants access to the pickle deserializer remotely - and so that drives clearer separation of serializers vs deserialisers... that 2nd case is probably waht i should concentrate on... right now this does module loading in order to work with the class based serializer design inherited from before... but maybe that's not how it should end up... perhaps just serializer callables and deserializer callables? or based around modules? it seems nice to be able to configure the serializer (aka make a class or a partially applied callable) for configuring proxystore... but unclear how that would work for remote end configuration. TODO: as an example plug in, also show 'serpent' as a serialiser/deserialiser pair
TODO: (broken) make a test that runs using a custom serializer to an htex worker... to check that remote processes can load a custom serializer. prior to this PR, serializer registration happened in order of class definiton, which in the case of a single concretes file, is the order of the serializers inside that file. this does not work well when trying to define other serializers in other files, because the order of import of files can change subtly and because of distant effects. so doing serialiser registration in a subclass hook is not a very nice thing to do here... this PR should move to a model which defines the default de*serializers explicitly, and allows the user to modify that list explicitly too, to add in new de*serializers. limitations: this is part of work needed to allow serializer plugins, but not all of it, as it does not define how a remote worker will discover plugged-in serializers - for that, perhaps, the header should become something importable? instead of a registry-based ID bytes string. contrasting with that: proxystore doesn't need its own deserializer! it can use the existing pickle/picklecode deserialisers, because what it generates is a pickled object... and that's a legitimate use case... which suggests that serializers don't need to be indexed by "serializer" ID at all on the serializing side, but only on the deserializing side... this API also will not easily allow serializer methods to be defined for values returned from workers to the submit side: the expected place of registration, near the start of the user workflow, does not have a corresponding place on the worker side. this PR separates out serialization behaviour (object -> bytestream) from deserialization behaviour: (identifier, bytestream) -> object, because serializer might not (and in this prototype, *cannot*) implement custom deserialization... due to non-registration on the remote side, and instead can only generate dill/pickle targeted bytestreams the motivating use case for this basic serializer plugin API is supporting proxystore, prototyped in #2718 https://labs.globus.org/projects/proxystore.html this PR also contain an optional proxystore plugin that would need to be activated by a user, and an attempt at an in-pickle (recursive style) proxystore plugin that is policy-aware and can proxy stuff deeper inside lists... to deal with proxying only interesting objects deep inside graphs - which is actually what happens with regular argument lists when they're non-trivial. loading remote serialisers via importlib is a lot of stuff... perhaps i should... just pickle the deserializer and send it (and let pickle deal with that?) - i guess there's two paths here: one is loading remote deserialisers, and the other is focusing on the proxystore use case, which, differently, wants access to the pickle deserializer remotely - and so that drives clearer separation of serializers vs deserialisers... that 2nd case is probably waht i should concentrate on... right now this does module loading in order to work with the class based serializer design inherited from before... but maybe that's not how it should end up... perhaps just serializer callables and deserializer callables? or based around modules? it seems nice to be able to configure the serializer (aka make a class or a partially applied callable) for configuring proxystore... but unclear how that would work for remote end configuration.
…ystore' into benc-prototype-proxystore Conflicts: parsl/serialize/facade.py
…ystore Conflicts: parsl/dataflow/memoization.py parsl/serialize/facade.py
…ystore Conflicts: Makefile
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
i'm using this as an excuse to explore a serialiser plugin API for external serialisers - that sort of behaviour is half-implemented in parsl already, in so much as new serialiser subclasses can be implemented. but this is not well exposed to users at all - for example, identifiers are forced to be three bytes and not amenable to shared namespaces; the order in which serializers are registered (when defined in multiple files) is dependent on the order in which imports are executed (which is python can be very subtle)
The tests all pass for this.
Memoization needed some changes, as it uses parsl.serializer to serialize values for memo identifiers - which is not a suitable use for proxystore's proxy objects - this PR switches to using one of the usual serializers to do this, which is probably a better thing to do in the presence of pluggable serializers: there is still a pluggable API to put in new id_for_memo implementations...