Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad decoding of exception in radical error handling path #3328

Closed
benclifford opened this issue Apr 9, 2024 · 10 comments
Closed

Bad decoding of exception in radical error handling path #3328

benclifford opened this issue Apr 9, 2024 · 10 comments
Labels

Comments

@benclifford
Copy link
Collaborator

Describe the bug

Here's an error I see in CI (when working on some stderr behaviour elsewhere). I think it looks like one piece of code is serializing an exception as a bytestring of the repr of the exception, but another piece of the code is expecting to unserialize it as base64.


================================================================= FAILURES ==================================================================
___________________________________________________________ test_bad_stderr_file ____________________________________________________________

self = RadicalPilotExecutor(
    'local.localhost',
    bulk_mode=False, 
    label='RPEXBulk', 
    rpex_cfg='/home/benc/par...est-current/runinfo/000/rpex.cfg', 
    working_dir=PosixPath('/home/benc/parsl/src/parsl/.pytest/parsltest-current')
)
parsl_task = <Future at 0x7ba1f02736e0 state=finished raised DeserializationError>, exception = "PermissionError(13, 'Permission denied')"

    def _unpack_and_set_parsl_exception(self, parsl_task, exception):
        try:
>           s = rp.utils.deserialize_bson(exception)

parsl/executors/radical/executor.py:416: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../virtualenv-3.12/lib/python3.12/site-packages/radical/pilot/utils/serializer.py:101: in deserialize_bson
    return pickle.loads(codecs.decode(obj.encode(), "base64"))
/usr/local/lib/python3.12/encodings/base64_codec.py:19: in base64_decode
    return (base64.decodebytes(input), len(input))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

s = b"PermissionError(13, 'Permission denied')"

    def decodebytes(s):
        """Decode a bytestring of base-64 data into a bytes object."""
        _input_type_check(s)
>       return binascii.a2b_base64(s)
E       binascii.Error: Invalid base64-encoded string: number of data characters (33) cannot be 1 more than a multiple of 4
E       decoding with 'base64' codec failed

cc @AymenFJA

To Reproduce

pytest parsl/tests/test_bash_apps/test_stdout.py::test_bad_stderr_file -k "not cleannet" --config parsl/tests/configs/local_radical.py

Expected behavior
This should be raising the relevant exception (the PermissionError) up to the user program, rather than raising a DeserializationError.

Environment
my laptop
parsl master 13c3147

$ pip list |grep radic
radical.gtod                  1.47.0
radical.pilot                 1.47.0
radical.saga                  1.47.0
radical.utils                 1.48.0
@benclifford benclifford added the bug label Apr 9, 2024
benclifford added a commit that referenced this issue Apr 9, 2024
This marker originally referred to issue #363 for tests which should not be
expected to pass if running in an environment where stderr and stdout are not
staged back to the submitting system by the executor.

Since it was created, it then expanded in use to more generally refer to any
test which had a problem in environments which did not properly make either
output files or stdout/stderr available on the submitting system, as a
separate behaviour from the core of parsl behaving properly. An example of
that would be running in a multi-site environment without working file
staging.

However, since all that happened, the feel in the Parsl community is that
running with a shared filesystem is the core supported usage mode of Parsl;
and also since then, notions of file staging have become much stronger,
both in Parsl core and in individual executors, making it a much more
reasonable test requirement that if you make a file, it can be expected to
be available for the test to inspect - and that if you are in an environment
that does not support that, you should not expect the test suite (or Parsl)
to work.

Because of all of that, this PR removes the issue363 test marker entirely.
It was removed from Work Queue and Task Vine test environments in PR #3327,
leaving only the Radical-PILOT test environment setting that.

It turns out only two tests don't work in the Radical-PILOT environment
when issue363 is removed:

One requires support for the slightly awkward "stdout/err opening modes,
specified by tuples" feature, and this PR labels and disables that test
using a new 'executor_supports_std_stream_tuples' label. (In general, an
executor which does staging of stdout/err streams is likely to not support
the features offered by std stream tuples, such as appending to stderr
files from multiple task - so this label is not Radical-PILOT specific)

Another test looks like it is broken exception handling in the Radical-PILOT
executor, and so it is labelled with a new issue specific label, relating to
the bug report for that issue - issue #3328, issue3328.
benclifford added a commit that referenced this issue Apr 9, 2024
This marker originally referred to issue #363 for tests which should not be
expected to pass if running in an environment where stderr and stdout are not
staged back to the submitting system by the executor.

Since it was created, it then expanded in use to more generally refer to any
test which had a problem in environments which did not properly make either
output files or stdout/stderr available on the submitting system, as a
separate behaviour from the core of parsl behaving properly. An example of
that would be running in a multi-site environment without working file
staging.

However, since all that happened, the feel in the Parsl community is that
running with a shared filesystem is the core supported usage mode of Parsl;
and also since then, notions of file staging have become much stronger,
both in Parsl core and in individual executors, making it a much more
reasonable test requirement that if you make a file, it can be expected to
be available for the test to inspect - and that if you are in an environment
that does not support that, you should not expect the test suite (or Parsl)
to work.

Because of all of that, this PR removes the issue363 test marker entirely.
It was removed from Work Queue and Task Vine test environments in PR #3327,
leaving only the Radical-PILOT test environment setting that.

It turns out only two tests don't work in the Radical-PILOT environment
when issue363 is removed:

One requires support for the slightly awkward "stdout/err opening modes,
specified by tuples" feature, and this PR labels and disables that test
using a new 'executor_supports_std_stream_tuples' label. (In general, an
executor which does staging of stdout/err streams is likely to not support
the features offered by std stream tuples, such as appending to stderr
files from multiple task - so this label is not Radical-PILOT specific)

Another test looks like it is broken exception handling in the Radical-PILOT
executor, and so it is labelled with a new issue specific label, relating to
the bug report for that issue - issue #3328, issue3328.
@andre-merzky
Copy link

andre-merzky commented Apr 29, 2024

@AymenFJA
The task's exception field is indeed just repr(exception), there is no bson involved (unless the rp.pytask class does some magic - but I could not find anything there). So I would suggest to remove the deserialize call and add

            elif isinstance(exception, str):
                parsl_task.set_exception(eval(exception))

to the checks.

@benclifford
With that change the test still fails though since line 67 of parsl/tests/test_bash_apps/test_stdout.py expects an perror.BadStdStreamFile, not a PermissionDenied as you mentioned above. Am I missing something? This is on parsl master.

PS.: I should not condone the casual use of eval really. Long story short: RP does not bother to really serialize exceptions as that does not (easily) work for custom exception types...

@benclifford
Copy link
Collaborator Author

@andre-merzky @AymenFJA thanks for looking into this - I'll have a poke at that test with your changes above. (Without digging in, I expect the difference in exceptions comes from the different code paths that the radical executor uses for bash apps - but let me spend some time investigating before believing me).

Serialization of exceptions has given us problems in the past, but I haven't personally been hitting problems recently - it might just be that we dealt with all the common cases that we were encountering - we have a bit of special logic in parsl.app.errors.RemoteExceptionWrapper, and I can see that at least some part of the radical executor is using that:

parsl/executors/radical/rpex_worker.py:            exc = (rp.utils.serialize_bson(pe.RemoteExceptionWrapper(*sys.exc_info())), None)

@benclifford
Copy link
Collaborator Author

If you want to avoid this eval(repr( serialization, it's perhaps OK to create a new RuntimeError(exception) - uglier, but if you're avoiding sending the exception object and just sending some human-oriented string, then this might be the right way to go.

In that case, tests would have to be loosened up to expect different exception types - I'm not in principle against that.

@andre-merzky
Copy link

Yes, I would prefer that option. It's not ideal for sure, but very portable...

@AymenFJA
Copy link
Contributor

AymenFJA commented May 3, 2024

@benclifford @andre-merzky, sorry for the late response. I believe this is not an issue of serialization. It seems like we set the sandbox of the task to the path of the stdout/stderr assigned by Parsl task kwargs. Specifically for the test shown in this ticket, it's set for a path that leads to a permission exception as expected. The problem is that the error is coming from the RP when it tries to create the sandbox of the task (way before it executes the task) and it ends up in the same error of Permission denied when it does: os.mkdir. The only difference is that we can not serialize that exception as it comes from a different component than the worker (which serializes the exceptions). When we unpack the exception we assume it is serialized and there we fail (it received a raw string rather than a serialized object note the obj=b"PermissionError(13, 'Permission denied')"). I have pushed a fix with this commit hash here: 61f9b92.

Interestingly, this test did not fail when we ran the entire test suite. But running it individually showed this issue. Regardless I tested the changes and it passed on my local machine with the release version of RP and Parsl.

@benclifford
Copy link
Collaborator Author

so I guess there's two things going on here - 1) one the underlying bug in choice of directories that I think @AymenFJA fixed in 61f9b92, and 2) the problem reported at the start of this issue that some errors happening around that code path are not properly reported to the end user, which is I think that Andre is addressing here above: #3328 (comment)

That 2nd problem still exists, I guess, in the case of any other errors that are not the fixed directory choice problem, right?

@AymenFJA
Copy link
Contributor

so I guess there's two things going on here - 1) one the underlying bug in choice of directories that I think @AymenFJA fixed in 61f9b92, and 2) the problem reported at the start of this issue that some errors happening around that code path are not properly reported to the end user, which is I think that Andre is addressing here above: #3328 (comment)

That 2nd problem still exists, I guess, in the case of any other errors that are not the fixed directory choice problem, right?

Hey @benclifford, I just pushed a commit that addresses the 2nd problem here: 0903330 .

@AymenFJA
Copy link
Contributor

@benclifford is there anything else that needs to be done in this ticket? Thank you.

@benclifford
Copy link
Collaborator Author

@AymenFJA the second fix you made (0903330) is part of PR #3060, which is not merged. I'll make some comments there on advancing that PR.

@AymenFJA
Copy link
Contributor

AymenFJA commented Jul 5, 2024

@benclifford should we close this ticket? as the related PR with the fixes is merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants