Bad decoding of exception in radical error handling path #3328

benclifford · 2024-04-09T13:39:42Z

Describe the bug

Here's an error I see in CI (when working on some stderr behaviour elsewhere). I think it looks like one piece of code is serializing an exception as a bytestring of the repr of the exception, but another piece of the code is expecting to unserialize it as base64.


================================================================= FAILURES ==================================================================
___________________________________________________________ test_bad_stderr_file ____________________________________________________________

self = RadicalPilotExecutor(
    'local.localhost',
    bulk_mode=False, 
    label='RPEXBulk', 
    rpex_cfg='/home/benc/par...est-current/runinfo/000/rpex.cfg', 
    working_dir=PosixPath('/home/benc/parsl/src/parsl/.pytest/parsltest-current')
)
parsl_task = <Future at 0x7ba1f02736e0 state=finished raised DeserializationError>, exception = "PermissionError(13, 'Permission denied')"

    def _unpack_and_set_parsl_exception(self, parsl_task, exception):
        try:
>           s = rp.utils.deserialize_bson(exception)

parsl/executors/radical/executor.py:416: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../virtualenv-3.12/lib/python3.12/site-packages/radical/pilot/utils/serializer.py:101: in deserialize_bson
    return pickle.loads(codecs.decode(obj.encode(), "base64"))
/usr/local/lib/python3.12/encodings/base64_codec.py:19: in base64_decode
    return (base64.decodebytes(input), len(input))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

s = b"PermissionError(13, 'Permission denied')"

    def decodebytes(s):
        """Decode a bytestring of base-64 data into a bytes object."""
        _input_type_check(s)
>       return binascii.a2b_base64(s)
E       binascii.Error: Invalid base64-encoded string: number of data characters (33) cannot be 1 more than a multiple of 4
E       decoding with 'base64' codec failed

cc @AymenFJA

To Reproduce

pytest parsl/tests/test_bash_apps/test_stdout.py::test_bad_stderr_file -k "not cleannet" --config parsl/tests/configs/local_radical.py

Expected behavior
This should be raising the relevant exception (the PermissionError) up to the user program, rather than raising a DeserializationError.

Environment
my laptop
parsl master 13c3147

$ pip list |grep radic
radical.gtod                  1.47.0
radical.pilot                 1.47.0
radical.saga                  1.47.0
radical.utils                 1.48.0

The text was updated successfully, but these errors were encountered:

This marker originally referred to issue #363 for tests which should not be expected to pass if running in an environment where stderr and stdout are not staged back to the submitting system by the executor. Since it was created, it then expanded in use to more generally refer to any test which had a problem in environments which did not properly make either output files or stdout/stderr available on the submitting system, as a separate behaviour from the core of parsl behaving properly. An example of that would be running in a multi-site environment without working file staging. However, since all that happened, the feel in the Parsl community is that running with a shared filesystem is the core supported usage mode of Parsl; and also since then, notions of file staging have become much stronger, both in Parsl core and in individual executors, making it a much more reasonable test requirement that if you make a file, it can be expected to be available for the test to inspect - and that if you are in an environment that does not support that, you should not expect the test suite (or Parsl) to work. Because of all of that, this PR removes the issue363 test marker entirely. It was removed from Work Queue and Task Vine test environments in PR #3327, leaving only the Radical-PILOT test environment setting that. It turns out only two tests don't work in the Radical-PILOT environment when issue363 is removed: One requires support for the slightly awkward "stdout/err opening modes, specified by tuples" feature, and this PR labels and disables that test using a new 'executor_supports_std_stream_tuples' label. (In general, an executor which does staging of stdout/err streams is likely to not support the features offered by std stream tuples, such as appending to stderr files from multiple task - so this label is not Radical-PILOT specific) Another test looks like it is broken exception handling in the Radical-PILOT executor, and so it is labelled with a new issue specific label, relating to the bug report for that issue - issue #3328, issue3328.

andre-merzky · 2024-04-29T07:42:40Z

@AymenFJA
The task's exception field is indeed just repr(exception), there is no bson involved (unless the rp.pytask class does some magic - but I could not find anything there). So I would suggest to remove the deserialize call and add

            elif isinstance(exception, str):
                parsl_task.set_exception(eval(exception))

to the checks.

@benclifford
With that change the test still fails though since line 67 of parsl/tests/test_bash_apps/test_stdout.py expects an perror.BadStdStreamFile, not a PermissionDenied as you mentioned above. Am I missing something? This is on parsl master.

PS.: I should not condone the casual use of eval really. Long story short: RP does not bother to really serialize exceptions as that does not (easily) work for custom exception types...

benclifford · 2024-04-29T07:50:31Z

@andre-merzky @AymenFJA thanks for looking into this - I'll have a poke at that test with your changes above. (Without digging in, I expect the difference in exceptions comes from the different code paths that the radical executor uses for bash apps - but let me spend some time investigating before believing me).

Serialization of exceptions has given us problems in the past, but I haven't personally been hitting problems recently - it might just be that we dealt with all the common cases that we were encountering - we have a bit of special logic in parsl.app.errors.RemoteExceptionWrapper, and I can see that at least some part of the radical executor is using that:

parsl/executors/radical/rpex_worker.py:            exc = (rp.utils.serialize_bson(pe.RemoteExceptionWrapper(*sys.exc_info())), None)

benclifford · 2024-04-29T07:54:20Z

If you want to avoid this eval(repr( serialization, it's perhaps OK to create a new RuntimeError(exception) - uglier, but if you're avoiding sending the exception object and just sending some human-oriented string, then this might be the right way to go.

In that case, tests would have to be loosened up to expect different exception types - I'm not in principle against that.

andre-merzky · 2024-04-29T09:12:16Z

Yes, I would prefer that option. It's not ideal for sure, but very portable...

AymenFJA · 2024-05-03T18:13:28Z

@benclifford @andre-merzky, sorry for the late response. I believe this is not an issue of serialization. It seems like we set the sandbox of the task to the path of the stdout/stderr assigned by Parsl task kwargs. Specifically for the test shown in this ticket, it's set for a path that leads to a permission exception as expected. The problem is that the error is coming from the RP when it tries to create the sandbox of the task (way before it executes the task) and it ends up in the same error of Permission denied when it does: os.mkdir. The only difference is that we can not serialize that exception as it comes from a different component than the worker (which serializes the exceptions). When we unpack the exception we assume it is serialized and there we fail (it received a raw string rather than a serialized object note the obj=b"PermissionError(13, 'Permission denied')"). I have pushed a fix with this commit hash here: 61f9b92.

Interestingly, this test did not fail when we ran the entire test suite. But running it individually showed this issue. Regardless I tested the changes and it passed on my local machine with the release version of RP and Parsl.

benclifford · 2024-05-04T10:27:21Z

so I guess there's two things going on here - 1) one the underlying bug in choice of directories that I think @AymenFJA fixed in 61f9b92, and 2) the problem reported at the start of this issue that some errors happening around that code path are not properly reported to the end user, which is I think that Andre is addressing here above: #3328 (comment)

That 2nd problem still exists, I guess, in the case of any other errors that are not the fixed directory choice problem, right?

AymenFJA · 2024-05-10T00:01:53Z

so I guess there's two things going on here - 1) one the underlying bug in choice of directories that I think @AymenFJA fixed in 61f9b92, and 2) the problem reported at the start of this issue that some errors happening around that code path are not properly reported to the end user, which is I think that Andre is addressing here above: #3328 (comment)

That 2nd problem still exists, I guess, in the case of any other errors that are not the fixed directory choice problem, right?

Hey @benclifford, I just pushed a commit that addresses the 2nd problem here: 0903330 .

AymenFJA · 2024-06-21T12:14:12Z

@benclifford is there anything else that needs to be done in this ticket? Thank you.

benclifford · 2024-07-01T13:17:21Z

@AymenFJA the second fix you made (0903330) is part of PR #3060, which is not merged. I'll make some comments there on advancing that PR.

AymenFJA · 2024-07-05T12:07:11Z

@benclifford should we close this ticket? as the related PR with the fixes is merged?

benclifford added the bug label Apr 9, 2024

benclifford mentioned this issue Apr 9, 2024

Remove issue363 test marker #3329

Merged

AymenFJA mentioned this issue Apr 26, 2024

RPEX decoding failure radical-cybertools/radical.pilot#3175

Closed

benclifford closed this as completed Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad decoding of exception in radical error handling path #3328

Bad decoding of exception in radical error handling path #3328

benclifford commented Apr 9, 2024

andre-merzky commented Apr 29, 2024 •

edited

Loading

benclifford commented Apr 29, 2024

benclifford commented Apr 29, 2024

andre-merzky commented Apr 29, 2024

AymenFJA commented May 3, 2024

benclifford commented May 4, 2024

AymenFJA commented May 10, 2024

AymenFJA commented Jun 21, 2024

benclifford commented Jul 1, 2024

AymenFJA commented Jul 5, 2024

Bad decoding of exception in radical error handling path #3328

Bad decoding of exception in radical error handling path #3328

Comments

benclifford commented Apr 9, 2024

andre-merzky commented Apr 29, 2024 • edited Loading

benclifford commented Apr 29, 2024

benclifford commented Apr 29, 2024

andre-merzky commented Apr 29, 2024

AymenFJA commented May 3, 2024

benclifford commented May 4, 2024

AymenFJA commented May 10, 2024

AymenFJA commented Jun 21, 2024

benclifford commented Jul 1, 2024

AymenFJA commented Jul 5, 2024

andre-merzky commented Apr 29, 2024 •

edited

Loading