-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failures from Radical pilot executor #3029
Comments
Hey @yadudoc, thanks for raising this. It seems like the error source is from the I am on it and I will try to reproduce it if possible. |
since some recent changes that hopefully fixed other CI failures, it looks like this error is now the dominant cause of CI test failures that I am experiencing. |
(this has affected every PR I've tried to merge or review today) |
@benclifford I am preparing a PR that fixes this in addition to several fixes by today. |
@yadudoc , @benclifford, a PR to fix this issue and others is open now #3060. |
@AymenFJA 's fix PR #3060 didn't satisfy my "why is this broken / how does this fix it?" curiosity, so I dug in a bit today. I've identified these issues which interact here, I think, in no particular order:
... and then running
|
I did some more hacking for the hang described about as "some part of the pytest stack hangs at exit": the stack is waiting for a process launched in radical.pilot/proxy.py to exit before the whole Python process will exit (as that is how multiprocessing processes work) -- however it looks like that process does not exit in this situation: I modified my Python runtime so that it reports the name of processes that it waits for at exit, and I modified radical.pilot to add a
and then:
If I set |
none of the above gives any explanation for what works (or does not work) about Aymen's PR #3060 though... so I'll try to understand that. |
adding safe-exit tag because this hang is also going to occur for real users in some situations |
See radical-cybertools/radical.saga#885 which repeatedly recommends avoiding the radical SAGA layer, and the corresponding Parsl issue #3029
See radical-cybertools/radical.saga#885 which repeatedly recommends avoiding the radical SAGA layer, and the corresponding Parsl issue #3029 This PR updates the comment for psi-j-parsl in setup.py to reflect the different between psi-j-parsl and psij-python.
hypothesis on why this failure has become worse recently: this test failure is happening when around a thousand file descriptors are open at the start of the radical test. this test failure happens in this test failure happens non-deterministically because the tests are deliberately run in random order to flush out inter-test dependencies (introduced in massive test rearrangement PR #1210) - if this test runs "early" in the test run, the fd count will be "low" and the test will not fail. if this test runs "late" in the test run, the fd count will be "high" and the test will fail recently a bunch more tests have been added to the adding more tests means that it's more likely that this radical test will run beyond the 1000-ish tests so while this bug has always existed, it's not a change in the radical stack that has caused it to happen more, but instead a change in the test environment to be more fd-intensive. |
Hi Ben, thanks for deep-diving into this! As Aymen said earlier, SAGA is on it's way out of RP and we'll likely not attempt to fix this on the SAGA layer. We still use it in some places as PSI/J does not jet support data staging. The PARSL tests triggering these kind of errors only add to the motivation to complete the transition though ;-) Let me have a look over the next few days to see if I can do a hackish SAGA replacement for the code path relevant to this. PS: Setting |
Quick note that this should be resolved with the next RPC release: the saga removal is completed, the respective code path simply does not exist anymore. Sorry it took so long, but it turned out that some refactoring and code removal ultimately was simpler (and more sensible) than fixing this. |
The fix to this issue comes from a change in Radical Cybertools, not a change in the Parsl codebase. See radical-cybertools/radical.saga#885
The fix to this issue comes from a change in Radical Cybertools, not a change in the Parsl codebase. See radical-cybertools/radical.saga#885
Describe the bug
I noticed some instability in the radical pilot executor that caused failures in the CI tests for the
mpi_experimental_3
branch. These failures are intermittent so rerunning failed tasks gets past these errors, but it would be good to have these tests or the cause of instability fixed.Here's a small snippet of the failure from github actions (https://github.com/Parsl/parsl/actions/runs/7479805965/job/20357787240#step:8:2239)
To Reproduce
Steps to reproduce the behavior, for e.g:
Expected behavior
Tests passing without instability
Environment
It would be nice if someone from the radical team, perhaps @AymenFJA, could take a quick look.
The text was updated successfully, but these errors were encountered: