-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent RPEX failure (PSI-J/SAGA layer) #885
Comments
The problem seems to happen when trying to obtain the full path to the pilot sandbox on the target machine, way before pilot submission even hits the SAGA layer. That failing can means we can't open a shell in SAGA, and I would guess we are hitting the system too hard. |
@AymenFJA : ping |
@andre-merzky : I am passing a self.session = rp.Session(cfg={'base': self.run_dir},
uid=ru.generate_id('rpex.session',
mode=ru.ID_PRIVATE)) Is this problematic? Also, I am using Important things to mention:
This is critical as I just saw another test stumbles with this error Parsl/parsl#3053. |
Wanted to add a quick paste from a recent failure in CI:
Full logs are over here-> https://github.com/Parsl/parsl/actions/runs/7818279731/job/21330235701#step:8:2525 |
Hi @yadudoc - can you please add |
A corresponding PR is open now to fix this issue and others: Parsl/parsl#3060 |
I dug into this issue quite a bit to try to understand why this was specifically being triggered in our CI. It looks like in the radical.saga pty code, fork() can return file descriptors that can't be used with |
See radical-cybertools/radical.saga#885 which repeatedly recommends avoiding the radical SAGA layer, and the corresponding Parsl issue #3029
See radical-cybertools/radical.saga#885 which repeatedly recommends avoiding the radical SAGA layer, and the corresponding Parsl issue #3029 This PR updates the comment for psi-j-parsl in setup.py to reflect the different between psi-j-parsl and psij-python.
I opened a PR to install psij-python when installing |
@andre-merzky I added psij-python to the environment in parsl PR #3079 but I still see the pty select code running and failing: for example scroll down near the end of https://github.com/Parsl/parsl/actions/runs/7928758618/job/21647641006 Is there something else that needs to happen to make this happen: > If psij is installed it should be picked up and be used as default launcher ? |
Fix has been released, SAGA is not an RP dependency anymore. |
The fix to this issue comes from a change in Radical Cybertools, not a change in the Parsl codebase. See radical-cybertools/radical.saga#885
@andre-merzky what should I be seeing as a newly released component? Our CI still installs radical-pilot 1.47, and PyPI reports 1.47.0 on Feb 8th as the latest release. |
@benclifford : pypi should deliver 1.48 by now. Could you please re-trigger the test pipeline? Thanks! |
@andre-merzky that new version seems to completely break radical-pilot+parsl for the usual test run that was working with 1.47: see this run https://github.com/Parsl/parsl/actions/runs/8418695922/job/23054193010?pr=3286 where at the end you can see something (I think radical pilot) sending a ctrl-C to the test process because it is so upset about something. I haven't dug into what's going on there. |
Hi @benclifford : the $ grep -B 4 Error pmgr_launching.0000.log
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/psij/__init__.py", line 13, in <module>
from .job_executor import JobExecutor
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/psij/job_executor.py", line 3, in <module>
from distutils.version import Version
ModuleNotFoundError: No module named 'distutils'
--
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/saga.py", line 35, in __init__
raise rs_ex
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/saga.py", line 15, in <module>
import radical.saga
ModuleNotFoundError: No module named 'radical.saga'
--
Traceback (most recent call last):
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/base.py", line 306, in work
self._start_pilot_bulk(resource, schema, pilots)
File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/base.py", line 524, in _start_pilot_bulk
raise RuntimeError('no launcher found for %s' % pilot['uid'])
RuntimeError: no launcher found for pilot.0000 SAGA not being found is intentional - but the I am afraid I will be offline for the rest of the week (I am not online today either tbh :-P). This ticket is re-opened now, I'll tend to it first thing on Monday. Also, we should add a parsl integration test to our own test suite so that things don't fall over on your end, but ideally before we push new releases... |
PS.: Oh, that was fixed in |
A recently released version, 1.48, doesn't work with this executor, so this PR aggressively constrains the version to what was passing in Parsl GitHub Actions over the last few weeks. For further contex, see radical-cybertools/radical.saga#885 (comment)
A recently released version, 1.48, doesn't work with this executor, so this PR aggressively constrains the version to what was passing in Parsl GitHub Actions over the last few weeks. For further contex, see radical-cybertools/radical.saga#885 (comment)
ok. For now parsl PR 3290 pins radical.pilot to 1.47 which has been passing our testing for the last months - aside from the problem that this issue #885 is about. We can loosen those constraints later. |
I just made a 0.9.5 release. |
@benclifford, can we consider this addressed and close this ticket? |
I guess so, if you think it's fixed? I haven't validated it in Parsl because haven't got back to trying it again, but there's a parsl issue Parsl/parsl#3029 to remind me about that... |
@benclifford yes, I think it should be fixed and released now. |
got some problems integrating this with parsl on Python 3.12 with missing |
actually i think this looks the same as @andre-merzky pasted above, #885 (comment) I can see that this version of psi-j-python is being installed in our build logs: psij-python-0.9.5 |
@benclifford Ben, seems that there is a leftover of |
@benclifford, our colleague just released a fixed version of PSI/J (0.9.6), would you mind to restart that tests? Thanks! |
I consider it fixed based on @AymenFJA feedback. If this is still a problem, please reopen the ticket. |
The fix to this issue comes from a change in Radical Cybertools, not a change in the Parsl codebase. See radical-cybertools/radical.saga#885
This is reported here: Parsl/parsl#3029
The text was updated successfully, but these errors were encountered: