Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent RPEX failure (PSI-J/SAGA layer) #885

Closed
AymenFJA opened this issue Jan 19, 2024 · 25 comments
Closed

Intermittent RPEX failure (PSI-J/SAGA layer) #885

AymenFJA opened this issue Jan 19, 2024 · 25 comments
Assignees
Labels

Comments

@AymenFJA
Copy link
Contributor

This is reported here: Parsl/parsl#3029

@andre-merzky
Copy link
Member

The problem seems to happen when trying to obtain the full path to the pilot sandbox on the target machine, way before pilot submission even hits the SAGA layer. That failing can means we can't open a shell in SAGA, and I would guess we are hitting the system too hard.
A quick workaround is to not use any env variables in default_remote_workdir for the test configuration (the failing code path attempts to resolve that env var). Also PSIJ should be installed as dependency to skip SAGA on job submission (will likely hit the same error later on).

@andre-merzky
Copy link
Member

@AymenFJA : ping

@AymenFJA
Copy link
Contributor Author

AymenFJA commented Feb 7, 2024

@andre-merzky : I am passing a dir session as follows:

self.session = rp.Session(cfg={'base': self.run_dir},
                                  uid=ru.generate_id('rpex.session',
                                                     mode=ru.ID_PRIVATE))

Is this problematic?

Also, I am using local.localhost which by default sets the default_remote_workdir to $HOME.

Important things to mention:

  • I need to set the path of the session on the client side. (as shown above)
  • I do not know the full home path for the test machine, which can change.
  • I do not want to set the default_remote_workdir to the self.run_dir for 2 reasons:
    • I do not have access for the self.run_dir from where we run the test suite
    • I do not want to include the agent sandbox in the test suite (assuming if it was possible to set the self.run_dir as the path) as it is very large and can not be zipped due to parsl constraints.

This is critical as I just saw another test stumbles with this error Parsl/parsl#3053.

@yadudoc
Copy link

yadudoc commented Feb 7, 2024

Wanted to add a quick paste from a recent failure in CI:

self = <radical.saga.utils.pty_process.PTYProcess object at 0x7fe2a4a5b410>
data = " export PROMPT_COMMAND='' PS1='$'; set prompt='$'\n", nolog = False

    def write (self, data, nolog=False) :
        """
        This method will repeatedly attempt to push the given data into the
        child's stdin pipe, until it succeeds to write all data.
        """
    
        with self.rlock :
    
            if not self.alive (recover=False) :
                raise ptye.translate_exception(
                        se.NoSuccess("cannot write to dead process (%s) [%5d]"
                                     % (self.cache[-256:], self.parent_in)))
    
            try :
                log = self._hide_data (data, nolog)
                log =  log.replace ('\n', '\\n')
                log =  log.replace ('\r', '')
                if  len(log) > _DEBUG_MAX :
                    self.logger.debug ("write: [%5d] [%5d] (%s ... %s)"
                                    % (self.parent_in, len(data),
                                       log[:30], log[-30:]))
                else :
                    self.logger.debug ("write: [%5d] [%5d] (%s)"
                                    % (self.parent_in, len(data), log))
    
                # attempt to write forever -- until we succeeed
                while data :
    
                    # check if the pty pipe is ready for data
>                   _, wlist, _ = select.select ([], [self.parent_in], [],
                                                 _POLLDELAY)
E                                                ValueError: filedescriptor out of range in select()

Full logs are over here-> https://github.com/Parsl/parsl/actions/runs/7818279731/job/21330235701#step:8:2525

@andre-merzky
Copy link
Member

Hi @yadudoc - can you please add psij-python as dependency for the test environment? If psij is installed it should be picked up and be used as default launcher, thus avoiding problems with the SAGA PTY layer which is known to be rather fickle...

@AymenFJA
Copy link
Contributor Author

AymenFJA commented Feb 9, 2024

A corresponding PR is open now to fix this issue and others: Parsl/parsl#3060

@benclifford
Copy link

I dug into this issue quite a bit to try to understand why this was specifically being triggered in our CI.

It looks like in the radical.saga pty code, fork() can return file descriptors that can't be used with select sometimes, because they are greater than glibc's FD_SETSIZE. In this Parsl test case, that's because a bunch of earlier tests have run (in randomised order, which is why this doesn't always happen) leaving a lot of fds open in that main test process: not so many to exhaust the available fds, but enough to move into the fd space where select doesn't work.

benclifford added a commit to Parsl/parsl that referenced this issue Feb 16, 2024
See radical-cybertools/radical.saga#885
which repeatedly recommends avoiding the radical SAGA layer, and
the corresponding Parsl issue #3029
benclifford added a commit to Parsl/parsl that referenced this issue Feb 16, 2024
See radical-cybertools/radical.saga#885
which repeatedly recommends avoiding the radical SAGA layer, and
the corresponding Parsl issue #3029

This PR updates the comment for psi-j-parsl in setup.py to reflect
the different between psi-j-parsl and psij-python.
@benclifford
Copy link

I opened a PR to install psij-python when installing parsl[radical-pilot] - parsl PR #3079. I'll give it a few runs, but if it avoids the above SAGA file descriptor limits and does not bring in new problems, this feels like a more principled approach than PR #3060.

@benclifford
Copy link

@andre-merzky I added psij-python to the environment in parsl PR #3079 but I still see the pty select code running and failing: for example scroll down near the end of https://github.com/Parsl/parsl/actions/runs/7928758618/job/21647641006

Is there something else that needs to happen to make this happen: > If psij is installed it should be picked up and be used as default launcher ?

@andre-merzky
Copy link
Member

Fix has been released, SAGA is not an RP dependency anymore.

benclifford added a commit to Parsl/parsl that referenced this issue Mar 24, 2024
The fix to this issue comes from a change in Radical Cybertools, not a
change in the Parsl codebase.
See radical-cybertools/radical.saga#885
@benclifford
Copy link

@andre-merzky what should I be seeing as a newly released component? Our CI still installs radical-pilot 1.47, and PyPI reports 1.47.0 on Feb 8th as the latest release.

@andre-merzky
Copy link
Member

@benclifford : pypi should deliver 1.48 by now. Could you please re-trigger the test pipeline? Thanks!

@benclifford
Copy link

@andre-merzky that new version seems to completely break radical-pilot+parsl for the usual test run that was working with 1.47: see this run https://github.com/Parsl/parsl/actions/runs/8418695922/job/23054193010?pr=3286 where at the end you can see something (I think radical pilot) sending a ctrl-C to the test process because it is so upset about something. I haven't dug into what's going on there.

@andre-merzky andre-merzky reopened this Mar 25, 2024
@andre-merzky
Copy link
Member

andre-merzky commented Mar 25, 2024

Hi @benclifford : the KeyboardInterrupt is issued when a pilot fails. In the logs, I find this:

$ grep -B 4 Error pmgr_launching.0000.log
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/psij/__init__.py", line 13, in <module>
    from .job_executor import JobExecutor
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/psij/job_executor.py", line 3, in <module>
    from distutils.version import Version
ModuleNotFoundError: No module named 'distutils'
--
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/saga.py", line 35, in __init__
    raise rs_ex
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/saga.py", line 15, in <module>
    import radical.saga
ModuleNotFoundError: No module named 'radical.saga'
--
Traceback (most recent call last):
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/base.py", line 306, in work
    self._start_pilot_bulk(resource, schema, pilots)
  File "/home/runner/work/parsl/parsl/.venv/lib/python3.12/site-packages/radical/pilot/pmgr/launching/base.py", line 524, in _start_pilot_bulk
    raise RuntimeError('no launcher found for %s' % pilot['uid'])
RuntimeError: no launcher found for pilot.0000

SAGA not being found is intentional - but the distutils problem in PSIJ is what we stumble over.

I am afraid I will be offline for the rest of the week (I am not online today either tbh :-P). This ticket is re-opened now, I'll tend to it first thing on Monday. Also, we should add a parsl integration test to our own test suite so that things don't fall over on your end, but ideally before we push new releases...

@andre-merzky
Copy link
Member

PS.: Oh, that was fixed in psij-python, but did not make it into that release yet. pinging @hategan

benclifford added a commit to Parsl/parsl that referenced this issue Mar 25, 2024
A recently released version, 1.48, doesn't work with this executor, so this PR
aggressively constrains the version to what was passing in Parsl GitHub Actions
over the last few weeks.

For further contex, see
radical-cybertools/radical.saga#885 (comment)
benclifford added a commit to Parsl/parsl that referenced this issue Mar 25, 2024
A recently released version, 1.48, doesn't work with this executor, so this PR
aggressively constrains the version to what was passing in Parsl GitHub Actions
over the last few weeks.

For further contex, see
radical-cybertools/radical.saga#885 (comment)
@benclifford
Copy link

ok. For now parsl PR 3290 pins radical.pilot to 1.47 which has been passing our testing for the last months - aside from the problem that this issue #885 is about. We can loosen those constraints later.

@hategan
Copy link

hategan commented Mar 25, 2024

PS.: Oh, that was fixed in psij-python, but did not make it into that release yet. pinging @hategan

I just made a 0.9.5 release.

@mturilli
Copy link
Contributor

@benclifford, can we consider this addressed and close this ticket?

@benclifford
Copy link

I guess so, if you think it's fixed? I haven't validated it in Parsl because haven't got back to trying it again, but there's a parsl issue Parsl/parsl#3029 to remind me about that...

@mturilli
Copy link
Contributor

mturilli commented Apr 1, 2024

@benclifford yes, I think it should be fixed and released now.

@benclifford
Copy link

got some problems integrating this with parsl on Python 3.12 with missing distutils - more info on the relevant parsl PR, Parsl/parsl#3286 (comment)

@benclifford
Copy link

actually i think this looks the same as @andre-merzky pasted above, #885 (comment)

I can see that this version of psi-j-python is being installed in our build logs: psij-python-0.9.5

@mtitov
Copy link
Collaborator

mtitov commented Apr 8, 2024

@benclifford Ben, seems that there is a leftover of disutils in PSI/J, I've created a ticket regarding it ExaWorks/psij-python#462

@mtitov
Copy link
Collaborator

mtitov commented Apr 8, 2024

@benclifford, our colleague just released a fixed version of PSI/J (0.9.6), would you mind to restart that tests? Thanks!

@mturilli
Copy link
Contributor

I consider it fixed based on @AymenFJA feedback. If this is still a problem, please reopen the ticket.

benclifford added a commit to Parsl/parsl that referenced this issue Jun 7, 2024
The fix to this issue comes from a change in Radical Cybertools, not a
change in the Parsl codebase.
See radical-cybertools/radical.saga#885
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants