Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix testing of suicide for daemons
We don't support a cmd line option for this as it isn't
something a user should ever do. Instead, we use two
MCA params to specify it:
prte_daemon_fail - specifies the daemon rank that
should commit suicide
prte_daemon_fail_delay - time in seconds the target
rank should wait before dying. A value of zero means
no delay, just die after calling init. This is the
default value.
Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 618dd0a)
Fix daemon suicide and preserve output files
Correctly set parent rank so that the OOB can
correctly identify its lifeline and cause the
daemon to abort when it dies. Fix the
--debug-daemons-file
flag so it works, andpreserve the resulting output file from cleanup.
Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit a87d172)
Remove unused MCA param
Session directories now always include the PID of the daemon
Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit c4d5f81)
Only trigger job failed to start once
Trigger the "job failed to start" state only when the
first process to do so reports. This avoids a "bounce"
effect that causes the job object to be multiply
released.
Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit a386514)
Add "close stale issues" actions
Ported from open-mpi/ompi#12329
Thanks to @jsquyres!
Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 31c948f)
oac: strengthen Sphinx check
Update oac submodule pointer to pick up a stronger test for
Sphinx. Also add (new) optional 3rd param to OAC_SETUP_SPHINX.
Signed-off-by: Jeff Squyres [email protected]
(cherry picked from commit d3171cc)
Revamp the session directory system
We now have multiple tools (e.g., psched, prte, and even
multiple prte instances) running on the same node. Keeping
all those session directory trees under a single root is
problematic and leading to inadvertent deletion of contact
files. So simplify things and put each instance under its
own session directory tree root.
Add the pid and uid to the session directory root name. Prefix
the root name with the argv[0] of the tool so we know what
generated it.
Fix an error in PRRTE that assumed the job-level session was
a global name. It is not - it is different for each job, so
we need to track it by job. Have the prte_job_t destructor
call the session_dir_destroy function to remove it when
the job is complete.
Fix refcounts so the job object destructor gets called upon
job completion.
Signed-off-by: Ralph Castain [email protected]
(cherry picked from commit 14dd818)
guard against possible segfault in prted
as it exits by removing unneeded activity
Signed-off-by: Howard Pritchard [email protected]
pr feedback
Signed-off-by: Howard Pritchard [email protected]
(cherry picked from commit 025d5ab)