-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix two bugs and add support for N masters, N workers with load balancing in RadicalPilotExecutor #3060
Conversation
addressing comments and refine
ok thanks. there are a bunch of changes in this and I'm not super familiar with this bit of the code, so I'll try to use the review as a learning experience. If you're making several changes at once, it can be easier both for review and tracking down later bugs to make several smaller, single focused PRs (a rule of thumb for me is, if you start listing out multiple changes in the PR description then consider splitting it...) |
@benclifford totally understandable and I agree. In the future, the PRs will be more focused. During this review please let me know if I can help you with anything vague or any questions. Thank you and sorry for the late response.. |
Makefile
Outdated
@@ -10,6 +10,7 @@ export PATH := $(CCTOOLS_INSTALL)/bin/:$(PATH) | |||
export CCTOOLS_VERSION=7.7.2 | |||
export HYDRA_LAUNCHER=fork | |||
export OMPI_MCA_rmaps_base_oversubscribe=yes | |||
export RADICAL_TEMP_SANDBOX=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this variable is not interpreted as a Python bool so it's maybe misleading to use True
here in case someone thinks False
would do something opposite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that. maybe RADICAL_TEMP_SANDBOX=yes
or RADICAL_TEMP_SANDBOX=1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think everything to do with RADICAL_TEMP_SANDBOX can be removed from this PR now, because I think the problem it was solving was fixed in a different way by moving away from radical SAGA?
parsl/executors/radical/executor.py
Outdated
# to include the agent sandbox with the ci artifacts. | ||
if os.environ.get("LOCAL_SANDBOX"): | ||
pd_init['sandbox'] = self.run_dir | ||
# set the agent dir to /temp as a fix for #3029 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this changed sandbox behaviour the only part that fixes #3029?
If so:
it's moving things into a temporary directory only when the CI tests are run (or in other situations where the RADICAL_TEMP_SANDBOX
environment variable is set).
in non-test situations, this behaviour is otherwise unchanged, right? If so, does this PR stop that situation occurring for normal users who are running real applications, not our CI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is a very rare case that can happen when the filesystem is exhausted for some reason (maybe too many concurrent I/O or so in the CI test machine), this behavior should not happen on a regular basis.
I do understand your concerns here that this might also happen specifically on a local
non-test setup (as the agent sandbox
is located in $HOME
), we never observed this behavior on an HPC machine as we use the shared file system to store our agent sandbox
.
I think to address your concern we can make the default location for the sandbox
on local
setup in general inside the /tmp
folder? I am open to suggestions as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I investigated this a bunch more (and have commented on what I see in other issues) to try to understand what was actually happening in the CI environment, because I like to at least have some understanding of how that environment behaves.
I can see why this is happening in the CI (due to entering a strange zone of fd number-space which radical.saga cannot deal with) and therefore why this would happen in a user workflow (when a user's workflow submitting process has opened many - around 1000 - files).
This PR avoid the misbehaving radical.saga code path by having a special non-default option to turn on a local sandbox.
But in radical-cybertools/radical.saga#885 Andrew mentions, effectively, that radical.saga would be better replaced by psij-python - that would have the same effect of removing the misbehaving codepath, in a different way.
So I tried that out in this draft PR: #3079 which doesn't actually avoid the misbehaving codepath, as far as I can tell.
But maybe this is a better approach longer term, if radical.saga is effectively deprecated in favour of psij-python?
@AymenFJA I think that this PR doesn't need the RADICAL_TEMP_SANDBOX code in it any more? I think the problem that was trying to resolve was resolved in a different way, by moving radical pilot away from using radical SAGA? |
I just merged the latest parsl |
Description
This PR:
Fixes
Fixes #3029, #3070
Type of change
Choose which options apply, and delete the ones which do not apply.