-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add job avoidance #22
Conversation
Google artificially scales disk performance by size; a 2 TB disk seems like a sweet spot
If a file specified as an absolute path resides on the same NFS as the workspace, we shouldn't symlink it into the workspace -- chances are the user meant to supply it as an absolute path. If the user truly wants to symlink, they can specify this manually with an override.
This makes it more intuitive for a user override the NFS backend's new default behavior.
If a batch job of length 0 makes it to here, it was definitely completely avoided -- prior steps ensure that batch jobs of length 0 cannot proceed unless avoided.
- Remove entire output directory if: - dataframe is empty (run likely crashed) - dataframe doesn't exist (run never started?) - Allow for all output to be removed, regardless of status (overwrite = True)
This is in the case of job avoidance -- if job_id 0 gets avoided, but job_id 1 gets re-run, then Slurm task array ID 0 -> job_id 1 We do, however, assume that ordering is conserved between the two. This is due to the fact that Python dictionaries are guaranteed to be insertion ordered.
In 28de503, we assume that Orchestrator.job_spec's elements will be insertion ordered. This is only guaranteed in Python >= 3.7. In the future, we might want to be less lazy (and explicitly store the insert order ourselves), to relax this version requirement.
Because this is created by the localizer constructor, many localizer methods assume that it already exists.
Otherwise, this will just fail
Rather than cancelling by state, cancel based on user - For some reason, cancelling by state was occasionally failing Wait up to 60 seconds for all jobs to have stopped -- scancel does not block
These are now included in the output DF, so there's no reason to exclude them from field comparison.
If NFS got preempted before workflow exits, this could raise an error.
Sometimes, shutil.rmtree will hang due to NFS lag. Retry five times, then abort, which causes the entire task directory to be removed.
Otherwise, the localizer will try (and fail) to save common inputs to the common directory. If job avoidance occurs, this directory will already be present and thus crash the localizer. Removing it wholesale is somewhat hacky, but it's a lot more straightforward than modifying the orchestrator to coordinate this with the localizer.
In the future, we need to figure out how to make the Docker backend less gcloud-specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great, but it's too reliant on the NFSLocalizer right now. I would prefer if job avoidance worked on other backends/localizers
canine/backends/dockerTransient.py
Outdated
self, nfs_compute_script = "/usr/local/share/cga_pipeline/src/provision_storage_container_host.sh", | ||
compute_script = "/usr/local/share/cga_pipeline/src/provision_worker_container_host.sh", | ||
self, nfs_compute_script = "/usr/local/share/slurm_gcp_docker/src/provision_storage_container_host.sh", | ||
compute_script = "/usr/local/share/slurm_gcp_docker/src/provision_worker_container_host.sh", | ||
nfs_disk_size = 2000, nfs_disk_type = "pd-standard", nfs_action_on_stop = "stop", nfs_image = "", | ||
action_on_stop = "delete", image_family = "pydpiper", image = None, | ||
cluster_name = None, clust_frac = 0.01, user = os.environ["USER"], **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason that cluster_name
isn't a positional? Enforcing it as a required optional seems weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want there to be any positional arguments. I should really make this constructor signature
def __init__(self, *, nfs_compute_script, ...
to clarify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All arguments are positional in python. They can be put in arbitrary order using names regardless of if you set a default value. Not setting a default value simply makes an argument required.
Unless you're creating a complex object hierarchy where the available arguments change based on selected arguments or mixins, I am generally of the opinion that always-available arguments should be explicit in the constructor signature for the sake of clarity and readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have functions which do not support positional arguments:
In [1]: def foo(*, bah):
...: pass
...:
In [2]: foo("a")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-1df0092b7087> in <module>
----> 1 foo("a")
TypeError: foo() takes 0 positional arguments but 1 was given
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, that's a sneaky new python 3 trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most code I develop has been 2 and 3 compatible, so I haven't come across that pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's Python 3-only, though I wouldn't exactly call it "new" :-)
Added to the standard in 2006: https://www.python.org/dev/peps/pep-3102/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I guess I should clarify. I think that setting an argument to have a default kind of signifies that it's optional. I have required parameters in the NFSLocalizer that have a default and appear in the middle of the argument list for compliance with the base class API. Is there a reason not to move the required parameters to the front of the list and remove their defaults so it's more obvious in docs/inspection that they are required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I was being real dumb. Let's make this a positional and remove the check in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(It's not like the exception message gives the user any additional information.)
Minor fixes and improvements to job avoidance
Rather than defaulting it to None and raising an exception if it's None in the constructor, just make it a required positional argument.
The main contribution of this PR is (basic) job avoidance. We pickle the output dataframe when a job completes; when another job is run with the same output directory, it searches for this pickle, which, if found, is compared against to see if the job specifications match.
This PR also adds the following miscellaneous features:
/mnt/nfs/foo/bar/bah
, and the output directory resides somewhere in/mnt/nfs
, this input will not be symlinked into the inputs directory, since we assume that the user meant the absolute path as a string literal.OrderedDict
when necessary.