Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add job avoidance #22

Merged
merged 73 commits into from
Feb 27, 2020
Merged

Add job avoidance #22

merged 73 commits into from
Feb 27, 2020

Conversation

julianhess
Copy link
Collaborator

@julianhess julianhess commented Feb 3, 2020

The main contribution of this PR is (basic) job avoidance. We pickle the output dataframe when a job completes; when another job is run with the same output directory, it searches for this pickle, which, if found, is compared against to see if the job specifications match.

This PR also adds the following miscellaneous features:

  • Preemptible NFS automatic restarting (Docker backend)
    • Launches a thread in the background active during the duration of the backend context manager that checks every minute if the NFS server has been preempted, and will restart it if so
  • Automatically override inputs to null that are absolute paths residing on the same NFS share and are not Canine outputs
    • E.g. if a user specifies an input with the absolute path /mnt/nfs/foo/bar/bah, and the output directory resides somewhere in /mnt/nfs, this input will not be symlinked into the inputs directory, since we assume that the user meant the absolute path as a string literal.
    • This is especially useful for running things like aligners, which often take a path to a reference FASTA and implicitly assume that ancillary files (e.g. indices) are present in the same directory, with the same basename as the reference FASTA.
    • Note that this only applies to absolute paths outside of a Canine directory structure. If we detect that the input path is a Canine output directory (via a simple regex), we will symlink it, since this is likely being run as part of a wolF workflow.
  • Track task runtime in seconds (not hours)
  • Bump required Python version to 3.7 (we are implicitly assuming ordered dicts)
    • If this is a problem, we can explicitly use OrderedDict when necessary.
  • Kill any jobs still running on cluster when we stop the cluster
    • I thought the Slurm worker daemons did this automatically, but I guess not!

Google artificially scales disk performance by size; a 2 TB disk seems like
a sweet spot
If a file specified as an absolute path resides on the same NFS as the workspace,
we shouldn't symlink it into the workspace -- chances are the user meant to
supply it as an absolute path.

If the user truly wants to symlink, they can specify this manually with an override.
This makes it more intuitive for a user override the NFS backend's new
default behavior.
If a batch job of length 0 makes it to here, it was definitely completely
avoided -- prior steps ensure that batch jobs of length 0 cannot proceed unless
avoided.
- Remove entire output directory if:
  - dataframe is empty (run likely crashed)
  - dataframe doesn't exist (run never started?)
- Allow for all output to be removed, regardless of status (overwrite = True)
This is in the case of job avoidance -- if job_id 0 gets avoided, but job_id 1
gets re-run, then Slurm task array ID 0 -> job_id 1

We do, however, assume that ordering is conserved between the two. This is due
to the fact that Python dictionaries are guaranteed to be insertion ordered.
In 28de503, we assume that Orchestrator.job_spec's elements will be insertion
ordered. This is only guaranteed in Python >= 3.7.

In the future, we might want to be less lazy (and explicitly store the insert order
ourselves), to relax this version requirement.
Because this is created by the localizer constructor, many localizer methods
assume that it already exists.
Otherwise, this will just fail
setup.py Show resolved Hide resolved
Rather than cancelling by state, cancel based on user
- For some reason, cancelling by state was occasionally failing

Wait up to 60 seconds for all jobs to have stopped -- scancel does not block
These are now included in the output DF, so there's no reason to exclude them
from field comparison.
If NFS got preempted before workflow exits, this could raise an error.
Sometimes, shutil.rmtree will hang due to NFS lag. Retry five times,
then abort, which causes the entire task directory to be removed.
Otherwise, the localizer will try (and fail) to save common inputs to the
common directory. If job avoidance occurs, this directory will already be
present and thus crash the localizer.

Removing it wholesale is somewhat hacky, but it's a lot more straightforward
than modifying the orchestrator to coordinate this with the localizer.
In the future, we need to figure out how to make the Docker backend
less gcloud-specific.
Copy link
Collaborator

@agraubert agraubert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great, but it's too reliant on the NFSLocalizer right now. I would prefer if job avoidance worked on other backends/localizers

setup.py Show resolved Hide resolved
self, nfs_compute_script = "/usr/local/share/cga_pipeline/src/provision_storage_container_host.sh",
compute_script = "/usr/local/share/cga_pipeline/src/provision_worker_container_host.sh",
self, nfs_compute_script = "/usr/local/share/slurm_gcp_docker/src/provision_storage_container_host.sh",
compute_script = "/usr/local/share/slurm_gcp_docker/src/provision_worker_container_host.sh",
nfs_disk_size = 2000, nfs_disk_type = "pd-standard", nfs_action_on_stop = "stop", nfs_image = "",
action_on_stop = "delete", image_family = "pydpiper", image = None,
cluster_name = None, clust_frac = 0.01, user = os.environ["USER"], **kwargs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that cluster_name isn't a positional? Enforcing it as a required optional seems weird

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want there to be any positional arguments. I should really make this constructor signature

def __init__(self, *, nfs_compute_script, ...

to clarify this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All arguments are positional in python. They can be put in arbitrary order using names regardless of if you set a default value. Not setting a default value simply makes an argument required.

Unless you're creating a complex object hierarchy where the available arguments change based on selected arguments or mixins, I am generally of the opinion that always-available arguments should be explicit in the constructor signature for the sake of clarity and readability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have functions which do not support positional arguments:

In [1]: def foo(*, bah):
   ...:     pass
   ...:

In [2]: foo("a")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-2-1df0092b7087> in <module>
----> 1 foo("a")

TypeError: foo() takes 0 positional arguments but 1 was given

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, that's a sneaky new python 3 trick.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most code I develop has been 2 and 3 compatible, so I haven't come across that pattern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's Python 3-only, though I wouldn't exactly call it "new" :-)

Added to the standard in 2006: https://www.python.org/dev/peps/pep-3102/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I guess I should clarify. I think that setting an argument to have a default kind of signifies that it's optional. I have required parameters in the NFSLocalizer that have a default and appear in the middle of the argument list for compliance with the base class API. Is there a reason not to move the required parameters to the front of the list and remove their defaults so it's more obvious in docs/inspection that they are required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I was being real dumb. Let's make this a positional and remove the check in the constructor.

https://github.com/broadinstitute/canine/blob/e02d6a73ffb93442ca5c70d9cd4345f780b7a2c6/canine/backends/dockerTransient.py#L30-L31

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It's not like the exception message gives the user any additional information.)

canine/orchestrator.py Outdated Show resolved Hide resolved
canine/orchestrator.py Outdated Show resolved Hide resolved
Minor fixes and improvements to job avoidance
canine/orchestrator.py Outdated Show resolved Hide resolved
canine/orchestrator.py Outdated Show resolved Hide resolved
@agraubert
Copy link
Collaborator

@julianhess #22 (comment)

Rather than defaulting it to None and raising an exception if it's None in the
constructor, just make it a required positional argument.
@agraubert agraubert merged commit a053e29 into getzlab:master Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants