This is a patch for NNI that builds on version v2.10. Now a new training service is available: slurm
!
- You use NNI to run your machine learning experiments?
- Your ML experiments run on compute nodes without internet access (for example, using a batch system)?
- Your compute nodes and your head/login node (with internet) have access to a shared file system?
Then this package might be useful. For Weights & Biases users, you might be interested in this alternative.
Currently, this patch only supports SLURM, but it's quite simple to extend to other workload managers (e.g. PBS).
This package is built upon NNI. If you are new to NNI, please refer to the official documents.
If you are a professional NNI user. Simply change your training service to slurm
(other types are not affected, change it back if needed):
trainingService:
platform: slurm
resource:
gres: gpu:NVIDIAGeForceRTX2080Ti:1 # request 1 2080Ti for each trial
time: 1000 # wall time for each trial
partition: critical # request partition critical for resource allocation
useSbatch: false
useWandb: true
or if you use a python script:
experiment = Experiment('slurm')
experiment.config.training_service.resource = {
'gres': 'gpu:NVIDIAGeForceRTX2080Ti:1', # request 1 2080Ti for each trial
'time': 1000, # wall time for each trial
'partition': 'critical' # request partition critical for resource allocation
}
experiment.config.training_service.useSbatch = False
experiment.config.training_service.useWandb = True
Then run nnictl create --config config.yaml
or execute the python script on the login node. It will start the NNI server on the login node and submit slurm jobs.
There are only 4 parameters in slurm
training service:
platform
:str
. Must beslurm
.resource
:Dict[str, str]
. Arguments to submit a single job to SLURM system. Do not add hyphens (-
or--
) at the front. Depending on the ways to submit the job (srun
,sbatch
), the options might be a little bit different. See SLURM docs (srun, sbatch) for more details. Feel free to use numbers -- it will automatically convert to string when reading the config.useSbatch
:Optional[bool]
. Usesbatch
to submit jobs instead ofsrun
. The good side is: When the login node crashes accidentally, your job will not be affected. The bad side is: It has a buffer so that the output is delayed (metrics will not be affected). Default:False
.useWandb
:Optional[bool]
. Summit the trail logs to W&B. If you have logged in W&B in your system before, it will use your account. Otherwise, it will automatically create an anonymous account for you. Default:True
.
You can find a complete project example here. It is modified from the NNI official tutorial HPO Quickstart with PyTorch.
W&B provides more detailed experiment analysis (e.g. params importance, machine status, .etc).
If you enabled useWandb
(by default), then you are expected to see a new tab on the navigate bar:
This is the web link to the W&B project of this experiment. After a trial succeeds, you could also see a link to this trial:
Caution: W&B link will only be valid if at least one of the trials succeeds. Only succeeded trials will be recorded by W&B.
By default, W&B link will be available for 7 days. If you want to keep the data for future analysis, you may claim the experiment to your account. If you have logged in W&B account on the login node, then the experiment will automatically save to your account.
Just download this patch wheel and do pip install
:
wget https://github.com/whyNLP/nni-slurm/releases/download/v2.11.1/nni-2.11.1-py3-none-manylinux1_x86_64.whl
pip install nni-2.11.1-py3-none-manylinux1_x86_64.whl
Simply do pip uninstall nni
will completely remove this patch from your system.
It might because a low version of gcc. You might want to install gcc through spack, which only requires a few lines of commands to install packages. See microsoft#4914 for more details.
This problem might have been fixed in the upcoming NNI 3.0. This patch uses a temporary fix: give it more retries. See microsoft#3496 for more details.
It has been fixed in this patch. See microsoft#5468 for more details.
Error: tuner_command_channel: Tuner did not connect in 10 seconds. Please check tuner (dispatche) log.
In most cases, this error means that the login node is too slow (heavy workload on CPU and memory). This patch has extended the connection time to 120 seconds.
This is a bug from wandb. Change the tmp dir to home could be a quick fix:
mkdir ~/tmp
export TMPDIR=~/tmp
See wandb#2845 for more details.
It depends. There are some solutions:
- Run in
local
mode with srun command. Potential problem: Login node cannot use tail-stream. Listen on file change will fail. The behaviour is that no metric could be updated. - Run in
remote
mode with srun command, but connect tolocalhost
. Potential problem: tmp folder does not sync to compute node. Also, you might not be able to visit login node on the compute node.
See microsoft#1939, microsoft#3717 for more details.
An advantage of this patch is that it supports safe resume. Thanks to the slurm system, your trials won't be affected by the failure of NNI manager. If your NNI stops due to error, your existing trials will continue to run. You may resume the experiment later using nnictl resume <experiment_id>
(docs) or python script (docs). This patch will read the trial logs and update the status, instead of ignoring the running trials. Below is an example of NNI timeline with trial concurrency of 2.
Time line ---+--------------+-----------+---------+-----------+----------------+------------------------------
User Start NNI -- Go to sleep Wake up, Resume ----------------------
NNI +--------------------------+------ Error Resume -----------------+---------
Trial 1 +---------------------- Finish
Trial 2 +----------------------------------------------Finish Register Result
Trial 3 +----------------------------- Register Progress -------- Finish
Trial 4 +------------------------------
Trial 5 +---------
Notice that if you stops the experiment manually (e.g. with command nnictl stop ...
), NNI will cancel all the running trials in order to release the resources.
I have no plan to create a pull request. This patch is not fully tested. The code style is not fully consistent with NNI requirements. I develop this patch for personal use only.