Skip to content

Latest commit

 

History

History
73 lines (64 loc) · 9.37 KB

README.md

File metadata and controls

73 lines (64 loc) · 9.37 KB

From single-GPU to multi-GPU training of PyTorch applications at NERSC

This repo covers material from the Grads@NERSC event. It includes minimal example scripts that show how to move from Jupyter notebooks to scripts that can run on multiple GPUs (and multiple nodes) on the Perlmutter supercomputer at NERSC.

Jupyter notebooks

We recommend running machine learning workloads for testing or small-scale runs on a Jupyter notebook to enable interactivity with the user. At NERSC, this can be done easily through JupyterHub with the following steps:

  1. Visit JupyterHub
  2. Select the required resource on Perlmutter and navigate to your local notebook. A minimal notebook is shared here as an example -- most workflows will build upon these basic building blocks. The notebook defines this common workflow in PyTorch that includes
    • defining the neural network architecture torch.nn, PyTorch data loaders for processing data
    • general optimization harness with a training and validation loop over the respective datasets.
  3. There are in-built Jupyter kernels for machine learning that are based on the PyTorch modules (similar ones exist for the other ML frameworks as well). A quick recommendation is to use the pytorch-2.0.1 kernel to get PyTorch 2.0 features. For building custom kernels that are based on your conda environment or other means, please see the NERSC docs. The docs are also a great resource on how to use Jupyter on Perlmutter, in general. You may also review best practices for JupyterHub
  4. Another quick way to install your own libraries on top of what is included in the modules is to simply do pip install --user <library_name>. The --user flag will always install libraries into the path defined by the environment variable PYTHONUSERBASE.
  5. Quick note on libraries: While you can build your own conda environment, the other recommendation is to use modules or containers. In either case, if you need libraries in addition to what's already provided, use the --user flag so that the libraries are installed in PYTHONUSERBASE. For modules, this is defined by default and for containers, we recommend you define this variable to some local location so that user defined libraries do not interfere with the default environment.

Notebooks to scripts

As you move to more larger workloads, the general recommendation is to use scripts -- this is especially so for multi-GPU workloads, since it is tricky to get this working with Jupyter notebooks. We also recommend organizing parts of your code, such as data loaders and neural network definitions, into separate subdirectories for more modular codebases. This makes it easier to add and modify features as the code gets improved and features are added.

  • The example notebook has been converted into the train_single_gpu script with a class structure that allows for easier extension into custom workflows.
  • We have also added two additional routines that implement the checkpoint-restart function. This allows you to start the training from where a previous run ended by loading a saved model checkpoint (along with the optimizer and learning rate schedulers). We highly recommend you consider checkpoint-restart while submitting jobs, since the perlmutter regular GPU queue has a maximum job time limit of 24 hours (training longer than that will require multiple jobs with checkpoint-restart).
  • To run a quick single-GPU script, follow these steps:
    • Request an interactive node with salloc --nodes 1 --qos interactive -t 30 -C gpu -A <your_account>
    • Load a default PyTorch module environment with module load pytorch/2.0.1. You may also use your own conda environment with module load conda; conda activate <your_env> (plus loading any background modules, such as cudnn, needed by your conda environment).
    • Run python train_single_gpu.py
  • The script will save model checkpoints every epoch to the outputs directory and also the best model checkpoint that tracks the lowest validation loss (general strategy to choose the best model that avoids overfitting)

Transforming single-GPU scripts for running on multiple GPUs (and nodes)

To speed up training (due to large datasets), the most common strategy is to use data parallelism. The easiest framework here is PyTorch DistributedDataParallel or DDP, which includes comprehensive tutorials on how to use this framework. Following this, we can convert our single-GPU script to multi-GPU using these simple steps:

  1. Initialize torch.distributed using
    torch.distributed.init_process_group(backend='nccl', init_method='env://')
    
    This will pick up the world_size (total number of GPUs used) and rank (rank of current GPU) from the environment variables that you will need to set before running -- the submit launch scripts will show you how to do that below.
  2. Set the local GPU device using the LOCAL_RANK environment variable (that will be defined similar to above) with
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    
  3. Wrap the model with DistributedDataParallel using
model = DistributedDataParallel(model, device_ids=[local_rank], output_device=[local_rank])
  1. Proceed with training as you would with a single GPU. DDP will automatically sync the gradients across devices when you call loss.backward() during the backpropagation.
  2. Cleanup the GPU groups after the training with dist.destroy_process_group()

To launch the job:

  1. Define the environment variables that allow torch.distributed to set up the distributed environment (world_size and rank). We implement that in this bash script that can be sourced when you allocate multiple GPUs using srun.
  2. Set$MASTER_ADDR with export MASTER_ADDR=$(hostname)

We implement these in our submit scripts that you can launch:

  • If you are using PyTorch modules (or your own conda env), submit submit_batch_modules.sh with sbatch submit_batch_modules.sh. Note that, in the script we have
    • Loaded the environment with module load pytorch/2.0.1
    • Set up the $MASTER_ADDR with export MASTER_ADDR=$(hostname)
    • Sourced the environment variables within the srun with source export_DDP_vars.sh: these will set the necessary variables for torch.distributed based on the allocated resources by srun.
  • For running with shifter (containers), see submit_batch_shifter.sh. The commands are mostly the same except we use a containerized environment. Both scripts currently submit a 2 node job, but this can be changed to any number of nodes.

Other best practices

  • We recommend using containers for more optimized libraries and better performance. NERSC provides PyTorch containers based on the NVIDIA GPU cloud containers. To query a list of PyTorch containers on Perlmutter, you can use shifterimg images | grep pytorch. The example shifter submit scripts use the container nersc/pytorch:ngc-23.07-v0.
  • Quick DDP note: checkpoint-restart needs a slight modification if you try to load a model without the DistributedDataParallel wrapper (for example, if you are doing inference on a single GPU) that was trained on multiple GPUs (using the DistributedDataParallel wrapper).
	try:
            self.model.load_state_dict(checkpoint['model_state'])
        except:
            new_state_dict = OrderedDict()
            for key, val in checkpoint['model_state'].items():
                name = key[7:]
                new_state_dict[name] = val 
            self.model.load_state_dict(new_state_dict)

Models wrapped with DDP have an extra string .module that needs to be removed. The above lines in the scripts take care of this automatically

  • For logging application-specific metrics/visualizations and automatic hyperparameter optimization (HPO), we recommend Weights & Biases. See this tutorial that extends the above scripts to include Weights & Biases logging and automatic HPO on multi-GPU tests.
  • Before moving to data parallelism, we first recommend that you optimize your code to run on single GPU efficiently. Check out this in-depth tutorial that takes you step-by-step in developing a large-scale AI for Science application: this includes single GPU optimizations and profiling, data parallelism, and, for very large models that do not fit on a single GPU, model parallelism.