-
Notifications
You must be signed in to change notification settings - Fork 0
/
.env.example
59 lines (49 loc) · 2.92 KB
/
.env.example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Edit these variables according to your need
# choose an Enroot Image
# IMPORTANT: For the scripts provided here to function correctly, there needs to be conda installed in that image!
CONTAINER_IMAGE=/netscratch/enroot/nvcr.io_nvidia_pytorch_23.07-py3.sqsh
# chooses a Slurm partition
PARTITION=batch
#PARTITION=RTXA6000-SLT
# set the max time limit for the job
# the general format to be used is d-hh:mm:ss
# you can ignore the mm:ss to only set the number of days and hours
MAX_TIME_LIMIT=3-0
# resource allocation (RESOURCE_ALLOCATION):
N_TASK=1
GPUS_PER_TASK=1
CPUS_PER_GPU=16
MEMORY_PER_CPU=4G
# assemble everything (feel free to modify this):
RESOURCE_ALLOCATION="--ntasks=$N_TASK --gpus-per-task=$GPUS_PER_TASK --cpus-per-gpu=$CPUS_PER_GPU --mem-per-cpu=$MEMORY_PER_CPU"
# use the current working directory as the working directory inside the container
CONTAINER_WORKDIR=`pwd`
# We mount the following (CONTAINER_MOUNTS):
# * the whole /netscratch, e.g. to keep previously created symlinks into that alive,
# * datasets /ds as readonly,
# * the working directory, $HOST_WORKDIR, which is usually the current working directory, is mapped to the $CONTAINER_WORKDIR,
# * existing conda environments, and
# * we persist any cache locations to $HOST_CACHEDIR.
HOST_WORKDIR=`pwd`
# ensure that this path exists at the host!
HOST_CACHEDIR=/netscratch/$USER/.cache_slurm
# change this if you have installed miniconda to another location
HOST_CONDA_ENVS_DIR=/netscratch/$USER/miniconda3
# change this if you use another Enroot image with a different conda location
CONTAINER_CONDA_ENVS_DIR=/opt/conda
# assemble everything:
CONTAINER_MOUNTS=/netscratch:/netscratch,/ds:/ds:ro,"$HOST_WORKDIR":"$CONTAINER_WORKDIR","$HOST_CONDA_ENVS_DIR":"$CONTAINER_CONDA_ENVS_DIR","$HOST_CACHEDIR":"/home/$USER/.cache","$HOST_CACHEDIR":/root/.cache,"$HOST_CACHEDIR":/home/root/.cache
# Define which conda environment should be activated. Set to $CONDA_DEFAULT_ENV, to use the one that is activated at
# the host. If not defined, no conda environment will be activated and the base python interpreter will be used.
CONDA_ENV=$CONDA_DEFAULT_ENV
# default job name is name of conda env
JOB_NAME=$CONDA_ENV
# Define which pip requirements file should be used to install additional packages. If not defined, no additional
# packages will be installed.
# PIP_REQUIREMENTS_FILE=requirements.txt
# Define additional environment variables that should be set inside the container.
# - NCCL_SOCKET_IFNAME and NCCL_IB_HCA are required for multi-node training with NCCL.
# - PIP_INDEX_URL and PIP_TRUSTED_HOST are set to use the local PyPI cache of the DFKI cluster.
# - PIP_NO_CACHE=true disables pip's default cache (we already use the fast local cache)
# - PIP_REQUIREMENTS_FILE and CONDA_ENV are defined above.
EXPORT="NCCL_SOCKET_IFNAME=bond,NCCL_IB_HCA=mlx5,PIP_INDEX_URL=http://pypi-cache/index,PIP_TRUSTED_HOST=pypi-cache,PIP_NO_CACHE=true,PIP_REQUIREMENTS_FILE=$PIP_REQUIREMENTS_FILE,CONDA_ENV=$CONDA_ENV"