Useful tips and tricks for high-performance computing on the IPMU cluster.
There are two main machines at IPMU:
idark
is the main computer cluster.gpgpu
is a box with 8 GPUs.
These machines are managed by IPMU's IT team, who can be reached at it 'at' ipmu 'dot' jp
. See the internal webpage (ask the IT team for access) for technical details and specifications.
The machines can only be accessed from within the campus intranet. To use them from home, ask the IT team for VPN connection information.
You will need an account on the machines you wish to access. Follow the IT team's instructions, which will involve sending them your ssh public key.
Once you're all set up you can connect to the servers with
$ ssh [username]@idark.ipmu.jp # for idark
$ ssh [username]@192.168.156.71 # for gpgpu
Once you ssh onto the cluster, you'll want to see what everyone else is doing.
The job manager on idark is PBS. To see what jobs are running, run
[username@idark ~]$ qstat
On gpgpu, the job manager is slurm. The equivalent command is
[username@gpgpu ~]$ squeue
It's important to know that there is no central system to allocate the GPUs on gpgpu, so you need to check which are available using the command
[username@gpgpu ~]$ nvidia-smi
Before running your own jobs, you may wish to set up your Python environment. Python 3 is already installed on the cluster, and there are two main approaches to managing your python installation: conda
and pyenv
. Choose whichever you prefer.
Conda is a python package and environment manager. Suppose you start a new project and want to use all of the up-to-date versions of your favorite python modules -- but want to keep older versions available for compatibility with previous projects. This is the rationale of conda. With conda you can create a library of separate python environments that you can activate anywhere on the cluster -- from the login node or compute nodes, either in interactive mode or directly inside your job scripts.
The first time you log on to the cluster, run
[username@idark ~]$ conda init
This will add an initialization script to your ~/.bashrc
that is executed whenever you login to the cluster or allocate to a compute node. The next time you login, you will see (base)
next to your credentials on the command line, indicating that your base
conda environment is active. When you run python,
(base) [username@idark ~]$ python
and you will see that the version that has been activated is Python 3.8.10.
If you prefer, you can always turn off automatic activation of the environment by doing:
[username@idark ~]$ conda config --set auto_activate_base false
and instead manually activate the environment using
[username@idark ~]$ source /home/anaconda3/bin/activate
or
[username@idark ~]$ conda activate base
To deactivate a conda environment, just run conda deactivate
.
You can also create your own environments with conda with any specified python version:
[username@idark ~]$ conda create -n project-env python=3.9
This command creates a new python environment called project-env
which runs the latest stable version of Python 3.9. Once you have created the new environment, you can view it and all other existing environments (including base
) with conda env list
. You can activate any of these environments as follows:
[username@idark ~]$ conda activate project-env
replacing project-env
with the name of your environment.
The default installation will be fairly bare-bones, but you can now start installing packages. To install numpy, for example, you can do:
(project-env) [username@idark ~]$ conda install numpy
If a package you want is unavailable from the conda package manager, you can use pip
within a conda environment instead. Type which pip
and you will see that it is a specific pip installation for your python environment. Note, however, that this can sometimes lead to conflicts; see here for details.
numpy
is one of many useful Python packages. Wouldn't it be nice if there is a stack of all the useful scientific packages so that you wouldn't have to install them all separately and think about dependencies? Oh yeah:
(project-env) [username@idark ~]$ pip install scipy-stack
Later, we will see how to use this python environment for jobs. For more documentation on conda, including more management options and deletion of environments, check out the conda docs.
Unlike conda, which is a full package manager, pyenv is a simpler tool that just manages your different python versions. The idea is to use pip as the package manager and python's built in virtual environment feature to separate dependencies for different projects.
pyenv does not come pre-installed on the cluster machines. To install it, we use the pyenv-installer. First run
[username@idark ~]$ curl https://pyenv.run | bash
and then add the following lines to your bash profile ~/.bashrc
:
export PYENV_ROOT="$HOME/.pyenv"
command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"
After restarting your terminal, pyenv should now be in your path and you can see the list of installable python versions:
[username@idark ~]$ pyenv install --list # see a list of installable versions
and then install one using
[username@idark ~]$
CPPFLAGS=-I/usr/include/openssl11 \
LDFLAGS=-L/usr/lib64/openssl11 \
PYTHON_CONFIGURE_OPTS="--enable-shared" \
pyenv install -v 3.11.3 # or something different
The new version will now show up if you run pyenv versions
, and you can set it as your global python version using
[username@idark ~]$ pyenv global 3.11.3
[username@idark ~]$ python --version
# Python 3.11.3
[username@idark ~]$ which python
# ~/.pyenv/shims/python
[username@idark ~]$ which pip
# ~/.pyenv/shims/pip
You can also set local python versions for different projects using pyenv local
. Then to manage dependencies for different projects, you use python's built in venv
feature. Here is a tutorial, and the tldr is as follows. Starting from your project folder, initialize a new virtual environment and activate it with
[username@idark project]$ python -m venv project-venv
[username@idark project]$ source project-venv/bin/activate
At which point you will see (project-venv) in the command prompt. If you now install dependencies or projects with pip, they will be installed in the local virtual environment.
Anytime you want to run code on the cluster, you should do so in a job. There are a lot of different types of jobs.
The basic pattern for a PBS job on idark is
#!/bin/bash
#PBS -N demo
#PBS -o /home/username/PBS/
#PBS -e /home/username/PBS/
#PBS -l select=1:ncpus=1:mem=4gb
#PBS -l walltime=0:0:30
#PBS -u username
#PBS -M [email protected]
#PBS -m ae
#PBS -q tiny
# activate python environment
source ~/.bashrc
conda activate project-env
# for pyenv/virtualenv instead use
# source /home/username/project/project-venv/bin/activate
python my_program.py
For a job on gpgpu, make sure to first run nvidia-smi
and identify GPUs which are not being used. Then a typical slurm file looks like
#!/bin/bash
#SBATCH --job-name=demo
#SBATCH --account=username
#SBATCH --output=/home/username/log/%j.out
#SBATCH --error=/home/username/log/%j.err
#SBATCH --time==0+00:01:00
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
#SBATCH --cpus-per-gpu=6
#SBATCH [email protected]
#SBATCH --mail-type=END,FAIL
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=X # where X is the GPU id of an available GPU
# activate python environment
conda activate project-env
# for pyenv/virtualenv instead use
# source /home/username/project/project-venv/bin/activate
python my_program.py
Please use only one GPU at a time, to prevent congestion.
On idark, you can directly access the compute node that is running your job. First identify the node by running
[username@idark] $ qstat -nrt -u username
The node will have a name of the format ansysNN
where NN
in a 2-digit number between 01 and 40.
Then use rsh
to connect to it
[username@idark ~]$ rsh ansysNN
[username@ansysNN ~]$
Using jupyter is a great way to make working remotely a little easier and more seamless.
You should always run jupyter in a job. Make a job script which runs the command
jupyter lab --no-browser --ip=0.0.0.0 --port=X
where X is some port you choose, e.g 1337. Each port can only be used by one application at a time so make sure your jupyter session chooses a unique port number.
Now you just need to forward some port Y on your local computer to the remote port X. On gpgpu, this is as simple as running
$ ssh -NfL Y:localhost:X [email protected] #for gpgpu
Now opening your browser to localhost:Y will give you direct access to jupyter on gpgpu. This same kind of port forwarding is used if you want to use MLFlow or tensorboard to track your ML experiments.
On idark, you first need to find node your jupyter is running on by running qstat -nrt -u username
.
Then on your local computer just run
$ ssh -NfL Y:ansysNN:X [email protected] #for idark
where ansysNN is the node name. You'll now be able to access jupyter by pointing your browser to localhost:Y . The first time you connect to a jupyter session, it may prompt you for a connection token. You can get the token by connecting to the compute node using rsh ansysNN
from the head node, and then running jupyter server list
. You can use that token on the connection page to set a jupyter password which you can use to connect to future jupyter sessions on any node.
To register a python environment with jupyter, just activate the environment and then run
python -m ipykernel install --name project-venv --user
and you'll see a kernel with name project-venv next time you launch jupyter.
Most IDEs you run on your local computer can be connected to the cluster to allow you to edit your files there seamlessly. For vscode, for example, follow the instructions here.
The /home
file system has very limited space and may rapidly get congested if too many users are working and storing files there. As with previous clusters, iDark has a designated file system for work:
/lustre/work/username
The snag is that the output files from job scripts will not save to this file system -- could be quite inconvenient. This issue has been raised with IT, but for now you should create a directory in the /home
file system (e.g. /home/username/tmp/pbs/
where those files can be directed in the #PBS -o
and #PBS -e
lines of your job scripts.
If you have any comments or suggestions for this goodie bag of tips and tricks, please send them to connor.bottrell 'at' ipmu 'dot' jp
with a description or make a pull request directly to this repo.