Skip to content

Commit

Permalink
Merge pull request #125 from EPCCed/aaron-docs
Browse files Browse the repository at this point in the history
Merging CS2 Instructions
  • Loading branch information
rkm authored Dec 15, 2023
2 parents f0fa69a + 599a995 commit e9142f4
Showing 1 changed file with 84 additions and 22 deletions.
106 changes: 84 additions & 22 deletions docs/services/cs2/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,48 +2,110 @@

## Introduction

The Cerebras CS-2 system is attached to the Ultra2 system which serves as a host, provides access to files, the SLURM batch system etc.
The Cerebras CS-2 Wafer-scale cluster (WSC) uses the Ultra2 system which serves as a host, provides access to files, the SLURM batch system etc.

## Connecting to the CS-2
## Connecting to the cluster

To gain access to the CS-2 you need to login to the host system, Ultra2 (also called SDF-CS1). See the [documentation for Ultra2](../ultra2/run.md#login).
To gain access to the CS-2 WSC you need to login to the host system, Ultra2 (also called SDF-CS1). See the [documentation for Ultra2](../ultra2/run.md#login).

## Running Jobs

All jobs must be run via SLURM to avoid inconveniencing other users of the system. The `csrun_cpu` and `csrun_wse` scripts themselves contain calls to `srun` to work with the SLURM system, so note the omission of `srun` in the below examples.
Users can either copy these files from `/home/y26/shared/bin` to their own home directory should they wish, or use the centrally supplied version. In either case, ensure they are in your `PATH` before execution, eg:
All jobs must be run via SLURM to avoid inconveniencing other users of the system. An example job is shown below.

```bash
export PATH=$PATH:/home/y26/shared/bin
```

### Run on the host
### SLURM example

Jobs can be run on the host system (eg simulations, test scripts) using the `csrun_cpu` wrapper. Here is the example from the Cerebras documentation on PyTorch. Note that this assumes csrun_cpu is in your path.
This is based on the sample job from the Cerebras documentation [Cerebras documentation - Execute your job](https://docs.cerebras.net/en/latest/wsc/getting-started/cs-appliance.html#execute-your-job)

```bash
#!/bin/bash
#SBATCH --job-name=Example # Job name
#SBATCH --cpus-per-task=2 # Request 2 cores
#SBATCH --output=example_%j.log # Standard output and error log
#SBATCH --time=01:00:00 # Set time limit for this job to 1 hour
#SBATCH --gres=cs:1 # Request CS-2 system

csrun_cpu python-pt run.py --mode train --compile_only --params configs/<name-of-the-params-file.yaml>
source venv_cerebras_pt/bin/activate
python run.py \
CSX \
--params params.yaml \
--num_csx=1 \
--model_dir model_dir \
--mode {train,eval,eval_all,train_and_eval} \
--mount_dirs {paths to modelzoo and to data} \
--python_paths {paths to modelzoo and other python code if used}
```
See the 'Troubleshooting' section below for known issues.
## Creating an environment
### Run on the CS-2
To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this [Cerebras setup environment docs](https://docs.cerebras.net/en/latest/wsc/getting-started/setup-environment.html) however our host system is slightly different so we recommend the following:
The following will run the above PyTorch example on the CS-2 - note the `--cs_ip` argument with port number passed in via the command line, and the inclusion of the `--gres` option to request use of the CS-2 via SLURM.
1. Create the venv
```bash
#!/bin/bash
#SBATCH --job-name=Example # Job name
#SBATCH --tasks-per-node=8 # There is only one node on SDF-CS1
#SBATCH --cpus-per-task=16 # Each cpu is a core
#SBATCH --gres=cs:1 # Request CS-2 system
#SBATCH --output=example_%j.log # Standard output and error log
#SBATCH --time=01:00:00 # Set time limit for this job to 1 hour
/opt/python3.8/bin/python3.8 -m venv venv_cerebras_pt
```
1. Install the dependencies
```bash
source venv_cerebras_pt/bin/activate
pip install --upgrade pip
pip install cerebras_pytorch==2.0.2
```
1. Validate the setup
csrun_wse python-pt run.py --mode train --cs_ip 172.24.102.121:9000 --params configs/<name-of-the-params-file.yaml>
```bash
source venv_cerebras_pt/bin/activate
cerebras_install_check
```
## Troubleshooting
### "Failed to transfer X out of 1943 weight tensors with modelzoo"
Sometimes jobs receive an error during the 'Transferring weights to server' like below:
```
2023-12-14 16:00:19,066 ERROR: Failed to transfer 5 out of 1943 weight tensors. Raising the first error encountered.
2023-12-14 16:00:19,118 ERROR: Initiating shutdown sequence due to error: Attempting to materialize deferred tensor with key “state.optimizer.state.214.beta1_power” from file model_dir/cerebras_logs/device_data_jxsi5hub/initial_state.hdf5, but the file has since been modified. The loaded tensor value may be different from originally loaded tensor. Please refrain from modifying the file while the run is in progress.
```
Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:
1. From within your python venv, edit the <venv>/lib64/python3.8/site-packages/cerebras_pytorch/storage.py file
```bash
vi <venv>/lib64/python3.8/site-packages/cerebras_pytorch/storage.py
```
1. Navigate to line 672
```bash
:672
```
The section should look like this:
```
if modified_time > self._last_modified:
raise RuntimeError(
f"Attempting to materialize deferred tensor with key "
f"\"{self._key}\" from file {self._filepath}, but the file has "
f"since been modified. The loaded tensor value may be "
f"different from originally loaded tensor. Please refrain "
f"from modifying the file while the run is in progress."
)
```
1. Comment out the whole section
```
#if modified_time > self._last_modified:
# raise RuntimeError(
# f"Attempting to materialize deferred tensor with key "
# f"\"{self._key}\" from file {self._filepath}, but the file has "
# f"since been modified. The loaded tensor value may be "
# f"different from originally loaded tensor. Please refrain "
# f"from modifying the file while the run is in progress."
# )
```
1. Save the file
1. Re-run the job

0 comments on commit e9142f4

Please sign in to comment.