Skip to content

Commit

Permalink
Merge pull request #128 from nickaj/main
Browse files Browse the repository at this point in the history
Minor update to CS2 WSC docs/Troubleshooting.
  • Loading branch information
nickaj authored Feb 6, 2024
2 parents dd7d43c + 1edbb2e commit ef3889f
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 29 deletions.
80 changes: 57 additions & 23 deletions docs/services/cs2/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,56 +34,53 @@ python run.py \
--mount_dirs {paths to modelzoo and to data} \
--python_paths {paths to modelzoo and other python code if used}
```
See the 'Troubleshooting' section below for known issues.
## Creating an environment
To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this [Cerebras setup environment docs](https://docs.cerebras.net/en/latest/wsc/getting-started/setup-environment.html) however our host system is slightly different so we recommend the following:
1. Create the venv
### Create the venv
```bash
python3.8 -m venv venv_cerebras_pt
```
1. Install the dependencies
### Install the dependencies
```bash
source venv_cerebras_pt/bin/activate
pip install --upgrade pip
pip install cerebras_pytorch==2.0.2
pip install cerebras_pytorch==2.1.1
```
1. Validate the setup
### Validate the setup
```bash
source venv_cerebras_pt/bin/activate
cerebras_install_check
```
## Troubleshooting
### "Failed to transfer X out of 1943 weight tensors with modelzoo"
Sometimes jobs receive an error during the 'Transferring weights to server' like below:
```
2023-12-14 16:00:19,066 ERROR: Failed to transfer 5 out of 1943 weight tensors. Raising the first error encountered.
2023-12-14 16:00:19,118 ERROR: Initiating shutdown sequence due to error: Attempting to materialize deferred tensor with key “state.optimizer.state.214.beta1_power” from file model_dir/cerebras_logs/device_data_jxsi5hub/initial_state.hdf5, but the file has since been modified. The loaded tensor value may be different from originally loaded tensor. Please refrain from modifying the file while the run is in progress.
```
### Modify venv files to remove clock sync check on EPCC system.
Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:
1. From within your python venv, edit the <venv>/lib64/python3.8/site-packages/cerebras_pytorch/storage.py file
### From within your python venv, edit the <venv>/lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file
```bash
vi <venv>/lib64/python3.8/site-packages/cerebras_pytorch/storage.py
```
vi <venv>/lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py
```
### Navigate to line 530
1. Navigate to line 672
```bash
:672
:530
```
The section should look like this:
```
```python
if modified_time > self._last_modified:
raise RuntimeError(
f"Attempting to materialize deferred tensor with key "
Expand All @@ -94,8 +91,9 @@ if modified_time > self._last_modified:
)
```
1. Comment out the whole section
```
### Comment out the whole section
```python
#if modified_time > self._last_modified:
# raise RuntimeError(
# f"Attempting to materialize deferred tensor with key "
Expand All @@ -106,6 +104,42 @@ if modified_time > self._last_modified:
# )
```
1. Save the file
### Navigate to line 774
```bash
:774
```
The section should look like this:
```python
if stat.st_mtime_ns > self._stat.st_mtime_ns:
raise RuntimeError(
f"Attempting to {msg} deferred tensor with key "
f"\"{self._key}\" from file {self._filepath}, but the file has "
f"since been modified. The loaded tensor value may be "
f"different from originally loaded tensor. Please refrain "
f"from modifying the file while the run is in progress."
)
```
### Comment out the whole section
```python
#if stat.st_mtime_ns > self._stat.st_mtime_ns:
# raise RuntimeError(
# f"Attempting to {msg} deferred tensor with key "
# f"\"{self._key}\" from file {self._filepath}, but the file has "
# f"since been modified. The loaded tensor value may be "
# f"different from originally loaded tensor. Please refrain "
# f"from modifying the file while the run is in progress."
# )
```
### Save the file
### Run jobs as per existing documentation.
## Paths, PYTHONPATH and mount_dirs
1. Re-run the job
There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. [Python, paths and mount directories.](https://docs.cerebras.net/en/latest/wsc/getting-started/mount_dir.html?highlight=mount#python-paths-and-mount-directories)
13 changes: 7 additions & 6 deletions docs/services/ultra2/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,18 +68,19 @@ Once you have done this, your SSH key will be added to your Ultra2 account.
Remember, you will need to use both an SSH key and Time-based one-time password to log into Ultra2 so you will also need to [set up your TOTP](https://epcced.github.io/safe-docs/safe-for-users/#how-to-turn-on-mfa-on-your-machine-account) before you can log into Ultra2.
!!! Note
---
!!! note "First Login"
When you **first** log into Ultra2, you will be prompted to change your initial password. This is a three step process:
1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine)
1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine)
1. When prompted to enter your new password: type in a new password
1. When prompted to re-enter the new password: re-enter the new password
2. When prompted to enter your new password: type in a new password
Your password has now been changed
3. When prompted to re-enter the new password: re-enter the new password
Your password has now been changed<br>
You will **not** use your password when logging on to Ultra2 after the initial logon.
---
### SSH Login
Expand Down

0 comments on commit ef3889f

Please sign in to comment.