From 15e6e6aa032c7f0386836b19a6cae98232c958cb Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Wed, 3 Jan 2024 14:46:30 +0000 Subject: [PATCH 1/2] Minor update to CS2 WSC docs/Troubleshooting. --- docs/services/cs2/run.md | 27 +++++++++++++++++++-------- docs/services/ultra2/run.md | 13 +++++++------ 2 files changed, 26 insertions(+), 14 deletions(-) diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index 79461e616..d5e53c9d3 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -34,6 +34,7 @@ python run.py \ --mount_dirs {paths to modelzoo and to data} \ --python_paths {paths to modelzoo and other python code if used} ``` + See the 'Troubleshooting' section below for known issues. ## Creating an environment @@ -61,29 +62,34 @@ source venv_cerebras_pt/bin/activate cerebras_install_check ``` - ## Troubleshooting ### "Failed to transfer X out of 1943 weight tensors with modelzoo" + Sometimes jobs receive an error during the 'Transferring weights to server' like below: -``` + +```bash 2023-12-14 16:00:19,066 ERROR: Failed to transfer 5 out of 1943 weight tensors. Raising the first error encountered. 2023-12-14 16:00:19,118 ERROR: Initiating shutdown sequence due to error: Attempting to materialize deferred tensor with key “state.optimizer.state.214.beta1_power” from file model_dir/cerebras_logs/device_data_jxsi5hub/initial_state.hdf5, but the file has since been modified. The loaded tensor value may be different from originally loaded tensor. Please refrain from modifying the file while the run is in progress. -``` +``` Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround: -1. From within your python venv, edit the /lib64/python3.8/site-packages/cerebras_pytorch/storage.py file +1. From within your python venv, edit the /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file + ```bash -vi /lib64/python3.8/site-packages/cerebras_pytorch/storage.py -``` +vi /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py +``` 1. Navigate to line 672 + ```bash :672 ``` + The section should look like this: -``` + +```python if modified_time > self._last_modified: raise RuntimeError( f"Attempting to materialize deferred tensor with key " @@ -95,7 +101,8 @@ if modified_time > self._last_modified: ``` 1. Comment out the whole section -``` + +```python #if modified_time > self._last_modified: # raise RuntimeError( # f"Attempting to materialize deferred tensor with key " @@ -109,3 +116,7 @@ if modified_time > self._last_modified: 1. Save the file 1. Re-run the job + +### Paths, PYTHONPATH and mount_dirs + +There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. [Python, paths and mount directories.](https://docs.cerebras.net/en/latest/wsc/getting-started/mount_dir.html?highlight=mount#python-paths-and-mount-directories) diff --git a/docs/services/ultra2/run.md b/docs/services/ultra2/run.md index 18c1b5f98..6374cdc67 100644 --- a/docs/services/ultra2/run.md +++ b/docs/services/ultra2/run.md @@ -68,18 +68,19 @@ Once you have done this, your SSH key will be added to your Ultra2 account. Remember, you will need to use both an SSH key and Time-based one-time password to log into Ultra2 so you will also need to [set up your TOTP](https://epcced.github.io/safe-docs/safe-for-users/#how-to-turn-on-mfa-on-your-machine-account) before you can log into Ultra2. -!!! Note +--- +!!! note "First Login" When you **first** log into Ultra2, you will be prompted to change your initial password. This is a three step process: - 1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine) + 1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine) + 1. When prompted to enter your new password: type in a new password + 1. When prompted to re-enter the new password: re-enter the new password - 2. When prompted to enter your new password: type in a new password + Your password has now been changed - 3. When prompted to re-enter the new password: re-enter the new password - - Your password has now been changed
You will **not** use your password when logging on to Ultra2 after the initial logon. +--- ### SSH Login From 1edbb2e8bf677191e726af37c520d69d3c3b173a Mon Sep 17 00:00:00 2001 From: Nick Johnson Date: Tue, 6 Feb 2024 14:02:34 +0000 Subject: [PATCH 2/2] Post r2.1.1 update --- docs/services/cs2/run.md | 65 +++++++++++++++++++++++++++------------- 1 file changed, 44 insertions(+), 21 deletions(-) diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md index d5e53c9d3..e6c00a791 100644 --- a/docs/services/cs2/run.md +++ b/docs/services/cs2/run.md @@ -41,50 +41,41 @@ See the 'Troubleshooting' section below for known issues. To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this [Cerebras setup environment docs](https://docs.cerebras.net/en/latest/wsc/getting-started/setup-environment.html) however our host system is slightly different so we recommend the following: -1. Create the venv +### Create the venv ```bash python3.8 -m venv venv_cerebras_pt ``` -1. Install the dependencies +### Install the dependencies ```bash source venv_cerebras_pt/bin/activate pip install --upgrade pip -pip install cerebras_pytorch==2.0.2 +pip install cerebras_pytorch==2.1.1 ``` -1. Validate the setup +### Validate the setup ```bash source venv_cerebras_pt/bin/activate cerebras_install_check ``` -## Troubleshooting - -### "Failed to transfer X out of 1943 weight tensors with modelzoo" - -Sometimes jobs receive an error during the 'Transferring weights to server' like below: - -```bash -2023-12-14 16:00:19,066 ERROR: Failed to transfer 5 out of 1943 weight tensors. Raising the first error encountered. -2023-12-14 16:00:19,118 ERROR: Initiating shutdown sequence due to error: Attempting to materialize deferred tensor with key “state.optimizer.state.214.beta1_power” from file model_dir/cerebras_logs/device_data_jxsi5hub/initial_state.hdf5, but the file has since been modified. The loaded tensor value may be different from originally loaded tensor. Please refrain from modifying the file while the run is in progress. -``` +### Modify venv files to remove clock sync check on EPCC system. Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround: -1. From within your python venv, edit the /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file +### From within your python venv, edit the /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file ```bash vi /lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py ``` -1. Navigate to line 672 +### Navigate to line 530 ```bash -:672 +:530 ``` The section should look like this: @@ -100,7 +91,7 @@ if modified_time > self._last_modified: ) ``` -1. Comment out the whole section +### Comment out the whole section ```python #if modified_time > self._last_modified: @@ -113,10 +104,42 @@ if modified_time > self._last_modified: # ) ``` -1. Save the file +### Navigate to line 774 + +```bash +:774 +``` + +The section should look like this: + +```python + if stat.st_mtime_ns > self._stat.st_mtime_ns: + raise RuntimeError( + f"Attempting to {msg} deferred tensor with key " + f"\"{self._key}\" from file {self._filepath}, but the file has " + f"since been modified. The loaded tensor value may be " + f"different from originally loaded tensor. Please refrain " + f"from modifying the file while the run is in progress." + ) +``` + +### Comment out the whole section + +```python + #if stat.st_mtime_ns > self._stat.st_mtime_ns: + # raise RuntimeError( + # f"Attempting to {msg} deferred tensor with key " + # f"\"{self._key}\" from file {self._filepath}, but the file has " + # f"since been modified. The loaded tensor value may be " + # f"different from originally loaded tensor. Please refrain " + # f"from modifying the file while the run is in progress." + # ) +``` + +### Save the file -1. Re-run the job +### Run jobs as per existing documentation. -### Paths, PYTHONPATH and mount_dirs +## Paths, PYTHONPATH and mount_dirs There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. [Python, paths and mount directories.](https://docs.cerebras.net/en/latest/wsc/getting-started/mount_dir.html?highlight=mount#python-paths-and-mount-directories)