From 15e6e6aa032c7f0386836b19a6cae98232c958cb Mon Sep 17 00:00:00 2001
From: Nick Johnson <n.johnson@epcc.ed.ac.uk>
Date: Wed, 3 Jan 2024 14:46:30 +0000
Subject: [PATCH 1/2] Minor update to CS2 WSC docs/Troubleshooting.

---
 docs/services/cs2/run.md    | 27 +++++++++++++++++++--------
 docs/services/ultra2/run.md | 13 +++++++------
 2 files changed, 26 insertions(+), 14 deletions(-)
diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md
index 79461e616..d5e53c9d3 100644
--- a/docs/services/cs2/run.md
+++ b/docs/services/cs2/run.md
@@ -34,6 +34,7 @@ python run.py \
        --mount_dirs {paths to modelzoo and to data} \
        --python_paths {paths to modelzoo and other python code if used}
 ```
+
 See the 'Troubleshooting' section below for known issues.
 
 ## Creating an environment
@@ -61,29 +62,34 @@ source venv_cerebras_pt/bin/activate
 cerebras_install_check
 ```
 
-
 ## Troubleshooting
 
 ### "Failed to transfer X out of 1943 weight tensors with modelzoo"
+
 Sometimes jobs receive an error during the 'Transferring weights to server' like below:
-```
+
+```bash
 2023-12-14 16:00:19,066 ERROR:   Failed to transfer 5 out of 1943 weight tensors. Raising the first error encountered.
 2023-12-14 16:00:19,118 ERROR:   Initiating shutdown sequence due to error: Attempting to materialize deferred tensor with key “state.optimizer.state.214.beta1_power” from file model_dir/cerebras_logs/device_data_jxsi5hub/initial_state.hdf5, but the file has since been modified. The loaded tensor value may be different from originally loaded tensor. Please refrain from modifying the file while the run is in progress.
-``` 
+```
 
 Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:
 
-1. From within your python venv, edit the <venv>/lib64/python3.8/site-packages/cerebras_pytorch/storage.py file
+1. From within your python venv, edit the <venv>/lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file
+
 ```bash
-vi <venv>/lib64/python3.8/site-packages/cerebras_pytorch/storage.py
-``` 
+vi <venv>/lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py
+```
 
 1. Navigate to line 672
+
 ```bash
 :672
 ```
+
 The section should look like this:
-```
+
+```python
 if modified_time > self._last_modified:
     raise RuntimeError(
         f"Attempting to materialize deferred tensor with key "
@@ -95,7 +101,8 @@ if modified_time > self._last_modified:
 ```
 
 1. Comment out the whole section
-```
+
+```python
  #if modified_time > self._last_modified:
  #    raise RuntimeError(
  #        f"Attempting to materialize deferred tensor with key "
@@ -109,3 +116,7 @@ if modified_time > self._last_modified:
 1. Save the file
 
 1. Re-run the job
+
+### Paths, PYTHONPATH and mount_dirs
+
+There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. [Python, paths and mount directories.](https://docs.cerebras.net/en/latest/wsc/getting-started/mount_dir.html?highlight=mount#python-paths-and-mount-directories)
diff --git a/docs/services/ultra2/run.md b/docs/services/ultra2/run.md
index 18c1b5f98..6374cdc67 100644
--- a/docs/services/ultra2/run.md
+++ b/docs/services/ultra2/run.md
@@ -68,18 +68,19 @@ Once you have done this, your SSH key will be added to your Ultra2 account.
 
 Remember, you will need to use both an SSH key and Time-based one-time password to log into Ultra2 so you will also need to [set up your TOTP](https://epcced.github.io/safe-docs/safe-for-users/#how-to-turn-on-mfa-on-your-machine-account) before you can log into Ultra2.
 
-!!! Note
+---
+!!! note "First Login"
 
     When you **first** log into Ultra2, you will be prompted to change your initial password. This is a three step process:
 
-    1.  When promoted to enter your *password*: Enter the password  which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine)
+     1. When promoted to enter your *password*: Enter the password which you [retrieve from SAFE](https://epcced.github.io/safe-docs/safe-for-users/#how-can-i-pick-up-my-password-for-the-service-machine)
+     1. When prompted to enter your new password: type in a new password
+     1. When prompted to re-enter the new password: re-enter the new password
 
-    2.  When prompted to enter your new password: type in a new password
+    Your password has now been changed
 
-    3.  When prompted to re-enter the new password: re-enter the new password
-
-    Your password has now been changed<br>
     You will **not** use your password when logging on to Ultra2 after the initial logon.
+---
 
 ### SSH Login
 

From 1edbb2e8bf677191e726af37c520d69d3c3b173a Mon Sep 17 00:00:00 2001
From: Nick Johnson <n.johnson@epcc.ed.ac.uk>
Date: Tue, 6 Feb 2024 14:02:34 +0000
Subject: [PATCH 2/2] Post r2.1.1 update

---
 docs/services/cs2/run.md | 65 +++++++++++++++++++++++++++-------------
 1 file changed, 44 insertions(+), 21 deletions(-)

diff --git a/docs/services/cs2/run.md b/docs/services/cs2/run.md
index d5e53c9d3..e6c00a791 100644
--- a/docs/services/cs2/run.md
+++ b/docs/services/cs2/run.md
@@ -41,50 +41,41 @@ See the 'Troubleshooting' section below for known issues.
 
 To run a job on the cluster, you must create a Python virtual environment (venv) and install the dependencies. The Cerebras documentation contains generic instructions to do this [Cerebras setup environment docs](https://docs.cerebras.net/en/latest/wsc/getting-started/setup-environment.html) however our host system is slightly different so we recommend the following:
 
-1. Create the venv
+### Create the venv
 
 ```bash
 python3.8 -m venv venv_cerebras_pt
 ```
 
-1. Install the dependencies
+### Install the dependencies
 
 ```bash
 source venv_cerebras_pt/bin/activate
 pip install --upgrade pip
-pip install cerebras_pytorch==2.0.2
+pip install cerebras_pytorch==2.1.1
 ```
 
-1. Validate the setup
+### Validate the setup
 
 ```bash
 source venv_cerebras_pt/bin/activate
 cerebras_install_check
 ```
 
-## Troubleshooting
-
-### "Failed to transfer X out of 1943 weight tensors with modelzoo"
-
-Sometimes jobs receive an error during the 'Transferring weights to server' like below:
-
-```bash
-2023-12-14 16:00:19,066 ERROR:   Failed to transfer 5 out of 1943 weight tensors. Raising the first error encountered.
-2023-12-14 16:00:19,118 ERROR:   Initiating shutdown sequence due to error: Attempting to materialize deferred tensor with key “state.optimizer.state.214.beta1_power” from file model_dir/cerebras_logs/device_data_jxsi5hub/initial_state.hdf5, but the file has since been modified. The loaded tensor value may be different from originally loaded tensor. Please refrain from modifying the file while the run is in progress.
-```
+### Modify venv files to remove clock sync check on EPCC system.
 
 Cerebras are aware of this issue and are working on a fix, however in the mean time follow the below workaround:
 
-1. From within your python venv, edit the <venv>/lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file
+### From within your python venv, edit the <venv>/lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py file
 
 ```bash
 vi <venv>/lib/python3.8/site-packages/cerebras_pytorch/saver/storage.py
 ```
 
-1. Navigate to line 672
+### Navigate to line 530
 
 ```bash
-:672
+:530
 ```
 
 The section should look like this:
@@ -100,7 +91,7 @@ if modified_time > self._last_modified:
     )
 ```
 
-1. Comment out the whole section
+### Comment out the whole section
 
 ```python
  #if modified_time > self._last_modified:
@@ -113,10 +104,42 @@ if modified_time > self._last_modified:
         #    )
 ```
 
-1. Save the file
+### Navigate to line 774
+
+```bash
+:774
+```
+
+The section should look like this:
+
+```python
+   if stat.st_mtime_ns > self._stat.st_mtime_ns:
+        raise RuntimeError(
+            f"Attempting to {msg} deferred tensor with key "
+            f"\"{self._key}\" from file {self._filepath}, but the file has "
+            f"since been modified. The loaded tensor value may be "
+            f"different from originally loaded tensor. Please refrain "
+            f"from modifying the file while the run is in progress."
+       )
+```
+
+### Comment out the whole section
+
+```python
+   #if stat.st_mtime_ns > self._stat.st_mtime_ns:
+   #     raise RuntimeError(
+   #         f"Attempting to {msg} deferred tensor with key "
+   #         f"\"{self._key}\" from file {self._filepath}, but the file has "
+   #         f"since been modified. The loaded tensor value may be "
+   #         f"different from originally loaded tensor. Please refrain "
+   #         f"from modifying the file while the run is in progress."
+   #    )
+```
+
+### Save the file
 
-1. Re-run the job
+### Run jobs as per existing documentation.
 
-### Paths, PYTHONPATH and mount_dirs
+## Paths, PYTHONPATH and mount_dirs
 
 There can be some confusion over the correct use of the parameters supplied to the run.py script. There is a helpful explanation page from Cerebras which explains these parameters and how they should be used. [Python, paths and mount directories.](https://docs.cerebras.net/en/latest/wsc/getting-started/mount_dir.html?highlight=mount#python-paths-and-mount-directories)