-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why is unsloth thinking I'm doing multi gpu optimization when I'm not? #1240
Comments
Hm that is very weird - is this like a machine with multiple cards - could you try |
(beyond_scale_2) ***@***.***~/beyond-scale-2-alignment-coeff $ nvidia-smi
Tue Nov 5 08:57:13 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | 0 |
| N/A 54C P0 223W / 400W | 75448MiB / 81920MiB | 97% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:0A:00.0 Off | 0 |
| N/A 43C P0 89W / 400W | 31490MiB / 81920MiB | 88% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:44:00.0 Off | 0 |
| N/A 31C P0 68W / 400W | 1031MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:4A:00.0 Off | 0 |
| N/A 60C P0 297W / 400W | 31514MiB / 81920MiB | 84% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100-SXM4-80GB On | 00000000:84:00.0 Off | 0 |
| N/A 38C P0 97W / 400W | 23790MiB / 81920MiB | 31% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA A100-SXM4-80GB On | 00000000:8A:00.0 Off | 0 |
| N/A 37C P0 105W / 400W | 71724MiB / 81920MiB | 96% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA A100-SXM4-80GB On | 00000000:C0:00.0 Off | 0 |
| N/A 52C P0 269W / 400W | 31518MiB / 81920MiB | 85% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100-SXM4-80GB On | 00000000:C3:00.0 Off | 0 |
| N/A 55C P0 237W / 400W | 60673MiB / 81920MiB | 88% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 412907 C python 1018MiB |
| 0 N/A N/A 2531149 C python 74416MiB |
| 1 N/A N/A 3611 C ...nqduc/miniconda3/envs/lf/bin/python 30540MiB |
| 1 N/A N/A 1534976 C python 908MiB |
| 2 N/A N/A 4165148 C python 2482MiB |
| 3 N/A N/A 2201035 C python 848MiB |
| 3 N/A N/A 4140397 C ...nqduc/miniconda3/envs/lf/bin/python 30624MiB |
| 4 N/A N/A 2174832 C ...iconda3/envs/ampere1-env/bin/python 9328MiB |
| 4 N/A N/A 2737509 C python 14412MiB |
| 5 N/A N/A 119688 C python 43242MiB |
| 5 N/A N/A 124733 C python 28468MiB |
| 6 N/A N/A 111759 C ...nqduc/miniconda3/envs/lf/bin/python 30548MiB |
| 6 N/A N/A 1488814 C python 928MiB |
| 7 N/A N/A 3185003 C python 60650MiB |
+-----------------------------------------------------------------------------------------+
The error was also non-deterministic. I changed nothing of my code and then it went away (at least for 1 run). I didn't try again afterwards given lm_head wasn't lora-able but def non-deterministic. Let me know how I can help. I think I attached the code.
… On Nov 5, 2024, at 2:03 AM, Daniel Han ***@***.***> wrote:
Hm that is very weird - is this like a machine with multiple cards - could you try nvidia-smi
—
Reply to this email directly, view it on GitHub <#1240 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOE6LRFOPKDPTBSJMSVXKDZ7CJYNAVCNFSM6AAAAABRFTUK7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINJWG42DGNRTHE>.
You are receiving this because you authored the thread.
|
I encountered the same issue on a single machine with multiple GPUs. I used
Without changing any code, rerunning it sometimes succeeds and sometimes fails. |
code
but I'm only doing 1 gpu a100...
The text was updated successfully, but these errors were encountered: