You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I used the sample code (W8A16) to quantize THUDM/glm-4-9b-chat-hf on an Nvidia 4090, and the entire process was very slow (nearly 24 hours), with extremely high memory usage, to the point where an Out of Memory (OOM) error occurred at the final step. When the OOM occurred, there were no obvious error messages, only displayed.
Using the same example code and consistent environment, the versions were updated to compressed-tensors 0.7.0 and llmcompressor 0.2.0. The quantization process was completed smoothly, and it took only 2 hours.
Expected behavior
Hoping for performance improvements.
Environment
OS Ubuntu 22.04
Python version 3.11.9
CUDA Version 12.4.1
Errors
OOM [1] 216936 killed python3 test_ct.py
The text was updated successfully, but these errors were encountered:
Describe the bug
I used the sample code (W8A16) to quantize THUDM/glm-4-9b-chat-hf on an Nvidia 4090, and the entire process was very slow (nearly 24 hours), with extremely high memory usage, to the point where an Out of Memory (OOM) error occurred at the final step. When the OOM occurred, there were no obvious error messages, only displayed.
[1] 216936 killed python3 test_ct.py
WSL environment:
compressed-tensors 0.8.0
llmcompressor 0.3.0
Memory : 47 GB
Swap : 40 GB
Using the same example code and consistent environment, the versions were updated to
compressed-tensors 0.7.0
andllmcompressor 0.2.0
. The quantization process was completed smoothly, and it took only 2 hours.Expected behavior
Hoping for performance improvements.
Environment
Errors
OOM
[1] 216936 killed python3 test_ct.py
The text was updated successfully, but these errors were encountered: