The new version 0.3.0 takes a long time for quantization and eventually fails due to OOM #965

okwinds · 2024-12-10T09:10:29Z

Describe the bug
I used the sample code (W8A16) to quantize THUDM/glm-4-9b-chat-hf on an Nvidia 4090, and the entire process was very slow (nearly 24 hours), with extremely high memory usage, to the point where an Out of Memory (OOM) error occurred at the final step. When the OOM occurred, there were no obvious error messages, only displayed.

[1] 216936 killed python3 test_ct.py
WSL environment:
compressed-tensors 0.8.0
llmcompressor 0.3.0

Memory : 47 GB
Swap : 40 GB

Using the same example code and consistent environment, the versions were updated to compressed-tensors 0.7.0 and llmcompressor 0.2.0. The quantization process was completed smoothly, and it took only 2 hours.

Expected behavior
Hoping for performance improvements.

Environment

OS Ubuntu 22.04
Python version 3.11.9
CUDA Version 12.4.1

Errors
OOM
[1] 216936 killed python3 test_ct.py

The text was updated successfully, but these errors were encountered:

okwinds added the bug Something isn't working label Dec 10, 2024

dsikka assigned horheynm Dec 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The new version 0.3.0 takes a long time for quantization and eventually fails due to OOM #965

The new version 0.3.0 takes a long time for quantization and eventually fails due to OOM #965

okwinds commented Dec 10, 2024

The new version 0.3.0 takes a long time for quantization and eventually fails due to OOM #965

The new version 0.3.0 takes a long time for quantization and eventually fails due to OOM #965

Comments

okwinds commented Dec 10, 2024