Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The new version 0.3.0 takes a long time for quantization and eventually fails due to OOM #965

Open
okwinds opened this issue Dec 10, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@okwinds
Copy link

okwinds commented Dec 10, 2024

Describe the bug
I used the sample code (W8A16) to quantize THUDM/glm-4-9b-chat-hf on an Nvidia 4090, and the entire process was very slow (nearly 24 hours), with extremely high memory usage, to the point where an Out of Memory (OOM) error occurred at the final step. When the OOM occurred, there were no obvious error messages, only displayed.

[1] 216936 killed python3 test_ct.py
WSL environment:
compressed-tensors 0.8.0
llmcompressor 0.3.0

Memory : 47 GB
Swap : 40 GB

Using the same example code and consistent environment, the versions were updated to compressed-tensors 0.7.0 and llmcompressor 0.2.0. The quantization process was completed smoothly, and it took only 2 hours.

Expected behavior
Hoping for performance improvements.

Environment

  1. OS Ubuntu 22.04
  2. Python version 3.11.9
  3. CUDA Version 12.4.1

Errors
OOM
[1] 216936 killed python3 test_ct.py

@okwinds okwinds added the bug Something isn't working label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants