You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm encountering an issue while running the VideoQuery example on my Jetson Orin NX (8GB RAM, 500GB SSD). The process fails during the quantization step. I'm using the jetson-containers auto-choose Docker container.
Setup:
Jetson Orin NX 8GB with 500GB SSD
JetPack 6 and L4T 36.4.0
Running in a Docker container using jetson-containers
python3 -m nano_llm.agents.video_query --api=mlc \
--model Efficient-Large-Model/VILA1.5-3b \
--max-context-len 256 \
--max-new-tokens 32 \
--video-input /dev/video0 \
--video-output webrtc://@:8554/output \
--nanodb /data/nanodb/coco/2017
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
warnings.warn(
Fetching 13 files: 0%|| 0/13 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1142: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Fetching 13 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:00<00:00, 35429.47it/s]
Fetching 17 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 110035.75it/s]
16:01:44 | INFO | loading /data/models/huggingface/models--Efficient-Large-Model--VILA1.5-3b/snapshots/42d1dda6807cc521ef27674ca2ae157539d17026 with MLC
16:01:48 | INFO | NumExpr defaulting to 6 threads.
16:01:48 | WARNING | AWQ not installed (requires JetPack 6 / L4T R36) - AWQ models will fail to initialize
['/data/models/mlc/dist/VILA1.5-3b/ctx256/VILA1.5-3b-q4f16_ft/mlc-chat-config.json', '/data/models/mlc/dist/VILA1.5-3b/ctx256/VILA1.5-3b-q4f16_ft/params/mlc-chat-config.json']
16:01:50 | INFO | running MLC quantization:
python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b/ctx256 --use-safetensors
Using path "/data/models/mlc/dist/models/VILA1.5-3b"for model "VILA1.5-3b"
Target configured: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Automatically using target for weight quantization: cuda -keys=cuda,gpu -arch=sm_87 -max_num_threads=1024 -max_shared_memory_per_block=49152 -max_threads_per_block=1024 -registers_per_block=65536 -thread_warp_size=32
Get old param: 0%|| 0/197 [00:00<?, ?tensors/sStart computing and quantizing weights... This may take a while. | 0/327 [00:00<?, ?tensors/s]
Get old param: 1%|▉ | 2/197 [00:02<03:22, 1.04s/tensors]Traceback (most recent call last): | 1/327 [00:02<13:23, 2.47s/tensors]
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/opt/NanoLLM/nano_llm/agents/video_query.py", line 357, in<module>
agent = VideoQuery(**vars(args)).run()
File "/opt/NanoLLM/nano_llm/agents/video_query.py", line 44, in __init__
self.llm = ChatQuery(model=model, drop_inputs=True, vision_scaling=vision_scaling, warmup=True, **kwargs) #ProcessProxy('ChatQuery', model=model, drop_inputs=True, vision_scaling=vision_scaling, warmup=True, **kwargs)
File "/opt/NanoLLM/nano_llm/plugins/chat_query.py", line 78, in __init__
self.model = NanoLLM.from_pretrained(model, **kwargs)
File "/opt/NanoLLM/nano_llm/nano_llm.py", line 91, in from_pretrained
model = MLCModel(model_path, **kwargs)
File "/opt/NanoLLM/nano_llm/models/mlc.py", line 60, in __init__
quant = MLCModel.quantize(self.model_path, self.config, method=quantization, max_context_len=max_context_len, **kwargs)
File "/opt/NanoLLM/nano_llm/models/mlc.py", line 276, in quantize
subprocess.run(cmd, executable='/bin/bash', shell=True, check=True)
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'python3 -m mlc_llm.build --model /data/models/mlc/dist/models/VILA1.5-3b --quantization q4f16_ft --target cuda --use-cuda-graph --use-flash-attn-mqa --sep-embed --max-seq-len 256 --artifact-path /data/models/mlc/dist/VILA1.5-3b/ctx256 --use-safetensors ' died with <Signals.SIGKILL: 9>.
Any help or suggestions would be appreciated!
The text was updated successfully, but these errors were encountered:
I'm encountering an issue while running the VideoQuery example on my Jetson Orin NX (8GB RAM, 500GB SSD). The process fails during the quantization step. I'm using the
jetson-containers
auto-choose Docker container.Setup:
jetson-containers
jetson-containers run $(autotag nano_llm) python3 -m nano_llm.agents.video_query --api=mlc \ --model Efficient-Large-Model/VILA1.5-3b \ --max-context-len 256 \ --max-new-tokens 32 \ --video-input /dev/video0 \ --video-output webrtc://@:8554/output \ --nanodb /data/nanodb/coco/2017
Setup:
Any help or suggestions would be appreciated!
The text was updated successfully, but these errors were encountered: