Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code corrupted silently when import decord before torch #293

Open
zhangbw17 opened this issue Mar 13, 2024 · 8 comments
Open

Code corrupted silently when import decord before torch #293

zhangbw17 opened this issue Mar 13, 2024 · 8 comments

Comments

@zhangbw17
Copy link

This issue occurs when I import decord before torch, and then place nn.Module on the GPU.

import decord
import torch

torch.nn.Linear(3, 3).cuda()

It corrupted silently using python3 debug.py, and reported Segmentation fault (core dumped) when running in terminal.
Instead, the following code runs well,

import torch
import decord

torch.nn.Linear(3, 3).cuda()
@YinAoXiong
Copy link

same problem

@tongda
Copy link

tongda commented May 13, 2024

same with me. versions:

python: 3.10.14
decord: 0.6.0
pytorch: 2.3.0

and cuda related libs installed by pip:
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.20.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.1.105

@Leojc
Copy link

Leojc commented Jul 3, 2024

same problem.

python: 3.10.13
pytorch: 2.3.0
decord: 0.6.0

@losehu
Copy link

losehu commented Jul 25, 2024

me too

  • python=3.10
  • pytorch=1.13.1
  • decord==0.6.0

@ellemcfarlane
Copy link

ellemcfarlane commented Sep 9, 2024

Also happens to me, specifically when

  1. importing CPU version of decord before torch via import decord (have not tested with GPU version)
  2. moving my model to gpu e.g. model.to(torch.device("cuda"), dtype=torch.float16)

Note1: this occurs even when just importing decord but not actually using it to do anything
Note2: it does not occur when moving the model to cpu e.g. model.to(torch.device("cpu"), dtype=torch.float16), so given that I'm using the cpu-version of decord, there might be a connection there, but regardless, this should not happen.

Issue fixed when: simply importing torch before decord

Specific log when running with huggingface accelerate:
subprocess.CalledProcessError: Command '['python3', 'train.py', <placeholder-args>]' died with <Signals.SIGSEGV: 11>.

without accelerate:
train_script.sh: line 3: 2131749 Segmentation fault (core dumped) python3 train.py <placeholder-args>

versions:

python 3.10.14
cuda 12.1
decord 0.6.0
torch 2.4.0
torchvision 0.19.0

@a-r-r-o-w
Copy link

Issue fixed when: simply importing torch before decord

Thank you for saving my time @ellemcfarlane! I was stuck on this for quite a bit - very weird/unexpected how this works

@lhoestq
Copy link

lhoestq commented Oct 28, 2024

it also happened to me with duckdb, which needs to be imported before decord or it crashes on import

@korkinof
Copy link

korkinof commented Jan 3, 2025

Hey folks,
I've also had the same problem and just got the chance to look a bit into it.
There are multiple related open issues. The ones I could find are #329 #324 #329 and as mentioned above vllm-project/vllm#9993.
There may well be more, like #201 and #318, but I haven't been able to verify that they are related.
Recompiling locally from source resolves the issue for me and I verified that the segfault reappears if I reinstall the package from pip.
This may be due to a version mismatch in the dependencies the pip binaries have been built against, as suggested by @russellb in vllm-project/vllm#9993.
Could someone independently verify that compiling locally from source resolves the problem? I'm hoping we can finally close all these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants