Skip to content
This repository has been archived by the owner on May 15, 2023. It is now read-only.

RuntimeError: CUDA error: invalid device ordinal #1

Open
kazuma0606 opened this issue Aug 19, 2021 · 4 comments
Open

RuntimeError: CUDA error: invalid device ordinal #1

kazuma0606 opened this issue Aug 19, 2021 · 4 comments

Comments

@kazuma0606
Copy link

kazuma0606 commented Aug 19, 2021

Hi,

In google Collaboratory, the following comment appears and train.py does not work.
----------------- End -------------------
Traceback (most recent call last):
File "train.py", line 15, in
opt = TrainOptions().parse()
File "/content/drive/My Drive/Colab Notebooks/3D_CycleGAN_Project/3D-CycleGan-Pytorch-MedImaging/options/base_options.py", line 109, in parse
torch.cuda.set_device(opt.gpu_ids[0])
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 264, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Do you know of a better way?

Enviroments:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0

Cheers,
Hisanori Yoshimura
E-mail: [email protected]

@davidiommi
Copy link
Owner

You have to change the gpu ids in base options.

I was selecting '2,3', ( I have 4 gpus.).

You should have fewer GPUs available. Write '0,1' if you have 2 GPUs available

@kazuma0606
Copy link
Author

Sorry,

I take a "--gpu_ids '0,1'"
the following comment appears and train.py does not work.
----------------- End -------------------
lenght train list: 3
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
cpuset_checked))
Traceback (most recent call last):
File "train.py", line 31, in
model = create_model(opt) # creation of the model
File "/content/drive/My Drive/Colab Notebooks/3D_CycleGAN_Project/3D-CycleGan-Pytorch-MedImaging/models/init.py", line 42, in create_model
instance.initialize(opt)
File "/content/drive/My Drive/Colab Notebooks/3D_CycleGAN_Project/3D-CycleGan-Pytorch-MedImaging/models/cycle_gan_model.py", line 88, in initialize
not opt.no_dropout, opt.init_type, opt.init_gain, self.gpu_ids)
File "/content/drive/My Drive/Colab Notebooks/3D_CycleGAN_Project/3D-CycleGan-Pytorch-MedImaging/models/networks3D.py", line 93, in define_G
return init_net(net, init_type, init_gain, gpu_ids)
File "/content/drive/My Drive/Colab Notebooks/3D_CycleGAN_Project/3D-CycleGan-Pytorch-MedImaging/models/networks3D.py", line 70, in init_net
net = torch.nn.DataParallel(net, gpu_ids)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 142, in init
_check_balance(self.device_ids)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 23, in _check_balance
dev_props = _get_devices_properties(device_ids)
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 455, in _get_devices_properties
return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids]
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 455, in
return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids]
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 438, in _get_device_attr
return get_member(torch.cuda)
File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 455, in
return [_get_device_attr(lambda m: m.get_device_properties(i)) for i in device_ids]
File "/usr/local/lib/python3.7/dist-packages/torch/cuda/init.py", line 312, in get_device_properties
raise AssertionError("Invalid device id")
AssertionError: Invalid device id

Do you know of a better way?

@kazuma0606
Copy link
Author

I was mistaken.
Training is now working after changing parameters.
--gpu_ids 0, --patch_size 64,64,32
However, I get the following message and test.py is not running.

0% 0/123 [00:00<?, ?it/s]
Traceback (most recent call last):
File "test.py", line 218, in

File "test.py", line 158, in inference
model.set_input(batch)
File "/content/drive/My Drive/Colab Notebooks/3D_CycleGAN_Project/3D-CycleGan-Pytorch-MedImaging/models/cycle_gan_model.py", line 118, in set_input
self.real_B = input[1 if AtoB else 0].to(self.device)
IndexError: index 1 is out of bounds for dimension 0 with size 1

Which parameters should I change?
Please let me know. Thank you very much for your help.

@s-jafarpoor
Copy link

s-jafarpoor commented Sep 23, 2022

hello
Thanks for sharing the code
I want to run on cpu and I set gpu_ids to -1 :

 python .\train.py  --gpu_ids -1

but I have the following error:

Traceback (most recent call last):
  File ".\train.py", line 15, in <module>
    opt = TrainOptions().parse()
  File "E:\cyclegan3d\3D-CycleGan-Pytorch-MedImaging\options\base_options.py", line 102, in parse
    str_ids.remove(',')
ValueError: list.remove(x): x not in list

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants