Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error using nvidia-container-cli for enroot (bug) #125

Open
KapilS25 opened this issue Jan 18, 2021 · 10 comments
Open

error using nvidia-container-cli for enroot (bug) #125

KapilS25 opened this issue Jan 18, 2021 · 10 comments

Comments

@KapilS25
Copy link

As reported by enroot developer, kindly look into this : NVIDIA/enroot#54 (comment)

@klueska
Copy link
Contributor

klueska commented Jan 18, 2021

Thanks for the report. There is a long thread on that link (some of which is relevant, some of which is not). Can you summarize the exact bug you are seeing with libnvidia-container here.

@KapilS25
Copy link
Author

KapilS25 commented Jan 18, 2021

With cgroups ,nvidia-container-cli unable to mount /dev from host to inside the containers /dev
need to use --no-devbind flag with nvidia-container-cli , which should not be a case , as mentioned by enroot developer.
NVIDIA/enroot#54 (comment)

@klueska
Copy link
Contributor

klueska commented Jan 18, 2021

I don't know anything about enroot. Do you have a simple reproducer with nvidia-container-cli directly that I can use to see what your issue is?

@KapilS25
Copy link
Author

Adding @3XX0 (enroot developer) in the conversation, @3XX0 can you please explain the issue to @klueska , as i dont know how exactly enroot start is using nvidia-container-cli.

@3XX0
Copy link
Member

3XX0 commented Jan 19, 2021

Basically it looks like the device mount fails if the device already exists at the destination.
I've never seen this before, so this might be RHEL specific:

mount error: file creation failed: /scratch/pbs/enroot-data/user-613.chas052/lammps/dev/nvidia-uvm-tools: operation not permitted

/dev/nvidia-uvm-tools already exists because /dev is bind mounted in the container, so mount shouldn't try to create it.

@KapilS25 Can you try adding strace to nvidia-container-cli in the nvidia hook so we can see the exact failure on open

@KapilS25
Copy link
Author

KapilS25 commented Jan 19, 2021

Please find attached output file for nvidia-container-cli with strace.
dev_mount_issue_nvidia-container-cli.strace.txt

@3XX0
Copy link
Member

3XX0 commented Jan 19, 2021

Thanks, this makes sense now, the umask will make the open fail as it tries to adjust permissions

@klueska
Copy link
Contributor

klueska commented Jan 19, 2021

So it sounds like this is not actually a bug in libnvidia-container then, but rather expected behaviour given the umask set on /dev/nvidia-uvm-tools.

@3XX0
Copy link
Member

3XX0 commented Jan 19, 2021

It is a bug, the file exists and can just be mounted over.
Libnvidia-container shouldn't try to adjust the permission of a device file to reflect the system umask. Permissions of the underlying file actually don't really matter.

@ydm-amazon
Copy link

I've also encountered the same bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants