Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for multi gpu setup training with a single GPU. #974

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Sehyo
Copy link

@Sehyo Sehyo commented Aug 30, 2024

check_nvidia() originally spawns a new process for nvidia-smi, thus bypassing that GPU count might be limited by an OS environmental variable as this won't be reflected in the new process.

Added check for if GPU is limited by OS environ, if multiple, raises error like original behaviour.

If only one GPU enabled, only returns output for that GPU.

Sehyo added 3 commits August 31, 2024 03:36
check_nvidia() originally spawns a new process for nvidia-smi, thus bypassing that GPU count might be limited by an OS environmental variable as this won't be reflected in the new process.

Added check for if GPU is limited by OS environ, if multiple, raises error like original behaviour.

If only one GPU enabled, only returns output for that GPU.
Add fixed code to the trainer patcher.
Fixed variable misname in trainer patcher.
Copy link
Author

@Sehyo Sehyo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge, tested in my local multi gpu setup, and now possible to single train if limited by OS ENVIRON.

Copy link

@RahulVadisetty91 RahulVadisetty91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rk45

There are several instances where potential failures (like in convert_to_fast_tokenizer, try_fix_tokenizer, and assert_same_tokenization) are not properly logged or handled. Instead of silently returning the slow tokenizer, there should be logs or warnings for failed conversions or mismatches.

if not check_vocab or not check_special:
logger.warning("Vocab or special tokens do not match between slow and fast tokenizer.")
return slow_tokenizer

@Sehyo
Copy link
Author

Sehyo commented Oct 13, 2024

@RahulVadisetty91 Hello, your review does not seem relevant to my PR. Can you elaborate on the relevancy?

Comment on lines +1094 to +1096
index_for_cuda = -1
if "CUDA_VISIBLE_DEVICES" in os.environ:
index_for_cuda = os.environ["CUDA_VISIBLE_DEVICES"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
index_for_cuda = -1
if "CUDA_VISIBLE_DEVICES" in os.environ:
index_for_cuda = os.environ["CUDA_VISIBLE_DEVICES"]
index_for_cuda = os.environ.get("CUDA_VISIBLE_DEVICES", -1)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if CUDA_VISIBLE_DEVICES="0,1,2"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The next few lines would take care of that I suppose.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, sorry

@hife-ai
Copy link

hife-ai commented Oct 31, 2024

Hi @Sehyo!
Thank you for your PR. I see that this is not the only place where this happens.
For example in llama.py and throughout this file.

Could you please fix those places as well?

@udaygirish
Copy link
Contributor

@Sehyo @hife-ai isn't it better till the dev is ongoing someone updates this in ReadMe about using os.environ specifically this is a problem for the notebooks also I think we should use a device argument which can be a list or int or leverage that everywhere rather than checking like this. And if someone specifies more devices then the code can exit in the start itself saying multi GPU is not supported. What you say ?

@giuliabaldini
Copy link
Contributor

@hife-ai @Datta0 FYI, I am working on another PR for this. From my understanding the only changes that should be needed are the ones to check_nvidia(), because that is also the function that is called on other parts of the code.

@Datta0
Copy link
Contributor

Datta0 commented Nov 13, 2024

@giuliabaldini in that case can you please link that here?

@giuliabaldini
Copy link
Contributor

@Datta0 I am still testing that it works, I will open the PR as soon as I am sure and link it here!

@Sehyo
Copy link
Author

Sehyo commented Nov 13, 2024

@hife-ai @Datta0 FYI, I am working on another PR for this. From my understanding the only changes that should be needed are the ones to check_nvidia(), because that is also the function that is called on other parts of the code.

I am not sure why you would think that is the case. The trainer patcher clearly contains code that will break multi gpu functionality without my PR. This is the appropriate solution.

@giuliabaldini
Copy link
Contributor

Hey @Sehyo, do you intend to finish the PR and implement the other changes requested by @hife-ai and @Datta0 ?

@Sehyo
Copy link
Author

Sehyo commented Nov 13, 2024

Sure I could do it tomorrow if necessary

@giuliabaldini
Copy link
Contributor

Let me know if you can't and I will do it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants