-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] cudaErrorIllegalAddress: an illegal memory access was encounteredThread #441
Comments
Hi @kangna-qi ,thank you for using SOK. It seems to be a GPU memory out-of-bounds error. Could you provide me with the code of how you use SOK so that I can reproduce the problem? Also , is the problem happens on the beginning of your training? , if not , I recommend your use HKV as backend of the dynamic embedding variable, here is the example : https://github.com/NVIDIA-Merlin/HugeCTR/blob/main/sparse_operation_kit/sparse_operation_kit/examples/lookup_example_tf1/lookup_sparse_distributed_hkv_test.py |
@kanghui0204 Thanks for your reply.I've alreadly solved this problem.I can train the model with TF single threading. When using TF for multi-threaded model training, cuco requires locking to ensure correct calculations. |
Hi @kangna-qi ,I would like to ask if multi-threaded model training which you mentioned is MirroredStrategy? Or is it about concurrency between different OPs? If it is MirroredStrategy, I would like to know if there are any requirements that necessitate the use of MirroredStrategy. If not, we recommend using Horovod for multi-GPU training. |
Hi @minseokl , because @kangna-qi didn't response for 2 weeks , I decide close this issue, FYI. |
Describe the bug
I use tf1+sok to train my own model.I meet this error.
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: