Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudaErrorIllegalAddress: an illegal memory access was encounteredThread #441

Closed
kangna-qi opened this issue Feb 6, 2024 · 4 comments
Assignees

Comments

@kangna-qi
Copy link

Describe the bug
I use tf1+sok to train my own model.I meet this error.

[1,2]<stderr>:terminate called after throwing an instance of 'thrust::system::system_error'
[1,2]<stderr>:terminate called recursively
[1,2]<stderr>:  what():  [1,2]<stderr>:terminate called recursively
[1,2]<stderr>:Fatal Python error: Aborted
[1,2]<stderr>:
[1,2]<stderr>:parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encounteredThread 0x
[1,2]<stderr>:00007f92f7c7b700 (most recent call first):
[1,2]<stderr>:  File [1,0]<stderr>:terminate called after throwing an instance of 'cuco::cuda_error'
[1,0]<stderr>:terminate called recursively
[1,0]<stderr>:  what():  CUDA error at: /workspace/HugeCTR/sparse_operation_kit/kit_src/variable/impl/dynamic_embedding_table/cuCollections/include/cuco/detail/dynamic_map.inl125: cudaErrorIllegalAddress an illegal memory access was encountered
[1,0]<stderr>:terminate called recursively
[1,0]<stderr>:Fatal Python error: Fatal Python error: AbortedAborted

Environment (please complete the following information):

  • OS:Ubuntu 20.04
  • Graphic card: NVIDIA DGX A100
  • CUDA version: CUDA 12
  • Docker image:nvcr.io/nvidia/tensorflow:23.03-tf1-py3
@kanghui0204
Copy link
Collaborator

kanghui0204 commented Feb 6, 2024

Hi @kangna-qi ,thank you for using SOK. It seems to be a GPU memory out-of-bounds error.

Could you provide me with the code of how you use SOK so that I can reproduce the problem?

Also , is the problem happens on the beginning of your training? , if not , I recommend your use HKV as backend of the dynamic embedding variable, here is the example : https://github.com/NVIDIA-Merlin/HugeCTR/blob/main/sparse_operation_kit/sparse_operation_kit/examples/lookup_example_tf1/lookup_sparse_distributed_hkv_test.py

@kanghui0204 kanghui0204 self-assigned this Feb 6, 2024
@kangna-qi
Copy link
Author

kangna-qi commented Feb 18, 2024

@kanghui0204 Thanks for your reply.I've alreadly solved this problem.I can train the model with TF single threading. When using TF for multi-threaded model training, cuco requires locking to ensure correct calculations.

@kanghui0204
Copy link
Collaborator

kanghui0204 commented Feb 22, 2024

@kanghui0204 Thanks for your reply.I've alreadly solved this problem.I can train the model with TF single threading. When using TF for multi-threaded model training, cuco requires locking to ensure correct calculations.

Hi @kangna-qi ,I would like to ask if multi-threaded model training which you mentioned is MirroredStrategy? Or is it about concurrency between different OPs? If it is MirroredStrategy, I would like to know if there are any requirements that necessitate the use of MirroredStrategy. If not, we recommend using Horovod for multi-GPU training.

@kanghui0204
Copy link
Collaborator

Hi @minseokl , because @kangna-qi didn't response for 2 weeks , I decide close this issue, FYI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants