[BUG] cudaErrorIllegalAddress: an illegal memory access was encounteredThread #441

kangna-qi · 2024-02-06T08:31:27Z

Describe the bug
I use tf1+sok to train my own model.I meet this error.

[1,2]<stderr>:terminate called after throwing an instance of 'thrust::system::system_error'
[1,2]<stderr>:terminate called recursively
[1,2]<stderr>:  what():  [1,2]<stderr>:terminate called recursively
[1,2]<stderr>:Fatal Python error: Aborted
[1,2]<stderr>:
[1,2]<stderr>:parallel_for: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encounteredThread 0x
[1,2]<stderr>:00007f92f7c7b700 (most recent call first):
[1,2]<stderr>:  File [1,0]<stderr>:terminate called after throwing an instance of 'cuco::cuda_error'
[1,0]<stderr>:terminate called recursively
[1,0]<stderr>:  what():  CUDA error at: /workspace/HugeCTR/sparse_operation_kit/kit_src/variable/impl/dynamic_embedding_table/cuCollections/include/cuco/detail/dynamic_map.inl125: cudaErrorIllegalAddress an illegal memory access was encountered
[1,0]<stderr>:terminate called recursively
[1,0]<stderr>:Fatal Python error: Fatal Python error: AbortedAborted

Environment (please complete the following information):

OS:Ubuntu 20.04
Graphic card: NVIDIA DGX A100
CUDA version: CUDA 12
Docker image：nvcr.io/nvidia/tensorflow:23.03-tf1-py3

The text was updated successfully, but these errors were encountered:

kanghui0204 · 2024-02-06T09:38:54Z

Hi @kangna-qi ,thank you for using SOK. It seems to be a GPU memory out-of-bounds error.

Could you provide me with the code of how you use SOK so that I can reproduce the problem?

Also , is the problem happens on the beginning of your training? , if not , I recommend your use HKV as backend of the dynamic embedding variable, here is the example : https://github.com/NVIDIA-Merlin/HugeCTR/blob/main/sparse_operation_kit/sparse_operation_kit/examples/lookup_example_tf1/lookup_sparse_distributed_hkv_test.py

kangna-qi · 2024-02-18T09:06:38Z

@kanghui0204 Thanks for your reply.I've alreadly solved this problem.I can train the model with TF single threading. When using TF for multi-threaded model training, cuco requires locking to ensure correct calculations.

kanghui0204 · 2024-02-22T00:41:17Z

@kanghui0204 Thanks for your reply.I've alreadly solved this problem.I can train the model with TF single threading. When using TF for multi-threaded model training, cuco requires locking to ensure correct calculations.

Hi @kangna-qi ,I would like to ask if multi-threaded model training which you mentioned is MirroredStrategy? Or is it about concurrency between different OPs? If it is MirroredStrategy, I would like to know if there are any requirements that necessitate the use of MirroredStrategy. If not, we recommend using Horovod for multi-GPU training.

kanghui0204 · 2024-03-04T06:36:12Z

Hi @minseokl , because @kangna-qi didn't response for 2 weeks , I decide close this issue, FYI.

kanghui0204 self-assigned this Feb 6, 2024

kanghui0204 closed this as completed Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cudaErrorIllegalAddress: an illegal memory access was encounteredThread #441

[BUG] cudaErrorIllegalAddress: an illegal memory access was encounteredThread #441

kangna-qi commented Feb 6, 2024

kanghui0204 commented Feb 6, 2024 •

edited

Loading

kangna-qi commented Feb 18, 2024 •

edited

Loading

kanghui0204 commented Feb 22, 2024 •

edited

Loading

kanghui0204 commented Mar 4, 2024

[BUG] cudaErrorIllegalAddress: an illegal memory access was encounteredThread #441

[BUG] cudaErrorIllegalAddress: an illegal memory access was encounteredThread #441

Comments

kangna-qi commented Feb 6, 2024

kanghui0204 commented Feb 6, 2024 • edited Loading

kangna-qi commented Feb 18, 2024 • edited Loading

kanghui0204 commented Feb 22, 2024 • edited Loading

kanghui0204 commented Mar 4, 2024

kanghui0204 commented Feb 6, 2024 •

edited

Loading

kangna-qi commented Feb 18, 2024 •

edited

Loading

kanghui0204 commented Feb 22, 2024 •

edited

Loading