You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I complie sok in tensorflow docker,and I run sok unit test examples error
device:GPU:0 with 38294 MB memory) -> physical GPU (device: 0, name: NVIDIA A100-PCIE-40GB, pci bus id: 0000:35:00.0, compute capability: 8.0)
[SOK INFO] Initialize finished, communication tool: horovod
WARNING:tensorflow:From lookup_sparse_distributed_test.py:59: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
terminate called after throwing an instance of 'HugeCTR::core23::RuntimeError'
what(): Runtime error: invalid argument
cudaMemcpyAsync(tf_model_key.data(), model_key.data(), tf_model_key.num_bytes(), cudaMemcpyDeviceToDevice, core->get_local_gpu()->get_stream()) (copy_model_keys_and_offsets @ /workspace/HugeCTR/HugeCTR/embedding/all2all_embedding_collection.cu:538)
[notebook-tf1-py3-1ttrfki-notebook-0:24301] *** Process received signal ***
[notebook-tf1-py3-1ttrfki-notebook-0:24301] Signal: Aborted (6)
[notebook-tf1-py3-1ttrfki-notebook-0:24301] Signal code: (-6)
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fa7ea968090]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fa7ea96800b]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fa7ea947859]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7fa7e69d4911]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7fa7e69e038c]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa3f7)[0x7fa7e69e03f7]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa6a9)[0x7fa7e69e06a9]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [ 7] /usr/local/lib/libembedding.so(+0x12a41e)[0x7fa6442e441e]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [ 8] /usr/local/lib/libsparse_operation_kit.so(_ZN10tensorflow16LookupFowardBaseIllfN3sok9TFAdapterIllfEEE7forwardEPNS_15OpKernelContextERSt10shared_ptrIN4core19CoreResourceManagerEEP11CUstream_st+0x9f1)[0x7fa643c285f1]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [ 9] /usr/local/lib/libsparse_operation_kit.so(_ZN10tensorflow15LookupForwardOpIllfNS_3VarEN3sok9TFAdapterIllfEEE7ComputeEPNS_15OpKernelContextE+0x43b)[0x7fa643c44e6b]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [10] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0x3cb)[0x7fa6f08ae21b]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [11] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x113dab7)[0x7fa6f090bab7]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [12] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x113e11f)[0x7fa6f090c11f]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [13] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN5Eigen15ThreadPoolTemplIN10tensorflow6thread16EigenEnvironmentEE10WorkerLoopEi+0x285)[0x7fa6f09c0735]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [14] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZNSt17_Function_handlerIFvvEZN10tensorflow6thread16EigenEnvironment12CreateThreadESt8functionIS0_EEUlvE_E9_M_invokeERKSt9_Any_data+0x48)[0x7fa6f09bd278]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [15] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(+0x18d9da0)[0x7fa6f10a7da0]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7fa7ea90a609]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fa7eaa44133]
[notebook-tf1-py3-1ttrfki-notebook-0:24301] *** End of error message ***
Aborted (core dumped)
To Reproduce
Steps to reproduce the behavior:
docker:nvcr.io/nvidia/tensorflow:23.03-tf1-py3
1.complie sok
2.cd HugeCTR/sparse_operation_kit/sparse_operation_kit/test/function_test/tf1/lookup
3.python lookup_sparse_distributed_test.py
Environment (please complete the following information):
Describe the bug
I complie sok in tensorflow docker,and I run sok unit test examples error
To Reproduce
Steps to reproduce the behavior:
docker:nvcr.io/nvidia/tensorflow:23.03-tf1-py3
1.complie sok
2.cd HugeCTR/sparse_operation_kit/sparse_operation_kit/test/function_test/tf1/lookup
3.python lookup_sparse_distributed_test.py
Environment (please complete the following information):
Additional context
Other unit test examples are also error
The text was updated successfully, but these errors were encountered: