[BUG]: Segfault when using Triton 24.09 #2028

efajardo-nv · 2024-10-31T19:50:24Z

Version

24.10

Which installation method(s) does this occur on?

No response

Describe the bug.

After switching from Triton 23.06 to 24.09, I now get a segfault when running the Sensitive Information Detection (SID) example.
https://github.com/nv-morpheus/Morpheus/tree/branch-24.10/examples/nlp_si_detection

Also seeing this in the Triton logs

Click here to see error details

I1031 19:52:26.979265 1 grpc_server.cc:2558] "Started GRPCInferenceService at 0.0.0.0:8001"
I1031 19:52:26.979522 1 http_server.cc:4704] "Started HTTPService at 0.0.0.0:8000"
I1031 19:52:27.020947 1 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
2024-10-31 19:53:20.292330528 [W:onnxruntime:log, tensorrt_execution_provider.h:90 log] [2024-10-31 19:53:20 WARNING] Detected layernorm nodes in FP16.
2024-10-31 19:53:20.292396299 [W:onnxruntime:log, tensorrt_execution_provider.h:90 log] [2024-10-31 19:53:20 WARNING] Running layernorm after self-attention with FP16 Reduce or Pow may cause overflow. Forcing Reduce or Pow Layers in FP32 precision, or exporting the model to use INormalizationLayer (available with ONNX opset >= 17) can help preserving accuracy.
2024-10-31 19:53:56.917212111 [W:onnxruntime:log, tensorrt_execution_provider.h:90 log] [2024-10-31 19:53:56 WARNING] Detected layernorm nodes in FP16.
2024-10-31 19:53:56.917239452 [W:onnxruntime:log, tensorrt_execution_provider.h:90 log] [2024-10-31 19:53:56 WARNING] Running layernorm after self-attention with FP16 Reduce or Pow may cause overflow. Forcing Reduce or Pow Layers in FP32 precision, or exporting the model to use INormalizationLayer (available with ONNX opset >= 17) can help preserving accuracy.
2024-10-31 19:53:56.966697260 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-10-31 19:53:56   ERROR] IExecutionContext::enqueueV3: Error Code 1: Myelin ([exec_instruction.cpp:exec:905] CUDA error 400 launching __myl_RepGatGatGatResResAddResAddResMeaSubMulMea kernel.)
2024-10-31 19:53:56.966738118 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-10-31 19:53:56.992547855 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-10-31 19:53:56   ERROR] IExecutionContext::enqueueV3: Error Code 1: Myelin ([exec_instruction.cpp:exec:905] CUDA error 400 launching __myl_RepGatGatGatResResAddResAddResMeaSubMulMea kernel.)
2024-10-31 19:53:56.992578951 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-10-31 19:53:57.014066591 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-10-31 19:53:57   ERROR] IExecutionContext::enqueueV3: Error Code 1: Myelin ([exec_instruction.cpp:exec:905] CUDA error 400 launching __myl_RepGatGatGatResResAddResAddResMeaSubMulMea kernel.)
2024-10-31 19:53:57.014098015 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-10-31 19:53:57.030573573 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-10-31 19:53:57   ERROR] IExecutionContext::enqueueV3: Error Code 1: Myelin ([exec_instruction.cpp:exec:905] CUDA error 400 launching __myl_RepGatGatGatResResAddResAddResMeaSubMulMea kernel.)
2024-10-31 19:53:57.030602777 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-10-31 19:53:57.058039747 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-10-31 19:53:57   ERROR] IExecutionContext::enqueueV3: Error Code 1: Myelin ([exec_instruction.cpp:exec:905] CUDA error 400 launching __myl_RepGatGatGatResResAddResAddResMeaSubMulMea kernel.)
2024-10-31 19:53:57.058074218 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-10-31 19:53:57.077752653 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-10-31 19:53:57   ERROR] IExecutionContext::enqueueV3: Error Code 1: Myelin ([exec_instruction.cpp:exec:905] CUDA error 400 launching __myl_RepGatGatGatResResAddResAddResMeaSubMulMea kernel.)
2024-10-31 19:53:57.077786224 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-10-31 19:53:57.102153869 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-10-31 19:53:57   ERROR] IExecutionContext::enqueueV3: Error Code 1: Myelin ([exec_instruction.cpp:exec:905] CUDA error 400 launching __myl_RepGatGatGatResResAddResAddResMeaSubMulMea kernel.)
2024-10-31 19:53:57.102185109 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-10-31 19:53:57.122471509 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-10-31 19:53:57   ERROR] IExecutionContext::enqueueV3: Error Code 1: Myelin ([exec_instruction.cpp:exec:905] CUDA error 400 launching __myl_RepGatGatGatResResAddResAddResMeaSubMulMea kernel.)
2024-10-31 19:53:57.122501052 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-10-31 19:53:57.137861868 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-10-31 19:53:57   ERROR] IExecutionContext::enqueueV3: Error Code 1: Myelin ([exec_instruction.cpp:exec:905] CUDA error 400 launching __myl_RepGatGatGatResResAddResAddResMeaSubMulMea kernel.)
2024-10-31 19:53:57.137897524 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-10-31 19:53:57.153740337 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-10-31 19:53:57   ERROR] IExecutionContext::enqueueV3: Error Code 1: Myelin ([exec_instruction.cpp:exec:905] CUDA error 400 launching __myl_RepGatGatGatResResAddResAddResMeaSubMulMea kernel.)
2024-10-31 19:53:57.153768012 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
2024-10-31 19:53:57.180863558 [E:onnxruntime:log, tensorrt_execution_provider.h:88 log] [2024-10-31 19:53:57   ERROR] IExecutionContext::enqueueV3: Error Code 1: Myelin ([exec_instruction.cpp:exec:905] CUDA error 400 launching __myl_RepGatGatGatResResAddResAddResMeaSubMulMea kernel.)
2024-10-31 19:53:57.180897473 [E:onnxruntime:, sequential_executor.cc:516 ExecuteKernel] Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.

Minimum reproducible example

Follow steps in example. Error is seen when running pipeline with CLI.

Relevant log output

Click here to see error details

====Building Segment Complete!====
Inference Rate: 0 inf [00:40, ? inf/s]E20241031 19:39:15.559962 140437595592256 triton_inference.cpp:75] Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(447)
W20241031 19:39:15.560433 140437595592256 inference_client_stage.cpp:284] Exception while processing message for InferenceClientStage, attempting retry. ex.what(): Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(447)
E20241031 19:39:15.560933 140437595592256 triton_inference.cpp:75] Triton Error while executing 'm_client.async_infer( [this, handle](triton::client::InferResult* result) { m_result.reset(result); handle(); }, m_options, m_inputs, m_outputs)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(114)
*** Aborted at 1730403555 (unix time) try "date -d @1730403555" if you are using GNU date ***
PC: @ 0x7fbada196af5 morpheus::TritonInferenceClientSession::infer(morpheus::TritonInferenceClientSession::infer(std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, morpheus::TensorObject, std::less<std::__cxx11::basic_string<char, std::ch�=~
*** SIGSEGV (@0x8) received by PID 5182 (TID 0x7fba2cff9640) from PID 8; stack trace: ***
@ 0x7fbb54dd8ee8 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x99ee7)
@ 0x7fbae6fe4917 google::(anonymous namespace)::FailureSignalHandler(int, siginfo*, void*)
E20241031 19:39:15.596625 140437553628736 triton_inference.cpp:75] Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(447)
W20241031 19:39:15.596770 140437553628736 inference_client_stage.cpp:284] Exception while processing message for InferenceClientStage, attempting retry. ex.what(): Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(447)
E20241031 19:39:15.596795 140437553628736 triton_inference.cpp:75] Triton Error while executing 'm_client.async_infer( [this, handle](triton::client::InferResult* result) { m_result.reset(result); handle(); }, m_options, m_inputs, m_outputs)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(114)
@ 0x7fbb54d81520 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x4251f)
@ 0x7fbada196af5 morpheus::TritonInferenceClientSession::infer(morpheus::TritonInferenceClientSession::infer(std::map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, morpheus::TensorObject, std::less<std::__cxx11::basic_string<char, std::ch�=~
@ 0x7fbada306410 pybind11::cpp_function::initialize<mrc::pymrc::AsyncioScheduler::resume(mrc::pymrc::PyObjectHolder, std::__n4861::coroutine_handle)::{lambda()#1}, void>(mrc::pymrc::AsyncioScheduler::resume(mrc::pymrc::PyObjectHolder, std::__n4861::coroutine_handle<v�=~
@ 0x7fbada1c86f6 pybind11::cpp_function::dispatcher(_object*, _object*, _object*)
@ 0x55b0727a7c46 cfunction_call
@ 0x55b0727a0f73 _PyObject_MakeTpCall.localalias
@ 0x55b072763850 context_run
@ 0x55b07279f703 cfunction_vectorcall_FASTCALL_KEYWORDS
@ 0x55b07279d8ec _PyEval_EvalFrameDefault
@ 0x55b0727a80cc _PyFunction_Vectorcall
@ 0x55b072798680 _PyEval_EvalFrameDefault
@ 0x55b0727a80cc _PyFunction_Vectorcall
@ 0x55b072798680 _PyEval_EvalFrameDefault
@ 0x55b0727a80cc _PyFunction_Vectorcall
@ 0x55b072798680 _PyEval_EvalFrameDefault
@ 0x55b0727b3638 method_vectorcall
@ 0x7fbada1ee0f5 pybind11::detail::simple_collector<(pybind11::return_value_policy)1>::call(_object*) const
@ 0x7fbada30a1a8 mrc::pymrc::AsyncioRunnable<std::shared_ptrmorpheus::ControlMessage, std::shared_ptrmorpheus::ControlMessage >::run(mrc::runnable::Context&)
@ 0x7fbada1b00ef mrc::runnable::RunnableWithContextmrc::runnable::Context::main(mrc::runnable::Context&)
E20241031 19:39:15.649539 140437587199552 triton_inference.cpp:75] Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(447)
W20241031 19:39:15.649679 140437587199552 inference_client_stage.cpp:284] Exception while processing message for InferenceClientStage, attempting retry. ex.what(): Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(447)
E20241031 19:39:15.649712 140437587199552 triton_inference.cpp:75] Triton Error while executing 'm_client.async_infer( [this, handle](triton::client::InferResult* result) { m_result.reset(result); handle(); }, m_options, m_inputs, m_outputs)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(114)
@ 0x7fbae13e1fae std::_Function_handler<void (), mrc::runnable::Runner::enqueue(std::shared_ptrmrc::runnable::IEngines, std::vector<std::shared_ptrmrc::runnable::Context, std::allocator<std::shared_ptrmrc::runnable::Context > >&&)::{lambda()#1}>::_M_invoke(std::_Any_�s
@ 0x7fbae12f5fc8 std::thread::_State_impl<std::thread::_Invoker<std::tuple<mrc::system::ThreadResources::make_thread<boost::fibers::packaged_task<void ()> >(std::__cxx11::basic_string<char, std::char_traits, std::allocator >, mrc::CpuSet, boost::fibers::package�s
@ 0x7fbae7de7b65 execute_native_thread_routine
@ 0x7fbb54dd3ac3 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x94ac2)
E20241031 19:39:15.690391 140437637555776 triton_inference.cpp:75] Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(447)
@ 0x7fbb54e64a04 clone
W20241031 19:39:15.690531 140437637555776 inference_client_stage.cpp:284] Exception while processing message for InferenceClientStage, attempting retry. ex.what(): Triton Error while executing 'results->Shape(model_output.name, &output_shape)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(447)
E20241031 19:39:15.690560 140437637555776 triton_inference.cpp:75] Triton Error while executing 'm_client.async_infer( [this, handle](triton::client::InferResult* result) { m_result.reset(result); handle(); }, m_options, m_inputs, m_outputs)'. Error: onnx runtime error 1: Non-zero status code returned while running TRTKernel_graph_torch_jit_3139280210422962738_0 node. Name:'TensorrtExecutionProvider_TRTKernel_graph_torch_jit_3139280210422962738_0_0' Status Message: TensorRT EP execution context enqueue failed.
../python/morpheus/morpheus/_lib/src/stages/triton_inference.cpp(114)
Segmentation fault (core dumped)

Full env printout

Click here to see environment details

 **git***
 commit de72aafc98ffb298195224fc77c2a31eac9efda2 (HEAD -> branch-24.10, origin/pull-request/1965, origin/branch-24.10)
 Author: Yuchen Zhang <[email protected]>
 Date:   Thu Oct 31 11:14:58 2024 -0700

 Fix `log_parsing` example pipeline null output issue (#2024)

 This bug is caused by the transition from `MultiMessage` to `ControlMessage`.

 `inference_stage.py::InferenceStage::_build_single` calls `_convert_one_response` in a loop for a batch, and the argument is passing is the same for the whole batch, but inside `_convert_one_response` we're grabbing the tensors and assigning starting at zero, so the tensors are overwriting each other and causes the issue.

 Added `batch_offset` variable to keep track where the next incoming tensor should start writing to the output message.

 Closes #2019


 ## By Submitting this PR I confirm:
 - I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md).
 - When the PR is ready for review, new or existing tests cover these changes.
 - When the PR is ready for review, the documentation is up to date with these changes.

 Authors:
 - Yuchen Zhang (https://github.com/yczhang-nv)
 - David Gardner (https://github.com/dagardner-nv)

 Approvers:
 - David Gardner (https://github.com/dagardner-nv)
 - Christopher Harris (https://github.com/cwharris)

 URL: https://github.com/nv-morpheus/Morpheus/pull/2024
 **git submodules***
 f69a1fa8f5977b02a70436d92febfd4db1e0ad4d external/morpheus-visualizations (v24.10.00a-1-gf69a1fa)
 87b33dd0b7fd3d7460742bc5ad13d77e0d722c3c external/utilities (v24.10.00a-10-g87b33dd)

 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=22.04
 DISTRIB_CODENAME=jammy
 DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"
 PRETTY_NAME="Ubuntu 22.04.5 LTS"
 NAME="Ubuntu"
 VERSION_ID="22.04"
 VERSION="22.04.5 LTS (Jammy Jellyfish)"
 VERSION_CODENAME=jammy
 ID=ubuntu
 ID_LIKE=debian
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 UBUNTU_CODENAME=jammy
 Linux EFAJARDO-DT 5.15.0-91-generic #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

 ***GPU Information***
 Thu Oct 31 23:03:56 2024
 +-----------------------------------------------------------------------------------------+
 | NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
 |-----------------------------------------+------------------------+----------------------+
 | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
 |                                         |                        |               MIG M. |
 |=========================================+========================+======================|
 |   0  Quadro RTX 8000                Off |   00000000:15:00.0 Off |                  Off |
 | 33%   38C    P8             11W /  260W |     547MiB /  49152MiB |      0%      Default |
 |                                         |                        |                  N/A |
 +-----------------------------------------+------------------------+----------------------+

 +-----------------------------------------------------------------------------------------+
 | Processes:                                                                              |
 |  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
 |        ID   ID                                                               Usage      |
 |=========================================================================================|
 +-----------------------------------------------------------------------------------------+

 ***CPU***
 Architecture:                       x86_64
 CPU op-mode(s):                     32-bit, 64-bit
 Address sizes:                      46 bits physical, 48 bits virtual
 Byte Order:                         Little Endian
 CPU(s):                             12
 On-line CPU(s) list:                0-11
 Vendor ID:                          GenuineIntel
 Model name:                         Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz
 CPU family:                         6
 Model:                              85
 Thread(s) per core:                 2
 Core(s) per socket:                 6
 Socket(s):                          1
 Stepping:                           4
 CPU max MHz:                        3700.0000
 CPU min MHz:                        1200.0000
 BogoMIPS:                           6800.00
 Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities
 L1d cache:                          192 KiB (6 instances)
 L1i cache:                          192 KiB (6 instances)
 L2 cache:                           6 MiB (6 instances)
 L3 cache:                           19.3 MiB (1 instance)
 NUMA node(s):                       1
 NUMA node0 CPU(s):                  0-11
 Vulnerability Gather data sampling: Mitigation; Microcode
 Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
 Vulnerability L1tf:                 Mitigation; PTE Inversion
 Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT vulnerable
 Vulnerability Meltdown:             Mitigation; PTI
 Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
 Vulnerability Retbleed:             Mitigation; IBRS
 Vulnerability Spec rstack overflow: Not affected
 Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
 Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
 Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
 Vulnerability Srbds:                Not affected
 Vulnerability Tsx async abort:      Mitigation; Clear CPU buffers; SMT vulnerable

 ***CMake***
 /opt/conda/envs/morpheus/bin/cmake
 cmake version 3.27.9

 CMake suite maintained and supported by Kitware (kitware.com/cmake).

 ***g++***
 /opt/conda/envs/morpheus/bin/g++
 g++ (conda-forge gcc 12.1.0-17) 12.1.0
 Copyright (C) 2022 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


 ***nvcc***
 /opt/conda/envs/morpheus/bin/nvcc
 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2024 NVIDIA Corporation
 Built on Thu_Jun__6_02:18:23_PDT_2024
 Cuda compilation tools, release 12.5, V12.5.82
 Build cuda_12.5.r12.5/compiler.34385749_0

 ***Python***
 /opt/conda/envs/morpheus/bin/python
 Python 3.10.15

 ***Environment Variables***
 PATH                            : /opt/conda/envs/morpheus/bin:/opt/conda/condabin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/conda/bin
 LD_LIBRARY_PATH                 : /usr/local/nvidia/lib:/usr/local/nvidia/lib64
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    : /opt/conda/envs/morpheus
 PYTHON_PATH                     :

Other/Misc.

No response

Code of Conduct

I agree to follow Morpheus' Code of Conduct
I have searched the open bugs and have found no duplicates for this bug report

The text was updated successfully, but these errors were encountered:

efajardo-nv added the bug Something isn't working label Oct 31, 2024

github-project-automation bot added this to Morpheus Boards Oct 31, 2024

github-project-automation bot moved this to Todo in Morpheus Boards Oct 31, 2024

dagardner-nv moved this from Todo to In Progress in Morpheus Boards Oct 31, 2024

dagardner-nv added this to the 24.10 - Release milestone Oct 31, 2024

dagardner-nv assigned efajardo-nv Nov 1, 2024

This was referenced Dec 16, 2024

Error when using ONNX with TensorRT (ORT-TRT) Optimization on Multi-GPU triton-inference-server/server#7885

Open

Remove triton optimization config, causing error for multi gpu inference #2079

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Segfault when using Triton 24.09 #2028

[BUG]: Segfault when using Triton 24.09 #2028

efajardo-nv commented Oct 31, 2024 •

edited

Loading

[BUG]: Segfault when using Triton 24.09 #2028

[BUG]: Segfault when using Triton 24.09 #2028

Comments

efajardo-nv commented Oct 31, 2024 • edited Loading

Version

Which installation method(s) does this occur on?

Describe the bug.

Minimum reproducible example

Relevant log output

Full env printout

Other/Misc.

Code of Conduct

efajardo-nv commented Oct 31, 2024 •

edited

Loading