Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Plugin Debugging #3105

Closed
jeethesh-pai opened this issue Jul 3, 2023 · 4 comments
Closed

Custom Plugin Debugging #3105

jeethesh-pai opened this issue Jul 3, 2023 · 4 comments
Assignees
Labels
triaged Issue has been triaged by maintainers

Comments

@jeethesh-pai
Copy link

How to debug the enqueue function of custom Plugins

I am trying to convert the Deformable DETR model to TensorRT. All the models are publicly available. I tried with this model. This model has a custom Operator called MultiScaleDeformableAttention which can be seen here.

In order to convert this model to TensorRT, I had to write a Custom Plugin for this operation which I named as MSDeformAttentionPlugin. I used NVIDIA_AI_IOT repo to build a standalone custom plugin independent of TensorRT repo.

I am converting this model using

trtexec --onnx=<path> --saveEngine=<path> --plugins=<path to compiled library> --verbose --workspace=16000

trtexec successfully parses my Plugin and generates engine. But during inference phase this plugin fails as it says there is illegal memory access. Is there anyway to know where exactly in my kernel, this illegal memory access occured?

The error is like this.

[07/03/2023-14:32:48] [TRT] [E] 1: [runner.cpp::execute::386] Error Code 1: Myelin (Final synchronize failed (700))

Can somebody help me how to debug my Plugin enqueue function? Since enqueue arguments are all pointers, I am not able to get the sizes of the variables being passed to this function.

Thanks for the hlep!!!

Environment

TensorRT Version: 8.2

NVIDIA GPU: GeForce RTX 3080 Laptop

NVIDIA Driver Version: 535.54.03

CUDA Version: 12.2

Operating System:

Python Version (if applicable): 3.8.12

Tensorflow Version (if applicable): NA

PyTorch Version (if applicable): 1.11.0a0+17540c5

Baremetal or Container (if so, version): Container - nvcr.io/nvidia/pytorch:22.02-py3

@zerollzeng
Copy link
Collaborator

Hi, I am the auther of https://github.com/NVIDIA-AI-IOT/tensorrt_plugin_generator, thanks for try it and give feedback :-)

Could you please provide a reproduce for this issue? I can help check it further, an ONNX and the tensorrt_plugin_generator yaml config should be enough. Thanks!

@zerollzeng zerollzeng self-assigned this Jul 4, 2023
@zerollzeng zerollzeng added the triaged Issue has been triaged by maintainers label Jul 4, 2023
@jeethesh-pai
Copy link
Author

Hi @zerollzeng, Thanks for the repository. I loved the abstraction and ease of using it to generate a custom plugin. It also seems a recommended way of writing a custom plugin, as it was totally independent of TensorRT repository, which can take long time to compile, when something is changed. I would suggest you to add NVIDIA_AI_IOT in the official documentation of TensorRT so as to reach all the TensorRT audience.

About my problem, Unfortunately I cannot upload anything from my device because of confidentiality. But as given in my Question statement, you can download the publicly available repo, the weights as mentioned and convert into ONNX within minutes. The Plugin Yaml is something like this

MSDeformAttentionPlugin:
  attributes:
    im2col_step:
      datatype: int32
  inputs:
    tpg_input_0: # value
      shape: -1x18259x8x32
    tpg_input_1: # value_spatial_shapes
      shape: 4x2
    tpg_input_2: # value_level_start_index
      shape: 4
    tpg_input_3: # sampling_locations
      shape: -1x-1x8x4x4x2
    tpg_input_4: # attention_weights
      shape: -1x-1x8x4x4
  outputs:
    tpg_output_0:
      shape: -1x-1x256
  plugin_type: IPluginV2DynamicExt
  support_format_combination:
    - "float32+int32+int32+float32+float32+float32"

@zerollzeng
Copy link
Collaborator

I wonder the error is caused by the plugin or not, could you please try split the network into prev_network + plugin_only? Then run each model separately.

@jeethesh-pai
Copy link
Author

jeethesh-pai commented Jul 6, 2023

I will close this issue as this issue is being tracked in your other repo. The bug is in my Plugin and I wanted to debug this plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants