Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Related issues about using NvJPEG to implement image encoding and decoding on RTX3060 #194

Open
OroChippw opened this issue Jun 28, 2024 · 2 comments
Assignees
Labels

Comments

@OroChippw
Copy link

OroChippw commented Jun 28, 2024

Thanks to the contribution of this warehouse, I am a beginner of NvJPEG, trying to use RTX 3060 to compress png or bmp images, and raise a few questions as follows:
Resolution of input image: 8432 * 40000
Experimental version: CUDA 11.6
image

  1. It takes about 334ms to encode and compress an image of this size on the 3060 GPU, and about 12xxms to decode. Is this time consumption normal?
  2. After I finish decoding, I try to convert the image from nvjpeg_image to cv::Mat format. In the getCVImage in the figure above, I use a loop for this process. Is there a faster way?
cv::Mat NvjpegCompressRunnerImpl::getCVImage(const unsigned char *d_chanB, int pitchB, \
                                             const unsigned char *d_chanG, int pitchG, \
                                             const unsigned char *d_chanR, int pitchR, \
                                             int width, int height) 
{
    cudaEvent_t start, end;
    float milliseconds = 0.0;
    CHECK_CUDA(cudaEventCreate(&start));
    CHECK_CUDA(cudaEventCreate(&end));

    CHECK_CUDA(cudaEventRecord(start));


    cv::Mat cvImage(height, width, CV_8UC3); //BGR
    std::vector<unsigned char> vchanR(height * width);
    std::vector<unsigned char> vchanG(height * width);
    std::vector<unsigned char> vchanB(height * width);
    unsigned char *chanR = vchanR.data();
    unsigned char *chanG = vchanG.data();
    unsigned char *chanB = vchanB.data();

    CHECK_CUDA(cudaMemcpy2D(chanR, (size_t)width, d_chanR, (size_t)pitchR, \
                    width, height, cudaMemcpyDeviceToHost));
    CHECK_CUDA(cudaMemcpy2D(chanG, (size_t)width, d_chanG, (size_t)pitchR, \
                    width, height, cudaMemcpyDeviceToHost));
    CHECK_CUDA(cudaMemcpy2D(chanB, (size_t)width, d_chanB, (size_t)pitchR, \
                    width, height, cudaMemcpyDeviceToHost));

    for (int y = 0; y < height; y++) 
    {
        for (int x = 0; x < width; x++) 
        {
            cvImage.at<cv::Vec3b>(y, x) = cv::Vec3b(chanB[y * width + x], chanG[y * width + x], chanR[y * width + x]);
        }
    }

    CHECK_CUDA(cudaEventRecord(end));
    CHECK_CUDA(cudaEventSynchronize(end));

    CHECK_CUDA(cudaEventElapsedTime(&milliseconds, start, end));

    CHECK_CUDA(cudaEventDestroy(start));
    CHECK_CUDA(cudaEventDestroy(end));

    std::cout << "=> getCVImage execution time: " << milliseconds << " ms" << std::endl;

    return cvImage;
}
  1. If I want to encode and decode the same picture on a RTX 1030 GPU, it will crash directly. Should the large picture be divided into small pictures? Can multiple small pictures be compressed asynchronously?
    Thank you again for your contribution, looking forward to your reply, thank you
@JanuszL JanuszL added the nvJPEG label Jun 28, 2024
@zohebk-nv
Copy link
Collaborator

zohebk-nv commented Jul 9, 2024

It takes about 334ms to encode and compress an image of this size on the 3060 GPU, and about 12xxms to decode. Is this time consumption normal?

The number is plausible taking into account the size of your image. If possible, please use nsys(nsight systems) tool to generate a profile, this can help confirm that there are no other bottlenecks.

After I finish decoding, I try to convert the image from nvjpeg_image to cv::Mat format. In the getCVImage in the figure above, I use a loop for this process. Is there a faster way?

I'm not too familiar with cv::Mat, so wont be able to answer your question definitively. However, I did find this link(https://answers.opencv.org/question/134322/initialize-mat-from-pointer-help/) on opencv.org which seems similar to your question. Hope this helps.

If I want to encode and decode the same picture on a RTX 1030 GPU, it will crash directly.

Would it be possible for you try with a recent cuda toolkit(12.5) to see if the crash can be reproduced? We've made a lot of fixes since cuda 11.6. If you still see the crash, it will helpful if you can share a self contained reproducer code so that we can root cause this at our end.

Should the large picture be divided into small pictures? Can multiple small pictures be compressed asynchronously?

If this is on GTX 1030, dividing into smaller pictures will help since GT1030 only has 2GB of memory. Small images can be asynchronously compressed to an extent. Synchronization will be required when retrieving compressed file to memory. You will have try with multiple instances of nvjpeg encoder to achieve asynchronous compression.

@zohebk-nv zohebk-nv self-assigned this Jul 9, 2024
@OroChippw
Copy link
Author

It takes about 334ms to encode and compress an image of this size on the 3060 GPU, and about 12xxms to decode. Is this time consumption normal?

The number is plausible taking into account the size of your image. If possible, please use nsys(nsight systems) tool to generate a profile, this can help confirm that there are no other bottlenecks.

After I finish decoding, I try to convert the image from nvjpeg_image to cv::Mat format. In the getCVImage in the figure above, I use a loop for this process. Is there a faster way?

I'm not too familiar with cv::Mat, so wont be able to answer your question definitively. However, I did find this link(https://answers.opencv.org/question/134322/initialize-mat-from-pointer-help/) on opencv.org which seems similar to your question. Hope this helps.

If I want to encode and decode the same picture on a RTX 1030 GPU, it will crash directly.

Would it be possible for you try with a recent cuda toolkit(12.5) to see if the crash can be reproduced? We've made a lot of fixes since cuda 11.6. If you still see the crash, it will helpful if you can share a self contained reproducer code so that we can root cause this at our end.

Should the large picture be divided into small pictures? Can multiple small pictures be compressed asynchronously?

If this is on GTX 1030, dividing into smaller pictures will help since GT1030 only has 2GB of memory. Small images can be asynchronously compressed to an extent. Synchronization will be required when retrieving compressed file to memory. You will have try with multiple instances of nvjpeg encoder to achieve asynchronous compression.

Thank you very much for your reply. I used the pointer of opencv to construct cv::Mat, which has improved the speed a lot. Is there any relevant sample for reference for CUDA's Nvjpeg asynchronous stream compression? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants