Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code not optimized for GPU #4

Open
RahulBhalley opened this issue Feb 16, 2023 · 15 comments
Open

Code not optimized for GPU #4

RahulBhalley opened this issue Feb 16, 2023 · 15 comments

Comments

@RahulBhalley
Copy link

Hi @JCBrouwer, I've been playing with your code. It's really good!

But the only issue is it doesn't seem like it's optimized for GPU. I mean the GPU utilization on average is ~40%.

Do you have any suggestions regarding optimizing it for running on GPU?

Regards,
Rahul Bhalley

@JCBrouwer
Copy link
Owner

Hi Rahul, thanks for your interest in the code! I've updated the PyTorch version and swapped over to torch.linalg.eigh like you suggested in #3.

In terms of the performance issues, I believe it's mainly due to the fact that the data is relatively small and so can't saturate the GPU. When using the multi-scale mode the image is first optimized at a smaller resolution and progressively upscaled to the final desired size.

This leads me to believe that the low utilization is primarily due to overhead of repeatedly launching many small CUDA kernels. To me this sounds like an ideal setting for torch's CUDA graphs API.

It might require a bit more detailed profiling to be sure that this is the issue though.

@JCBrouwer
Copy link
Owner

JCBrouwer commented Feb 16, 2023

Alright, did some quick profiling, looks like it's not the kernel overhead, but just host operations in general...

temporal_breakdown
idle_time_breakdown

@RahulBhalley
Copy link
Author

RahulBhalley commented Feb 16, 2023

Woah! A ~90% speedup will make it really fast! I have few questions:

  • What does 'host_wait' means? Is it the GPU waiting for CPU to complete its task?
  • If so, any guidance how to track this?
  • What's the name of this profiling tool?

@JCBrouwer
Copy link
Owner

JCBrouwer commented Feb 16, 2023

The plots are from Holistic Trace Analysis. 'host_wait' is indeed the GPU waiting for the CPU to give it work.

Looking a bit closer at the actual traces shows that drawing the random rotation dominates the time of each histogram matching iteration. Just replacing the .item() call in there with a .clone() helps a little as it saves a round-trip to host memory, but overall utilization still isn't great. I also tried decorating the function with @torch.jit.script but it didn't help that much either. The trace of this function is still pre-dominantly CPU operations even though the device is correctly specified as 'cuda' as far as I can tell. I wonder if there's some way to vectorize this operation?

Another small improvement is using the 'chol' histogram matching method instead of 'pca'. Doing a cholesky decomposition is quite a bit faster than running the eigenvalue solver.

One last thing that helped quite a bit for me is to set torch.backends.cudnn.benchmark = False. This is because the implementation repeatedly cycles through forward passes at different resolutions which requires the cudnn autotuner to re-run every time for just a single forward pass.

I also tried cutting out some of the encode/decode steps which are happening at the beginning and end of each pass, but it seems like the feature inverters are actually separately trained for each depth they invert from, so this ruins the quality of results.

You can see some of the things I tried in this branch.

@JCBrouwer
Copy link
Owner

Alright, I've just merged a refactor which makes a few changes for better performance. I've got a few more ideas, but give this version a try and please let me know how it compares on your machine.

@RahulBhalley
Copy link
Author

Alright, I've just merged a refactor which makes a few changes for better performance. I've got a few more ideas, but give this version a try and please let me know how it compares on your machine.

I deeply apologize for not replying. I got a little sick right after opening this issue. I'll surely test it out & let you know. Thank you for doing all this. :)

@RahulBhalley
Copy link
Author

Not sure how much you changed the code but my first script run fails to converge. I used my same previous arguments. Also tried changing seed. Now I'll just start from where you started (profiling the previous code) and then make changes slowly to the code.

Pass 0, size 256
Layer: relu5_1
Layer: relu4_1
Traceback (most recent call last):
  File "[/workspace/OptimalTextures/optex.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/optex.py)", line 283, in <module>
    pastiche = texturizer.forward(pastiche, styles, content, verbose=True)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "[/workspace/OptimalTextures/optex.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/optex.py)", line 112, in forward
    
                for _ in range(self.iters_per_pass_and_layer[p][l - 1]):
                    pastiche_feature = optimal_transport(pastiche_feature, style_features[l], self.hist_mode)
                                       ~~~~~~~~~~~~~~~~~ <--- HERE
    
                    if len(content_features) > 0 and l >= 2:  # apply content matching step
  File "[/workspace/OptimalTextures/optex.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/optex.py)", line 168, in optimal_transport
    rotated_style = style_feature @ rotation

    matched_pastiche = hist_match(rotated_pastiche, rotated_style, mode=hist_mode)
                       ~~~~~~~~~~ <--- HERE

    pastiche_feature = matched_pastiche @ rotation.T  # rotate back to normal
  File "[/workspace/OptimalTextures/histmatch.py](https://file+.vscode-resource.vscode-cdn.net/workspace/OptimalTextures/histmatch.py)", line 37, in hist_match

        else:  # mode == "sym"
            eva_t, eve_t = torch.linalg.eigh(cov_t, UPLO="U")
                           ~~~~~~~~~~~~~~~~~ <--- HERE
            Qt = eve_t @ torch.sqrt(torch.diag(eva_t)) @ eve_t.T
            Qt_Cs_Qt = Qt @ cov_s @ Qt
RuntimeError: linalg.eigh: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 301).

@JCBrouwer
Copy link
Owner

JCBrouwer commented Feb 18, 2023

Ahh I see you're using the 'sym' hist_mode. I've been using 'chol' as it's quite a bit faster (and apparently doesn't have these convergence issues?)

One thing you can do to help with convergence is to increase the eps argument of hist_match(). I've just pushed another small update which is even a little faster on my machine (with eps bumped up quite a bit).

Profiler is still showing random_rotation() as the bottleneck, but I'm just not sure how to make that more efficient.

@RahulBhalley
Copy link
Author

RahulBhalley commented Feb 18, 2023

Ahh I see you're using the 'sym' hist_mode. I've been using 'chol' as it's quite a bit faster (and apparently doesn't have these convergence issues?)

Okay, did try that. But the results are now inferior to your previous code (before I pinged you).

I used same command for style transfer: python optex.py --style style/lava-small.jpg --content content/rocket.jpg --content_strength 0.2 --hist chol --seed 0.

Synthesis with previous source code
lava-small_rocket_strength0 2_cholhist_512

Synthesis with current modification you made
lava-small_rocket_strength0 2_cholhist_512

I am still reading the paper (started today) so I am far from understanding the code. But I will, soon.

@RahulBhalley
Copy link
Author

RahulBhalley commented Feb 18, 2023

One thing you can do to help with convergence is to increase the eps argument of hist_match(). I've just pushed another small update which is even a little faster on my machine (with eps bumped up quite a bit).

How much time it takes and what resolution? For me, these took 36s (previous code) and 34.5s (current code). I didn't try multiple runs so it's not an average time.

@JCBrouwer
Copy link
Owner

My bad, I missed swapping the if statement's condition when I reversed the for loop's direction.

For me the original code was taking about 30 seconds for the simple texture synthesis case and now is around 11 seconds on a 1080 ti.

I haven't been testing the style transfer case though (as is apparent by the error you just encountered). I guess I should write a little test suite...

@RahulBhalley
Copy link
Author

RahulBhalley commented Feb 18, 2023

I haven't been testing the style transfer case though (as is apparent by the error you just encountered). I guess I should write a little test suite...

Interesting, even on my side texture was synthesized correctly.

OMG, the texture is very heavy & large scale now.

lava-small_rocket_strength0 2_cholhist_512

@RahulBhalley
Copy link
Author

Now, I'm also unable to push the resolution above 1024.

Pass 0, size 256
Layer: relu5_1
Layer: relu4_1
Layer: relu3_1
Layer: relu2_1
Layer: relu1_1
Pass 1, size 512
Layer: relu5_1
Layer: relu4_1
Layer: relu3_1
Layer: relu2_1
Layer: relu1_1
Pass 2, size 768
Layer: relu5_1
Layer: relu4_1
Layer: relu3_1
Layer: relu2_1
Layer: relu1_1
Pass 3, size 1024
Layer: relu5_1
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│                                                                                                  │
│ /workspace/OptimalTextures/optex.py:283  │
│ in <module>                                                                                      │
│                                                                                                  │
│   280 │   │   from time import time                                                              │
│   281 │   │                                                                                      │
│   282 │   │   t = time()                                                                         │
│ ❱ 283 │   │   pastiche = texturizer.forward(pastiche, styles, content, verbose=True)             │
│   284 │   │   print("Took:", time() - t)                                                         │
│   285 │                                                                                          │
│   286 │   save_image(pastiche, args)                                                             │
│ /workspace/OptimalTextures/optex.py:116  │
│ in forward                                                                                       │
│                                                                                                  │
│   113 │   │   │   │   │                                                                          │
│   114 │   │   │   │   │   if len(content_features) > 0 and l <= 2:  # apply content matching s   │
│   115 │   │   │   │   │   │   strength = self.content_strength / 2 ** (4 - l)  # 1, 2, or 4 de   │
│ ❱ 116 │   │   │   │   │   │   pastiche_feature += strength * (content_features[l] - pastiche_f   │
│   117 │   │   │   │                                                                              │
│   118 │   │   │   │   if self.use_pca:                                                           │
│   119 │   │   │   │   │   pastiche_feature = pastiche_feature @ style_eigvs[l].T  # reverse pr   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (80) must match the size of tensor b (64) at non-singleton dimension 2

@JCBrouwer
Copy link
Owner

JCBrouwer commented Feb 19, 2023

Hmmm, could you give the exact command you ran here? If I had to guess I'd say it's related to rounding errors in the multi-resolution resizing code. Are you using a non-square image?

For me the following is working fine on the current main branch.

python optex.py --style style/lava-small.jpg --content content/rocket.jpg --content_strength 0.5 --size 1448

lava-small_rocket_strength0 5_cholhist_1448

@RahulBhalley
Copy link
Author

RahulBhalley commented Feb 19, 2023

Hmmm, could you give the exact command you ran here? If I had to guess I'd say it's related to rounding errors in the multi-resolution resizing code. Are you using a non-square image?

For me the following is working fine on the current main branch.

python optex.py --style style/lava-small.jpg --content content/rocket.jpg --content_strength 0.5 --size 1448

lava-small_rocket_strength0 5_cholhist_1448

Could be something wrong on my end if yours is working fine. I'll not ping you until I understand the whole paper and your code. I don't want to consume your time. You might be busy somewhere else. :) thanks for your help btw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants