-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code not optimized for GPU #4
Comments
Hi Rahul, thanks for your interest in the code! I've updated the PyTorch version and swapped over to torch.linalg.eigh like you suggested in #3. In terms of the performance issues, I believe it's mainly due to the fact that the data is relatively small and so can't saturate the GPU. When using the multi-scale mode the image is first optimized at a smaller resolution and progressively upscaled to the final desired size. This leads me to believe that the low utilization is primarily due to overhead of repeatedly launching many small CUDA kernels. To me this sounds like an ideal setting for torch's CUDA graphs API. It might require a bit more detailed profiling to be sure that this is the issue though. |
Woah! A ~90% speedup will make it really fast! I have few questions:
|
The plots are from Holistic Trace Analysis. 'host_wait' is indeed the GPU waiting for the CPU to give it work. Looking a bit closer at the actual traces shows that drawing the random rotation dominates the time of each histogram matching iteration. Just replacing the Another small improvement is using the 'chol' histogram matching method instead of 'pca'. Doing a cholesky decomposition is quite a bit faster than running the eigenvalue solver. One last thing that helped quite a bit for me is to set I also tried cutting out some of the encode/decode steps which are happening at the beginning and end of each pass, but it seems like the feature inverters are actually separately trained for each depth they invert from, so this ruins the quality of results. You can see some of the things I tried in this branch. |
Alright, I've just merged a refactor which makes a few changes for better performance. I've got a few more ideas, but give this version a try and please let me know how it compares on your machine. |
I deeply apologize for not replying. I got a little sick right after opening this issue. I'll surely test it out & let you know. Thank you for doing all this. :) |
Not sure how much you changed the code but my first script run fails to converge. I used my same previous arguments. Also tried changing seed. Now I'll just start from where you started (profiling the previous code) and then make changes slowly to the code.
|
Ahh I see you're using the 'sym' hist_mode. I've been using 'chol' as it's quite a bit faster (and apparently doesn't have these convergence issues?) One thing you can do to help with convergence is to increase the Profiler is still showing |
How much time it takes and what resolution? For me, these took 36s (previous code) and 34.5s (current code). I didn't try multiple runs so it's not an average time. |
My bad, I missed swapping the if statement's condition when I reversed the for loop's direction. For me the original code was taking about 30 seconds for the simple texture synthesis case and now is around 11 seconds on a 1080 ti. I haven't been testing the style transfer case though (as is apparent by the error you just encountered). I guess I should write a little test suite... |
Now, I'm also unable to push the resolution above 1024.
|
Hmmm, could you give the exact command you ran here? If I had to guess I'd say it's related to rounding errors in the multi-resolution resizing code. Are you using a non-square image? For me the following is working fine on the current python optex.py --style style/lava-small.jpg --content content/rocket.jpg --content_strength 0.5 --size 1448 |
Could be something wrong on my end if yours is working fine. I'll not ping you until I understand the whole paper and your code. I don't want to consume your time. You might be busy somewhere else. :) thanks for your help btw. |
Hi @JCBrouwer, I've been playing with your code. It's really good!
But the only issue is it doesn't seem like it's optimized for GPU. I mean the GPU utilization on average is ~40%.
Do you have any suggestions regarding optimizing it for running on GPU?
Regards,
Rahul Bhalley
The text was updated successfully, but these errors were encountered: