Shader Execution Reordering
1.1. Shader Execution Reordering: Nvidia Tackles Divergence
1.2. Improve Shader Performance and In-Game Frame Rates with Shader Execution Reordering
1.3. Whitepaper
Subchannel switch
2.1. Advanced API Performance: Async Compute and Overlap
2.2. New Releases of NVIDIA Nsight Systems and Nsight Graphics Debut at SIGGRAPH 2022
Ray Tracing
3.2. Tips and Tricks: Ray Tracing Best Practices (2019)
3.4. Tips: Acceleration Structure Compaction (2021)
3.3. Best Practices: Using NVIDIA RTX Ray Tracing (2020)
Mesh Shader
4.1. Advanced API Performance: Mesh Shaders (2021)
Best Practices
5.1. Tips and Tricks: Vulkan Dos and Don’ts (2019)
5.2. Measuring the GPU Occupancy of Multi-stream Workloads (2024)
Advanced API Performance
6.1. Memory and Resources (2021)
6.2. Async Copy (2021)
6.3. Barriers (2021)
6.4. Command Buffers (2021)
6.5. Async Compute and Overlap (2021)
6.6. Vulkan Clearing and Presenting (2022)
6.7. Clears (2022)
6.8. Variable Rate Shading (2022)
6.9. CPUs (2023)
6.10. Pipeline State Objects (2023)
6.11. Shaders (2023)
6.12. Synchronization (2023)
6.13. Sampler Feedback (2023)
6.14. Debugging (2023)
6.15. Descriptors (2023)
6.16. Intrinsics (2023)
6.17. Swap Chains (2023)
-
Only use signed integers (if possible), this can be faster. The compiler can optimize more aggressively with signed arithmetic than it can with unsigned arithmetic. [3]
-
Constant/vertex/index buffers are faster with direct binds (A4Engine)
-
In some cases compiler uses FP16 units to implement MOV (e.g. moving a number to a register by multiplying with zero and adding the value/constant it wants to move there). [nv forum]
-
A warp scheduler in a modern GPU can schedule 2 instructions per cycle(using different pipelines). [3?]
-
Nvidia has been using separate FP32 and FP64 units in their Streaming Multiprocessors (Pascal, Turing, Ampere) [?]