References

Subchannel switch
2.1. Advanced API Performance: Async Compute and Overlap
2.2. New Releases of NVIDIA Nsight Systems and Nsight Graphics Debut at SIGGRAPH 2022

Ray Tracing
3.2. Tips and Tricks: Ray Tracing Best Practices (2019)
3.4. Tips: Acceleration Structure Compaction (2021)
3.3. Best Practices: Using NVIDIA RTX Ray Tracing (2020)

Mesh Shader
4.1. Advanced API Performance: Mesh Shaders (2021)

Best Practices
5.1. Tips and Tricks: Vulkan Dos and Don’ts (2019)
5.2. Measuring the GPU Occupancy of Multi-stream Workloads (2024)

Notes

Only use signed integers (if possible), this can be faster. The compiler can optimize more aggressively with signed arithmetic than it can with unsigned arithmetic. [3]
Constant/vertex/index buffers are faster with direct binds (A4Engine)
In some cases compiler uses FP16 units to implement MOV (e.g. moving a number to a register by multiplying with zero and adding the value/constant it wants to move there). [nv forum]
A warp scheduler in a modern GPU can schedule 2 instructions per cycle(using different pipelines). [3?]
Nvidia has been using separate FP32 and FP64 units in their Streaming Multiprocessors (Pascal, Turing, Ampere) [?]