Skip to content

Commit

Permalink
gpu arch
Browse files Browse the repository at this point in the history
  • Loading branch information
Katzeee committed Aug 27, 2024
1 parent 8d27115 commit ff93150
Show file tree
Hide file tree
Showing 8 changed files with 169 additions and 11 deletions.
164 changes: 156 additions & 8 deletions _posts/Cg/2024-03-14-gpu-architechture.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@
:page-category: Cg
:page-tags: [cg, gpu]

References:
== Desktop GPU

> https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline
NOTE: Check here to get more official documentfootnote:1[https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline]

== GPU Composition
=== GPU Composition

* TPC: Texture Processor Cluster
* SM: Streaming Multiprocessor
Expand All @@ -23,7 +23,7 @@ References:
* RT Core
* Tensor Core: for tensor, matrix computation

== How do shader codes run on `Fermi` architecture
=== How do shader codes run on `Fermi` architecture

*1. From CPU to GPU*

Expand All @@ -47,6 +47,10 @@ One warp may excute many times to finish(store the intermediate result then load
[.text-center]
image::/assets/images/2024-03-15-vs-step.png[]

SFU for sin, cos, sqrt.

LD/ST for loading uniform variables.

*3. Rasterizer*

Clip and transform in `Viewport Transform` module, then resterize the triangles to pixels.
Expand Down Expand Up @@ -78,19 +82,163 @@ Finally, the data will passed to `ROP` for depth test, blend or something.

NOTE: The manipulate of the depth and color data must be atomic.

== GPU Context in Core
=== Vector vs Scalar

Scalar: do a scalar operation for all thread in one work group to save cycles.

One GPU Core can be abstrcted to a `Fetch/Decode Module`, some ``ALU``s, and some ``Context``s. ``ALU``s are responsible for excuting the commands and ``Context``s are the context of the ``ALU``s.
image::/assets/images/2024-08-27-vector-vs-scalar.png[]

=== Context

One GPU Core can be abstrcted to A `Fetch/Decode Module`, some ``ALUs``, and some ``Contexts``. ``ALUs`` are responsible for excuting the commands and ``Contexts`` are the context of the ``ALUs``.

One instruction is excuted by one `ALU` in one `Context`.

If there is a time-consuming instruction, the scheduler can let the `ALU` to excute in another context to avoid blocking.

== Optimization
*1. What are in Context*

* General Purpose Registers(GPR)

* Local variables, Varyings

*2. What should we do*

* Use less GPR in one shader, makes more warp in one SM, then less latency when retriving textures.

* Devide one complex pass into more simple passes

* Sampling texture may use more GPRs

=== Optimization

* Customize geometry instancing to replace the static batching and dynamic batching in Unity, which will merge mesh incresing VBO memory, and cause heavy CPU consumption, respectively.

* Decrease the number of vertices and triangles to decrease the consumption of VS, PS and data storage.3D objects should use LOD.

* Avoid transfering data to GPU every frame. In Unity, use GPU particle to instead CPU particle. Avoid large amount of transparency particle, which will cause overdraw.

* Avoid rendering status setting and fetching, like set shader property in `Update()`, because CPU communicate with GPU through `MMIO`.

* *Enable mipmap to decrease the texture cache missing.*
* Avoid excessively small triangles which may cause overdrawing, imagine a small triangle is at the center of 4 `2x2 pixel tile` but only cover the center 4 pixels, then the 4 tiles must calculate for this triangle and mask the other result.

* Avoid excessively small triangles which may cause overdrawing, imagine a small triangle is at the center of 4 `2x2 pixel tile` but only cover the center 4 pixels, then the 4 tiles must calculate for this triangle and mask the other result.

== Moblie GPU

=== Types

* Mali

* Adreno(From AMD Imageon)

* PowerVR

NOTE: Battery-cosuming = Hot = Low FPS Most battery-cosuming part is GPU and memory.

=== Difference from Desktop

Low frequency, high amount of ALU

Bandwidth optimization: tile-based rendering

* Desktop GPU use Immediate Mode Rendering(IMR)
+
--
image::/assets/images/2024-08-27-GPU-IMR.png[]
--

* Mali tiled-based rendering
+
--
Vertex shader -> Store position and varying to memory -> Load from memory to local tile memory -> Fragment shader

image::/assets/images/2024-08-27-mali-TBR.png[]

16 x 16 Tile
--

* Power-VR tiled-based deferred rendering
+
--
Vertex shader -> Store position and varying to memory -> Tile-based hidden surface removal, only draw pixels can be seen(zero overdraw) -> Fragment shader

image::/assets/images/2024-08-27-powervr-tbdr.png[]

32 x 32 Tile
--

* Adreno
+
--
**Only transform position** -> For every tile get triangle visibility list -> Vertex shader -> Fragment shader

image::/assets/images/2024-08-27-adreno-tbr.png[]

Big tile `GMEM`

--
** Optimizition:
+
--
Position data use seperate buffer, if vertex is culled, then GPU will not fetch the other vertex data.
--

=== TBR vs TBDR

* Mali and Adreno
+
--
Low-cost blend, like tiled IMR.
--

* PowerVR
+
--
High-cost blend, blend will flush HSR.

Discard need write z to HSR(consuming).
--

=== How to operate on-chip memory

* OpenGL
+
--
EXT_shader_pixel_local_storage
ARM_shader_framebuffer_fetch
ARM_shader_framebuffer_fetch_depth_stencil
--

* Metal
+
--
Memoryless
Imageblocks(A11 and later)
--

* Vulkan
+
--
Subpass
--

* Use
+
--
Depth
Tonemapping
Programmable Blending(physical correct color glass)
Deferred rendering
--

=== Tools

* Snapdragon Profiler
* Xcode Instrument
* Mali offline compiler
* PowerVR Shader Editor

=== Optimization

*
12 changes: 11 additions & 1 deletion _posts/Cg/2024-08-06-advanced-shadow.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,18 @@

=== 泊松采样PCF

除了使用上述所说的3x3区域进行采样,还可以使用泊松圆盘分布采样,以得到更好的软阴影效果。
除了使用上述所说的3x3区域进行采样,还可以使用泊松圆盘分布采样,以得到更好的软阴影效果。但会带来一些伪影。

=== 硬件PCF

硬件PCF是在2x2的区域上通过纹理采样器 compare func 并开启 linear sampling 实现双线性插值采样以做到直接计算出当前像素插值阴影权重。

也可以通过4次硬件PCF再手动混合以得到一个3x3(四个采样中心偏离0.5texel)或4x4(四个采样中心偏离1.0texel)的硬件PCFfootnote:4[阴影的PCF采样优化算法 https://zhuanlan.zhihu.com/p/369761748]。

=== The Witness

The Witness 提供了一种更快并更简单得到PCF权重的过滤方法。footnote:9[Shadow Mapping Summary http://the-witness.net/news/2013/09/shadow-mapping-summary-part-1/]

=== 连续采样PCF

==== 理论
Expand Down Expand Up @@ -275,3 +279,9 @@ image::/assets/images/2024-08-12-csm-camera-rotation-shimmering.gif[]
WARNING: 一定要先处理平移导致的抖动,因为你如果先处理了旋转的抖动(固定包围盒大小),还是会因为小数精度问题导致抖动没有被消除。
--
== VSM
VSM的问题:
UAV格式的贴图光栅化写入性能较差,依赖Cache
4 changes: 2 additions & 2 deletions _posts/Game/2024-05-06-unity-UGUI.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -94,11 +94,11 @@ If you want to write a component dealing with click event, implement a class der

=== Script Part

. When `Graphic` itself find it need to be rebuilded, they will register itself to `CanvasUpdateRegistry`. Then every frame when `Canvas` rendering UI, it calls `Canvas::willRenderCanvases()` which is an event calling `CanvasUpdateRegistry::PerformUpdate()`.
. When `Graphic` itself find it need to be rebuilt, it will register itself to `CanvasUpdateRegistry`. Then every frame when `Canvas` rendering UI, it calls `Canvas::willRenderCanvases()` which is an event calling `CanvasUpdateRegistry::PerformUpdate()`.

. All renderable components(Image, Text, RawImage etc.) are derived from `Graphic`. `Graphic` implements `Rebuild()` function using for draw the UI elements, which called by `CanvasUpdateRegistry::PerformUpdate()`.

. `Graphic` deals with mesh(`UpdateGeometry()`) and material(`UpdateMaterial()`) data then passes it to `CanvasRenderer` which finally become instructions for real rendering. `Canvas` handles ``CanvasRenderer``s, batches them and generates render instructions.
. `Graphic` deals with mesh(`UpdateGeometry()`) and material(`UpdateMaterial()`) data then passes it to `CanvasRenderer` which finally become instructions for real rendering. `Canvas` handles ``CanvasRenderers``, batches them and generates render instructions.

=== Engine Part(based on Unity 4.x)

Expand Down
Binary file added assets/images/2024-08-27-GPU-IMR.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/2024-08-27-adreno-tbr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/2024-08-27-mali-TBR.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/2024-08-27-powervr-tbdr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/2024-08-27-vector-vs-scalar.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit ff93150

Please sign in to comment.