Skip to content

Latest commit

 

History

History
135 lines (85 loc) · 5.17 KB

ARM-Mali-Valhall.md

File metadata and controls

135 lines (85 loc) · 5.17 KB

Valhall is a 4th generation of Mali GPU architecture.

Content:

Valhall Gen1

Examples

  • Mali-G57
  • Mali-G77

References

1.1. Arm's New Mali-G77 & Valhall GPU Architecture
1.2. Arm Mali-G77 Performance Counters Reference Guide, [backup]

Notes

  • Static branching or early termination is not supported. [az]
  • 2 work queues: non-fragment, fragment. [1.2]

Valhall Gen2

Examples

  • Mali-G68
  • Mali-G78

References

2.1. Arm Announces The Mali-G78 GPU
2.2. Mali-G78 Performance Counters Reference Guide, [backup]

Valhall Gen3

Examples

  • Mali-G310
  • Mali-G510
  • Mali-G610
  • Mali-G710

References

3.1. Arm Mali-G610 Performance Counters Reference Guide, [backup]
3.2. Arm Announces New Mali-G710, G610, G510 & G310 Mobile GPU Families
3.3. Mali-G510

Notes

  • G610, G710 L2 cache: Configurable 512KB – 2MB, 2 or 4 slices of 256K or 512K

  • Scalability: 7 to 16 cores

  • 3 hardware work queues: compute, vertex, fragment.

  • Arm Fixed Rate Compression (AFRC), 4x4 block lossy compression for textures and framebuffer. [3.3]

Valhall Gen4

Examples

  • Mali-G615
  • Mali-G715
  • Immortalis-G715

References

4.1. Arm Mali-G615 Performance Counters Reference Guide, [backup]
4.2. The Valhall Shader Core, [backup]

Notes

  • The FMA and SVT pipelines are 16-wide, the SFU pipeline is 4-wide and runs at one quarter of the throughput of the other two. [4.2]

  • Valhall maintains native support for int8, int16, and fp16 data types. These data types can be packed using SIMD instructions to fill each 32-bit data processing lane. This arrangement maintains the power efficiency and performance that is provided by the types that are narrower than 32-bits. [4.2]

  • A single 16-wide warp maths unit can therefore perform 32x fp16/int16 operations per clock cycle, or 64x int8 operations per clock cycle. [4.2]

  • LSU: [4.2]

    • 64-byte cache line.
    • 16KB L1 data cache per core
    • Warp unit accesses are optimized to reduce unique cache access requests. Data can be returned in a single cycle if all threads access data inside the same cache line.
  • varying unit can interpolate 32 bits for every thread in a warp. [4.2]

  • Variable rate shading (VRS).

Valhall (all gens)

References

  1. Instruction Set Architecture, [backup]
  2. Mesa driver details

Notes

  • scalar

  • 16 threads per warp. [1.2, 4.2]

  • Fragment Task with 32x32 pixels region. [1.2]

  • AFBC (v1.3) with 4x4 block.

  • MSAA: 4x, 8x, 16x

  • Transaction Elimination with 16x16 pixel block size.

  • All Valhall GPU cores implement a 4 texel-per-clock and 2 pixel-per-clock shader core.

  • Mali Valhall GPU shader cores allow variable numbers of threads to be created, depending on the number of work registers that are used by the in-flight shader programs.

    • 0-32 registers - Maximum thread capacity
    • 33-64 registers - Half thread capacity
  • A Valhall core can perform 32 FP32 FMAs, read 4 bilinear filtered texture samples, blend 2 fragments, and write 2 pixels per clock. [4.2]

  • Each Processing Engine (PE) executes the programmable shader instructions. [4.2]

  • Each PE includes 3 arithmetic processing pipelines: [4.2]

    • FMA pipeline with is used for complex maths operations
    • CVT pipeline which is used for simple maths operations
    • SFU pipeline which is used for special functions

Subgroup threads order

Result of Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize ) in fragment shader, gl_SubgroupSize: 16, tile size: 32x32.

Unique subgroups, image size: 32x32, gl_SubgroupSize: 16. Each subgroup in tile scheduled by quads (2x2 pixels), each quad may have any position inside 32x32 pixel tile, but often they are placed inside 8x8 region.

Result of Rainbow( gl_SubgroupInvocationID / gl_SubgroupSize ) in compute shader, gl_SubgroupSize: 16, workgroup size: 8x8.