diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 05b5b8c01..56098b9c7 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -1,7 +1,7 @@ * @koomie @coleramos425 # Documentation files -docs/* @ROCm/rocm-documentation +docs/ @ROCm/rocm-documentation *.md @ROCm/rocm-documentation *.rst @ROCm/rocm-documentation .readthedocs.yaml @ROCm/rocm-documentation diff --git a/.gitignore b/.gitignore index 38a7fd3d8..3b3a34d40 100644 --- a/.gitignore +++ b/.gitignore @@ -23,3 +23,4 @@ VERSION.sha # documentation artifacts /_build _toc.yml + diff --git a/docs/concept/command-processor.rst b/docs/concept/command-processor.rst index 6759832fa..1ec6373ae 100644 --- a/docs/concept/command-processor.rst +++ b/docs/concept/command-processor.rst @@ -2,95 +2,148 @@ Command processor (CP) ********************** -The command processor (CP) is responsible for interacting with the AMDGPU Kernel -Driver (a.k.a., the Linux Kernel) on the CPU and -for interacting with user-space HSA clients when they submit commands to -HSA queues. Basic tasks of the CP include reading commands (e.g., -corresponding to a kernel launch) out of `HSA -Queues `__ -(Sec. 2.5), scheduling work to subsequent parts of the scheduler -pipeline, and marking kernels complete for synchronization events on the -host. - -The command processor is composed of two sub-components: - -- Fetcher (CPF): Fetches commands out of memory to hand them over to - the CPC for processing -- Packet Processor (CPC): The micro-controller running the command - processing firmware that decodes the fetched commands, and (for - kernels) passes them to the `Workgroup Processors `__ for - scheduling - -Before scheduling work to the accelerator, the command-processor can -first acquire a memory fence to ensure system consistency `(Sec -2.6.4) `__. -After the work is complete, the command-processor can apply a -memory-release fence. Depending on the AMD CDNA accelerator under -question, either of these operations *may* initiate a cache write-back -or invalidation. +The command processor (CP) is responsible for interacting with the AMDGPU kernel +driver -- the Linux kernel -- on the CPU and for interacting with user-space +HSA clients when they submit commands to HSA queues. Basic tasks of the CP +include reading commands (such as, corresponding to a kernel launch) out of +:hsa-runtime-pdf:`HSA queues <68>`, scheduling work to subsequent parts of the +scheduler pipeline, and marking kernels complete for synchronization events on +the host. + +The command processor consists of two sub-components: + +* :ref:`Fetcher ` (CPF): Fetches commands out of memory to hand + them over to the CPC for processing. + +* :ref:`Packet processor ` (CPC): Micro-controller running the + command processing firmware that decodes the fetched commands and (for + kernels) passes them to the :ref:`workgroup processors ` for + scheduling. + +Before scheduling work to the accelerator, the command processor can +first acquire a memory fence to ensure system consistency +:hsa-runtime-pdf:`Section 2.6.4 <91>`. After the work is complete, the +command processor can apply a memory-release fence. Depending on the AMD CDNA +accelerator under question, either of these operations *might* initiate a cache +write-back or invalidation. Analyzing command processor performance is most interesting for kernels -that the user suspects to be scheduling/launch-rate limited. The command -processor’s metrics therefore are focused on reporting, e.g.: +that you suspect to be limited by scheduling or launch rate. The command +processor’s metrics therefore are focused on reporting, for example: + +* Utilization of the fetcher + +* Utilization of the packet processor, and decoding processing packets + +* Stalls in fetching and processing -- Utilization of the fetcher -- Utilization of the packet processor, and decoding processing packets -- Fetch/processing stalls +.. _cpf-metrics: Command Processor Fetcher (CPF) metrics ======================================= .. list-table:: :header-rows: 1 - :widths: 20 65 15 * - Metric + - Description + - Unit + * - CPF Utilization - - Percent of total cycles where the CPF was busy actively doing any work. The ratio of CPF busy cycles over total cycles counted by the CPF. + + - Percent of total cycles where the CPF was busy actively doing any work. + The ratio of CPF busy cycles over total cycles counted by the CPF. + - Percent + * - CPF Stall + - Percent of CPF busy cycles where the CPF was stalled for any reason. + - Percent + * - CPF-L2 Utilization - - Percent of total cycles counted by the CPF-[L2](L2) interface where the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles over total cycles counted by the CPF-L2. + + - Percent of total cycles counted by the CPF-:doc:`L2 ` interface + where the CPF-L2 interface was active doing any work. The ratio of CPF-L2 + busy cycles over total cycles counted by the CPF-L2. + - Percent + * - CPF-L2 Stall - - Percent of CPF-L2 busy cycles where the CPF-[L2](L2) interface was stalled for any reason. + + - Percent of CPF-L2 busy cycles where the CPF-:doc:`L2 ` + interface was stalled for any reason. + - Percent + * - CPF-UTCL1 Stall - - Percent of CPF busy cycles where the CPF was stalled by address translation. + + - Percent of CPF busy cycles where the CPF was stalled by address + translation. + - Percent +.. _cpc-metrics: + Command Processor Packet Processor (CPC) metrics ================================================ .. list-table:: :header-rows: 1 - :widths: 20 65 15 * - Metric + - Description + - Unit + * - CPC Utilization - - Percent of total cycles where the CPC was busy actively doing any work. The ratio of CPC busy cycles over total cycles counted by the CPC. + + - Percent of total cycles where the CPC was busy actively doing any work. + The ratio of CPC busy cycles over total cycles counted by the CPC. + - Percent + * - CPC Stall + - Percent of CPC busy cycles where the CPC was stalled for any reason. + - Percent + * - CPC Packet Decoding Utilization + - Percent of CPC busy cycles spent decoding commands for processing. + - Percent + * - CPC-Workgroup Manager Utilization - - Percent of CPC busy cycles spent dispatching workgroups to the [Workgroup Manager](SPI). + + - Percent of CPC busy cycles spent dispatching workgroups to the + :ref:`workgroup manager `. + - Percent + * - CPC-L2 Utilization - - Percent of total cycles counted by the CPC-[L2](L2) interface where the CPC-L2 interface was active doing any work. + + - Percent of total cycles counted by the CPC-:doc:`L2 ` interface + where the CPC-L2 interface was active doing any work. + - Percent + * - CPC-UTCL1 Stall - - Percent of CPC busy cycles where the CPC was stalled by address translation. + + - Percent of CPC busy cycles where the CPC was stalled by address + translation. + - Percent + * - CPC-UTCL2 Utilization - - Percent of total cycles counted by the CPC's L2 address translation interface where the CPC was busy doing address translation work. + + - Percent of total cycles counted by the CPC's L2 address translation + interface where the CPC was busy doing address translation work. + - Percent + diff --git a/docs/concept/compute-unit.rst b/docs/concept/compute-unit.rst index 2ab859a1b..db1666d33 100644 --- a/docs/concept/compute-unit.rst +++ b/docs/concept/compute-unit.rst @@ -3,8 +3,8 @@ Compute unit (CU) ***************** The compute unit (CU) is responsible for executing a user's kernels on -CDNA-based accelerators. All :ref:`wavefronts` of a :ref:`workgroup` are -scheduled on the same CU. +CDNA-based accelerators. All :ref:`wavefronts ` of a +:ref:`workgroup ` are scheduled on the same CU. .. image:: ../data/performance-model/gcn_compute_unit.png :alt: AMD CDNA accelerator compute unit diagram @@ -17,35 +17,38 @@ The CU consists of several independent execution pipelines and functional units. executing much of the computational work on CDNA accelerators, including but not limited to floating-point operations (FLOPs) and integer operations (IOPs). + * The *vector memory (VMEM)* unit is responsible for issuing loads, stores and atomic operations that interact with the memory system. + * The :ref:`desc-salu` is shared by all threads in a - [wavefront](wavefront), and is responsible for executing instructions that are - known to be uniform across the wavefront at compile-time. The SALU has a - memory unit (SMEM) for interacting with memory, but it cannot issue separately - from the SALU. -* The :ref:`desc-lds` is an on-CU software-managed scratchpad memory + :ref:`wavefront `, and is responsible for executing + instructions that are known to be uniform across the wavefront at compile + time. The SALU has a memory unit (SMEM) for interacting with memory, but it + cannot issue separately from the SALU. + +* The :doc:`local-data-share` is an on-CU software-managed scratchpad memory that can be used to efficiently share data between all threads in a - [workgroup](workgroup). -* The :ref:`desc-scheduler` is responsible for issuing and decoding instructions for all - the [wavefronts](wavefront) on the compute unit. -* The *vector L1 data cache (vL1D)* is the first level cache local to the - compute unit. On current CDNA accelerators, the vL1D is write-through. The - vL1D caches from multiple compute units are kept coherent with one another - through software instructions. + :ref:`workgroup `. + +* The :ref:`desc-scheduler` is responsible for issuing and decoding instructions + for all the :ref:`wavefronts ` on the compute unit. + +* The :doc:`vector L1 data cache (vL1D) ` is the first level + cache local to the compute unit. On current CDNA accelerators, the vL1D is + write-through. The vL1D caches from multiple compute units are kept coherent + with one another through software instructions. + * CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain specialized matrix-multiplication accelerator pipelines known as the :ref:`desc-mfma`. For a more in-depth description of a compute unit on a CDNA accelerator, see -:hip-training-2019:`22` and :gcn-crash-course:`27`. +:hip-training-pdf:`22` and :gcn-crash-course:`27`. :ref:`pipeline-desc` details the various -execution pipelines (VALU, SALU, LDS, Scheduler, etc.). The metrics +execution pipelines -- VALU, SALU, LDS, scheduler, and so forth. The metrics presented by Omniperf for these pipelines are described in -:ref:`pipeline-metrics`. Finally, the `vL1D `__ cache and -:ref:`LDS ` will be described their own sections. - -.. include:: ./includes/pipeline-descriptions.rst +:doc:`pipeline-metrics`. The :doc:`vL1D ` cache and +:doc:`LDS ` are described their own chapters. -.. include:: ./includes/pipeline-metrics.rst diff --git a/docs/concept/definitions.rst b/docs/concept/definitions.rst new file mode 100644 index 000000000..127ef19f1 --- /dev/null +++ b/docs/concept/definitions.rst @@ -0,0 +1,109 @@ +.. meta:: + :description: Omniperf terminology and definitions + :keywords: Omniperf, ROCm, glossary, definitions, terms, profiler, tool, + Instinct, accelerator, AMD + +*********** +Definitions +*********** + +The following table briefly defines some terminology used in Omniperf interfaces +and in this documentation. + +.. include:: ./includes/terms.rst + +.. include:: ./includes/normalization-units.rst + +.. _memory-spaces: + +Memory spaces +============= + +AMD Instinct MI accelerators can access memory through multiple address spaces +which may map to different physical memory locations on the system. The +following table provides a view into how various types of memory used +in HIP map onto these constructs: + +.. list-table:: + :header-rows: 1 + + * - LLVM Address Space + - Hardware Memory Space + - HIP Terminology + + * - Generic + - Flat + - N/A + + * - Global + - Global + - Global + + * - Local + - LDS + - LDS/Shared + + * - Private + - Scratch + - Private + + * - Constant + - Same as global + - Constant + +The following is a high-level description of the address spaces in the AMDGPU +backend of LLVM: + +.. list-table:: + :header-rows: 1 + + * - Address space + - Description + + * - Global + - Memory that can be seen by all threads in a process, and may be backed by + the local accelerator's HBM, a remote accelerator's HBM, or the CPU's + DRAM. + + * - Local + - Memory that is only visible to a particular workgroup. On AMD's Instinct + accelerator hardware, this is stored in :doc:`LDS ` + memory. + + * - Private + - Memory that is only visible to a particular [work-item](workitem) + (thread), stored in the scratch space on AMD's Instinct accelerators. + + * - Constant + - Read-only memory that is in the global address space and stored on the + local accelerator's HBM. + + * - Generic + - Used when the compiler cannot statically prove that a pointer is + addressing memory in a single (non-generic) address space. Mapped to Flat + on AMD's Instinct accelerators, the pointer could dynamically address + global, local, private or constant memory. + +`LLVM's documentation for AMDGPU Backend `_ +has the most up-to-date information. Refer to this source for a more complete +explanation. + +.. _memory-type: + +Memory type +=========== + +AMD Instinct accelerators contain a number of different memory allocation +types to enable the HIP language's +:doc:`memory coherency model `. +These memory types are broadly similar between AMD Instinct accelerator +generations, but may differ in exact implementation. + +In addition, these memory types *might* differ between accelerators on the same +system, even when accessing the same memory allocation. + +For example, an :ref:`MI2XX ` accelerator accessing *fine-grained* +memory allocated local to that device may see the allocation as coherently +cacheable, while a remote accelerator might see the same allocation as +*uncached*. + diff --git a/docs/concept/glossary.rst b/docs/concept/glossary.rst deleted file mode 100644 index 39f5680a9..000000000 --- a/docs/concept/glossary.rst +++ /dev/null @@ -1,225 +0,0 @@ -.. meta:: - :description: Omniperf documentation and reference - :keywords: Omniperf, ROCm, glossary, definitions, terms, profiler, tool, - Instinct, accelerator, AMD - -******** -Glossary -******** - -The following table briefly defines some terminology used in Omniperf interfaces -and in this documentation. - -.. list-table:: - :header-rows: 1 - - * - Name - - Description - - Unit - - * - Kernel time - - The number of seconds the accelerator was executing a kernel, from the - :ref:`command processor `'s (CP) start-of-kernel - timestamp (a number of cycles after the CP beings processing the packet) - to the CP's end-of-kernel timestamp (a number of cycles before the CP - stops processing the packet). - - Seconds - - * - Kernel cycles - - The number of cycles the accelerator was active doing *any* work, as - measured by the :ref:`command processor ` (CP). - - Cycles - - * - Total CU cycles - - The number of cycles the accelerator was active doing *any* work - (that is, kernel cycles), multiplied by the number of - :doc:`compute units ` on the accelerator. A - measure of the total possible active cycles the compute units could be - doing work, useful for the normalization of metrics inside the CU. - - Cycles - - * - Total active CU cycles - - The number of cycles a CU on the accelerator was active doing *any* - work, summed over all :ref:`compute units ` on the - accelerator. - - Cycles - - * - Total SIMD cycles - - The number of cycles the accelerator was active doing *any* work (that - is, kernel cycles), multiplied by the number of - :ref:`SIMDs ` on the accelerator. A measure of the - total possible active cycles the SIMDs could be doing work, useful for - the normalization of metrics inside the CU. - - Cycles - - * - Total L2 cycles - - The number of cycles the accelerator was active doing *any* work (that - is, kernel cycles), multiplied by the number of :ref:`L2 ` - channels on the accelerator. A measure of the total possible active - cycles the L2 channels could be doing work, useful for the normalization - of metrics inside the L2. - - Cycles - - * - Total active L2 cycles - - The number of cycles a channel of the L2 cache was active doing *any* - work, summed over all :ref:`L2 ` channels on the accelerator. - - Cycles - - * - Total sL1D cycles - - The number of cycles the accelerator was active doing *any* work (that - is, kernel cycles), multiplied by the number of - :ref:`scalar L1 data caches ` on the accelerator. A measure of - the total possible active cycles the sL1Ds could be doing work, useful - for the normalization of metrics inside the sL1D. - - Cycles - - * - Total L1I cycles - - The number of cycles the accelerator was active doing *any* work (that - is, kernel cycles), multiplied by the number of - :ref:`L1 instruction caches ` (L1I) on the accelerator. A - measure of the total possible active cycles the L1Is could be doing - work, useful for the normalization of metrics inside the L1I. - - Cycles - - * - Total scheduler-pipe cycles - - The number of cycles the accelerator was active doing *any* work (that - is, kernel cycles), multiplied by the number of - :ref:`scheduler pipes ` on the accelerator. A measure of the - total possible active cycles the scheduler-pipes could be doing work, - useful for the normalization of metrics inside the - :ref:`workgroup manager ` and :ref:`command processor `. - - Cycles - - * - Total shader-engine cycles - - The total number of cycles the accelerator was active doing *any* work, - multiplied by the number of :ref:`shader engines ` on the - accelerator. A measure of the total possible active cycles the shader - engines could be doing work, useful for the normalization of - metrics inside the :ref:`workgroup manager `. - - Cycles - - * - Thread-requests - - The number of unique memory addresses accessed by a single memory - instruction. On AMD Instinct accelerators, this has a maximum of 64 - (that is, the size of the :ref:`wavefront `). - - Addresses - - * - Work-item - - A single *thread*, or lane, of execution that executes in lockstep with - the rest of the work-items comprising a :ref:`wavefront ` - of execution. - - N/A - - * - Wavefront - - A group of work-items, or threads, that execute in lockstep on the - :ref:`compute unit `. On AMD Instinct accelerators, the - wavefront size is always 64 work-items. - - N/A - - * - Workgroup - - A group of wavefronts that execute on the same - :ref:`compute unit `, and can cooperatively execute and share - data via the use of synchronization primitives, :ref:`LDS `, - atomics, and others. - - N/A - - * - Divergence - - Divergence within a wavefront occurs when not all work-items are active - when executing an instruction, that is, due to non-uniform control flow - within a wavefront. Can reduce execution efficiency by causing, - for instance, the :ref:`VALU ` to need to execute both - branches of a conditional with different sets of work-items active. - - N/A - -.. include:: ./includes/normalization-units.rst - -.. _memory-spaces: - -Memory spaces -============= - -AMD Instinct MI accelerators can access memory through multiple address spaces -which may map to different physical memory locations on the system. The -[table below](mspace-table) provides a view of how various types of memory used -in HIP map onto these constructs: - -.. list-table:: - :header-rows: 1 - - * - LLVM Address Space - - Hardware Memory Space - - HIP Terminology - - * - Generic - - Flat - - N/A - - * - Global - - Global - - Global - - * - Local - - LDS - - LDS/Shared - - * - Private - - Scratch - - Private - - * - Constant - - Same as global - - Constant - -Below is a high-level description of the address spaces in the AMDGPU backend -of LLVM: - -.. list-table:: - :header-rows: 1 - - * - Address space - - Description - - * - Global - - Memory that can be seen by all threads in a process, and may be backed by - the local accelerator's HBM, a remote accelerator's HBM, or the CPU's - DRAM. - - * - Local - - Memory that is only visible to a particular workgroup. On AMD's Instinct - accelerator hardware, this is stored in [LDS](LDS) memory. - - * - Private - - Memory that is only visible to a particular [work-item](workitem) - (thread), stored in the scratch space on AMD's Instinct(tm) accelerators. - - * - Constant - - Read-only memory that is in the global address space and stored on the - local accelerator's HBM. - - * - Generic - - Used when the compiler cannot statically prove that a pointer is - addressing memory in a single (non-generic) address space. Mapped to Flat - on AMD's Instinct(tm) accelerators, the pointer could dynamically address - global, local, private or constant memory. - -`LLVM's documentation for AMDGPU Backend ` -will always have the most up-to-date information, and the interested reader is -referred to this source for a more complete explanation. - -.. _memory-type: - -Memory type -=========== - -AMD Instinct accelerators contain a number of different memory allocation -types to enable the HIP language's -:doc:`memory coherency model `. -These memory types are broadly similar between AMD Instinct accelerator -generations, but may differ in exact implementation. - -In addition, these memory types *might* differ between accelerators on the same -system, even when accessing the same memory allocation. - -For example, an :ref:`MI2XX ` accelerator accessing "fine-grained" -memory allocated local to that device may see the allocation as coherently -cacheable, while a remote accelerator might see the same allocation as uncached. diff --git a/docs/concept/includes/compute-unit.rst b/docs/concept/includes/compute-unit.rst deleted file mode 100644 index 8b1378917..000000000 --- a/docs/concept/includes/compute-unit.rst +++ /dev/null @@ -1 +0,0 @@ - diff --git a/docs/concept/includes/l2-cache.rst b/docs/concept/includes/l2-cache.rst deleted file mode 100644 index e69de29bb..000000000 diff --git a/docs/concept/includes/normalization-units.rst b/docs/concept/includes/normalization-units.rst index 396248214..57b7abfe0 100644 --- a/docs/concept/includes/normalization-units.rst +++ b/docs/concept/includes/normalization-units.rst @@ -12,17 +12,17 @@ include: * - Name - Description + * - ``per_wave`` + - The total value of the measured counter or metric that occurred per + kernel invocation divided by the total number of + :ref:`wavefronts ` launched in the kernel. + * - ``per_cycle`` - The total value of the measured counter or metric that occurred per kernel invocation divided by the - :ref:`kernel cycles `, that is, the total number of + :ref:`kernel cycles `, that is, the total number of cycles the kernel executed as measured by the - :ref:`command processor `. - - * - ``per_wave`` - - The total value of the measured counter or metric that occurred per - kernel invocation divided by the total number of - :ref:`wavefronts ` launched in the kernel. + :doc:`command processor `. * - ``per_kernel`` - The total value of the measured counter or metric that occurred per @@ -30,15 +30,16 @@ include: * - ``per_second`` - The total value of the measured counter or metric that occurred per - kernel invocation divided by the :ref:`kernel time `, + kernel invocation divided by the :ref:`kernel time `, that is, the total runtime of the kernel in seconds, as measured by the - :ref:`command processor `. + :doc:`command processor `. By default, Omniperf uses the ``per_wave`` normalization. The ideal normalization varies depending on your use case. For instance, a ``per_second`` normalization might be useful for FLOP or bandwidth comparisons, while a ``per_wave`` normalization could be useful to see how many -(and what types) of instructions are used per wavefront; a ``per_kernel`` -normalization may be useful to get the total aggregate values of metrics for +(and what types) of instructions are used per wavefront. A ``per_kernel`` +normalization can be useful to get the total aggregate values of metrics for comparison between different configurations. + diff --git a/docs/concept/includes/pipeline-descriptions.rst b/docs/concept/includes/pipeline-descriptions.rst deleted file mode 100644 index 89367c1bb..000000000 --- a/docs/concept/includes/pipeline-descriptions.rst +++ /dev/null @@ -1,232 +0,0 @@ -.. _pipeline-desc: - -Pipeline descriptions -===================== - -.. _desc-valu: - -Vector Arithmetic Logic Unit (VALU) ------------------------------------ - -The vector arithmetic logic unit (VALU) executes vector instructions -over an entire wavefront, each `work-item `__ (or, -vector-lane) potentially operating on distinct data. The VALU of a CDNA -accelerator or GCN GPU typically consists of: - -* Four 16-wide SIMD processors (see :hip-training-2019:`24` for more details). -* Four 64 or 128 KiB VGPR files (yielding a total of 256-512 KiB total - per CU), see :ref:`AGPRs ` for more detail. -* An instruction buffer (per-SIMD) that contains execution slots for up - to 8 wavefronts (for 32 total wavefront slots on each CU). -* A vector memory (VMEM) unit which transfers data between VGPRs and - memory; each work-item supplies its own memory address and supplies - or receives unique data. -* CDNA accelerators, such as the MI100 and MI2XX [#mi2xx]_, contain - additional - :amd-lab-note:`Matrix Fused Multiply-Add (MFMA) ` - unites. - -In order to support branching / conditionals, each wavefront in the VALU -has a distinct execution mask which determines which work-items in the -wavefront are active for the currently executing instruction. When -executing a VALU instruction, inactive work-items (according to the -current execution mask of the wavefront) do not execute the instruction -and are treated as no-ops. - -.. note:: - - On GCN GPUs and the CDNA MI100 accelerator, there are slots for up to 10 - wavefronts in the instruction buffer, but generally occupancy is limited by - other factors to 32 waves per :doc:`compute unit `. - On the CDNA2 MI2XX [#mi2xx]_ series accelerators, there are only 8 waveslots - per-SIMD. - -.. _desc-salu: - -Scalar Arithmetic Logic Unit (SALU) ------------------------------------ - -The scalar arithmetic logic unit (SALU) executes instructions that are -shared between all work-items in a wavefront. This includes control-flow -– such as if/else conditionals, branches and looping -– pointer arithmetic, loading common values, etc. - -The SALU consists of: - -- A scalar processor capable of various arithmetic, conditional, and - comparison (etc.) operations. See :mi200-isa-pdf:`Chapter 5. Scalar ALU Operations <35>` - of the CDNA2 Instruction Set Architecture (ISA) Reference Guide for more - detail. -- A 12.5 KiB Scalar General Purpose Register (SGPR) file -- A scalar memory (SMEM) unit which transfers data between SGPRs and - memory - -Data loaded by the SMEM can be cached in the `scalar L1 data -cache `__, and is typically only used for read-only, uniform -accesses such as kernel arguments, or HIP’s ``__constant__`` memory. - -.. _desc-lds: - -Local data share (LDS) ----------------------- - -.. _perf-model-branch: - -The local data share (LDS, a.k.a., "shared memory") is fast on-CU scratchpad -that can be explicitly managed by software to effectively share data and to -coordinate between wavefronts in a workgroup. - -.. figure:: ../data/performance-model/lds.* - :align: center - :alt: Performance model of the local data share (LDS) on AMD Instinct - accelerators - - Performance model of the local data share (LDS) on AMD Instinct MI-series - accelerators. - -Above is Omniperf's performance model of the LDS on CDNA accelerators (adapted from [GCN Architecture, by Mike Mantor](https://old.hotchips.org/wp-content/uploads/hc_archives/hc24/HC24-3-ManyCore/HC24.28.315-AMD.GCN.mantor_v1.pdf), slide 20). -The SIMDs in the [VALU](valu) are connected to the LDS in pairs (see above). -Only one SIMD per pair may issue an LDS instruction at a time, but both pairs may issue concurrently. - -On CDNA accelerators, the LDS contains 32 banks and each bank is 4B wide. -The LDS is designed such that each bank can be read from/written to/atomically updated every cycle, for a total throughput of 128B/clock :gcn-crash-course:`40`. - -On each of the two ports to the SIMDs, 64B can be sent in each direction per cycle. So, a single wavefront, coming from one of the 2 SIMDs in a pair, can only get back 64B/cycle (16 lanes per cycle). The input port is shared between data and address and this can affect achieved bandwidth for different data sizes. For example, a 64-wide store where each lane is sending a 4B value takes 8 cycles (50% peak bandwidth) while a 64-wide store where each lane is sending a 16B value takes 20 cycles (80% peak bandwidth). - -In addition, the LDS contains conflict-resolution hardware to detect and handle bank conflicts. -A bank conflict occurs when two (or more) work-items in a wavefront want to read, write, or atomically update different addresses that map to the same bank in the same cycle. -In this case, the conflict detection hardware will determine a new schedule such that the access is split into multiple cycles with no conflicts in any single cycle. - -When multiple work-items want to read from the same address within a bank, the result can be efficiently broadcasted -:gcn-crash-course:`41`. -Multiple work-items writing to the same address within a bank typically results undefined behavior in HIP and other languages, as the LDS will write the value from the last work-item as determined by the hardware scheduler -:gcn-crash-course:`41`. This behavior may be useful in the very specific case of storing a uniform value. - -Relatedly, an address conflict is defined as occurring when two (or more) work-items in a wavefront want to atomically update the same address on the same cycle. -As in a bank-conflict, this may cause additional cycles of work for the LDS operation to complete. - -.. _desc-branch: - -Branch ------- - -The branch unit is responsible for executing jumps and branches to execute -control flow operations. -Note that Branch operations are not used for execution mask updates, but only -for “whole wavefront” control-flow changes. - -.. _desc-scheduler: - -Scheduler ---------- - -The scheduler is responsible for arbitration and issue of instructions for all -the wavefronts currently executing on the :doc:`CU `. On every -clock cycle, the scheduler: - -* Considers waves from one of the SIMD units for execution, selected in a - round-robin fashion between the SIMDs in the compute unit -* Issues up to one instruction per wavefront on the selected SIMD -* Issues up to one instruction per each of the instruction categories among the waves on the selected SIMD: - * :ref:`VALU ` / :ref:`VMEM ` operations - * :ref:`SALU ` / SMEM operations - * :ref:`LDS ` - * :ref:`Branch ` operations - -This gives a maximum of five issued Instructions Per Cycle (IPC), per-SIMD, -per-CU ([AMD GPU HIP Training](https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf), [GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah)). - -On CDNA accelerators with [MFMA](mfma) instructions, these are issued via the -[VALU](valu). Some of them will execute on a separate functional unit and typically allow other [VALU](valu) operations to execute in their shadow (see the [MFMA](mfma) section for more detail). - -.. note:: - - The IPC model used by Omniperf omits the following two complications for clarity. - First, CDNA accelerators contain other execution units on the CU that are unused for compute applications. - Second, so-called "internal" instructions (see [Layla Mah's GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah), slide 29) are not issued to a functional unit, and can technically cause the maximum IPC to _exceed_ 5 instructions per-cycle in special (largely unrealistic) cases. - The latter issue is discussed in more detail in our ['internal' IPC](Internal_ipc) example. - -.. _desc-mfma: - -Matrix fused multiply-add (MFMA) --------------------------------- - -CDNA accelerators, such as the MI100 and `MI2XX <2xxnote>`__, contain -specialized hardware to accelerate matrix-matrix multiplications, also -known as Matrix Fused Multiply-Add (MFMA) operations. The exact -operation types and supported formats may vary by accelerator. The -reader is referred to the `AMD matrix -cores `__ -blog post on GPUOpen for a general discussion of these hardware units. -In addition, to explore the available MFMA instructions in-depth on -various AMD accelerators (including the CDNA line), we recommend the -`AMD Matrix Instruction -Calculator `__. - -.. code-block:: shell - :caption: Partial snapshot of the AMD Matrix Instruction Calculatoor Tool - - $ ./matrix_calculator.py –architecture cdna2 –instruction - v_mfma_f32_4x4x1f32 –detail-instruction Architecture: CDNA2 Instruction: - V_MFMA_F32_4X4X1F32 Encoding: VOP3P-MAI VOP3P Opcode: 0x42 VOP3P-MAI - Opcode: 0x2 Matrix Dimensions: M: 4 N: 4 K: 1 blocks: 16 Execution - statistics: FLOPs: 512 Execution cycles: 8 FLOPs/CU/cycle: 256 Can - co-execute with VALU: True VALU co-execution cycles possible: 4 Register - usage: GPRs required for A: 1 GPRs required for B: 1 GPRs required for - C: 4 GPRs required for D: 4 GPR alignment requirement: 8 bytes - -For the purposes of Omniperf, the MFMA unit is typically treated as a separate pipeline from the [VALU](valu), as other VALU instructions (along with other execution pipelines such as the SALU) can be issued during a portion of the total duration of an MFMA operation. - -.. note:: - -The exact details of VALU and MFMA operation co-execution vary by instruction, and can be explored in more detail via the: - -* 'Can co-execute with VALU' -* 'VALU co-execution cycles possible' - - fields in the [AMD Matrix Instruction Calculator](https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator#example-of-querying-instruction-information)'s detailed instruction information. - -Non-pipeline resources ----------------------- - -In this section, we describe a few resources that are not standalone -pipelines but are important for understanding performance optimization -on CDNA accelerators. - -.. _desc-barrier: - -Barrier -^^^^^^^ - -Barriers are resources on the compute-unit of a CDNA accelerator that -are used to implement synchronization primitives (e.g., HIP’s -``__syncthreads``). Barriers are allocated to any workgroup that -consists of more than a single wavefront. - -.. _desc-agprs - -Accumulation vector General-Purpose Registers (AGPRs) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -Accumulation vector General-Purpose Registers, or AGPRs, are special -resources that are accessible to a subset of instructions focused on -`MFMA `__ operations. These registers allow the `MFMA `__ -unit to access more than the normal maximum of 256 `architected Vector -General-Purpose Registers (i.e., VGPRs) `__ by having up to 256 in -the architected space and up to 256 in the accumulation space. -Traditional VALU instructions can only use VGPRs in the architected -space, and data can be moved to/from VGPRs↔AGPRs using specialized -instructions (``v_accvgpr_*``). These data movement instructions may be -used by the compiler to implement lower-cost register-spill/fills on -architectures with AGPRs. - -AGPRs are not available on all AMD Instinct(tm) accelerators. GCN GPUs, -such as the AMD Instinct(tm) MI50 had a 256 KiB VGPR file. The AMD -Instinct(tm) MI100 (CDNA) has a 2x256 KiB register file, where one half -is available as general-purpose VGPRs, and the other half is for matrix -math accumulation VGPRs (AGPRs). The AMD Instinct(tm) -`MI2XX <2xxnote>`__ (CDNA2) has a 512 KiB VGPR file per CU, where each -wave can dynamically request up to 256 KiB of VGPRs and an additional -256 KiB of AGPRs. For more detail, the reader is referred to the -`following -comment `__. diff --git a/docs/concept/includes/pipeline-metrics.rst b/docs/concept/includes/pipeline-metrics.rst deleted file mode 100644 index 2959895cd..000000000 --- a/docs/concept/includes/pipeline-metrics.rst +++ /dev/null @@ -1,1329 +0,0 @@ -.. _pipeline-metrics: - -Pipeline metrics -================ - -In this section, we describe the metrics available in Omniperf to analyze the -pipelines discussed in the :ref:`pipeline-desc`. - -.. _wavefront: - -Wavefront ---------- - -.. _wavefront-launch-stats: - -Wavefront launch stats -^^^^^^^^^^^^^^^^^^^^^^ - -The wavefront launch stats panel gives general information about the -kernel launch: - -.. list-table:: - :header-rows: 1 - :widths: 20 65 15 - - * - Metric - - Description - - Unit - * - Grid Size - - The total number of work-items (a.k.a "threads") launched as a part of the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied by the total workgroup (a.k.a "block") size. - - [Work-items](Workitem) - * - Workgroup Size - - The total number of work-items (a.k.a "threads") in each workgroup (a.k.a "block") launched as part of the kernel dispatch. In HIP, this is equivalent to the total block size. - - [Work-items](Workitem) - * - Total Wavefronts - - The total number of wavefronts launched as part of the kernel dispatch. On AMD Instinct(tm) CDNA accelerators and GCN GPUs, the wavefront size is always 64 work-items. Thus, the total number of wavefronts should be equivalent to the ceiling of Grid Size divided by 64. - - [Wavefronts](Wavefront) - * - Saved Wavefronts - - The total number of wavefronts saved at a context-save, see [cwsr_enable](https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr). - - [Wavefronts](Wavefront) - * - Restored Wavefronts - - The total number of wavefronts restored from a context-save, see [cwsr_enable](https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr). - - [Wavefronts](Wavefront) - * - VGPRs - - The number of architected vector general-purpose registers allocated for the kernel, see [VALU](valu). Note: this may not exactly match the number of VGPRs requested by the compiler due to allocation granularity. - - [VGPRs](valu) - * - AGPRs - - The number of accumulation vector general-purpose registers allocated for the kernel, see [AGPRs](agprs). Note: this may not exactly match the number of AGPRs requested by the compiler due to allocation granularity. - - [AGPRs](agprs) - * - SGPRs - - The number of scalar general-purpose registers allocated for the kernel, see [SALU](salu). Note: this may not exactly match the number of SGPRs requested by the compiler due to allocation granularity. - - [SGPRs](salu) - * - LDS Allocation - - The number of bytes of [LDS](lds) memory (a.k.a., "Shared" memory) allocated for this kernel. Note: This may also be larger than what was requested at compile-time due to both allocation granularity and dynamic per-dispatch LDS allocations. - - Bytes per [workgroup](workgroup) - * - Scratch Allocation - - The number of bytes of [scratch-memory](Mspace) requested _per_ work-item for this kernel. Scratch memory is used for stack memory on the accelerator, as well as for register spills/restores. - - Bytes per [work-item](workitem) - -.. _wavefront-runtime-stats: - -Wavefront Runtime Stats -^^^^^^^^^^^^^^^^^^^^^^^ - -The wavefront runtime statistics gives a high-level overview of the -execution of wavefronts in a kernel: - -.. list-table:: - :header-rows: 1 - :widths: 18 65 17 - - * - Metric - - Description - - Unit - * - [Kernel Time](KernelTime) - - The total duration of the executed kernel. Note: this should not be directly compared to the wavefront cycles / timings below. - - Nanoseconds - * - [Kernel Cycles](KernelCycles) - - The total duration of the executed kernel in cycles. Note: this should not be directly compared to the wavefront cycles / timings below. - - Cycles - * - Instructions per wavefront - - The average number of instructions (of all types) executed per wavefront. This is averaged over all wavefronts in a kernel dispatch. - - Instructions / wavefront - * - Wave Cycles - - The number of cycles a wavefront in the kernel dispatch spent resident on a compute unit per [normalization-unit](normunit). This is averaged over all wavefronts in a kernel dispatch. Note: this should not be directly compared to the kernel cycles above. - - Cycles per [normalization-unit](normunit) - * - Dependency Wait Cycles - - The number of cycles a wavefront in the kernel dispatch stalled waiting on memory of any kind (e.g., instruction fetch, vector or scalar memory, etc.) per [normalization-unit](normunit). This counter is incremented at every cycle by _all_ wavefronts on a CU stalled at a memory operation. As such, it is most useful to get a sense of how waves were spending their time, rather than identification of a precise limiter because another wave could be actively executing while a wave is stalled. The sum of this metric, Issue Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric. - - Cycles per [normalization-unit](normunit) - * - Issue Wait Cycles - - The number of cycles a wavefront in the kernel dispatch was unable to issue an instruction for any reason (e.g., execution pipe back-pressure, arbitration loss, etc.) per [normalization-unit](normunit). This counter is incremented at every cycle by _all_ wavefronts on a CU unable to issue an instruction. As such, it is most useful to get a sense of how waves were spending their time, rather than identification of a precise limiter because another wave could be actively executing while a wave is issue stalled. The sum of this metric, Dependency Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric. - - Cycles per [normalization-unit](normunit) - * - Active Cycles - - The average number of cycles a wavefront in the kernel dispatch was actively executing instructions per [normalization-unit](normunit). This measurement is made on a per-wavefront basis, and may include (e.g.,) cycles that another wavefront spent actively executing (e.g., on another execution unit) or was stalled. As such, it is most useful to get a sense of how waves were spending their time, rather than identification of a precise limiter. The sum of this metric, Issue Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles metric. - - Cycles per [normalization-unit](normunit) - * - Wavefront Occupancy - - The time-averaged number of wavefronts resident on the accelerator over the lifetime of the kernel. Note: this metric may be inaccurate for short-running kernels (<< 1ms). - - Wavefronts - -.. code:: {seealso} - - As mentioned above, the measurement of kernel cycles and time typically cannot directly be compared to e.g., Wave Cycles. - This is due to two factors: first, the kernel cycles/timings are measured using a counter that is impacted by scheduling overhead, this is particularly noticeable for "short-running" kernels (typically << 1ms) where scheduling overhead forms a significant portion of the overall kernel runtime. - Secondly, the Wave Cycles metric is incremented per-wavefront scheduled to a SIMD every cycle whereas the kernel cycles counter is incremented only once per-cycle when _any_ wavefront is scheduled. - -.. _instruction-mix: - -Instruction Mix ---------------- - -The instruction mix panel shows a breakdown of the various types of -instructions executed by the user’s kernel, and which pipelines on the -`CU `__ they were executed on. In addition, Omniperf reports further -information about the breakdown of operation types for the -`VALU `__, vector-memory, and `MFMA `__ instructions. - -.. code:: {note} - - All metrics in this section count _instructions issued_, and _not_ the total number of operations executed. - The values reported by these metrics will not change regardless of the execution mask of the wavefront. - We note that even if the execution mask is identically zero (i.e., _no lanes are active_) the instruction will still be counted, as CDNA accelerators still consider these instructions 'issued' see, e.g., [EXECute Mask, Section 3.3 of the CDNA2 ISA Guide](https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf) for more details. - -Overall Instruction Mix -^^^^^^^^^^^^^^^^^^^^^^^ - -This panel shows the total number of each type of instruction issued to -the :ref:`various compute pipelines ` on the -:ref:`CU `. These are: - -.. list-table:: - :header-rows: 1 - :widths: 20 65 15 - - * - Metric - - Description - - Unit - * - [VALU](valu) Instructions - - The total number of vector arithmetic logic unit (VALU) operations issued. These are the workhorses of the compute-unit, and are used to execute wide range of instruction types including floating point operations, non-uniform address calculations, transcendental operations, integer operations, shifts, conditional evaluation, etc. - - Instructions - * - VMEM Instructions - - The total number of vector memory operations issued. These include most loads, stores and atomic operations and all accesses to [generic, global, private and texture](Mspace) memory. - - Instructions - * - [LDS](lds) Instructions - - The total number of LDS (a.k.a., "shared memory") operations issued. These include (e.g.,) loads, stores, atomics, and HIP's `__shfl` operations. - - Instructions - * - [MFMA](mfma) Instructions - - The total number of matrix fused multiply-add instructions issued. - - Instructions - * - [SALU](salu) Instructions - - The total number of scalar arithmetic logic unit (SALU) operations issued. Typically these are used for (e.g.,) address calculations, literal constants, and other operations that are _provably_ uniform across a wavefront. Although scalar memory (SMEM) operations are issued by the SALU, they are counted separately in this section. - - Instructions - * - SMEM Instructions - - The total number of scalar memory (SMEM) operations issued. These are typically used for loading kernel arguments, base-pointers and loads from HIP's `__constant__` memory. - - Instructions - * - [Branch](branch) Instructions - - The total number of branch operations issued. These typically consist of jump / branch operations and are used to implement control flow. - - Instructions - -.. code:: {note} - - Note, as mentioned in the [Branch](branch) section: branch operations are not used for execution mask updates, but only for "whole wavefront" control-flow changes. - -VALU Arithmetic Instruction Mix -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -.. code:: {warning} - - Not all metrics in this section (e.g., the floating-point instruction breakdowns) are available on CDNA accelerators older than the [MI2XX](2xxnote) series. - -This panel details the various types of vector instructions that were -issued to the `VALU `__. The metrics in this section do *not* -include `MFMA `__ instructions using the same precision, e.g. the -“F16-ADD” metric does not include any 16-bit floating point additions -executed as part of an MFMA instruction using the same precision. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 15 65 20 - :class: noscroll-table - * - Metric - - Description - - Unit - * - INT32 - - The total number of instructions operating on 32-bit integer operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - INT64 - - The total number of instructions operating on 64-bit integer operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - F16-ADD - - The total number of addition instructions operating on 16-bit floating-point operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - F16-MUL - - The total number of multiplication instructions operating on 16-bit floating-point operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - F16-FMA - - The total number of fused multiply-add instructions operating on 16-bit floating-point operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - F16-TRANS - - The total number of transcendental instructions (e.g., `sqrt`) operating on 16-bit floating-point operands issued to the VALU per [normalization-unit](normunit) - - Instructions per [normalization-unit](normunit) - * - F32-ADD - - The total number of addition instructions operating on 32-bit floating-point operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - F32-MUL - - The total number of multiplication instructions operating on 32-bit floating-point operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - F32-FMA - - The total number of fused multiply-add instructions operating on 32-bit floating-point operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - F32-TRANS - - The total number of transcendental instructions (e.g., `sqrt`) operating on 32-bit floating-point operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - F64-ADD - - The total number of addition instructions operating on 64-bit floating-point operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - F64-MUL - - The total number of multiplication instructions operating on 64-bit floating-point operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - F64-FMA - - The total number of fused multiply-add instructions operating on 64-bit floating-point operands issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - F64-TRANS - - The total number of transcendental instructions (e.g., `sqrt`) operating on 64-bit floating-point operands issued to the VALUper [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - Conversion - - The total number of type conversion instructions (e.g., converting data to/from F32↔F64) issued to the VALU per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - -For an example of these counters in action, the reader is referred to -the `VALU Arithmetic Instruction Mix example `__. - -VMEM Instruction Mix -^^^^^^^^^^^^^^^^^^^^ - -This section breaks down the types of vector memory (VMEM) instructions -that were issued. Refer to the `Instruction Counts metrics -section `__ of address-processor frontend of the vL1D cache for -a description of these VMEM instructions. - -(MFMA_Inst_mix)= ##### MFMA Instruction Mix - -.. code:: {warning} - - The metrics in this section are only available on CDNA2 ([MI2XX](2xxnote)) accelerators and newer. - -This section details the types of Matrix Fused Multiply-Add -(`MFMA `__) instructions that were issued. Note that -`MFMA `__ instructions are classified by the type of input data -they operate on, and *not* the data-type the result is accumulated to. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 25 60 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - MFMA-I8 Instructions - - The total number of 8-bit integer [MFMA](mfma) instructions issued per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - MFMA-F16 Instructions - - The total number of 16-bit floating point [MFMA](mfma) instructions issued per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - MFMA-BF16 Instructions - - The total number of 16-bit brain floating point [MFMA](mfma) instructions issued per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - MFMA-F32 Instructions - - The total number of 32-bit floating-point [MFMA](mfma) instructions issued per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - MFMA-F64 Instructions - - The total number of 64-bit floating-point [MFMA](mfma) instructions issued per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - -Compute Pipeline ----------------- - -FLOP counting conventions -^^^^^^^^^^^^^^^^^^^^^^^^^ - -Omniperf’s conventions for VALU FLOP counting are as follows: - Addition -or Multiplication: 1 operation - Transcendentals: 1 operation - Fused -Multiply-Add (FMA): 2 operations - -Integer operations (IOPs) do not use this convention. They are counted -as a single operation regardless of the instruction type. - -.. code:: {note} - - Packed operations which operate on multiple operands in the same instruction are counted identically to the underlying instruction type. - For example, the `v_pk_add_f32` instruction on [MI2XX](2xxnote), which performs an add operation on two pairs of aligned 32-bit floating-point operands is counted only as a single addition (i.e., 1 operation). - -As discussed in the `Instruction Mix `__ section, the FLOP/IOP -metrics in this section do not take into account the execution mask of -the operation, and will report the same value even if the execution mask -is identically zero. - -For example, a FMA instruction operating on 32-bit floating-point -operands (e.g., ``v_fma_f32`` on a `MI2XX <2xxnote>`__ accelerator) -would be counted as 128 total FLOPs: 2 operations (due to the -instruction type) multiplied by 64 operations (because the wavefront is -composed of 64 work-items). - -(Compute_SOL)= ##### Compute Speed-of-Light - -.. code:: {warning} - - The theoretical maximum throughput for some metrics in this section are currently computed with the maximum achievable clock frequency, as reported by `rocminfo`, for an accelerator. This may not be realistic for all workloads. - -This section reports the number of floating-point and integer operations -executed on the `VALU `__ and `MFMA `__ units in various -precisions. We note that unlike the `VALU instruction -mix `__ and `MFMA instruction mix `__ -sections, the metrics here are reported as FLOPs and IOPs, i.e., the -total number of operations executed. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 65 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - VALU FLOPs - - The total floating-point operations executed per second on the [VALU](valu). This is also presented as a percent of the peak theoretical FLOPs achievable on the specific accelerator. Note: this does not include any floating-point operations from [MFMA](mfma) instructions. - - GFLOPs - * - VALU IOPs - - The total integer operations executed per second on the [VALU](valu). This is also presented as a percent of the peak theoretical IOPs achievable on the specific accelerator. Note: this does not include any integer operations from [MFMA](mfma) instructions. - - GIOPs - * - MFMA FLOPs (BF16) - - The total number of 16-bit brain floating point [MFMA](mfma) operations executed per second. Note: this does not include any 16-bit brain floating point operations from [VALU](valu) instructions. This is also presented as a percent of the peak theoretical BF16 MFMA operations achievable on the specific accelerator. - - GFLOPs - * - MFMA FLOPs (F16) - - The total number of 16-bit floating point [MFMA](mfma) operations executed per second. Note: this does not include any 16-bit floating point operations from [VALU](valu) instructions. This is also presented as a percent of the peak theoretical F16 MFMA operations achievable on the specific accelerator. - - GFLOPs - * - MFMA FLOPs (F32) - - The total number of 32-bit floating point [MFMA](mfma) operations executed per second. Note: this does not include any 32-bit floating point operations from [VALU](valu) instructions. This is also presented as a percent of the peak theoretical F32 MFMA operations achievable on the specific accelerator. - - GFLOPs - * - MFMA FLOPs (F64) - - The total number of 64-bit floating point [MFMA](mfma) operations executed per second. Note: this does not include any 64-bit floating point operations from [VALU](valu) instructions. This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator. - - GFLOPs - * - MFMA IOPs (INT8) - - The total number of 8-bit integer [MFMA](mfma) operations executed per second. Note: this does not include any 8-bit integer operations from [VALU](valu) instructions. This is also presented as a percent of the peak theoretical INT8 MFMA operations achievable on the specific accelerator. - - GIOPs - -(Pipeline_stats)= ##### Pipeline Statistics - -This section reports a number of key performance characteristics of -various execution units on the `CU `__. The reader is referred to -the `Instructions per-cycle and Utilizations `__ example -for a detailed dive into these metrics, and the -`scheduler `__ for a high-level overview of execution units -and instruction issue. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 65 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - IPC - - The ratio of the total number of instructions executed on the [CU](cu) over the [total active CU cycles](TotalActiveCUCycles). - - Instructions per-cycle - * - IPC (Issued) - - The ratio of the total number of (non-[internal](Internal_ipc)) instructions issued over the number of cycles where the [scheduler](scheduler) was actively working on issuing instructions. The reader is recommended the [Issued IPC](Issued_ipc) example for further detail. - - Instructions per-cycle - * - SALU Utilization - - Indicates what percent of the kernel's duration the [SALU](salu) was busy executing instructions. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [SALU](salu) / [SMEM](salu) instructions over the [total CU cycles](TotalCUCycles). - - Percent - * - VALU Utilization - - Indicates what percent of the kernel's duration the [VALU](valu) was busy executing instructions. Does not include [VMEM](valu) operations. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [VALU](valu) instructions over the [total CU cycles](TotalCUCycles). - - Percent - * - VMEM Utilization - - Indicates what percent of the kernel's duration the [VMEM](valu) unit was busy executing instructions, including both global/generic and spill/scratch operations (see the [VMEM instruction count metrics](TA_inst) for more detail). Does not include [VALU](valu) operations. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [VMEM](valu) instructions over the [total CU cycles](TotalCUCycles). - - Percent - * - Branch Utilization - - Indicates what percent of the kernel's duration the [Branch](branch) unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [Branch](branch) instructions over the [total CU cycles](TotalCUCycles). - - Percent - * - VALU Active Threads - - Indicates the average level of [divergence](Divergence) within a wavefront over the lifetime of the kernel. The number of work-items that were active in a wavefront during execution of each [VALU](valu) instruction, time-averaged over all VALU instructions run on all wavefronts in the kernel. - - Work-items - * - MFMA Utilization - - Indicates what percent of the kernel's duration the [MFMA](mfma) unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the [MFMA](salu) was busy over the [total CU cycles](TotalCUCycles). - - Percent - * - MFMA Instruction Cycles - - The average duration of [MFMA](mfma) instructions in this kernel in cycles. Computed as the ratio of the total number of cycles the [MFMA](mfma) unit was busy over the total number of [MFMA](mfma) instructions. Compare to e.g., the [AMD Matrix Instruction Calculator](https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator). - - Cycles per instruction - * - VMEM Latency - - The average number of round-trip cycles (i.e., from issue to data-return / acknowledgment) required for a VMEM instruction to complete. - - Cycles - * - SMEM Latency - - The average number of round-trip cycles (i.e., from issue to data-return / acknowledgment) required for a SMEM instruction to complete. - - Cycles - -.. code:: {note} - - The Branch utilization reported in this section also includes time spent in other instruction types (namely: `s_endpgm`) that are _typically_ a very small percentage of the overall kernel execution. This complication is omitted for simplicity, but may result in small amounts of "branch" utilization (<<1\%) for otherwise branch-less kernels. - -(FLOPS)= ##### Arithmetic Operations - -This section reports the total number of floating-point and integer -operations executed in various precisions. Unlike the `Compute -speed-of-light `__ panel, this section reports both -`VALU `__ and `MFMA `__ operations of the same precision -(e.g., F32) in the same metric. Additionally, this panel lets the user -control how the data is normalized (i.e., control the -`normalization-unit `__), while the speed-of-light panel does -not. For more detail on how operations are counted see the `FLOP -counting convention `__ section. - -.. code:: {warning} - - As discussed in the [Instruction Mix](Inst_Mix) section, the metrics in this section do not take into account the execution mask of the operation, and will report the same value even if EXEC is identically zero. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - FLOPs (Total) - - The total number of floating-point operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit) - - FLOP per [normalization-unit](normunit) - * - IOPs (Total) - - The total number of integer operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit) - - IOP per [normalization-unit](normunit) - * - F16 OPs - - The total number of 16-bit floating-point operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit) - - FLOP per [normalization-unit](normunit) - * - BF16 OPs - - The total number of 16-bit brain floating-point operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit). Note: on current CDNA accelerators, the [VALU](valu) has no native BF16 instructions. - - FLOP per [normalization-unit](normunit) - * - F32 OPs - - The total number of 32-bit floating-point operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit) - - FLOP per [normalization-unit](normunit) - * - F64 OPs - - The total number of 64-bit floating-point operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit) - - FLOP per [normalization-unit](normunit) - * - INT8 OPs - - The total number of 8-bit integer operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit). Note: on current CDNA accelerators, the [VALU](valu) has no native INT8 instructions. - - IOPs per [normalization-unit](normunit) - -(LDS_metrics)= ### Local Data Share (LDS) - -LDS Speed-of-Light -^^^^^^^^^^^^^^^^^^ - -.. code:: {warning} - - The theoretical maximum throughput for some metrics in this section are currently computed with the maximum achievable clock frequency, as reported by `rocminfo`, for an accelerator. This may not be realistic for all workloads. - -The LDS speed-of-light chart shows a number of key metrics for the -`LDS `__ as a comparison with the peak achievable values of those -metrics. The reader is referred to our previous `LDS `__ -description for a more in-depth view of the hardware. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 65 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Utilization - - Indicates what percent of the kernel's duration the [LDS](lds) was actively executing instructions (including, but not limited to, load, store, atomic and HIP's `__shfl` operations). Calculated as the ratio of the total number of cycles LDS was active over the [total CU cycles](TotalCUCycles). - - Percent - * - Access Rate - - Indicates the percentage of SIMDs in the [VALU](valu){sup}`1` actively issuing LDS instructions, averaged over the lifetime of the kernel. Calculated as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [LDS](lds) instructions over the [total CU cycles](TotalCUCycles). - - Percent - * - Theoretical Bandwidth (% of Peak) - - Indicates the maximum amount of bytes that _could_ have been loaded from/stored to/atomically updated in the LDS in this kernel, as a percent of the peak LDS bandwidth achievable. See the [LDS Bandwidth example](lds_bandwidth) for more detail. - - Percent - * - Bank Conflict Rate - - Indicates the percentage of active LDS cycles that were spent servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts over the number of LDS cycles that would have been required to move the same amount of data in an uncontended access.{sup}`2` - - Percent - -.. code:: {note} - - {sup}`1` Here we assume the typical case where the workload evenly distributes LDS operations over all SIMDs in a CU (that is, waves on different SIMDs are executing similar code). - For highly unbalanced workloads, where e.g., one SIMD pair in the CU does not issue LDS instructions at all, this metric is better interpreted as the percentage of SIMDs issuing LDS instructions on [SIMD pairs](lds) that are actively using the LDS, averaged over the lifetime of the kernel. - - {sup}`2` The maximum value of the bank conflict rate is less than 100% (specifically: 96.875%), as the first cycle in the [LDS scheduler](lds) is never considered contended. - -Statistics -^^^^^^^^^^ - -The `LDS `__ statistics panel gives a more detailed view of the -hardware: - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - LDS Instructions - - The total number of LDS instructions (including, but not limited to, read/write/atomics, and e.g., HIP's `__shfl` instructions) executed per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - Theoretical Bandwidth - - Indicates the maximum amount of bytes that could have been loaded from/stored to/atomically updated in the LDS per [normalization-unit](normunit). Does _not_ take into account the execution mask of the wavefront when the instruction was executed (see [LDS Bandwidth](lds_bandwidth) example for more detail). - - Bytes per [normalization-unit](normunit) - * - LDS Latency - - The average number of round-trip cycles (i.e., from issue to data-return / acknowledgment) required for an LDS instruction to complete. - - Cycles - * - Bank Conflicts/Access - - The ratio of the number of cycles spent in the [LDS scheduler](lds) due to bank conflicts (as determined by the conflict resolution hardware) to the base number of cycles that would be spent in the LDS scheduler in a completely uncontended case. This is the unnormalized form of the Bank Conflict Rate. - - Conflicts/Access - * - Index Accesses - - The total number of cycles spent in the [LDS scheduler](lds) over all operations per [normalization-unit](normunit). - - Cycles per [normalization-unit](normunit) - * - Atomic Return Cycles - - The total number of cycles spent on LDS atomics with return per [normalization-unit](normunit). - - Cycles per [normalization-unit](normunit) - * - Bank Conflicts - - The total number of cycles spent in the [LDS scheduler](lds) due to bank conflicts (as determined by the conflict resolution hardware) per [normalization-unit](normunit). - - Cycles per [normalization-unit](normunit) - * - Address Conflicts - - The total number of cycles spent in the [LDS scheduler](lds) due to address conflicts (as determined by the conflict resolution hardware) per [normalization-unit](normunit). - - Cycles per [normalization-unit](normunit) - * - Unaligned Stall - - The total number of cycles spent in the [LDS scheduler](lds) due to stalls from non-dword aligned addresses per [normalization-unit](normunit). - - Cycles per [normalization-unit](normunit) - * - Memory Violations - - The total number of out-of-bounds accesses made to the LDS, per [normalization-unit](normunit). This is unused and expected to be zero in most configurations for modern CDNA accelerators. - - Accesses per [normalization-unit](normunit) - -(vL1D)= ### Vector L1 Cache (vL1D) - -The vector L1 data (vL1D) cache is local to each `compute unit `__ -on the accelerator, and handles vector memory operations issued by a -wavefront. The vL1D cache consists of several components: - -- an address processing unit, also known as the `texture addresser - (TA) `__, which receives commands (e.g., instructions) and - write/atomic data from the `Compute Unit `__, and coalesces them - into fewer requests for the cache to process. -- an address translation unit, also known as the L1 Unified Translation - Cache (UTCL1), that translates requests from virtual to physical - addresses for lookup in the cache. The translation unit has an L1 - translation lookaside buffer (L1TLB) to reduce the cost of repeated - translations. -- a Tag RAM that looks up whether a requested cache line is already - present in the `cache `__. -- the result of the Tag RAM lookup is placed in the L1 cache controller - for routing to the correct location, e.g., the `L2 Memory - Interface `__ for misses or the `Cache - RAM `__ for hits. -- the Cache RAM, also known as the `texture cache (TC) `__, stores - requested data for potential reuse. Data returned from the `L2 - cache `__ is placed into the Cache RAM before going down the - `data-return path `__. -- a backend data processing unit, also known as the `texture data - (TD) `__ that routes data back to the requesting `Compute - Unit `__. - -Together, this complex is known as the vL1D, or Texture Cache per Pipe -(TCP). A simplified diagram of the vL1D is presented below: - -\```{figure} images/l1perf_model.\* :scale: 150 % :alt: Performance -model of the vL1D Cache on AMD Instinct(tm) MI accelerators. :align: -center - -Performance model of the vL1D Cache on AMD Instinct(tm) MI accelerators. - -:: - - - (L1_SOL)= - #### vL1D Speed-of-Light - - ```{warning} - The theoretical maximum throughput for some metrics in this section are currently computed with the maximum achievable clock frequency, as reported by `rocminfo`, for an accelerator. This may not be realistic for all workloads. - -The vL1D’s speed-of-light chart shows several key metrics for the vL1D -as a comparison with the peak achievable values of those metrics. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 65 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Hit Rate - - The ratio of the number of vL1D cache line requests that hit{sup}`1` in vL1D cache over the total number of cache line requests to the [vL1D Cache RAM](TC). - - Percent - * - Bandwidth - - The number of bytes looked up in the vL1D cache as a result of [VMEM](VALU) instructions, as a percent of the peak theoretical bandwidth achievable on the specific accelerator. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - - Percent - * - Utilization - - Indicates how busy the [vL1D Cache RAM](TC) was during the kernel execution. The number of cycles where the [vL1D Cache RAM](TC) is actively processing any request divided by the number of cycles where the [vL1D is active](vL1d_activity){sup}`2` - - Percent - * - Coalescing - - Indicates how well memory instructions were coalesced by the [address processing unit](TA), ranging from uncoalesced (25\%) to fully coalesced (100\%). The average number of [thread-requests](ThreadRequests) generated per instruction divided by the ideal number of [thread-requests](ThreadRequests) per instruction. - - Percent - -(vL1d_activity)= - -.. code:: {note} - - {sup}`1` The vL1D cache on AMD Instinct(tm) MI CDNA accelerators uses a "hit-on-miss" approach to reporting cache hits. - That is, if while satisfying a miss, another request comes in that would hit on the same pending cache line, the subsequent request will be counted as a 'hit'. - Therefore, it is also important to consider the Access Latency metric in the [Cache access metrics](TCP_cache_access_metrics) section when evaluating the vL1D hit rate. - - {sup}`2` Omniperf considers the vL1D to be active when any part of the vL1D (excluding the [address-processor](TA) and [data-return](TD) units) are active, e.g., performing a translation, waiting for data, accessing the Tag or Cache RAMs, etc. - -(TA)= #### Address Processing Unit or Texture Addresser (TA) - -The `vL1D `__\ ’s address processing unit receives vector memory -instructions (commands) along with write/atomic data from a `Compute -Unit `__ and is responsible for coalescing these into requests for -lookup in the `vL1D RAM `__. The address processor passes -information about the commands (coalescing state, destination SIMD, -etc.) to the `data processing unit `__ for use after the requested -data has been retrieved. - -Omniperf reports several metrics to indicate performance bottlenecks in -the address processing unit, which are broken down into a few -categories: - -- Busy / stall metrics -- Instruction counts -- Spill / Stack metrics - -Busy / Stall metrics -'''''''''''''''''''' - -When executing vector memory instructions, the compute unit must send an -address (and in the case of writes/atomics, data) to the address -processing unit. When the frontend cannot accept any more addresses, it -must backpressure the wave-issue logic for the VMEM pipe and prevent the -issue of a vector memory instruction until a previously issued memory -operation has been processed. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 65 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Busy - - Percent of the [total CU cycles](TotalCUCycles) the address processor was busy - - Percent - * - Address Stall - - Percent of the [total CU cycles](TotalCUCycles) the address processor was stalled from sending address requests further into the vL1D pipeline - - Percent - * - Data Stall - - Percent of the [total CU cycles](TotalCUCycles) the address processor was stalled from sending write/atomic data further into the vL1D pipeline - - Percent - * - Data-Processor → Address Stall - - Percent of [total CU cycles](TotalCUCycles) the address processor was stalled waiting to send command data to the [data processor](TD) - - Percent - -(TA_inst)= ##### Instruction counts - -The address processor also counts instruction types to give the user -information on what sorts of memory instructions were executed by the -kernel. These are broken down into a few major categories: - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 20 60 - :class: noscroll-table - * - Memory type - - Usage - - Description - * - Global - - Global memory - - Global memory can be seen by all threads from a process. This includes the local accelerator's DRAM, remote accelerator's DRAM, and the host's DRAM. - * - Generic - - Dynamic address spaces - - Generic memory, a.k.a. "flat" memory, is used when the compiler cannot statically prove that a pointer is to memory in one or the other address spaces. The pointer could dynamically point into global, local, constant, or private memory. - * - Private Memory - - Register spills / Stack memory - - Private memory, a.k.a. "scratch" memory, is only visible to a particular [work-item](workitem) in a particular [workgroup](workgroup). On AMD Instinct(tm) MI accelerators, private memory is used to implement both register spills and stack memory accesses. - -The address processor counts these instruction types as follows: - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - - * - Type - - Description - - Unit - * - Global/Generic - - The total number of global & generic memory instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - Global/Generic Read - - The total number of global & generic memory read instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - Global/Generic Write - - The total number of global & generic memory write instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - Global/Generic Atomic - - The total number of global & generic memory atomic (with and without return) instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - Spill/Stack - - The total number of spill/stack memory instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - Spill/Stack Read - - The total number of spill/stack memory read instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - Spill/Stack Write - - The total number of spill/stack memory write instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). - - Instruction per [normalization-unit](normunit) - * - Spill/Stack Atomic - - The total number of spill/stack memory atomic (with and without return) instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). Typically unused as these memory operations are typically used to implement thread-local storage. - - Instructions per [normalization-unit](normunit) - -.. code:: {note} - - The above is a simplified model specifically for the HIP programming language that does not consider (e.g.,) inline assembly usage, constant memory usage or texture memory. - - These categories correspond to: - - Global/Generic: global and flat memory operations, that are used for Global and Generic memory access. - - Spill/Stack: buffer instructions which are used on the MI50, MI100, and [MI2XX](2xxnote) accelerators for register spills / stack memory. - - These concepts are described in more detail in the [memory space section](Mspace) below, while generic memory access is explored in the [generic memory benchmark](flatmembench) section. - -Spill/Stack metrics -''''''''''''''''''' - -Finally, the address processing unit contains a separate coalescing -stage for spill/stack memory, and thus reports: - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Spill/Stack Total Cycles - - The number of cycles the address processing unit spent working on spill/stack instructions, per [normalization-unit](normunit). - - Cycles per [normalization-unit](normunit) - * - Spill/Stack Coalesced Read Cycles - - The number of cycles the address processing unit spent working on coalesced spill/stack read instructions, per [normalization-unit](normunit). - - Cycles per [normalization-unit](normunit) - * - Spill/Stack Coalesced Write Cycles - - The number of cycles the address processing unit spent working on coalesced spill/stack write instructions, per [normalization-unit](normunit) - - Cycles per [normalization-unit](normunit) - -(UTCL1)= #### L1 Unified Translation Cache (UTCL1) - -After a vector memory instruction has been processed/coalesced by the -address processing unit of the vL1D, it must be translated from a -virtual to physical address. This process is handled by the L1 Unified -Translation Cache (UTCL1). This cache contains a L1 Translation -Lookaside Buffer (TLB) which stores recently translated addresses to -reduce the cost of subsequent re-translations. - -Omniperf reports the following L1 TLB metrics: - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Requests - - The number of translation requests made to the UTCL1 per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Hits - - The number of translation requests that hit in the UTCL1, and could be reused, per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Hit Ratio - - The ratio of the number of translation requests that hit in the UTCL1 divided by the total number of translation requests made to the UTCL1. - - Percent - * - Translation Misses - - The total number of translation requests that missed in the UTCL1 due to translation not being present in the cache, per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Permission Misses - - The total number of translation requests that missed in the UTCL1 due to a permission error, per [normalization-unit](normunit). This is unused and expected to be zero in most configurations for modern CDNA accelerators. - - Requests per [normalization-unit](normunit) - -.. code:: {note} - - On current CDNA accelerators, such as the [MI2XX](2xxnote), the UTCL1 does _not_ count hit-on-miss requests. - -(TC)= #### Vector L1 Cache RAM (TC) - -After coalescing in the `address processing unit `__ of the v1LD, -and address translation in the `L1 TLB `__ the request proceeds -to the Cache RAM stage of the pipeline. Incoming requests are looked up -in the cache RAMs using parts of the physical address as a tag. Hits -will be returned through the `data-return path `__, while misses -will routed out to the `L2 Cache `__ for servicing. - -The metrics tracked by the vL1D RAM include: - -- Stall metrics -- Cache access metrics -- vL1D-L2 transaction detail metrics - -(TCP_cache_stall_metrics)= ##### vL1D cache stall metrics - -The vL1D also reports where it is stalled in the pipeline, which may -indicate performance limiters of the cache. A stall in the pipeline may -result in backpressuring earlier parts of the pipeline, e.g., a stall on -L2 requests may backpressure the wave-issue logic of the `VMEM `__ -pipe and prevent it from issuing more vector memory instructions until -the vL1D’s outstanding requests are completed. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 65 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Stalled on L2 Data - - The ratio of the number of cycles where the vL1D is stalled waiting for requested data to return from the [L2 cache](L2) divided by the number of cycles where the [vL1D is active](vL1d_activity). - - Percent - * - Stalled on L2 Requests - - The ratio of the number of cycles where the vL1D is stalled waiting to issue a request for data to the [L2 cache](L2) divided by the number of cycles where the [vL1D is active](vL1d_activity). - - Percent - * - Tag RAM Stall (Read/Write/Atomic) - - The ratio of the number of cycles where the vL1D is stalled due to Read/Write/Atomic requests with conflicting tags being looked up concurrently, divided by the number of cycles where the [vL1D is active](vL1d_activity). - - Percent - -(TCP_cache_access_metrics)= ##### vL1D cache access metrics - -The vL1D cache access metrics broadly indicate the type of requests -incoming from the `cache frontend `__, the number of requests that -were serviced by the vL1D, and the number & type of outgoing requests to -the `L2 cache `__. In addition, this section includes the -approximate latencies of accesses to the cache itself, along with -latencies of read/write memory operations to the `L2 cache `__. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Total Requests - - The total number of incoming requests from the [address processing unit](TA) after coalescing. - - Requests - * - Total read/write/atomic requests - - The total number of incoming read/write/atomic requests from the [address processing unit](TA) after coalescing per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Cache Bandwidth - - The number of bytes looked up in the vL1D cache as a result of [VMEM](VALU) instructions per [normalization-unit](normunit). The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - - Bytes per [normalization-unit](normunit) - * - Cache Hit Rate - - The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the [vL1D Cache RAM](TC). - - Percent - * - Cache Accesses - - The total number of cache line lookups in the vL1D. - - Cache lines - * - Cache Hits - - The number of cache accesses minus the number of outgoing requests to the [L2 cache](L2), i.e., the number of cache line requests serviced by the [vL1D Cache RAM](TC) per [normalization-unit](normunit). - - Cache lines per [normalization-unit](normunit) - * - Invalidations - - The number of times the vL1D was issued a write-back invalidate command during the kernel's execution per [normalization-unit](normunit). This may be triggered by, e.g., the `buffer_wbinvl1` instruction. - - Invalidations per [normalization-unit](normunit) - * - L1-L2 Bandwidth - - The number of bytes transferred across the vL1D-L2 interface as a result of [VMEM](VALU) instructions, per [normalization-unit](normunit). The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - - Bytes per [normalization-unit](normunit) - * - L1-L2 Reads - - The number of read requests for a vL1D cache line that were not satisfied by the vL1D and must be retrieved from the to the [L2 Cache](L2) per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - L1-L2 Writes - - The number of post-coalescing write requests that are sent through the vL1D to the [L2 cache](L2), per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - L1-L2 Atomics - - The number of atomic requests that are sent through the vL1D to the [L2 cache](L2), per [normalization-unit](normunit). This includes requests for atomics with, and without return. - - Requests per [normalization-unit](normunit) - * - L1 Access Latency - - The average number of cycles that a vL1D cache line request spent in the vL1D cache pipeline. - - Cycles - * - L1-L2 Read Access Latency - - The average number of cycles that the vL1D cache took to issue and receive read requests from the [L2 Cache](L2). This number also includes requests for atomics with return values. - - Cycles - * - L1-L2 Write Access Latency - - The average number of cycles that the vL1D cache took to issue and receive acknowledgement of a write request to the [L2 Cache](L2). This number also includes requests for atomics without return values. - - Cycles - -.. code:: {note} - - All cache accesses in vL1D are for a single cache line's worth of data. - The size of a cache line may vary, however on current AMD Instinct(tm) MI CDNA accelerators and GCN GPUs the L1 cache line size is 64B. - -(TCP_TCC_Transactions_Detail)= ##### vL1D - L2 Transaction Detail - -This section provides a more granular look at the types of requests made -to the `L2 cache `__. These are broken down by the operation type -(read / write / atomic, with, or without return), and the `memory -type `__. For more detail, the reader is referred to the `Memory -Types `__ section. - -(TD)= #### Vector L1 Data-Return Path or Texture Data (TD) - -The data-return path of the vL1D cache, also known as the Texture Data -(TD) unit, is responsible for routing data returned from the `vL1D cache -RAM `__ back to a wavefront on a SIMD. As described in the `vL1D -cache front-end `__ section, the data-return path is passed -information about the space requirements and routing for data requests -from the `VALU `__. When data is returned from the `vL1D cache -RAM `__, it is matched to this previously stored request data, and -returned to the appropriate SIMD. - -Omniperf reports the following vL1D data-return path metrics: - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Data-return Busy - - Percent of the [total CU cycles](TotalCUCycles) the data-return unit was busy processing or waiting on data to return to the [CU](CU). - - Percent - * - Cache RAM → Data-return Stall - - Percent of the [total CU cycles](TotalCUCycles) the data-return unit was stalled on data to be returned from the [vL1D Cache RAM](TC). - - Percent - * - Workgroup manager → Data-return Stall - - Percent of the [total CU cycles](TotalCUCycles) the data-return unit was stalled by the [workgroup manager](SPI) due to initialization of registers as a part of launching new workgroups. - - Percent - * - Coalescable Instructions - - The number of instructions submitted to the [data-return unit](TD) by the [address-processor](TA) that were found to be coalescable, per [normalization-unit](normunit). - - Instructions per [normalization-unit](normunit) - * - Read Instructions - - The number of read instructions submitted to the [data-return unit](TD) by the [address-processor](TA) summed over all [compute units](CU) on the accelerator, per [normalization-unit](normunit). This is expected to be the sum of global/generic and spill/stack reads in the [address processor](TA_inst). - - Instructions per [normalization-unit](normunit) - * - Write Instructions - - The number of store instructions submitted to the [data-return unit](TD) by the [address-processor](TA) summed over all [compute units](CU) on the accelerator, per [normalization-unit](normunit). This is expected to be the sum of global/generic and spill/stack stores counted by the [vL1D cache-frontend](TA_inst). - - Instructions per [normalization-unit](normunit) - * - Atomic Instructions - - The number of atomic instructions submitted to the [data-return unit](TD) by the [address-processor](TA) summed over all [compute units](CU) on the accelerator, per [normalization-unit](normunit). This is expected to be the sum of global/generic and spill/stack atomics in the [address processor](TA_inst). - - Instructions per [normalization-unit](normunit) - -(L2)= ## L2 Cache (TCC) - -The L2 cache is the coherence point for current AMD Instinct(tm) MI GCN -GPUs and CDNA accelerators, and is shared by all `compute units `__ -on the device. Besides serving requests from the `vector L1 data -caches `__, the L2 cache also is responsible for servicing -requests from the `L1 instruction caches `__, the `scalar L1 data -caches `__ and the `command-processor `__. The L2 cache is -composed of a number of distinct channels (32 on -MI100/`MI2XX <2xxnote>`__ series CDNA accelerators at 256B address -interleaving) which can largely operate independently. Mapping of -incoming requests to a specific L2 channel is determined by a hashing -mechanism that attempts to evenly distribute requests across the L2 -channels. Requests that miss in the L2 cache are passed out to `Infinity -Fabric(tm) `__ to be routed to the appropriate memory -location. - -The L2 cache metrics reported by Omniperf are broken down into four -categories: - -- L2 Speed-of-Light -- L2 Cache Accesses -- L2-Fabric Transactions -- L2-Fabric Stalls - -(L2SoL)= ### L2 Speed-of-Light - -.. code:: {warning} - - The theoretical maximum throughput for some metrics in this section are currently computed with the maximum achievable clock frequency, as reported by `rocminfo`, for an accelerator. This may not be realistic for all workloads. - -The L2 cache’s speed-of-light table contains a few key metrics about the -performance of the L2 cache, aggregated over all the L2 channels, as a -comparison with the peak achievable values of those metrics: - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 65 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Utilization - - The ratio of the [number of cycles an L2 channel was active, summed over all L2 channels on the accelerator](TotalActiveL2Cycles) over the [total L2 cycles](TotalL2Cycles). - - Percent - * - Bandwidth - - The number of bytes looked up in the L2 cache, as a percent of the peak theoretical bandwidth achievable on the specific accelerator. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - - Percent - * - Hit Rate - - The ratio of the number of L2 cache line requests that hit in the L2 cache over the total number of incoming cache line requests to the L2 cache. - - Percent - * - L2-Fabric Read BW - - The number of bytes read by the L2 over the [Infinity Fabric(tm) interface](l2fabric) per unit time. - - GB/s - * - L2-Fabric Write and Atomic BW - - The number of bytes sent by the L2 over the [Infinity Fabric(tm) interface](l2fabric) by write and atomic operations per unit time. - - GB/s - -.. code:: {note} - - The L2 cache on AMD Instinct(tm) MI CDNA accelerators uses a "hit-on-miss" approach to reporting cache hits. - That is, if while satisfying a miss, another request comes in that would hit on the same pending cache line, the subsequent request will be counted as a 'hit'. - Therefore, it is also important to consider the latency metric in the [L2-Fabric](l2fabric) section when evaluating the L2 hit rate. - -(L2_cache_metrics)= ### L2 Cache Accesses - -This section details the incoming requests to the L2 cache from the -`vL1D `__ and other clients (e.g., the `sL1D `__ and -`L1I `__ caches). - -.. code:: {list-table} - - :header-rows: 1 - :widths: 13 70 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Bandwidth - - The number of bytes looked up in the L2 cache, per [normalization-unit](normunit). The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - - Bytes per [normalization-unit](normunit) - * - Requests - - The total number of incoming requests to the L2 from all clients for all request types, per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Read Requests - - The total number of read requests to the L2 from all clients. - - Requests per [normalization-unit](normunit) - * - Write Requests - - The total number of write requests to the L2 from all clients. - - Requests per [normalization-unit](normunit) - * - Atomic Requests - - The total number of atomic requests (with and without return) to the L2 from all clients. - - Requests per [normalization-unit](normunit) - * - Streaming Requests - - The total number of incoming requests to the L2 that are marked as 'streaming'. The exact meaning of this may differ depending on the targeted accelerator, however on an [MI2XX](2xxnote) this corresponds to [non-temporal load or stores](https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins). The L2 cache attempts to evict 'streaming' requests before normal requests when the L2 is at capacity. - - Requests per [normalization-unit](normunit) - * - Probe Requests - - The number of coherence probe requests made to the L2 cache from outside the accelerator. On an [MI2XX](2xxnote), probe requests may be generated by e.g., writes to [fine-grained device](MType) memory or by writes to [coarse-grained](MType) device memory. - - Requests per [normalization-unit](normunit) - * - Hit Rate - - The ratio of the number of L2 cache line requests that hit in the L2 cache over the total number of incoming cache line requests to the L2 cache. - - Percent - * - Hits - - The total number of requests to the L2 from all clients that hit in the cache. As noted in the [speed-of-light](L2SoL) section, this includes hit-on-miss requests. - - Requests per [normalization-unit](normunit) - * - Misses - - The total number of requests to the L2 from all clients that miss in the cache. As noted in the [speed-of-light](L2SoL) section, these do not include hit-on-miss requests. - - Requests per [normalization-unit](normunit) - * - Writebacks - - The total number of L2 cache lines written back to memory for any reason. Write-backs may occur due to e.g., user-code (e.g., HIP kernel calls to `__threadfence_system`, or atomic built-ins), by the [command-processor](CP)'s memory acquire/release fences, or for other internal hardware reasons. - - Cache lines per [normalization-unit](normunit) - * - Writebacks (Internal) - - The total number of L2 cache lines written back to memory for internal hardware reasons, per [normalization-unit](normunit). - - Cache lines per [normalization-unit](normunit) - * - Writebacks (vL1D Req) - - The total number of L2 cache lines written back to memory due to requests initiated by the [vL1D cache](vL1D), per [normalization-unit](normunit). - - Cache lines per [normalization-unit](normunit) - * - Evictions (Normal) - - The total number of L2 cache lines evicted from the cache due to capacity limits, per [normalization-unit](normunit), per [normalization-unit](normunit). - - Cache lines per [normalization-unit](normunit) - * - Evictions (vL1D Req) - - The total number of L2 cache lines evicted from the cache due to invalidation requests initiated by the [vL1D cache](vL1D), per [normalization-unit](normunit). - - Cache lines per [normalization-unit](normunit) - * - Non-hardware-Coherent Requests - - The total number of requests to the L2 to Not-hardware-Coherent (NC) memory allocations, per [normalization-unit](normunit). See the [Memory Types section](Mtype) for more detail. - - Requests per [normalization-unit](normunit) - * - Uncached Requests - - The total number of requests to the L2 that to uncached (UC) memory allocations. See the [Memory Types section](Mtype) for more detail. - - Requests per [normalization-unit](normunit) - * - Coherently Cached Requests - - The total number of requests to the L2 that to coherently cachable (CC) memory allocations. See the [Memory Types section](Mtype) for more detail. - - Requests per [normalization-unit](normunit) - * - Read/Write Coherent Requests - - The total number of requests to the L2 that to Read-Write coherent memory (RW) allocations. See the [Memory Types section](Mtype) for more detail. - - Requests per [normalization-unit](normunit) - -.. code:: {note} - - All requests to the L2 are for a single cache line's worth of data. - The size of a cache line may vary depending on the accelerator, however on an AMD Instinct(tm) CDNA2 [MI2XX](2xxnote) accelerator, it is 128B, while on an MI100, it is 64B. - -(l2fabric)= ### L2-Fabric transactions - -Requests/data that miss in the L2 must be routed to memory in order to -service them. The backing memory for a request may be local to this -accelerator (i.e., in the local high-bandwidth memory), in a remote -accelerator’s memory, or even in the CPU’s memory. Infinity Fabric(tm) -is responsible for routing these memory requests/data to the correct -location and returning any fetched data to the L2 cache. The `following -section `__ describes the flow of these requests through -Infinity Fabric(tm) in more detail, as described by Omniperf metrics, -while `later sections `__ give detailed definitions of -individual metrics. - -(L2_req_flow)= #### Request flow - -Below is a diagram that illustrates how L2↔Fabric requests are reported -by Omniperf: - -\```{figure} images/fabric.png :alt: L2↔Fabric transaction flow on AMD -Instinct(tm) MI accelerators. :align: center :name: fabric-fig - -L2↔Fabric transaction flow on AMD Instinct(tm) MI accelerators. - -:: - - - Requests from the L2 Cache are broken down into two major categories, read requests and write requests (at this granularity, atomic requests are treated as writes). - - From there, these requests can additionally subdivided in a number of ways. - First, these requests may be sent across Infinity Fabric(tm) as different transaction sizes, 32B or 64B on current CDNA accelerators. - - ```{note} - On current CDNA accelerators, the 32B read request path is expected to be unused (hence: is disconnected in the flow diagram). - -In addition, the read and write requests can be further categorized as: -- uncached read/write requests, e.g., for accesses to `fine-grained -memory `__ - atomic requests, e.g., for atomic updates to -`fine-grained memory `__ - HBM read/write requests OR remote -read/write requests, i.e., for requests to the accelerator’s local HBM -OR requests to a remote accelerator’s HBM / the CPU’s DRAM. - -These classifications are not necessarily *exclusive*, for example, a -write request can be classified as an atomic request to the -accelerator’s local HBM, and an uncached write request. The request-flow -diagram marks *exclusive* classifications as a splitting of the flow, -while *non-exclusive* requests do not split the flow line. For example, -a request is either a 32B Write Request OR a 64B Write request, as the -flow splits at this point: \```{figure} images/split.\* :scale: 50 % -:alt: Request flow splitting :align: center :name: -split-request-flow-fig - -Splitting request flow - -:: - - However, continuing along, the same request might be an Atomic request and an Uncached Write request, as reflected by a non-split flow: - ```{figure} images/nosplit.* - :scale: 50 % - :alt: Request flow splitting - :align: center - :name: nosplit-request-flow-fig - - Non-splitting request flow - -Finally, we note that `uncached `__ read requests (e.g., to -`fine-grained memory `__) are handled specially on CDNA -accelerators, as indicated in the request flow diagram. These are -expected to be counted as a 64B Read Request, and *if* they are requests -to uncached memory (denoted by the dashed line), they will also be -counted as *two* uncached read requests (i.e., the request is split): - -\```{figure} images/uncached.\* :scale: 50 % :alt: Uncached read-request -splitting :align: center :name: uncached-read-request-flow-fig - -Uncached read-request splitting. - -:: - - - (L2_req_metrics)= - #### Metrics - - - The following metrics are reported for the L2-Fabric interface: - - ```{list-table} - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - L2-Fabric Read Bandwidth - - The total number of bytes read by the L2 cache from Infinity Fabric(tm) per [normalization-unit](normunit). - - Bytes per [normalization-unit](normunit) - * - HBM Read Traffic - - The percent of read requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown does not consider the _size_ of the request (i.e., 32B and 64B requests are both counted as a single request), so this metric only _approximates_ the percent of the L2-Fabric Read bandwidth directed to the local HBM. - - Percent - * - Remote Read Traffic - - The percent of read requests generated by the L2 cache that are routed to any memory location other than the accelerator's local high-bandwidth memory (HBM) --- e.g., the CPU's DRAM, a remote accelerator's HBM, etc. This breakdown does not consider the _size_ of the request (i.e., 32B and 64B requests are both counted as a single request), so this metric only _approximates_ the percent of the L2-Fabric Read bandwidth directed to a remote location. - - Percent - * - Uncached Read Traffic - - The percent of read requests generated by the L2 cache that are reading from an [uncached memory allocation](Mtype). Note, as described in the [request-flow](L2_req_flow) section, a single 64B read request is typically counted as two uncached read requests, hence it is possible for the Uncached Read Traffic to reach up to 200% of the total number of read requests. This breakdown does not consider the _size_ of the request (i.e., 32B and 64B requests are both counted as a single request), so this metric only _approximates_ the percent of the L2-Fabric read bandwidth directed to an uncached memory location. - - Percent - * - L2-Fabric Write and Atomic Bandwidth - - The total number of bytes written by the L2 over Infinity Fabric(tm) by write and atomic operations per [normalization-unit](normunit). Note that on current CDNA accelerators, such as the [MI2XX](2xxnote), requests are only considered 'atomic' by Infinity Fabric(tm) if they are targeted at non-write-cachable memory, e.g., [fine-grained memory](Mtype) allocations or [uncached memory](Mtype) allocations on the [MI2XX](2xxnote). - - Bytes per [normalization-unit](normunit) - * - HBM Write and Atomic Traffic - - The percent of write and atomic requests generated by the L2 cache that are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown does not consider the _size_ of the request (i.e., 32B and 64B requests are both counted as a single request), so this metric only _approximates_ the percent of the L2-Fabric Write and Atomic bandwidth directed to the local HBM. Note that on current CDNA accelerators, such as the [MI2XX](2xxnote), requests are only considered 'atomic' by Infinity Fabric(tm) if they are targeted at [fine-grained memory](Mtype) allocations or [uncached memory](Mtype) allocations. - - Percent - * - Remote Write and Atomic Traffic - - The percent of write and atomic requests generated by the L2 cache that are routed to any memory location other than the accelerator's local high-bandwidth memory (HBM) --- e.g., the CPU's DRAM, a remote accelerator's HBM, etc. This breakdown does not consider the _size_ of the request (i.e., 32B and 64B requests are both counted as a single request), so this metric only _approximates_ the percent of the L2-Fabric Write and Atomic bandwidth directed to a remote location. Note that on current CDNA accelerators, such as the [MI2XX](2xxnote), requests are only considered 'atomic' by Infinity Fabric(tm) if they are targeted at non-write-cachable memory, e.g., [fine-grained memory](Mtype) allocations or [uncached memory](Mtype) allocations on the [MI2XX](2xxnote). - - Percent - * - Atomic Traffic - - The percent of write requests generated by the L2 cache that are atomic requests to _any_ memory location. This breakdown does not consider the _size_ of the request (i.e., 32B and 64B requests are both counted as a single request), so this metric only _approximates_ the percent of the L2-Fabric Write and Atomic bandwidth that is due to use of atomics. Note that on current CDNA accelerators, such as the [MI2XX](2xxnote), requests are only considered 'atomic' by Infinity Fabric(tm) if they are targeted at [fine-grained memory](Mtype) allocations or [uncached memory](Mtype) allocations. - - Percent - * - Uncached Write and Atomic Traffic - - The percent of write and atomic requests generated by the L2 cache that are targeting [uncached memory allocations](Mtype). This breakdown does not consider the _size_ of the request (i.e., 32B and 64B requests are both counted as a single request), so this metric only _approximates_ the percent of the L2-Fabric read bandwidth directed to uncached memory allocations. - - Percent - * - Read Latency - - The time-averaged number of cycles read requests spent in Infinity Fabric(tm) before data was returned to the L2. - - Cycles - * - Write Latency - - The time-averaged number of cycles write requests spent in Infinity Fabric(tm) before a completion acknowledgement was returned to the L2. - - Cycles - * - Atomic Latency - - The time-averaged number of cycles atomic requests spent in Infinity Fabric(tm) before a completion acknowledgement (atomic without return value) or data (atomic with return value) was returned to the L2. - - Cycles - * - Read Stall - - The ratio of the total number of cycles the L2-Fabric interface was stalled on a read request to any destination (local HBM, remote PCIe(r) connected accelerator / CPU, or remote Infinity Fabric(tm) connected accelerator{sup}`1` / CPU) over the [total active L2 cycles](TotalActiveL2Cycles). - - Percent - * - Write Stall - - The ratio of the total number of cycles the L2-Fabric interface was stalled on a write or atomic request to any destination (local HBM, remote accelerator / CPU, PCIe(r) connected accelerator / CPU, or remote Infinity Fabric(tm) connected accelerator{sup}`1` / CPU) over the [total active L2 cycles](TotalActiveL2Cycles). - - Percent - -(L2_req_metric_details)= #### Detailed Transaction Metrics - -The following metrics are available in the detailed L2-Fabric -transaction breakdown table: - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - 32B Read Requests - - The total number of L2 requests to Infinity Fabric(tm) to read 32B of data from any memory location, per [normalization-unit](normunit). See [request-flow](L2_req_flow) for more detail. Typically unused on CDNA accelerators. - - Requests per [normalization-unit](normunit) - * - Uncached Read Requests - - The total number of L2 requests to Infinity Fabric(tm) to read [uncached data](Mtype) from any memory location, per [normalization-unit](normunit). 64B requests for uncached data are counted as two 32B uncached data requests. See [request-flow](L2_req_flow) for more detail. - - Requests per [normalization-unit](normunit) - * - 64B Read Requests - - The total number of L2 requests to Infinity Fabric(tm) to read 64B of data from any memory location, per [normalization-unit](normunit). See [request-flow](L2_req_flow) for more detail. - - Requests per [normalization-unit](normunit) - * - HBM Read Requests - - The total number of L2 requests to Infinity Fabric(tm) to read 32B or 64B of data from the accelerator's local HBM, per [normalization-unit](normunit). See [request-flow](L2_req_flow) for more detail. - - Requests per [normalization-unit](normunit) - * - Remote Read Requests - - The total number of L2 requests to Infinity Fabric(tm) to read 32B or 64B of data from any source other than the accelerator's local HBM, per [normalization-unit](normunit). See [request-flow](L2_req_flow) for more detail. - - Requests per [normalization-unit](normunit) - * - 32B Write and Atomic Requests - - The total number of L2 requests to Infinity Fabric(tm) to write or atomically update 32B of data to any memory location, per [normalization-unit](normunit). See [request-flow](L2_req_flow) for more detail. - - Requests per [normalization-unit](normunit) - * - Uncached Write and Atomic Requests - - The total number of L2 requests to Infinity Fabric(tm) to write or atomically update 32B or 64B of [uncached data](Mtype), per [normalization-unit](normunit). See [request-flow](L2_req_flow) for more detail. - - Requests per [normalization-unit](normunit) - * - 64B Write and Atomic Requests - - The total number of L2 requests to Infinity Fabric(tm) to write or atomically update 64B of data in any memory location, per [normalization-unit](normunit). See [request-flow](L2_req_flow) for more detail. - - Requests per [normalization-unit](normunit) - * - HBM Write and Atomic Requests - - The total number of L2 requests to Infinity Fabric(tm) to write or atomically update 32B or 64B of data in the accelerator's local HBM, per [normalization-unit](normunit). See [request-flow](L2_req_flow) for more detail. - - Requests per [normalization-unit](normunit) - * - Remote Write and Atomic Requests - - The total number of L2 requests to Infinity Fabric(tm) to write or atomically update 32B or 64B of data in any memory location other than the accelerator's local HBM, per [normalization-unit](normunit). See [request-flow](L2_req_flow) for more detail. - - Requests per [normalization-unit](normunit) - * - Atomic Requests - - The total number of L2 requests to Infinity Fabric(tm) to atomically update 32B or 64B of data in any memory location, per [normalization-unit](normunit). See [request-flow](L2_req_flow) for more detail. Note that on current CDNA accelerators, such as the [MI2XX](2xxnote), requests are only considered 'atomic' by Infinity Fabric(tm) if they are targeted at non-write-cachable memory, e.g., [fine-grained memory](Mtype) allocations or [uncached memory](Mtype) allocations on the [MI2XX](2xxnote). - - Requests per [normalization-unit](normunit) - -L2-Fabric Interface Stalls -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -When the interface between the L2 cache and Infinity Fabric(tm) becomes -backed up by requests, it may stall preventing the L2 from issuing -additional requests to Infinity Fabric(tm) until prior requests -complete. This section gives a breakdown of what types of requests in a -kernel caused a stall (e.g., read vs write), and to which locations -(e.g., to the accelerator’s local memory, or to remote -accelerators/CPUs). - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 65 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Read - PCIe(r) Stall - - The number of cycles the L2-Fabric interface was stalled on read requests to remote PCIe(r) connected accelerators{sup}`1` or CPUs as a percent of the [total active L2 cycles](TotalActiveL2Cycles). - - Percent - * - Read - Infinity Fabric(tm) Stall - - The number of cycles the L2-Fabric interface was stalled on read requests to remote Infinity Fabric(tm) connected accelerators{sup}`1` or CPUs as a percent of the [total active L2 cycles](TotalActiveL2Cycles). - - Percent - * - Read - HBM Stall - - The number of cycles the L2-Fabric interface was stalled on read requests to the accelerator's local HBM as a percent of the [total active L2 cycles](TotalActiveL2Cycles). - - Percent - * - Write - PCIe(r) Stall - - The number of cycles the L2-Fabric interface was stalled on write or atomic requests to remote PCIe(r) connected accelerators{sup}`1` or CPUs as a percent of the [total active L2 cycles](TotalActiveL2Cycles). - - Percent - * - Write - Infinity Fabric(tm) Stall - - The number of cycles the L2-Fabric interface was stalled on write or atomic requests to remote Infinity Fabric(tm) connected accelerators{sup}`1` or CPUs as a percent of the [total active L2 cycles](TotalActiveL2Cycles). - - Percent - * - Write - HBM Stall - - The number of cycles the L2-Fabric interface was stalled on write or atomic requests to accelerator's local HBM as a percent of the [total active L2 cycles](TotalActiveL2Cycles). - - Percent - * - Write - Credit Starvation - - The number of cycles the L2-Fabric interface was stalled on write or atomic requests to any memory location because too many write/atomic requests were currently in flight, as a percent of the [total active L2 cycles](TotalActiveL2Cycles). - - Percent - -.. code:: {note} - - {sup}`1` In addition to being used for on-accelerator data-traffic, AMD [Infinity Fabric](https://www.amd.com/en/technologies/infinity-architecture)(tm) technology can be used to connect multiple accelerators to achieve advanced peer-to-peer connectivity and enhanced bandwidths over traditional PCIe(r) connections. - Some AMD Instinct(tm) MI accelerators, e.g., the MI250X, [feature coherent CPU↔accelerator connections built using AMD Infinity Fabric(tm)](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf) - -.. code:: {warning} - - On current CDNA accelerators and GCN GPUs, these L2↔Fabric stalls can be undercounted in some circumstances. diff --git a/docs/concept/includes/shader-engine.rst b/docs/concept/includes/shader-engine.rst deleted file mode 100644 index 35f31bc28..000000000 --- a/docs/concept/includes/shader-engine.rst +++ /dev/null @@ -1,392 +0,0 @@ -(SE)= ## Shader Engine (SE) - -The `CUs `__ on a CDNA accelerator are grouped together into a -higher-level organizational unit called a Shader Engine (SE): - -\```{figure} images/selayout.png :alt: Example of CU-grouping into -shader-engines on AMD Instinct(tm) MI accelerators. :align: center -:name: selayout-fig - -Example of CU-grouping into shader-engines on AMD Instinct(tm) MI -accelerators. - -:: - - - The number of CUs on a SE varies from chip-to-chip (see, for example [AMD GPU HIP Training](https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf), slide 20). - In addition, newer accelerators such as the AMD Instinct(tm) MI 250X have 8 SEs per accelerator. - - For the purposes of Omniperf, we consider resources that are shared between multiple CUs on a single SE as part of the SE's metrics. - These include: - - the [scalar L1 data cache](sL1D) - - the [L1 instruction cache](L1I) - - the [workgroup manager](SPI) - - (sL1D)= - ### Scalar L1 Data Cache (sL1D) - - The Scalar L1 Data cache (sL1D) can cache data accessed from scalar load instructions (and scalar store instructions on architectures where they exist) from wavefronts in the [CUs](CU). - The sL1D is shared between multiple CUs ([GCN Crash Course](https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah), slide 36) --- the exact number of CUs depends on the architecture in question (3 CUs in GCN GPUs and MI100, 2 CUs in [MI2XX](2xxnote)) --- and is backed by the [L2](L2) cache. - - In typical usage, the data in the sL1D is comprised of (e.g.,): - - Kernel arguments, e.g., pointers, [non-populated](https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-sgpr-register-set-up-order-table) grid/block dimensions, etc. - - HIP's `__constant__` memory, when accessed in a provably uniform{sup}`1` manner - - Other memory, when accessed in a provably uniform manner, *and* the backing memory is provably constant{sup}`1` - - ```{note} - {sup}`1` - The scalar data cache is used when the compiler emits scalar loads to access data. - This requires that the data be _provably_ uniformly accessed (i.e., the compiler can verify that all work-items in a wavefront access the same data), _and_ that the data can be proven to be read-only (e.g., HIP's `__constant__` memory, or properly `__restrict__`'ed pointers to avoid write-aliasing). - Access of e.g., `__constant__` memory is not guaranteed to go through the sL1D if, e.g., the wavefront loads a non-uniform value. - -(sL1D_SOL)= #### Scalar L1D Speed-of-Light - -.. code:: {warning} - - The theoretical maximum throughput for some metrics in this section are currently computed with the maximum achievable clock frequency, as reported by `rocminfo`, for an accelerator. This may not be realistic for all workloads. - -The Scalar L1D speed-of-light chart shows some key metrics of the sL1D -cache as a comparison with the peak achievable values of those metrics: - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 65 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Bandwidth - - The number of bytes looked up in the sL1D cache, as a percent of the peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the [total sL1D cycles](TotalSL1DCycles). - - Percent - * - Cache Hit Rate - - The percent of sL1D requests that hit{sup}`1` on a previously loaded line in the cache. Calculated as the ratio of the number of sL1D requests that hit over the number of all sL1D requests. - - Percent - * - sL1D-L2 BW - - The number of bytes requested by the sL1D from the L2 cache, as a percent of the peak theoretical sL1D → L2 cache bandwidth. Calculated as the ratio of the total number of requests from the sL1D to the L2 cache over the [total sL1D-L2 interface cycles](TotalSL1DCycles). - - Percent - -.. code:: {note} - - {sup}`1` Unlike the [vL1D](vL1D) and [L2](L2) caches, the sL1D cache on AMD Instinct(tm) MI CDNA accelerators does _not_ use "hit-on-miss" approach to reporting cache hits. - That is, if while satisfying a miss, another request comes in that would hit on the same pending cache line, the subsequent request will be counted as a 'duplicated miss' (see below). - -Scalar L1D Cache Accesses -^^^^^^^^^^^^^^^^^^^^^^^^^ - -This panel gives more detail on the types of accesses made to the sL1D, -and the hit/miss statistics. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Requests - - The total number of requests, of any size or type, made to the sL1D per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Hits - - The total number of sL1D requests that hit on a previously loaded cache line, per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Misses - Non Duplicated - - The total number of sL1D requests that missed on a cache line that *was not* already pending due to another request, per [normalization-unit](normunit). See note in [speed-of-light section](sL1D_SOL) for more detail. - - Requests per [normalization-unit](normunit) - * - Misses - Duplicated - - The total number of sL1D requests that missed on a cache line that *was* already pending due to another request, per [normalization-unit](normunit). See note in [speed-of-light section](sL1D_SOL) for more detail. - - Requests per [normalization-unit](normunit) - * - Cache Hit Rate - - Indicates the percent of sL1D requests that hit on a previously loaded line the cache. The ratio of the number of sL1D requests that hit{sup}`1` over the number of all sL1D requests. - - Percent - * - Read Requests (Total) - - The total number of sL1D read requests of any size, per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Atomic Requests - - The total number of sL1D atomic requests of any size, per [normalization-unit](normunit). Typically unused on CDNA accelerators. - - Requests per [normalization-unit](normunit) - * - Read Requests (1 DWord) - - The total number of sL1D read requests made for a single dword of data (4B), per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Read Requests (2 DWord) - - The total number of sL1D read requests made for a two dwords of data (8B), per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Read Requests (4 DWord) - - The total number of sL1D read requests made for a four dwords of data (16B), per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Read Requests (8 DWord) - - The total number of sL1D read requests made for a eight dwords of data (32B), per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Read Requests (16 DWord) - - The total number of sL1D read requests made for a sixteen dwords of data (64B), per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - -.. code:: {note} - - {sup}`1`Unlike the [vL1D](vL1D) and [L2](L2) caches, the sL1D cache on AMD Instinct(tm) MI CDNA accelerators does _not_ use "hit-on-miss" approach to reporting cache hits. - That is, if while satisfying a miss, another request comes in that would hit on the same pending cache line, the subsequent request will be counted as a 'duplicated miss' (see below). - -sL1D ↔ L2 Interface -^^^^^^^^^^^^^^^^^^^ - -This panel gives more detail on the data requested across the -sL1D↔\ `L2 `__ interface. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - sL1D-L2 BW - - The total number of bytes read from/written to/atomically updated across the sL1D↔[L2](L2) interface, per [normalization-unit](normunit). Note that sL1D writes and atomics are typically unused on current CDNA accelerators, so in the majority of cases this can be interpreted as an sL1D→L2 read bandwidth. - - Bytes per [normalization-unit](normunit) - * - Read Requests - - The total number of read requests from sL1D to the [L2](L2), per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Write Requests - - The total number of write requests from sL1D to the [L2](L2), per [normalization-unit](normunit). Typically unused on current CDNA accelerators. - - Requests per [normalization-unit](normunit) - * - Atomic Requests - - The total number of atomic requests from sL1D to the [L2](L2), per [normalization-unit](normunit). Typically unused on current CDNA accelerators. - - Requests per [normalization-unit](normunit) - * - Stall Cycles - - The total number of cycles the sL1D↔[L2](L2) interface was stalled, per [normalization-unit](normunit). - - Cycles per [normalization-unit](normunit) - -(L1I)= ### L1 Instruction Cache (L1I) - -As with the `sL1D `__, the L1 Instruction (L1I) cache is shared -between multiple CUs on a shader-engine, where the precise number of CUs -sharing a L1I depends on the architecture in question (`GCN Crash -Course `__, -slide 36) and is backed by the `L2 `__ cache. Unlike the sL1D, the -instruction cache is read-only. - -(L1I_SOL)= #### L1I Speed-of-Light - -.. code:: {warning} - - The theoretical maximum throughput for some metrics in this section are currently computed with the maximum achievable clock frequency, as reported by `rocminfo`, for an accelerator. This may not be realistic for all workloads. - -The L1 Instruction Cache speed-of-light chart shows some key metrics of -the L1I cache as a comparison with the peak achievable values of those -metrics: - -.. code:: {list-table} - - :header-rows: 1 - :widths: 15 70 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Bandwidth - - The number of bytes looked up in the L1I cache, as a percent of the peak theoretical bandwidth. Calculated as the ratio of L1I requests over the [total L1I cycles](TotalL1ICycles). - - Percent - * - Cache Hit Rate - - The percent of L1I requests that hit on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit{sup}`1` over the number of all L1I requests. - - Percent - * - L1I-L2 BW - - The percent of the peak theoretical L1I → L2 cache request bandwidth achieved. Calculated as the ratio of the total number of requests from the L1I to the L2 cache over the [total L1I-L2 interface cycles](TotalL1ICycles). - - Percent - * - Instruction Fetch Latency - - The average number of cycles spent to fetch instructions to a [CU](cu). - - Cycles - -.. code:: {note} - - {sup}`1`Unlike the [vL1D](vL1D) and [L2](L2) caches, the L1I cache on AMD Instinct(tm) MI CDNA accelerators does _not_ use "hit-on-miss" approach to reporting cache hits. - That is, if while satisfying a miss, another request comes in that would hit on the same pending cache line, the subsequent request will be counted as a 'duplicated miss' (see below). - -L1I Cache Accesses -^^^^^^^^^^^^^^^^^^ - -This panel gives more detail on the hit/miss statistics of the L1I: - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Requests - - The total number of requests made to the L1I per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Hits - - The total number of L1I requests that hit on a previously loaded cache line, per [normalization-unit](normunit). - - Requests per [normalization-unit](normunit) - * - Misses - Non Duplicated - - The total number of L1I requests that missed on a cache line that *was not* already pending due to another request, per [normalization-unit](normunit). See note in [speed-of-light section](L1I_SOL) for more detail. - - Requests per [normalization-unit](normunit) - * - Misses - Duplicated - - The total number of L1I requests that missed on a cache line that *was* already pending due to another request, per [normalization-unit](normunit). See note in [speed-of-light section](L1I_SOL) for more detail. - - Requests per [normalization-unit](normunit) - * - Cache Hit Rate - - The percent of L1I requests that hit{sup}`1` on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the the number of all L1I requests. - - Percent - -.. code:: {note} - - {sup}`1`Unlike the [vL1D](vL1D) and [L2](L2) caches, the L1I cache on AMD Instinct(tm) MI CDNA accelerators does _not_ use "hit-on-miss" approach to reporting cache hits. - That is, if while satisfying a miss, another request comes in that would hit on the same pending cache line, the subsequent request will be counted as a 'duplicated miss' (see below). - -L1I - L2 Interface -^^^^^^^^^^^^^^^^^^ - -This panel gives more detail on the data requested across the -L1I-`L2 `__ interface. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 18 65 17 - :class: noscroll-table - * - Metric - - Description - - Unit - * - L1I-L2 BW - - The total number of bytes read across the L1I-[L2](L2) interface, per [normalization-unit](normunit). - - Bytes per [normalization-unit](normunit) - -(SPI)= ### Workgroup manager (SPI) - -The workgroup manager (SPI) is the bridge between the `command -processor `__ and the `compute units `__. After the `command -processor `__ processes a kernel dispatch, it will then pass the -dispatch off to the workgroup manager, which then schedules -`workgroups `__ onto the `compute units `__. As -workgroups complete execution and resources become available, the -workgroup manager will schedule new workgroups onto `compute -units `__. The workgroup manager’s metrics therefore are focused on -reporting, e.g.: - -- Utilizations of various parts of the accelerator that the workgroup - manager interacts with (and the workgroup manager itself) -- How many workgroups were dispatched, their size, and how many - resources they used -- Percent of scheduler opportunities (cycles) where workgroups failed - to dispatch, and -- Percent of scheduler opportunities (cycles) where workgroups failed - to dispatch due to lack of a specific resource on the CUs (e.g., too - many VGPRs allocated) - -This gives the user an idea of why the workgroup manager couldn’t -schedule more wavefronts onto the device, and is most useful for -workloads that the user suspects to be scheduling/launch-rate limited. - -As discussed in the `command processor `__ description, the command -processor on AMD Instinct(tm) MI architectures contains four hardware -scheduler-pipes, each with eight software threads (`“Vega10” - -Mantor `__, -slide 19). Each scheduler-pipe can issue a kernel dispatch to the -workgroup manager to schedule concurrently. Therefore, some workgroup -manager metrics are presented relative to the utilization of these -scheduler-pipes (e.g., whether all four are issuing concurrently). - -.. code:: {note} - - Current versions of the profiling libraries underlying Omniperf attempt to serialize concurrent kernels running on the accelerator, as the performance counters on the device are global (i.e., shared between concurrent kernels). - This means that these scheduler-pipe utilization metrics are expected to reach e.g., a maximum of one pipe active, i.e., only 25\%. - -Workgroup Manager Utilizations -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This section describes the utilization of the workgroup manager, and the -hardware components it interacts with. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 65 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Accelerator Utilization - - The percent of cycles in the kernel where the accelerator was actively doing any work. - - Percent - * - Scheduler-Pipe Utilization - - The percent of [total scheduler-pipe cycles](TotalPipeCycles) in the kernel where the scheduler-pipes were actively doing any work. Note: this value is expected to range between 0-25%, see note in [workgroup-manager](SPI) description. - - Percent - * - Workgroup Manager Utilization - - The percent of cycles in the kernel where the Workgroup Manager was actively doing any work. - - Percent - * - Shader Engine Utilization - - The percent of [total shader-engine cycles](TotalSECycles) in the kernel where any CU in a shader-engine was actively doing any work, normalized over all shader-engines. Low values (e.g., << 100%) indicate that the accelerator was not fully saturated by the kernel, or a potential load-imbalance issue. - - Percent - * - SIMD Utilization - - The percent of [total SIMD cycles](TotalSIMDCycles) in the kernel where any [SIMD](VALU) on a CU was actively doing any work, summed over all CUs. Low values (e.g., << 100%) indicate that the accelerator was not fully saturated by the kernel, or a potential load-imbalance issue. - - Percent - * - Dispatched Workgroups - - The total number of workgroups forming this kernel launch. - - Workgroups - * - Dispatched Wavefronts - - The total number of wavefronts, summed over all workgroups, forming this kernel launch. - - Wavefronts - * - VGPR Writes - - The average number of cycles spent initializing [VGPRs](valu) at wave creation. - - Cycles/wave - * - SGPR Writes - - The average number of cycles spent initializing [SGPRs](salu) at wave creation. - - Cycles/wave - -Workgroup Manager - Resource Allocation -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -This panel gives more detail on how workgroups/wavefronts were scheduled -onto compute units, and what occupancy limiters they hit (if any). When -analyzing these metrics, the user should also take into account their -achieved occupancy (i.e., `Wavefront -occupancy `__). A kernel may be occupancy -limited by e.g., LDS usage, but may still achieve high occupancy levels -such that improving occupancy further may not improve performance. See -the `Workgroup Manager - Occupancy Limiters `__ -example for more details. - -.. code:: {list-table} - - :header-rows: 1 - :widths: 20 65 15 - :class: noscroll-table - * - Metric - - Description - - Unit - * - Not-scheduled Rate (Workgroup Manager) - - The percent of [total scheduler-pipe cycles](TotalPipeCycles) in the kernel where a workgroup could not be scheduled to a [CU](CU) due to a bottleneck within the workgroup manager rather than a lack of a [CU](CU)/[SIMD](VALU) with sufficient resources. Note: this value is expected to range between 0-25%, see note in [workgroup-manager](SPI) description. - - Percent - * - Not-scheduled Rate (Scheduler-Pipe) - - The percent of [total scheduler-pipe cycles](TotalPipeCycles) in the kernel where a workgroup could not be scheduled to a [CU](CU) due to a bottleneck within the scheduler-pipes rather than a lack of a [CU](CU)/[SIMD](VALU) with sufficient resources. Note: this value is expected to range between 0-25%, see note in [workgroup-manager](SPI) description. - - Percent - * - Scheduler-Pipe Stall Rate - - The percent of [total scheduler-pipe cycles](TotalPipeCycles) in the kernel where a workgroup could not be scheduled to a [CU](CU) due to occupancy limitations (i.e., a lack of a [CU](CU)/[SIMD](VALU) with sufficient resources). Note: this value is expected to range between 0-25%, see note in [workgroup-manager](SPI) description. - - Percent - * - Scratch Stall Rate - - The percent of [total shader-engine cycles](TotalSECycles) in the kernel where a workgroup could not be scheduled to a [CU](CU) due to lack of [private (a.k.a., scratch) memory](Mtype) slots. While this can reach up to 100\%, we note that the actual occupancy limitations on a kernel using private memory are typically quite small (e.g., <1\% of the total number of waves that can be scheduled to an accelerator). - - Percent - * - Insufficient SIMD Waveslots - - The percent of [total SIMD cycles](TotalSIMDCycles) in the kernel where a workgroup could not be scheduled to a [SIMD](valu) due to lack of available [waveslots](valu). - - Percent - * - Insufficient SIMD VGPRs - - The percent of [total SIMD cycles](TotalSIMDCycles) in the kernel where a workgroup could not be scheduled to a [SIMD](valu) due to lack of available [VGPRs](valu). - - Percent - * - Insufficient SIMD SGPRs - - The percent of [total SIMD cycles](TotalSIMDCycles) in the kernel where a workgroup could not be scheduled to a [SIMD](valu) due to lack of available [SGPRs](salu). - - Percent - * - Insufficient CU LDS - - The percent of [total CU cycles](TotalCUCycles) in the kernel where a workgroup could not be scheduled to a [CU](cu) due to lack of available [LDS](lds). - - Percent - * - Insufficient CU Barriers - - The percent of [total CU cycles](TotalCUCycles) in the kernel where a workgroup could not be scheduled to a [CU](cu) due to lack of available [barriers](barrier). - - Percent - * - Reached CU Workgroup Limit - - The percent of [total CU cycles](TotalCUCycles) in the kernel where a workgroup could not be scheduled to a [CU](cu) due to limits within the workgroup manager. This is expected to be always be zero on CDNA2 or newer accelerators (and small for previous accelerators). - - Percent - * - Reached CU Wavefront Limit - - The percent of [total CU cycles](TotalCUCycles) in the kernel where a wavefront could not be scheduled to a [CU](cu) due to limits within the workgroup manager. This is expected to be always be zero on CDNA2 or newer accelerators (and small for previous accelerators). - - Percent diff --git a/docs/concept/includes/terms.rst b/docs/concept/includes/terms.rst index e104515bc..d170e99d1 100644 --- a/docs/concept/includes/terms.rst +++ b/docs/concept/includes/terms.rst @@ -1,120 +1,186 @@ +.. _desc-workgroup: + +.. _desc-work-item: + +.. _desc-wavefront: + +.. _desc-divergence: + +.. _kernel-time: + +.. _kernel-cycles: + +.. _total-active-cu-cycles: + +.. _total-cu-cycles: + +.. _total-se-cycles: + +.. _total-simd-cycles: + +.. _total-pipe-cycles: + +.. _total-l1i-cycles: + +.. _total-active-l2-cycles: + +.. _total-l2-cycles: + +.. _total-sl1d-cycles: + .. list-table:: :header-rows: 1 * - Name + - Description + - Unit * - Kernel time + - The number of seconds the accelerator was executing a kernel, from the - :ref:`command processor `'s (CP) start-of-kernel + :doc:`command processor `'s (CP) start-of-kernel timestamp (a number of cycles after the CP beings processing the packet) to the CP's end-of-kernel timestamp (a number of cycles before the CP stops processing the packet). + - Seconds * - Kernel cycles + - The number of cycles the accelerator was active doing *any* work, as - measured by the :ref:`command processor ` (CP). + measured by the :doc:`command processor ` (CP). + - Cycles * - Total CU cycles + - The number of cycles the accelerator was active doing *any* work (that is, kernel cycles), multiplied by the number of - :ref:`compute units ` on the accelerator. A + :doc:`compute units ` on the accelerator. A measure of the total possible active cycles the compute units could be doing work, useful for the normalization of metrics inside the CU. + - Cycles * - Total active CU cycles + - The number of cycles a CU on the accelerator was active doing *any* - work, summed over all :ref:`compute units ` on the + work, summed over all :doc:`compute units ` on the accelerator. + - Cycles * - Total SIMD cycles + - The number of cycles the accelerator was active doing *any* work (that is, kernel cycles), multiplied by the number of - :ref:`SIMDs ` on the accelerator. A measure of the + :doc:`SIMDs ` on the accelerator. A measure of the total possible active cycles the SIMDs could be doing work, useful for the normalization of metrics inside the CU. + - Cycles * - Total L2 cycles + - The number of cycles the accelerator was active doing *any* work (that - is, kernel cycles), multiplied by the number of :ref:`L2 ` + is, kernel cycles), multiplied by the number of :doc:`L2 ` channels on the accelerator. A measure of the total possible active cycles the L2 channels could be doing work, useful for the normalization of metrics inside the L2. + - Cycles * - Total active L2 cycles + - The number of cycles a channel of the L2 cache was active doing *any* - work, summed over all :ref:`L2 ` channels on the accelerator. + work, summed over all :doc:`L2 ` channels on the accelerator. + - Cycles * - Total sL1D cycles + - The number of cycles the accelerator was active doing *any* work (that is, kernel cycles), multiplied by the number of - :ref:`scalar L1 data caches ` on the accelerator. A measure of + :ref:`scalar L1 data caches ` on the accelerator. A measure of the total possible active cycles the sL1Ds could be doing work, useful for the normalization of metrics inside the sL1D. + - Cycles * - Total L1I cycles + - The number of cycles the accelerator was active doing *any* work (that is, kernel cycles), multiplied by the number of - :ref:`L1 instruction caches ` (L1I) on the accelerator. A + :ref:`L1 instruction caches ` (L1I) on the accelerator. A measure of the total possible active cycles the L1Is could be doing work, useful for the normalization of metrics inside the L1I. + - Cycles * - Total scheduler-pipe cycles + - The number of cycles the accelerator was active doing *any* work (that is, kernel cycles), multiplied by the number of - :ref:`scheduler pipes ` on the accelerator. A measure of the - total possible active cycles the scheduler-pipes could be doing work, - useful for the normalization of metrics inside the - :ref:`workgroup manager ` and :ref:`command processor `. + :doc:`scheduler pipes ` on the accelerator. A measure + of the total possible active cycles the scheduler-pipes could be doing + work, useful for the normalization of metrics inside the + :ref:`workgroup manager ` and + :doc:`command processor `. + - Cycles * - Total shader-engine cycles + - The total number of cycles the accelerator was active doing *any* work, - multiplied by the number of :ref:`shader engines ` on the + multiplied by the number of :doc:`shader engines ` on the accelerator. A measure of the total possible active cycles the shader engines could be doing work, useful for the normalization of - metrics inside the :ref:`workgroup manager `. + metrics inside the :ref:`workgroup manager `. + - Cycles * - Thread-requests + - The number of unique memory addresses accessed by a single memory instruction. On AMD Instinct accelerators, this has a maximum of 64 - (that is, the size of the :ref:`wavefront `). + (that is, the size of the :ref:`wavefront `). + - Addresses * - Work-item + - A single *thread*, or lane, of execution that executes in lockstep with - the rest of the work-items comprising a :ref:`wavefront ` + the rest of the work-items comprising a :ref:`wavefront ` of execution. + - N/A * - Wavefront + - A group of work-items, or threads, that execute in lockstep on the - :ref:`compute unit `. On AMD Instinct accelerators, the + :doc:`compute unit `. On AMD Instinct accelerators, the wavefront size is always 64 work-items. + - N/A * - Workgroup + - A group of wavefronts that execute on the same - :ref:`compute unit `, and can cooperatively execute and share - data via the use of synchronization primitives, :ref:`LDS `, - atomics, and others. + :doc:`compute unit `, and can cooperatively execute and + share data via the use of synchronization primitives, + :doc:`LDS `, atomics, and others. + - N/A * - Divergence + - Divergence within a wavefront occurs when not all work-items are active when executing an instruction, that is, due to non-uniform control flow within a wavefront. Can reduce execution efficiency by causing, - for instance, the :ref:`VALU ` to need to execute both + for instance, the :ref:`VALU ` to need to execute both branches of a conditional with different sets of work-items active. + - N/A + diff --git a/docs/concept/l2-cache.rst b/docs/concept/l2-cache.rst new file mode 100644 index 000000000..436ae12a5 --- /dev/null +++ b/docs/concept/l2-cache.rst @@ -0,0 +1,763 @@ +************** +L2 cache (TCC) +************** + +The L2 cache is the coherence point for current AMD Instinct™ MI-series GCN™ +GPUs and CDNA™ accelerators, and is shared by all :doc:`CUs ` +on the device. Besides serving requests from the +:doc:`vector L1 data caches `, the L2 cache also is responsible +for servicing requests from the :ref:`L1 instruction caches `, the +:ref:`scalar L1 data caches ` and the +:doc:`command processor `. The L2 cache is composed of a +number of distinct channels (32 on MI100/:ref:`MI2XX ` series CDNA +accelerators at 256B address interleaving) which can largely operate +independently. Mapping of incoming requests to a specific L2 channel is +determined by a hashing mechanism that attempts to evenly distribute requests +across the L2 channels. Requests that miss in the L2 cache are passed out to +:ref:`Infinity Fabric™ ` to be routed to the appropriate memory +location. + +The L2 cache metrics reported by Omniperf are broken down into four +categories: + +* :ref:`L2 Speed-of-Light ` + +* :ref:`L2 cache accesses ` + +* :ref:`L2-Fabric transactions ` + +* :ref:`L2-Fabric stalls ` + +.. _l2-sol: + +L2 Speed-of-Light +================= + +.. warning:: + + The theoretical maximum throughput for some metrics in this section + are currently computed with the maximum achievable clock frequency, as + reported by ``rocminfo``, for an accelerator. This may not be realistic for + all workloads. + +The L2 cache’s speed-of-light table contains a few key metrics about the +performance of the L2 cache, aggregated over all the L2 channels, as a +comparison with the peak achievable values of those metrics: + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - Utilization + + - The ratio of the + :ref:`number of cycles an L2 channel was active, summed over all L2 channels on the accelerator ` + over the :ref:`total L2 cycles `. + + - Percent + + * - Bandwidth + + - The number of bytes looked up in the L2 cache, as a percent of the peak + theoretical bandwidth achievable on the specific accelerator. The number + of bytes is calculated as the number of cache lines requested multiplied + by the cache line size. This value does not consider partial requests, so + e.g., if only a single value is requested in a cache line, the data + movement will still be counted as a full cache line. + + - Percent + + * - Hit Rate + + - The ratio of the number of L2 cache line requests that hit in the L2 + cache over the total number of incoming cache line requests to the L2 + cache. + + - Percent + + * - L2-Fabric Read BW + + - The number of bytes read by the L2 over the + :ref:`Infinity Fabric interface ` per unit time. + + - GB/s + + * - L2-Fabric Write and Atomic BW + + - The number of bytes sent by the L2 over the + :ref:`Infinity Fabric interface ` by write and atomic + operations per unit time. + + - GB/s + +.. note:: + + The L2 cache on AMD Instinct MI CDNA accelerators uses a "hit-on-miss" + approach to reporting cache hits. That is, if while satisfying a miss, + another request comes in that would hit on the same pending cache line, the + subsequent request will be counted as a 'hit'. Therefore, it is also + important to consider the latency metric in the :ref:`L2-Fabric ` + section when evaluating the L2 hit rate. + +.. _l2-cache-accesses: + +L2 cache accesses +================= + +This section details the incoming requests to the L2 cache from the +:doc:`vL1D ` and other clients -- for instance, the +:ref:`sL1D ` and :ref:`L1I ` caches. + +.. list-table:: + :header-rows: 1 + :widths: 13 70 17 + + * - Metric + + - Description + + - Unit + + * - Bandwidth + + - The number of bytes looked up in the L2 cache, per + :ref:`normalization unit `. The number of bytes is + calculated as the number of cache lines requested multiplied by the cache + line size. This value does not consider partial requests, so for example, + if only a single value is requested in a cache line, the data movement + will still be counted as a full cache line. + + - Bytes per normalization unit + + * - Requests + + - The total number of incoming requests to the L2 from all clients for all + request types, per :ref:`normalization unit `. + + - Requests per normalization unit + + * - Read Requests + + - The total number of read requests to the L2 from all clients. + + - Requests per :ref:`normalization unit ` + + * - Write Requests + + - The total number of write requests to the L2 from all clients. + + - Requests per :ref:`normalization unit ` + + * - Atomic Requests + + - The total number of atomic requests (with and without return) to the L2 + from all clients. + + - Requests per :ref:`normalization unit ` + + * - Streaming Requests + + - The total number of incoming requests to the L2 that are marked as + *streaming*. The exact meaning of this may differ depending on the + targeted accelerator, however on an :ref:`MI2XX ` this + corresponds to + `non-temporal load or stores `_. + The L2 cache attempts to evict *streaming* requests before normal + requests when the L2 is at capacity. + + - Requests per :ref:`normalization unit ` + + * - Probe Requests + + - The number of coherence probe requests made to the L2 cache from outside + the accelerator. On an :ref:`MI2XX `, probe requests may be + generated by, for example, writes to + :ref:`fine-grained device ` memory or by writes to + :ref:`coarse-grained ` device memory. + + - Requests per :ref:`normalization unit ` + + * - Hit Rate + + - The ratio of the number of L2 cache line requests that hit in the L2 + cache over the total number of incoming cache line requests to the L2 + cache. + + - Percent + + * - Hits + + - The total number of requests to the L2 from all clients that hit in the + cache. As noted in the :ref:`Speed-of-Light ` section, this + includes hit-on-miss requests. + + - Requests per :ref:`normalization unit ` + + * - Misses + + - The total number of requests to the L2 from all clients that miss in the + cache. As noted in the :ref:`Speed-of-Light ` section, these do + not include hit-on-miss requests. + + - Requests per :ref:`normalization unit ` + + * - Writebacks + + - The total number of L2 cache lines written back to memory for any reason. + Write-backs may occur due to user code (such as HIP kernel calls to + ``__threadfence_system`` or atomic built-ins) by the + :doc:`command processor `'s memory acquire/release + fences, or for other internal hardware reasons. + + - Cache lines per :ref:`normalization unit ` + + * - Writebacks (Internal) + + - The total number of L2 cache lines written back to memory for internal + hardware reasons, per :ref:`normalization unit . + + - Cache lines per normalization unit + + * - Writebacks (vL1D Req) + + - The total number of L2 cache lines written back to memory due to requests + initiated by the :doc:`vL1D cache `, per + :ref:`normalization unit `. + + - Cache lines per normalization unit + + * - Evictions (Normal) + + - The total number of L2 cache lines evicted from the cache due to capacity + limits, per :ref:`normalization unit `. + + - Cache lines per normalization unit + + * - Evictions (vL1D Req) + + - The total number of L2 cache lines evicted from the cache due to + invalidation requests initiated by the + :doc:`vL1D cache `, per + :ref:`normalization unit `. + + - Cache lines per normalization unit + + * - Non-hardware-Coherent Requests + + - The total number of requests to the L2 to Not-hardware-Coherent (NC) + memory allocations, per :ref:`normalization unit `. + See the :ref:`memory-type` for more information. + + - Requests per normalization unit + + * - Uncached Requests + + - The total number of requests to the L2 that to uncached (UC) memory + allocations. See the :ref:`memory-type` for more information. + + - Requests per :ref:`normalization unit `. + + * - Coherently Cached Requests + + - The total number of requests to the L2 that to coherently cacheable (CC) + memory allocations. See the :ref:`memory-type` for more information. + + - Requests per :ref:`normalization unit `. + + * - Read/Write Coherent Requests + + - The total number of requests to the L2 that to Read-Write coherent memory + (RW) allocations. See the :ref:`memory-type` for more information. + + - Requests per :ref:`normalization unit `. + +.. note:: + + All requests to the L2 are for a single cache line's worth of data. The size + of a cache line may vary depending on the accelerator, however on an AMD + Instinct CDNA2 :ref:`MI2XX ` accelerator, it is 128B, while on + an MI100, it is 64B. + +.. _l2-fabric: + +L2-Fabric transactions +====================== + +Requests/data that miss in the L2 must be routed to memory in order to +service them. The backing memory for a request may be local to this +accelerator (i.e., in the local high-bandwidth memory), in a remote +accelerator’s memory, or even in the CPU’s memory. Infinity Fabric +is responsible for routing these memory requests/data to the correct +location and returning any fetched data to the L2 cache. The +:ref:`l2-request-flow` describes the flow of these requests through +Infinity Fabric in more detail, as described by Omniperf metrics, +while :ref:`l2-request-metrics` give detailed definitions of +individual metrics. + +.. _l2-request-flow: + +Request flow +------------ + +The following is a diagram that illustrates how L2↔Fabric requests are reported +by Omniperf: + +.. figure:: ../data/performance-model/fabric.* + :align: center + :alt: L2-Fabric transaction flow on AMD Instinct MI-series accelerators + + L2↔Fabric transaction flow on AMD Instinct MI-series accelerators. + + +Requests from the L2 Cache are broken down into two major categories, read +requests and write requests (at this granularity, atomic requests are treated +as writes). + +From there, these requests can additionally subdivided in a number of ways. +First, these requests may be sent across Infinity Fabric as different +transaction sizes, 32B or 64B on current CDNA accelerators. + +.. note:: + + On current CDNA accelerators, the 32B read request path is expected to be + unused and so is disconnected in the flow diagram. + +In addition, the read and write requests can be further categorized as: + +* Uncached read/write requests, for instance: for access to + :ref:`fine-grained memory ` + +* Atomic requests, for instance: for atomic updates to + :ref:`fine-grained memory ` + +* HBM read/write requests OR remote read/write requests, for instance: for + requests to the accelerator’s local HBM OR requests to a remote accelerator’s + HBM or the CPU’s DRAM + +These classifications are not necessarily *exclusive*. For example, a +write request can be classified as an atomic request to the +accelerator’s local HBM, and an uncached write request. The request-flow +diagram marks *exclusive* classifications as a splitting of the flow, +while *non-exclusive* requests do not split the flow line. For example, +a request is either a 32B Write Request OR a 64B Write request, as the +flow splits at this point: + +.. figure:: ../data/performance-model/split.* + :align: center + :alt: Splitting request flow + + Splitting request flow + +However, continuing along, the same request might be an atomic request and an +uncached write request, as reflected by a non-split flow: + +.. figure:: ../data/performance-model/nosplit.* + :align: center + :alt: Non-splitting request flow + + Non-splitting request flow + +Finally, we note that :ref:`uncached ` read requests (e.g., to +:ref:`fine-grained memory `) are handled specially on CDNA +accelerators, as indicated in the request flow diagram. These are +expected to be counted as a 64B Read Request, and *if* they are requests +to uncached memory (denoted by the dashed line), they will also be +counted as *two* uncached read requests (that is, the request is split): + +.. figure:: ../data/performance-model/uncached.* + :align: center + :alt: Uncached read-request splitting + + Uncached read-request splitting. + +.. _l2-request-metrics: + +Metrics +------- + + The following metrics are reported for the L2-Fabric interface: + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - L2-Fabric Read Bandwidth + + - The total number of bytes read by the L2 cache from Infinity Fabric per + :ref:`normalization unit `. + + - Bytes per normalization unit + + * - HBM Read Traffic + + - The percent of read requests generated by the L2 cache that are routed to + the accelerator's local high-bandwidth memory (HBM). This breakdown does + not consider the *size* of the request (meaning that 32B and 64B requests + are both counted as a single request), so this metric only *approximates* + the percent of the L2-Fabric Read bandwidth directed to the local HBM. + + - Percent + + * - Remote Read Traffic + + - The percent of read requests generated by the L2 cache that are routed to + any memory location other than the accelerator's local high-bandwidth + memory (HBM) -- for example, the CPU's DRAM or a remote accelerator's + HBM. This breakdown does not consider the *size* of the request (meaning + that 32B and 64B requests are both counted as a single request), so this + metric only *approximates* the percent of the L2-Fabric Read bandwidth + directed to a remote location. + + - Percent + + * - Uncached Read Traffic + + - The percent of read requests generated by the L2 cache that are reading + from an :ref:`uncached memory allocation `. Note, as + described in the :ref:`request flow ` section, a single + 64B read request is typically counted as two uncached read requests. So, + it is possible for the Uncached Read Traffic to reach up to 200% of the + total number of read requests. This breakdown does not consider the + *size* of the request (i.e., 32B and 64B requests are both counted as a + single request), so this metric only *approximates* the percent of the + L2-Fabric read bandwidth directed to an uncached memory location. + + - Percent + + * - L2-Fabric Write and Atomic Bandwidth + + - The total number of bytes written by the L2 over Infinity Fabric by write + and atomic operations per + :ref:`normalization unit `. Note that on current + CDNA accelerators, such as the :ref:`MI2XX `, requests are + only considered *atomic* by Infinity Fabric if they are targeted at + non-write-cacheable memory, for example, + :ref:`fine-grained memory ` allocations or + :ref:`uncached memory ` allocations on the + MI2XX. + + - Bytes per normalization unit + + * - HBM Write and Atomic Traffic + + - The percent of write and atomic requests generated by the L2 cache that + are routed to the accelerator's local high-bandwidth memory (HBM). This + breakdown does not consider the *size* of the request (meaning that 32B + and 64B requests are both counted as a single request), so this metric + only *approximates* the percent of the L2-Fabric Write and Atomic + bandwidth directed to the local HBM. Note that on current CDNA + accelerators, such as the :ref:`MI2XX `, requests are only + considered *atomic* by Infinity Fabric if they are targeted at + :ref:`fine-grained memory ` allocations or + :ref:`uncached memory ` allocations. + + - Percent + + * - Remote Write and Atomic Traffic + + - The percent of read requests generated by the L2 cache that are routed to + any memory location other than the accelerator's local high-bandwidth + memory (HBM) -- for example, the CPU's DRAM or a remote accelerator's + HBM. This breakdown does not consider the *size* of the request (meaning + that 32B and 64B requests are both counted as a single request), so this + metric only *approximates* the percent of the L2-Fabric Read bandwidth + directed to a remote location. Note that on current CDNA + accelerators, such as the :ref:`MI2XX `, requests are only + considered *atomic* by Infinity Fabric if they are targeted at + :ref:`fine-grained memory ` allocations or + :ref:`uncached memory ` allocations. + + - Percent + + * - Atomic Traffic + + - The percent of write requests generated by the L2 cache that are atomic + requests to *any* memory location. This breakdown does not consider the + *size* of the request (meaning that 32B and 64B requests are both counted + as a single request), so this metric only *approximates* the percent of + the L2-Fabric Read bandwidth directed to a remote location. Note that on + current CDNA accelerators, such as the :ref:`MI2XX `, + requests are only considered *atomic* by Infinity Fabric if they are + targeted at :ref:`fine-grained memory ` allocations or + :ref:`uncached memory ` allocations. + + - Percent + + * - Uncached Write and Atomic Traffic + + - The percent of write and atomic requests generated by the L2 cache that + are targeting :ref:`uncached memory allocations `. This + breakdown does not consider the *size* of the request (meaning that 32B + and 64B requests are both counted as a single request), so this metric + only *approximates* the percent of the L2-Fabric read bandwidth directed + to uncached memory allocations. + + - Percent + + * - Read Latency + + - The time-averaged number of cycles read requests spent in Infinity Fabric + before data was returned to the L2. + + - Cycles + + * - Write Latency + + - The time-averaged number of cycles write requests spent in Infinity + Fabric before a completion acknowledgement was returned to the L2. + + - Cycles + + * - Atomic Latency + + - The time-averaged number of cycles atomic requests spent in Infinity + Fabric before a completion acknowledgement (atomic without return value) + or data (atomic with return value) was returned to the L2. + + - Cycles + + * - Read Stall + + - The ratio of the total number of cycles the L2-Fabric interface was + stalled on a read request to any destination (local HBM, remote PCIe + connected accelerator or CPU, or remote Infinity Fabric connected + accelerator [#inf]_ or CPU) over the + :ref:`total active L2 cycles `. + + - Percent + + * - Write Stall + + - The ratio of the total number of cycles the L2-Fabric interface was + stalled on a write or atomic request to any destination (local HBM, + remote accelerator or CPU, PCIe connected accelerator or CPU, or remote + Infinity Fabric connected accelerator [#inf]_ or CPU) over the + :ref:`total active L2 cycles `. + + - Percent + +.. _l2-detailed-metrics: + +Detailed transaction metrics +---------------------------- + +The following metrics are available in the detailed L2-Fabric +transaction breakdown table: + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - 32B Read Requests + + - The total number of L2 requests to Infinity Fabric to read 32B of data + from any memory location, per + :ref:`normalization unit `. See + :ref:`l2-request-flow` for more detail. Typically unused on CDNA + accelerators. + + - Requests per normalization unit + + * - Uncached Read Requests + + - The total number of L2 requests to Infinity Fabric to read + :ref:`uncached data ` from any memory location, per + :ref:`normalization unit `. 64B requests for + uncached data are counted as two 32B uncached data requests. See + :ref:`l2-request-flow` for more detail. + + - Requests per normalization unit + + * - 64B Read Requests + + - The total number of L2 requests to Infinity Fabric to read 64B of data + from any memory location, per + :ref:`normalization unit `. See + :ref:`l2-request-flow` for more detail. + + - Requests per normalization unit + + * - HBM Read Requests + + - The total number of L2 requests to Infinity Fabric to read 32B or 64B of + data from the accelerator's local HBM, per + :ref:`normalization unit `. See + :ref:`l2-request-flow` for more detail. + + - Requests per normalization unit + + * - Remote Read Requests + + - The total number of L2 requests to Infinity Fabric to read 32B or 64B of + data from any source other than the accelerator's local HBM, per + :ref:`normalization unit `. See + :ref:`l2-request-flow` for more detail. + + - Requests per normalization unit + + * - 32B Write and Atomic Requests + + - The total number of L2 requests to Infinity Fabric to write or atomically + update 32B of data to any memory location, per + :ref:`normalization unit `. See + :ref:`l2-request-flow` for more detail. + + - Requests per normalization unit + + * - Uncached Write and Atomic Requests + + - The total number of L2 requests to Infinity Fabric to write or atomically + update 32B or 64B of :ref:`uncached data `, per + :ref:`normalization unit `. See + :ref:`l2-request-flow` for more detail. + + - Requests per normalization unit + + * - 64B Write and Atomic Requests + + - The total number of L2 requests to Infinity Fabric to write or atomically + update 64B of data in any memory location, per + :ref:`normalization unit `. See + :ref:`l2-request-flow` for more detail. + + - Requests per normalization unit + + * - HBM Write and Atomic Requests + + - The total number of L2 requests to Infinity Fabric to write or atomically + update 32B or 64B of data in the accelerator's local HBM, per + :ref:`normalization unit `. See + :ref:`l2-request-flow` for more detail. + + - Requests per normalization unit + + * - Remote Write and Atomic Requests + + - The total number of L2 requests to Infinity Fabric to write or atomically + update 32B or 64B of data in any memory location other than the + accelerator's local HBM, per + :ref:`normalization unit `. See + :ref:`l2-request-flow` for more detail. + + - Requests per normalization unit + + * - Atomic Requests + + - The total number of L2 requests to Infinity Fabric to atomically update + 32B or 64B of data in any memory location, per + :ref:`normalization unit `. See + :ref:`l2-request-flow` for more detail. Note that on current CDNA + accelerators, such as the :ref:`MI2XX `, requests are only + considered *atomic* by Infinity Fabric if they are targeted at + non-write-cacheable memory, such as + :ref:`fine-grained memory ` allocations or + :ref:`uncached memory ` allocations on the MI2XX. + + - Requests per normalization unit + +.. _l2-fabric-stalls: + +L2-Fabric interface stalls +========================== + +When the interface between the L2 cache and Infinity Fabric becomes backed up by +requests, it may stall, preventing the L2 from issuing additional requests to +Infinity Fabric until prior requests complete. This section gives a breakdown of +what types of requests in a kernel caused a stall (like read versus write), and +to which locations -- for instance, to the accelerator’s local memory, or to +remote accelerators or CPUs. + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - Read - PCIe Stall + + - The number of cycles the L2-Fabric interface was stalled on read requests + to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the + :ref:`total active L2 cycles `. + + - Percent + + * - Read - Infinity Fabric Stall + + - The number of cycles the L2-Fabric interface was stalled on read requests + to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a + percent of the :ref:`total active L2 cycles `. + + - Percent + + * - Read - HBM Stall + + - The number of cycles the L2-Fabric interface was stalled on read requests + to the accelerator's local HBM as a percent of the + :ref:`total active L2 cycles `. + + - Percent + + * - Write - PCIe Stall + + - The number of cycles the L2-Fabric interface was stalled on write or + atomic requests to remote PCIe connected accelerators [#inf]_ or CPUs as + a percent of the :ref:`total active L2 cycles `. + + - Percent + + * - Write - Infinity Fabric Stall + + - The number of cycles the L2-Fabric interface was stalled on write or + atomic requests to remote Infinity Fabric connected accelerators [#inf]_ + or CPUs as a percent of the + :ref:`total active L2 cycles `. + + - Percent + + * - Write - HBM Stall + + - The number of cycles the L2-Fabric interface was stalled on write or + atomic requests to accelerator's local HBM as a percent of the + :ref:`total active L2 cycles `. + + - Percent + + * - Write - Credit Starvation + + - The number of cycles the L2-Fabric interface was stalled on write or + atomic requests to any memory location because too many write/atomic + requests were currently in flight, as a percent of the + :ref:`total active L2 cycles `. + + - Percent + +.. warning:: + + On current CDNA accelerators and GCN GPUs, these L2↔Fabric stalls can be undercounted in some circumstances. + +.. rubric:: Footnotes + +.. [#inf] In addition to being used for on-accelerator data-traffic, AMD + `Infinity Fabric `_ + technology can be used to connect multiple accelerators to achieve advanced + peer-to-peer connectivity and enhanced bandwidths over traditional PCIe + connections. Some AMD Instinct MI accelerators like the MI250X, + `feature coherent CPU↔accelerator connections built using AMD Infinity Fabric `_ diff --git a/docs/concept/local-data-share.rst b/docs/concept/local-data-share.rst new file mode 100644 index 000000000..9ed8069ab --- /dev/null +++ b/docs/concept/local-data-share.rst @@ -0,0 +1,174 @@ +********************** +Local data share (LDS) +********************** + +LDS Speed-of-Light +================== + +.. warning:: + + The theoretical maximum throughput for some metrics in this section are + currently computed with the maximum achievable clock frequency, as reported + by ``rocminfo``, for an accelerator. This may not be realistic for all + workloads. + +The :ref:`LDS ` speed-of-light chart shows a number of key metrics for +the LDS as a comparison with the peak achievable values of those metrics. + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - Utilization + + - Indicates what percent of the kernel's duration the :ref:`LDS ` + was actively executing instructions (including, but not limited to, load, + store, atomic and HIP's ``__shfl`` operations). Calculated as the ratio + of the total number of cycles LDS was active over the + :ref:`total CU cycles `. + + - Percent + + * - Access Rate + + - Indicates the percentage of SIMDs in the :ref:`VALU ` [#1]_ + actively issuing LDS instructions, averaged over the lifetime of the + kernel. Calculated as the ratio of the total number of cycles spent by + the :ref:`scheduler ` issuing :ref:`LDS ` + instructions over the + :ref:`total CU cycles `. + + - Percent + + * - Theoretical Bandwidth (% of Peak) + + - Indicates the maximum amount of bytes that *could* have been loaded from, + stored to, or atomically updated in the LDS in this kernel, as a percent + of the peak LDS bandwidth achievable. See the + :ref:`LDS bandwidth example ` for more detail. + + - Percent + + * - Bank Conflict Rate + + - Indicates the percentage of active LDS cycles that were spent servicing + bank conflicts. Calculated as the ratio of LDS cycles spent servicing + bank conflicts over the number of LDS cycles that would have been + required to move the same amount of data in an uncontended access. [#2]_ + + - Percent + +.. rubric:: Footnotes + +.. [#1] Here we assume the typical case where the workload evenly distributes + LDS operations over all SIMDs in a CU (that is, waves on different SIMDs are + executing similar code). For highly unbalanced workloads, where e.g., one + SIMD pair in the CU does not issue LDS instructions at all, this metric is + better interpreted as the percentage of SIMDs issuing LDS instructions on + :ref:`SIMD pairs ` that are actively using the LDS, averaged over + the lifetime of the kernel. + +.. [#2] The maximum value of the bank conflict rate is less than 100% + (specifically: 96.875%), as the first cycle in the + :ref:`LDS scheduler ` is never considered contended. + +Statistics +========== + +The LDS statistics panel gives a more detailed view of the hardware: + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - LDS Instructions + + - The total number of LDS instructions (including, but not limited to, + read/write/atomics and HIP's ``__shfl`` instructions) executed per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - Theoretical Bandwidth + + - Indicates the maximum amount of bytes that could have been loaded from, + stored to, or atomically updated in the LDS per + :ref:`normalization unit `. Does *not* take into + account the execution mask of the wavefront when the instruction was + executed. See the + :ref:`LDS bandwidth example ` for more detail. + + - Bytes per normalization unit + + * - LDS Latency + + - The average number of round-trip cycles (i.e., from issue to data-return + / acknowledgment) required for an LDS instruction to complete. + + - Cycles + + * - Bank Conflicts/Access + + - The ratio of the number of cycles spent in the + :ref:`LDS scheduler ` due to bank conflicts (as determined by + the conflict resolution hardware) to the base number of cycles that would + be spent in the LDS scheduler in a completely uncontended case. This is + the unnormalized form of the Bank Conflict Rate. + + - Conflicts/Access + + * - Index Accesses + + - The total number of cycles spent in the :ref:`LDS scheduler ` + over all operations per :ref:`normalization unit `. + + - Cycles per normalization unit + + * - Atomic Return Cycles + + - The total number of cycles spent on LDS atomics with return per + :ref:`normalization unit `. + + - Cycles per normalization unit + + * - Bank Conflicts + + - The total number of cycles spent in the :ref:`LDS scheduler ` + due to bank conflicts (as determined by the conflict resolution hardware) + per :ref:`normalization unit `. + + - Cycles per normalization unit + + * - Address Conflicts + + - The total number of cycles spent in the :ref:`LDS scheduler ` + due to address conflicts (as determined by the conflict resolution + hardware) per :ref:`normalization unit `. + + - Cycles per normalization unit + + * - Unaligned Stall + + - The total number of cycles spent in the :ref:`LDS scheduler ` + due to stalls from non-dword aligned addresses per :ref:`normalization unit `. + + - Cycles per normalization unit + + * - Memory Violations + + - The total number of out-of-bounds accesses made to the LDS, per + :ref:`normalization unit `. This is unused and + expected to be zero in most configurations for modern CDNA accelerators. + + - Accesses per normalization unit + diff --git a/docs/concept/performance-model.rst b/docs/concept/performance-model.rst index f43625616..f49de6df0 100644 --- a/docs/concept/performance-model.rst +++ b/docs/concept/performance-model.rst @@ -17,8 +17,8 @@ To best use profiling data, it's important to understand the role of various hardware blocks of AMD Instinct accelerators. This section describes each hardware block on the accelerator as interacted with by a software developer to give a deeper understanding of the metrics reported by profiling data. Refer to -:doc:`` for more practical examples and details on how to -use Omniperf to optimize your code. +:doc:`<../how-to/profile/mode>` for more practical examples and details on how +to use Omniperf to optimize your code. .. _mixxx-note: @@ -45,8 +45,8 @@ use Omniperf to optimize your code. References ========== -Some sections in the following materials might refer the following -publicly available documentation. +Some sections in this chapter cite the following publicly available +documentation. * :hip-training-pdf:`Introduction to AMD GPU Programming with HIP <>` diff --git a/docs/concept/pipeline-descriptions.rst b/docs/concept/pipeline-descriptions.rst new file mode 100644 index 000000000..7d58a092d --- /dev/null +++ b/docs/concept/pipeline-descriptions.rst @@ -0,0 +1,290 @@ +********************* +Pipeline descriptions +********************* + +.. _desc-valu: + +.. _desc-vmem: + +Vector arithmetic logic unit (VALU) +----------------------------------- + +The vector arithmetic logic unit (VALU) executes vector instructions +over an entire wavefront, each `work-item `__ (or, +vector-lane) potentially operating on distinct data. The VALU of a CDNA +accelerator or GCN GPU typically consists of: + +* Four 16-wide SIMD processors (see :hip-training-2019:`24` for more details). + +* Four 64 or 128 KiB VGPR files (yielding a total of 256-512 KiB total + per CU), see :ref:`AGPRs ` for more detail. + +* An instruction buffer (per-SIMD) that contains execution slots for up + to 8 wavefronts (for 32 total wavefront slots on each CU). + +* A vector memory (VMEM) unit which transfers data between VGPRs and + memory; each work-item supplies its own memory address and supplies + or receives unique data. + +* CDNA accelerators, such as the MI100 and :ref:`MI2XX `, contain + additional + :amd-lab-note:`Matrix Fused Multiply-Add (MFMA) ` + units. + +In order to support branching / conditionals, each wavefront in the VALU +has a distinct execution mask which determines which work-items in the +wavefront are active for the currently executing instruction. When +executing a VALU instruction, inactive work-items (according to the +current execution mask of the wavefront) do not execute the instruction +and are treated as no-ops. + +.. note:: + + On GCN GPUs and the CDNA MI100 accelerator, there are slots for up to 10 + wavefronts in the instruction buffer, but generally occupancy is limited by + other factors to 32 waves per :doc:`compute unit `. + On the CDNA2 :ref:`MI2XX ` series accelerators, there are only 8 + waveslots per-SIMD. + +.. _desc-salu: + +.. _desc-smem: + +Scalar arithmetic logic unit (SALU) +----------------------------------- + +The scalar arithmetic logic unit (SALU) executes instructions that are +shared between all work-items in a wavefront. This includes control flow +such as if/else conditionals, branches and looping pointer arithmetic, loading +common values, and more. + +The SALU consists of: + +* A scalar processor capable of various arithmetic, conditional, and + comparison (etc.) operations. See + :mi200-isa-pdf:`Chapter 5. Scalar ALU Operations <35>` + of the CDNA2 Instruction Set Architecture (ISA) Reference Guide for more + detail. + +* A 12.5 KiB Scalar General Purpose Register (SGPR) file + +* A scalar memory (SMEM) unit which transfers data between SGPRs and + memory + +Data loaded by the SMEM can be cached in the :ref:`scalar L1 data cache `, +and is typically only used for read-only, uniform accesses such as kernel +arguments, or HIP’s ``__constant__`` memory. + +.. _desc-lds: + +Local data share (LDS) +---------------------- + +The local data share (LDS, a.k.a., "shared memory") is fast on-CU scratchpad +that can be explicitly managed by software to effectively share data and to +coordinate between wavefronts in a workgroup. + +.. figure:: ../data/performance-model/lds.* + :align: center + :alt: Performance model of the local data share (LDS) on AMD Instinct + accelerators + + Performance model of the local data share (LDS) on AMD Instinct MI-series + accelerators. + +Above is Omniperf's performance model of the LDS on CDNA accelerators (adapted +from :mantor-gcn-pdf:`20`). The SIMDs in the :ref:`VALU ` are +connected to the LDS in pairs (see above). Only one SIMD per pair may issue an +LDS instruction at a time, but both pairs may issue concurrently. + +On CDNA accelerators, the LDS contains 32 banks and each bank is 4B wide. +The LDS is designed such that each bank can be read from, written to, or +atomically updated every cycle, for a total throughput of 128B/clock +:gcn-crash-course:`40`. + +On each of the two ports to the SIMDs, 64B can be sent in each direction per +cycle. So, a single wavefront, coming from one of the 2 SIMDs in a pair, can +only get back 64B/cycle (16 lanes per cycle). The input port is shared between +data and address and this can affect achieved bandwidth for different data +sizes. For example, a 64-wide store where each lane is sending a 4B value takes +8 cycles (50% peak bandwidth) while a 64-wide store where each lane is sending +a 16B value takes 20 cycles (80% peak bandwidth). + +In addition, the LDS contains conflict-resolution hardware to detect and handle +bank conflicts. A bank conflict occurs when two (or more) +:ref:`work-items ` in a :ref:`wavefront ` want +to read, write, or atomically update different addresses that map to the same +bank in the same cycle. In this case, the conflict detection hardware will +determine a new schedule such that the access is split into multiple cycles with +no conflicts in any single cycle. + +When multiple work-items want to read from the same address within a bank, the +result can be efficiently broadcasted :gcn-crash-course:`41`. Multiple +work-items writing to the same address within a bank typically results undefined +behavior in HIP and other languages, as the LDS will write the value from the +last work-item as determined by the hardware scheduler :gcn-crash-course:`41`. +This behavior may be useful in the very specific case of storing a uniform +value. + +Relatedly, an address conflict is defined as occurring when two (or more) +work-items in a wavefront want to atomically update the same address on the same +cycle. As in a bank-conflict, this may cause additional cycles of work for the +LDS operation to complete. + +.. _desc-branch: + +Branch +------ + +The branch unit is responsible for executing jumps and branches to execute +control flow operations. +Note that Branch operations are not used for execution mask updates, but only +for “whole wavefront” control-flow changes. + +.. _desc-scheduler: + +Scheduler +--------- + +The scheduler is responsible for arbitration and issue of instructions for all +the wavefronts currently executing on the :doc:`CU `. On every +clock cycle, the scheduler: + +* Considers waves from one of the SIMD units for execution, selected in a + round-robin fashion between the SIMDs in the compute unit + +* Issues up to one instruction per wavefront on the selected SIMD + +* Issues up to one instruction per each of the instruction categories among the waves on the selected SIMD: + + * :ref:`VALU ` / :ref:`VMEM ` operations + + * :ref:`SALU ` / SMEM operations + + * :ref:`LDS ` + + * :ref:`Branch ` operations + +This gives a maximum of five issued Instructions Per Cycle (IPC), per-SIMD, +per-CU (:hip-training-pdf:`Introduction to AMD GPU Programming with HIP <>`, +:gcn-crash-course:`The AMD GCN Architecture - A Crash Course <>`). On CDNA +accelerators with :ref:`MFMA ` instructions, these are issued via the +:ref:`VALU `. Some of them will execute on a separate functional unit +and typically allow other :ref:`VALU ` operations to execute in their +shadow (see the :ref:`MFMA ` section for more detail). + +.. note:: + + The IPC model used by Omniperf omits the following two complications for + clarity. First, CDNA accelerators contain other execution units on the CU + that are unused for compute applications. Second, so-called "internal" + instructions (see :gcn-crash-course:`29`) are not issued to a functional + unit, and can technically cause the maximum IPC to *exceed* 5 instructions + per-cycle in special (largely unrealistic) cases. The latter issue is + discussed in more detail in the + :ref:`'internal' IPC ` example. + +.. _desc-mfma: + +Matrix fused multiply-add (MFMA) +-------------------------------- + +CDNA accelerators, such as the MI100 and :ref:`MI2XX `, contain +specialized hardware to accelerate matrix-matrix multiplications, also +known as Matrix Fused Multiply-Add (MFMA) operations. The exact +operation types and supported formats may vary by accelerator. Refer to the +:amd-lab-note:`AMD matrix cores ` +blog post on GPUOpen for a general discussion of these hardware units. +In addition, to explore the available MFMA instructions in-depth on +various AMD accelerators (including the CDNA line), we recommend the +`AMD Matrix Instruction Calculator `_. + +.. code-block:: shell + :caption: Partial snapshot of the AMD Matrix Instruction Calculatoor Tool + + $ ./matrix_calculator.py –architecture cdna2 –instruction v_mfma_f32_4x4x1f32 –detail-instruction + Architecture: CDNA2 + Instruction: V_MFMA_F32_4X4X1F32 + Encoding: VOP3P-MAI + VOP3P Opcode: 0x42 + VOP3P-MAI Opcode: 0x2 + Matrix Dimensions: + M: 4 + N: 4 + K: 1 + blocks: 16 + Execution statistics: + FLOPs: 512 + Execution cycles: 8 + FLOPs/CU/cycle: 256 + Can co-execute with VALU: True + VALU co-execution cycles possible: 4 + Register usage: + GPRs required for A: 1 + GPRs required for B: 1 + GPRs required for C: 4 + GPRs required for D: 4 + GPR alignment requirement: 8 bytes + +For the purposes of Omniperf, the MFMA unit is typically treated as a separate +pipeline from the :ref:`VALU `, as other VALU instructions (along +with other execution pipelines such as the :ref:`SALU `) can be +issued during a portion of the total duration of an MFMA operation. + +.. note:: + + The exact details of VALU and MFMA operation co-execution vary by + instruction, and can be explored in more detail via the following fields in + the + `AMD Matrix Instruction Calculator `_'s + detailed instruction information. + + * ``Can co-execute with VALU`` + + * ``VALU co-execution cycles possible`` + + +Non-pipeline resources +---------------------- + +In this section, we describe a few resources that are not standalone +pipelines but are important for understanding performance optimization +on CDNA accelerators. + +.. _desc-barrier: + +Barrier +^^^^^^^ + +Barriers are resources on the compute-unit of a CDNA accelerator that +are used to implement synchronization primitives; for example, HIP’s +``__syncthreads``). Barriers are allocated to any workgroup that +consists of more than a single wavefront. + +.. _desc-agprs: + +Accumulation vector general-purpose registers (AGPRs) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Accumulation vector general-purpose registers, or AGPRs, are special +resources that are accessible to a subset of instructions focused on +:ref:`MFMA ` operations. These registers allow the MFMA +unit to access more than the normal maximum of 256 architected +:ref:`vector general-purpose registers (VGPRs) ` by having up to 256 +in the architected space and up to 256 in the accumulation space. +Traditional VALU instructions can only use VGPRs in the architected +space, and data can be moved to/from VGPRs↔AGPRs using specialized +instructions (``v_accvgpr_*``). These data movement instructions may be +used by the compiler to implement lower-cost register-spill/fills on +architectures with AGPRs. + +AGPRs are not available on all AMD Instinct accelerators. GCN GPUs, +such as the AMD Instinct MI50 had a 256 KiB VGPR file. The AMD +Instinct MI100 (CDNA) has a 2x256 KiB register file, where one half +is available as general-purpose VGPRs, and the other half is for matrix +math accumulation VGPRs (AGPRs). The AMD Instinct :ref:`MI2XX ` +(CDNA2) has a 512 KiB VGPR file per CU, where each wave can dynamically request +up to 256 KiB of VGPRs and an additional 256 KiB of AGPRs. For more information, +refer to +``__. + diff --git a/docs/concept/pipeline-metrics.rst b/docs/concept/pipeline-metrics.rst new file mode 100644 index 000000000..e70f39192 --- /dev/null +++ b/docs/concept/pipeline-metrics.rst @@ -0,0 +1,830 @@ +**************** +Pipeline metrics +**************** + +In this section, we describe the metrics available in Omniperf to analyze the +pipelines discussed in the :doc:`pipeline-descriptions`. + +.. _wavefront: + +Wavefront +========= + +.. _wavefront-launch-stats: + +Wavefront launch stats +---------------------- + +The wavefront launch stats panel gives general information about the +kernel launch: + +.. list-table:: + :header-rows: 1 + :widths: 20 65 15 + + * - Metric + + - Description + + - Unit + + * - Grid Size + + - The total number of work-items (or, threads) launched as a part of + the kernel dispatch. In HIP, this is equivalent to the total grid size + multiplied by the total workgroup (or, block) size. + + - :ref:`Work-items ` + + * - Workgroup Size + + - The total number of work-items (or, threads) in each workgroup + (or, block) launched as part of the kernel dispatch. In HIP, this is + equivalent to the total block size. + + - :ref:`Work-items ` + + * - Total Wavefronts + + - The total number of wavefronts launched as part of the kernel dispatch. + On AMD Instinct CDNA accelerators and GCN GPUs, the wavefront size is + always 64 work-items. Thus, the total number of wavefronts should be + equivalent to the ceiling of grid size divided by 64. + + - :ref:`Wavefronts ` + + * - Saved Wavefronts + + - The total number of wavefronts saved at a context-save. See + `cwsr_enable `_. + + - :ref:`Wavefronts ` + + * - Restored Wavefronts + + - The total number of wavefronts restored from a context-save. See + `cwsr_enable `_. + + - :ref:`Wavefronts ` + + * - VGPRs + + - The number of architected vector general-purpose registers allocated for + the kernel, see :ref:`VALU `. Note: this may not exactly + match the number of VGPRs requested by the compiler due to allocation + granularity. + + - :ref:`VGPRs ` + + * - AGPRs + + - The number of accumulation vector general-purpose registers allocated for + the kernel, see :ref:`AGPRs `. Note: this may not exactly + match the number of AGPRs requested by the compiler due to allocation + granularity. + + - :ref:`AGPRs ` + + * - SGPRs + + - The number of scalar general-purpose registers allocated for the kernel, + see :ref:`SALU `. Note: this may not exactly match the number + of SGPRs requested by the compiler due to allocation granularity. + + - :ref:`SGPRs ` + + * - LDS Allocation + + - The number of bytes of :doc:`LDS ` memory (or, shared + memory) allocated for this kernel. Note: This may also be larger than + what was requested at compile time due to both allocation granularity and + dynamic per-dispatch LDS allocations. + + - Bytes per :ref:`workgroup ` + + * - Scratch Allocation + + - The number of bytes of :ref:`scratch memory ` requested + per work-item for this kernel. Scratch memory is used for stack memory + on the accelerator, as well as for register spills and restores. + + - Bytes per :ref:`work-item ` + +.. _wavefront-runtime-stats: + +Wavefront runtime stats +----------------------- + +The wavefront runtime statistics gives a high-level overview of the +execution of wavefronts in a kernel: + +.. list-table:: + :header-rows: 1 + :widths: 18 65 17 + + * - Metric + + - Description + + - Unit + + * - :ref:`Kernel time ` + + - The total duration of the executed kernel. Note: this should not be + directly compared to the wavefront cycles / timings below. + + - Nanoseconds + + * - :ref:`Kernel cycles ` + + - The total duration of the executed kernel in cycles. Note: this should + not be directly compared to the wavefront cycles / timings below. + + - Cycles + + * - Instructions per wavefront + + - The average number of instructions (of all types) executed per wavefront. + This is averaged over all wavefronts in a kernel dispatch. + + - Instructions / wavefront + + * - Wave cycles + + - The number of cycles a wavefront in the kernel dispatch spent resident on + a compute unit per :ref:`normalization unit `. This + is averaged over all wavefronts in a kernel dispatch. Note: this should + not be directly compared to the kernel cycles above. + + - Cycles per :ref:`normalization unit ` + + * - Dependency wait cycles + + - The number of cycles a wavefront in the kernel dispatch stalled waiting + on memory of any kind (e.g., instruction fetch, vector or scalar memory, + etc.) per :ref:`normalization unit `. This counter + is incremented at every cycle by *all* wavefronts on a CU stalled at a + memory operation. As such, it is most useful to get a sense of how waves + were spending their time, rather than identification of a precise limiter + because another wave could be actively executing while a wave is stalled. + The sum of this metric, Issue Wait Cycles and Active Cycles should be + equal to the total Wave Cycles metric. + + - Cycles per :ref:`normalization unit ` + + * - Issue Wait Cycles + + - The number of cycles a wavefront in the kernel dispatch was unable to + issue an instruction for any reason (e.g., execution pipe back-pressure, + arbitration loss, etc.) per + :ref:`normalization unit `. This counter is + incremented at every cycle by *all* wavefronts on a CU unable to issue an + instruction. As such, it is most useful to get a sense of how waves were + spending their time, rather than identification of a precise limiter + because another wave could be actively executing while a wave is issue + stalled. The sum of this metric, Dependency Wait Cycles and Active + Cycles should be equal to the total Wave Cycles metric. + + - Cycles per :ref:`normalization unit ` + + * - Active Cycles + + - The average number of cycles a wavefront in the kernel dispatch was + actively executing instructions per + :ref:`normalization unit `. This measurement is made + on a per-wavefront basis, and may include cycles that another wavefront + spent actively executing (on another execution unit, for example) or was + stalled. As such, it is most useful to get a sense of how waves were + spending their time, rather than identification of a precise limiter. The + sum of this metric, Issue Wait Cycles and Active Wait Cycles should be + equal to the total Wave Cycles metric. + + - Cycles per :ref:`normalization unit ` + + * - Wavefront Occupancy + + - The time-averaged number of wavefronts resident on the accelerator over + the lifetime of the kernel. Note: this metric may be inaccurate for + short-running kernels (less than 1ms). + + - :ref:`Wavefronts ` + +.. note:: + + As mentioned earlier, the measurement of kernel cycles and time typically + cannot directly be compared to e.g., Wave Cycles. This is due to two factors: + first, the kernel cycles/timings are measured using a counter that is + impacted by scheduling overhead, this is particularly noticeable for + "short-running" kernels (less than 1ms) where scheduling overhead forms a + significant portion of the overall kernel runtime. Secondly, the Wave Cycles + metric is incremented per-wavefront scheduled to a SIMD every cycle whereas + the kernel cycles counter is incremented only once per-cycle when *any* + wavefront is scheduled. + +.. _instruction-mix: + +Instruction mix +=============== + +The instruction mix panel shows a breakdown of the various types of instructions +executed by the user’s kernel, and which pipelines on the +:doc:`CU ` they were executed on. In addition, Omniperf reports +further information about the breakdown of operation types for the +:ref:`VALU `, vector-memory, and :ref:`MFMA ` +instructions. + +.. note:: + + All metrics in this section count *instructions issued*, and *not* the total + number of operations executed. The values reported by these metrics will not + change regardless of the execution mask of the wavefront. Note that even if + the execution mask is identically zero (meaning that *no lanes are active*) + the instruction will still be counted, as CDNA accelerators still consider + these instructions *issued*. See for example + :mi200-isa-pdf:`EXECute Mask, section 3.3 of the CDNA2 ISA guide<19>` and + further details. + +Overall instruction mix +----------------------- + +This panel shows the total number of each type of instruction issued to +the :ref:`various compute pipelines ` on the +:ref:`CU `. These are: + +.. list-table:: + :header-rows: 1 + :widths: 20 65 15 + + * - Metric + + - Description + + - Unit + + * - :ref:`VALU ` instructions + + - The total number of vector arithmetic logic unit (VALU) operations + issued. These are the workhorses of the + :doc:`compute unit `, and are used to execute wide range of + instruction types including floating point operations, non-uniform + address calculations, transcendental operations, integer operations, + shifts, conditional evaluation, etc. + + - Instructions + + * - VMEM instructions + + - The total number of vector memory operations issued. These include most + loads, stores and atomic operations and all accesses to + :ref:`generic, global, private and texture ` memory. + + - Instructions + + * - :doc:`LDS ` instructions + + - The total number of LDS (also known as shared memory) operations issued. + These include loads, stores, atomics, and HIP's ``__shfl`` operations. + + - Instructions + + * - :ref:`MFMA ` instructions + + - The total number of matrix fused multiply-add instructions issued. + + - Instructions + + * - :ref:`SALU ` instructions + + - The total number of scalar arithmetic logic unit (SALU) operations + issued. Typically these are used for address calculations, literal + constants, and other operations that are *provably* uniform across a + wavefront. Although scalar memory (SMEM) operations are issued by the + SALU, they are counted separately in this section. + + - Instructions + + * - SMEM instructions + + - The total number of scalar memory (SMEM) operations issued. These are + typically used for loading kernel arguments, base-pointers and loads + from HIP's ``__constant__`` memory. + + - Instructions + + * - :ref:`Branch ` instructions + + - The total number of branch operations issued. These typically consist of + jump or branch operations and are used to implement control flow. + + - Instructions + +.. note:: + + Note, as mentioned in the :ref:`desc-branch` section: branch + operations are not used for execution mask updates, but only for "whole + wavefront" control flow changes. + +VALU arithmetic instruction mix +------------------------------- + +.. warning:: + + Not all metrics in this section (for instance, the floating-point instruction + breakdowns) are available on CDNA accelerators older than the + :ref:`MI2XX ` series. + +This panel details the various types of vector instructions that were +issued to the :ref:`VALU `. The metrics in this section do *not* +include :ref:`MFMA ` instructions using the same precision; for +instance, the “F16-ADD” metric does not include any 16-bit floating point +additions executed as part of an MFMA instruction using the same precision. + +.. list-table:: + :header-rows: 1 + :widths: 15 65 20 + + * - Metric + + - Description + + - Unit + + * - INT32 + + - The total number of instructions operating on 32-bit integer operands + issued to the VALU per :ref:`normalization unit `. + + - Instructions per normalization unit + + * - INT64 + + - The total number of instructions operating on 64-bit integer operands + issued to the VALU per :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F16-ADD + + - The total number of addition instructions operating on 16-bit + floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F16-MUL + + - The total number of multiplication instructions operating on 16-bit + floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F16-FMA + + - The total number of fused multiply-add instructions operating on 16-bit + floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F16-TRANS + + - The total number of transcendental instructions (e.g., `sqrt`) operating + on 16-bit floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F32-ADD + + - The total number of addition instructions operating on 32-bit + floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F32-MUL + + - The total number of multiplication instructions operating on 32-bit + floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F32-FMA + + - The total number of fused multiply-add instructions operating on 32-bit + floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F32-TRANS + + - The total number of transcendental instructions (such as ``sqrt``) + operating on 32-bit floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F64-ADD + + - The total number of addition instructions operating on 64-bit + floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F64-MUL + + - The total number of multiplication instructions operating on 64-bit + floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F64-FMA + + - The total number of fused multiply-add instructions operating on 64-bit + floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - F64-TRANS + + - The total number of transcendental instructions (such as `sqrt`) + operating on 64-bit floating-point operands issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + + * - Conversion + + - The total number of type conversion instructions (such as converting data + to or from F32↔F64) issued to the VALU per + :ref:`normalization unit `. + + - Instructions per normalization unit + +For an example of these counters in action, refer to +:ref:``. + +VMEM instruction mix +-------------------- + +This section breaks down the types of vector memory (VMEM) instructions +that were issued. Refer to the `Instruction Counts metrics +section `__ of address-processor frontend of the vL1D cache for +a description of these VMEM instructions. + +MFMA instruction mix +^^^^^^^^^^^^^^^^^^^^ + +.. warning:: + + The metrics in this section are only available on CDNA2 + (:ref:`MI2XX `) accelerators and newer. + +This section details the types of Matrix Fused Multiply-Add +(:ref:`MFMA `) instructions that were issued. Note that +MFMA instructions are classified by the type of input data they operate on, and +*not* the data type the result is accumulated to. + +.. list-table:: + :header-rows: 1 + :widths: 25 60 17 + + * - Metric + + - Description + + - Unit + + * - MFMA-I8 Instructions + + - The total number of 8-bit integer :ref:`MFMA ` instructions + issued per :ref:`normalization unit `. + + - Instructions per normalization unit + + * - MFMA-F16 Instructions + + - The total number of 16-bit floating point :ref:`MFMA ` + instructions issued per :ref:`normalization unit `. + + - Instructions per normalization unit + + * - MFMA-BF16 Instructions + + - The total number of 16-bit brain floating point :ref:`MFMA ` + instructions issued per :ref:`normalization unit `. + + - Instructions per normalization unit + + * - MFMA-F32 Instructions + + - The total number of 32-bit floating-point :ref:`MFMA ` + instructions issued per :ref:`normalization unit `. + + - Instructions per normalization unit + + * - MFMA-F64 Instructions + + - The total number of 64-bit floating-point :ref:`MFMA ` + instructions issued per :ref:`normalization unit `. + + - Instructions per normalization unit + +Compute pipeline +================ + +.. _metrics-flop-count: + +FLOP counting conventions +------------------------- + +Omniperf’s conventions for VALU FLOP counting are as follows: + +* Addition or multiplication: 1 operation + +* Transcendentals: 1 operation + +* Fused multiply-add (FMA): 2 operations + +Integer operations (IOPs) do not use this convention. They are counted +as a single operation regardless of the instruction type. + +.. note:: + + Packed operations which operate on multiple operands in the same instruction + are counted identically to the underlying instruction type. For example, the + ``v_pk_add_f32`` instruction on :ref:`MI2XX `, which performs an + add operation on two pairs of aligned 32-bit floating-point operands is + counted only as a single addition -- that is, 1 operation. + +As discussed in the :ref:`instruction-mix` section, the FLOP/IOP +metrics in this section do not take into account the execution mask of +the operation, and will report the same value even if the execution mask +is identically zero. + +For example, a FMA instruction operating on 32-bit floating-point +operands (such as ``v_fma_f32`` on a :ref:`MI2XX ` accelerator) +would be counted as 128 total FLOPs: 2 operations (due to the +instruction type) multiplied by 64 operations (because the wavefront is +composed of 64 work-items). + +Compute Speed-of-Light +---------------------- + +.. warning:: + + The theoretical maximum throughput for some metrics in this section are currently computed with the maximum achievable clock frequency, as reported by `rocminfo`, for an accelerator. This may not be realistic for all workloads. + +This section reports the number of floating-point and integer operations +executed on the `VALU `__ and `MFMA `__ units in various +precisions. We note that unlike the `VALU instruction +mix `__ and `MFMA instruction mix `__ +sections, the metrics here are reported as FLOPs and IOPs, i.e., the +total number of operations executed. + +.. list-table:: + :header-rows: 1 + :widths: 20 65 15 + + * - Metric + + - Description + + - Unit + + * - VALU FLOPs + + - The total floating-point operations executed per second on the + :ref:`VALU `. This is also presented as a percent of the peak + theoretical FLOPs achievable on the specific accelerator. Note: this does + not include any floating-point operations from :ref:`MFMA ` + instructions. + + - GFLOPs + + * - VALU IOPs + + - The total integer operations executed per second on the + :ref:`VALU `. This is also presented as a percent of the peak + theoretical IOPs achievable on the specific accelerator. Note: this does + not include any integer operations from :ref:`MFMA ` + instructions. + + - GIOPs + + * - MFMA FLOPs (BF16) + + - The total number of 16-bit brain floating point :ref:`MFMA ` + operations executed per second. Note: this does not include any 16-bit + brain floating point operations from :ref:`VALU ` + instructions. This is also presented as a percent of the peak theoretical + BF16 MFMA operations achievable on the specific accelerator. + + - GFLOPs + + * - MFMA FLOPs (F16) + + - The total number of 16-bit floating point :ref:`MFMA ` + operations executed per second. Note: this does not include any 16-bit + floating point operations from :ref:`VALU ` instructions. This + is also presented as a percent of the peak theoretical F16 MFMA + operations achievable on the specific accelerator. + + - GFLOPs + + * - MFMA FLOPs (F32) + + - The total number of 32-bit floating point :ref:`MFMA ` + operations executed per second. Note: this does not include any 32-bit + floating point operations from :ref:`VALU ` instructions. This + is also presented as a percent of the peak theoretical F32 MFMA + operations achievable on the specific accelerator. + + - GFLOPs + + * - MFMA FLOPs (F64) + + - The total number of 64-bit floating point :ref:`MFMA ` + operations executed per second. Note: this does not include any 64-bit + floating point operations from :ref:`VALU ` instructions. This + is also presented as a percent of the peak theoretical F64 MFMA + operations achievable on the specific accelerator. + + - GFLOPs + + * - MFMA IOPs (INT8) + + - The total number of 8-bit integer :ref:`MFMA ` operations + executed per second. Note: this does not include any 8-bit integer + operations from :ref:`VALU ` instructions. This is also + presented as a percent of the peak theoretical INT8 MFMA operations + achievable on the specific accelerator. + + - GIOPs + +Pipeline statistics +------------------- + +This section reports a number of key performance characteristics of +various execution units on the :doc:`CU `. Refer to +:ref:`ipc-example` for a detailed dive into these metrics, and +:ref:`scheduler ` for a high-level overview of execution units +and instruction issue. + +.. list-table:: + :header-rows: 1 + :widths: 20 65 15 + + * - Metric + + - Description + + - Unit + + * - IPC + + - The ratio of the total number of instructions executed on the + :doc:`CU ` over the + :ref:`total active CU cycles `. + + - Instructions per-cycle + + * - IPC (Issued) + + - The ratio of the total number of (non-[internal](Internal_ipc)) instructions issued over the number of cycles where the [scheduler](scheduler) was actively working on issuing instructions. The reader is recommended the [Issued IPC](Issued_ipc) example for further detail. + + - Instructions per-cycle + + * - SALU utilization + + - Indicates what percent of the kernel's duration the [SALU](salu) was busy executing instructions. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [SALU](salu) / [SMEM](salu) instructions over the [total CU cycles](TotalCUCycles). + + - Percent + + * - VALU utilization + + - Indicates what percent of the kernel's duration the [VALU](valu) was busy executing instructions. Does not include [VMEM](valu) operations. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [VALU](valu) instructions over the [total CU cycles](TotalCUCycles). + + - Percent + + * - VMEM utilization + + - Indicates what percent of the kernel's duration the [VMEM](valu) unit was busy executing instructions, including both global/generic and spill/scratch operations (see the [VMEM instruction count metrics](TA_inst) for more detail). Does not include [VALU](valu) operations. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [VMEM](valu) instructions over the [total CU cycles](TotalCUCycles). + + - Percent + + * - Branch utilization + + - Indicates what percent of the kernel's duration the [Branch](branch) unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [Branch](branch) instructions over the [total CU cycles](TotalCUCycles). + + - Percent + + * - VALU Active Threads + + - Indicates the average level of [divergence](Divergence) within a wavefront over the lifetime of the kernel. The number of work-items that were active in a wavefront during execution of each [VALU](valu) instruction, time-averaged over all VALU instructions run on all wavefronts in the kernel. + + - Work-items + + * - MFMA Utilization + + - Indicates what percent of the kernel's duration the [MFMA](mfma) unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the [MFMA](salu) was busy over the [total CU cycles](TotalCUCycles). + + - Percent + + * - MFMA Instruction Cycles + + - The average duration of [MFMA](mfma) instructions in this kernel in cycles. Computed as the ratio of the total number of cycles the [MFMA](mfma) unit was busy over the total number of [MFMA](mfma) instructions. Compare to e.g., the [AMD Matrix Instruction Calculator](https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator). + + - Cycles per instruction + + * - VMEM Latency + + - The average number of round-trip cycles (i.e., from issue to data-return / acknowledgment) required for a VMEM instruction to complete. + + - Cycles + + * - SMEM Latency + + - The average number of round-trip cycles (i.e., from issue to data-return / acknowledgment) required for a SMEM instruction to complete. + + - Cycles + +.. note:: + + The Branch utilization reported in this section also includes time spent in other instruction types (namely: `s_endpgm`) that are _typically_ a very small percentage of the overall kernel execution. This complication is omitted for simplicity, but may result in small amounts of "branch" utilization (<<1\%) for otherwise branch-less kernels. + +Arithmetic operations +--------------------- + +This section reports the total number of floating-point and integer +operations executed in various precisions. Unlike the `Compute +speed-of-light `__ panel, this section reports both +`VALU `__ and `MFMA `__ operations of the same precision +(e.g., F32) in the same metric. Additionally, this panel lets the user +control how the data is normalized (i.e., control the +`normalization-unit `__), while the speed-of-light panel does +not. For more detail on how operations are counted see the `FLOP +counting convention `__ section. + +.. warning:: + + As discussed in the [Instruction Mix](Inst_Mix) section, the metrics in this section do not take into account the execution mask of the operation, and will report the same value even if EXEC is identically zero. + +.. list-table:: + :header-rows: 1 + :widths: 18 65 17 + + * - Metric + + - Description + + - Unit + + * - FLOPs (Total) + + - The total number of floating-point operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit) + + - FLOP per [normalization-unit](normunit) + + * - IOPs (Total) + + - The total number of integer operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit) + + - IOP per [normalization-unit](normunit) + + * - F16 OPs + + - The total number of 16-bit floating-point operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit) + + - FLOP per [normalization-unit](normunit) + + * - BF16 OPs + + - The total number of 16-bit brain floating-point operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit). Note: on current CDNA accelerators, the [VALU](valu) has no native BF16 instructions. + + - FLOP per [normalization-unit](normunit) + + * - F32 OPs + + - The total number of 32-bit floating-point operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit) + + - FLOP per [normalization-unit](normunit) + + * - F64 OPs + + - The total number of 64-bit floating-point operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit) + + - FLOP per [normalization-unit](normunit) + + * - INT8 OPs + + - The total number of 8-bit integer operations executed on either the [VALU](valu) or [MFMA](mfma) units, per [normalization-unit](normunit). Note: on current CDNA accelerators, the [VALU](valu) has no native INT8 instructions. + + - IOPs per [normalization-unit](normunit) + diff --git a/docs/concept/shader-engine.rst b/docs/concept/shader-engine.rst new file mode 100644 index 000000000..97282e31c --- /dev/null +++ b/docs/concept/shader-engine.rst @@ -0,0 +1,695 @@ +****************** +Shader engine (SE) +****************** + +The :doc:`compute units ` on a CDNA accelerator are grouped +together into a higher-level organizational unit called a shader engine (SE): + +.. figure:: ../data/performance-model/selayout.png + :align: center + :alt: Example of CU-grouping into shader engines + + Example of CU-grouping into shader engines on AMD Instinct MI-series + accelerators. + +The number of CUs on a SE varies from chip to chip -- see for example +:hip-training-pdf:`20`. In addition, newer accelerators such as the AMD +Instinct MI 250X have 8 SEs per accelerator. + +For the purposes of Omniperf, we consider resources that are shared between +multiple CUs on a single SE as part of the SE's metrics. + +These include: + +* The :ref:`scalar L1 data cache ` + +* The :ref:`L1 instruction cache ` + +* The :ref:`workgroup manager ` + +.. _desc-sl1d: + +Scalar L1 data cache (sL1D) +=========================== + +The Scalar L1 Data cache (sL1D) can cache data accessed from scalar load +instructions (and scalar store instructions on architectures where they exist) +from wavefronts in the :doc:`CUs `. The sL1D is shared between +multiple CUs (:gcn-crash-course:`36`) -- the exact number of CUs depends on the +architecture in question (3 CUs in GCN GPUs and MI100, 2 CUs in +:ref:`MI2XX `) -- and is backed by the :doc:`L2 cache `. + +In typical usage, the data in the sL1D is comprised of: + +* Kernel arguments, such as pointers, + `non-populated `_ + grid and block dimensions, and others + +* HIP's ``__constant__`` memory, when accessed in a provably uniform manner + [#uniform-access]_ + +* Other memory, when accessed in a provably uniform manner, *and* the backing + memory is provably constant [#uniform-access]_ + +.. [#uniform-access] The scalar data cache is used when the compiler emits + scalar loads to access data. This requires that the data be *provably* + uniformly accesses (that is, the compiler can verify that all work-items in a + wavefront access the same data), *and* that the data can be proven to be + read-only (for instance, HIP's ``__constant__`` memory, or properly + ``__restrict__``\ed pointers to avoid write-aliasing). Access of + ``__constant__`` memory for example is not guaranteed to go through the sL1D + if the wavefront loads a non-uniform value. + +.. _desc-sl1d-sol: + +Scalar L1D Speed-of-Light +------------------------- + +.. warning:: + + The theoretical maximum throughput for some metrics in this section are + currently computed with the maximum achievable clock frequency, as reported + by ``rocminfo``, for an accelerator. This may not be realistic for all + workloads. + +The Scalar L1D speed-of-light chart shows some key metrics of the sL1D +cache as a comparison with the peak achievable values of those metrics: + +.. list-table:: + :header-rows: 1 + :widths: 20 65 15 + + * - Metric + + - Description + + - Unit + + * - Bandwidth + + - The number of bytes looked up in the sL1D cache, as a percent of the peak + theoretical bandwidth. Calculated as the ratio of sL1D requests over the + :ref:`total sL1D cycles `. + + - Percent + + * - Cache Hit Rate + + - The percent of sL1D requests that hit [#sl1d-cache]_ on a previously + loaded line in the cache. Calculated as the ratio of the number of sL1D + requests that hit over the number of all sL1D requests. + + - Percent + + * - sL1D-L2 BW + + - The number of bytes requested by the sL1D from the L2 cache, as a percent + of the peak theoretical sL1D → L2 cache bandwidth. Calculated as the + ratio of the total number of requests from the sL1D to the L2 cache over + the :ref:`total sL1D-L2 interface cycles . + + - Percent + + +Scalar L1D cache accesses +------------------------- + +This panel gives more detail on the types of accesses made to the sL1D, +and the hit/miss statistics. + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - Requests + + - The total number of requests, of any size or type, made to the sL1D per + :ref:`normalization unit `. + + - Requests per normalization unit + + * - Hits + + - The total number of sL1D requests that hit on a previously loaded cache + line, per :ref:`normalization unit `. + + - Requests per normalization unit + + * - Misses - Non Duplicated + + - The total number of sL1D requests that missed on a cache line that *was + not* already pending due to another request, per + :ref:`normalization unit `. See :ref:`desc-sl1d-sol` + for more detail. + + - Requests per normalization unit + + * - Misses - Duplicated + + - The total number of sL1D requests that missed on a cache line that *was* + already pending due to another request, per + :ref:`normalization unit `. See + :ref:`desc-sl1d-sol` for more detail. + + - Requests per normalization unit + + * - Cache Hit Rate + + - Indicates the percent of sL1D requests that hit on a previously loaded + line the cache. The ratio of the number of sL1D requests that hit + [#sl1d-cache]_ over the number of all sL1D requests. + + - Percent + + * - Read Requests (Total) + + - The total number of sL1D read requests of any size, per + :ref:`normalization unit `. + + - Requests per normalization unit + + * - Atomic Requests + + - The total number of sL1D atomic requests of any size, per + :ref:`normalization unit `. Typically unused on CDNA + accelerators. + + - Requests per normalization unit + + * - Read Requests (1 DWord) + + - The total number of sL1D read requests made for a single dword of data + (4B), per :ref:`normalization unit `. + + - Requests per normalization unit + + * - Read Requests (2 DWord) + + - The total number of sL1D read requests made for a two dwords of data + (8B), per :ref:`normalization unit `. + + - Requests per normalization unit + + * - Read Requests (4 DWord) + + - The total number of sL1D read requests made for a four dwords of data + (16B), per :ref:`normalization unit `. + + - Requests per normalization unit + + * - Read Requests (8 DWord) + + - The total number of sL1D read requests made for a eight dwords of data + (32B), per :ref:`normalization unit `. + + - Requests per normalization unit + + * - Read Requests (16 DWord) + + - The total number of sL1D read requests made for a sixteen dwords of data + (64B), per :ref:`normalization unit `. + + - Requests per normalization unit + +.. rubric:: Footnotes + +.. [#sl1d-cache] Unlike the :doc:`vL1D ` and + :doc:`L2 ` caches, the sL1D cache on AMD Instinct MI-series CDNA + accelerators does *not* use the "hit-on-miss" approach to reporting cache + hits. That is, if while satisfying a miss, another request comes in that + would hit on the same pending cache line, the subsequent request will be + counted as a *duplicated miss*. + +sL1D ↔ L2 Interface +------------------- + +This panel gives more detail on the data requested across the +sL1D↔:doc:`L2 ` interface. + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - sL1D-L2 BW + + - The total number of bytes read from, written to, or atomically updated + across the sL1D↔:doc:`L2 ` interface, per + :ref:`normalization unit `. Note that sL1D writes + and atomics are typically unused on current CDNA accelerators, so in the + majority of cases this can be interpreted as an sL1D→L2 read bandwidth. + + - Bytes per normalization unit + + * - Read Requests + + - The total number of read requests from sL1D to the :doc:`L2 `, + per :ref:`normalization unit `. + + - Requests per normalization unit + + * - Write Requests + + - The total number of write requests from sL1D to the :doc:`L2 `, + per :ref:`normalization unit `. Typically unused on + current CDNA accelerators. + + - Requests per normalization unit + + * - Atomic Requests + + - The total number of atomic requests from sL1D to the + :doc:`L2 `, per + :ref:`normalization unit `. Typically unused on + current CDNA accelerators. + + - Requests per normalization unit + + * - Stall Cycles + + - The total number of cycles the sL1D↔:doc:`L2 ` interface was + stalled, per :ref:`normalization unit `. + + - Cycles per normalization unit + +.. _desc-l1i: + +L1 Instruction Cache (L1I) +========================== + +As with the :ref:`sL1D `, the L1 Instruction (L1I) cache is shared +between multiple CUs on a shader-engine, where the precise number of CUs +sharing a L1I depends on the architecture in question (:gcn-crash-course:`36`) +and is backed by the :doc:`L2 cache `. Unlike the sL1D, the +instruction cache is read-only. + +.. _desc-l1i-sol: + +L1I Speed-of-Light +------------------ + +.. warning:: + + The theoretical maximum throughput for some metrics in this section are + currently computed with the maximum achievable clock frequency, as reported + by ``rocminfo``, for an accelerator. This may not be realistic for all + workloads. + +The L1 Instruction Cache speed-of-light chart shows some key metrics of +the L1I cache as a comparison with the peak achievable values of those +metrics: + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - Bandwidth + + - The number of bytes looked up in the L1I cache, as a percent of the peak + theoretical bandwidth. Calculated as the ratio of L1I requests over the + :ref:`total L1I cycles `. + + - Percent + + * - Cache Hit Rate + + - The percent of L1I requests that hit on a previously loaded line the + cache. Calculated as the ratio of the number of L1I requests that hit + [#l1i-cache]_ over the number of all L1I requests. + + - Percent + + * - L1I-L2 BW + + - The percent of the peak theoretical L1I → L2 cache request bandwidth + achieved. Calculated as the ratio of the total number of requests from + the L1I to the L2 cache over the + :ref:`total L1I-L2 interface cycles `. + + - Percent + + * - Instruction Fetch Latency + + - The average number of cycles spent to fetch instructions to a + :doc:`CU `. + + - Cycles + +L1I cache accesses +------------------ + +This panel gives more detail on the hit/miss statistics of the L1I: + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - Requests + + - The total number of requests made to the L1I per + :ref:`normalization-unit `. + + - Requests per normalization unit + + * - Hits + + - The total number of L1I requests that hit on a previously loaded cache + line, per :ref:`normalization-unit `. + + - Requests per normalization unit + + * - Misses - Non Duplicated + + - The total number of L1I requests that missed on a cache line that + *were not* already pending due to another request, per + :ref:`normalization-unit `. See note in + :ref:`desc-l1i-sol` for more detail. + + - Requests per normalization unit + + * - Misses - Duplicated + + - The total number of L1I requests that missed on a cache line that *were* + already pending due to another request, per + :ref:`normalization-unit `. See note in + :ref:`desc-l1i-sol` for more detail. + + - Requests per normalization unit + + * - Cache Hit Rate + + - The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded + line the cache. Calculated as the ratio of the number of L1I requests + that hit over the the number of all L1I requests. + + - Percent + +.. rubric:: Footnotes + +.. [#l1i-cache] Unlike the :doc:`vL1D ` and + :doc:`L2 ` caches, the L1I cache on AMD Instinct MI-series CDNA + accelerators does *not* use the "hit-on-miss" approach to reporting cache + hits. That is, if while satisfying a miss, another request comes in that + would hit on the same pending cache line, the subsequent request will be + counted as a *duplicated miss*. + +L1I - L2 interface +------------------ + +This panel gives more detail on the data requested across the +L1I-:doc:`L2 ` interface. + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - L1I-L2 BW + + - The total number of bytes read across the L1I-:doc:`L2 ` + interface, per :ref:`normalization unit `. + + - Bytes per normalization unit + +.. _desc-spi: + +Workgroup manager (SPI) +======================= + +The workgroup manager (SPI) is the bridge between the +:doc:`command processor ` and the +:doc:`compute units `. After the command processor processes a +kernel dispatch, it will then pass the dispatch off to the workgroup manager, +which then schedules :ref:`workgroups ` onto the compute units. +As workgroups complete execution and resources become available, the +workgroup manager will schedule new workgroups onto compute units. The workgroup +manager’s metrics therefore are focused on reporting the following: + +* Utilizations of various parts of the accelerator that the workgroup + manager interacts with (and the workgroup manager itself) + +* How many workgroups were dispatched, their size, and how many + resources they used + +* Percent of scheduler opportunities (cycles) where workgroups failed + to dispatch, and + +* Percent of scheduler opportunities (cycles) where workgroups failed + to dispatch due to lack of a specific resource on the CUs (for instance, too + many VGPRs allocated) + +This gives you an idea of why the workgroup manager couldn’t schedule more +wavefronts onto the device, and is most useful for workloads that you suspect to +be limited by scheduling or launch rate. + +As discussed in :doc:`Command processor `, the command +processor on AMD Instinct MI-series architectures contains four hardware +scheduler-pipes, each with eight software threads (:mantor-vega-10:`19`). Each +scheduler-pipe can issue a kernel dispatch to the workgroup manager to schedule +concurrently. Therefore, some workgroup manager metrics are presented relative +to the utilization of these scheduler-pipes (for instance, whether all four are +issuing concurrently). + +.. note:: + + Current versions of the profiling libraries underlying Omniperf attempt to + serialize concurrent kernels running on the accelerator, as the performance + counters on the device are global (that is, shared between concurrent + kernels). This means that these scheduler-pipe utilization metrics are + expected to reach (for example) a maximum of one pipe active -- only 25%. + +Workgroup manager utilizations +------------------------------ + +This section describes the utilization of the workgroup manager, and the +hardware components it interacts with. + +.. list-table:: + :header-rows: 1 + :widths: 20 65 15 + + * - Metric + + - Description + + - Unit + + * - Accelerator utilization + + - The percent of cycles in the kernel where the accelerator was actively + doing any work. + + - Percent + + * - Scheduler-pipe utilization + + - The percent of :ref:`total scheduler-pipe cycles ` in + the kernel where the scheduler-pipes were actively doing any work. Note: + this value is expected to range between 0% and 25%. See :ref:`desc-spi`. + + - Percent + + * - Workgroup manager utilization + + - The percent of cycles in the kernel where the workgroup manager was + actively doing any work. + + - Percent + + * - Shader engine utilization + + - The percent of :ref:`total shader engine cycles ` in the + kernel where any CU in a shader-engine was actively doing any work, + normalized over all shader-engines. Low values (e.g., << 100%) indicate + that the accelerator was not fully saturated by the kernel, or a + potential load-imbalance issue. + + - Percent + + * - SIMD utilization + + - The percent of :ref:`total SIMD cycles ` on a CU was actively doing any work, + summed over all CUs. Low values (less than 100%) indicate that the + accelerator was not fully saturated by the kernel, or a potential + load-imbalance issue. + + - Percent + + * - Dispatched workgroups + + - The total number of workgroups forming this kernel launch. + + - Workgroups + + * - Dispatched wavefronts + + - The total number of wavefronts, summed over all workgroups, forming this + kernel launch. + + - Wavefronts + + * - VGPR writes + + - The average number of cycles spent initializing :ref:`VGPRs ` + at wave creation. + + - Cycles/wave + + * - SGPR Writes + + - The average number of cycles spent initializing :ref:`SGPRs ` + at wave creation. + + - Cycles/wave + +Resource allocation +------------------- + +This panel gives more detail on how workgroups and wavefronts were scheduled +onto compute units, and what occupancy limiters they hit -- if any. When +analyzing these metrics, you should also take into account their +achieved occupancy -- such as +:ref:`wavefront occupancy `. A kernel may be occupancy +limited by LDS usage, for example, but may still achieve high occupancy levels +such that improving occupancy further may not improve performance. See +:ref:`occupancy-example` for details. + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - Not-scheduled rate (Workgroup Manager) + + - The percent of :ref:`total scheduler-pipe cycles ` in + the kernel where a workgroup could not be scheduled to a + :doc:`CU ` due to a bottleneck within the workgroup manager + rather than a lack of a CU or :ref:`SIMD ` with sufficient + resources. Note: this value is expected to range between 0-25%. See note + in :ref:`workgroup manager ` description. + + - Percent + + * - Not-scheduled rate (Scheduler-Pipe) + + - The percent of :ref:`total scheduler-pipe cycles ` in + the kernel where a workgroup could not be scheduled to a + :doc:`CU ` due to a bottleneck within the scheduler-pipes + rather than a lack of a CU or :ref:`SIMD ` with sufficient + resources. Note: this value is expected to range between 0-25%, see note + in :ref:`workgroup manager ` description. + + - Percent + + * - Scheduler-Pipe Stall Rate + + - The percent of :ref:`total scheduler-pipe cycles ` in + the kernel where a workgroup could not be scheduled to a + :doc:`CU ` due to occupancy limitations (like a lack of a + CU or :ref:`SIMD ` with sufficient resources). Note: this + value is expected to range between 0-25%, see note in + :ref:`workgroup manager ` description. + + - Percent + + * - Scratch Stall Rate + + - The percent of :ref:`total shader-engine cycles ` in the + kernel where a workgroup could not be scheduled to a + :doc:`CU ` due to lack of + :ref:`private (a.k.a., scratch) memory ` slots. While this + can reach up to 100%, note that the actual occupancy limitations on a + kernel using private memory are typically quite small (for example, less + than 1% of the total number of waves that can be scheduled to an + accelerator). + + - Percent + + * - Insufficient SIMD Waveslots + + - The percent of :ref:`total SIMD cycles ` in the kernel + where a workgroup could not be scheduled to a :ref:`SIMD ` + due to lack of available :ref:`waveslots `. + + - Percent + + * - Insufficient SIMD VGPRs + + - The percent of :ref:`total SIMD cycles ` in the kernel + where a workgroup could not be scheduled to a :ref:`SIMD ` + due to lack of available :ref:`VGPRs `. + + - Percent + + * - Insufficient SIMD SGPRs + + - The percent of :ref:`total SIMD cycles ` in the kernel + where a workgroup could not be scheduled to a :ref:`SIMD ` + due to lack of available :ref:`SGPRs `. + + - Percent + + * - Insufficient CU LDS + + - The percent of :ref:`total CU cycles ` in the kernel + where a workgroup could not be scheduled to a :doc:`CU ` + due to lack of available :doc:`LDS `. + + - Percent + + * - Insufficient CU Barriers + + - The percent of :ref:`total CU cycles ` in the kernel + where a workgroup could not be scheduled to a :doc:`CU ` + due to lack of available :ref:`barriers `. + + - Percent + + * - Reached CU Workgroup Limit + + - The percent of :ref:`total CU cycles ` in the kernel + where a workgroup could not be scheduled to a :doc:`CU ` + due to limits within the workgroup manager. This is expected to be + always be zero on CDNA2 or newer accelerators (and small for previous + accelerators). + + - Percent + + * - Reached CU Wavefront Limit + + - The percent of :ref:`total CU cycles ` in the kernel + where a wavefront could not be scheduled to a :doc:`CU ` + due to limits within the workgroup manager. This is expected to be + always be zero on CDNA2 or newer accelerators (and small for previous + accelerators). + + - Percent + diff --git a/docs/concept/system-speed-of-light.rst b/docs/concept/system-speed-of-light.rst index b59a81d82..5a2b547a2 100644 --- a/docs/concept/system-speed-of-light.rst +++ b/docs/concept/system-speed-of-light.rst @@ -1,111 +1,315 @@ +********************* System Speed-of-Light ---------------------- +********************* -.. code:: {warning} +System Speed-of-Light summarizes some of the key metrics from various sections +of Omniperf’s profiling report. - The theoretical maximum throughput for some metrics in this section are currently computed with the maximum achievable clock frequency, as reported by `rocminfo`, for an accelerator. This may not be realistic for all workloads. +.. warning:: - In addition, not all metrics (e.g., FLOP counters) are available on all AMD Instinct(tm) MI accelerators. - For more detail on how operations are counted, see the [FLOP counting convention](FLOP_count) section. + The theoretical maximum throughput for some metrics in this section are + currently computed with the maximum achievable clock frequency, as reported + by :doc:`rocminfo:index`, for an accelerator. This may not be realistic for + all workloads. -Finally, the system speed-of-light summarizes some of the key metrics -from various sections of Omniperf’s profiling report. - -.. code:: {list-table} + Also, not all metrics -- such as FLOP counters -- are available on all AMD + Instinct(tm) MI accelerators. For more detail on how operations are counted, + see the :ref:`metrics-flop-count` section. +.. list-table:: :header-rows: 1 :widths: 20 65 15 - :class: noscroll-table + * - Metric + - Description + - Unit - * - [VALU](valu) FLOPs - - The total floating-point operations executed per second on the [VALU](valu). This is also presented as a percent of the peak theoretical FLOPs achievable on the specific accelerator. Note: this does not include any floating-point operations from [MFMA](mfma) instructions. + + * - :ref:`VALU ` FLOPs + + - The total floating-point operations executed per second on the + :ref:`VALU `. This is also presented as a percent of the peak + theoretical FLOPs achievable on the specific accelerator. Note: this does + not include any floating-point operations from :ref:`MFMA ` + instructions. + - GFLOPs - * - [VALU](valu) IOPs - - The total integer operations executed per second on the [VALU](valu). This is also presented as a percent of the peak theoretical IOPs achievable on the specific accelerator. Note: this does not include any integer operations from [MFMA](mfma) instructions. + + * - :ref:`VALU ` IOPs + + - The total integer operations executed per second on the + :ref:`VALU `. This is also presented as a percent of the peak + theoretical IOPs achievable on the specific accelerator. Note: this does + not include any integer operations from :ref:`MFMA ` + instructions. + - GIOPs - * - [MFMA](mfma) FLOPs (BF16) - - The total number of 16-bit brain floating point [MFMA](mfma) operations executed per second. Note: this does not include any 16-bit brain floating point operations from [VALU](valu) instructions. This is also presented as a percent of the peak theoretical BF16 MFMA operations achievable on the specific accelerator. + + * - :ref:`MFMA ` FLOPs (BF16) + + - The total number of 16-bit brain floating point :ref:`MFMA ` + operations executed per second. Note: this does not include any 16-bit + brain floating point operations from :ref:`VALU ` + instructions. This is also presented as a percent of the peak theoretical + BF16 MFMA operations achievable on the specific accelerator. + - GFLOPs - * - [MFMA](mfma) FLOPs (F16) - - The total number of 16-bit floating point [MFMA](mfma) operations executed per second. Note: this does not include any 16-bit floating point operations from [VALU](valu) instructions. This is also presented as a percent of the peak theoretical F16 MFMA operations achievable on the specific accelerator. + + * - :ref:`MFMA ` FLOPs (F16) + + - The total number of 16-bit floating point :ref:`MFMA ` + operations executed per second. Note: this does not include any 16-bit + floating point operations from :ref:`VALU ` instructions. This + is also presented as a percent of the peak theoretical F16 MFMA + operations achievable on the specific accelerator. + - GFLOPs - * - [MFMA](mfma) FLOPs (F32) - - The total number of 32-bit floating point [MFMA](mfma) operations executed per second. Note: this does not include any 32-bit floating point operations from [VALU](valu) instructions. This is also presented as a percent of the peak theoretical F32 MFMA operations achievable on the specific accelerator. + + * - :ref:`MFMA ` FLOPs (F32) + + - The total number of 32-bit floating point :ref:`MFMA ` + operations executed per second. Note: this does not include any 32-bit + floating point operations from :ref:`VALU ` instructions. This + is also presented as a percent of the peak theoretical F32 MFMA + operations achievable on the specific accelerator. + - GFLOPs - * - [MFMA](mfma) FLOPs (F64) - - The total number of 64-bit floating point [MFMA](mfma) operations executed per second. Note: this does not include any 64-bit floating point operations from [VALU](valu) instructions. This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator. + + * - :ref:`MFMA ` FLOPs (F64) + + - The total number of 64-bit floating point :ref:`MFMA ` + operations executed per second. Note: this does not include any 64-bit + floating point operations from :ref:`VALU ` instructions. This + is also presented as a percent of the peak theoretical F64 MFMA + operations achievable on the specific accelerator. + - GFLOPs - * - [MFMA](mfma) IOPs (INT8) - - The total number of 8-bit integer [MFMA](mfma) operations executed per second. Note: this does not include any 8-bit integer operations from [VALU](valu) instructions. This is also presented as a percent of the peak theoretical INT8 MFMA operations achievable on the specific accelerator. + + * - :ref:`MFMA ` IOPs (INT8) + + - The total number of 8-bit integer :ref:`MFMA ` operations + executed per second. Note: this does not include any 8-bit integer + operations from :ref:`VALU ` instructions. This is also + presented as a percent of the peak theoretical INT8 MFMA operations + achievable on the specific accelerator. + - GIOPs - * - [SALU](salu) Utilization - - Indicates what percent of the kernel's duration the [SALU](salu) was busy executing instructions. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [SALU](salu) / [SMEM](salu) instructions over the [total CU cycles](TotalCUCycles). + + * - :ref:`SALU ` utilization + + - Indicates what percent of the kernel's duration the + :ref:`SALU ` was busy executing instructions. Computed as the + ratio of the total number of cycles spent by the + :ref:`scheduler ` issuing :ref:`SALU ` or + :ref:`SMEM ` instructions over the + :ref:`total CU cycles `. + - Percent - * - [VALU](valu) Utilization - - Indicates what percent of the kernel's duration the [VALU](valu) was busy executing instructions. Does not include [VMEM](valu) operations. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [VALU](valu) instructions over the [total CU cycles](TotalCUCycles). + + * - :ref:`VALU ` utilization + + - Indicates what percent of the kernel's duration the + :ref:`VALU ` was busy executing instructions. Does not include + :ref:`VMEM ` operations. Computed as the ratio of the total + number of cycles spent by the :ref:`scheduler ` issuing + :ref:`VALU ` instructions over the + :ref:`total CU cycles `. + - Percent - * - [MFMA](mfma) Utilization - - Indicates what percent of the kernel's duration the [MFMA](mfma) unit was busy executing instructions. Computed as the ratio of the total number of cycles the [MFMA](mfma) was busy over the [total CU cycles](TotalCUCycles). + + * - :ref:`MFMA ` utilization + + - Indicates what percent of the kernel's duration the + :ref:`MFMA ` unit was busy executing instructions. Computed as + the ratio of the total number of cycles the MFMA was busy over the + :ref:`total CU cycles `. + - Percent - * - [VMEM](valu) Utilization - - Indicates what percent of the kernel's duration the [VMEM](valu) unit was busy executing instructions, including both global/generic and spill/scratch operations (see the [VMEM instruction count metrics](TA_inst) for more detail). Does not include [VALU](valu) operations. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [VMEM](valu) instructions over the [total CU cycles](TotalCUCycles). + + * - :ref:`VMEM ` utilization + + - Indicates what percent of the kernel's duration the + :ref:`VMEM ` unit was busy executing instructions, including + both global/generic and spill/scratch operations (see the + :ref:`VMEM instruction count metrics `) for more + detail). Does not include :ref:`VALU ` operations. Computed as + the ratio of the total number of cycles spent by the + :ref:`scheduler ` issuing VMEM instructions over the + :ref:`total CU cycles `. + - Percent - * - [Branch](branch) Utilization - - Indicates what percent of the kernel's duration the [Branch](branch) unit was busy executing instructions. Computed as the ratio of the total number of cycles spent by the [scheduler](scheduler) issuing [Branch](branch) instructions over the [total CU cycles](TotalCUCycles). + + * - :ref:`Branch ` utilization + + - Indicates what percent of the kernel's duration the + :ref:`branch ` unit was busy executing instructions. + Computed as the ratio of the total number of cycles spent by the + :ref:`scheduler ` issuing :ref:`branch ` + instructions over the :ref:`total CU cycles ` + - Percent - * - [VALU](valu) Active Threads - - Indicates the average level of [divergence](Divergence) within a wavefront over the lifetime of the kernel. The number of work-items that were active in a wavefront during execution of each [VALU](valu) instruction, time-averaged over all VALU instructions run on all wavefronts in the kernel. + + * - :ref:`VALU ` active threads + + - Indicates the average level of :ref:`divergence ` within + a wavefront over the lifetime of the kernel. The number of work-items + that were active in a wavefront during execution of each + :ref:`VALU ` instruction, time-averaged over all VALU + instructions run on all wavefronts in the kernel. + - Work-items + * - IPC - - The ratio of the total number of instructions executed on the [CU](cu) over the [total active CU cycles](TotalActiveCUCycles). This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. + + - The ratio of the total number of instructions executed on the + :doc:`CU ` over the + :ref:`total active CU cycles `. This is also + presented as a percent of the peak theoretical bandwidth achievable on + the specific accelerator. + - Instructions per-cycle - * - Wavefront Occupancy - - The time-averaged number of wavefronts resident on the accelerator over the lifetime of the kernel. Note: this metric may be inaccurate for short-running kernels (<< 1ms). This is also presented as a percent of the peak theoretical occupancy achievable on the specific accelerator. + + * - Wavefront occupancy + + - The time-averaged number of wavefronts resident on the accelerator over + the lifetime of the kernel. Note: this metric may be inaccurate for + short-running kernels (less than 1ms). This is also presented as a + percent of the peak theoretical occupancy achievable on the specific + accelerator. + - Wavefronts - * - [LDS](lds) Theoretical Bandwidth - - Indicates the maximum amount of bytes that could have been loaded from/stored to/atomically updated in the LDS per unit time (see [LDS Bandwidth](lds_bandwidth) example for more detail). This is also presented as a percent of the peak theoretical F64 MFMA operations achievable on the specific accelerator. + + * - :doc:`LDS ` theoretical bandwidth + + - Indicates the maximum amount of bytes that could have been loaded from, + stored to, or atomically updated in the LDS per unit time (see + :ref:`LDS Bandwidth ` example for more detail). This is + also presented as a percent of the peak theoretical F64 MFMA operations + achievable on the specific accelerator. + - GB/s - * - [LDS](lds) Bank Conflicts/Access - - The ratio of the number of cycles spent in the [LDS scheduler](lds) due to bank conflicts (as determined by the conflict resolution hardware) to the base number of cycles that would be spent in the LDS scheduler in a completely uncontended case. This is also presented in normalized form (i.e., the Bank Conflict Rate). + + * - :doc:`LDS ` bank conflicts/access + + - The ratio of the number of cycles spent in the + :doc:`LDS scheduler ` due to bank conflicts (as + determined by the conflict resolution hardware) to the base number of + cycles that would be spent in the LDS scheduler in a completely + uncontended case. This is also presented in normalized form (i.e., the + Bank Conflict Rate). + - Conflicts/Access - * - [vL1D](vL1D) Cache Hit Rate - - The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the [vL1D Cache RAM](TC). + + * - :doc:`vL1D ` cache hit rate + + - The ratio of the number of vL1D cache line requests that hit in vL1D + cache over the total number of cache line requests to the + :ref:`vL1D cache RAM `. + - Percent - * - [vL1D](vL1D) Cache Bandwidth - - The number of bytes looked up in the vL1D cache as a result of [VMEM](VALU) instructions per unit time. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. + + * - :doc:`vL1D ` cache bandwidth + + - The number of bytes looked up in the vL1D cache as a result of + :ref:`VMEM ` instructions per unit time. The number of bytes + is calculated as the number of cache lines requested multiplied by the + cache line size. This value does not consider partial requests, so e.g., + if only a single value is requested in a cache line, the data movement + will still be counted as a full cache line. This is also presented as a + percent of the peak theoretical bandwidth achievable on the specific + accelerator. + - GB/s - * - [L2](L2) Cache Hit Rate - - The ratio of the number of L2 cache line requests that hit in the L2 cache over the total number of incoming cache line requests to the L2 cache. + + * - :doc:`L2 ` cache hit rate + + - The ratio of the number of L2 cache line requests that hit in the L2 + cache over the total number of incoming cache line requests to the L2 + cache. + - Percent - * - [L2](L2) Cache Bandwidth - - The number of bytes looked up in the L2 cache per unit time. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. + + * - :doc:`L2 ` cache bandwidth + + - The number of bytes looked up in the L2 cache per unit time. The number + of bytes is calculated as the number of cache lines requested multiplied + by the cache line size. This value does not consider partial requests, so + e.g., if only a single value is requested in a cache line, the data + movement will still be counted as a full cache line. This is also + presented as a percent of the peak theoretical bandwidth achievable on + the specific accelerator. + - GB/s - * - [L2](L2)-Fabric Read BW - - The number of bytes read by the L2 over the [Infinity Fabric(tm) interface](l2fabric) per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. + + * - :doc:`L2 `-fabric read BW + + - The number of bytes read by the L2 over the + :ref:`Infinity Fabric™ interface ` per unit time. This is also + presented as a percent of the peak theoretical bandwidth achievable on + the specific accelerator. + - GB/s - * - [L2](L2)-Fabric Write and Atomic BW - - The number of bytes sent by the L2 over the [Infinity Fabric(tm) interface](l2fabric) by write and atomic operations per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. + + * - :doc:`L2 `-fabric write and atomic BW + + - The number of bytes sent by the L2 over the + :ref:`Infinity Fabric interface ` by write and atomic + operations per unit time. This is also presented as a percent of the peak + theoretical bandwidth achievable on the specific accelerator. + - GB/s - * - [L2](L2)-Fabric Read Latency - - The time-averaged number of cycles read requests spent in Infinity Fabric(tm) before data was returned to the L2. + + * - :doc:`L2 `-fabric read latency + + - The time-averaged number of cycles read requests spent in Infinity Fabric + before data was returned to the L2. + - Cycles - * - [L2](L2)-Fabric Write Latency - - The time-averaged number of cycles write requests spent in Infinity Fabric(tm) before a completion acknowledgement was returned to the L2. + + * - :doc:`L2 `-fabric write latency + + - The time-averaged number of cycles write requests spent in Infinity + Fabric before a completion acknowledgement was returned to the L2. + - Cycles - * - [sL1D](sL1D) Cache Hit Rate - - The percent of sL1D requests that hit on a previously loaded line the cache. Calculated as the ratio of the number of sL1D requests that hit over the number of all sL1D requests. + + * - :ref:`sL1D ` cache hit rate + + - The percent of sL1D requests that hit on a previously loaded line the + cache. Calculated as the ratio of the number of sL1D requests that hit + over the number of all sL1D requests. + - Percent - * - [sL1D](sL1D) Bandwidth - - The number of bytes looked up in the sL1D cache per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. + + * - :ref:`sL1D ` bandwidth + + - The number of bytes looked up in the sL1D cache per unit time. This is + also presented as a percent of the peak theoretical bandwidth achievable + on the specific accelerator. + - GB/s - * - [L1I](L1I) Bandwidth - - The number of bytes looked up in the L1I cache per unit time. This is also presented as a percent of the peak theoretical bandwidth achievable on the specific accelerator. + + * - :ref:`L1I ` bandwidth + + - The number of bytes looked up in the L1I cache per unit time. This is + also presented as a percent of the peak theoretical bandwidth achievable + on the specific accelerator. + - GB/s - * - [L1I](L1I) Cache Hit Rate - - The percent of L1I requests that hit on a previously loaded line the cache. Calculated as the ratio of the number of L1I requests that hit over the number of all L1I requests. + + * - :ref:`L1I ` cache hit rate + + - The percent of L1I requests that hit on a previously loaded line the + cache. Calculated as the ratio of the number of L1I requests that hit + over the number of all L1I requests. + - Percent - * - [L1I](L1I) Fetch Latency - - The average number of cycles spent to fetch instructions to a [CU](cu). + + * - :ref:`L1I ` fetch latency + + - The average number of cycles spent to fetch instructions to a + :doc:`CU `. + - Cycles + diff --git a/docs/concept/vector-l1-cache.rst b/docs/concept/vector-l1-cache.rst new file mode 100644 index 000000000..46cec0965 --- /dev/null +++ b/docs/concept/vector-l1-cache.rst @@ -0,0 +1,553 @@ +********************** +Vector L1 cache (vL1D) +********************** + +The vector L1 data (vL1D) cache is local to each +:doc:`compute unit ` on the accelerator, and handles vector memory +operations issued by a wavefront. The vL1D cache consists of several components: + +* An address processing unit, also known as the + :ref:`texture addresser ` which receives commands (instructions) and + write/atomic data from the :doc:`compute unit `, and coalesces + them into fewer requests for the cache to process. + +* An address translation unit, also known as the L1 Unified Translation + Cache (UTCL1), that translates requests from virtual to physical + addresses for lookup in the cache. The translation unit has an L1 + translation lookaside buffer (L1TLB) to reduce the cost of repeated + translations. + +* A Tag RAM that looks up whether a requested cache line is already + present in the :ref:`cache `. + +* The result of the Tag RAM lookup is placed in the L1 cache controller + for routing to the correct location, e.g., the `L2 Memory + Interface `__ for misses or the + :ref:`cache RAM ` for hits. + +* The cache RAM, also known as the `texture cache (TC) `, stores + requested data for potential reuse. Data returned from the + :doc:`L2 cache ` is placed into the cache RAM before going down the + `data-return path `__. + +* A backend data processing unit, also known as the + :ref:`texture data (TD) ` that routes data back to the requesting + :doc:`compute unit `. + +Together, this complex is known as the vL1D, or Texture Cache per Pipe +(TCP). A simplified diagram of the vL1D is presented below: + +.. figure:: ../data/performance-model/l1perf_model.* + :align: center + :alt: Performance model of the vL1D Cache on AMD Instinct + + Performance model of the vL1D Cache on AMD Instinct MI-series accelerators. + +.. _vl1d-sol: + +vL1D Speed-of-Light +=================== + +.. warning:: + + The theoretical maximum throughput for some metrics in this section are + currently computed with the maximum achievable clock frequency, as reported + by ``rocminfo``, for an accelerator. This may not be realistic for all + workloads. + +The vL1D’s speed-of-light chart shows several key metrics for the vL1D +as a comparison with the peak achievable values of those metrics. + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - Hit Rate + + - The ratio of the number of vL1D cache line requests that hit{sup}`1` in vL1D cache over the total number of cache line requests to the [vL1D Cache RAM](TC). + + - Percent + + * - Bandwidth + + - The number of bytes looked up in the vL1D cache as a result of [VMEM](VALU) instructions, as a percent of the peak theoretical bandwidth achievable on the specific accelerator. The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. + + - Percent + + * - Utilization + + - Indicates how busy the [vL1D Cache RAM](TC) was during the kernel execution. The number of cycles where the [vL1D Cache RAM](TC) is actively processing any request divided by the number of cycles where the [vL1D is active](vL1d_activity){sup}`2` + + - Percent + + * - Coalescing + + - Indicates how well memory instructions were coalesced by the [address processing unit](TA), ranging from uncoalesced (25\%) to fully coalesced (100\%). The average number of [thread-requests](ThreadRequests) generated per instruction divided by the ideal number of [thread-requests](ThreadRequests) per instruction. + + - Percent + +(vL1d_activity)= + +.. note:: + + {sup}`1` The vL1D cache on AMD Instinct(tm) MI CDNA accelerators uses a "hit-on-miss" approach to reporting cache hits. + That is, if while satisfying a miss, another request comes in that would hit on the same pending cache line, the subsequent request will be counted as a 'hit'. + Therefore, it is also important to consider the Access Latency metric in the [Cache access metrics](TCP_cache_access_metrics) section when evaluating the vL1D hit rate. + + {sup}`2` Omniperf considers the vL1D to be active when any part of the vL1D (excluding the [address-processor](TA) and [data-return](TD) units) are active, e.g., performing a translation, waiting for data, accessing the Tag or Cache RAMs, etc. + +.. _desc-ta: + +Address processing unit or Texture Addresser (TA) +================================================= + +The `vL1D `__\ ’s address processing unit receives vector memory +instructions (commands) along with write/atomic data from a `Compute +Unit `__ and is responsible for coalescing these into requests for +lookup in the `vL1D RAM `__. The address processor passes +information about the commands (coalescing state, destination SIMD, +etc.) to the `data processing unit `__ for use after the requested +data has been retrieved. + +Omniperf reports several metrics to indicate performance bottlenecks in +the address processing unit, which are broken down into a few +categories: + +- Busy / stall metrics +- Instruction counts +- Spill / Stack metrics + +Busy / Stall metrics +'''''''''''''''''''' + +When executing vector memory instructions, the compute unit must send an +address (and in the case of writes/atomics, data) to the address +processing unit. When the frontend cannot accept any more addresses, it +must backpressure the wave-issue logic for the VMEM pipe and prevent the +issue of a vector memory instruction until a previously issued memory +operation has been processed. + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - Busy + + - Percent of the [total CU cycles](TotalCUCycles) the address processor was busy + + - Percent + + * - Address Stall + + - Percent of the [total CU cycles](TotalCUCycles) the address processor was stalled from sending address requests further into the vL1D pipeline + + - Percent + + * - Data Stall + + - Percent of the [total CU cycles](TotalCUCycles) the address processor was stalled from sending write/atomic data further into the vL1D pipeline + + - Percent + + * - Data-Processor → Address Stall + + - Percent of [total CU cycles](TotalCUCycles) the address processor was stalled waiting to send command data to the [data processor](TD) + + - Percent + +.. _ta-instruction-counts: + +(TA_inst)= ##### Instruction counts + +The address processor also counts instruction types to give the user +information on what sorts of memory instructions were executed by the +kernel. These are broken down into a few major categories: + +.. list-table:: + :header-rows: 1 + :widths: 20 20 60 + + * - Memory type + - Usage + - Description + * - Global + - Global memory + - Global memory can be seen by all threads from a process. This includes the local accelerator's DRAM, remote accelerator's DRAM, and the host's DRAM. + * - Generic + - Dynamic address spaces + - Generic memory, a.k.a. "flat" memory, is used when the compiler cannot statically prove that a pointer is to memory in one or the other address spaces. The pointer could dynamically point into global, local, constant, or private memory. + * - Private Memory + - Register spills / Stack memory + - Private memory, a.k.a. "scratch" memory, is only visible to a particular [work-item](workitem) in a particular [workgroup](workgroup). On AMD Instinct(tm) MI accelerators, private memory is used to implement both register spills and stack memory accesses. + +The address processor counts these instruction types as follows: + +.. list-table:: + :header-rows: 1 + + * - Type + + - Description + + - Unit + + * - Global/Generic + + - The total number of global & generic memory instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). + + - Instructions per [normalization-unit](normunit) + + * - Global/Generic Read + + - The total number of global & generic memory read instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). + + - Instructions per [normalization-unit](normunit) + + * - Global/Generic Write + + - The total number of global & generic memory write instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). + + - Instructions per [normalization-unit](normunit) + + * - Global/Generic Atomic + + - The total number of global & generic memory atomic (with and without return) instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). + + - Instructions per [normalization-unit](normunit) + + * - Spill/Stack + + - The total number of spill/stack memory instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). + + - Instructions per [normalization-unit](normunit) + + * - Spill/Stack Read + + - The total number of spill/stack memory read instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). + + - Instructions per [normalization-unit](normunit) + + * - Spill/Stack Write + + - The total number of spill/stack memory write instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). + + - Instruction per [normalization-unit](normunit) + + * - Spill/Stack Atomic + + - The total number of spill/stack memory atomic (with and without return) instructions executed on all [compute units](CU) on the accelerator, per [normalization-unit](normunit). Typically unused as these memory operations are typically used to implement thread-local storage. + + - Instructions per [normalization-unit](normunit) + +.. note:: + + The above is a simplified model specifically for the HIP programming language that does not consider (e.g.,) inline assembly usage, constant memory usage or texture memory. + + These categories correspond to: + - Global/Generic: global and flat memory operations, that are used for Global and Generic memory access. + - Spill/Stack: buffer instructions which are used on the MI50, MI100, and [MI2XX](2xxnote) accelerators for register spills / stack memory. + + These concepts are described in more detail in the [memory space section](Mspace) below, while generic memory access is explored in the [generic memory benchmark](flatmembench) section. + +Spill/Stack metrics +''''''''''''''''''' + +Finally, the address processing unit contains a separate coalescing +stage for spill/stack memory, and thus reports: + +.. list-table:: + :header-rows: 1 + :widths: 18 65 17 + + * - Metric + - Description + - Unit + + * - Spill/Stack Total Cycles + - The number of cycles the address processing unit spent working on spill/stack instructions, per [normalization-unit](normunit). + - Cycles per [normalization-unit](normunit) + + * - Spill/Stack Coalesced Read Cycles + - The number of cycles the address processing unit spent working on coalesced spill/stack read instructions, per [normalization-unit](normunit). + - Cycles per [normalization-unit](normunit) + + * - Spill/Stack Coalesced Write Cycles + - The number of cycles the address processing unit spent working on coalesced spill/stack write instructions, per [normalization-unit](normunit) + - Cycles per [normalization-unit](normunit) + +(UTCL1)= #### L1 Unified Translation Cache (UTCL1) + +After a vector memory instruction has been processed/coalesced by the +address processing unit of the vL1D, it must be translated from a +virtual to physical address. This process is handled by the L1 Unified +Translation Cache (UTCL1). This cache contains a L1 Translation +Lookaside Buffer (TLB) which stores recently translated addresses to +reduce the cost of subsequent re-translations. + +Omniperf reports the following L1 TLB metrics: + +.. list-table:: + :header-rows: 1 + :widths: 18 65 17 + + * - Metric + - Description + - Unit + * - Requests + - The number of translation requests made to the UTCL1 per [normalization-unit](normunit). + - Requests per [normalization-unit](normunit) + * - Hits + - The number of translation requests that hit in the UTCL1, and could be reused, per [normalization-unit](normunit). + - Requests per [normalization-unit](normunit) + * - Hit Ratio + - The ratio of the number of translation requests that hit in the UTCL1 divided by the total number of translation requests made to the UTCL1. + - Percent + * - Translation Misses + - The total number of translation requests that missed in the UTCL1 due to translation not being present in the cache, per [normalization-unit](normunit). + - Requests per [normalization-unit](normunit) + * - Permission Misses + - The total number of translation requests that missed in the UTCL1 due to a permission error, per [normalization-unit](normunit). This is unused and expected to be zero in most configurations for modern CDNA accelerators. + - Requests per [normalization-unit](normunit) + +.. note:: + + On current CDNA accelerators, such as the [MI2XX](2xxnote), the UTCL1 does _not_ count hit-on-miss requests. + +.. _desc-tc: + +Vector L1 Cache RAM or Texture Cache (TC) +========================================= + +After coalescing in the `address processing unit `__ of the v1LD, +and address translation in the `L1 TLB `__ the request proceeds +to the Cache RAM stage of the pipeline. Incoming requests are looked up +in the cache RAMs using parts of the physical address as a tag. Hits +will be returned through the `data-return path `__, while misses +will routed out to the `L2 Cache `__ for servicing. + +The metrics tracked by the vL1D RAM include: + +- Stall metrics +- Cache access metrics +- vL1D-L2 transaction detail metrics + +(TCP_cache_stall_metrics)= ##### vL1D cache stall metrics + +The vL1D also reports where it is stalled in the pipeline, which may +indicate performance limiters of the cache. A stall in the pipeline may +result in backpressuring earlier parts of the pipeline, e.g., a stall on +L2 requests may backpressure the wave-issue logic of the `VMEM `__ +pipe and prevent it from issuing more vector memory instructions until +the vL1D’s outstanding requests are completed. + +.. list-table:: + :header-rows: 1 + :widths: 20 65 15 + + * - Metric + - Description + - Unit + * - Stalled on L2 Data + - The ratio of the number of cycles where the vL1D is stalled waiting for requested data to return from the [L2 cache](L2) divided by the number of cycles where the [vL1D is active](vL1d_activity). + - Percent + * - Stalled on L2 Requests + - The ratio of the number of cycles where the vL1D is stalled waiting to issue a request for data to the [L2 cache](L2) divided by the number of cycles where the [vL1D is active](vL1d_activity). + - Percent + * - Tag RAM Stall (Read/Write/Atomic) + - The ratio of the number of cycles where the vL1D is stalled due to Read/Write/Atomic requests with conflicting tags being looked up concurrently, divided by the number of cycles where the [vL1D is active](vL1d_activity). + - Percent + +(TCP_cache_access_metrics)= ##### vL1D cache access metrics + +The vL1D cache access metrics broadly indicate the type of requests +incoming from the `cache frontend `__, the number of requests that +were serviced by the vL1D, and the number & type of outgoing requests to +the `L2 cache `__. In addition, this section includes the +approximate latencies of accesses to the cache itself, along with +latencies of read/write memory operations to the `L2 cache `__. + +.. list-table:: + :header-rows: 1 + :widths: 18 65 17 + + * - Metric + + - Description + + - Unit + + * - Total Requests + + - The total number of incoming requests from the [address processing unit](TA) after coalescing. + + - Requests + + * - Total read/write/atomic requests + + - The total number of incoming read/write/atomic requests from the [address processing unit](TA) after coalescing per [normalization-unit](normunit). + + - Requests per normalization unit + + * - Cache Bandwidth + + - The number of bytes looked up in the vL1D cache as a result of [VMEM](VALU) instructions per [normalization-unit](normunit). The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. + + - Bytes per normalization unit + + * - Cache Hit Rate + + - The ratio of the number of vL1D cache line requests that hit in vL1D cache over the total number of cache line requests to the [vL1D Cache RAM](TC). + + - Percent + + * - Cache Accesses + + - The total number of cache line lookups in the vL1D. + + - Cache lines + + * - Cache Hits + + - The number of cache accesses minus the number of outgoing requests to the [L2 cache](L2), i.e., the number of cache line requests serviced by the [vL1D Cache RAM](TC) per [normalization-unit](normunit). + + - Cache lines per normalization unit + + * - Invalidations + + - The number of times the vL1D was issued a write-back invalidate command during the kernel's execution per [normalization-unit](normunit). This may be triggered by, e.g., the `buffer_wbinvl1` instruction. + + - Invalidations per normalization unit + + * - L1-L2 Bandwidth + + - The number of bytes transferred across the vL1D-L2 interface as a result of [VMEM](VALU) instructions, per [normalization-unit](normunit). The number of bytes is calculated as the number of cache lines requested multiplied by the cache line size. This value does not consider partial requests, so e.g., if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. + + - Bytes per normalization unit + + * - L1-L2 Reads + + - The number of read requests for a vL1D cache line that were not satisfied by the vL1D and must be retrieved from the to the [L2 Cache](L2) per [normalization-unit](normunit). + + - Requests per normalization unit + + * - L1-L2 Writes + + - The number of post-coalescing write requests that are sent through the vL1D to the [L2 cache](L2), per [normalization-unit](normunit). + + - Requests per [normalization-unit](normunit) + + * - L1-L2 Atomics + + - The number of atomic requests that are sent through the vL1D to the [L2 cache](L2), per [normalization-unit](normunit). This includes requests for atomics with, and without return. + + - Requests per [normalization-unit](normunit) + + * - L1 Access Latency + + - The average number of cycles that a vL1D cache line request spent in the vL1D cache pipeline. + + - Cycles + + * - L1-L2 Read Access Latency + + - The average number of cycles that the vL1D cache took to issue and receive read requests from the [L2 Cache](L2). This number also includes requests for atomics with return values. + + - Cycles + + * - L1-L2 Write Access Latency + + - The average number of cycles that the vL1D cache took to issue and receive acknowledgement of a write request to the [L2 Cache](L2). This number also includes requests for atomics without return values. + + - Cycles + +.. note:: + + All cache accesses in vL1D are for a single cache line's worth of data. + The size of a cache line may vary, however on current AMD Instinct MI CDNA + accelerators and GCN GPUs the L1 cache line size is 64B. + +(TCP_TCC_Transactions_Detail)= ##### vL1D - L2 Transaction Detail + +This section provides a more granular look at the types of requests made +to the `L2 cache `__. These are broken down by the operation type +(read / write / atomic, with, or without return), and the `memory +type `__. For more detail, the reader is referred to the `Memory +Types `__ section. + +.. _desc-td: + +Vector L1 data-return path or Texture Data (TD) +=============================================== + +The data-return path of the vL1D cache, also known as the Texture Data +(TD) unit, is responsible for routing data returned from the `vL1D cache +RAM `__ back to a wavefront on a SIMD. As described in the `vL1D +cache front-end `__ section, the data-return path is passed +information about the space requirements and routing for data requests +from the `VALU `__. When data is returned from the `vL1D cache +RAM `__, it is matched to this previously stored request data, and +returned to the appropriate SIMD. + +Omniperf reports the following vL1D data-return path metrics: + +.. list-table:: + :header-rows: 1 + + * - Metric + + - Description + + - Unit + + * - Data-return Busy + + - Percent of the [total CU cycles](TotalCUCycles) the data-return unit was busy processing or waiting on data to return to the [CU](CU). + + - Percent + + * - Cache RAM → Data-return Stall + + - Percent of the [total CU cycles](TotalCUCycles) the data-return unit was stalled on data to be returned from the [vL1D Cache RAM](TC). + + - Percent + + * - Workgroup manager → Data-return Stall + + - Percent of the [total CU cycles](TotalCUCycles) the data-return unit was stalled by the [workgroup manager](SPI) due to initialization of registers as a part of launching new workgroups. + + - Percent + + * - Coalescable Instructions + + - The number of instructions submitted to the [data-return unit](TD) by the [address-processor](TA) that were found to be coalescable, per [normalization-unit](normunit). + + - Instructions per [normalization-unit](normunit) + + * - Read Instructions + + - The number of read instructions submitted to the [data-return unit](TD) by the [address-processor](TA) summed over all [compute units](CU) on the accelerator, per [normalization-unit](normunit). This is expected to be the sum of global/generic and spill/stack reads in the [address processor](TA_inst). + + - Instructions per [normalization-unit](normunit) + + * - Write Instructions + + - The number of store instructions submitted to the [data-return unit](TD) by the [address-processor](TA) summed over all [compute units](CU) on the accelerator, per [normalization-unit](normunit). This is expected to be the sum of global/generic and spill/stack stores counted by the [vL1D cache-frontend](TA_inst). + + - Instructions per [normalization-unit](normunit) + + * - Atomic Instructions + + - The number of atomic instructions submitted to the [data-return unit](TD) by the [address-processor](TA) summed over all [compute units](CU) on the accelerator, per [normalization-unit](normunit). This is expected to be the sum of global/generic and spill/stack atomics in the [address processor](TA_inst). + + - Instructions per [normalization-unit](normunit) + diff --git a/docs/data/unused/omniperf_server_vs_client_install.png b/docs/data/install/omniperf_server_vs_client_install.png similarity index 100% rename from docs/data/unused/omniperf_server_vs_client_install.png rename to docs/data/install/omniperf_server_vs_client_install.png diff --git a/docs/data/unused/fabric.png b/docs/data/performance-model/fabric.png similarity index 100% rename from docs/data/unused/fabric.png rename to docs/data/performance-model/fabric.png diff --git a/docs/data/unused/fabric.svg b/docs/data/performance-model/fabric.svg similarity index 100% rename from docs/data/unused/fabric.svg rename to docs/data/performance-model/fabric.svg diff --git a/docs/data/unused/l1perf_model.png b/docs/data/performance-model/l1perf_model.png similarity index 100% rename from docs/data/unused/l1perf_model.png rename to docs/data/performance-model/l1perf_model.png diff --git a/docs/data/unused/l1perf_model.svg b/docs/data/performance-model/l1perf_model.svg similarity index 100% rename from docs/data/unused/l1perf_model.svg rename to docs/data/performance-model/l1perf_model.svg diff --git a/docs/data/unused/nosplit.png b/docs/data/performance-model/nosplit.png similarity index 100% rename from docs/data/unused/nosplit.png rename to docs/data/performance-model/nosplit.png diff --git a/docs/data/unused/nosplit.svg b/docs/data/performance-model/nosplit.svg similarity index 100% rename from docs/data/unused/nosplit.svg rename to docs/data/performance-model/nosplit.svg diff --git a/docs/data/unused/selayout.png b/docs/data/performance-model/selayout.png similarity index 100% rename from docs/data/unused/selayout.png rename to docs/data/performance-model/selayout.png diff --git a/docs/data/unused/split.png b/docs/data/performance-model/split.png similarity index 100% rename from docs/data/unused/split.png rename to docs/data/performance-model/split.png diff --git a/docs/data/unused/split.svg b/docs/data/performance-model/split.svg similarity index 100% rename from docs/data/unused/split.svg rename to docs/data/performance-model/split.svg diff --git a/docs/data/unused/uncached.png b/docs/data/performance-model/uncached.png similarity index 100% rename from docs/data/unused/uncached.png rename to docs/data/performance-model/uncached.png diff --git a/docs/data/unused/uncached.svg b/docs/data/performance-model/uncached.svg similarity index 100% rename from docs/data/unused/uncached.svg rename to docs/data/performance-model/uncached.svg diff --git a/docs/how-to/analyze/cli.rst b/docs/how-to/analyze/cli.rst index 2488c9453..f91120fb7 100644 --- a/docs/how-to/analyze/cli.rst +++ b/docs/how-to/analyze/cli.rst @@ -3,8 +3,11 @@ CLI analysis ************ * **Derived metrics**: All of Omniperf's built-in metrics. + * **Baseline comparison**: Compare multiple runs in a side-by-side manner. + * **Metric customization**: Isolate a subset of built-in metrics or build your own profiling configuration. + * **Filtering**: Hone in on a particular kernel, gpu-id, and/or dispatch-id via post-process filtering. Run ``omniperf analyze -h`` for more details. @@ -343,10 +346,8 @@ Filter kernels ╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╧═════╛ ... - .. note:: - - You should see your filtered kernels indicated by an asterisk in the **Top - Stats** table. + You should see your filtered kernels indicated by an asterisk in the **Top + Stats** table. Baseline comparison diff --git a/docs/how-to/analyze/grafana-gui.rst b/docs/how-to/analyze/grafana-gui.rst index 839f7c5d7..852463cc7 100644 --- a/docs/how-to/analyze/grafana-gui.rst +++ b/docs/how-to/analyze/grafana-gui.rst @@ -8,20 +8,33 @@ The Omniperf Grafana analysis dashboard GUI supports the following features to facilitate MI accelerator performance profiling and analysis: * System and Hardware Component (Hardware Block) + * Speed-of-Light (SOL) + * :ref:`Multiple normalization options ` + * Baseline comparisons + * Regex based Dispatch ID filtering + * Roofline Analysis -* Detailed performance counters and metrics per hardware component, e.g., - * Command Processor - Fetch (CPF) / Command Processor - Controller (CPC) - * Workgroup Manager (SPI) - * Shader Sequencer (SQ) - * Shader Sequencer Controller (SQC) - * L1 Address Processing Unit, a.k.a. Texture Addresser (TA) / L1 Backend Data - Processing Unit, a.k.a. Texture Data (TD) - * L1 Cache (TCP) - * L2 Cache (TCC) (both aggregated and per-channel perf info) + +* Detailed performance counters and metrics per hardware component, such as: + + * Command Processor - Fetch (CPF) / Command Processor - Controller (CPC) + + * Workgroup Manager (SPI) + + * Shader Sequencer (SQ) + + * Shader Sequencer Controller (SQC) + + * L1 Address Processing Unit, a.k.a. Texture Addresser (TA) / L1 Backend Data + Processing Unit, a.k.a. Texture Data (TD) + + * L1 Cache (TCP) + + * L2 Cache (TCC) (both aggregated and per-channel perf info) .. _analysis-sol: @@ -43,9 +56,12 @@ Multiple performance number normalizations are provided to allow performance inspection within both hardware and software context. The following normalizations are available. -* ``per_cycle`` * ``per_wave`` + +* ``per_cycle`` + * ``per_kernel`` + * ``per_second`` See :ref:`normalization-units` to learn more about Omniperf normalizations. @@ -117,7 +133,7 @@ over 50%, while *red* means over 90% percent. Global variables and configurations ----------------------------------- -.. image:: ../data/analyze/global_variables.png +.. image:: ../../data/analyze/global_variables.png :align: center .. _grafana-gui-import: @@ -129,18 +145,22 @@ The Omniperf database ``--import`` option imports the raw profiling data to Grafana's backend MongoDB database. This step is only required for Grafana GUI-based performance analysis. -Default username and password for MongoDB (to be used in database mode) are as follows: +Default username and password for MongoDB (to be used in database mode) are as +follows: * **Username**: ``temp`` -* **Password**: ``temp123`` -Each workload is imported to a separate database with the following naming convention: +* **Password**: ``temp123`` - omniperf___ +Each workload is imported to a separate database with the following naming +convention: -e.g., omniperf_asw_vcopy_mi200. + ``omniperf___`` For example, ``omniperf_asw_vcopy_mi200``. -When using database mode, be sure to tailor the connection options to the machine hosting your [sever-side instance](./installation.md). Below is the sample command to import the *vcopy* profiling data, lets assuming our host machine is called "dummybox". +When using database mode, be sure to tailor the connection options to the +machine hosting your [sever-side instance](./installation.md). Below is the +sample command to import the *vcopy* profiling data, lets assuming our host +machine is called `dummybox`. .. code-block:: shell @@ -184,7 +204,7 @@ When using database mode, be sure to tailor the connection options to the machin -w , --workload Specify name of workload (to remove) or path to workload (to import) --kernel-verbose Specify Kernel Name verbose level 1-5. Lower the level, shorter the kernel name. (DEFAULT: 5) (DISABLE: 5) -omniperf import for vcopy: +Omniperf import for vcopy: ^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -232,80 +252,154 @@ workload performance. Each category contains several panels for close inspection of the system performance. - Kernel Statistics + - Kernel time histogram + - Top Ten bottleneck kernels + - System Speed-of-Light + - Speed-of-Light + - System Info table + - Memory Chart Analysis + - Roofline Analysis + - FP32/FP64 + - FP16/INT8 + - Command Processor + - Command Processor - Fetch (CPF) + - Command Processor - Controller (CPC) + - Workgroup Manager or Shader Processor Input (SPI) + - SPI Stats + - SPI Resource Allocations + - Wavefront Launch + - Wavefront Launch Stats + - Wavefront runtime stats + - per-SE Wavefront Scheduling performance + - Wavefront Lifetime + - Wavefront lifetime breakdown + - per-SE wavefront life (average) + - per-SE wavefront life (histogram) + - Wavefront Occupancy + - per-SE wavefront occupancy + - per-CU wavefront occupancy + - Compute Unit - Instruction Mix + - per-wave Instruction mix + - per-wave VALU Arithmetic instruction mix + - per-wave MFMA Arithmetic instruction mix + - Compute Unit - Compute Pipeline + - Speed-of-Light: Compute Pipeline + - Arithmetic OPs count + - Compute pipeline stats + - Memory latencies + - Local Data Share (LDS) + - Speed-of-Light: LDS + - LDS stats + - Instruction Cache + - Speed-of-Light: Instruction Cache + - Instruction Cache Accesses + - Constant Cache + - Speed-of-Light: Constant Cache + - Constant Cache Accesses + - Constant Cache - L2 Interface stats + - Texture Address and Texture Data + - Texture Address (TA) + - Texture Data (TD) + - L1 Cache + - Speed-of-Light: L1 Cache + - L1 Cache Accesses + - L1 Cache Stalls + - L1 - L2 Transactions + - L1 - UTCL1 Interface stats + - L2 Cache + - Speed-of-Light: L2 Cache + - L2 Cache Accesses + - L2 - EA Transactions + - L2 - EA Stalls + - L2 Cache Per Channel Performance + - Per-channel L2 Hit rate + - Per-channel L1-L2 Read requests + - Per-channel L1-L2 Write Requests + - Per-channel L1-L2 Atomic Requests + - Per-channel L2-EA Read requests + - Per-channel L2-EA Write requests + - Per-channel L2-EA Atomic requests + - Per-channel L2-EA Read latency + - Per-channel L2-EA Write latency + - Per-channel L2-EA Atomic latency + - Per-channel L2-EA Read stall (I/O, GMI, HBM) + - Per-channel L2-EA Write stall (I/O, GMI, HBM, Starve) -Most panels are designed around a specific hardware component block to thoroughly understand its behavior. Additional panels, including custom panels, could also be added to aid the performance analysis. +Most panels are designed around a specific hardware component block to +thoroughly understand its behavior. Additional panels, including custom panels, +could also be added to aid the performance analysis. System Info Panel ^^^^^^^^^^^^^^^^^ @@ -412,6 +506,8 @@ Command Processor Compute fetched commands, and (for kernels) passes them to the Workgroup Managers (SPIs) for scheduling. +See :doc:`concept/command-processor` to learn about reported metrics. + Shader Processor Input (SPI) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -561,7 +657,8 @@ Speed-of-Light :align: center :alt: Speed-of-Light (instruction cache) panel in Omniperf Grafana - Key metrics of the L1 Instruction (L1I) cache as a comparison with the peak achievable values of those metrics. + Key metrics of the L1 Instruction (L1I) cache as a comparison with the peak + achievable values of those metrics. Instruction Cache Stats +++++++++++++++++++++++ @@ -659,7 +756,7 @@ L1D Cache Accesses :align: center :alt: L1D Cache Accesses - The type of requests incoming from the cache frontend, the number of requests + The type of requests incoming from the cache front-end, the number of requests that were serviced by the vector L1 data (vL1D) cache, and the number & type of outgoing requests to the L2 cache. @@ -702,7 +799,7 @@ L2 Cache Accesses +++++++++++++++++ .. figure:: ../../data/analyze/grafana/l2-accesses_panel.png - ign: center + :align: center :alt: L2 Cache Accesses panel in Omniperf Grafana Incoming requests to the L2 cache from the vector L1 data (vL1D) cache and diff --git a/docs/how-to/profile/mode.rst b/docs/how-to/profile/mode.rst index 5a10cfe48..730a77959 100644 --- a/docs/how-to/profile/mode.rst +++ b/docs/how-to/profile/mode.rst @@ -22,7 +22,7 @@ data through analysis of compute workloads. Profiling with Omniperf yields the following benefits. * **Automate counter collection**: Omniperf handles all of your profiling via - preconfigured input files. + pre-configured input files. * **Filtering**: Apply runtime filters to speed up the profiling process. * **Standalone roofline**: Isolate a subset of built-in metrics or build your own profiling configuration. @@ -223,17 +223,17 @@ detailed description of profiling filters available when using Omniperf with Filtering options ----------------- +``-b``, ``--block `` + Allows system profiling on one or more selected hardware components to speed + up the profiling process. See :ref:`profiling-hw-component-filtering`. + ``-k``, ``--kernel `` Allows for kernel filtering. Usage is equivalent with the current ``rocprof`` - utility. See :ref:``. + utility. See :ref:`profiling-kernel-filtering`. ``-d``, ``--dispatch `` Allows for dispatch ID filtering. Usage is equivalent with the current - rocProf utility. See :ref:`profiling-kernel-filtering`. - -``-b``, ``--block `` - Allows system profiling on one or more selected hardware components to speed - up the profiling process. See :ref:`profiling-hw-component-filtering`. + ``rocprof`` utility. See :ref:`profiling-dispatch-filtering`. .. tip:: @@ -333,6 +333,8 @@ substring ``vecCopy``. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ... +.. _profiling-dispatch-filtering: + Dispatch filtering ^^^^^^^^^^^^^^^^^^ diff --git a/docs/how-to/use.rst b/docs/how-to/use.rst index 45d75962a..1766d19fa 100644 --- a/docs/how-to/use.rst +++ b/docs/how-to/use.rst @@ -89,7 +89,7 @@ workload path. ``-p``, ``--path`` Enables you to analyze existing profiling data in the Omniperf CLI. -See :doc:`cli` for more detailed information. +See :doc:`analyze/cli` for more detailed information. .. _basic-analyze-grafana: @@ -99,7 +99,7 @@ Analyze in the Grafana GUI To conduct a more in-depth analysis of profiling results, it's suggested to use a Grafana GUI with Omniperf. To interact with profiling results, import your data to the MongoDB instance included in the Omniperf Dockerfile. See -:doc:`grafana-setup`. +:doc:`install/grafana-setup`. To interact with Grafana data, stored in the Omniperf database, enter ``database`` :ref:`mode `; for example: @@ -108,7 +108,7 @@ To interact with Grafana data, stored in the Omniperf database, enter $ omniperf database --import [CONNECTION OPTIONS] -See :ref:`grafana-analysis` for more detailed information. +See :doc:`grafana-gui` for more detailed information. .. _modes: @@ -173,15 +173,15 @@ Database mode profiling results to the DB to interact with the workload in Grafana or ``--remove`` the workload from the DB. - Connection options need to be specified. See :ref:`grafana-analysis` for + Connection options need to be specified. See :doc:`grafana-gui` for more details. .. code-block:: shell $ omniperf database --help -See :doc:`grafana-setup` to learn about setting up a Grafana server and database -instance to make your profiling data more digestible and shareable. +See :doc:`install/grafana-setup` to learn about setting up a Grafana server and +database instance to make your profiling data more digestible and shareable. .. _global-options: diff --git a/docs/index.rst b/docs/index.rst index 874fe3c1e..0cd083b56 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -13,7 +13,7 @@ its components. Omniperf is open source and hosted at ``__. -If you're new to Omniperf, acquaint yourself with the tool by reviewing the +If you're new to Omniperf, familiarize yourself with the tool by reviewing the chapters that follow and gradually learn its more advanced features. To get started, see :doc:`What is Omniperf? `. @@ -37,31 +37,50 @@ in practice. .. grid-item-card:: How to * :doc:`how-to/use` + * :doc:`how-to/profile/mode` + * :doc:`how-to/analyze/mode` - * :doc:`how-to/analyze/grafana-gui` - * :doc:`how-to/analyze/standalone-gui` - .. grid-item-card:: Tutorials + * :doc:`how-to/analyze/cli` - * :doc:`tutorial/profiling-by-example` - * `Omniperf example exercises `_ - * `Introduction to Omniperf - AMD profiling workshop `_ + * :doc:`how-to/analyze/grafana-gui` + + * :doc:`how-to/analyze/standalone-gui` .. grid-item-card:: Conceptual - * :doc:`conceptual/performance-model` - * :doc:`conceptual/glossary` + * :doc:`concept/performance-model` + + * :doc:`concept/compute-unit` + + * :doc:`concept/l2-cache` + + * :doc:`concept/shader-engine` + + * :doc:`concept/command-processor` + + * :doc:`concept/system-speed-of-light` + + * :doc:`concept/definitions` + + * :ref:`normalization-units` + + .. grid-item-card:: Tutorials + + * :doc:`tutorial/profiling-by-example` + + * :doc:`Learning resources ` .. grid-item-card:: Reference * :doc:`reference/compatible-accelerators` + * :doc:`reference/faq` - * :doc:`license` -This project is proudly open source; all feedback is welcome. For more details -on how to contribute, refer to +This project is proudly open source. For more details on how to contribute, +refer to `Contributing to ROCm `_. Find ROCm licensing information on the diff --git a/docs/install/grafana-setup.rst b/docs/install/grafana-setup.rst index ab91f00c4..8d0e1c34f 100644 --- a/docs/install/grafana-setup.rst +++ b/docs/install/grafana-setup.rst @@ -56,7 +56,7 @@ Set up persistent storage ------------------------- Bind MongoDB to a directory on the host OS to create a local backup in case of a -crash or reset. +crash or reset. This is called *creating a persistent volume*. .. code-block:: bash @@ -65,10 +65,6 @@ crash or reset. $ sudo docker volume create --driver local --opt type=none --opt device=/usr/local/persist/grafana-storage --opt o=bind grafana-storage $ sudo docker volume create --driver local --opt type=none --opt device=/usr/local/persist/mongodb --opt o=bind grafana-mongo-db -.. tip:: - - In Docker lingo, this is called *creating a persistent volume*. - .. _grafana-docker-container: Build and launch the Docker container @@ -146,17 +142,17 @@ Configure the following fields in the data source settings. After configuring these fields, click **Save & test** to make sure your connection is successful. -.. note:: - - To avoid potential DNS issues, you might need to use the actual IP address - for the host node in the MongoDB URL. - .. figure:: ../data/install/datasource_settings.jpg :align: center :alt: Grafana data source settings Grafana data source settings. +.. note:: + + To avoid potential DNS issues, you might need to use the actual IP address + for the host node in the MongoDB URL. + .. _grafana-import-dashboard-file: Import the Omniperf dashboard file @@ -203,5 +199,5 @@ from the workload dropdown located at the top of your Grafana dashboard. Selecting your Omniperf workload in Grafana. For more information on how to use the Grafana interface for analysis see -:ref:`grafana-analysis`. +:doc:`how-to/analyze/grafana-gui`. diff --git a/docs/reference/compatible-accelerators.rst b/docs/reference/compatible-accelerators.rst index a0cbd7481..30eaf6f6e 100644 --- a/docs/reference/compatible-accelerators.rst +++ b/docs/reference/compatible-accelerators.rst @@ -23,14 +23,14 @@ GPU specifications. * - Platform - Status - * - MI50, MI60 (Vega 20) - - No support + * - AMD Instinct MI300 + - Supported ✅ - * - MI100 - - Supported + * - AMD Instinct MI200 + - Supported ✅ - * - MI200 - - Supported + * - AMD Instinct MI100 + - Supported ✅ - * - MI300 - - Supported + * - AMD Instinct MI50, MI60 (Vega 20) + - No support ❌ diff --git a/docs/reference/faq.rst b/docs/reference/faq.rst index 9916fb522..e847c8a35 100644 --- a/docs/reference/faq.rst +++ b/docs/reference/faq.rst @@ -1,6 +1,7 @@ .. meta:: :description: Omniperf FAQ and troubleshooting - :keywords: Omniperf, FAQ, troubleshooting, ROCm, profiler, tool, Instinct, accelerator, AMD + :keywords: Omniperf, FAQ, troubleshooting, ROCm, profiler, tool, Instinct, + accelerator, AMD, SSH, error, version, workaround, help *** FAQ diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 2c4853e0e..99f07d1a7 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -25,22 +25,34 @@ subtrees: - file: how-to/analyze/grafana-gui.rst - file: how-to/analyze/standalone-gui.rst - - caption: Tutorials - entries: - - file: tutorial/profiling-by-example.rst - - file: tutorial/training-resources.rst - - caption: Conceptual entries: - file: concept/performance-model.rst entries: - file: concept/compute-unit.rst title: Compute unit + entries: + - file: concept/pipeline-descriptions.rst + - file: concept/pipeline-metrics.rst + - file: concept/local-data-share.rst + title: Local data share + - file: concept/vector-l1-cache.rst + title: Vector L1 cache + - file: concept/l2-cache.rst + title: L2 cache + - file: concept/shader-engine.rst + title: Shader engine - file: concept/command-processor.rst title: Command processor - file: concept/system-speed-of-light.rst title: System Speed-of-Light - - file: concept/glossary.rst + - file: concept/definitions.rst + + - caption: Tutorials + entries: + - file: tutorial/profiling-by-example.rst + - file: tutorial/external-resources.rst + title: Learning resources - caption: Reference entries: diff --git a/docs/tutorial/external-resources.rst b/docs/tutorial/external-resources.rst new file mode 100644 index 000000000..44964881c --- /dev/null +++ b/docs/tutorial/external-resources.rst @@ -0,0 +1,19 @@ +.. meta:: + :description: Omniperf external training resources + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD, + training, examples + +****************** +Learning resources +****************** + +This section is a catalog of external resources and third-party content that +can help you learn Omniperf. Some areas of the following content might be +outdated. + +Introduction to Omniperf + :fab:`youtube` `AMD profiling workshop (Pawsey Supercomputing Research Centre) `_ + +Omniperf example exercises + ``__ + diff --git a/docs/tutorial/includes/infinity-fabric-transactions.rst b/docs/tutorial/includes/infinity-fabric-transactions.rst index 6eef66a91..3534a973c 100644 --- a/docs/tutorial/includes/infinity-fabric-transactions.rst +++ b/docs/tutorial/includes/infinity-fabric-transactions.rst @@ -3,11 +3,11 @@ Infinity Fabric transactions ============================ -For this example, consider the -:dev-sample:`Infinity Fabric™ sample ` distributed as a part of -Omniperf. + For this example, consider the + :dev-sample:`Infinity Fabric™ sample ` distributed as a part of + Omniperf. -This code launches a simple read-only kernel: +This following code snippet launches a simple read-only kernel. .. code:: cpp @@ -25,32 +25,26 @@ This code launches a simple read-only kernel: } } -This happens twice; once as a warmup, and once for analysis. We note that the buffer -``x`` is initialized to all zeros via a call to ``hipMemcpy`` on the -host before the kernel is ever launched, therefore the conditional: +This happens twice -- once as a warmup and once for analysis. Note that the +buffer ``x`` is initialized to all zeros via a call to ``hipMemcpy`` on the +host before the kernel is ever launched. Therefore, the following conditional +is identically false -- and thus we expect no writes. .. code:: cpp if (sum != 0) { ... -is identically false (and thus: we expect no writes). - -.. code:: {note} +.. note:: The actual sample included with Omniperf also includes the ability to select different operation types (such as atomics, writes). This abbreviated version is presented here for reference only. -Finally, this sample code lets the user control: - -* The :ref:`granularity of an allocation `, - -* The owner of an allocation (local HBM, CPU DRAM or remote HBM), and - -* The size of an allocation (the default is :math:`\sim4`\ GiB) - -via command line arguments. In doing so, we can explore the impact of -these parameters on the L2-Fabric metrics reported by Omniperf to +Finally, this sample code lets the user control the +:ref:`granularity of an allocation `, the owner of an allocation +(local HBM, CPU DRAM or remote HBM), and the size of an allocation (the default +is :math:`\sim4`\ GiB) via command line arguments. In doing so, we can explore +the impact of these parameters on the L2-Fabric metrics reported by Omniperf to further understand their meaning. .. note:: @@ -117,18 +111,27 @@ In our first experiment, we consider the simplest possible case, a │ 17.5.4 │ Remote Read │ 8.00 │ 8.00 │ 8.00 │ Req per kernel │ ╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛ -Here, we see: - The vast majority of L2-Fabric requests (>99%) are 64B -read requests (17.5.2) - Nearly 100% of the read requests (17.2.1) are -homed in on the accelerator-local HBM (17.5.3), while some small -fraction of these reads are routed to a “remote” device (17.5.4) - These -drive a :math:`\sim40`\ GiB per kernel read-bandwidth (17.2.0) +Here, you can make the following observations. + +- The vast majority of L2-Fabric requests (>99%) are 64B + read requests (17.5.2) + +- Nearly 100% of the read requests (17.2.1) are homed in on the + accelerator-local HBM (17.5.3), while some small fraction of these reads are + routed to a “remote” device (17.5.4) + +- These drive a :math:`\sim40`\ GiB per kernel read-bandwidth (17.2.0) -In addition, we see a small amount of `uncached `__ reads -(17.5.1), these correspond to things like: - the assembly code to -execute the kernel - kernel arguments - coordinate parameters (e.g., -blockDim.z) that were not initialized by the hardware, etc. and may -account for some of our ‘remote’ read requests (17.5.4), e.g., reading -from CPU DRAM. +In addition, we see a small amount of :ref:`uncached ` reads +(17.5.1), these correspond to things like: + +* The assembly code to execute the kernel + +* Kernel arguments + +* Coordinate parameters (e.g., blockDim.z) that were not initialized by the + hardware, etc. and may account for some of our ‘remote’ read requests + (17.5.4), e.g., reading from CPU DRAM. The above list is not exhaustive, nor are all of these guaranteed to be ‘uncached’ – the exact implementation depends on the accelerator and @@ -136,9 +139,11 @@ ROCm versions used. These read requests could be interrogated further in the `Scalar L1 Data Cache `__ and `Instruction Cache `__ metric sections. -.. code:: {note} +.. note:: - The Traffic metrics in Sec 17.2 are presented as a percentage of the total number of requests, e.g. 'HBM Read Traffic' is the percent of read requests (17.5.0-17.5.2) that were directed to the accelerators' local HBM (17.5.3). + The Traffic metrics in Sec 17.2 are presented as a percentage of the total + number of requests, e.g. 'HBM Read Traffic' is the percent of read requests + (17.5.0-17.5.2) that were directed to the accelerators' local HBM (17.5.3). .. _infinity-fabric-ex2: @@ -150,7 +155,7 @@ device-allocation to be fine-grained device memory, local to the current accelerator. Our code uses the ``hipExtMallocWithFlag`` API with the ``hipDeviceMallocFinegrained`` flag to accomplish this. -.. code:: {note} +.. note:: On some systems (e.g., those with only PCIe(r) connected accelerators), you need to set the environment variable `HSA_FORCE_FINE_GRAIN_PCIE=1` to enable this memory type. @@ -214,7 +219,7 @@ In addition, we now see a small percentage of HBM Read Stalls (17.4.2), as streaming fine-grained memory is putting more stress on Infinity Fabric(tm). -.. code:: {note} +.. note:: The stalls in Sec 17.4 are presented as a percentage of the total number active L2 cycles, summed over [all L2 channels](L2). @@ -304,11 +309,11 @@ number of HBM Read Requests (17.2.1, 17.5.3), nor HBM Read Stalls considered “remote” (17.2.2, 17.5.4) are being routed to another accelerator, or the CPU — in this case HIP Device 1 — and - we observe a significantly larger percentage of AMD Infinity Fabric(tm) Read Stalls -(17.4.1) as compared to the HBM Read Stalls in the `previous -example `__ +(17.4.1) as compared to the HBM Read Stalls in the +:ref:`previous example `. These stalls correspond to reads that are going out over the AMD -Infinity Fabric(tm) connection to another MI250 accelerator. In +Infinity Fabric connection to another MI250 accelerator. In addition, because these are crossing between accelerators, we expect significantly lower achievable bandwidths as compared to the local accelerator’s HBM – this is reflected (indirectly) in the magnitude of @@ -320,7 +325,7 @@ only PCIe connected accelerators, these observations will differ. Experiment #4 - Fine-grained, CPU-DRAM reads -------------------------------------------- -In this experiment, we move our `fine-grained `__ allocation to +In this experiment, we move our :ref:`fine-grained ` allocation to be owned by the CPU’s DRAM. We accomplish this by allocating host-pinned fine-grained memory using the ``hipHostMalloc`` API: @@ -375,17 +380,21 @@ fine-grained memory using the ``hipHostMalloc`` API: │ 17.5.4 │ Remote Read │ 671088642.00 │ 671088642.00 │ 671088642.00 │ Req per kernel │ ╘═════════╧═════════════════╧═══════════════╧═══════════════╧═══════════════╧════════════════╛ -Here we see *almost* the same results as in the `previous -experiment `__, however now as we are crossing a PCIe(r) -bus to the CPU, we see that the Infinity Fabric(tm) Read stalls (17.4.1) -have shifted to be a PCIe(r) stall (17.4.2). In addition, as (on this -system) the PCIe(r) bus has a lower peak bandwidth than the AMD Infinity -Fabric(TM) connection between two accelerators, we once again observe an +Here we see *almost* the same results as in the +:ref:`previous experiment `, however now as we are crossing +a PCIe bus to the CPU, we see that the Infinity Fabric Read stalls (17.4.1) +have shifted to be a PCIe stall (17.4.2). In addition, as (on this +system) the PCIe bus has a lower peak bandwidth than the AMD Infinity +Fabric connection between two accelerators, we once again observe an increase in the percentage of stalls on this interface. .. code:: {note} - Had we performed this same experiment on a [MI250X system](https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf), these transactions would again have been marked as Infinity Fabric(tm) Read stalls (17.4.1), as the CPU is connected to the accelerator via AMD Infinity Fabric. + Had we performed this same experiment on an + `MI250X system `_, + these transactions would again have been marked as Infinity Fabric Read + stalls (17.4.1), as the CPU is connected to the accelerator via AMD Infinity + Fabric. .. _infinity-fabric-ex5: diff --git a/docs/tutorial/includes/instructions-per-cycle-and-utilizations.rst b/docs/tutorial/includes/instructions-per-cycle-and-utilizations.rst index 7e6cc398f..91f8389e6 100644 --- a/docs/tutorial/includes/instructions-per-cycle-and-utilizations.rst +++ b/docs/tutorial/includes/instructions-per-cycle-and-utilizations.rst @@ -243,6 +243,8 @@ cycles `__. We will be explore that in the [subsequent section](Internal_ipc). For these reasons, Omniperf typically promotes use of the regular IPC metric (11.2.0), e.g., in the top-level Speed-of-Light chart. +.. _ipc-internal-instructions: + (Internal_ipc)= ### “Internal” instructions and IPC Next, we explore the concept of an “internal” instruction. From `Layla diff --git a/docs/tutorial/includes/lds-examples.rst b/docs/tutorial/includes/lds-examples.rst index ae499f1fd..85586fb58 100644 --- a/docs/tutorial/includes/lds-examples.rst +++ b/docs/tutorial/includes/lds-examples.rst @@ -84,7 +84,7 @@ Recall our definition of this metric: Indicates the maximum amount of bytes that could have been loaded from/stored to/atomically updated in the LDS per - `normalization-unit `__. + :ref:`normalization unit `. Here we see that this instruction *could* have loaded up to 256 bytes of data (4 bytes for each work-item in the wavefront), and therefore this diff --git a/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst b/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst index 85dbfecb6..59246c0db 100644 --- a/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst +++ b/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst @@ -16,7 +16,7 @@ optimization / dead-code elimination by the compiler. While inline assembly is inherently unportable, this example is expected to work on all GCN GPUs and CDNA accelerators. -We reproduce a sample of the kernel below: +We reproduce a sample of the kernel as follows: .. code:: cpp @@ -31,10 +31,14 @@ We reproduce a sample of the kernel below: : "v"(f2)); These instructions correspond to: -- A 32-bit floating point addition, -- A 32-bit floating point multiplication, -- A 32-bit floating point square-root transcendental operation, and -- A 32-bit floating point fused multiply-add operation. + +* A 32-bit floating point addition, + +* A 32-bit floating point multiplication, + +* A 32-bit floating point square-root transcendental operation, and + +* A 32-bit floating point fused multiply-add operation. For more detail, refer to (e.g.,) the `CDNA2 ISA Guide `__. @@ -42,20 +46,20 @@ Guide `_ - -Omniperf example exercises - ``__ - diff --git a/docs/what-is-omniperf.rst b/docs/what-is-omniperf.rst index 5539ce8b8..41a39182d 100644 --- a/docs/what-is-omniperf.rst +++ b/docs/what-is-omniperf.rst @@ -35,20 +35,20 @@ Core Omniperf profiler Grafana server for Omniperf * **Grafana database import**: All raw performance counters are imported into - the a :ref:`backend MongoDB database ` to support + a :ref:`backend MongoDB database ` to support analysis and visualization in the Grafana GUI. Compatibility with previously generated data using older Omniperf versions is not guaranteed. * **Grafana analysis dashboard GUI**: The - :doc:`Grafana dashboard ` retrieves the raw counters - information from the backend database. It displays the relevant + :doc:`Grafana dashboard ` retrieves the raw + counters information from the backend database. It displays the relevant performance metrics and visualization. Omniperf standalone GUI analyzer - Omniperf provides a :doc:`standalone GUI ` to enable - basic performance analysis without the need to import data into a database - instance. + Omniperf provides a :doc:`standalone GUI ` to + enable basic performance analysis without the need to import data into a + database instance. -.. figure:: ./data/omniperf_server_vs_client_install.png +.. figure:: ./data/install/omniperf_server_vs_client_install.png :align: center :alt: Architectural design of Omniperf @@ -116,4 +116,4 @@ high level. * Baseline comparisons -* Multiple normalizations +* :ref:`Multiple normalizations `