more reorg

Signed-off-by: Peter Jun Park <[email protected]> clean up Signed-off-by: Peter Jun Park <[email protected]> reorg images move profile mode reorg reorg reorg more fix formatting fix headings ref anchor mi2xx note add extlinks add extlinks Signed-off-by: Peter Jun Park <[email protected]> black format fix formatting, anchors Signed-off-by: Peter Jun Park <[email protected]> reorg
ROCm · Jun 28, 2024 · 133cc9d · 133cc9d
1 parent 18effce
commit 133cc9d
Show file tree

Hide file tree

Showing 113 changed files with 6,394 additions and 1,362 deletions.
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -1,4 +1,4 @@
-name: Lint Documentation
+name: Linting
 
 on:
   push:

diff --git a/docs/concept/command-processor.rst b/docs/concept/command-processor.rst
@@ -0,0 +1,96 @@
+**********************
+Command processor (CP)
+**********************
+
+The command processor (CP) is responsible for interacting with the AMDGPU Kernel
+Driver (a.k.a., the Linux Kernel) on the CPU and
+for interacting with user-space HSA clients when they submit commands to
+HSA queues. Basic tasks of the CP include reading commands (e.g.,
+corresponding to a kernel launch) out of `HSA
+Queues <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__
+(Sec. 2.5), scheduling work to subsequent parts of the scheduler
+pipeline, and marking kernels complete for synchronization events on the
+host.
+
+The command processor is composed of two sub-components:
+
+-  Fetcher (CPF): Fetches commands out of memory to hand them over to
+   the CPC for processing
+-  Packet Processor (CPC): The micro-controller running the command
+   processing firmware that decodes the fetched commands, and (for
+   kernels) passes them to the `Workgroup Processors <SPI>`__ for
+   scheduling
+
+Before scheduling work to the accelerator, the command-processor can
+first acquire a memory fence to ensure system consistency `(Sec
+2.6.4) <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__.
+After the work is complete, the command-processor can apply a
+memory-release fence. Depending on the AMD CDNA accelerator under
+question, either of these operations *may* initiate a cache write-back
+or invalidation.
+
+Analyzing command processor performance is most interesting for kernels
+that the user suspects to be scheduling/launch-rate limited. The command
+processor’s metrics therefore are focused on reporting, e.g.:
+
+-  Utilization of the fetcher
+-  Utilization of the packet processor, and decoding processing packets
+-  Fetch/processing stalls
+
+Command Processor Fetcher (CPF) metrics
+=======================================
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 65 15
+
+   * - Metric
+     - Description
+     - Unit
+   * - CPF Utilization
+     - Percent of total cycles where the CPF was busy actively doing any work.  The ratio of CPF busy cycles over total cycles counted by the CPF.
+     - Percent
+   * - CPF Stall
+     - Percent of CPF busy cycles where the CPF was stalled for any reason.
+     - Percent
+   * - CPF-L2 Utilization
+     - Percent of total cycles counted by the CPF-[L2](L2) interface where the CPF-L2 interface was active doing any work.  The ratio of CPF-L2 busy cycles over total cycles counted by the CPF-L2.
+     - Percent
+   * - CPF-L2 Stall
+     - Percent of CPF-L2 busy cycles where the CPF-[L2](L2) interface was stalled for any reason.
+     - Percent
+   * - CPF-UTCL1 Stall
+     - Percent of CPF busy cycles where the CPF was stalled by address translation. 
+     - Percent
+
+Command Processor Packet Processor (CPC) metrics
+================================================
+
+.. list-table::
+   :header-rows: 1
+   :widths: 20 65 15
+
+   * - Metric
+     - Description
+     - Unit
+   * - CPC Utilization
+     - Percent of total cycles where the CPC was busy actively doing any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
+     - Percent
+   * - CPC Stall
+     - Percent of CPC busy cycles where the CPC was stalled for any reason.
+     - Percent
+   * - CPC Packet Decoding Utilization
+     - Percent of CPC busy cycles spent decoding commands for processing.
+     - Percent
+   * - CPC-Workgroup Manager Utilization
+     - Percent of CPC busy cycles spent dispatching workgroups to the [Workgroup Manager](SPI).
+     - Percent
+   * - CPC-L2 Utilization
+     - Percent of total cycles counted by the CPC-[L2](L2) interface where the CPC-L2 interface was active doing any work.
+     - Percent
+   * - CPC-UTCL1 Stall
+     - Percent of CPC busy cycles where the CPC was stalled by address translation.
+     - Percent
+   * - CPC-UTCL2 Utilization
+     - Percent of total cycles counted by the CPC's L2 address translation interface where the CPC was busy doing address translation work.
+     - Percent
diff --git a/docs/conceptual/includes/compute-unit.rst → docs/concept/compute-unit.rst b/docs/conceptual/includes/compute-unit.rst → docs/concept/compute-unit.rst
@@ -1,7 +1,6 @@
-.. _def-compute-unit:
-
-Compute unit
-============
+*****************
+Compute unit (CU)
+*****************
 
 The compute unit (CU) is responsible for executing a user's kernels on
 CDNA-based accelerators. All :ref:`wavefronts` of a :ref:`workgroup` are
@@ -10,36 +9,43 @@ scheduled on the same CU.
 .. image:: ../data/performance-model/gcn_compute_unit.png
     :alt: AMD CDNA accelerator compute unit diagram
 
-The CU consists of several independent pipelines and functional units.
+The CU consists of several independent execution pipelines and functional units.
 
-* The *vector arithmetic logic unit (VALU)* is composed of multiple SIMD (single
+* The :ref:`desc-valu` is composed of multiple SIMD (single
   instruction, multiple data) vector processors, vector general purpose
   registers (VGPRs) and instruction buffers. The VALU is responsible for
   executing much of the computational work on CDNA accelerators, including but
   not limited to floating-point operations (FLOPs) and integer operations
   (IOPs).
 * The *vector memory (VMEM)* unit is responsible for issuing loads, stores and
   atomic operations that interact with the memory system.
-* The *scalar arithmetic logic unit (SALU)* is shared by all threads in a
+* The :ref:`desc-salu` is shared by all threads in a
   [wavefront](wavefront), and is responsible for executing instructions that are
   known to be uniform across the wavefront at compile-time. The SALU has a
   memory unit (SMEM) for interacting with memory, but it cannot issue separately
   from the SALU.
-* The *local data share (LDS)* is an on-CU software-managed scratchpad memory
+* The :ref:`desc-lds` is an on-CU software-managed scratchpad memory
   that can be used to efficiently share data between all threads in a
   [workgroup](workgroup).
-* The *scheduler* is responsible for issuing and decoding instructions for all
+* The :ref:`desc-scheduler` is responsible for issuing and decoding instructions for all
   the [wavefronts](wavefront) on the compute unit.
 * The *vector L1 data cache (vL1D)* is the first level cache local to the
   compute unit. On current CDNA accelerators, the vL1D is write-through. The
   vL1D caches from multiple compute units are kept coherent with one another
   through software instructions.
 * CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain
   specialized matrix-multiplication accelerator pipelines known as the
-  [matrix fused multiply-add (MFMA)](mfma).
+  :ref:`desc-mfma`.
 
 For a more in-depth description of a compute unit on a CDNA accelerator, see
-slides 22 to 28 in
-`An introduction to AMD GPU Programming with HIP <https://www.olcf.ornl.gov/wp-content/uploads/2019/09/AMD_GPU_HIP_training_20190906.pdf>`_
-and slide 27 in
-`The AMD GCN Architecture - A Crash Course (Layla Mah) <https://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah>`_.
+:hip-training-2019:`22` and :gcn-crash-course:`27`.
+
+:ref:`pipeline-desc` details the various
+execution pipelines (VALU, SALU, LDS, Scheduler, etc.). The metrics
+presented by Omniperf for these pipelines are described in
+:ref:`pipeline-metrics`. Finally, the `vL1D <vL1D>`__ cache and
+:ref:`LDS <desc-lds>` will be described their own sections.
+
+.. include:: ./includes/pipeline-descriptions.rst
+
+.. include:: ./includes/pipeline-metrics.rst
diff --git a/docs/concept/glossary.rst b/docs/concept/glossary.rst
@@ -0,0 +1,225 @@
+.. meta::
+   :description: Omniperf documentation and reference
+   :keywords: Omniperf, ROCm, glossary, definitions, terms, profiler, tool,
+              Instinct, accelerator, AMD
+
+********
+Glossary
+********
+
+The following table briefly defines some terminology used in Omniperf interfaces
+and in this documentation.
+
+.. list-table::
+   :header-rows: 1
+
+   * - Name
+     - Description
+     - Unit
+
+   * - Kernel time
+     - The number of seconds the accelerator was executing a kernel, from the
+       :ref:`command processor <def-cp>`'s (CP) start-of-kernel
+       timestamp (a number of cycles after the CP beings processing the packet)
+       to the CP's end-of-kernel timestamp (a number of cycles before the CP
+       stops processing the packet).
+     - Seconds
+
+   * - Kernel cycles
+     - The number of cycles the accelerator was active doing *any* work, as
+       measured by the :ref:`command processor <def-cp>` (CP).
+     - Cycles
+
+   * - Total CU cycles
+     - The number of cycles the accelerator was active doing *any* work
+       (that is, kernel cycles), multiplied by the number of
+       :doc:`compute units <compute-unit>` on the accelerator. A
+       measure of the total possible active cycles the compute units could be
+       doing work, useful for the normalization of metrics inside the CU.
+     - Cycles
+
+   * - Total active CU cycles
+     - The number of cycles a CU on the accelerator was active doing *any*
+       work, summed over all :ref:`compute units <def-cu>` on the
+       accelerator.
+     - Cycles
+
+   * - Total SIMD cycles
+     - The number of cycles the accelerator was active doing *any* work (that
+       is, kernel cycles), multiplied by the number of
+       :ref:`SIMDs <def-cu>` on the accelerator. A measure of the
+       total possible active cycles the SIMDs could be doing work, useful for
+       the normalization of metrics inside the CU.
+     - Cycles
+
+   * - Total L2 cycles
+     - The number of cycles the accelerator was active doing *any* work (that
+       is, kernel cycles), multiplied by the number of :ref:`L2 <def-l2>`
+       channels on the accelerator. A measure of the total possible active
+       cycles the L2 channels could be doing work, useful for the normalization
+       of metrics inside the L2.
+     - Cycles
+
+   * - Total active L2 cycles
+     - The number of cycles a channel of the L2 cache was active doing *any*
+       work, summed over all :ref:`L2 <def-l2>` channels on the accelerator.
+     - Cycles
+
+   * - Total sL1D cycles
+     - The number of cycles the accelerator was active doing *any* work (that
+       is, kernel cycles), multiplied by the number of
+       :ref:`scalar L1 data caches <def-sl1d>` on the accelerator. A measure of
+       the total possible active cycles the sL1Ds could be doing work, useful
+       for the normalization of metrics inside the sL1D.
+     - Cycles
+
+   * - Total L1I cycles
+     - The number of cycles the accelerator was active doing *any* work (that
+       is, kernel cycles), multiplied by the number of
+       :ref:`L1 instruction caches <def-l1i>` (L1I) on the accelerator. A
+       measure of the total possible active cycles the L1Is could be doing
+       work, useful for the normalization of metrics inside the L1I.
+     - Cycles
+
+   * - Total scheduler-pipe cycles
+     - The number of cycles the accelerator was active doing *any* work (that
+       is, kernel cycles), multiplied by the number of
+       :ref:`scheduler pipes <def-cp>` on the accelerator. A measure of the
+       total possible active cycles the scheduler-pipes could be doing work,
+       useful for the normalization of metrics inside the
+       :ref:`workgroup manager <def-spi>` and :ref:`command processor <def-cp>`.
+     - Cycles
+
+   * - Total shader-engine cycles
+     - The total number of cycles the accelerator was active doing *any* work,
+       multiplied by the number of :ref:`shader engines <def-se>` on the
+       accelerator. A measure of the total possible active cycles the shader
+       engines could be doing work, useful for the normalization of
+       metrics inside the :ref:`workgroup manager <def-spi>`.
+     - Cycles
+
+   * - Thread-requests
+     - The number of unique memory addresses accessed by a single memory
+       instruction. On AMD Instinct accelerators, this has a maximum of 64
+       (that is, the size of the :ref:`wavefront <def-wavefront>`).
+     - Addresses
+
+   * - Work-item
+     - A single *thread*, or lane, of execution that executes in lockstep with
+       the rest of the work-items comprising a :ref:`wavefront <def-wavefront>`
+       of execution.
+     - N/A
+
+   * - Wavefront
+     - A group of work-items, or threads, that execute in lockstep on the
+       :ref:`compute unit <def-cu>`. On AMD Instinct accelerators, the
+       wavefront size is always 64 work-items.
+     - N/A
+
+   * - Workgroup
+     - A group of wavefronts that execute on the same
+       :ref:`compute unit <def-cu>`, and can cooperatively execute and share
+       data via the use of synchronization primitives, :ref:`LDS <def-lds>`,
+       atomics, and others.
+     - N/A
+
+   * - Divergence
+     - Divergence within a wavefront occurs when not all work-items are active
+       when executing an instruction, that is, due to non-uniform control flow
+       within a wavefront. Can reduce execution efficiency by causing,
+       for instance, the :ref:`VALU <def-valu>` to need to execute both
+       branches of a conditional with different sets of work-items active.
+     - N/A
+
+.. include:: ./includes/normalization-units.rst
+
+.. _memory-spaces:
+
+Memory spaces
+=============
+
+AMD Instinct MI accelerators can access memory through multiple address spaces
+which may map to different physical memory locations on the system. The
+[table below](mspace-table) provides a view of how various types of memory used
+in HIP map onto these constructs:
+
+.. list-table::
+   :header-rows: 1
+
+   * - LLVM Address Space
+     - Hardware Memory Space
+     - HIP Terminology
+
+   * - Generic
+     - Flat
+     - N/A
+
+   * - Global
+     - Global
+     - Global
+
+   * - Local
+     - LDS
+     - LDS/Shared
+
+   * - Private
+     - Scratch
+     - Private
+
+   * - Constant
+     - Same as global
+     - Constant
+
+Below is a high-level description of the address spaces in the AMDGPU backend
+of LLVM:
+
+.. list-table::
+   :header-rows: 1
+
+   * - Address space
+     - Description
+
+   * - Global
+     - Memory that can be seen by all threads in a process, and may be backed by
+       the local accelerator's HBM, a remote accelerator's HBM, or the CPU's
+       DRAM.
+
+   * - Local
+     - Memory that is only visible to a particular workgroup. On AMD's Instinct
+       accelerator hardware, this is stored in [LDS](LDS) memory.
+
+   * - Private
+     - Memory that is only visible to a particular [work-item](workitem)
+       (thread), stored in the scratch space on AMD's Instinct(tm) accelerators.
+
+   * - Constant
+     - Read-only memory that is in the global address space and stored on the
+       local accelerator's HBM.
+
+   * - Generic
+     - Used when the compiler cannot statically prove that a pointer is
+       addressing memory in a single (non-generic) address space. Mapped to Flat
+       on AMD's Instinct(tm) accelerators, the pointer could dynamically address
+       global, local, private or constant memory.
+
+`LLVM's documentation for AMDGPU Backend <https://llvm.org/docs/AMDGPUUsage.html#address-spaces>`
+will always have the most up-to-date information, and the interested reader is
+referred to this source for a more complete explanation.
+
+.. _memory-type:
+
+Memory type
+===========
+
+AMD Instinct accelerators contain a number of different memory allocation
+types to enable the HIP language's
+:doc:`memory coherency model <hip:how-to/programming_manual>`.
+These memory types are broadly similar between AMD Instinct accelerator
+generations, but may differ in exact implementation.
+
+In addition, these memory types *might* differ between accelerators on the same
+system, even when accessing the same memory allocation.
+
+For example, an :ref:`MI2XX <mixxx-note>` accelerator accessing "fine-grained"
+memory allocated local to that device may see the allocation as coherently
+cacheable, while a remote accelerator might see the same allocation as uncached.
diff --git a/docs/concept/includes/compute-unit.rst b/docs/concept/includes/compute-unit.rst
@@ -0,0 +1 @@
+
diff --git a/docs/concept/includes/l2-cache.rst b/docs/concept/includes/l2-cache.rst
diff --git a/docs/concept/includes/memory-spaces.rst b/docs/concept/includes/memory-spaces.rst
diff --git a/docs/concept/includes/memory-types.rst b/docs/concept/includes/memory-types.rst