fix words and formatting

Signed-off-by: Peter Jun Park <[email protected]> formatting Signed-off-by: Peter Jun Park <[email protected]>
ROCm · Jul 9, 2024 · dee9b62 · dee9b62
1 parent 184f18a
commit dee9b62
Show file tree

Hide file tree

Showing 45 changed files with 4,044 additions and 2,483 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -1,7 +1,7 @@
 * @koomie @coleramos425
 
 # Documentation files
-docs/* @ROCm/rocm-documentation
+docs/ @ROCm/rocm-documentation
 *.md @ROCm/rocm-documentation
 *.rst @ROCm/rocm-documentation
 .readthedocs.yaml @ROCm/rocm-documentation
diff --git a/docs/concept/command-processor.rst b/docs/concept/command-processor.rst
@@ -2,95 +2,148 @@
 Command processor (CP)
 **********************
 
-The command processor (CP) is responsible for interacting with the AMDGPU Kernel
-Driver (a.k.a., the Linux Kernel) on the CPU and
-for interacting with user-space HSA clients when they submit commands to
-HSA queues. Basic tasks of the CP include reading commands (e.g.,
-corresponding to a kernel launch) out of `HSA
-Queues <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__
-(Sec. 2.5), scheduling work to subsequent parts of the scheduler
-pipeline, and marking kernels complete for synchronization events on the
-host.
-
-The command processor is composed of two sub-components:
-
--  Fetcher (CPF): Fetches commands out of memory to hand them over to
-   the CPC for processing
--  Packet Processor (CPC): The micro-controller running the command
-   processing firmware that decodes the fetched commands, and (for
-   kernels) passes them to the `Workgroup Processors <SPI>`__ for
-   scheduling
-
-Before scheduling work to the accelerator, the command-processor can
-first acquire a memory fence to ensure system consistency `(Sec
-2.6.4) <http://hsafoundation.com/wp-content/uploads/2021/02/HSA-Runtime-1.2.pdf>`__.
-After the work is complete, the command-processor can apply a
-memory-release fence. Depending on the AMD CDNA accelerator under
-question, either of these operations *may* initiate a cache write-back
-or invalidation.
+The command processor (CP) is responsible for interacting with the AMDGPU kernel
+driver -- the Linux kernel -- on the CPU and for interacting with user-space
+HSA clients when they submit commands to HSA queues. Basic tasks of the CP
+include reading commands (such as, corresponding to a kernel launch) out of 
+:hsa-runtime-pdf:`HSA queues <68>`, scheduling work to subsequent parts of the
+scheduler pipeline, and marking kernels complete for synchronization events on
+the host.
+
+The command processor consists of two sub-components:
+
+* :ref:`Fetcher <cpf-metrics>` (CPF): Fetches commands out of memory to hand
+  them over to the CPC for processing.
+
+* :ref:`Packet processor <cpc-metrics>` (CPC): Micro-controller running the
+  command processing firmware that decodes the fetched commands and (for
+  kernels) passes them to the :ref:`workgroup processors <desc-spi>` for
+  scheduling.
+
+Before scheduling work to the accelerator, the command processor can
+first acquire a memory fence to ensure system consistency 
+:hsa-runtime-pdf:`Section 2.6.4 <91>`. After the work is complete, the
+command processor can apply a memory-release fence. Depending on the AMD CDNA
+accelerator under question, either of these operations *might* initiate a cache
+write-back or invalidation.
 
 Analyzing command processor performance is most interesting for kernels
-that the user suspects to be scheduling/launch-rate limited. The command
-processor’s metrics therefore are focused on reporting, e.g.:
+that you suspect to be limited by scheduling or launch rate. The command
+processor’s metrics therefore are focused on reporting, for example:
+
+*  Utilization of the fetcher
+
+*  Utilization of the packet processor, and decoding processing packets
+
+*  Stalls in fetching and processing
 
--  Utilization of the fetcher
--  Utilization of the packet processor, and decoding processing packets
--  Fetch/processing stalls
+.. _cpf-metrics:
 
 Command Processor Fetcher (CPF) metrics
 =======================================
 
 .. list-table::
    :header-rows: 1
-   :widths: 20 65 15
 
    * - Metric
+
      - Description
+
      - Unit
+
    * - CPF Utilization
-     - Percent of total cycles where the CPF was busy actively doing any work.  The ratio of CPF busy cycles over total cycles counted by the CPF.
+
+     - Percent of total cycles where the CPF was busy actively doing any work.
+       The ratio of CPF busy cycles over total cycles counted by the CPF.
+
      - Percent
+
    * - CPF Stall
+
      - Percent of CPF busy cycles where the CPF was stalled for any reason.
+
      - Percent
+
    * - CPF-L2 Utilization
-     - Percent of total cycles counted by the CPF-[L2](L2) interface where the CPF-L2 interface was active doing any work.  The ratio of CPF-L2 busy cycles over total cycles counted by the CPF-L2.
+
+     - Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface
+       where the CPF-L2 interface was active doing any work. The ratio of CPF-L2
+       busy cycles over total cycles counted by the CPF-L2.
+
      - Percent
+
    * - CPF-L2 Stall
-     - Percent of CPF-L2 busy cycles where the CPF-[L2](L2) interface was stalled for any reason.
+
+     - Percent of CPF-L2 busy cycles where the CPF-:doc:`L2 <l2-cache>`
+       interface was stalled for any reason.
+
      - Percent
+
    * - CPF-UTCL1 Stall
-     - Percent of CPF busy cycles where the CPF was stalled by address translation. 
+
+     - Percent of CPF busy cycles where the CPF was stalled by address
+       translation. 
+
      - Percent
 
+.. _cpc-metrics:
+
 Command Processor Packet Processor (CPC) metrics
 ================================================
 
 .. list-table::
    :header-rows: 1
-   :widths: 20 65 15
 
    * - Metric
+
      - Description
+
      - Unit
+
    * - CPC Utilization
-     - Percent of total cycles where the CPC was busy actively doing any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
+
+     - Percent of total cycles where the CPC was busy actively doing any work.
+       The ratio of CPC busy cycles over total cycles counted by the CPC.
+
      - Percent
+
    * - CPC Stall
+
      - Percent of CPC busy cycles where the CPC was stalled for any reason.
+
      - Percent
+
    * - CPC Packet Decoding Utilization
+
      - Percent of CPC busy cycles spent decoding commands for processing.
+
      - Percent
+
    * - CPC-Workgroup Manager Utilization
-     - Percent of CPC busy cycles spent dispatching workgroups to the [Workgroup Manager](SPI).
+
+     - Percent of CPC busy cycles spent dispatching workgroups to the
+       :ref:`workgroup manager <desc-spi>`.
+
      - Percent
+
    * - CPC-L2 Utilization
-     - Percent of total cycles counted by the CPC-[L2](L2) interface where the CPC-L2 interface was active doing any work.
+
+     - Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface
+       where the CPC-L2 interface was active doing any work.
+
      - Percent
+
    * - CPC-UTCL1 Stall
-     - Percent of CPC busy cycles where the CPC was stalled by address translation.
+
+     - Percent of CPC busy cycles where the CPC was stalled by address
+       translation.
+
      - Percent
+
    * - CPC-UTCL2 Utilization
-     - Percent of total cycles counted by the CPC's L2 address translation interface where the CPC was busy doing address translation work.
+
+     - Percent of total cycles counted by the CPC's L2 address translation
+       interface where the CPC was busy doing address translation work.
+
      - Percent
+
diff --git a/docs/concept/compute-unit.rst b/docs/concept/compute-unit.rst
@@ -17,35 +17,38 @@ The CU consists of several independent execution pipelines and functional units.
   executing much of the computational work on CDNA accelerators, including but
   not limited to floating-point operations (FLOPs) and integer operations
   (IOPs).
+
 * The *vector memory (VMEM)* unit is responsible for issuing loads, stores and
   atomic operations that interact with the memory system.
+
 * The :ref:`desc-salu` is shared by all threads in a
   [wavefront](wavefront), and is responsible for executing instructions that are
   known to be uniform across the wavefront at compile-time. The SALU has a
   memory unit (SMEM) for interacting with memory, but it cannot issue separately
   from the SALU.
+
 * The :ref:`desc-lds` is an on-CU software-managed scratchpad memory
   that can be used to efficiently share data between all threads in a
-  [workgroup](workgroup).
+  :ref:`workgroup <desc-workgroup>`.
+
 * The :ref:`desc-scheduler` is responsible for issuing and decoding instructions for all
-  the [wavefronts](wavefront) on the compute unit.
-* The *vector L1 data cache (vL1D)* is the first level cache local to the
-  compute unit. On current CDNA accelerators, the vL1D is write-through. The
-  vL1D caches from multiple compute units are kept coherent with one another
-  through software instructions.
+  the :ref:`wavefronts <desc-wavefront>` on the compute unit.
+
+* The :doc:`vector L1 data cache (vL1D) <vector-l1-cache>` is the first level
+  cache local to the compute unit. On current CDNA accelerators, the vL1D is
+  write-through. The vL1D caches from multiple compute units are kept coherent
+  with one another through software instructions.
+
 * CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain
   specialized matrix-multiplication accelerator pipelines known as the
   :ref:`desc-mfma`.
 
 For a more in-depth description of a compute unit on a CDNA accelerator, see
-:hip-training-2019:`22` and :gcn-crash-course:`27`.
+:hip-training-pdf:`22` and :gcn-crash-course:`27`.
 
 :ref:`pipeline-desc` details the various
 execution pipelines (VALU, SALU, LDS, Scheduler, etc.). The metrics
 presented by Omniperf for these pipelines are described in
-:ref:`pipeline-metrics`. Finally, the `vL1D <vL1D>`__ cache and
-:ref:`LDS <desc-lds>` will be described their own sections.
-
-.. include:: ./includes/pipeline-descriptions.rst
+:ref:`pipeline-metrics`. The :doc:`vL1D <vector-l1-cache>` cache and
+:doc:`LDS <local-data-share>` are described their own sections.
 
-.. include:: ./includes/pipeline-metrics.rst
diff --git a/docs/concept/definitions.rst b/docs/concept/definitions.rst
@@ -0,0 +1,109 @@
+.. meta::
+   :description: Omniperf terminology and definitions
+   :keywords: Omniperf, ROCm, glossary, definitions, terms, profiler, tool,
+              Instinct, accelerator, AMD
+
+***********
+Definitions
+***********
+
+The following table briefly defines some terminology used in Omniperf interfaces
+and in this documentation.
+
+.. include:: ./includes/terms.rst
+
+.. include:: ./includes/normalization-units.rst
+
+.. _memory-spaces:
+
+Memory spaces
+=============
+
+AMD Instinct MI accelerators can access memory through multiple address spaces
+which may map to different physical memory locations on the system. The
+following table provides a view into how various types of memory used
+in HIP map onto these constructs:
+
+.. list-table::
+   :header-rows: 1
+
+   * - LLVM Address Space
+     - Hardware Memory Space
+     - HIP Terminology
+
+   * - Generic
+     - Flat
+     - N/A
+
+   * - Global
+     - Global
+     - Global
+
+   * - Local
+     - LDS
+     - LDS/Shared
+
+   * - Private
+     - Scratch
+     - Private
+
+   * - Constant
+     - Same as global
+     - Constant
+
+The following is a high-level description of the address spaces in the AMDGPU
+backend of LLVM:
+
+.. list-table::
+   :header-rows: 1
+
+   * - Address space
+     - Description
+
+   * - Global
+     - Memory that can be seen by all threads in a process, and may be backed by
+       the local accelerator's HBM, a remote accelerator's HBM, or the CPU's
+       DRAM.
+
+   * - Local
+     - Memory that is only visible to a particular workgroup. On AMD's Instinct
+       accelerator hardware, this is stored in :ref:`LDS <local-data-share>`
+       memory.
+
+   * - Private
+     - Memory that is only visible to a particular [work-item](workitem)
+       (thread), stored in the scratch space on AMD's Instinct accelerators.
+
+   * - Constant
+     - Read-only memory that is in the global address space and stored on the
+       local accelerator's HBM.
+
+   * - Generic
+     - Used when the compiler cannot statically prove that a pointer is
+       addressing memory in a single (non-generic) address space. Mapped to Flat
+       on AMD's Instinct accelerators, the pointer could dynamically address
+       global, local, private or constant memory.
+
+`LLVM's documentation for AMDGPU Backend <https://llvm.org/docs/AMDGPUUsage.html#address-spaces>`_
+has the most up-to-date information. Refer to this source for a more complete
+explanation.
+
+.. _memory-type:
+
+Memory type
+===========
+
+AMD Instinct accelerators contain a number of different memory allocation
+types to enable the HIP language's
+:doc:`memory coherency model <hip:how-to/programming_manual>`.
+These memory types are broadly similar between AMD Instinct accelerator
+generations, but may differ in exact implementation.
+
+In addition, these memory types *might* differ between accelerators on the same
+system, even when accessing the same memory allocation.
+
+For example, an :ref:`MI2XX <mixxx-note>` accelerator accessing *fine-grained*
+memory allocated local to that device may see the allocation as coherently
+cacheable, while a remote accelerator might see the same allocation as
+*uncached*.
+