ROCm · peterjunpark · Jul 31, 2024 · May 6, 2024 · May 9, 2024 · May 24, 2024
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -1,6 +1,7 @@
 * @koomie @coleramos425
 
 # Documentation files
-docs/* @ROCm/rocm-documentation
+docs/ @ROCm/rocm-documentation
 *.md @ROCm/rocm-documentation
 *.rst @ROCm/rocm-documentation
+.readthedocs.yaml @ROCm/rocm-documentation
diff --git a/.github/workflows/dependabot.yml b/.github/workflows/dependabot.yml
@@ -0,0 +1,18 @@
+# To get started with Dependabot version updates, you'll need to specify which
+# package ecosystems to update and where the package manifests are located.
+# Please see the documentation for all configuration options:
+# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates
+
+version: 2
+updates:
+  - package-ecosystem: "pip" # See documentation for possible values
+    directory: "/docs/sphinx" # Location of package manifests
+    open-pull-requests-limit: 10
+    schedule:
+      interval: "daily"
+    target-branch: "dev"
+    labels:
+      - "documentation"
+      - "dependencies"
+    reviewers:
+      - "samjwu"
diff --git a/.github/workflows/docs-linting.yml b/.github/workflows/docs-linting.yml
@@ -0,0 +1,16 @@
+name: Documentation
+
+on:
+  push:
+    branches:
+    - dev
+    - 'docs/*'
+  pull_request:
+    branches:
+    - dev
+    - 'docs/*'
+
+jobs:
+  call-workflow-passing-data:
+    name: Linting
+    uses: ROCm/rocm-docs-core/.github/workflows/linting.yml@develop
diff --git a/.gitignore b/.gitignore
@@ -19,3 +19,8 @@ VERSION.sha
 
 # temp files
 /tests/Testing
+
+# documentation artifacts
+/_build
+_toc.yml
+
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -0,0 +1,13 @@
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+version: 2
+
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.10"
+
+python:
+  install:
+  - requirements: docs/sphinx/requirements.txt
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -0,0 +1,60 @@
+Addresser
+AGPRs
+FLOPs
+GPUOpen
+GiB
+Grafana
+Grafana's
+HIP's
+IOPs
+IPC
+KiB
+LD
+Lmod
+MobaXterm
+Normalizations
+Omniperf's
+OPs
+Relatedly
+SEs
+SIG
+SPIs
+SQC
+SoC
+SoCs
+TLB
+TODO
+Transcendentals
+UID
+Utilizations
+VPGRs
+addresser
+amd
+architected
+ast
+atomicAdd
+backpressure
+backpressuring
+benchmarked
+broadcasted
+cdna
+conf
+gcn
+isa
+latencies
+lookaside
+mantor
+modulefile
+modulefiles
+normalizations
+pdf
+perf
+roofline
+sl
+substring
+typename
+untar
+utilizations
+vcopy
+vega
+vl
diff --git a/docs/conceptual/command-processor.rst b/docs/conceptual/command-processor.rst
@@ -0,0 +1,149 @@
+**********************
+Command processor (CP)
+**********************
+
+The command processor (CP) is responsible for interacting with the AMDGPU kernel
+driver -- the Linux kernel -- on the CPU and for interacting with user-space
+HSA clients when they submit commands to HSA queues. Basic tasks of the CP
+include reading commands (such as, corresponding to a kernel launch) out of 
+:hsa-runtime-pdf:`HSA queues <68>`, scheduling work to subsequent parts of the
+scheduler pipeline, and marking kernels complete for synchronization events on
+the host.
+
+The command processor consists of two sub-components:
+
+* :ref:`Fetcher <cpf-metrics>` (CPF): Fetches commands out of memory to hand
+  them over to the CPC for processing.
+
+* :ref:`Packet processor <cpc-metrics>` (CPC): Micro-controller running the
+  command processing firmware that decodes the fetched commands and (for
+  kernels) passes them to the :ref:`workgroup processors <desc-spi>` for
+  scheduling.
+
+Before scheduling work to the accelerator, the command processor can
+first acquire a memory fence to ensure system consistency 
+:hsa-runtime-pdf:`Section 2.6.4 <91>`. After the work is complete, the
+command processor can apply a memory-release fence. Depending on the AMD CDNA
+accelerator under question, either of these operations *might* initiate a cache
+write-back or invalidation.
+
+Analyzing command processor performance is most interesting for kernels
+that you suspect to be limited by scheduling or launch rate. The command
+processor’s metrics therefore are focused on reporting, for example:
+
+*  Utilization of the fetcher
+
+*  Utilization of the packet processor, and decoding processing packets
+
+*  Stalls in fetching and processing
+
+.. _cpf-metrics:
+
+Command Processor Fetcher (CPF) metrics
+=======================================
+
+.. list-table::
+   :header-rows: 1
+
+   * - Metric
+
+     - Description
+
+     - Unit
+
+   * - CPF Utilization
+
+     - Percent of total cycles where the CPF was busy actively doing any work.
+       The ratio of CPF busy cycles over total cycles counted by the CPF.
+
+     - Percent
+
+   * - CPF Stall
+
+     - Percent of CPF busy cycles where the CPF was stalled for any reason.
+
+     - Percent
+
+   * - CPF-L2 Utilization
+
+     - Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface
+       where the CPF-L2 interface was active doing any work. The ratio of CPF-L2
+       busy cycles over total cycles counted by the CPF-L2.
+
+     - Percent
+
+   * - CPF-L2 Stall
+
+     - Percent of CPF-L2 busy cycles where the CPF-:doc:`L2 <l2-cache>`
+       interface was stalled for any reason.
+
+     - Percent
+
+   * - CPF-UTCL1 Stall
+
+     - Percent of CPF busy cycles where the CPF was stalled by address
+       translation. 
+
+     - Percent
+
+.. _cpc-metrics:
+
+Command Processor Packet Processor (CPC) metrics
+================================================
+
+.. list-table::
+   :header-rows: 1
+
+   * - Metric
+
+     - Description
+
+     - Unit
+
+   * - CPC Utilization
+
+     - Percent of total cycles where the CPC was busy actively doing any work.
+       The ratio of CPC busy cycles over total cycles counted by the CPC.
+
+     - Percent
+
+   * - CPC Stall
+
+     - Percent of CPC busy cycles where the CPC was stalled for any reason.
+
+     - Percent
+
+   * - CPC Packet Decoding Utilization
+
+     - Percent of CPC busy cycles spent decoding commands for processing.
+
+     - Percent
+
+   * - CPC-Workgroup Manager Utilization
+
+     - Percent of CPC busy cycles spent dispatching workgroups to the
+       :ref:`workgroup manager <desc-spi>`.
+
+     - Percent
+
+   * - CPC-L2 Utilization
+
+     - Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface
+       where the CPC-L2 interface was active doing any work.
+
+     - Percent
+
+   * - CPC-UTCL1 Stall
+
+     - Percent of CPC busy cycles where the CPC was stalled by address
+       translation.
+
+     - Percent
+
+   * - CPC-UTCL2 Utilization
+
+     - Percent of total cycles counted by the CPC's L2 address translation
+       interface where the CPC was busy doing address translation work.
+
+     - Percent
+
diff --git a/docs/conceptual/compute-unit.rst b/docs/conceptual/compute-unit.rst
@@ -0,0 +1,53 @@
+*****************
+Compute unit (CU)
+*****************
+
+The compute unit (CU) is responsible for executing a user's kernels on
+CDNA-based accelerators. All :ref:`wavefronts <desc-wavefront>` of a
+:ref:`workgroup <desc-workgroup>` are scheduled on the same CU.
+
+.. image:: ../data/performance-model/gcn_compute_unit.png
+    :alt: AMD CDNA accelerator compute unit diagram
+
+The CU consists of several independent execution pipelines and functional units.
+The :doc:`/conceptual/pipeline-descriptions` section details the various
+execution pipelines -- VALU, SALU, LDS, scheduler, and so forth. The metrics
+presented by Omniperf for these pipelines are described in
+:doc:`pipeline-metrics`. The :doc:`vL1D <vector-l1-cache>` cache and
+:doc:`LDS <local-data-share>` are described their own chapters.
+
+* The :ref:`desc-valu` is composed of multiple SIMD (single
+  instruction, multiple data) vector processors, vector general purpose
+  registers (VGPRs) and instruction buffers. The VALU is responsible for
+  executing much of the computational work on CDNA accelerators, including but
+  not limited to floating-point operations (FLOPs) and integer operations
+  (IOPs).
+
+* The vector memory (VMEM) unit is responsible for issuing loads, stores and
+  atomic operations that interact with the memory system.
+
+* The :ref:`desc-salu` is shared by all threads in a
+  :ref:`wavefront <desc-wavefront>`, and is responsible for executing
+  instructions that are known to be uniform across the wavefront at compile
+  time. The SALU has a memory unit (SMEM) for interacting with memory, but it
+  cannot issue separately from the SALU.
+
+* The :doc:`local-data-share` is an on-CU software-managed scratchpad memory
+  that can be used to efficiently share data between all threads in a
+  :ref:`workgroup <desc-workgroup>`.
+
+* The :ref:`desc-scheduler` is responsible for issuing and decoding instructions
+  for all the :ref:`wavefronts <desc-wavefront>` on the compute unit.
+
+* The :doc:`vector L1 data cache (vL1D) <vector-l1-cache>` is the first level
+  cache local to the compute unit. On current CDNA accelerators, the vL1D is
+  write-through. The vL1D caches from multiple compute units are kept coherent
+  with one another through software instructions.
+
+* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain
+  specialized matrix-multiplication accelerator pipelines known as the
+  :ref:`desc-mfma`.
+
+For a more in-depth description of a compute unit on a CDNA accelerator, see
+:hip-training-pdf:`22` and :gcn-crash-course:`27`.
+