Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: refactor and integrate into ROCm docs portal #362

Merged
merged 24 commits into from
Jul 31, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
a0175e1
pip-compile docs/requirements.txt
peterjunpark May 6, 2024
f282605
style(conf.py): Apply black formatting to docs/conf.py
samjwu May 9, 2024
345e5a5
Update docs requirements
peterjunpark May 24, 2024
fb794f5
Add dependabot.yml and update CODEOWNERS
peterjunpark Jun 17, 2024
3c80be0
Port docs to rocm-docs standard
peterjunpark May 8, 2024
29eee16
impr internal linking and fix sphinx warnings
peterjunpark Jul 18, 2024
80b1cf6
add spellcheck/linting from rocm-docs-core
peterjunpark Jul 18, 2024
8821660
Merge branch 'dev' into docs/refactor
peterjunpark Jul 18, 2024
7980941
bump rocm-docs-core to 1.6.0
peterjunpark Jul 24, 2024
1a94f72
add fixes from @skyreflectedinmirrors and @lpaoletti
peterjunpark Jul 25, 2024
bcb858e
add package manager install section
peterjunpark Jul 25, 2024
6ec9958
add fixes
peterjunpark Jul 26, 2024
c821863
add custom css
peterjunpark Jul 29, 2024
47fa8f7
make images/figs click-to-expand
peterjunpark Jul 29, 2024
e916f9d
update documentation link in README
peterjunpark Jul 29, 2024
0eed6a0
formatting fixes
peterjunpark Jul 30, 2024
afa4abc
Merge branch 'dev' into docs/refactor
peterjunpark Jul 30, 2024
7912d52
fix heading
peterjunpark Jul 30, 2024
8480640
move archived docs
peterjunpark Jul 30, 2024
85c27a8
exclude archived docs from docs build
peterjunpark Jul 30, 2024
13c64b2
update archived docs workflow
peterjunpark Jul 30, 2024
77f3a2e
rm docs linting
peterjunpark Jul 30, 2024
426d632
Apply cmake-format suggested changes
samjwu Jul 30, 2024
5ff0963
Apply cmake-format
samjwu Jul 30, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
* @koomie @coleramos425

# Documentation files
docs/* @ROCm/rocm-documentation
docs/ @ROCm/rocm-documentation
*.md @ROCm/rocm-documentation
*.rst @ROCm/rocm-documentation
.readthedocs.yaml @ROCm/rocm-documentation
18 changes: 18 additions & 0 deletions .github/workflows/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# To get started with Dependabot version updates, you'll need to specify which
# package ecosystems to update and where the package manifests are located.
# Please see the documentation for all configuration options:
# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates

version: 2
updates:
- package-ecosystem: "pip" # See documentation for possible values
directory: "/docs/sphinx" # Location of package manifests
open-pull-requests-limit: 10
schedule:
interval: "daily"
target-branch: "dev"
labels:
- "documentation"
- "dependencies"
reviewers:
- "samjwu"
16 changes: 16 additions & 0 deletions .github/workflows/docs-linting.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: Documentation

on:
push:
branches:
- dev
- 'docs/*'
pull_request:
branches:
- dev
- 'docs/*'

jobs:
call-workflow-passing-data:
name: Linting
uses: ROCm/rocm-docs-core/.github/workflows/linting.yml@develop
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,8 @@ VERSION.sha

# temp files
/tests/Testing

# documentation artifacts
/_build
_toc.yml

13 changes: 13 additions & 0 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

version: 2

build:
os: ubuntu-22.04
tools:
python: "3.10"

python:
install:
- requirements: docs/sphinx/requirements.txt
61 changes: 61 additions & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
Addresser
AGPRs
FLOPs
GPUOpen
GiB
Grafana
Grafana's
HIP's
IOPs
IPC
KiB
LD
Lmod
MobaXterm
Normalizations
Omniperf's
OPs
Relatedly
SEs
SIG
SPIs
SQC
SoC
SoCs
TLB
TODO
Transcendentals
UID
Utilizations
VPGRs
addresser
amd
architected
ast
atomicAdd
backpressure
backpressuring
benchmarked
broadcasted
cdna
conf
gcn
isa
latencies
lds
lookaside
mantor
modulefile
modulefiles
normalizations
pdf
perf
roofline
sl
substring
typename
untar
utilizations
vcopy
vega
vl
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,17 @@
[![Docs](https://github.com/ROCm/omniperf/actions/workflows/docs.yml/badge.svg)](https://rocm.github.io/omniperf/)
[![DOI](https://zenodo.org/badge/561919887.svg)](https://zenodo.org/badge/latestdoi/561919887)


# Omniperf

## General

Omniperf is a system performance profiling tool for machine
learning/HPC workloads running on AMD MI GPUs. The tool presently
targets usage on MI100, MI200, and MI300 accelerators.

* For more information on available features, installation steps, and
workload profiling and analysis, please refer to the online
[documentation](https://rocm.github.io/omniperf).
[documentation](https://rocm.docs.amd.com/projects/omniperf/en/latest/).

* Omniperf is an AMD open source research project and is not supported
as part of the ROCm software stack. We welcome contributions and
Expand Down
153 changes: 153 additions & 0 deletions docs/conceptual/command-processor.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
.. meta::
:description: Omniperf performance model: Command processor (CP)
:keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, command, processor, fetcher, packet processor, CPF, CPC

**********************
Command processor (CP)
**********************

The command processor (CP) is responsible for interacting with the AMDGPU kernel
driver -- the Linux kernel -- on the CPU and for interacting with user-space
HSA clients when they submit commands to HSA queues. Basic tasks of the CP
include reading commands (such as, corresponding to a kernel launch) out of
:hsa-runtime-pdf:`HSA queues <68>`, scheduling work to subsequent parts of the
scheduler pipeline, and marking kernels complete for synchronization events on
the host.

The command processor consists of two sub-components:

* :ref:`Fetcher <cpf-metrics>` (CPF): Fetches commands out of memory to hand
them over to the CPC for processing.

* :ref:`Packet processor <cpc-metrics>` (CPC): Micro-controller running the
command processing firmware that decodes the fetched commands and (for
kernels) passes them to the :ref:`workgroup processors <desc-spi>` for
scheduling.

Before scheduling work to the accelerator, the command processor can
first acquire a memory fence to ensure system consistency
:hsa-runtime-pdf:`Section 2.6.4 <91>`. After the work is complete, the
command processor can apply a memory-release fence. Depending on the AMD CDNA™
accelerator under question, either of these operations *might* initiate a cache
write-back or invalidation.

Analyzing command processor performance is most interesting for kernels
that you suspect to be limited by scheduling or launch rate. The command
processor’s metrics therefore are focused on reporting, for example:

* Utilization of the fetcher

* Utilization of the packet processor, and decoding processing packets

* Stalls in fetching and processing

.. _cpf-metrics:

Command processor fetcher (CPF)
===============================

.. list-table::
:header-rows: 1

* - Metric

- Description

- Unit

* - CPF Utilization

- Percent of total cycles where the CPF was busy actively doing any work.
The ratio of CPF busy cycles over total cycles counted by the CPF.

- Percent

* - CPF Stall

- Percent of CPF busy cycles where the CPF was stalled for any reason.

- Percent

* - CPF-L2 Utilization

- Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface
where the CPF-L2 interface was active doing any work. The ratio of CPF-L2
busy cycles over total cycles counted by the CPF-L2.

- Percent

* - CPF-L2 Stall

- Percent of CPF-L2 busy cycles where the CPF-:doc:`L2 <l2-cache>`
interface was stalled for any reason.

- Percent

* - CPF-UTCL1 Stall

- Percent of CPF busy cycles where the CPF was stalled by address
translation.

- Percent

.. _cpc-metrics:

Command processor packet processor (CPC)
========================================

.. list-table::
:header-rows: 1

* - Metric

- Description

- Unit

* - CPC Utilization

- Percent of total cycles where the CPC was busy actively doing any work.
The ratio of CPC busy cycles over total cycles counted by the CPC.

- Percent

* - CPC Stall

- Percent of CPC busy cycles where the CPC was stalled for any reason.

- Percent

* - CPC Packet Decoding Utilization

- Percent of CPC busy cycles spent decoding commands for processing.

- Percent

* - CPC-Workgroup Manager Utilization

- Percent of CPC busy cycles spent dispatching workgroups to the
:ref:`workgroup manager <desc-spi>`.

- Percent

* - CPC-L2 Utilization

- Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface
where the CPC-L2 interface was active doing any work.

- Percent

* - CPC-UTCL1 Stall

- Percent of CPC busy cycles where the CPC was stalled by address
translation.

- Percent

* - CPC-UTCL2 Utilization

- Percent of total cycles counted by the CPC's L2 address translation
interface where the CPC was busy doing address translation work.

- Percent

60 changes: 60 additions & 0 deletions docs/conceptual/compute-unit.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
.. meta::
:description: Omniperf performance model: Compute unit (CU)
:keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, GCN, compute, unit, pipeline, workgroup, wavefront,
CDNA

*****************
Compute unit (CU)
*****************

The compute unit (CU) is responsible for executing a user's kernels on
CDNA™-based accelerators. All :ref:`wavefronts <desc-wavefront>` of a
:ref:`workgroup <desc-workgroup>` are scheduled on the same CU.

.. image:: ../data/performance-model/gcn_compute_unit.png
:align: center
:alt: AMD CDNA accelerator compute unit diagram
:width: 800

The CU consists of several independent execution pipelines and functional units.
The :doc:`/conceptual/pipeline-descriptions` section details the various
execution pipelines -- VALU, SALU, LDS, scheduler, and so forth. The metrics
presented by Omniperf for these pipelines are described in
:doc:`pipeline-metrics`. The :doc:`vL1D <vector-l1-cache>` cache and
:doc:`LDS <local-data-share>` are described in their own sections.

* The :ref:`desc-valu` is composed of multiple SIMD (single
instruction, multiple data) vector processors, vector general purpose
registers (VGPRs) and instruction buffers. The VALU is responsible for
executing much of the computational work on CDNA accelerators, including but
not limited to floating-point operations (FLOPs) and integer operations
(IOPs).

* The vector memory (VMEM) unit is responsible for issuing loads, stores and
atomic operations that interact with the memory system.

* The :ref:`desc-salu` is shared by all threads in a
:ref:`wavefront <desc-wavefront>`, and is responsible for executing
instructions that are known to be uniform across the wavefront at compile
time. The SALU has a memory unit (SMEM) for interacting with memory, but it
cannot issue separately from the SALU.

* The :doc:`local-data-share` is an on-CU software-managed scratchpad memory
that can be used to efficiently share data between all threads in a
:ref:`workgroup <desc-workgroup>`.

* The :ref:`desc-scheduler` is responsible for issuing and decoding instructions
for all the :ref:`wavefronts <desc-wavefront>` on the compute unit.

* The :doc:`vector L1 data cache (vL1D) <vector-l1-cache>` is the first level
cache local to the compute unit. On current CDNA accelerators, the vL1D is
write-through. The vL1D caches from multiple compute units are kept coherent
with one another through software instructions.

* CDNA accelerators -- that is, AMD Instinct™ MI100 and newer -- contain
specialized matrix-multiplication accelerator pipelines known as the
:ref:`desc-mfma`.

For a more in-depth description of a compute unit on a CDNA accelerator, see
:hip-training-pdf:`22` and :gcn-crash-course:`27`.

Loading
Loading