diff --git a/docs/conceptual/glossary.rst b/docs/conceptual/glossary.rst index 9a7dd430b..39f5680a9 100644 --- a/docs/conceptual/glossary.rst +++ b/docs/conceptual/glossary.rst @@ -1,6 +1,7 @@ .. meta:: :description: Omniperf documentation and reference - :keywords: Omniperf, ROCm, glossary, definitions, terms, profiler, tool, Instinct, accelerator, AMD + :keywords: Omniperf, ROCm, glossary, definitions, terms, profiler, tool, + Instinct, accelerator, AMD ******** Glossary @@ -132,6 +133,8 @@ and in this documentation. .. include:: ./includes/normalization-units.rst +.. _memory-spaces: + Memory spaces ============= @@ -203,6 +206,8 @@ of LLVM: will always have the most up-to-date information, and the interested reader is referred to this source for a more complete explanation. +.. _memory-type: + Memory type =========== diff --git a/docs/conceptual/performance-model.rst b/docs/conceptual/performance-model.rst index 47f0ff386..f43625616 100644 --- a/docs/conceptual/performance-model.rst +++ b/docs/conceptual/performance-model.rst @@ -45,7 +45,7 @@ use Omniperf to optimize your code. References ========== -Some sections in the materials in the sections might refer the following +Some sections in the following materials might refer the following publicly available documentation. * :hip-training-pdf:`Introduction to AMD GPU Programming with HIP <>` diff --git a/docs/tutorial/includes/infinity-fabric-transactions.rst b/docs/tutorial/includes/infinity-fabric-transactions.rst index 06fa6063f..cf82bc65f 100644 --- a/docs/tutorial/includes/infinity-fabric-transactions.rst +++ b/docs/tutorial/includes/infinity-fabric-transactions.rst @@ -41,8 +41,8 @@ is identically false (and thus: we expect no writes). different operation types (such as atomics, writes). This abbreviated version is presented here for reference only. -Finally, this sample code lets the user control: - The `granularity of -an allocation `__, - The owner of an allocation (local HBM, CPU +Finally, this sample code lets the user control: - The :ref:`granularity of +an allocation `, - The owner of an allocation (local HBM, CPU DRAM or remote HBM), and - The size of an allocation (the default is :math:`\sim4`\ GiB) @@ -50,11 +50,12 @@ via command line arguments. In doing so, we can explore the impact of these parameters on the L2-Fabric metrics reported by Omniperf to further understand their meaning. -All results in this section were generated an a node of Infinity -Fabric(tm) connected MI250 accelerators using ROCm v5.6.0, and Omniperf -v2.0.0. Although results may vary with ROCm versions and accelerator -connectivity, we expect the lessons learned here to be broadly -applicable. +.. note:: + + All results in this section were generated an a node of Infinity + Fabric connected MI250 accelerators using ROCm version 5.6.0, and Omniperf + version 2.0.0. Although results may vary with ROCm versions and accelerator + connectivity, we expect the lessons learned here to be broadly applicable. .. _infinity-fabric-ex1: @@ -201,7 +202,7 @@ accelerator. Our code uses the ``hipExtMallocWithFlag`` API with the │ 17.5.4 │ Remote Read │ 6.00 │ 6.00 │ 6.00 │ Req per kernel │ ╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛ -Comparing with our `previous example `__, we see a +Comparing with our :ref:`previous example `, we see a relatively similar result, namely: - The vast majority of L2-Fabric requests are 64B read requests (17.5.2) - Nearly all these read requests are directed to the accelerator-local HBM (17.2.1) @@ -212,17 +213,18 @@ Fabric(tm). .. code:: {note} - The stalls in Sec 17.4 are presented as a percentage of the total number active L2 cycles, summed over [all L2 channels](L2). + The stalls in Sec 17.4 are presented as a percentage of the total number + active L2 cycles, summed over [all L2 channels](L2). .. _infinity-fabric-ex3: Experiment #3 - Fine-grained, remote-accelerator HBM reads ---------------------------------------------------------- -In this experiment, we move our `fine-grained `__ allocation to +In this experiment, we move our :ref:`fine-grained ` allocation to be owned by a remote accelerator. We accomplish this by first changing the HIP device using e.g., ``hipSetDevice(1)`` API, then allocating -fine-grained memory (as described `previously `__), and +fine-grained memory (as described :ref:`previously `), and finally resetting the device back to the default, e.g., ``hipSetDevice(0)``. @@ -308,7 +310,7 @@ addition, because these are crossing between accelerators, we expect significantly lower achievable bandwidths as compared to the local accelerator’s HBM – this is reflected (indirectly) in the magnitude of the stall metric (17.4.1). Finally, we note that if our system contained -only PCIe(r) connected accelerators, these observations will differ. +only PCIe connected accelerators, these observations will differ. .. _infinity-fabric-ex4: