WIP

ROCm · May 24, 2024 · d426b03 · d426b03
1 parent bf1c13b
commit d426b03
Show file tree

Hide file tree

Showing 3 changed files with 31 additions and 27 deletions.
diff --git a/.wordlist.txt b/.wordlist.txt
@@ -7,10 +7,14 @@ Builtins
 clr
 enqueue
 enqueues
+embeded
+fatbinary
 GPGPU
+Malloc
 multicore
 NDRange
 Numa
+PTX
 RTC
 SIMT
 SYCL
diff --git a/docs/how-to/hip_porting_driver_api.md b/docs/how-to/hip_porting_driver_api.md
@@ -9,9 +9,9 @@ CUDA provides a separate CUDA Driver and Runtime APIs. The two APIs have signifi
 * Driver APIs calls begin with the prefix `cu` while Runtime APIs begin with the prefix `cuda`. For example, the Driver API API contains `cuEventCreate` while the Runtime API contains `cudaEventCreate`, with similar functionality.
 * The Driver API defines a different but largely overlapping error code space than the Runtime API, and uses a different coding convention. For example, Driver API defines `CUDA_ERROR_INVALID_VALUE` while the Runtime API defines `cudaErrorInvalidValue`
 
-The Driver API offers two additional pieces of functionality not provided by the Runtime API: cuModule and cuCtx APIs.
+The Driver API offers two additional pieces of functionality not provided by the Runtime API: `cuModule` and `cuCtx` APIs.
 
-### cuModule API
+### `cuModule` API
 
 The Module section of the Driver API provides additional control over how and when accelerator code objects are loaded.
 For example, the driver API allows code objects to be loaded from files or memory pointers.
@@ -30,7 +30,7 @@ Other environments have many kernels and do not want them to be all loaded autom
 The Module functions can be used to load the generated code objects and launch kernels.
 As we will see below, HIP defines a Module API which provides similar explicit control over code object management.
 
-### cuCtx API
+### `cuCtx` API
 
 The Driver API defines "Context" and "Devices" as separate entities.
 Contexts contain a single device, and a device can theoretically have multiple contexts.
@@ -41,41 +41,41 @@ HIP as well as a recent versions of CUDA Runtime provide other mechanisms to acc
 
 The CUDA Runtime API unifies the Context API with the Device API. This simplifies the APIs and has little loss of functionality since each Context can contain a single device, and the benefits of multiple contexts has been replaced with other interfaces.
 HIP provides a context API to facilitate easy porting from existing Driver codes.
-In HIP, the Ctx functions largely provide an alternate syntax for changing the active device.
+In HIP, the `Ctx` functions largely provide an alternate syntax for changing the active device.
 
-Most new applications will prefer to use `hipSetDevice` or the stream APIs , therefore HIP has marked hipCtx APIs as **deprecated**. Support for these APIs may not be available in future releases. For more details on deprecated APIs please refer [HIP deprecated APIs](https://github.com/ROCm/HIP/blob/develop/docs/reference/deprecated_api_list.md).
+Most new applications will prefer to use `hipSetDevice` or the stream APIs , therefore HIP has marked `hipCtx` APIs as **deprecated**. Support for these APIs may not be available in future releases. For more details on deprecated APIs please refer [HIP deprecated APIs](https://github.com/ROCm/HIP/blob/develop/docs/reference/deprecated_api_list.md).
 
-## HIP Module and Ctx APIs
+## HIP Module and `Ctx` APIs
 
-Rather than present two separate APIs, HIP extends the HIP API with new APIs for Modules and Ctx control.
+Rather than present two separate APIs, HIP extends the HIP API with new APIs for Modules and `Ctx` control.
 
-### hipModule API
+### `hipModule` API
 
 Like the CUDA Driver API, the Module API provides additional control over how code is loaded, including options to load code from files or from in-memory pointers.
 NVCC and HIP-Clang target different architectures and use different code object formats: NVCC is `cubin` or `ptx` files, while the HIP-Clang path is the `hsaco` format.
 The external compilers which generate these code objects are responsible for generating and loading the correct code object for each platform.
 Notably, there is not a fat binary format that can contain code for both NVCC and HIP-Clang platforms. The following table summarizes the formats used on each platform:
 
-| Format        | APIs                             | NVCC               | HIP-CLANG    |
-| ---           | ---                              | ---                | ---          |
-| Code Object   | hipModuleLoad, hipModuleLoadData | .cubin or PTX text | .hsaco       |
-| Fat Binary    | hipModuleLoadFatBin              | .fatbin            | .hip_fatbin  |
+| Format        | APIs                                 | NVCC                 | HIP-CLANG     |
+| ---           | ---                                  | ---                  | ---           |
+| Code Object   | `hipModuleLoad`, `hipModuleLoadData` | `.cubin` or PTX text | `.hsaco`      |
+| Fat Binary    | `hipModuleLoadFatBin`                | `.fatbin`            | `.hip_fatbin` |
 
 `hipcc` uses HIP-Clang or NVCC to compile host codes. Both of these may embed code objects into the final executable, and these code objects will be automatically loaded when the application starts.
-The hipModule API can be used to load additional code objects, and in this way provides an extended capability to the automatically loaded code objects.
+The `hipModule` API can be used to load additional code objects, and in this way provides an extended capability to the automatically loaded code objects.
 HIP-Clang allows both of these capabilities to be used together, if desired. Of course it is possible to create a program with no kernels and thus no automatic loading.
 
-### hipCtx API
+### `hipCtx` API
 
-HIP provides a `Ctx` API as a thin layer over the existing Device functions. This Ctx API can be used to set the current context, or to query properties of the device associated with the context.
+HIP provides a `Ctx` API as a thin layer over the existing Device functions. This `Ctx` API can be used to set the current context, or to query properties of the device associated with the context.
 The current context is implicitly used by other APIs such as `hipStreamCreate`.
 
 ### hipify translation of CUDA Driver API
 
 The HIPIFY tools convert CUDA Driver APIs for streams, events, modules, devices, memory management, context, profiler to the equivalent HIP driver calls. For example, `cuEventCreate` will be translated to `hipEventCreate`.
 HIPIFY tools also convert error codes from the Driver namespace and coding convention to the equivalent HIP error code. Thus, HIP unifies the APIs for these common functions.
 
-The memory copy API requires additional explanation. The CUDA driver includes the memory direction in the name of the API (ie `cuMemcpyH2D`) while the CUDA driver API provides a single memory copy API with a parameter that specifies the direction and additionally supports a "default" direction where the runtime determines the direction automatically.
+The memory copy API requires additional explanation. The CUDA driver includes the memory direction in the name of the API (`cuMemcpyH2D`) while the CUDA driver API provides a single memory copy API with a parameter that specifies the direction and additionally supports a "default" direction where the runtime determines the direction automatically.
 HIP provides APIs with both styles: for example, `hipMemcpyH2D` as well as `hipMemcpy`.
 The first flavor may be faster in some cases since they avoid host overhead to detect the different memory directions.
 
@@ -86,7 +86,7 @@ HIP defines a single error space, and uses camel-case for all errors (i.e. `hipE
 HIP-Clang defines a process-wide address space where the CPU and all devices allocate addresses from a single unified pool.
 Thus addresses may be shared between contexts, and unlike the original CUDA definition a new context does not create a new address space for the device.
 
-#### Using hipModuleLaunchKernel
+#### Using `hipModuleLaunchKernel`
 
 `hipModuleLaunchKernel` is `cuLaunchKernel` in HIP world. It takes the same arguments as `cuLaunchKernel`.
 
@@ -95,26 +95,26 @@ Thus addresses may be shared between contexts, and unlike the original CUDA defi
 * HIP-Clang creates a primary context when the HIP API is called. So in a pure driver API code, HIP-Clang will create a primary context while HIP/NVCC will have empty context stack.
 HIP-Clang will push primary context to context stack when it is empty. This can have subtle differences on applications which mix the runtime and driver APIs.
 
-### hip-clang Implementation Notes
+### `hip-clang` Implementation Notes
 
-#### .hip_fatbin
+#### `.hip_fatbin`
 
-hip-clang links device code from different translation units together. For each device target, a code object is generated. Code objects for different device targets are bundled by clang-offload-bundler as one fatbinary, which is embeded as a global symbol `__hip_fatbin` in the .hip_fatbin section of the ELF file of the executable or shared object.
+hip-clang links device code from different translation units together. For each device target, a code object is generated. Code objects for different device targets are bundled by `clang-offload-bundler` as one fatbinary, which is embeded as a global symbol `__hip_fatbin` in the .hip_fatbin section of the ELF file of the executable or shared object.
 
 #### Initialization and Termination Functions
 
-hip-clang generates initializatiion and termination functions for each translation unit for host code compilation. The initialization functions call `__hipRegisterFatBinary` to register the fatbinary embeded in the ELF file. They also call `__hipRegisterFunction` and `__hipRegisterVar` to register kernel functions and device side global variables. The termination functions call `__hipUnregisterFatBinary`.
+hip-clang generates initialization and termination functions for each translation unit for host code compilation. The initialization functions call `__hipRegisterFatBinary` to register the fatbinary embeded in the ELF file. They also call `__hipRegisterFunction` and `__hipRegisterVar` to register kernel functions and device side global variables. The termination functions call `__hipUnregisterFatBinary`.
 hip-clang emits a global variable `__hip_gpubin_handle` of void** type with linkonce linkage and inital value 0 for each host translation unit. Each initialization function checks `__hip_gpubin_handle` and register the fatbinary only if `__hip_gpubin_handle` is 0 and saves the return value of `__hip_gpubin_handle` to `__hip_gpubin_handle`. This is to guarantee that the fatbinary is only registered once. Similar check is done in the termination functions.
 
 #### Kernel Launching
 
 hip-clang supports kernel launching by CUDA `<<<>>>` syntax, hipLaunchKernelGGL. The latter one is macro which expand to CUDA `<<<>>>` syntax.
 
-When the executable or shared library is loaded by the dynamic linker, the initilization functions are called. In the initialization functions, when `__hipRegisterFatBinary` is called, the code objects containing all kernels are loaded; when `__hipRegisterFunction` is called, the stub functions are associated with the corresponding kernels in code objects.
+When the executable or shared library is loaded by the dynamic linker, the initialization functions are called. In the initialization functions, when `__hipRegisterFatBinary` is called, the code objects containing all kernels are loaded; when `__hipRegisterFunction` is called, the stub functions are associated with the corresponding kernels in code objects.
 
 hip-clang implements two sets of kernel launching APIs.
 
-By default, in the host code, for the `<<<>>>` statement, hip-clang first emits call of hipConfigureCall to set up the threads and grids, then emits call of the stub function with the given arguments. In the stub function, hipSetupArgument is called for each kernel argument, then hipLaunchByPtr is called with a function pointer to the stub function. In hipLaunchByPtr, the real kernel associated with the stub function is launched.
+By default, in the host code, for the `<<<>>>` statement, hip-clang first emits call of `hipConfigureCall` to set up the threads and grids, then emits call of the stub function with the given arguments. In the stub function, `hipSetupArgument` is called for each kernel argument, then `hipLaunchByPtr` is called with a function pointer to the stub function. In `hipLaunchByPtr`, the real kernel associated with the stub function is launched.
 
 ### NVCC Implementation Notes
 

diff --git a/docs/how-to/programming_manual.md b/docs/how-to/programming_manual.md
@@ -27,7 +27,7 @@ Numa policy determines how memory is allocated.
 Target of Numa policy is to select a CPU that is closest to each GPU.
 Numa distance is the measurement of how far between GPU and CPU devices.
 
-By default, each GPU selects a Numa CPU node that has the least Numa distance between them, that is, host memory will be automatically allocated closest on the memory pool of Numa node of the current GPU device. Using hipSetDevice API to a different GPU will still be able to access the host allocation, but can have longer Numa distance.
+By default, each GPU selects a Numa CPU node that has the least Numa distance between them, that is, host memory will be automatically allocated closest on the memory pool of Numa node of the current GPU device. Using `hipSetDevice` API to a different GPU will still be able to access the host allocation, but can have longer Numa distance.
 Note, Numa policy is so far implemented on Linux, and under development on Windows.
 
 ### Coherency Controls
@@ -65,7 +65,7 @@ Non-coherent
 
 ### `hipEventSynchronize`
 
-Developers can control the release scope for hipEvents:
+Developers can control the release scope for `hipEvents`:
 
 * By default, the GPU performs a device-scope acquire and release operation with each recorded event.  This will make host and device memory visible to other commands executing on the same device.
 
@@ -118,7 +118,7 @@ HIP supports Stream Memory Operations to enable direct synchronization between N
 Note, CPU access to the semaphore's memory requires volatile keyword to disable CPU compiler's optimizations on memory access.
 For more details, please check the documentation HIP-API.pdf.
 
-Please note, HIP stream does not guarantee concurrency on AMD hardware for the case of multiple (at least 6) long-running streams executing concurrently, using hipStreamSynchronize(nullptr) for synchronization.
+Please note, HIP stream does not guarantee concurrency on AMD hardware for the case of multiple (at least 6) long-running streams executing concurrently, using `hipStreamSynchronize(nullptr)` for synchronization.
 
 ## Direct Dispatch
 
@@ -160,7 +160,7 @@ The per-thread default stream is a blocking stream and will synchronize with the
 The per-thread default stream can be enabled via adding a compilation option,
 `-fgpu-default-stream=per-thread`.
 
-And users can explicitly use "hipStreamPerThread" as per-thread default stream handle as input in API commands. There are test codes as examples in the [link](https://github.com/ROCm/hip-tests/tree/develop/catch/unit/streamperthread).
+And users can explicitly use `hipStreamPerThread` as per-thread default stream handle as input in API commands. There are test codes as examples in the [link](https://github.com/ROCm/hip-tests/tree/develop/catch/unit/streamperthread).
 
 ## Use of Long Double Type