Skip to content

Commit

Permalink
WIP
Browse files Browse the repository at this point in the history
  • Loading branch information
neon60 committed May 24, 2024
1 parent bf1c13b commit d426b03
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 27 deletions.
4 changes: 4 additions & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,14 @@ Builtins
clr
enqueue
enqueues
embeded
fatbinary
GPGPU
Malloc
multicore
NDRange
Numa
PTX
RTC
SIMT
SYCL
46 changes: 23 additions & 23 deletions docs/how-to/hip_porting_driver_api.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,9 @@ CUDA provides a separate CUDA Driver and Runtime APIs. The two APIs have signifi
* Driver APIs calls begin with the prefix `cu` while Runtime APIs begin with the prefix `cuda`. For example, the Driver API API contains `cuEventCreate` while the Runtime API contains `cudaEventCreate`, with similar functionality.
* The Driver API defines a different but largely overlapping error code space than the Runtime API, and uses a different coding convention. For example, Driver API defines `CUDA_ERROR_INVALID_VALUE` while the Runtime API defines `cudaErrorInvalidValue`

The Driver API offers two additional pieces of functionality not provided by the Runtime API: cuModule and cuCtx APIs.
The Driver API offers two additional pieces of functionality not provided by the Runtime API: `cuModule` and `cuCtx` APIs.

### cuModule API
### `cuModule` API

The Module section of the Driver API provides additional control over how and when accelerator code objects are loaded.
For example, the driver API allows code objects to be loaded from files or memory pointers.
Expand All @@ -30,7 +30,7 @@ Other environments have many kernels and do not want them to be all loaded autom
The Module functions can be used to load the generated code objects and launch kernels.
As we will see below, HIP defines a Module API which provides similar explicit control over code object management.

### cuCtx API
### `cuCtx` API

The Driver API defines "Context" and "Devices" as separate entities.
Contexts contain a single device, and a device can theoretically have multiple contexts.
Expand All @@ -41,41 +41,41 @@ HIP as well as a recent versions of CUDA Runtime provide other mechanisms to acc

The CUDA Runtime API unifies the Context API with the Device API. This simplifies the APIs and has little loss of functionality since each Context can contain a single device, and the benefits of multiple contexts has been replaced with other interfaces.
HIP provides a context API to facilitate easy porting from existing Driver codes.
In HIP, the Ctx functions largely provide an alternate syntax for changing the active device.
In HIP, the `Ctx` functions largely provide an alternate syntax for changing the active device.

Most new applications will prefer to use `hipSetDevice` or the stream APIs , therefore HIP has marked hipCtx APIs as **deprecated**. Support for these APIs may not be available in future releases. For more details on deprecated APIs please refer [HIP deprecated APIs](https://github.com/ROCm/HIP/blob/develop/docs/reference/deprecated_api_list.md).
Most new applications will prefer to use `hipSetDevice` or the stream APIs , therefore HIP has marked `hipCtx` APIs as **deprecated**. Support for these APIs may not be available in future releases. For more details on deprecated APIs please refer [HIP deprecated APIs](https://github.com/ROCm/HIP/blob/develop/docs/reference/deprecated_api_list.md).

## HIP Module and Ctx APIs
## HIP Module and `Ctx` APIs

Rather than present two separate APIs, HIP extends the HIP API with new APIs for Modules and Ctx control.
Rather than present two separate APIs, HIP extends the HIP API with new APIs for Modules and `Ctx` control.

### hipModule API
### `hipModule` API

Like the CUDA Driver API, the Module API provides additional control over how code is loaded, including options to load code from files or from in-memory pointers.
NVCC and HIP-Clang target different architectures and use different code object formats: NVCC is `cubin` or `ptx` files, while the HIP-Clang path is the `hsaco` format.
The external compilers which generate these code objects are responsible for generating and loading the correct code object for each platform.
Notably, there is not a fat binary format that can contain code for both NVCC and HIP-Clang platforms. The following table summarizes the formats used on each platform:

| Format | APIs | NVCC | HIP-CLANG |
| --- | --- | --- | --- |
| Code Object | hipModuleLoad, hipModuleLoadData | .cubin or PTX text | .hsaco |
| Fat Binary | hipModuleLoadFatBin | .fatbin | .hip_fatbin |
| Format | APIs | NVCC | HIP-CLANG |
| --- | --- | --- | --- |
| Code Object | `hipModuleLoad`, `hipModuleLoadData` | `.cubin` or PTX text | `.hsaco` |
| Fat Binary | `hipModuleLoadFatBin` | `.fatbin` | `.hip_fatbin` |

`hipcc` uses HIP-Clang or NVCC to compile host codes. Both of these may embed code objects into the final executable, and these code objects will be automatically loaded when the application starts.
The hipModule API can be used to load additional code objects, and in this way provides an extended capability to the automatically loaded code objects.
The `hipModule` API can be used to load additional code objects, and in this way provides an extended capability to the automatically loaded code objects.
HIP-Clang allows both of these capabilities to be used together, if desired. Of course it is possible to create a program with no kernels and thus no automatic loading.

### hipCtx API
### `hipCtx` API

HIP provides a `Ctx` API as a thin layer over the existing Device functions. This Ctx API can be used to set the current context, or to query properties of the device associated with the context.
HIP provides a `Ctx` API as a thin layer over the existing Device functions. This `Ctx` API can be used to set the current context, or to query properties of the device associated with the context.
The current context is implicitly used by other APIs such as `hipStreamCreate`.

### hipify translation of CUDA Driver API

The HIPIFY tools convert CUDA Driver APIs for streams, events, modules, devices, memory management, context, profiler to the equivalent HIP driver calls. For example, `cuEventCreate` will be translated to `hipEventCreate`.
HIPIFY tools also convert error codes from the Driver namespace and coding convention to the equivalent HIP error code. Thus, HIP unifies the APIs for these common functions.

The memory copy API requires additional explanation. The CUDA driver includes the memory direction in the name of the API (ie `cuMemcpyH2D`) while the CUDA driver API provides a single memory copy API with a parameter that specifies the direction and additionally supports a "default" direction where the runtime determines the direction automatically.
The memory copy API requires additional explanation. The CUDA driver includes the memory direction in the name of the API (`cuMemcpyH2D`) while the CUDA driver API provides a single memory copy API with a parameter that specifies the direction and additionally supports a "default" direction where the runtime determines the direction automatically.
HIP provides APIs with both styles: for example, `hipMemcpyH2D` as well as `hipMemcpy`.
The first flavor may be faster in some cases since they avoid host overhead to detect the different memory directions.

Expand All @@ -86,7 +86,7 @@ HIP defines a single error space, and uses camel-case for all errors (i.e. `hipE
HIP-Clang defines a process-wide address space where the CPU and all devices allocate addresses from a single unified pool.
Thus addresses may be shared between contexts, and unlike the original CUDA definition a new context does not create a new address space for the device.

#### Using hipModuleLaunchKernel
#### Using `hipModuleLaunchKernel`

`hipModuleLaunchKernel` is `cuLaunchKernel` in HIP world. It takes the same arguments as `cuLaunchKernel`.

Expand All @@ -95,26 +95,26 @@ Thus addresses may be shared between contexts, and unlike the original CUDA defi
* HIP-Clang creates a primary context when the HIP API is called. So in a pure driver API code, HIP-Clang will create a primary context while HIP/NVCC will have empty context stack.
HIP-Clang will push primary context to context stack when it is empty. This can have subtle differences on applications which mix the runtime and driver APIs.

### hip-clang Implementation Notes
### `hip-clang` Implementation Notes

#### .hip_fatbin
#### `.hip_fatbin`

hip-clang links device code from different translation units together. For each device target, a code object is generated. Code objects for different device targets are bundled by clang-offload-bundler as one fatbinary, which is embeded as a global symbol `__hip_fatbin` in the .hip_fatbin section of the ELF file of the executable or shared object.
hip-clang links device code from different translation units together. For each device target, a code object is generated. Code objects for different device targets are bundled by `clang-offload-bundler` as one fatbinary, which is embeded as a global symbol `__hip_fatbin` in the .hip_fatbin section of the ELF file of the executable or shared object.

#### Initialization and Termination Functions

hip-clang generates initializatiion and termination functions for each translation unit for host code compilation. The initialization functions call `__hipRegisterFatBinary` to register the fatbinary embeded in the ELF file. They also call `__hipRegisterFunction` and `__hipRegisterVar` to register kernel functions and device side global variables. The termination functions call `__hipUnregisterFatBinary`.
hip-clang generates initialization and termination functions for each translation unit for host code compilation. The initialization functions call `__hipRegisterFatBinary` to register the fatbinary embeded in the ELF file. They also call `__hipRegisterFunction` and `__hipRegisterVar` to register kernel functions and device side global variables. The termination functions call `__hipUnregisterFatBinary`.
hip-clang emits a global variable `__hip_gpubin_handle` of void** type with linkonce linkage and inital value 0 for each host translation unit. Each initialization function checks `__hip_gpubin_handle` and register the fatbinary only if `__hip_gpubin_handle` is 0 and saves the return value of `__hip_gpubin_handle` to `__hip_gpubin_handle`. This is to guarantee that the fatbinary is only registered once. Similar check is done in the termination functions.

#### Kernel Launching

hip-clang supports kernel launching by CUDA `<<<>>>` syntax, hipLaunchKernelGGL. The latter one is macro which expand to CUDA `<<<>>>` syntax.

When the executable or shared library is loaded by the dynamic linker, the initilization functions are called. In the initialization functions, when `__hipRegisterFatBinary` is called, the code objects containing all kernels are loaded; when `__hipRegisterFunction` is called, the stub functions are associated with the corresponding kernels in code objects.
When the executable or shared library is loaded by the dynamic linker, the initialization functions are called. In the initialization functions, when `__hipRegisterFatBinary` is called, the code objects containing all kernels are loaded; when `__hipRegisterFunction` is called, the stub functions are associated with the corresponding kernels in code objects.

hip-clang implements two sets of kernel launching APIs.

By default, in the host code, for the `<<<>>>` statement, hip-clang first emits call of hipConfigureCall to set up the threads and grids, then emits call of the stub function with the given arguments. In the stub function, hipSetupArgument is called for each kernel argument, then hipLaunchByPtr is called with a function pointer to the stub function. In hipLaunchByPtr, the real kernel associated with the stub function is launched.
By default, in the host code, for the `<<<>>>` statement, hip-clang first emits call of `hipConfigureCall` to set up the threads and grids, then emits call of the stub function with the given arguments. In the stub function, `hipSetupArgument` is called for each kernel argument, then `hipLaunchByPtr` is called with a function pointer to the stub function. In `hipLaunchByPtr`, the real kernel associated with the stub function is launched.

### NVCC Implementation Notes

Expand Down
8 changes: 4 additions & 4 deletions docs/how-to/programming_manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Numa policy determines how memory is allocated.
Target of Numa policy is to select a CPU that is closest to each GPU.
Numa distance is the measurement of how far between GPU and CPU devices.

By default, each GPU selects a Numa CPU node that has the least Numa distance between them, that is, host memory will be automatically allocated closest on the memory pool of Numa node of the current GPU device. Using hipSetDevice API to a different GPU will still be able to access the host allocation, but can have longer Numa distance.
By default, each GPU selects a Numa CPU node that has the least Numa distance between them, that is, host memory will be automatically allocated closest on the memory pool of Numa node of the current GPU device. Using `hipSetDevice` API to a different GPU will still be able to access the host allocation, but can have longer Numa distance.
Note, Numa policy is so far implemented on Linux, and under development on Windows.

### Coherency Controls
Expand Down Expand Up @@ -65,7 +65,7 @@ Non-coherent

### `hipEventSynchronize`

Developers can control the release scope for hipEvents:
Developers can control the release scope for `hipEvents`:

* By default, the GPU performs a device-scope acquire and release operation with each recorded event. This will make host and device memory visible to other commands executing on the same device.

Expand Down Expand Up @@ -118,7 +118,7 @@ HIP supports Stream Memory Operations to enable direct synchronization between N
Note, CPU access to the semaphore's memory requires volatile keyword to disable CPU compiler's optimizations on memory access.
For more details, please check the documentation HIP-API.pdf.
Please note, HIP stream does not guarantee concurrency on AMD hardware for the case of multiple (at least 6) long-running streams executing concurrently, using hipStreamSynchronize(nullptr) for synchronization.
Please note, HIP stream does not guarantee concurrency on AMD hardware for the case of multiple (at least 6) long-running streams executing concurrently, using `hipStreamSynchronize(nullptr)` for synchronization.
## Direct Dispatch
Expand Down Expand Up @@ -160,7 +160,7 @@ The per-thread default stream is a blocking stream and will synchronize with the
The per-thread default stream can be enabled via adding a compilation option,
`-fgpu-default-stream=per-thread`.
And users can explicitly use "hipStreamPerThread" as per-thread default stream handle as input in API commands. There are test codes as examples in the [link](https://github.com/ROCm/hip-tests/tree/develop/catch/unit/streamperthread).
And users can explicitly use `hipStreamPerThread` as per-thread default stream handle as input in API commands. There are test codes as examples in the [link](https://github.com/ROCm/hip-tests/tree/develop/catch/unit/streamperthread).
## Use of Long Double Type
Expand Down

0 comments on commit d426b03

Please sign in to comment.