Skip to content

Commit

Permalink
Update based on reviews
Browse files Browse the repository at this point in the history
  • Loading branch information
matyas-streamhpc committed Nov 6, 2024
1 parent 855bfc5 commit 814f61c
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 41 deletions.
86 changes: 45 additions & 41 deletions docs/how-to/hip_runtime_api/call_stack.rst
Original file line number Diff line number Diff line change
@@ -1,51 +1,49 @@
.. meta::
:description: This page lists graph safe ROCm libraries.
:keywords: AMD, ROCm, HIP, hipGRAPH
:description: This page describes call stack concept in HIP
:keywords: AMD, ROCm, HIP, call stack

********************************************************************************
*******************************************************************************
Call stack
********************************************************************************
*******************************************************************************

The call stack and the program counter work together to manage the flow of
control in a program, especially in the context of function calls, loops, and
branching.

The call stack is a concept in computer science, representing the stack data
structure that stores information about the active subroutines or functions.
Each time a function is called, a new call frame is added to the top of the stack,
containing information such as local variables, return addresses and function
parameters. When the function execution completes, the frame is removed from the
stack. This concept allows the program to return to the calling function and
continue execution from where it left off.

The program counter keeps track of the current or next instruction to
be executed, while the call stack maintains the history of function calls,
return addresses, and local variables. Together, they allow a program to execute
complex control flow structures, handle function calls and return to the
appropriate location in the code after a function completes.

The call stack in GPU programming has extra complexity due to parallel execution
also NVIDIA and AMD GPUs are using different approaches. NVIDIA GPUs have the
independent thread scheduling feature since Volta GPU, where every thread has
its own call stack and effective program counter. At AMD GPUs every
warp/wavefront has its own call stack and program counter.
structure that used for managing function calls, by saving the state of the
current function. Each time a function is called, a new call frame is added to
the top of the stack, containing information such as local variables, return
addresses and function parameters. When the function execution completes, the
frame is removed from the stack and loaded back into the corresponding
registers. This concept allows the program to return to the calling function
and continue execution from where it left off.

The program counter keeps track of the current or next instruction to be
executed. The program counter and the call stack concept together allow a
program to handle function calls and return to the appropriate location in the
code after a function completes.

The call stack for each thread must track its function calls, local variables,
and return addresses.
and return addresses. However, in GPU programming, the memory required to store
the call stack increases due to the parallelism inherent to the GPUs. NVIDIA
and AMD GPUs are using different approaches. NVIDIA GPUs have the independent
thread scheduling feature since Volta GPU, where every thread has its own call
stack and effective program counter. At AMD GPUs every warp/wavefront has its
own call stack and program counter.

If a thread or warp exceeds its stack size, a stack overflow occurs, causing
kernel failure. This can be detected and handled using debuggers or with the
:doc:`error handling functions <../how-to/hip_runtime_api/error_handling>`.
kernel failure. This can be detected using debuggers.

Call stack management with HIP
================================================================================

Developers can adjust the call stack size for every thread, allowing fine-tuning
based on specific kernel requirements. This helps prevent stack overflow errors
by ensuring sufficient stack memory is allocated.
Developers can adjust the call stack size, allowing fine-tuning based on
specific kernel requirements. This helps prevent stack overflow errors by
ensuring sufficient stack memory is allocated.

.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
Expand Down Expand Up @@ -78,18 +76,23 @@ by ensuring sufficient stack memory is allocated.
return 0;
}
Depending on the GPU model, at full occupancy, it can consume a significant
amount of memory. For instance, an MI300X with 304 computing units (CU) and up
to 2048 threads per CU would use 304 · 2048 · 8192 bytes = 4.75 GiB.

Handling recursion and deep function calls
--------------------------------------------------------------------------------

Similar to CPU programming, recursive functions and deeply nested function calls
are supported. However, developers must ensure that these functions do not
exceed the available stack memory, considering the limited resources on GPUs.
exceed the available stack memory, considering a huge amount of memory needed
for the call stack due to the GPUs inherent parallelism.

.. code-block:: cpp
#include <hip/hip_runtime.h>
#include <iostream>
#define HIP_CHECK(expression) \
{ \
const hipError_t status = expression; \
Expand All @@ -101,32 +104,33 @@ exceed the available stack memory, considering the limited resources on GPUs.
<< __LINE__ << std::endl; \
} \
}
__device__ unsigned long long factorial(unsigned long long n)
{
if (n == 0 || n == 1) {
if (n == 0 || n == 1)
{
return 1;
}
return n * factorial(n - 1);
}
__global__ void kernel(unsigned long long n)
{
unsigned long long result = factorial(n);
const size_t x = threadIdx.x + blockDim.x * blockIdx.x;
if(x == 0)
printf("%llu! = %llu \n", n, result);
if (x == 0)
printf("%llu! = %llu \n", n, result);
}
int main()
{
kernel<<<1, 1>>>(10);
HIP_CHECK(hipDeviceSynchronize());
// With -O0 optimization option hit the stack limit
// kernel<<<1, 256>>>(2048);
// This hits the stack limit due to excessive recursion
// kernel<<<1, 1>>>(2048);
// HIP_CHECK(hipDeviceSynchronize());
return 0;
}
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ The HIP documentation is organized into the following categories:
* {doc}`./how-to/hip_runtime_api/memory_management`
* {doc}`./how-to/hip_runtime_api/cooperative_groups`
* {doc}`./how-to/hip_runtime_api/hipgraph`
* {doc}`./how-to/hip_runtime_api/call_stack`
* [HIP porting guide](./how-to/hip_porting_guide)
* [HIP porting: driver API guide](./how-to/hip_porting_driver_api)
* {doc}`./how-to/hip_rtc`
Expand Down

0 comments on commit 814f61c

Please sign in to comment.