Skip to content

Using APEX with OpenMP

Kevin Huck edited this page Jan 9, 2017 · 9 revisions

This example will show how to build APEX (including all of its dependencies for this example) and run the example.

APEX has support for the OpenMP Tools interface (OMPT), proposed as part of the OpenMP 5.0 specification (under review as of January 2017). Using the OMPT interface, APEX gets a callback every time that registered events are encountered in a supported OpenMP runtime. APEX includes a prototype OMPT implementation, based on the Intel OpenMP runtime that was open-sourced in April of 2013. This runtime provides support for Intel, GCC and Clang compiler OpenMP API calls.

APEX also uses Active Harmony and Binutils support in this example. Like OMPT, they are automatically downloaded, configured and built to work with APEX.

This example also uses the gperftools library provided by Google. This provides the tcmalloc library, a drop-in replacement for system malloc that is much faster in multithreaded applications. Similar libraries are Intel TBB or jemalloc. APEX currently supports tcmalloc and jemalloc, TBB is planned for the future.

Step 1: Download APEX (to get latest code)

git clone https://github.com/khuck/xpress-apex
cd xpress-apex

Step 2: Configure APEX

mkdir build
cd build
cmake \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DBUILD_TESTS=TRUE \
-DBUILD_EXAMPLES=TRUE \
-DBUILD_ACTIVEHARMONY=TRUE \
-DBUILD_BFD=TRUE \
-DBUILD_OMPT=TRUE \
-DGPERFTOOLS_ROOT=/usr/local/gperftools/2.5 \
-DUSE_PLUGINS=TRUE \
-DCMAKE_INSTALL_PREFIX=../install ..

If you don't have gperftools, omit the "GPERFTOOLS_ROOT" argument to cmake.

Step 3: Build APEX

make -j16
make install
cd ../install

As mentioned in the introductory paragraph, this will download, configure, build and install each of Active Harmony, OpenMP runtime with OMPT support and Binutils. Then the APEX code will be built and installed.

Step 4: Run the example

APEX includes a slightly modified version of the LULESH benchmark kernel. The modifications to the code include changes to the OpenMP pragmas, and a modified memory allocation to prevent crashes when the number of active threads changes. The OpenMP modifications include:

  • splitting "#pragma omp parallel for" into two statements, "#pragma omp parallel" and "#pragma omp for"
  • removing the "nowait" argument from OpenMP pragmas
  • adding "schedule(runtime)" to all OpenMP parallel loops

These changes are necessary so that APEX can dynamically adapt the number of active threads, schedule and chunk size for each parallel region independently. For more details, see the IEEE Cluster publication for ARCS, which includes APEX.

Copy the tuning space JSON file from the example source code, and edit it to not use more threads than you have cores:

cp ../src/examples/OpenMP_Policy/space.json .
vim ./space.json

Set the relevant policy plugin options, and run the example.

export OMP_NUM_THREADS=32  # or max cores
export APEX_PLUGINS_PATH=`pwd`/lib
export APEX_PLUGINS=libapex_openmp_policy
export APEX_OPENMP_SPACE=./space.json
export APEX_SCREEN_OUTPUT=1
./bin/lulesh_OpenMP_2.0 -s 64 -i 25 -p

The output should look something like this:

[khuck@godzilla install]$ ./bin/lulesh_OpenMP_2.0 -s 64 -i 25 -p
apex_plugin_init
apex_openmp_policy init
Using default tuning strategy (NELDER_MEAD)
Running problem size 64^3 per domain until completion
Num processors: 1
Registering OMPT events...done.
Num threads: 32
Total number of elements: 262144

To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options

cycle = 1, time = 5.831050e-07, dt=5.831050e-07
Tuning has converged for session 21.
Tuning has converged for session 20.
cycle = 2, time = 1.282831e-06, dt=6.997260e-07
cycle = 3, time = 1.522670e-06, dt=2.398394e-07
ERROR: calls = 0 for OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x404730
Tuning has converged for session 19.
Tuning has converged for session 23.
cycle = 4, time = 1.722955e-06, dt=2.002845e-07
Tuning has converged for session 24.
cycle = 5, time = 1.901501e-06, dt=1.785459e-07
Tuning has converged for session 25.
Tuning has converged for session 22.
cycle = 6, time = 2.067289e-06, dt=1.657880e-07
Tuning has converged for session 18.
cycle = 7, time = 2.224910e-06, dt=1.576217e-07
cycle = 8, time = 2.377156e-06, dt=1.522454e-07
Tuning has converged for session 16.
cycle = 9, time = 2.525914e-06, dt=1.487584e-07
Tuning has converged for session 26.
cycle = 10, time = 2.672575e-06, dt=1.466606e-07
Tuning has converged for session 27.
Tuning has converged for session 30.
cycle = 11, time = 2.848568e-06, dt=1.759927e-07
cycle = 12, time = 3.047272e-06, dt=1.987049e-07
cycle = 13, time = 3.235112e-06, dt=1.878400e-07
cycle = 14, time = 3.412132e-06, dt=1.770197e-07
cycle = 15, time = 3.578321e-06, dt=1.661889e-07
cycle = 16, time = 3.736980e-06, dt=1.586588e-07
Tuning has converged for session 29.
cycle = 17, time = 3.890910e-06, dt=1.539304e-07
cycle = 18, time = 4.042518e-06, dt=1.516081e-07
cycle = 19, time = 4.194103e-06, dt=1.515843e-07
cycle = 20, time = 4.345687e-06, dt=1.515843e-07
cycle = 21, time = 4.497271e-06, dt=1.515843e-07
cycle = 22, time = 4.648855e-06, dt=1.515843e-07
cycle = 23, time = 4.800440e-06, dt=1.515843e-07
cycle = 24, time = 4.968220e-06, dt=1.677805e-07
cycle = 25, time = 5.136001e-06, dt=1.677805e-07
Run completed:  
   Problem size        =  64 
   MPI tasks           =  1 
   Iteration count     =  25 
   Final Origin Energy = 3.610655e+07 
   Testing Plane 0 of Energy Array on rank 0:
        MaxAbsDiff   = 2.793968e-09
        TotalAbsDiff = 3.034312e-09
        MaxRelDiff   = 2.106546e-14


Elapsed time         =       4.90 (s)
Grind time (us/z/c)  = 0.74820572 (per dom)  (0.74820572 overall)
FOM                  =  1336.5308 (z/s)

apex_plugin_finalize
apex_openmp_policy finalize

and the output from the results file should look something like this:

[khuck@godzilla install]$ cat results-2017-01-09-15-07-24.csv 
"name","num_threads","schedule","chunk_size","converged"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x4068c0"$8$"dynamic"$128$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x4067c0"$8$"guided"$64$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x406630"$4$"static"$256$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x4057d0"$8$"dynamic"$256$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x406160"$8$"dynamic"$256$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x405cc0"$8$"dynamic"$256$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x403480"$8$"static"$64$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x401fc0"$8$"guided"$128$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x403640"$16$"dynamic"$512$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x402140"$24$"guided"$128$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x405df0"$8$"guided"$128$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x403380"$8$"dynamic"$512$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x403800"$16$"dynamic"$512$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x403910"$16$"dynamic"$512$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x408230"$16$"dynamic"$64$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x401c30"$16$"dynamic"$512$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x401d80"$8$"guided"$128$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x401ce0"$8$"guided"$64$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x406d20"$8$"dynamic"$128$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x401e60"$16$"dynamic"$32$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x4058c0"$4$"guided"$64$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x406b60"$4$"guided"$64$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x405320"$8$"guided"$256$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x4099f0"$24$"dynamic"$128$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x4063c0"$8$"guided"$256$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x403a40"$16$"guided"$64$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x404730"$16$"dynamic"$128$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x4050c0"$8$"dynamic"$256$"NOT CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x4064a0"$8$"guided"$256$"CONVERGED"
"OpenMP_PARALLEL_REGION: UNRESOLVED ADDR 0x4059e0"$8$"guided"$64$"CONVERGED"

To see comments from the policy as it is executing, also set the environment variable APEX_OPENMP_VERBOSE=1. After execution, the policy will dump the converged values, which can be used in subsequent runs, rather than tuning again. To do that, you set the environment variable APEX_OPENMP_HISTORY, like this: export APEX_OPENMP_HISTORY=`pwd`/results-2017-01-09-15-04-29.csv. Running with the converged values gives output like this (note the shorter execution time, due to elimination of tuning/search overhead):

apex_plugin_init
apex_openmp_policy init
Using default tuning strategy (NELDER_MEAD)
Using tuning history file: /home/users/khuck/src/xpress-apex/install/results-2017-01-09-15-07-24.csv
Running problem size 64^3 per domain until completion
Num processors: 1
Registering OMPT events...done.
Num threads: 32
Total number of elements: 262144

To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options

cycle = 1, time = 5.831050e-07, dt=5.831050e-07
cycle = 2, time = 1.282831e-06, dt=6.997260e-07
cycle = 3, time = 1.522670e-06, dt=2.398394e-07
cycle = 4, time = 1.722955e-06, dt=2.002845e-07
cycle = 5, time = 1.901501e-06, dt=1.785459e-07
cycle = 6, time = 2.067289e-06, dt=1.657880e-07
cycle = 7, time = 2.224910e-06, dt=1.576217e-07
cycle = 8, time = 2.377156e-06, dt=1.522454e-07
cycle = 9, time = 2.525914e-06, dt=1.487584e-07
cycle = 10, time = 2.672575e-06, dt=1.466606e-07
cycle = 11, time = 2.848568e-06, dt=1.759927e-07
cycle = 12, time = 3.047272e-06, dt=1.987049e-07
cycle = 13, time = 3.235112e-06, dt=1.878400e-07
cycle = 14, time = 3.412132e-06, dt=1.770197e-07
cycle = 15, time = 3.578321e-06, dt=1.661889e-07
cycle = 16, time = 3.736980e-06, dt=1.586588e-07
cycle = 17, time = 3.890910e-06, dt=1.539304e-07
cycle = 18, time = 4.042518e-06, dt=1.516081e-07
cycle = 19, time = 4.194103e-06, dt=1.515843e-07
cycle = 20, time = 4.345687e-06, dt=1.515843e-07
cycle = 21, time = 4.497271e-06, dt=1.515843e-07
cycle = 22, time = 4.648855e-06, dt=1.515843e-07
cycle = 23, time = 4.800440e-06, dt=1.515843e-07
cycle = 24, time = 4.968220e-06, dt=1.677805e-07
cycle = 25, time = 5.136001e-06, dt=1.677805e-07
Run completed:  
   Problem size        =  64 
   MPI tasks           =  1 
   Iteration count     =  25 
   Final Origin Energy = 3.610655e+07 
   Testing Plane 0 of Energy Array on rank 0:
        MaxAbsDiff   = 2.793968e-09
        TotalAbsDiff = 3.034312e-09
        MaxRelDiff   = 2.106546e-14


Elapsed time         =       3.86 (s)
Grind time (us/z/c)  = 0.58937607 (per dom)  (0.58937607 overall)
FOM                  =  1696.7095 (z/s)

apex_plugin_finalize
apex_openmp_policy finalize