The timemory compiler instrumentation library provides the symbols required by the -finstrument-functions
compiler flag supported by Clang and GCC.
This compiler flag enables the insertion of a callback at the entry and exit of each function currently being compiled -- it has no effect whatsoever
on the functions which are defined in an external library and linked into your executable or library.
The -finstrument-functions
compiler flag inserts functions at the beginning and end of each compiled function (does not apply to externally defined functions), respectively:
void __cyg_profile_func_enter(void* this_fn, void* call_site);
void __cyg_profile_func_exit(void* this_fn, void* call_site);
The timemory-compiler-instrument
library invokes the timemory analogues to the above functions
from the timemory-compiler-instrument-base
library:
void timemory_profile_func_enter(void* this_fn, void* call_site);
void timemory_profile_func_exit(void* this_fn, void* call_site);
By default, this library instruments a wall-clock timer but can be customized exclusively
via the TIMEMORY_COMPILER_COMPONENTS
environment variable or synchronized with timemory
instrumentation present in the target code via TIMEMORY_GLOBAL_COMPONENTS
.
Timemory compiler instrumentation is fully compatible with codes that are instrumented with timemory
but, in order to prevent expensive checks for recursion, the data is collected separately --
resulting in an entirely separate set of outputs.
All traditional settings are available for the compiler instrumentation (see timemory-avail -S
), however,
the settings are only configurable via environment variables. Additionally, the compiler instrumentation
uses "TIMEMORY_COMPILER"
as the prefix for the environment variables instead of "TIMEMORY"
,
with the sole exception of "TIMEMORY_GLOBAL_COMPONENTS"
.
In other words, "TIMEMORY_FLAT_PROFILE=ON"
will not be applied to the compiler instrumentation; in order
to enable flat profiling for the compiler instrumentation, set "TIMEMORY_COMPILER_FLAT_PROFILE=ON"
,
and so on for "TIMEMORY_COMPILER_OUTPUT_PATH=..."
, etc.
NOTE: Environment variables
TIMEMORY_COMPILER_MAX_DEPTH
,TIMEMORY_COMPILER_THROTTLE_COUNT
, andTIMEMORY_COMPILER_THROTTLE_VALUE
can be very useful for reducing the overhead of the instrumentation. For more information, see the descriptions provided viatimemory-avail -Sd -r 'THROTTLE|MAX_DEPTH'
.
Timemory provides a timemory::timemory-compiler-instrument
target in CMake which provides the necessary
libraries and the compiler flags for the best resolution of the instrumented function names.
When compiling with
-finstrument-functions
, it is highly recommended to also compile with:-g -rdynamic -fno-omit-frame-pointer -fno-optimize-sibling-calls
. These flags are automatically provided when using CMake.
The Clang compiler supports -finstrument-functions-after-inlining
. This flag causes clang to only instrument non-inlined function calls, which
can significantly lower the size of the results and the overhead of the compiler instrumentation.
If this flag is available, timemory will propagate this flag to downstream projects by default instead of -finstrument-functions
unless TIMEMORY_INLINE_COMPILER_INSTRUMENTATION=ON
during the timemory CMake configuration (which sets the default for the timemory
installation) or the default for the installation can be changed by setting TIMEMORY_INLINE_COMPILER_INSTRUMENTATION=OFF
within the downstream CMake project.
In other words:
- Setting
TIMEMORY_INLINE_COMPILER_INSTRUMENTATION=ON
when installing timemory with the Clang compiler will setTIMEMORY_INLINE_COMPILER_INSTRUMENTATION
toON
by default in the installation. - Setting
TIMEMORY_INLINE_COMPILER_INSTRUMENTATION=OFF
when installing timemory with the Clang compiler will setTIMEMORY_INLINE_COMPILER_INSTRUMENTATION
toOFF
by default in the installation. - Using any non-Clang compiler to install timemory will set
TIMEMORY_INLINE_COMPILER_INSTRUMENTATION
toOFF
by default in the installation. - Using any non-Clang compiler to when building your project and linking against the
timemory::timemory-compiler-instrument
CMake target will result in-finstrument-functions
, regardless of whether Clang was used to install timemory.
For reference, the check (provided below in non-generator-syntax form) is:
if(NOT TIMEMORY_INLINE_COMPILER_INSTRUMENTATION AND CMAKE_<LANG>_COMPILER_ID MATCHES "Clang")
# if options is enabled and the C or CXX compiler ID for your compiler is Clang
target_compile_options(<mylib> PUBLIC -finstrument-functions-after-inlining)
else()
# otherwise, use flag which includes inlined function calls
target_compile_options(<mylib> PUBLIC -finstrument-functions)
endif()
#########################
#
# C or C++ codes
#
#########################
# Source is not instrumented with timemory
find_package(timemory COMPONENTS compiler-instrumentation)
# Source is instrumented with the timemory library interface
find_package(timemory COMPONENTS c cxx compiler-instrumentation)
#########################
#
# C++ codes
#
#########################
# Source is instrumented with timemory library and/or template interface
find_package(timemory COMPONENTS cxx compiler-instrumentation)
# Source is instrumented with timemory template interface in header-only mode
find_package(timemory COMPONENTS headers compiler-instrumentation)
# Source is instrumented with timemory macros and wants to disable all macros
find_package(timemory COMPONENTS cxx disable compiler-instrumentation)
# Source is instrumented with timemory and wants (in-source, not compiler)
# instrumentation disabled by default at runtime
find_package(timemory COMPONENTS cxx default-disabled compiler-instrumentation)
#########################
#
# Using one of above
#
#########################
# some library
add_library(foo SHARED foo.c foo.cpp)
# add in instrumentation, PUBLIC so it propagates to downstream targets
target_link_libraries(foo PUBLIC timemory::timemory)
# some executable
add_executable(bar bar.c)
# add in instrumentation to executable via timemory::timemory publicly linked to foo
target_link_libraries(bar PRIVATE foo)
CC = gcc
CFLAGS += -finstrument-functions -g -fno-omit-frame-pointer -fno-optimize-sibling-calls -rdynamic
LIBS += timemory-compiler-instrument dl rt
OBJ = foo.o
%.o: %.c $(DEPS)
$(CC) -c -o $@ $< $(CFLAGS)
bar: $(OBJ) bar.c
$(CC) -o $@ $^ $(CFLAGS) $(LIBS)
Compiler instrumentation will generate alot of profiling information -- instrumentation is also done for functions expanded inline in other functions.
In order to reduce this with GCC, refer to the compiler documentation for the additional flags: -finstrument-functions-exclude-file-list=file,file,... and
-finstrument-functions-exclude-function-list=sym,sym,....
In order to reduce the profiling information with Clang, set the TIMEMORY_INLINE_COMPILER_INSTRUMENTATION=OFF
#include <chrono>
#include <iostream>
#include <mutex>
inline long
fibonacci(long n)
{
return (n < 2) ? n : (fibonacci(n - 1) + fibonacci(n - 2));
}
inline void
consume(long n)
{
using mutex_t = std::mutex;
using lock_t = std::unique_lock<mutex_t>;
// a mutex held by one lock
mutex_t mutex;
// acquire lock
lock_t hold_lk(mutex);
// associate but defer
lock_t try_lk(mutex, std::defer_lock);
// get current time
auto now = std::chrono::steady_clock::now();
// try until time point
while(std::chrono::steady_clock::now() < (now + std::chrono::milliseconds(n)))
try_lk.try_lock();
}
int
main(int argc, char** argv)
{
long nwait = (argc > 1) ? atol(argv[1]) : 12;
int nitr = (argc > 2) ? atoi(argv[2]) : 10;
for(int i = 0; i < nitr; ++i)
{
consume(fibonacci(nwait));
}
return 0;
}
git clone https://github.com/NERSC/timemory.git timemory-compiler-instrument-example
cd timemory-compiler-instrument-example
mkdir build
cd build
cmake -DTIMEMORY_BUILD_EXAMPLES=ON -DTIMEMORY_BUILD_TOOLS=ON -DTIMEMORY_BUILD_COMPILER_INSTRUMENTATION=ON ..
cmake --build . --target ex_compiler_instrument
$ TIMEMORY_COMPILER_MAX_WIDTH=60 ./ex_compiler_instrument
[12301]> timemory-compiler-instrument will close after 'main' returns
[12301]> timemory-compiler-instrument: 47 results
[wall]|0> Outputting 'timemory-compiler-instrumentation-output/wall.flamegraph.json'...
[wall]|0> Outputting 'timemory-compiler-instrumentation-output/wall.json'...
[wall]|0> Outputting 'timemory-compiler-instrumentation-output/wall.tree.json'...
[wall]|0> Outputting 'timemory-compiler-instrumentation-output/wall.txt'...
[12301]> timemory-compiler-instrument: finalizing...
$ cat timemory-compiler-instrumentation-output/wall.txt
|--------------------------------------------------------------------------------------------------------------------------------------------------------|
| REAL-CLOCK TIMER (I.E. WALL-CLOCK TIMER) |
|--------------------------------------------------------------------------------------------------------------------------------------------------------|
| LABEL | COUNT | DEPTH | METRIC | UNITS | SUM | MEAN | MIN | MAX | STDDEV | % SELF |
|--------------------------------------------------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
| >>> main | 1 | 0 | wall | sec | 1.544 | 1.544 | 1.544 | 1.544 | 0.000 | 0.0 |
| >>> |_fibonacci(long) | 10 | 1 | wall | sec | 0.092 | 0.009 | 0.008 | 0.011 | 0.001 | 0.4 |
| >>> |_fibonacci(long) | 20 | 2 | wall | sec | 0.092 | 0.005 | 0.003 | 0.007 | 0.001 | 0.8 |
| >>> |_fibonacci(long) | 40 | 3 | wall | sec | 0.091 | 0.002 | 0.001 | 0.004 | 0.001 | 1.6 |
| >>> |_fibonacci(long) | 80 | 4 | wall | sec | 0.089 | 0.001 | 0.000 | 0.002 | 0.001 | 3.2 |
| >>> |_fibonacci(long) | 160 | 5 | wall | sec | 0.086 | 0.001 | 0.000 | 0.001 | 0.000 | 6.7 |
| >>> |_fibonacci(long) | 320 | 6 | wall | sec | 0.081 | 0.000 | 0.000 | 0.001 | 0.000 | 14.1 |
| >>> |_fibonacci(long) | 640 | 7 | wall | sec | 0.069 | 0.000 | 0.000 | 0.001 | 0.000 | 30.5 |
| >>> |_fibonacci(long) | 1140 | 8 | wall | sec | 0.048 | 0.000 | 0.000 | 0.000 | 0.000 | 53.0 |
| >>> |_fibonacci(long) | 1280 | 9 | wall | sec | 0.023 | 0.000 | 0.000 | 0.000 | 0.000 | 71.8 |
| >>> |_fibonacci(long) | 740 | 10 | wall | sec | 0.006 | 0.000 | 0.000 | 0.000 | 0.000 | 84.5 |
| >>> |_fibonacci(long) | 200 | 11 | wall | sec | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | 93.4 |
| >>> |_fibonacci(long) | 20 | 12 | wall | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_consume(long) | 10 | 1 | wall | sec | 1.452 | 0.145 | 0.145 | 0.146 | 0.000 | 13.1 |
| >>> |_std::__1::mutex::mutex() | 10 | 2 | wall | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 84.7 |
| >>> |_std::__1::mutex::mutex() | 10 | 3 | wall | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::unique_lock<std::__1::mutex>::unique_lo... | 10 | 2 | wall | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 45.2 |
| >>> |_std::__1::unique_lock<std::__1::mutex>::unique_... | 10 | 3 | wall | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 86.9 |
| >>> |_std::__1::mutex* std::__1::addressof<std::__1... | 10 | 4 | wall | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::unique_lock<std::__1::mutex>::unique_lo... | 10 | 2 | wall | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 45.5 |
| >>> |_std::__1::unique_lock<std::__1::mutex>::unique_... | 10 | 3 | wall | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 85.7 |
| >>> |_std::__1::mutex* std::__1::addressof<std::__1... | 10 | 4 | wall | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::chrono::duration<long long, std::__1::r... | 2610 | 2 | wall | sec | 0.061 | 0.000 | 0.000 | 0.000 | 0.000 | 87.1 |
| >>> |_std::__1::chrono::duration<long long, std::__1:... | 2610 | 3 | wall | sec | 0.008 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::chrono::time_point<std::__1::chrono::st... | 2610 | 2 | wall | sec | 0.864 | 0.000 | 0.000 | 0.001 | 0.000 | 16.9 |
| >>> |_std::__1::chrono::time_point<std::__1::chrono::... | 2610 | 3 | wall | sec | 0.008 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::common_type<std::__1::chrono::duratio... | 2610 | 3 | wall | sec | 0.650 | 0.000 | 0.000 | 0.001 | 0.000 | 30.0 |
| >>> |_std::__1::chrono::duration<long long, std::__... | 5220 | 4 | wall | sec | 0.017 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::chrono::duration<long long, std::__... | 2610 | 4 | wall | sec | 0.375 | 0.000 | 0.000 | 0.000 | 0.000 | 13.3 |
| >>> |_std::__1::chrono::duration<long long, std::... | 2610 | 5 | wall | sec | 0.325 | 0.000 | 0.000 | 0.000 | 0.000 | 30.5 |
| >>> |_std::__1::enable_if<__is_duration<std::__... | 2610 | 6 | wall | sec | 0.218 | 0.000 | 0.000 | 0.000 | 0.000 | 22.8 |
| >>> |_std::__1::chrono::__duration_cast<std::... | 2610 | 7 | wall | sec | 0.168 | 0.000 | 0.000 | 0.000 | 0.000 | 58.5 |
| >>> |_std::__1::chrono::duration<long long,... | 2610 | 8 | wall | sec | 0.008 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::chrono::duration<long long,... | 2610 | 8 | wall | sec | 0.062 | 0.000 | 0.000 | 0.000 | 0.000 | 86.4 |
| >>> |_std::__1::chrono::duration<long lon... | 2610 | 9 | wall | sec | 0.008 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::chrono::duration<long long, std... | 2610 | 6 | wall | sec | 0.008 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::chrono::duration<long long, std::__... | 2610 | 4 | wall | sec | 0.063 | 0.000 | 0.000 | 0.000 | 0.000 | 86.4 |
| >>> |_std::__1::chrono::duration<long long, std::... | 2610 | 5 | wall | sec | 0.009 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::chrono::time_point<std::__1::chrono::... | 2610 | 3 | wall | sec | 0.060 | 0.000 | 0.000 | 0.000 | 0.000 | 86.9 |
| >>> |_std::__1::chrono::time_point<std::__1::chrono... | 2610 | 4 | wall | sec | 0.008 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_bool std::__1::chrono::operator<<std::__1::chrono... | 2610 | 2 | wall | sec | 0.327 | 0.000 | 0.000 | 0.000 | 0.000 | 44.8 |
| >>> |_std::__1::chrono::time_point<std::__1::chrono::... | 5220 | 3 | wall | sec | 0.016 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_bool std::__1::chrono::operator<<long long, std... | 2610 | 3 | wall | sec | 0.165 | 0.000 | 0.000 | 0.000 | 0.000 | 29.7 |
| >>> |_std::__1::chrono::__duration_lt<std::__1::chr... | 2610 | 4 | wall | sec | 0.116 | 0.000 | 0.000 | 0.000 | 0.000 | 86.5 |
| >>> |_std::__1::chrono::duration<long long, std::... | 5220 | 5 | wall | sec | 0.016 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::unique_lock<std::__1::mutex>::try_lock() | 2600 | 2 | wall | sec | 0.008 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
| >>> |_std::__1::unique_lock<std::__1::mutex>::~unique_l... | 20 | 2 | wall | sec | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | 86.8 |
| >>> |_std::__1::unique_lock<std::__1::mutex>::~unique... | 20 | 3 | wall | sec | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 100.0 |
|--------------------------------------------------------------------------------------------------------------------------------------------------------|