Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when Jemalloc pool is used with BA_PROVIDER #903

Open
vinser52 opened this issue Nov 17, 2024 · 9 comments
Open

Crash when Jemalloc pool is used with BA_PROVIDER #903

vinser52 opened this issue Nov 17, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@vinser52
Copy link
Contributor

vinser52 commented Nov 17, 2024

Environment Information

  • UMF version (hash commit or a tag): main
  • OS(es) version(s): Ubuntu 22.04.5 LTS

Please provide a reproduction of the bug:

I have created PR #902 as a reproducer for this bug.

How often bug is revealed:

The bug is always reproduced on Ubuntu 22.04.4, but works fine on Ubuntu 24.04.1. Probably it is caused by different versions of libjemalloc library.

Actual behavior:

Tests are crashed with SEGFAULT. The reason is that the jemalloc pool calls memory provider with an invalid pointer and an internal check of the base allocator caught this.

Here is an output of the test enabled in the PR #902 with UMF logs.

Run on the Ubuntu22.04.4:

svinogra@gkdse-dnp-01:~/unified-memory-framework/build$ UMF_LOG="level:debug;flush:debug;output:stdout" LD_LIBRARY_PATH=./lib:$LD_LIBRARY_PATH ./test/umf_test-jemalloc_pool --gtest_filter=jemallocPoolTest/umfPoolTest.allocFree/1
[INFO  UMF] utils_log_init: Logger enabled (UMF version: 0.10.0-dev.git311.ga3050d4, level: DEBUG, flush: DEBUG, pid: no, timestamp: no)
[DEBUG UMF] umfMemoryTrackerCreate: tracker created, handle=0x7ff9754ad068, segment map=0x7ff9754a6008
Running main() from /user/svinogra/unified-memory-framework/build/_deps/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = jemallocPoolTest/umfPoolTest.allocFree/1
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from jemallocPoolTest/umfPoolTest
[ RUN      ] jemallocPoolTest/umfPoolTest.allocFree/1
[DEBUG UMF] umfTrackingMemoryProviderCreate: upstream=0x7ff9741f7068, tracker=0x7ff9754ad068, pool=0x7ff9754a80e8, ipcCache=0x7ff974c5e008, hIpcMappedCache=0x7ff9754b0068
[INFO  UMF] umfPoolCreateInternal: Memory pool created: 0x7ff9754a80e8
[DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7ff9754ad068, ptr=0x7ff9717ff000, pool=0x7ff9754a80e8, size=2097152
[ERROR UMF] trackingAllocationSplit: upstream provider failed to split the region
[DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7ff9754ad068, ptr=0x7ff9741e7000, pool=0x7ff9754a80e8, size=4096
[DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7ff9754ad068, ptr=0x7ff9741e7000, size=4096
[ERROR UMF] umfMemoryTrackerRemove: pointer 0x7ff971a00000 not found in the map
[ERROR UMF] trackingFree: failed to remove the region from the tracker, ptr=0x7ff971a00000, size=2097152, ret = 2147483646
umf_test-jemalloc_pool: /user/svinogra/unified-memory-framework/src/base_alloc/base_alloc.c:281: umf_ba_free: Assertion `pool_contains_pointer(pool, ptr)' failed.
Aborted (core dumped)

The jemalloc pool does two allocations from memory provider (see umfMemoryTrackerAdd in the log). AS a result it gets two buffers: ptr=0x7ff9717ff000, size=2097152 and ptr=0x7ff9741e7000, size=4096. At the end of the test, the jemalloc pool deallocates these buffers (see umfMemoryTrackerRemove), but one of the pointers is incorrect: [ERROR UMF] umfMemoryTrackerRemove: pointer 0x7ff971a00000 not found in the map.

Run on the Ubuntu 24.04.01:

❯ UMF_LOG="level:debug;flush:debug;output:stdout" LD_LIBRARY_PATH=./lib:$LD_LIBRARY_PATH ./test/umf_test-jemalloc_pool --gtest_filter=jemallocPoolTest/umfPoolTest.allocFree/1
[INFO  UMF] utils_log_init: Logger enabled (UMF version: 0.10.0-dev.git308.g64456b7, level: DEBUG, flush: DEBUG, pid: no, timestamp: no)
[DEBUG UMF] umfMemoryTrackerCreate: tracker created, handle=0x7d2c6903e068, segment map=0x7d2c69037008
Running main() from /home/vinser52/repos/unified-memory-framework/build/_deps/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = jemallocPoolTest/umfPoolTest.allocFree/1
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from jemallocPoolTest/umfPoolTest
[ RUN      ] jemallocPoolTest/umfPoolTest.allocFree/1
[DEBUG UMF] umfTrackingMemoryProviderCreate: upstream=0x7d2c68f08068, tracker=0x7d2c6903e068, pool=0x7d2c690390e8, ipcCache=0x7d2c69036008, hIpcMappedCache=0x7d2c69041068
[INFO  UMF] umfPoolCreateInternal: Memory pool created: 0x7d2c690390e8
[DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7d2c6903e068, ptr=0x7d2c67001000, pool=0x7d2c690390e8, size=2097152
[ERROR UMF] trackingAllocationSplit: upstream provider failed to split the region
[DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7d2c6903e068, ptr=0x7d2c68eef000, pool=0x7d2c690390e8, size=4096
[DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7d2c6903e068, ptr=0x7d2c68eef000, size=4096
[ERROR UMF] clear_tracker_for_the_pool: tracking provider of pool 0x7d2c690390e8 is not empty! (1 items left)
[INFO  UMF] umfPoolDestroy: Memory pool destroyed: 0x7d2c690390e8
[       OK ] jemallocPoolTest/umfPoolTest.allocFree/1 (0 ms)
[----------] 1 test from jemallocPoolTest/umfPoolTest (0 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (0 ms total)
[  PASSED  ] 1 test.

As you can see on the Ubuntu 24.04.01 tests passed. But there is only one deallocation (see umfMemoryTrackerRemove) while jemalloc still allocates two buffers (see umfMemoryTrackerAdd).

Expected behavior:

Test with BA_PROVIDER should pass

Details

Additional information about Priority and Help Requested:

I think there is an issue with libjemalloc. Need to confirm. If so, we need to fix our CI to use appropriate version of libjemalloc.

Requested priority: High. It blocks PR #901

@vinser52 vinser52 added the bug Something isn't working label Nov 17, 2024
ldorau added a commit to ldorau/unified-memory-framework that referenced this issue Nov 18, 2024
@vinser52
Copy link
Contributor Author

I just checked how it works with the OS provider on Ubuntu 22.04:

svinogra@gkdse-dnp-01:~/unified-memory-framework/build$ UMF_LOG="level:debug;flush:debug;output:stdout" LD_LIBRARY_PATH=./lib:$LD_LIBRARY_PATH ./test/umf_test-jemalloc_pool --gtest_filter=jemallocPoolTest/umfPoolTest.allocFree/0
[INFO  UMF] utils_log_init: Logger enabled (UMF version: 0.10.0-dev.git311.ga3050d4, level: DEBUG, flush: DEBUG, pid: no, timestamp: no)
[DEBUG UMF] umfMemoryTrackerCreate: tracker created, handle=0x7f53bc1fc068, segment map=0x7f53bc1f5008
Running main() from /user/svinogra/unified-memory-framework/build/_deps/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = jemallocPoolTest/umfPoolTest.allocFree/0
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from jemallocPoolTest/umfPoolTest
[ RUN      ] jemallocPoolTest/umfPoolTest.allocFree/0
[INFO  UMF] translate_params: established HWLOC NUMA policy: 0
[INFO  UMF] os_initialize: OS provider initialized with NUMA nodes:
[INFO  UMF] os_initialize: 0-1
[DEBUG UMF] umfTrackingMemoryProviderCreate: upstream=0x7f53bb9a5068, tracker=0x7f53bc1fc068, pool=0x7f53bc1f70e8, ipcCache=0x7f53bb9a1008, hIpcMappedCache=0x7f53bc1ff068
[INFO  UMF] umfPoolCreateInternal: Memory pool created: 0x7f53bc1f70e8
[DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7f53bc1fc068, ptr=0x7f53b8600000, pool=0x7f53bc1f70e8, size=2097152
[DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7f53bc1fc068, ptr=0x7f53b8601000, pool=0x7f53bc1f70e8, size=2093056
[DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7f53bc1fc068, ptr=0x7f53b8600000, size=4096
[DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7f53bc1fc068, ptr=0x7f53b8601000, size=2093056
[ERROR UMF] umfMemoryTrackerRemove: pointer 0x7f53b8800000 not found in the map
[ERROR UMF] trackingFree: failed to remove the region from the tracker, ptr=0x7f53b8800000, size=2097152, ret = 2147483646
[INFO  UMF] umfPoolDestroy: Memory pool destroyed: 0x7f53bc1f70e8
[       OK ] jemallocPoolTest/umfPoolTest.allocFree/0 (192 ms)
[----------] 1 test from jemallocPoolTest/umfPoolTest (192 ms total)

As you can see there are two umfMemoryTrackerAdd calls and three umfMemoryTrackerRemove calls - the last one has invalid pointer again. So the issue is not memory provider-specific and is not related to split/merge functionality (OS provider supports it, right?).

@ldorau
Copy link
Contributor

ldorau commented Nov 18, 2024

With jemalloc v5.3.0:

commit 54eaed1d8b56b1aa528be3bdd1877e59c56fa90c (HEAD -> master, tag: 5.3.0, upstream/master, origin/master)
Merge: ea6b3e97 304c9198
Author: Qi Wang <[email protected]>
Date:   Fri May 6 11:28:25 2022 -0700

    Merge branch 'dev'

it does not crash:

$ LD_LIBRARY_PATH=/home/ldorau/work/jemalloc/lib/ ./test/umf_test-jemalloc_pool --gtest_filter=jemallocPoolTest/umfPoolTest.allocFree/1
[PID:19805  TID:19805  INFO  UMF] utils_log_init: Logger enabled (UMF version: 0.10.0-dev.git311.g4fa88ecd, level: DEBUG, flush: DEBUG, pid: yes, timestamp: no)
[PID:19805  TID:19805  DEBUG UMF] umfMemoryTrackerCreate: tracker created, handle=0x7f2ba7d35068, segment map=0x7f2ba7739008
Running main() from /home/ldorau/work/unified-memory-framework/build/_deps/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = jemallocPoolTest/umfPoolTest.allocFree/1
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from jemallocPoolTest/umfPoolTest
[ RUN      ] jemallocPoolTest/umfPoolTest.allocFree/1
[PID:19805  TID:19805  DEBUG UMF] umfTrackingMemoryProviderCreate: upstream=0x7f2ba773b068, tracker=0x7f2ba7d35068, pool=0x7f2ba7d300e8, ipcCache=0x7f2ba7732008, hIpcMappedCache=0x7f2ba7d38068
[PID:19805  TID:19805  INFO  UMF] umfPoolCreateInternal: Memory pool created: 0x7f2ba7d300e8
[PID:19805  TID:19805  DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7f2ba7d35068, ptr=0x7f2ba69ff000, pool=0x7f2ba7d300e8, size=2097152
[PID:19805  TID:19805  ERROR UMF] trackingAllocationSplit: upstream provider failed to split the region
[PID:19805  TID:19805  DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7f2ba7d35068, ptr=0x7f2ba771c000, pool=0x7f2ba7d300e8, size=4096
[PID:19805  TID:19805  DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7f2ba7d35068, ptr=0x7f2ba771c000, size=4096
[PID:19805  TID:19805  ERROR UMF] clear_tracker_for_the_pool: tracking provider of pool 0x7f2ba7d300e8 is not empty! (1 items left)
[PID:19805  TID:19805  INFO  UMF] umfPoolDestroy: Memory pool destroyed: 0x7f2ba7d300e8
[       OK ] jemallocPoolTest/umfPoolTest.allocFree/1 (2 ms)
[----------] 1 test from jemallocPoolTest/umfPoolTest (2 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2 ms total)
[  PASSED  ] 1 test.

@vinser52
Copy link
Contributor Author

With jemalloc v5.3.0:

commit 54eaed1d8b56b1aa528be3bdd1877e59c56fa90c (HEAD -> master, tag: 5.3.0, upstream/master, origin/master)
Merge: ea6b3e97 304c9198
Author: Qi Wang <[email protected]>
Date:   Fri May 6 11:28:25 2022 -0700

    Merge branch 'dev'

it does not crash:

$ LD_LIBRARY_PATH=/home/ldorau/work/jemalloc/lib/ ./test/umf_test-jemalloc_pool --gtest_filter=jemallocPoolTest/umfPoolTest.allocFree/1
[PID:19805  TID:19805  INFO  UMF] utils_log_init: Logger enabled (UMF version: 0.10.0-dev.git311.g4fa88ecd, level: DEBUG, flush: DEBUG, pid: yes, timestamp: no)
[PID:19805  TID:19805  DEBUG UMF] umfMemoryTrackerCreate: tracker created, handle=0x7f2ba7d35068, segment map=0x7f2ba7739008
Running main() from /home/ldorau/work/unified-memory-framework/build/_deps/googletest-src/googletest/src/gtest_main.cc
Note: Google Test filter = jemallocPoolTest/umfPoolTest.allocFree/1
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from jemallocPoolTest/umfPoolTest
[ RUN      ] jemallocPoolTest/umfPoolTest.allocFree/1
[PID:19805  TID:19805  DEBUG UMF] umfTrackingMemoryProviderCreate: upstream=0x7f2ba773b068, tracker=0x7f2ba7d35068, pool=0x7f2ba7d300e8, ipcCache=0x7f2ba7732008, hIpcMappedCache=0x7f2ba7d38068
[PID:19805  TID:19805  INFO  UMF] umfPoolCreateInternal: Memory pool created: 0x7f2ba7d300e8
[PID:19805  TID:19805  DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7f2ba7d35068, ptr=0x7f2ba69ff000, pool=0x7f2ba7d300e8, size=2097152
[PID:19805  TID:19805  ERROR UMF] trackingAllocationSplit: upstream provider failed to split the region
[PID:19805  TID:19805  DEBUG UMF] umfMemoryTrackerAdd: memory region is added, tracker=0x7f2ba7d35068, ptr=0x7f2ba771c000, pool=0x7f2ba7d300e8, size=4096
[PID:19805  TID:19805  DEBUG UMF] umfMemoryTrackerRemove: memory region removed: tracker=0x7f2ba7d35068, ptr=0x7f2ba771c000, size=4096
[PID:19805  TID:19805  ERROR UMF] clear_tracker_for_the_pool: tracking provider of pool 0x7f2ba7d300e8 is not empty! (1 items left)
[PID:19805  TID:19805  INFO  UMF] umfPoolDestroy: Memory pool destroyed: 0x7f2ba7d300e8
[       OK ] jemallocPoolTest/umfPoolTest.allocFree/1 (2 ms)
[----------] 1 test from jemallocPoolTest/umfPoolTest (2 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (2 ms total)
[  PASSED  ] 1 test.

It is the same what I see on the Ubuntu 24.04. But still there is an issue. Even the test did not crash the jemalloc makes 2 allocations from the provider but return back only one. As a result we have the following error in the log:

[PID:19805  TID:19805  ERROR UMF] clear_tracker_for_the_pool: tracking provider of pool 0x7f2ba7d300e8 is not empty! (1 items left)

If I remember in the past we had an assert if tracker is not empty for the pool. @ldorau did we change the logic of this check?

@ldorau
Copy link
Contributor

ldorau commented Nov 18, 2024

If I remember in the past we had an assert if tracker is not empty for the pool. @ldorau did we change the logic of this check?

Yes, there was an assert and it was removed:

$ git log -1 -p 29aa3c8a1c3474dd2990cd5b94a0b0180a34b560
commit 29aa3c8a1c3474dd2990cd5b94a0b0180a34b560
Author: Igor Chorążewicz <[email protected]>
Date:   Fri May 10 01:20:33 2024 +0000

    Remove assert from tracking provider

    This assert would be triggered every time a memory
    is leaked by the application. Currently this happens in
    certain SYCL tests.

diff --git a/src/provider/provider_tracking.c b/src/provider/provider_tracking.c
index 12af60db..cd9b072e 100644
--- a/src/provider/provider_tracking.c
+++ b/src/provider/provider_tracking.c
@@ -405,7 +405,6 @@ static void check_if_tracker_is_empty(umf_memory_tracker_handle_t hTracker,
                         "left)",
                         n_items);
             }
-            assert(n_items == 0);
         }
     }
 }

@vinser52
Copy link
Contributor Author

If I remember in the past we had an assert if tracker is not empty for the pool. @ldorau did we change the logic of this check?

Yes, there was an assert and it was removed:

$ git log -1 -p 29aa3c8a1c3474dd2990cd5b94a0b0180a34b560
commit 29aa3c8a1c3474dd2990cd5b94a0b0180a34b560
Author: Igor Chorążewicz <[email protected]>
Date:   Fri May 10 01:20:33 2024 +0000

    Remove assert from tracking provider

    This assert would be triggered every time a memory
    is leaked by the application. Currently this happens in
    certain SYCL tests.

diff --git a/src/provider/provider_tracking.c b/src/provider/provider_tracking.c
index 12af60db..cd9b072e 100644
--- a/src/provider/provider_tracking.c
+++ b/src/provider/provider_tracking.c
@@ -405,7 +405,6 @@ static void check_if_tracker_is_empty(umf_memory_tracker_handle_t hTracker,
                         "left)",
                         n_items);
             }
-            assert(n_items == 0);
         }
     }
 }

But again, it is a memory leak and we just deleted an assert that detected this leak instead of trying to fix the issue itself.

@igchor
Copy link
Member

igchor commented Nov 18, 2024

Just a note: having the assertion there was wrong, in my opinion. In general, we shouldn't force the application to not have any leaks. We can produce some logs or enable checking for leaks under some option (or, ideally, just properly support memcheck and asan), but by default, we should not assert/crash.

@vinser52
Copy link
Contributor Author

Just a note: having the assertion there was wrong, in my opinion. In general, we shouldn't force the application to not have any leaks. We can produce some logs or enable checking for leaks under some option (or, ideally, just properly support memcheck and asan), but by default, we should not assert/crash.

In general, I agree that we should not force applications to not have memory leaks.

But two thoughts from my side regarding this particular case:

  1. It is an assert that works only in debug mode. If I remember correctly Microsoft had similar asserts in debug mode to identify memory leaks.
  2. This assert identifies an issue between the pool manager and the memory provider, not between the user code and UMF. Our pool manager (not the user code) grabs memory from the memory provider and did not free this memory. And there is no such issue with other pool managers. Of course, in that particular case, it is a jemalloc that we do not control.

BTW, do we understand why jemalloc behaves in such way?

@igchor
Copy link
Member

igchor commented Nov 18, 2024

From what I remember, this assert will be triggered if an application does not free all allocations prior to calling pool/provider destroy, so this is not only a check for internal consistency.

@vinser52
Copy link
Contributor Author

From what I remember, this assert will be triggered if an application does not free all allocations prior to calling pool/provider destroy, so this is not only a check for internal consistency.

In that case I agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants