Unite test "test-backend-ops" crashed on MacOS #4672

nguoithichkhampha · 2023-12-28T17:16:42Z

I'm using MacOS 13.6 (Intel chip). Here is stack trace

slaren · 2023-12-29T12:45:17Z

There is a lot of missing output that would hint at the issue. I assume this is because the buffer allocation failed. I will add more checks so that these cases are detected and reported instead of crashing, but actually fixing this would require someone with an intel mac to figure what is the issue.

nguoithichkhampha · 2024-01-02T16:52:50Z

LastTest.log
@slaren , I have uploaded log file when run ctest in verbose mode.
and you are right, seems there is an issue when alloc buffer
MOE(n_experts=8,n_experts_per_tok=2,n_tokens=1,n_embd=4096,n_ff=8192): ggml_backend_metal_buffer_type_alloc_buffer: error: failed to allocate buffer, size = 3072.58 MiB

slaren · 2024-01-02T17:21:29Z

Thank you. The out of memory issue in the MoE test is not really a concern, it requires a larger buffer than can be allocated in your system. The log also shows that many MUL_MAT and MUL_MAT_ID tests are failing, and that's a problem since it may cause the Metal backend to produce wrong results silently. I think there are already some checks for the GPU family in the metal matrix multiplication, but it may not be enough.

nguoithichkhampha · 2024-01-02T17:37:41Z

so, to debug this issue. I should look at the first time failed of MUL_MAT ?
MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.04 MiB, ( 17.11 / 1536.00) 19: [MUL_MAT] NMSE = 3.110048 FAIL

slaren · 2024-01-02T17:48:44Z

All the failed MUL_MAT tests are important, not just the first one.

ggerganov · 2024-01-06T13:47:36Z

@nguoithichkhampha Please checkout #4794 and try again:

make clean
make -j tests && ./tests/test-backend-ops -b Metal

If the matrix multiplication tests continue to fail, please run the following and post the output:

MTL_DEBUG_LAYER=1 ./tests/test-backend-ops -b Metal

nguoithichkhampha · 2024-01-07T10:35:44Z

test-metal-backend.txt
I see no more test failed but still getting the crash.
and I also get more clear stack trace

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libsystem_kernel.dylib 0x7ff80338b1e2 __pthread_kill + 10
1 libsystem_pthread.dylib 0x7ff8033c2ee6 pthread_kill + 263
2 libsystem_c.dylib 0x7ff8032e9b45 abort + 123
3 libsystem_c.dylib 0x7ff8032e8e5e __assert_rtn + 314
4 Metal 0x7ff80cbbd182 MTLReportFailure.cold.1 + 43
5 Metal 0x7ff80cb98bef MTLReportFailure + 529
6 Metal 0x7ff80cb8d4e0 _MTLMessageContextEnd + 1282
7 MetalTools 0x7ff803bcbefd -[MTLDebugDevice newBufferWithBytesNoCopy:length:options:deallocator:] + 237
8 test-backend-ops 0x107ee691c ggml_backend_metal_buffer_type_alloc_buffer + 252
9 test-backend-ops 0x107ecc44e ggml_backend_alloc_ctx_tensors_from_buft + 158
10 test-backend-ops 0x107e61155 test_case::eval(ggml_backend*, ggml_backend*, char const*) + 549
11 test-backend-ops 0x107e609d9 test_backend(ggml_backend*, test_mode, char const*) + 28505
12 test-backend-ops 0x107e598b1 main + 465
13 dyld 0x7ff80306941f start + 1903

Seems there is an assertion from OS to prevent alloc buffer more than 2048 MB

nguoithichkhampha · 2024-01-07T12:38:46Z

I think this is make sense when my gpu only 1536 MB VRAM.
So, we should check max buffer length before call
ctx->buffers[0].metal = [device newBufferWithBytesNoCopy:ctx->all_data length:size_aligned options:MTLResourceStorageModeShared deallocator:nil];
in function ggml_backend_metal_buffer_type_alloc_buffer

ggerganov · 2024-01-07T13:02:28Z

Yes, the MOE test is expected to fail due to out of memory - that's not a big concern.
The main problem is that your GPU should support the Metal3 feature set as defined by Apple's documentation:

https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf

However, we currently fail to detect that:

Backend 2/2 (Metal)
ggml_metal_init: allocating
2024-01-07 17:36:34.077 test-backend-ops[2294:105408] Metal API Validation Enabled
ggml_metal_init: found device: Intel(R) Iris(TM) Plus Graphics 650
ggml_metal_init: picking default device: Intel(R) Iris(TM) Plus Graphics 650
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/Emotiv/llama.cpp/build/bin/ggml-metal.metal'
ggml_metal_init: GPU name:   Intel(R) Iris(TM) Plus Graphics 650
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: simdgroup reduction support   = false
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  1610.61 MB
ggml_metal_init: maxTransferRate               = built-in GPU

There should be a log message stating:

ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)

I just pushed another change to #4794 that would hopefully fix this.

nguoithichkhampha · 2024-01-07T14:07:53Z

tried with latest commit. I see the message GPU family: MTLGPUFamilyMetal3 as the expectation but seems get another error and then an assertion

ggml_metal_init: found device: Intel(R) Iris(TM) Plus Graphics 650
ggml_metal_init: picking default device: Intel(R) Iris(TM) Plus Graphics 650
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/Users/Emotiv/llama.cpp/build/bin/ggml-metal.metal'
ggml_metal_init: GPU name:   Intel(R) Iris(TM) Plus Graphics 650
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  =  1610.61 MB
ggml_metal_init: maxTransferRate               = built-in GPU
ggml_metal_init: error: load pipeline error: Error Domain=CompilerError Code=2 "AIR builtin function was called but no definition was found." UserInfo={NSLocalizedDescription=AIR builtin function was called but no definition was found.}
GGML_ASSERT: /Users/Emotiv/llama.cpp/tests/test-backend-ops.cpp:1703: backend != NULL

ggerganov · 2024-01-07T15:48:48Z

Thanks! I think it should work now. When you get the chance - please give it another try with the latest version and if it fails, post the output again. It will be more verbose now

nguoithichkhampha · 2024-01-07T16:55:47Z

ok, I read your code change. seems that my gpu does not support mul_mat.
But it still crashing and seems back to first issue
output.txt

ggerganov · 2024-01-08T09:00:37Z

Yup, it is unexpected that the MUL_MAT tests fail. Even though SIMD matrix multiplications are not available, it fallbacks to the other kernels which only use SIMD reductions - these should be supported and should work correctly. Not sure what is the issue in this case

nguoithichkhampha added the bug-unconfirmed label Dec 28, 2023

ggerganov mentioned this issue Jan 6, 2024

metal : refactor kernel loading code #4794

Merged

ggerganov closed this as completed in #4794 Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unite test "test-backend-ops" crashed on MacOS #4672

Unite test "test-backend-ops" crashed on MacOS #4672

nguoithichkhampha commented Dec 28, 2023

slaren commented Dec 29, 2023

nguoithichkhampha commented Jan 2, 2024

slaren commented Jan 2, 2024

nguoithichkhampha commented Jan 2, 2024

slaren commented Jan 2, 2024

ggerganov commented Jan 6, 2024 •

edited

Loading

nguoithichkhampha commented Jan 7, 2024 •

edited

Loading

nguoithichkhampha commented Jan 7, 2024

ggerganov commented Jan 7, 2024 •

edited

Loading

nguoithichkhampha commented Jan 7, 2024

ggerganov commented Jan 7, 2024

nguoithichkhampha commented Jan 7, 2024

ggerganov commented Jan 8, 2024

Unite test "test-backend-ops" crashed on MacOS #4672

Unite test "test-backend-ops" crashed on MacOS #4672

Comments

nguoithichkhampha commented Dec 28, 2023

slaren commented Dec 29, 2023

nguoithichkhampha commented Jan 2, 2024

slaren commented Jan 2, 2024

nguoithichkhampha commented Jan 2, 2024

slaren commented Jan 2, 2024

ggerganov commented Jan 6, 2024 • edited Loading

nguoithichkhampha commented Jan 7, 2024 • edited Loading

nguoithichkhampha commented Jan 7, 2024

ggerganov commented Jan 7, 2024 • edited Loading

nguoithichkhampha commented Jan 7, 2024

ggerganov commented Jan 7, 2024

nguoithichkhampha commented Jan 7, 2024

ggerganov commented Jan 8, 2024

ggerganov commented Jan 6, 2024 •

edited

Loading

nguoithichkhampha commented Jan 7, 2024 •

edited

Loading

ggerganov commented Jan 7, 2024 •

edited

Loading