Failed "assert(buffer->is_shared)" in comp_buffer_connect() #9343

marc-hb · 2024-07-31T18:09:53Z

This is failing when fuzzing, see below.

This code came with commit 4a03699, @marcinszkudlinski can you please comment?

https://github.com/thesofproject/sof/actions/runs/10144692547/job/28048711573?pr=9338

==5054==ERROR: AddressSanitizer: SEGV on unknown address 0x00000000 (pc 0x081aecfa bp 0xdbffadb8 sp 0xdbffacd0 T7)
==5054==The signal is caused by a WRITE memory access.
==5054==Hint: address points to the zero page.
    #0 0x81aecfa in k_sys_fatal_error_handler /home/runner/work/sof/sof/workspace/sof/zephyr/wrapper.c:352:19
    #1 0x815e3c9 in assert_post_action /home/runner/work/sof/sof/workspace/zephyr/lib/os/assert.c:43:2
    #2 0x8188b30 in comp_buffer_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc-helper.c:188:3
    #3 0x81819b6 in ipc_buffer_to_comp_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/helper.c:594:9
    #4 0x8181404 in ipc_comp_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/helper.c:[622](https://github.com/thesofproject/sof/actions/runs/10144692547/job/28048711573?pr=9338#step:7:623):10
    #5 0x8176f3e in ipc_glb_tplg_comp_connect /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/handler.c:1375:9
    #6 0x8175bd4 in ipc_cmd /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc3/handler.c:1654:9
    #7 0x818e639 in ipc_platform_do_cmd /home/runner/work/sof/sof/workspace/sof/src/platform/posix/ipc.c:160:2
    #8 0x818750c in ipc_do_cmd /home/runner/work/sof/sof/workspace/sof/src/ipc/ipc-common.c:326:9
    #9 0x81af56b in task_run /home/runner/work/sof/sof/workspace/sof/zephyr/include/rtos/task.h:94:9
    #10 0x81af0de in edf_work_handler /home/runner/work/sof/sof/workspace/sof/zephyr/edf_schedule.c:32:16

Originally posted by @tmleman in #9338 (comment)

EDIT: somehow this was run again and is passing now: https://github.com/thesofproject/sof/actions/runs/10144692547/job/28093942074?pr=9338

Fuzzing is typically non-deterministic.

I have another PR with the same update and there was no failure there. Locally, I initially couldn't hit it either. That's why I did a rerun of fuzzing for this PR.
After some longer runs, I now have the same failure and I must admit that I don't quite understand the point of this assert. The assumption seems to be that when creating components we know they will be on separate cores and that they will be connected. That's why we set the buffer as shared, but I don't know how to verify this at an earlier stage so that here we can be sure that the buffer is shared.
I will also add that the same case reproduces for me on the current main.

The text was updated successfully, but these errors were encountered:

marc-hb · 2024-07-31T18:14:59Z

If this happens due to some invalid IPC sequence then comp_buffer_connect() should return an error code, not assert()

asserts can be disabled in production and they are disabled in production right now, example: #9308 (comment)

Asserts are only a debugging aid for "impossible"/buggy conditions.

Invalid IPC sequences are not considered "impossible": IPC fuzzing would be entirely pointless if they were.

marcinszkudlinski · 2024-08-05T09:04:21Z

Question: does IPC3 allow binding components located on separate cores? As I see in the code - no.

If its true (cross core bind is not allowed) - that means the assert is at right place and protection against the illegal situation should be in ipc_comp_to_buffer_connect

if its false (cross core bind is allowed) - assert should be replaced with an error exit code

in both situations - a fix is needed

lgirdwood · 2024-08-05T12:57:57Z

Question: does IPC3 allow binding components located on separate cores? As I see in the code - no.

If its true (cross core bind is not allowed) - that means the assert is at right place and protection against the illegal situation should be in ipc_comp_to_buffer_connect

if its false (cross core bind is allowed) - assert should be replaced with an error exit code

in both situations - a fix is needed

Its not allowed and the assert is ok as there are no IPC3 multicore users today. If someone does add multicore for IPC3 in the future they can address.

marc-hb · 2024-08-05T17:17:21Z

that means the assert is at right place and protection against the illegal situation should be in ipc_comp_to_buffer_connect()

Its not allowed and the assert is ok as there are no IPC3 multicore users today.

@marcinszkudlinski could you please add an error code in ipc_comp_to_buffer_connect()? Then probably cherry-pick to stable-v2.2.

marcinszkudlinski · 2024-08-06T13:35:59Z

well, it is a bit more complicated. As I see the check is already in place:

static int ipc_buffer_to_comp_connect(struct ipc_comp_dev *buffer,
				      struct ipc_comp_dev *comp)
{
	tr_dbg(&ipc_tr, "ipc: comp sink %d, source %d -> connect", comp->id,
	       buffer->id);

#if CONFIG_INCOHERENT
	if (comp->core != buffer->cb->core) {
		tr_err(&ipc_tr, "ipc: shared buffers are not supported for IPC3 incoherent architectures");
		return -ENOTSUP;
	}
#endif
	return comp_buffer_connect(comp->cd, comp->core, buffer->cb,
				   PPL_CONN_DIR_BUFFER_TO_COMP);
}

its checked for incoherent arch only, but the assert is there in every case

@marc-hb @lgirdwood Looks like fuzzing is performed with CONFIG_INCOHERENT = 0 What coherent architectures we support? Is fuzzing also performed with CONFIG_INCOHERENT = 1, especially for IPC4?

The quick fix would be to put the assert in question under conditional compiling, but there may be more places where checks or even "is_shared" flags are not necessary, checking in progress

marc-hb · 2024-08-06T16:07:58Z

The mystery deepens: I tried to reproduce but I couldn't.

I downloaded the unit "crash" file from https://github.com/thesofproject/sof/actions/runs/10144692547/attempts/1?pr=9338 (attached here before that log expires, should have done that sooner). IPC3 fuzz crash 9343.zip
git fetch origin 1067e5feb4232c1e115995aaa4d1ac08f34f1dda # merge commit in https://github.com/thesofproject/sof/actions/runs/10144692547/job/28048711573?pr=9338
west update
scripts/fuzz.sh -b -p -- -DCONFIG_IPC_MAJOR_3=y [ -DEXTRA_CFLAGS="-O0 -g3 ]
gdb -tui ./build-fuzz/zephyr/zephyr.exe
break ipc_comp_connect
run ./crash-117ab2d04ed5afa9417f47d35e1ffd7182663826

But SOF never even reaches ipc_buffer_to_comp_connect()! It returns an -EINVAL from ipc_comp_connect() first because icd_source is NULL:

int ipc_comp_connect(struct ipc *ipc, ipc_pipe_comp_connect *_connect)
{
	struct sof_ipc_pipe_comp_connect *connect = ipc_from_pipe_connect(_connect);
	struct ipc_comp_dev *icd_source;
	struct ipc_comp_dev *icd_sink;

	/* check whether the components already exist */
	icd_source = ipc_get_comp_dev(ipc, COMP_TYPE_ANY, connect->source_id);
	if (!icd_source) {
		tr_err(&ipc_tr, "ipc_comp_connect(): source component does not exist, source_id = %u sink_id = %u",
		       connect->source_id, connect->sink_id);
		return -EINVAL;
	}

What did I miss?

I have a newer Clang version (18), could that ~~make a difference~~? EDIT: tried on Ubuntu 22.04 just like Github and no difference.

andyross · 2024-08-06T17:56:23Z

Its not allowed and the assert is ok as there are no IPC3 multicore users today. If someone does add multicore for IPC3 in the future they can address.

Right, but the less proximate problem here is that the cross-core bind is accessible via the IPC3 protocol even if it's "not allowed" as a matter of policy and device configuration, and the case isn't being handled as a protocol error, only as a an assertion failure downward in the stack when it hits an unexpected situation.

As far as determinism: I've had mixed success with the reproducer files generated by the fuzzer too. It seems like too many things can affect the flow. Presumably what's happened is that the previously-fuzzed IPC commands left SOF in a weird state, but that may not be reachable in a reliable way from a developer system. Generally it's OK just to look at a failure and understand how we "could" have reached it and then fix via analysis instead.

If we're really sure that this "shouldn't be able to happen" then I guess we have work to do.

marc-hb · 2024-08-06T19:26:57Z

Presumably what's happened is that the previously-fuzzed IPC commands left SOF in a weird state, but that may not be reachable in a reliable way from a developer system.

Thanks! This means the crash-... file is technically incomplete. This is disappointing - and confusing/time-consuming.

@marc-hb @lgirdwood Looks like fuzzing is performed with CONFIG_INCOHERENT = 0 What coherent architectures we support? Is fuzzing also performed with CONFIG_INCOHERENT = 1, especially for IPC4?

Thanks, I took a quick look at this and I'm afraid we have some serious Kconfig problems here (what's new?). CONFIG_INCOHERENT is guarded by if (CONFIG_)XTENSA in src/platform/Kconfig. But:

(CONFIG_)XTENSA is not defined when fuzzing: neither fuzzing IPC3 nor fuzzing IPC4
More surprisingly, (CONFIG_)XTENSA is not defined either when compiling IPC3 on the main branch?! I tried ./scripts/docker-run.sh ./scripts/xtensa-build-all.sh rn.

So, in all of these cases CONFIG_INCOHERENT is not even reachable.

It would help if Kconfig+cpp could make a difference between "undefined" and zero but I'm afraid that ship has sailed a long time ago; sorry for the digression.

Long story short I think the very first step should be "top-down": first making sense of and clarifying the CONFIG_INCOHERENT situation. Only then we can start looking at the C code.

marc-hb · 2024-08-06T19:38:43Z

Thanks! This means the crash-... file is technically incomplete. This is disappointing - and confusing/time-consuming.

To be clear: some of these files have worked for me in the past. When they do, they save an ENORMOUS amount of testing time. So don't just cast them aside. Just disappointing to learn they don't always work.

marcinszkudlinski · 2024-08-08T10:44:04Z

I think the best solution would be:

int comp_buffer_connect(struct comp_dev *comp, uint32_t comp_core,
			struct comp_buffer *buffer, uint32_t dir)
{
	/* check if it's a connection between cores */
	if (buffer->core != comp_core) {
#if CONFIG_INCOHERENT
		/* buffer must be shared */
		assert(buffer->is_shared);
#else
		buffer->is_shared = true;
#endif
		if (!comp->is_shared)
			comp_make_shared(comp);
	}

	return pipeline_connect(comp, buffer, dir);
}

for incoherent archs the buffer must be allocated as shared/not shared, for coherent it may be set as shared at any moment

@lgirdwood ?

marcinszkudlinski · 2024-08-09T08:02:52Z

#9356 please comment

marc-hb · 2024-08-09T17:13:35Z

Even if 9356 is perfect, it does not change anything to the apparently messy Kconfig situation... @andyross could you help there? Should CONFIG_INCOHERENT really be gated by CONFIG_XTENSA? Can we align build-fuzz/.config with build/.config more? Some differences are unavoidable but they should be as small as possible: fuzzing something different from the product is both dangerous and a bit of a waste.

andyross · 2024-08-09T20:45:25Z

Should CONFIG_INCOHERENT really be gated by CONFIG_XTENSA?

Likely yes? The latter is a Zephyr kconfig indicating the architecture, which always has an incoherent L1 cache. The former is a SOF tunable I'm less familiar with. From the perspective of software[1], the incoherence is irrelevant when there's only one CPU. So my guess is that SOF allows this to be =n on single CPU builds? The equivalent (but logically converse) Zephyr abstraction is something called CONFIG_KERNEL_COHERENCE, which forces mutable .data/.bss access to be uncached and does some trickery to allow stacks to be cached, while leaving everything else (.rodata and instruction literals in .text) cached.

But regardless, the only architecture in this whole problem space with a truly incoherent cache is Xtensa, and that's very unlikely to change.

[1] And software only! Things like host-shared memory and DMA buffers still care about coherence even with a single CPU, but that's generally handled at the driver layer and not global firmware behavior, AFAIK.

marc-hb · 2024-08-10T21:35:37Z

Should CONFIG_INCOHERENT really be gated by CONFIG_XTENSA?

Likely yes? The latter is a Zephyr kconfig indicating the architecture, which always has an incoherent L1 cache

So does this mean the "INCOHERENT" code cannot be fuzzed? (This is why I asked in the first place)

marcinszkudlinski · 2024-08-12T06:37:11Z

@marc-hb even if you enable the flag for a cohernt arch, it won't change much (well, except some asserts like the above) - but the main problem, the infamous cache incoherency, won't be tested with fuzzing.

we can think of some backdoors - fuzzing is not any of productions' compilations anyway, but better would be to run fuzzing using an emulator.

marc-hb · 2024-08-12T19:07:11Z

but the main problem, the infamous cache incoherency, won't be tested with fuzzing.

AFAIK fuzzing is single threaded so I was never expecting fuzzing to catch memory coherency issues, that's not the point I was trying to make.

My main point is: we should always minimize #ifdef differences between fuzzed code and production code as much as possible so GitHub issues like this one don't even exist and we save a lot of time. Also, we have a greater chance to catch actual input sanitization issues. It's not about fuzzing finding very complex bugs but about minimizing #ifdef differences to have fewer false positives and fewer false negatives.

(well, except some asserts like the above)

Maybe it's not very "ambitious" but that's more like my main point.

but better would be to run fuzzing using an emulator.

That does not sound mutually exclusive. You can never have too much test coverage. You can only have limited validation time and resources and bad validation priorities.

marcinszkudlinski · 2024-08-13T08:02:31Z

AFAIK fuzzing is single threaded so I was never expecting fuzzing to catch memory coherency issues, that's not the point I was trying to make.

You're right

My main point is: we should always minimize #ifdef differences between fuzzed code and production code as much as possible

Well, I believe a backdoor than is not a bad idea,

kv2019i · 2024-08-20T07:34:35Z

@marc-hb @marcinszkudlinski I believe this can be closed now with #9356 merged ?

marcinszkudlinski · 2024-08-20T10:15:14Z

@kv2019i the assert won't go off anymore.
@marc-hb maybe open another issue for fuzzing compilation params?

marc-hb · 2024-08-22T00:37:09Z

@marc-hb maybe open another issue for fuzzing compilation params?

I just filed:

General lack of .config control exposed by fuzzer .config divergence #9386

Fuzzing is just the messenger.

marc-hb assigned marcinszkudlinski Jul 31, 2024

marc-hb added the bug Something isn't working as expected label Jul 31, 2024

marc-hb mentioned this issue Jul 31, 2024

west.yml: update zephyr to 3bcaa6f8d63 #9338

Merged

marc-hb assigned marc-hb and unassigned marcinszkudlinski Aug 21, 2024

marc-hb mentioned this issue Aug 22, 2024

General lack of .config control exposed by fuzzer .config divergence #9386

Open

marc-hb mentioned this issue Aug 22, 2024

scripts: add debug build to fuzz.sh #9375

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed "assert(buffer->is_shared)" in comp_buffer_connect() #9343

Failed "assert(buffer->is_shared)" in comp_buffer_connect() #9343

marc-hb commented Jul 31, 2024 •

edited

Loading

marc-hb commented Jul 31, 2024 •

edited

Loading

marcinszkudlinski commented Aug 5, 2024

lgirdwood commented Aug 5, 2024

marc-hb commented Aug 5, 2024 •

edited

Loading

marcinszkudlinski commented Aug 6, 2024

marc-hb commented Aug 6, 2024 •

edited

Loading

andyross commented Aug 6, 2024

marc-hb commented Aug 6, 2024

marc-hb commented Aug 6, 2024

marcinszkudlinski commented Aug 8, 2024 •

edited

Loading

marcinszkudlinski commented Aug 9, 2024

marc-hb commented Aug 9, 2024 •

edited

Loading

andyross commented Aug 9, 2024

marc-hb commented Aug 10, 2024

marcinszkudlinski commented Aug 12, 2024

marc-hb commented Aug 12, 2024 •

edited

Loading

marcinszkudlinski commented Aug 13, 2024

kv2019i commented Aug 20, 2024

marcinszkudlinski commented Aug 20, 2024 •

edited

Loading

marc-hb commented Aug 22, 2024

Failed "assert(buffer->is_shared)" in comp_buffer_connect() #9343

Failed "assert(buffer->is_shared)" in comp_buffer_connect() #9343

Comments

marc-hb commented Jul 31, 2024 • edited Loading

marc-hb commented Jul 31, 2024 • edited Loading

marcinszkudlinski commented Aug 5, 2024

lgirdwood commented Aug 5, 2024

marc-hb commented Aug 5, 2024 • edited Loading

marcinszkudlinski commented Aug 6, 2024

marc-hb commented Aug 6, 2024 • edited Loading

andyross commented Aug 6, 2024

marc-hb commented Aug 6, 2024

marc-hb commented Aug 6, 2024

marcinszkudlinski commented Aug 8, 2024 • edited Loading

marcinszkudlinski commented Aug 9, 2024

marc-hb commented Aug 9, 2024 • edited Loading

andyross commented Aug 9, 2024

marc-hb commented Aug 10, 2024

marcinszkudlinski commented Aug 12, 2024

marc-hb commented Aug 12, 2024 • edited Loading

marcinszkudlinski commented Aug 13, 2024

kv2019i commented Aug 20, 2024

marcinszkudlinski commented Aug 20, 2024 • edited Loading

marc-hb commented Aug 22, 2024

marc-hb commented Jul 31, 2024 •

edited

Loading

marc-hb commented Jul 31, 2024 •

edited

Loading

marc-hb commented Aug 5, 2024 •

edited

Loading

marc-hb commented Aug 6, 2024 •

edited

Loading

marcinszkudlinski commented Aug 8, 2024 •

edited

Loading

marc-hb commented Aug 9, 2024 •

edited

Loading

marc-hb commented Aug 12, 2024 •

edited

Loading

marcinszkudlinski commented Aug 20, 2024 •

edited

Loading