-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] MTL firmware does not resume any more #9308
Comments
Last working daily test run 43924 Start Time: 2024-07-16 13:08:34 UTC First failing daily test run 43972 Start Time: 2024-07-17 13:21:12 UTC d629e52 [email protected] // (origin/main) ipc4: copier: Add IPC4 channel map handler _
I checked all the corresponding Pull Requests and they all passed separately. |
can it be the 5a000e8 Zephyr upgrade? |
I only reverted LLEXT PR while keeping zephyr upgrade. |
@fredoh9 thanks for checking that! And that isn't surprising, since every PR, merged after LLEXT passed its testing too - on a base that didn't include LLEXT. But apparently one of those PRs conflicts with LLEXT. |
Temporarily disable CONFIG_MODULES to get MTL working again. Avoids crash thesofproject#9308 Signed-off-by: Marc Herbert <[email protected]>
Partial quote from #9291 (comment)
|
Temporarily disable CONFIG_MODULES to get MTL working again. Avoids crash #9308 Signed-off-by: Marc Herbert <[email protected]>
Test if this fixes thesofproject#9308 Signed-off-by: Guennadi Liakhovetski <[email protected]>
This reverts commit 5a000e8. This should fix thesofproject#9308 Signed-off-by: Guennadi Liakhovetski <[email protected]>
Wild guess: would NOT resuming from IMR temporarily but reliably avoid these SRAM power-off/cache alignment issues? Assuming the crash happens at resume time. This instead of reverting recent, unrelated commits that just happened to be in the wrong place at the wrong time:
The latter feels like a timebomb: which other, unrelated commits will just break it and crash again? |
Unfortunately, I have not yet managed to get to the bottom of these problems. However, I have determined the following:
My suspicion is that someone is writing to memory locations they shouldn't. |
@tmleman what do you mean by "IMR context save has been disabled for the MTL"
|
I just tested this and it works. In other words, adding the BIT(7) 0x80 to https://github.com/thesofproject/linux/blob/topic/sof-dev/sound/soc/sof/sof-priv.h Note IMR resume is being especially "evil" here because you don't need to reboot to drop the bit 0x80 but you need to reboot to add it. Maybe that's an error handling issue in the Linux kernel? cc: @jxstelter |
"Someone" in audio firmware or could it be "someone" elsewhere? |
@tmleman has explained to me that he had traced the problem down to IMR overwriting by the use of the IMR heap. That explains well why using (LLEXT) loadable modules triggers these problems. Indeed, "randomly" increasing some of the values in https://github.com/zephyrproject-rtos/zephyr/blob/f2b6490dee30b38dfb0ee31902177491c35ebacb/soc/intel/intel_adsp/ace/include/adsp_memory.h#L44-L68 fixes the problem. @tmleman could you provide a fix with some "meaningful" constants? And we need a strategy for keeping those up to date, or maybe generating them automatically. |
Note Zephyr commit zephyrproject-rtos/zephyr@6069f946be1bd502 has de-duplicated the |
The IMR is used by the firmware to hold its own copy for hot-booting and for an "L3 heap," used for slow large allocations like loadable libraries. The beginning of the L3 heap is currently hard-coded and now the firmware has grown too large to fit into the dedicated area so that it gets overwritten by heap allocations. This is a critical bug that needs an urgent solution, for which we increase the offset, but a real fix must calculate the L3 heap offset automatically. BugLink: thesofproject/sof#9308 Signed-off-by: Guennadi Liakhovetski <[email protected]>
@lyakh thank you for the fix. I had a problem with decoding all these macros and I wasn't sure if it would be enough to shift the IMR stack. To better outline what the problem was, here is a small map of the IMR along with the addresses from which the FW was copied: L3_MEM_BASE_ADDR = 0xa1000000
IMR_BOOT_LDR_MANIFEST_BASE = 0xa1042000
IMR_BOOT_LDR_TEXT_ENTRY_BASE = 0xa1048000
IMR_BOOT_LDR_LIT_BASE = 0xa1048180
IMR_BOOT_LDR_TEXT_BASE = 0xa10481c0
IMR_BOOT_LDR_DATA_BASE = 0xa1049000 # The beginning of the memory copied to the hpsram.
0xa1049000 -> 0xa0030000
IMR_BOOT_LDR_BSS_BASE = 0xa1110000
IMR_BOOT_LDR_STACK_BASE = 0xa1120000
IMR_L3_HEAP_BASE = 0xa1121000
The last copied address = 0xa1133000 -> 0xa011a000 The proposed fix moves |
@tmleman thanks for the break-down. Yes, the important thing is to move the L3 heap base further down, because it's that which overlaps with the firmware and where we overwrite its parts? And yes, we need to automate extracting these addresses, @lgirdwood proposes using a linker script. |
We need short term fix today for Zephyr, but long term we should use Zephyr methods for memory reservations |
I think as a hotfix that should be fine. |
I just filed new issue zephyrproject-rtos/zephyr#76247 to discuss and track a longer term solution. |
The IMR is used by the firmware to hold its own copy for hot-booting and for an "L3 heap," used for slow large allocations like loadable libraries. The beginning of the L3 heap is currently hard-coded and now the firmware has grown too large to fit into the dedicated area so that it gets overwritten by heap allocations. This is a critical bug that needs an urgent solution, for which we increase the offset, but a real fix must calculate the L3 heap offset automatically. BugLink: thesofproject/sof#9308 Signed-off-by: Guennadi Liakhovetski <[email protected]> (cherry picked from commit 7620208)
The IMR is used by the firmware to hold its own copy for hot-booting and for an "L3 heap," used for slow large allocations like loadable libraries. The beginning of the L3 heap is currently hard-coded and now the firmware has grown too large to fit into the dedicated area so that it gets overwritten by heap allocations. This is a critical bug that needs an urgent solution, for which we increase the offset, but a real fix must calculate the L3 heap offset automatically. BugLink: thesofproject/sof#9308 Signed-off-by: Guennadi Liakhovetski <[email protected]>
The IMR is used by the firmware to hold its own copy for hot-booting and for an "L3 heap," used for slow large allocations like loadable libraries. The beginning of the L3 heap is currently hard-coded and now the firmware has grown too large to fit into the dedicated area so that it gets overwritten by heap allocations. This is a critical bug that needs an urgent solution, for which we increase the offset, but a real fix must calculate the L3 heap offset automatically. (cherry picked from commit 293fa11) Original-BugLink: thesofproject/sof#9308 Original-Signed-off-by: Guennadi Liakhovetski <[email protected]> GitOrigin-RevId: 293fa11 Change-Id: If6dfc2283d19bd72cfa9bc627fdccefdeb6b8eb7 Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/5745034 Tested-by: ChromeOS Prod (Robot) <[email protected]> Reviewed-by: Fabio Baltieri <[email protected]> Commit-Queue: Fabio Baltieri <[email protected]>
Tentative Zephyr upgrade in |
@marc-hb opened a separate issue on the Zephyr side zephyrproject-rtos/zephyr#76247. I think this one can be closed unless the FW grows faster than someone manages to fix it. |
Once #9347 is passing and merged. |
The IMR is used by the firmware to hold its own copy for hot-booting and for an "L3 heap," used for slow large allocations like loadable libraries. The beginning of the L3 heap is currently hard-coded and now the firmware has grown too large to fit into the dedicated area so that it gets overwritten by heap allocations. This is a critical bug that needs an urgent solution, for which we increase the offset, but a real fix must calculate the L3 heap offset automatically. BugLink: thesofproject/sof#9308 Signed-off-by: Guennadi Liakhovetski <[email protected]>
As of July 22nd, the current status is: something (IMR heap?) overwrites the firmware code stored in the IMR and used for resuming. That's why not booting from IMR with
sof_debug=0x80
avoids the crash.To Reproduce
Use audio. Wait. Try to use it again.
Reproduction Rate
100%
Impact
Show stopper
cc:
Screenshots or console output
https://sof-ci.01.org/sofpr/PR9305/build6552/devicetest/index.html
https://sof-ci.01.org/softestpr/PR1220/build662/devicetest/index.html
The text was updated successfully, but these errors were encountered: