Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MTL firmware does not resume any more #9308

Closed
marc-hb opened this issue Jul 18, 2024 · 25 comments
Closed

[BUG] MTL firmware does not resume any more #9308

marc-hb opened this issue Jul 18, 2024 · 25 comments
Assignees
Labels
bug Something isn't working as expected MTL Applies to Meteor Lake platform P1 Blocker bugs or important features

Comments

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 18, 2024

As of July 22nd, the current status is: something (IMR heap?) overwrites the firmware code stored in the IMR and used for resuming. That's why not booting from IMR with sof_debug=0x80 avoids the crash.
 


To Reproduce

Use audio. Wait. Try to use it again.

Reproduction Rate

100%

Impact

Show stopper

cc:

Screenshots or console output

Screenshot 2024-07-17 at 17 45 08

https://sof-ci.01.org/sofpr/PR9305/build6552/devicetest/index.html
https://sof-ci.01.org/softestpr/PR1220/build662/devicetest/index.html

[  161.664059] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIR: 00000000 at  00000044
[  161.664078] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIR: 00000000 at  00000048
[  161.667119] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIR: 00000000 at  00000048
[  161.667125] kernel: snd_sof:snd_sof_pci_update_bits_unlocked: sof-audio-pci-intel-mtl 0000:00:1f.3: Debug PCIW: 00000040 at  00000048
[  161.667133] kernel: snd_sof_intel_hda_common:hda_dsp_state_log: sof-audio-pci-intel-mtl 0000:00:1f.3: Current DSP power state: D0I0
[  161.667138] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state change: 0 -> 2
[  161.667143] kernel: snd_sof:snd_sof_load_firmware: sof-audio-pci-intel-mtl 0000:00:1f.3: loading firmware
[  161.667146] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state change: 2 -> 3
[  161.668181] kernel: snd_sof_pci_intel_mtl:mtl_dsp_pre_fw_run: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1000]=0x1010000 successful
[  161.669221] kernel: snd_sof_pci_intel_mtl:mtl_dsp_pre_fw_run: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1d1c]=0x8151 successful
[  161.669227] kernel: snd_sof:snd_sof_run_firmware: sof-audio-pci-intel-mtl 0000:00:1f.3: booting DSP firmware
[  161.669233] kernel: snd_sof_intel_hda_common:hda_dsp_cl_boot_firmware: sof-audio-pci-intel-mtl 0000:00:1f.3: IMR restore supported, booting from IMR directly
[  161.670274] kernel: snd_sof_pci_intel_mtl:mtl_dsp_core_power_up: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x178d04]=0x2000101 successful
[  161.670281] kernel: snd_sof_pci_intel_mtl:mtl_dsp_cl_init: sof-audio-pci-intel-mtl 0000:00:1f.3: Primary core power up successful
[  161.670290] kernel: snd_sof_pci_intel_mtl:mtl_dsp_cl_init: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x73214]=0x80000000 successful
[  161.670301] kernel: snd_sof_pci_intel_mtl:mtl_enable_interrupts: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1800]=0x41 successful
[  161.670310] kernel: snd_sof_pci_intel_mtl:mtl_enable_interrupts: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x1140]=0x1 successful
[  161.670320] kernel: snd_sof_pci_intel_mtl:mtl_dsp_cl_init: sof-audio-pci-intel-mtl 0000:00:1f.3: FW Poll Status: reg[0x163200]=0x50000005 successful
[  163.714303] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump start ]------------
[  163.714323] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware boot failure due to timeout
[  163.714330] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (3)
[  163.714343] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x50000005: module: ROM_EXT, state: FW_ENTERED, running
[  163.714356] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Firmware state: 0x5, status/error code: 0x0
[  163.714438] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: Unknown toolchain is used
[  163.714446] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: DSP Firmware Oops
[  163.714451] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: Exception Cause: AllocaCause, MOVSP instruction, if caller’s registers are not in the register file
[  163.714458] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EXCCAUSE 0x00000005 EXCVADDR 0x00000004 PS       0x00060020 SAR     0x00000004
[  163.714466] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPC1     0xa0088553 EPC2     0x00000000 EPC3     0x00000000 EPC4    0x00000000
[  163.714474] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPC5     0x00000000 EPC6     0x00000000 EPC7     0x00000000 DEPC    0x00000000
[  163.714480] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPS2     0x00000000 EPS3     0x00000000 EPS4     0x00000000 EPS5    0x00000000
[  163.714486] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: EPS6     0x00000000 EPS7     0x00000000 INTENABL 0x00000000 INTERRU 0x00000000
[  163.714492] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: stack dump from 0x00000000
[  163.714498] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: AR registers:
[  163.714506] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x0: a0041e68 a0102688 00000000 a0065c84
[  163.714512] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x10: a0066370 00000001 0000000a a01025c8
[  163.714519] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x20: a0065d65 a0102628 00000001 40150478
[  163.714525] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: 0x30: a0065d65 a0102628 00000001 40150478
[  163.714532] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ------------[ DSP dump end ]------------
[  163.714538] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: failed to boot DSP firmware after resume -5
[  163.714547] kernel: snd_sof:sof_set_fw_state: sof-audio-pci-intel-mtl 0000:00:1f.3: fw_state change: 3 -> 4
[  163.714558] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_pm_runtime_get on 0000:00:1f.3: -5
[  163.714569] kernel:  NoCodec-0: ASoC: error at __soc_pcm_open on NoCodec-0: -5
[  163.714578] kernel:  Port0: ASoC: error at dpcm_be_dai_startup on Port0: -5
[  163.714588] kernel:  Port0: ASoC: error at dpcm_fe_dai_startup on Port0: -5
@marc-hb marc-hb added bug Something isn't working as expected urgent labels Jul 18, 2024
@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 18, 2024

Last working daily test run 43924

Start Time: 2024-07-16 13:08:34 UTC
Linux Commit: 1998ade4783a
KConfig Commit: 8189104a4f38
SOF Commit: 3051607efb4f
Zephyr Commit: 650227d8c47f

First failing daily test run 43972

Start Time: 2024-07-17 13:21:12 UTC
Linux Commit: c3ecd35f66c1
KConfig Commit: 8189104a4f38
SOF Commit: d629e521020c
Zephyr Commit: 740d7f735e23

d629e52 [email protected] // (origin/main) ipc4: copier: Add IPC4 channel map handler
ffce2cb [email protected] // ipc4: copier: Extend get_convertion_func() to support remapping
eda6029 [email protected] // dai-zephyr: Prioritize HW params channels over base config params
8082006 [email protected] // ipc4: pcm_converter: Add channel remapping conversion functions
3199786 [email protected] // pcm_converter: Add channel map parameter
8b927ad [email protected] // dai-zephyr: Use frames, not samples, for DMA copy bytes calculation
6e183cb [email protected] // dai-zephyr: Fix to avoid using buffers with uninitialized stream params
ee66620 [email protected] // ipc4: dai-zephyr: Do not reuse process func as multi-gateway channel copy
5a000e8 [email protected] // west.yml: update zephyr to 740d7f735e2
3051607 [email protected] // kcps: fix 0 module CPC case

_

c3ecd35f66c1 [email protected] // !fixup ASoC: amd: acp: fix for unused-but-set-variable warning for amp_num
1998ade4783a [email protected] // ASoC: SOF: sof-audio.h: optimize snd_sof_pcm_stream_pipeline_list

I checked all the corresponding Pull Requests and they all passed separately.

cc: @serhiy-katsyuba-intel

@lyakh
Copy link
Collaborator

lyakh commented Jul 18, 2024

@marc-hb the last passing test of 16.07 already includes LLEXT patches as you can also see by following 3051607 and by checking that daily test's logs

@lyakh
Copy link
Collaborator

lyakh commented Jul 18, 2024

can it be the 5a000e8 Zephyr upgrade?

@fredoh9
Copy link
Contributor

fredoh9 commented Jul 18, 2024

I only reverted LLEXT PR while keeping zephyr upgrade.
FW re-loading worked fine. This is my revert branch, https://github.com/fredoh9/sof/tree/fix/revert

@lyakh
Copy link
Collaborator

lyakh commented Jul 18, 2024

I only reverted LLEXT PR while keeping zephyr upgrade. FW re-loading worked fine. This is my revert branch, https://github.com/fredoh9/sof/tree/fix/revert

@fredoh9 thanks for checking that! And that isn't surprising, since every PR, merged after LLEXT passed its testing too - on a base that didn't include LLEXT. But apparently one of those PRs conflicts with LLEXT.

@kv2019i
Copy link
Collaborator

kv2019i commented Jul 18, 2024

@lyakh @tmleman can you check this, it's seems it's a combination of LLEXT and Zephyr update. Given where it fails, this could be a new side-effect of the power_down.S changes.

@kv2019i kv2019i added P1 Blocker bugs or important features MTL Applies to Meteor Lake platform labels Jul 18, 2024
marc-hb added a commit to marc-hb/sof that referenced this issue Jul 18, 2024
Temporarily disable CONFIG_MODULES to get MTL working again.

Avoids crash thesofproject#9308

Signed-off-by: Marc Herbert <[email protected]>
@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 18, 2024

Partial quote from #9291 (comment)

We can have cases where two PRs pass the tests independetly, but when they are both in the tree, the test fails

This is a well-known CI issue and Gitlab and Github have a solution for it: ...

kv2019i pushed a commit that referenced this issue Jul 18, 2024
Temporarily disable CONFIG_MODULES to get MTL working again.

Avoids crash #9308

Signed-off-by: Marc Herbert <[email protected]>
@marc-hb marc-hb removed the urgent label Jul 18, 2024
lyakh added a commit to lyakh/sof that referenced this issue Jul 19, 2024
Test if this fixes thesofproject#9308

Signed-off-by: Guennadi Liakhovetski <[email protected]>
lyakh added a commit to lyakh/sof that referenced this issue Jul 19, 2024
This reverts commit 5a000e8.
This should fix thesofproject#9308

Signed-off-by: Guennadi Liakhovetski <[email protected]>
@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 19, 2024

Wild guess: would NOT resuming from IMR temporarily but reliably avoid these SRAM power-off/cache alignment issues? Assuming the crash happens at resume time.

This instead of reverting recent, unrelated commits that just happened to be in the wrong place at the wrong time:

The latter feels like a timebomb: which other, unrelated commits will just break it and crash again?

@tmleman
Copy link
Contributor

tmleman commented Jul 19, 2024

Unfortunately, I have not yet managed to get to the bottom of these problems. However, I have determined the following:

  • When the firmware (FW) exits D3 state, we encounter a LoadStoreAlignmentCause exception (9). This occurs because the FW attempts to load a value into the register from address 0xD.
  • This issue arises during the initialization of the io3_domain power domain (__device_dts_ord_94). It should be noted that this initialization only sets the state of the internal structures of the power manager and does not interact with the hardware registers.
  • Given that the IMR context save has been disabled for the MTL, the FW boot process should be consistent each time. The code that attempts to read from the incorrect address is executed only after exiting D3 state.

My suspicion is that someone is writing to memory locations they shouldn't.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 19, 2024

Given that the IMR context save has been disabled for the MTL, the FW boot process should be consistent each time.

@tmleman what do you mean by "IMR context save has been disabled for the MTL"

I see this: https://sof-ci.01.org/sofpr/PR9305/build6552/devicetest/index.html?model=MTLP_RVP_NOCODEC&testcase=check-playback-10sec

[  161.669233] kernel: snd_sof_intel_hda_common:hda_dsp_cl_boot_firmware: sof-audio-pci-intel-mtl
               0000:00:1f.3: IMR restore supported, booting from IMR directly
.
.
.
[  163.714446] kernel: sof-audio-pci-intel-mtl 0000:00:1f.3: error: DSP Firmware Oops

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 19, 2024

Wild guess: would NOT resuming from IMR temporarily but reliably avoid these SRAM power-off/cache alignment issues? Assuming the crash happens at resume time.

I just tested this and it works. In other words, adding the BIT(7) 0x80 to options snd-sof sof_debug=... completely avoids this crash.

https://github.com/thesofproject/linux/blob/topic/sof-dev/sound/soc/sof/sof-priv.h

Note IMR resume is being especially "evil" here because you don't need to reboot to drop the bit 0x80 but you need to reboot to add it. Maybe that's an error handling issue in the Linux kernel?

cc: @jxstelter

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 20, 2024

@tmleman what do you mean by "IMR context save has been disabled for the MTL"

Answering myself: 69ad96a set CONFIG_ADSP_IMR_CONTEXT_SAVE=n (It's y for LNL)

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 22, 2024

My suspicion is that someone is writing to memory locations they shouldn't.

"Someone" in audio firmware or could it be "someone" elsewhere?

@lyakh
Copy link
Collaborator

lyakh commented Jul 22, 2024

@tmleman has explained to me that he had traced the problem down to IMR overwriting by the use of the IMR heap. That explains well why using (LLEXT) loadable modules triggers these problems. Indeed, "randomly" increasing some of the values in https://github.com/zephyrproject-rtos/zephyr/blob/f2b6490dee30b38dfb0ee31902177491c35ebacb/soc/intel/intel_adsp/ace/include/adsp_memory.h#L44-L68 fixes the problem. @tmleman could you provide a fix with some "meaningful" constants? And we need a strategy for keeping those up to date, or maybe generating them automatically.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 22, 2024

Note Zephyr commit zephyrproject-rtos/zephyr@6069f946be1bd502 has de-duplicated the adsp_memory.h file across ace15, ace20 and ace30 and moved some constants to device tree.

lyakh added a commit to lyakh/zephyr that referenced this issue Jul 23, 2024
The IMR is used by the firmware to hold its own copy for hot-booting
and for an "L3 heap," used for slow large allocations like loadable
libraries. The beginning of the L3 heap is currently hard-coded and
now the firmware has grown too large to fit into the dedicated area
so that it gets overwritten by heap allocations. This is a critical
bug that needs an urgent solution, for which we increase the offset,
but a real fix must calculate the L3 heap offset automatically.

BugLink: thesofproject/sof#9308
Signed-off-by: Guennadi Liakhovetski <[email protected]>
@tmleman
Copy link
Contributor

tmleman commented Jul 23, 2024

@lyakh thank you for the fix. I had a problem with decoding all these macros and I wasn't sure if it would be enough to shift the IMR stack.

To better outline what the problem was, here is a small map of the IMR along with the addresses from which the FW was copied:

L3_MEM_BASE_ADDR             = 0xa1000000
IMR_BOOT_LDR_MANIFEST_BASE   = 0xa1042000
IMR_BOOT_LDR_TEXT_ENTRY_BASE = 0xa1048000
IMR_BOOT_LDR_LIT_BASE        = 0xa1048180
IMR_BOOT_LDR_TEXT_BASE       = 0xa10481c0
IMR_BOOT_LDR_DATA_BASE       = 0xa1049000 # The beginning of the memory copied to the hpsram.
                               0xa1049000 -> 0xa0030000
IMR_BOOT_LDR_BSS_BASE        = 0xa1110000
IMR_BOOT_LDR_STACK_BASE	     = 0xa1120000
IMR_L3_HEAP_BASE             = 0xa1121000
The last copied address      = 0xa1133000 -> 0xa011a000

The proposed fix moves IMR_BOOT_LDR_STACK_BASE and IMR_L3_HEAP_BASE to addresses 0xa1150000 and 0xa1151000, respectively. CI caught the problem only in builds with assertions enabled, but also in a normal build, the heap initialization was overwriting our FW, it just didn't cause crashes.

@lyakh
Copy link
Collaborator

lyakh commented Jul 23, 2024

The proposed fix moves IMR_BOOT_LDR_STACK_BASE and IMR_L3_HEAP_BASE to addresses 0xa1150000 and 0xa1151000, respectively.

@tmleman thanks for the break-down. Yes, the important thing is to move the L3 heap base further down, because it's that which overlaps with the firmware and where we overwrite its parts? And yes, we need to automate extracting these addresses, @lgirdwood proposes using a linker script.

@lgirdwood
Copy link
Member

The proposed fix moves IMR_BOOT_LDR_STACK_BASE and IMR_L3_HEAP_BASE to addresses 0xa1150000 and 0xa1151000, respectively.

@tmleman thanks for the break-down. Yes, the important thing is to move the L3 heap base further down, because it's that which overlaps with the firmware and where we overwrite its parts? And yes, we need to automate extracting these addresses, @lgirdwood proposes using a linker script.

We need short term fix today for Zephyr, but long term we should use Zephyr methods for memory reservations
@dcpleung will @lyakh fix zephyrproject-rtos/zephyr#76196 be ok for Zephyr today ?

@dcpleung
Copy link
Contributor

I think as a hotfix that should be fine.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 23, 2024

but long term we should use Zephyr methods for memory reservations

I just filed new issue zephyrproject-rtos/zephyr#76247 to discuss and track a longer term solution.

tmleman pushed a commit to tmleman/zephyr that referenced this issue Jul 26, 2024
The IMR is used by the firmware to hold its own copy for hot-booting
and for an "L3 heap," used for slow large allocations like loadable
libraries. The beginning of the L3 heap is currently hard-coded and
now the firmware has grown too large to fit into the dedicated area
so that it gets overwritten by heap allocations. This is a critical
bug that needs an urgent solution, for which we increase the offset,
but a real fix must calculate the L3 heap offset automatically.

BugLink: thesofproject/sof#9308
Signed-off-by: Guennadi Liakhovetski <[email protected]>
(cherry picked from commit 7620208)
nashif pushed a commit to zephyrproject-rtos/zephyr that referenced this issue Jul 27, 2024
The IMR is used by the firmware to hold its own copy for hot-booting
and for an "L3 heap," used for slow large allocations like loadable
libraries. The beginning of the L3 heap is currently hard-coded and
now the firmware has grown too large to fit into the dedicated area
so that it gets overwritten by heap allocations. This is a critical
bug that needs an urgent solution, for which we increase the offset,
but a real fix must calculate the L3 heap offset automatically.

BugLink: thesofproject/sof#9308
Signed-off-by: Guennadi Liakhovetski <[email protected]>
coreboot-org-bot pushed a commit to coreboot/zephyr-cros that referenced this issue Jul 27, 2024
The IMR is used by the firmware to hold its own copy for hot-booting
and for an "L3 heap," used for slow large allocations like loadable
libraries. The beginning of the L3 heap is currently hard-coded and
now the firmware has grown too large to fit into the dedicated area
so that it gets overwritten by heap allocations. This is a critical
bug that needs an urgent solution, for which we increase the offset,
but a real fix must calculate the L3 heap offset automatically.

(cherry picked from commit 293fa11)

Original-BugLink: thesofproject/sof#9308
Original-Signed-off-by: Guennadi Liakhovetski <[email protected]>
GitOrigin-RevId: 293fa11
Change-Id: If6dfc2283d19bd72cfa9bc627fdccefdeb6b8eb7
Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/5745034
Tested-by: ChromeOS Prod (Robot) <[email protected]>
Reviewed-by: Fabio Baltieri <[email protected]>
Commit-Queue: Fabio Baltieri <[email protected]>
@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 29, 2024

Tentative Zephyr upgrade in sof/west.yml with the temporary fix:

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 5, 2024

With #9338 merged, @marc-hb @tmleman can we close this?

@tmleman
Copy link
Contributor

tmleman commented Aug 5, 2024

With #9338 merged, @marc-hb @tmleman can we close this?

@marc-hb opened a separate issue on the Zephyr side zephyrproject-rtos/zephyr#76247. I think this one can be closed unless the FW grows faster than someone manages to fix it.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 5, 2024

can we close this?

Once #9347 is passing and merged.

@kv2019i
Copy link
Collaborator

kv2019i commented Aug 6, 2024

@marc-hb wrote:

Once #9347 is passing and merged.

Ack, now done, closing this one. Thanks all!

@kv2019i kv2019i closed this as completed Aug 6, 2024
thedjnK pushed a commit to thedjnK/zephyr that referenced this issue Oct 27, 2024
The IMR is used by the firmware to hold its own copy for hot-booting
and for an "L3 heap," used for slow large allocations like loadable
libraries. The beginning of the L3 heap is currently hard-coded and
now the firmware has grown too large to fit into the dedicated area
so that it gets overwritten by heap allocations. This is a critical
bug that needs an urgent solution, for which we increase the offset,
but a real fix must calculate the L3 heap offset automatically.

BugLink: thesofproject/sof#9308
Signed-off-by: Guennadi Liakhovetski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected MTL Applies to Meteor Lake platform P1 Blocker bugs or important features
Projects
None yet
Development

No branches or pull requests

7 participants