Lillygo T5-4.7: task_wdt: Task watchdog got triggered. #387

rmk92 · 2024-12-24T00:05:27Z

When using epdiy (master branch) with esp-idf 5.4, I'm running into this after the platform has been up for a while:

E (306796) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (306796) task_wdt:  - IDLE1 (CPU 1)
E (306796) task_wdt: Tasks currently running:
E (306796) task_wdt: CPU 0: epd_prep
E (306796) task_wdt: CPU 1: epd_prep
E (306796) task_wdt: Print CPU 1 backtrace

Backtrace: 0x4008F356:0x3FFBF9C0 0x4008DB35:0x3FFBF9E0 0x40085488:0x3FFE5D40 0x40084EF5:0x3FFE5D60 0x400850A1:0x3FFE5D80 0x40084C30:0x3FFE5DB0 0x40094D95:0x3FFE5DD0
0x4008f356: esp_crosscore_isr at esp-idf/components/esp_system/crosscore_int.c:72
0x4008db35: _xt_lowint1 at esp-idf/components/xtensa/xtensa_vectors.S:1240
0x40085488: pulse_ckv_ticks at epdiy/src/output_i2s/rmt_pulse.c:72 (discriminator 1)
0x40084ef5: i2s_skip_row at epdiy/src/output_i2s/render_i2s.c:83
0x400850a1: i2s_output_frame at epdiy/src/output_i2s/render_i2s.c:242
0x40084c30: render_thread at epdiy/src/render.c:210
0x40094d95: vPortTaskWrapper at esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134

The display gets updated about once a minute, and the above seems to trigger in about 20 minutes.

The problem appears to be that rmt_tx_busy is false on entry to pulse_ckv_ticks().

Building with xtensa-esp-elf-cc (crosstool-NG esp-13.2.0_20230928) 13.2.0

The text was updated successfully, but these errors were encountered:

martinberlin · 2024-12-24T15:53:17Z

Try putting a:

  vTaskDelay(1);

inside that while loop just to see if it makes any difference.

rmk92 · 2024-12-24T18:56:10Z

Adding vTaskDelay(1) makes the thing grind so much that clearing the display during initialisation causes the watchdog to trigger. I tried vTaskDelay(0) instead, which avoided the watchdog during clearing, but otherwise made little difference to the reported problem.

I subsequently added the following to aid additional debugging:

+    int i = 1000;
     while (!rmt_tx_done) {
+        if (i && --i == 0) {
+           ESP_LOGE("rmt", "timeout: rmt_tx_done=%x conf1=%08x int_st=%08x int_e+n=%08x",
+                    rmt_tx_done, RMT.conf_ch[row_rmt_config.channel].conf1.val,
+                    RMT.int_st.val, RMT.int_ena.val);
+        }
+        vTaskDelay(0);
     };

and with that, just before the watchdog, I get:

E (422098) rmt: timeout: rmt_tx_done=0 conf1=000a0f00 int_st=00000000 int_en=00000008

The tx_start bit is clear, interrupt is enabled, and no interrupt in the status bit. Reading the ESP32 documentation, it isn't clear under what circumstances the tx_start bit clears, but I'm guessing it clears when the RMT returns to idle state. If that is a correct assumption, then it looks like there could be a problem with interrupt delivery.

With a bit of further investigation, I noticed that esp_intr_alloc() does not have the ESP_INTR_FLAG_IRAM flag passed to it, yet the interrupt handler is marked with IRAM_ATTR. I thought I'll add that flag, and so far it's been running for 32 minutes without watchdogging, which is the longest so far.

rmk92 · 2024-12-24T19:03:37Z

... and unfortunately it still failed at 37 minutes.

martinberlin · 2024-12-25T00:17:51Z

Add also in the debugging a int showing how much RAM you’ve left just to discard the program has some leak and you are at the limit of available RAM

rmk92 · 2024-12-26T20:08:29Z

Printing the esp_get_free_heap_size() value reports just under 3MiB.

rmk92 · 2024-12-26T23:22:03Z

I think I've found the issue:

static void IRAM_ATTR rmt_interrupt_handler(void* arg) {
    rmt_tx_done = true;
    RMT.int_clr.val = RMT.int_st.val;
}

If we set rmt_tx_done true before clearing the status, and this occurs on CPU 0, surely we are racing with CPU 1 executing pulse_ckv_ticks(), potentially triggering the next RMT pulse before clearing the interrupt has taken effect - especially as the above gets assembled to:

400854a1:       ed5381          l32r    a8, 400809f0 <_iram_text_start+0x5ec> (3ff56000 <RMT>)
400854a4:       0020c0          memw
400854a7:       292892          l32i    a9, a8, 164
400854aa:       0020c0          memw
400854ad:       2b6892          s32i    a9, a8, 172
400854b0:       f01d            retw.n

Note that the write to int_clr has no memw after it. If the xtensa CPUs use weak ordering (which I guess they do as they have the memw instruction to ensure that preceeding writes don't pass following writes) and this is an interrupt handler, it means that it's indeterminate when that write is going to take effect - and I believe that in the failing case, we get the following order:
CPU 1 waits for rmt_tx_done = true
CPU 0 sets rmt_tx_done true
CPU 0 issues the write to int_clr, but because of the lack of barrier, this does not make it to the hardware
CPU 1 observes rmt_tx_done true, and programs the next RMT pulse, sets rmt_tx_done false, and then sets tx_start=1
CPU 0 eventually retires the write to int_clr, coincidentally clearing the interrupt from this RMT pulse
CPU 1 enters pulse_ckv_ticks() again, and waits for the rmt_tx_done to be set true
As the interrupt has been cleared by the late-retired store, rmt_interrupt_handler() is not called, and thus rmt_tx_done never gets set true.

With the order in rmt_interrupt_handler() reversed (writing int_clr then rmt_tx_done) but with some additional debug code, the platform at this exact moment has been running without this issue for more than 1h40.

The additional debugging that adds an additional memw after the store to rmt_tx_done in the interrupt handler (incrementing a count of the number of RMT interrupts received), and without that it could be that CPU 1 still watchdogs because of the write to rmt_tx_done not being observable by CPU 1 quickly enough. I'll update when I have a chance to do further testing.

Clearly, "volatile" doesn't have the desired effect here - it doesn't guarantee that the write to the volatile variable (whether that's RMT.int_clr or rmt_tx_done, which-ever is last in rmt_interrupt_handler()) will be observable by the other CPU or hardware in a timely manner.

I'm now testing this fix (annoyingly, GitHub doesn't support attaching patches/diffs to comments, so this is probably white space damaged!)

diff --git a/src/output_i2s/rmt_pulse.c b/src/output_i2s/rmt_pulse.c
index 383aab6d87d8..b89c8826c210 100644
--- a/src/output_i2s/rmt_pulse.c
+++ b/src/output_i2s/rmt_pulse.c
@@ -21,8 +21,13 @@ volatile bool rmt_tx_done = true;
  * Remote peripheral interrupt. Used to signal when transmission is done.
  */
 static void IRAM_ATTR rmt_interrupt_handler(void* arg) {
-    rmt_tx_done = true;
+    // Clear the interrupt *before* marking tx_done to ensure pulse_ckv_ticks()
+    // waits for this write to complete.
     RMT.int_clr.val = RMT.int_st.val;
+    rmt_tx_done = true;
+    // read back rmt_tx_done to ensure that the above write is visible to
+    // pulse_ckv_ticks() in a timely manner.
+    rmt_tx_done;
 }

 #if ESP_IDF_VERSION >= ESP_IDF_VERSION_VAL(5, 0, 0)
@@ -56,7 +61,7 @@ void rmt_pulse_init(gpio_num_t pin) {
 #error "This driver is not compatible with IDF version 4.1.\nPlease use 4.0 or >= 4.2!"
 #endif
     esp_intr_alloc(
-        ETS_RMT_INTR_SOURCE, ESP_INTR_FLAG_LEVEL3, rmt_interrupt_handler, 0, &gRMT_intr_handle
+        ETS_RMT_INTR_SOURCE, ESP_INTR_FLAG_IRAM | ESP_INTR_FLAG_LEVEL3, rmt_interrupt_handler, 0, &gRMT_intr_handle
     );

     rmt_config(&row_rmt_config);

With this, rmt_interrupt_handler() now assembles to:

4008545c <rmt_interrupt_handler>:
4008545c:       004136          entry   a1, 32
4008545f:       ed6281          l32r    a8, 400809e8 <_iram_text_start+0x5e4> (3ff56000 <RMT>)
40085462:       0020c0          memw
40085465:       292892          l32i    a9, a8, 164  - read RMT.int_st.val
40085468:       0020c0          memw
4008546b:       2b6892          s32i    a9, a8, 172 - write RMT.int_clr.val
4008546e:       ed5f81          l32r    a8, 400809ec <_iram_text_start+0x5e8> (3ffbdff5 <rmt_tx_done>)
40085471:       190c            movi.n  a9, 1
40085473:       0020c0          memw             - this ensures that the write to RMT.int_clr.val is visible before rmt_tx_done is written
40085476:       004892          s8i     a9, a8, 0 - set rmt_tx_done = true
40085479:       0020c0          memw              - this ensures rmt_tx_done becomes visible in a timely manner
4008547c:       000882          l8ui    a8, a8, 0 - dummy read (part of the above memw)
4008547f:       f01d            retw.n
40085481:       000000          ill

martinberlin · 2024-12-27T08:58:27Z

It looks like it fixed the issue. When you are completely sure it does not trigger back in, can you make a Pull Request to suggest this change?
Then others will have it fixed too. Thanks for your research!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lillygo T5-4.7: task_wdt: Task watchdog got triggered. #387

Lillygo T5-4.7: task_wdt: Task watchdog got triggered. #387

rmk92 commented Dec 24, 2024

martinberlin commented Dec 24, 2024

rmk92 commented Dec 24, 2024 •

edited

Loading

rmk92 commented Dec 24, 2024

martinberlin commented Dec 25, 2024

rmk92 commented Dec 26, 2024

rmk92 commented Dec 26, 2024

martinberlin commented Dec 27, 2024

Lillygo T5-4.7: task_wdt: Task watchdog got triggered. #387

Lillygo T5-4.7: task_wdt: Task watchdog got triggered. #387

Comments

rmk92 commented Dec 24, 2024

martinberlin commented Dec 24, 2024

rmk92 commented Dec 24, 2024 • edited Loading

rmk92 commented Dec 24, 2024

martinberlin commented Dec 25, 2024

rmk92 commented Dec 26, 2024

rmk92 commented Dec 26, 2024

martinberlin commented Dec 27, 2024

rmk92 commented Dec 24, 2024 •

edited

Loading