Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BeagleBone DMTimer2 unexpected stop after one or more days #203

Open
mgkiller7 opened this issue Apr 26, 2019 · 6 comments
Open

BeagleBone DMTimer2 unexpected stop after one or more days #203

mgkiller7 opened this issue Apr 26, 2019 · 6 comments
Assignees

Comments

@mgkiller7
Copy link

mgkiller7 commented Apr 26, 2019

we encounter DMTimer2 unexpected stop in am335x after run 1 or more days, we indeed seen gp_timer in /proc/interrupts never increase any more;
we have try beagleBoard github kernel version 4.4.113/4.4.155 with our own rootfs in Beagebone Black board and our custom board, the situation is the same, Eventhought i don't make any change in kernel source.

This timer is initialized for clockevent in omap2_gp_clockevent_init(clkev_nr, clkev_src, clkev_prop); //arch/arm/mach-omap2/timer.c

below is the related call stack:

omap3_gptimer_timer_init(void) =>

__omap_sync32k_timer_init(2, "timer_sys_ck", NULL,
1, "timer_sys_ck", "ti,timer-alwon", true); =>

omap2_gp_clockevent_init(clkev_nr, clkev_src, clkev_prop);

after DMTimer2 unexpected stop, those things happen:

1、gp_timer in /proc/interrupts NEVER increases

2、get time form date cmd may goback some minues or seconds

3、user apps no longer output debug log in console, it seems the scheduler of kernel do not work correctly.

  but shell in console work fine, network ping is also work fine.

4、cpu load of threads in top cmd are all 0%

By the way, i checked after situation come out, ST bit of the DMTimer2's TCLR is 1 (that is Start timer)

But If i stop DMTimer2 manually in console shell by cmd: devmem 0x48040038 32 0x0

then i can reproduced the 1/2/3 situation mentioned above, but hung while i type cmd top in console shell.

So i think DMTimer2 of my AM335x is not work correctly after run one or more days.

We also try to comment out __omap_dm_timer_override_errata() in omap2_gp_clockevent_init(), this force to enable OMAP_TIMER_ERRATA_I103_I767, but the kernel can't bootup at all.

we also posted this problem in TI community at https://e2e.ti.com/support/processors/f/791/t/796508

@pdp7
Copy link
Contributor

pdp7 commented Jun 9, 2020

@mgkiller7 did you find a resolution?

I see the last post is:
https://e2e.ti.com/support/processors/f/791/p/796508/2978764#2978764

Recently i find out this problem is related to No initialization with PMIC in my u-boot in custom board. So in u-boot stage, the voltage supply to CORE and MPU from PMIC are 1.1V default. But when i configure the PMIC to supply 1.120V to CORE and 1.270V to MPU in am33xx_spl_board_init of u-boot (board.c), the problem disappeared.

DMTimer2 unexpected stop problem can reproduce when i delete PMIC change voltage in u-boot in BeagleBone Black.

@pdp7 pdp7 self-assigned this Jun 9, 2020
@pdp7
Copy link
Contributor

pdp7 commented Jun 10, 2020

@mgkiller7 Please re-open if still an issue.

You may also be interested in the Debian images and kernel builds that we are currently testing for the next release:
https://elinux.org/Beagleboard:Latest-images-testing

@pdp7 pdp7 closed this as completed Jun 10, 2020
RobertCNelson pushed a commit that referenced this issue Mar 31, 2021
…before setting skb ownership

commit e940e08 upstream.

There are two ref count variables controlling the free()ing of a socket:
- struct sock::sk_refcnt - which is changed by sock_hold()/sock_put()
- struct sock::sk_wmem_alloc - which accounts the memory allocated by
  the skbs in the send path.

In case there are still TX skbs on the fly and the socket() is closed,
the struct sock::sk_refcnt reaches 0. In the TX-path the CAN stack
clones an "echo" skb, calls sock_hold() on the original socket and
references it. This produces the following back trace:

| WARNING: CPU: 0 PID: 280 at lib/refcount.c:25 refcount_warn_saturate+0x114/0x134
| refcount_t: addition on 0; use-after-free.
| Modules linked in: coda_vpu(E) v4l2_jpeg(E) videobuf2_vmalloc(E) imx_vdoa(E)
| CPU: 0 PID: 280 Comm: test_can.sh Tainted: G            E     5.11.0-04577-gf8ff6603c617 #203
| Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
| Backtrace:
| [<80bafea4>] (dump_backtrace) from [<80bb0280>] (show_stack+0x20/0x24) r7:00000000 r6:600f0113 r5:00000000 r4:81441220
| [<80bb0260>] (show_stack) from [<80bb593c>] (dump_stack+0xa0/0xc8)
| [<80bb589c>] (dump_stack) from [<8012b268>] (__warn+0xd4/0x114) r9:00000019 r8:80f4a8c2 r7:83e4150c r6:00000000 r5:00000009 r4:80528f90
| [<8012b194>] (__warn) from [<80bb09c4>] (warn_slowpath_fmt+0x88/0xc8) r9:83f26400 r8:80f4a8d1 r7:00000009 r6:80528f90 r5:00000019 r4:80f4a8c2
| [<80bb0940>] (warn_slowpath_fmt) from [<80528f90>] (refcount_warn_saturate+0x114/0x134) r8:00000000 r7:00000000 r6:82b44000 r5:834e5600 r4:83f4d540
| [<80528e7c>] (refcount_warn_saturate) from [<8079a4c8>] (__refcount_add.constprop.0+0x4c/0x50)
| [<8079a47c>] (__refcount_add.constprop.0) from [<8079a57c>] (can_put_echo_skb+0xb0/0x13c)
| [<8079a4cc>] (can_put_echo_skb) from [<8079ba98>] (flexcan_start_xmit+0x1c4/0x230) r9:00000010 r8:83f48610 r7:0fdc0000 r6:0c080000 r5:82b44000 r4:834e5600
| [<8079b8d4>] (flexcan_start_xmit) from [<80969078>] (netdev_start_xmit+0x44/0x70) r9:814c0ba0 r8:80c8790c r7:00000000 r6:834e5600 r5:82b44000 r4:82ab1f00
| [<80969034>] (netdev_start_xmit) from [<809725a4>] (dev_hard_start_xmit+0x19c/0x318) r9:814c0ba0 r8:00000000 r7:82ab1f00 r6:82b44000 r5:00000000 r4:834e5600
| [<80972408>] (dev_hard_start_xmit) from [<809c6584>] (sch_direct_xmit+0xcc/0x264) r10:834e5600 r9:00000000 r8:00000000 r7:82b44000 r6:82ab1f00 r5:834e5600 r4:83f27400
| [<809c64b8>] (sch_direct_xmit) from [<809c6c0c>] (__qdisc_run+0x4f0/0x534)

To fix this problem, only set skb ownership to sockets which have still
a ref count > 0.

Fixes: 0ae89be ("can: add destructor for self generated skbs")
Cc: Oliver Hartkopp <[email protected]>
Cc: Andre Naujoks <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Oliver Hartkopp <[email protected]>
Signed-off-by: Marc Kleine-Budde <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
@wiltshiretom
Copy link

Would love to know if you found a solution to this problem, we are seeing a very similar problem

@pdp7
Copy link
Contributor

pdp7 commented Apr 2, 2021

@wiltshiretom Please run

sudo /opt/scripts/tools/version.sh

which will show the uboot and linux versions and what device tree overlays are present.

@pdp7 pdp7 reopened this Apr 2, 2021
@wiltshiretom
Copy link

Apologies, my post lacked some detail! I'm seeing an identical issue but we are using a custom board (not beagle) also using the AM335x part. I found this thread and was wondering if you had identified a workaround. Sorry to resurrect something that is already closed on your platform but I am looking for inspiration. Also identical symptoms here: https://e2e.ti.com/support/processors/f/processors-forum/237808/am335x-system-time-looping

RobertCNelson pushed a commit that referenced this issue May 14, 2021
…before setting skb ownership

commit e940e08 upstream.

There are two ref count variables controlling the free()ing of a socket:
- struct sock::sk_refcnt - which is changed by sock_hold()/sock_put()
- struct sock::sk_wmem_alloc - which accounts the memory allocated by
  the skbs in the send path.

In case there are still TX skbs on the fly and the socket() is closed,
the struct sock::sk_refcnt reaches 0. In the TX-path the CAN stack
clones an "echo" skb, calls sock_hold() on the original socket and
references it. This produces the following back trace:

| WARNING: CPU: 0 PID: 280 at lib/refcount.c:25 refcount_warn_saturate+0x114/0x134
| refcount_t: addition on 0; use-after-free.
| Modules linked in: coda_vpu(E) v4l2_jpeg(E) videobuf2_vmalloc(E) imx_vdoa(E)
| CPU: 0 PID: 280 Comm: test_can.sh Tainted: G            E     5.11.0-04577-gf8ff6603c617 #203
| Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
| Backtrace:
| [<80bafea4>] (dump_backtrace) from [<80bb0280>] (show_stack+0x20/0x24) r7:00000000 r6:600f0113 r5:00000000 r4:81441220
| [<80bb0260>] (show_stack) from [<80bb593c>] (dump_stack+0xa0/0xc8)
| [<80bb589c>] (dump_stack) from [<8012b268>] (__warn+0xd4/0x114) r9:00000019 r8:80f4a8c2 r7:83e4150c r6:00000000 r5:00000009 r4:80528f90
| [<8012b194>] (__warn) from [<80bb09c4>] (warn_slowpath_fmt+0x88/0xc8) r9:83f26400 r8:80f4a8d1 r7:00000009 r6:80528f90 r5:00000019 r4:80f4a8c2
| [<80bb0940>] (warn_slowpath_fmt) from [<80528f90>] (refcount_warn_saturate+0x114/0x134) r8:00000000 r7:00000000 r6:82b44000 r5:834e5600 r4:83f4d540
| [<80528e7c>] (refcount_warn_saturate) from [<8079a4c8>] (__refcount_add.constprop.0+0x4c/0x50)
| [<8079a47c>] (__refcount_add.constprop.0) from [<8079a57c>] (can_put_echo_skb+0xb0/0x13c)
| [<8079a4cc>] (can_put_echo_skb) from [<8079ba98>] (flexcan_start_xmit+0x1c4/0x230) r9:00000010 r8:83f48610 r7:0fdc0000 r6:0c080000 r5:82b44000 r4:834e5600
| [<8079b8d4>] (flexcan_start_xmit) from [<80969078>] (netdev_start_xmit+0x44/0x70) r9:814c0ba0 r8:80c8790c r7:00000000 r6:834e5600 r5:82b44000 r4:82ab1f00
| [<80969034>] (netdev_start_xmit) from [<809725a4>] (dev_hard_start_xmit+0x19c/0x318) r9:814c0ba0 r8:00000000 r7:82ab1f00 r6:82b44000 r5:00000000 r4:834e5600
| [<80972408>] (dev_hard_start_xmit) from [<809c6584>] (sch_direct_xmit+0xcc/0x264) r10:834e5600 r9:00000000 r8:00000000 r7:82b44000 r6:82ab1f00 r5:834e5600 r4:83f27400
| [<809c64b8>] (sch_direct_xmit) from [<809c6c0c>] (__qdisc_run+0x4f0/0x534)

To fix this problem, only set skb ownership to sockets which have still
a ref count > 0.

Fixes: 0ae89be ("can: add destructor for self generated skbs")
Cc: Oliver Hartkopp <[email protected]>
Cc: Andre Naujoks <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Oleksij Rempel <[email protected]>
Reviewed-by: Oliver Hartkopp <[email protected]>
Signed-off-by: Marc Kleine-Budde <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
@mgkiller7
Copy link
Author

mgkiller7 commented Jul 19, 2021 via email

RobertCNelson pushed a commit that referenced this issue Feb 4, 2023
The set channel operation "ethtool -L tx <n>" broke with
the recent suspend/resume changes.

Revert back to original driver behaviour of not freeing
the TX/RX IRQs at am65_cpsw_nuss_common_stop(). We will
now free them only on .suspend() as we need to release
the DMA channels (as DMA looses context) and re-acquiring
them on .resume() may not necessarily give us the same
IRQs.

Introduce am65_cpsw_nuss_remove_rx_chns() which is similar
to am65_cpsw_nuss_remove_tx_chns() and invoke them both in
.suspend().

At .resume() call am65_cpsw_nuss_init_rx/tx_chns() to
acquire the DMA channels.

To as IRQs need to be requested after knowing the IRQ
numbers, move am65_cpsw_nuss_ndev_add_tx_napi() call to
am65_cpsw_nuss_init_tx_chns().

Also fixes the below warning during suspend/resume on multi
CPU system.

[   67.347684] ------------[ cut here ]------------
[   67.347700] Unbalanced enable for IRQ 119
[   67.347726] WARNING: CPU: 0 PID: 1080 at kernel/irq/manage.c:781 __enable_irq+0x4c/0x80
[   67.347754] Modules linked in: wlcore_sdio wl18xx wlcore mac80211 libarc4 cfg80211 rfkill crct10dif_ce sch_fq_codel ipv6
[   67.347803] CPU: 0 PID: 1080 Comm: rtcwake Not tainted 6.1.0-rc4-00023-gc826e5480732-dirty #203
[   67.347812] Hardware name: Texas Instruments AM625 (DT)
[   67.347818] pstate: 400000c5 (nZcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   67.347829] pc : __enable_irq+0x4c/0x80
[   67.347838] lr : __enable_irq+0x4c/0x80
[   67.347846] sp : ffff80000999ba00
[   67.347850] x29: ffff80000999ba00 x28: ffff0000011c1c80 x27: 0000000000000000
[   67.347863] x26: 00000000000001f4 x25: ffff000001058358 x24: ffff000001059080
[   67.347876] x23: ffff000001058080 x22: ffff000001060000 x21: 0000000000000077
[   67.347888] x20: ffff0000011c1c80 x19: ffff000001429600 x18: 0000000000000001
[   67.347900] x17: 0000000000000080 x16: fffffc000176e008 x15: ffff0000011c21b0
[   67.347913] x14: 0000000000000000 x13: 3931312051524920 x12: 726f6620656c6261
[   67.347925] x11: 656820747563205b x10: 000000000000000a x9 : ffff80000999ba00
[   67.347938] x8 : ffff800009121068 x7 : ffff80000999b810 x6 : 00000000fffff17f
[   67.347950] x5 : ffff00007fb99b18 x4 : 0000000000000000 x3 : 0000000000000027
[   67.347962] x2 : ffff00007fb99b20 x1 : 50dd48f7f19deb00 x0 : 0000000000000000
[   67.347975] Call trace:
[   67.347980]  __enable_irq+0x4c/0x80
[   67.347989]  enable_irq+0x4c/0xa0
[   67.347999]  am65_cpsw_nuss_ndo_slave_open+0x4b0/0x568
[   67.348015]  am65_cpsw_nuss_resume+0x68/0x160
[   67.348025]  dpm_run_callback.isra.0+0x28/0x88
[   67.348040]  device_resume+0x78/0x160
[   67.348050]  dpm_resume+0xc0/0x1f8
[   67.348057]  dpm_resume_end+0x18/0x30
[   67.348063]  suspend_devices_and_enter+0x1cc/0x4e0
[   67.348075]  pm_suspend+0x1f8/0x268
[   67.348084]  state_store+0x8c/0x118
[   67.348092]  kobj_attr_store+0x18/0x30
[   67.348104]  sysfs_kf_write+0x44/0x58
[   67.348117]  kernfs_fop_write_iter+0x118/0x1a8
[   67.348127]  vfs_write+0x31c/0x418
[   67.348140]  ksys_write+0x6c/0xf8
[   67.348150]  __arm64_sys_write+0x1c/0x28
[   67.348160]  invoke_syscall+0x44/0x108
[   67.348172]  el0_svc_common.constprop.0+0x44/0xf0
[   67.348182]  do_el0_svc+0x2c/0xc8
[   67.348191]  el0_svc+0x2c/0x88
[   67.348201]  el0t_64_sync_handler+0xb8/0xc0
[   67.348209]  el0t_64_sync+0x18c/0x190
[   67.348218] ---[ end trace 0000000000000000 ]---

Fixes: cbdde66 ("net: ethernet: ti: am65-cpsw: Add suspend/resume support")
Signed-off-by: Roger Quadros <[email protected]>
Signed-off-by: Vignesh Raghavendra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants