mbox series

[next,RFC,0/5] rockchip: panic() when DRAM init fails

Message ID 20241105-rk3399-dram-init-v1-0-1e29acdf0966@cherry.de
Headers show
Series rockchip: panic() when DRAM init fails | expand

Message

Quentin Schulz Nov. 5, 2024, 5:21 p.m. UTC
I am the unfortunate owner of an RK3399 Puma that is failing DRAM init
every now and then with the following cryptic error message:
"""
pctl_start: Failed to init pctl channel 0
"""

When it fails, the current logic is implemented in such a way that the
board enters an infinite while loop. This is not ideal for embedded
systems that are not easily accessible and also an issue when having
devices in a CI lab for example.

Therefore, let's simply panic if the DRAM fails to init. This was tested
only on RK3399 Puma, similar changes may be required for other SoCs to
properly propagate the errors.

While this is all but a work-around instead of fixing the DRAM init
sequence, I believe it is important for resilience of devices in the
field.

Marking this as RFC because:
1) panic()s for all Rockchip SoCs but only tested on RK3399
2) unsure about error return values (ENODEV and the likes), not sure
   what would be better than that though :/

Note that one can still hang() if they want by setting PANIC_HANG
symbol.

Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de>
---
Quentin Schulz (5):
      ram: rk3399: allow to fail DRAM init if pctl_start fails
      ram: rk3399: fail probe if DRAM init failed
      rockchip: spl/tpl: panic when DRAM init failed
      ram: rk3399: merge two consecutive ifs with same condition
      ram: rk3399: fail DRAM init when pctl channel init fails instead of hanging

 arch/arm/mach-rockchip/spl.c        |  7 +++----
 arch/arm/mach-rockchip/tpl.c        |  6 ++----
 drivers/ram/rockchip/sdram_rk3399.c | 24 +++++++++++-------------
 3 files changed, 16 insertions(+), 21 deletions(-)
---
base-commit: 56accc56b9aab87ef4809ccc588e1257969cd271
change-id: 20241105-rk3399-dram-init-207325c2d9ca

Best regards,

Comments

Kever Yang Nov. 7, 2024, 3:35 a.m. UTC | #1
Hi Quentin,

On 2024/11/6 01:21, Quentin Schulz wrote:
> I am the unfortunate owner of an RK3399 Puma that is failing DRAM init
> every now and then with the following cryptic error message:
> """
> pctl_start: Failed to init pctl channel 0
> """
>
> When it fails, the current logic is implemented in such a way that the
> board enters an infinite while loop. This is not ideal for embedded
> systems that are not easily accessible and also an issue when having
> devices in a CI lab for example.

It's fine for this patch to make the choice in panic to hang or to reset 
the system.

But I don't understand how this help you, does it may success to init 
the dram after

the reset in your case? I think in most case the board will keep reset 
because this is

more likely hardware issue and not able to became success after reset.


Thanks,

- Kever

>
> Therefore, let's simply panic if the DRAM fails to init. This was tested
> only on RK3399 Puma, similar changes may be required for other SoCs to
> properly propagate the errors.
>
> While this is all but a work-around instead of fixing the DRAM init
> sequence, I believe it is important for resilience of devices in the
> field.
>
> Marking this as RFC because:
> 1) panic()s for all Rockchip SoCs but only tested on RK3399
> 2) unsure about error return values (ENODEV and the likes), not sure
>     what would be better than that though :/
>
> Note that one can still hang() if they want by setting PANIC_HANG
> symbol.
>
> Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de>
> ---
> Quentin Schulz (5):
>        ram: rk3399: allow to fail DRAM init if pctl_start fails
>        ram: rk3399: fail probe if DRAM init failed
>        rockchip: spl/tpl: panic when DRAM init failed
>        ram: rk3399: merge two consecutive ifs with same condition
>        ram: rk3399: fail DRAM init when pctl channel init fails instead of hanging
>
>   arch/arm/mach-rockchip/spl.c        |  7 +++----
>   arch/arm/mach-rockchip/tpl.c        |  6 ++----
>   drivers/ram/rockchip/sdram_rk3399.c | 24 +++++++++++-------------
>   3 files changed, 16 insertions(+), 21 deletions(-)
> ---
> base-commit: 56accc56b9aab87ef4809ccc588e1257969cd271
> change-id: 20241105-rk3399-dram-init-207325c2d9ca
>
> Best regards,
Quentin Schulz Nov. 7, 2024, 10:55 a.m. UTC | #2
Hi Kever,

On 11/7/24 4:35 AM, Kever Yang wrote:
> Hi Quentin,
> 
> On 2024/11/6 01:21, Quentin Schulz wrote:
>> I am the unfortunate owner of an RK3399 Puma that is failing DRAM init
>> every now and then with the following cryptic error message:
>> """
>> pctl_start: Failed to init pctl channel 0
>> """
>>
>> When it fails, the current logic is implemented in such a way that the
>> board enters an infinite while loop. This is not ideal for embedded
>> systems that are not easily accessible and also an issue when having
>> devices in a CI lab for example.
> 
> It's fine for this patch to make the choice in panic to hang or to reset 
> the system.
> 
> But I don't understand how this help you, does it may success to init 
> the dram after
> 
> the reset in your case? I think in most case the board will keep reset 
> because this is
> 
> more likely hardware issue and not able to became success after reset.
> 

Exactly. After a reset (manual by pressing the reset button, or by 
panic()) the board works again. I don't have exact numbers for the 
probability for this to happen, but I had a board doing boot-into-Linux 
tests in a loop and it failed multiple times in the same day. Gut 
feeling is around ~10-20 times for ~2200 reboots.

I am NOT saying this is not related to a HW issue though, but I've been 
using that RK3399 Puma for a while, and sometimes I have this issue. The 
rest of the time it seems to "work fine" (haven't run memtest or 
extensive testing on that unit though).

I am NOT sure BUT I think applying 
https://github.com/rockchip-linux/u-boot/commit/e1652d3918b182e107addd5e6f451655f239efbc 
on upstream U-Boot helped lower the probability (it still happened 
though!). Could be totally unrelated though and just flukes.

Cheers,
Quentin
Quentin Schulz Jan. 17, 2025, 3:45 p.m. UTC | #3
Hi Kever,

On 11/5/24 6:21 PM, Quentin Schulz wrote:
> I am the unfortunate owner of an RK3399 Puma that is failing DRAM init
> every now and then with the following cryptic error message:
> """
> pctl_start: Failed to init pctl channel 0
> """
> 
> When it fails, the current logic is implemented in such a way that the
> board enters an infinite while loop. This is not ideal for embedded
> systems that are not easily accessible and also an issue when having
> devices in a CI lab for example.
> 
> Therefore, let's simply panic if the DRAM fails to init. This was tested
> only on RK3399 Puma, similar changes may be required for other SoCs to
> properly propagate the errors.
> 
> While this is all but a work-around instead of fixing the DRAM init
> sequence, I believe it is important for resilience of devices in the
> field.
> 
> Marking this as RFC because:
> 1) panic()s for all Rockchip SoCs but only tested on RK3399
> 2) unsure about error return values (ENODEV and the likes), not sure
>     what would be better than that though :/
> 
> Note that one can still hang() if they want by setting PANIC_HANG
> symbol.
> 
> Signed-off-by: Quentin Schulz <quentin.schulz@cherry.de>

What should we do with those patches?

Additionally, is there something I could do to help fix the DDR init on 
RK3399? Increasing the timeout didn't help sadly :/

Cheers,
Quentin