diff mbox series

[RFC] watchdog: base rate-limiting on get_ticks() rather than get_timer()

Message ID 20200605111657.28773-1-rasmus.villemoes@prevas.dk
State RFC
Delegated to: Tom Rini
Headers show
Series [RFC] watchdog: base rate-limiting on get_ticks() rather than get_timer() | expand

Commit Message

Rasmus Villemoes June 5, 2020, 11:16 a.m. UTC
On powerpc, get_timer() is implemented using a volatile variable that
gets incremented from the decrementer interrupt handler. Hence, when
interrupts are disabled, time as measured by get_timer() ceases to
pass.

Interrupts are necessarily disabled during bootm (see the comment in
bootm_disable_interrupts() for why). But after interrupts are
disabled, there's still lots of work to do - e.g. decompressing the
kernel image to the right load address. Unfortunately, the
rate-limiting logic in wdt_uclass.c's watchdog_reset function means
that WATCHDOG_RESET() becomes a no-op, since get_timer() never
progresses past the next_reset.

This, combined with an external gpio-triggered watchdog that must be
petted at least every 800ms, means our board gets reset before booting
into linux.

Now, at least on powerpc, get_ticks() continues to return sensible
results whether or not interrupts are enabled. So this fixes the above
problem for our board - but I don't know if get_ticks() can be assumed
to always work.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
---
This is what I had in mind. I also considered making it a config knob
(possibly just auto-selected based on $ARCH) whether to use
get_timer() or get_ticks(), but that becomes quite ugly.

 drivers/watchdog/wdt-uclass.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

Comments

Stefan Roese June 5, 2020, 11:48 a.m. UTC | #1
On 05.06.20 13:16, Rasmus Villemoes wrote:
> On powerpc, get_timer() is implemented using a volatile variable that
> gets incremented from the decrementer interrupt handler. Hence, when
> interrupts are disabled, time as measured by get_timer() ceases to
> pass.
> 
> Interrupts are necessarily disabled during bootm (see the comment in
> bootm_disable_interrupts() for why). But after interrupts are
> disabled, there's still lots of work to do - e.g. decompressing the
> kernel image to the right load address. Unfortunately, the
> rate-limiting logic in wdt_uclass.c's watchdog_reset function means
> that WATCHDOG_RESET() becomes a no-op, since get_timer() never
> progresses past the next_reset.
> 
> This, combined with an external gpio-triggered watchdog that must be
> petted at least every 800ms, means our board gets reset before booting
> into linux.
> 
> Now, at least on powerpc, get_ticks() continues to return sensible
> results whether or not interrupts are enabled. So this fixes the above
> problem for our board - but I don't know if get_ticks() can be assumed
> to always work.
> 
> Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
> ---
> This is what I had in mind. I also considered making it a config knob
> (possibly just auto-selected based on $ARCH) whether to use
> get_timer() or get_ticks(), but that becomes quite ugly.

I hesitate a bit, moving with this generic code from get_timer()
to get_ticks() for all boards. Did you test this patch on other
platforms, like some ARM boards?

Please note that I don't reject it - just asking.

Thanks,
Stefan

>   drivers/watchdog/wdt-uclass.c | 22 ++++++++++++----------
>   1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/watchdog/wdt-uclass.c b/drivers/watchdog/wdt-uclass.c
> index 4cdb7bd64c..7be4e9b5bc 100644
> --- a/drivers/watchdog/wdt-uclass.c
> +++ b/drivers/watchdog/wdt-uclass.c
> @@ -16,14 +16,15 @@ DECLARE_GLOBAL_DATA_PTR;
>   
>   #define WATCHDOG_TIMEOUT_SECS	(CONFIG_WATCHDOG_TIMEOUT_MSECS / 1000)
>   
> -/*
> - * Reset every 1000ms, or however often is required as indicated by a
> - * hw_margin_ms property.
> - */
> -static ulong reset_period = 1000;
> +static u64 ratelimit_ticks;
>   
>   int initr_watchdog(void)
>   {
> +	/*
> +	 * Reset every 1000ms, or however often is required as indicated by a
> +	 * hw_margin_ms property.
> +	 */
> +	u32 reset_period = 1000;
>   	u32 timeout = WATCHDOG_TIMEOUT_SECS;
>   
>   	/*
> @@ -48,6 +49,7 @@ int initr_watchdog(void)
>   						    4 * reset_period) / 4;
>   	}
>   
> +	ratelimit_ticks = usec_to_tick(reset_period * 1000);
>   	wdt_start(gd->watchdog_dev, timeout * 1000, 0);
>   	gd->flags |= GD_FLG_WDT_READY;
>   	printf("WDT:   Started with%s servicing (%ds timeout)\n",
> @@ -117,17 +119,17 @@ int wdt_expire_now(struct udevice *dev, ulong flags)
>    */
>   void watchdog_reset(void)
>   {
> -	static ulong next_reset;
> -	ulong now;
> +	static u64 last_reset;
> +	u64 now;
>   
>   	/* Exit if GD is not ready or watchdog is not initialized yet */
>   	if (!gd || !(gd->flags & GD_FLG_WDT_READY))
>   		return;
>   
>   	/* Do not reset the watchdog too often */
> -	now = get_timer(0);
> -	if (time_after(now, next_reset)) {
> -		next_reset = now + reset_period;
> +	now = get_ticks();
> +	if (now - last_reset >= ratelimit_ticks) {
> +		last_reset = now;
>   		wdt_reset(gd->watchdog_dev);
>   	}
>   }
> 


Viele Grüße,
Stefan
Rasmus Villemoes June 5, 2020, 12:11 p.m. UTC | #2
On 05/06/2020 13.48, Stefan Roese wrote:
> On 05.06.20 13:16, Rasmus Villemoes wrote:
>> This is what I had in mind. I also considered making it a config knob
>> (possibly just auto-selected based on $ARCH) whether to use
>> get_timer() or get_ticks(), but that becomes quite ugly.
> 
> I hesitate a bit, moving with this generic code from get_timer()
> to get_ticks() for all boards. Did you test this patch on other
> platforms, like some ARM boards?
> 
> Please note that I don't reject it - just asking.

Yeah, I'm not really too happy about it myself, exactly because it
affects all arches/platforms. And no, I don't have other hardware handy
unfortunately. So it's very much an RFC where I hope someone with
knowledge of the various arches can say whether one can expect
get_ticks() to be at least as widely available as get_timer().

Rasmus
Stefan Roese June 5, 2020, 12:13 p.m. UTC | #3
On 05.06.20 14:11, Rasmus Villemoes wrote:
> On 05/06/2020 13.48, Stefan Roese wrote:
>> On 05.06.20 13:16, Rasmus Villemoes wrote:
>>> This is what I had in mind. I also considered making it a config knob
>>> (possibly just auto-selected based on $ARCH) whether to use
>>> get_timer() or get_ticks(), but that becomes quite ugly.
>>
>> I hesitate a bit, moving with this generic code from get_timer()
>> to get_ticks() for all boards. Did you test this patch on other
>> platforms, like some ARM boards?
>>
>> Please note that I don't reject it - just asking.
> 
> Yeah, I'm not really too happy about it myself, exactly because it
> affects all arches/platforms. And no, I don't have other hardware handy
> unfortunately. So it's very much an RFC where I hope someone with
> knowledge of the various arches can say whether one can expect
> get_ticks() to be at least as widely available as get_timer().

I'll try to test it on a non-powerpc platform soon'ish.

Thanks,
Stefan
Stefan Roese June 5, 2020, 1:37 p.m. UTC | #4
On 05.06.20 14:13, Stefan Roese wrote:
> On 05.06.20 14:11, Rasmus Villemoes wrote:
>> On 05/06/2020 13.48, Stefan Roese wrote:
>>> On 05.06.20 13:16, Rasmus Villemoes wrote:
>>>> This is what I had in mind. I also considered making it a config knob
>>>> (possibly just auto-selected based on $ARCH) whether to use
>>>> get_timer() or get_ticks(), but that becomes quite ugly.
>>>
>>> I hesitate a bit, moving with this generic code from get_timer()
>>> to get_ticks() for all boards. Did you test this patch on other
>>> platforms, like some ARM boards?
>>>
>>> Please note that I don't reject it - just asking.
>>
>> Yeah, I'm not really too happy about it myself, exactly because it
>> affects all arches/platforms. And no, I don't have other hardware handy
>> unfortunately. So it's very much an RFC where I hope someone with
>> knowledge of the various arches can say whether one can expect
>> get_ticks() to be at least as widely available as get_timer().
> 
> I'll try to test it on a non-powerpc platform soon'ish.

I've tested it on an MIPS based platform (gardena-smart-gateway-mt7688)
and it works there without any issues AFAICT.

Two more remarks:

Could you please run a build test with this patch applied for all
boards (Travis, Azure...)?

And do you see an issue, that the timer-wrap fixed by Chris with this
patch [1] will not re-occur with your "ticks" approach?

Thanks,
Stefan

[1]
Author: Chris Packham <judge.packham@gmail.com>  2020-02-24 01:20:33
Committer: Stefan Roese <sr@denx.de>  2020-03-16 11:25:12
Parent: db41d65a97f335167e1fbc0400a83333b5157703 (common: Move hang() to 
the same header as panic())
Child:  b4d9452c442769e6dc25649ac02db2d5ed5ea0c8 (watchdog: move 
initr_watchdog() to wdt-uclass.c)
Branches: master, watchdog-ratelimit-2020-06-05-rfc and many more (84)
Follows: v2020.04-rc3
Precedes: v2020.04-rc4

     watchdog: Handle timer wrap around

     On some platforms/architectures the value from get_timer() can wrap.
     This is particularly problematic when long-running code needs to 
measure
     a time difference as is the case with watchdog_reset() which tries to
     avoid tickling the watchdog too frequently.

     Use time_after() from time.h instead of a plain > comparison to avoid
     any issues with the time wrapping on a system that has been sitting in
     u-boot for a long time.

     Signed-off-by: Chris Packham <judge.packham@gmail.com>
     Reviewed-by: Stefan Roese <sr@denx.de>
Rasmus Villemoes June 5, 2020, 2:18 p.m. UTC | #5
On 05/06/2020 15.37, Stefan Roese wrote:
> On 05.06.20 14:13, Stefan Roese wrote:
>> On 05.06.20 14:11, Rasmus Villemoes wrote:
>>> On 05/06/2020 13.48, Stefan Roese wrote:
>>>> On 05.06.20 13:16, Rasmus Villemoes wrote:
>>>>> This is what I had in mind. I also considered making it a config knob
>>>>> (possibly just auto-selected based on $ARCH) whether to use
>>>>> get_timer() or get_ticks(), but that becomes quite ugly.
>>>>
>>>> I hesitate a bit, moving with this generic code from get_timer()
>>>> to get_ticks() for all boards. Did you test this patch on other
>>>> platforms, like some ARM boards?
>>>>
>>>> Please note that I don't reject it - just asking.
>>>
>>> Yeah, I'm not really too happy about it myself, exactly because it
>>> affects all arches/platforms. And no, I don't have other hardware handy
>>> unfortunately. So it's very much an RFC where I hope someone with
>>> knowledge of the various arches can say whether one can expect
>>> get_ticks() to be at least as widely available as get_timer().
>>
>> I'll try to test it on a non-powerpc platform soon'ish.
> 
> I've tested it on an MIPS based platform (gardena-smart-gateway-mt7688)
> and it works there without any issues AFAICT.

Thanks.

> Two more remarks:
> 
> Could you please run a build test with this patch applied for all
> boards (Travis, Azure...)?

I may need a few more pointers than that. What am I supposed to do exactly?

> And do you see an issue, that the timer-wrap fixed by Chris with this
> patch [1] will not re-occur with your "ticks" approach?

No, it will not re-occur, because this does the comparison the right
way, namely by subtracting the absolute times and comparing that delta
to the threshold (that's also what the time_after does under the hood).
The wrong way is saying now >= last_reset + threshold (or as I think the
code used to do, computing next_reset = now + threshold and then asking
now >= next_reset).

Rasmus
Stefan Roese June 5, 2020, 2:23 p.m. UTC | #6
On 05.06.20 16:18, Rasmus Villemoes wrote:
> On 05/06/2020 15.37, Stefan Roese wrote:
>> On 05.06.20 14:13, Stefan Roese wrote:
>>> On 05.06.20 14:11, Rasmus Villemoes wrote:
>>>> On 05/06/2020 13.48, Stefan Roese wrote:
>>>>> On 05.06.20 13:16, Rasmus Villemoes wrote:
>>>>>> This is what I had in mind. I also considered making it a config knob
>>>>>> (possibly just auto-selected based on $ARCH) whether to use
>>>>>> get_timer() or get_ticks(), but that becomes quite ugly.
>>>>>
>>>>> I hesitate a bit, moving with this generic code from get_timer()
>>>>> to get_ticks() for all boards. Did you test this patch on other
>>>>> platforms, like some ARM boards?
>>>>>
>>>>> Please note that I don't reject it - just asking.
>>>>
>>>> Yeah, I'm not really too happy about it myself, exactly because it
>>>> affects all arches/platforms. And no, I don't have other hardware handy
>>>> unfortunately. So it's very much an RFC where I hope someone with
>>>> knowledge of the various arches can say whether one can expect
>>>> get_ticks() to be at least as widely available as get_timer().
>>>
>>> I'll try to test it on a non-powerpc platform soon'ish.
>>
>> I've tested it on an MIPS based platform (gardena-smart-gateway-mt7688)
>> and it works there without any issues AFAICT.
> 
> Thanks.
> 
>> Two more remarks:
>>
>> Could you please run a build test with this patch applied for all
>> boards (Travis, Azure...)?
> 
> I may need a few more pointers than that. What am I supposed to do exactly?

Not sure if there is some documentation on how to use the Travis,
Gitlab or Azure build. It mainly uses buildman [1] internally, so you
can use buildman locally if you don't want to push the build into
the cloud.

>> And do you see an issue, that the timer-wrap fixed by Chris with this
>> patch [1] will not re-occur with your "ticks" approach?
> 
> No, it will not re-occur, because this does the comparison the right
> way, namely by subtracting the absolute times and comparing that delta
> to the threshold (that's also what the time_after does under the hood).
> The wrong way is saying now >= last_reset + threshold (or as I think the
> code used to do, computing next_reset = now + threshold and then asking
> now >= next_reset).

Ah, good to know. Thanks for confirming.

Thanks,
Stefan

[1] tools/buildman/README
Tom Rini June 5, 2020, 2:34 p.m. UTC | #7
On Fri, Jun 05, 2020 at 04:23:21PM +0200, Stefan Roese wrote:
> On 05.06.20 16:18, Rasmus Villemoes wrote:
> > On 05/06/2020 15.37, Stefan Roese wrote:
> > > On 05.06.20 14:13, Stefan Roese wrote:
> > > > On 05.06.20 14:11, Rasmus Villemoes wrote:
> > > > > On 05/06/2020 13.48, Stefan Roese wrote:
> > > > > > On 05.06.20 13:16, Rasmus Villemoes wrote:
> > > > > > > This is what I had in mind. I also considered making it a config knob
> > > > > > > (possibly just auto-selected based on $ARCH) whether to use
> > > > > > > get_timer() or get_ticks(), but that becomes quite ugly.
> > > > > > 
> > > > > > I hesitate a bit, moving with this generic code from get_timer()
> > > > > > to get_ticks() for all boards. Did you test this patch on other
> > > > > > platforms, like some ARM boards?
> > > > > > 
> > > > > > Please note that I don't reject it - just asking.
> > > > > 
> > > > > Yeah, I'm not really too happy about it myself, exactly because it
> > > > > affects all arches/platforms. And no, I don't have other hardware handy
> > > > > unfortunately. So it's very much an RFC where I hope someone with
> > > > > knowledge of the various arches can say whether one can expect
> > > > > get_ticks() to be at least as widely available as get_timer().
> > > > 
> > > > I'll try to test it on a non-powerpc platform soon'ish.
> > > 
> > > I've tested it on an MIPS based platform (gardena-smart-gateway-mt7688)
> > > and it works there without any issues AFAICT.
> > 
> > Thanks.
> > 
> > > Two more remarks:
> > > 
> > > Could you please run a build test with this patch applied for all
> > > boards (Travis, Azure...)?
> > 
> > I may need a few more pointers than that. What am I supposed to do exactly?
> 
> Not sure if there is some documentation on how to use the Travis,
> Gitlab or Azure build. It mainly uses buildman [1] internally, so you
> can use buildman locally if you don't want to push the build into
> the cloud.

So, here's a recently-I-learned.  If you submit a pull request on GitHub
to https://github.com/u-boot/u-boot that will trigger Travis and Azure
runs:
https://travis-ci.org/github/u-boot/u-boot/pull_requests
and somewhere under:
https://dev.azure.com/u-boot/u-boot/_build?definitionId=2&_a=summary

Note that sometimes Travis will have connectivity failures and you have
to be me to tell it to re-run the job.  Azure only has failures when we
get those to-be-debugged random failures of sandbox/qemu targets.
Rasmus Villemoes June 5, 2020, 7:56 p.m. UTC | #8
On 05/06/2020 16.34, Tom Rini wrote:
> On Fri, Jun 05, 2020 at 04:23:21PM +0200, Stefan Roese wrote:
>> On 05.06.20 16:18, Rasmus Villemoes wrote:
>>> On 05/06/2020 15.37, Stefan Roese wrote:
>>>> Could you please run a build test with this patch applied for all
>>>> boards (Travis, Azure...)?
>>>
>>> I may need a few more pointers than that. What am I supposed to do exactly?
>>
>> Not sure if there is some documentation on how to use the Travis,
>> Gitlab or Azure build. It mainly uses buildman [1] internally, so you
>> can use buildman locally if you don't want to push the build into
>> the cloud.
> 
> So, here's a recently-I-learned.  If you submit a pull request on GitHub
> to https://github.com/u-boot/u-boot that will trigger Travis and Azure
> runs:
> https://travis-ci.org/github/u-boot/u-boot/pull_requests
> and somewhere under:
> https://dev.azure.com/u-boot/u-boot/_build?definitionId=2&_a=summary
> 
> Note that sometimes Travis will have connectivity failures and you have
> to be me to tell it to re-run the job.  Azure only has failures when we
> get those to-be-debugged random failures of sandbox/qemu targets.
> 

Thanks! FWIW,
https://travis-ci.org/github/u-boot/u-boot/builds/695074829 and
https://dev.azure.com/u-boot/u-boot/_build/results?buildId=713&view=results
are all green - but how many of those actually build wdt-uclass.c I
don't know.

Rasmus
diff mbox series

Patch

diff --git a/drivers/watchdog/wdt-uclass.c b/drivers/watchdog/wdt-uclass.c
index 4cdb7bd64c..7be4e9b5bc 100644
--- a/drivers/watchdog/wdt-uclass.c
+++ b/drivers/watchdog/wdt-uclass.c
@@ -16,14 +16,15 @@  DECLARE_GLOBAL_DATA_PTR;
 
 #define WATCHDOG_TIMEOUT_SECS	(CONFIG_WATCHDOG_TIMEOUT_MSECS / 1000)
 
-/*
- * Reset every 1000ms, or however often is required as indicated by a
- * hw_margin_ms property.
- */
-static ulong reset_period = 1000;
+static u64 ratelimit_ticks;
 
 int initr_watchdog(void)
 {
+	/*
+	 * Reset every 1000ms, or however often is required as indicated by a
+	 * hw_margin_ms property.
+	 */
+	u32 reset_period = 1000;
 	u32 timeout = WATCHDOG_TIMEOUT_SECS;
 
 	/*
@@ -48,6 +49,7 @@  int initr_watchdog(void)
 						    4 * reset_period) / 4;
 	}
 
+	ratelimit_ticks = usec_to_tick(reset_period * 1000);
 	wdt_start(gd->watchdog_dev, timeout * 1000, 0);
 	gd->flags |= GD_FLG_WDT_READY;
 	printf("WDT:   Started with%s servicing (%ds timeout)\n",
@@ -117,17 +119,17 @@  int wdt_expire_now(struct udevice *dev, ulong flags)
  */
 void watchdog_reset(void)
 {
-	static ulong next_reset;
-	ulong now;
+	static u64 last_reset;
+	u64 now;
 
 	/* Exit if GD is not ready or watchdog is not initialized yet */
 	if (!gd || !(gd->flags & GD_FLG_WDT_READY))
 		return;
 
 	/* Do not reset the watchdog too often */
-	now = get_timer(0);
-	if (time_after(now, next_reset)) {
-		next_reset = now + reset_period;
+	now = get_ticks();
+	if (now - last_reset >= ratelimit_ticks) {
+		last_reset = now;
 		wdt_reset(gd->watchdog_dev);
 	}
 }