diff mbox series

[v3,1/4] powerpc/watchdog: Fix missed watchdog reset due to memory ordering race

Message ID 20211110025056.2084347-2-npiggin@gmail.com (mailing list archive)
State Accepted
Headers show
Series powerpc: watchdog fixes | expand

Commit Message

Nicholas Piggin Nov. 10, 2021, 2:50 a.m. UTC
It is possible for all CPUs to miss the pending cpumask becoming clear,
and then nobody resetting it, which will cause the lockup detector to
stop working. It will eventually expire, but watchdog_smp_panic will
avoid doing anything if the pending mask is clear and it will never be
reset.

Order the cpumask clear vs the subsequent test to close this race.

Add an extra check for an empty pending mask when the watchdog fires and
finds its bit still clear, to try to catch any other possible races or
bugs here and keep the watchdog working. The extra test in
arch_touch_nmi_watchdog is required to prevent the new warning from
firing off.

Debugged-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/watchdog.c | 41 +++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

Comments

Laurent Dufour Nov. 15, 2021, 3:09 p.m. UTC | #1
Le 10/11/2021 à 03:50, Nicholas Piggin a écrit :
> It is possible for all CPUs to miss the pending cpumask becoming clear,
> and then nobody resetting it, which will cause the lockup detector to
> stop working. It will eventually expire, but watchdog_smp_panic will
> avoid doing anything if the pending mask is clear and it will never be
> reset.
> 
> Order the cpumask clear vs the subsequent test to close this race.
> 
> Add an extra check for an empty pending mask when the watchdog fires and
> finds its bit still clear, to try to catch any other possible races or
> bugs here and keep the watchdog working. The extra test in
> arch_touch_nmi_watchdog is required to prevent the new warning from
> firing off.
> 
> Debugged-by: Laurent Dufour <ldufour@linux.ibm.com>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   arch/powerpc/kernel/watchdog.c | 41 +++++++++++++++++++++++++++++++++-
>   1 file changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
> index f9ea0e5357f9..3c60872b6a2c 100644
> --- a/arch/powerpc/kernel/watchdog.c
> +++ b/arch/powerpc/kernel/watchdog.c
> @@ -135,6 +135,10 @@ static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
>   {
>   	cpumask_or(&wd_smp_cpus_stuck, &wd_smp_cpus_stuck, cpumask);
>   	cpumask_andnot(&wd_smp_cpus_pending, &wd_smp_cpus_pending, cpumask);
> +	/*
> +	 * See wd_smp_clear_cpu_pending()
> +	 */
> +	smp_mb();
>   	if (cpumask_empty(&wd_smp_cpus_pending)) {
>   		wd_smp_last_reset_tb = tb;
>   		cpumask_andnot(&wd_smp_cpus_pending,
> @@ -215,13 +219,44 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
>   
>   			cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
>   			wd_smp_unlock(&flags);
> +		} else {
> +			/*
> +			 * The last CPU to clear pending should have reset the
> +			 * watchdog so we generally should not find it empty
> +			 * here if our CPU was clear. However it could happen
> +			 * due to a rare race with another CPU taking the
> +			 * last CPU out of the mask concurrently.
> +			 *
> +			 * We can't add a warning for it. But just in case
> +			 * there is a problem with the watchdog that is causing
> +			 * the mask to not be reset, try to kick it along here.
> +			 */
> +			if (unlikely(cpumask_empty(&wd_smp_cpus_pending)))
> +				goto none_pending;

If I understand correctly, that branch is a security in case the code is not 
working as expected. But I'm really wondering if that's really needed, and we 
will end up with a contention on the watchdog lock while this path should be 
lockless, and I'd say that in most of the case there is nothing to do after 
grabbing that lock. Am I missing something risky here?

>   		}
>   		return;
>   	}
> +
>   	cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
> +
> +	/*
> +	 * Order the store to clear pending with the load(s) to check all
> +	 * words in the pending mask to check they are all empty. This orders
> +	 * with the same barrier on another CPU. This prevents two CPUs
> +	 * clearing the last 2 pending bits, but neither seeing the other's
> +	 * store when checking if the mask is empty, and missing an empty
> +	 * mask, which ends with a false positive.
> +	 */
> +	smp_mb();
>   	if (cpumask_empty(&wd_smp_cpus_pending)) {
>   		unsigned long flags;
>   
> +none_pending:
> +		/*
> +		 * Double check under lock because more than one CPU could see
> +		 * a clear mask with the lockless check after clearing their
> +		 * pending bits.
> +		 */
>   		wd_smp_lock(&flags);
>   		if (cpumask_empty(&wd_smp_cpus_pending)) {
>   			wd_smp_last_reset_tb = tb;
> @@ -312,8 +347,12 @@ void arch_touch_nmi_watchdog(void)
>   {
>   	unsigned long ticks = tb_ticks_per_usec * wd_timer_period_ms * 1000;
>   	int cpu = smp_processor_id();
> -	u64 tb = get_tb();
> +	u64 tb;
>   
> +	if (!cpumask_test_cpu(cpu, &watchdog_cpumask))
> +		return;
> +
> +	tb = get_tb();
>   	if (tb - per_cpu(wd_timer_tb, cpu) >= ticks) {
>   		per_cpu(wd_timer_tb, cpu) = tb;
>   		wd_smp_clear_cpu_pending(cpu, tb);
>
Nicholas Piggin Nov. 19, 2021, 9:05 a.m. UTC | #2
Excerpts from Laurent Dufour's message of November 16, 2021 1:09 am:
> Le 10/11/2021 à 03:50, Nicholas Piggin a écrit :
>> It is possible for all CPUs to miss the pending cpumask becoming clear,
>> and then nobody resetting it, which will cause the lockup detector to
>> stop working. It will eventually expire, but watchdog_smp_panic will
>> avoid doing anything if the pending mask is clear and it will never be
>> reset.
>> 
>> Order the cpumask clear vs the subsequent test to close this race.
>> 
>> Add an extra check for an empty pending mask when the watchdog fires and
>> finds its bit still clear, to try to catch any other possible races or
>> bugs here and keep the watchdog working. The extra test in
>> arch_touch_nmi_watchdog is required to prevent the new warning from
>> firing off.
>> 
>> Debugged-by: Laurent Dufour <ldufour@linux.ibm.com>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>   arch/powerpc/kernel/watchdog.c | 41 +++++++++++++++++++++++++++++++++-
>>   1 file changed, 40 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
>> index f9ea0e5357f9..3c60872b6a2c 100644
>> --- a/arch/powerpc/kernel/watchdog.c
>> +++ b/arch/powerpc/kernel/watchdog.c
>> @@ -135,6 +135,10 @@ static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
>>   {
>>   	cpumask_or(&wd_smp_cpus_stuck, &wd_smp_cpus_stuck, cpumask);
>>   	cpumask_andnot(&wd_smp_cpus_pending, &wd_smp_cpus_pending, cpumask);
>> +	/*
>> +	 * See wd_smp_clear_cpu_pending()
>> +	 */
>> +	smp_mb();
>>   	if (cpumask_empty(&wd_smp_cpus_pending)) {
>>   		wd_smp_last_reset_tb = tb;
>>   		cpumask_andnot(&wd_smp_cpus_pending,
>> @@ -215,13 +219,44 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
>>   
>>   			cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
>>   			wd_smp_unlock(&flags);
>> +		} else {
>> +			/*
>> +			 * The last CPU to clear pending should have reset the
>> +			 * watchdog so we generally should not find it empty
>> +			 * here if our CPU was clear. However it could happen
>> +			 * due to a rare race with another CPU taking the
>> +			 * last CPU out of the mask concurrently.
>> +			 *
>> +			 * We can't add a warning for it. But just in case
>> +			 * there is a problem with the watchdog that is causing
>> +			 * the mask to not be reset, try to kick it along here.
>> +			 */
>> +			if (unlikely(cpumask_empty(&wd_smp_cpus_pending)))
>> +				goto none_pending;
> 
> If I understand correctly, that branch is a security in case the code is not 
> working as expected. But I'm really wondering if that's really needed, and we 
> will end up with a contention on the watchdog lock while this path should be 
> lockless, and I'd say that in most of the case there is nothing to do after 
> grabbing that lock. Am I missing something risky here?

I'm thinking it should not hit very much because that first test

    if (!cpumask_test_cpu(cpu, &wd_smp_cpus_pending)) {

I think it should not be true too often, it would mean a CPU has taken 
two timer interrupts while another one has not taken any, so hopefully
that's pretty rare in normal operation.

Thanks,
Nick
Laurent Dufour Nov. 19, 2021, 9:25 a.m. UTC | #3
Le 19/11/2021 à 10:05, Nicholas Piggin a écrit :
> Excerpts from Laurent Dufour's message of November 16, 2021 1:09 am:
>> Le 10/11/2021 à 03:50, Nicholas Piggin a écrit :
>>> It is possible for all CPUs to miss the pending cpumask becoming clear,
>>> and then nobody resetting it, which will cause the lockup detector to
>>> stop working. It will eventually expire, but watchdog_smp_panic will
>>> avoid doing anything if the pending mask is clear and it will never be
>>> reset.
>>>
>>> Order the cpumask clear vs the subsequent test to close this race.
>>>
>>> Add an extra check for an empty pending mask when the watchdog fires and
>>> finds its bit still clear, to try to catch any other possible races or
>>> bugs here and keep the watchdog working. The extra test in
>>> arch_touch_nmi_watchdog is required to prevent the new warning from
>>> firing off.
>>>
>>> Debugged-by: Laurent Dufour <ldufour@linux.ibm.com>
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>>> ---
>>>    arch/powerpc/kernel/watchdog.c | 41 +++++++++++++++++++++++++++++++++-
>>>    1 file changed, 40 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
>>> index f9ea0e5357f9..3c60872b6a2c 100644
>>> --- a/arch/powerpc/kernel/watchdog.c
>>> +++ b/arch/powerpc/kernel/watchdog.c
>>> @@ -135,6 +135,10 @@ static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
>>>    {
>>>    	cpumask_or(&wd_smp_cpus_stuck, &wd_smp_cpus_stuck, cpumask);
>>>    	cpumask_andnot(&wd_smp_cpus_pending, &wd_smp_cpus_pending, cpumask);
>>> +	/*
>>> +	 * See wd_smp_clear_cpu_pending()
>>> +	 */
>>> +	smp_mb();
>>>    	if (cpumask_empty(&wd_smp_cpus_pending)) {
>>>    		wd_smp_last_reset_tb = tb;
>>>    		cpumask_andnot(&wd_smp_cpus_pending,
>>> @@ -215,13 +219,44 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
>>>    
>>>    			cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
>>>    			wd_smp_unlock(&flags);
>>> +		} else {
>>> +			/*
>>> +			 * The last CPU to clear pending should have reset the
>>> +			 * watchdog so we generally should not find it empty
>>> +			 * here if our CPU was clear. However it could happen
>>> +			 * due to a rare race with another CPU taking the
>>> +			 * last CPU out of the mask concurrently.
>>> +			 *
>>> +			 * We can't add a warning for it. But just in case
>>> +			 * there is a problem with the watchdog that is causing
>>> +			 * the mask to not be reset, try to kick it along here.
>>> +			 */
>>> +			if (unlikely(cpumask_empty(&wd_smp_cpus_pending)))
>>> +				goto none_pending;
>>
>> If I understand correctly, that branch is a security in case the code is not
>> working as expected. But I'm really wondering if that's really needed, and we
>> will end up with a contention on the watchdog lock while this path should be
>> lockless, and I'd say that in most of the case there is nothing to do after
>> grabbing that lock. Am I missing something risky here?
> 
> I'm thinking it should not hit very much because that first test
> 
>      if (!cpumask_test_cpu(cpu, &wd_smp_cpus_pending)) {
> 
> I think it should not be true too often, it would mean a CPU has taken
> two timer interrupts while another one has not taken any, so hopefully
> that's pretty rare in normal operation.

Thanks, Nick, for the clarification.

Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index f9ea0e5357f9..3c60872b6a2c 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -135,6 +135,10 @@  static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
 {
 	cpumask_or(&wd_smp_cpus_stuck, &wd_smp_cpus_stuck, cpumask);
 	cpumask_andnot(&wd_smp_cpus_pending, &wd_smp_cpus_pending, cpumask);
+	/*
+	 * See wd_smp_clear_cpu_pending()
+	 */
+	smp_mb();
 	if (cpumask_empty(&wd_smp_cpus_pending)) {
 		wd_smp_last_reset_tb = tb;
 		cpumask_andnot(&wd_smp_cpus_pending,
@@ -215,13 +219,44 @@  static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
 
 			cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
 			wd_smp_unlock(&flags);
+		} else {
+			/*
+			 * The last CPU to clear pending should have reset the
+			 * watchdog so we generally should not find it empty
+			 * here if our CPU was clear. However it could happen
+			 * due to a rare race with another CPU taking the
+			 * last CPU out of the mask concurrently.
+			 *
+			 * We can't add a warning for it. But just in case
+			 * there is a problem with the watchdog that is causing
+			 * the mask to not be reset, try to kick it along here.
+			 */
+			if (unlikely(cpumask_empty(&wd_smp_cpus_pending)))
+				goto none_pending;
 		}
 		return;
 	}
+
 	cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
+
+	/*
+	 * Order the store to clear pending with the load(s) to check all
+	 * words in the pending mask to check they are all empty. This orders
+	 * with the same barrier on another CPU. This prevents two CPUs
+	 * clearing the last 2 pending bits, but neither seeing the other's
+	 * store when checking if the mask is empty, and missing an empty
+	 * mask, which ends with a false positive.
+	 */
+	smp_mb();
 	if (cpumask_empty(&wd_smp_cpus_pending)) {
 		unsigned long flags;
 
+none_pending:
+		/*
+		 * Double check under lock because more than one CPU could see
+		 * a clear mask with the lockless check after clearing their
+		 * pending bits.
+		 */
 		wd_smp_lock(&flags);
 		if (cpumask_empty(&wd_smp_cpus_pending)) {
 			wd_smp_last_reset_tb = tb;
@@ -312,8 +347,12 @@  void arch_touch_nmi_watchdog(void)
 {
 	unsigned long ticks = tb_ticks_per_usec * wd_timer_period_ms * 1000;
 	int cpu = smp_processor_id();
-	u64 tb = get_tb();
+	u64 tb;
 
+	if (!cpumask_test_cpu(cpu, &watchdog_cpumask))
+		return;
+
+	tb = get_tb();
 	if (tb - per_cpu(wd_timer_tb, cpu) >= ticks) {
 		per_cpu(wd_timer_tb, cpu) = tb;
 		wd_smp_clear_cpu_pending(cpu, tb);