mbox series

[v3,0/2] Copy-on-write poison recovery

Message ID 20221021200120.175753-1-tony.luck@intel.com (mailing list archive)
Headers show
Series Copy-on-write poison recovery | expand

Message

Tony Luck Oct. 21, 2022, 8:01 p.m. UTC
Part 1 deals with the process that triggered the copy on write
fault with a store to a shared read-only page. That process is
send a SIGBUS with the usual machine check decoration to specify
the virtual address of the lost page, together with the scope.

Part 2 sets up to asynchronously take the page with the uncorrected
error offline to prevent additional machine check faults. H/t to
Miaohe Lin <linmiaohe@huawei.com> and Shuai Xue <xueshuai@linux.alibaba.com>
for pointing me to the existing function to queue a call to
memory_failure().

On x86 there is some duplicate reporting (because the error is
also signalled by the memory controller as well as by the core
that triggered the machine check). Console logs look like this:

[ 1647.723403] mce: [Hardware Error]: Machine check events logged
	Machine check from kernel copy routine

[ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400
	x86 fault handler sends SIGBUS to child process

[ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered
	Async call to memory_failure() from copy on write path

[ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned
	uc_decode_notifier() processes memory controller report

[ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400
	Parent process tries to read poisoned page. Page has been unmapped, so
	#PF handler sends SIGBUS


Tony Luck (2):
  mm, hwpoison: Try to recover from copy-on write faults
  mm, hwpoison: When copy-on-write hits poison, take page offline

 include/linux/highmem.h | 24 ++++++++++++++++++++++++
 include/linux/mm.h      |  5 ++++-
 mm/memory.c             | 32 ++++++++++++++++++++++----------
 3 files changed, 50 insertions(+), 11 deletions(-)

Comments

Shuai Xue Oct. 23, 2022, 3:52 p.m. UTC | #1
在 2022/10/22 AM4:01, Tony Luck 写道:
> Part 1 deals with the process that triggered the copy on write
> fault with a store to a shared read-only page. That process is
> send a SIGBUS with the usual machine check decoration to specify
> the virtual address of the lost page, together with the scope.
> 
> Part 2 sets up to asynchronously take the page with the uncorrected
> error offline to prevent additional machine check faults. H/t to
> Miaohe Lin <linmiaohe@huawei.com> and Shuai Xue <xueshuai@linux.alibaba.com>
> for pointing me to the existing function to queue a call to
> memory_failure().
> 
> On x86 there is some duplicate reporting (because the error is
> also signalled by the memory controller as well as by the core
> that triggered the machine check). Console logs look like this:
> 
> [ 1647.723403] mce: [Hardware Error]: Machine check events logged
> 	Machine check from kernel copy routine
> 
> [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400
> 	x86 fault handler sends SIGBUS to child process
> 
> [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered
> 	Async call to memory_failure() from copy on write path

The recovery action might also be handled asynchronously in CMCI uc_decode_notifier
handler signaled by memory controller, right?

I have a one more memory failure log than yours.

[ 3187.485742] MCE: Killing einj_mem_uc:31746 due to hardware memory corruption fault at 7fc4bf7cf400
[ 3187.740620] Memory failure: 0x1a3b80: recovery action for dirty LRU page: Recovered
	uc_decode_notifier() processes memory controller report

[ 3187.748272] Memory failure: 0x1a3b80: already hardware poisoned
	Workqueue: events memory_failure_work_func // queued by ghes_do_memory_failure

[ 3187.754194] Memory failure: 0x1a3b80: already hardware poisoned
	Workqueue: events memory_failure_work_func // queued by __wp_page_copy_user

[ 3188.615920] MCE: Killing einj_mem_uc:31745 due to hardware memory corruption fault at 7fc4bf7cf400

Best Regards,
Shuai

> 
> [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned
> 	uc_decode_notifier() processes memory controller report
> 
> [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400
> 	Parent process tries to read poisoned page. Page has been unmapped, so
> 	#PF handler sends SIGBUS
> 
> 
> Tony Luck (2):
>   mm, hwpoison: Try to recover from copy-on write faults
>   mm, hwpoison: When copy-on-write hits poison, take page offline
> 
>  include/linux/highmem.h | 24 ++++++++++++++++++++++++
>  include/linux/mm.h      |  5 ++++-
>  mm/memory.c             | 32 ++++++++++++++++++++++----------
>  3 files changed, 50 insertions(+), 11 deletions(-)
>
Shuai Xue Oct. 26, 2022, 5:19 a.m. UTC | #2
在 2022/10/23 PM11:52, Shuai Xue 写道:
> 
> 
> 在 2022/10/22 AM4:01, Tony Luck 写道:
>> Part 1 deals with the process that triggered the copy on write
>> fault with a store to a shared read-only page. That process is
>> send a SIGBUS with the usual machine check decoration to specify
>> the virtual address of the lost page, together with the scope.
>>
>> Part 2 sets up to asynchronously take the page with the uncorrected
>> error offline to prevent additional machine check faults. H/t to
>> Miaohe Lin <linmiaohe@huawei.com> and Shuai Xue <xueshuai@linux.alibaba.com>
>> for pointing me to the existing function to queue a call to
>> memory_failure().
>>
>> On x86 there is some duplicate reporting (because the error is
>> also signalled by the memory controller as well as by the core
>> that triggered the machine check). Console logs look like this:
>>
>> [ 1647.723403] mce: [Hardware Error]: Machine check events logged
>> 	Machine check from kernel copy routine
>>
>> [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400
>> 	x86 fault handler sends SIGBUS to child process
>>
>> [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered
>> 	Async call to memory_failure() from copy on write path
> 
> The recovery action might also be handled asynchronously in CMCI uc_decode_notifier
> handler signaled by memory controller, right?
> 
> I have a one more memory failure log than yours.
> 
> [ 3187.485742] MCE: Killing einj_mem_uc:31746 due to hardware memory corruption fault at 7fc4bf7cf400
> [ 3187.740620] Memory failure: 0x1a3b80: recovery action for dirty LRU page: Recovered
> 	uc_decode_notifier() processes memory controller report
> 
> [ 3187.748272] Memory failure: 0x1a3b80: already hardware poisoned
> 	Workqueue: events memory_failure_work_func // queued by ghes_do_memory_failure
> 
> [ 3187.754194] Memory failure: 0x1a3b80: already hardware poisoned
> 	Workqueue: events memory_failure_work_func // queued by __wp_page_copy_user
> 
> [ 3188.615920] MCE: Killing einj_mem_uc:31745 due to hardware memory corruption fault at 7fc4bf7cf400
> 
> Best Regards,
> Shuai

Tested-by: Shuai Xue <xueshuai@linux.alibaba.com>

Thank you.
Shuai

> 
>>
>> [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned
>> 	uc_decode_notifier() processes memory controller report
>>
>> [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400
>> 	Parent process tries to read poisoned page. Page has been unmapped, so
>> 	#PF handler sends SIGBUS
>>
>>
>> Tony Luck (2):
>>   mm, hwpoison: Try to recover from copy-on write faults
>>   mm, hwpoison: When copy-on-write hits poison, take page offline
>>
>>  include/linux/highmem.h | 24 ++++++++++++++++++++++++
>>  include/linux/mm.h      |  5 ++++-
>>  mm/memory.c             | 32 ++++++++++++++++++++++----------
>>  3 files changed, 50 insertions(+), 11 deletions(-)
>>