mbox series

[0/4] KVM: PPC: Book3S: Fix potential deadlocks

Message ID 20190523063424.GB19655@blackberry
Headers show
Series KVM: PPC: Book3S: Fix potential deadlocks | expand

Message

Paul Mackerras May 23, 2019, 6:34 a.m. UTC
Recent reports of lockdep splats in the HV KVM code revealed that it
was taking the kvm->lock mutex in several contexts where a vcpu mutex
was already held.  Lockdep has only started warning since I added code
to take the vcpu mutexes in the XIVE device release functions, but
since Documentation/virtual/kvm/locking.txt specifies that the vcpu
mutexes nest inside kvm->lock, it seems that the new code is correct
and it is most of the old uses of kvm->lock that are wrong.

This series should fix the problems, by adding new mutexes that nest
inside the vcpu mutexes and using them instead of kvm->lock.

Paul.

Comments

Alexey Kardashevskiy May 23, 2019, 7:21 a.m. UTC | #1
On 23/05/2019 16:34, Paul Mackerras wrote:
> Recent reports of lockdep splats in the HV KVM code revealed that it
> was taking the kvm->lock mutex in several contexts where a vcpu mutex
> was already held.  Lockdep has only started warning since I added code
> to take the vcpu mutexes in the XIVE device release functions, but
> since Documentation/virtual/kvm/locking.txt specifies that the vcpu
> mutexes nest inside kvm->lock, it seems that the new code is correct
> and it is most of the old uses of kvm->lock that are wrong.
> 
> This series should fix the problems, by adding new mutexes that nest
> inside the vcpu mutexes and using them instead of kvm->lock.


I applied these 4, compiled, installed, rebooted, tried running a guest
(which failed because I also updated QEMU and its cli has changed), got
this. So VM was created and then destroyed without executing a single
instruction, if that matters.


systemd-journald[1278]: Successfully sent stream file descriptor to
service manager.
systemd-journald[1278]: Successfully sent stream file descriptor to
service manager.
WARNING: CPU: 3 PID: 7697 at arch/powerpc/kvm/book3s_rtas.c:285
kvmppc_rtas_tokens_free+0x100/0x108
[kvm]

Modules linked in: bridge stp llc kvm_hv kvm rpcrdma ib_iser ib_srp
rdma_ucm ib_umad sunrpc rdma_cm
ib_ipoib iw_cm libiscsi ib_cm scsi_transport_iscsi mlx5_ib ib_uverbs
ib_core vmx_crypto crct10dif_vp
msum crct10dif_common at24 sg xfs libcrc32c crc32c_vpmsum mlx5_core
mlxfw autofs4
CPU: 3 PID: 7697 Comm: qemu-kvm Not tainted
5.2.0-rc1-le_nv2_aikATfstn1-p1 #496
NIP:  c00800000f3ab678 LR: c00800000f3ab66c CTR: c000000000198210

REGS: c000003fdf873680 TRAP: 0700   Not tainted
(5.2.0-rc1-le_nv2_aikATfstn1-p1)
MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR:
24008882  XER: 20040000
CFAR: c000000000198364 IRQMASK: 0

GPR00: c00800000f3ab66c c000003fdf873910 c00800000f3d8f00
0000000000000000
GPR04: c000003f4c984630 0000000000000000 00000000d5952dee
0000000000000000
GPR08: 0000000000000000 0000000000000000 0000000000000000
c00800000f3b95c0
GPR12: c000000000198210 c000003fffffbb80 00007ffff20d57a8
0000000000000000
GPR16: 0000000000000000 0000000000000008 0000000119dcd0c0
0000000000000001
GPR20: c000003f4c98f530 c000003f4c980098 c000003f4c984188
0000000000000000
GPR24: c000003f4c98a740 0000000000000000 c000003fdf8dd200
c00800000f3d1c18
GPR28: c000003f4c98a858 c000003f4c980000 c000003f4c9840c8
c000003f4c980000
NIP [c00800000f3ab678] kvmppc_rtas_tokens_free+0x100/0x108 [kvm]

LR [c00800000f3ab66c] kvmppc_rtas_tokens_free+0xf4/0x108 [kvm]

Call Trace:

[c000003fdf873910] [c00800000f3ab66c] kvmppc_rtas_tokens_free+0xf4/0x108
[kvm] (unreliable)
[c000003fdf873960] [c00800000f3aa640] kvmppc_core_destroy_vm+0x48/0xa8
[kvm]
[c000003fdf873990] [c00800000f3a4b08] kvm_arch_destroy_vm+0x130/0x190
[kvm]
[c000003fdf8739d0] [c00800000f3985dc] kvm_put_kvm+0x204/0x500 [kvm]

[c000003fdf873a60] [c00800000f398910] kvm_vm_release+0x38/0x60 [kvm]

[c000003fdf873a90] [c0000000004345fc] __fput+0xcc/0x2f0

[c000003fdf873af0] [c000000000139318] task_work_run+0x108/0x150

[c000003fdf873b30] [c000000000109408] do_exit+0x438/0xe10

[c000003fdf873c00] [c000000000109eb0] do_group_exit+0x60/0xe0

[c000003fdf873c40] [c00000000011ea24] get_signal+0x1b4/0xce0

[c000003fdf873d30] [c000000000025ea8] do_notify_resume+0x1a8/0x430

[c000003fdf873e20] [c00000000000e444] ret_from_except_lite+0x70/0x74

Instruction dump:

ebc1fff0 ebe1fff8 7c0803a6 4e800020 60000000 60000000 3880ffff 38634630

4800df59 e8410018 2fa30000 409eff44 <0fe00000> 4bffff3c 7c0802a6
60000000
irq event stamp: 114938

hardirqs last  enabled at (114937): [<c0000000003ab8f4>]
free_unref_page+0xd4/0x100
hardirqs last disabled at (114938): [<c000000000009060>]
program_check_common+0x170/0x180
softirqs last  enabled at (114424): [<c000000000a06bdc>]
peernet2id+0x6c/0xb0
softirqs last disabled at (114422): [<c000000000a06bb4>]
peernet2id+0x44/0xb0
---[ end trace c33c9599a1a73dd2 ]---

systemd-journald[1278]: Compressed data object 650 -> 280 using XZ

tun: Universal TUN/TAP device driver, 1.6

virbr0: port 1(virbr0-nic) entered blocking state
Cédric Le Goater May 23, 2019, 7:41 a.m. UTC | #2
On 5/23/19 9:21 AM, Alexey Kardashevskiy wrote:
> 
> 
> On 23/05/2019 16:34, Paul Mackerras wrote:
>> Recent reports of lockdep splats in the HV KVM code revealed that it
>> was taking the kvm->lock mutex in several contexts where a vcpu mutex
>> was already held.  Lockdep has only started warning since I added code
>> to take the vcpu mutexes in the XIVE device release functions, but
>> since Documentation/virtual/kvm/locking.txt specifies that the vcpu
>> mutexes nest inside kvm->lock, it seems that the new code is correct
>> and it is most of the old uses of kvm->lock that are wrong.
>>
>> This series should fix the problems, by adding new mutexes that nest
>> inside the vcpu mutexes and using them instead of kvm->lock.
> 
> 
> I applied these 4, compiled, installed, rebooted, tried running a guest
> (which failed because I also updated QEMU and its cli has changed), got
> this. So VM was created and then destroyed without executing a single
> instruction, if that matters.


kvm->lock is still held when  kvmppc_rtas_tokens_free() is called
but kvm->arch.rtas_token_lock isn't.

May be we should change the lockdep annotation to what is was
before ? 

C. 


> 
> 
> systemd-journald[1278]: Successfully sent stream file descriptor to
> service manager.
> systemd-journald[1278]: Successfully sent stream file descriptor to
> service manager.
> WARNING: CPU: 3 PID: 7697 at arch/powerpc/kvm/book3s_rtas.c:285
> kvmppc_rtas_tokens_free+0x100/0x108
> [kvm]
> 
> Modules linked in: bridge stp llc kvm_hv kvm rpcrdma ib_iser ib_srp
> rdma_ucm ib_umad sunrpc rdma_cm
> ib_ipoib iw_cm libiscsi ib_cm scsi_transport_iscsi mlx5_ib ib_uverbs
> ib_core vmx_crypto crct10dif_vp
> msum crct10dif_common at24 sg xfs libcrc32c crc32c_vpmsum mlx5_core
> mlxfw autofs4
> CPU: 3 PID: 7697 Comm: qemu-kvm Not tainted
> 5.2.0-rc1-le_nv2_aikATfstn1-p1 #496
> NIP:  c00800000f3ab678 LR: c00800000f3ab66c CTR: c000000000198210
> 
> REGS: c000003fdf873680 TRAP: 0700   Not tainted
> (5.2.0-rc1-le_nv2_aikATfstn1-p1)
> MSR:  900000000282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR:
> 24008882  XER: 20040000
> CFAR: c000000000198364 IRQMASK: 0
> 
> GPR00: c00800000f3ab66c c000003fdf873910 c00800000f3d8f00
> 0000000000000000
> GPR04: c000003f4c984630 0000000000000000 00000000d5952dee
> 0000000000000000
> GPR08: 0000000000000000 0000000000000000 0000000000000000
> c00800000f3b95c0
> GPR12: c000000000198210 c000003fffffbb80 00007ffff20d57a8
> 0000000000000000
> GPR16: 0000000000000000 0000000000000008 0000000119dcd0c0
> 0000000000000001
> GPR20: c000003f4c98f530 c000003f4c980098 c000003f4c984188
> 0000000000000000
> GPR24: c000003f4c98a740 0000000000000000 c000003fdf8dd200
> c00800000f3d1c18
> GPR28: c000003f4c98a858 c000003f4c980000 c000003f4c9840c8
> c000003f4c980000
> NIP [c00800000f3ab678] kvmppc_rtas_tokens_free+0x100/0x108 [kvm]
> 
> LR [c00800000f3ab66c] kvmppc_rtas_tokens_free+0xf4/0x108 [kvm]
> 
> Call Trace:
> 
> [c000003fdf873910] [c00800000f3ab66c] kvmppc_rtas_tokens_free+0xf4/0x108
> [kvm] (unreliable)
> [c000003fdf873960] [c00800000f3aa640] kvmppc_core_destroy_vm+0x48/0xa8
> [kvm]
> [c000003fdf873990] [c00800000f3a4b08] kvm_arch_destroy_vm+0x130/0x190
> [kvm]
> [c000003fdf8739d0] [c00800000f3985dc] kvm_put_kvm+0x204/0x500 [kvm]
> 
> [c000003fdf873a60] [c00800000f398910] kvm_vm_release+0x38/0x60 [kvm]
> 
> [c000003fdf873a90] [c0000000004345fc] __fput+0xcc/0x2f0
> 
> [c000003fdf873af0] [c000000000139318] task_work_run+0x108/0x150
> 
> [c000003fdf873b30] [c000000000109408] do_exit+0x438/0xe10
> 
> [c000003fdf873c00] [c000000000109eb0] do_group_exit+0x60/0xe0
> 
> [c000003fdf873c40] [c00000000011ea24] get_signal+0x1b4/0xce0
> 
> [c000003fdf873d30] [c000000000025ea8] do_notify_resume+0x1a8/0x430
> 
> [c000003fdf873e20] [c00000000000e444] ret_from_except_lite+0x70/0x74
> 
> Instruction dump:
> 
> ebc1fff0 ebe1fff8 7c0803a6 4e800020 60000000 60000000 3880ffff 38634630
> 
> 4800df59 e8410018 2fa30000 409eff44 <0fe00000> 4bffff3c 7c0802a6
> 60000000
> irq event stamp: 114938
> 
> hardirqs last  enabled at (114937): [<c0000000003ab8f4>]
> free_unref_page+0xd4/0x100
> hardirqs last disabled at (114938): [<c000000000009060>]
> program_check_common+0x170/0x180
> softirqs last  enabled at (114424): [<c000000000a06bdc>]
> peernet2id+0x6c/0xb0
> softirqs last disabled at (114422): [<c000000000a06bb4>]
> peernet2id+0x44/0xb0
> ---[ end trace c33c9599a1a73dd2 ]---
> 
> systemd-journald[1278]: Compressed data object 650 -> 280 using XZ
> 
> tun: Universal TUN/TAP device driver, 1.6
> 
> virbr0: port 1(virbr0-nic) entered blocking state
> 
> 
> 
> 
>
Paul Mackerras May 23, 2019, 8:31 a.m. UTC | #3
On Thu, May 23, 2019 at 05:21:00PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 23/05/2019 16:34, Paul Mackerras wrote:
> > Recent reports of lockdep splats in the HV KVM code revealed that it
> > was taking the kvm->lock mutex in several contexts where a vcpu mutex
> > was already held.  Lockdep has only started warning since I added code
> > to take the vcpu mutexes in the XIVE device release functions, but
> > since Documentation/virtual/kvm/locking.txt specifies that the vcpu
> > mutexes nest inside kvm->lock, it seems that the new code is correct
> > and it is most of the old uses of kvm->lock that are wrong.
> > 
> > This series should fix the problems, by adding new mutexes that nest
> > inside the vcpu mutexes and using them instead of kvm->lock.
> 
> 
> I applied these 4, compiled, installed, rebooted, tried running a guest
> (which failed because I also updated QEMU and its cli has changed), got
> this. So VM was created and then destroyed without executing a single
> instruction, if that matters.

Looks like I need to remove the
	lockdep_assert_held(&kvm->arch.rtas_token_lock);

in kvmppc_rtas_tokens_free().  We don't have the rtas_token_lock, but
it doesn't matter because we are destroying the VM and nothing else
has a reference to it by now.

Paul.
Cédric Le Goater May 24, 2019, 9:17 a.m. UTC | #4
On 5/23/19 8:34 AM, Paul Mackerras wrote:
> Recent reports of lockdep splats in the HV KVM code revealed that it
> was taking the kvm->lock mutex in several contexts where a vcpu mutex
> was already held.  Lockdep has only started warning since I added code
> to take the vcpu mutexes in the XIVE device release functions, but
> since Documentation/virtual/kvm/locking.txt specifies that the vcpu
> mutexes nest inside kvm->lock, it seems that the new code is correct
> and it is most of the old uses of kvm->lock that are wrong.
> 
> This series should fix the problems, by adding new mutexes that nest
> inside the vcpu mutexes and using them instead of kvm->lock.

Hello,

I guest this warning when running a guest with this patchset :

[  228.686461] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
[  228.686480] WARNING: CPU: 116 PID: 3803 at ../kernel/locking/lockdep.c:4219 check_flags.part.23+0x21c/0x270
[  228.686544] Modules linked in: vhost_net vhost xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter fuse kvm_hv kvm at24 ipmi_powernv regmap_i2c ipmi_devintf uio_pdrv_genirq ofpart ipmi_msghandler uio powernv_flash mtd ibmpowernv opal_prd ip_tables ext4 mbcache jbd2 btrfs zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 ses sd_mod enclosure scsi_transport_sas ast i2c_opal i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm i40e e1000e cxl aacraid tg3 drm_panel_orientation_quirks i2c_core
[  228.686859] CPU: 116 PID: 3803 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc1-xive+ #42
[  228.686911] NIP:  c0000000001b394c LR: c0000000001b3948 CTR: c000000000bfad20
[  228.686963] REGS: c000200cdb50f570 TRAP: 0700   Not tainted  (5.2.0-rc1-xive+)
[  228.687001] MSR:  9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 48222222  XER: 20040000
[  228.687060] CFAR: c000000000116db0 IRQMASK: 1 
[  228.687060] GPR00: c0000000001b3948 c000200cdb50f800 c0000000015e7600 000000000000002e 
[  228.687060] GPR04: 0000000000000001 c0000000001c71a0 000000006e655f73 72727563284e4f5f 
[  228.687060] GPR08: 0000200e60680000 0000000000000000 c000200cdb486180 0000000000000000 
[  228.687060] GPR12: 0000000000002000 c000200fff61a680 0000000000000000 00007fffb75c0000 
[  228.687060] GPR16: 0000000000000000 0000000000000000 c0000000017d6900 c000000001124900 
[  228.687060] GPR20: 0000000000000074 c008000006916f68 0000000000000074 0000000000000074 
[  228.687060] GPR24: ffffffffffffffff ffffffffffffffff 0000000000000003 c000200d4b600000 
[  228.687060] GPR28: c000000001627e58 c000000001489908 c000000001627e58 c000000002304de0 
[  228.687377] NIP [c0000000001b394c] check_flags.part.23+0x21c/0x270
[  228.687415] LR [c0000000001b3948] check_flags.part.23+0x218/0x270
[  228.687466] Call Trace:
[  228.687488] [c000200cdb50f800] [c0000000001b3948] check_flags.part.23+0x218/0x270 (unreliable)
[  228.687542] [c000200cdb50f870] [c0000000001b6548] lock_is_held_type+0x188/0x1c0
[  228.687595] [c000200cdb50f8d0] [c0000000001d939c] rcu_read_lock_sched_held+0xdc/0x100
[  228.687646] [c000200cdb50f900] [c0000000001dd704] rcu_note_context_switch+0x304/0x340
[  228.687701] [c000200cdb50f940] [c0080000068fcc58] kvmhv_run_single_vcpu+0xdb0/0x1120 [kvm_hv]
[  228.687756] [c000200cdb50fa20] [c0080000068fd5b0] kvmppc_vcpu_run_hv+0x5e8/0xe40 [kvm_hv]
[  228.687816] [c000200cdb50faf0] [c0080000071797dc] kvmppc_vcpu_run+0x34/0x48 [kvm]
[  228.687863] [c000200cdb50fb10] [c0080000071755dc] kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm]
[  228.687916] [c000200cdb50fba0] [c008000007165ccc] kvm_vcpu_ioctl+0x424/0x838 [kvm]
[  228.687957] [c000200cdb50fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0
[  228.687995] [c000200cdb50fdb0] [c000000000434724] ksys_ioctl+0x104/0x120
[  228.688033] [c000200cdb50fe00] [c000000000434768] sys_ioctl+0x28/0x80
[  228.688072] [c000200cdb50fe20] [c00000000000b888] system_call+0x5c/0x70
[  228.688109] Instruction dump:
[  228.688142] 4bf6342d 60000000 0fe00000 e8010080 7c0803a6 4bfffe60 3c82ff87 3c62ff87 
[  228.688196] 388472d0 3863d738 4bf63405 60000000 <0fe00000> 4bffff4c 3c82ff87 3c62ff87 
[  228.688251] irq event stamp: 205
[  228.688287] hardirqs last  enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv]
[  228.688344] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv]
[  228.688412] softirqs last  enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4
[  228.688464] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210
[  228.688513] ---[ end trace eb16f6260022a812 ]---
[  228.688548] possible reason: unannotated irqs-off.
[  228.688571] irq event stamp: 205
[  228.688607] hardirqs last  enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv]
[  228.688664] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv]
[  228.688719] softirqs last  enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4
[  228.688758] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210



C.
Paul Mackerras May 28, 2019, 4:54 a.m. UTC | #5
On Fri, May 24, 2019 at 11:17:16AM +0200, Cédric Le Goater wrote:
> On 5/23/19 8:34 AM, Paul Mackerras wrote:
> > Recent reports of lockdep splats in the HV KVM code revealed that it
> > was taking the kvm->lock mutex in several contexts where a vcpu mutex
> > was already held.  Lockdep has only started warning since I added code
> > to take the vcpu mutexes in the XIVE device release functions, but
> > since Documentation/virtual/kvm/locking.txt specifies that the vcpu
> > mutexes nest inside kvm->lock, it seems that the new code is correct
> > and it is most of the old uses of kvm->lock that are wrong.
> > 
> > This series should fix the problems, by adding new mutexes that nest
> > inside the vcpu mutexes and using them instead of kvm->lock.
> 
> Hello,
> 
> I guest this warning when running a guest with this patchset :

Looks like we need the equivalent of 3309bec85e60 applied to the
p9/radix streamlined entry path.

Paul.
Paul Mackerras May 31, 2019, 6:32 a.m. UTC | #6
On Thu, May 23, 2019 at 04:34:24PM +1000, Paul Mackerras wrote:
> Recent reports of lockdep splats in the HV KVM code revealed that it
> was taking the kvm->lock mutex in several contexts where a vcpu mutex
> was already held.  Lockdep has only started warning since I added code
> to take the vcpu mutexes in the XIVE device release functions, but
> since Documentation/virtual/kvm/locking.txt specifies that the vcpu
> mutexes nest inside kvm->lock, it seems that the new code is correct
> and it is most of the old uses of kvm->lock that are wrong.
> 
> This series should fix the problems, by adding new mutexes that nest
> inside the vcpu mutexes and using them instead of kvm->lock.

Series applied to my kvm-ppc-fixes branch (with v2 of 3/4).

Paul.