Message ID | 20240229062201.49500-1-kai.heng.feng@canonical.com |
---|---|
State | New |
Headers | show |
Series | [v3] driver core: Cancel scheduled pm_runtime_idle() on device removal | expand |
On Thu, Feb 29, 2024 at 02:22:00PM +0800, Kai-Heng Feng wrote: > When inserting an SD7.0 card to Realtek card reader, the card reader > unplugs itself and morph into a NVMe device. The slot Link down on hot > unplugged can cause the following error: > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > BUG: unable to handle page fault for address: ffffb24d403e5010 > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > Oops: 0000 [#1] PREEMPT SMP PTI > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > Workqueue: pm pm_runtime_work > RIP: 0010:ioread32+0x2e/0x70 > Code: ff 03 00 77 25 48 81 ff 00 00 01 00 77 14 8b 15 08 d9 54 01 b8 ff ff ff ff 85 d2 75 14 c3 cc cc cc cc 89 fa ed c3 cc cc cc cc <8b> 07 c3 cc cc cc cc 55 83 ea 01 48 89 fe 48 c7 c7 98 6f 15 99 48 > RSP: 0018:ffffb24d40a5bd78 EFLAGS: 00010296 > RAX: ffffb24d403e5000 RBX: 0000000000000152 RCX: 000000000000007f > RDX: 000000000000ff00 RSI: ffffb24d403e5010 RDI: ffffb24d403e5010 > RBP: ffffb24d40a5bd98 R08: ffffb24d403e5010 R09: 0000000000000000 > R10: ffff9074cd95e7f4 R11: 0000000000000003 R12: 000000000000007f > R13: ffff9074e1a68c00 R14: ffff9074e1a68d00 R15: 0000000000009003 > FS: 0000000000000000(0000) GS:ffff90752a180000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffb24d403e5010 CR3: 0000000152832006 CR4: 00000000003706e0 > Call Trace: > <TASK> > ? show_regs+0x68/0x70 > ? __die_body+0x20/0x70 > ? __die+0x2b/0x40 > ? page_fault_oops+0x160/0x480 > ? search_bpf_extables+0x63/0x90 > ? ioread32+0x2e/0x70 > ? search_exception_tables+0x5f/0x70 > ? kernelmode_fixup_or_oops+0xa2/0x120 > ? __bad_area_nosemaphore+0x179/0x230 > ? bad_area_nosemaphore+0x16/0x20 > ? do_kern_addr_fault+0x8b/0xa0 > ? exc_page_fault+0xe5/0x180 > ? asm_exc_page_fault+0x27/0x30 > ? ioread32+0x2e/0x70 > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > pci_pm_runtime_idle+0x34/0x70 > rpm_idle+0xc4/0x2b0 > pm_runtime_work+0x93/0xc0 > process_one_work+0x21a/0x430 > worker_thread+0x4a/0x3c0 > ? __pfx_worker_thread+0x10/0x10 > kthread+0x106/0x140 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x29/0x50 > </TASK> > Modules linked in: nvme nvme_core snd_hda_codec_hdmi snd_sof_pci_intel_cnl snd_sof_intel_hda_common snd_hda_codec_realtek snd_hda_codec_generic snd_soc_hdac_hda soundwire_intel ledtrig_audio nls_iso8859_1 soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel i915 snd_intel_dspcfg snd_intel_sdw_acpi intel_rapl_msr snd_hda_codec intel_rapl_common snd_hda_core x86_pkg_temp_thermal intel_powerclamp snd_hwdep coretemp snd_pcm kvm_intel drm_buddy ttm mei_hdcp kvm drm_display_helper snd_seq_midi snd_seq_midi_event cec crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd rc_core cryptd rapl snd_rawmidi drm_kms_helper binfmt_misc intel_cstate i2c_algo_bit joydev snd_seq snd_seq_device syscopyarea wmi_bmof snd_timer sysfillrect input_leds snd ee1004 sysimgblt mei_me soundcore > mei intel_pch_thermal mac_hid acpi_tad acpi_pad sch_fq_codel msr parport_pc ppdev lp ramoops drm parport reed_solomon efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid rtsx_pci_sdmmc crc32_pclmul ahci e1000e i2c_i801 i2c_smbus rtsx_pci xhci_pci libahci xhci_pci_renesas video wmi > CR2: ffffb24d403e5010 > ---[ end trace 0000000000000000 ]--- > > This happens because scheduled pm_runtime_idle() is not cancelled. > > So before releasing the device, stop all runtime power managements by > using pm_runtime_barrier() to fix the issue. > > Link: https://lore.kernel.org/all/2ce258f371234b1f8a1a470d5488d00e@realtek.com/ > Cc: Ricky Wu <ricky_wu@realtek.com> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> What commit id does this fix? Should it also go to stable kernels? > --- > v3: > Move the change the device driver core. > > v2: > Cover more cases than just pciehp. > > drivers/base/dd.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > index 85152537dbf1..38c815e2b3a2 100644 > --- a/drivers/base/dd.c > +++ b/drivers/base/dd.c > @@ -1244,6 +1244,7 @@ static void __device_release_driver(struct device *dev, struct device *parent) > > drv = dev->driver; > if (drv) { > + pm_runtime_barrier(dev); > pm_runtime_get_sync(dev); This is going to affect every device, are you sure about that? And shouldn't we be checking the return value? Wait, why does EVERYONE ignore the return value of this call? thanks, greg k-h
> When inserting an SD7.0 card to Realtek card reader, the card reader > unplugs itself and morph into a NVMe device. The slot Link down on hot > unplugged can cause the following error: > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > BUG: unable to handle page fault for address: ffffb24d403e5010 > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > Oops: 0000 [#1] PREEMPT SMP PTI > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, > BIOS P3.40 10/25/2018 > Workqueue: pm pm_runtime_work > RIP: 0010:ioread32+0x2e/0x70 > Code: ff 03 00 77 25 48 81 ff 00 00 01 00 77 14 8b 15 08 d9 54 01 b8 ff ff ff ff > 85 d2 75 14 c3 cc cc cc cc 89 fa ed c3 cc cc cc cc <8b> 07 c3 cc cc cc cc 55 83 > ea 01 48 89 fe 48 c7 c7 98 6f 15 99 48 > RSP: 0018:ffffb24d40a5bd78 EFLAGS: 00010296 > RAX: ffffb24d403e5000 RBX: 0000000000000152 RCX: 000000000000007f > RDX: 000000000000ff00 RSI: ffffb24d403e5010 RDI: ffffb24d403e5010 > RBP: ffffb24d40a5bd98 R08: ffffb24d403e5010 R09: 0000000000000000 > R10: ffff9074cd95e7f4 R11: 0000000000000003 R12: 000000000000007f > R13: ffff9074e1a68c00 R14: ffff9074e1a68d00 R15: 0000000000009003 > FS: 0000000000000000(0000) GS:ffff90752a180000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffb24d403e5010 CR3: 0000000152832006 CR4: 00000000003706e0 > Call Trace: > <TASK> > ? show_regs+0x68/0x70 > ? __die_body+0x20/0x70 > ? __die+0x2b/0x40 > ? page_fault_oops+0x160/0x480 > ? search_bpf_extables+0x63/0x90 > ? ioread32+0x2e/0x70 > ? search_exception_tables+0x5f/0x70 > ? kernelmode_fixup_or_oops+0xa2/0x120 > ? __bad_area_nosemaphore+0x179/0x230 > ? bad_area_nosemaphore+0x16/0x20 > ? do_kern_addr_fault+0x8b/0xa0 > ? exc_page_fault+0xe5/0x180 > ? asm_exc_page_fault+0x27/0x30 > ? ioread32+0x2e/0x70 > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > pci_pm_runtime_idle+0x34/0x70 > rpm_idle+0xc4/0x2b0 > pm_runtime_work+0x93/0xc0 > process_one_work+0x21a/0x430 > worker_thread+0x4a/0x3c0 > ? __pfx_worker_thread+0x10/0x10 > kthread+0x106/0x140 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x29/0x50 > </TASK> > Modules linked in: nvme nvme_core snd_hda_codec_hdmi > snd_sof_pci_intel_cnl snd_sof_intel_hda_common snd_hda_codec_realtek > snd_hda_codec_generic snd_soc_hdac_hda soundwire_intel ledtrig_audio > nls_iso8859_1 soundwire_generic_allocation soundwire_cadence > snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp > snd_sof snd_sof_utils snd_hda_ext_core snd_soc_acpi_intel_match > snd_soc_acpi soundwire_bus snd_soc_core snd_compress ac97_bus > snd_pcm_dmaengine snd_hda_intel i915 snd_intel_dspcfg snd_intel_sdw_acpi > intel_rapl_msr snd_hda_codec intel_rapl_common snd_hda_core > x86_pkg_temp_thermal intel_powerclamp snd_hwdep coretemp snd_pcm > kvm_intel drm_buddy ttm mei_hdcp kvm drm_display_helper snd_seq_midi > snd_seq_midi_event cec crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 > aesni_intel crypto_simd rc_core cryptd rapl snd_rawmidi drm_kms_helper > binfmt_misc intel_cstate i2c_algo_bit joydev snd_seq snd_seq_device > syscopyarea wmi_bmof snd_timer sysfillrect input_leds snd ee1004 sysimgblt > mei_me soundcore > mei intel_pch_thermal mac_hid acpi_tad acpi_pad sch_fq_codel msr > parport_pc ppdev lp ramoops drm parport reed_solomon efi_pstore ip_tables > x_tables autofs4 hid_generic usbhid hid rtsx_pci_sdmmc crc32_pclmul ahci > e1000e i2c_i801 i2c_smbus rtsx_pci xhci_pci libahci xhci_pci_renesas video > wmi > CR2: ffffb24d403e5010 > ---[ end trace 0000000000000000 ]--- > > This happens because scheduled pm_runtime_idle() is not cancelled. > > So before releasing the device, stop all runtime power managements by > using pm_runtime_barrier() to fix the issue. > > Link: > https://lore.kernel.org/all/2ce258f371234b1f8a1a470d5488d00e@realtek.com > / > Cc: Ricky Wu <ricky_wu@realtek.com> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> Tested-by: Ricky Wu <ricky_wu@realtek.com> I tested this patch, the result is GOOD. This issue is not happening with multiple plugging and unplugging > --- > v3: > Move the change the device driver core. > > v2: > Cover more cases than just pciehp. > > drivers/base/dd.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > index 85152537dbf1..38c815e2b3a2 100644 > --- a/drivers/base/dd.c > +++ b/drivers/base/dd.c > @@ -1244,6 +1244,7 @@ static void __device_release_driver(struct device > *dev, struct device *parent) > > drv = dev->driver; > if (drv) { > + pm_runtime_barrier(dev); > pm_runtime_get_sync(dev); > > while (device_links_busy(dev)) { > -- > 2.34.1
[+to Rafael, can you comment on whether this is the right fix for the .remove() vs .runtime_idle() race?] On Thu, Feb 29, 2024 at 02:22:00PM +0800, Kai-Heng Feng wrote: > When inserting an SD7.0 card to Realtek card reader, the card reader > unplugs itself and morph into a NVMe device. The slot Link down on hot > unplugged can cause the following error: > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > BUG: unable to handle page fault for address: ffffb24d403e5010 > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > Oops: 0000 [#1] PREEMPT SMP PTI > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > Workqueue: pm pm_runtime_work > RIP: 0010:ioread32+0x2e/0x70 > Code: ff 03 00 77 25 48 81 ff 00 00 01 00 77 14 8b 15 08 d9 54 01 b8 ff ff ff ff 85 d2 75 14 c3 cc cc cc cc 89 fa ed c3 cc cc cc cc <8b> 07 c3 cc cc cc cc 55 83 ea 01 48 89 fe 48 c7 c7 98 6f 15 99 48 > RSP: 0018:ffffb24d40a5bd78 EFLAGS: 00010296 > RAX: ffffb24d403e5000 RBX: 0000000000000152 RCX: 000000000000007f > RDX: 000000000000ff00 RSI: ffffb24d403e5010 RDI: ffffb24d403e5010 > RBP: ffffb24d40a5bd98 R08: ffffb24d403e5010 R09: 0000000000000000 > R10: ffff9074cd95e7f4 R11: 0000000000000003 R12: 000000000000007f > R13: ffff9074e1a68c00 R14: ffff9074e1a68d00 R15: 0000000000009003 > FS: 0000000000000000(0000) GS:ffff90752a180000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffb24d403e5010 CR3: 0000000152832006 CR4: 00000000003706e0 > Call Trace: > <TASK> > ? show_regs+0x68/0x70 > ? __die_body+0x20/0x70 > ? __die+0x2b/0x40 > ? page_fault_oops+0x160/0x480 > ? search_bpf_extables+0x63/0x90 > ? ioread32+0x2e/0x70 > ? search_exception_tables+0x5f/0x70 > ? kernelmode_fixup_or_oops+0xa2/0x120 > ? __bad_area_nosemaphore+0x179/0x230 > ? bad_area_nosemaphore+0x16/0x20 > ? do_kern_addr_fault+0x8b/0xa0 > ? exc_page_fault+0xe5/0x180 > ? asm_exc_page_fault+0x27/0x30 > ? ioread32+0x2e/0x70 > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > pci_pm_runtime_idle+0x34/0x70 > rpm_idle+0xc4/0x2b0 > pm_runtime_work+0x93/0xc0 > process_one_work+0x21a/0x430 > worker_thread+0x4a/0x3c0 > ? __pfx_worker_thread+0x10/0x10 > kthread+0x106/0x140 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x29/0x50 > </TASK> > Modules linked in: nvme nvme_core snd_hda_codec_hdmi snd_sof_pci_intel_cnl snd_sof_intel_hda_common snd_hda_codec_realtek snd_hda_codec_generic snd_soc_hdac_hda soundwire_intel ledtrig_audio nls_iso8859_1 soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel i915 snd_intel_dspcfg snd_intel_sdw_acpi intel_rapl_msr snd_hda_codec intel_rapl_common snd_hda_core x86_pkg_temp_thermal intel_powerclamp snd_hwdep coretemp snd_pcm kvm_intel drm_buddy ttm mei_hdcp kvm drm_display_helper snd_seq_midi snd_seq_midi_event cec crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd rc_core cryptd rapl snd_rawmidi drm_kms_helper binfmt_misc intel_cstate i2c_algo_bit joydev snd_seq snd_seq_device syscopyarea wmi_bmof snd_timer sysfillrect input_leds snd ee1004 sysimgblt mei_me soundcore > mei intel_pch_thermal mac_hid acpi_tad acpi_pad sch_fq_codel msr parport_pc ppdev lp ramoops drm parport reed_solomon efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid rtsx_pci_sdmmc crc32_pclmul ahci e1000e i2c_i801 i2c_smbus rtsx_pci xhci_pci libahci xhci_pci_renesas video wmi The module list is a big distraction and isn't relevant to this issue. > CR2: ffffb24d403e5010 > ---[ end trace 0000000000000000 ]--- > > This happens because scheduled pm_runtime_idle() is not cancelled. I think it would be useful to include a little more background about how we got here. In particular, I think we got here because .runtime_idle() raced with .remove(): - rtsx_pci_runtime_idle() did iowrite32(pcr->remap_addr + RTSX_HAIMR) in rtsx_pci_write_register() successfully - rtsx_pci_remove() iounmapped pcr->remap_addr - rtsx_pci_runtime_idle() did ioread32(pcr->remap_addr + RTSX_HAIMR) in rtsx_pci_write_register(), which faulted The write and the read access the same register, but the write was successful and we faulted on the *read* (see [1]), which means rtsx_pci_runtime_idle() started execution first, and rtsx_pci_remove() raced with it and happened to unmap pcr->remap_addr (see [2]) between the write and the read. It looks like this kind of race between .runtime_idle() and .remove() could happen with any driver. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/misc/cardreader/rtsx_pcr.c?id=v6.7#n164 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/misc/cardreader/rtsx_pcr.c?id=v6.7#n1633 > So before releasing the device, stop all runtime power managements by > using pm_runtime_barrier() to fix the issue. > > Link: https://lore.kernel.org/all/2ce258f371234b1f8a1a470d5488d00e@realtek.com/ > Cc: Ricky Wu <ricky_wu@realtek.com> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> > --- > v3: > Move the change the device driver core. > > v2: > Cover more cases than just pciehp. > > drivers/base/dd.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > index 85152537dbf1..38c815e2b3a2 100644 > --- a/drivers/base/dd.c > +++ b/drivers/base/dd.c > @@ -1244,6 +1244,7 @@ static void __device_release_driver(struct device *dev, struct device *parent) > > drv = dev->driver; > if (drv) { > + pm_runtime_barrier(dev); > pm_runtime_get_sync(dev); > > while (device_links_busy(dev)) { > -- > 2.34.1 >
On Thu, Feb 29, 2024 at 8:21 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > [+to Rafael, can you comment on whether this is the right fix for the > .remove() vs .runtime_idle() race?] It doesn't seem so. pm_runtime_get_sync() is expected to cancel pending pm_runtime_idle() in all cases, so this looks like PM-runtime core issue. Let me have a look at it. > On Thu, Feb 29, 2024 at 02:22:00PM +0800, Kai-Heng Feng wrote: > > When inserting an SD7.0 card to Realtek card reader, the card reader > > unplugs itself and morph into a NVMe device. The slot Link down on hot > > unplugged can cause the following error: > > > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > > BUG: unable to handle page fault for address: ffffb24d403e5010 > > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > > Oops: 0000 [#1] PREEMPT SMP PTI > > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > > Workqueue: pm pm_runtime_work > > RIP: 0010:ioread32+0x2e/0x70 > > Code: ff 03 00 77 25 48 81 ff 00 00 01 00 77 14 8b 15 08 d9 54 01 b8 ff ff ff ff 85 d2 75 14 c3 cc cc cc cc 89 fa ed c3 cc cc cc cc <8b> 07 c3 cc cc cc cc 55 83 ea 01 48 89 fe 48 c7 c7 98 6f 15 99 48 > > RSP: 0018:ffffb24d40a5bd78 EFLAGS: 00010296 > > RAX: ffffb24d403e5000 RBX: 0000000000000152 RCX: 000000000000007f > > RDX: 000000000000ff00 RSI: ffffb24d403e5010 RDI: ffffb24d403e5010 > > RBP: ffffb24d40a5bd98 R08: ffffb24d403e5010 R09: 0000000000000000 > > R10: ffff9074cd95e7f4 R11: 0000000000000003 R12: 000000000000007f > > R13: ffff9074e1a68c00 R14: ffff9074e1a68d00 R15: 0000000000009003 > > FS: 0000000000000000(0000) GS:ffff90752a180000(0000) knlGS:0000000000000000 > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > CR2: ffffb24d403e5010 CR3: 0000000152832006 CR4: 00000000003706e0 > > Call Trace: > > <TASK> > > ? show_regs+0x68/0x70 > > ? __die_body+0x20/0x70 > > ? __die+0x2b/0x40 > > ? page_fault_oops+0x160/0x480 > > ? search_bpf_extables+0x63/0x90 > > ? ioread32+0x2e/0x70 > > ? search_exception_tables+0x5f/0x70 > > ? kernelmode_fixup_or_oops+0xa2/0x120 > > ? __bad_area_nosemaphore+0x179/0x230 > > ? bad_area_nosemaphore+0x16/0x20 > > ? do_kern_addr_fault+0x8b/0xa0 > > ? exc_page_fault+0xe5/0x180 > > ? asm_exc_page_fault+0x27/0x30 > > ? ioread32+0x2e/0x70 > > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > > pci_pm_runtime_idle+0x34/0x70 > > rpm_idle+0xc4/0x2b0 > > pm_runtime_work+0x93/0xc0 > > process_one_work+0x21a/0x430 > > worker_thread+0x4a/0x3c0 > > ? __pfx_worker_thread+0x10/0x10 > > kthread+0x106/0x140 > > ? __pfx_kthread+0x10/0x10 > > ret_from_fork+0x29/0x50 > > </TASK> > > Modules linked in: nvme nvme_core snd_hda_codec_hdmi snd_sof_pci_intel_cnl snd_sof_intel_hda_common snd_hda_codec_realtek snd_hda_codec_generic snd_soc_hdac_hda soundwire_intel ledtrig_audio nls_iso8859_1 soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel i915 snd_intel_dspcfg snd_intel_sdw_acpi intel_rapl_msr snd_hda_codec intel_rapl_common snd_hda_core x86_pkg_temp_thermal intel_powerclamp snd_hwdep coretemp snd_pcm kvm_intel drm_buddy ttm mei_hdcp kvm drm_display_helper snd_seq_midi snd_seq_midi_event cec crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd rc_core cryptd rapl snd_rawmidi drm_kms_helper binfmt_misc intel_cstate i2c_algo_bit joydev snd_seq snd_seq_device syscopyarea wmi_bmof snd_timer sysfillrect input_leds snd ee1004 sysimgblt mei_me soundcore > > mei intel_pch_thermal mac_hid acpi_tad acpi_pad sch_fq_codel msr parport_pc ppdev lp ramoops drm parport reed_solomon efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid rtsx_pci_sdmmc crc32_pclmul ahci e1000e i2c_i801 i2c_smbus rtsx_pci xhci_pci libahci xhci_pci_renesas video wmi > > The module list is a big distraction and isn't relevant to this issue. > > > CR2: ffffb24d403e5010 > > ---[ end trace 0000000000000000 ]--- > > > > This happens because scheduled pm_runtime_idle() is not cancelled. > > I think it would be useful to include a little more background about > how we got here. In particular, I think we got here because > .runtime_idle() raced with .remove(): > > - rtsx_pci_runtime_idle() did iowrite32(pcr->remap_addr + RTSX_HAIMR) > in rtsx_pci_write_register() successfully > > - rtsx_pci_remove() iounmapped pcr->remap_addr > > - rtsx_pci_runtime_idle() did ioread32(pcr->remap_addr + RTSX_HAIMR) > in rtsx_pci_write_register(), which faulted > > The write and the read access the same register, but the write was > successful and we faulted on the *read* (see [1]), which means > rtsx_pci_runtime_idle() started execution first, and rtsx_pci_remove() > raced with it and happened to unmap pcr->remap_addr (see [2]) between > the write and the read. > > It looks like this kind of race between .runtime_idle() and .remove() > could happen with any driver. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/misc/cardreader/rtsx_pcr.c?id=v6.7#n164 > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/misc/cardreader/rtsx_pcr.c?id=v6.7#n1633 > > > So before releasing the device, stop all runtime power managements by > > using pm_runtime_barrier() to fix the issue. > > > > Link: https://lore.kernel.org/all/2ce258f371234b1f8a1a470d5488d00e@realtek.com/ > > Cc: Ricky Wu <ricky_wu@realtek.com> > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> > > --- > > v3: > > Move the change the device driver core. > > > > v2: > > Cover more cases than just pciehp. > > > > drivers/base/dd.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > > index 85152537dbf1..38c815e2b3a2 100644 > > --- a/drivers/base/dd.c > > +++ b/drivers/base/dd.c > > @@ -1244,6 +1244,7 @@ static void __device_release_driver(struct device *dev, struct device *parent) > > > > drv = dev->driver; > > if (drv) { > > + pm_runtime_barrier(dev); > > pm_runtime_get_sync(dev); > > > > while (device_links_busy(dev)) { > > -- > > 2.34.1 > >
On Thu, Feb 29, 2024 at 7:23 AM Kai-Heng Feng <kai.heng.feng@canonical.com> wrote: > > When inserting an SD7.0 card to Realtek card reader, the card reader > unplugs itself and morph into a NVMe device. The slot Link down on hot > unplugged can cause the following error: > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > BUG: unable to handle page fault for address: ffffb24d403e5010 > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > Oops: 0000 [#1] PREEMPT SMP PTI > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > Workqueue: pm pm_runtime_work > RIP: 0010:ioread32+0x2e/0x70 > Code: ff 03 00 77 25 48 81 ff 00 00 01 00 77 14 8b 15 08 d9 54 01 b8 ff ff ff ff 85 d2 75 14 c3 cc cc cc cc 89 fa ed c3 cc cc cc cc <8b> 07 c3 cc cc cc cc 55 83 ea 01 48 89 fe 48 c7 c7 98 6f 15 99 48 > RSP: 0018:ffffb24d40a5bd78 EFLAGS: 00010296 > RAX: ffffb24d403e5000 RBX: 0000000000000152 RCX: 000000000000007f > RDX: 000000000000ff00 RSI: ffffb24d403e5010 RDI: ffffb24d403e5010 > RBP: ffffb24d40a5bd98 R08: ffffb24d403e5010 R09: 0000000000000000 > R10: ffff9074cd95e7f4 R11: 0000000000000003 R12: 000000000000007f > R13: ffff9074e1a68c00 R14: ffff9074e1a68d00 R15: 0000000000009003 > FS: 0000000000000000(0000) GS:ffff90752a180000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffb24d403e5010 CR3: 0000000152832006 CR4: 00000000003706e0 > Call Trace: > <TASK> > ? show_regs+0x68/0x70 > ? __die_body+0x20/0x70 > ? __die+0x2b/0x40 > ? page_fault_oops+0x160/0x480 > ? search_bpf_extables+0x63/0x90 > ? ioread32+0x2e/0x70 > ? search_exception_tables+0x5f/0x70 > ? kernelmode_fixup_or_oops+0xa2/0x120 > ? __bad_area_nosemaphore+0x179/0x230 > ? bad_area_nosemaphore+0x16/0x20 > ? do_kern_addr_fault+0x8b/0xa0 > ? exc_page_fault+0xe5/0x180 > ? asm_exc_page_fault+0x27/0x30 > ? ioread32+0x2e/0x70 > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > pci_pm_runtime_idle+0x34/0x70 > rpm_idle+0xc4/0x2b0 > pm_runtime_work+0x93/0xc0 > process_one_work+0x21a/0x430 > worker_thread+0x4a/0x3c0 > ? __pfx_worker_thread+0x10/0x10 > kthread+0x106/0x140 > ? __pfx_kthread+0x10/0x10 > ret_from_fork+0x29/0x50 > </TASK> > Modules linked in: nvme nvme_core snd_hda_codec_hdmi snd_sof_pci_intel_cnl snd_sof_intel_hda_common snd_hda_codec_realtek snd_hda_codec_generic snd_soc_hdac_hda soundwire_intel ledtrig_audio nls_iso8859_1 soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel i915 snd_intel_dspcfg snd_intel_sdw_acpi intel_rapl_msr snd_hda_codec intel_rapl_common snd_hda_core x86_pkg_temp_thermal intel_powerclamp snd_hwdep coretemp snd_pcm kvm_intel drm_buddy ttm mei_hdcp kvm drm_display_helper snd_seq_midi snd_seq_midi_event cec crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd rc_core cryptd rapl snd_rawmidi drm_kms_helper binfmt_misc intel_cstate i2c_algo_bit joydev snd_seq snd_seq_device syscopyarea wmi_bmof snd_timer sysfillrect input_leds snd ee1004 sysimgblt mei_me soundcore > mei intel_pch_thermal mac_hid acpi_tad acpi_pad sch_fq_codel msr parport_pc ppdev lp ramoops drm parport reed_solomon efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid rtsx_pci_sdmmc crc32_pclmul ahci e1000e i2c_i801 i2c_smbus rtsx_pci xhci_pci libahci xhci_pci_renesas video wmi > CR2: ffffb24d403e5010 > ---[ end trace 0000000000000000 ]--- > > This happens because scheduled pm_runtime_idle() is not cancelled. But rpm_resume() changes dev->power.request to RPM_REQ_NONE and if pm_runtime_work() sees this, it will not run rpm_idle(). However, rpm_resume() doesn't deactivate the autosuspend timer if it is running (see the comment in rpm_resume() regarding this), so it may queue up a runtime PM work later. If this is not desirable, you need to stop the autosuspend timer explicitly in addition to calling pm_runtime_get_sync(). > So before releasing the device, stop all runtime power managements by > using pm_runtime_barrier() to fix the issue. > > Link: https://lore.kernel.org/all/2ce258f371234b1f8a1a470d5488d00e@realtek.com/ > Cc: Ricky Wu <ricky_wu@realtek.com> > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> > --- > v3: > Move the change the device driver core. > > v2: > Cover more cases than just pciehp. > > drivers/base/dd.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > index 85152537dbf1..38c815e2b3a2 100644 > --- a/drivers/base/dd.c > +++ b/drivers/base/dd.c > @@ -1244,6 +1244,7 @@ static void __device_release_driver(struct device *dev, struct device *parent) > > drv = dev->driver; > if (drv) { > + pm_runtime_barrier(dev); This prevents the crash from occurring because pm_runtime_barrier() calls pm_runtime_deactivate_timer() unconditionally AFAICS. > > while (device_links_busy(dev)) { > -- Overall, the issue appears to be in the driver that forgets to deactivate the autosuspend timer in its remove callback.
On Mon, Mar 04, 2024 at 03:38:38PM +0100, Rafael J. Wysocki wrote: > On Thu, Feb 29, 2024 at 7:23 AM Kai-Heng Feng > <kai.heng.feng@canonical.com> wrote: > > > > When inserting an SD7.0 card to Realtek card reader, the card reader > > unplugs itself and morph into a NVMe device. The slot Link down on hot > > unplugged can cause the following error: > > > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > > BUG: unable to handle page fault for address: ffffb24d403e5010 > > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > > Oops: 0000 [#1] PREEMPT SMP PTI > > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > > Workqueue: pm pm_runtime_work > > RIP: 0010:ioread32+0x2e/0x70 > ... > > Call Trace: > > <TASK> > > ? show_regs+0x68/0x70 > > ? __die_body+0x20/0x70 > > ? __die+0x2b/0x40 > > ? page_fault_oops+0x160/0x480 > > ? search_bpf_extables+0x63/0x90 > > ? ioread32+0x2e/0x70 > > ? search_exception_tables+0x5f/0x70 > > ? kernelmode_fixup_or_oops+0xa2/0x120 > > ? __bad_area_nosemaphore+0x179/0x230 > > ? bad_area_nosemaphore+0x16/0x20 > > ? do_kern_addr_fault+0x8b/0xa0 > > ? exc_page_fault+0xe5/0x180 > > ? asm_exc_page_fault+0x27/0x30 > > ? ioread32+0x2e/0x70 > > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > > pci_pm_runtime_idle+0x34/0x70 > > rpm_idle+0xc4/0x2b0 > > pm_runtime_work+0x93/0xc0 > > process_one_work+0x21a/0x430 > > worker_thread+0x4a/0x3c0 > ... > > This happens because scheduled pm_runtime_idle() is not cancelled. > > But rpm_resume() changes dev->power.request to RPM_REQ_NONE and if > pm_runtime_work() sees this, it will not run rpm_idle(). > > However, rpm_resume() doesn't deactivate the autosuspend timer if it > is running (see the comment in rpm_resume() regarding this), so it may > queue up a runtime PM work later. > > If this is not desirable, you need to stop the autosuspend timer > explicitly in addition to calling pm_runtime_get_sync(). I don't quite follow all this. I think the race is between rtsx_pci_remove() (not resume) and rtsx_pci_runtime_idle(). rtsx_pci_remove() { pm_runtime_get_sync() pm_runtime_forbid() ... If this is an rtsx bug, what exactly should be added to rtsx_pci_remove()? Is there ever a case where we want any runtime PM work to happen during or after a driver .remove()? If not, maybe the driver core should prevent that, which I think is basically what this patch does. If this is an rtsx driver bug, I'm concerned there may be many other drivers with a similar issue. rtsx exercises this path more than most because the device switches between card reader and NVMe SSD using hotplug add/remove based on whether an SD card is inserted (see [1]). [1] https://lore.kernel.org/r/20231019143504.GA25140@wunner.de > > So before releasing the device, stop all runtime power managements by > > using pm_runtime_barrier() to fix the issue. > > > > Link: https://lore.kernel.org/all/2ce258f371234b1f8a1a470d5488d00e@realtek.com/ > > Cc: Ricky Wu <ricky_wu@realtek.com> > > Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> > > --- > > v3: > > Move the change the device driver core. > > > > v2: > > Cover more cases than just pciehp. > > > > drivers/base/dd.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/base/dd.c b/drivers/base/dd.c > > index 85152537dbf1..38c815e2b3a2 100644 > > --- a/drivers/base/dd.c > > +++ b/drivers/base/dd.c > > @@ -1244,6 +1244,7 @@ static void __device_release_driver(struct device *dev, struct device *parent) > > > > drv = dev->driver; > > if (drv) { > > + pm_runtime_barrier(dev); > > This prevents the crash from occurring because pm_runtime_barrier() > calls pm_runtime_deactivate_timer() unconditionally AFAICS.
On Mon, Mar 4, 2024 at 4:51 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > On Mon, Mar 04, 2024 at 03:38:38PM +0100, Rafael J. Wysocki wrote: > > On Thu, Feb 29, 2024 at 7:23 AM Kai-Heng Feng > > <kai.heng.feng@canonical.com> wrote: > > > > > > When inserting an SD7.0 card to Realtek card reader, the card reader > > > unplugs itself and morph into a NVMe device. The slot Link down on hot > > > unplugged can cause the following error: > > > > > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > > > BUG: unable to handle page fault for address: ffffb24d403e5010 > > > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > > > Oops: 0000 [#1] PREEMPT SMP PTI > > > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > > > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > > > Workqueue: pm pm_runtime_work > > > RIP: 0010:ioread32+0x2e/0x70 > > ... > > > Call Trace: > > > <TASK> > > > ? show_regs+0x68/0x70 > > > ? __die_body+0x20/0x70 > > > ? __die+0x2b/0x40 > > > ? page_fault_oops+0x160/0x480 > > > ? search_bpf_extables+0x63/0x90 > > > ? ioread32+0x2e/0x70 > > > ? search_exception_tables+0x5f/0x70 > > > ? kernelmode_fixup_or_oops+0xa2/0x120 > > > ? __bad_area_nosemaphore+0x179/0x230 > > > ? bad_area_nosemaphore+0x16/0x20 > > > ? do_kern_addr_fault+0x8b/0xa0 > > > ? exc_page_fault+0xe5/0x180 > > > ? asm_exc_page_fault+0x27/0x30 > > > ? ioread32+0x2e/0x70 > > > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > > > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > > > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > > > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > > > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > > > pci_pm_runtime_idle+0x34/0x70 > > > rpm_idle+0xc4/0x2b0 > > > pm_runtime_work+0x93/0xc0 > > > process_one_work+0x21a/0x430 > > > worker_thread+0x4a/0x3c0 > > ... > > > > This happens because scheduled pm_runtime_idle() is not cancelled. > > > > But rpm_resume() changes dev->power.request to RPM_REQ_NONE and if > > pm_runtime_work() sees this, it will not run rpm_idle(). > > > > However, rpm_resume() doesn't deactivate the autosuspend timer if it > > is running (see the comment in rpm_resume() regarding this), so it may > > queue up a runtime PM work later. > > > > If this is not desirable, you need to stop the autosuspend timer > > explicitly in addition to calling pm_runtime_get_sync(). > > I don't quite follow all this. I think the race is between > rtsx_pci_remove() (not resume) and rtsx_pci_runtime_idle(). I think so too and the latter is not expected to run. > rtsx_pci_remove() > { > pm_runtime_get_sync() > pm_runtime_forbid() > ... > > If this is an rtsx bug, what exactly should be added to > rtsx_pci_remove()? > > Is there ever a case where we want any runtime PM work to happen > during or after a driver .remove()? If not, maybe the driver core > should prevent that, which I think is basically what this patch does. No, it is not, because it doesn't actually prevent the race from occurring, it just narrows the window quite a bit. It would be better to call pm_runtime_dont_use_autosuspend() instead of pm_runtime_barrier(). > If this is an rtsx driver bug, I'm concerned there may be many other > drivers with a similar issue. rtsx exercises this path more than most > because the device switches between card reader and NVMe SSD using > hotplug add/remove based on whether an SD card is inserted (see [1]). This is a valid concern, so it is mostly a matter of where to disable autosuspend. It may be the driver core in principle, but note that it calls ->remove() after invoking pm_runtime_put_sync(), so why would it disable autosuspend when it allows runtime PM to race with device removal in general? Another way might be to add a pm_runtime_dont_use_autosuspend() call at the beginning of pci_device_remove(). Or just remove the optimization in question from rpm_resume() which is quite confusing and causes people to make assumptions that lead to incorrect behavior in this particular case. So this (modulo GMail-induced whitespace breakage): --- drivers/base/power/runtime.c | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) Index: linux-pm/drivers/base/power/runtime.c =================================================================== --- linux-pm.orig/drivers/base/power/runtime.c +++ linux-pm/drivers/base/power/runtime.c @@ -782,15 +782,8 @@ static int rpm_resume(struct device *dev if (retval) goto out; - /* - * Other scheduled or pending requests need to be canceled. Small - * optimization: If an autosuspend timer is running, leave it running - * rather than cancelling it now only to restart it again in the near - * future. - */ dev->power.request = RPM_REQ_NONE; - if (!dev->power.timer_autosuspends) - pm_runtime_deactivate_timer(dev); + pm_runtime_deactivate_timer(dev); if (dev->power.runtime_status == RPM_ACTIVE) { retval = 1;
On Mon, Mar 4, 2024 at 5:41 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Mon, Mar 4, 2024 at 4:51 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > On Mon, Mar 04, 2024 at 03:38:38PM +0100, Rafael J. Wysocki wrote: > > > On Thu, Feb 29, 2024 at 7:23 AM Kai-Heng Feng > > > <kai.heng.feng@canonical.com> wrote: > > > > > > > > When inserting an SD7.0 card to Realtek card reader, the card reader > > > > unplugs itself and morph into a NVMe device. The slot Link down on hot > > > > unplugged can cause the following error: > > > > > > > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > > > > BUG: unable to handle page fault for address: ffffb24d403e5010 > > > > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > > > > Oops: 0000 [#1] PREEMPT SMP PTI > > > > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > > > > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > > > > Workqueue: pm pm_runtime_work > > > > RIP: 0010:ioread32+0x2e/0x70 > > > ... > > > > Call Trace: > > > > <TASK> > > > > ? show_regs+0x68/0x70 > > > > ? __die_body+0x20/0x70 > > > > ? __die+0x2b/0x40 > > > > ? page_fault_oops+0x160/0x480 > > > > ? search_bpf_extables+0x63/0x90 > > > > ? ioread32+0x2e/0x70 > > > > ? search_exception_tables+0x5f/0x70 > > > > ? kernelmode_fixup_or_oops+0xa2/0x120 > > > > ? __bad_area_nosemaphore+0x179/0x230 > > > > ? bad_area_nosemaphore+0x16/0x20 > > > > ? do_kern_addr_fault+0x8b/0xa0 > > > > ? exc_page_fault+0xe5/0x180 > > > > ? asm_exc_page_fault+0x27/0x30 > > > > ? ioread32+0x2e/0x70 > > > > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > > > > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > > > > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > > > > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > > > > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > > > > pci_pm_runtime_idle+0x34/0x70 > > > > rpm_idle+0xc4/0x2b0 > > > > pm_runtime_work+0x93/0xc0 > > > > process_one_work+0x21a/0x430 > > > > worker_thread+0x4a/0x3c0 > > > ... > > > > > > This happens because scheduled pm_runtime_idle() is not cancelled. > > > > > > But rpm_resume() changes dev->power.request to RPM_REQ_NONE and if > > > pm_runtime_work() sees this, it will not run rpm_idle(). > > > > > > However, rpm_resume() doesn't deactivate the autosuspend timer if it > > > is running (see the comment in rpm_resume() regarding this), so it may > > > queue up a runtime PM work later. > > > > > > If this is not desirable, you need to stop the autosuspend timer > > > explicitly in addition to calling pm_runtime_get_sync(). > > > > I don't quite follow all this. I think the race is between > > rtsx_pci_remove() (not resume) and rtsx_pci_runtime_idle(). > > I think so too and the latter is not expected to run. > > > rtsx_pci_remove() > > { > > pm_runtime_get_sync() > > pm_runtime_forbid() > > ... > > > > If this is an rtsx bug, what exactly should be added to > > rtsx_pci_remove()? > > > > Is there ever a case where we want any runtime PM work to happen > > during or after a driver .remove()? If not, maybe the driver core > > should prevent that, which I think is basically what this patch does. > > No, it is not, because it doesn't actually prevent the race from > occurring, it just narrows the window quite a bit. > > It would be better to call pm_runtime_dont_use_autosuspend() instead > of pm_runtime_barrier(). > > > If this is an rtsx driver bug, I'm concerned there may be many other > > drivers with a similar issue. rtsx exercises this path more than most > > because the device switches between card reader and NVMe SSD using > > hotplug add/remove based on whether an SD card is inserted (see [1]). > > This is a valid concern, so it is mostly a matter of where to disable > autosuspend. > > It may be the driver core in principle, but note that it calls > ->remove() after invoking pm_runtime_put_sync(), so why would it > disable autosuspend when it allows runtime PM to race with device > removal in general? > > Another way might be to add a pm_runtime_dont_use_autosuspend() call > at the beginning of pci_device_remove(). > > Or just remove the optimization in question from rpm_resume() which is > quite confusing and causes people to make assumptions that lead to > incorrect behavior in this particular case. Well, scratch this. If rpm_idle() is already running at the time rpm_resume() is called, the latter may return right away without waiting, which is incorrect. rpm_resume() needs to wait for the "idle" callback to complete, so this (again, modulo GMail-induced whitespace mangling) should help: --- drivers/base/power/runtime.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) Index: linux-pm/drivers/base/power/runtime.c =================================================================== --- linux-pm.orig/drivers/base/power/runtime.c +++ linux-pm/drivers/base/power/runtime.c @@ -798,7 +798,8 @@ static int rpm_resume(struct device *dev } if (dev->power.runtime_status == RPM_RESUMING || - dev->power.runtime_status == RPM_SUSPENDING) { + dev->power.runtime_status == RPM_SUSPENDING || + dev->power.idle_notification) { DEFINE_WAIT(wait); if (rpmflags & (RPM_ASYNC | RPM_NOWAIT)) { @@ -826,7 +827,8 @@ static int rpm_resume(struct device *dev prepare_to_wait(&dev->power.wait_queue, &wait, TASK_UNINTERRUPTIBLE); if (dev->power.runtime_status != RPM_RESUMING && - dev->power.runtime_status != RPM_SUSPENDING) + dev->power.runtime_status != RPM_SUSPENDING && + !dev->power.idle_notification) break; spin_unlock_irq(&dev->power.lock);
On Mon, Mar 4, 2024 at 6:00 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Mon, Mar 4, 2024 at 5:41 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Mon, Mar 4, 2024 at 4:51 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > > On Mon, Mar 04, 2024 at 03:38:38PM +0100, Rafael J. Wysocki wrote: > > > > On Thu, Feb 29, 2024 at 7:23 AM Kai-Heng Feng > > > > <kai.heng.feng@canonical.com> wrote: > > > > > > > > > > When inserting an SD7.0 card to Realtek card reader, the card reader > > > > > unplugs itself and morph into a NVMe device. The slot Link down on hot > > > > > unplugged can cause the following error: > > > > > > > > > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > > > > > BUG: unable to handle page fault for address: ffffb24d403e5010 > > > > > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > > > > > Oops: 0000 [#1] PREEMPT SMP PTI > > > > > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > > > > > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > > > > > Workqueue: pm pm_runtime_work > > > > > RIP: 0010:ioread32+0x2e/0x70 > > > > ... > > > > > Call Trace: > > > > > <TASK> > > > > > ? show_regs+0x68/0x70 > > > > > ? __die_body+0x20/0x70 > > > > > ? __die+0x2b/0x40 > > > > > ? page_fault_oops+0x160/0x480 > > > > > ? search_bpf_extables+0x63/0x90 > > > > > ? ioread32+0x2e/0x70 > > > > > ? search_exception_tables+0x5f/0x70 > > > > > ? kernelmode_fixup_or_oops+0xa2/0x120 > > > > > ? __bad_area_nosemaphore+0x179/0x230 > > > > > ? bad_area_nosemaphore+0x16/0x20 > > > > > ? do_kern_addr_fault+0x8b/0xa0 > > > > > ? exc_page_fault+0xe5/0x180 > > > > > ? asm_exc_page_fault+0x27/0x30 > > > > > ? ioread32+0x2e/0x70 > > > > > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > > > > > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > > > > > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > > > > > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > > > > > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > > > > > pci_pm_runtime_idle+0x34/0x70 > > > > > rpm_idle+0xc4/0x2b0 > > > > > pm_runtime_work+0x93/0xc0 > > > > > process_one_work+0x21a/0x430 > > > > > worker_thread+0x4a/0x3c0 > > > > ... > > > > > > > > This happens because scheduled pm_runtime_idle() is not cancelled. > > > > > > > > But rpm_resume() changes dev->power.request to RPM_REQ_NONE and if > > > > pm_runtime_work() sees this, it will not run rpm_idle(). > > > > > > > > However, rpm_resume() doesn't deactivate the autosuspend timer if it > > > > is running (see the comment in rpm_resume() regarding this), so it may > > > > queue up a runtime PM work later. > > > > > > > > If this is not desirable, you need to stop the autosuspend timer > > > > explicitly in addition to calling pm_runtime_get_sync(). > > > > > > I don't quite follow all this. I think the race is between > > > rtsx_pci_remove() (not resume) and rtsx_pci_runtime_idle(). > > > > I think so too and the latter is not expected to run. > > > > > rtsx_pci_remove() > > > { > > > pm_runtime_get_sync() > > > pm_runtime_forbid() > > > ... > > > > > > If this is an rtsx bug, what exactly should be added to > > > rtsx_pci_remove()? > > > > > > Is there ever a case where we want any runtime PM work to happen > > > during or after a driver .remove()? If not, maybe the driver core > > > should prevent that, which I think is basically what this patch does. > > > > No, it is not, because it doesn't actually prevent the race from > > occurring, it just narrows the window quite a bit. > > > > It would be better to call pm_runtime_dont_use_autosuspend() instead > > of pm_runtime_barrier(). > > > > > If this is an rtsx driver bug, I'm concerned there may be many other > > > drivers with a similar issue. rtsx exercises this path more than most > > > because the device switches between card reader and NVMe SSD using > > > hotplug add/remove based on whether an SD card is inserted (see [1]). > > > > This is a valid concern, so it is mostly a matter of where to disable > > autosuspend. > > > > It may be the driver core in principle, but note that it calls > > ->remove() after invoking pm_runtime_put_sync(), so why would it > > disable autosuspend when it allows runtime PM to race with device > > removal in general? > > > > Another way might be to add a pm_runtime_dont_use_autosuspend() call > > at the beginning of pci_device_remove(). > > > > Or just remove the optimization in question from rpm_resume() which is > > quite confusing and causes people to make assumptions that lead to > > incorrect behavior in this particular case. > > Well, scratch this. > > If rpm_idle() is already running at the time rpm_resume() is called, > the latter may return right away without waiting, which is incorrect. > > rpm_resume() needs to wait for the "idle" callback to complete, so > this (again, modulo GMail-induced whitespace mangling) should help: > > --- > drivers/base/power/runtime.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > Index: linux-pm/drivers/base/power/runtime.c > =================================================================== > --- linux-pm.orig/drivers/base/power/runtime.c > +++ linux-pm/drivers/base/power/runtime.c > @@ -798,7 +798,8 @@ static int rpm_resume(struct device *dev > } > > if (dev->power.runtime_status == RPM_RESUMING || > - dev->power.runtime_status == RPM_SUSPENDING) { > + dev->power.runtime_status == RPM_SUSPENDING || > + dev->power.idle_notification) { > DEFINE_WAIT(wait); > > if (rpmflags & (RPM_ASYNC | RPM_NOWAIT)) { > @@ -826,7 +827,8 @@ static int rpm_resume(struct device *dev > prepare_to_wait(&dev->power.wait_queue, &wait, > TASK_UNINTERRUPTIBLE); > if (dev->power.runtime_status != RPM_RESUMING && > - dev->power.runtime_status != RPM_SUSPENDING) > + dev->power.runtime_status != RPM_SUSPENDING && > + !dev->power.idle_notification) > break; > > spin_unlock_irq(&dev->power.lock); Well, not really. The problem is that rtsx_pci_runtime_idle() is not expected to be running after pm_runtime_get_sync(), but the latter doesn't really guarantee that. It only guarantees that the suspend/resume callbacks will not be running after it returns. As I said above, if the ->runtime_idle() callback is already running when pm_runtime_get_sync() runs, the latter will notice that the status is RPM_ACTIVE and will return right away without waiting for the former to complete. In fact, it cannot wait for it to complete, because it may be called from a ->runtime_idle() callback itself (it arguably does not make much sense to do that, but it is not strictly forbidden). So whoever is providing a ->runtime_idle() callback, they need to protect it from running in parallel with whatever code runs after pm_runtime_get_sync(). Note that ->runtime_idle() will not start after pm_runtime_get_sync(), but it may continue running then if it has started earlier already. Calling pm_runtime_barrier() after pm_runtime_get_sync() (not before it) should suffice, but once the runtime PM usage counter is dropped, rpm_idle() may run again, so this is only effective until the usage counter is greater than 1. This means that __device_release_driver(() is not the right place to call it, because the usage counter is dropped before calling device_remove() in that case. The PCI bus type can prevent the race between driver-provided ->runtime_idle() and ->remove() from occurring by adding a pm_runtime_probe() call in the following way: --- drivers/pci/pci-driver.c | 7 +++++++ 1 file changed, 7 insertions(+) Index: linux-pm/drivers/pci/pci-driver.c =================================================================== --- linux-pm.orig/drivers/pci/pci-driver.c +++ linux-pm/drivers/pci/pci-driver.c @@ -473,6 +473,13 @@ static void pci_device_remove(struct dev if (drv->remove) { pm_runtime_get_sync(dev); + /* + * If the driver provides a .runtime_idle() callback and it has + * started to run already, it may continue to run in parallel + * with the code below, so wait until all of the runtime PM + * activity has completed. + */ + pm_runtime_barrier(dev); drv->remove(pci_dev); pm_runtime_put_noidle(dev); }
On Mon, Mar 4, 2024 at 7:10 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Mon, Mar 4, 2024 at 6:00 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Mon, Mar 4, 2024 at 5:41 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > On Mon, Mar 4, 2024 at 4:51 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > > > > On Mon, Mar 04, 2024 at 03:38:38PM +0100, Rafael J. Wysocki wrote: > > > > > On Thu, Feb 29, 2024 at 7:23 AM Kai-Heng Feng > > > > > <kai.heng.feng@canonical.com> wrote: > > > > > > > > > > > > When inserting an SD7.0 card to Realtek card reader, the card reader > > > > > > unplugs itself and morph into a NVMe device. The slot Link down on hot > > > > > > unplugged can cause the following error: > > > > > > > > > > > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > > > > > > BUG: unable to handle page fault for address: ffffb24d403e5010 > > > > > > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > > > > > > Oops: 0000 [#1] PREEMPT SMP PTI > > > > > > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > > > > > > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > > > > > > Workqueue: pm pm_runtime_work > > > > > > RIP: 0010:ioread32+0x2e/0x70 > > > > > ... > > > > > > Call Trace: > > > > > > <TASK> > > > > > > ? show_regs+0x68/0x70 > > > > > > ? __die_body+0x20/0x70 > > > > > > ? __die+0x2b/0x40 > > > > > > ? page_fault_oops+0x160/0x480 > > > > > > ? search_bpf_extables+0x63/0x90 > > > > > > ? ioread32+0x2e/0x70 > > > > > > ? search_exception_tables+0x5f/0x70 > > > > > > ? kernelmode_fixup_or_oops+0xa2/0x120 > > > > > > ? __bad_area_nosemaphore+0x179/0x230 > > > > > > ? bad_area_nosemaphore+0x16/0x20 > > > > > > ? do_kern_addr_fault+0x8b/0xa0 > > > > > > ? exc_page_fault+0xe5/0x180 > > > > > > ? asm_exc_page_fault+0x27/0x30 > > > > > > ? ioread32+0x2e/0x70 > > > > > > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > > > > > > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > > > > > > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > > > > > > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > > > > > > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > > > > > > pci_pm_runtime_idle+0x34/0x70 > > > > > > rpm_idle+0xc4/0x2b0 > > > > > > pm_runtime_work+0x93/0xc0 > > > > > > process_one_work+0x21a/0x430 > > > > > > worker_thread+0x4a/0x3c0 > > > > > ... > > > > > > > > > > This happens because scheduled pm_runtime_idle() is not cancelled. > > > > > > > > > > But rpm_resume() changes dev->power.request to RPM_REQ_NONE and if > > > > > pm_runtime_work() sees this, it will not run rpm_idle(). > > > > > > > > > > However, rpm_resume() doesn't deactivate the autosuspend timer if it > > > > > is running (see the comment in rpm_resume() regarding this), so it may > > > > > queue up a runtime PM work later. > > > > > > > > > > If this is not desirable, you need to stop the autosuspend timer > > > > > explicitly in addition to calling pm_runtime_get_sync(). > > > > > > > > I don't quite follow all this. I think the race is between > > > > rtsx_pci_remove() (not resume) and rtsx_pci_runtime_idle(). > > > > > > I think so too and the latter is not expected to run. > > > > > > > rtsx_pci_remove() > > > > { > > > > pm_runtime_get_sync() > > > > pm_runtime_forbid() > > > > ... > > > > > > > > If this is an rtsx bug, what exactly should be added to > > > > rtsx_pci_remove()? > > > > > > > > Is there ever a case where we want any runtime PM work to happen > > > > during or after a driver .remove()? If not, maybe the driver core > > > > should prevent that, which I think is basically what this patch does. > > > > > > No, it is not, because it doesn't actually prevent the race from > > > occurring, it just narrows the window quite a bit. > > > > > > It would be better to call pm_runtime_dont_use_autosuspend() instead > > > of pm_runtime_barrier(). > > > > > > > If this is an rtsx driver bug, I'm concerned there may be many other > > > > drivers with a similar issue. rtsx exercises this path more than most > > > > because the device switches between card reader and NVMe SSD using > > > > hotplug add/remove based on whether an SD card is inserted (see [1]). > > > > > > This is a valid concern, so it is mostly a matter of where to disable > > > autosuspend. > > > > > > It may be the driver core in principle, but note that it calls > > > ->remove() after invoking pm_runtime_put_sync(), so why would it > > > disable autosuspend when it allows runtime PM to race with device > > > removal in general? > > > > > > Another way might be to add a pm_runtime_dont_use_autosuspend() call > > > at the beginning of pci_device_remove(). > > > > > > Or just remove the optimization in question from rpm_resume() which is > > > quite confusing and causes people to make assumptions that lead to > > > incorrect behavior in this particular case. > > > > Well, scratch this. > > > > If rpm_idle() is already running at the time rpm_resume() is called, > > the latter may return right away without waiting, which is incorrect. > > > > rpm_resume() needs to wait for the "idle" callback to complete, so > > this (again, modulo GMail-induced whitespace mangling) should help: > > > > --- > > drivers/base/power/runtime.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > Index: linux-pm/drivers/base/power/runtime.c > > =================================================================== > > --- linux-pm.orig/drivers/base/power/runtime.c > > +++ linux-pm/drivers/base/power/runtime.c > > @@ -798,7 +798,8 @@ static int rpm_resume(struct device *dev > > } > > > > if (dev->power.runtime_status == RPM_RESUMING || > > - dev->power.runtime_status == RPM_SUSPENDING) { > > + dev->power.runtime_status == RPM_SUSPENDING || > > + dev->power.idle_notification) { > > DEFINE_WAIT(wait); > > > > if (rpmflags & (RPM_ASYNC | RPM_NOWAIT)) { > > @@ -826,7 +827,8 @@ static int rpm_resume(struct device *dev > > prepare_to_wait(&dev->power.wait_queue, &wait, > > TASK_UNINTERRUPTIBLE); > > if (dev->power.runtime_status != RPM_RESUMING && > > - dev->power.runtime_status != RPM_SUSPENDING) > > + dev->power.runtime_status != RPM_SUSPENDING && > > + !dev->power.idle_notification) > > break; > > > > spin_unlock_irq(&dev->power.lock); > > Well, not really. > > The problem is that rtsx_pci_runtime_idle() is not expected to be > running after pm_runtime_get_sync(), but the latter doesn't really > guarantee that. It only guarantees that the suspend/resume callbacks > will not be running after it returns. > > As I said above, if the ->runtime_idle() callback is already running > when pm_runtime_get_sync() runs, the latter will notice that the > status is RPM_ACTIVE and will return right away without waiting for > the former to complete. In fact, it cannot wait for it to complete, > because it may be called from a ->runtime_idle() callback itself (it > arguably does not make much sense to do that, but it is not strictly > forbidden). > > So whoever is providing a ->runtime_idle() callback, they need to > protect it from running in parallel with whatever code runs after > pm_runtime_get_sync(). Note that ->runtime_idle() will not start > after pm_runtime_get_sync(), but it may continue running then if it > has started earlier already. > > Calling pm_runtime_barrier() after pm_runtime_get_sync() (not before > it) should suffice, but once the runtime PM usage counter is dropped, > rpm_idle() may run again, so this is only effective until the usage > counter is greater than 1. This means that > __device_release_driver(() is not the right place to call it, because > the usage counter is dropped before calling device_remove() in that > case. > > The PCI bus type can prevent the race between driver-provided > ->runtime_idle() and ->remove() from occurring by adding a > pm_runtime_probe() call in the following way: s/pm_runtime_probe/pm_runtime_barrier/ (sorry) The patchlet below is correct, though. > --- > drivers/pci/pci-driver.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > Index: linux-pm/drivers/pci/pci-driver.c > =================================================================== > --- linux-pm.orig/drivers/pci/pci-driver.c > +++ linux-pm/drivers/pci/pci-driver.c > @@ -473,6 +473,13 @@ static void pci_device_remove(struct dev > > if (drv->remove) { > pm_runtime_get_sync(dev); > + /* > + * If the driver provides a .runtime_idle() callback and it has > + * started to run already, it may continue to run in parallel > + * with the code below, so wait until all of the runtime PM > + * activity has completed. > + */ > + pm_runtime_barrier(dev); > drv->remove(pci_dev); > pm_runtime_put_noidle(dev); > }
On Tue, Mar 5, 2024 at 2:10 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > On Mon, Mar 4, 2024 at 6:00 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Mon, Mar 4, 2024 at 5:41 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > On Mon, Mar 4, 2024 at 4:51 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > > > > On Mon, Mar 04, 2024 at 03:38:38PM +0100, Rafael J. Wysocki wrote: > > > > > On Thu, Feb 29, 2024 at 7:23 AM Kai-Heng Feng > > > > > <kai.heng.feng@canonical.com> wrote: > > > > > > > > > > > > When inserting an SD7.0 card to Realtek card reader, the card reader > > > > > > unplugs itself and morph into a NVMe device. The slot Link down on hot > > > > > > unplugged can cause the following error: > > > > > > > > > > > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > > > > > > BUG: unable to handle page fault for address: ffffb24d403e5010 > > > > > > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > > > > > > Oops: 0000 [#1] PREEMPT SMP PTI > > > > > > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > > > > > > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > > > > > > Workqueue: pm pm_runtime_work > > > > > > RIP: 0010:ioread32+0x2e/0x70 > > > > > ... > > > > > > Call Trace: > > > > > > <TASK> > > > > > > ? show_regs+0x68/0x70 > > > > > > ? __die_body+0x20/0x70 > > > > > > ? __die+0x2b/0x40 > > > > > > ? page_fault_oops+0x160/0x480 > > > > > > ? search_bpf_extables+0x63/0x90 > > > > > > ? ioread32+0x2e/0x70 > > > > > > ? search_exception_tables+0x5f/0x70 > > > > > > ? kernelmode_fixup_or_oops+0xa2/0x120 > > > > > > ? __bad_area_nosemaphore+0x179/0x230 > > > > > > ? bad_area_nosemaphore+0x16/0x20 > > > > > > ? do_kern_addr_fault+0x8b/0xa0 > > > > > > ? exc_page_fault+0xe5/0x180 > > > > > > ? asm_exc_page_fault+0x27/0x30 > > > > > > ? ioread32+0x2e/0x70 > > > > > > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > > > > > > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > > > > > > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > > > > > > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > > > > > > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > > > > > > pci_pm_runtime_idle+0x34/0x70 > > > > > > rpm_idle+0xc4/0x2b0 > > > > > > pm_runtime_work+0x93/0xc0 > > > > > > process_one_work+0x21a/0x430 > > > > > > worker_thread+0x4a/0x3c0 > > > > > ... > > > > > > > > > > This happens because scheduled pm_runtime_idle() is not cancelled. > > > > > > > > > > But rpm_resume() changes dev->power.request to RPM_REQ_NONE and if > > > > > pm_runtime_work() sees this, it will not run rpm_idle(). > > > > > > > > > > However, rpm_resume() doesn't deactivate the autosuspend timer if it > > > > > is running (see the comment in rpm_resume() regarding this), so it may > > > > > queue up a runtime PM work later. > > > > > > > > > > If this is not desirable, you need to stop the autosuspend timer > > > > > explicitly in addition to calling pm_runtime_get_sync(). > > > > > > > > I don't quite follow all this. I think the race is between > > > > rtsx_pci_remove() (not resume) and rtsx_pci_runtime_idle(). > > > > > > I think so too and the latter is not expected to run. > > > > > > > rtsx_pci_remove() > > > > { > > > > pm_runtime_get_sync() > > > > pm_runtime_forbid() > > > > ... > > > > > > > > If this is an rtsx bug, what exactly should be added to > > > > rtsx_pci_remove()? > > > > > > > > Is there ever a case where we want any runtime PM work to happen > > > > during or after a driver .remove()? If not, maybe the driver core > > > > should prevent that, which I think is basically what this patch does. > > > > > > No, it is not, because it doesn't actually prevent the race from > > > occurring, it just narrows the window quite a bit. > > > > > > It would be better to call pm_runtime_dont_use_autosuspend() instead > > > of pm_runtime_barrier(). > > > > > > > If this is an rtsx driver bug, I'm concerned there may be many other > > > > drivers with a similar issue. rtsx exercises this path more than most > > > > because the device switches between card reader and NVMe SSD using > > > > hotplug add/remove based on whether an SD card is inserted (see [1]). > > > > > > This is a valid concern, so it is mostly a matter of where to disable > > > autosuspend. > > > > > > It may be the driver core in principle, but note that it calls > > > ->remove() after invoking pm_runtime_put_sync(), so why would it > > > disable autosuspend when it allows runtime PM to race with device > > > removal in general? > > > > > > Another way might be to add a pm_runtime_dont_use_autosuspend() call > > > at the beginning of pci_device_remove(). > > > > > > Or just remove the optimization in question from rpm_resume() which is > > > quite confusing and causes people to make assumptions that lead to > > > incorrect behavior in this particular case. > > > > Well, scratch this. > > > > If rpm_idle() is already running at the time rpm_resume() is called, > > the latter may return right away without waiting, which is incorrect. > > > > rpm_resume() needs to wait for the "idle" callback to complete, so > > this (again, modulo GMail-induced whitespace mangling) should help: > > > > --- > > drivers/base/power/runtime.c | 6 ++++-- > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > Index: linux-pm/drivers/base/power/runtime.c > > =================================================================== > > --- linux-pm.orig/drivers/base/power/runtime.c > > +++ linux-pm/drivers/base/power/runtime.c > > @@ -798,7 +798,8 @@ static int rpm_resume(struct device *dev > > } > > > > if (dev->power.runtime_status == RPM_RESUMING || > > - dev->power.runtime_status == RPM_SUSPENDING) { > > + dev->power.runtime_status == RPM_SUSPENDING || > > + dev->power.idle_notification) { > > DEFINE_WAIT(wait); > > > > if (rpmflags & (RPM_ASYNC | RPM_NOWAIT)) { > > @@ -826,7 +827,8 @@ static int rpm_resume(struct device *dev > > prepare_to_wait(&dev->power.wait_queue, &wait, > > TASK_UNINTERRUPTIBLE); > > if (dev->power.runtime_status != RPM_RESUMING && > > - dev->power.runtime_status != RPM_SUSPENDING) > > + dev->power.runtime_status != RPM_SUSPENDING && > > + !dev->power.idle_notification) > > break; > > > > spin_unlock_irq(&dev->power.lock); > > Well, not really. > > The problem is that rtsx_pci_runtime_idle() is not expected to be > running after pm_runtime_get_sync(), but the latter doesn't really > guarantee that. It only guarantees that the suspend/resume callbacks > will not be running after it returns. > > As I said above, if the ->runtime_idle() callback is already running > when pm_runtime_get_sync() runs, the latter will notice that the > status is RPM_ACTIVE and will return right away without waiting for > the former to complete. In fact, it cannot wait for it to complete, > because it may be called from a ->runtime_idle() callback itself (it > arguably does not make much sense to do that, but it is not strictly > forbidden). > > So whoever is providing a ->runtime_idle() callback, they need to > protect it from running in parallel with whatever code runs after > pm_runtime_get_sync(). Note that ->runtime_idle() will not start > after pm_runtime_get_sync(), but it may continue running then if it > has started earlier already. > > Calling pm_runtime_barrier() after pm_runtime_get_sync() (not before > it) should suffice, but once the runtime PM usage counter is dropped, > rpm_idle() may run again, so this is only effective until the usage > counter is greater than 1. This means that > __device_release_driver(() is not the right place to call it, because > the usage counter is dropped before calling device_remove() in that > case. > > The PCI bus type can prevent the race between driver-provided > ->runtime_idle() and ->remove() from occurring by adding a > pm_runtime_probe() call in the following way: Thanks for the detailed explanation. Does this mean only PCI bus needs this fix because other subsystems are not affected, or this needs to be implemented for each different bus types? Kai-Heng > > --- > drivers/pci/pci-driver.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > Index: linux-pm/drivers/pci/pci-driver.c > =================================================================== > --- linux-pm.orig/drivers/pci/pci-driver.c > +++ linux-pm/drivers/pci/pci-driver.c > @@ -473,6 +473,13 @@ static void pci_device_remove(struct dev > > if (drv->remove) { > pm_runtime_get_sync(dev); > + /* > + * If the driver provides a .runtime_idle() callback and it has > + * started to run already, it may continue to run in parallel > + * with the code below, so wait until all of the runtime PM > + * activity has completed. > + */ > + pm_runtime_barrier(dev); > drv->remove(pci_dev); > pm_runtime_put_noidle(dev); > }
> On Mon, Mar 4, 2024 at 7:10 PM Rafael J. Wysocki <rafael@kernel.org> > wrote: > > > > On Mon, Mar 4, 2024 at 6:00 PM Rafael J. Wysocki <rafael@kernel.org> > wrote: > > > > > > On Mon, Mar 4, 2024 at 5:41 PM Rafael J. Wysocki <rafael@kernel.org> > wrote: > > > > > > > > On Mon, Mar 4, 2024 at 4:51 PM Bjorn Helgaas <helgaas@kernel.org> > wrote: > > > > > > > > > > On Mon, Mar 04, 2024 at 03:38:38PM +0100, Rafael J. Wysocki wrote: > > > > > > On Thu, Feb 29, 2024 at 7:23 AM Kai-Heng Feng > > > > > > <kai.heng.feng@canonical.com> wrote: > > > > > > > > > > > > > > When inserting an SD7.0 card to Realtek card reader, the card > reader > > > > > > > unplugs itself and morph into a NVMe device. The slot Link down on > hot > > > > > > > unplugged can cause the following error: > > > > > > > > > > > > > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > > > > > > > BUG: unable to handle page fault for address: ffffb24d403e5010 > > > > > > > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 > PTE 0 > > > > > > > Oops: 0000 [#1] PREEMPT SMP PTI > > > > > > > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > > > > > > > Hardware name: To Be Filled By O.E.M. To Be Filled By > O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > > > > > > > Workqueue: pm pm_runtime_work > > > > > > > RIP: 0010:ioread32+0x2e/0x70 > > > > > > ... > > > > > > > Call Trace: > > > > > > > <TASK> > > > > > > > ? show_regs+0x68/0x70 > > > > > > > ? __die_body+0x20/0x70 > > > > > > > ? __die+0x2b/0x40 > > > > > > > ? page_fault_oops+0x160/0x480 > > > > > > > ? search_bpf_extables+0x63/0x90 > > > > > > > ? ioread32+0x2e/0x70 > > > > > > > ? search_exception_tables+0x5f/0x70 > > > > > > > ? kernelmode_fixup_or_oops+0xa2/0x120 > > > > > > > ? __bad_area_nosemaphore+0x179/0x230 > > > > > > > ? bad_area_nosemaphore+0x16/0x20 > > > > > > > ? do_kern_addr_fault+0x8b/0xa0 > > > > > > > ? exc_page_fault+0xe5/0x180 > > > > > > > ? asm_exc_page_fault+0x27/0x30 > > > > > > > ? ioread32+0x2e/0x70 > > > > > > > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > > > > > > > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > > > > > > > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > > > > > > > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > > > > > > > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > > > > > > > pci_pm_runtime_idle+0x34/0x70 > > > > > > > rpm_idle+0xc4/0x2b0 > > > > > > > pm_runtime_work+0x93/0xc0 > > > > > > > process_one_work+0x21a/0x430 > > > > > > > worker_thread+0x4a/0x3c0 > > > > > > ... > > > > > > > > > > > > This happens because scheduled pm_runtime_idle() is not > cancelled. > > > > > > > > > > > > But rpm_resume() changes dev->power.request to RPM_REQ_NONE > and if > > > > > > pm_runtime_work() sees this, it will not run rpm_idle(). > > > > > > > > > > > > However, rpm_resume() doesn't deactivate the autosuspend timer if > it > > > > > > is running (see the comment in rpm_resume() regarding this), so it > may > > > > > > queue up a runtime PM work later. > > > > > > > > > > > > If this is not desirable, you need to stop the autosuspend timer > > > > > > explicitly in addition to calling pm_runtime_get_sync(). > > > > > > > > > > I don't quite follow all this. I think the race is between > > > > > rtsx_pci_remove() (not resume) and rtsx_pci_runtime_idle(). > > > > > > > > I think so too and the latter is not expected to run. > > > > > > > > > rtsx_pci_remove() > > > > > { > > > > > pm_runtime_get_sync() > > > > > pm_runtime_forbid() > > > > > ... > > > > > > > > > > If this is an rtsx bug, what exactly should be added to > > > > > rtsx_pci_remove()? > > > > > > > > > > Is there ever a case where we want any runtime PM work to happen > > > > > during or after a driver .remove()? If not, maybe the driver core > > > > > should prevent that, which I think is basically what this patch does. > > > > > > > > No, it is not, because it doesn't actually prevent the race from > > > > occurring, it just narrows the window quite a bit. > > > > > > > > It would be better to call pm_runtime_dont_use_autosuspend() instead > > > > of pm_runtime_barrier(). > > > > > > > > > If this is an rtsx driver bug, I'm concerned there may be many other > > > > > drivers with a similar issue. rtsx exercises this path more than most > > > > > because the device switches between card reader and NVMe SSD using > > > > > hotplug add/remove based on whether an SD card is inserted (see [1]). > > > > > > > > This is a valid concern, so it is mostly a matter of where to disable > > > > autosuspend. > > > > > > > > It may be the driver core in principle, but note that it calls > > > > ->remove() after invoking pm_runtime_put_sync(), so why would it > > > > disable autosuspend when it allows runtime PM to race with device > > > > removal in general? > > > > > > > > Another way might be to add a pm_runtime_dont_use_autosuspend() > call > > > > at the beginning of pci_device_remove(). > > > > > > > > Or just remove the optimization in question from rpm_resume() which is > > > > quite confusing and causes people to make assumptions that lead to > > > > incorrect behavior in this particular case. > > > > > > Well, scratch this. > > > > > > If rpm_idle() is already running at the time rpm_resume() is called, > > > the latter may return right away without waiting, which is incorrect. > > > > > > rpm_resume() needs to wait for the "idle" callback to complete, so > > > this (again, modulo GMail-induced whitespace mangling) should help: > > > > > > --- > > > drivers/base/power/runtime.c | 6 ++++-- > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > Index: linux-pm/drivers/base/power/runtime.c > > > > ================================================================ > === > > > --- linux-pm.orig/drivers/base/power/runtime.c > > > +++ linux-pm/drivers/base/power/runtime.c > > > @@ -798,7 +798,8 @@ static int rpm_resume(struct device *dev > > > } > > > > > > if (dev->power.runtime_status == RPM_RESUMING || > > > - dev->power.runtime_status == RPM_SUSPENDING) { > > > + dev->power.runtime_status == RPM_SUSPENDING || > > > + dev->power.idle_notification) { > > > DEFINE_WAIT(wait); > > > > > > if (rpmflags & (RPM_ASYNC | RPM_NOWAIT)) { > > > @@ -826,7 +827,8 @@ static int rpm_resume(struct device *dev > > > prepare_to_wait(&dev->power.wait_queue, &wait, > > > TASK_UNINTERRUPTIBLE); > > > if (dev->power.runtime_status != RPM_RESUMING && > > > - dev->power.runtime_status != RPM_SUSPENDING) > > > + dev->power.runtime_status != RPM_SUSPENDING && > > > + !dev->power.idle_notification) > > > break; > > > > > > spin_unlock_irq(&dev->power.lock); > > > > Well, not really. > > > > The problem is that rtsx_pci_runtime_idle() is not expected to be > > running after pm_runtime_get_sync(), but the latter doesn't really > > guarantee that. It only guarantees that the suspend/resume callbacks > > will not be running after it returns. > > > > As I said above, if the ->runtime_idle() callback is already running > > when pm_runtime_get_sync() runs, the latter will notice that the > > status is RPM_ACTIVE and will return right away without waiting for > > the former to complete. In fact, it cannot wait for it to complete, > > because it may be called from a ->runtime_idle() callback itself (it > > arguably does not make much sense to do that, but it is not strictly > > forbidden). > > > > So whoever is providing a ->runtime_idle() callback, they need to > > protect it from running in parallel with whatever code runs after > > pm_runtime_get_sync(). Note that ->runtime_idle() will not start > > after pm_runtime_get_sync(), but it may continue running then if it > > has started earlier already. > > > > Calling pm_runtime_barrier() after pm_runtime_get_sync() (not before > > it) should suffice, but once the runtime PM usage counter is dropped, > > rpm_idle() may run again, so this is only effective until the usage > > counter is greater than 1. This means that > > __device_release_driver(() is not the right place to call it, because > > the usage counter is dropped before calling device_remove() in that > > case. > > > > The PCI bus type can prevent the race between driver-provided > > ->runtime_idle() and ->remove() from occurring by adding a > > pm_runtime_probe() call in the following way: > > s/pm_runtime_probe/pm_runtime_barrier/ (sorry) > > The patchlet below is correct, though. > I tested this patch it work and well... Tested-by: Ricky Wu <ricky_wu@realtek.com> > > --- > > drivers/pci/pci-driver.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > Index: linux-pm/drivers/pci/pci-driver.c > > > ================================================================ > === > > --- linux-pm.orig/drivers/pci/pci-driver.c > > +++ linux-pm/drivers/pci/pci-driver.c > > @@ -473,6 +473,13 @@ static void pci_device_remove(struct dev > > > > if (drv->remove) { > > pm_runtime_get_sync(dev); > > + /* > > + * If the driver provides a .runtime_idle() callback and it has > > + * started to run already, it may continue to run in parallel > > + * with the code below, so wait until all of the runtime PM > > + * activity has completed. > > + */ > > + pm_runtime_barrier(dev); > > drv->remove(pci_dev); > > pm_runtime_put_noidle(dev); > > }
On Tue, Mar 5, 2024 at 8:20 AM Kai-Heng Feng <kai.heng.feng@canonical.com> wrote: > > On Tue, Mar 5, 2024 at 2:10 AM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > On Mon, Mar 4, 2024 at 6:00 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > On Mon, Mar 4, 2024 at 5:41 PM Rafael J. Wysocki <rafael@kernel.org> wrote: > > > > > > > > On Mon, Mar 4, 2024 at 4:51 PM Bjorn Helgaas <helgaas@kernel.org> wrote: > > > > > > > > > > On Mon, Mar 04, 2024 at 03:38:38PM +0100, Rafael J. Wysocki wrote: > > > > > > On Thu, Feb 29, 2024 at 7:23 AM Kai-Heng Feng > > > > > > <kai.heng.feng@canonical.com> wrote: > > > > > > > > > > > > > > When inserting an SD7.0 card to Realtek card reader, the card reader > > > > > > > unplugs itself and morph into a NVMe device. The slot Link down on hot > > > > > > > unplugged can cause the following error: > > > > > > > > > > > > > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > > > > > > > BUG: unable to handle page fault for address: ffffb24d403e5010 > > > > > > > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 > > > > > > > Oops: 0000 [#1] PREEMPT SMP PTI > > > > > > > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > > > > > > > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > > > > > > > Workqueue: pm pm_runtime_work > > > > > > > RIP: 0010:ioread32+0x2e/0x70 > > > > > > ... > > > > > > > Call Trace: > > > > > > > <TASK> > > > > > > > ? show_regs+0x68/0x70 > > > > > > > ? __die_body+0x20/0x70 > > > > > > > ? __die+0x2b/0x40 > > > > > > > ? page_fault_oops+0x160/0x480 > > > > > > > ? search_bpf_extables+0x63/0x90 > > > > > > > ? ioread32+0x2e/0x70 > > > > > > > ? search_exception_tables+0x5f/0x70 > > > > > > > ? kernelmode_fixup_or_oops+0xa2/0x120 > > > > > > > ? __bad_area_nosemaphore+0x179/0x230 > > > > > > > ? bad_area_nosemaphore+0x16/0x20 > > > > > > > ? do_kern_addr_fault+0x8b/0xa0 > > > > > > > ? exc_page_fault+0xe5/0x180 > > > > > > > ? asm_exc_page_fault+0x27/0x30 > > > > > > > ? ioread32+0x2e/0x70 > > > > > > > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > > > > > > > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > > > > > > > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > > > > > > > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > > > > > > > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > > > > > > > pci_pm_runtime_idle+0x34/0x70 > > > > > > > rpm_idle+0xc4/0x2b0 > > > > > > > pm_runtime_work+0x93/0xc0 > > > > > > > process_one_work+0x21a/0x430 > > > > > > > worker_thread+0x4a/0x3c0 > > > > > > ... > > > > > > > > > > > > This happens because scheduled pm_runtime_idle() is not cancelled. > > > > > > > > > > > > But rpm_resume() changes dev->power.request to RPM_REQ_NONE and if > > > > > > pm_runtime_work() sees this, it will not run rpm_idle(). > > > > > > > > > > > > However, rpm_resume() doesn't deactivate the autosuspend timer if it > > > > > > is running (see the comment in rpm_resume() regarding this), so it may > > > > > > queue up a runtime PM work later. > > > > > > > > > > > > If this is not desirable, you need to stop the autosuspend timer > > > > > > explicitly in addition to calling pm_runtime_get_sync(). > > > > > > > > > > I don't quite follow all this. I think the race is between > > > > > rtsx_pci_remove() (not resume) and rtsx_pci_runtime_idle(). > > > > > > > > I think so too and the latter is not expected to run. > > > > > > > > > rtsx_pci_remove() > > > > > { > > > > > pm_runtime_get_sync() > > > > > pm_runtime_forbid() > > > > > ... > > > > > > > > > > If this is an rtsx bug, what exactly should be added to > > > > > rtsx_pci_remove()? > > > > > > > > > > Is there ever a case where we want any runtime PM work to happen > > > > > during or after a driver .remove()? If not, maybe the driver core > > > > > should prevent that, which I think is basically what this patch does. > > > > > > > > No, it is not, because it doesn't actually prevent the race from > > > > occurring, it just narrows the window quite a bit. > > > > > > > > It would be better to call pm_runtime_dont_use_autosuspend() instead > > > > of pm_runtime_barrier(). > > > > > > > > > If this is an rtsx driver bug, I'm concerned there may be many other > > > > > drivers with a similar issue. rtsx exercises this path more than most > > > > > because the device switches between card reader and NVMe SSD using > > > > > hotplug add/remove based on whether an SD card is inserted (see [1]). > > > > > > > > This is a valid concern, so it is mostly a matter of where to disable > > > > autosuspend. > > > > > > > > It may be the driver core in principle, but note that it calls > > > > ->remove() after invoking pm_runtime_put_sync(), so why would it > > > > disable autosuspend when it allows runtime PM to race with device > > > > removal in general? > > > > > > > > Another way might be to add a pm_runtime_dont_use_autosuspend() call > > > > at the beginning of pci_device_remove(). > > > > > > > > Or just remove the optimization in question from rpm_resume() which is > > > > quite confusing and causes people to make assumptions that lead to > > > > incorrect behavior in this particular case. > > > > > > Well, scratch this. > > > > > > If rpm_idle() is already running at the time rpm_resume() is called, > > > the latter may return right away without waiting, which is incorrect. > > > > > > rpm_resume() needs to wait for the "idle" callback to complete, so > > > this (again, modulo GMail-induced whitespace mangling) should help: > > > > > > --- > > > drivers/base/power/runtime.c | 6 ++++-- > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > Index: linux-pm/drivers/base/power/runtime.c > > > =================================================================== > > > --- linux-pm.orig/drivers/base/power/runtime.c > > > +++ linux-pm/drivers/base/power/runtime.c > > > @@ -798,7 +798,8 @@ static int rpm_resume(struct device *dev > > > } > > > > > > if (dev->power.runtime_status == RPM_RESUMING || > > > - dev->power.runtime_status == RPM_SUSPENDING) { > > > + dev->power.runtime_status == RPM_SUSPENDING || > > > + dev->power.idle_notification) { > > > DEFINE_WAIT(wait); > > > > > > if (rpmflags & (RPM_ASYNC | RPM_NOWAIT)) { > > > @@ -826,7 +827,8 @@ static int rpm_resume(struct device *dev > > > prepare_to_wait(&dev->power.wait_queue, &wait, > > > TASK_UNINTERRUPTIBLE); > > > if (dev->power.runtime_status != RPM_RESUMING && > > > - dev->power.runtime_status != RPM_SUSPENDING) > > > + dev->power.runtime_status != RPM_SUSPENDING && > > > + !dev->power.idle_notification) > > > break; > > > > > > spin_unlock_irq(&dev->power.lock); > > > > Well, not really. > > > > The problem is that rtsx_pci_runtime_idle() is not expected to be > > running after pm_runtime_get_sync(), but the latter doesn't really > > guarantee that. It only guarantees that the suspend/resume callbacks > > will not be running after it returns. > > > > As I said above, if the ->runtime_idle() callback is already running > > when pm_runtime_get_sync() runs, the latter will notice that the > > status is RPM_ACTIVE and will return right away without waiting for > > the former to complete. In fact, it cannot wait for it to complete, > > because it may be called from a ->runtime_idle() callback itself (it > > arguably does not make much sense to do that, but it is not strictly > > forbidden). > > > > So whoever is providing a ->runtime_idle() callback, they need to > > protect it from running in parallel with whatever code runs after > > pm_runtime_get_sync(). Note that ->runtime_idle() will not start > > after pm_runtime_get_sync(), but it may continue running then if it > > has started earlier already. > > > > Calling pm_runtime_barrier() after pm_runtime_get_sync() (not before > > it) should suffice, but once the runtime PM usage counter is dropped, > > rpm_idle() may run again, so this is only effective until the usage > > counter is greater than 1. This means that > > __device_release_driver(() is not the right place to call it, because > > the usage counter is dropped before calling device_remove() in that > > case. > > > > The PCI bus type can prevent the race between driver-provided > > ->runtime_idle() and ->remove() from occurring by adding a > > pm_runtime_probe() call in the following way: > > Thanks for the detailed explanation. Does this mean only PCI bus needs > this fix because other subsystems are not affected, or this needs to > be implemented for each different bus types? I don't think that it needs to be implemented in this form for any other bus types. There are not many cases in which non-trivial .runtime_idle() callbacks are used and the majority of them are PCI drivers anyway AFAICS. Also the drivers using .runtime_idle() can prevent it from racing with other code paths by themselves, this way or another (for example, with the help of locking), and it is generally better to address the races in question on a per-driver basis IMV. The PCI bus type is kind of exceptional, because it forces devices to resume and prevents them from suspending before removing drivers, which is necessary for the fix to be really effective.
On Tue, Mar 5, 2024 at 10:20 AM Ricky WU <ricky_wu@realtek.com> wrote: > > > On Mon, Mar 4, 2024 at 7:10 PM Rafael J. Wysocki <rafael@kernel.org> > > wrote: > > > > > > On Mon, Mar 4, 2024 at 6:00 PM Rafael J. Wysocki <rafael@kernel.org> > > wrote: > > > > > > > > On Mon, Mar 4, 2024 at 5:41 PM Rafael J. Wysocki <rafael@kernel.org> > > wrote: > > > > > > > > > > On Mon, Mar 4, 2024 at 4:51 PM Bjorn Helgaas <helgaas@kernel.org> > > wrote: > > > > > > > > > > > > On Mon, Mar 04, 2024 at 03:38:38PM +0100, Rafael J. Wysocki wrote: > > > > > > > On Thu, Feb 29, 2024 at 7:23 AM Kai-Heng Feng > > > > > > > <kai.heng.feng@canonical.com> wrote: > > > > > > > > > > > > > > > > When inserting an SD7.0 card to Realtek card reader, the card > > reader > > > > > > > > unplugs itself and morph into a NVMe device. The slot Link down on > > hot > > > > > > > > unplugged can cause the following error: > > > > > > > > > > > > > > > > pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down > > > > > > > > BUG: unable to handle page fault for address: ffffb24d403e5010 > > > > > > > > PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 > > PTE 0 > > > > > > > > Oops: 0000 [#1] PREEMPT SMP PTI > > > > > > > > CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 > > > > > > > > Hardware name: To Be Filled By O.E.M. To Be Filled By > > O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 > > > > > > > > Workqueue: pm pm_runtime_work > > > > > > > > RIP: 0010:ioread32+0x2e/0x70 > > > > > > > ... > > > > > > > > Call Trace: > > > > > > > > <TASK> > > > > > > > > ? show_regs+0x68/0x70 > > > > > > > > ? __die_body+0x20/0x70 > > > > > > > > ? __die+0x2b/0x40 > > > > > > > > ? page_fault_oops+0x160/0x480 > > > > > > > > ? search_bpf_extables+0x63/0x90 > > > > > > > > ? ioread32+0x2e/0x70 > > > > > > > > ? search_exception_tables+0x5f/0x70 > > > > > > > > ? kernelmode_fixup_or_oops+0xa2/0x120 > > > > > > > > ? __bad_area_nosemaphore+0x179/0x230 > > > > > > > > ? bad_area_nosemaphore+0x16/0x20 > > > > > > > > ? do_kern_addr_fault+0x8b/0xa0 > > > > > > > > ? exc_page_fault+0xe5/0x180 > > > > > > > > ? asm_exc_page_fault+0x27/0x30 > > > > > > > > ? ioread32+0x2e/0x70 > > > > > > > > ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] > > > > > > > > rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] > > > > > > > > rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] > > > > > > > > rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] > > > > > > > > ? __pfx_pci_pm_runtime_idle+0x10/0x10 > > > > > > > > pci_pm_runtime_idle+0x34/0x70 > > > > > > > > rpm_idle+0xc4/0x2b0 > > > > > > > > pm_runtime_work+0x93/0xc0 > > > > > > > > process_one_work+0x21a/0x430 > > > > > > > > worker_thread+0x4a/0x3c0 > > > > > > > ... > > > > > > > > > > > > > > This happens because scheduled pm_runtime_idle() is not > > cancelled. > > > > > > > > > > > > > > But rpm_resume() changes dev->power.request to RPM_REQ_NONE > > and if > > > > > > > pm_runtime_work() sees this, it will not run rpm_idle(). > > > > > > > > > > > > > > However, rpm_resume() doesn't deactivate the autosuspend timer if > > it > > > > > > > is running (see the comment in rpm_resume() regarding this), so it > > may > > > > > > > queue up a runtime PM work later. > > > > > > > > > > > > > > If this is not desirable, you need to stop the autosuspend timer > > > > > > > explicitly in addition to calling pm_runtime_get_sync(). > > > > > > > > > > > > I don't quite follow all this. I think the race is between > > > > > > rtsx_pci_remove() (not resume) and rtsx_pci_runtime_idle(). > > > > > > > > > > I think so too and the latter is not expected to run. > > > > > > > > > > > rtsx_pci_remove() > > > > > > { > > > > > > pm_runtime_get_sync() > > > > > > pm_runtime_forbid() > > > > > > ... > > > > > > > > > > > > If this is an rtsx bug, what exactly should be added to > > > > > > rtsx_pci_remove()? > > > > > > > > > > > > Is there ever a case where we want any runtime PM work to happen > > > > > > during or after a driver .remove()? If not, maybe the driver core > > > > > > should prevent that, which I think is basically what this patch does. > > > > > > > > > > No, it is not, because it doesn't actually prevent the race from > > > > > occurring, it just narrows the window quite a bit. > > > > > > > > > > It would be better to call pm_runtime_dont_use_autosuspend() instead > > > > > of pm_runtime_barrier(). > > > > > > > > > > > If this is an rtsx driver bug, I'm concerned there may be many other > > > > > > drivers with a similar issue. rtsx exercises this path more than most > > > > > > because the device switches between card reader and NVMe SSD using > > > > > > hotplug add/remove based on whether an SD card is inserted (see [1]). > > > > > > > > > > This is a valid concern, so it is mostly a matter of where to disable > > > > > autosuspend. > > > > > > > > > > It may be the driver core in principle, but note that it calls > > > > > ->remove() after invoking pm_runtime_put_sync(), so why would it > > > > > disable autosuspend when it allows runtime PM to race with device > > > > > removal in general? > > > > > > > > > > Another way might be to add a pm_runtime_dont_use_autosuspend() > > call > > > > > at the beginning of pci_device_remove(). > > > > > > > > > > Or just remove the optimization in question from rpm_resume() which is > > > > > quite confusing and causes people to make assumptions that lead to > > > > > incorrect behavior in this particular case. > > > > > > > > Well, scratch this. > > > > > > > > If rpm_idle() is already running at the time rpm_resume() is called, > > > > the latter may return right away without waiting, which is incorrect. > > > > > > > > rpm_resume() needs to wait for the "idle" callback to complete, so > > > > this (again, modulo GMail-induced whitespace mangling) should help: > > > > > > > > --- > > > > drivers/base/power/runtime.c | 6 ++++-- > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > Index: linux-pm/drivers/base/power/runtime.c > > > > > > ================================================================ > > === > > > > --- linux-pm.orig/drivers/base/power/runtime.c > > > > +++ linux-pm/drivers/base/power/runtime.c > > > > @@ -798,7 +798,8 @@ static int rpm_resume(struct device *dev > > > > } > > > > > > > > if (dev->power.runtime_status == RPM_RESUMING || > > > > - dev->power.runtime_status == RPM_SUSPENDING) { > > > > + dev->power.runtime_status == RPM_SUSPENDING || > > > > + dev->power.idle_notification) { > > > > DEFINE_WAIT(wait); > > > > > > > > if (rpmflags & (RPM_ASYNC | RPM_NOWAIT)) { > > > > @@ -826,7 +827,8 @@ static int rpm_resume(struct device *dev > > > > prepare_to_wait(&dev->power.wait_queue, &wait, > > > > TASK_UNINTERRUPTIBLE); > > > > if (dev->power.runtime_status != RPM_RESUMING && > > > > - dev->power.runtime_status != RPM_SUSPENDING) > > > > + dev->power.runtime_status != RPM_SUSPENDING && > > > > + !dev->power.idle_notification) > > > > break; > > > > > > > > spin_unlock_irq(&dev->power.lock); > > > > > > Well, not really. > > > > > > The problem is that rtsx_pci_runtime_idle() is not expected to be > > > running after pm_runtime_get_sync(), but the latter doesn't really > > > guarantee that. It only guarantees that the suspend/resume callbacks > > > will not be running after it returns. > > > > > > As I said above, if the ->runtime_idle() callback is already running > > > when pm_runtime_get_sync() runs, the latter will notice that the > > > status is RPM_ACTIVE and will return right away without waiting for > > > the former to complete. In fact, it cannot wait for it to complete, > > > because it may be called from a ->runtime_idle() callback itself (it > > > arguably does not make much sense to do that, but it is not strictly > > > forbidden). > > > > > > So whoever is providing a ->runtime_idle() callback, they need to > > > protect it from running in parallel with whatever code runs after > > > pm_runtime_get_sync(). Note that ->runtime_idle() will not start > > > after pm_runtime_get_sync(), but it may continue running then if it > > > has started earlier already. > > > > > > Calling pm_runtime_barrier() after pm_runtime_get_sync() (not before > > > it) should suffice, but once the runtime PM usage counter is dropped, > > > rpm_idle() may run again, so this is only effective until the usage > > > counter is greater than 1. This means that > > > __device_release_driver(() is not the right place to call it, because > > > the usage counter is dropped before calling device_remove() in that > > > case. > > > > > > The PCI bus type can prevent the race between driver-provided > > > ->runtime_idle() and ->remove() from occurring by adding a > > > pm_runtime_probe() call in the following way: > > > > s/pm_runtime_probe/pm_runtime_barrier/ (sorry) > > > > The patchlet below is correct, though. > > > > I tested this patch it work and well... > Tested-by: Ricky Wu <ricky_wu@realtek.com> Thank you! I will resend it with a proper changelog.
diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 85152537dbf1..38c815e2b3a2 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -1244,6 +1244,7 @@ static void __device_release_driver(struct device *dev, struct device *parent) drv = dev->driver; if (drv) { + pm_runtime_barrier(dev); pm_runtime_get_sync(dev); while (device_links_busy(dev)) {
When inserting an SD7.0 card to Realtek card reader, the card reader unplugs itself and morph into a NVMe device. The slot Link down on hot unplugged can cause the following error: pcieport 0000:00:1c.0: pciehp: Slot(8): Link Down BUG: unable to handle page fault for address: ffffb24d403e5010 PGD 100000067 P4D 100000067 PUD 1001fe067 PMD 100d97067 PTE 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 534 Comm: kworker/3:10 Not tainted 6.4.0 #6 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H370M Pro4, BIOS P3.40 10/25/2018 Workqueue: pm pm_runtime_work RIP: 0010:ioread32+0x2e/0x70 Code: ff 03 00 77 25 48 81 ff 00 00 01 00 77 14 8b 15 08 d9 54 01 b8 ff ff ff ff 85 d2 75 14 c3 cc cc cc cc 89 fa ed c3 cc cc cc cc <8b> 07 c3 cc cc cc cc 55 83 ea 01 48 89 fe 48 c7 c7 98 6f 15 99 48 RSP: 0018:ffffb24d40a5bd78 EFLAGS: 00010296 RAX: ffffb24d403e5000 RBX: 0000000000000152 RCX: 000000000000007f RDX: 000000000000ff00 RSI: ffffb24d403e5010 RDI: ffffb24d403e5010 RBP: ffffb24d40a5bd98 R08: ffffb24d403e5010 R09: 0000000000000000 R10: ffff9074cd95e7f4 R11: 0000000000000003 R12: 000000000000007f R13: ffff9074e1a68c00 R14: ffff9074e1a68d00 R15: 0000000000009003 FS: 0000000000000000(0000) GS:ffff90752a180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffb24d403e5010 CR3: 0000000152832006 CR4: 00000000003706e0 Call Trace: <TASK> ? show_regs+0x68/0x70 ? __die_body+0x20/0x70 ? __die+0x2b/0x40 ? page_fault_oops+0x160/0x480 ? search_bpf_extables+0x63/0x90 ? ioread32+0x2e/0x70 ? search_exception_tables+0x5f/0x70 ? kernelmode_fixup_or_oops+0xa2/0x120 ? __bad_area_nosemaphore+0x179/0x230 ? bad_area_nosemaphore+0x16/0x20 ? do_kern_addr_fault+0x8b/0xa0 ? exc_page_fault+0xe5/0x180 ? asm_exc_page_fault+0x27/0x30 ? ioread32+0x2e/0x70 ? rtsx_pci_write_register+0x5b/0x90 [rtsx_pci] rtsx_set_l1off_sub+0x1c/0x30 [rtsx_pci] rts5261_set_l1off_cfg_sub_d0+0x36/0x40 [rtsx_pci] rtsx_pci_runtime_idle+0xc7/0x160 [rtsx_pci] ? __pfx_pci_pm_runtime_idle+0x10/0x10 pci_pm_runtime_idle+0x34/0x70 rpm_idle+0xc4/0x2b0 pm_runtime_work+0x93/0xc0 process_one_work+0x21a/0x430 worker_thread+0x4a/0x3c0 ? __pfx_worker_thread+0x10/0x10 kthread+0x106/0x140 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x29/0x50 </TASK> Modules linked in: nvme nvme_core snd_hda_codec_hdmi snd_sof_pci_intel_cnl snd_sof_intel_hda_common snd_hda_codec_realtek snd_hda_codec_generic snd_soc_hdac_hda soundwire_intel ledtrig_audio nls_iso8859_1 soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda_mlink snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_sof_utils snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi soundwire_bus snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel i915 snd_intel_dspcfg snd_intel_sdw_acpi intel_rapl_msr snd_hda_codec intel_rapl_common snd_hda_core x86_pkg_temp_thermal intel_powerclamp snd_hwdep coretemp snd_pcm kvm_intel drm_buddy ttm mei_hdcp kvm drm_display_helper snd_seq_midi snd_seq_midi_event cec crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd rc_core cryptd rapl snd_rawmidi drm_kms_helper binfmt_misc intel_cstate i2c_algo_bit joydev snd_seq snd_seq_device syscopyarea wmi_bmof snd_timer sysfillrect input_leds snd ee1004 sysimgblt mei_me soundcore mei intel_pch_thermal mac_hid acpi_tad acpi_pad sch_fq_codel msr parport_pc ppdev lp ramoops drm parport reed_solomon efi_pstore ip_tables x_tables autofs4 hid_generic usbhid hid rtsx_pci_sdmmc crc32_pclmul ahci e1000e i2c_i801 i2c_smbus rtsx_pci xhci_pci libahci xhci_pci_renesas video wmi CR2: ffffb24d403e5010 ---[ end trace 0000000000000000 ]--- This happens because scheduled pm_runtime_idle() is not cancelled. So before releasing the device, stop all runtime power managements by using pm_runtime_barrier() to fix the issue. Link: https://lore.kernel.org/all/2ce258f371234b1f8a1a470d5488d00e@realtek.com/ Cc: Ricky Wu <ricky_wu@realtek.com> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com> --- v3: Move the change the device driver core. v2: Cover more cases than just pciehp. drivers/base/dd.c | 1 + 1 file changed, 1 insertion(+)