Message ID | 20241204114229.21452-1-wander@redhat.com |
---|---|
Headers | show |
Series | igb: fix igb_msix_other() handling for PREEMPT_RT | expand |
On Wed, Dec 4, 2024 at 8:43 AM Wander Lairson Costa <wander@redhat.com> wrote: > > This is the second attempt at fixing the behavior of igb_msix_other() > for PREEMPT_RT. The previous attempt [1] was reverted [2] following > concerns raised by Sebastian [3]. > > The initial approach proposed converting vfs_lock to a raw_spinlock, > a minor change intended to make it safe. However, it became evident > that igb_rcv_msg_from_vf() invokes kcalloc with GFP_ATOMIC, > which is unsafe in interrupt context on PREEMPT_RT systems. > > To address this, the solution involves splitting igb_msg_task() > into two parts: > > * One part invoked from the IRQ context. > * Another part called from the threaded interrupt handler. > > To accommodate this, vfs_lock has been restructured into a double > lock: a spinlock_t and a raw_spinlock_t. In the revised design: > > * igb_disable_sriov() locks both spinlocks. > * Each part of igb_msg_task() locks the appropriate spinlock for > its execution context. > > It is worth noting that the double lock mechanism is only active under > PREEMPT_RT. For non-PREEMPT_RT builds, the additional raw_spinlock_t > field is ommited. > > If the extra raw_spinlock_t field can be tolerated under > !PREEMPT_RT (even though it remains unused), we can eliminate the > need for #ifdefs and simplify the code structure. > > I will be on vacation from December 7th to Christmas and will address > review comments upon my return. > > If possible, I kindly request the Intel team to perform smoke tests > on both stock and realtime kernels to catch any potential issues with > this patch series. > > Cheers, > Wander > > [1] https://lore.kernel.org/all/20240920185918.616302-2-wander@redhat.com/ > [2] https://lore.kernel.org/all/20241104124050.22290-1-wander@redhat.com/ > [3] https://lore.kernel.org/all/20241104110708.gFyxRFlC@linutronix.de/ > > > Wander Lairson Costa (4): > igb: narrow scope of vfs_lock in SR-IOV cleanup > igb: introduce raw vfs_lock to igb_adapter > igb: split igb_msg_task() > igb: fix igb_msix_other() handling for PREEMPT_RT > > drivers/net/ethernet/intel/igb/igb.h | 4 + > drivers/net/ethernet/intel/igb/igb_main.c | 160 +++++++++++++++++++--- > 2 files changed, 148 insertions(+), 16 deletions(-) > > -- > 2.47.0 > I had requested Red Hat Network QA to run regression tests on this, and they recently reported that no issues were found.
On 2024-12-04 08:42:23 [-0300], Wander Lairson Costa wrote: > This is the second attempt at fixing the behavior of igb_msix_other() > for PREEMPT_RT. The previous attempt [1] was reverted [2] following > concerns raised by Sebastian [3]. > > The initial approach proposed converting vfs_lock to a raw_spinlock, > a minor change intended to make it safe. However, it became evident > that igb_rcv_msg_from_vf() invokes kcalloc with GFP_ATOMIC, > which is unsafe in interrupt context on PREEMPT_RT systems. > > To address this, the solution involves splitting igb_msg_task() > into two parts: > > * One part invoked from the IRQ context. > * Another part called from the threaded interrupt handler. > > To accommodate this, vfs_lock has been restructured into a double > lock: a spinlock_t and a raw_spinlock_t. In the revised design: > > * igb_disable_sriov() locks both spinlocks. > * Each part of igb_msg_task() locks the appropriate spinlock for > its execution context. - Is this limited to PREEMPT_RT or does it also occur on PREEMPT systems with threadirqs? And if this is PREEMPT_RT only, why? - What causes the failure? I see you reworked into two parts to behave similar to what happens without threaded interrupts. There is still no explanation for it. Is there a timing limit or was there another register operation which removed the mailbox message? > Cheers, > Wander Sebastian
On Tue, Jan 07, 2025 at 02:51:06PM +0100, Sebastian Andrzej Siewior wrote: > On 2024-12-04 08:42:23 [-0300], Wander Lairson Costa wrote: > > This is the second attempt at fixing the behavior of igb_msix_other() > > for PREEMPT_RT. The previous attempt [1] was reverted [2] following > > concerns raised by Sebastian [3]. > > > > The initial approach proposed converting vfs_lock to a raw_spinlock, > > a minor change intended to make it safe. However, it became evident > > that igb_rcv_msg_from_vf() invokes kcalloc with GFP_ATOMIC, > > which is unsafe in interrupt context on PREEMPT_RT systems. > > > > To address this, the solution involves splitting igb_msg_task() > > into two parts: > > > > * One part invoked from the IRQ context. > > * Another part called from the threaded interrupt handler. > > > > To accommodate this, vfs_lock has been restructured into a double > > lock: a spinlock_t and a raw_spinlock_t. In the revised design: > > > > * igb_disable_sriov() locks both spinlocks. > > * Each part of igb_msg_task() locks the appropriate spinlock for > > its execution context. > > - Is this limited to PREEMPT_RT or does it also occur on PREEMPT systems > with threadirqs? And if this is PREEMPT_RT only, why? PREEMPT systems configured to use threadirqs should be affected as well, although I never tested with this configuration. Honestly, until now I wasn't aware of the possibility of a non PREEMPT_RT kernel with threaded IRQs by default. > > - What causes the failure? I see you reworked into two parts to behave > similar to what happens without threaded interrupts. There is still no > explanation for it. Is there a timing limit or was there another > register operation which removed the mailbox message? > I explained the root cause of the issue in the last commit. Maybe I should have added the explanation to the cover letter as well. Anyway, here is a partial verbatim copy of it: "During testing of SR-IOV, Red Hat QE encountered an issue where the ip link up command intermittently fails for the igbvf interfaces when using the PREEMPT_RT variant. Investigation revealed that e1000_write_posted_mbx returns an error due to the lack of an ACK from e1000_poll_for_ack. The underlying issue arises from the fact that IRQs are threaded by default under PREEMPT_RT. While the exact hardware details are not available, it appears that the IRQ handled by igb_msix_other must be processed before e1000_poll_for_ack times out. However, e1000_write_posted_mbx is called with preemption disabled, leading to a scenario where the IRQ is serviced only after the failure of e1000_write_posted_mbx." The call chain from igb_msg_task(): igb_msg_task igb_rcv_msg_from_vf igb_set_vf_multicasts igb_set_rx_mode igb_write_mc_addr_list kmalloc Cannot happen from interrupt context under PREEMPT_RT. So this part of the interrupt handler is deferred to a threaded IRQ handler. > > Cheers, > > Wander > > Sebastian >
On 2025-01-07 15:52:47 [-0300], Wander Lairson Costa wrote: > On Tue, Jan 07, 2025 at 02:51:06PM +0100, Sebastian Andrzej Siewior wrote: > > On 2024-12-04 08:42:23 [-0300], Wander Lairson Costa wrote: > > > This is the second attempt at fixing the behavior of igb_msix_other() > > > for PREEMPT_RT. The previous attempt [1] was reverted [2] following > > > concerns raised by Sebastian [3]. > > > > > > The initial approach proposed converting vfs_lock to a raw_spinlock, > > > a minor change intended to make it safe. However, it became evident > > > that igb_rcv_msg_from_vf() invokes kcalloc with GFP_ATOMIC, > > > which is unsafe in interrupt context on PREEMPT_RT systems. > > > > > > To address this, the solution involves splitting igb_msg_task() > > > into two parts: > > > > > > * One part invoked from the IRQ context. > > > * Another part called from the threaded interrupt handler. > > > > > > To accommodate this, vfs_lock has been restructured into a double > > > lock: a spinlock_t and a raw_spinlock_t. In the revised design: > > > > > > * igb_disable_sriov() locks both spinlocks. > > > * Each part of igb_msg_task() locks the appropriate spinlock for > > > its execution context. > > > > - Is this limited to PREEMPT_RT or does it also occur on PREEMPT systems > > with threadirqs? And if this is PREEMPT_RT only, why? > > PREEMPT systems configured to use threadirqs should be affected as well, > although I never tested with this configuration. Honestly, until now I wasn't > aware of the possibility of a non PREEMPT_RT kernel with threaded IRQs by default. If the issue is indeed the use of threaded interrupts then the fix should not be limited to be PREEMPT_RT only. > > - What causes the failure? I see you reworked into two parts to behave > > similar to what happens without threaded interrupts. There is still no > > explanation for it. Is there a timing limit or was there another > > register operation which removed the mailbox message? > > > > I explained the root cause of the issue in the last commit. Maybe I should > have added the explanation to the cover letter as well. Anyway, here is a > partial verbatim copy of it: > > "During testing of SR-IOV, Red Hat QE encountered an issue where the > ip link up command intermittently fails for the igbvf interfaces when > using the PREEMPT_RT variant. Investigation revealed that > e1000_write_posted_mbx returns an error due to the lack of an ACK > from e1000_poll_for_ack. That ACK would have come if it would poll longer? > The underlying issue arises from the fact that IRQs are threaded by > default under PREEMPT_RT. While the exact hardware details are not > available, it appears that the IRQ handled by igb_msix_other must > be processed before e1000_poll_for_ack times out. However, > e1000_write_posted_mbx is called with preemption disabled, leading > to a scenario where the IRQ is serviced only after the failure of > e1000_write_posted_mbx." Where is this disabled preemption coming from? This should be one of the ops.write_posted() calls, right? I've been looking around and don't see anything obvious. Couldn't you wait for an event instead of polling? > The call chain from igb_msg_task(): > > igb_msg_task > igb_rcv_msg_from_vf > igb_set_vf_multicasts > igb_set_rx_mode > igb_write_mc_addr_list > kmalloc > > Cannot happen from interrupt context under PREEMPT_RT. So this part of > the interrupt handler is deferred to a threaded IRQ handler. > > > > Cheers, > > > Wander Sebastian
On Wed, Jan 8, 2025 at 7:25 AM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > > On 2025-01-07 15:52:47 [-0300], Wander Lairson Costa wrote: > > On Tue, Jan 07, 2025 at 02:51:06PM +0100, Sebastian Andrzej Siewior wrote: > > > On 2024-12-04 08:42:23 [-0300], Wander Lairson Costa wrote: > > > > This is the second attempt at fixing the behavior of igb_msix_other() > > > > for PREEMPT_RT. The previous attempt [1] was reverted [2] following > > > > concerns raised by Sebastian [3]. > > > > > > > > The initial approach proposed converting vfs_lock to a raw_spinlock, > > > > a minor change intended to make it safe. However, it became evident > > > > that igb_rcv_msg_from_vf() invokes kcalloc with GFP_ATOMIC, > > > > which is unsafe in interrupt context on PREEMPT_RT systems. > > > > > > > > To address this, the solution involves splitting igb_msg_task() > > > > into two parts: > > > > > > > > * One part invoked from the IRQ context. > > > > * Another part called from the threaded interrupt handler. > > > > > > > > To accommodate this, vfs_lock has been restructured into a double > > > > lock: a spinlock_t and a raw_spinlock_t. In the revised design: > > > > > > > > * igb_disable_sriov() locks both spinlocks. > > > > * Each part of igb_msg_task() locks the appropriate spinlock for > > > > its execution context. > > > > > > - Is this limited to PREEMPT_RT or does it also occur on PREEMPT systems > > > with threadirqs? And if this is PREEMPT_RT only, why? > > > > PREEMPT systems configured to use threadirqs should be affected as well, > > although I never tested with this configuration. Honestly, until now I wasn't > > aware of the possibility of a non PREEMPT_RT kernel with threaded IRQs by default. > > If the issue is indeed the use of threaded interrupts then the fix > should not be limited to be PREEMPT_RT only. > Although I was not aware of this scenario, the patch should work for it as well, as I am forcing it to run in interrupt context. I will test it to confirm. > > > - What causes the failure? I see you reworked into two parts to behave > > > similar to what happens without threaded interrupts. There is still no > > > explanation for it. Is there a timing limit or was there another > > > register operation which removed the mailbox message? > > > > > > > I explained the root cause of the issue in the last commit. Maybe I should > > have added the explanation to the cover letter as well. Anyway, here is a > > partial verbatim copy of it: > > > > "During testing of SR-IOV, Red Hat QE encountered an issue where the > > ip link up command intermittently fails for the igbvf interfaces when > > using the PREEMPT_RT variant. Investigation revealed that > > e1000_write_posted_mbx returns an error due to the lack of an ACK > > from e1000_poll_for_ack. > > That ACK would have come if it would poll longer? > No, the service wouldn't be serviced while polling. > > The underlying issue arises from the fact that IRQs are threaded by > > default under PREEMPT_RT. While the exact hardware details are not > > available, it appears that the IRQ handled by igb_msix_other must > > be processed before e1000_poll_for_ack times out. However, > > e1000_write_posted_mbx is called with preemption disabled, leading > > to a scenario where the IRQ is serviced only after the failure of > > e1000_write_posted_mbx." > > Where is this disabled preemption coming from? This should be one of the > ops.write_posted() calls, right? I've been looking around and don't see > anything obvious. I don't remember if I found the answer by looking at the code or by looking at the ftrace flags. I am currently on sick leave with covid. I can check it when I come back. > Couldn't you wait for an event instead of polling? > > > The call chain from igb_msg_task(): > > > > igb_msg_task > > igb_rcv_msg_from_vf > > igb_set_vf_multicasts > > igb_set_rx_mode > > igb_write_mc_addr_list > > kmalloc > > > > Cannot happen from interrupt context under PREEMPT_RT. So this part of > > the interrupt handler is deferred to a threaded IRQ handler. > > > > > > Cheers, > > > > Wander > > Sebastian >
On 2025-01-09 13:46:47 [-0300], Wander Lairson Costa wrote: > > If the issue is indeed the use of threaded interrupts then the fix > > should not be limited to be PREEMPT_RT only. > > > Although I was not aware of this scenario, the patch should work for it as well, > as I am forcing it to run in interrupt context. I will test it to confirm. If I remember correctly there were "ifdef preempt_rt" things in it. > > > > - What causes the failure? I see you reworked into two parts to behave > > > > similar to what happens without threaded interrupts. There is still no > > > > explanation for it. Is there a timing limit or was there another > > > > register operation which removed the mailbox message? > > > > > > > > > > I explained the root cause of the issue in the last commit. Maybe I should > > > have added the explanation to the cover letter as well. Anyway, here is a > > > partial verbatim copy of it: > > > > > > "During testing of SR-IOV, Red Hat QE encountered an issue where the > > > ip link up command intermittently fails for the igbvf interfaces when > > > using the PREEMPT_RT variant. Investigation revealed that > > > e1000_write_posted_mbx returns an error due to the lack of an ACK > > > from e1000_poll_for_ack. > > > > That ACK would have come if it would poll longer? > > > No, the service wouldn't be serviced while polling. Hmm. > > > The underlying issue arises from the fact that IRQs are threaded by > > > default under PREEMPT_RT. While the exact hardware details are not > > > available, it appears that the IRQ handled by igb_msix_other must > > > be processed before e1000_poll_for_ack times out. However, > > > e1000_write_posted_mbx is called with preemption disabled, leading > > > to a scenario where the IRQ is serviced only after the failure of > > > e1000_write_posted_mbx." > > > > Where is this disabled preemption coming from? This should be one of the > > ops.write_posted() calls, right? I've been looking around and don't see > > anything obvious. > > I don't remember if I found the answer by looking at the code or by > looking at the ftrace flags. > I am currently on sick leave with covid. I can check it when I come back. Don't worry, get better first. I'm kind of off myself. I'm not sure if I have the hardware needed to setup so I can look at it… Sebastian
On Thu, Jan 09, 2025 at 06:45:12PM +0100, Sebastian Andrzej Siewior wrote: > On 2025-01-09 13:46:47 [-0300], Wander Lairson Costa wrote: > > > If the issue is indeed the use of threaded interrupts then the fix > > > should not be limited to be PREEMPT_RT only. > > > > > Although I was not aware of this scenario, the patch should work for it as well, > > as I am forcing it to run in interrupt context. I will test it to confirm. I tested with the stock kernel with threadirqs and the problem does show up. Applying the patches the issue is gone. > > If I remember correctly there were "ifdef preempt_rt" things in it. That exists only to handle the case in which part in which the ISR needs to partially run in thread context (because the piece of code calling kmalloc), so I need an sleeping lock for that. For non-PREEMPT_RT, we don't have this constrain. > > > > > > - What causes the failure? I see you reworked into two parts to behave > > > > > similar to what happens without threaded interrupts. There is still no > > > > > explanation for it. Is there a timing limit or was there another > > > > > register operation which removed the mailbox message? > > > > > > > > > > > > > I explained the root cause of the issue in the last commit. Maybe I should > > > > have added the explanation to the cover letter as well. Anyway, here is a > > > > partial verbatim copy of it: > > > > > > > > "During testing of SR-IOV, Red Hat QE encountered an issue where the > > > > ip link up command intermittently fails for the igbvf interfaces when > > > > using the PREEMPT_RT variant. Investigation revealed that > > > > e1000_write_posted_mbx returns an error due to the lack of an ACK > > > > from e1000_poll_for_ack. > > > > > > That ACK would have come if it would poll longer? No, the poll happens while preemption is disabled. > > > > > No, the service wouldn't be serviced while polling. s/service/interrupt/. Since we can't sleep at this context, there is no way to wait for an event. > > Hmm. > > > > > The underlying issue arises from the fact that IRQs are threaded by > > > > default under PREEMPT_RT. While the exact hardware details are not > > > > available, it appears that the IRQ handled by igb_msix_other must > > > > be processed before e1000_poll_for_ack times out. However, > > > > e1000_write_posted_mbx is called with preemption disabled, leading > > > > to a scenario where the IRQ is serviced only after the failure of > > > > e1000_write_posted_mbx." > > > > > > Where is this disabled preemption coming from? This should be one of the > > > ops.write_posted() calls, right? I've been looking around and don't see > > > anything obvious. > > > > I don't remember if I found the answer by looking at the code or by > > looking at the ftrace flags. > > I am currently on sick leave with covid. I can check it when I come back. > > Don't worry, get better first. I'm kind of off myself. I'm not sure if I > have the hardware needed to setup so I can look at it… > The reason of why you didn't find is because the interrupt in the igb driver is triggered inside the igbvf code. igbvf_reset() calls spin_lock_bh() [1], although in the cases I found it was already called with preemption disabled from process_one_work() (workqueue) and netlink_sendmsg(). Here is an ftrace log for the failure case: kworker/-86 0...1 85.381866: function: igbvf_reset kworker/-86 0...2 85.381866: function: e1000_reset_hw_vf kworker/-86 0...2 85.381867: function: e1000_check_for_rst_vf kworker/-86 0...2 85.381868: function: e1000_write_posted_mbx kworker/-86 0...2 85.381868: function: e1000_write_mbx_vf kworker/-86 0...2 85.381870: function: e1000_check_for_ack_vf // repeats for 2000 lines ... kworker/-86 0.N.2 86.393782: function: e1000_read_posted_mbx kworker/-86 0.N.2 86.398606: function: e1000_init_hw_vf kworker/-86 0.N.2 86.398606: function: e1000_rar_set_vf kworker/-86 0.N.2 86.398606: function: e1000_write_posted_mbx irq/65-e-1287 0d..1 86.398609: function: igb_msix_other irq/65-e-1287 0d..1 86.398609: function: igb_rd32 irq/65-e-1287 0d..2 86.398610: function: igb_check_for_rst irq/65-e-1287 0d..2 86.398610: function: igb_check_for_rst_pf irq/65-e-1287 0d..2 86.398610: function: igb_rd32 irq/65-e-1287 0d..2 86.398611: function: igb_check_for_msg irq/65-e-1287 0d..2 86.398611: function: igb_check_for_msg_pf irq/65-e-1287 0d..2 86.398611: function: igb_rd32 irq/65-e-1287 0d..2 86.398612: function: igb_rcv_msg_from_vf irq/65-e-1287 0d..2 86.398612: function: igb_read_mbx irq/65-e-1287 0d..2 86.398612: function: igb_read_mbx_pf irq/65-e-1287 0d..2 86.398612: function: igb_obtain_mbx_lock_pf irq/65-e-1287 0d..2 86.398612: function: igb_rd32 Notice the interrupt handler only executes after e1000_write_posted() returns. And here it is for the sucessful case: ip-5603 0...1 1884.710747: function: igbvf_reset ip-5603 0...2 1884.710754: function: e1000_reset_hw_vf ip-5603 0...2 1884.710755: function: e1000_check_for_rst_vf ip-5603 0...2 1884.710756: function: e1000_write_posted_mbx ip-5603 0...2 1884.710756: function: e1000_write_mbx_vf ip-5603 0...2 1884.710758: function: e1000_check_for_ack_vf ip-5603 0d.h2 1884.710760: function: igb_msix_other ip-5603 0d.h2 1884.710760: function: igb_rd32 ip-5603 0d.h3 1884.710761: function: igb_check_for_rst ip-5603 0d.h3 1884.710761: function: igb_check_for_rst_pf ip-5603 0d.h3 1884.710761: function: igb_rd32 ip-5603 0d.h3 1884.710762: function: igb_check_for_msg ip-5603 0d.h3 1884.710762: function: igb_check_for_msg_pf ip-5603 0d.h3 1884.710762: function: igb_rd32 ip-5603 0d.h3 1884.710763: function: igb_rcv_msg_from_vf ip-5603 0d.h3 1884.710763: function: igb_read_mbx ip-5603 0d.h3 1884.710763: function: igb_read_mbx_pf ip-5603 0d.h3 1884.710763: function: igb_obtain_mbx_lock_pf ip-5603 0d.h3 1884.710763: function: igb_rd32 The ISR executes immediately fater e1000_write_mbx_vf(). [1] https://elixir.bootlin.com/linux/v6.12.6/source/drivers/net/ethernet/intel/igbvf/netdev.c#L1522 > Sebastian >