Message ID | 20190619202533.4856-1-nhorman@tuxdriver.com |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
Series | [net] af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET | expand |
On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > When an application is run that: > a) Sets its scheduler to be SCHED_FIFO > and > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > MSG_DONTWAIT flag cleared, its possible for the application to hang > forever in the kernel. This occurs because when waiting, the code in > tpacket_snd calls schedule, which under normal circumstances allows > other tasks to run, including ksoftirqd, which in some cases is > responsible for freeing the transmitted skb (which in AF_PACKET calls a > destructor that flips the status bit of the transmitted frame back to > available, allowing the transmitting task to complete). > > However, when the calling application is SCHED_FIFO, its priority is > such that the schedule call immediately places the task back on the cpu, > preventing ksoftirqd from freeing the skb, which in turn prevents the > transmitting task from detecting that the transmission is complete. > > We can fix this by converting the schedule call to a completion > mechanism. By using a completion queue, we force the calling task, when > it detects there are no more frames to send, to schedule itself off the > cpu until such time as the last transmitted skb is freed, allowing > forward progress to be made. > > Tested by myself and the reporter, with good results > > Appies to the net tree > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > Reported-by: Matteo Croce <mcroce@redhat.com> > CC: "David S. Miller" <davem@davemloft.net> > --- This is a complex change for a narrow configuration. Isn't a SCHED_FIFO process preempting ksoftirqd a potential problem for other networking workloads as well? And the right configuration to always increase ksoftirqd priority when increasing another process's priority? Also, even when ksoftirqd kicks in, isn't some progress still made on the local_bh_enable reached from schedule()?
On Thu, Jun 20, 2019 at 3:42 PM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > When an application is run that: > > a) Sets its scheduler to be SCHED_FIFO > > and > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > > MSG_DONTWAIT flag cleared, its possible for the application to hang > > forever in the kernel. This occurs because when waiting, the code in > > tpacket_snd calls schedule, which under normal circumstances allows > > other tasks to run, including ksoftirqd, which in some cases is > > responsible for freeing the transmitted skb (which in AF_PACKET calls a > > destructor that flips the status bit of the transmitted frame back to > > available, allowing the transmitting task to complete). > > > > However, when the calling application is SCHED_FIFO, its priority is > > such that the schedule call immediately places the task back on the cpu, > > preventing ksoftirqd from freeing the skb, which in turn prevents the > > transmitting task from detecting that the transmission is complete. > > > > We can fix this by converting the schedule call to a completion > > mechanism. By using a completion queue, we force the calling task, when > > it detects there are no more frames to send, to schedule itself off the > > cpu until such time as the last transmitted skb is freed, allowing > > forward progress to be made. > > > > Tested by myself and the reporter, with good results > > > > Appies to the net tree > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > Reported-by: Matteo Croce <mcroce@redhat.com> > > CC: "David S. Miller" <davem@davemloft.net> > > --- > > This is a complex change for a narrow configuration. Isn't a > SCHED_FIFO process preempting ksoftirqd a potential problem for other > networking workloads as well? And the right configuration to always > increase ksoftirqd priority when increasing another process's > priority? Also, even when ksoftirqd kicks in, isn't some progress > still made on the local_bh_enable reached from schedule()? Hi Willem, from my tests, it seems that when the application hangs, raising the ksoftirqd priority doesn't help. If the application priority is the same as ksoftirqd, it will loop forever, the only workaround is to lower the application priority or set it to SCHED_NORMAL Regards,
On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote: > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > When an application is run that: > > a) Sets its scheduler to be SCHED_FIFO > > and > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > > MSG_DONTWAIT flag cleared, its possible for the application to hang > > forever in the kernel. This occurs because when waiting, the code in > > tpacket_snd calls schedule, which under normal circumstances allows > > other tasks to run, including ksoftirqd, which in some cases is > > responsible for freeing the transmitted skb (which in AF_PACKET calls a > > destructor that flips the status bit of the transmitted frame back to > > available, allowing the transmitting task to complete). > > > > However, when the calling application is SCHED_FIFO, its priority is > > such that the schedule call immediately places the task back on the cpu, > > preventing ksoftirqd from freeing the skb, which in turn prevents the > > transmitting task from detecting that the transmission is complete. > > > > We can fix this by converting the schedule call to a completion > > mechanism. By using a completion queue, we force the calling task, when > > it detects there are no more frames to send, to schedule itself off the > > cpu until such time as the last transmitted skb is freed, allowing > > forward progress to be made. > > > > Tested by myself and the reporter, with good results > > > > Appies to the net tree > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > Reported-by: Matteo Croce <mcroce@redhat.com> > > CC: "David S. Miller" <davem@davemloft.net> > > --- > > This is a complex change for a narrow configuration. Isn't a > SCHED_FIFO process preempting ksoftirqd a potential problem for other > networking workloads as well? And the right configuration to always > increase ksoftirqd priority when increasing another process's > priority? Also, even when ksoftirqd kicks in, isn't some progress > still made on the local_bh_enable reached from schedule()? > A few questions here to answer: Regarding other protocols having this problem, thats not the case, because non packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set. We could certainly do that, but the current implementation doesn't (opting instead to wait indefinately until the respective packet(s) have transmitted or errored out), and I wanted to maintain that behavior. If there is consensus that packet sockets should honor SNDTIMEO, then I can certainly do that. As for progress made by calling local_bh_enable, My read of the code doesn't have the scheduler calling local_bh_enable at all. Instead schedule uses preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu, which ignores pending softirqs on re-enablement. Perhaps that needs to change, but I'm averse to making scheduler changes for this (the aforementioned concern about complex changes for a narrow use case) Regarding raising the priority of ksoftirqd, that could be a solution, but the priority would need to be raised to a high priority SCHED_FIFO parameter, and that gets back to making complex changes for a narrow problem domain As for the comlexity of the of the solution, I think this is, given your comments the least complex and intrusive change to solve the given problem. We need to find a way to force the calling task off the cpu while the asynchronous operations in the transmit path complete, and we can do that this way, or by honoring SK_SNDTIMEO. I'm fine with doing the latter, but I didn't want to alter the current protocol behavior without consensus on that. Regards Neil
On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote: > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote: > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > When an application is run that: > > > a) Sets its scheduler to be SCHED_FIFO > > > and > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > > > MSG_DONTWAIT flag cleared, its possible for the application to hang > > > forever in the kernel. This occurs because when waiting, the code in > > > tpacket_snd calls schedule, which under normal circumstances allows > > > other tasks to run, including ksoftirqd, which in some cases is > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a > > > destructor that flips the status bit of the transmitted frame back to > > > available, allowing the transmitting task to complete). > > > > > > However, when the calling application is SCHED_FIFO, its priority is > > > such that the schedule call immediately places the task back on the cpu, > > > preventing ksoftirqd from freeing the skb, which in turn prevents the > > > transmitting task from detecting that the transmission is complete. > > > > > > We can fix this by converting the schedule call to a completion > > > mechanism. By using a completion queue, we force the calling task, when > > > it detects there are no more frames to send, to schedule itself off the > > > cpu until such time as the last transmitted skb is freed, allowing > > > forward progress to be made. > > > > > > Tested by myself and the reporter, with good results > > > > > > Appies to the net tree > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > Reported-by: Matteo Croce <mcroce@redhat.com> > > > CC: "David S. Miller" <davem@davemloft.net> > > > --- > > > > This is a complex change for a narrow configuration. Isn't a > > SCHED_FIFO process preempting ksoftirqd a potential problem for other > > networking workloads as well? And the right configuration to always > > increase ksoftirqd priority when increasing another process's > > priority? Also, even when ksoftirqd kicks in, isn't some progress > > still made on the local_bh_enable reached from schedule()? > > > > A few questions here to answer: Thanks for the detailed explanation. > Regarding other protocols having this problem, thats not the case, because non > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set. We could > certainly do that, but the current implementation doesn't (opting instead to > wait indefinately until the respective packet(s) have transmitted or errored > out), and I wanted to maintain that behavior. If there is consensus that packet > sockets should honor SNDTIMEO, then I can certainly do that. > > As for progress made by calling local_bh_enable, My read of the code doesn't > have the scheduler calling local_bh_enable at all. Instead schedule uses > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu, > which ignores pending softirqs on re-enablement. Ah, I'm mistaken there, then. > Perhaps that needs to change, > but I'm averse to making scheduler changes for this (the aforementioned concern > about complex changes for a narrow use case) > > Regarding raising the priority of ksoftirqd, that could be a solution, but the > priority would need to be raised to a high priority SCHED_FIFO parameter, and > that gets back to making complex changes for a narrow problem domain > > As for the comlexity of the of the solution, I think this is, given your > comments the least complex and intrusive change to solve the given problem. Could it be simpler to ensure do_softirq() gets run here? That would allow progress for this case. > We > need to find a way to force the calling task off the cpu while the asynchronous > operations in the transmit path complete, and we can do that this way, or by > honoring SK_SNDTIMEO. I'm fine with doing the latter, but I didn't want to > alter the current protocol behavior without consensus on that. In general SCHED_FIFO is dangerous with regard to stalling other progress, incl. ksoftirqd. But it does appear that this packet socket case is special inside networking in calling schedule() directly here. If converting that, should it convert to logic more akin to other sockets, like sock_wait_for_wmem? I haven't had a chance to read up on the pros and cons of completion here yet, sorry. Didn't want to delay responding until after I get a chance.
On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote: > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote: > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > When an application is run that: > > > > a) Sets its scheduler to be SCHED_FIFO > > > > and > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang > > > > forever in the kernel. This occurs because when waiting, the code in > > > > tpacket_snd calls schedule, which under normal circumstances allows > > > > other tasks to run, including ksoftirqd, which in some cases is > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a > > > > destructor that flips the status bit of the transmitted frame back to > > > > available, allowing the transmitting task to complete). > > > > > > > > However, when the calling application is SCHED_FIFO, its priority is > > > > such that the schedule call immediately places the task back on the cpu, > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the > > > > transmitting task from detecting that the transmission is complete. > > > > > > > > We can fix this by converting the schedule call to a completion > > > > mechanism. By using a completion queue, we force the calling task, when > > > > it detects there are no more frames to send, to schedule itself off the > > > > cpu until such time as the last transmitted skb is freed, allowing > > > > forward progress to be made. > > > > > > > > Tested by myself and the reporter, with good results > > > > > > > > Appies to the net tree > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > Reported-by: Matteo Croce <mcroce@redhat.com> > > > > CC: "David S. Miller" <davem@davemloft.net> > > > > --- > > > > > > This is a complex change for a narrow configuration. Isn't a > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other > > > networking workloads as well? And the right configuration to always > > > increase ksoftirqd priority when increasing another process's > > > priority? Also, even when ksoftirqd kicks in, isn't some progress > > > still made on the local_bh_enable reached from schedule()? > > > > > > > A few questions here to answer: > > Thanks for the detailed explanation. > Gladly. > > Regarding other protocols having this problem, thats not the case, because non > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set. We could > > certainly do that, but the current implementation doesn't (opting instead to > > wait indefinately until the respective packet(s) have transmitted or errored > > out), and I wanted to maintain that behavior. If there is consensus that packet > > sockets should honor SNDTIMEO, then I can certainly do that. > > > > As for progress made by calling local_bh_enable, My read of the code doesn't > > have the scheduler calling local_bh_enable at all. Instead schedule uses > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu, > > which ignores pending softirqs on re-enablement. > > Ah, I'm mistaken there, then. > > > Perhaps that needs to change, > > but I'm averse to making scheduler changes for this (the aforementioned concern > > about complex changes for a narrow use case) > > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the > > priority would need to be raised to a high priority SCHED_FIFO parameter, and > > that gets back to making complex changes for a narrow problem domain > > > > As for the comlexity of the of the solution, I think this is, given your > > comments the least complex and intrusive change to solve the given problem. > > Could it be simpler to ensure do_softirq() gets run here? That would > allow progress for this case. > I'm not sure. On the surface, we certainly could do it, but inserting a call to do_softirq, either directly, or indirectly through some other mechanism seems like a non-obvious fix, and may lead to confusion down the road. I'm hesitant to pursue such a soultion without some evidence it would make a better solution. > > We > > need to find a way to force the calling task off the cpu while the asynchronous > > operations in the transmit path complete, and we can do that this way, or by > > honoring SK_SNDTIMEO. I'm fine with doing the latter, but I didn't want to > > alter the current protocol behavior without consensus on that. > > In general SCHED_FIFO is dangerous with regard to stalling other > progress, incl. ksoftirqd. But it does appear that this packet socket > case is special inside networking in calling schedule() directly here. > > If converting that, should it convert to logic more akin to other > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on > the pros and cons of completion here yet, sorry. Didn't want to delay > responding until after I get a chance. > That would be the solution described above (i.e. honoring SK_SNDTIMEO. Basically you call sock_send_waittimeo, which returns a timeout value, or 0 if MSG_DONTWAIT is set), then you block for that period of time waiting for transmit completion. I'm happy to implement that solution, but I'd like to get some clarity as to if there is a reason we don't currently honor that socket option now before I change the behavior that way. Dave, do you have any insight into AF_PACKET history as to why we would ignore the send timeout socket option here? Best Neil
On Thu, Jun 20, 2019 at 12:14 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote: > > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote: > > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > > > When an application is run that: > > > > > a) Sets its scheduler to be SCHED_FIFO > > > > > and > > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang > > > > > forever in the kernel. This occurs because when waiting, the code in > > > > > tpacket_snd calls schedule, which under normal circumstances allows > > > > > other tasks to run, including ksoftirqd, which in some cases is > > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a > > > > > destructor that flips the status bit of the transmitted frame back to > > > > > available, allowing the transmitting task to complete). > > > > > > > > > > However, when the calling application is SCHED_FIFO, its priority is > > > > > such that the schedule call immediately places the task back on the cpu, > > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the > > > > > transmitting task from detecting that the transmission is complete. > > > > > > > > > > We can fix this by converting the schedule call to a completion > > > > > mechanism. By using a completion queue, we force the calling task, when > > > > > it detects there are no more frames to send, to schedule itself off the > > > > > cpu until such time as the last transmitted skb is freed, allowing > > > > > forward progress to be made. > > > > > > > > > > Tested by myself and the reporter, with good results > > > > > > > > > > Appies to the net tree > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > Reported-by: Matteo Croce <mcroce@redhat.com> > > > > > CC: "David S. Miller" <davem@davemloft.net> > > > > > --- > > > > > > > > This is a complex change for a narrow configuration. Isn't a > > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other > > > > networking workloads as well? And the right configuration to always > > > > increase ksoftirqd priority when increasing another process's > > > > priority? Also, even when ksoftirqd kicks in, isn't some progress > > > > still made on the local_bh_enable reached from schedule()? > > > > > > > > > > A few questions here to answer: > > > > Thanks for the detailed explanation. > > > Gladly. > > > > Regarding other protocols having this problem, thats not the case, because non > > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period > > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set. We could > > > certainly do that, but the current implementation doesn't (opting instead to > > > wait indefinately until the respective packet(s) have transmitted or errored > > > out), and I wanted to maintain that behavior. If there is consensus that packet > > > sockets should honor SNDTIMEO, then I can certainly do that. > > > > > > As for progress made by calling local_bh_enable, My read of the code doesn't > > > have the scheduler calling local_bh_enable at all. Instead schedule uses > > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu, > > > which ignores pending softirqs on re-enablement. > > > > Ah, I'm mistaken there, then. > > > > > Perhaps that needs to change, > > > but I'm averse to making scheduler changes for this (the aforementioned concern > > > about complex changes for a narrow use case) > > > > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the > > > priority would need to be raised to a high priority SCHED_FIFO parameter, and > > > that gets back to making complex changes for a narrow problem domain > > > > > > As for the comlexity of the of the solution, I think this is, given your > > > comments the least complex and intrusive change to solve the given problem. > > > > Could it be simpler to ensure do_softirq() gets run here? That would > > allow progress for this case. > > > I'm not sure. On the surface, we certainly could do it, but inserting a call to > do_softirq, either directly, or indirectly through some other mechanism seems > like a non-obvious fix, and may lead to confusion down the road. I'm hesitant > to pursue such a soultion without some evidence it would make a better solution. > > > > We > > > need to find a way to force the calling task off the cpu while the asynchronous > > > operations in the transmit path complete, and we can do that this way, or by > > > honoring SK_SNDTIMEO. I'm fine with doing the latter, but I didn't want to > > > alter the current protocol behavior without consensus on that. > > > > In general SCHED_FIFO is dangerous with regard to stalling other > > progress, incl. ksoftirqd. But it does appear that this packet socket > > case is special inside networking in calling schedule() directly here. > > > > If converting that, should it convert to logic more akin to other > > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on > > the pros and cons of completion here yet, sorry. Didn't want to delay > > responding until after I get a chance. > > > That would be the solution described above (i.e. honoring SK_SNDTIMEO. > Basically you call sock_send_waittimeo, which returns a timeout value, or 0 if > MSG_DONTWAIT is set), then you block for that period of time waiting for > transmit completion. From an ABI point of view, starting to support SK_SNDTIMEO where it currently is not implemented certainly seems fine. > I'm happy to implement that solution, but I'd like to get > some clarity as to if there is a reason we don't currently honor that socket > option now before I change the behavior that way. > > Dave, do you have any insight into AF_PACKET history as to why we would ignore > the send timeout socket option here? On the point of calling schedule(): even if that is rare, there are a lot of other cond_resched() calls that may have the same starvation issue with SCHED_FIFO and ksoftirqd.
On Thu, Jun 20, 2019 at 12:18:41PM -0400, Willem de Bruijn wrote: > On Thu, Jun 20, 2019 at 12:14 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote: > > > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote: > > > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > > > > > When an application is run that: > > > > > > a) Sets its scheduler to be SCHED_FIFO > > > > > > and > > > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > > > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang > > > > > > forever in the kernel. This occurs because when waiting, the code in > > > > > > tpacket_snd calls schedule, which under normal circumstances allows > > > > > > other tasks to run, including ksoftirqd, which in some cases is > > > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a > > > > > > destructor that flips the status bit of the transmitted frame back to > > > > > > available, allowing the transmitting task to complete). > > > > > > > > > > > > However, when the calling application is SCHED_FIFO, its priority is > > > > > > such that the schedule call immediately places the task back on the cpu, > > > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the > > > > > > transmitting task from detecting that the transmission is complete. > > > > > > > > > > > > We can fix this by converting the schedule call to a completion > > > > > > mechanism. By using a completion queue, we force the calling task, when > > > > > > it detects there are no more frames to send, to schedule itself off the > > > > > > cpu until such time as the last transmitted skb is freed, allowing > > > > > > forward progress to be made. > > > > > > > > > > > > Tested by myself and the reporter, with good results > > > > > > > > > > > > Appies to the net tree > > > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > Reported-by: Matteo Croce <mcroce@redhat.com> > > > > > > CC: "David S. Miller" <davem@davemloft.net> > > > > > > --- > > > > > > > > > > This is a complex change for a narrow configuration. Isn't a > > > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other > > > > > networking workloads as well? And the right configuration to always > > > > > increase ksoftirqd priority when increasing another process's > > > > > priority? Also, even when ksoftirqd kicks in, isn't some progress > > > > > still made on the local_bh_enable reached from schedule()? > > > > > > > > > > > > > A few questions here to answer: > > > > > > Thanks for the detailed explanation. > > > > > Gladly. > > > > > > Regarding other protocols having this problem, thats not the case, because non > > > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period > > > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set. We could > > > > certainly do that, but the current implementation doesn't (opting instead to > > > > wait indefinately until the respective packet(s) have transmitted or errored > > > > out), and I wanted to maintain that behavior. If there is consensus that packet > > > > sockets should honor SNDTIMEO, then I can certainly do that. > > > > > > > > As for progress made by calling local_bh_enable, My read of the code doesn't > > > > have the scheduler calling local_bh_enable at all. Instead schedule uses > > > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu, > > > > which ignores pending softirqs on re-enablement. > > > > > > Ah, I'm mistaken there, then. > > > > > > > Perhaps that needs to change, > > > > but I'm averse to making scheduler changes for this (the aforementioned concern > > > > about complex changes for a narrow use case) > > > > > > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the > > > > priority would need to be raised to a high priority SCHED_FIFO parameter, and > > > > that gets back to making complex changes for a narrow problem domain > > > > > > > > As for the comlexity of the of the solution, I think this is, given your > > > > comments the least complex and intrusive change to solve the given problem. > > > > > > Could it be simpler to ensure do_softirq() gets run here? That would > > > allow progress for this case. > > > > > I'm not sure. On the surface, we certainly could do it, but inserting a call to > > do_softirq, either directly, or indirectly through some other mechanism seems > > like a non-obvious fix, and may lead to confusion down the road. I'm hesitant > > to pursue such a soultion without some evidence it would make a better solution. > > > > > > We > > > > need to find a way to force the calling task off the cpu while the asynchronous > > > > operations in the transmit path complete, and we can do that this way, or by > > > > honoring SK_SNDTIMEO. I'm fine with doing the latter, but I didn't want to > > > > alter the current protocol behavior without consensus on that. > > > > > > In general SCHED_FIFO is dangerous with regard to stalling other > > > progress, incl. ksoftirqd. But it does appear that this packet socket > > > case is special inside networking in calling schedule() directly here. > > > > > > If converting that, should it convert to logic more akin to other > > > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on > > > the pros and cons of completion here yet, sorry. Didn't want to delay > > > responding until after I get a chance. > > > > > That would be the solution described above (i.e. honoring SK_SNDTIMEO. > > Basically you call sock_send_waittimeo, which returns a timeout value, or 0 if > > MSG_DONTWAIT is set), then you block for that period of time waiting for > > transmit completion. > > From an ABI point of view, starting to support SK_SNDTIMEO where it > currently is not implemented certainly seems fine. > I would agree, I was just thinking since AF_PACKET is something of a unique protocol (i.e. no real protocol), there may have been reason to ignore that option, but I agree, it probably makes sense, barring no counter reasoning. > > I'm happy to implement that solution, but I'd like to get > > some clarity as to if there is a reason we don't currently honor that socket > > option now before I change the behavior that way. > > > > Dave, do you have any insight into AF_PACKET history as to why we would ignore > > the send timeout socket option here? > > On the point of calling schedule(): even if that is rare, there are a lot of > other cond_resched() calls that may have the same starvation issue > with SCHED_FIFO and ksoftirqd. > Agreed, but I've not found any yet Neil
On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote: > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote: > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > When an application is run that: > > > > a) Sets its scheduler to be SCHED_FIFO > > > > and > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang > > > > forever in the kernel. This occurs because when waiting, the code in > > > > tpacket_snd calls schedule, which under normal circumstances allows > > > > other tasks to run, including ksoftirqd, which in some cases is > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a > > > > destructor that flips the status bit of the transmitted frame back to > > > > available, allowing the transmitting task to complete). > > > > > > > > However, when the calling application is SCHED_FIFO, its priority is > > > > such that the schedule call immediately places the task back on the cpu, > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the > > > > transmitting task from detecting that the transmission is complete. > > > > > > > > We can fix this by converting the schedule call to a completion > > > > mechanism. By using a completion queue, we force the calling task, when > > > > it detects there are no more frames to send, to schedule itself off the > > > > cpu until such time as the last transmitted skb is freed, allowing > > > > forward progress to be made. > > > > > > > > Tested by myself and the reporter, with good results > > > > > > > > Appies to the net tree > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > Reported-by: Matteo Croce <mcroce@redhat.com> > > > > CC: "David S. Miller" <davem@davemloft.net> > > > > --- > > > > > > This is a complex change for a narrow configuration. Isn't a > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other > > > networking workloads as well? And the right configuration to always > > > increase ksoftirqd priority when increasing another process's > > > priority? Also, even when ksoftirqd kicks in, isn't some progress > > > still made on the local_bh_enable reached from schedule()? > > > > > > > A few questions here to answer: > > Thanks for the detailed explanation. > > > Regarding other protocols having this problem, thats not the case, because non > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set. We could > > certainly do that, but the current implementation doesn't (opting instead to > > wait indefinately until the respective packet(s) have transmitted or errored > > out), and I wanted to maintain that behavior. If there is consensus that packet > > sockets should honor SNDTIMEO, then I can certainly do that. > > > > As for progress made by calling local_bh_enable, My read of the code doesn't > > have the scheduler calling local_bh_enable at all. Instead schedule uses > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu, > > which ignores pending softirqs on re-enablement. > > Ah, I'm mistaken there, then. > > > Perhaps that needs to change, > > but I'm averse to making scheduler changes for this (the aforementioned concern > > about complex changes for a narrow use case) > > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the > > priority would need to be raised to a high priority SCHED_FIFO parameter, and > > that gets back to making complex changes for a narrow problem domain > > > > As for the comlexity of the of the solution, I think this is, given your > > comments the least complex and intrusive change to solve the given problem. > > Could it be simpler to ensure do_softirq() gets run here? That would > allow progress for this case. > > > We > > need to find a way to force the calling task off the cpu while the asynchronous > > operations in the transmit path complete, and we can do that this way, or by > > honoring SK_SNDTIMEO. I'm fine with doing the latter, but I didn't want to > > alter the current protocol behavior without consensus on that. > > In general SCHED_FIFO is dangerous with regard to stalling other > progress, incl. ksoftirqd. But it does appear that this packet socket > case is special inside networking in calling schedule() directly here. > > If converting that, should it convert to logic more akin to other > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on > the pros and cons of completion here yet, sorry. Didn't want to delay > responding until after I get a chance. > So, I started looking at implementing SOCK_SNDTIMEO for this patch, and something occured to me....We still need a mechanism to block in tpacket_snd. That is to say, other protocol use SK_SNDTIMEO to wait for socket memory to become available, and that requirement doesn't exist for memory mapped sockets in AF_PACKET (which tpacket_snd implements the kernel side for). We have memory mapped frame buffers, which we marshall with an otherwise empty skb, and just send that (i.e. no waiting on socket memory, we just product an error if we don't have enough ram to allocate an sk_buff). Given that, we only ever need to wait for a frame to complete transmission, or get freed in an error path further down the stack. This probably explains why SK_SNDTIMEO doesn't exist for AF_PACKET. Given that, is it worth implementing (as it will just further complicate this patch, for no additional value), or can we move forward with it as it is? Neil
On Fri, Jun 21, 2019 at 12:42 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote: > > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote: > > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > > > When an application is run that: > > > > > a) Sets its scheduler to be SCHED_FIFO > > > > > and > > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang > > > > > forever in the kernel. This occurs because when waiting, the code in > > > > > tpacket_snd calls schedule, which under normal circumstances allows > > > > > other tasks to run, including ksoftirqd, which in some cases is > > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a > > > > > destructor that flips the status bit of the transmitted frame back to > > > > > available, allowing the transmitting task to complete). > > > > > > > > > > However, when the calling application is SCHED_FIFO, its priority is > > > > > such that the schedule call immediately places the task back on the cpu, > > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the > > > > > transmitting task from detecting that the transmission is complete. > > > > > > > > > > We can fix this by converting the schedule call to a completion > > > > > mechanism. By using a completion queue, we force the calling task, when > > > > > it detects there are no more frames to send, to schedule itself off the > > > > > cpu until such time as the last transmitted skb is freed, allowing > > > > > forward progress to be made. > > > > > > > > > > Tested by myself and the reporter, with good results > > > > > > > > > > Appies to the net tree > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > Reported-by: Matteo Croce <mcroce@redhat.com> > > > > > CC: "David S. Miller" <davem@davemloft.net> > > > > > --- > > > > > > > > This is a complex change for a narrow configuration. Isn't a > > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other > > > > networking workloads as well? And the right configuration to always > > > > increase ksoftirqd priority when increasing another process's > > > > priority? Also, even when ksoftirqd kicks in, isn't some progress > > > > still made on the local_bh_enable reached from schedule()? > > > > > > > > > > A few questions here to answer: > > > > Thanks for the detailed explanation. > > > > > Regarding other protocols having this problem, thats not the case, because non > > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period > > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set. We could > > > certainly do that, but the current implementation doesn't (opting instead to > > > wait indefinately until the respective packet(s) have transmitted or errored > > > out), and I wanted to maintain that behavior. If there is consensus that packet > > > sockets should honor SNDTIMEO, then I can certainly do that. > > > > > > As for progress made by calling local_bh_enable, My read of the code doesn't > > > have the scheduler calling local_bh_enable at all. Instead schedule uses > > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu, > > > which ignores pending softirqs on re-enablement. > > > > Ah, I'm mistaken there, then. > > > > > Perhaps that needs to change, > > > but I'm averse to making scheduler changes for this (the aforementioned concern > > > about complex changes for a narrow use case) > > > > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the > > > priority would need to be raised to a high priority SCHED_FIFO parameter, and > > > that gets back to making complex changes for a narrow problem domain > > > > > > As for the comlexity of the of the solution, I think this is, given your > > > comments the least complex and intrusive change to solve the given problem. > > > > Could it be simpler to ensure do_softirq() gets run here? That would > > allow progress for this case. > > > > > We > > > need to find a way to force the calling task off the cpu while the asynchronous > > > operations in the transmit path complete, and we can do that this way, or by > > > honoring SK_SNDTIMEO. I'm fine with doing the latter, but I didn't want to > > > alter the current protocol behavior without consensus on that. > > > > In general SCHED_FIFO is dangerous with regard to stalling other > > progress, incl. ksoftirqd. But it does appear that this packet socket > > case is special inside networking in calling schedule() directly here. > > > > If converting that, should it convert to logic more akin to other > > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on > > the pros and cons of completion here yet, sorry. Didn't want to delay > > responding until after I get a chance. > > > So, I started looking at implementing SOCK_SNDTIMEO for this patch, and > something occured to me....We still need a mechanism to block in tpacket_snd. > That is to say, other protocol use SK_SNDTIMEO to wait for socket memory to > become available, and that requirement doesn't exist for memory mapped sockets > in AF_PACKET (which tpacket_snd implements the kernel side for). We have memory > mapped frame buffers, which we marshall with an otherwise empty skb, and just > send that (i.e. no waiting on socket memory, we just product an error if we > don't have enough ram to allocate an sk_buff). Given that, we only ever need to > wait for a frame to complete transmission, or get freed in an error path further > down the stack. This probably explains why SK_SNDTIMEO doesn't exist for > AF_PACKET. SNDTIMEO behavior would still be useful: to wait for frame slot to become available, but only up to a timeout? To be clear, adding that is not a prerequisite for fixing this specific issue, of course. It would just be nice if the one happens to be fixed by adding the other. My main question is wrt the implementation details of struct completion. Without dynamic memory allocation, sock_wait_for_wmem/sk_stream_write_space obviously does not make sense. But should we still use sk_wq and more importantly does this need the same wait semantics (TASK_INTERRUPTIBLE) and does struct completion give those? > > Given that, is it worth implementing (as it will just further complicate this > patch, for no additional value), or can we move forward with it as it is? > > Neil >
On Fri, Jun 21, 2019 at 02:31:17PM -0400, Willem de Bruijn wrote: > On Fri, Jun 21, 2019 at 12:42 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote: > > > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote: > > > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > > > > > When an application is run that: > > > > > > a) Sets its scheduler to be SCHED_FIFO > > > > > > and > > > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > > > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang > > > > > > forever in the kernel. This occurs because when waiting, the code in > > > > > > tpacket_snd calls schedule, which under normal circumstances allows > > > > > > other tasks to run, including ksoftirqd, which in some cases is > > > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a > > > > > > destructor that flips the status bit of the transmitted frame back to > > > > > > available, allowing the transmitting task to complete). > > > > > > > > > > > > However, when the calling application is SCHED_FIFO, its priority is > > > > > > such that the schedule call immediately places the task back on the cpu, > > > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the > > > > > > transmitting task from detecting that the transmission is complete. > > > > > > > > > > > > We can fix this by converting the schedule call to a completion > > > > > > mechanism. By using a completion queue, we force the calling task, when > > > > > > it detects there are no more frames to send, to schedule itself off the > > > > > > cpu until such time as the last transmitted skb is freed, allowing > > > > > > forward progress to be made. > > > > > > > > > > > > Tested by myself and the reporter, with good results > > > > > > > > > > > > Appies to the net tree > > > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > Reported-by: Matteo Croce <mcroce@redhat.com> > > > > > > CC: "David S. Miller" <davem@davemloft.net> > > > > > > --- > > > > > > > > > > This is a complex change for a narrow configuration. Isn't a > > > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other > > > > > networking workloads as well? And the right configuration to always > > > > > increase ksoftirqd priority when increasing another process's > > > > > priority? Also, even when ksoftirqd kicks in, isn't some progress > > > > > still made on the local_bh_enable reached from schedule()? > > > > > > > > > > > > > A few questions here to answer: > > > > > > Thanks for the detailed explanation. > > > > > > > Regarding other protocols having this problem, thats not the case, because non > > > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period > > > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set. We could > > > > certainly do that, but the current implementation doesn't (opting instead to > > > > wait indefinately until the respective packet(s) have transmitted or errored > > > > out), and I wanted to maintain that behavior. If there is consensus that packet > > > > sockets should honor SNDTIMEO, then I can certainly do that. > > > > > > > > As for progress made by calling local_bh_enable, My read of the code doesn't > > > > have the scheduler calling local_bh_enable at all. Instead schedule uses > > > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu, > > > > which ignores pending softirqs on re-enablement. > > > > > > Ah, I'm mistaken there, then. > > > > > > > Perhaps that needs to change, > > > > but I'm averse to making scheduler changes for this (the aforementioned concern > > > > about complex changes for a narrow use case) > > > > > > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the > > > > priority would need to be raised to a high priority SCHED_FIFO parameter, and > > > > that gets back to making complex changes for a narrow problem domain > > > > > > > > As for the comlexity of the of the solution, I think this is, given your > > > > comments the least complex and intrusive change to solve the given problem. > > > > > > Could it be simpler to ensure do_softirq() gets run here? That would > > > allow progress for this case. > > > > > > > We > > > > need to find a way to force the calling task off the cpu while the asynchronous > > > > operations in the transmit path complete, and we can do that this way, or by > > > > honoring SK_SNDTIMEO. I'm fine with doing the latter, but I didn't want to > > > > alter the current protocol behavior without consensus on that. > > > > > > In general SCHED_FIFO is dangerous with regard to stalling other > > > progress, incl. ksoftirqd. But it does appear that this packet socket > > > case is special inside networking in calling schedule() directly here. > > > > > > If converting that, should it convert to logic more akin to other > > > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on > > > the pros and cons of completion here yet, sorry. Didn't want to delay > > > responding until after I get a chance. > > > > > So, I started looking at implementing SOCK_SNDTIMEO for this patch, and > > something occured to me....We still need a mechanism to block in tpacket_snd. > > That is to say, other protocol use SK_SNDTIMEO to wait for socket memory to > > become available, and that requirement doesn't exist for memory mapped sockets > > in AF_PACKET (which tpacket_snd implements the kernel side for). We have memory > > mapped frame buffers, which we marshall with an otherwise empty skb, and just > > send that (i.e. no waiting on socket memory, we just product an error if we > > don't have enough ram to allocate an sk_buff). Given that, we only ever need to > > wait for a frame to complete transmission, or get freed in an error path further > > down the stack. This probably explains why SK_SNDTIMEO doesn't exist for > > AF_PACKET. > > SNDTIMEO behavior would still be useful: to wait for frame slot to > become available, but only up to a timeout? > Ok, thats fair. To be clear, memory_mapped packets aren't waiting here for a frame to become available for sending (thats done in userspace, where the application checks a specific memory location for the TP_AVAILABLE status to be set, so a new frame can be written). tpacket_snd is wating for the transmission of a specific frame to complete the transmit action, which is a different thing. Still, I suppose theres no reason we couldn't contrain that wait on a timeout set by SK_SNDTIMEO > To be clear, adding that is not a prerequisite for fixing this > specific issue, of course. It would just be nice if the one happens to > be fixed by adding the other. > > My main question is wrt the implementation details of struct > completion. Without dynamic memory allocation, > sock_wait_for_wmem/sk_stream_write_space obviously does not make > sense. But should we still use sk_wq and more importantly does this > need the same wait semantics (TASK_INTERRUPTIBLE) and does struct > completion give those? > I suppose we could overload its use here, but I would be worried that such an overload would be confusing. Nominally, sk_wq is used, as you note, to block sockets whose allocated send space is full, until such time as enough frames have been sent to make space for the next write. In this scenario, we already know we have space to send a frame (by virtue of the fact that we are executing in tpakcet_snd, which is only called after userspace has written a frame to the memory mapped buffer already allocated for frame transmission). In this case we are simply waiting for the last frame that we have sent to complete transmission, at which point we can look to see if more frames are available to send, or return from the system call. I'm happy to take an alternate consensus into account, but for the sake of clarity I think I would rather use the completion queue, as it makes clear the correlation between the waiter and the event we are waiting on. > > > > Given that, is it worth implementing (as it will just further complicate this > > patch, for no additional value), or can we move forward with it as it is? > > > > Neil > > >
On Fri, Jun 21, 2019 at 3:18 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > On Fri, Jun 21, 2019 at 02:31:17PM -0400, Willem de Bruijn wrote: > > On Fri, Jun 21, 2019 at 12:42 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote: > > > > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote: > > > > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > > > > > > > When an application is run that: > > > > > > > a) Sets its scheduler to be SCHED_FIFO > > > > > > > and > > > > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > > > > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang > > > > > > > forever in the kernel. This occurs because when waiting, the code in > > > > > > > tpacket_snd calls schedule, which under normal circumstances allows > > > > > > > other tasks to run, including ksoftirqd, which in some cases is > > > > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a > > > > > > > destructor that flips the status bit of the transmitted frame back to > > > > > > > available, allowing the transmitting task to complete). > > > > > > > > > > > > > > However, when the calling application is SCHED_FIFO, its priority is > > > > > > > such that the schedule call immediately places the task back on the cpu, > > > > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the > > > > > > > transmitting task from detecting that the transmission is complete. > > > > > > > > > > > > > > We can fix this by converting the schedule call to a completion > > > > > > > mechanism. By using a completion queue, we force the calling task, when > > > > > > > it detects there are no more frames to send, to schedule itself off the > > > > > > > cpu until such time as the last transmitted skb is freed, allowing > > > > > > > forward progress to be made. > > > > > > > > > > > > > > Tested by myself and the reporter, with good results > > > > > > > > > > > > > > Appies to the net tree > > > > > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > > Reported-by: Matteo Croce <mcroce@redhat.com> > > > > > > > CC: "David S. Miller" <davem@davemloft.net> > > > > > > > --- > > > > > > > > > > > > This is a complex change for a narrow configuration. Isn't a > > > > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other > > > > > > networking workloads as well? And the right configuration to always > > > > > > increase ksoftirqd priority when increasing another process's > > > > > > priority? Also, even when ksoftirqd kicks in, isn't some progress > > > > > > still made on the local_bh_enable reached from schedule()? > > > > > > > > > > > > > > > > A few questions here to answer: > > > > > > > > Thanks for the detailed explanation. > > > > > > > > > Regarding other protocols having this problem, thats not the case, because non > > > > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period > > > > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set. We could > > > > > certainly do that, but the current implementation doesn't (opting instead to > > > > > wait indefinately until the respective packet(s) have transmitted or errored > > > > > out), and I wanted to maintain that behavior. If there is consensus that packet > > > > > sockets should honor SNDTIMEO, then I can certainly do that. > > > > > > > > > > As for progress made by calling local_bh_enable, My read of the code doesn't > > > > > have the scheduler calling local_bh_enable at all. Instead schedule uses > > > > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu, > > > > > which ignores pending softirqs on re-enablement. > > > > > > > > Ah, I'm mistaken there, then. > > > > > > > > > Perhaps that needs to change, > > > > > but I'm averse to making scheduler changes for this (the aforementioned concern > > > > > about complex changes for a narrow use case) > > > > > > > > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the > > > > > priority would need to be raised to a high priority SCHED_FIFO parameter, and > > > > > that gets back to making complex changes for a narrow problem domain > > > > > > > > > > As for the comlexity of the of the solution, I think this is, given your > > > > > comments the least complex and intrusive change to solve the given problem. > > > > > > > > Could it be simpler to ensure do_softirq() gets run here? That would > > > > allow progress for this case. > > > > > > > > > We > > > > > need to find a way to force the calling task off the cpu while the asynchronous > > > > > operations in the transmit path complete, and we can do that this way, or by > > > > > honoring SK_SNDTIMEO. I'm fine with doing the latter, but I didn't want to > > > > > alter the current protocol behavior without consensus on that. > > > > > > > > In general SCHED_FIFO is dangerous with regard to stalling other > > > > progress, incl. ksoftirqd. But it does appear that this packet socket > > > > case is special inside networking in calling schedule() directly here. > > > > > > > > If converting that, should it convert to logic more akin to other > > > > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on > > > > the pros and cons of completion here yet, sorry. Didn't want to delay > > > > responding until after I get a chance. > > > > > > > So, I started looking at implementing SOCK_SNDTIMEO for this patch, and > > > something occured to me....We still need a mechanism to block in tpacket_snd. > > > That is to say, other protocol use SK_SNDTIMEO to wait for socket memory to > > > become available, and that requirement doesn't exist for memory mapped sockets > > > in AF_PACKET (which tpacket_snd implements the kernel side for). We have memory > > > mapped frame buffers, which we marshall with an otherwise empty skb, and just > > > send that (i.e. no waiting on socket memory, we just product an error if we > > > don't have enough ram to allocate an sk_buff). Given that, we only ever need to > > > wait for a frame to complete transmission, or get freed in an error path further > > > down the stack. This probably explains why SK_SNDTIMEO doesn't exist for > > > AF_PACKET. > > > > SNDTIMEO behavior would still be useful: to wait for frame slot to > > become available, but only up to a timeout? > > > Ok, thats fair. To be clear, memory_mapped packets aren't waiting here for a > frame to become available for sending (thats done in userspace, where the > application checks a specific memory location for the TP_AVAILABLE status to be > set, so a new frame can be written). tpacket_snd is wating for the transmission > of a specific frame to complete the transmit action, which is a different thing. Right. Until this report I was actually not even aware of this behavior without MSG_DONTWAIT. Though the wait is not for a specific frame, right? Wait as long as the pending_refcnt, which is incremented on every loop. > Still, I suppose theres no reason we couldn't contrain that wait on a timeout > set by SK_SNDTIMEO > > > To be clear, adding that is not a prerequisite for fixing this > > specific issue, of course. It would just be nice if the one happens to > > be fixed by adding the other. > > > > My main question is wrt the implementation details of struct > > completion. Without dynamic memory allocation, > > sock_wait_for_wmem/sk_stream_write_space obviously does not make > > sense. But should we still use sk_wq and more importantly does this > > need the same wait semantics (TASK_INTERRUPTIBLE) and does struct > > completion give those? > > > I suppose we could overload its use here, but I would be worried that such an > overload would be confusing. Nominally, sk_wq is used, as you note, to block > sockets whose allocated send space is full, until such time as enough frames > have been sent to make space for the next write. In this scenario, we already > know we have space to send a frame (by virtue of the fact that we are executing > in tpakcet_snd, which is only called after userspace has written a frame to the > memory mapped buffer already allocated for frame transmission). In this case we > are simply waiting for the last frame that we have sent to complete > transmission, at which point we can look to see if more frames are available to > send, or return from the system call. I'm happy to take an alternate consensus > into account, but for the sake of clarity I think I would rather use the > completion queue, as it makes clear the correlation between the waiter and the > event we are waiting on. That will be less likely to have unexpected side-effects. Agreed. Thanks for the explanation. Only last question, does it have the right wait behavior? Should this be wait_for_completion_interruptible?
On Fri, Jun 21, 2019 at 04:06:09PM -0400, Willem de Bruijn wrote: > On Fri, Jun 21, 2019 at 3:18 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > On Fri, Jun 21, 2019 at 02:31:17PM -0400, Willem de Bruijn wrote: > > > On Fri, Jun 21, 2019 at 12:42 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote: > > > > > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > > > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote: > > > > > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote: > > > > > > > > > > > > > > > > When an application is run that: > > > > > > > > a) Sets its scheduler to be SCHED_FIFO > > > > > > > > and > > > > > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the > > > > > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang > > > > > > > > forever in the kernel. This occurs because when waiting, the code in > > > > > > > > tpacket_snd calls schedule, which under normal circumstances allows > > > > > > > > other tasks to run, including ksoftirqd, which in some cases is > > > > > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a > > > > > > > > destructor that flips the status bit of the transmitted frame back to > > > > > > > > available, allowing the transmitting task to complete). > > > > > > > > > > > > > > > > However, when the calling application is SCHED_FIFO, its priority is > > > > > > > > such that the schedule call immediately places the task back on the cpu, > > > > > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the > > > > > > > > transmitting task from detecting that the transmission is complete. > > > > > > > > > > > > > > > > We can fix this by converting the schedule call to a completion > > > > > > > > mechanism. By using a completion queue, we force the calling task, when > > > > > > > > it detects there are no more frames to send, to schedule itself off the > > > > > > > > cpu until such time as the last transmitted skb is freed, allowing > > > > > > > > forward progress to be made. > > > > > > > > > > > > > > > > Tested by myself and the reporter, with good results > > > > > > > > > > > > > > > > Appies to the net tree > > > > > > > > > > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > > > > > > > Reported-by: Matteo Croce <mcroce@redhat.com> > > > > > > > > CC: "David S. Miller" <davem@davemloft.net> > > > > > > > > --- > > > > > > > > > > > > > > This is a complex change for a narrow configuration. Isn't a > > > > > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other > > > > > > > networking workloads as well? And the right configuration to always > > > > > > > increase ksoftirqd priority when increasing another process's > > > > > > > priority? Also, even when ksoftirqd kicks in, isn't some progress > > > > > > > still made on the local_bh_enable reached from schedule()? > > > > > > > > > > > > > > > > > > > A few questions here to answer: > > > > > > > > > > Thanks for the detailed explanation. > > > > > > > > > > > Regarding other protocols having this problem, thats not the case, because non > > > > > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period > > > > > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set. We could > > > > > > certainly do that, but the current implementation doesn't (opting instead to > > > > > > wait indefinately until the respective packet(s) have transmitted or errored > > > > > > out), and I wanted to maintain that behavior. If there is consensus that packet > > > > > > sockets should honor SNDTIMEO, then I can certainly do that. > > > > > > > > > > > > As for progress made by calling local_bh_enable, My read of the code doesn't > > > > > > have the scheduler calling local_bh_enable at all. Instead schedule uses > > > > > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu, > > > > > > which ignores pending softirqs on re-enablement. > > > > > > > > > > Ah, I'm mistaken there, then. > > > > > > > > > > > Perhaps that needs to change, > > > > > > but I'm averse to making scheduler changes for this (the aforementioned concern > > > > > > about complex changes for a narrow use case) > > > > > > > > > > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the > > > > > > priority would need to be raised to a high priority SCHED_FIFO parameter, and > > > > > > that gets back to making complex changes for a narrow problem domain > > > > > > > > > > > > As for the comlexity of the of the solution, I think this is, given your > > > > > > comments the least complex and intrusive change to solve the given problem. > > > > > > > > > > Could it be simpler to ensure do_softirq() gets run here? That would > > > > > allow progress for this case. > > > > > > > > > > > We > > > > > > need to find a way to force the calling task off the cpu while the asynchronous > > > > > > operations in the transmit path complete, and we can do that this way, or by > > > > > > honoring SK_SNDTIMEO. I'm fine with doing the latter, but I didn't want to > > > > > > alter the current protocol behavior without consensus on that. > > > > > > > > > > In general SCHED_FIFO is dangerous with regard to stalling other > > > > > progress, incl. ksoftirqd. But it does appear that this packet socket > > > > > case is special inside networking in calling schedule() directly here. > > > > > > > > > > If converting that, should it convert to logic more akin to other > > > > > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on > > > > > the pros and cons of completion here yet, sorry. Didn't want to delay > > > > > responding until after I get a chance. > > > > > > > > > So, I started looking at implementing SOCK_SNDTIMEO for this patch, and > > > > something occured to me....We still need a mechanism to block in tpacket_snd. > > > > That is to say, other protocol use SK_SNDTIMEO to wait for socket memory to > > > > become available, and that requirement doesn't exist for memory mapped sockets > > > > in AF_PACKET (which tpacket_snd implements the kernel side for). We have memory > > > > mapped frame buffers, which we marshall with an otherwise empty skb, and just > > > > send that (i.e. no waiting on socket memory, we just product an error if we > > > > don't have enough ram to allocate an sk_buff). Given that, we only ever need to > > > > wait for a frame to complete transmission, or get freed in an error path further > > > > down the stack. This probably explains why SK_SNDTIMEO doesn't exist for > > > > AF_PACKET. > > > > > > SNDTIMEO behavior would still be useful: to wait for frame slot to > > > become available, but only up to a timeout? > > > > > Ok, thats fair. To be clear, memory_mapped packets aren't waiting here for a > > frame to become available for sending (thats done in userspace, where the > > application checks a specific memory location for the TP_AVAILABLE status to be > > set, so a new frame can be written). tpacket_snd is wating for the transmission > > of a specific frame to complete the transmit action, which is a different thing. > > Right. Until this report I was actually not even aware of this > behavior without MSG_DONTWAIT. > > Though the wait is not for a specific frame, right? Wait as long as > the pending_refcnt, which is incremented on every loop. > > > Still, I suppose theres no reason we couldn't contrain that wait on a timeout > > set by SK_SNDTIMEO > > > > > To be clear, adding that is not a prerequisite for fixing this > > > specific issue, of course. It would just be nice if the one happens to > > > be fixed by adding the other. > > > > > > My main question is wrt the implementation details of struct > > > completion. Without dynamic memory allocation, > > > sock_wait_for_wmem/sk_stream_write_space obviously does not make > > > sense. But should we still use sk_wq and more importantly does this > > > need the same wait semantics (TASK_INTERRUPTIBLE) and does struct > > > completion give those? > > > > > I suppose we could overload its use here, but I would be worried that such an > > overload would be confusing. Nominally, sk_wq is used, as you note, to block > > sockets whose allocated send space is full, until such time as enough frames > > have been sent to make space for the next write. In this scenario, we already > > know we have space to send a frame (by virtue of the fact that we are executing > > in tpakcet_snd, which is only called after userspace has written a frame to the > > memory mapped buffer already allocated for frame transmission). In this case we > > are simply waiting for the last frame that we have sent to complete > > transmission, at which point we can look to see if more frames are available to > > send, or return from the system call. I'm happy to take an alternate consensus > > into account, but for the sake of clarity I think I would rather use the > > completion queue, as it makes clear the correlation between the waiter and the > > event we are waiting on. > > That will be less likely to have unexpected side-effects. Agreed. > Thanks for the explanation. > > Only last question, does it have the right wait behavior? Should this > be wait_for_completion_interruptible? > Thats a good point. Other protocols use the socket timeout along with sk_wait_event, which sets the task state TASK_INTERRUPTIBLE, so we should probably use wait_for_completion_interruptible_timeout Thanks!
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index a29d66da7394..e65f4ef18a06 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -358,7 +358,8 @@ static inline struct page * __pure pgv_to_page(void *addr) return virt_to_page(addr); } -static void __packet_set_status(struct packet_sock *po, void *frame, int status) +static void __packet_set_status(struct packet_sock *po, void *frame, int status, + bool call_complete) { union tpacket_uhdr h; @@ -381,6 +382,8 @@ static void __packet_set_status(struct packet_sock *po, void *frame, int status) BUG(); } + if (po->wait_on_complete && call_complete) + complete(&po->skb_completion); smp_wmb(); } @@ -1148,6 +1151,14 @@ static void *packet_previous_frame(struct packet_sock *po, return packet_lookup_frame(po, rb, previous, status); } +static void *packet_next_frame(struct packet_sock *po, + struct packet_ring_buffer *rb, + int status) +{ + unsigned int next = rb->head != rb->frame_max ? rb->head+1 : 0; + return packet_lookup_frame(po, rb, next, status); +} + static void packet_increment_head(struct packet_ring_buffer *buff) { buff->head = buff->head != buff->frame_max ? buff->head+1 : 0; @@ -2360,7 +2371,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, #endif if (po->tp_version <= TPACKET_V2) { - __packet_set_status(po, h.raw, status); + __packet_set_status(po, h.raw, status, false); sk->sk_data_ready(sk); } else { prb_clear_blk_fill_status(&po->rx_ring); @@ -2400,7 +2411,7 @@ static void tpacket_destruct_skb(struct sk_buff *skb) packet_dec_pending(&po->tx_ring); ts = __packet_set_timestamp(po, ph, skb); - __packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts); + __packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts, true); } sock_wfree(skb); @@ -2600,6 +2611,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) int len_sum = 0; int status = TP_STATUS_AVAILABLE; int hlen, tlen, copylen = 0; + void *tmp; mutex_lock(&po->pg_vec_lock); @@ -2647,16 +2659,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) size_max = dev->mtu + reserve + VLAN_HLEN; do { + + if (po->wait_on_complete && need_wait) { + wait_for_completion(&po->skb_completion); + po->wait_on_complete = 0; + } + ph = packet_current_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST); - if (unlikely(ph == NULL)) { - if (need_wait && need_resched()) - schedule(); - continue; - } + + if (likely(ph == NULL)) + break; skb = NULL; tp_len = tpacket_parse_header(po, ph, size_max, &data); + if (tp_len < 0) goto tpacket_error; @@ -2699,7 +2716,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) tpacket_error: if (po->tp_loss) { __packet_set_status(po, ph, - TP_STATUS_AVAILABLE); + TP_STATUS_AVAILABLE, false); packet_increment_head(&po->tx_ring); kfree_skb(skb); continue; @@ -2719,7 +2736,9 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) } skb->destructor = tpacket_destruct_skb; - __packet_set_status(po, ph, TP_STATUS_SENDING); + __packet_set_status(po, ph, TP_STATUS_SENDING, false); + if (!packet_next_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST)) + po->wait_on_complete = 1; packet_inc_pending(&po->tx_ring); status = TP_STATUS_SEND_REQUEST; @@ -2753,7 +2772,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) goto out_put; out_status: - __packet_set_status(po, ph, status); + __packet_set_status(po, ph, status, false); kfree_skb(skb); out_put: dev_put(dev); @@ -3207,6 +3226,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol, sock_init_data(sock, sk); po = pkt_sk(sk); + init_completion(&po->skb_completion); sk->sk_family = PF_PACKET; po->num = proto; po->xmit = dev_queue_xmit; diff --git a/net/packet/internal.h b/net/packet/internal.h index 3bb7c5fb3bff..bbb4be2c18e7 100644 --- a/net/packet/internal.h +++ b/net/packet/internal.h @@ -128,6 +128,8 @@ struct packet_sock { unsigned int tp_hdrlen; unsigned int tp_reserve; unsigned int tp_tstamp; + struct completion skb_completion; + unsigned int wait_on_complete; struct net_device __rcu *cached_dev; int (*xmit)(struct sk_buff *skb); struct packet_type prot_hook ____cacheline_aligned_in_smp;
When an application is run that: a) Sets its scheduler to be SCHED_FIFO and b) Opens a memory mapped AF_PACKET socket, and sends frames with the MSG_DONTWAIT flag cleared, its possible for the application to hang forever in the kernel. This occurs because when waiting, the code in tpacket_snd calls schedule, which under normal circumstances allows other tasks to run, including ksoftirqd, which in some cases is responsible for freeing the transmitted skb (which in AF_PACKET calls a destructor that flips the status bit of the transmitted frame back to available, allowing the transmitting task to complete). However, when the calling application is SCHED_FIFO, its priority is such that the schedule call immediately places the task back on the cpu, preventing ksoftirqd from freeing the skb, which in turn prevents the transmitting task from detecting that the transmission is complete. We can fix this by converting the schedule call to a completion mechanism. By using a completion queue, we force the calling task, when it detects there are no more frames to send, to schedule itself off the cpu until such time as the last transmitted skb is freed, allowing forward progress to be made. Tested by myself and the reporter, with good results Appies to the net tree Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reported-by: Matteo Croce <mcroce@redhat.com> CC: "David S. Miller" <davem@davemloft.net> --- net/packet/af_packet.c | 42 +++++++++++++++++++++++++++++++----------- net/packet/internal.h | 2 ++ 2 files changed, 33 insertions(+), 11 deletions(-)