diff mbox series

[net] af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET

Message ID 20190619202533.4856-1-nhorman@tuxdriver.com
State Changes Requested
Delegated to: David Miller
Headers show
Series [net] af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET | expand

Commit Message

Neil Horman June 19, 2019, 8:25 p.m. UTC
When an application is run that:
a) Sets its scheduler to be SCHED_FIFO
and
b) Opens a memory mapped AF_PACKET socket, and sends frames with the
MSG_DONTWAIT flag cleared, its possible for the application to hang
forever in the kernel.  This occurs because when waiting, the code in
tpacket_snd calls schedule, which under normal circumstances allows
other tasks to run, including ksoftirqd, which in some cases is
responsible for freeing the transmitted skb (which in AF_PACKET calls a
destructor that flips the status bit of the transmitted frame back to
available, allowing the transmitting task to complete).

However, when the calling application is SCHED_FIFO, its priority is
such that the schedule call immediately places the task back on the cpu,
preventing ksoftirqd from freeing the skb, which in turn prevents the
transmitting task from detecting that the transmission is complete.

We can fix this by converting the schedule call to a completion
mechanism.  By using a completion queue, we force the calling task, when
it detects there are no more frames to send, to schedule itself off the
cpu until such time as the last transmitted skb is freed, allowing
forward progress to be made.

Tested by myself and the reporter, with good results

Appies to the net tree

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Matteo Croce <mcroce@redhat.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 net/packet/af_packet.c | 42 +++++++++++++++++++++++++++++++-----------
 net/packet/internal.h  |  2 ++
 2 files changed, 33 insertions(+), 11 deletions(-)

Comments

Willem de Bruijn June 20, 2019, 1:41 p.m. UTC | #1
On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
>
> When an application is run that:
> a) Sets its scheduler to be SCHED_FIFO
> and
> b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> MSG_DONTWAIT flag cleared, its possible for the application to hang
> forever in the kernel.  This occurs because when waiting, the code in
> tpacket_snd calls schedule, which under normal circumstances allows
> other tasks to run, including ksoftirqd, which in some cases is
> responsible for freeing the transmitted skb (which in AF_PACKET calls a
> destructor that flips the status bit of the transmitted frame back to
> available, allowing the transmitting task to complete).
>
> However, when the calling application is SCHED_FIFO, its priority is
> such that the schedule call immediately places the task back on the cpu,
> preventing ksoftirqd from freeing the skb, which in turn prevents the
> transmitting task from detecting that the transmission is complete.
>
> We can fix this by converting the schedule call to a completion
> mechanism.  By using a completion queue, we force the calling task, when
> it detects there are no more frames to send, to schedule itself off the
> cpu until such time as the last transmitted skb is freed, allowing
> forward progress to be made.
>
> Tested by myself and the reporter, with good results
>
> Appies to the net tree
>
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> Reported-by: Matteo Croce <mcroce@redhat.com>
> CC: "David S. Miller" <davem@davemloft.net>
> ---

This is a complex change for a narrow configuration. Isn't a
SCHED_FIFO process preempting ksoftirqd a potential problem for other
networking workloads as well? And the right configuration to always
increase ksoftirqd priority when increasing another process's
priority? Also, even when ksoftirqd kicks in, isn't some progress
still made on the local_bh_enable reached from schedule()?
Matteo Croce June 20, 2019, 2:01 p.m. UTC | #2
On Thu, Jun 20, 2019 at 3:42 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > When an application is run that:
> > a) Sets its scheduler to be SCHED_FIFO
> > and
> > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > forever in the kernel.  This occurs because when waiting, the code in
> > tpacket_snd calls schedule, which under normal circumstances allows
> > other tasks to run, including ksoftirqd, which in some cases is
> > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > destructor that flips the status bit of the transmitted frame back to
> > available, allowing the transmitting task to complete).
> >
> > However, when the calling application is SCHED_FIFO, its priority is
> > such that the schedule call immediately places the task back on the cpu,
> > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > transmitting task from detecting that the transmission is complete.
> >
> > We can fix this by converting the schedule call to a completion
> > mechanism.  By using a completion queue, we force the calling task, when
> > it detects there are no more frames to send, to schedule itself off the
> > cpu until such time as the last transmitted skb is freed, allowing
> > forward progress to be made.
> >
> > Tested by myself and the reporter, with good results
> >
> > Appies to the net tree
> >
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > Reported-by: Matteo Croce <mcroce@redhat.com>
> > CC: "David S. Miller" <davem@davemloft.net>
> > ---
>
> This is a complex change for a narrow configuration. Isn't a
> SCHED_FIFO process preempting ksoftirqd a potential problem for other
> networking workloads as well? And the right configuration to always
> increase ksoftirqd priority when increasing another process's
> priority? Also, even when ksoftirqd kicks in, isn't some progress
> still made on the local_bh_enable reached from schedule()?

Hi Willem,

from my tests, it seems that when the application hangs, raising the
ksoftirqd priority doesn't help.
If the application priority is the same as ksoftirqd, it will loop
forever, the only workaround is to lower the application priority or
set it to SCHED_NORMAL

Regards,
Neil Horman June 20, 2019, 2:23 p.m. UTC | #3
On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote:
> On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > When an application is run that:
> > a) Sets its scheduler to be SCHED_FIFO
> > and
> > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > forever in the kernel.  This occurs because when waiting, the code in
> > tpacket_snd calls schedule, which under normal circumstances allows
> > other tasks to run, including ksoftirqd, which in some cases is
> > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > destructor that flips the status bit of the transmitted frame back to
> > available, allowing the transmitting task to complete).
> >
> > However, when the calling application is SCHED_FIFO, its priority is
> > such that the schedule call immediately places the task back on the cpu,
> > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > transmitting task from detecting that the transmission is complete.
> >
> > We can fix this by converting the schedule call to a completion
> > mechanism.  By using a completion queue, we force the calling task, when
> > it detects there are no more frames to send, to schedule itself off the
> > cpu until such time as the last transmitted skb is freed, allowing
> > forward progress to be made.
> >
> > Tested by myself and the reporter, with good results
> >
> > Appies to the net tree
> >
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > Reported-by: Matteo Croce <mcroce@redhat.com>
> > CC: "David S. Miller" <davem@davemloft.net>
> > ---
> 
> This is a complex change for a narrow configuration. Isn't a
> SCHED_FIFO process preempting ksoftirqd a potential problem for other
> networking workloads as well? And the right configuration to always
> increase ksoftirqd priority when increasing another process's
> priority? Also, even when ksoftirqd kicks in, isn't some progress
> still made on the local_bh_enable reached from schedule()?
> 

A few questions here to answer:

Regarding other protocols having this problem, thats not the case, because non
packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period
of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set.  We could
certainly do that, but the current implementation doesn't (opting instead to
wait indefinately until the respective packet(s) have transmitted or errored
out), and I wanted to maintain that behavior.  If there is consensus that packet
sockets should honor SNDTIMEO, then I can certainly do that.

As for progress made by calling local_bh_enable, My read of the code doesn't
have the scheduler calling local_bh_enable at all.  Instead schedule uses
preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu,
which ignores pending softirqs on re-enablement.  Perhaps that needs to change,
but I'm averse to making scheduler changes for this (the aforementioned concern
about complex changes for a narrow use case)

Regarding raising the priority of ksoftirqd, that could be a solution, but the
priority would need to be raised to a high priority SCHED_FIFO parameter, and
that gets back to making complex changes for a narrow problem domain

As for the comlexity of the of the solution, I think this is, given your
comments the least complex and intrusive change to solve the given problem.  We
need to find a way to force the calling task off the cpu while the asynchronous
operations in the transmit path complete, and we can do that this way, or by
honoring SK_SNDTIMEO.  I'm fine with doing the latter, but I didn't want to
alter the current protocol behavior without consensus on that.

Regards
Neil
Willem de Bruijn June 20, 2019, 3:16 p.m. UTC | #4
On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote:
>
> On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote:
> > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > >
> > > When an application is run that:
> > > a) Sets its scheduler to be SCHED_FIFO
> > > and
> > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > > forever in the kernel.  This occurs because when waiting, the code in
> > > tpacket_snd calls schedule, which under normal circumstances allows
> > > other tasks to run, including ksoftirqd, which in some cases is
> > > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > > destructor that flips the status bit of the transmitted frame back to
> > > available, allowing the transmitting task to complete).
> > >
> > > However, when the calling application is SCHED_FIFO, its priority is
> > > such that the schedule call immediately places the task back on the cpu,
> > > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > > transmitting task from detecting that the transmission is complete.
> > >
> > > We can fix this by converting the schedule call to a completion
> > > mechanism.  By using a completion queue, we force the calling task, when
> > > it detects there are no more frames to send, to schedule itself off the
> > > cpu until such time as the last transmitted skb is freed, allowing
> > > forward progress to be made.
> > >
> > > Tested by myself and the reporter, with good results
> > >
> > > Appies to the net tree
> > >
> > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > Reported-by: Matteo Croce <mcroce@redhat.com>
> > > CC: "David S. Miller" <davem@davemloft.net>
> > > ---
> >
> > This is a complex change for a narrow configuration. Isn't a
> > SCHED_FIFO process preempting ksoftirqd a potential problem for other
> > networking workloads as well? And the right configuration to always
> > increase ksoftirqd priority when increasing another process's
> > priority? Also, even when ksoftirqd kicks in, isn't some progress
> > still made on the local_bh_enable reached from schedule()?
> >
>
> A few questions here to answer:

Thanks for the detailed explanation.

> Regarding other protocols having this problem, thats not the case, because non
> packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period
> of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set.  We could
> certainly do that, but the current implementation doesn't (opting instead to
> wait indefinately until the respective packet(s) have transmitted or errored
> out), and I wanted to maintain that behavior.  If there is consensus that packet
> sockets should honor SNDTIMEO, then I can certainly do that.
>
> As for progress made by calling local_bh_enable, My read of the code doesn't
> have the scheduler calling local_bh_enable at all.  Instead schedule uses
> preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu,
> which ignores pending softirqs on re-enablement.

Ah, I'm mistaken there, then.

>  Perhaps that needs to change,
> but I'm averse to making scheduler changes for this (the aforementioned concern
> about complex changes for a narrow use case)
>
> Regarding raising the priority of ksoftirqd, that could be a solution, but the
> priority would need to be raised to a high priority SCHED_FIFO parameter, and
> that gets back to making complex changes for a narrow problem domain
>
> As for the comlexity of the of the solution, I think this is, given your
> comments the least complex and intrusive change to solve the given problem.

Could it be simpler to ensure do_softirq() gets run here? That would
allow progress for this case.

>  We
> need to find a way to force the calling task off the cpu while the asynchronous
> operations in the transmit path complete, and we can do that this way, or by
> honoring SK_SNDTIMEO.  I'm fine with doing the latter, but I didn't want to
> alter the current protocol behavior without consensus on that.

In general SCHED_FIFO is dangerous with regard to stalling other
progress, incl. ksoftirqd. But it does appear that this packet socket
case is special inside networking in calling schedule() directly here.

If converting that, should it convert to logic more akin to other
sockets, like sock_wait_for_wmem? I haven't had a chance to read up on
the pros and cons of completion here yet, sorry. Didn't want to delay
responding until after I get a chance.
Neil Horman June 20, 2019, 4:14 p.m. UTC | #5
On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote:
> On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote:
> > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > >
> > > > When an application is run that:
> > > > a) Sets its scheduler to be SCHED_FIFO
> > > > and
> > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > > > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > > > forever in the kernel.  This occurs because when waiting, the code in
> > > > tpacket_snd calls schedule, which under normal circumstances allows
> > > > other tasks to run, including ksoftirqd, which in some cases is
> > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > > > destructor that flips the status bit of the transmitted frame back to
> > > > available, allowing the transmitting task to complete).
> > > >
> > > > However, when the calling application is SCHED_FIFO, its priority is
> > > > such that the schedule call immediately places the task back on the cpu,
> > > > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > > > transmitting task from detecting that the transmission is complete.
> > > >
> > > > We can fix this by converting the schedule call to a completion
> > > > mechanism.  By using a completion queue, we force the calling task, when
> > > > it detects there are no more frames to send, to schedule itself off the
> > > > cpu until such time as the last transmitted skb is freed, allowing
> > > > forward progress to be made.
> > > >
> > > > Tested by myself and the reporter, with good results
> > > >
> > > > Appies to the net tree
> > > >
> > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > Reported-by: Matteo Croce <mcroce@redhat.com>
> > > > CC: "David S. Miller" <davem@davemloft.net>
> > > > ---
> > >
> > > This is a complex change for a narrow configuration. Isn't a
> > > SCHED_FIFO process preempting ksoftirqd a potential problem for other
> > > networking workloads as well? And the right configuration to always
> > > increase ksoftirqd priority when increasing another process's
> > > priority? Also, even when ksoftirqd kicks in, isn't some progress
> > > still made on the local_bh_enable reached from schedule()?
> > >
> >
> > A few questions here to answer:
> 
> Thanks for the detailed explanation.
> 
Gladly.

> > Regarding other protocols having this problem, thats not the case, because non
> > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period
> > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set.  We could
> > certainly do that, but the current implementation doesn't (opting instead to
> > wait indefinately until the respective packet(s) have transmitted or errored
> > out), and I wanted to maintain that behavior.  If there is consensus that packet
> > sockets should honor SNDTIMEO, then I can certainly do that.
> >
> > As for progress made by calling local_bh_enable, My read of the code doesn't
> > have the scheduler calling local_bh_enable at all.  Instead schedule uses
> > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu,
> > which ignores pending softirqs on re-enablement.
> 
> Ah, I'm mistaken there, then.
> 
> >  Perhaps that needs to change,
> > but I'm averse to making scheduler changes for this (the aforementioned concern
> > about complex changes for a narrow use case)
> >
> > Regarding raising the priority of ksoftirqd, that could be a solution, but the
> > priority would need to be raised to a high priority SCHED_FIFO parameter, and
> > that gets back to making complex changes for a narrow problem domain
> >
> > As for the comlexity of the of the solution, I think this is, given your
> > comments the least complex and intrusive change to solve the given problem.
> 
> Could it be simpler to ensure do_softirq() gets run here? That would
> allow progress for this case.
> 
I'm not sure.  On the surface, we certainly could do it, but inserting a call to
do_softirq, either directly, or indirectly through some other mechanism seems
like a non-obvious fix, and may lead to confusion down the road.  I'm hesitant
to pursue such a soultion without some evidence it would make a better solution.

> >  We
> > need to find a way to force the calling task off the cpu while the asynchronous
> > operations in the transmit path complete, and we can do that this way, or by
> > honoring SK_SNDTIMEO.  I'm fine with doing the latter, but I didn't want to
> > alter the current protocol behavior without consensus on that.
> 
> In general SCHED_FIFO is dangerous with regard to stalling other
> progress, incl. ksoftirqd. But it does appear that this packet socket
> case is special inside networking in calling schedule() directly here.
> 
> If converting that, should it convert to logic more akin to other
> sockets, like sock_wait_for_wmem? I haven't had a chance to read up on
> the pros and cons of completion here yet, sorry. Didn't want to delay
> responding until after I get a chance.
> 
That would be the solution described above (i.e. honoring SK_SNDTIMEO.
Basically you call sock_send_waittimeo, which returns a timeout value, or 0 if
MSG_DONTWAIT is set), then you block for that period of time waiting for
transmit completion.  I'm happy to implement that solution, but I'd like to get
some clarity as to if there is a reason we don't currently honor that socket
option now before I change the behavior that way.

Dave, do you have any insight into AF_PACKET history as to why we would ignore
the send timeout socket option here?

Best
Neil
Willem de Bruijn June 20, 2019, 4:18 p.m. UTC | #6
On Thu, Jun 20, 2019 at 12:14 PM Neil Horman <nhorman@tuxdriver.com> wrote:
>
> On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote:
> > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> > >
> > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote:
> > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > >
> > > > > When an application is run that:
> > > > > a) Sets its scheduler to be SCHED_FIFO
> > > > > and
> > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > > > > forever in the kernel.  This occurs because when waiting, the code in
> > > > > tpacket_snd calls schedule, which under normal circumstances allows
> > > > > other tasks to run, including ksoftirqd, which in some cases is
> > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > > > > destructor that flips the status bit of the transmitted frame back to
> > > > > available, allowing the transmitting task to complete).
> > > > >
> > > > > However, when the calling application is SCHED_FIFO, its priority is
> > > > > such that the schedule call immediately places the task back on the cpu,
> > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > > > > transmitting task from detecting that the transmission is complete.
> > > > >
> > > > > We can fix this by converting the schedule call to a completion
> > > > > mechanism.  By using a completion queue, we force the calling task, when
> > > > > it detects there are no more frames to send, to schedule itself off the
> > > > > cpu until such time as the last transmitted skb is freed, allowing
> > > > > forward progress to be made.
> > > > >
> > > > > Tested by myself and the reporter, with good results
> > > > >
> > > > > Appies to the net tree
> > > > >
> > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > Reported-by: Matteo Croce <mcroce@redhat.com>
> > > > > CC: "David S. Miller" <davem@davemloft.net>
> > > > > ---
> > > >
> > > > This is a complex change for a narrow configuration. Isn't a
> > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other
> > > > networking workloads as well? And the right configuration to always
> > > > increase ksoftirqd priority when increasing another process's
> > > > priority? Also, even when ksoftirqd kicks in, isn't some progress
> > > > still made on the local_bh_enable reached from schedule()?
> > > >
> > >
> > > A few questions here to answer:
> >
> > Thanks for the detailed explanation.
> >
> Gladly.
>
> > > Regarding other protocols having this problem, thats not the case, because non
> > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period
> > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set.  We could
> > > certainly do that, but the current implementation doesn't (opting instead to
> > > wait indefinately until the respective packet(s) have transmitted or errored
> > > out), and I wanted to maintain that behavior.  If there is consensus that packet
> > > sockets should honor SNDTIMEO, then I can certainly do that.
> > >
> > > As for progress made by calling local_bh_enable, My read of the code doesn't
> > > have the scheduler calling local_bh_enable at all.  Instead schedule uses
> > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu,
> > > which ignores pending softirqs on re-enablement.
> >
> > Ah, I'm mistaken there, then.
> >
> > >  Perhaps that needs to change,
> > > but I'm averse to making scheduler changes for this (the aforementioned concern
> > > about complex changes for a narrow use case)
> > >
> > > Regarding raising the priority of ksoftirqd, that could be a solution, but the
> > > priority would need to be raised to a high priority SCHED_FIFO parameter, and
> > > that gets back to making complex changes for a narrow problem domain
> > >
> > > As for the comlexity of the of the solution, I think this is, given your
> > > comments the least complex and intrusive change to solve the given problem.
> >
> > Could it be simpler to ensure do_softirq() gets run here? That would
> > allow progress for this case.
> >
> I'm not sure.  On the surface, we certainly could do it, but inserting a call to
> do_softirq, either directly, or indirectly through some other mechanism seems
> like a non-obvious fix, and may lead to confusion down the road.  I'm hesitant
> to pursue such a soultion without some evidence it would make a better solution.
>
> > >  We
> > > need to find a way to force the calling task off the cpu while the asynchronous
> > > operations in the transmit path complete, and we can do that this way, or by
> > > honoring SK_SNDTIMEO.  I'm fine with doing the latter, but I didn't want to
> > > alter the current protocol behavior without consensus on that.
> >
> > In general SCHED_FIFO is dangerous with regard to stalling other
> > progress, incl. ksoftirqd. But it does appear that this packet socket
> > case is special inside networking in calling schedule() directly here.
> >
> > If converting that, should it convert to logic more akin to other
> > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on
> > the pros and cons of completion here yet, sorry. Didn't want to delay
> > responding until after I get a chance.
> >
> That would be the solution described above (i.e. honoring SK_SNDTIMEO.
> Basically you call sock_send_waittimeo, which returns a timeout value, or 0 if
> MSG_DONTWAIT is set), then you block for that period of time waiting for
> transmit completion.

From an ABI point of view, starting to support SK_SNDTIMEO where it
currently is not implemented certainly seems fine.

>  I'm happy to implement that solution, but I'd like to get
> some clarity as to if there is a reason we don't currently honor that socket
> option now before I change the behavior that way.
>
> Dave, do you have any insight into AF_PACKET history as to why we would ignore
> the send timeout socket option here?

On the point of calling schedule(): even if that is rare, there are a lot of
other cond_resched() calls that may have the same starvation issue
with SCHED_FIFO and ksoftirqd.
Neil Horman June 20, 2019, 5:31 p.m. UTC | #7
On Thu, Jun 20, 2019 at 12:18:41PM -0400, Willem de Bruijn wrote:
> On Thu, Jun 20, 2019 at 12:14 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote:
> > > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > >
> > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote:
> > > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > > >
> > > > > > When an application is run that:
> > > > > > a) Sets its scheduler to be SCHED_FIFO
> > > > > > and
> > > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > > > > > forever in the kernel.  This occurs because when waiting, the code in
> > > > > > tpacket_snd calls schedule, which under normal circumstances allows
> > > > > > other tasks to run, including ksoftirqd, which in some cases is
> > > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > > > > > destructor that flips the status bit of the transmitted frame back to
> > > > > > available, allowing the transmitting task to complete).
> > > > > >
> > > > > > However, when the calling application is SCHED_FIFO, its priority is
> > > > > > such that the schedule call immediately places the task back on the cpu,
> > > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > > > > > transmitting task from detecting that the transmission is complete.
> > > > > >
> > > > > > We can fix this by converting the schedule call to a completion
> > > > > > mechanism.  By using a completion queue, we force the calling task, when
> > > > > > it detects there are no more frames to send, to schedule itself off the
> > > > > > cpu until such time as the last transmitted skb is freed, allowing
> > > > > > forward progress to be made.
> > > > > >
> > > > > > Tested by myself and the reporter, with good results
> > > > > >
> > > > > > Appies to the net tree
> > > > > >
> > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > > Reported-by: Matteo Croce <mcroce@redhat.com>
> > > > > > CC: "David S. Miller" <davem@davemloft.net>
> > > > > > ---
> > > > >
> > > > > This is a complex change for a narrow configuration. Isn't a
> > > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other
> > > > > networking workloads as well? And the right configuration to always
> > > > > increase ksoftirqd priority when increasing another process's
> > > > > priority? Also, even when ksoftirqd kicks in, isn't some progress
> > > > > still made on the local_bh_enable reached from schedule()?
> > > > >
> > > >
> > > > A few questions here to answer:
> > >
> > > Thanks for the detailed explanation.
> > >
> > Gladly.
> >
> > > > Regarding other protocols having this problem, thats not the case, because non
> > > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period
> > > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set.  We could
> > > > certainly do that, but the current implementation doesn't (opting instead to
> > > > wait indefinately until the respective packet(s) have transmitted or errored
> > > > out), and I wanted to maintain that behavior.  If there is consensus that packet
> > > > sockets should honor SNDTIMEO, then I can certainly do that.
> > > >
> > > > As for progress made by calling local_bh_enable, My read of the code doesn't
> > > > have the scheduler calling local_bh_enable at all.  Instead schedule uses
> > > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu,
> > > > which ignores pending softirqs on re-enablement.
> > >
> > > Ah, I'm mistaken there, then.
> > >
> > > >  Perhaps that needs to change,
> > > > but I'm averse to making scheduler changes for this (the aforementioned concern
> > > > about complex changes for a narrow use case)
> > > >
> > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the
> > > > priority would need to be raised to a high priority SCHED_FIFO parameter, and
> > > > that gets back to making complex changes for a narrow problem domain
> > > >
> > > > As for the comlexity of the of the solution, I think this is, given your
> > > > comments the least complex and intrusive change to solve the given problem.
> > >
> > > Could it be simpler to ensure do_softirq() gets run here? That would
> > > allow progress for this case.
> > >
> > I'm not sure.  On the surface, we certainly could do it, but inserting a call to
> > do_softirq, either directly, or indirectly through some other mechanism seems
> > like a non-obvious fix, and may lead to confusion down the road.  I'm hesitant
> > to pursue such a soultion without some evidence it would make a better solution.
> >
> > > >  We
> > > > need to find a way to force the calling task off the cpu while the asynchronous
> > > > operations in the transmit path complete, and we can do that this way, or by
> > > > honoring SK_SNDTIMEO.  I'm fine with doing the latter, but I didn't want to
> > > > alter the current protocol behavior without consensus on that.
> > >
> > > In general SCHED_FIFO is dangerous with regard to stalling other
> > > progress, incl. ksoftirqd. But it does appear that this packet socket
> > > case is special inside networking in calling schedule() directly here.
> > >
> > > If converting that, should it convert to logic more akin to other
> > > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on
> > > the pros and cons of completion here yet, sorry. Didn't want to delay
> > > responding until after I get a chance.
> > >
> > That would be the solution described above (i.e. honoring SK_SNDTIMEO.
> > Basically you call sock_send_waittimeo, which returns a timeout value, or 0 if
> > MSG_DONTWAIT is set), then you block for that period of time waiting for
> > transmit completion.
> 
> From an ABI point of view, starting to support SK_SNDTIMEO where it
> currently is not implemented certainly seems fine.
> 
I would agree, I was just thinking since AF_PACKET is something of a unique
protocol (i.e. no real protocol), there may have been reason to ignore that
option, but I agree, it probably makes sense, barring no counter reasoning.

> >  I'm happy to implement that solution, but I'd like to get
> > some clarity as to if there is a reason we don't currently honor that socket
> > option now before I change the behavior that way.
> >
> > Dave, do you have any insight into AF_PACKET history as to why we would ignore
> > the send timeout socket option here?
> 
> On the point of calling schedule(): even if that is rare, there are a lot of
> other cond_resched() calls that may have the same starvation issue
> with SCHED_FIFO and ksoftirqd.
> 
Agreed, but I've not found any yet
Neil
Neil Horman June 21, 2019, 4:41 p.m. UTC | #8
On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote:
> On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote:
> > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > >
> > > > When an application is run that:
> > > > a) Sets its scheduler to be SCHED_FIFO
> > > > and
> > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > > > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > > > forever in the kernel.  This occurs because when waiting, the code in
> > > > tpacket_snd calls schedule, which under normal circumstances allows
> > > > other tasks to run, including ksoftirqd, which in some cases is
> > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > > > destructor that flips the status bit of the transmitted frame back to
> > > > available, allowing the transmitting task to complete).
> > > >
> > > > However, when the calling application is SCHED_FIFO, its priority is
> > > > such that the schedule call immediately places the task back on the cpu,
> > > > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > > > transmitting task from detecting that the transmission is complete.
> > > >
> > > > We can fix this by converting the schedule call to a completion
> > > > mechanism.  By using a completion queue, we force the calling task, when
> > > > it detects there are no more frames to send, to schedule itself off the
> > > > cpu until such time as the last transmitted skb is freed, allowing
> > > > forward progress to be made.
> > > >
> > > > Tested by myself and the reporter, with good results
> > > >
> > > > Appies to the net tree
> > > >
> > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > Reported-by: Matteo Croce <mcroce@redhat.com>
> > > > CC: "David S. Miller" <davem@davemloft.net>
> > > > ---
> > >
> > > This is a complex change for a narrow configuration. Isn't a
> > > SCHED_FIFO process preempting ksoftirqd a potential problem for other
> > > networking workloads as well? And the right configuration to always
> > > increase ksoftirqd priority when increasing another process's
> > > priority? Also, even when ksoftirqd kicks in, isn't some progress
> > > still made on the local_bh_enable reached from schedule()?
> > >
> >
> > A few questions here to answer:
> 
> Thanks for the detailed explanation.
> 
> > Regarding other protocols having this problem, thats not the case, because non
> > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period
> > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set.  We could
> > certainly do that, but the current implementation doesn't (opting instead to
> > wait indefinately until the respective packet(s) have transmitted or errored
> > out), and I wanted to maintain that behavior.  If there is consensus that packet
> > sockets should honor SNDTIMEO, then I can certainly do that.
> >
> > As for progress made by calling local_bh_enable, My read of the code doesn't
> > have the scheduler calling local_bh_enable at all.  Instead schedule uses
> > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu,
> > which ignores pending softirqs on re-enablement.
> 
> Ah, I'm mistaken there, then.
> 
> >  Perhaps that needs to change,
> > but I'm averse to making scheduler changes for this (the aforementioned concern
> > about complex changes for a narrow use case)
> >
> > Regarding raising the priority of ksoftirqd, that could be a solution, but the
> > priority would need to be raised to a high priority SCHED_FIFO parameter, and
> > that gets back to making complex changes for a narrow problem domain
> >
> > As for the comlexity of the of the solution, I think this is, given your
> > comments the least complex and intrusive change to solve the given problem.
> 
> Could it be simpler to ensure do_softirq() gets run here? That would
> allow progress for this case.
> 
> >  We
> > need to find a way to force the calling task off the cpu while the asynchronous
> > operations in the transmit path complete, and we can do that this way, or by
> > honoring SK_SNDTIMEO.  I'm fine with doing the latter, but I didn't want to
> > alter the current protocol behavior without consensus on that.
> 
> In general SCHED_FIFO is dangerous with regard to stalling other
> progress, incl. ksoftirqd. But it does appear that this packet socket
> case is special inside networking in calling schedule() directly here.
> 
> If converting that, should it convert to logic more akin to other
> sockets, like sock_wait_for_wmem? I haven't had a chance to read up on
> the pros and cons of completion here yet, sorry. Didn't want to delay
> responding until after I get a chance.
> 
So, I started looking at implementing SOCK_SNDTIMEO for this patch, and
something occured to me....We still need a mechanism to block in tpacket_snd.
That is to say, other protocol use SK_SNDTIMEO to wait for socket memory to
become available, and that requirement doesn't exist for memory mapped sockets
in AF_PACKET (which tpacket_snd implements the kernel side for).  We have memory
mapped frame buffers, which we marshall with an otherwise empty skb, and just
send that (i.e. no waiting on socket memory, we just product an error if we
don't have enough ram to allocate an sk_buff).  Given that, we only ever need to
wait for a frame to complete transmission, or get freed in an error path further
down the stack.  This probably explains why SK_SNDTIMEO doesn't exist for
AF_PACKET.

Given that, is it worth implementing (as it will just further complicate this
patch, for no additional value), or can we move forward with it as it is?

Neil
Willem de Bruijn June 21, 2019, 6:31 p.m. UTC | #9
On Fri, Jun 21, 2019 at 12:42 PM Neil Horman <nhorman@tuxdriver.com> wrote:
>
> On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote:
> > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> > >
> > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote:
> > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > >
> > > > > When an application is run that:
> > > > > a) Sets its scheduler to be SCHED_FIFO
> > > > > and
> > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > > > > forever in the kernel.  This occurs because when waiting, the code in
> > > > > tpacket_snd calls schedule, which under normal circumstances allows
> > > > > other tasks to run, including ksoftirqd, which in some cases is
> > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > > > > destructor that flips the status bit of the transmitted frame back to
> > > > > available, allowing the transmitting task to complete).
> > > > >
> > > > > However, when the calling application is SCHED_FIFO, its priority is
> > > > > such that the schedule call immediately places the task back on the cpu,
> > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > > > > transmitting task from detecting that the transmission is complete.
> > > > >
> > > > > We can fix this by converting the schedule call to a completion
> > > > > mechanism.  By using a completion queue, we force the calling task, when
> > > > > it detects there are no more frames to send, to schedule itself off the
> > > > > cpu until such time as the last transmitted skb is freed, allowing
> > > > > forward progress to be made.
> > > > >
> > > > > Tested by myself and the reporter, with good results
> > > > >
> > > > > Appies to the net tree
> > > > >
> > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > Reported-by: Matteo Croce <mcroce@redhat.com>
> > > > > CC: "David S. Miller" <davem@davemloft.net>
> > > > > ---
> > > >
> > > > This is a complex change for a narrow configuration. Isn't a
> > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other
> > > > networking workloads as well? And the right configuration to always
> > > > increase ksoftirqd priority when increasing another process's
> > > > priority? Also, even when ksoftirqd kicks in, isn't some progress
> > > > still made on the local_bh_enable reached from schedule()?
> > > >
> > >
> > > A few questions here to answer:
> >
> > Thanks for the detailed explanation.
> >
> > > Regarding other protocols having this problem, thats not the case, because non
> > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period
> > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set.  We could
> > > certainly do that, but the current implementation doesn't (opting instead to
> > > wait indefinately until the respective packet(s) have transmitted or errored
> > > out), and I wanted to maintain that behavior.  If there is consensus that packet
> > > sockets should honor SNDTIMEO, then I can certainly do that.
> > >
> > > As for progress made by calling local_bh_enable, My read of the code doesn't
> > > have the scheduler calling local_bh_enable at all.  Instead schedule uses
> > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu,
> > > which ignores pending softirqs on re-enablement.
> >
> > Ah, I'm mistaken there, then.
> >
> > >  Perhaps that needs to change,
> > > but I'm averse to making scheduler changes for this (the aforementioned concern
> > > about complex changes for a narrow use case)
> > >
> > > Regarding raising the priority of ksoftirqd, that could be a solution, but the
> > > priority would need to be raised to a high priority SCHED_FIFO parameter, and
> > > that gets back to making complex changes for a narrow problem domain
> > >
> > > As for the comlexity of the of the solution, I think this is, given your
> > > comments the least complex and intrusive change to solve the given problem.
> >
> > Could it be simpler to ensure do_softirq() gets run here? That would
> > allow progress for this case.
> >
> > >  We
> > > need to find a way to force the calling task off the cpu while the asynchronous
> > > operations in the transmit path complete, and we can do that this way, or by
> > > honoring SK_SNDTIMEO.  I'm fine with doing the latter, but I didn't want to
> > > alter the current protocol behavior without consensus on that.
> >
> > In general SCHED_FIFO is dangerous with regard to stalling other
> > progress, incl. ksoftirqd. But it does appear that this packet socket
> > case is special inside networking in calling schedule() directly here.
> >
> > If converting that, should it convert to logic more akin to other
> > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on
> > the pros and cons of completion here yet, sorry. Didn't want to delay
> > responding until after I get a chance.
> >
> So, I started looking at implementing SOCK_SNDTIMEO for this patch, and
> something occured to me....We still need a mechanism to block in tpacket_snd.
> That is to say, other protocol use SK_SNDTIMEO to wait for socket memory to
> become available, and that requirement doesn't exist for memory mapped sockets
> in AF_PACKET (which tpacket_snd implements the kernel side for).  We have memory
> mapped frame buffers, which we marshall with an otherwise empty skb, and just
> send that (i.e. no waiting on socket memory, we just product an error if we
> don't have enough ram to allocate an sk_buff).  Given that, we only ever need to
> wait for a frame to complete transmission, or get freed in an error path further
> down the stack.  This probably explains why SK_SNDTIMEO doesn't exist for
> AF_PACKET.

SNDTIMEO behavior would still be useful: to wait for frame slot to
become available, but only up to a timeout?

To be clear, adding that is not a prerequisite for fixing this
specific issue, of course. It would just be nice if the one happens to
be fixed by adding the other.

My main question is wrt the implementation details of struct
completion. Without dynamic memory allocation,
sock_wait_for_wmem/sk_stream_write_space obviously does not make
sense. But should we still use sk_wq and more importantly does this
need the same wait semantics (TASK_INTERRUPTIBLE) and does struct
completion give those?

>
> Given that, is it worth implementing (as it will just further complicate this
> patch, for no additional value), or can we move forward with it as it is?
>
> Neil
>
Neil Horman June 21, 2019, 7:18 p.m. UTC | #10
On Fri, Jun 21, 2019 at 02:31:17PM -0400, Willem de Bruijn wrote:
> On Fri, Jun 21, 2019 at 12:42 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote:
> > > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > >
> > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote:
> > > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > > >
> > > > > > When an application is run that:
> > > > > > a) Sets its scheduler to be SCHED_FIFO
> > > > > > and
> > > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > > > > > forever in the kernel.  This occurs because when waiting, the code in
> > > > > > tpacket_snd calls schedule, which under normal circumstances allows
> > > > > > other tasks to run, including ksoftirqd, which in some cases is
> > > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > > > > > destructor that flips the status bit of the transmitted frame back to
> > > > > > available, allowing the transmitting task to complete).
> > > > > >
> > > > > > However, when the calling application is SCHED_FIFO, its priority is
> > > > > > such that the schedule call immediately places the task back on the cpu,
> > > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > > > > > transmitting task from detecting that the transmission is complete.
> > > > > >
> > > > > > We can fix this by converting the schedule call to a completion
> > > > > > mechanism.  By using a completion queue, we force the calling task, when
> > > > > > it detects there are no more frames to send, to schedule itself off the
> > > > > > cpu until such time as the last transmitted skb is freed, allowing
> > > > > > forward progress to be made.
> > > > > >
> > > > > > Tested by myself and the reporter, with good results
> > > > > >
> > > > > > Appies to the net tree
> > > > > >
> > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > > Reported-by: Matteo Croce <mcroce@redhat.com>
> > > > > > CC: "David S. Miller" <davem@davemloft.net>
> > > > > > ---
> > > > >
> > > > > This is a complex change for a narrow configuration. Isn't a
> > > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other
> > > > > networking workloads as well? And the right configuration to always
> > > > > increase ksoftirqd priority when increasing another process's
> > > > > priority? Also, even when ksoftirqd kicks in, isn't some progress
> > > > > still made on the local_bh_enable reached from schedule()?
> > > > >
> > > >
> > > > A few questions here to answer:
> > >
> > > Thanks for the detailed explanation.
> > >
> > > > Regarding other protocols having this problem, thats not the case, because non
> > > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period
> > > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set.  We could
> > > > certainly do that, but the current implementation doesn't (opting instead to
> > > > wait indefinately until the respective packet(s) have transmitted or errored
> > > > out), and I wanted to maintain that behavior.  If there is consensus that packet
> > > > sockets should honor SNDTIMEO, then I can certainly do that.
> > > >
> > > > As for progress made by calling local_bh_enable, My read of the code doesn't
> > > > have the scheduler calling local_bh_enable at all.  Instead schedule uses
> > > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu,
> > > > which ignores pending softirqs on re-enablement.
> > >
> > > Ah, I'm mistaken there, then.
> > >
> > > >  Perhaps that needs to change,
> > > > but I'm averse to making scheduler changes for this (the aforementioned concern
> > > > about complex changes for a narrow use case)
> > > >
> > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the
> > > > priority would need to be raised to a high priority SCHED_FIFO parameter, and
> > > > that gets back to making complex changes for a narrow problem domain
> > > >
> > > > As for the comlexity of the of the solution, I think this is, given your
> > > > comments the least complex and intrusive change to solve the given problem.
> > >
> > > Could it be simpler to ensure do_softirq() gets run here? That would
> > > allow progress for this case.
> > >
> > > >  We
> > > > need to find a way to force the calling task off the cpu while the asynchronous
> > > > operations in the transmit path complete, and we can do that this way, or by
> > > > honoring SK_SNDTIMEO.  I'm fine with doing the latter, but I didn't want to
> > > > alter the current protocol behavior without consensus on that.
> > >
> > > In general SCHED_FIFO is dangerous with regard to stalling other
> > > progress, incl. ksoftirqd. But it does appear that this packet socket
> > > case is special inside networking in calling schedule() directly here.
> > >
> > > If converting that, should it convert to logic more akin to other
> > > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on
> > > the pros and cons of completion here yet, sorry. Didn't want to delay
> > > responding until after I get a chance.
> > >
> > So, I started looking at implementing SOCK_SNDTIMEO for this patch, and
> > something occured to me....We still need a mechanism to block in tpacket_snd.
> > That is to say, other protocol use SK_SNDTIMEO to wait for socket memory to
> > become available, and that requirement doesn't exist for memory mapped sockets
> > in AF_PACKET (which tpacket_snd implements the kernel side for).  We have memory
> > mapped frame buffers, which we marshall with an otherwise empty skb, and just
> > send that (i.e. no waiting on socket memory, we just product an error if we
> > don't have enough ram to allocate an sk_buff).  Given that, we only ever need to
> > wait for a frame to complete transmission, or get freed in an error path further
> > down the stack.  This probably explains why SK_SNDTIMEO doesn't exist for
> > AF_PACKET.
> 
> SNDTIMEO behavior would still be useful: to wait for frame slot to
> become available, but only up to a timeout?
> 
Ok, thats fair.  To be clear, memory_mapped packets aren't waiting here for a
frame to become available for sending (thats done in userspace, where the
application checks a specific memory location for the TP_AVAILABLE status to be
set, so a new frame can be written).  tpacket_snd is wating for the transmission
of a specific frame to complete the transmit action, which is a different thing.
Still, I suppose theres no reason we couldn't contrain that wait on a timeout
set by SK_SNDTIMEO

> To be clear, adding that is not a prerequisite for fixing this
> specific issue, of course. It would just be nice if the one happens to
> be fixed by adding the other.
> 
> My main question is wrt the implementation details of struct
> completion. Without dynamic memory allocation,
> sock_wait_for_wmem/sk_stream_write_space obviously does not make
> sense. But should we still use sk_wq and more importantly does this
> need the same wait semantics (TASK_INTERRUPTIBLE) and does struct
> completion give those?
> 
I suppose we could overload its use here, but I would be worried that such an
overload would be confusing.  Nominally, sk_wq is used, as you note, to block
sockets whose allocated send space is full, until such time as enough frames
have been sent to make space for the next write.  In this scenario, we already
know we have space to send a frame (by virtue of the fact that we are executing
in tpakcet_snd, which is only called after userspace has written a frame to the
memory mapped buffer already allocated for frame transmission).  In this case we
are simply waiting for the last frame that we have sent to complete
transmission, at which point we can look to see if more frames are available to
send, or return from the system call.  I'm happy to take an alternate consensus
into account, but for the sake of clarity I think I would rather use the
completion queue, as it makes clear the correlation between the waiter and the
event we are waiting on.


> >
> > Given that, is it worth implementing (as it will just further complicate this
> > patch, for no additional value), or can we move forward with it as it is?
> >
> > Neil
> >
>
Willem de Bruijn June 21, 2019, 8:06 p.m. UTC | #11
On Fri, Jun 21, 2019 at 3:18 PM Neil Horman <nhorman@tuxdriver.com> wrote:
>
> On Fri, Jun 21, 2019 at 02:31:17PM -0400, Willem de Bruijn wrote:
> > On Fri, Jun 21, 2019 at 12:42 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > >
> > > On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote:
> > > > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > >
> > > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote:
> > > > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > > > >
> > > > > > > When an application is run that:
> > > > > > > a) Sets its scheduler to be SCHED_FIFO
> > > > > > > and
> > > > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > > > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > > > > > > forever in the kernel.  This occurs because when waiting, the code in
> > > > > > > tpacket_snd calls schedule, which under normal circumstances allows
> > > > > > > other tasks to run, including ksoftirqd, which in some cases is
> > > > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > > > > > > destructor that flips the status bit of the transmitted frame back to
> > > > > > > available, allowing the transmitting task to complete).
> > > > > > >
> > > > > > > However, when the calling application is SCHED_FIFO, its priority is
> > > > > > > such that the schedule call immediately places the task back on the cpu,
> > > > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > > > > > > transmitting task from detecting that the transmission is complete.
> > > > > > >
> > > > > > > We can fix this by converting the schedule call to a completion
> > > > > > > mechanism.  By using a completion queue, we force the calling task, when
> > > > > > > it detects there are no more frames to send, to schedule itself off the
> > > > > > > cpu until such time as the last transmitted skb is freed, allowing
> > > > > > > forward progress to be made.
> > > > > > >
> > > > > > > Tested by myself and the reporter, with good results
> > > > > > >
> > > > > > > Appies to the net tree
> > > > > > >
> > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > > > Reported-by: Matteo Croce <mcroce@redhat.com>
> > > > > > > CC: "David S. Miller" <davem@davemloft.net>
> > > > > > > ---
> > > > > >
> > > > > > This is a complex change for a narrow configuration. Isn't a
> > > > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other
> > > > > > networking workloads as well? And the right configuration to always
> > > > > > increase ksoftirqd priority when increasing another process's
> > > > > > priority? Also, even when ksoftirqd kicks in, isn't some progress
> > > > > > still made on the local_bh_enable reached from schedule()?
> > > > > >
> > > > >
> > > > > A few questions here to answer:
> > > >
> > > > Thanks for the detailed explanation.
> > > >
> > > > > Regarding other protocols having this problem, thats not the case, because non
> > > > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period
> > > > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set.  We could
> > > > > certainly do that, but the current implementation doesn't (opting instead to
> > > > > wait indefinately until the respective packet(s) have transmitted or errored
> > > > > out), and I wanted to maintain that behavior.  If there is consensus that packet
> > > > > sockets should honor SNDTIMEO, then I can certainly do that.
> > > > >
> > > > > As for progress made by calling local_bh_enable, My read of the code doesn't
> > > > > have the scheduler calling local_bh_enable at all.  Instead schedule uses
> > > > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu,
> > > > > which ignores pending softirqs on re-enablement.
> > > >
> > > > Ah, I'm mistaken there, then.
> > > >
> > > > >  Perhaps that needs to change,
> > > > > but I'm averse to making scheduler changes for this (the aforementioned concern
> > > > > about complex changes for a narrow use case)
> > > > >
> > > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the
> > > > > priority would need to be raised to a high priority SCHED_FIFO parameter, and
> > > > > that gets back to making complex changes for a narrow problem domain
> > > > >
> > > > > As for the comlexity of the of the solution, I think this is, given your
> > > > > comments the least complex and intrusive change to solve the given problem.
> > > >
> > > > Could it be simpler to ensure do_softirq() gets run here? That would
> > > > allow progress for this case.
> > > >
> > > > >  We
> > > > > need to find a way to force the calling task off the cpu while the asynchronous
> > > > > operations in the transmit path complete, and we can do that this way, or by
> > > > > honoring SK_SNDTIMEO.  I'm fine with doing the latter, but I didn't want to
> > > > > alter the current protocol behavior without consensus on that.
> > > >
> > > > In general SCHED_FIFO is dangerous with regard to stalling other
> > > > progress, incl. ksoftirqd. But it does appear that this packet socket
> > > > case is special inside networking in calling schedule() directly here.
> > > >
> > > > If converting that, should it convert to logic more akin to other
> > > > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on
> > > > the pros and cons of completion here yet, sorry. Didn't want to delay
> > > > responding until after I get a chance.
> > > >
> > > So, I started looking at implementing SOCK_SNDTIMEO for this patch, and
> > > something occured to me....We still need a mechanism to block in tpacket_snd.
> > > That is to say, other protocol use SK_SNDTIMEO to wait for socket memory to
> > > become available, and that requirement doesn't exist for memory mapped sockets
> > > in AF_PACKET (which tpacket_snd implements the kernel side for).  We have memory
> > > mapped frame buffers, which we marshall with an otherwise empty skb, and just
> > > send that (i.e. no waiting on socket memory, we just product an error if we
> > > don't have enough ram to allocate an sk_buff).  Given that, we only ever need to
> > > wait for a frame to complete transmission, or get freed in an error path further
> > > down the stack.  This probably explains why SK_SNDTIMEO doesn't exist for
> > > AF_PACKET.
> >
> > SNDTIMEO behavior would still be useful: to wait for frame slot to
> > become available, but only up to a timeout?
> >
> Ok, thats fair.  To be clear, memory_mapped packets aren't waiting here for a
> frame to become available for sending (thats done in userspace, where the
> application checks a specific memory location for the TP_AVAILABLE status to be
> set, so a new frame can be written).  tpacket_snd is wating for the transmission
> of a specific frame to complete the transmit action, which is a different thing.

Right. Until this report I was actually not even aware of this
behavior without MSG_DONTWAIT.

Though the wait is not for a specific frame, right? Wait as long as
the pending_refcnt, which is incremented on every loop.

> Still, I suppose theres no reason we couldn't contrain that wait on a timeout
> set by SK_SNDTIMEO
>
> > To be clear, adding that is not a prerequisite for fixing this
> > specific issue, of course. It would just be nice if the one happens to
> > be fixed by adding the other.
> >
> > My main question is wrt the implementation details of struct
> > completion. Without dynamic memory allocation,
> > sock_wait_for_wmem/sk_stream_write_space obviously does not make
> > sense. But should we still use sk_wq and more importantly does this
> > need the same wait semantics (TASK_INTERRUPTIBLE) and does struct
> > completion give those?
> >
> I suppose we could overload its use here, but I would be worried that such an
> overload would be confusing.  Nominally, sk_wq is used, as you note, to block
> sockets whose allocated send space is full, until such time as enough frames
> have been sent to make space for the next write.  In this scenario, we already
> know we have space to send a frame (by virtue of the fact that we are executing
> in tpakcet_snd, which is only called after userspace has written a frame to the
> memory mapped buffer already allocated for frame transmission).  In this case we
> are simply waiting for the last frame that we have sent to complete
> transmission, at which point we can look to see if more frames are available to
> send, or return from the system call.  I'm happy to take an alternate consensus
> into account, but for the sake of clarity I think I would rather use the
> completion queue, as it makes clear the correlation between the waiter and the
> event we are waiting on.

That will be less likely to have unexpected side-effects. Agreed.
Thanks for the explanation.

Only last question, does it have the right wait behavior? Should this
be wait_for_completion_interruptible?
Neil Horman June 22, 2019, 11:08 a.m. UTC | #12
On Fri, Jun 21, 2019 at 04:06:09PM -0400, Willem de Bruijn wrote:
> On Fri, Jun 21, 2019 at 3:18 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > On Fri, Jun 21, 2019 at 02:31:17PM -0400, Willem de Bruijn wrote:
> > > On Fri, Jun 21, 2019 at 12:42 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > >
> > > > On Thu, Jun 20, 2019 at 11:16:13AM -0400, Willem de Bruijn wrote:
> > > > > On Thu, Jun 20, 2019 at 10:24 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > > >
> > > > > > On Thu, Jun 20, 2019 at 09:41:30AM -0400, Willem de Bruijn wrote:
> > > > > > > On Wed, Jun 19, 2019 at 4:26 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > > > > > >
> > > > > > > > When an application is run that:
> > > > > > > > a) Sets its scheduler to be SCHED_FIFO
> > > > > > > > and
> > > > > > > > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > > > > > > > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > > > > > > > forever in the kernel.  This occurs because when waiting, the code in
> > > > > > > > tpacket_snd calls schedule, which under normal circumstances allows
> > > > > > > > other tasks to run, including ksoftirqd, which in some cases is
> > > > > > > > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > > > > > > > destructor that flips the status bit of the transmitted frame back to
> > > > > > > > available, allowing the transmitting task to complete).
> > > > > > > >
> > > > > > > > However, when the calling application is SCHED_FIFO, its priority is
> > > > > > > > such that the schedule call immediately places the task back on the cpu,
> > > > > > > > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > > > > > > > transmitting task from detecting that the transmission is complete.
> > > > > > > >
> > > > > > > > We can fix this by converting the schedule call to a completion
> > > > > > > > mechanism.  By using a completion queue, we force the calling task, when
> > > > > > > > it detects there are no more frames to send, to schedule itself off the
> > > > > > > > cpu until such time as the last transmitted skb is freed, allowing
> > > > > > > > forward progress to be made.
> > > > > > > >
> > > > > > > > Tested by myself and the reporter, with good results
> > > > > > > >
> > > > > > > > Appies to the net tree
> > > > > > > >
> > > > > > > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > > > > > Reported-by: Matteo Croce <mcroce@redhat.com>
> > > > > > > > CC: "David S. Miller" <davem@davemloft.net>
> > > > > > > > ---
> > > > > > >
> > > > > > > This is a complex change for a narrow configuration. Isn't a
> > > > > > > SCHED_FIFO process preempting ksoftirqd a potential problem for other
> > > > > > > networking workloads as well? And the right configuration to always
> > > > > > > increase ksoftirqd priority when increasing another process's
> > > > > > > priority? Also, even when ksoftirqd kicks in, isn't some progress
> > > > > > > still made on the local_bh_enable reached from schedule()?
> > > > > > >
> > > > > >
> > > > > > A few questions here to answer:
> > > > >
> > > > > Thanks for the detailed explanation.
> > > > >
> > > > > > Regarding other protocols having this problem, thats not the case, because non
> > > > > > packet sockets honor the SK_SNDTIMEO option here (i.e. they sleep for a period
> > > > > > of time specified by the SNDTIMEO option if MSG_DONTWAIT isn't set.  We could
> > > > > > certainly do that, but the current implementation doesn't (opting instead to
> > > > > > wait indefinately until the respective packet(s) have transmitted or errored
> > > > > > out), and I wanted to maintain that behavior.  If there is consensus that packet
> > > > > > sockets should honor SNDTIMEO, then I can certainly do that.
> > > > > >
> > > > > > As for progress made by calling local_bh_enable, My read of the code doesn't
> > > > > > have the scheduler calling local_bh_enable at all.  Instead schedule uses
> > > > > > preempt_disable/preempt_enable_no_resched() to gain exlcusive access to the cpu,
> > > > > > which ignores pending softirqs on re-enablement.
> > > > >
> > > > > Ah, I'm mistaken there, then.
> > > > >
> > > > > >  Perhaps that needs to change,
> > > > > > but I'm averse to making scheduler changes for this (the aforementioned concern
> > > > > > about complex changes for a narrow use case)
> > > > > >
> > > > > > Regarding raising the priority of ksoftirqd, that could be a solution, but the
> > > > > > priority would need to be raised to a high priority SCHED_FIFO parameter, and
> > > > > > that gets back to making complex changes for a narrow problem domain
> > > > > >
> > > > > > As for the comlexity of the of the solution, I think this is, given your
> > > > > > comments the least complex and intrusive change to solve the given problem.
> > > > >
> > > > > Could it be simpler to ensure do_softirq() gets run here? That would
> > > > > allow progress for this case.
> > > > >
> > > > > >  We
> > > > > > need to find a way to force the calling task off the cpu while the asynchronous
> > > > > > operations in the transmit path complete, and we can do that this way, or by
> > > > > > honoring SK_SNDTIMEO.  I'm fine with doing the latter, but I didn't want to
> > > > > > alter the current protocol behavior without consensus on that.
> > > > >
> > > > > In general SCHED_FIFO is dangerous with regard to stalling other
> > > > > progress, incl. ksoftirqd. But it does appear that this packet socket
> > > > > case is special inside networking in calling schedule() directly here.
> > > > >
> > > > > If converting that, should it convert to logic more akin to other
> > > > > sockets, like sock_wait_for_wmem? I haven't had a chance to read up on
> > > > > the pros and cons of completion here yet, sorry. Didn't want to delay
> > > > > responding until after I get a chance.
> > > > >
> > > > So, I started looking at implementing SOCK_SNDTIMEO for this patch, and
> > > > something occured to me....We still need a mechanism to block in tpacket_snd.
> > > > That is to say, other protocol use SK_SNDTIMEO to wait for socket memory to
> > > > become available, and that requirement doesn't exist for memory mapped sockets
> > > > in AF_PACKET (which tpacket_snd implements the kernel side for).  We have memory
> > > > mapped frame buffers, which we marshall with an otherwise empty skb, and just
> > > > send that (i.e. no waiting on socket memory, we just product an error if we
> > > > don't have enough ram to allocate an sk_buff).  Given that, we only ever need to
> > > > wait for a frame to complete transmission, or get freed in an error path further
> > > > down the stack.  This probably explains why SK_SNDTIMEO doesn't exist for
> > > > AF_PACKET.
> > >
> > > SNDTIMEO behavior would still be useful: to wait for frame slot to
> > > become available, but only up to a timeout?
> > >
> > Ok, thats fair.  To be clear, memory_mapped packets aren't waiting here for a
> > frame to become available for sending (thats done in userspace, where the
> > application checks a specific memory location for the TP_AVAILABLE status to be
> > set, so a new frame can be written).  tpacket_snd is wating for the transmission
> > of a specific frame to complete the transmit action, which is a different thing.
> 
> Right. Until this report I was actually not even aware of this
> behavior without MSG_DONTWAIT.
> 
> Though the wait is not for a specific frame, right? Wait as long as
> the pending_refcnt, which is incremented on every loop.
> 
> > Still, I suppose theres no reason we couldn't contrain that wait on a timeout
> > set by SK_SNDTIMEO
> >
> > > To be clear, adding that is not a prerequisite for fixing this
> > > specific issue, of course. It would just be nice if the one happens to
> > > be fixed by adding the other.
> > >
> > > My main question is wrt the implementation details of struct
> > > completion. Without dynamic memory allocation,
> > > sock_wait_for_wmem/sk_stream_write_space obviously does not make
> > > sense. But should we still use sk_wq and more importantly does this
> > > need the same wait semantics (TASK_INTERRUPTIBLE) and does struct
> > > completion give those?
> > >
> > I suppose we could overload its use here, but I would be worried that such an
> > overload would be confusing.  Nominally, sk_wq is used, as you note, to block
> > sockets whose allocated send space is full, until such time as enough frames
> > have been sent to make space for the next write.  In this scenario, we already
> > know we have space to send a frame (by virtue of the fact that we are executing
> > in tpakcet_snd, which is only called after userspace has written a frame to the
> > memory mapped buffer already allocated for frame transmission).  In this case we
> > are simply waiting for the last frame that we have sent to complete
> > transmission, at which point we can look to see if more frames are available to
> > send, or return from the system call.  I'm happy to take an alternate consensus
> > into account, but for the sake of clarity I think I would rather use the
> > completion queue, as it makes clear the correlation between the waiter and the
> > event we are waiting on.
> 
> That will be less likely to have unexpected side-effects. Agreed.
> Thanks for the explanation.
> 
> Only last question, does it have the right wait behavior? Should this
> be wait_for_completion_interruptible?
> 
Thats a good point.  Other protocols use the socket timeout along with
sk_wait_event, which sets the task state TASK_INTERRUPTIBLE, so we should
probably use wait_for_completion_interruptible_timeout

Thanks!
diff mbox series

Patch

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index a29d66da7394..e65f4ef18a06 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -358,7 +358,8 @@  static inline struct page * __pure pgv_to_page(void *addr)
 	return virt_to_page(addr);
 }
 
-static void __packet_set_status(struct packet_sock *po, void *frame, int status)
+static void __packet_set_status(struct packet_sock *po, void *frame, int status,
+				bool call_complete)
 {
 	union tpacket_uhdr h;
 
@@ -381,6 +382,8 @@  static void __packet_set_status(struct packet_sock *po, void *frame, int status)
 		BUG();
 	}
 
+	if (po->wait_on_complete && call_complete)
+		complete(&po->skb_completion);
 	smp_wmb();
 }
 
@@ -1148,6 +1151,14 @@  static void *packet_previous_frame(struct packet_sock *po,
 	return packet_lookup_frame(po, rb, previous, status);
 }
 
+static void *packet_next_frame(struct packet_sock *po,
+		struct packet_ring_buffer *rb,
+		int status)
+{
+	unsigned int next = rb->head != rb->frame_max ? rb->head+1 : 0;
+	return packet_lookup_frame(po, rb, next, status);
+}
+
 static void packet_increment_head(struct packet_ring_buffer *buff)
 {
 	buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
@@ -2360,7 +2371,7 @@  static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
 #endif
 
 	if (po->tp_version <= TPACKET_V2) {
-		__packet_set_status(po, h.raw, status);
+		__packet_set_status(po, h.raw, status, false);
 		sk->sk_data_ready(sk);
 	} else {
 		prb_clear_blk_fill_status(&po->rx_ring);
@@ -2400,7 +2411,7 @@  static void tpacket_destruct_skb(struct sk_buff *skb)
 		packet_dec_pending(&po->tx_ring);
 
 		ts = __packet_set_timestamp(po, ph, skb);
-		__packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts);
+		__packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts, true);
 	}
 
 	sock_wfree(skb);
@@ -2600,6 +2611,7 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	int len_sum = 0;
 	int status = TP_STATUS_AVAILABLE;
 	int hlen, tlen, copylen = 0;
+	void *tmp;
 
 	mutex_lock(&po->pg_vec_lock);
 
@@ -2647,16 +2659,21 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		size_max = dev->mtu + reserve + VLAN_HLEN;
 
 	do {
+
+		if (po->wait_on_complete && need_wait) {
+			wait_for_completion(&po->skb_completion);
+			po->wait_on_complete = 0;
+		}
+
 		ph = packet_current_frame(po, &po->tx_ring,
 					  TP_STATUS_SEND_REQUEST);
-		if (unlikely(ph == NULL)) {
-			if (need_wait && need_resched())
-				schedule();
-			continue;
-		}
+
+		if (likely(ph == NULL))
+			break;
 
 		skb = NULL;
 		tp_len = tpacket_parse_header(po, ph, size_max, &data);
+
 		if (tp_len < 0)
 			goto tpacket_error;
 
@@ -2699,7 +2716,7 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 tpacket_error:
 			if (po->tp_loss) {
 				__packet_set_status(po, ph,
-						TP_STATUS_AVAILABLE);
+						TP_STATUS_AVAILABLE, false);
 				packet_increment_head(&po->tx_ring);
 				kfree_skb(skb);
 				continue;
@@ -2719,7 +2736,9 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		}
 
 		skb->destructor = tpacket_destruct_skb;
-		__packet_set_status(po, ph, TP_STATUS_SENDING);
+		__packet_set_status(po, ph, TP_STATUS_SENDING, false);
+		if (!packet_next_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST))
+			po->wait_on_complete = 1;
 		packet_inc_pending(&po->tx_ring);
 
 		status = TP_STATUS_SEND_REQUEST;
@@ -2753,7 +2772,7 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	goto out_put;
 
 out_status:
-	__packet_set_status(po, ph, status);
+	__packet_set_status(po, ph, status, false);
 	kfree_skb(skb);
 out_put:
 	dev_put(dev);
@@ -3207,6 +3226,7 @@  static int packet_create(struct net *net, struct socket *sock, int protocol,
 	sock_init_data(sock, sk);
 
 	po = pkt_sk(sk);
+	init_completion(&po->skb_completion);
 	sk->sk_family = PF_PACKET;
 	po->num = proto;
 	po->xmit = dev_queue_xmit;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 3bb7c5fb3bff..bbb4be2c18e7 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -128,6 +128,8 @@  struct packet_sock {
 	unsigned int		tp_hdrlen;
 	unsigned int		tp_reserve;
 	unsigned int		tp_tstamp;
+	struct completion	skb_completion;
+	unsigned int		wait_on_complete;
 	struct net_device __rcu	*cached_dev;
 	int			(*xmit)(struct sk_buff *skb);
 	struct packet_type	prot_hook ____cacheline_aligned_in_smp;