diff mbox series

[v3,net] af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET

Message ID 20190624004604.25607-1-nhorman@tuxdriver.com
State Superseded
Delegated to: David Miller
Headers show
Series [v3,net] af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET | expand

Commit Message

Neil Horman June 24, 2019, 12:46 a.m. UTC
When an application is run that:
a) Sets its scheduler to be SCHED_FIFO
and
b) Opens a memory mapped AF_PACKET socket, and sends frames with the
MSG_DONTWAIT flag cleared, its possible for the application to hang
forever in the kernel.  This occurs because when waiting, the code in
tpacket_snd calls schedule, which under normal circumstances allows
other tasks to run, including ksoftirqd, which in some cases is
responsible for freeing the transmitted skb (which in AF_PACKET calls a
destructor that flips the status bit of the transmitted frame back to
available, allowing the transmitting task to complete).

However, when the calling application is SCHED_FIFO, its priority is
such that the schedule call immediately places the task back on the cpu,
preventing ksoftirqd from freeing the skb, which in turn prevents the
transmitting task from detecting that the transmission is complete.

We can fix this by converting the schedule call to a completion
mechanism.  By using a completion queue, we force the calling task, when
it detects there are no more frames to send, to schedule itself off the
cpu until such time as the last transmitted skb is freed, allowing
forward progress to be made.

Tested by myself and the reporter, with good results

Appies to the net tree

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Matteo Croce <mcroce@redhat.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Willem de Bruijn <willemdebruijn.kernel@gmail.com>

Change Notes:

V1->V2:
	Enhance the sleep logic to support being interruptible and
allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)

V2->V3:
	Rearrage the point at which we wait for the completion queue, to
avoid needing to check for ph/skb being null at the end of the loop.
Also move the complete call to the skb destructor to avoid needing to
modify __packet_set_status.  Also gate calling complete on
packet_read_pending returning zero to avoid multiple calls to complete.
(Willem de Bruijn)

	Move timeo computation within loop, to re-fetch the socket
timeout since we also use the timeo variable to record the return code
from the wait_for_complete call (Neil Horman)
---
 net/packet/af_packet.c | 59 +++++++++++++++++++++++++++++++++++++-----
 net/packet/internal.h  |  2 ++
 2 files changed, 55 insertions(+), 6 deletions(-)

Comments

Willem de Bruijn June 24, 2019, 6:08 p.m. UTC | #1
On Sun, Jun 23, 2019 at 8:46 PM Neil Horman <nhorman@tuxdriver.com> wrote:
>
> When an application is run that:
> a) Sets its scheduler to be SCHED_FIFO
> and
> b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> MSG_DONTWAIT flag cleared, its possible for the application to hang
> forever in the kernel.  This occurs because when waiting, the code in
> tpacket_snd calls schedule, which under normal circumstances allows
> other tasks to run, including ksoftirqd, which in some cases is
> responsible for freeing the transmitted skb (which in AF_PACKET calls a
> destructor that flips the status bit of the transmitted frame back to
> available, allowing the transmitting task to complete).
>
> However, when the calling application is SCHED_FIFO, its priority is
> such that the schedule call immediately places the task back on the cpu,
> preventing ksoftirqd from freeing the skb, which in turn prevents the
> transmitting task from detecting that the transmission is complete.
>
> We can fix this by converting the schedule call to a completion
> mechanism.  By using a completion queue, we force the calling task, when
> it detects there are no more frames to send, to schedule itself off the
> cpu until such time as the last transmitted skb is freed, allowing
> forward progress to be made.
>
> Tested by myself and the reporter, with good results
>
> Appies to the net tree
>
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> Reported-by: Matteo Croce <mcroce@redhat.com>
> CC: "David S. Miller" <davem@davemloft.net>
> CC: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
>
> Change Notes:
>
> V1->V2:
>         Enhance the sleep logic to support being interruptible and
> allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)
>
> V2->V3:
>         Rearrage the point at which we wait for the completion queue, to
> avoid needing to check for ph/skb being null at the end of the loop.
> Also move the complete call to the skb destructor to avoid needing to
> modify __packet_set_status.  Also gate calling complete on
> packet_read_pending returning zero to avoid multiple calls to complete.
> (Willem de Bruijn)
>
>         Move timeo computation within loop, to re-fetch the socket
> timeout since we also use the timeo variable to record the return code
> from the wait_for_complete call (Neil Horman)
> ---
>  net/packet/af_packet.c | 59 +++++++++++++++++++++++++++++++++++++-----
>  net/packet/internal.h  |  2 ++
>  2 files changed, 55 insertions(+), 6 deletions(-)
>
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index a29d66da7394..5c48bb7a4fa5 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -380,7 +380,6 @@ static void __packet_set_status(struct packet_sock *po, void *frame, int status)
>                 WARN(1, "TPACKET version not supported.\n");
>                 BUG();
>         }
> -

Unrelated to this feature

>         smp_wmb();
>  }
>
> @@ -1148,6 +1147,14 @@ static void *packet_previous_frame(struct packet_sock *po,
>         return packet_lookup_frame(po, rb, previous, status);
>  }
>
> +static void *packet_next_frame(struct packet_sock *po,
> +               struct packet_ring_buffer *rb,
> +               int status)
> +{
> +       unsigned int next = rb->head != rb->frame_max ? rb->head+1 : 0;
> +       return packet_lookup_frame(po, rb, next, status);
> +}
> +
>  static void packet_increment_head(struct packet_ring_buffer *buff)
>  {
>         buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
> @@ -2401,6 +2408,9 @@ static void tpacket_destruct_skb(struct sk_buff *skb)
>
>                 ts = __packet_set_timestamp(po, ph, skb);
>                 __packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts);
> +
> +               if (po->wait_on_complete && !packet_read_pending(&po->tx_ring))
> +                       complete(&po->skb_completion);
>         }
>
>         sock_wfree(skb);
> @@ -2600,9 +2610,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>         int len_sum = 0;
>         int status = TP_STATUS_AVAILABLE;
>         int hlen, tlen, copylen = 0;
> +       long timeo = 0;
>
>         mutex_lock(&po->pg_vec_lock);
>
> +

Same

>         if (likely(saddr == NULL)) {
>                 dev     = packet_cached_dev_get(po);
>                 proto   = po->num;
> @@ -2647,16 +2659,16 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>                 size_max = dev->mtu + reserve + VLAN_HLEN;
>
>         do {
> +

Same

>                 ph = packet_current_frame(po, &po->tx_ring,
>                                           TP_STATUS_SEND_REQUEST);
> -               if (unlikely(ph == NULL)) {
> -                       if (need_wait && need_resched())
> -                               schedule();
> -                       continue;

Why not keep the test whether the process needs to wait exactly here (A)?

Then no need for packet_next_frame.

> -               }
> +
> +               if (unlikely(ph == NULL))
> +                       break;
>
>                 skb = NULL;
>                 tp_len = tpacket_parse_header(po, ph, size_max, &data);
> +

Again

>                 if (tp_len < 0)
>                         goto tpacket_error;
>
> @@ -2720,6 +2732,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>
>                 skb->destructor = tpacket_destruct_skb;
>                 __packet_set_status(po, ph, TP_STATUS_SENDING);
> +
> +               /*
> +                * If we need to wait and we've sent the last frame pending
> +                * transmission in the mmaped buffer, flag that we need to wait
> +                * on those frames to get freed via tpacket_destruct_skb.  This
> +                * flag indicates that tpacket_destruct_skb should call complete
> +                * when the packet_pending count reaches zero, and that we need
> +                * to call wait_on_complete_interruptible_timeout below, to make
> +                * sure we pick up the result of that completion
> +                */
> +               if (need_wait && !packet_next_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST)) {
> +                       po->wait_on_complete = 1;
> +                       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);

This resets timeout on every loop. should only set above the loop once.

Also, please limit the comments in the code (also below). If every
patch would add this many lines of comment, the file would be
enormous. OTOH, it's great to be this explanatory in the git commit,
which is easily reached for any line with git blame.

> +               }
> +
>                 packet_inc_pending(&po->tx_ring);
>
>                 status = TP_STATUS_SEND_REQUEST;
> @@ -2728,6 +2755,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>                         err = net_xmit_errno(err);
>                         if (err && __packet_get_status(po, ph) ==
>                                    TP_STATUS_AVAILABLE) {
> +                               /* re-init completion queue to avoid subsequent fallthrough
> +                                * on a future thread calling wait_on_complete_interruptible_timeout
> +                                */
> +                               po->wait_on_complete = 0;

If setting where sleeping, no need for resetting if a failure happens
between those blocks.

> +                               init_completion(&po->skb_completion);

no need to reinit between each use?

>                                 /* skb was destructed already */
>                                 skb = NULL;
>                                 goto out_status;
> @@ -2740,6 +2772,20 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
>                 }
>                 packet_increment_head(&po->tx_ring);
>                 len_sum += tp_len;
> +
> +               if (po->wait_on_complete) {
> +                       timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
> +                       po->wait_on_complete = 0;

I was going to argue for clearing in tpacket_destruct_skb. But then we
would have to separate clear on timeout instead of signal, too.

  po->wait_on_complete = 1;
  timeo = wait_for_completion...
  po->wait_on_complete = 0;

as the previous version had is fine, as long as the compiler does not
"optimize" away an assignment. The function call will avoid reordering
by the cpu, at least. Probably requires WRITE_ONCE/READ_ONCE.

> +                       if (!timeo) {
> +                               /* We timed out, break out and notify userspace */
> +                               err = -ETIMEDOUT;
> +                               goto out_status;

goto out_put, there is no active ph or skb here

> +                       } else if (timeo == -ERESTARTSYS) {
> +                               err = -ERESTARTSYS;
> +                               goto out_status;
> +                       }
> +               }
> +
>         } while (likely((ph != NULL) ||
>                 /* Note: packet_read_pending() might be slow if we have
>                  * to call it as it's per_cpu variable, but in fast-path
> @@ -3207,6 +3253,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
>         sock_init_data(sock, sk);
>
>         po = pkt_sk(sk);
> +       init_completion(&po->skb_completion);
>         sk->sk_family = PF_PACKET;
>         po->num = proto;
>         po->xmit = dev_queue_xmit;

This is basically replacing a busy-wait with schedule() with sleeping
using wait_for_completion_interruptible_timeout. My main question is
does this really need to move control flow around and add
packet_next_frame? If not, especially for net, the shortest, simplest
change is preferable.
Neil Horman June 24, 2019, 9:51 p.m. UTC | #2
On Mon, Jun 24, 2019 at 02:08:43PM -0400, Willem de Bruijn wrote:
> On Sun, Jun 23, 2019 at 8:46 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > When an application is run that:
> > a) Sets its scheduler to be SCHED_FIFO
> > and
> > b) Opens a memory mapped AF_PACKET socket, and sends frames with the
> > MSG_DONTWAIT flag cleared, its possible for the application to hang
> > forever in the kernel.  This occurs because when waiting, the code in
> > tpacket_snd calls schedule, which under normal circumstances allows
> > other tasks to run, including ksoftirqd, which in some cases is
> > responsible for freeing the transmitted skb (which in AF_PACKET calls a
> > destructor that flips the status bit of the transmitted frame back to
> > available, allowing the transmitting task to complete).
> >
> > However, when the calling application is SCHED_FIFO, its priority is
> > such that the schedule call immediately places the task back on the cpu,
> > preventing ksoftirqd from freeing the skb, which in turn prevents the
> > transmitting task from detecting that the transmission is complete.
> >
> > We can fix this by converting the schedule call to a completion
> > mechanism.  By using a completion queue, we force the calling task, when
> > it detects there are no more frames to send, to schedule itself off the
> > cpu until such time as the last transmitted skb is freed, allowing
> > forward progress to be made.
> >
> > Tested by myself and the reporter, with good results
> >
> > Appies to the net tree
> >
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > Reported-by: Matteo Croce <mcroce@redhat.com>
> > CC: "David S. Miller" <davem@davemloft.net>
> > CC: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> >
> > Change Notes:
> >
> > V1->V2:
> >         Enhance the sleep logic to support being interruptible and
> > allowing for honoring to SK_SNDTIMEO (Willem de Bruijn)
> >
> > V2->V3:
> >         Rearrage the point at which we wait for the completion queue, to
> > avoid needing to check for ph/skb being null at the end of the loop.
> > Also move the complete call to the skb destructor to avoid needing to
> > modify __packet_set_status.  Also gate calling complete on
> > packet_read_pending returning zero to avoid multiple calls to complete.
> > (Willem de Bruijn)
> >
> >         Move timeo computation within loop, to re-fetch the socket
> > timeout since we also use the timeo variable to record the return code
> > from the wait_for_complete call (Neil Horman)
> > ---
> >  net/packet/af_packet.c | 59 +++++++++++++++++++++++++++++++++++++-----
> >  net/packet/internal.h  |  2 ++
> >  2 files changed, 55 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index a29d66da7394..5c48bb7a4fa5 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -380,7 +380,6 @@ static void __packet_set_status(struct packet_sock *po, void *frame, int status)
> >                 WARN(1, "TPACKET version not supported.\n");
> >                 BUG();
> >         }
> > -
> 
> Unrelated to this feature
> 
Agreed.

> >         smp_wmb();
><snip>

> 
> >                 ph = packet_current_frame(po, &po->tx_ring,
> >                                           TP_STATUS_SEND_REQUEST);
> > -               if (unlikely(ph == NULL)) {
> > -                       if (need_wait && need_resched())
> > -                               schedule();
> > -                       continue;
> 
> Why not keep the test whether the process needs to wait exactly here (A)?
> 
As I said in the changelog, I think it makes the code more readable, to
understand that you are waiting for an event to complete after you send the
frame.

> Then no need for packet_next_frame.
> 
Thats fair.  I still think waiting at the bottom of the loop is more clear, but
it does save a function, so I'll agree to this.

> > -               }
> > +
> > +               if (unlikely(ph == NULL))
> > +                       break;
> >
> >                 skb = NULL;
> >                 tp_len = tpacket_parse_header(po, ph, size_max, &data);
> > +
> 
> Again
> 
> >                 if (tp_len < 0)
> >                         goto tpacket_error;
> >
> > @@ -2720,6 +2732,21 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> >
> >                 skb->destructor = tpacket_destruct_skb;
> >                 __packet_set_status(po, ph, TP_STATUS_SENDING);
> > +
> > +               /*
> > +                * If we need to wait and we've sent the last frame pending
> > +                * transmission in the mmaped buffer, flag that we need to wait
> > +                * on those frames to get freed via tpacket_destruct_skb.  This
> > +                * flag indicates that tpacket_destruct_skb should call complete
> > +                * when the packet_pending count reaches zero, and that we need
> > +                * to call wait_on_complete_interruptible_timeout below, to make
> > +                * sure we pick up the result of that completion
> > +                */
> > +               if (need_wait && !packet_next_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST)) {
> > +                       po->wait_on_complete = 1;
> > +                       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
> 
> This resets timeout on every loop. should only set above the loop once.
> 
I explained exactly why I did that in the change log.  Its because I reuse the
timeout variable to get the return value of the wait_for_complete call.
Otherwise I need to add additional data to the stack, which I don't want to do.
Sock_sndtimeo is an inline function and really doesn't add any overhead to this
path, so I see no reason not to reuse the variable.

> Also, please limit the comments in the code (also below). If every
> patch would add this many lines of comment, the file would be
> enormous. OTOH, it's great to be this explanatory in the git commit,
> which is easily reached for any line with git blame.
> 
> > +               }
> > +
> >                 packet_inc_pending(&po->tx_ring);
> >
> >                 status = TP_STATUS_SEND_REQUEST;
> > @@ -2728,6 +2755,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> >                         err = net_xmit_errno(err);
> >                         if (err && __packet_get_status(po, ph) ==
> >                                    TP_STATUS_AVAILABLE) {
> > +                               /* re-init completion queue to avoid subsequent fallthrough
> > +                                * on a future thread calling wait_on_complete_interruptible_timeout
> > +                                */
> > +                               po->wait_on_complete = 0;
> 
> If setting where sleeping, no need for resetting if a failure happens
> between those blocks.
> 
> > +                               init_completion(&po->skb_completion);
> 
> no need to reinit between each use?
> 
I explained exactly why I did this in the comment above.  We have to set
wait_for_complete prior to calling transmit, so as to ensure that we call
wait_for_completion before we exit the loop. However, in this error case, we
exit the loop prior to calling wait_for_complete, so we need to reset the
completion variable and the wait_for_complete flag.  Otherwise we will be in a
case where, on the next entrace to this loop we will have a completion variable
with completion->done > 0, meaning the next wait will be a fall through case,
which we don't want.

> >                                 /* skb was destructed already */
> >                                 skb = NULL;
> >                                 goto out_status;
> > @@ -2740,6 +2772,20 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> >                 }
> >                 packet_increment_head(&po->tx_ring);
> >                 len_sum += tp_len;
> > +
> > +               if (po->wait_on_complete) {
> > +                       timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
> > +                       po->wait_on_complete = 0;
> 
> I was going to argue for clearing in tpacket_destruct_skb. But then we
> would have to separate clear on timeout instead of signal, too.
> 
>   po->wait_on_complete = 1;
>   timeo = wait_for_completion...
>   po->wait_on_complete = 0;
> 
Also, we would have a race condition, since the destructor may be called from
softirq context (the first cause of the bug I'm fixing here), and so if the
packet is freed prior to us checking wait_for_complete in tpacket_snd, we will
be in the above situation again, exiting the loop with a completion variable in
an improper state.

> as the previous version had is fine, as long as the compiler does not
> "optimize" away an assignment. The function call will avoid reordering
> by the cpu, at least. Probably requires WRITE_ONCE/READ_ONCE.
> 
> > +                       if (!timeo) {
> > +                               /* We timed out, break out and notify userspace */
> > +                               err = -ETIMEDOUT;
> > +                               goto out_status;
> 
> goto out_put, there is no active ph or skb here
> 
Yes, good catch.

> > +                       } else if (timeo == -ERESTARTSYS) {
> > +                               err = -ERESTARTSYS;
> > +                               goto out_status;
> > +                       }
> > +               }
> > +
> >         } while (likely((ph != NULL) ||
> >                 /* Note: packet_read_pending() might be slow if we have
> >                  * to call it as it's per_cpu variable, but in fast-path
> > @@ -3207,6 +3253,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
> >         sock_init_data(sock, sk);
> >
> >         po = pkt_sk(sk);
> > +       init_completion(&po->skb_completion);
> >         sk->sk_family = PF_PACKET;
> >         po->num = proto;
> >         po->xmit = dev_queue_xmit;
> 
> This is basically replacing a busy-wait with schedule() with sleeping
> using wait_for_completion_interruptible_timeout. My main question is
> does this really need to move control flow around and add
> packet_next_frame? If not, especially for net, the shortest, simplest
> change is preferable.
> 
Its not replacing a busy wait at all, its replacing a non-blocking schedule with
a blocking schedule (via completion queues).  As for control flow, Im not sure I
why you are bound to the existing control flow, and given that we already have
packet_previous_frame, I didn't see anything egregious about adding
packet_next_frame as well, but since you've seen a way to eliminate it, I'm ok
with it.

Neil
Willem de Bruijn June 24, 2019, 10:15 p.m. UTC | #3
> > > +               if (need_wait && !packet_next_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST)) {
> > > +                       po->wait_on_complete = 1;
> > > +                       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
> >
> > This resets timeout on every loop. should only set above the loop once.
> >
> I explained exactly why I did that in the change log.  Its because I reuse the
> timeout variable to get the return value of the wait_for_complete call.
> Otherwise I need to add additional data to the stack, which I don't want to do.
> Sock_sndtimeo is an inline function and really doesn't add any overhead to this
> path, so I see no reason not to reuse the variable.

The issue isn't the reuse. It is that timeo is reset to sk_sndtimeo
each time. Whereas wait_for_common and its variants return the
number of jiffies left in case the loop needs to sleep again later.

Reading sock_sndtimeo once and passing it to wait_.. repeatedly is a
common pattern across the stack.

> > > @@ -2728,6 +2755,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > >                         err = net_xmit_errno(err);
> > >                         if (err && __packet_get_status(po, ph) ==
> > >                                    TP_STATUS_AVAILABLE) {
> > > +                               /* re-init completion queue to avoid subsequent fallthrough
> > > +                                * on a future thread calling wait_on_complete_interruptible_timeout
> > > +                                */
> > > +                               po->wait_on_complete = 0;
> >
> > If setting where sleeping, no need for resetting if a failure happens
> > between those blocks.
> >
> > > +                               init_completion(&po->skb_completion);
> >
> > no need to reinit between each use?
> >
> I explained exactly why I did this in the comment above.  We have to set
> wait_for_complete prior to calling transmit, so as to ensure that we call
> wait_for_completion before we exit the loop. However, in this error case, we
> exit the loop prior to calling wait_for_complete, so we need to reset the
> completion variable and the wait_for_complete flag.  Otherwise we will be in a
> case where, on the next entrace to this loop we will have a completion variable
> with completion->done > 0, meaning the next wait will be a fall through case,
> which we don't want.

By moving back to the point where schedule() is called, hopefully this
complexity automatically goes away. Same as my comment to the line
immediately above.

> > > @@ -2740,6 +2772,20 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > >                 }
> > >                 packet_increment_head(&po->tx_ring);
> > >                 len_sum += tp_len;
> > > +
> > > +               if (po->wait_on_complete) {
> > > +                       timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
> > > +                       po->wait_on_complete = 0;
> >
> > I was going to argue for clearing in tpacket_destruct_skb. But then we
> > would have to separate clear on timeout instead of signal, too.
> >
> >   po->wait_on_complete = 1;
> >   timeo = wait_for_completion...
> >   po->wait_on_complete = 0;
> >
> Also, we would have a race condition, since the destructor may be called from
> softirq context (the first cause of the bug I'm fixing here), and so if the
> packet is freed prior to us checking wait_for_complete in tpacket_snd, we will
> be in the above situation again, exiting the loop with a completion variable in
> an improper state.

Good point.

The common pattern is to clear in tpacket_destruct_skb. Then
we do need to handle the case where the wait is interrupted or
times out and reset it in those cases.

> > This is basically replacing a busy-wait with schedule() with sleeping
> > using wait_for_completion_interruptible_timeout. My main question is
> > does this really need to move control flow around and add
> > packet_next_frame? If not, especially for net, the shortest, simplest
> > change is preferable.
> >
> Its not replacing a busy wait at all, its replacing a non-blocking schedule with
> a blocking schedule (via completion queues).  As for control flow, Im not sure I
> why you are bound to the existing control flow, and given that we already have
> packet_previous_frame, I didn't see anything egregious about adding
> packet_next_frame as well, but since you've seen a way to eliminate it, I'm ok
> with it.

The benefit of keeping to the existing control flow is that that is a
smaller change, so easier to verify.

I understand the benefit of moving the wait outside the loop. Before
this report, I was not even aware of that behavior on !MSG_DONTWAIT,
because it is so co-located.

But moving it elsewhere in the loop does not have the same benefit,
imho. Either way, I think we better leave any such code improvements
to net-next and focus on the minimal , least risky, patch for net.
Neil Horman June 25, 2019, 11:02 a.m. UTC | #4
On Mon, Jun 24, 2019 at 06:15:29PM -0400, Willem de Bruijn wrote:
> > > > +               if (need_wait && !packet_next_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST)) {
> > > > +                       po->wait_on_complete = 1;
> > > > +                       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
> > >
> > > This resets timeout on every loop. should only set above the loop once.
> > >
> > I explained exactly why I did that in the change log.  Its because I reuse the
> > timeout variable to get the return value of the wait_for_complete call.
> > Otherwise I need to add additional data to the stack, which I don't want to do.
> > Sock_sndtimeo is an inline function and really doesn't add any overhead to this
> > path, so I see no reason not to reuse the variable.
> 
> The issue isn't the reuse. It is that timeo is reset to sk_sndtimeo
> each time. Whereas wait_for_common and its variants return the
> number of jiffies left in case the loop needs to sleep again later.
> 
> Reading sock_sndtimeo once and passing it to wait_.. repeatedly is a
> common pattern across the stack.
> 
But those patterns are unique to those situations.  For instance, in
tcp_sendmsg_locked, we aquire the value of the socket timeout, and use that to
wait for the entire message send operation to complete, which consists of
potentially several blocking operations (waiting for the tcp connection to be
established, waiting for socket memory, etc).  In that situation we want to wait
for all of those operations to complete to send a single message, and fail if
they exceed the timeout in aggregate.  The semantics are different with
AF_PACKET.  In this use case, the message is in effect empty, and just used to
pass some control information.  tpacket_snd, sends as many frames from the
memory mapped buffer as possible, and on each iteration we want to wait for the
specified timeout for those frames to complete sending.  I think resetting the
timeout on each wait instance is the right way to go here.

> > > > @@ -2728,6 +2755,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > > >                         err = net_xmit_errno(err);
> > > >                         if (err && __packet_get_status(po, ph) ==
> > > >                                    TP_STATUS_AVAILABLE) {
> > > > +                               /* re-init completion queue to avoid subsequent fallthrough
> > > > +                                * on a future thread calling wait_on_complete_interruptible_timeout
> > > > +                                */
> > > > +                               po->wait_on_complete = 0;
> > >
> > > If setting where sleeping, no need for resetting if a failure happens
> > > between those blocks.
> > >
> > > > +                               init_completion(&po->skb_completion);
> > >
> > > no need to reinit between each use?
> > >
> > I explained exactly why I did this in the comment above.  We have to set
> > wait_for_complete prior to calling transmit, so as to ensure that we call
> > wait_for_completion before we exit the loop. However, in this error case, we
> > exit the loop prior to calling wait_for_complete, so we need to reset the
> > completion variable and the wait_for_complete flag.  Otherwise we will be in a
> > case where, on the next entrace to this loop we will have a completion variable
> > with completion->done > 0, meaning the next wait will be a fall through case,
> > which we don't want.
> 
> By moving back to the point where schedule() is called, hopefully this
> complexity automatically goes away. Same as my comment to the line
> immediately above.
> 
Its going to change what the complexity is, actually.  I was looking at this
last night, and I realized that your assertion that we could remove
packet_next_frame came at a cost.  This is because we need to determine if we
have to wait before we call po->xmit, but we need to actually do the wait after
po->xmit (to ensure that wait_on_complete is set properly when the desructor is
called).  By moving the wait to the top of the loop and getting rid of
packet_next_frame we now have a race condition in which we might call
tpacket_destruct_skb with wait_on_complete set to false, causing us to wait for
the maximum timeout erroneously.  So I'm going to have to find a new way to
signal the need to call complete, which I think will introduce a different level
of complexity.

> > > > @@ -2740,6 +2772,20 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > > >                 }
> > > >                 packet_increment_head(&po->tx_ring);
> > > >                 len_sum += tp_len;
> > > > +
> > > > +               if (po->wait_on_complete) {
> > > > +                       timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
> > > > +                       po->wait_on_complete = 0;
> > >
> > > I was going to argue for clearing in tpacket_destruct_skb. But then we
> > > would have to separate clear on timeout instead of signal, too.
> > >
> > >   po->wait_on_complete = 1;
> > >   timeo = wait_for_completion...
> > >   po->wait_on_complete = 0;
> > >
> > Also, we would have a race condition, since the destructor may be called from
> > softirq context (the first cause of the bug I'm fixing here), and so if the
> > packet is freed prior to us checking wait_for_complete in tpacket_snd, we will
> > be in the above situation again, exiting the loop with a completion variable in
> > an improper state.
> 
> Good point.
> 
> The common pattern is to clear in tpacket_destruct_skb. Then
> we do need to handle the case where the wait is interrupted or
> times out and reset it in those cases.
> 
As noted above, if we restore the original control flow, the wait_on_complete
flag is not useable, as its state needs to be determined prior to actually
sending the frame to the driver via po->xmit.

I'm going to try some logic in which both tpacket_snd and tpacket_destruct_skb
key off of packet_read_pending.  I think this will require a re-initalization of
the completion queue on each entry to tpacket_snd, but perhaps thats more
pallatable.

> > > This is basically replacing a busy-wait with schedule() with sleeping
> > > using wait_for_completion_interruptible_timeout. My main question is
> > > does this really need to move control flow around and add
> > > packet_next_frame? If not, especially for net, the shortest, simplest
> > > change is preferable.
> > >
> > Its not replacing a busy wait at all, its replacing a non-blocking schedule with
> > a blocking schedule (via completion queues).  As for control flow, Im not sure I
> > why you are bound to the existing control flow, and given that we already have
> > packet_previous_frame, I didn't see anything egregious about adding
> > packet_next_frame as well, but since you've seen a way to eliminate it, I'm ok
> > with it.
> 
> The benefit of keeping to the existing control flow is that that is a
> smaller change, so easier to verify.
> 
I don't think it will really, because its going to introduce different
complexities, but we'll see for certain when I finish getting it rewritten
again.

> I understand the benefit of moving the wait outside the loop. Before
> this report, I was not even aware of that behavior on !MSG_DONTWAIT,
> because it is so co-located.
> 
> But moving it elsewhere in the loop does not have the same benefit,
> imho. Either way, I think we better leave any such code improvements
> to net-next and focus on the minimal , least risky, patch for net.
> 
Ok.
Willem de Bruijn June 25, 2019, 1:37 p.m. UTC | #5
On Tue, Jun 25, 2019 at 7:03 AM Neil Horman <nhorman@tuxdriver.com> wrote:
>
> On Mon, Jun 24, 2019 at 06:15:29PM -0400, Willem de Bruijn wrote:
> > > > > +               if (need_wait && !packet_next_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST)) {
> > > > > +                       po->wait_on_complete = 1;
> > > > > +                       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
> > > >
> > > > This resets timeout on every loop. should only set above the loop once.
> > > >
> > > I explained exactly why I did that in the change log.  Its because I reuse the
> > > timeout variable to get the return value of the wait_for_complete call.
> > > Otherwise I need to add additional data to the stack, which I don't want to do.
> > > Sock_sndtimeo is an inline function and really doesn't add any overhead to this
> > > path, so I see no reason not to reuse the variable.
> >
> > The issue isn't the reuse. It is that timeo is reset to sk_sndtimeo
> > each time. Whereas wait_for_common and its variants return the
> > number of jiffies left in case the loop needs to sleep again later.
> >
> > Reading sock_sndtimeo once and passing it to wait_.. repeatedly is a
> > common pattern across the stack.
> >
> But those patterns are unique to those situations.  For instance, in
> tcp_sendmsg_locked, we aquire the value of the socket timeout, and use that to
> wait for the entire message send operation to complete, which consists of
> potentially several blocking operations (waiting for the tcp connection to be
> established, waiting for socket memory, etc).  In that situation we want to wait
> for all of those operations to complete to send a single message, and fail if
> they exceed the timeout in aggregate.  The semantics are different with
> AF_PACKET.  In this use case, the message is in effect empty, and just used to
> pass some control information.  tpacket_snd, sends as many frames from the
> memory mapped buffer as possible, and on each iteration we want to wait for the
> specified timeout for those frames to complete sending.  I think resetting the
> timeout on each wait instance is the right way to go here.

I disagree. If a SO_SNDTIMEO of a given time is set, the thread should
not wait beyond that. Else what is the point of passing a specific
duration in the syscall?

Btw, we can always drop the timeo and go back to unconditional (bar
signals) waiting.

>
> > > > > @@ -2728,6 +2755,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > > > >                         err = net_xmit_errno(err);
> > > > >                         if (err && __packet_get_status(po, ph) ==
> > > > >                                    TP_STATUS_AVAILABLE) {
> > > > > +                               /* re-init completion queue to avoid subsequent fallthrough
> > > > > +                                * on a future thread calling wait_on_complete_interruptible_timeout
> > > > > +                                */
> > > > > +                               po->wait_on_complete = 0;
> > > >
> > > > If setting where sleeping, no need for resetting if a failure happens
> > > > between those blocks.
> > > >
> > > > > +                               init_completion(&po->skb_completion);
> > > >
> > > > no need to reinit between each use?
> > > >
> > > I explained exactly why I did this in the comment above.  We have to set
> > > wait_for_complete prior to calling transmit, so as to ensure that we call
> > > wait_for_completion before we exit the loop. However, in this error case, we
> > > exit the loop prior to calling wait_for_complete, so we need to reset the
> > > completion variable and the wait_for_complete flag.  Otherwise we will be in a
> > > case where, on the next entrace to this loop we will have a completion variable
> > > with completion->done > 0, meaning the next wait will be a fall through case,
> > > which we don't want.
> >
> > By moving back to the point where schedule() is called, hopefully this
> > complexity automatically goes away. Same as my comment to the line
> > immediately above.
> >
> Its going to change what the complexity is, actually.  I was looking at this
> last night, and I realized that your assertion that we could remove
> packet_next_frame came at a cost.  This is because we need to determine if we
> have to wait before we call po->xmit, but we need to actually do the wait after
> po->xmit

Why? The existing method using schedule() doesn't.

Can we not just loop while sending and sleep immediately when
__packet_get_status returns !TP_STATUS_AVAILABLE?

I don't understand the need to probe the next packet to send instead
of the current.

This seems to be the crux of the disagreement. My guess is that it has
to do with setting wait_on_complete, but I don't see the problem. It
can be set immediately before going to sleep.

I don't meant to draw this out, btw, or add to your workload. If you
prefer, I can take a stab at my (admittedly hand-wavy) suggestion.

> (to ensure that wait_on_complete is set properly when the desructor is
> called).  By moving the wait to the top of the loop and getting rid of
> packet_next_frame we now have a race condition in which we might call
> tpacket_destruct_skb with wait_on_complete set to false, causing us to wait for
> the maximum timeout erroneously.  So I'm going to have to find a new way to
> signal the need to call complete, which I think will introduce a different level
> of complexity.
Neil Horman June 25, 2019, 4:20 p.m. UTC | #6
On Tue, Jun 25, 2019 at 09:37:17AM -0400, Willem de Bruijn wrote:
> On Tue, Jun 25, 2019 at 7:03 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> >
> > On Mon, Jun 24, 2019 at 06:15:29PM -0400, Willem de Bruijn wrote:
> > > > > > +               if (need_wait && !packet_next_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST)) {
> > > > > > +                       po->wait_on_complete = 1;
> > > > > > +                       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
> > > > >
> > > > > This resets timeout on every loop. should only set above the loop once.
> > > > >
> > > > I explained exactly why I did that in the change log.  Its because I reuse the
> > > > timeout variable to get the return value of the wait_for_complete call.
> > > > Otherwise I need to add additional data to the stack, which I don't want to do.
> > > > Sock_sndtimeo is an inline function and really doesn't add any overhead to this
> > > > path, so I see no reason not to reuse the variable.
> > >
> > > The issue isn't the reuse. It is that timeo is reset to sk_sndtimeo
> > > each time. Whereas wait_for_common and its variants return the
> > > number of jiffies left in case the loop needs to sleep again later.
> > >
> > > Reading sock_sndtimeo once and passing it to wait_.. repeatedly is a
> > > common pattern across the stack.
> > >
> > But those patterns are unique to those situations.  For instance, in
> > tcp_sendmsg_locked, we aquire the value of the socket timeout, and use that to
> > wait for the entire message send operation to complete, which consists of
> > potentially several blocking operations (waiting for the tcp connection to be
> > established, waiting for socket memory, etc).  In that situation we want to wait
> > for all of those operations to complete to send a single message, and fail if
> > they exceed the timeout in aggregate.  The semantics are different with
> > AF_PACKET.  In this use case, the message is in effect empty, and just used to
> > pass some control information.  tpacket_snd, sends as many frames from the
> > memory mapped buffer as possible, and on each iteration we want to wait for the
> > specified timeout for those frames to complete sending.  I think resetting the
> > timeout on each wait instance is the right way to go here.
> 
> I disagree. If a SO_SNDTIMEO of a given time is set, the thread should
> not wait beyond that. Else what is the point of passing a specific
> duration in the syscall?
> 
I can appreciate that, but you said yourself that you wanted to minimize this
change.  The current behavior of AF_PACKET is to:
a) check for their being no more packets ready to send
b) if (a) is false we schedule() (nominally allowing other tasks to run)
c) when (b) is complete, check for additional frames to send, and exit if there
are not, otherwise, continue sending and waiting

In that model, a single task calling sendmsg can run in the kernel indefinately,
if userspace continues to fill the memory mapped buffer

Given that model, resetting the timeout on each call to wait_for_completion
keeps that behavior.  In fact, if we allow the timeout value to get continuously
decremented, it will be possible for a call to sendmsg in this protocol to
_always_ return -ETIMEDOUT (if we keep the buffer full in userspace long enough,
then the sending task will eventually decrement timeo to zero, and force an
-ETIMEDOUT call).

I'm convinced that resetting the timeout here is the right way to go.

> Btw, we can always drop the timeo and go back to unconditional (bar
> signals) waiting.
> 
We could, but it would be nice to implement the timeout feature here if
possible.

> >
> > > > > > @@ -2728,6 +2755,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > > > > >                         err = net_xmit_errno(err);
> > > > > >                         if (err && __packet_get_status(po, ph) ==
> > > > > >                                    TP_STATUS_AVAILABLE) {
> > > > > > +                               /* re-init completion queue to avoid subsequent fallthrough
> > > > > > +                                * on a future thread calling wait_on_complete_interruptible_timeout
> > > > > > +                                */
> > > > > > +                               po->wait_on_complete = 0;
> > > > >
> > > > > If setting where sleeping, no need for resetting if a failure happens
> > > > > between those blocks.
> > > > >
> > > > > > +                               init_completion(&po->skb_completion);
> > > > >
> > > > > no need to reinit between each use?
> > > > >
> > > > I explained exactly why I did this in the comment above.  We have to set
> > > > wait_for_complete prior to calling transmit, so as to ensure that we call
> > > > wait_for_completion before we exit the loop. However, in this error case, we
> > > > exit the loop prior to calling wait_for_complete, so we need to reset the
> > > > completion variable and the wait_for_complete flag.  Otherwise we will be in a
> > > > case where, on the next entrace to this loop we will have a completion variable
> > > > with completion->done > 0, meaning the next wait will be a fall through case,
> > > > which we don't want.
> > >
> > > By moving back to the point where schedule() is called, hopefully this
> > > complexity automatically goes away. Same as my comment to the line
> > > immediately above.
> > >
> > Its going to change what the complexity is, actually.  I was looking at this
> > last night, and I realized that your assertion that we could remove
> > packet_next_frame came at a cost.  This is because we need to determine if we
> > have to wait before we call po->xmit, but we need to actually do the wait after
> > po->xmit
> 
> Why? The existing method using schedule() doesn't.
> 
Because the existing method using schedule doesn't have to rely on an
asynchronous event to wake the sending task back up.  Specifically, we need to
set some state that indicates tpacket_destruct_skb should call complete, and
since tpacket_destruct_skb is called asynchronously in a separate execution
context, we need to ensure there is no race condition whereby we execute
tpacket_destruct_skb before we set the state that we also use to determine if we
should call wait_for_complete.  ie:
1) tpacket_send_skb calls po->xmit
2) tpacket_send_skb loops around and checks to see if there are more packets to
send.  If not and if need_wait is set we call wait_for_complete

If tpacket_destruct_skb is called after (2), we're fine.  But if its called
between (1) and (2), then tpacket_destruct_skb won't call complete (because
wait_on_complete isn't set yet), causing a hang.

Because you wanted to remove packet_next_frame, we have no way to determine if
we're done sending frames prior to calling po->xmit, which is the point past
which tpacket_destruct_skb might be called, so now I have to find an alternate
state condition to key off of.
 

> Can we not just loop while sending and sleep immediately when
> __packet_get_status returns !TP_STATUS_AVAILABLE?
> 
> I don't understand the need to probe the next packet to send instead
> of the current.
> 
See above, we can definately check the return of ph at the top of the loop, and
sleep if its NULL, but in so doing we cant use po->wait_on_complete as a gating
variable because we've already called po->xmit.  Once we call that function, if
we haven't already set po->wait_on_complete to true, its possible
tpacket_destruct_skb will already have been called before we get to the point
where we check wait_for_completion.  That means we will wait on a completion
variable that never gets completed, so I need to find a new way to track weather
or not we are waiting.  Its entirely doable, I'm sure, its just not as straight
forward as your making it out to be.

> This seems to be the crux of the disagreement. My guess is that it has
> to do with setting wait_on_complete, but I don't see the problem. It
> can be set immediately before going to sleep.
> 
The reason has to do with the fact that
tpacket_destruct_skb can be called from ksoftirq context (i.e. it will be called
asynchrnously, not from the thread running in tpacket_snd).  Condsider this
flow, assuming we do as you suggest, and just set po->wait_on_complete=1
immediately prior to calling wait_for_complete_interruptible_timeout: 

1) tpacket_snd gets a frame, builds the skb for transmission
2) tpakcket_snd calls po->xmit(skb), sending the frame to the driver
3) tpacket_snd, iterates through back to the top of the loop, and determines
that we need to wait for the previous packet to complete
4) The driver gets a tx completion interrupt, interrupting the cpu on which we
are executing, and calls kfree_skb_irq on the skb we just transmitted
5) the ksoftirq runs on the cpu, calling kfree_skb on our skb
6) (5) calls skb->destruct (which is tpakcet_skb_destruct)
7) tpacket_skb_destruct executes, sees that po->wait_on_completion is zero, and
skips calling complete()
8) the thread running tpacket_snd gets scheduled back on the cpu and now sets
po->wait_on_complete, then calls wait_for_completion_interruptible_timeout, but
since the skb we are waiting to complete has already been freed, we will never
get the completion notification, and so we will wait for the maximal timeout,
which is absolutely not what we want.

Interestingly (Ironically), that control flow will never happen in the use case
I'm trying to fix here, because its SCHED_FIFO, and will run until such time as
we call wait_for_completion_interuptible_timeout, so in this case we're safe.
But in the nominal case, where the sending task is acturally SCHED_OTHER, the
above can aboslutely happen, and thats very broken.

> I don't meant to draw this out, btw, or add to your workload. If you
> prefer, I can take a stab at my (admittedly hand-wavy) suggestion.
> 
No, I have another method in flight that I'm testing now, it removes the
po->wait_on_complete variable entirely, checking instead to make sure that
we've:
a) sent at least one frame
and
b) that we have a positive pending count
and
c) that we need to wait
and
d) that we have no more frames to send

This is why I was saying that your idea can be done, but it trades one form of
complexity for another.

Neil

> > (to ensure that wait_on_complete is set properly when the desructor is
> > called).  By moving the wait to the top of the loop and getting rid of
> > packet_next_frame we now have a race condition in which we might call
> > tpacket_destruct_skb with wait_on_complete set to false, causing us to wait for
> > the maximum timeout erroneously.  So I'm going to have to find a new way to
> > signal the need to call complete, which I think will introduce a different level
> > of complexity.
>
Willem de Bruijn June 25, 2019, 9:59 p.m. UTC | #7
On Tue, Jun 25, 2019 at 12:20 PM Neil Horman <nhorman@tuxdriver.com> wrote:
>
> On Tue, Jun 25, 2019 at 09:37:17AM -0400, Willem de Bruijn wrote:
> > On Tue, Jun 25, 2019 at 7:03 AM Neil Horman <nhorman@tuxdriver.com> wrote:
> > >
> > > On Mon, Jun 24, 2019 at 06:15:29PM -0400, Willem de Bruijn wrote:
> > > > > > > +               if (need_wait && !packet_next_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST)) {
> > > > > > > +                       po->wait_on_complete = 1;
> > > > > > > +                       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
> > > > > >
> > > > > > This resets timeout on every loop. should only set above the loop once.
> > > > > >
> > > > > I explained exactly why I did that in the change log.  Its because I reuse the
> > > > > timeout variable to get the return value of the wait_for_complete call.
> > > > > Otherwise I need to add additional data to the stack, which I don't want to do.
> > > > > Sock_sndtimeo is an inline function and really doesn't add any overhead to this
> > > > > path, so I see no reason not to reuse the variable.
> > > >
> > > > The issue isn't the reuse. It is that timeo is reset to sk_sndtimeo
> > > > each time. Whereas wait_for_common and its variants return the
> > > > number of jiffies left in case the loop needs to sleep again later.
> > > >
> > > > Reading sock_sndtimeo once and passing it to wait_.. repeatedly is a
> > > > common pattern across the stack.
> > > >
> > > But those patterns are unique to those situations.  For instance, in
> > > tcp_sendmsg_locked, we aquire the value of the socket timeout, and use that to
> > > wait for the entire message send operation to complete, which consists of
> > > potentially several blocking operations (waiting for the tcp connection to be
> > > established, waiting for socket memory, etc).  In that situation we want to wait
> > > for all of those operations to complete to send a single message, and fail if
> > > they exceed the timeout in aggregate.  The semantics are different with
> > > AF_PACKET.  In this use case, the message is in effect empty, and just used to
> > > pass some control information.  tpacket_snd, sends as many frames from the
> > > memory mapped buffer as possible, and on each iteration we want to wait for the
> > > specified timeout for those frames to complete sending.  I think resetting the
> > > timeout on each wait instance is the right way to go here.
> >
> > I disagree. If a SO_SNDTIMEO of a given time is set, the thread should
> > not wait beyond that. Else what is the point of passing a specific
> > duration in the syscall?
> >
> I can appreciate that, but you said yourself that you wanted to minimize this
> change.  The current behavior of AF_PACKET is to:
> a) check for their being no more packets ready to send
> b) if (a) is false we schedule() (nominally allowing other tasks to run)
> c) when (b) is complete, check for additional frames to send, and exit if there
> are not, otherwise, continue sending and waiting
>
> In that model, a single task calling sendmsg can run in the kernel indefinately,
> if userspace continues to fill the memory mapped buffer
>
> Given that model, resetting the timeout on each call to wait_for_completion
> keeps that behavior.  In fact, if we allow the timeout value to get continuously
> decremented, it will be possible for a call to sendmsg in this protocol to
> _always_ return -ETIMEDOUT (if we keep the buffer full in userspace long enough,
> then the sending task will eventually decrement timeo to zero, and force an
> -ETIMEDOUT call).
>
> I'm convinced that resetting the timeout here is the right way to go.

It upholds the contract, but extends when userspace appends to the
ring. Okay, yes, that makes sense.

> > Btw, we can always drop the timeo and go back to unconditional (bar
> > signals) waiting.
> >
> We could, but it would be nice to implement the timeout feature here if
> possible.
>
> > >
> > > > > > > @@ -2728,6 +2755,11 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
> > > > > > >                         err = net_xmit_errno(err);
> > > > > > >                         if (err && __packet_get_status(po, ph) ==
> > > > > > >                                    TP_STATUS_AVAILABLE) {
> > > > > > > +                               /* re-init completion queue to avoid subsequent fallthrough
> > > > > > > +                                * on a future thread calling wait_on_complete_interruptible_timeout
> > > > > > > +                                */
> > > > > > > +                               po->wait_on_complete = 0;
> > > > > >
> > > > > > If setting where sleeping, no need for resetting if a failure happens
> > > > > > between those blocks.
> > > > > >
> > > > > > > +                               init_completion(&po->skb_completion);
> > > > > >
> > > > > > no need to reinit between each use?
> > > > > >
> > > > > I explained exactly why I did this in the comment above.  We have to set
> > > > > wait_for_complete prior to calling transmit, so as to ensure that we call
> > > > > wait_for_completion before we exit the loop. However, in this error case, we
> > > > > exit the loop prior to calling wait_for_complete, so we need to reset the
> > > > > completion variable and the wait_for_complete flag.  Otherwise we will be in a
> > > > > case where, on the next entrace to this loop we will have a completion variable
> > > > > with completion->done > 0, meaning the next wait will be a fall through case,
> > > > > which we don't want.
> > > >
> > > > By moving back to the point where schedule() is called, hopefully this
> > > > complexity automatically goes away. Same as my comment to the line
> > > > immediately above.
> > > >
> > > Its going to change what the complexity is, actually.  I was looking at this
> > > last night, and I realized that your assertion that we could remove
> > > packet_next_frame came at a cost.  This is because we need to determine if we
> > > have to wait before we call po->xmit, but we need to actually do the wait after
> > > po->xmit
> >
> > Why? The existing method using schedule() doesn't.
> >
> Because the existing method using schedule doesn't have to rely on an
> asynchronous event to wake the sending task back up.  Specifically, we need to
> set some state that indicates tpacket_destruct_skb should call complete, and
> since tpacket_destruct_skb is called asynchronously in a separate execution
> context, we need to ensure there is no race condition whereby we execute
> tpacket_destruct_skb before we set the state that we also use to determine if we
> should call wait_for_complete.  ie:
> 1) tpacket_send_skb calls po->xmit
> 2) tpacket_send_skb loops around and checks to see if there are more packets to
> send.  If not and if need_wait is set we call wait_for_complete
>
> If tpacket_destruct_skb is called after (2), we're fine.  But if its called
> between (1) and (2), then tpacket_destruct_skb won't call complete (because
> wait_on_complete isn't set yet), causing a hang.
>
> Because you wanted to remove packet_next_frame, we have no way to determine if
> we're done sending frames prior to calling po->xmit, which is the point past
> which tpacket_destruct_skb might be called, so now I have to find an alternate
> state condition to key off of.
>
>
> > Can we not just loop while sending and sleep immediately when
> > __packet_get_status returns !TP_STATUS_AVAILABLE?
> >
> > I don't understand the need to probe the next packet to send instead
> > of the current.
> >
> See above, we can definately check the return of ph at the top of the loop, and
> sleep if its NULL, but in so doing we cant use po->wait_on_complete as a gating
> variable because we've already called po->xmit.  Once we call that function, if
> we haven't already set po->wait_on_complete to true, its possible
> tpacket_destruct_skb will already have been called before we get to the point
> where we check wait_for_completion.  That means we will wait on a completion
> variable that never gets completed, so I need to find a new way to track weather
> or not we are waiting.  Its entirely doable, I'm sure, its just not as straight
> forward as your making it out to be.
>
> > This seems to be the crux of the disagreement. My guess is that it has
> > to do with setting wait_on_complete, but I don't see the problem. It
> > can be set immediately before going to sleep.
> >
> The reason has to do with the fact that
> tpacket_destruct_skb can be called from ksoftirq context (i.e. it will be called
> asynchrnously, not from the thread running in tpacket_snd).  Condsider this
> flow, assuming we do as you suggest, and just set po->wait_on_complete=1
> immediately prior to calling wait_for_complete_interruptible_timeout:
>
> 1) tpacket_snd gets a frame, builds the skb for transmission
> 2) tpakcket_snd calls po->xmit(skb), sending the frame to the driver
> 3) tpacket_snd, iterates through back to the top of the loop, and determines
> that we need to wait for the previous packet to complete
> 4) The driver gets a tx completion interrupt, interrupting the cpu on which we
> are executing, and calls kfree_skb_irq on the skb we just transmitted
> 5) the ksoftirq runs on the cpu, calling kfree_skb on our skb
> 6) (5) calls skb->destruct (which is tpakcet_skb_destruct)
> 7) tpacket_skb_destruct executes, sees that po->wait_on_completion is zero, and
> skips calling complete()
> 8) the thread running tpacket_snd gets scheduled back on the cpu and now sets
> po->wait_on_complete, then calls wait_for_completion_interruptible_timeout, but
> since the skb we are waiting to complete has already been freed, we will never
> get the completion notification, and so we will wait for the maximal timeout,
> which is absolutely not what we want.
>
> Interestingly (Ironically), that control flow will never happen in the use case
> I'm trying to fix here, because its SCHED_FIFO, and will run until such time as
> we call wait_for_completion_interuptible_timeout, so in this case we're safe.
> But in the nominal case, where the sending task is acturally SCHED_OTHER, the
> above can aboslutely happen, and thats very broken.
>
> > I don't meant to draw this out, btw, or add to your workload. If you
> > prefer, I can take a stab at my (admittedly hand-wavy) suggestion.
> >
> No, I have another method in flight that I'm testing now, it removes the
> po->wait_on_complete variable entirely, checking instead to make sure that
> we've:
> a) sent at least one frame
> and
> b) that we have a positive pending count
> and
> c) that we need to wait
> and
> d) that we have no more frames to send
>
> This is why I was saying that your idea can be done, but it trades one form of
> complexity for another.

Thanks for the detailed explanation of the races.

I was thinking this through, but just see V4. Will take a look at that.
diff mbox series

Patch

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index a29d66da7394..5c48bb7a4fa5 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -380,7 +380,6 @@  static void __packet_set_status(struct packet_sock *po, void *frame, int status)
 		WARN(1, "TPACKET version not supported.\n");
 		BUG();
 	}
-
 	smp_wmb();
 }
 
@@ -1148,6 +1147,14 @@  static void *packet_previous_frame(struct packet_sock *po,
 	return packet_lookup_frame(po, rb, previous, status);
 }
 
+static void *packet_next_frame(struct packet_sock *po,
+		struct packet_ring_buffer *rb,
+		int status)
+{
+	unsigned int next = rb->head != rb->frame_max ? rb->head+1 : 0;
+	return packet_lookup_frame(po, rb, next, status);
+}
+
 static void packet_increment_head(struct packet_ring_buffer *buff)
 {
 	buff->head = buff->head != buff->frame_max ? buff->head+1 : 0;
@@ -2401,6 +2408,9 @@  static void tpacket_destruct_skb(struct sk_buff *skb)
 
 		ts = __packet_set_timestamp(po, ph, skb);
 		__packet_set_status(po, ph, TP_STATUS_AVAILABLE | ts);
+
+		if (po->wait_on_complete && !packet_read_pending(&po->tx_ring))
+			complete(&po->skb_completion);
 	}
 
 	sock_wfree(skb);
@@ -2600,9 +2610,11 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	int len_sum = 0;
 	int status = TP_STATUS_AVAILABLE;
 	int hlen, tlen, copylen = 0;
+	long timeo = 0;
 
 	mutex_lock(&po->pg_vec_lock);
 
+
 	if (likely(saddr == NULL)) {
 		dev	= packet_cached_dev_get(po);
 		proto	= po->num;
@@ -2647,16 +2659,16 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		size_max = dev->mtu + reserve + VLAN_HLEN;
 
 	do {
+
 		ph = packet_current_frame(po, &po->tx_ring,
 					  TP_STATUS_SEND_REQUEST);
-		if (unlikely(ph == NULL)) {
-			if (need_wait && need_resched())
-				schedule();
-			continue;
-		}
+
+		if (unlikely(ph == NULL))
+			break;
 
 		skb = NULL;
 		tp_len = tpacket_parse_header(po, ph, size_max, &data);
+
 		if (tp_len < 0)
 			goto tpacket_error;
 
@@ -2720,6 +2732,21 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 
 		skb->destructor = tpacket_destruct_skb;
 		__packet_set_status(po, ph, TP_STATUS_SENDING);
+
+		/*
+		 * If we need to wait and we've sent the last frame pending
+		 * transmission in the mmaped buffer, flag that we need to wait
+		 * on those frames to get freed via tpacket_destruct_skb.  This
+		 * flag indicates that tpacket_destruct_skb should call complete
+		 * when the packet_pending count reaches zero, and that we need
+		 * to call wait_on_complete_interruptible_timeout below, to make
+		 * sure we pick up the result of that completion
+		 */
+		if (need_wait && !packet_next_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST)) {
+			po->wait_on_complete = 1;
+			timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
+		}
+
 		packet_inc_pending(&po->tx_ring);
 
 		status = TP_STATUS_SEND_REQUEST;
@@ -2728,6 +2755,11 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 			err = net_xmit_errno(err);
 			if (err && __packet_get_status(po, ph) ==
 				   TP_STATUS_AVAILABLE) {
+				/* re-init completion queue to avoid subsequent fallthrough
+				 * on a future thread calling wait_on_complete_interruptible_timeout
+				 */
+				po->wait_on_complete = 0;
+				init_completion(&po->skb_completion);
 				/* skb was destructed already */
 				skb = NULL;
 				goto out_status;
@@ -2740,6 +2772,20 @@  static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 		}
 		packet_increment_head(&po->tx_ring);
 		len_sum += tp_len;
+
+		if (po->wait_on_complete) {
+			timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
+			po->wait_on_complete = 0;
+			if (!timeo) {
+				/* We timed out, break out and notify userspace */
+				err = -ETIMEDOUT;
+				goto out_status;
+			} else if (timeo == -ERESTARTSYS) {
+				err = -ERESTARTSYS;
+				goto out_status;
+			}
+		}
+
 	} while (likely((ph != NULL) ||
 		/* Note: packet_read_pending() might be slow if we have
 		 * to call it as it's per_cpu variable, but in fast-path
@@ -3207,6 +3253,7 @@  static int packet_create(struct net *net, struct socket *sock, int protocol,
 	sock_init_data(sock, sk);
 
 	po = pkt_sk(sk);
+	init_completion(&po->skb_completion);
 	sk->sk_family = PF_PACKET;
 	po->num = proto;
 	po->xmit = dev_queue_xmit;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 3bb7c5fb3bff..bbb4be2c18e7 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -128,6 +128,8 @@  struct packet_sock {
 	unsigned int		tp_hdrlen;
 	unsigned int		tp_reserve;
 	unsigned int		tp_tstamp;
+	struct completion	skb_completion;
+	unsigned int		wait_on_complete;
 	struct net_device __rcu	*cached_dev;
 	int			(*xmit)(struct sk_buff *skb);
 	struct packet_type	prot_hook ____cacheline_aligned_in_smp;