[bpf-next,3/3] veth: Support bulk XDP_TX

Message ID	1558609008-2590-4-git-send-email-makita.toshiaki@lab.ntt.co.jp
State	Changes Requested
Delegated to:	BPF Maintainers
Headers	show Return-Path: <bpf-owner@vger.kernel.org> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Subject: [PATCH bpf-next 3/3] veth: Support bulk XDP_TX Date: Thu, 23 May 2019 19:56:48 +0900 Message-Id: <1558609008-2590-4-git-send-email-makita.toshiaki@lab.ntt.co.jp> In-Reply-To: <1558609008-2590-1-git-send-email-makita.toshiaki@lab.ntt.co.jp> References: <1558609008-2590-1-git-send-email-makita.toshiaki@lab.ntt.co.jp> To: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <jakub.kicinski@netronome.com>, Jesper Dangaard Brouer <hawk@kernel.org>, John Fastabend <john.fastabend@gmail.com> Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>, netdev@vger.kernel.org, xdp-newbies@vger.kernel.org, bpf@vger.kernel.org Sender: bpf-owner@vger.kernel.org Precedence: bulk
Series	veth: Bulk XDP_TX \| expand [bpf-next,0/3] veth: Bulk XDP_TX [bpf-next,1/3] xdp: Add bulk XDP_TX queue [bpf-next,2/3] xdp: Add tracepoint for bulk XDP_TX [bpf-next,3/3] veth: Support bulk XDP_TX

Toshiaki Makita May 23, 2019, 10:56 a.m. UTC

This improves XDP_TX performance by about 8%.

Here are single core XDP_TX test results. CPU consumptions are taken
from "perf report --no-child".

- Before:

  7.26 Mpps

  _raw_spin_lock  7.83%
  veth_xdp_xmit  12.23%

- After:

  7.84 Mpps

  _raw_spin_lock  1.17%
  veth_xdp_xmit   6.45%

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

Toke Høiland-Jørgensen May 23, 2019, 11:25 a.m. UTC | #1

Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:

> This improves XDP_TX performance by about 8%.
>
> Here are single core XDP_TX test results. CPU consumptions are taken
> from "perf report --no-child".
>
> - Before:
>
>   7.26 Mpps
>
>   _raw_spin_lock  7.83%
>   veth_xdp_xmit  12.23%
>
> - After:
>
>   7.84 Mpps
>
>   _raw_spin_lock  1.17%
>   veth_xdp_xmit   6.45%
>
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
>  drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>  1 file changed, 25 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 52110e5..4edc75f 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>  	return ret;
>  }
>  
> +static void veth_xdp_flush_bq(struct net_device *dev)
> +{
> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
> +	int sent, i, err = 0;
> +
> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);

Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
you're introducing an additional per-cpu bulk queue, only to avoid lock
contention around the existing pointer ring. But the pointer ring is
per-rq, so if you have lock contention, this means you must have
multiple CPUs servicing the same rq, no? So why not just fix that
instead?

-Toke

Toshiaki Makita May 23, 2019, 11:35 a.m. UTC | #2

On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
> 
>> This improves XDP_TX performance by about 8%.
>>
>> Here are single core XDP_TX test results. CPU consumptions are taken
>> from "perf report --no-child".
>>
>> - Before:
>>
>>   7.26 Mpps
>>
>>   _raw_spin_lock  7.83%
>>   veth_xdp_xmit  12.23%
>>
>> - After:
>>
>>   7.84 Mpps
>>
>>   _raw_spin_lock  1.17%
>>   veth_xdp_xmit   6.45%
>>
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>> ---
>>  drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>  1 file changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 52110e5..4edc75f 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>  	return ret;
>>  }
>>  
>> +static void veth_xdp_flush_bq(struct net_device *dev)
>> +{
>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>> +	int sent, i, err = 0;
>> +
>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
> 
> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
> you're introducing an additional per-cpu bulk queue, only to avoid lock
> contention around the existing pointer ring. But the pointer ring is
> per-rq, so if you have lock contention, this means you must have
> multiple CPUs servicing the same rq, no?

Yes, it's possible. Not recommended though.

> So why not just fix that
> instead?

The queues are shared with packets from stack sent from peer. That's
because I needed the lock. I have tried to separate the queues, one for
redirect and one for stack, but receiver side got too complicated and it
ended up with worse performance.

Toke Høiland-Jørgensen May 23, 2019, 12:18 p.m. UTC | #3

Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:

> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>> 
>>> This improves XDP_TX performance by about 8%.
>>>
>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>> from "perf report --no-child".
>>>
>>> - Before:
>>>
>>>   7.26 Mpps
>>>
>>>   _raw_spin_lock  7.83%
>>>   veth_xdp_xmit  12.23%
>>>
>>> - After:
>>>
>>>   7.84 Mpps
>>>
>>>   _raw_spin_lock  1.17%
>>>   veth_xdp_xmit   6.45%
>>>
>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>> ---
>>>  drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>  1 file changed, 25 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 52110e5..4edc75f 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>>  	return ret;
>>>  }
>>>  
>>> +static void veth_xdp_flush_bq(struct net_device *dev)
>>> +{
>>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>> +	int sent, i, err = 0;
>>> +
>>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>> 
>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>> you're introducing an additional per-cpu bulk queue, only to avoid lock
>> contention around the existing pointer ring. But the pointer ring is
>> per-rq, so if you have lock contention, this means you must have
>> multiple CPUs servicing the same rq, no?
>
> Yes, it's possible. Not recommended though.
>
>> So why not just fix that instead?
>
> The queues are shared with packets from stack sent from peer. That's
> because I needed the lock. I have tried to separate the queues, one for
> redirect and one for stack, but receiver side got too complicated and it
> ended up with worse performance.

I meant fix it with configuration. Now many receive queues are you
running on the veth device in your benchmarks, and how have you
configured the RPS?

-Toke

Jesper Dangaard Brouer May 23, 2019, 1:29 p.m. UTC | #4

On Thu, 23 May 2019 20:35:50 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
> > Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
> >   
> >> This improves XDP_TX performance by about 8%.
> >>
> >> Here are single core XDP_TX test results. CPU consumptions are taken
> >> from "perf report --no-child".
> >>
> >> - Before:
> >>
> >>   7.26 Mpps
> >>
> >>   _raw_spin_lock  7.83%
> >>   veth_xdp_xmit  12.23%
> >>
> >> - After:
> >>
> >>   7.84 Mpps
> >>
> >>   _raw_spin_lock  1.17%
> >>   veth_xdp_xmit   6.45%
> >>
> >> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> >> ---
> >>  drivers/net/veth.c | 26 +++++++++++++++++++++++++-
> >>  1 file changed, 25 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> >> index 52110e5..4edc75f 100644
> >> --- a/drivers/net/veth.c
> >> +++ b/drivers/net/veth.c
> >> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
> >>  	return ret;
> >>  }
> >>  
> >> +static void veth_xdp_flush_bq(struct net_device *dev)
> >> +{
> >> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
> >> +	int sent, i, err = 0;
> >> +
> >> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);  
> > 
> > Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
> > you're introducing an additional per-cpu bulk queue, only to avoid lock
> > contention around the existing pointer ring. But the pointer ring is
> > per-rq, so if you have lock contention, this means you must have
> > multiple CPUs servicing the same rq, no?  
> 
> Yes, it's possible. Not recommended though.
> 

I think the general per-cpu TX bulk queue is overkill.  There is a loop
over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
the caller veth_poll() will call veth_xdp_flush(rq->dev).

Why can't you store this "temp" bulk array in struct veth_rq ?

You could even alloc/create it on the stack of veth_poll() and send it
along via a pointer to veth_xdp_rcv).

Toshiaki Makita May 23, 2019, 1:40 p.m. UTC | #5

On 19/05/23 (木) 21:18:25, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
> 
>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>
>>>> This improves XDP_TX performance by about 8%.
>>>>
>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>> from "perf report --no-child".
>>>>
>>>> - Before:
>>>>
>>>>    7.26 Mpps
>>>>
>>>>    _raw_spin_lock  7.83%
>>>>    veth_xdp_xmit  12.23%
>>>>
>>>> - After:
>>>>
>>>>    7.84 Mpps
>>>>
>>>>    _raw_spin_lock  1.17%
>>>>    veth_xdp_xmit   6.45%
>>>>
>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>> ---
>>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>> index 52110e5..4edc75f 100644
>>>> --- a/drivers/net/veth.c
>>>> +++ b/drivers/net/veth.c
>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>>>   	return ret;
>>>>   }
>>>>   
>>>> +static void veth_xdp_flush_bq(struct net_device *dev)
>>>> +{
>>>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>> +	int sent, i, err = 0;
>>>> +
>>>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>
>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>> you're introducing an additional per-cpu bulk queue, only to avoid lock
>>> contention around the existing pointer ring. But the pointer ring is
>>> per-rq, so if you have lock contention, this means you must have
>>> multiple CPUs servicing the same rq, no?
>>
>> Yes, it's possible. Not recommended though.
>>
>>> So why not just fix that instead?
>>
>> The queues are shared with packets from stack sent from peer. That's
>> because I needed the lock. I have tried to separate the queues, one for
>> redirect and one for stack, but receiver side got too complicated and it
>> ended up with worse performance.
> 
> I meant fix it with configuration. Now many receive queues are you
> running on the veth device in your benchmarks, and how have you
> configured the RPS?

As I wrote this test is a single queue test and does not have any 
contention.
Per packet lock has some overhead even in that configuration.

Toshiaki Makita

Toshiaki Makita May 23, 2019, 1:51 p.m. UTC | #6

On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
> On Thu, 23 May 2019 20:35:50 +0900
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>    
>>>> This improves XDP_TX performance by about 8%.
>>>>
>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>> from "perf report --no-child".
>>>>
>>>> - Before:
>>>>
>>>>    7.26 Mpps
>>>>
>>>>    _raw_spin_lock  7.83%
>>>>    veth_xdp_xmit  12.23%
>>>>
>>>> - After:
>>>>
>>>>    7.84 Mpps
>>>>
>>>>    _raw_spin_lock  1.17%
>>>>    veth_xdp_xmit   6.45%
>>>>
>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>> ---
>>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>> index 52110e5..4edc75f 100644
>>>> --- a/drivers/net/veth.c
>>>> +++ b/drivers/net/veth.c
>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>>>   	return ret;
>>>>   }
>>>>   
>>>> +static void veth_xdp_flush_bq(struct net_device *dev)
>>>> +{
>>>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>> +	int sent, i, err = 0;
>>>> +
>>>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>
>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>> you're introducing an additional per-cpu bulk queue, only to avoid lock
>>> contention around the existing pointer ring. But the pointer ring is
>>> per-rq, so if you have lock contention, this means you must have
>>> multiple CPUs servicing the same rq, no?
>>
>> Yes, it's possible. Not recommended though.
>>
> 
> I think the general per-cpu TX bulk queue is overkill.  There is a loop
> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
> the caller veth_poll() will call veth_xdp_flush(rq->dev).
> 
> Why can't you store this "temp" bulk array in struct veth_rq ?

Of course I can. But I thought tun has the same problem and we can 
decrease memory footprint by sharing the same storage between devices.
Or if other devices want to reduce queues so that we can use XDP on 
many-cpu servers and introduce locks, we can use this storage for that 
case as well.

Still do you prefer veth-specific solution?

> 
> You could even alloc/create it on the stack of veth_poll() and send it
> along via a pointer to veth_xdp_rcv).
> 

Toshiaki Makita

Jason Wang May 24, 2019, 3:13 a.m. UTC | #7

On 2019/5/23 下午9:51, Toshiaki Makita wrote:
> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
>> On Thu, 23 May 2019 20:35:50 +0900
>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>
>>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>>> This improves XDP_TX performance by about 8%.
>>>>>
>>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>>> from "perf report --no-child".
>>>>>
>>>>> - Before:
>>>>>
>>>>>    7.26 Mpps
>>>>>
>>>>>    _raw_spin_lock  7.83%
>>>>>    veth_xdp_xmit  12.23%
>>>>>
>>>>> - After:
>>>>>
>>>>>    7.84 Mpps
>>>>>
>>>>>    _raw_spin_lock  1.17%
>>>>>    veth_xdp_xmit   6.45%
>>>>>
>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>> ---
>>>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>> index 52110e5..4edc75f 100644
>>>>> --- a/drivers/net/veth.c
>>>>> +++ b/drivers/net/veth.c
>>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device 
>>>>> *dev, int n,
>>>>>       return ret;
>>>>>   }
>>>>>   +static void veth_xdp_flush_bq(struct net_device *dev)
>>>>> +{
>>>>> +    struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>>> +    int sent, i, err = 0;
>>>>> +
>>>>> +    sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>>
>>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>>> you're introducing an additional per-cpu bulk queue, only to avoid 
>>>> lock
>>>> contention around the existing pointer ring. But the pointer ring is
>>>> per-rq, so if you have lock contention, this means you must have
>>>> multiple CPUs servicing the same rq, no?
>>>
>>> Yes, it's possible. Not recommended though.
>>>
>>
>> I think the general per-cpu TX bulk queue is overkill.  There is a loop
>> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
>> the caller veth_poll() will call veth_xdp_flush(rq->dev).
>>
>> Why can't you store this "temp" bulk array in struct veth_rq ?
>
> Of course I can. But I thought tun has the same problem and we can 
> decrease memory footprint by sharing the same storage between devices.


For TUN and for its fast path where vhost passes a bulk of XDP frames 
(through msg_control) to us, we probably just need a temporary bulk 
array in tun_xdp_one() instead of a global one. I can post patch or 
maybe you if you're interested in this.

Thanks


> Or if other devices want to reduce queues so that we can use XDP on 
> many-cpu servers and introduce locks, we can use this storage for that 
> case as well.
>
> Still do you prefer veth-specific solution?
>
>>
>> You could even alloc/create it on the stack of veth_poll() and send it
>> along via a pointer to veth_xdp_rcv).
>>
>
> Toshiaki Makita

Toshiaki Makita May 24, 2019, 3:28 a.m. UTC | #8

On 2019/05/24 12:13, Jason Wang wrote:
> On 2019/5/23 下午9:51, Toshiaki Makita wrote:
>> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
>>> On Thu, 23 May 2019 20:35:50 +0900
>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>>
>>>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>>>> This improves XDP_TX performance by about 8%.
>>>>>>
>>>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>>>> from "perf report --no-child".
>>>>>>
>>>>>> - Before:
>>>>>>
>>>>>>    7.26 Mpps
>>>>>>
>>>>>>    _raw_spin_lock  7.83%
>>>>>>    veth_xdp_xmit  12.23%
>>>>>>
>>>>>> - After:
>>>>>>
>>>>>>    7.84 Mpps
>>>>>>
>>>>>>    _raw_spin_lock  1.17%
>>>>>>    veth_xdp_xmit   6.45%
>>>>>>
>>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>>> ---
>>>>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>>> index 52110e5..4edc75f 100644
>>>>>> --- a/drivers/net/veth.c
>>>>>> +++ b/drivers/net/veth.c
>>>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device
>>>>>> *dev, int n,
>>>>>>       return ret;
>>>>>>   }
>>>>>>   +static void veth_xdp_flush_bq(struct net_device *dev)
>>>>>> +{
>>>>>> +    struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>>>> +    int sent, i, err = 0;
>>>>>> +
>>>>>> +    sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>>>
>>>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>>>> you're introducing an additional per-cpu bulk queue, only to avoid
>>>>> lock
>>>>> contention around the existing pointer ring. But the pointer ring is
>>>>> per-rq, so if you have lock contention, this means you must have
>>>>> multiple CPUs servicing the same rq, no?
>>>>
>>>> Yes, it's possible. Not recommended though.
>>>>
>>>
>>> I think the general per-cpu TX bulk queue is overkill.  There is a loop
>>> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
>>> the caller veth_poll() will call veth_xdp_flush(rq->dev).
>>>
>>> Why can't you store this "temp" bulk array in struct veth_rq ?
>>
>> Of course I can. But I thought tun has the same problem and we can
>> decrease memory footprint by sharing the same storage between devices.
> 
> 
> For TUN and for its fast path where vhost passes a bulk of XDP frames
> (through msg_control) to us, we probably just need a temporary bulk
> array in tun_xdp_one() instead of a global one. I can post patch or
> maybe you if you're interested in this.

Of course you/I can. What I'm concerned is that could be waste of cache
line when softirq runs veth napi handler and then tun napi handler.

> 
> Thanks
> 
> 
>> Or if other devices want to reduce queues so that we can use XDP on
>> many-cpu servers and introduce locks, we can use this storage for that
>> case as well.
>>
>> Still do you prefer veth-specific solution?
>>
>>>
>>> You could even alloc/create it on the stack of veth_poll() and send it
>>> along via a pointer to veth_xdp_rcv).
>>>
>>
>> Toshiaki Makita
> 
>

Jason Wang May 24, 2019, 3:54 a.m. UTC | #9

On 2019/5/24 上午11:28, Toshiaki Makita wrote:
> On 2019/05/24 12:13, Jason Wang wrote:
>> On 2019/5/23 下午9:51, Toshiaki Makita wrote:
>>> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
>>>> On Thu, 23 May 2019 20:35:50 +0900
>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>>>
>>>>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>>>>> This improves XDP_TX performance by about 8%.
>>>>>>>
>>>>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>>>>> from "perf report --no-child".
>>>>>>>
>>>>>>> - Before:
>>>>>>>
>>>>>>>     7.26 Mpps
>>>>>>>
>>>>>>>     _raw_spin_lock  7.83%
>>>>>>>     veth_xdp_xmit  12.23%
>>>>>>>
>>>>>>> - After:
>>>>>>>
>>>>>>>     7.84 Mpps
>>>>>>>
>>>>>>>     _raw_spin_lock  1.17%
>>>>>>>     veth_xdp_xmit   6.45%
>>>>>>>
>>>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>>>> ---
>>>>>>>    drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>>>>    1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>>>> index 52110e5..4edc75f 100644
>>>>>>> --- a/drivers/net/veth.c
>>>>>>> +++ b/drivers/net/veth.c
>>>>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device
>>>>>>> *dev, int n,
>>>>>>>        return ret;
>>>>>>>    }
>>>>>>>    +static void veth_xdp_flush_bq(struct net_device *dev)
>>>>>>> +{
>>>>>>> +    struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>>>>> +    int sent, i, err = 0;
>>>>>>> +
>>>>>>> +    sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>>>>> you're introducing an additional per-cpu bulk queue, only to avoid
>>>>>> lock
>>>>>> contention around the existing pointer ring. But the pointer ring is
>>>>>> per-rq, so if you have lock contention, this means you must have
>>>>>> multiple CPUs servicing the same rq, no?
>>>>> Yes, it's possible. Not recommended though.
>>>>>
>>>> I think the general per-cpu TX bulk queue is overkill.  There is a loop
>>>> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
>>>> the caller veth_poll() will call veth_xdp_flush(rq->dev).
>>>>
>>>> Why can't you store this "temp" bulk array in struct veth_rq ?
>>> Of course I can. But I thought tun has the same problem and we can
>>> decrease memory footprint by sharing the same storage between devices.
>>
>> For TUN and for its fast path where vhost passes a bulk of XDP frames
>> (through msg_control) to us, we probably just need a temporary bulk
>> array in tun_xdp_one() instead of a global one. I can post patch or
>> maybe you if you're interested in this.
> Of course you/I can. What I'm concerned is that could be waste of cache
> line when softirq runs veth napi handler and then tun napi handler.
>

Well, technically the bulk queue passed to TUN could be reused. I admit 
it may save cacheline in ideal case but I wonder how much we could gain 
on real workload. (Note TUN doesn't use napi handler to do XDP, it has a 
NAPI mode but it was mainly used for hardening and XDP was not 
implemented there, maybe we should fix this).

Thanks


>> Thanks
>>
>>
>>> Or if other devices want to reduce queues so that we can use XDP on
>>> many-cpu servers and introduce locks, we can use this storage for that
>>> case as well.
>>>
>>> Still do you prefer veth-specific solution?
>>>
>>>> You could even alloc/create it on the stack of veth_poll() and send it
>>>> along via a pointer to veth_xdp_rcv).
>>>>
>>> Toshiaki Makita
>>

Toshiaki Makita May 24, 2019, 4:52 a.m. UTC | #10

On 2019/05/24 12:54, Jason Wang wrote:
> On 2019/5/24 上午11:28, Toshiaki Makita wrote:
>> On 2019/05/24 12:13, Jason Wang wrote:
>>> On 2019/5/23 下午9:51, Toshiaki Makita wrote:
>>>> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
>>>>> On Thu, 23 May 2019 20:35:50 +0900
>>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>>>>
>>>>>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:
>>>>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>>>>>> This improves XDP_TX performance by about 8%.
>>>>>>>>
>>>>>>>> Here are single core XDP_TX test results. CPU consumptions are
>>>>>>>> taken
>>>>>>>> from "perf report --no-child".
>>>>>>>>
>>>>>>>> - Before:
>>>>>>>>
>>>>>>>>     7.26 Mpps
>>>>>>>>
>>>>>>>>     _raw_spin_lock  7.83%
>>>>>>>>     veth_xdp_xmit  12.23%
>>>>>>>>
>>>>>>>> - After:
>>>>>>>>
>>>>>>>>     7.84 Mpps
>>>>>>>>
>>>>>>>>     _raw_spin_lock  1.17%
>>>>>>>>     veth_xdp_xmit   6.45%
>>>>>>>>
>>>>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>>>>> ---
>>>>>>>>    drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>>>>>    1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>>>>> index 52110e5..4edc75f 100644
>>>>>>>> --- a/drivers/net/veth.c
>>>>>>>> +++ b/drivers/net/veth.c
>>>>>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device
>>>>>>>> *dev, int n,
>>>>>>>>        return ret;
>>>>>>>>    }
>>>>>>>>    +static void veth_xdp_flush_bq(struct net_device *dev)
>>>>>>>> +{
>>>>>>>> +    struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>>>>>> +    int sent, i, err = 0;
>>>>>>>> +
>>>>>>>> +    sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);
>>>>>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>>>>>> you're introducing an additional per-cpu bulk queue, only to avoid
>>>>>>> lock
>>>>>>> contention around the existing pointer ring. But the pointer ring is
>>>>>>> per-rq, so if you have lock contention, this means you must have
>>>>>>> multiple CPUs servicing the same rq, no?
>>>>>> Yes, it's possible. Not recommended though.
>>>>>>
>>>>> I think the general per-cpu TX bulk queue is overkill.  There is a
>>>>> loop
>>>>> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
>>>>> the caller veth_poll() will call veth_xdp_flush(rq->dev).
>>>>>
>>>>> Why can't you store this "temp" bulk array in struct veth_rq ?
>>>> Of course I can. But I thought tun has the same problem and we can
>>>> decrease memory footprint by sharing the same storage between devices.
>>>
>>> For TUN and for its fast path where vhost passes a bulk of XDP frames
>>> (through msg_control) to us, we probably just need a temporary bulk
>>> array in tun_xdp_one() instead of a global one. I can post patch or
>>> maybe you if you're interested in this.
>> Of course you/I can. What I'm concerned is that could be waste of cache
>> line when softirq runs veth napi handler and then tun napi handler.
>>
> 
> Well, technically the bulk queue passed to TUN could be reused. I admit
> it may save cacheline in ideal case but I wonder how much we could gain
> on real workload.

I see veth_rq ptr_ring suffering from cacheline miss, which makes me
conservative about adding more buffers for xdp_frames.
I'll wait for some more feedback from others.

> (Note TUN doesn't use napi handler to do XDP, it has a
> NAPI mode but it was mainly used for hardening and XDP was not
> implemented there, maybe we should fix this).

Ah, that's true. Sorry for confusion.

> 
> Thanks
> 
> 
>>> Thanks
>>>
>>>
>>>> Or if other devices want to reduce queues so that we can use XDP on
>>>> many-cpu servers and introduce locks, we can use this storage for that
>>>> case as well.
>>>>
>>>> Still do you prefer veth-specific solution?
>>>>
>>>>> You could even alloc/create it on the stack of veth_poll() and send it
>>>>> along via a pointer to veth_xdp_rcv).
>>>>>
>>>> Toshiaki Makita
>>>
> 
>

Jesper Dangaard Brouer May 24, 2019, 9:53 a.m. UTC | #11

On Thu, 23 May 2019 22:51:34 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
> > On Thu, 23 May 2019 20:35:50 +0900
> > Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:  
> >>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
> >>>      
> >>>> This improves XDP_TX performance by about 8%.
> >>>>
> >>>> Here are single core XDP_TX test results. CPU consumptions are taken
> >>>> from "perf report --no-child".
> >>>>
> >>>> - Before:
> >>>>
> >>>>    7.26 Mpps
> >>>>
> >>>>    _raw_spin_lock  7.83%
> >>>>    veth_xdp_xmit  12.23%
> >>>>
> >>>> - After:
> >>>>
> >>>>    7.84 Mpps
> >>>>
> >>>>    _raw_spin_lock  1.17%
> >>>>    veth_xdp_xmit   6.45%
> >>>>
> >>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> >>>> ---
> >>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
> >>>>   1 file changed, 25 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> >>>> index 52110e5..4edc75f 100644
> >>>> --- a/drivers/net/veth.c
> >>>> +++ b/drivers/net/veth.c
> >>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
> >>>>   	return ret;
> >>>>   }
> >>>>   
> >>>> +static void veth_xdp_flush_bq(struct net_device *dev)
> >>>> +{
> >>>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
> >>>> +	int sent, i, err = 0;
> >>>> +
> >>>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);  
> >>>
> >>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
> >>> you're introducing an additional per-cpu bulk queue, only to avoid lock
> >>> contention around the existing pointer ring. But the pointer ring is
> >>> per-rq, so if you have lock contention, this means you must have
> >>> multiple CPUs servicing the same rq, no?  
> >>
> >> Yes, it's possible. Not recommended though.
> >>  
> > 
> > I think the general per-cpu TX bulk queue is overkill.  There is a loop
> > over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
> > the caller veth_poll() will call veth_xdp_flush(rq->dev).
> > 
> > Why can't you store this "temp" bulk array in struct veth_rq ?  
> 
> Of course I can. But I thought tun has the same problem and we can 
> decrease memory footprint by sharing the same storage between devices.
> Or if other devices want to reduce queues so that we can use XDP on 
> many-cpu servers and introduce locks, we can use this storage for
> that case as well.
> 
> Still do you prefer veth-specific solution?

Yes.  Another reason is that with this shared/general per-cpu TX bulk
queue, I can easily see bugs resulting in xdp_frames getting
transmitted on a completely other NIC, which will be hard to debug for
people.

> > 
> > You could even alloc/create it on the stack of veth_poll() and send
> > it along via a pointer to veth_xdp_rcv).

IHMO it would be cleaner code wise to place the "temp" bulk array in
struct veth_rq.  But if you worry about performance and want a hot
cacheline for this, then you could just use the call-stack for
veth_poll(), as I described.  It should not be too ugly code wise to do
this I think.

Toshiaki Makita May 27, 2019, 6:08 a.m. UTC | #12

On 2019/05/24 18:53, Jesper Dangaard Brouer wrote:
> On Thu, 23 May 2019 22:51:34 +0900
> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> 
>> On 19/05/23 (木) 22:29:27, Jesper Dangaard Brouer wrote:
>>> On Thu, 23 May 2019 20:35:50 +0900
>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>>   
>>>> On 2019/05/23 20:25, Toke Høiland-Jørgensen wrote:  
>>>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> writes:
>>>>>      
>>>>>> This improves XDP_TX performance by about 8%.
>>>>>>
>>>>>> Here are single core XDP_TX test results. CPU consumptions are taken
>>>>>> from "perf report --no-child".
>>>>>>
>>>>>> - Before:
>>>>>>
>>>>>>    7.26 Mpps
>>>>>>
>>>>>>    _raw_spin_lock  7.83%
>>>>>>    veth_xdp_xmit  12.23%
>>>>>>
>>>>>> - After:
>>>>>>
>>>>>>    7.84 Mpps
>>>>>>
>>>>>>    _raw_spin_lock  1.17%
>>>>>>    veth_xdp_xmit   6.45%
>>>>>>
>>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>>> ---
>>>>>>   drivers/net/veth.c | 26 +++++++++++++++++++++++++-
>>>>>>   1 file changed, 25 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>>>>> index 52110e5..4edc75f 100644
>>>>>> --- a/drivers/net/veth.c
>>>>>> +++ b/drivers/net/veth.c
>>>>>> @@ -442,6 +442,23 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
>>>>>>   	return ret;
>>>>>>   }
>>>>>>   
>>>>>> +static void veth_xdp_flush_bq(struct net_device *dev)
>>>>>> +{
>>>>>> +	struct xdp_tx_bulk_queue *bq = this_cpu_ptr(&xdp_tx_bq);
>>>>>> +	int sent, i, err = 0;
>>>>>> +
>>>>>> +	sent = veth_xdp_xmit(dev, bq->count, bq->q, 0);  
>>>>>
>>>>> Wait, veth_xdp_xmit() is just putting frames on a pointer ring. So
>>>>> you're introducing an additional per-cpu bulk queue, only to avoid lock
>>>>> contention around the existing pointer ring. But the pointer ring is
>>>>> per-rq, so if you have lock contention, this means you must have
>>>>> multiple CPUs servicing the same rq, no?  
>>>>
>>>> Yes, it's possible. Not recommended though.
>>>>  
>>>
>>> I think the general per-cpu TX bulk queue is overkill.  There is a loop
>>> over packets in veth_xdp_rcv(struct veth_rq *rq, budget, *status), and
>>> the caller veth_poll() will call veth_xdp_flush(rq->dev).
>>>
>>> Why can't you store this "temp" bulk array in struct veth_rq ?  
>>
>> Of course I can. But I thought tun has the same problem and we can 
>> decrease memory footprint by sharing the same storage between devices.
>> Or if other devices want to reduce queues so that we can use XDP on 
>> many-cpu servers and introduce locks, we can use this storage for
>> that case as well.
>>
>> Still do you prefer veth-specific solution?
> 
> Yes.  Another reason is that with this shared/general per-cpu TX bulk
> queue, I can easily see bugs resulting in xdp_frames getting
> transmitted on a completely other NIC, which will be hard to debug for
> people.
> 
>>>
>>> You could even alloc/create it on the stack of veth_poll() and send
>>> it along via a pointer to veth_xdp_rcv).
> 
> IHMO it would be cleaner code wise to place the "temp" bulk array in
> struct veth_rq.  But if you worry about performance and want a hot
> cacheline for this, then you could just use the call-stack for
> veth_poll(), as I described.  It should not be too ugly code wise to do
> this I think.

Rethinking this I agree to not using global but use stack.

For performance you are right, stack should be as hot as global if other
drivers use stack as well. I was a bit concerned about stack size, but
128 bytes size is probably acceptable these days.

Wrt debugging, indeed the global solution is probably more difficult.
When we fail to flush bq, the stack solution can be tracked by something
like kmemleak but the global one cannot. Also the global solution has a
risk to send packets from unintentional devices, which leads to a
security problem. With the stack solution missing flush just causes
packet loss and memory leak.

[bpf-next,3/3] veth: Support bulk XDP_TX

Commit Message

Comments

Patch