diff mbox series

[net] virtio_net: Account for tx bytes and packets on sending xdp_frames

Message ID 1548934830-2389-1-git-send-email-makita.toshiaki@lab.ntt.co.jp
State Accepted
Delegated to: David Miller
Headers show
Series [net] virtio_net: Account for tx bytes and packets on sending xdp_frames | expand

Commit Message

Toshiaki Makita Jan. 31, 2019, 11:40 a.m. UTC
Previously virtnet_xdp_xmit() did not account for device tx counters,
which caused confusions.
To be consistent with SKBs, account them on freeing xdp_frames.

Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/virtio_net.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

Comments

Michael S. Tsirkin Jan. 31, 2019, 3:25 p.m. UTC | #1
On Thu, Jan 31, 2019 at 08:40:30PM +0900, Toshiaki Makita wrote:
> Previously virtnet_xdp_xmit() did not account for device tx counters,
> which caused confusions.
> To be consistent with SKBs, account them on freeing xdp_frames.
> 
> Reported-by: David Ahern <dsahern@gmail.com>
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Well we count them on receive so I guess it makes sense for consistency

Acked-by: Michael S. Tsirkin <mst@redhat.com>

however, I really wonder whether adding more and more standard net stack
things like this will end up costing most of XDP its speed.

Should we instead make sure *not* to account XDP packets
in any counters at all? XDP programs can use maps
to do their own counting...


> ---
>  drivers/net/virtio_net.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 2594481..4cfceb7 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -503,6 +503,8 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>  	struct bpf_prog *xdp_prog;
>  	struct send_queue *sq;
>  	unsigned int len;
> +	int packets = 0;
> +	int bytes = 0;
>  	int drops = 0;
>  	int kicks = 0;
>  	int ret, err;
> @@ -526,10 +528,18 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>  
>  	/* Free up any pending old buffers before queueing new ones. */
>  	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		if (likely(is_xdp_frame(ptr)))
> -			xdp_return_frame(ptr_to_xdp(ptr));
> -		else
> -			napi_consume_skb(ptr, false);
> +		if (likely(is_xdp_frame(ptr))) {
> +			struct xdp_frame *frame = ptr_to_xdp(ptr);
> +
> +			bytes += frame->len;
> +			xdp_return_frame(frame);
> +		} else {
> +			struct sk_buff *skb = ptr;
> +
> +			bytes += skb->len;
> +			napi_consume_skb(skb, false);
> +		}
> +		packets++;
>  	}
>  
>  	for (i = 0; i < n; i++) {
> @@ -549,6 +559,8 @@ static int virtnet_xdp_xmit(struct net_device *dev,
>  	}
>  out:
>  	u64_stats_update_begin(&sq->stats.syncp);
> +	sq->stats.bytes += bytes;
> +	sq->stats.packets += packets;
>  	sq->stats.xdp_tx += n;
>  	sq->stats.xdp_tx_drops += drops;
>  	sq->stats.kicks += kicks;
> -- 
> 1.8.3.1
>
David Miller Jan. 31, 2019, 5:45 p.m. UTC | #2
From: "Michael S. Tsirkin" <mst@redhat.com>
Date: Thu, 31 Jan 2019 10:25:17 -0500

> On Thu, Jan 31, 2019 at 08:40:30PM +0900, Toshiaki Makita wrote:
>> Previously virtnet_xdp_xmit() did not account for device tx counters,
>> which caused confusions.
>> To be consistent with SKBs, account them on freeing xdp_frames.
>> 
>> Reported-by: David Ahern <dsahern@gmail.com>
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> 
> Well we count them on receive so I guess it makes sense for consistency
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> however, I really wonder whether adding more and more standard net stack
> things like this will end up costing most of XDP its speed.
> 
> Should we instead make sure *not* to account XDP packets
> in any counters at all? XDP programs can use maps
> to do their own counting...

This has been definitely a discussion point, and something we should
develop a clear, strong, policy on.

David, Jesper, care to chime in where we ended up in that last thread
discussion this?
Jesper Dangaard Brouer Jan. 31, 2019, 8:15 p.m. UTC | #3
On Thu, 31 Jan 2019 09:45:23 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Thu, 31 Jan 2019 10:25:17 -0500
> 
> > On Thu, Jan 31, 2019 at 08:40:30PM +0900, Toshiaki Makita wrote:  
> >> Previously virtnet_xdp_xmit() did not account for device tx counters,
> >> which caused confusions.
> >> To be consistent with SKBs, account them on freeing xdp_frames.
> >> 
> >> Reported-by: David Ahern <dsahern@gmail.com>
> >> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>  
> > 
> > Well we count them on receive so I guess it makes sense for consistency
> > 
> > Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > however, I really wonder whether adding more and more standard net stack
> > things like this will end up costing most of XDP its speed.
> > 
> > Should we instead make sure *not* to account XDP packets
> > in any counters at all? XDP programs can use maps
> > to do their own counting...  
> 
> This has been definitely a discussion point, and something we should
> develop a clear, strong, policy on.
> 
> David, Jesper, care to chime in where we ended up in that last thread
> discussion this?

IHMO packets RX and TX on a device need to be accounted, in standard
counters, regardless of XDP.  For XDP RX the packet is counted as RX,
regardless if XDP choose to XDP_DROP.  On XDP TX which is via
XDP_REDIRECT or XDP_TX, the driver that transmit the packet need to
account the packet in a TX counter (this if often delayed to DMA TX
completion handling).  We cannot break the expectation that RX and TX
counter are visible to userspace stats tools. XDP should not make these
packets invisible.

Performance wise, I don't see an issue. As updating these counters
(packets and bytes) can be done as a bulk, either when driver NAPI RX
func ends, or in TX DMA completion, like most drivers already do today.
Further more, most drivers save this in per RX ring data-area, which
are only summed when userspace read these.


A separate question (and project) raised by David Ahern, was if we
should have more detailed stats on the different XDP action return
codes, as an easy means for sysadms to diagnose running XDP programs.
That is something that require more discussions, as it can impact
performance, and likely need to be opt-in.  My opinion is yes we should
do this for the sake of better User eXperience, BUT *only* if we can find
a technical solution that does not hurt performance.

--Jesper
Toshiaki Makita Feb. 1, 2019, 1:53 a.m. UTC | #4
On 2019/02/01 5:15, Jesper Dangaard Brouer wrote:
> On Thu, 31 Jan 2019 09:45:23 -0800 (PST)
> David Miller <davem@davemloft.net> wrote:
> 
>> From: "Michael S. Tsirkin" <mst@redhat.com>
>> Date: Thu, 31 Jan 2019 10:25:17 -0500
>>
>>> On Thu, Jan 31, 2019 at 08:40:30PM +0900, Toshiaki Makita wrote:  
>>>> Previously virtnet_xdp_xmit() did not account for device tx counters,
>>>> which caused confusions.
>>>> To be consistent with SKBs, account them on freeing xdp_frames.
>>>>
>>>> Reported-by: David Ahern <dsahern@gmail.com>
>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>  
>>>
>>> Well we count them on receive so I guess it makes sense for consistency
>>>
>>> Acked-by: Michael S. Tsirkin <mst@redhat.com>
>>>
>>> however, I really wonder whether adding more and more standard net stack
>>> things like this will end up costing most of XDP its speed.
>>>
>>> Should we instead make sure *not* to account XDP packets
>>> in any counters at all? XDP programs can use maps
>>> to do their own counting...  
>>
>> This has been definitely a discussion point, and something we should
>> develop a clear, strong, policy on.
>>
>> David, Jesper, care to chime in where we ended up in that last thread
>> discussion this?
> 
> IHMO packets RX and TX on a device need to be accounted, in standard
> counters, regardless of XDP.  For XDP RX the packet is counted as RX,
> regardless if XDP choose to XDP_DROP.  On XDP TX which is via
> XDP_REDIRECT or XDP_TX, the driver that transmit the packet need to
> account the packet in a TX counter (this if often delayed to DMA TX
> completion handling).  We cannot break the expectation that RX and TX
> counter are visible to userspace stats tools. XDP should not make these
> packets invisible.
> 
> Performance wise, I don't see an issue. As updating these counters
> (packets and bytes) can be done as a bulk, either when driver NAPI RX
> func ends, or in TX DMA completion, like most drivers already do today.
> Further more, most drivers save this in per RX ring data-area, which
> are only summed when userspace read these.

Agreed.

> A separate question (and project) raised by David Ahern, was if we
> should have more detailed stats on the different XDP action return
> codes, as an easy means for sysadms to diagnose running XDP programs.
> That is something that require more discussions, as it can impact
> performance, and likely need to be opt-in.  My opinion is yes we should
> do this for the sake of better User eXperience, BUT *only* if we can find
> a technical solution that does not hurt performance.

Basically the situation for the detailed stats is the same as standard
stats, at least in virtio_net. Stats are updated as a bulk, and the
counters reside in RX/TX ring structures.
Probably this way of implementation would be ok performance-wise?
But as other drivers may have different situations, if it is generally
difficult to avoid performance penalty I'm OK with making them opt-in as
a standard way.
David Ahern Feb. 2, 2019, 9:27 p.m. UTC | #5
On 1/31/19 1:15 PM, Jesper Dangaard Brouer wrote:
>>
>> David, Jesper, care to chime in where we ended up in that last thread
>> discussion this?
> 
> IHMO packets RX and TX on a device need to be accounted, in standard
> counters, regardless of XDP.  For XDP RX the packet is counted as RX,
> regardless if XDP choose to XDP_DROP.  On XDP TX which is via
> XDP_REDIRECT or XDP_TX, the driver that transmit the packet need to
> account the packet in a TX counter (this if often delayed to DMA TX
> completion handling).  We cannot break the expectation that RX and TX
> counter are visible to userspace stats tools. XDP should not make these
> packets invisible.

Agreed. What I was pushing on that last thread was Rx, Tx and dropped
are all accounted by the driver in standard stats. Basically if the
driver touched it, the driver's counters should indicate that.

The push back was on dropped packets and whether that counter should be
bumped on XDP_DROP.
David Miller Feb. 4, 2019, 4:18 a.m. UTC | #6
From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Date: Thu, 31 Jan 2019 20:40:30 +0900

> Previously virtnet_xdp_xmit() did not account for device tx counters,
> which caused confusions.
> To be consistent with SKBs, account them on freeing xdp_frames.
> 
> Reported-by: David Ahern <dsahern@gmail.com>
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Applied, thank you.
Jesper Dangaard Brouer Feb. 4, 2019, 11:53 a.m. UTC | #7
On Sat, 2 Feb 2019 14:27:26 -0700
David Ahern <dsahern@gmail.com> wrote:

> On 1/31/19 1:15 PM, Jesper Dangaard Brouer wrote:
> >>
> >> David, Jesper, care to chime in where we ended up in that last thread
> >> discussion this?  
> > 
> > IHMO packets RX and TX on a device need to be accounted, in standard
> > counters, regardless of XDP.  For XDP RX the packet is counted as RX,
> > regardless if XDP choose to XDP_DROP.  On XDP TX which is via
> > XDP_REDIRECT or XDP_TX, the driver that transmit the packet need to
> > account the packet in a TX counter (this if often delayed to DMA TX
> > completion handling).  We cannot break the expectation that RX and TX
> > counter are visible to userspace stats tools. XDP should not make these
> > packets invisible.  
> 
> Agreed. What I was pushing on that last thread was Rx, Tx and dropped
> are all accounted by the driver in standard stats. Basically if the
> driver touched it, the driver's counters should indicate that.

Sound like we all agree (except with the dropped counter, see below).

Do notice that mlx5 driver doesn't do this.  It is actually rather
confusing to use XDP on mlx5, as when XDP "consume" which include
XDP_DROP, XDP_REDIRECT or XDP_TX, then the driver standard stats are
not incremented... the packet is invisible to "ifconfig" stat based
tools.


> The push back was on dropped packets and whether that counter should be
> bumped on XDP_DROP.

My opinion is the XDP_DROP action should NOT increment the drivers drop
counter.  First of all the "dropped" counter is also use for other
stuff, which will confuse that this counter express.  Second, choosing
XDP_DROP is a policy choice, it still means it was RX-ed at the driver
level.
David Ahern Feb. 5, 2019, 3:13 a.m. UTC | #8
On 2/4/19 3:53 AM, Jesper Dangaard Brouer wrote:
> On Sat, 2 Feb 2019 14:27:26 -0700
> David Ahern <dsahern@gmail.com> wrote:
> 
>> On 1/31/19 1:15 PM, Jesper Dangaard Brouer wrote:
>>>>
>>>> David, Jesper, care to chime in where we ended up in that last thread
>>>> discussion this?  
>>>
>>> IHMO packets RX and TX on a device need to be accounted, in standard
>>> counters, regardless of XDP.  For XDP RX the packet is counted as RX,
>>> regardless if XDP choose to XDP_DROP.  On XDP TX which is via
>>> XDP_REDIRECT or XDP_TX, the driver that transmit the packet need to
>>> account the packet in a TX counter (this if often delayed to DMA TX
>>> completion handling).  We cannot break the expectation that RX and TX
>>> counter are visible to userspace stats tools. XDP should not make these
>>> packets invisible.  
>>
>> Agreed. What I was pushing on that last thread was Rx, Tx and dropped
>> are all accounted by the driver in standard stats. Basically if the
>> driver touched it, the driver's counters should indicate that.
> 
> Sound like we all agree (except with the dropped counter, see below).
> 
> Do notice that mlx5 driver doesn't do this.  It is actually rather
> confusing to use XDP on mlx5, as when XDP "consume" which include
> XDP_DROP, XDP_REDIRECT or XDP_TX, then the driver standard stats are
> not incremented... the packet is invisible to "ifconfig" stat based
> tools.

mlx5 needs some work. As I recall it still has the bug/panic removing
xdp programs - at least I don't recall seeing a patch for it.

> 
> 
>> The push back was on dropped packets and whether that counter should be
>> bumped on XDP_DROP.
> 
> My opinion is the XDP_DROP action should NOT increment the drivers drop
> counter.  First of all the "dropped" counter is also use for other
> stuff, which will confuse that this counter express.  Second, choosing
> XDP_DROP is a policy choice, it still means it was RX-ed at the driver
> level.
> 

Understood. Hopefully in March I will get some time to come back to this
and propose an idea on what I would like to see - namely, the admin has
a config option at load time to enable driver counters versus custom map
counters. (meaning the operator of the node chooses standard stats over
strict performance.) But of course that means the drivers have the code
to collect those stats.
Saeed Mahameed Feb. 6, 2019, 12:06 a.m. UTC | #9
On Mon, 2019-02-04 at 19:13 -0800, David Ahern wrote:
> On 2/4/19 3:53 AM, Jesper Dangaard Brouer wrote:
> > On Sat, 2 Feb 2019 14:27:26 -0700
> > David Ahern <dsahern@gmail.com> wrote:
> > 
> > > On 1/31/19 1:15 PM, Jesper Dangaard Brouer wrote:
> > > > > David, Jesper, care to chime in where we ended up in that
> > > > > last thread
> > > > > discussion this?  
> > > > 
> > > > IHMO packets RX and TX on a device need to be accounted, in
> > > > standard
> > > > counters, regardless of XDP.  For XDP RX the packet is counted
> > > > as RX,
> > > > regardless if XDP choose to XDP_DROP.  On XDP TX which is via
> > > > XDP_REDIRECT or XDP_TX, the driver that transmit the packet
> > > > need to
> > > > account the packet in a TX counter (this if often delayed to
> > > > DMA TX
> > > > completion handling).  We cannot break the expectation that RX
> > > > and TX
> > > > counter are visible to userspace stats tools. XDP should not
> > > > make these
> > > > packets invisible.  
> > > 
> > > Agreed. What I was pushing on that last thread was Rx, Tx and
> > > dropped
> > > are all accounted by the driver in standard stats. Basically if
> > > the
> > > driver touched it, the driver's counters should indicate that.
> > 
> > Sound like we all agree (except with the dropped counter, see
> > below).
> > 
> > Do notice that mlx5 driver doesn't do this.  It is actually rather
> > confusing to use XDP on mlx5, as when XDP "consume" which include
> > XDP_DROP, XDP_REDIRECT or XDP_TX, then the driver standard stats
> > are
> > not incremented... the packet is invisible to "ifconfig" stat based
> > tools.
> 
> mlx5 needs some work. As I recall it still has the bug/panic removing
> xdp programs - at least I don't recall seeing a patch for it.

Only when xdp_redirect to mlx5, and removing the program while redirect
is happening, this is actually due to a lack of synchronization means
between different drivers, we have some ideas to overcome this using a
standard XDP API, or just use a hack in mlx5 driver which i don't like:

https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=topic/xdp-redirect-fix&id=a3652d03cc35fd3ad62744986c8ccaca74c9f20c

I will be working on this towards the end of this week.

> 
> > 
> > > The push back was on dropped packets and whether that counter
> > > should be
> > > bumped on XDP_DROP.
> > 
> > My opinion is the XDP_DROP action should NOT increment the drivers
> > drop
> > counter.  First of all the "dropped" counter is also use for other
> > stuff, which will confuse that this counter express.  Second,
> > choosing
> > XDP_DROP is a policy choice, it still means it was RX-ed at the
> > driver
> > level.
> > 
> 
> Understood. Hopefully in March I will get some time to come back to
> this
> and propose an idea on what I would like to see - namely, the admin
> has
> a config option at load time to enable driver counters versus custom
> map
> counters. (meaning the operator of the node chooses standard stats
> over
> strict performance.) But of course that means the drivers have the
> code
> to collect those stats.

So bottom line:
1) Driver will count rx packets as rx-ed packets regardless of XDP
decision.

2) Driver should keep track of XDP decisions statistics, report them in
ethtool and in the new API suggested by David. track even (XDP_PASS) ?

Maybe instead of having all drivers track the statistics on their own,
we should move the responsibility to upper layer.

Idea: since we already have rxq_info structure per XDP ring (no false
sharing) and available per xdp_buff we can do:

+++ b/include/linux/filter.h
@@ -651,7 +651,9 @@ static __always_inline u32 bpf_prog_run_xdp(const
struct bpf_prog *prog,
         * already takes rcu_read_lock() when fetching the program, so
         * it's not necessary here anymore.
         */
-       return BPF_PROG_RUN(prog, xdp);
+       u32 ret = BPF_PROG_RUN(prog, xdp);
+       xdp->xdp_rxq_info.stats[ret]++
+       return ret;
 }

still we need a way (API) to report the rxq_info to whoever needs to
read current XDP stats 

3) Unrelated, In non XDP case, if skb allocation fails or driver fails
to pass the skb up to the stack for somereason, should the driver
increase rx packets ? IMHO the answer should be yes if we want to have
similar behavior between XDP and non XDP cases.

But this could result in netdev->stats.rx_packets + netdev-
>stats.rx_dropped to be more than the actual rx-ed packets, is this
acceptable ?
Jesper Dangaard Brouer Feb. 6, 2019, 1:48 p.m. UTC | #10
On Wed, 6 Feb 2019 00:06:33 +0000
Saeed Mahameed <saeedm@mellanox.com> wrote:

> 3) Unrelated, In non XDP case, if skb allocation fails or driver fails
> to pass the skb up to the stack for somereason, should the driver
> increase rx packets ? IMHO the answer should be yes if we want to have
> similar behavior between XDP and non XDP cases.

I don't think "skb allocation fails" should increase rx packets
counter.  The difference is that these events are outside sysadm/users
control, and is an error detected inside the driver.  The XDP program
takes a policy choice to XDP_DROP a packet, which can be accounted
inside the XDP prog (as the samples show) or as we also discuss via a
more generic XDP-action counters.

That said, I took at quick look at driver code, and it seems this
behavior differs per driver... ixgbe and mlx5 does not count "skb
allocation fails" as RX-ed packets, while mlx4 seems to count them.


> But this could result in netdev->stats.rx_packets +
> netdev->stats.rx_dropped to be more than the actual rx-ed packets, is
> this acceptable ?

This is one reasons I think this is wrong.

--Jesper
Jakub Kicinski Feb. 6, 2019, 3:52 p.m. UTC | #11
On Wed, 6 Feb 2019 14:48:14 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 6 Feb 2019 00:06:33 +0000
> Saeed Mahameed <saeedm@mellanox.com> wrote:
> 
> > 3) Unrelated, In non XDP case, if skb allocation fails or driver fails
> > to pass the skb up to the stack for somereason, should the driver
> > increase rx packets ? IMHO the answer should be yes if we want to have
> > similar behavior between XDP and non XDP cases.  
> 
> I don't think "skb allocation fails" should increase rx packets
> counter.  The difference is that these events are outside sysadm/users
> control, and is an error detected inside the driver.  The XDP program
> takes a policy choice to XDP_DROP a packet, which can be accounted
> inside the XDP prog (as the samples show) or as we also discuss via a
> more generic XDP-action counters.

FWIW that's my understanding as well.  My understanding of Linux stats
is that they are incrementing one counter per packet.  I.e. in RX
direction success packets are those given to the stack, and for TX
those given to the hardware.  Standards (IETF/IEEE) usually count stats
on the same layer boundary, but I think software generally counts when
it's done with the packet.

I haven't seen it documented anywhere, yet.  I have tried to document
it in the docs of the recent RFC:
https://patchwork.ozlabs.org/patch/1032332/

Incidentally XDP_DROP may have been better named XDP_DISCARD from stats
perspective ;)
Jesper Dangaard Brouer Feb. 7, 2019, 7:48 a.m. UTC | #12
On Wed, 6 Feb 2019 00:06:33 +0000 Saeed Mahameed <saeedm@mellanox.com> wrote:

> On Mon, 2019-02-04 at 19:13 -0800, David Ahern wrote:
[...]
> > 
> > mlx5 needs some work. As I recall it still has the bug/panic
> > removing xdp programs - at least I don't recall seeing a patch for
> > it.  
> 
> Only when xdp_redirect to mlx5, and removing the program while
> redirect is happening, this is actually due to a lack of
> synchronization means between different drivers, we have some ideas
> to overcome this using a standard XDP API, or just use a hack in mlx5
> driver which i don't like:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=topic/xdp-redirect-fix&id=a3652d03cc35fd3ad62744986c8ccaca74c9f20c
> 
> I will be working on this towards the end of this week.

Toke and I have been discussing how to solve this.

The main idea for fixing this is to tie resource allocation to interface
insertion into interface maps (kernel/bpf/devmap.c). As the =devmap=
already have the needed synchronisation mechanisms and steps for safely
adding and removing =net_devices= (e.g. stopping RX side, flushing
remaining frames, waiting RCU period before freeing objects, etc.)

As described here:
 https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#better-ndo_xdp_xmit-resource-management

--Jesper
Saeed Mahameed Feb. 7, 2019, 7:08 p.m. UTC | #13
On Thu, 2019-02-07 at 08:48 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 6 Feb 2019 00:06:33 +0000 Saeed Mahameed <saeedm@mellanox.com
> > wrote:
> 
> > On Mon, 2019-02-04 at 19:13 -0800, David Ahern wrote:
> [...]
> > > mlx5 needs some work. As I recall it still has the bug/panic
> > > removing xdp programs - at least I don't recall seeing a patch
> > > for
> > > it.  
> > 
> > Only when xdp_redirect to mlx5, and removing the program while
> > redirect is happening, this is actually due to a lack of
> > synchronization means between different drivers, we have some ideas
> > to overcome this using a standard XDP API, or just use a hack in
> > mlx5
> > driver which i don't like:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=topic/xdp-redirect-fix&id=a3652d03cc35fd3ad62744986c8ccaca74c9f20c
> > 
> > I will be working on this towards the end of this week.
> 
> Toke and I have been discussing how to solve this.
> 
> The main idea for fixing this is to tie resource allocation to
> interface
> insertion into interface maps (kernel/bpf/devmap.c). As the =devmap=
> already have the needed synchronisation mechanisms and steps for
> safely
> adding and removing =net_devices= (e.g. stopping RX side, flushing
> remaining frames, waiting RCU period before freeing objects, etc.)
> 
> As described here:
>  
> https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#better-ndo_xdp_xmit-resource-management
> 
> --Jesper

Yes you already suggested this approach @LPC:

So 
1) on dev_map_update_elem() we will call
dev->dev->ndo_bpf() to notify the device on the intention to start/stop
redirect, and wait for it to create/destroy the HW resources
before/after actually updating the map

But:
2) this won't totally solve our problem, since sometimes the driver can
decide to recreate (change of configuration) hw resources on the fly
while redirect/devmap is already happening, so we need some kind of a
dev_map_notification or a flag with rcu synch, for when the driver want
to make the xdp redirect resources unavailable.

Thanks,
Saeed.
Jesper Dangaard Brouer Feb. 8, 2019, 4:02 p.m. UTC | #14
On Wed, 6 Feb 2019 00:06:33 +0000 Saeed Mahameed <saeedm@mellanox.com> wrote:

> 2) Driver should keep track of XDP decisions statistics, report them in
> ethtool and in the new API suggested by David. track even (XDP_PASS) ?
> 
> Maybe instead of having all drivers track the statistics on their own,
> we should move the responsibility to upper layer.
> 
> Idea: since we already have rxq_info structure per XDP ring (no false
> sharing) and available per xdp_buff we can do:
> 
> +++ b/include/linux/filter.h
> @@ -651,7 +651,9 @@ static __always_inline u32 bpf_prog_run_xdp(const
> struct bpf_prog *prog,
>          * already takes rcu_read_lock() when fetching the program, so
>          * it's not necessary here anymore.
>          */
> -       return BPF_PROG_RUN(prog, xdp);
> +       u32 ret = BPF_PROG_RUN(prog, xdp);
> +       xdp->xdp_rxq_info.stats[ret]++
> +       return ret;
>  }
> 
> still we need a way (API) to report the rxq_info to whoever needs to
> read current XDP stats 

I'm capturing this as tasks under the XDP-project github page:
 https://github.com/xdp-project/xdp-project/pull/13/files
Toke Høiland-Jørgensen Feb. 8, 2019, 4:55 p.m. UTC | #15
Saeed Mahameed <saeedm@mellanox.com> writes:

> But:
> 2) this won't totally solve our problem, since sometimes the driver can
> decide to recreate (change of configuration) hw resources on the fly
> while redirect/devmap is already happening, so we need some kind of a
> dev_map_notification or a flag with rcu synch, for when the driver want
> to make the xdp redirect resources unavailable.

Good point, I'll make a note of this. Do you have a pointer to where the
mlx5 driver does this kind of change currently?

-Toke
Saeed Mahameed Feb. 8, 2019, 10:49 p.m. UTC | #16
On Fri, 2019-02-08 at 17:55 +0100, Toke Høiland-Jørgensen wrote:
> Saeed Mahameed <saeedm@mellanox.com> writes:
> 
> > But:
> > 2) this won't totally solve our problem, since sometimes the driver
> > can
> > decide to recreate (change of configuration) hw resources on the
> > fly
> > while redirect/devmap is already happening, so we need some kind of
> > a
> > dev_map_notification or a flag with rcu synch, for when the driver
> > want
> > to make the xdp redirect resources unavailable.
> 
> Good point, I'll make a note of this. Do you have a pointer to where
> the
> mlx5 driver does this kind of change currently?
> 

example:
ethtool -L to reduce/increase the number of rings
e.g. @mlx5e_ethtool_set_ringparam
or virtually anywhere mlx5e_switch_priv_channels is called when xdp
prog redirect is attached to mlx5.

> -Toke
Saeed Mahameed Feb. 8, 2019, 11:17 p.m. UTC | #17
On Thu, 2019-02-07 at 19:08 +0000, Saeed Mahameed wrote:
> On Thu, 2019-02-07 at 08:48 +0100, Jesper Dangaard Brouer wrote:
> > On Wed, 6 Feb 2019 00:06:33 +0000 Saeed Mahameed <
> > saeedm@mellanox.com
> > > wrote:
> > > On Mon, 2019-02-04 at 19:13 -0800, David Ahern wrote:
> > [...]
> > > > mlx5 needs some work. As I recall it still has the bug/panic
> > > > removing xdp programs - at least I don't recall seeing a patch
> > > > for
> > > > it.  
> > > 
> > > Only when xdp_redirect to mlx5, and removing the program while
> > > redirect is happening, this is actually due to a lack of
> > > synchronization means between different drivers, we have some
> > > ideas
> > > to overcome this using a standard XDP API, or just use a hack in
> > > mlx5
> > > driver which i don't like:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=topic/xdp-redirect-fix&id=a3652d03cc35fd3ad62744986c8ccaca74c9f20c
> > > 
> > > I will be working on this towards the end of this week.
> > 
> > Toke and I have been discussing how to solve this.
> > 
> > The main idea for fixing this is to tie resource allocation to
> > interface
> > insertion into interface maps (kernel/bpf/devmap.c). As the
> > =devmap=
> > already have the needed synchronisation mechanisms and steps for
> > safely
> > adding and removing =net_devices= (e.g. stopping RX side, flushing
> > remaining frames, waiting RCU period before freeing objects, etc.)
> > 
> > As described here:
> >  
> > https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#better-ndo_xdp_xmit-resource-management
> > 
> > --Jesper
> 
> Yes you already suggested this approach @LPC:
> 
> So 
> 1) on dev_map_update_elem() we will call
> dev->dev->ndo_bpf() to notify the device on the intention to
> start/stop
> redirect, and wait for it to create/destroy the HW resources
> before/after actually updating the map
> 

silly me, dev_map_update_elem must be atomic, we can't hook driver
resource allocation to it, it must come as a separate request (syscall)
from user space to request to create XDP redirect resources.


> But:
> 2) this won't totally solve our problem, since sometimes the driver
> can
> decide to recreate (change of configuration) hw resources on the fly
> while redirect/devmap is already happening, so we need some kind of a
> dev_map_notification or a flag with rcu synch, for when the driver
> want
> to make the xdp redirect resources unavailable.
> 

I will focus on this problem first, then figure out how to create XDP
redirect resources without actullay attaching a dummy xdp program.

> Thanks,
> Saeed.
Saeed Mahameed Feb. 9, 2019, 12:18 a.m. UTC | #18
On Fri, 2019-02-08 at 15:17 -0800, Saeed Mahameed wrote:
> On Thu, 2019-02-07 at 19:08 +0000, Saeed Mahameed wrote:
> > 
> > So 
> > 1) on dev_map_update_elem() we will call
> > dev->dev->ndo_bpf() to notify the device on the intention to
> > start/stop
> > redirect, and wait for it to create/destroy the HW resources
> > before/after actually updating the map
> > 
> 
> silly me, dev_map_update_elem must be atomic, we can't hook driver
> resource allocation to it, it must come as a separate request
> (syscall)
> from user space to request to create XDP redirect resources.
> 

Well, it is possible to render dev_map_update_elem non-atomic and fail
BPF programs who try to update it in the verifier
check_map_func_compatibility.

if you know of any case where devmap needs to be updated from the BPF
program please let me know.
Jakub Kicinski Feb. 9, 2019, 2:05 a.m. UTC | #19
On Sat, 9 Feb 2019 00:18:31 +0000, Saeed Mahameed wrote:
> On Fri, 2019-02-08 at 15:17 -0800, Saeed Mahameed wrote:
> > On Thu, 2019-02-07 at 19:08 +0000, Saeed Mahameed wrote:  
> > > 
> > > So 
> > > 1) on dev_map_update_elem() we will call
> > > dev->dev->ndo_bpf() to notify the device on the intention to
> > > start/stop
> > > redirect, and wait for it to create/destroy the HW resources
> > > before/after actually updating the map
> > >   
> > 
> > silly me, dev_map_update_elem must be atomic, we can't hook driver
> > resource allocation to it, it must come as a separate request
> > (syscall)
> > from user space to request to create XDP redirect resources.
> >   
> 
> Well, it is possible to render dev_map_update_elem non-atomic and fail
> BPF programs who try to update it in the verifier
> check_map_func_compatibility.
> 
> if you know of any case where devmap needs to be updated from the BPF
> program please let me know.

Did we find a solution to non-map redirect?

Sorry if I missed the discussion, I couldn't make the iovisor call this
week due to travel.
Toke Høiland-Jørgensen Feb. 9, 2019, 4:56 p.m. UTC | #20
Saeed Mahameed <saeedm@mellanox.com> writes:

> On Fri, 2019-02-08 at 15:17 -0800, Saeed Mahameed wrote:
>> On Thu, 2019-02-07 at 19:08 +0000, Saeed Mahameed wrote:
>> > 
>> > So 
>> > 1) on dev_map_update_elem() we will call
>> > dev->dev->ndo_bpf() to notify the device on the intention to
>> > start/stop
>> > redirect, and wait for it to create/destroy the HW resources
>> > before/after actually updating the map
>> > 
>> 
>> silly me, dev_map_update_elem must be atomic, we can't hook driver
>> resource allocation to it, it must come as a separate request
>> (syscall)
>> from user space to request to create XDP redirect resources.
>> 
>
> Well, it is possible to render dev_map_update_elem non-atomic and fail
> BPF programs who try to update it in the verifier
> check_map_func_compatibility.
>
> if you know of any case where devmap needs to be updated from the BPF
> program please let me know.

Well, maybe? My plan for dealing with the non-map redirect variant is
essentially to have a hidden map that does just-in-time insertion of
ifindexes if they are not already there; and rework XDP_REDIRECT to use
that.

However, this would essentially be an insert from eBPF; but I guess
maybe it could be deferred until the RX-side driver exits its NAPI loop
(as we're buffering frames in the map anyway)?

-Toke
Toke Høiland-Jørgensen Feb. 9, 2019, 4:56 p.m. UTC | #21
Jakub Kicinski <jakub.kicinski@netronome.com> writes:

> On Sat, 9 Feb 2019 00:18:31 +0000, Saeed Mahameed wrote:
>> On Fri, 2019-02-08 at 15:17 -0800, Saeed Mahameed wrote:
>> > On Thu, 2019-02-07 at 19:08 +0000, Saeed Mahameed wrote:  
>> > > 
>> > > So 
>> > > 1) on dev_map_update_elem() we will call
>> > > dev->dev->ndo_bpf() to notify the device on the intention to
>> > > start/stop
>> > > redirect, and wait for it to create/destroy the HW resources
>> > > before/after actually updating the map
>> > >   
>> > 
>> > silly me, dev_map_update_elem must be atomic, we can't hook driver
>> > resource allocation to it, it must come as a separate request
>> > (syscall)
>> > from user space to request to create XDP redirect resources.
>> >   
>> 
>> Well, it is possible to render dev_map_update_elem non-atomic and fail
>> BPF programs who try to update it in the verifier
>> check_map_func_compatibility.
>> 
>> if you know of any case where devmap needs to be updated from the BPF
>> program please let me know.
>
> Did we find a solution to non-map redirect?

See my other reply to Saeed :)

-Toke
Toke Høiland-Jørgensen April 18, 2019, 2:24 p.m. UTC | #22
David Ahern <dsahern@gmail.com> writes:

> On 2/4/19 3:53 AM, Jesper Dangaard Brouer wrote:
>> On Sat, 2 Feb 2019 14:27:26 -0700
>> David Ahern <dsahern@gmail.com> wrote:
>> 
>>> On 1/31/19 1:15 PM, Jesper Dangaard Brouer wrote:
>>>>>
>>>>> David, Jesper, care to chime in where we ended up in that last thread
>>>>> discussion this?  
>>>>
>>>> IHMO packets RX and TX on a device need to be accounted, in standard
>>>> counters, regardless of XDP.  For XDP RX the packet is counted as RX,
>>>> regardless if XDP choose to XDP_DROP.  On XDP TX which is via
>>>> XDP_REDIRECT or XDP_TX, the driver that transmit the packet need to
>>>> account the packet in a TX counter (this if often delayed to DMA TX
>>>> completion handling).  We cannot break the expectation that RX and TX
>>>> counter are visible to userspace stats tools. XDP should not make these
>>>> packets invisible.  
>>>
>>> Agreed. What I was pushing on that last thread was Rx, Tx and dropped
>>> are all accounted by the driver in standard stats. Basically if the
>>> driver touched it, the driver's counters should indicate that.
>> 
>> Sound like we all agree (except with the dropped counter, see below).
>> 
>> Do notice that mlx5 driver doesn't do this.  It is actually rather
>> confusing to use XDP on mlx5, as when XDP "consume" which include
>> XDP_DROP, XDP_REDIRECT or XDP_TX, then the driver standard stats are
>> not incremented... the packet is invisible to "ifconfig" stat based
>> tools.
>
> mlx5 needs some work. As I recall it still has the bug/panic removing
> xdp programs - at least I don't recall seeing a patch for it.
>
>> 
>> 
>>> The push back was on dropped packets and whether that counter should be
>>> bumped on XDP_DROP.
>> 
>> My opinion is the XDP_DROP action should NOT increment the drivers drop
>> counter.  First of all the "dropped" counter is also use for other
>> stuff, which will confuse that this counter express.  Second, choosing
>> XDP_DROP is a policy choice, it still means it was RX-ed at the driver
>> level.
>> 
>
> Understood. Hopefully in March I will get some time to come back to this
> and propose an idea on what I would like to see - namely, the admin has
> a config option at load time to enable driver counters versus custom map
> counters. (meaning the operator of the node chooses standard stats over
> strict performance.) But of course that means the drivers have the code
> to collect those stats.

Hi David

I don't recall seeing any follow-up on this. Did you have a chance to
formulate your ideas? :)

-Toke
David Ahern April 21, 2019, 12:16 a.m. UTC | #23
On 4/18/19 8:24 AM, Toke Høiland-Jørgensen wrote:
>>>
>>
>> Understood. Hopefully in March I will get some time to come back to this
>> and propose an idea on what I would like to see - namely, the admin has
>> a config option at load time to enable driver counters versus custom map
>> counters. (meaning the operator of the node chooses standard stats over
>> strict performance.) But of course that means the drivers have the code
>> to collect those stats.
> 
> Hi David
> 
> I don't recall seeing any follow-up on this. Did you have a chance to
> formulate your ideas? :)
> 

Not yet. Almost done with the nexthop changes. Once that is out of the
way I can come back to this.
Toke Høiland-Jørgensen June 20, 2019, 8:42 p.m. UTC | #24
David Ahern <dsahern@gmail.com> writes:

> On 4/18/19 8:24 AM, Toke Høiland-Jørgensen wrote:
>>>>
>>>
>>> Understood. Hopefully in March I will get some time to come back to this
>>> and propose an idea on what I would like to see - namely, the admin has
>>> a config option at load time to enable driver counters versus custom map
>>> counters. (meaning the operator of the node chooses standard stats over
>>> strict performance.) But of course that means the drivers have the code
>>> to collect those stats.
>> 
>> Hi David
>> 
>> I don't recall seeing any follow-up on this. Did you have a chance to
>> formulate your ideas? :)
>> 
>
> Not yet. Almost done with the nexthop changes. Once that is out of the
> way I can come back to this.

Ping? :)

-Toke
David Ahern June 21, 2019, 12:42 a.m. UTC | #25
On 6/20/19 2:42 PM, Toke Høiland-Jørgensen wrote:
>>> I don't recall seeing any follow-up on this. Did you have a chance to
>>> formulate your ideas? :)
>>>
>>
>> Not yet. Almost done with the nexthop changes. Once that is out of the
>> way I can come back to this.
> 
> Ping? :)
> 

Definitely back to this after the July 4th holiday.
Toke Høiland-Jørgensen June 21, 2019, 1:57 p.m. UTC | #26
David Ahern <dsahern@gmail.com> writes:

> On 6/20/19 2:42 PM, Toke Høiland-Jørgensen wrote:
>>>> I don't recall seeing any follow-up on this. Did you have a chance to
>>>> formulate your ideas? :)
>>>>
>>>
>>> Not yet. Almost done with the nexthop changes. Once that is out of the
>>> way I can come back to this.
>> 
>> Ping? :)
>> 
>
> Definitely back to this after the July 4th holiday.

Awesome! I'll wait until then before bugging you again... ;)

-Toke
diff mbox series

Patch

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 2594481..4cfceb7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -503,6 +503,8 @@  static int virtnet_xdp_xmit(struct net_device *dev,
 	struct bpf_prog *xdp_prog;
 	struct send_queue *sq;
 	unsigned int len;
+	int packets = 0;
+	int bytes = 0;
 	int drops = 0;
 	int kicks = 0;
 	int ret, err;
@@ -526,10 +528,18 @@  static int virtnet_xdp_xmit(struct net_device *dev,
 
 	/* Free up any pending old buffers before queueing new ones. */
 	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		if (likely(is_xdp_frame(ptr)))
-			xdp_return_frame(ptr_to_xdp(ptr));
-		else
-			napi_consume_skb(ptr, false);
+		if (likely(is_xdp_frame(ptr))) {
+			struct xdp_frame *frame = ptr_to_xdp(ptr);
+
+			bytes += frame->len;
+			xdp_return_frame(frame);
+		} else {
+			struct sk_buff *skb = ptr;
+
+			bytes += skb->len;
+			napi_consume_skb(skb, false);
+		}
+		packets++;
 	}
 
 	for (i = 0; i < n; i++) {
@@ -549,6 +559,8 @@  static int virtnet_xdp_xmit(struct net_device *dev,
 	}
 out:
 	u64_stats_update_begin(&sq->stats.syncp);
+	sq->stats.bytes += bytes;
+	sq->stats.packets += packets;
 	sq->stats.xdp_tx += n;
 	sq->stats.xdp_tx_drops += drops;
 	sq->stats.kicks += kicks;