[net-next,v3,2/3] e1000: add initial XDP support

Message ID	20160912221351.5610.29043.stgit@john-Precision-Tower-5810
State	Changes Requested
Delegated to:	Jeff Kirsher
Headers	show Return-Path: <intel-wired-lan-bounces@lists.osuosl.org> From: John Fastabend <john.fastabend@gmail.com> To: bblanco@plumgrid.com, john.fastabend@gmail.com, alexei.starovoitov@gmail.com, jeffrey.t.kirsher@intel.com, brouer@redhat.com, davem@davemloft.net Date: Mon, 12 Sep 2016 15:13:51 -0700 Message-ID: <20160912221351.5610.29043.stgit@john-Precision-Tower-5810> In-Reply-To: <20160912220312.5610.77528.stgit@john-Precision-Tower-5810> References: <20160912220312.5610.77528.stgit@john-Precision-Tower-5810> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Cc: xiyou.wangcong@gmail.com, intel-wired-lan@lists.osuosl.org, u9012063@gmail.com, netdev@vger.kernel.org Subject: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-wired-lan-bounces@lists.osuosl.org Sender: "Intel-wired-lan" <intel-wired-lan-bounces@lists.osuosl.org>

John Fastabend Sept. 12, 2016, 10:13 p.m. UTC

From: Alexei Starovoitov <ast@fb.com>

This patch adds initial support for XDP on e1000 driver. Note e1000
driver does not support page recycling in general which could be
added as a further improvement. However XDP_DROP case will recycle.
XDP_TX and XDP_PASS do not support recycling.

e1000 only supports a single tx queue at this time so the queue
is shared between xdp program and Linux stack. It is possible for
an XDP program to starve the stack in this model.

The XDP program will drop packets on XDP_TX errors. This can occur
when the tx descriptors are exhausted. This behavior is the same
for both shared queue models like e1000 and dedicated tx queue
models used in multiqueue devices. However if both the stack and
XDP are transmitting packets it is perhaps more likely to occur in
the shared queue model. Further refinement to the XDP model may be
possible in the future.

I tested this patch running e1000 in a VM using KVM over a tap
device.

CC: William Tu <u9012063@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000.h      |    2 
 drivers/net/ethernet/intel/e1000/e1000_main.c |  176 +++++++++++++++++++++++++
 2 files changed, 175 insertions(+), 3 deletions(-)

Eric Dumazet Sept. 12, 2016, 10:46 p.m. UTC | #1

On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote:
> From: Alexei Starovoitov <ast@fb.com>

> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> +				 u32 len,
> +				 struct net_device *netdev,
> +				 struct e1000_adapter *adapter)
> +{
> +	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +	struct e1000_hw *hw = &adapter->hw;
> +	struct e1000_tx_ring *tx_ring;
> +
> +	if (len > E1000_MAX_DATA_PER_TXD)
> +		return;
> +
> +	/* e1000 only support a single txq at the moment so the queue is being
> +	 * shared with stack. To support this requires locking to ensure the
> +	 * stack and XDP are not running at the same time. Devices with
> +	 * multiple queues should allocate a separate queue space.
> +	 */
> +	HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +	tx_ring = adapter->tx_ring;
> +
> +	if (E1000_DESC_UNUSED(tx_ring) < 2) {
> +		HARD_TX_UNLOCK(netdev, txq);
> +		return;
> +	}
> +
> +	if (netif_xmit_frozen_or_stopped(txq))
> +		return;
> +
> +	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> +	netdev_sent_queue(netdev, len);
> +	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +
> +	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> +	mmiowb();
> +
> +	HARD_TX_UNLOCK(netdev, txq);
> +}


e1000_tx_map() is full of workarounds.

Have a look at last_tx_tso for example.

               /* Workaround for Controller erratum --
                 * descriptor for non-tso packet in a linear SKB that follows a
                 * tso gets written back prematurely before the data is fully
                 * DMA'd to the controller
                 */
                if (!skb->data_len && tx_ring->last_tx_tso &&
                    !skb_is_gso(skb)) {
                        tx_ring->last_tx_tso = false;
                        size -= 4;
                }

Look, this XDP_TX thing is hard to properly implement and test on
various NIC revisions.

Without proper queue management, high prio packets in qdisc wont be sent
if NIC is under RX -> XDP_TX flood.

Sounds a horrible feature to me.

Alexei Starovoitov Sept. 12, 2016, 11:07 p.m. UTC | #2

On Mon, Sep 12, 2016 at 03:46:39PM -0700, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote:
> > From: Alexei Starovoitov <ast@fb.com>
> 
> > +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> > +				 u32 len,
> > +				 struct net_device *netdev,
> > +				 struct e1000_adapter *adapter)
> > +{
> > +	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> > +	struct e1000_hw *hw = &adapter->hw;
> > +	struct e1000_tx_ring *tx_ring;
> > +
> > +	if (len > E1000_MAX_DATA_PER_TXD)
> > +		return;
> > +
> > +	/* e1000 only support a single txq at the moment so the queue is being
> > +	 * shared with stack. To support this requires locking to ensure the
> > +	 * stack and XDP are not running at the same time. Devices with
> > +	 * multiple queues should allocate a separate queue space.
> > +	 */
> > +	HARD_TX_LOCK(netdev, txq, smp_processor_id());
> > +
> > +	tx_ring = adapter->tx_ring;
> > +
> > +	if (E1000_DESC_UNUSED(tx_ring) < 2) {
> > +		HARD_TX_UNLOCK(netdev, txq);
> > +		return;
> > +	}
> > +
> > +	if (netif_xmit_frozen_or_stopped(txq))
> > +		return;
> > +
> > +	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> > +	netdev_sent_queue(netdev, len);
> > +	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> > +
> > +	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> > +	mmiowb();
> > +
> > +	HARD_TX_UNLOCK(netdev, txq);
> > +}
> 
> 
> e1000_tx_map() is full of workarounds.
> 
> Have a look at last_tx_tso for example.
> 
>                /* Workaround for Controller erratum --
>                  * descriptor for non-tso packet in a linear SKB that follows a
>                  * tso gets written back prematurely before the data is fully
>                  * DMA'd to the controller
>                  */
>                 if (!skb->data_len && tx_ring->last_tx_tso &&
>                     !skb_is_gso(skb)) {
>                         tx_ring->last_tx_tso = false;
>                         size -= 4;
>                 }
> 
> Look, this XDP_TX thing is hard to properly implement and test on
> various NIC revisions.

my reading of these e1k workarounds are for old versions of the nic that
don't even exist anymore. I believe none of them apply to qemu e1k.

> Without proper queue management, high prio packets in qdisc wont be sent
> if NIC is under RX -> XDP_TX flood.

yep. there are various ways to shoot yourself in the foot with xdp.
The simplest program that drops all the packets will make the box unpingable.

> Sounds a horrible feature to me.

yep. not pretty by any means.
This whole thing is only to test xdp programs under qemu which
realistically emulates only e1k.
I simply don't see any other options to test xdp under kvm.

Eric Dumazet Sept. 12, 2016, 11:46 p.m. UTC | #3

On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:

> yep. there are various ways to shoot yourself in the foot with xdp.
> The simplest program that drops all the packets will make the box unpingable.

Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
scooter on 101 highway ;)

This XDP_TX thing was one of the XDP marketing stuff, but there is
absolutely no documentation on it, warning users about possible
limitations/outcomes.

BTW, I am not sure mlx4 implementation even works, vs BQL :

mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
but tx completion will call netdev_tx_completed_queue() -> crash

Do we have one test to validate that a XDP_TX implementation is actually
correct ?

Tom Herbert Sept. 13, 2016, 12:03 a.m. UTC | #4

On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>
>> yep. there are various ways to shoot yourself in the foot with xdp.
>> The simplest program that drops all the packets will make the box unpingable.
>
> Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> scooter on 101 highway ;)
>
> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.
>
> BTW, I am not sure mlx4 implementation even works, vs BQL :
>
> mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> but tx completion will call netdev_tx_completed_queue() -> crash
>
> Do we have one test to validate that a XDP_TX implementation is actually
> correct ?
>
Obviously not for e1000 :-(. We really need some real test and
performance results and analysis on the interaction between the stack
data path and XDP data path. The fact that these changes are being
passed of as something only needed for KCM is irrelevant, e1000 is a
well deployed a NIC and there's no restriction that I see that would
prevent any users from enabling this feature on real devices. So these
patches need to be tested and justified. Eric's skepticism in all this
really doesn't surprise me...

Tom

>
>

Alexei Starovoitov Sept. 13, 2016, 1:21 a.m. UTC | #5

On Mon, Sep 12, 2016 at 04:46:08PM -0700, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> 
> > yep. there are various ways to shoot yourself in the foot with xdp.
> > The simplest program that drops all the packets will make the box unpingable.
> 
> Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> scooter on 101 highway ;)
> 
> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.

that's fair critique. there is no documentation for xdp in general.

> BTW, I am not sure mlx4 implementation even works, vs BQL :

it doesn't and it shouldn't. xdp and stack use different tx queues.

> mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> but tx completion will call netdev_tx_completed_queue() -> crash

not quite. netdev_tx_completed_queue() is not called for xdp queues.

> Do we have one test to validate that a XDP_TX implementation is actually
> correct ?

it's correct in the scope that it was defined for.
There is no objective to share the same tx queue with the stack in xdp.
The queues must be absolutely separate otherwise performance will tank.

This e1k patch is really special, because the e1k has one tx queue,
but this is for debugging of the programs, so it's ok to cut corners.
The e1k xdp support doesn't need to interact nicely with stack.
It's impossible in the first place. xdp is the layer before the stack.

Alexei Starovoitov Sept. 13, 2016, 1:28 a.m. UTC | #6

On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> >
> >> yep. there are various ways to shoot yourself in the foot with xdp.
> >> The simplest program that drops all the packets will make the box unpingable.
> >
> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> > scooter on 101 highway ;)
> >
> > This XDP_TX thing was one of the XDP marketing stuff, but there is
> > absolutely no documentation on it, warning users about possible
> > limitations/outcomes.
> >
> > BTW, I am not sure mlx4 implementation even works, vs BQL :
> >
> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> > but tx completion will call netdev_tx_completed_queue() -> crash
> >
> > Do we have one test to validate that a XDP_TX implementation is actually
> > correct ?
> >
> Obviously not for e1000 :-(. We really need some real test and
> performance results and analysis on the interaction between the stack
> data path and XDP data path.

no. we don't need it for e1k and we cannot really do it.
<broken record mode on> this patch is for debugging of xdp programs only.

> The fact that these changes are being
> passed of as something only needed for KCM is irrelevant, e1000 is a
> well deployed a NIC and there's no restriction that I see that would
> prevent any users from enabling this feature on real devices.

e1k is not even manufactured any more. Probably the only place
where it can be found is computer history museum.
e1000e fairs slightly better, but it's a different nic and this
patch is not about it.

Alexander Duyck Sept. 13, 2016, 3:42 a.m. UTC | #7

On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> From: Alexei Starovoitov <ast@fb.com>
>
> This patch adds initial support for XDP on e1000 driver. Note e1000
> driver does not support page recycling in general which could be
> added as a further improvement. However XDP_DROP case will recycle.
> XDP_TX and XDP_PASS do not support recycling.
>
> e1000 only supports a single tx queue at this time so the queue
> is shared between xdp program and Linux stack. It is possible for
> an XDP program to starve the stack in this model.
>
> The XDP program will drop packets on XDP_TX errors. This can occur
> when the tx descriptors are exhausted. This behavior is the same
> for both shared queue models like e1000 and dedicated tx queue
> models used in multiqueue devices. However if both the stack and
> XDP are transmitting packets it is perhaps more likely to occur in
> the shared queue model. Further refinement to the XDP model may be
> possible in the future.
>
> I tested this patch running e1000 in a VM using KVM over a tap
> device.
>
> CC: William Tu <u9012063@gmail.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h      |    2
>  drivers/net/ethernet/intel/e1000/e1000_main.c |  176 +++++++++++++++++++++++++
>  2 files changed, 175 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> index d7bdea7..5cf8a0a 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -150,6 +150,7 @@ struct e1000_adapter;
>   */
>  struct e1000_tx_buffer {
>         struct sk_buff *skb;
> +       struct page *page;
>         dma_addr_t dma;
>         unsigned long time_stamp;
>         u16 length;

I'm not really a huge fan of adding yet another member to this
structure.  Each e1000_tx_buffer is already pretty big at 40 bytes,
pushing it to 48 just means we lose that much more memory.  If nothing
else we may wan to look at doing something like creating a union
between the skb, page, and an unsigned long.  Then you could use the
lowest bit of the address as a flag indicating if this is a skb or a
page.

> @@ -279,6 +280,7 @@ struct e1000_adapter {
>                              struct e1000_rx_ring *rx_ring,
>                              int cleaned_count);
>         struct e1000_rx_ring *rx_ring;      /* One per active queue */
> +       struct bpf_prog *prog;
>         struct napi_struct napi;
>
>         int num_tx_queues;
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 62a7f8d..232b927 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -32,6 +32,7 @@
>  #include <linux/prefetch.h>
>  #include <linux/bitops.h>
>  #include <linux/if_vlan.h>
> +#include <linux/bpf.h>
>
>  char e1000_driver_name[] = "e1000";
>  static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
> @@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
>         return 0;
>  }
>
> +static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
> +{
> +       struct e1000_adapter *adapter = netdev_priv(netdev);
> +       struct bpf_prog *old_prog;
> +
> +       old_prog = xchg(&adapter->prog, prog);
> +       if (old_prog) {
> +               synchronize_net();
> +               bpf_prog_put(old_prog);
> +       }
> +
> +       if (netif_running(netdev))
> +               e1000_reinit_locked(adapter);
> +       else
> +               e1000_reset(adapter);

What is the point of the reset?  If the interface isn't running is
there anything in the hardware you actually need to cleanup?

> +       return 0;
> +}
> +
> +static bool e1000_xdp_attached(struct net_device *dev)
> +{
> +       struct e1000_adapter *priv = netdev_priv(dev);
> +
> +       return !!priv->prog;
> +}
> +
> +static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
> +{
> +       switch (xdp->command) {
> +       case XDP_SETUP_PROG:
> +               return e1000_xdp_set(dev, xdp->prog);
> +       case XDP_QUERY_PROG:
> +               xdp->prog_attached = e1000_xdp_attached(dev);
> +               return 0;
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
>  static const struct net_device_ops e1000_netdev_ops = {
>         .ndo_open               = e1000_open,
>         .ndo_stop               = e1000_close,
> @@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
>  #endif
>         .ndo_fix_features       = e1000_fix_features,
>         .ndo_set_features       = e1000_set_features,
> +       .ndo_xdp                = e1000_xdp,
>  };
>
>  /**
> @@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
>         e1000_down_and_stop(adapter);
>         e1000_release_manageability(adapter);
>
> +       if (adapter->prog)
> +               bpf_prog_put(adapter->prog);
> +
>         unregister_netdev(netdev);
>
>         e1000_phy_hw_reset(hw);
> @@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
>         struct e1000_hw *hw = &adapter->hw;
>         u32 rdlen, rctl, rxcsum;
>
> -       if (adapter->netdev->mtu > ETH_DATA_LEN) {
> +       if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) {
>                 rdlen = adapter->rx_ring[0].count *
>                         sizeof(struct e1000_rx_desc);
>                 adapter->clean_rx = e1000_clean_jumbo_rx_irq;

If you are really serious about using the page based Rx path we should
probably fix the fact that you take a pretty significant hit on
performance penalty for turning this mode on.

> @@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter,
>                 dev_kfree_skb_any(buffer_info->skb);
>                 buffer_info->skb = NULL;
>         }
> +       if (buffer_info->page) {
> +               put_page(buffer_info->page);
> +               buffer_info->page = NULL;
> +       }
> +
>         buffer_info->time_stamp = 0;
>         /* buffer_info must be completely set up in the transmit path */
>  }
> @@ -3298,6 +3346,69 @@ static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
>         return NETDEV_TX_OK;
>  }
>
> +static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
> +                               struct e1000_rx_buffer *rx_buffer_info,
> +                               unsigned int len)
> +{
> +       struct e1000_tx_buffer *buffer_info;
> +       unsigned int i = tx_ring->next_to_use;
> +
> +       buffer_info = &tx_ring->buffer_info[i];
> +
> +       buffer_info->length = len;
> +       buffer_info->time_stamp = jiffies;
> +       buffer_info->mapped_as_page = false;
> +       buffer_info->dma = rx_buffer_info->dma;
> +       buffer_info->next_to_watch = i;
> +       buffer_info->page = rx_buffer_info->rxbuf.page;
> +
> +       tx_ring->buffer_info[i].skb = NULL;
> +       tx_ring->buffer_info[i].segs = 1;
> +       tx_ring->buffer_info[i].bytecount = len;
> +       tx_ring->buffer_info[i].next_to_watch = i;
> +
> +       rx_buffer_info->rxbuf.page = NULL;
> +}
> +
> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> +                                u32 len,
> +                                struct net_device *netdev,
> +                                struct e1000_adapter *adapter)
> +{
> +       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +       struct e1000_hw *hw = &adapter->hw;
> +       struct e1000_tx_ring *tx_ring;
> +
> +       if (len > E1000_MAX_DATA_PER_TXD)
> +               return;
> +
> +       /* e1000 only support a single txq at the moment so the queue is being
> +        * shared with stack. To support this requires locking to ensure the
> +        * stack and XDP are not running at the same time. Devices with
> +        * multiple queues should allocate a separate queue space.
> +        */
> +       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +       tx_ring = adapter->tx_ring;
> +
> +       if (E1000_DESC_UNUSED(tx_ring) < 2) {
> +               HARD_TX_UNLOCK(netdev, txq);
> +               return;
> +       }
> +
> +       if (netif_xmit_frozen_or_stopped(txq))
> +               return;
> +
> +       e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> +       netdev_sent_queue(netdev, len);
> +       e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +
> +       writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> +       mmiowb();
> +
> +       HARD_TX_UNLOCK(netdev, txq);
> +}
> +
>  #define NUM_REGS 38 /* 1 based count */
>  static void e1000_regdump(struct e1000_adapter *adapter)
>  {
> @@ -4139,6 +4250,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct e1000_adapter *adapter,
>         return skb;
>  }
>
> +static inline int e1000_call_bpf(struct bpf_prog *prog, void *data,
> +                                unsigned int length)
> +{
> +       struct xdp_buff xdp;
> +       int ret;
> +
> +       xdp.data = data;
> +       xdp.data_end = data + length;
> +       ret = BPF_PROG_RUN(prog, (void *)&xdp);
> +
> +       return ret;
> +}
> +
>  /**
>   * e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
>   * @adapter: board private structure
> @@ -4157,12 +4281,15 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>         struct pci_dev *pdev = adapter->pdev;
>         struct e1000_rx_desc *rx_desc, *next_rxd;
>         struct e1000_rx_buffer *buffer_info, *next_buffer;
> +       struct bpf_prog *prog;
>         u32 length;
>         unsigned int i;
>         int cleaned_count = 0;
>         bool cleaned = false;
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>
> +       rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
> +       prog = READ_ONCE(adapter->prog);
>         i = rx_ring->next_to_clean;
>         rx_desc = E1000_RX_DESC(*rx_ring, i);
>         buffer_info = &rx_ring->buffer_info[i];
> @@ -4188,12 +4315,54 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>
>                 cleaned = true;
>                 cleaned_count++;
> +               length = le16_to_cpu(rx_desc->length);
> +
> +               if (prog) {
> +                       struct page *p = buffer_info->rxbuf.page;
> +                       dma_addr_t dma = buffer_info->dma;
> +                       int act;
> +
> +                       if (unlikely(!(status & E1000_RXD_STAT_EOP))) {
> +                               /* attached bpf disallows larger than page
> +                                * packets, so this is hw error or corruption
> +                                */
> +                               pr_info_once("%s buggy !eop\n", netdev->name);
> +                               break;
> +                       }
> +                       if (unlikely(rx_ring->rx_skb_top)) {
> +                               pr_info_once("%s ring resizing bug\n",
> +                                            netdev->name);
> +                               break;
> +                       }
> +                       dma_sync_single_for_cpu(&pdev->dev, dma,
> +                                               length, DMA_FROM_DEVICE);
> +                       act = e1000_call_bpf(prog, page_address(p), length);
> +                       switch (act) {
> +                       case XDP_PASS:
> +                               break;
> +                       case XDP_TX:
> +                               dma_sync_single_for_device(&pdev->dev,
> +                                                          dma,
> +                                                          length,
> +                                                          DMA_TO_DEVICE);
> +                               e1000_xmit_raw_frame(buffer_info, length,
> +                                                    netdev, adapter);

Implementing a new xmit path and clean-up routines for just this is
going to be a pain.  I'd say if we are going to do something like this
then maybe we should look at coming up with a new ndo for the xmit and
maybe push more of this into some sort of inline hook.  Duplicating
this code in every driver is going to be really expensive.

Also I just noticed there is no break statement from the xmit code
above to the drop that below.  I'd think you could overwrite the frame
data in a case where the Rx exceeds the Tx due to things like flow
control generating back pressure.

> +                       case XDP_DROP:
> +                       default:
> +                               /* re-use mapped page. keep buffer_info->dma
> +                                * as-is, so that e1000_alloc_jumbo_rx_buffers
> +                                * only needs to put it back into rx ring
> +                                */
> +                               total_rx_bytes += length;
> +                               total_rx_packets++;
> +                               goto next_desc;
> +                       }
> +               }
> +
>                 dma_unmap_page(&pdev->dev, buffer_info->dma,
>                                adapter->rx_buffer_len, DMA_FROM_DEVICE);
>                 buffer_info->dma = 0;
>
> -               length = le16_to_cpu(rx_desc->length);
> -
>                 /* errors is only valid for DD + EOP descriptors */
>                 if (unlikely((status & E1000_RXD_STAT_EOP) &&
>                     (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK))) {
> @@ -4327,6 +4496,7 @@ next_desc:
>                 rx_desc = next_rxd;
>                 buffer_info = next_buffer;
>         }
> +       rcu_read_unlock();
>         rx_ring->next_to_clean = i;
>
>         cleaned_count = E1000_DESC_UNUSED(rx_ring);
>
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@lists.osuosl.org
> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

Tom Herbert Sept. 13, 2016, 4:21 p.m. UTC | #8

On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
>> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>> >
>> >> yep. there are various ways to shoot yourself in the foot with xdp.
>> >> The simplest program that drops all the packets will make the box unpingable.
>> >
>> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
>> > scooter on 101 highway ;)
>> >
>> > This XDP_TX thing was one of the XDP marketing stuff, but there is
>> > absolutely no documentation on it, warning users about possible
>> > limitations/outcomes.
>> >
>> > BTW, I am not sure mlx4 implementation even works, vs BQL :
>> >
>> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
>> > but tx completion will call netdev_tx_completed_queue() -> crash
>> >
>> > Do we have one test to validate that a XDP_TX implementation is actually
>> > correct ?
>> >
>> Obviously not for e1000 :-(. We really need some real test and
>> performance results and analysis on the interaction between the stack
>> data path and XDP data path.
>
> no. we don't need it for e1k and we cannot really do it.
> <broken record mode on> this patch is for debugging of xdp programs only.
>
You can say this "only for a debugging" a thousand times and that
still won't justify putting bad code into the kernel. Material issues
have been raised with these patches, I have proposed a fix for one
core issue, and we have requested a lot more testing. So, please, if
you really want to move these patches forward start addressing the
concerns being raised by reviewers.

Tom

>> The fact that these changes are being
>> passed of as something only needed for KCM is irrelevant, e1000 is a
>> well deployed a NIC and there's no restriction that I see that would
>> prevent any users from enabling this feature on real devices.
>
> e1k is not even manufactured any more. Probably the only place
> where it can be found is computer history museum.
> e1000e fairs slightly better, but it's a different nic and this
> patch is not about it.
>

Alexei Starovoitov Sept. 13, 2016, 4:56 p.m. UTC | #9

On Mon, Sep 12, 2016 at 08:42:41PM -0700, Alexander Duyck wrote:
> On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
> > From: Alexei Starovoitov <ast@fb.com>
> >
> > This patch adds initial support for XDP on e1000 driver. Note e1000
> > driver does not support page recycling in general which could be
> > added as a further improvement. However XDP_DROP case will recycle.
> > XDP_TX and XDP_PASS do not support recycling.
> >
> > e1000 only supports a single tx queue at this time so the queue
> > is shared between xdp program and Linux stack. It is possible for
> > an XDP program to starve the stack in this model.
> >
> > The XDP program will drop packets on XDP_TX errors. This can occur
> > when the tx descriptors are exhausted. This behavior is the same
> > for both shared queue models like e1000 and dedicated tx queue
> > models used in multiqueue devices. However if both the stack and
> > XDP are transmitting packets it is perhaps more likely to occur in
> > the shared queue model. Further refinement to the XDP model may be
> > possible in the future.
> >
> > I tested this patch running e1000 in a VM using KVM over a tap
> > device.
> >
> > CC: William Tu <u9012063@gmail.com>
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> > ---
> >  drivers/net/ethernet/intel/e1000/e1000.h      |    2
> >  drivers/net/ethernet/intel/e1000/e1000_main.c |  176 +++++++++++++++++++++++++
> >  2 files changed, 175 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> > index d7bdea7..5cf8a0a 100644
> > --- a/drivers/net/ethernet/intel/e1000/e1000.h
> > +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> > @@ -150,6 +150,7 @@ struct e1000_adapter;
> >   */
> >  struct e1000_tx_buffer {
> >         struct sk_buff *skb;
> > +       struct page *page;
> >         dma_addr_t dma;
> >         unsigned long time_stamp;
> >         u16 length;
> 
> I'm not really a huge fan of adding yet another member to this
> structure.  Each e1000_tx_buffer is already pretty big at 40 bytes,
> pushing it to 48 just means we lose that much more memory.  If nothing
> else we may wan to look at doing something like creating a union
> between the skb, page, and an unsigned long.  Then you could use the
> lowest bit of the address as a flag indicating if this is a skb or a
> page.

that exactly what we did for mlx4_en_tx_info, since it's a real nic
where performance matters. For e1k I didn't want to complicate
the logic for no reason, since I don't see how 8 extra bytes matter here.
we will take a look if union is easy to do though.

> > @@ -279,6 +280,7 @@ struct e1000_adapter {
> >                              struct e1000_rx_ring *rx_ring,
> >                              int cleaned_count);
> >         struct e1000_rx_ring *rx_ring;      /* One per active queue */
> > +       struct bpf_prog *prog;
> >         struct napi_struct napi;
> >
> >         int num_tx_queues;
> > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > index 62a7f8d..232b927 100644
> > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/prefetch.h>
> >  #include <linux/bitops.h>
> >  #include <linux/if_vlan.h>
> > +#include <linux/bpf.h>
> >
> >  char e1000_driver_name[] = "e1000";
> >  static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
> > @@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
> >         return 0;
> >  }
> >
> > +static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
> > +{
> > +       struct e1000_adapter *adapter = netdev_priv(netdev);
> > +       struct bpf_prog *old_prog;
> > +
> > +       old_prog = xchg(&adapter->prog, prog);
> > +       if (old_prog) {
> > +               synchronize_net();
> > +               bpf_prog_put(old_prog);
> > +       }
> > +
> > +       if (netif_running(netdev))
> > +               e1000_reinit_locked(adapter);
> > +       else
> > +               e1000_reset(adapter);
> 
> What is the point of the reset?  If the interface isn't running is
> there anything in the hardware you actually need to cleanup?

The above is inspired by e1000_set_features().
I'm assuming it was done there for a reason and same applies here.

> > +       return 0;
> > +}
> > +
> > +static bool e1000_xdp_attached(struct net_device *dev)
> > +{
> > +       struct e1000_adapter *priv = netdev_priv(dev);
> > +
> > +       return !!priv->prog;
> > +}
> > +
> > +static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
> > +{
> > +       switch (xdp->command) {
> > +       case XDP_SETUP_PROG:
> > +               return e1000_xdp_set(dev, xdp->prog);
> > +       case XDP_QUERY_PROG:
> > +               xdp->prog_attached = e1000_xdp_attached(dev);
> > +               return 0;
> > +       default:
> > +               return -EINVAL;
> > +       }
> > +}
> > +
> >  static const struct net_device_ops e1000_netdev_ops = {
> >         .ndo_open               = e1000_open,
> >         .ndo_stop               = e1000_close,
> > @@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
> >  #endif
> >         .ndo_fix_features       = e1000_fix_features,
> >         .ndo_set_features       = e1000_set_features,
> > +       .ndo_xdp                = e1000_xdp,
> >  };
> >
> >  /**
> > @@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
> >         e1000_down_and_stop(adapter);
> >         e1000_release_manageability(adapter);
> >
> > +       if (adapter->prog)
> > +               bpf_prog_put(adapter->prog);
> > +
> >         unregister_netdev(netdev);
> >
> >         e1000_phy_hw_reset(hw);
> > @@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
> >         struct e1000_hw *hw = &adapter->hw;
> >         u32 rdlen, rctl, rxcsum;
> >
> > -       if (adapter->netdev->mtu > ETH_DATA_LEN) {
> > +       if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) {
> >                 rdlen = adapter->rx_ring[0].count *
> >                         sizeof(struct e1000_rx_desc);
> >                 adapter->clean_rx = e1000_clean_jumbo_rx_irq;
> 
> If you are really serious about using the page based Rx path we should
> probably fix the fact that you take a pretty significant hit on
> performance penalty for turning this mode on.

Not sure I follow. KVM tests show that xdp_drop/tx is faster even with
full page alloc and no page recycling.
xdp is only operational in page-per-packet mode which dictates the above approach.

> > @@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter,
> >                 dev_kfree_skb_any(buffer_info->skb);
> >                 buffer_info->skb = NULL;
> >         }
> > +       if (buffer_info->page) {
> > +               put_page(buffer_info->page);
> > +               buffer_info->page = NULL;
> > +       }
> > +
> >         buffer_info->time_stamp = 0;
> >         /* buffer_info must be completely set up in the transmit path */
> >  }
> > @@ -3298,6 +3346,69 @@ static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
> >         return NETDEV_TX_OK;
> >  }
> >
> > +static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
> > +                               struct e1000_rx_buffer *rx_buffer_info,
> > +                               unsigned int len)
> > +{
> > +       struct e1000_tx_buffer *buffer_info;
> > +       unsigned int i = tx_ring->next_to_use;
> > +
> > +       buffer_info = &tx_ring->buffer_info[i];
> > +
> > +       buffer_info->length = len;
> > +       buffer_info->time_stamp = jiffies;
> > +       buffer_info->mapped_as_page = false;
> > +       buffer_info->dma = rx_buffer_info->dma;
> > +       buffer_info->next_to_watch = i;
> > +       buffer_info->page = rx_buffer_info->rxbuf.page;
> > +
> > +       tx_ring->buffer_info[i].skb = NULL;
> > +       tx_ring->buffer_info[i].segs = 1;
> > +       tx_ring->buffer_info[i].bytecount = len;
> > +       tx_ring->buffer_info[i].next_to_watch = i;
> > +
> > +       rx_buffer_info->rxbuf.page = NULL;
> > +}
> > +
> > +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> > +                                u32 len,
> > +                                struct net_device *netdev,
> > +                                struct e1000_adapter *adapter)
> > +{
> > +       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> > +       struct e1000_hw *hw = &adapter->hw;
> > +       struct e1000_tx_ring *tx_ring;
> > +
> > +       if (len > E1000_MAX_DATA_PER_TXD)
> > +               return;
> > +
> > +       /* e1000 only support a single txq at the moment so the queue is being
> > +        * shared with stack. To support this requires locking to ensure the
> > +        * stack and XDP are not running at the same time. Devices with
> > +        * multiple queues should allocate a separate queue space.
> > +        */
> > +       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> > +
> > +       tx_ring = adapter->tx_ring;
> > +
> > +       if (E1000_DESC_UNUSED(tx_ring) < 2) {
> > +               HARD_TX_UNLOCK(netdev, txq);
> > +               return;
> > +       }
> > +
> > +       if (netif_xmit_frozen_or_stopped(txq))
> > +               return;
> > +
> > +       e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> > +       netdev_sent_queue(netdev, len);
> > +       e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> > +
> > +       writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> > +       mmiowb();
> > +
> > +       HARD_TX_UNLOCK(netdev, txq);
> > +}
> > +
> >  #define NUM_REGS 38 /* 1 based count */
> >  static void e1000_regdump(struct e1000_adapter *adapter)
> >  {
> > @@ -4139,6 +4250,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct e1000_adapter *adapter,
> >         return skb;
> >  }
> >
> > +static inline int e1000_call_bpf(struct bpf_prog *prog, void *data,
> > +                                unsigned int length)
> > +{
> > +       struct xdp_buff xdp;
> > +       int ret;
> > +
> > +       xdp.data = data;
> > +       xdp.data_end = data + length;
> > +       ret = BPF_PROG_RUN(prog, (void *)&xdp);
> > +
> > +       return ret;
> > +}
> > +
> >  /**
> >   * e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
> >   * @adapter: board private structure
> > @@ -4157,12 +4281,15 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
> >         struct pci_dev *pdev = adapter->pdev;
> >         struct e1000_rx_desc *rx_desc, *next_rxd;
> >         struct e1000_rx_buffer *buffer_info, *next_buffer;
> > +       struct bpf_prog *prog;
> >         u32 length;
> >         unsigned int i;
> >         int cleaned_count = 0;
> >         bool cleaned = false;
> >         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> >
> > +       rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
> > +       prog = READ_ONCE(adapter->prog);
> >         i = rx_ring->next_to_clean;
> >         rx_desc = E1000_RX_DESC(*rx_ring, i);
> >         buffer_info = &rx_ring->buffer_info[i];
> > @@ -4188,12 +4315,54 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
> >
> >                 cleaned = true;
> >                 cleaned_count++;
> > +               length = le16_to_cpu(rx_desc->length);
> > +
> > +               if (prog) {
> > +                       struct page *p = buffer_info->rxbuf.page;
> > +                       dma_addr_t dma = buffer_info->dma;
> > +                       int act;
> > +
> > +                       if (unlikely(!(status & E1000_RXD_STAT_EOP))) {
> > +                               /* attached bpf disallows larger than page
> > +                                * packets, so this is hw error or corruption
> > +                                */
> > +                               pr_info_once("%s buggy !eop\n", netdev->name);
> > +                               break;
> > +                       }
> > +                       if (unlikely(rx_ring->rx_skb_top)) {
> > +                               pr_info_once("%s ring resizing bug\n",
> > +                                            netdev->name);
> > +                               break;
> > +                       }
> > +                       dma_sync_single_for_cpu(&pdev->dev, dma,
> > +                                               length, DMA_FROM_DEVICE);
> > +                       act = e1000_call_bpf(prog, page_address(p), length);
> > +                       switch (act) {
> > +                       case XDP_PASS:
> > +                               break;
> > +                       case XDP_TX:
> > +                               dma_sync_single_for_device(&pdev->dev,
> > +                                                          dma,
> > +                                                          length,
> > +                                                          DMA_TO_DEVICE);
> > +                               e1000_xmit_raw_frame(buffer_info, length,
> > +                                                    netdev, adapter);
> 
> Implementing a new xmit path and clean-up routines for just this is
> going to be a pain.  I'd say if we are going to do something like this
> then maybe we should look at coming up with a new ndo for the xmit and
> maybe push more of this into some sort of inline hook.  Duplicating
> this code in every driver is going to be really expensive.

we will have a common xdp routines when more drivers implement it.
I would expect several pieces fo mlx4/mlx5 can be made common.
ndo approach won't work here, since stack doesn't call into this part.
The xdp logic stays within the driver and dma/page things are driver specific too.
It's pretty much like trying to make common struct between e1000_tx_buffer
and mlx4_en_tx_info. Which is quite difficult.

> Also I just noticed there is no break statement from the xmit code
> above to the drop that below.  I'd think you could overwrite the frame
> data in a case where the Rx exceeds the Tx due to things like flow
> control generating back pressure.

you mean pause frames on TX side?
Aren't 'if (E1000_DESC_UNUSED(tx_ring) < 2) {' enough in e1000_xmit_raw_frame() ?

Thanks for the review!

Alexei Starovoitov Sept. 13, 2016, 5:13 p.m. UTC | #10

On Tue, Sep 13, 2016 at 09:21:47AM -0700, Tom Herbert wrote:
> On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
> >> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> >> >
> >> >> yep. there are various ways to shoot yourself in the foot with xdp.
> >> >> The simplest program that drops all the packets will make the box unpingable.
> >> >
> >> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> >> > scooter on 101 highway ;)
> >> >
> >> > This XDP_TX thing was one of the XDP marketing stuff, but there is
> >> > absolutely no documentation on it, warning users about possible
> >> > limitations/outcomes.
> >> >
> >> > BTW, I am not sure mlx4 implementation even works, vs BQL :
> >> >
> >> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> >> > but tx completion will call netdev_tx_completed_queue() -> crash
> >> >
> >> > Do we have one test to validate that a XDP_TX implementation is actually
> >> > correct ?
> >> >
> >> Obviously not for e1000 :-(. We really need some real test and
> >> performance results and analysis on the interaction between the stack
> >> data path and XDP data path.
> >
> > no. we don't need it for e1k and we cannot really do it.
> > <broken record mode on> this patch is for debugging of xdp programs only.
> >
> You can say this "only for a debugging" a thousand times and that
> still won't justify putting bad code into the kernel. Material issues
> have been raised with these patches, I have proposed a fix for one
> core issue, and we have requested a lot more testing. So, please, if
> you really want to move these patches forward start addressing the
> concerns being raised by reviewers.

I'm afraid the point 'only for debugging' still didn't make it across.
xdp+e1k is for development (and debugging) of xdp-type of bpf
programs and _not_ for debugging of xdp itself, kernel or anything else.
The e1k provided interfaces and behavior needs to match exactly
what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
Doing special hacks are not acceptable. Therefore your
'proposed fix' misses the mark, since:
1. ignoring bql/qdisc is not a bug, but the requirement
2. such 'fix' goes against the goal above since behaviors will be
different and xdp developer won't be able to build something like
xdp loadbalancer in the kvm.

If you have other concerns please raise them or if you have
suggestions on how to develop xdp programs without this e1k patch
I would love hear them.
Alexander's review comments are discussed in separate thread.

Eric Dumazet Sept. 13, 2016, 5:37 p.m. UTC | #11

On Tue, 2016-09-13 at 10:13 -0700, Alexei Starovoitov wrote:

> I'm afraid the point 'only for debugging' still didn't make it across.
> xdp+e1k is for development (and debugging) of xdp-type of bpf
> programs and _not_ for debugging of xdp itself, kernel or anything else.
> The e1k provided interfaces and behavior needs to match exactly
> what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> Doing special hacks are not acceptable. Therefore your
> 'proposed fix' misses the mark, since:
> 1. ignoring bql/qdisc is not a bug, but the requirement
> 2. such 'fix' goes against the goal above since behaviors will be
> different and xdp developer won't be able to build something like
> xdp loadbalancer in the kvm.
> 


Is e1k the only way a VM can receive and send packets ?

Instead of adding more cruft to a legacy driver, risking breaking real
old machines, I am sure we can find modern alternative.

Tom Herbert Sept. 13, 2016, 5:55 p.m. UTC | #12

On Tue, Sep 13, 2016 at 10:13 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Sep 13, 2016 at 09:21:47AM -0700, Tom Herbert wrote:
>> On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
>> >> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>> >> >
>> >> >> yep. there are various ways to shoot yourself in the foot with xdp.
>> >> >> The simplest program that drops all the packets will make the box unpingable.
>> >> >
>> >> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
>> >> > scooter on 101 highway ;)
>> >> >
>> >> > This XDP_TX thing was one of the XDP marketing stuff, but there is
>> >> > absolutely no documentation on it, warning users about possible
>> >> > limitations/outcomes.
>> >> >
>> >> > BTW, I am not sure mlx4 implementation even works, vs BQL :
>> >> >
>> >> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
>> >> > but tx completion will call netdev_tx_completed_queue() -> crash
>> >> >
>> >> > Do we have one test to validate that a XDP_TX implementation is actually
>> >> > correct ?
>> >> >
>> >> Obviously not for e1000 :-(. We really need some real test and
>> >> performance results and analysis on the interaction between the stack
>> >> data path and XDP data path.
>> >
>> > no. we don't need it for e1k and we cannot really do it.
>> > <broken record mode on> this patch is for debugging of xdp programs only.
>> >
>> You can say this "only for a debugging" a thousand times and that
>> still won't justify putting bad code into the kernel. Material issues
>> have been raised with these patches, I have proposed a fix for one
>> core issue, and we have requested a lot more testing. So, please, if
>> you really want to move these patches forward start addressing the
>> concerns being raised by reviewers.
>
> I'm afraid the point 'only for debugging' still didn't make it across.
> xdp+e1k is for development (and debugging) of xdp-type of bpf
> programs and _not_ for debugging of xdp itself, kernel or anything else.
> The e1k provided interfaces and behavior needs to match exactly
> what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> Doing special hacks are not acceptable. Therefore your
> 'proposed fix' misses the mark, since:
> 1. ignoring bql/qdisc is not a bug, but the requirement

You don't seem to understand the problem. In the shared queue scenario
if one party (the stack) implements qdiscs, BQL, and such and the
other (XDP) just throws packets onto the queue then these are
incompatible behaviors and something will break. I suppose it's
possible that some how this does not affect the stack path, but
remains to be proven. In any case the patches under review look very
much like they break things; either a fix is needed or tests run to
show it's not a problem. Until this is resolved I am going to nack the
patch.

Tom

> 2. such 'fix' goes against the goal above since behaviors will be
> different and xdp developer won't be able to build something like
> xdp loadbalancer in the kvm.
>
> If you have other concerns please raise them or if you have
> suggestions on how to develop xdp programs without this e1k patch
> I would love hear them.
> Alexander's review comments are discussed in separate thread.
>

Alexei Starovoitov Sept. 13, 2016, 5:59 p.m. UTC | #13

On Tue, Sep 13, 2016 at 10:37:32AM -0700, Eric Dumazet wrote:
> On Tue, 2016-09-13 at 10:13 -0700, Alexei Starovoitov wrote:
> 
> > I'm afraid the point 'only for debugging' still didn't make it across.
> > xdp+e1k is for development (and debugging) of xdp-type of bpf
> > programs and _not_ for debugging of xdp itself, kernel or anything else.
> > The e1k provided interfaces and behavior needs to match exactly
> > what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> > Doing special hacks are not acceptable. Therefore your
> > 'proposed fix' misses the mark, since:
> > 1. ignoring bql/qdisc is not a bug, but the requirement
> > 2. such 'fix' goes against the goal above since behaviors will be
> > different and xdp developer won't be able to build something like
> > xdp loadbalancer in the kvm.
> > 
> 
> 
> Is e1k the only way a VM can receive and send packets ?
> 
> Instead of adding more cruft to a legacy driver, risking breaking real
> old machines, 

agree that it is the concern.

> I am sure we can find modern alternative.

I've looked through qemu and it appears only emulate e1k and tg3.
The latter is still used in the field, so the risk of touching
it is higher.
The other alternative is virtio, but it doesn't have dma and/or pages,
so it looks to me even messier hack.
The last alternative considered was to invent xdp-only fake 'hw' nic,
but it's too much work to get it into qemu then ask the world
to upgrade qemu.
At that point I ran out of ideas and settled on hacking e1k :(
Not proud of this hack at all.

Rustad, Mark D Sept. 13, 2016, 6:28 p.m. UTC | #14

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> I've looked through qemu and it appears only emulate e1k and tg3.
> The latter is still used in the field, so the risk of touching
> it is higher.

I have no idea what makes you think that e1k is *not* "used in the field".   
I grant you it probably is used more virtualized than not these days, but  
it certainly exists and is used. You can still buy them new at Newegg for  
goodness sakes!

Maybe I'll go home and plug in my old e100 into my machine that still has a  
PCI slot, just for old times sake. Oh darn, I have a SCSI card in that  
slot...

--
Mark Rustad, Networking Division, Intel Corporation

Alexei Starovoitov Sept. 13, 2016, 6:30 p.m. UTC | #15

On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> >I've looked through qemu and it appears only emulate e1k and tg3.
> >The latter is still used in the field, so the risk of touching
> >it is higher.
> 
> I have no idea what makes you think that e1k is *not* "used in the field".
> I grant you it probably is used more virtualized than not these days, but it
> certainly exists and is used. You can still buy them new at Newegg for
> goodness sakes!

the point that it's only used virtualized, since PCI (not PCIE) have
been long dead.

Rustad, Mark D Sept. 13, 2016, 7:14 p.m. UTC | #16

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>
>>> I've looked through qemu and it appears only emulate e1k and tg3.
>>> The latter is still used in the field, so the risk of touching
>>> it is higher.
>>
>> I have no idea what makes you think that e1k is *not* "used in the field".
>> I grant you it probably is used more virtualized than not these days,  
>> but it
>> certainly exists and is used. You can still buy them new at Newegg for
>> goodness sakes!
>
> the point that it's only used virtualized, since PCI (not PCIE) have
> been long dead.

My point is precisely the opposite. It is a real device, it exists in real  
systems and it is used in those systems. I worked on embedded systems that  
ran Linux and used e1000 devices. I am sure they are still out there  
because customers are still paying for support of those systems.

Yes, PCI(-X) is absent from any current hardware and has been for some  
years now, but there is an installed base that continues. What part of that  
installed base updates software? I don't know, but I would not just assume  
that it is 0. I know that I updated the kernel on those embedded systems  
that I worked on when I was supporting them. Never at the bleeding edge,  
but generally hopping from one LTS kernel to another as needed.

The day is coming when all the motherboards with PCI(-X) will be gone, but  
I think it is still at least a few years off.

--
Mark Rustad, Networking Division, Intel Corporation

Alexei Starovoitov Sept. 13, 2016, 9:52 p.m. UTC | #17

On Tue, Sep 13, 2016 at 07:14:27PM +0000, Rustad, Mark D wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> >On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
> >>Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>
> >>>I've looked through qemu and it appears only emulate e1k and tg3.
> >>>The latter is still used in the field, so the risk of touching
> >>>it is higher.
> >>
> >>I have no idea what makes you think that e1k is *not* "used in the field".
> >>I grant you it probably is used more virtualized than not these days,
> >>but it
> >>certainly exists and is used. You can still buy them new at Newegg for
> >>goodness sakes!
> >
> >the point that it's only used virtualized, since PCI (not PCIE) have
> >been long dead.
> 
> My point is precisely the opposite. It is a real device, it exists in real
> systems and it is used in those systems. I worked on embedded systems that
> ran Linux and used e1000 devices. I am sure they are still out there because
> customers are still paying for support of those systems.
> 
> Yes, PCI(-X) is absent from any current hardware and has been for some years
> now, but there is an installed base that continues. What part of that
> installed base updates software? I don't know, but I would not just assume
> that it is 0. I know that I updated the kernel on those embedded systems
> that I worked on when I was supporting them. Never at the bleeding edge, but
> generally hopping from one LTS kernel to another as needed.

I suspect modern linux won't boot on such old pci only systems for other
reasons not related to networking, since no one really cares to test kernels there.
So I think we mostly agree. There is chance that this xdp e1k code will
find a way to that old system. What are the chances those users will
be using xdp there? I think pretty close to zero.

The pci-e nics integrated into motherboards that pretend to be tg3
(though they're not at all build by broadcom) are significantly more common.
That's why I picked e1k instead of tg3.

Also note how this patch is not touching anything in the main e1k path
(because I don't have a hw to test and I suspect Intel's driver team
doesn't have it either) to make sure there is no breakage on those
old systems. I created separate e1000_xmit_raw_frame() routine
instead of adding flags into e1000_xmit_frame() for the same reasons:
to make sure there is no breakage.
Same reasoning for not doing an union of page/skb as Alexander suggested.
I wanted minimal change to e1k that allows development xdp programs in kvm
without affecting e1k main path. If you see the actual bug in the patch,
please point out the line.

Rustad, Mark D Sept. 13, 2016, 10:41 p.m. UTC | #18

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Sep 13, 2016 at 07:14:27PM +0000, Rustad, Mark D wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>
>>> On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>
>>>>> I've looked through qemu and it appears only emulate e1k and tg3.
>>>>> The latter is still used in the field, so the risk of touching
>>>>> it is higher.
>>>>
>>>> I have no idea what makes you think that e1k is *not* "used in the  
>>>> field".
>>>> I grant you it probably is used more virtualized than not these days,
>>>> but it
>>>> certainly exists and is used. You can still buy them new at Newegg for
>>>> goodness sakes!
>>>
>>> the point that it's only used virtualized, since PCI (not PCIE) have
>>> been long dead.
>>
>> My point is precisely the opposite. It is a real device, it exists in real
>> systems and it is used in those systems. I worked on embedded systems that
>> ran Linux and used e1000 devices. I am sure they are still out there  
>> because
>> customers are still paying for support of those systems.
>>
>> Yes, PCI(-X) is absent from any current hardware and has been for some  
>> years
>> now, but there is an installed base that continues. What part of that
>> installed base updates software? I don't know, but I would not just assume
>> that it is 0. I know that I updated the kernel on those embedded systems
>> that I worked on when I was supporting them. Never at the bleeding edge,  
>> but
>> generally hopping from one LTS kernel to another as needed.
>
> I suspect modern linux won't boot on such old pci only systems for other
> reasons not related to networking, since no one really cares to test  
> kernels there.

Actually it does boot, because although the motherboard was PCIe, the slots  
and the adapters in them were PCI-X. So the core architecture was not so  
stale.

> So I think we mostly agree. There is chance that this xdp e1k code will
> find a way to that old system. What are the chances those users will
> be using xdp there? I think pretty close to zero.

For sure they wouldn't be using XDP, but they could suffer regressions in a  
changed driver that might find its way there. That is the risk.

> The pci-e nics integrated into motherboards that pretend to be tg3
> (though they're not at all build by broadcom) are significantly more  
> common.
> That's why I picked e1k instead of tg3.

That may be true (I really don't know anything about tg3 so I certainly  
can't dispute it), so the risk could be smaller with e1k, but there is  
still a regression risk for real existing hardware. That is my concern.

> Also note how this patch is not touching anything in the main e1k path
> (because I don't have a hw to test and I suspect Intel's driver team
> doesn't have it either) to make sure there is no breakage on those
> old systems. I created separate e1000_xmit_raw_frame() routine
> instead of adding flags into e1000_xmit_frame() for the same reasons:
> to make sure there is no breakage.
> Same reasoning for not doing an union of page/skb as Alexander suggested.
> I wanted minimal change to e1k that allows development xdp programs in kvm
> without affecting e1k main path. If you see the actual bug in the patch,
> please point out the line.

I can't say that I can, because I am not familiar with the internals of  
e1k. When I was using it, I never had cause to even look at the driver  
because it just worked. My attentions then were properly elsewhere.

My concern is with messing with a driver that probably no one has an active  
testbed routinely running regression tests.

Maybe a new ID should be assigned and the driver forked for this purpose.  
At least then only the virtualization case would have to be tested. Of  
course the hosts would have to offer that ID, but if this is just for  
testing that should be ok, or at least possible.

If someone with more internal knowledge of this driver has enough  
confidence to bless such patches, that might be fine, but it is a fallacy  
to think that e1k is *only* a virtualization driver today. Not yet anyway.  
Maybe around 2020.

That said, I can see that you have tried to keep the original code path  
pretty much intact. I would note that you introduced rcu calls into the  
!bpf path that would never have been done before. While that should be ok,  
I would really like to see it tested, at least for the !bpf case, on real  
hardware to be sure. I really can't comment on the workaround issue brought  
up by Eric, because I just don't know about them. At least that risk seems  
to only be in the bpf case.

--
Mark Rustad, Networking Division, Intel Corporation

Francois Romieu Sept. 13, 2016, 11:17 p.m. UTC | #19

Rustad, Mark D <mark.d.rustad@intel.com> :
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
[...]
> > the point that it's only used virtualized, since PCI (not PCIE) have
> > been long dead.
> 
> My point is precisely the opposite. It is a real device, it exists in real
> systems and it is used in those systems. I worked on embedded systems that
> ran Linux and used e1000 devices. I am sure they are still out there because
> customers are still paying for support of those systems.

Old PCI is not the bulk of my professional hardware but it's still used here,
both for networking and video. No embedded systems. Mostly dumb file serving
ranging from quad core to i5 and xeon. Recent kernels. 

> The day is coming when all the motherboards with PCI(-X) will be gone, but I
> think it is still at least a few years off.

Add some time for the PCI <-> PCIe adapters to disappear as well. :o)

Alexei Starovoitov Sept. 13, 2016, 11:40 p.m. UTC | #20

On Tue, Sep 13, 2016 at 10:41:12PM +0000, Rustad, Mark D wrote:
> That said, I can see that you have tried to keep the original code path
> pretty much intact. I would note that you introduced rcu calls into the !bpf
> path that would never have been done before. While that should be ok, I
> would really like to see it tested, at least for the !bpf case, on real
> hardware to be sure. 

please go ahead and test. rcu_read_lock is zero extra instructions
for everything but preempt or debug kernels.

Rustad, Mark D Sept. 14, 2016, 12:13 a.m. UTC | #21

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Sep 13, 2016 at 10:41:12PM +0000, Rustad, Mark D wrote:
>> That said, I can see that you have tried to keep the original code path
>> pretty much intact. I would note that you introduced rcu calls into the  
>> !bpf
>> path that would never have been done before. While that should be ok, I
>> would really like to see it tested, at least for the !bpf case, on real
>> hardware to be sure.
>
> please go ahead and test. rcu_read_lock is zero extra instructions
> for everything but preempt or debug kernels.

Well, I don't have any hardware in hand to test with, though my former  
employer would. I guess my current employer would too! :-) FWIW, the kernel  
used in that system I referred to before was a preempt kernel.

The test matrix is large, the tail is long and you can't just gloss these  
things over. I understand that it isn't the focus of your work, just as  
regression testing the e1000 is not the focus of any of our work any more.  
That is precisely why it is a sensitive area.

--
Mark Rustad, Networking Division, Intel Corporation

Brown, Aaron F Sept. 14, 2016, 11:42 p.m. UTC | #22

> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@lists.osuosl.org] On
> Behalf Of Rustad, Mark D
> Sent: Tuesday, September 13, 2016 3:41 PM
> To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>; Tom Herbert
> <tom@herbertland.com>; Brenden Blanco <bblanco@plumgrid.com>; Linux
> Kernel Network Developers <netdev@vger.kernel.org>; intel-wired-lan <intel-
> wired-lan@lists.osuosl.org>; William Tu <u9012063@gmail.com>; Jesper
> Dangaard Brouer <brouer@redhat.com>; Cong Wang
> <xiyou.wangcong@gmail.com>; David S. Miller <davem@davemloft.net>
> Subject: Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP
> support
> 
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Tue, Sep 13, 2016 at 07:14:27PM +0000, Rustad, Mark D wrote:
> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>
> >>> On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
> >>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>
> >>>>> I've looked through qemu and it appears only emulate e1k and tg3.
> >>>>> The latter is still used in the field, so the risk of touching
> >>>>> it is higher.
> >>>>
> >>>> I have no idea what makes you think that e1k is *not* "used in the
> >>>> field".

I personally use a number of physical e1000 parts in my lab.  Primarily for an out of band control adapter, as it almost never changes so tends to be very stable, and it is rare that I have to test it so I'm not fighting with a control adapter that I am also testing the driver on.

> >>>> I grant you it probably is used more virtualized than not these days,
> >>>> but it
> >>>> certainly exists and is used. You can still buy them new at Newegg for
> >>>> goodness sakes!
> >>>
> >>> the point that it's only used virtualized, since PCI (not PCIE) have
> >>> been long dead.

Maybe for new server installations.  But modern motherboards often still have PCI, university / lab environments are slow to upgrade specialty test equipment that uses PCI and home users often don't want to get rid of their expensive high quality sound card, video capture card (or whatever.)  Plain old PCI is still in use.

> >>
> >> My point is precisely the opposite. It is a real device, it exists in real
> >> systems and it is used in those systems. I worked on embedded systems
> that
> >> ran Linux and used e1000 devices. I am sure they are still out there
> >> because
> >> customers are still paying for support of those systems.
> >>
> >> Yes, PCI(-X) is absent from any current hardware and has been for some
> >> years
> >> now, but there is an installed base that continues. What part of that
> >> installed base updates software? I don't know, but I would not just
> assume
> >> that it is 0. I know that I updated the kernel on those embedded systems
> >> that I worked on when I was supporting them. Never at the bleeding edge,
> >> but
> >> generally hopping from one LTS kernel to another as needed.
> >
> > I suspect modern linux won't boot on such old pci only systems for other
> > reasons not related to networking, since no one really cares to test
> > kernels there.
> 
> Actually it does boot, because although the motherboard was PCIe, the slots
> and the adapters in them were PCI-X. So the core architecture was not so
> stale.
> 
> > So I think we mostly agree. There is chance that this xdp e1k code will
> > find a way to that old system. What are the chances those users will
> > be using xdp there? I think pretty close to zero.
> 
> For sure they wouldn't be using XDP, but they could suffer regressions in a
> changed driver that might find its way there. That is the risk.
> 
> > The pci-e nics integrated into motherboards that pretend to be tg3
> > (though they're not at all build by broadcom) are significantly more
> > common.
> > That's why I picked e1k instead of tg3.
> 
> That may be true (I really don't know anything about tg3 so I certainly
> can't dispute it), so the risk could be smaller with e1k, but there is
> still a regression risk for real existing hardware. That is my concern.

And one of the patches in supporting this seems to have done just that.  I'll reply to the patch itself with details.
> 
> > Also note how this patch is not touching anything in the main e1k path
> > (because I don't have a hw to test and I suspect Intel's driver team
> > doesn't have it either)

Your suspicion is incorrect.  My lab has multiple boxes out on benches and at least one cabinet 3/4 full of ancient hardware, including a decent variety of e1000 cards (and even a number of e100 ones.)  I also keep a handful of test systems in my regression rack devoted to e1000, though those have dwindled (old hardware eventually dies) and is currently in need of a bit of work, which this patch set has encouraged me to get on to.

> >  to make sure there is no breakage on those
> > old systems. I created separate e1000_xmit_raw_frame() routine
> > instead of adding flags into e1000_xmit_frame() for the same reasons:
> > to make sure there is no breakage.
> > Same reasoning for not doing an union of page/skb as Alexander
> suggested.
> > I wanted minimal change to e1k that allows development xdp programs in
> kvm
> > without affecting e1k main path. If you see the actual bug in the patch,
> > please point out the line.
> 
> I can't say that I can, because I am not familiar with the internals of
> e1k. When I was using it, I never had cause to even look at the driver
> because it just worked. My attentions then were properly elsewhere.
> 
> My concern is with messing with a driver that probably no one has an active
> testbed routinely running regression tests.
> 
> Maybe a new ID should be assigned and the driver forked for this purpose.
> At least then only the virtualization case would have to be tested. Of
> course the hosts would have to offer that ID, but if this is just for
> testing that should be ok, or at least possible.
> 
> If someone with more internal knowledge of this driver has enough
> confidence to bless such patches, that might be fine, but it is a fallacy
> to think that e1k is *only* a virtualization driver today. Not yet anyway.
> Maybe around 2020.
> 
> That said, I can see that you have tried to keep the original code path
> pretty much intact. I would note that you introduced rcu calls into the
> !bpf path that would never have been done before. While that should be ok,
> I would really like to see it tested, at least for the !bpf case, on real
> hardware to be sure. I really can't comment on the workaround issue
> brought
> up by Eric, because I just don't know about them. At least that risk seems
> to only be in the bpf case.
> 
> --
> Mark Rustad, Networking Division, Intel Corporation

Jesper Dangaard Brouer Sept. 18, 2016, 5:25 p.m. UTC | #23

On Mon, 12 Sep 2016 16:46:08 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.

I will take care of documentation for the XDP project.

Tom Herbert Sept. 18, 2016, 6:06 p.m. UTC | #24

On Sep 18, 2016 10:25 AM, "Jesper Dangaard Brouer" <brouer@redhat.com>
wrote:
>
> On Mon, 12 Sep 2016 16:46:08 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> > This XDP_TX thing was one of the XDP marketing stuff, but there is
> > absolutely no documentation on it, warning users about possible
> > limitations/outcomes.
>
> I will take care of documentation for the XDP project.

Thanks Jesper, that will help a lot!

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer

[net-next,v3,2/3] e1000: add initial XDP support

Commit Message

Comments

Patch