diff mbox

[ovs-dev,V2] netdev-dpdk: Add Jumbo Frame Support.

Message ID 1453889622-22972-1-git-send-email-mark.b.kavanagh@intel.com
State Changes Requested
Headers show

Commit Message

Mark Kavanagh Jan. 27, 2016, 10:13 a.m. UTC
Add support for Jumbo Frames to DPDK-enabled port types,
using single-segment-mbufs.

Using this approach, the amount of memory allocated for each mbuf
to store frame data is increased to a value greater than 1518B
(typical Ethernet maximum frame length). The increased space
available in the mbuf means that an entire Jumbo Frame can be carried
in a single mbuf, as opposed to partitioning it across multiple mbuf
segments.

The amount of space allocated to each mbuf to hold frame data is
defined dynamically by the user when adding a DPDK port to a bridge.
If an MTU value is not supplied, or the user-supplied value is invalid,
the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).

Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
---
 INSTALL.DPDK.md   |  59 ++++++++-
 lib/netdev-dpdk.c | 356 +++++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 328 insertions(+), 87 deletions(-)

Comments

Aaron Conole Jan. 27, 2016, 4:01 p.m. UTC | #1
Hi Mark,

Thanks for the patch! I've not done a very thorough review of this, but
my first-blush comments are inline.

Mark Kavanagh <mark.b.kavanagh@intel.com> writes:
> Add support for Jumbo Frames to DPDK-enabled port types,
> using single-segment-mbufs.
>
> Using this approach, the amount of memory allocated for each mbuf
> to store frame data is increased to a value greater than 1518B
> (typical Ethernet maximum frame length). The increased space
> available in the mbuf means that an entire Jumbo Frame can be carried
> in a single mbuf, as opposed to partitioning it across multiple mbuf
> segments.
>
> The amount of space allocated to each mbuf to hold frame data is
> defined dynamically by the user when adding a DPDK port to a bridge.
> If an MTU value is not supplied, or the user-supplied value is invalid,
> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>
> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
> ---

This is a new feature which has user-visible impact, so should be
announced in NEWS (plus, it's great news).

>  INSTALL.DPDK.md   |  59 ++++++++-
>  lib/netdev-dpdk.c | 356 +++++++++++++++++++++++++++++++++++++++++-------------
>  2 files changed, 328 insertions(+), 87 deletions(-)
>
> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
> index 96b686c..2f23e27 100644
> --- a/INSTALL.DPDK.md
> +++ b/INSTALL.DPDK.md
> @@ -859,10 +859,61 @@ by adding the following string:
>  to <interface> sections of all network devices used by DPDK. Parameter 'N'
>  determines how many queues can be used by the guest.
>  
> +
> +Jumbo Frames
> +------------
> +
> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
> +
> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to the ovs-vsctl
> +'add-port' command-line, along with the required MTU for the port.
> +e.g.

This text and the example do not match. I think '--dpdk-mtu' is not valid.

> +
> +     ```
> +     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-mtu=9000
> +     ```
> +
> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
> +increased, such that a full Jumbo Frame may be accommodated inside a single
> +mbuf segment. Once set, the MTU for a DPDK port is immutable.

Why no support? DPDK supports changing the mtu.

> +Jumbo frame support has been validated against 13312B frames, using the
> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
> +theoretically be supported. Supported port types excludes vHost-Cuse ports, as
> +this feature is pending deprecation.
> +
> +
> +vHost Ports and Jumbo Frames
> +----------------------------
> +Jumbo frame support is available for DPDK vHost-User ports only. Some additional
> +configuration is needed to take advantage of this feature:
> +
> +  1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in
> +      the QEMU command line snippet below:
> +
> +      ```
> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
> +      '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
> +      ```
> +
> +  2. Where virtio devices are bound to the Linux kernel driver in a guest
> +     environment (i.e. interfaces are not bound to an in-guest DPDK driver), the
> +     MTU of those logical network interfaces must also be increased. This
> +     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU' refers
> +     to the length of the IP packet only, and not that of the entire frame.
> +
> +     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract the L2
> +     header and CRC lengths (i.e. 18B) from the max supported frame size.
> +     So, to set the MTU for a 13312B Jumbo Frame:
> +
> +      ```
> +      ifconfig eth1 mtu 13294
> +      ```
> +
> +
>  Restrictions:
>  -------------
>  
> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
>    - Currently DPDK port does not make use any offload functionality.
>    - DPDK-vHost support works with 1G huge pages.
>  
> @@ -903,6 +954,12 @@ Restrictions:
>      the next release of DPDK (which includes the above patch) is available and
>      integrated into OVS.
>  
> +  Jumbo Frames:
> +  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The source of
> +     this issue is currently being investigated.
> +  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports.
> +
> +
>  Bug Reporting:
>  --------------
>  
> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> index de7e488..5f23e28 100644
> --- a/lib/netdev-dpdk.c
> +++ b/lib/netdev-dpdk.c
> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>  #define OVS_VPORT_DPDK "ovs_dpdk"
>  
> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
> +
>  /*
>   * need to reserve tons of extra space in the mbufs so we can align the
>   * DMA addresses to 4KB.
>   * The minimum mbuf size is limited to avoid scatter behaviour and drop in
>   * performance for standard Ethernet MTU.
>   */
> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
> -                              + sizeof(struct dp_packet) \
> -                              + RTE_PKTMBUF_HEADROOM)
> -#define MBUF_SIZE_DRIVER     (2048                       \
> -                              + sizeof (struct rte_mbuf) \
> -                              + RTE_PKTMBUF_HEADROOM)
> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
> +#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN)
> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
> +                                    + sizeof(struct dp_packet)   \
> +                                    + RTE_PKTMBUF_HEADROOM)
> +
> +/* This value should be specified as a multiple of the DPDK NIC driver's
> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
> + */
> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>  
>  /* Max and min number of packets in the mempool.  OVS tries to allocate a
>   * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>  #define MIN_NB_MBUF          (4096 * 4)
>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>  
> +#define DPDK_VLAN_TAG_LEN    4
> +
>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>  BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>  
> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>          .header_split   = 0, /* Header Split disabled */
>          .hw_ip_checksum = 0, /* IP checksum offload disabled */
>          .hw_vlan_filter = 0, /* VLAN filtering disabled */
> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>          .hw_strip_crc   = 0,
>      },
>      .rx_adv_conf = {
> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)
>      return class->construct == netdev_dpdk_construct;
>  }
>  
> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for igb_uio).
> + * If 'frame_len' is not a multiple of this value, insufficient buffers are
> + * allocated to accomodate the packet in its entirety. Furthermore, the igb_uio
> + * driver needs to ensure that there is also sufficient space in the Rx buffer
> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too
> + * small, then the driver enables scatter RX behaviour, which reduces
> + * performance. To prevent this, use a buffer size that is closest to
> + * 'frame_len', but which satisfies the aforementioned criteria.
> + */
> +static uint32_t
> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
> +{
> +    struct rte_eth_dev_info info;
> +    uint32_t buf_size;
> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
> +
> +    if(netdev->type == DPDK_DEV_ETH) {
> +        rte_eth_dev_info_get(netdev->port_id, &info);
> +        buf_size = (info.min_rx_bufsize == 0) ?
> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
> +                   info.min_rx_bufsize;

Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?

> +    } else {
> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
> +    }
> +
> +    if(len % buf_size) {
> +        len = buf_size * ((len/buf_size) + 1);
> +    }
> +
> +    return len;
> +}
> +
>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
>   * for all other segments data, bss and text. */
>  
> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)
>  }
>  
>  static void
> -__rte_pktmbuf_init(struct rte_mempool *mp,
> -                   void *opaque_arg OVS_UNUSED,
> -                   void *_m,
> -                   unsigned i OVS_UNUSED)
> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>  {
> -    struct rte_mbuf *m = _m;
> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
> +    struct rte_pktmbuf_pool_private default_mbp_priv;
> +    uint16_t roomsz;
>  
>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>  
> -    memset(m, 0, mp->elt_size);
> +    /* if no structure is provided, assume no mbuf private area */
> +
> +    user_mbp_priv = opaque_arg;
> +    if (user_mbp_priv == NULL) {
> +        default_mbp_priv.mbuf_priv_size = 0;
> +        if (mp->elt_size > sizeof(struct dp_packet)) {
> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
> +        } else {
> +            roomsz = 0;
> +        }
> +        default_mbp_priv.mbuf_data_room_size = roomsz;
> +        user_mbp_priv = &default_mbp_priv;
> +    }
> +
> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
> +        user_mbp_priv->mbuf_data_room_size +
> +        user_mbp_priv->mbuf_priv_size);
> +
> +    mbp_priv = rte_mempool_get_priv(mp);
> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
> +}
> +
> +/* Initialise some fields in the mbuf structure that are not modified by the
> + * user once created (origin pool, buffer start address, etc.*/
> +static void
> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
> +                       void *opaque_arg OVS_UNUSED,
> +                       void *_m,
> +                       unsigned i OVS_UNUSED)
> +{
> +	struct rte_mbuf *m = _m;
> +	uint32_t buf_size, buf_len, priv_size;
> +
> +	priv_size = rte_pktmbuf_priv_size(mp);
> +	buf_size = sizeof(struct dp_packet) + priv_size;
> +	buf_len = rte_pktmbuf_data_room_size(mp);
>  
> -    /* start of buffer is just after mbuf structure */
> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
> -                    sizeof(struct dp_packet);
> -    m->buf_len = (uint16_t)buf_len;
> +	RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);
> +	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
> +	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>  
> -    /* keep some headroom between start of buffer and data */
> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
> +	memset(m, 0, mp->elt_size);
>  
> -    /* init some constant fields */
> -    m->pool = mp;
> -    m->nb_segs = 1;
> -    m->port = 0xff;
> +	/* start of buffer is after dp_packet structure and priv data */
> +	m->priv_size = priv_size;
> +	m->buf_addr = (char *)m + buf_size;
> +	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
> +	m->buf_len = (uint16_t)buf_len;
> +
> +	/* keep some headroom between start of buffer and data */
> +	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
> +
> +	/* init some constant fields */
> +	m->pool = mp;
> +	m->nb_segs = 1;
> +	m->port = 0xff;

Please don't mix tabs and spaces in the file.

>  }
>  
>  static void
> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>  {
>      struct rte_mbuf *m = _m;
>  
> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>  
>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>  }
> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>      struct dpdk_mp *dmp = NULL;
>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>      unsigned mp_size;
> +    struct rte_pktmbuf_pool_private mbp_priv;
>  
>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>      dmp->socket_id = socket_id;
>      dmp->mtu = mtu;
>      dmp->refcount = 1;
> +    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM;
> +    mbp_priv.mbuf_priv_size = 0;
>  
>      mp_size = MAX_NB_MBUF;
>      do {
> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>              return NULL;
>          }
>  
> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
> +        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SEGMENT_SIZE(mtu),
>                                       MP_CACHE_SZ,
>                                       sizeof(struct rte_pktmbuf_pool_private),
> -                                     rte_pktmbuf_pool_init, NULL,
> +                                     ovs_rte_pktmbuf_pool_init, &mbp_priv,
>                                       ovs_rte_pktmbuf_init, NULL,
>                                       socket_id, 0);
>      } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= MIN_NB_MBUF);
> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>  {
>      int diag = 0;
>      int i;
> +    struct rte_eth_conf conf = port_conf;
>  
>      /* A device may report more queues than it makes available (this has
>       * been observed for Intel xl710, which reserves some of them for
> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>              VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>          }
>  
> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);
> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
> +        }
> +
> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>          if (diag) {
>              break;
>          }
> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>      int sid;
>      int err = 0;
> +    uint32_t buf_size;
>  
>      ovs_mutex_init(&netdev->mutex);
>      ovs_mutex_lock(&netdev->mutex);
> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
>      netdev->port_id = port_no;
>      netdev->type = type;
>      netdev->flags = 0;
> +
> +    /* Initialize port's MTU and frame len to the default Ethernet values.
> +     * Larger, user-specified (jumbo) frame buffers are accommodated in
> +     * netdev_dpdk_set_config.
> +     */
> +    netdev->max_packet_len = ETHER_MAX_LEN;
>      netdev->mtu = ETHER_MTU;
> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>  
> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
> +    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, FRAME_LEN_TO_MTU(buf_size));
>      if (!netdev->dpdk_mp) {
>          err = ENOMEM;
>          goto unlock;
> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[], const char prefix[],
>      return 0;
>  }
>  
> +static void
> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
> +{
> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
> +    int local_mtu;
> +
> +    if(mtu_str) {
> +        local_mtu = atoi(mtu_str);

Please use strtol or strtoul here, and detect errors in the string by
checking endptr. That way, we can be sure that random garbage on the
stack won't accidentally become the mtu.

> +    }
> +    if(!mtu_str || local_mtu < ETHER_MTU ||
> +                   local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {
> +        local_mtu = ETHER_MTU;
> +        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to %d.\n",
> +                   local_mtu);
> +    }
> +    *mtu = local_mtu;
> +}
> +
>  static int
>  vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>  {
> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev *netdev_, struct smap *args)
>      smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
>      smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>      smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);
> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>      ovs_mutex_unlock(&dev->mutex);
>  
>      return 0;
>  }
>  
> +/* Set the mtu of DPDK_DEV_ETH ports */
> +static int
> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    int old_mtu, err;
> +    uint32_t buf_size;
> +    int dpdk_mtu;
> +    struct dpdk_mp *old_mp;
> +    struct dpdk_mp *mp;
> +
> +    ovs_mutex_lock(&dpdk_mutex);
> +    ovs_mutex_lock(&dev->mutex);
> +    if (dev->mtu == mtu) {
> +        err = 0;
> +        goto out;
> +    }
> +
> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
> +
> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
> +    if (!mp) {
> +        err = ENOMEM;
> +        goto out;
> +    }
> +
> +    rte_eth_dev_stop(dev->port_id);
> +
> +    old_mtu = dev->mtu;
> +    old_mp = dev->dpdk_mp;
> +    dev->dpdk_mp = mp;
> +    dev->mtu = mtu;
> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> +
> +    err = dpdk_eth_dev_init(dev);

Why call the dpdk_eth_dev_init here? Since you are directly calling
rte_eth_dev_X functions here - wouldn't it make sense to use the
rte_eth_dev_start() instead?

> +    if (err) {
> +        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last known "
> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
> +        dpdk_mp_put(mp);
> +        dev->mtu = old_mtu;
> +        dev->dpdk_mp = old_mp;
> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> +        err = dpdk_eth_dev_init(dev);

What happens if this returns a failure? By using dpdk_eth_dev_init, you
will cause a queue reconfiguration action, which is not needed, and can
error.

If you are intending to set MTU, I would have expected calls to
rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I
recognize that much of this is legacy, but since you are touching it :)

> +        goto out;
> +    }
> +
> +    dpdk_mp_put(old_mp);
> +    netdev_change_seq_changed(netdev);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    ovs_mutex_unlock(&dpdk_mutex);
> +    return err;
> +}
> +
> +static int
> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
> +{
> +    int mtu;
> +
> +    dpdk_dev_parse_mtu(args, &mtu);
> +
> +    return netdev_dpdk_set_mtu(netdev_, mtu);
> +}
> +
>  static int
>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>  {
> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)
>  
>      return 0;
>  }
> -
> -static int
> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
> -{
> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> -    int old_mtu, err;
> -    struct dpdk_mp *old_mp;
> -    struct dpdk_mp *mp;
> -
> -    ovs_mutex_lock(&dpdk_mutex);
> -    ovs_mutex_lock(&dev->mutex);
> -    if (dev->mtu == mtu) {
> -        err = 0;
> -        goto out;
> -    }
> -
> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
> -    if (!mp) {
> -        err = ENOMEM;
> -        goto out;
> -    }
> -
> -    rte_eth_dev_stop(dev->port_id);
> -
> -    old_mtu = dev->mtu;
> -    old_mp = dev->dpdk_mp;
> -    dev->dpdk_mp = mp;
> -    dev->mtu = mtu;
> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> -
> -    err = dpdk_eth_dev_init(dev);
> -    if (err) {
> -        dpdk_mp_put(mp);
> -        dev->mtu = old_mtu;
> -        dev->dpdk_mp = old_mp;
> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
> -        dpdk_eth_dev_init(dev);
> -        goto out;
> -    }
> -
> -    dpdk_mp_put(old_mp);
> -    netdev_change_seq_changed(netdev);
> -out:
> -    ovs_mutex_unlock(&dev->mutex);
> -    ovs_mutex_unlock(&dpdk_mutex);
> -    return err;
> -}
> -
>  static int
>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>  
> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev *netdev_, struct smap *args)
>      smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id));
>      smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>      smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);
> -    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>      smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
>      smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
>      smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)
>      return 0;
>  }
>  
> +/* Set the mtu of DPDK_DEV_VHOST ports */
> +static int
> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
> +{
> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
> +    int err = 0;
> +    struct dpdk_mp *old_mp;
> +    struct dpdk_mp *mp;
> +
> +    ovs_mutex_lock(&dpdk_mutex);
> +    ovs_mutex_lock(&dev->mutex);
> +    if (dev->mtu == mtu) {
> +        err = 0;
> +        goto out;
> +    }
> +
> +    mp = dpdk_mp_get(dev->socket_id, mtu);
> +    if (!mp) {
> +        err = ENOMEM;
> +        goto out;
> +    }
> +
> +    old_mp = dev->dpdk_mp;
> +    dev->dpdk_mp = mp;
> +    dev->mtu = mtu;
> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
> +
> +    dpdk_mp_put(old_mp);
> +    netdev_change_seq_changed(netdev);
> +out:
> +    ovs_mutex_unlock(&dev->mutex);
> +    ovs_mutex_unlock(&dpdk_mutex);
> +    return err;
> +}
> +
> +static int
> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap *args)
> +{
> +    int mtu;
> +
> +    dpdk_dev_parse_mtu(args, &mtu);
> +
> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
> +}
> +
>  static void
>  dpdk_common_init(void)
>  {
> @@ -2040,8 +2215,9 @@ unlock_dpdk:
>      return err;
>  }
>  
> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \
> -    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \
> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
> +        GET_STATUS, RXQ_RECV)                                          \
>  {                                                             \
>      NAME,                                                     \
>      INIT,                       /* init */                    \
> @@ -2053,7 +2229,7 @@ unlock_dpdk:
>      DESTRUCT,                                                 \
>      netdev_dpdk_dealloc,                                      \
>      netdev_dpdk_get_config,                                   \
> -    NULL,                       /* netdev_dpdk_set_config */  \
> +    SET_CONFIG,                                               \
>      NULL,                       /* get_tunnel_config */       \
>      NULL,                       /* build header */            \
>      NULL,                       /* push header */             \
> @@ -2067,7 +2243,7 @@ unlock_dpdk:
>      netdev_dpdk_set_etheraddr,                                \
>      netdev_dpdk_get_etheraddr,                                \
>      netdev_dpdk_get_mtu,                                      \
> -    netdev_dpdk_set_mtu,                                      \
> +    SET_MTU,                                                  \
>      netdev_dpdk_get_ifindex,                                  \
>      GET_CARRIER,                                              \
>      netdev_dpdk_get_carrier_resets,                           \
> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =
>          NULL,
>          netdev_dpdk_construct,
>          netdev_dpdk_destruct,
> +        netdev_dpdk_set_config,
>          netdev_dpdk_set_multiq,
>          netdev_dpdk_eth_send,
> +        netdev_dpdk_set_mtu,
>          netdev_dpdk_get_carrier,
>          netdev_dpdk_get_stats,
>          netdev_dpdk_get_features,
> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class =
>          NULL,
>          netdev_dpdk_ring_construct,
>          netdev_dpdk_destruct,
> +        netdev_dpdk_set_config,
>          netdev_dpdk_set_multiq,
>          netdev_dpdk_ring_send,
> +        netdev_dpdk_set_mtu,
>          netdev_dpdk_get_carrier,
>          netdev_dpdk_get_stats,
>          netdev_dpdk_get_features,
> @@ -2241,8 +2421,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_cuse_class =
>          dpdk_vhost_cuse_class_init,
>          netdev_dpdk_vhost_cuse_construct,
>          netdev_dpdk_vhost_destruct,
> +        NULL,
>          netdev_dpdk_vhost_set_multiq,
>          netdev_dpdk_vhost_send,
> +        NULL,
>          netdev_dpdk_vhost_get_carrier,
>          netdev_dpdk_vhost_get_stats,
>          NULL,
> @@ -2255,8 +2437,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_user_class =
>          dpdk_vhost_user_class_init,
>          netdev_dpdk_vhost_user_construct,
>          netdev_dpdk_vhost_destruct,
> +        netdev_dpdk_vhost_set_config,
>          netdev_dpdk_vhost_set_multiq,
>          netdev_dpdk_vhost_send,
> +        netdev_dpdk_vhost_set_mtu,
>          netdev_dpdk_vhost_get_carrier,
>          netdev_dpdk_vhost_get_stats,
>          NULL,
Mark Kavanagh Jan. 28, 2016, 4:43 p.m. UTC | #2
>Hi Mark,
>
>Thanks for the patch! I've not done a very thorough review of this, but
>my first-blush comments are inline.

Thanks for the review Aaron!

Any additional comments are welcome - my initial responses are inline, below.

Thanks,
Mark

>
>Mark Kavanagh <mark.b.kavanagh@intel.com> writes:
>> Add support for Jumbo Frames to DPDK-enabled port types,
>> using single-segment-mbufs.
>>
>> Using this approach, the amount of memory allocated for each mbuf
>> to store frame data is increased to a value greater than 1518B
>> (typical Ethernet maximum frame length). The increased space
>> available in the mbuf means that an entire Jumbo Frame can be carried
>> in a single mbuf, as opposed to partitioning it across multiple mbuf
>> segments.
>>
>> The amount of space allocated to each mbuf to hold frame data is
>> defined dynamically by the user when adding a DPDK port to a bridge.
>> If an MTU value is not supplied, or the user-supplied value is invalid,
>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>
>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>> ---
>
>This is a new feature which has user-visible impact, so should be
>announced in NEWS (plus, it's great news).

No problem - I'll add an entry.

>
>>  INSTALL.DPDK.md   |  59 ++++++++-
>>  lib/netdev-dpdk.c | 356 +++++++++++++++++++++++++++++++++++++++++-------------
>>  2 files changed, 328 insertions(+), 87 deletions(-)
>>
>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>> index 96b686c..2f23e27 100644
>> --- a/INSTALL.DPDK.md
>> +++ b/INSTALL.DPDK.md
>> @@ -859,10 +859,61 @@ by adding the following string:
>>  to <interface> sections of all network devices used by DPDK. Parameter 'N'
>>  determines how many queues can be used by the guest.
>>
>> +
>> +Jumbo Frames
>> +------------
>> +
>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
>> +
>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to the ovs-vsctl
>> +'add-port' command-line, along with the required MTU for the port.
>> +e.g.
>
>This text and the example do not match. I think '--dpdk-mtu' is not valid.

Good catch, thanks.

>
>> +
>> +     ```
>> +     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-mtu=9000
>> +     ```
>> +
>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
>> +increased, such that a full Jumbo Frame may be accommodated inside a single
>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.
>
>Why no support? DPDK supports changing the mtu.

I guess my rationale here is that an MTU change can't be triggered via OVS command-line, nor can it be triggered programmatically via DPDK (apart from an explicit call to rte_eth_dev_set_mtu).
So, while technically it's possibly, from a user's point of view, there's no way to configure it, outside of modifying the code directly. If I've missed something here, please let me know. 

>
>> +Jumbo frame support has been validated against 13312B frames, using the
>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>> +theoretically be supported. Supported port types excludes vHost-Cuse ports, as
>> +this feature is pending deprecation.
>> +
>> +
>> +vHost Ports and Jumbo Frames
>> +----------------------------
>> +Jumbo frame support is available for DPDK vHost-User ports only. Some additional
>> +configuration is needed to take advantage of this feature:
>> +
>> +  1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in
>> +      the QEMU command line snippet below:
>> +
>> +      ```
>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>> +      '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>> +      ```
>> +
>> +  2. Where virtio devices are bound to the Linux kernel driver in a guest
>> +     environment (i.e. interfaces are not bound to an in-guest DPDK driver), the
>> +     MTU of those logical network interfaces must also be increased. This
>> +     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU' refers
>> +     to the length of the IP packet only, and not that of the entire frame.
>> +
>> +     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract the L2
>> +     header and CRC lengths (i.e. 18B) from the max supported frame size.
>> +     So, to set the MTU for a 13312B Jumbo Frame:
>> +
>> +      ```
>> +      ifconfig eth1 mtu 13294
>> +      ```
>> +
>> +
>>  Restrictions:
>>  -------------
>>
>> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
>>    - Currently DPDK port does not make use any offload functionality.
>>    - DPDK-vHost support works with 1G huge pages.
>>
>> @@ -903,6 +954,12 @@ Restrictions:
>>      the next release of DPDK (which includes the above patch) is available and
>>      integrated into OVS.
>>
>> +  Jumbo Frames:
>> +  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The source of
>> +     this issue is currently being investigated.
>> +  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports.
>> +
>> +
>>  Bug Reporting:
>>  --------------
>>
>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>> index de7e488..5f23e28 100644
>> --- a/lib/netdev-dpdk.c
>> +++ b/lib/netdev-dpdk.c
>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>  #define OVS_VPORT_DPDK "ovs_dpdk"
>>
>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>> +
>>  /*
>>   * need to reserve tons of extra space in the mbufs so we can align the
>>   * DMA addresses to 4KB.
>>   * The minimum mbuf size is limited to avoid scatter behaviour and drop in
>>   * performance for standard Ethernet MTU.
>>   */
>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>> -                              + sizeof(struct dp_packet) \
>> -                              + RTE_PKTMBUF_HEADROOM)
>> -#define MBUF_SIZE_DRIVER     (2048                       \
>> -                              + sizeof (struct rte_mbuf) \
>> -                              + RTE_PKTMBUF_HEADROOM)
>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>> +#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN)
>> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>> +                                    + sizeof(struct dp_packet)   \
>> +                                    + RTE_PKTMBUF_HEADROOM)
>> +
>> +/* This value should be specified as a multiple of the DPDK NIC driver's
>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>> + */
>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>
>>  /* Max and min number of packets in the mempool.  OVS tries to allocate a
>>   * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
>>  #define MIN_NB_MBUF          (4096 * 4)
>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>
>> +#define DPDK_VLAN_TAG_LEN    4
>> +
>>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>  BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>>
>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>          .header_split   = 0, /* Header Split disabled */
>>          .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>          .hw_vlan_filter = 0, /* VLAN filtering disabled */
>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>          .hw_strip_crc   = 0,
>>      },
>>      .rx_adv_conf = {
>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)
>>      return class->construct == netdev_dpdk_construct;
>>  }
>>
>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for igb_uio).
>> + * If 'frame_len' is not a multiple of this value, insufficient buffers are
>> + * allocated to accomodate the packet in its entirety. Furthermore, the igb_uio
>> + * driver needs to ensure that there is also sufficient space in the Rx buffer
>> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too
>> + * small, then the driver enables scatter RX behaviour, which reduces
>> + * performance. To prevent this, use a buffer size that is closest to
>> + * 'frame_len', but which satisfies the aforementioned criteria.
>> + */
>> +static uint32_t
>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>> +{
>> +    struct rte_eth_dev_info info;
>> +    uint32_t buf_size;
>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>> +
>> +    if(netdev->type == DPDK_DEV_ETH) {
>> +        rte_eth_dev_info_get(netdev->port_id, &info);
>> +        buf_size = (info.min_rx_bufsize == 0) ?
>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>> +                   info.min_rx_bufsize;
>
>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?

I'm not sure if I follow  - the MTU isn't the issue here, but rather the buffer size used when creating the rte_mempool, which must be a multiple of the driver's min_rx_bufsize.

>
>> +    } else {
>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>> +    }
>> +
>> +    if(len % buf_size) {
>> +        len = buf_size * ((len/buf_size) + 1);
>> +    }
>> +
>> +    return len;
>> +}
>> +
>>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
>>   * for all other segments data, bss and text. */
>>
>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)
>>  }
>>
>>  static void
>> -__rte_pktmbuf_init(struct rte_mempool *mp,
>> -                   void *opaque_arg OVS_UNUSED,
>> -                   void *_m,
>> -                   unsigned i OVS_UNUSED)
>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>  {
>> -    struct rte_mbuf *m = _m;
>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>> +    struct rte_pktmbuf_pool_private default_mbp_priv;
>> +    uint16_t roomsz;
>>
>>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>
>> -    memset(m, 0, mp->elt_size);
>> +    /* if no structure is provided, assume no mbuf private area */
>> +
>> +    user_mbp_priv = opaque_arg;
>> +    if (user_mbp_priv == NULL) {
>> +        default_mbp_priv.mbuf_priv_size = 0;
>> +        if (mp->elt_size > sizeof(struct dp_packet)) {
>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
>> +        } else {
>> +            roomsz = 0;
>> +        }
>> +        default_mbp_priv.mbuf_data_room_size = roomsz;
>> +        user_mbp_priv = &default_mbp_priv;
>> +    }
>> +
>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>> +        user_mbp_priv->mbuf_data_room_size +
>> +        user_mbp_priv->mbuf_priv_size);
>> +
>> +    mbp_priv = rte_mempool_get_priv(mp);
>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>> +}
>> +
>> +/* Initialise some fields in the mbuf structure that are not modified by the
>> + * user once created (origin pool, buffer start address, etc.*/
>> +static void
>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>> +                       void *opaque_arg OVS_UNUSED,
>> +                       void *_m,
>> +                       unsigned i OVS_UNUSED)
>> +{
>> +	struct rte_mbuf *m = _m;
>> +	uint32_t buf_size, buf_len, priv_size;
>> +
>> +	priv_size = rte_pktmbuf_priv_size(mp);
>> +	buf_size = sizeof(struct dp_packet) + priv_size;
>> +	buf_len = rte_pktmbuf_data_room_size(mp);
>>
>> -    /* start of buffer is just after mbuf structure */
>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>> -                    sizeof(struct dp_packet);
>> -    m->buf_len = (uint16_t)buf_len;
>> +	RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);
>> +	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>> +	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>
>> -    /* keep some headroom between start of buffer and data */
>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>> +	memset(m, 0, mp->elt_size);
>>
>> -    /* init some constant fields */
>> -    m->pool = mp;
>> -    m->nb_segs = 1;
>> -    m->port = 0xff;
>> +	/* start of buffer is after dp_packet structure and priv data */
>> +	m->priv_size = priv_size;
>> +	m->buf_addr = (char *)m + buf_size;
>> +	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>> +	m->buf_len = (uint16_t)buf_len;
>> +
>> +	/* keep some headroom between start of buffer and data */
>> +	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
>> +
>> +	/* init some constant fields */
>> +	m->pool = mp;
>> +	m->nb_segs = 1;
>> +	m->port = 0xff;
>
>Please don't mix tabs and spaces in the file.

Apologies - this was a copy/paste from a DPDK file, which uses tabs.

>
>>  }
>>
>>  static void
>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>  {
>>      struct rte_mbuf *m = _m;
>>
>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>
>>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>  }
>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>>      struct dpdk_mp *dmp = NULL;
>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>      unsigned mp_size;
>> +    struct rte_pktmbuf_pool_private mbp_priv;
>>
>>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>>      dmp->socket_id = socket_id;
>>      dmp->mtu = mtu;
>>      dmp->refcount = 1;
>> +    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM;
>> +    mbp_priv.mbuf_priv_size = 0;
>>
>>      mp_size = MAX_NB_MBUF;
>>      do {
>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
>>              return NULL;
>>          }
>>
>> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>> +        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SEGMENT_SIZE(mtu),
>>                                       MP_CACHE_SZ,
>>                                       sizeof(struct rte_pktmbuf_pool_private),
>> -                                     rte_pktmbuf_pool_init, NULL,
>> +                                     ovs_rte_pktmbuf_pool_init, &mbp_priv,
>>                                       ovs_rte_pktmbuf_init, NULL,
>>                                       socket_id, 0);
>>      } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= MIN_NB_MBUF);
>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
>>  {
>>      int diag = 0;
>>      int i;
>> +    struct rte_eth_conf conf = port_conf;
>>
>>      /* A device may report more queues than it makes available (this has
>>       * been observed for Intel xl710, which reserves some of them for
>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int
>n_txq)
>>              VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>>          }
>>
>> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);
>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>> +        }
>> +
>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>>          if (diag) {
>>              break;
>>          }
>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>      int sid;
>>      int err = 0;
>> +    uint32_t buf_size;
>>
>>      ovs_mutex_init(&netdev->mutex);
>>      ovs_mutex_lock(&netdev->mutex);
>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
>>      netdev->port_id = port_no;
>>      netdev->type = type;
>>      netdev->flags = 0;
>> +
>> +    /* Initialize port's MTU and frame len to the default Ethernet values.
>> +     * Larger, user-specified (jumbo) frame buffers are accommodated in
>> +     * netdev_dpdk_set_config.
>> +     */
>> +    netdev->max_packet_len = ETHER_MAX_LEN;
>>      netdev->mtu = ETHER_MTU;
>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>
>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>> +    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, FRAME_LEN_TO_MTU(buf_size));
>>      if (!netdev->dpdk_mp) {
>>          err = ENOMEM;
>>          goto unlock;
>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[], const char prefix[],
>>      return 0;
>>  }
>>
>> +static void
>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>> +{
>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
>> +    int local_mtu;
>> +
>> +    if(mtu_str) {
>> +        local_mtu = atoi(mtu_str);
>
>Please use strtol or strtoul here, and detect errors in the string by
>checking endptr. That way, we can be sure that random garbage on the
>stack won't accidentally become the mtu.

Will do, thanks.

>
>> +    }
>> +    if(!mtu_str || local_mtu < ETHER_MTU ||
>> +                   local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {
>> +        local_mtu = ETHER_MTU;
>> +        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to %d.\n",
>> +                   local_mtu);
>> +    }
>> +    *mtu = local_mtu;
>> +}
>> +
>>  static int
>>  vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>>  {
>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev *netdev_, struct smap
>*args)
>>      smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
>>      smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>>      smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);
>> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>>      ovs_mutex_unlock(&dev->mutex);
>>
>>      return 0;
>>  }
>>
>> +/* Set the mtu of DPDK_DEV_ETH ports */
>> +static int
>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>> +{
>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +    int old_mtu, err;
>> +    uint32_t buf_size;
>> +    int dpdk_mtu;
>> +    struct dpdk_mp *old_mp;
>> +    struct dpdk_mp *mp;
>> +
>> +    ovs_mutex_lock(&dpdk_mutex);
>> +    ovs_mutex_lock(&dev->mutex);
>> +    if (dev->mtu == mtu) {
>> +        err = 0;
>> +        goto out;
>> +    }
>> +
>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>> +
>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>> +    if (!mp) {
>> +        err = ENOMEM;
>> +        goto out;
>> +    }
>> +
>> +    rte_eth_dev_stop(dev->port_id);
>> +
>> +    old_mtu = dev->mtu;
>> +    old_mp = dev->dpdk_mp;
>> +    dev->dpdk_mp = mp;
>> +    dev->mtu = mtu;
>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>> +
>> +    err = dpdk_eth_dev_init(dev);
>
>Why call the dpdk_eth_dev_init here? Since you are directly calling
>rte_eth_dev_X functions here - wouldn't it make sense to use the
>rte_eth_dev_start() instead?

This is legacy code, but I can refactor if needs be.

>
>> +    if (err) {
>> +        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last known "
>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>> +        dpdk_mp_put(mp);
>> +        dev->mtu = old_mtu;
>> +        dev->dpdk_mp = old_mp;
>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>> +        err = dpdk_eth_dev_init(dev);
>
>What happens if this returns a failure? By using dpdk_eth_dev_init, you
>will cause a queue reconfiguration action, which is not needed, and can
>error.
>

As above.

>If you are intending to set MTU, I would have expected calls to
>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I
>recognize that much of this is legacy, but since you are touching it :)

I actually had a number of issues with rte_eth_dev_set_mtu early on in the patch; I can take a second look now that the code is more stable though.  

>
>> +        goto out;
>> +    }
>> +
>> +    dpdk_mp_put(old_mp);
>> +    netdev_change_seq_changed(netdev);
>> +out:
>> +    ovs_mutex_unlock(&dev->mutex);
>> +    ovs_mutex_unlock(&dpdk_mutex);
>> +    return err;
>> +}
>> +
>> +static int
>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>> +{
>> +    int mtu;
>> +
>> +    dpdk_dev_parse_mtu(args, &mtu);
>> +
>> +    return netdev_dpdk_set_mtu(netdev_, mtu);
>> +}
>> +
>>  static int
>>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>  {
>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)
>>
>>      return 0;
>>  }
>> -
>> -static int
>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>> -{
>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> -    int old_mtu, err;
>> -    struct dpdk_mp *old_mp;
>> -    struct dpdk_mp *mp;
>> -
>> -    ovs_mutex_lock(&dpdk_mutex);
>> -    ovs_mutex_lock(&dev->mutex);
>> -    if (dev->mtu == mtu) {
>> -        err = 0;
>> -        goto out;
>> -    }
>> -
>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>> -    if (!mp) {
>> -        err = ENOMEM;
>> -        goto out;
>> -    }
>> -
>> -    rte_eth_dev_stop(dev->port_id);
>> -
>> -    old_mtu = dev->mtu;
>> -    old_mp = dev->dpdk_mp;
>> -    dev->dpdk_mp = mp;
>> -    dev->mtu = mtu;
>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>> -
>> -    err = dpdk_eth_dev_init(dev);
>> -    if (err) {
>> -        dpdk_mp_put(mp);
>> -        dev->mtu = old_mtu;
>> -        dev->dpdk_mp = old_mp;
>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>> -        dpdk_eth_dev_init(dev);
>> -        goto out;
>> -    }
>> -
>> -    dpdk_mp_put(old_mp);
>> -    netdev_change_seq_changed(netdev);
>> -out:
>> -    ovs_mutex_unlock(&dev->mutex);
>> -    ovs_mutex_unlock(&dpdk_mutex);
>> -    return err;
>> -}
>> -
>>  static int
>>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>
>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev *netdev_, struct smap
>*args)
>>      smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id));
>>      smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>      smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);
>> -    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
>> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>      smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
>>      smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
>>      smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)
>>      return 0;
>>  }
>>
>> +/* Set the mtu of DPDK_DEV_VHOST ports */
>> +static int
>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>> +{
>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>> +    int err = 0;
>> +    struct dpdk_mp *old_mp;
>> +    struct dpdk_mp *mp;
>> +
>> +    ovs_mutex_lock(&dpdk_mutex);
>> +    ovs_mutex_lock(&dev->mutex);
>> +    if (dev->mtu == mtu) {
>> +        err = 0;
>> +        goto out;
>> +    }
>> +
>> +    mp = dpdk_mp_get(dev->socket_id, mtu);
>> +    if (!mp) {
>> +        err = ENOMEM;
>> +        goto out;
>> +    }
>> +
>> +    old_mp = dev->dpdk_mp;
>> +    dev->dpdk_mp = mp;
>> +    dev->mtu = mtu;
>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>> +
>> +    dpdk_mp_put(old_mp);
>> +    netdev_change_seq_changed(netdev);
>> +out:
>> +    ovs_mutex_unlock(&dev->mutex);
>> +    ovs_mutex_unlock(&dpdk_mutex);
>> +    return err;
>> +}
>> +
>> +static int
>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap *args)
>> +{
>> +    int mtu;
>> +
>> +    dpdk_dev_parse_mtu(args, &mtu);
>> +
>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>> +}
>> +
>>  static void
>>  dpdk_common_init(void)
>>  {
>> @@ -2040,8 +2215,9 @@ unlock_dpdk:
>>      return err;
>>  }
>>
>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \
>> -    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \
>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
>> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
>> +        GET_STATUS, RXQ_RECV)                                          \
>>  {                                                             \
>>      NAME,                                                     \
>>      INIT,                       /* init */                    \
>> @@ -2053,7 +2229,7 @@ unlock_dpdk:
>>      DESTRUCT,                                                 \
>>      netdev_dpdk_dealloc,                                      \
>>      netdev_dpdk_get_config,                                   \
>> -    NULL,                       /* netdev_dpdk_set_config */  \
>> +    SET_CONFIG,                                               \
>>      NULL,                       /* get_tunnel_config */       \
>>      NULL,                       /* build header */            \
>>      NULL,                       /* push header */             \
>> @@ -2067,7 +2243,7 @@ unlock_dpdk:
>>      netdev_dpdk_set_etheraddr,                                \
>>      netdev_dpdk_get_etheraddr,                                \
>>      netdev_dpdk_get_mtu,                                      \
>> -    netdev_dpdk_set_mtu,                                      \
>> +    SET_MTU,                                                  \
>>      netdev_dpdk_get_ifindex,                                  \
>>      GET_CARRIER,                                              \
>>      netdev_dpdk_get_carrier_resets,                           \
>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =
>>          NULL,
>>          netdev_dpdk_construct,
>>          netdev_dpdk_destruct,
>> +        netdev_dpdk_set_config,
>>          netdev_dpdk_set_multiq,
>>          netdev_dpdk_eth_send,
>> +        netdev_dpdk_set_mtu,
>>          netdev_dpdk_get_carrier,
>>          netdev_dpdk_get_stats,
>>          netdev_dpdk_get_features,
>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class =
>>          NULL,
>>          netdev_dpdk_ring_construct,
>>          netdev_dpdk_destruct,
>> +        netdev_dpdk_set_config,
>>          netdev_dpdk_set_multiq,
>>          netdev_dpdk_ring_send,
>> +        netdev_dpdk_set_mtu,
>>          netdev_dpdk_get_carrier,
>>          netdev_dpdk_get_stats,
>>          netdev_dpdk_get_features,
>> @@ -2241,8 +2421,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_cuse_class =
>>          dpdk_vhost_cuse_class_init,
>>          netdev_dpdk_vhost_cuse_construct,
>>          netdev_dpdk_vhost_destruct,
>> +        NULL,
>>          netdev_dpdk_vhost_set_multiq,
>>          netdev_dpdk_vhost_send,
>> +        NULL,
>>          netdev_dpdk_vhost_get_carrier,
>>          netdev_dpdk_vhost_get_stats,
>>          NULL,
>> @@ -2255,8 +2437,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_user_class =
>>          dpdk_vhost_user_class_init,
>>          netdev_dpdk_vhost_user_construct,
>>          netdev_dpdk_vhost_destruct,
>> +        netdev_dpdk_vhost_set_config,
>>          netdev_dpdk_vhost_set_multiq,
>>          netdev_dpdk_vhost_send,
>> +        netdev_dpdk_vhost_set_mtu,
>>          netdev_dpdk_vhost_get_carrier,
>>          netdev_dpdk_vhost_get_stats,
>>          NULL,
Mark Kavanagh Jan. 28, 2016, 5:22 p.m. UTC | #3
One addendum below, Aaron.

>

>>Hi Mark,

>>

>>Thanks for the patch! I've not done a very thorough review of this, but

>>my first-blush comments are inline.

>

>Thanks for the review Aaron!

>

>Any additional comments are welcome - my initial responses are inline, below.

>

>Thanks,

>Mark

>

>>

>>Mark Kavanagh <mark.b.kavanagh@intel.com> writes:

>>> Add support for Jumbo Frames to DPDK-enabled port types,

>>> using single-segment-mbufs.

>>>

>>> Using this approach, the amount of memory allocated for each mbuf

>>> to store frame data is increased to a value greater than 1518B

>>> (typical Ethernet maximum frame length). The increased space

>>> available in the mbuf means that an entire Jumbo Frame can be carried

>>> in a single mbuf, as opposed to partitioning it across multiple mbuf

>>> segments.

>>>

>>> The amount of space allocated to each mbuf to hold frame data is

>>> defined dynamically by the user when adding a DPDK port to a bridge.

>>> If an MTU value is not supplied, or the user-supplied value is invalid,

>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).

>>>

>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>

>>> ---

>>

>>This is a new feature which has user-visible impact, so should be

>>announced in NEWS (plus, it's great news).

>

>No problem - I'll add an entry.

>

>>

>>>  INSTALL.DPDK.md   |  59 ++++++++-

>>>  lib/netdev-dpdk.c | 356 +++++++++++++++++++++++++++++++++++++++++-------------

>>>  2 files changed, 328 insertions(+), 87 deletions(-)

>>>

>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md

>>> index 96b686c..2f23e27 100644

>>> --- a/INSTALL.DPDK.md

>>> +++ b/INSTALL.DPDK.md

>>> @@ -859,10 +859,61 @@ by adding the following string:

>>>  to <interface> sections of all network devices used by DPDK. Parameter 'N'

>>>  determines how many queues can be used by the guest.

>>>

>>> +

>>> +Jumbo Frames

>>> +------------

>>> +

>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.

>>> +

>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to the ovs-vsctl

>>> +'add-port' command-line, along with the required MTU for the port.

>>> +e.g.

>>

>>This text and the example do not match. I think '--dpdk-mtu' is not valid.

>

>Good catch, thanks.

>

>>

>>> +

>>> +     ```

>>> +     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-mtu=9000

>>> +     ```

>>> +

>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are

>>> +increased, such that a full Jumbo Frame may be accommodated inside a single

>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.

>>

>>Why no support? DPDK supports changing the mtu.

>

>I guess my rationale here is that an MTU change can't be triggered via OVS command-line, nor

>can it be triggered programmatically via DPDK (apart from an explicit call to

>rte_eth_dev_set_mtu).

>So, while technically it's possibly, from a user's point of view, there's no way to configure

>it, outside of modifying the code directly. If I've missed something here, please let me

>know.

>

>>

>>> +Jumbo frame support has been validated against 13312B frames, using the

>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may

>>> +theoretically be supported. Supported port types excludes vHost-Cuse ports, as

>>> +this feature is pending deprecation.

>>> +

>>> +

>>> +vHost Ports and Jumbo Frames

>>> +----------------------------

>>> +Jumbo frame support is available for DPDK vHost-User ports only. Some additional

>>> +configuration is needed to take advantage of this feature:

>>> +

>>> +  1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in

>>> +      the QEMU command line snippet below:

>>> +

>>> +      ```

>>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'

>>> +      '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'

>>> +      ```

>>> +

>>> +  2. Where virtio devices are bound to the Linux kernel driver in a guest

>>> +     environment (i.e. interfaces are not bound to an in-guest DPDK driver), the

>>> +     MTU of those logical network interfaces must also be increased. This

>>> +     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU' refers

>>> +     to the length of the IP packet only, and not that of the entire frame.

>>> +

>>> +     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract the L2

>>> +     header and CRC lengths (i.e. 18B) from the max supported frame size.

>>> +     So, to set the MTU for a 13312B Jumbo Frame:

>>> +

>>> +      ```

>>> +      ifconfig eth1 mtu 13294

>>> +      ```

>>> +

>>> +

>>>  Restrictions:

>>>  -------------

>>>

>>> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.

>>>    - Currently DPDK port does not make use any offload functionality.

>>>    - DPDK-vHost support works with 1G huge pages.

>>>

>>> @@ -903,6 +954,12 @@ Restrictions:

>>>      the next release of DPDK (which includes the above patch) is available and

>>>      integrated into OVS.

>>>

>>> +  Jumbo Frames:

>>> +  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The source of

>>> +     this issue is currently being investigated.

>>> +  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports.

>>> +

>>> +

>>>  Bug Reporting:

>>>  --------------

>>>

>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c

>>> index de7e488..5f23e28 100644

>>> --- a/lib/netdev-dpdk.c

>>> +++ b/lib/netdev-dpdk.c

>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);

>>>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE

>>>  #define OVS_VPORT_DPDK "ovs_dpdk"

>>>

>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1

>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024

>>> +

>>>  /*

>>>   * need to reserve tons of extra space in the mbufs so we can align the

>>>   * DMA addresses to 4KB.

>>>   * The minimum mbuf size is limited to avoid scatter behaviour and drop in

>>>   * performance for standard Ethernet MTU.

>>>   */

>>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)

>>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \

>>> -                              + sizeof(struct dp_packet) \

>>> -                              + RTE_PKTMBUF_HEADROOM)

>>> -#define MBUF_SIZE_DRIVER     (2048                       \

>>> -                              + sizeof (struct rte_mbuf) \

>>> -                              + RTE_PKTMBUF_HEADROOM)

>>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)

>>> +#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)

>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN)

>>> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \

>>> +                                    + sizeof(struct dp_packet)   \

>>> +                                    + RTE_PKTMBUF_HEADROOM)

>>> +

>>> +/* This value should be specified as a multiple of the DPDK NIC driver's

>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').

>>> + */

>>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312

>>>

>>>  /* Max and min number of packets in the mempool.  OVS tries to allocate a

>>>   * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have

>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);

>>>  #define MIN_NB_MBUF          (4096 * 4)

>>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE

>>>

>>> +#define DPDK_VLAN_TAG_LEN    4

>>> +

>>>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */

>>>  BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);

>>>

>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {

>>>          .header_split   = 0, /* Header Split disabled */

>>>          .hw_ip_checksum = 0, /* IP checksum offload disabled */

>>>          .hw_vlan_filter = 0, /* VLAN filtering disabled */

>>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */

>>>          .hw_strip_crc   = 0,

>>>      },

>>>      .rx_adv_conf = {

>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)

>>>      return class->construct == netdev_dpdk_construct;

>>>  }

>>>

>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity

>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for igb_uio).

>>> + * If 'frame_len' is not a multiple of this value, insufficient buffers are

>>> + * allocated to accomodate the packet in its entirety. Furthermore, the igb_uio

>>> + * driver needs to ensure that there is also sufficient space in the Rx buffer

>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too

>>> + * small, then the driver enables scatter RX behaviour, which reduces

>>> + * performance. To prevent this, use a buffer size that is closest to

>>> + * 'frame_len', but which satisfies the aforementioned criteria.

>>> + */

>>> +static uint32_t

>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)

>>> +{

>>> +    struct rte_eth_dev_info info;

>>> +    uint32_t buf_size;

>>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);

>>> +

>>> +    if(netdev->type == DPDK_DEV_ETH) {

>>> +        rte_eth_dev_info_get(netdev->port_id, &info);

>>> +        buf_size = (info.min_rx_bufsize == 0) ?

>>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :

>>> +                   info.min_rx_bufsize;

>>

>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?

>

>I'm not sure if I follow  - the MTU isn't the issue here, but rather the buffer size used

>when creating the rte_mempool, which must be a multiple of the driver's min_rx_bufsize.

>

>>

>>> +    } else {

>>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;

>>> +    }

>>> +

>>> +    if(len % buf_size) {

>>> +        len = buf_size * ((len/buf_size) + 1);

>>> +    }

>>> +

>>> +    return len;

>>> +}

>>> +

>>>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used

>>>   * for all other segments data, bss and text. */

>>>

>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)

>>>  }

>>>

>>>  static void

>>> -__rte_pktmbuf_init(struct rte_mempool *mp,

>>> -                   void *opaque_arg OVS_UNUSED,

>>> -                   void *_m,

>>> -                   unsigned i OVS_UNUSED)

>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)

>>>  {

>>> -    struct rte_mbuf *m = _m;

>>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);

>>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;

>>> +    struct rte_pktmbuf_pool_private default_mbp_priv;

>>> +    uint16_t roomsz;

>>>

>>>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));

>>>

>>> -    memset(m, 0, mp->elt_size);

>>> +    /* if no structure is provided, assume no mbuf private area */

>>> +

>>> +    user_mbp_priv = opaque_arg;

>>> +    if (user_mbp_priv == NULL) {

>>> +        default_mbp_priv.mbuf_priv_size = 0;

>>> +        if (mp->elt_size > sizeof(struct dp_packet)) {

>>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);

>>> +        } else {

>>> +            roomsz = 0;

>>> +        }

>>> +        default_mbp_priv.mbuf_data_room_size = roomsz;

>>> +        user_mbp_priv = &default_mbp_priv;

>>> +    }

>>> +

>>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +

>>> +        user_mbp_priv->mbuf_data_room_size +

>>> +        user_mbp_priv->mbuf_priv_size);

>>> +

>>> +    mbp_priv = rte_mempool_get_priv(mp);

>>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));

>>> +}

>>> +

>>> +/* Initialise some fields in the mbuf structure that are not modified by the

>>> + * user once created (origin pool, buffer start address, etc.*/

>>> +static void

>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,

>>> +                       void *opaque_arg OVS_UNUSED,

>>> +                       void *_m,

>>> +                       unsigned i OVS_UNUSED)

>>> +{

>>> +	struct rte_mbuf *m = _m;

>>> +	uint32_t buf_size, buf_len, priv_size;

>>> +

>>> +	priv_size = rte_pktmbuf_priv_size(mp);

>>> +	buf_size = sizeof(struct dp_packet) + priv_size;

>>> +	buf_len = rte_pktmbuf_data_room_size(mp);

>>>

>>> -    /* start of buffer is just after mbuf structure */

>>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);

>>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +

>>> -                    sizeof(struct dp_packet);

>>> -    m->buf_len = (uint16_t)buf_len;

>>> +	RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);

>>> +	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);

>>> +	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);

>>>

>>> -    /* keep some headroom between start of buffer and data */

>>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);

>>> +	memset(m, 0, mp->elt_size);

>>>

>>> -    /* init some constant fields */

>>> -    m->pool = mp;

>>> -    m->nb_segs = 1;

>>> -    m->port = 0xff;

>>> +	/* start of buffer is after dp_packet structure and priv data */

>>> +	m->priv_size = priv_size;

>>> +	m->buf_addr = (char *)m + buf_size;

>>> +	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;

>>> +	m->buf_len = (uint16_t)buf_len;

>>> +

>>> +	/* keep some headroom between start of buffer and data */

>>> +	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);

>>> +

>>> +	/* init some constant fields */

>>> +	m->pool = mp;

>>> +	m->nb_segs = 1;

>>> +	m->port = 0xff;

>>

>>Please don't mix tabs and spaces in the file.

>

>Apologies - this was a copy/paste from a DPDK file, which uses tabs.

>

>>

>>>  }

>>>

>>>  static void

>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,

>>>  {

>>>      struct rte_mbuf *m = _m;

>>>

>>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);

>>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);

>>>

>>>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);

>>>  }

>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)

>>>      struct dpdk_mp *dmp = NULL;

>>>      char mp_name[RTE_MEMPOOL_NAMESIZE];

>>>      unsigned mp_size;

>>> +    struct rte_pktmbuf_pool_private mbp_priv;

>>>

>>>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {

>>>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {

>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)

>>>      dmp->socket_id = socket_id;

>>>      dmp->mtu = mtu;

>>>      dmp->refcount = 1;

>>> +    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM;

>>> +    mbp_priv.mbuf_priv_size = 0;

>>>

>>>      mp_size = MAX_NB_MBUF;

>>>      do {

>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)

>>>              return NULL;

>>>          }

>>>

>>> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),

>>> +        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SEGMENT_SIZE(mtu),

>>>                                       MP_CACHE_SZ,

>>>                                       sizeof(struct rte_pktmbuf_pool_private),

>>> -                                     rte_pktmbuf_pool_init, NULL,

>>> +                                     ovs_rte_pktmbuf_pool_init, &mbp_priv,

>>>                                       ovs_rte_pktmbuf_init, NULL,

>>>                                       socket_id, 0);

>>>      } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= MIN_NB_MBUF);

>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int

>n_txq)

>>>  {

>>>      int diag = 0;

>>>      int i;

>>> +    struct rte_eth_conf conf = port_conf;

>>>

>>>      /* A device may report more queues than it makes available (this has

>>>       * been observed for Intel xl710, which reserves some of them for

>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int

>>n_txq)

>>>              VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);

>>>          }

>>>

>>> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);

>>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {

>>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;

>>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);

>>> +        }

>>> +

>>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);

>>>          if (diag) {

>>>              break;

>>>          }

>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,

>>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);

>>>      int sid;

>>>      int err = 0;

>>> +    uint32_t buf_size;

>>>

>>>      ovs_mutex_init(&netdev->mutex);

>>>      ovs_mutex_lock(&netdev->mutex);

>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,

>>>      netdev->port_id = port_no;

>>>      netdev->type = type;

>>>      netdev->flags = 0;

>>> +

>>> +    /* Initialize port's MTU and frame len to the default Ethernet values.

>>> +     * Larger, user-specified (jumbo) frame buffers are accommodated in

>>> +     * netdev_dpdk_set_config.

>>> +     */

>>> +    netdev->max_packet_len = ETHER_MAX_LEN;

>>>      netdev->mtu = ETHER_MTU;

>>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);

>>>

>>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);

>>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);

>>> +    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, FRAME_LEN_TO_MTU(buf_size));

>>>      if (!netdev->dpdk_mp) {

>>>          err = ENOMEM;

>>>          goto unlock;

>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[], const char prefix[],

>>>      return 0;

>>>  }

>>>

>>> +static void

>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)

>>> +{

>>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");

>>> +    int local_mtu;

>>> +

>>> +    if(mtu_str) {

>>> +        local_mtu = atoi(mtu_str);

>>

>>Please use strtol or strtoul here, and detect errors in the string by

>>checking endptr. That way, we can be sure that random garbage on the

>>stack won't accidentally become the mtu.

>

>Will do, thanks.

>

>>

>>> +    }

>>> +    if(!mtu_str || local_mtu < ETHER_MTU ||

>>> +                   local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {

>>> +        local_mtu = ETHER_MTU;

>>> +        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to %d.\n",

>>> +                   local_mtu);

>>> +    }

>>> +    *mtu = local_mtu;

>>> +}

>>> +

>>>  static int

>>>  vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)

>>>  {

>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev *netdev_, struct smap

>>*args)

>>>      smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);

>>>      smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);

>>>      smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);

>>> +    smap_add_format(args, "mtu", "%d", dev->mtu);

>>>      ovs_mutex_unlock(&dev->mutex);

>>>

>>>      return 0;

>>>  }

>>>

>>> +/* Set the mtu of DPDK_DEV_ETH ports */

>>> +static int

>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)

>>> +{

>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>>> +    int old_mtu, err;

>>> +    uint32_t buf_size;

>>> +    int dpdk_mtu;

>>> +    struct dpdk_mp *old_mp;

>>> +    struct dpdk_mp *mp;

>>> +

>>> +    ovs_mutex_lock(&dpdk_mutex);

>>> +    ovs_mutex_lock(&dev->mutex);

>>> +    if (dev->mtu == mtu) {

>>> +        err = 0;

>>> +        goto out;

>>> +    }

>>> +

>>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));

>>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);

>>> +

>>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);

>>> +    if (!mp) {

>>> +        err = ENOMEM;

>>> +        goto out;

>>> +    }

>>> +

>>> +    rte_eth_dev_stop(dev->port_id);

>>> +

>>> +    old_mtu = dev->mtu;

>>> +    old_mp = dev->dpdk_mp;

>>> +    dev->dpdk_mp = mp;

>>> +    dev->mtu = mtu;

>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

>>> +

>>> +    err = dpdk_eth_dev_init(dev);

>>

>>Why call the dpdk_eth_dev_init here? Since you are directly calling

>>rte_eth_dev_X functions here - wouldn't it make sense to use the

>>rte_eth_dev_start() instead?

>

>This is legacy code, but I can refactor if needs be.

>

>>

>>> +    if (err) {

>>> +        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last known "

>>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);

>>> +        dpdk_mp_put(mp);

>>> +        dev->mtu = old_mtu;

>>> +        dev->dpdk_mp = old_mp;

>>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

>>> +        err = dpdk_eth_dev_init(dev);

>>

>>What happens if this returns a failure? By using dpdk_eth_dev_init, you

>>will cause a queue reconfiguration action, which is not needed, and can

>>error.

>>

>

>As above.

>

>>If you are intending to set MTU, I would have expected calls to

>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I

>>recognize that much of this is legacy, but since you are touching it :)

>

>I actually had a number of issues with rte_eth_dev_set_mtu early on in the patch; I can take

>a second look now that the code is more stable though.


I remembered why I didn't use this. 

If you look at the IXGBE driver's mtu_set function, it refuses frame lengths which are larger than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM).
This might be avoided by configuring the device for scatter RX behavior, but this incurs a performance impact.

<snippet>
ixgbe_dev_mtu_set(...) {

...

    /* refuse mtu that requires the support of scattered packets when this
     * feature has not been enabled before. */
    if (!dev->data->scattered_rx &&
        (frame_size + 2 * IXGBE_VLAN_TAG_SIZE >
         dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM))
        return -EINVAL;

...
}
</snippet>


>

>>

>>> +        goto out;

>>> +    }

>>> +

>>> +    dpdk_mp_put(old_mp);

>>> +    netdev_change_seq_changed(netdev);

>>> +out:

>>> +    ovs_mutex_unlock(&dev->mutex);

>>> +    ovs_mutex_unlock(&dpdk_mutex);

>>> +    return err;

>>> +}

>>> +

>>> +static int

>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)

>>> +{

>>> +    int mtu;

>>> +

>>> +    dpdk_dev_parse_mtu(args, &mtu);

>>> +

>>> +    return netdev_dpdk_set_mtu(netdev_, mtu);

>>> +}

>>> +

>>>  static int

>>>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)

>>>  {

>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)

>>>

>>>      return 0;

>>>  }

>>> -

>>> -static int

>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)

>>> -{

>>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>>> -    int old_mtu, err;

>>> -    struct dpdk_mp *old_mp;

>>> -    struct dpdk_mp *mp;

>>> -

>>> -    ovs_mutex_lock(&dpdk_mutex);

>>> -    ovs_mutex_lock(&dev->mutex);

>>> -    if (dev->mtu == mtu) {

>>> -        err = 0;

>>> -        goto out;

>>> -    }

>>> -

>>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);

>>> -    if (!mp) {

>>> -        err = ENOMEM;

>>> -        goto out;

>>> -    }

>>> -

>>> -    rte_eth_dev_stop(dev->port_id);

>>> -

>>> -    old_mtu = dev->mtu;

>>> -    old_mp = dev->dpdk_mp;

>>> -    dev->dpdk_mp = mp;

>>> -    dev->mtu = mtu;

>>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);

>>> -

>>> -    err = dpdk_eth_dev_init(dev);

>>> -    if (err) {

>>> -        dpdk_mp_put(mp);

>>> -        dev->mtu = old_mtu;

>>> -        dev->dpdk_mp = old_mp;

>>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);

>>> -        dpdk_eth_dev_init(dev);

>>> -        goto out;

>>> -    }

>>> -

>>> -    dpdk_mp_put(old_mp);

>>> -    netdev_change_seq_changed(netdev);

>>> -out:

>>> -    ovs_mutex_unlock(&dev->mutex);

>>> -    ovs_mutex_unlock(&dpdk_mutex);

>>> -    return err;

>>> -}

>>> -

>>>  static int

>>>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);

>>>

>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev *netdev_, struct smap

>>*args)

>>>      smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id));

>>>      smap_add_format(args, "driver_name", "%s", dev_info.driver_name);

>>>      smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);

>>> -    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);

>>> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);

>>>      smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);

>>>      smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);

>>>      smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);

>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)

>>>      return 0;

>>>  }

>>>

>>> +/* Set the mtu of DPDK_DEV_VHOST ports */

>>> +static int

>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)

>>> +{

>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>>> +    int err = 0;

>>> +    struct dpdk_mp *old_mp;

>>> +    struct dpdk_mp *mp;

>>> +

>>> +    ovs_mutex_lock(&dpdk_mutex);

>>> +    ovs_mutex_lock(&dev->mutex);

>>> +    if (dev->mtu == mtu) {

>>> +        err = 0;

>>> +        goto out;

>>> +    }

>>> +

>>> +    mp = dpdk_mp_get(dev->socket_id, mtu);

>>> +    if (!mp) {

>>> +        err = ENOMEM;

>>> +        goto out;

>>> +    }

>>> +

>>> +    old_mp = dev->dpdk_mp;

>>> +    dev->dpdk_mp = mp;

>>> +    dev->mtu = mtu;

>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

>>> +

>>> +    dpdk_mp_put(old_mp);

>>> +    netdev_change_seq_changed(netdev);

>>> +out:

>>> +    ovs_mutex_unlock(&dev->mutex);

>>> +    ovs_mutex_unlock(&dpdk_mutex);

>>> +    return err;

>>> +}

>>> +

>>> +static int

>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap *args)

>>> +{

>>> +    int mtu;

>>> +

>>> +    dpdk_dev_parse_mtu(args, &mtu);

>>> +

>>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);

>>> +}

>>> +

>>>  static void

>>>  dpdk_common_init(void)

>>>  {

>>> @@ -2040,8 +2215,9 @@ unlock_dpdk:

>>>      return err;

>>>  }

>>>

>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \

>>> -    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \

>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \

>>> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \

>>> +        GET_STATUS, RXQ_RECV)                                          \

>>>  {                                                             \

>>>      NAME,                                                     \

>>>      INIT,                       /* init */                    \

>>> @@ -2053,7 +2229,7 @@ unlock_dpdk:

>>>      DESTRUCT,                                                 \

>>>      netdev_dpdk_dealloc,                                      \

>>>      netdev_dpdk_get_config,                                   \

>>> -    NULL,                       /* netdev_dpdk_set_config */  \

>>> +    SET_CONFIG,                                               \

>>>      NULL,                       /* get_tunnel_config */       \

>>>      NULL,                       /* build header */            \

>>>      NULL,                       /* push header */             \

>>> @@ -2067,7 +2243,7 @@ unlock_dpdk:

>>>      netdev_dpdk_set_etheraddr,                                \

>>>      netdev_dpdk_get_etheraddr,                                \

>>>      netdev_dpdk_get_mtu,                                      \

>>> -    netdev_dpdk_set_mtu,                                      \

>>> +    SET_MTU,                                                  \

>>>      netdev_dpdk_get_ifindex,                                  \

>>>      GET_CARRIER,                                              \

>>>      netdev_dpdk_get_carrier_resets,                           \

>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =

>>>          NULL,

>>>          netdev_dpdk_construct,

>>>          netdev_dpdk_destruct,

>>> +        netdev_dpdk_set_config,

>>>          netdev_dpdk_set_multiq,

>>>          netdev_dpdk_eth_send,

>>> +        netdev_dpdk_set_mtu,

>>>          netdev_dpdk_get_carrier,

>>>          netdev_dpdk_get_stats,

>>>          netdev_dpdk_get_features,

>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class =

>>>          NULL,

>>>          netdev_dpdk_ring_construct,

>>>          netdev_dpdk_destruct,

>>> +        netdev_dpdk_set_config,

>>>          netdev_dpdk_set_multiq,

>>>          netdev_dpdk_ring_send,

>>> +        netdev_dpdk_set_mtu,

>>>          netdev_dpdk_get_carrier,

>>>          netdev_dpdk_get_stats,

>>>          netdev_dpdk_get_features,

>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_cuse_class =

>>>          dpdk_vhost_cuse_class_init,

>>>          netdev_dpdk_vhost_cuse_construct,

>>>          netdev_dpdk_vhost_destruct,

>>> +        NULL,

>>>          netdev_dpdk_vhost_set_multiq,

>>>          netdev_dpdk_vhost_send,

>>> +        NULL,

>>>          netdev_dpdk_vhost_get_carrier,

>>>          netdev_dpdk_vhost_get_stats,

>>>          NULL,

>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_user_class =

>>>          dpdk_vhost_user_class_init,

>>>          netdev_dpdk_vhost_user_construct,

>>>          netdev_dpdk_vhost_destruct,

>>> +        netdev_dpdk_vhost_set_config,

>>>          netdev_dpdk_vhost_set_multiq,

>>>          netdev_dpdk_vhost_send,

>>> +        netdev_dpdk_vhost_set_mtu,

>>>          netdev_dpdk_vhost_get_carrier,

>>>          netdev_dpdk_vhost_get_stats,

>>>          NULL,

>_______________________________________________

>dev mailing list

>dev@openvswitch.org

>http://openvswitch.org/mailman/listinfo/dev
Mark Kavanagh Jan. 29, 2016, 11:56 a.m. UTC | #4
Further clarifications below Aaron - please let me know if you have any additional comments.

Thanks,
Mark

>

>One addendum below, Aaron.

>

>>

>>>Hi Mark,

>>>

>>>Thanks for the patch! I've not done a very thorough review of this, but

>>>my first-blush comments are inline.

>>

>>Thanks for the review Aaron!

>>

>>Any additional comments are welcome - my initial responses are inline, below.

>>

>>Thanks,

>>Mark

>>

>>>

>>>Mark Kavanagh <mark.b.kavanagh@intel.com> writes:

>>>> Add support for Jumbo Frames to DPDK-enabled port types,

>>>> using single-segment-mbufs.

>>>>

>>>> Using this approach, the amount of memory allocated for each mbuf

>>>> to store frame data is increased to a value greater than 1518B

>>>> (typical Ethernet maximum frame length). The increased space

>>>> available in the mbuf means that an entire Jumbo Frame can be carried

>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf

>>>> segments.

>>>>

>>>> The amount of space allocated to each mbuf to hold frame data is

>>>> defined dynamically by the user when adding a DPDK port to a bridge.

>>>> If an MTU value is not supplied, or the user-supplied value is invalid,

>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).

>>>>

>>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>

>>>> ---

>>>

>>>This is a new feature which has user-visible impact, so should be

>>>announced in NEWS (plus, it's great news).

>>

>>No problem - I'll add an entry.

>>

>>>

>>>>  INSTALL.DPDK.md   |  59 ++++++++-

>>>>  lib/netdev-dpdk.c | 356 +++++++++++++++++++++++++++++++++++++++++-------------

>>>>  2 files changed, 328 insertions(+), 87 deletions(-)

>>>>

>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md

>>>> index 96b686c..2f23e27 100644

>>>> --- a/INSTALL.DPDK.md

>>>> +++ b/INSTALL.DPDK.md

>>>> @@ -859,10 +859,61 @@ by adding the following string:

>>>>  to <interface> sections of all network devices used by DPDK. Parameter 'N'

>>>>  determines how many queues can be used by the guest.

>>>>

>>>> +

>>>> +Jumbo Frames

>>>> +------------

>>>> +

>>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.

>>>> +

>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to the ovs-vsctl

>>>> +'add-port' command-line, along with the required MTU for the port.

>>>> +e.g.

>>>

>>>This text and the example do not match. I think '--dpdk-mtu' is not valid.

>>

>>Good catch, thanks.

>>

>>>

>>>> +

>>>> +     ```

>>>> +     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-mtu=9000

>>>> +     ```

>>>> +

>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are

>>>> +increased, such that a full Jumbo Frame may be accommodated inside a single

>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.

>>>

>>>Why no support? DPDK supports changing the mtu.

>>

>>I guess my rationale here is that an MTU change can't be triggered via OVS command-line, nor

>>can it be triggered programmatically via DPDK (apart from an explicit call to

>>rte_eth_dev_set_mtu).

>>So, while technically it's possibly, from a user's point of view, there's no way to

>configure

>>it, outside of modifying the code directly. If I've missed something here, please let me

>>know.

>>

>>>

>>>> +Jumbo frame support has been validated against 13312B frames, using the

>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may

>>>> +theoretically be supported. Supported port types excludes vHost-Cuse ports, as

>>>> +this feature is pending deprecation.

>>>> +

>>>> +

>>>> +vHost Ports and Jumbo Frames

>>>> +----------------------------

>>>> +Jumbo frame support is available for DPDK vHost-User ports only. Some additional

>>>> +configuration is needed to take advantage of this feature:

>>>> +

>>>> +  1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in

>>>> +      the QEMU command line snippet below:

>>>> +

>>>> +      ```

>>>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'

>>>> +      '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'

>>>> +      ```

>>>> +

>>>> +  2. Where virtio devices are bound to the Linux kernel driver in a guest

>>>> +     environment (i.e. interfaces are not bound to an in-guest DPDK driver), the

>>>> +     MTU of those logical network interfaces must also be increased. This

>>>> +     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU' refers

>>>> +     to the length of the IP packet only, and not that of the entire frame.

>>>> +

>>>> +     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract the L2

>>>> +     header and CRC lengths (i.e. 18B) from the max supported frame size.

>>>> +     So, to set the MTU for a 13312B Jumbo Frame:

>>>> +

>>>> +      ```

>>>> +      ifconfig eth1 mtu 13294

>>>> +      ```

>>>> +

>>>> +

>>>>  Restrictions:

>>>>  -------------

>>>>

>>>> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.

>>>>    - Currently DPDK port does not make use any offload functionality.

>>>>    - DPDK-vHost support works with 1G huge pages.

>>>>

>>>> @@ -903,6 +954,12 @@ Restrictions:

>>>>      the next release of DPDK (which includes the above patch) is available and

>>>>      integrated into OVS.

>>>>

>>>> +  Jumbo Frames:

>>>> +  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The source of

>>>> +     this issue is currently being investigated.

>>>> +  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports.

>>>> +

>>>> +

>>>>  Bug Reporting:

>>>>  --------------

>>>>

>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c

>>>> index de7e488..5f23e28 100644

>>>> --- a/lib/netdev-dpdk.c

>>>> +++ b/lib/netdev-dpdk.c

>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);

>>>>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE

>>>>  #define OVS_VPORT_DPDK "ovs_dpdk"

>>>>

>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1

>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024

>>>> +

>>>>  /*

>>>>   * need to reserve tons of extra space in the mbufs so we can align the

>>>>   * DMA addresses to 4KB.

>>>>   * The minimum mbuf size is limited to avoid scatter behaviour and drop in

>>>>   * performance for standard Ethernet MTU.

>>>>   */

>>>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)

>>>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \

>>>> -                              + sizeof(struct dp_packet) \

>>>> -                              + RTE_PKTMBUF_HEADROOM)

>>>> -#define MBUF_SIZE_DRIVER     (2048                       \

>>>> -                              + sizeof (struct rte_mbuf) \

>>>> -                              + RTE_PKTMBUF_HEADROOM)

>>>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)

>>>> +#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)

>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN)

>>>> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \

>>>> +                                    + sizeof(struct dp_packet)   \

>>>> +                                    + RTE_PKTMBUF_HEADROOM)

>>>> +

>>>> +/* This value should be specified as a multiple of the DPDK NIC driver's

>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').

>>>> + */

>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312

>>>>

>>>>  /* Max and min number of packets in the mempool.  OVS tries to allocate a

>>>>   * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have

>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);

>>>>  #define MIN_NB_MBUF          (4096 * 4)

>>>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE

>>>>

>>>> +#define DPDK_VLAN_TAG_LEN    4

>>>> +

>>>>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */

>>>>  BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);

>>>>

>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {

>>>>          .header_split   = 0, /* Header Split disabled */

>>>>          .hw_ip_checksum = 0, /* IP checksum offload disabled */

>>>>          .hw_vlan_filter = 0, /* VLAN filtering disabled */

>>>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */

>>>>          .hw_strip_crc   = 0,

>>>>      },

>>>>      .rx_adv_conf = {

>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)

>>>>      return class->construct == netdev_dpdk_construct;

>>>>  }

>>>>

>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity

>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for igb_uio).

>>>> + * If 'frame_len' is not a multiple of this value, insufficient buffers are

>>>> + * allocated to accomodate the packet in its entirety. Furthermore, the igb_uio

>>>> + * driver needs to ensure that there is also sufficient space in the Rx buffer

>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too

>>>> + * small, then the driver enables scatter RX behaviour, which reduces

>>>> + * performance. To prevent this, use a buffer size that is closest to

>>>> + * 'frame_len', but which satisfies the aforementioned criteria.

>>>> + */

>>>> +static uint32_t

>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)

>>>> +{

>>>> +    struct rte_eth_dev_info info;

>>>> +    uint32_t buf_size;

>>>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);

>>>> +

>>>> +    if(netdev->type == DPDK_DEV_ETH) {

>>>> +        rte_eth_dev_info_get(netdev->port_id, &info);

>>>> +        buf_size = (info.min_rx_bufsize == 0) ?

>>>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :

>>>> +                   info.min_rx_bufsize;

>>>

>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?

>>

>>I'm not sure if I follow  - the MTU isn't the issue here, but rather the buffer size used

>>when creating the rte_mempool, which must be a multiple of the driver's min_rx_bufsize.

>>

>>>

>>>> +    } else {

>>>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;

>>>> +    }

>>>> +

>>>> +    if(len % buf_size) {

>>>> +        len = buf_size * ((len/buf_size) + 1);

>>>> +    }

>>>> +

>>>> +    return len;

>>>> +}

>>>> +

>>>>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used

>>>>   * for all other segments data, bss and text. */

>>>>

>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)

>>>>  }

>>>>

>>>>  static void

>>>> -__rte_pktmbuf_init(struct rte_mempool *mp,

>>>> -                   void *opaque_arg OVS_UNUSED,

>>>> -                   void *_m,

>>>> -                   unsigned i OVS_UNUSED)

>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)

>>>>  {

>>>> -    struct rte_mbuf *m = _m;

>>>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);

>>>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;

>>>> +    struct rte_pktmbuf_pool_private default_mbp_priv;

>>>> +    uint16_t roomsz;

>>>>

>>>>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));

>>>>

>>>> -    memset(m, 0, mp->elt_size);

>>>> +    /* if no structure is provided, assume no mbuf private area */

>>>> +

>>>> +    user_mbp_priv = opaque_arg;

>>>> +    if (user_mbp_priv == NULL) {

>>>> +        default_mbp_priv.mbuf_priv_size = 0;

>>>> +        if (mp->elt_size > sizeof(struct dp_packet)) {

>>>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);

>>>> +        } else {

>>>> +            roomsz = 0;

>>>> +        }

>>>> +        default_mbp_priv.mbuf_data_room_size = roomsz;

>>>> +        user_mbp_priv = &default_mbp_priv;

>>>> +    }

>>>> +

>>>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +

>>>> +        user_mbp_priv->mbuf_data_room_size +

>>>> +        user_mbp_priv->mbuf_priv_size);

>>>> +

>>>> +    mbp_priv = rte_mempool_get_priv(mp);

>>>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));

>>>> +}

>>>> +

>>>> +/* Initialise some fields in the mbuf structure that are not modified by the

>>>> + * user once created (origin pool, buffer start address, etc.*/

>>>> +static void

>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,

>>>> +                       void *opaque_arg OVS_UNUSED,

>>>> +                       void *_m,

>>>> +                       unsigned i OVS_UNUSED)

>>>> +{

>>>> +	struct rte_mbuf *m = _m;

>>>> +	uint32_t buf_size, buf_len, priv_size;

>>>> +

>>>> +	priv_size = rte_pktmbuf_priv_size(mp);

>>>> +	buf_size = sizeof(struct dp_packet) + priv_size;

>>>> +	buf_len = rte_pktmbuf_data_room_size(mp);

>>>>

>>>> -    /* start of buffer is just after mbuf structure */

>>>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);

>>>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +

>>>> -                    sizeof(struct dp_packet);

>>>> -    m->buf_len = (uint16_t)buf_len;

>>>> +	RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);

>>>> +	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);

>>>> +	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);

>>>>

>>>> -    /* keep some headroom between start of buffer and data */

>>>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);

>>>> +	memset(m, 0, mp->elt_size);

>>>>

>>>> -    /* init some constant fields */

>>>> -    m->pool = mp;

>>>> -    m->nb_segs = 1;

>>>> -    m->port = 0xff;

>>>> +	/* start of buffer is after dp_packet structure and priv data */

>>>> +	m->priv_size = priv_size;

>>>> +	m->buf_addr = (char *)m + buf_size;

>>>> +	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;

>>>> +	m->buf_len = (uint16_t)buf_len;

>>>> +

>>>> +	/* keep some headroom between start of buffer and data */

>>>> +	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);

>>>> +

>>>> +	/* init some constant fields */

>>>> +	m->pool = mp;

>>>> +	m->nb_segs = 1;

>>>> +	m->port = 0xff;

>>>

>>>Please don't mix tabs and spaces in the file.

>>

>>Apologies - this was a copy/paste from a DPDK file, which uses tabs.

>>

>>>

>>>>  }

>>>>

>>>>  static void

>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,

>>>>  {

>>>>      struct rte_mbuf *m = _m;

>>>>

>>>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);

>>>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);

>>>>

>>>>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);

>>>>  }

>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)

>>>>      struct dpdk_mp *dmp = NULL;

>>>>      char mp_name[RTE_MEMPOOL_NAMESIZE];

>>>>      unsigned mp_size;

>>>> +    struct rte_pktmbuf_pool_private mbp_priv;

>>>>

>>>>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {

>>>>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {

>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)

>>>>      dmp->socket_id = socket_id;

>>>>      dmp->mtu = mtu;

>>>>      dmp->refcount = 1;

>>>> +    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM;

>>>> +    mbp_priv.mbuf_priv_size = 0;

>>>>

>>>>      mp_size = MAX_NB_MBUF;

>>>>      do {

>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)

>>>>              return NULL;

>>>>          }

>>>>

>>>> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),

>>>> +        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SEGMENT_SIZE(mtu),

>>>>                                       MP_CACHE_SZ,

>>>>                                       sizeof(struct rte_pktmbuf_pool_private),

>>>> -                                     rte_pktmbuf_pool_init, NULL,

>>>> +                                     ovs_rte_pktmbuf_pool_init, &mbp_priv,

>>>>                                       ovs_rte_pktmbuf_init, NULL,

>>>>                                       socket_id, 0);

>>>>      } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= MIN_NB_MBUF);

>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int

>>n_txq)

>>>>  {

>>>>      int diag = 0;

>>>>      int i;

>>>> +    struct rte_eth_conf conf = port_conf;

>>>>

>>>>      /* A device may report more queues than it makes available (this has

>>>>       * been observed for Intel xl710, which reserves some of them for

>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int

>>>n_txq)

>>>>              VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);

>>>>          }

>>>>

>>>> -        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);

>>>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {

>>>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;

>>>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);

>>>> +        }

>>>> +

>>>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);

>>>>          if (diag) {

>>>>              break;

>>>>          }

>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,

>>>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);

>>>>      int sid;

>>>>      int err = 0;

>>>> +    uint32_t buf_size;

>>>>

>>>>      ovs_mutex_init(&netdev->mutex);

>>>>      ovs_mutex_lock(&netdev->mutex);

>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,

>>>>      netdev->port_id = port_no;

>>>>      netdev->type = type;

>>>>      netdev->flags = 0;

>>>> +

>>>> +    /* Initialize port's MTU and frame len to the default Ethernet values.

>>>> +     * Larger, user-specified (jumbo) frame buffers are accommodated in

>>>> +     * netdev_dpdk_set_config.

>>>> +     */

>>>> +    netdev->max_packet_len = ETHER_MAX_LEN;

>>>>      netdev->mtu = ETHER_MTU;

>>>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);

>>>>

>>>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);

>>>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);

>>>> +    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, FRAME_LEN_TO_MTU(buf_size));

>>>>      if (!netdev->dpdk_mp) {

>>>>          err = ENOMEM;

>>>>          goto unlock;

>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[], const char prefix[],

>>>>      return 0;

>>>>  }

>>>>

>>>> +static void

>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)

>>>> +{

>>>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");

>>>> +    int local_mtu;

>>>> +

>>>> +    if(mtu_str) {

>>>> +        local_mtu = atoi(mtu_str);

>>>

>>>Please use strtol or strtoul here, and detect errors in the string by

>>>checking endptr. That way, we can be sure that random garbage on the

>>>stack won't accidentally become the mtu.

>>

>>Will do, thanks.

>>

>>>

>>>> +    }

>>>> +    if(!mtu_str || local_mtu < ETHER_MTU ||

>>>> +                   local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {

>>>> +        local_mtu = ETHER_MTU;

>>>> +        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to %d.\n",

>>>> +                   local_mtu);

>>>> +    }

>>>> +    *mtu = local_mtu;

>>>> +}

>>>> +

>>>>  static int

>>>>  vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)

>>>>  {

>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev *netdev_, struct smap

>>>*args)

>>>>      smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);

>>>>      smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);

>>>>      smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);

>>>> +    smap_add_format(args, "mtu", "%d", dev->mtu);

>>>>      ovs_mutex_unlock(&dev->mutex);

>>>>

>>>>      return 0;

>>>>  }

>>>>

>>>> +/* Set the mtu of DPDK_DEV_ETH ports */

>>>> +static int

>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)

>>>> +{

>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>>>> +    int old_mtu, err;

>>>> +    uint32_t buf_size;

>>>> +    int dpdk_mtu;

>>>> +    struct dpdk_mp *old_mp;

>>>> +    struct dpdk_mp *mp;

>>>> +

>>>> +    ovs_mutex_lock(&dpdk_mutex);

>>>> +    ovs_mutex_lock(&dev->mutex);

>>>> +    if (dev->mtu == mtu) {

>>>> +        err = 0;

>>>> +        goto out;

>>>> +    }

>>>> +

>>>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));

>>>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);

>>>> +

>>>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);

>>>> +    if (!mp) {

>>>> +        err = ENOMEM;

>>>> +        goto out;

>>>> +    }

>>>> +

>>>> +    rte_eth_dev_stop(dev->port_id);

>>>> +

>>>> +    old_mtu = dev->mtu;

>>>> +    old_mp = dev->dpdk_mp;

>>>> +    dev->dpdk_mp = mp;

>>>> +    dev->mtu = mtu;

>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

>>>> +

>>>> +    err = dpdk_eth_dev_init(dev);

>>>

>>>Why call the dpdk_eth_dev_init here? Since you are directly calling

>>>rte_eth_dev_X functions here - wouldn't it make sense to use the

>>>rte_eth_dev_start() instead?


The rx queues for the device need to be assigned the mempool that was allocated for the jumbo frames; since dpdk_eth_dev_init handles the queue setup, rte_eth_dev configuration, etc. it makes sense to call it here. 

>>

>>This is legacy code, but I can refactor if needs be.

>>

>>>

>>>> +    if (err) {

>>>> +        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last known "

>>>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);

>>>> +        dpdk_mp_put(mp);

>>>> +        dev->mtu = old_mtu;

>>>> +        dev->dpdk_mp = old_mp;

>>>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

>>>> +        err = dpdk_eth_dev_init(dev);

>>>

>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you

>>>will cause a queue reconfiguration action, which is not needed, and can

>>>error.


This call is needed, in case the eth_dev's queue configuration was changed by the previous call.
The error handling for this call is arguably outside the scope of this patch, and could be submitted as a separate patch - what do you think?

>>>

>>

>>As above.

>>

>>>If you are intending to set MTU, I would have expected calls to

>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I

>>>recognize that much of this is legacy, but since you are touching it :)

>>

>>I actually had a number of issues with rte_eth_dev_set_mtu early on in the patch; I can take

>>a second look now that the code is more stable though.

>

>I remembered why I didn't use this.

>

>If you look at the IXGBE driver's mtu_set function, it refuses frame lengths which are larger

>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM).

>This might be avoided by configuring the device for scatter RX behavior, but this incurs a

>performance impact.

>

><snippet>

>ixgbe_dev_mtu_set(...) {

>

>...

>

>    /* refuse mtu that requires the support of scattered packets when this

>     * feature has not been enabled before. */

>    if (!dev->data->scattered_rx &&

>        (frame_size + 2 * IXGBE_VLAN_TAG_SIZE >

>         dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM))

>        return -EINVAL;

>

>...

>}

></snippet>

>

>

>>

>>>

>>>> +        goto out;

>>>> +    }

>>>> +

>>>> +    dpdk_mp_put(old_mp);

>>>> +    netdev_change_seq_changed(netdev);

>>>> +out:

>>>> +    ovs_mutex_unlock(&dev->mutex);

>>>> +    ovs_mutex_unlock(&dpdk_mutex);

>>>> +    return err;

>>>> +}

>>>> +

>>>> +static int

>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)

>>>> +{

>>>> +    int mtu;

>>>> +

>>>> +    dpdk_dev_parse_mtu(args, &mtu);

>>>> +

>>>> +    return netdev_dpdk_set_mtu(netdev_, mtu);

>>>> +}

>>>> +

>>>>  static int

>>>>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)

>>>>  {

>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)

>>>>

>>>>      return 0;

>>>>  }

>>>> -

>>>> -static int

>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)

>>>> -{

>>>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>>>> -    int old_mtu, err;

>>>> -    struct dpdk_mp *old_mp;

>>>> -    struct dpdk_mp *mp;

>>>> -

>>>> -    ovs_mutex_lock(&dpdk_mutex);

>>>> -    ovs_mutex_lock(&dev->mutex);

>>>> -    if (dev->mtu == mtu) {

>>>> -        err = 0;

>>>> -        goto out;

>>>> -    }

>>>> -

>>>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);

>>>> -    if (!mp) {

>>>> -        err = ENOMEM;

>>>> -        goto out;

>>>> -    }

>>>> -

>>>> -    rte_eth_dev_stop(dev->port_id);

>>>> -

>>>> -    old_mtu = dev->mtu;

>>>> -    old_mp = dev->dpdk_mp;

>>>> -    dev->dpdk_mp = mp;

>>>> -    dev->mtu = mtu;

>>>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);

>>>> -

>>>> -    err = dpdk_eth_dev_init(dev);

>>>> -    if (err) {

>>>> -        dpdk_mp_put(mp);

>>>> -        dev->mtu = old_mtu;

>>>> -        dev->dpdk_mp = old_mp;

>>>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);

>>>> -        dpdk_eth_dev_init(dev);

>>>> -        goto out;

>>>> -    }

>>>> -

>>>> -    dpdk_mp_put(old_mp);

>>>> -    netdev_change_seq_changed(netdev);

>>>> -out:

>>>> -    ovs_mutex_unlock(&dev->mutex);

>>>> -    ovs_mutex_unlock(&dpdk_mutex);

>>>> -    return err;

>>>> -}

>>>> -

>>>>  static int

>>>>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);

>>>>

>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev *netdev_, struct smap

>>>*args)

>>>>      smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id));

>>>>      smap_add_format(args, "driver_name", "%s", dev_info.driver_name);

>>>>      smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);

>>>> -    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);

>>>> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);

>>>>      smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);

>>>>      smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);

>>>>      smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);

>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)

>>>>      return 0;

>>>>  }

>>>>

>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */

>>>> +static int

>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)

>>>> +{

>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);

>>>> +    int err = 0;

>>>> +    struct dpdk_mp *old_mp;

>>>> +    struct dpdk_mp *mp;

>>>> +

>>>> +    ovs_mutex_lock(&dpdk_mutex);

>>>> +    ovs_mutex_lock(&dev->mutex);

>>>> +    if (dev->mtu == mtu) {

>>>> +        err = 0;

>>>> +        goto out;

>>>> +    }

>>>> +

>>>> +    mp = dpdk_mp_get(dev->socket_id, mtu);

>>>> +    if (!mp) {

>>>> +        err = ENOMEM;

>>>> +        goto out;

>>>> +    }

>>>> +

>>>> +    old_mp = dev->dpdk_mp;

>>>> +    dev->dpdk_mp = mp;

>>>> +    dev->mtu = mtu;

>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);

>>>> +

>>>> +    dpdk_mp_put(old_mp);

>>>> +    netdev_change_seq_changed(netdev);

>>>> +out:

>>>> +    ovs_mutex_unlock(&dev->mutex);

>>>> +    ovs_mutex_unlock(&dpdk_mutex);

>>>> +    return err;

>>>> +}

>>>> +

>>>> +static int

>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap *args)

>>>> +{

>>>> +    int mtu;

>>>> +

>>>> +    dpdk_dev_parse_mtu(args, &mtu);

>>>> +

>>>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);

>>>> +}

>>>> +

>>>>  static void

>>>>  dpdk_common_init(void)

>>>>  {

>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk:

>>>>      return err;

>>>>  }

>>>>

>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \

>>>> -    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \

>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \

>>>> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \

>>>> +        GET_STATUS, RXQ_RECV)                                          \

>>>>  {                                                             \

>>>>      NAME,                                                     \

>>>>      INIT,                       /* init */                    \

>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk:

>>>>      DESTRUCT,                                                 \

>>>>      netdev_dpdk_dealloc,                                      \

>>>>      netdev_dpdk_get_config,                                   \

>>>> -    NULL,                       /* netdev_dpdk_set_config */  \

>>>> +    SET_CONFIG,                                               \

>>>>      NULL,                       /* get_tunnel_config */       \

>>>>      NULL,                       /* build header */            \

>>>>      NULL,                       /* push header */             \

>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk:

>>>>      netdev_dpdk_set_etheraddr,                                \

>>>>      netdev_dpdk_get_etheraddr,                                \

>>>>      netdev_dpdk_get_mtu,                                      \

>>>> -    netdev_dpdk_set_mtu,                                      \

>>>> +    SET_MTU,                                                  \

>>>>      netdev_dpdk_get_ifindex,                                  \

>>>>      GET_CARRIER,                                              \

>>>>      netdev_dpdk_get_carrier_resets,                           \

>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =

>>>>          NULL,

>>>>          netdev_dpdk_construct,

>>>>          netdev_dpdk_destruct,

>>>> +        netdev_dpdk_set_config,

>>>>          netdev_dpdk_set_multiq,

>>>>          netdev_dpdk_eth_send,

>>>> +        netdev_dpdk_set_mtu,

>>>>          netdev_dpdk_get_carrier,

>>>>          netdev_dpdk_get_stats,

>>>>          netdev_dpdk_get_features,

>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class =

>>>>          NULL,

>>>>          netdev_dpdk_ring_construct,

>>>>          netdev_dpdk_destruct,

>>>> +        netdev_dpdk_set_config,

>>>>          netdev_dpdk_set_multiq,

>>>>          netdev_dpdk_ring_send,

>>>> +        netdev_dpdk_set_mtu,

>>>>          netdev_dpdk_get_carrier,

>>>>          netdev_dpdk_get_stats,

>>>>          netdev_dpdk_get_features,

>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_cuse_class

>=

>>>>          dpdk_vhost_cuse_class_init,

>>>>          netdev_dpdk_vhost_cuse_construct,

>>>>          netdev_dpdk_vhost_destruct,

>>>> +        NULL,

>>>>          netdev_dpdk_vhost_set_multiq,

>>>>          netdev_dpdk_vhost_send,

>>>> +        NULL,

>>>>          netdev_dpdk_vhost_get_carrier,

>>>>          netdev_dpdk_vhost_get_stats,

>>>>          NULL,

>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class OVS_UNUSED dpdk_vhost_user_class

>=

>>>>          dpdk_vhost_user_class_init,

>>>>          netdev_dpdk_vhost_user_construct,

>>>>          netdev_dpdk_vhost_destruct,

>>>> +        netdev_dpdk_vhost_set_config,

>>>>          netdev_dpdk_vhost_set_multiq,

>>>>          netdev_dpdk_vhost_send,

>>>> +        netdev_dpdk_vhost_set_mtu,

>>>>          netdev_dpdk_vhost_get_carrier,

>>>>          netdev_dpdk_vhost_get_stats,

>>>>          NULL,

>>_______________________________________________

>>dev mailing list

>>dev@openvswitch.org

>>http://openvswitch.org/mailman/listinfo/dev

>_______________________________________________

>dev mailing list

>dev@openvswitch.org

>http://openvswitch.org/mailman/listinfo/dev
Aaron Conole Jan. 29, 2016, 1:12 p.m. UTC | #5
"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
> Further clarifications below Aaron - please let me know if you have
> any additional comments.
>
> Thanks,
> Mark

Thanks again for this work, Mark!

>>
>>One addendum below, Aaron.
>>
>>>
>>>>Hi Mark,
>>>>
>>>>Thanks for the patch! I've not done a very thorough review of this, but
>>>>my first-blush comments are inline.
>>>
>>>Thanks for the review Aaron!
>>>
>>>Any additional comments are welcome - my initial responses are inline, below.
>>>
>>>Thanks,
>>>Mark
>>>
>>>>
>>>>Mark Kavanagh <mark.b.kavanagh@intel.com> writes:
>>>>> Add support for Jumbo Frames to DPDK-enabled port types,
>>>>> using single-segment-mbufs.
>>>>>
>>>>> Using this approach, the amount of memory allocated for each mbuf
>>>>> to store frame data is increased to a value greater than 1518B
>>>>> (typical Ethernet maximum frame length). The increased space
>>>>> available in the mbuf means that an entire Jumbo Frame can be carried
>>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf
>>>>> segments.
>>>>>
>>>>> The amount of space allocated to each mbuf to hold frame data is
>>>>> defined dynamically by the user when adding a DPDK port to a bridge.
>>>>> If an MTU value is not supplied, or the user-supplied value is invalid,
>>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>>>>
>>>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>>>> ---
>>>>
>>>>This is a new feature which has user-visible impact, so should be
>>>>announced in NEWS (plus, it's great news).
>>>
>>>No problem - I'll add an entry.
>>>
>>>>
>>>>>  INSTALL.DPDK.md   |  59 ++++++++-
>>>>> lib/netdev-dpdk.c | 356
> +++++++++++++++++++++++++++++++++++++++++-------------
>>>>>  2 files changed, 328 insertions(+), 87 deletions(-)
>>>>>
>>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>>>> index 96b686c..2f23e27 100644
>>>>> --- a/INSTALL.DPDK.md
>>>>> +++ b/INSTALL.DPDK.md
>>>>> @@ -859,10 +859,61 @@ by adding the following string:
>>>>>  to <interface> sections of all network devices used by DPDK. Parameter 'N'
>>>>>  determines how many queues can be used by the guest.
>>>>>
>>>>> +
>>>>> +Jumbo Frames
>>>>> +------------
>>>>> +
>>>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
>>>>> +
>>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to
> the ovs-vsctl
>>>>> +'add-port' command-line, along with the required MTU for the port.
>>>>> +e.g.
>>>>
>>>>This text and the example do not match. I think '--dpdk-mtu' is not valid.
>>>
>>>Good catch, thanks.
>>>
>>>>
>>>>> +
>>>>> +     ```
>>>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
> options:dpdk-mtu=9000
>>>>> +     ```
>>>>> +
>>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
>>>>> +increased, such that a full Jumbo Frame may be accommodated
> inside a single
>>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.
>>>>
>>>>Why no support? DPDK supports changing the mtu.
>>>
>>>I guess my rationale here is that an MTU change can't be triggered
> via OVS command-line, nor
>>>can it be triggered programmatically via DPDK (apart from an explicit call to
>>>rte_eth_dev_set_mtu).
>>>So, while technically it's possibly, from a user's point of view,
> there's no way to
>>configure
>>>it, outside of modifying the code directly. If I've missed something
> here, please let me
>>>know.
>>>
>>>>
>>>>> +Jumbo frame support has been validated against 13312B frames, using the
>>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>>>>> +theoretically be supported. Supported port types excludes
> vHost-Cuse ports, as
>>>>> +this feature is pending deprecation.
>>>>> +
>>>>> +
>>>>> +vHost Ports and Jumbo Frames
>>>>> +----------------------------
>>>>> +Jumbo frame support is available for DPDK vHost-User ports
> only. Some additional
>>>>> +configuration is needed to take advantage of this feature:
>>>>> +
>>>>> + 1. `mergeable buffers` must be enabled for vHost ports, as
> demonstrated in
>>>>> +      the QEMU command line snippet below:
>>>>> +
>>>>> +      ```
>>>>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>>>> + '-device
> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>>>> +      ```
>>>>> +
>>>>> +  2. Where virtio devices are bound to the Linux kernel driver in a guest
>>>>> + environment (i.e. interfaces are not bound to an in-guest DPDK
> driver), the
>>>>> +     MTU of those logical network interfaces must also be increased. This
>>>>> + avoids segmentation of Jumbo Frames in the guest. Note that
> 'MTU' refers
>>>>> + to the length of the IP packet only, and not that of the entire
> frame.
>>>>> +
>>>>> + e.g. To calculate the exact MTU of a standard IPv4 frame,
> subtract the L2
>>>>> +     header and CRC lengths (i.e. 18B) from the max supported frame size.
>>>>> +     So, to set the MTU for a 13312B Jumbo Frame:
>>>>> +
>>>>> +      ```
>>>>> +      ifconfig eth1 mtu 13294
>>>>> +      ```
>>>>> +
>>>>> +
>>>>>  Restrictions:
>>>>>  -------------
>>>>>
>>>>> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
>>>>>    - Currently DPDK port does not make use any offload functionality.
>>>>>    - DPDK-vHost support works with 1G huge pages.
>>>>>
>>>>> @@ -903,6 +954,12 @@ Restrictions:
>>>>> the next release of DPDK (which includes the above patch) is
> available and
>>>>>      integrated into OVS.
>>>>>
>>>>> +  Jumbo Frames:
>>>>> + - `virtio-pmd`: DPDK apps in the guest do not exit
> gracefully. The source of
>>>>> +     this issue is currently being investigated.
>>>>> +  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports.
>>>>> +
>>>>> +
>>>>>  Bug Reporting:
>>>>>  --------------
>>>>>
>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>> index de7e488..5f23e28 100644
>>>>> --- a/lib/netdev-dpdk.c
>>>>> +++ b/lib/netdev-dpdk.c
>>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>>>>  #define OVS_VPORT_DPDK "ovs_dpdk"
>>>>>
>>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>>>>> +
>>>>>  /*
>>>>>   * need to reserve tons of extra space in the mbufs so we can align the
>>>>>   * DMA addresses to 4KB.
>>>>>   * The minimum mbuf size is limited to avoid scatter behaviour and drop in
>>>>>   * performance for standard Ethernet MTU.
>>>>>   */
>>>>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>>>>> -                              + sizeof(struct dp_packet) \
>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>> -#define MBUF_SIZE_DRIVER     (2048                       \
>>>>> -                              + sizeof (struct rte_mbuf) \
>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN +
> ETHER_CRC_LEN)
>>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN
> - ETHER_CRC_LEN)
>>>>> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>>>>> +                                    + sizeof(struct dp_packet)   \
>>>>> +                                    + RTE_PKTMBUF_HEADROOM)
>>>>> +
>>>>> +/* This value should be specified as a multiple of the DPDK NIC driver's
>>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>>>> + */
>>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>>>>
>>>>>  /* Max and min number of packets in the mempool.  OVS tries to allocate a
>>>>> * mempool with MAX_NB_MBUF: if this fails (because the system
> doesn't have
>>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>  #define MIN_NB_MBUF          (4096 * 4)
>>>>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>>
>>>>> +#define DPDK_VLAN_TAG_LEN    4
>>>>> +
>>>>>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF %
> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>>>>>
>>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>>>>          .header_split   = 0, /* Header Split disabled */
>>>>>          .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>>>>          .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>>>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>>>>          .hw_strip_crc   = 0,
>>>>>      },
>>>>>      .rx_adv_conf = {
>>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)
>>>>>      return class->construct == netdev_dpdk_construct;
>>>>>  }
>>>>>
>>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K
> for igb_uio).
>>>>> + * If 'frame_len' is not a multiple of this value, insufficient
> buffers are
>>>>> + * allocated to accomodate the packet in its
> entirety. Furthermore, the igb_uio
>>>>> + * driver needs to ensure that there is also sufficient space in
> the Rx buffer
>>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too
>>>>> + * small, then the driver enables scatter RX behaviour, which reduces
>>>>> + * performance. To prevent this, use a buffer size that is closest to
>>>>> + * 'frame_len', but which satisfies the aforementioned criteria.
>>>>> + */
>>>>> +static uint32_t
>>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>>>> +{
>>>>> +    struct rte_eth_dev_info info;
>>>>> +    uint32_t buf_size;
>>>>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>>>> +
>>>>> +    if(netdev->type == DPDK_DEV_ETH) {
>>>>> +        rte_eth_dev_info_get(netdev->port_id, &info);
>>>>> +        buf_size = (info.min_rx_bufsize == 0) ?
>>>>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>>>> +                   info.min_rx_bufsize;
>>>>
>>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?
>>>
>>>I'm not sure if I follow - the MTU isn't the issue here, but rather
> the buffer size used
>>>when creating the rte_mempool, which must be a multiple of the
> driver's min_rx_bufsize.
>>>
>>>>
>>>>> +    } else {
>>>>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>>>> +    }
>>>>> +
>>>>> +    if(len % buf_size) {
>>>>> +        len = buf_size * ((len/buf_size) + 1);
>>>>> +    }
>>>>> +
>>>>> +    return len;
>>>>> +}
>>>>> +
>>>>>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
>>>>>   * for all other segments data, bss and text. */
>>>>>
>>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)
>>>>>  }
>>>>>
>>>>>  static void
>>>>> -__rte_pktmbuf_init(struct rte_mempool *mp,
>>>>> -                   void *opaque_arg OVS_UNUSED,
>>>>> -                   void *_m,
>>>>> -                   unsigned i OVS_UNUSED)
>>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>>>>  {
>>>>> -    struct rte_mbuf *m = _m;
>>>>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>>>> +    struct rte_pktmbuf_pool_private default_mbp_priv;
>>>>> +    uint16_t roomsz;
>>>>>
>>>>>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>>>
>>>>> -    memset(m, 0, mp->elt_size);
>>>>> +    /* if no structure is provided, assume no mbuf private area */
>>>>> +
>>>>> +    user_mbp_priv = opaque_arg;
>>>>> +    if (user_mbp_priv == NULL) {
>>>>> +        default_mbp_priv.mbuf_priv_size = 0;
>>>>> +        if (mp->elt_size > sizeof(struct dp_packet)) {
>>>>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
>>>>> +        } else {
>>>>> +            roomsz = 0;
>>>>> +        }
>>>>> +        default_mbp_priv.mbuf_data_room_size = roomsz;
>>>>> +        user_mbp_priv = &default_mbp_priv;
>>>>> +    }
>>>>> +
>>>>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>>>> +        user_mbp_priv->mbuf_data_room_size +
>>>>> +        user_mbp_priv->mbuf_priv_size);
>>>>> +
>>>>> +    mbp_priv = rte_mempool_get_priv(mp);
>>>>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>>>> +}
>>>>> +
>>>>> +/* Initialise some fields in the mbuf structure that are not
> modified by the
>>>>> + * user once created (origin pool, buffer start address, etc.*/
>>>>> +static void
>>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>> +                       void *opaque_arg OVS_UNUSED,
>>>>> +                       void *_m,
>>>>> +                       unsigned i OVS_UNUSED)
>>>>> +{
>>>>> +	struct rte_mbuf *m = _m;
>>>>> +	uint32_t buf_size, buf_len, priv_size;
>>>>> +
>>>>> +	priv_size = rte_pktmbuf_priv_size(mp);
>>>>> +	buf_size = sizeof(struct dp_packet) + priv_size;
>>>>> +	buf_len = rte_pktmbuf_data_room_size(mp);
>>>>>
>>>>> -    /* start of buffer is just after mbuf structure */
>>>>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>>> -                    sizeof(struct dp_packet);
>>>>> -    m->buf_len = (uint16_t)buf_len;
>>>>> +	RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);
>>>>> +	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>>>> +	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>>>>
>>>>> -    /* keep some headroom between start of buffer and data */
>>>>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>>> +	memset(m, 0, mp->elt_size);
>>>>>
>>>>> -    /* init some constant fields */
>>>>> -    m->pool = mp;
>>>>> -    m->nb_segs = 1;
>>>>> -    m->port = 0xff;
>>>>> +	/* start of buffer is after dp_packet structure and priv data */
>>>>> +	m->priv_size = priv_size;
>>>>> +	m->buf_addr = (char *)m + buf_size;
>>>>> +	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>>>> +	m->buf_len = (uint16_t)buf_len;
>>>>> +
>>>>> +	/* keep some headroom between start of buffer and data */
>>>>> +	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
>>>>> +
>>>>> +	/* init some constant fields */
>>>>> +	m->pool = mp;
>>>>> +	m->nb_segs = 1;
>>>>> +	m->port = 0xff;
>>>>
>>>>Please don't mix tabs and spaces in the file.
>>>
>>>Apologies - this was a copy/paste from a DPDK file, which uses tabs.
>>>
>>>>
>>>>>  }
>>>>>
>>>>>  static void
>>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>  {
>>>>>      struct rte_mbuf *m = _m;
>>>>>
>>>>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>>>>
>>>>>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>>>>  }
>>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu)
> OVS_REQUIRES(dpdk_mutex)
>>>>>      struct dpdk_mp *dmp = NULL;
>>>>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>>>      unsigned mp_size;
>>>>> +    struct rte_pktmbuf_pool_private mbp_priv;
>>>>>
>>>>>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>>>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu)
> OVS_REQUIRES(dpdk_mutex)
>>>>>      dmp->socket_id = socket_id;
>>>>>      dmp->mtu = mtu;
>>>>>      dmp->refcount = 1;
>>>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
> RTE_PKTMBUF_HEADROOM;
>>>>> +    mbp_priv.mbuf_priv_size = 0;
>>>>>
>>>>>      mp_size = MAX_NB_MBUF;
>>>>>      do {
>>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu)
> OVS_REQUIRES(dpdk_mutex)
>>>>>              return NULL;
>>>>>          }
>>>>>
>>>>> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>>> + dmp->mp = rte_mempool_create(mp_name, mp_size,
> MBUF_SEGMENT_SIZE(mtu),
>>>>>                                       MP_CACHE_SZ,
>>>>> sizeof(struct rte_pktmbuf_pool_private),
>>>>> -                                     rte_pktmbuf_pool_init, NULL,
>>>>> +                                     ovs_rte_pktmbuf_pool_init, &mbp_priv,
>>>>>                                       ovs_rte_pktmbuf_init, NULL,
>>>>>                                       socket_id, 0);
>>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
> MIN_NB_MBUF);
>>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
> *dev, int n_rxq, int
>>>n_txq)
>>>>>  {
>>>>>      int diag = 0;
>>>>>      int i;
>>>>> +    struct rte_eth_conf conf = port_conf;
>>>>>
>>>>>      /* A device may report more queues than it makes available (this has
>>>>>       * been observed for Intel xl710, which reserves some of them for
>>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
> *dev, int n_rxq, int
>>>>n_txq)
>>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>>>>>          }
>>>>>
>>>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
> &port_conf);
>>>>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>>>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>>>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>> +        }
>>>>> +
>>>>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>>>>>          if (diag) {
>>>>>              break;
>>>>>          }
>>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_,
> unsigned int port_no,
>>>>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>>>>      int sid;
>>>>>      int err = 0;
>>>>> +    uint32_t buf_size;
>>>>>
>>>>>      ovs_mutex_init(&netdev->mutex);
>>>>>      ovs_mutex_lock(&netdev->mutex);
>>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_,
> unsigned int port_no,
>>>>>      netdev->port_id = port_no;
>>>>>      netdev->type = type;
>>>>>      netdev->flags = 0;
>>>>> +
>>>>> +    /* Initialize port's MTU and frame len to the default Ethernet values.
>>>>> +     * Larger, user-specified (jumbo) frame buffers are accommodated in
>>>>> +     * netdev_dpdk_set_config.
>>>>> +     */
>>>>> +    netdev->max_packet_len = ETHER_MAX_LEN;
>>>>>      netdev->mtu = ETHER_MTU;
>>>>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>>>>
>>>>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>>>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
> FRAME_LEN_TO_MTU(buf_size));
>>>>>      if (!netdev->dpdk_mp) {
>>>>>          err = ENOMEM;
>>>>>          goto unlock;
>>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[],
> const char prefix[],
>>>>>      return 0;
>>>>>  }
>>>>>
>>>>> +static void
>>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>>>> +{
>>>>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
>>>>> +    int local_mtu;
>>>>> +
>>>>> +    if(mtu_str) {
>>>>> +        local_mtu = atoi(mtu_str);
>>>>
>>>>Please use strtol or strtoul here, and detect errors in the string by
>>>>checking endptr. That way, we can be sure that random garbage on the
>>>>stack won't accidentally become the mtu.
>>>
>>>Will do, thanks.
>>>
>>>>
>>>>> +    }
>>>>> +    if(!mtu_str || local_mtu < ETHER_MTU ||
>>>>> + local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {
>>>>> +        local_mtu = ETHER_MTU;
>>>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting
> to %d.\n",
>>>>> +                   local_mtu);
>>>>> +    }
>>>>> +    *mtu = local_mtu;
>>>>> +}
>>>>> +
>>>>>  static int
>>>>>  vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>>>>>  {
>>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev
> *netdev_, struct smap
>>>>*args)
>>>>>      smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
>>>>>      smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>>>>>      smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);
>>>>> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>>>>>      ovs_mutex_unlock(&dev->mutex);
>>>>>
>>>>>      return 0;
>>>>>  }
>>>>>
>>>>> +/* Set the mtu of DPDK_DEV_ETH ports */
>>>>> +static int
>>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>> +{
>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>> +    int old_mtu, err;
>>>>> +    uint32_t buf_size;
>>>>> +    int dpdk_mtu;
>>>>> +    struct dpdk_mp *old_mp;
>>>>> +    struct dpdk_mp *mp;
>>>>> +
>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>> +    if (dev->mtu == mtu) {
>>>>> +        err = 0;
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>>>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>>>> +
>>>>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>>>> +    if (!mp) {
>>>>> +        err = ENOMEM;
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    rte_eth_dev_stop(dev->port_id);
>>>>> +
>>>>> +    old_mtu = dev->mtu;
>>>>> +    old_mp = dev->dpdk_mp;
>>>>> +    dev->dpdk_mp = mp;
>>>>> +    dev->mtu = mtu;
>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>> +
>>>>> +    err = dpdk_eth_dev_init(dev);
>>>>
>>>>Why call the dpdk_eth_dev_init here? Since you are directly calling
>>>>rte_eth_dev_X functions here - wouldn't it make sense to use the
>>>>rte_eth_dev_start() instead?
>
> The rx queues for the device need to be assigned the mempool that was
> allocated for the jumbo frames; since dpdk_eth_dev_init handles the
> queue setup, rte_eth_dev configuration, etc. it makes sense to call it
> here.
>
>>>
>>>This is legacy code, but I can refactor if needs be.
>>>
>>>>
>>>>> +    if (err) {
>>>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
> known "
>>>>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>>>> +        dpdk_mp_put(mp);
>>>>> +        dev->mtu = old_mtu;
>>>>> +        dev->dpdk_mp = old_mp;
>>>>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>> +        err = dpdk_eth_dev_init(dev);
>>>>
>>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you
>>>>will cause a queue reconfiguration action, which is not needed, and can
>>>>error.
>
> This call is needed, in case the eth_dev's queue configuration was
> changed by the previous call.
> The error handling for this call is arguably outside the scope of this
> patch, and could be submitted as a separate patch - what do you think?

That's probably a good idea.

>>>>
>>>
>>>As above.
>>>
>>>>If you are intending to set MTU, I would have expected calls to
>>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I
>>>>recognize that much of this is legacy, but since you are touching it :)
>>>
>>>I actually had a number of issues with rte_eth_dev_set_mtu early on
> in the patch; I can take
>>>a second look now that the code is more stable though.
>>
>>I remembered why I didn't use this.
>>
>>If you look at the IXGBE driver's mtu_set function, it refuses frame
> lengths which are larger
>>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM).
>>This might be avoided by configuring the device for scatter RX
> behavior, but this incurs a
>>performance impact.

It seems strange to me that we need to ignore the library API for this
one case, right? What if the user is only doing vhostuser PMDs mixed
with a bnx2x card?

It would be better to get it fixed in the library, I think. After all,
they gave a generic setting to change MTU - why don't they make it
behave well?

>><snippet>
>>ixgbe_dev_mtu_set(...) {
>>
>>...
>>
>>    /* refuse mtu that requires the support of scattered packets when this
>>     * feature has not been enabled before. */
>>    if (!dev->data->scattered_rx &&
>>        (frame_size + 2 * IXGBE_VLAN_TAG_SIZE >
>>         dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM))
>>        return -EINVAL;
>>
>>...
>>}
>></snippet>
>>
>>
>>>
>>>>
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    dpdk_mp_put(old_mp);
>>>>> +    netdev_change_seq_changed(netdev);
>>>>> +out:
>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>> +    return err;
>>>>> +}
>>>>> +
>>>>> +static int
>>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>>>>> +{
>>>>> +    int mtu;
>>>>> +
>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>> +
>>>>> +    return netdev_dpdk_set_mtu(netdev_, mtu);
>>>>> +}
>>>>> +
>>>>>  static int
>>>>>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>>>>  {
>>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev
> *netdev, int *mtup)
>>>>>
>>>>>      return 0;
>>>>>  }
>>>>> -
>>>>> -static int
>>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>> -{
>>>>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>> -    int old_mtu, err;
>>>>> -    struct dpdk_mp *old_mp;
>>>>> -    struct dpdk_mp *mp;
>>>>> -
>>>>> -    ovs_mutex_lock(&dpdk_mutex);
>>>>> -    ovs_mutex_lock(&dev->mutex);
>>>>> -    if (dev->mtu == mtu) {
>>>>> -        err = 0;
>>>>> -        goto out;
>>>>> -    }
>>>>> -
>>>>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>>>> -    if (!mp) {
>>>>> -        err = ENOMEM;
>>>>> -        goto out;
>>>>> -    }
>>>>> -
>>>>> -    rte_eth_dev_stop(dev->port_id);
>>>>> -
>>>>> -    old_mtu = dev->mtu;
>>>>> -    old_mp = dev->dpdk_mp;
>>>>> -    dev->dpdk_mp = mp;
>>>>> -    dev->mtu = mtu;
>>>>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>> -
>>>>> -    err = dpdk_eth_dev_init(dev);
>>>>> -    if (err) {
>>>>> -        dpdk_mp_put(mp);
>>>>> -        dev->mtu = old_mtu;
>>>>> -        dev->dpdk_mp = old_mp;
>>>>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>> -        dpdk_eth_dev_init(dev);
>>>>> -        goto out;
>>>>> -    }
>>>>> -
>>>>> -    dpdk_mp_put(old_mp);
>>>>> -    netdev_change_seq_changed(netdev);
>>>>> -out:
>>>>> -    ovs_mutex_unlock(&dev->mutex);
>>>>> -    ovs_mutex_unlock(&dpdk_mutex);
>>>>> -    return err;
>>>>> -}
>>>>> -
>>>>>  static int
>>>>>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>>>>
>>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev
> *netdev_, struct smap
>>>>*args)
>>>>> smap_add_format(args, "numa_id", "%d",
> rte_eth_dev_socket_id(dev->port_id));
>>>>>      smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>>>> smap_add_format(args, "min_rx_bufsize", "%u",
> dev_info.min_rx_bufsize);
>>>>> -    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
>>>>> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>>>>      smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
>>>>>      smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
>>>>>      smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
>>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)
>>>>>      return 0;
>>>>>  }
>>>>>
>>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */
>>>>> +static int
>>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>>>> +{
>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>> +    int err = 0;
>>>>> +    struct dpdk_mp *old_mp;
>>>>> +    struct dpdk_mp *mp;
>>>>> +
>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>> +    if (dev->mtu == mtu) {
>>>>> +        err = 0;
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    mp = dpdk_mp_get(dev->socket_id, mtu);
>>>>> +    if (!mp) {
>>>>> +        err = ENOMEM;
>>>>> +        goto out;
>>>>> +    }
>>>>> +
>>>>> +    old_mp = dev->dpdk_mp;
>>>>> +    dev->dpdk_mp = mp;
>>>>> +    dev->mtu = mtu;
>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>> +
>>>>> +    dpdk_mp_put(old_mp);
>>>>> +    netdev_change_seq_changed(netdev);
>>>>> +out:
>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>> +    return err;
>>>>> +}
>>>>> +
>>>>> +static int
>>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const
> struct smap *args)
>>>>> +{
>>>>> +    int mtu;
>>>>> +
>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>> +
>>>>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>>>> +}
>>>>> +
>>>>>  static void
>>>>>  dpdk_common_init(void)
>>>>>  {
>>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk:
>>>>>      return err;
>>>>>  }
>>>>>
>>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \
>>>>> -    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \
>>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
>>>>> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
>>>>> +        GET_STATUS, RXQ_RECV)                                          \
>>>>>  {                                                             \
>>>>>      NAME,                                                     \
>>>>>      INIT,                       /* init */                    \
>>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk:
>>>>>      DESTRUCT,                                                 \
>>>>>      netdev_dpdk_dealloc,                                      \
>>>>>      netdev_dpdk_get_config,                                   \
>>>>> -    NULL,                       /* netdev_dpdk_set_config */  \
>>>>> +    SET_CONFIG,                                               \
>>>>>      NULL,                       /* get_tunnel_config */       \
>>>>>      NULL,                       /* build header */            \
>>>>>      NULL,                       /* push header */             \
>>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk:
>>>>>      netdev_dpdk_set_etheraddr,                                \
>>>>>      netdev_dpdk_get_etheraddr,                                \
>>>>>      netdev_dpdk_get_mtu,                                      \
>>>>> -    netdev_dpdk_set_mtu,                                      \
>>>>> +    SET_MTU,                                                  \
>>>>>      netdev_dpdk_get_ifindex,                                  \
>>>>>      GET_CARRIER,                                              \
>>>>>      netdev_dpdk_get_carrier_resets,                           \
>>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =
>>>>>          NULL,
>>>>>          netdev_dpdk_construct,
>>>>>          netdev_dpdk_destruct,
>>>>> +        netdev_dpdk_set_config,
>>>>>          netdev_dpdk_set_multiq,
>>>>>          netdev_dpdk_eth_send,
>>>>> +        netdev_dpdk_set_mtu,
>>>>>          netdev_dpdk_get_carrier,
>>>>>          netdev_dpdk_get_stats,
>>>>>          netdev_dpdk_get_features,
>>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class =
>>>>>          NULL,
>>>>>          netdev_dpdk_ring_construct,
>>>>>          netdev_dpdk_destruct,
>>>>> +        netdev_dpdk_set_config,
>>>>>          netdev_dpdk_set_multiq,
>>>>>          netdev_dpdk_ring_send,
>>>>> +        netdev_dpdk_set_mtu,
>>>>>          netdev_dpdk_get_carrier,
>>>>>          netdev_dpdk_get_stats,
>>>>>          netdev_dpdk_get_features,
>>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class
> OVS_UNUSED dpdk_vhost_cuse_class
>>=
>>>>>          dpdk_vhost_cuse_class_init,
>>>>>          netdev_dpdk_vhost_cuse_construct,
>>>>>          netdev_dpdk_vhost_destruct,
>>>>> +        NULL,
>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>          netdev_dpdk_vhost_send,
>>>>> +        NULL,
>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>          NULL,
>>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class
> OVS_UNUSED dpdk_vhost_user_class
>>=
>>>>>          dpdk_vhost_user_class_init,
>>>>>          netdev_dpdk_vhost_user_construct,
>>>>>          netdev_dpdk_vhost_destruct,
>>>>> +        netdev_dpdk_vhost_set_config,
>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>          netdev_dpdk_vhost_send,
>>>>> +        netdev_dpdk_vhost_set_mtu,
>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>          NULL,
>>>_______________________________________________
>>>dev mailing list
>>>dev@openvswitch.org
>>>http://openvswitch.org/mailman/listinfo/dev
>>_______________________________________________
>>dev mailing list
>>dev@openvswitch.org
>>http://openvswitch.org/mailman/listinfo/dev
<#secure method=pgpmime mode=encrypt>
Mark Kavanagh Jan. 29, 2016, 4:03 p.m. UTC | #6
>
>"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
>> Further clarifications below Aaron - please let me know if you have
>> any additional comments.
>>
>> Thanks,
>> Mark
>
>Thanks again for this work, Mark!
>
>>>
>>>One addendum below, Aaron.
>>>
>>>>
>>>>>Hi Mark,
>>>>>
>>>>>Thanks for the patch! I've not done a very thorough review of this, but
>>>>>my first-blush comments are inline.
>>>>
>>>>Thanks for the review Aaron!
>>>>
>>>>Any additional comments are welcome - my initial responses are inline, below.
>>>>
>>>>Thanks,
>>>>Mark
>>>>
>>>>>
>>>>>Mark Kavanagh <mark.b.kavanagh@intel.com> writes:
>>>>>> Add support for Jumbo Frames to DPDK-enabled port types,
>>>>>> using single-segment-mbufs.
>>>>>>
>>>>>> Using this approach, the amount of memory allocated for each mbuf
>>>>>> to store frame data is increased to a value greater than 1518B
>>>>>> (typical Ethernet maximum frame length). The increased space
>>>>>> available in the mbuf means that an entire Jumbo Frame can be carried
>>>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf
>>>>>> segments.
>>>>>>
>>>>>> The amount of space allocated to each mbuf to hold frame data is
>>>>>> defined dynamically by the user when adding a DPDK port to a bridge.
>>>>>> If an MTU value is not supplied, or the user-supplied value is invalid,
>>>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>>>>>
>>>>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>>>>> ---
>>>>>
>>>>>This is a new feature which has user-visible impact, so should be
>>>>>announced in NEWS (plus, it's great news).
>>>>
>>>>No problem - I'll add an entry.
>>>>
>>>>>
>>>>>>  INSTALL.DPDK.md   |  59 ++++++++-
>>>>>> lib/netdev-dpdk.c | 356
>> +++++++++++++++++++++++++++++++++++++++++-------------
>>>>>>  2 files changed, 328 insertions(+), 87 deletions(-)
>>>>>>
>>>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>>>>> index 96b686c..2f23e27 100644
>>>>>> --- a/INSTALL.DPDK.md
>>>>>> +++ b/INSTALL.DPDK.md
>>>>>> @@ -859,10 +859,61 @@ by adding the following string:
>>>>>>  to <interface> sections of all network devices used by DPDK. Parameter 'N'
>>>>>>  determines how many queues can be used by the guest.
>>>>>>
>>>>>> +
>>>>>> +Jumbo Frames
>>>>>> +------------
>>>>>> +
>>>>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
>>>>>> +
>>>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to
>> the ovs-vsctl
>>>>>> +'add-port' command-line, along with the required MTU for the port.
>>>>>> +e.g.
>>>>>
>>>>>This text and the example do not match. I think '--dpdk-mtu' is not valid.
>>>>
>>>>Good catch, thanks.
>>>>
>>>>>
>>>>>> +
>>>>>> +     ```
>>>>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>> options:dpdk-mtu=9000
>>>>>> +     ```
>>>>>> +
>>>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
>>>>>> +increased, such that a full Jumbo Frame may be accommodated
>> inside a single
>>>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.
>>>>>
>>>>>Why no support? DPDK supports changing the mtu.
>>>>
>>>>I guess my rationale here is that an MTU change can't be triggered
>> via OVS command-line, nor
>>>>can it be triggered programmatically via DPDK (apart from an explicit call to
>>>>rte_eth_dev_set_mtu).
>>>>So, while technically it's possibly, from a user's point of view,
>> there's no way to
>>>configure
>>>>it, outside of modifying the code directly. If I've missed something
>> here, please let me
>>>>know.
>>>>
>>>>>
>>>>>> +Jumbo frame support has been validated against 13312B frames, using the
>>>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>>>>>> +theoretically be supported. Supported port types excludes
>> vHost-Cuse ports, as
>>>>>> +this feature is pending deprecation.
>>>>>> +
>>>>>> +
>>>>>> +vHost Ports and Jumbo Frames
>>>>>> +----------------------------
>>>>>> +Jumbo frame support is available for DPDK vHost-User ports
>> only. Some additional
>>>>>> +configuration is needed to take advantage of this feature:
>>>>>> +
>>>>>> + 1. `mergeable buffers` must be enabled for vHost ports, as
>> demonstrated in
>>>>>> +      the QEMU command line snippet below:
>>>>>> +
>>>>>> +      ```
>>>>>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>>>>> + '-device
>> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>>>>> +      ```
>>>>>> +
>>>>>> +  2. Where virtio devices are bound to the Linux kernel driver in a guest
>>>>>> + environment (i.e. interfaces are not bound to an in-guest DPDK
>> driver), the
>>>>>> +     MTU of those logical network interfaces must also be increased. This
>>>>>> + avoids segmentation of Jumbo Frames in the guest. Note that
>> 'MTU' refers
>>>>>> + to the length of the IP packet only, and not that of the entire
>> frame.
>>>>>> +
>>>>>> + e.g. To calculate the exact MTU of a standard IPv4 frame,
>> subtract the L2
>>>>>> +     header and CRC lengths (i.e. 18B) from the max supported frame size.
>>>>>> +     So, to set the MTU for a 13312B Jumbo Frame:
>>>>>> +
>>>>>> +      ```
>>>>>> +      ifconfig eth1 mtu 13294
>>>>>> +      ```
>>>>>> +
>>>>>> +
>>>>>>  Restrictions:
>>>>>>  -------------
>>>>>>
>>>>>> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
>>>>>>    - Currently DPDK port does not make use any offload functionality.
>>>>>>    - DPDK-vHost support works with 1G huge pages.
>>>>>>
>>>>>> @@ -903,6 +954,12 @@ Restrictions:
>>>>>> the next release of DPDK (which includes the above patch) is
>> available and
>>>>>>      integrated into OVS.
>>>>>>
>>>>>> +  Jumbo Frames:
>>>>>> + - `virtio-pmd`: DPDK apps in the guest do not exit
>> gracefully. The source of
>>>>>> +     this issue is currently being investigated.
>>>>>> +  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports.
>>>>>> +
>>>>>> +
>>>>>>  Bug Reporting:
>>>>>>  --------------
>>>>>>
>>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>>> index de7e488..5f23e28 100644
>>>>>> --- a/lib/netdev-dpdk.c
>>>>>> +++ b/lib/netdev-dpdk.c
>>>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>>>>>  #define OVS_VPORT_DPDK "ovs_dpdk"
>>>>>>
>>>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>>>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>>>>>> +
>>>>>>  /*
>>>>>>   * need to reserve tons of extra space in the mbufs so we can align the
>>>>>>   * DMA addresses to 4KB.
>>>>>>   * The minimum mbuf size is limited to avoid scatter behaviour and drop in
>>>>>>   * performance for standard Ethernet MTU.
>>>>>>   */
>>>>>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>>>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>>>>>> -                              + sizeof(struct dp_packet) \
>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>> -#define MBUF_SIZE_DRIVER     (2048                       \
>>>>>> -                              + sizeof (struct rte_mbuf) \
>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>>>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN +
>> ETHER_CRC_LEN)
>>>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN
>> - ETHER_CRC_LEN)
>>>>>> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>>>>>> +                                    + sizeof(struct dp_packet)   \
>>>>>> +                                    + RTE_PKTMBUF_HEADROOM)
>>>>>> +
>>>>>> +/* This value should be specified as a multiple of the DPDK NIC driver's
>>>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>>>>> + */
>>>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>>>>>
>>>>>>  /* Max and min number of packets in the mempool.  OVS tries to allocate a
>>>>>> * mempool with MAX_NB_MBUF: if this fails (because the system
>> doesn't have
>>>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>  #define MIN_NB_MBUF          (4096 * 4)
>>>>>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>>>
>>>>>> +#define DPDK_VLAN_TAG_LEN    4
>>>>>> +
>>>>>>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF %
>> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>>>>>>
>>>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>>>>>          .header_split   = 0, /* Header Split disabled */
>>>>>>          .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>>>>>          .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>>>>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>>>>>          .hw_strip_crc   = 0,
>>>>>>      },
>>>>>>      .rx_adv_conf = {
>>>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)
>>>>>>      return class->construct == netdev_dpdk_construct;
>>>>>>  }
>>>>>>
>>>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K
>> for igb_uio).
>>>>>> + * If 'frame_len' is not a multiple of this value, insufficient
>> buffers are
>>>>>> + * allocated to accomodate the packet in its
>> entirety. Furthermore, the igb_uio
>>>>>> + * driver needs to ensure that there is also sufficient space in
>> the Rx buffer
>>>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too
>>>>>> + * small, then the driver enables scatter RX behaviour, which reduces
>>>>>> + * performance. To prevent this, use a buffer size that is closest to
>>>>>> + * 'frame_len', but which satisfies the aforementioned criteria.
>>>>>> + */
>>>>>> +static uint32_t
>>>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>>>>> +{
>>>>>> +    struct rte_eth_dev_info info;
>>>>>> +    uint32_t buf_size;
>>>>>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>>>>> +
>>>>>> +    if(netdev->type == DPDK_DEV_ETH) {
>>>>>> +        rte_eth_dev_info_get(netdev->port_id, &info);
>>>>>> +        buf_size = (info.min_rx_bufsize == 0) ?
>>>>>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>>>>> +                   info.min_rx_bufsize;
>>>>>
>>>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?
>>>>
>>>>I'm not sure if I follow - the MTU isn't the issue here, but rather
>> the buffer size used
>>>>when creating the rte_mempool, which must be a multiple of the
>> driver's min_rx_bufsize.
>>>>
>>>>>
>>>>>> +    } else {
>>>>>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>>>>> +    }
>>>>>> +
>>>>>> +    if(len % buf_size) {
>>>>>> +        len = buf_size * ((len/buf_size) + 1);
>>>>>> +    }
>>>>>> +
>>>>>> +    return len;
>>>>>> +}
>>>>>> +
>>>>>>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
>>>>>>   * for all other segments data, bss and text. */
>>>>>>
>>>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)
>>>>>>  }
>>>>>>
>>>>>>  static void
>>>>>> -__rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>> -                   void *opaque_arg OVS_UNUSED,
>>>>>> -                   void *_m,
>>>>>> -                   unsigned i OVS_UNUSED)
>>>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>>>>>  {
>>>>>> -    struct rte_mbuf *m = _m;
>>>>>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>>>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>>>>> +    struct rte_pktmbuf_pool_private default_mbp_priv;
>>>>>> +    uint16_t roomsz;
>>>>>>
>>>>>>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>>>>
>>>>>> -    memset(m, 0, mp->elt_size);
>>>>>> +    /* if no structure is provided, assume no mbuf private area */
>>>>>> +
>>>>>> +    user_mbp_priv = opaque_arg;
>>>>>> +    if (user_mbp_priv == NULL) {
>>>>>> +        default_mbp_priv.mbuf_priv_size = 0;
>>>>>> +        if (mp->elt_size > sizeof(struct dp_packet)) {
>>>>>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
>>>>>> +        } else {
>>>>>> +            roomsz = 0;
>>>>>> +        }
>>>>>> +        default_mbp_priv.mbuf_data_room_size = roomsz;
>>>>>> +        user_mbp_priv = &default_mbp_priv;
>>>>>> +    }
>>>>>> +
>>>>>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>>>>> +        user_mbp_priv->mbuf_data_room_size +
>>>>>> +        user_mbp_priv->mbuf_priv_size);
>>>>>> +
>>>>>> +    mbp_priv = rte_mempool_get_priv(mp);
>>>>>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>>>>> +}
>>>>>> +
>>>>>> +/* Initialise some fields in the mbuf structure that are not
>> modified by the
>>>>>> + * user once created (origin pool, buffer start address, etc.*/
>>>>>> +static void
>>>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>> +                       void *opaque_arg OVS_UNUSED,
>>>>>> +                       void *_m,
>>>>>> +                       unsigned i OVS_UNUSED)
>>>>>> +{
>>>>>> +	struct rte_mbuf *m = _m;
>>>>>> +	uint32_t buf_size, buf_len, priv_size;
>>>>>> +
>>>>>> +	priv_size = rte_pktmbuf_priv_size(mp);
>>>>>> +	buf_size = sizeof(struct dp_packet) + priv_size;
>>>>>> +	buf_len = rte_pktmbuf_data_room_size(mp);
>>>>>>
>>>>>> -    /* start of buffer is just after mbuf structure */
>>>>>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>>>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>>>> -                    sizeof(struct dp_packet);
>>>>>> -    m->buf_len = (uint16_t)buf_len;
>>>>>> +	RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);
>>>>>> +	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>>>>> +	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>>>>>
>>>>>> -    /* keep some headroom between start of buffer and data */
>>>>>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>>>> +	memset(m, 0, mp->elt_size);
>>>>>>
>>>>>> -    /* init some constant fields */
>>>>>> -    m->pool = mp;
>>>>>> -    m->nb_segs = 1;
>>>>>> -    m->port = 0xff;
>>>>>> +	/* start of buffer is after dp_packet structure and priv data */
>>>>>> +	m->priv_size = priv_size;
>>>>>> +	m->buf_addr = (char *)m + buf_size;
>>>>>> +	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>>>>> +	m->buf_len = (uint16_t)buf_len;
>>>>>> +
>>>>>> +	/* keep some headroom between start of buffer and data */
>>>>>> +	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
>>>>>> +
>>>>>> +	/* init some constant fields */
>>>>>> +	m->pool = mp;
>>>>>> +	m->nb_segs = 1;
>>>>>> +	m->port = 0xff;
>>>>>
>>>>>Please don't mix tabs and spaces in the file.
>>>>
>>>>Apologies - this was a copy/paste from a DPDK file, which uses tabs.
>>>>
>>>>>
>>>>>>  }
>>>>>>
>>>>>>  static void
>>>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>  {
>>>>>>      struct rte_mbuf *m = _m;
>>>>>>
>>>>>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>>>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>>>>>
>>>>>>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>>>>>  }
>>>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu)
>> OVS_REQUIRES(dpdk_mutex)
>>>>>>      struct dpdk_mp *dmp = NULL;
>>>>>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>>>>      unsigned mp_size;
>>>>>> +    struct rte_pktmbuf_pool_private mbp_priv;
>>>>>>
>>>>>>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>>>>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu)
>> OVS_REQUIRES(dpdk_mutex)
>>>>>>      dmp->socket_id = socket_id;
>>>>>>      dmp->mtu = mtu;
>>>>>>      dmp->refcount = 1;
>>>>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>> RTE_PKTMBUF_HEADROOM;
>>>>>> +    mbp_priv.mbuf_priv_size = 0;
>>>>>>
>>>>>>      mp_size = MAX_NB_MBUF;
>>>>>>      do {
>>>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu)
>> OVS_REQUIRES(dpdk_mutex)
>>>>>>              return NULL;
>>>>>>          }
>>>>>>
>>>>>> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>>>> + dmp->mp = rte_mempool_create(mp_name, mp_size,
>> MBUF_SEGMENT_SIZE(mtu),
>>>>>>                                       MP_CACHE_SZ,
>>>>>> sizeof(struct rte_pktmbuf_pool_private),
>>>>>> -                                     rte_pktmbuf_pool_init, NULL,
>>>>>> +                                     ovs_rte_pktmbuf_pool_init, &mbp_priv,
>>>>>>                                       ovs_rte_pktmbuf_init, NULL,
>>>>>>                                       socket_id, 0);
>>>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>> MIN_NB_MBUF);
>>>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>> *dev, int n_rxq, int
>>>>n_txq)
>>>>>>  {
>>>>>>      int diag = 0;
>>>>>>      int i;
>>>>>> +    struct rte_eth_conf conf = port_conf;
>>>>>>
>>>>>>      /* A device may report more queues than it makes available (this has
>>>>>>       * been observed for Intel xl710, which reserves some of them for
>>>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>> *dev, int n_rxq, int
>>>>>n_txq)
>>>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>>>>>>          }
>>>>>>
>>>>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>> &port_conf);
>>>>>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>>>>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>>>>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>> +        }
>>>>>> +
>>>>>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>>>>>>          if (diag) {
>>>>>>              break;
>>>>>>          }
>>>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_,
>> unsigned int port_no,
>>>>>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>>>>>      int sid;
>>>>>>      int err = 0;
>>>>>> +    uint32_t buf_size;
>>>>>>
>>>>>>      ovs_mutex_init(&netdev->mutex);
>>>>>>      ovs_mutex_lock(&netdev->mutex);
>>>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_,
>> unsigned int port_no,
>>>>>>      netdev->port_id = port_no;
>>>>>>      netdev->type = type;
>>>>>>      netdev->flags = 0;
>>>>>> +
>>>>>> +    /* Initialize port's MTU and frame len to the default Ethernet values.
>>>>>> +     * Larger, user-specified (jumbo) frame buffers are accommodated in
>>>>>> +     * netdev_dpdk_set_config.
>>>>>> +     */
>>>>>> +    netdev->max_packet_len = ETHER_MAX_LEN;
>>>>>>      netdev->mtu = ETHER_MTU;
>>>>>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>>>>>
>>>>>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>>>>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>>>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>> FRAME_LEN_TO_MTU(buf_size));
>>>>>>      if (!netdev->dpdk_mp) {
>>>>>>          err = ENOMEM;
>>>>>>          goto unlock;
>>>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[],
>> const char prefix[],
>>>>>>      return 0;
>>>>>>  }
>>>>>>
>>>>>> +static void
>>>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>>>>> +{
>>>>>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
>>>>>> +    int local_mtu;
>>>>>> +
>>>>>> +    if(mtu_str) {
>>>>>> +        local_mtu = atoi(mtu_str);
>>>>>
>>>>>Please use strtol or strtoul here, and detect errors in the string by
>>>>>checking endptr. That way, we can be sure that random garbage on the
>>>>>stack won't accidentally become the mtu.
>>>>
>>>>Will do, thanks.
>>>>
>>>>>
>>>>>> +    }
>>>>>> +    if(!mtu_str || local_mtu < ETHER_MTU ||
>>>>>> + local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {
>>>>>> +        local_mtu = ETHER_MTU;
>>>>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting
>> to %d.\n",
>>>>>> +                   local_mtu);
>>>>>> +    }
>>>>>> +    *mtu = local_mtu;
>>>>>> +}
>>>>>> +
>>>>>>  static int
>>>>>>  vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>>>>>>  {
>>>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev
>> *netdev_, struct smap
>>>>>*args)
>>>>>>      smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
>>>>>>      smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>>>>>>      smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);
>>>>>> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>>>>>>      ovs_mutex_unlock(&dev->mutex);
>>>>>>
>>>>>>      return 0;
>>>>>>  }
>>>>>>
>>>>>> +/* Set the mtu of DPDK_DEV_ETH ports */
>>>>>> +static int
>>>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>> +{
>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>> +    int old_mtu, err;
>>>>>> +    uint32_t buf_size;
>>>>>> +    int dpdk_mtu;
>>>>>> +    struct dpdk_mp *old_mp;
>>>>>> +    struct dpdk_mp *mp;
>>>>>> +
>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>> +    if (dev->mtu == mtu) {
>>>>>> +        err = 0;
>>>>>> +        goto out;
>>>>>> +    }
>>>>>> +
>>>>>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>>>>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>>>>> +
>>>>>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>>>>> +    if (!mp) {
>>>>>> +        err = ENOMEM;
>>>>>> +        goto out;
>>>>>> +    }
>>>>>> +
>>>>>> +    rte_eth_dev_stop(dev->port_id);
>>>>>> +
>>>>>> +    old_mtu = dev->mtu;
>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>> +    dev->dpdk_mp = mp;
>>>>>> +    dev->mtu = mtu;
>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>> +
>>>>>> +    err = dpdk_eth_dev_init(dev);
>>>>>
>>>>>Why call the dpdk_eth_dev_init here? Since you are directly calling
>>>>>rte_eth_dev_X functions here - wouldn't it make sense to use the
>>>>>rte_eth_dev_start() instead?
>>
>> The rx queues for the device need to be assigned the mempool that was
>> allocated for the jumbo frames; since dpdk_eth_dev_init handles the
>> queue setup, rte_eth_dev configuration, etc. it makes sense to call it
>> here.
>>
>>>>
>>>>This is legacy code, but I can refactor if needs be.
>>>>
>>>>>
>>>>>> +    if (err) {
>>>>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
>> known "
>>>>>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>>>>> +        dpdk_mp_put(mp);
>>>>>> +        dev->mtu = old_mtu;
>>>>>> +        dev->dpdk_mp = old_mp;
>>>>>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>> +        err = dpdk_eth_dev_init(dev);
>>>>>
>>>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you
>>>>>will cause a queue reconfiguration action, which is not needed, and can
>>>>>error.
>>
>> This call is needed, in case the eth_dev's queue configuration was
>> changed by the previous call.
>> The error handling for this call is arguably outside the scope of this
>> patch, and could be submitted as a separate patch - what do you think?
>
>That's probably a good idea.
>
>>>>>
>>>>
>>>>As above.
>>>>
>>>>>If you are intending to set MTU, I would have expected calls to
>>>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I
>>>>>recognize that much of this is legacy, but since you are touching it :)
>>>>
>>>>I actually had a number of issues with rte_eth_dev_set_mtu early on
>> in the patch; I can take
>>>>a second look now that the code is more stable though.
>>>
>>>I remembered why I didn't use this.
>>>
>>>If you look at the IXGBE driver's mtu_set function, it refuses frame
>> lengths which are larger
>>>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM).
>>>This might be avoided by configuring the device for scatter RX
>> behavior, but this incurs a
>>>performance impact.
>
>It seems strange to me that we need to ignore the library API for this
>one case, right? What if the user is only doing vhostuser PMDs mixed
>with a bnx2x card?
>

I can't say that I've familiar enough with this setup to form an answer on that one Aaron.

>It would be better to get it fixed in the library, I think. After all,
>they gave a generic setting to change MTU - why don't they make it
>behave well?
>

Let me follow up on that with some local DPDK folks here. 

Would you consider this a showstopper for upstreaming of this feature? If we need to wait for a change to DPDK, Jumbo Frames will miss the 2.5 window.

>>><snippet>
>>>ixgbe_dev_mtu_set(...) {
>>>
>>>...
>>>
>>>    /* refuse mtu that requires the support of scattered packets when this
>>>     * feature has not been enabled before. */
>>>    if (!dev->data->scattered_rx &&
>>>        (frame_size + 2 * IXGBE_VLAN_TAG_SIZE >
>>>         dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM))
>>>        return -EINVAL;
>>>
>>>...
>>>}
>>></snippet>
>>>
>>>
>>>>
>>>>>
>>>>>> +        goto out;
>>>>>> +    }
>>>>>> +
>>>>>> +    dpdk_mp_put(old_mp);
>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>> +out:
>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>> +    return err;
>>>>>> +}
>>>>>> +
>>>>>> +static int
>>>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>>>>>> +{
>>>>>> +    int mtu;
>>>>>> +
>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>> +
>>>>>> +    return netdev_dpdk_set_mtu(netdev_, mtu);
>>>>>> +}
>>>>>> +
>>>>>>  static int
>>>>>>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>>>>>  {
>>>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev
>> *netdev, int *mtup)
>>>>>>
>>>>>>      return 0;
>>>>>>  }
>>>>>> -
>>>>>> -static int
>>>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>> -{
>>>>>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>> -    int old_mtu, err;
>>>>>> -    struct dpdk_mp *old_mp;
>>>>>> -    struct dpdk_mp *mp;
>>>>>> -
>>>>>> -    ovs_mutex_lock(&dpdk_mutex);
>>>>>> -    ovs_mutex_lock(&dev->mutex);
>>>>>> -    if (dev->mtu == mtu) {
>>>>>> -        err = 0;
>>>>>> -        goto out;
>>>>>> -    }
>>>>>> -
>>>>>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>>>>> -    if (!mp) {
>>>>>> -        err = ENOMEM;
>>>>>> -        goto out;
>>>>>> -    }
>>>>>> -
>>>>>> -    rte_eth_dev_stop(dev->port_id);
>>>>>> -
>>>>>> -    old_mtu = dev->mtu;
>>>>>> -    old_mp = dev->dpdk_mp;
>>>>>> -    dev->dpdk_mp = mp;
>>>>>> -    dev->mtu = mtu;
>>>>>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>> -
>>>>>> -    err = dpdk_eth_dev_init(dev);
>>>>>> -    if (err) {
>>>>>> -        dpdk_mp_put(mp);
>>>>>> -        dev->mtu = old_mtu;
>>>>>> -        dev->dpdk_mp = old_mp;
>>>>>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>> -        dpdk_eth_dev_init(dev);
>>>>>> -        goto out;
>>>>>> -    }
>>>>>> -
>>>>>> -    dpdk_mp_put(old_mp);
>>>>>> -    netdev_change_seq_changed(netdev);
>>>>>> -out:
>>>>>> -    ovs_mutex_unlock(&dev->mutex);
>>>>>> -    ovs_mutex_unlock(&dpdk_mutex);
>>>>>> -    return err;
>>>>>> -}
>>>>>> -
>>>>>>  static int
>>>>>>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>>>>>
>>>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev
>> *netdev_, struct smap
>>>>>*args)
>>>>>> smap_add_format(args, "numa_id", "%d",
>> rte_eth_dev_socket_id(dev->port_id));
>>>>>>      smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>>>>> smap_add_format(args, "min_rx_bufsize", "%u",
>> dev_info.min_rx_bufsize);
>>>>>> -    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
>>>>>> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>>>>>      smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
>>>>>>      smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
>>>>>>      smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
>>>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)
>>>>>>      return 0;
>>>>>>  }
>>>>>>
>>>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */
>>>>>> +static int
>>>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>>>>> +{
>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>> +    int err = 0;
>>>>>> +    struct dpdk_mp *old_mp;
>>>>>> +    struct dpdk_mp *mp;
>>>>>> +
>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>> +    if (dev->mtu == mtu) {
>>>>>> +        err = 0;
>>>>>> +        goto out;
>>>>>> +    }
>>>>>> +
>>>>>> +    mp = dpdk_mp_get(dev->socket_id, mtu);
>>>>>> +    if (!mp) {
>>>>>> +        err = ENOMEM;
>>>>>> +        goto out;
>>>>>> +    }
>>>>>> +
>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>> +    dev->dpdk_mp = mp;
>>>>>> +    dev->mtu = mtu;
>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>> +
>>>>>> +    dpdk_mp_put(old_mp);
>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>> +out:
>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>> +    return err;
>>>>>> +}
>>>>>> +
>>>>>> +static int
>>>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const
>> struct smap *args)
>>>>>> +{
>>>>>> +    int mtu;
>>>>>> +
>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>> +
>>>>>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>>>>> +}
>>>>>> +
>>>>>>  static void
>>>>>>  dpdk_common_init(void)
>>>>>>  {
>>>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk:
>>>>>>      return err;
>>>>>>  }
>>>>>>
>>>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \
>>>>>> -    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \
>>>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
>>>>>> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
>>>>>> +        GET_STATUS, RXQ_RECV)                                          \
>>>>>>  {                                                             \
>>>>>>      NAME,                                                     \
>>>>>>      INIT,                       /* init */                    \
>>>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk:
>>>>>>      DESTRUCT,                                                 \
>>>>>>      netdev_dpdk_dealloc,                                      \
>>>>>>      netdev_dpdk_get_config,                                   \
>>>>>> -    NULL,                       /* netdev_dpdk_set_config */  \
>>>>>> +    SET_CONFIG,                                               \
>>>>>>      NULL,                       /* get_tunnel_config */       \
>>>>>>      NULL,                       /* build header */            \
>>>>>>      NULL,                       /* push header */             \
>>>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk:
>>>>>>      netdev_dpdk_set_etheraddr,                                \
>>>>>>      netdev_dpdk_get_etheraddr,                                \
>>>>>>      netdev_dpdk_get_mtu,                                      \
>>>>>> -    netdev_dpdk_set_mtu,                                      \
>>>>>> +    SET_MTU,                                                  \
>>>>>>      netdev_dpdk_get_ifindex,                                  \
>>>>>>      GET_CARRIER,                                              \
>>>>>>      netdev_dpdk_get_carrier_resets,                           \
>>>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =
>>>>>>          NULL,
>>>>>>          netdev_dpdk_construct,
>>>>>>          netdev_dpdk_destruct,
>>>>>> +        netdev_dpdk_set_config,
>>>>>>          netdev_dpdk_set_multiq,
>>>>>>          netdev_dpdk_eth_send,
>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>          netdev_dpdk_get_carrier,
>>>>>>          netdev_dpdk_get_stats,
>>>>>>          netdev_dpdk_get_features,
>>>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class =
>>>>>>          NULL,
>>>>>>          netdev_dpdk_ring_construct,
>>>>>>          netdev_dpdk_destruct,
>>>>>> +        netdev_dpdk_set_config,
>>>>>>          netdev_dpdk_set_multiq,
>>>>>>          netdev_dpdk_ring_send,
>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>          netdev_dpdk_get_carrier,
>>>>>>          netdev_dpdk_get_stats,
>>>>>>          netdev_dpdk_get_features,
>>>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class
>> OVS_UNUSED dpdk_vhost_cuse_class
>>>=
>>>>>>          dpdk_vhost_cuse_class_init,
>>>>>>          netdev_dpdk_vhost_cuse_construct,
>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>> +        NULL,
>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>          netdev_dpdk_vhost_send,
>>>>>> +        NULL,
>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>          NULL,
>>>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class
>> OVS_UNUSED dpdk_vhost_user_class
>>>=
>>>>>>          dpdk_vhost_user_class_init,
>>>>>>          netdev_dpdk_vhost_user_construct,
>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>> +        netdev_dpdk_vhost_set_config,
>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>          netdev_dpdk_vhost_send,
>>>>>> +        netdev_dpdk_vhost_set_mtu,
>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>          NULL,
>>>>_______________________________________________
>>>>dev mailing list
>>>>dev@openvswitch.org
>>>>http://openvswitch.org/mailman/listinfo/dev
>>>_______________________________________________
>>>dev mailing list
>>>dev@openvswitch.org
>>>http://openvswitch.org/mailman/listinfo/dev
><#secure method=pgpmime mode=encrypt>
Aaron Conole Jan. 29, 2016, 4:16 p.m. UTC | #7
"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:

>>
>>"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
>>> Further clarifications below Aaron - please let me know if you have
>>> any additional comments.
>>>
>>> Thanks,
>>> Mark
>>
>>Thanks again for this work, Mark!
>>
>>>>
>>>>One addendum below, Aaron.
>>>>
>>>>>
>>>>>>Hi Mark,
>>>>>>
>>>>>>Thanks for the patch! I've not done a very thorough review of this, but
>>>>>>my first-blush comments are inline.
>>>>>
>>>>>Thanks for the review Aaron!
>>>>>
>>>>>Any additional comments are welcome - my initial responses are
> inline, below.
>>>>>
>>>>>Thanks,
>>>>>Mark
>>>>>
>>>>>>
>>>>>>Mark Kavanagh <mark.b.kavanagh@intel.com> writes:
>>>>>>> Add support for Jumbo Frames to DPDK-enabled port types,
>>>>>>> using single-segment-mbufs.
>>>>>>>
>>>>>>> Using this approach, the amount of memory allocated for each mbuf
>>>>>>> to store frame data is increased to a value greater than 1518B
>>>>>>> (typical Ethernet maximum frame length). The increased space
>>>>>>> available in the mbuf means that an entire Jumbo Frame can be carried
>>>>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf
>>>>>>> segments.
>>>>>>>
>>>>>>> The amount of space allocated to each mbuf to hold frame data is
>>>>>>> defined dynamically by the user when adding a DPDK port to a bridge.
>>>>>>> If an MTU value is not supplied, or the user-supplied value is invalid,
>>>>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>>>>>>
>>>>>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>>>>>> ---
>>>>>>
>>>>>>This is a new feature which has user-visible impact, so should be
>>>>>>announced in NEWS (plus, it's great news).
>>>>>
>>>>>No problem - I'll add an entry.
>>>>>
>>>>>>
>>>>>>>  INSTALL.DPDK.md   |  59 ++++++++-
>>>>>>> lib/netdev-dpdk.c | 356
>>> +++++++++++++++++++++++++++++++++++++++++-------------
>>>>>>>  2 files changed, 328 insertions(+), 87 deletions(-)
>>>>>>>
>>>>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>>>>>> index 96b686c..2f23e27 100644
>>>>>>> --- a/INSTALL.DPDK.md
>>>>>>> +++ b/INSTALL.DPDK.md
>>>>>>> @@ -859,10 +859,61 @@ by adding the following string:
>>>>>>> to <interface> sections of all network devices used by
> DPDK. Parameter 'N'
>>>>>>>  determines how many queues can be used by the guest.
>>>>>>>
>>>>>>> +
>>>>>>> +Jumbo Frames
>>>>>>> +------------
>>>>>>> +
>>>>>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
>>>>>>> +
>>>>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to
>>> the ovs-vsctl
>>>>>>> +'add-port' command-line, along with the required MTU for the port.
>>>>>>> +e.g.
>>>>>>
>>>>>>This text and the example do not match. I think '--dpdk-mtu' is not valid.
>>>>>
>>>>>Good catch, thanks.
>>>>>
>>>>>>
>>>>>>> +
>>>>>>> +     ```
>>>>>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>>> options:dpdk-mtu=9000
>>>>>>> +     ```
>>>>>>> +
>>>>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf
> segments are
>>>>>>> +increased, such that a full Jumbo Frame may be accommodated
>>> inside a single
>>>>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.
>>>>>>
>>>>>>Why no support? DPDK supports changing the mtu.
>>>>>
>>>>>I guess my rationale here is that an MTU change can't be triggered
>>> via OVS command-line, nor
>>>>>can it be triggered programmatically via DPDK (apart from an
> explicit call to
>>>>>rte_eth_dev_set_mtu).
>>>>>So, while technically it's possibly, from a user's point of view,
>>> there's no way to
>>>>configure
>>>>>it, outside of modifying the code directly. If I've missed something
>>> here, please let me
>>>>>know.
>>>>>
>>>>>>
>>>>>>> +Jumbo frame support has been validated against 13312B frames, using the
>>>>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>>>>>>> +theoretically be supported. Supported port types excludes
>>> vHost-Cuse ports, as
>>>>>>> +this feature is pending deprecation.
>>>>>>> +
>>>>>>> +
>>>>>>> +vHost Ports and Jumbo Frames
>>>>>>> +----------------------------
>>>>>>> +Jumbo frame support is available for DPDK vHost-User ports
>>> only. Some additional
>>>>>>> +configuration is needed to take advantage of this feature:
>>>>>>> +
>>>>>>> + 1. `mergeable buffers` must be enabled for vHost ports, as
>>> demonstrated in
>>>>>>> +      the QEMU command line snippet below:
>>>>>>> +
>>>>>>> +      ```
>>>>>>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>>>>>> + '-device
>>> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>>>>>> +      ```
>>>>>>> +
>>>>>>> + 2. Where virtio devices are bound to the Linux kernel driver
> in a guest
>>>>>>> + environment (i.e. interfaces are not bound to an in-guest DPDK
>>> driver), the
>>>>>>> + MTU of those logical network interfaces must also be
> increased. This
>>>>>>> + avoids segmentation of Jumbo Frames in the guest. Note that
>>> 'MTU' refers
>>>>>>> + to the length of the IP packet only, and not that of the entire
>>> frame.
>>>>>>> +
>>>>>>> + e.g. To calculate the exact MTU of a standard IPv4 frame,
>>> subtract the L2
>>>>>>> + header and CRC lengths (i.e. 18B) from the max supported
> frame size.
>>>>>>> +     So, to set the MTU for a 13312B Jumbo Frame:
>>>>>>> +
>>>>>>> +      ```
>>>>>>> +      ifconfig eth1 mtu 13294
>>>>>>> +      ```
>>>>>>> +
>>>>>>> +
>>>>>>>  Restrictions:
>>>>>>>  -------------
>>>>>>>
>>>>>>> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
>>>>>>>    - Currently DPDK port does not make use any offload functionality.
>>>>>>>    - DPDK-vHost support works with 1G huge pages.
>>>>>>>
>>>>>>> @@ -903,6 +954,12 @@ Restrictions:
>>>>>>> the next release of DPDK (which includes the above patch) is
>>> available and
>>>>>>>      integrated into OVS.
>>>>>>>
>>>>>>> +  Jumbo Frames:
>>>>>>> + - `virtio-pmd`: DPDK apps in the guest do not exit
>>> gracefully. The source of
>>>>>>> +     this issue is currently being investigated.
>>>>>>> + - vHost-Cuse: Jumbo Frame support is not available for vHost
> Cuse ports.
>>>>>>> +
>>>>>>> +
>>>>>>>  Bug Reporting:
>>>>>>>  --------------
>>>>>>>
>>>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>>>> index de7e488..5f23e28 100644
>>>>>>> --- a/lib/netdev-dpdk.c
>>>>>>> +++ b/lib/netdev-dpdk.c
>>>>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>>>>>>  #define OVS_VPORT_DPDK "ovs_dpdk"
>>>>>>>
>>>>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>>>>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>>>>>>> +
>>>>>>>  /*
>>>>>>>   * need to reserve tons of extra space in the mbufs so we can align the
>>>>>>>   * DMA addresses to 4KB.
>>>>>>> * The minimum mbuf size is limited to avoid scatter behaviour
> and drop in
>>>>>>>   * performance for standard Ethernet MTU.
>>>>>>>   */
>>>>>>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>>>>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>>>>>>> -                              + sizeof(struct dp_packet) \
>>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>>> -#define MBUF_SIZE_DRIVER     (2048                       \
>>>>>>> -                              + sizeof (struct rte_mbuf) \
>>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>>>>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN +
>>> ETHER_CRC_LEN)
>>>>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN
>>> - ETHER_CRC_LEN)
>>>>>>> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>>>>>>> +                                    + sizeof(struct dp_packet)   \
>>>>>>> +                                    + RTE_PKTMBUF_HEADROOM)
>>>>>>> +
>>>>>>> +/* This value should be specified as a multiple of the DPDK NIC driver's
>>>>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>>>>>> + */
>>>>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>>>>>>
>>>>>>> /* Max and min number of packets in the mempool.  OVS tries to
> allocate a
>>>>>>> * mempool with MAX_NB_MBUF: if this fails (because the system
>>> doesn't have
>>>>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>>  #define MIN_NB_MBUF          (4096 * 4)
>>>>>>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>>>>
>>>>>>> +#define DPDK_VLAN_TAG_LEN    4
>>>>>>> +
>>>>>>>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF %
>>> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>>>>>>>
>>>>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>>>>>>          .header_split   = 0, /* Header Split disabled */
>>>>>>>          .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>>>>>>          .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>>>>>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>>>>>>          .hw_strip_crc   = 0,
>>>>>>>      },
>>>>>>>      .rx_adv_conf = {
>>>>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)
>>>>>>>      return class->construct == netdev_dpdk_construct;
>>>>>>>  }
>>>>>>>
>>>>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>>>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K
>>> for igb_uio).
>>>>>>> + * If 'frame_len' is not a multiple of this value, insufficient
>>> buffers are
>>>>>>> + * allocated to accomodate the packet in its
>>> entirety. Furthermore, the igb_uio
>>>>>>> + * driver needs to ensure that there is also sufficient space in
>>> the Rx buffer
>>>>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX
> buffer is too
>>>>>>> + * small, then the driver enables scatter RX behaviour, which reduces
>>>>>>> + * performance. To prevent this, use a buffer size that is closest to
>>>>>>> + * 'frame_len', but which satisfies the aforementioned criteria.
>>>>>>> + */
>>>>>>> +static uint32_t
>>>>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>>>>>> +{
>>>>>>> +    struct rte_eth_dev_info info;
>>>>>>> +    uint32_t buf_size;
>>>>>>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>>>>>> +
>>>>>>> +    if(netdev->type == DPDK_DEV_ETH) {
>>>>>>> +        rte_eth_dev_info_get(netdev->port_id, &info);
>>>>>>> +        buf_size = (info.min_rx_bufsize == 0) ?
>>>>>>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>>>>>> +                   info.min_rx_bufsize;
>>>>>>
>>>>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?
>>>>>
>>>>>I'm not sure if I follow - the MTU isn't the issue here, but rather
>>> the buffer size used
>>>>>when creating the rte_mempool, which must be a multiple of the
>>> driver's min_rx_bufsize.
>>>>>
>>>>>>
>>>>>>> +    } else {
>>>>>>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if(len % buf_size) {
>>>>>>> +        len = buf_size * ((len/buf_size) + 1);
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    return len;
>>>>>>> +}
>>>>>>> +
>>>>>>>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
>>>>>>>   * for all other segments data, bss and text. */
>>>>>>>
>>>>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)
>>>>>>>  }
>>>>>>>
>>>>>>>  static void
>>>>>>> -__rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>> -                   void *opaque_arg OVS_UNUSED,
>>>>>>> -                   void *_m,
>>>>>>> -                   unsigned i OVS_UNUSED)
>>>>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>>>>>>  {
>>>>>>> -    struct rte_mbuf *m = _m;
>>>>>>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>>>>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>>>>>> +    struct rte_pktmbuf_pool_private default_mbp_priv;
>>>>>>> +    uint16_t roomsz;
>>>>>>>
>>>>>>>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>>>>>
>>>>>>> -    memset(m, 0, mp->elt_size);
>>>>>>> +    /* if no structure is provided, assume no mbuf private area */
>>>>>>> +
>>>>>>> +    user_mbp_priv = opaque_arg;
>>>>>>> +    if (user_mbp_priv == NULL) {
>>>>>>> +        default_mbp_priv.mbuf_priv_size = 0;
>>>>>>> +        if (mp->elt_size > sizeof(struct dp_packet)) {
>>>>>>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
>>>>>>> +        } else {
>>>>>>> +            roomsz = 0;
>>>>>>> +        }
>>>>>>> +        default_mbp_priv.mbuf_data_room_size = roomsz;
>>>>>>> +        user_mbp_priv = &default_mbp_priv;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>>>>>> +        user_mbp_priv->mbuf_data_room_size +
>>>>>>> +        user_mbp_priv->mbuf_priv_size);
>>>>>>> +
>>>>>>> +    mbp_priv = rte_mempool_get_priv(mp);
>>>>>>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>>>>>> +}
>>>>>>> +
>>>>>>> +/* Initialise some fields in the mbuf structure that are not
>>> modified by the
>>>>>>> + * user once created (origin pool, buffer start address, etc.*/
>>>>>>> +static void
>>>>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>> +                       void *opaque_arg OVS_UNUSED,
>>>>>>> +                       void *_m,
>>>>>>> +                       unsigned i OVS_UNUSED)
>>>>>>> +{
>>>>>>> +	struct rte_mbuf *m = _m;
>>>>>>> +	uint32_t buf_size, buf_len, priv_size;
>>>>>>> +
>>>>>>> +	priv_size = rte_pktmbuf_priv_size(mp);
>>>>>>> +	buf_size = sizeof(struct dp_packet) + priv_size;
>>>>>>> +	buf_len = rte_pktmbuf_data_room_size(mp);
>>>>>>>
>>>>>>> -    /* start of buffer is just after mbuf structure */
>>>>>>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>>>>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>>>>> -                    sizeof(struct dp_packet);
>>>>>>> -    m->buf_len = (uint16_t)buf_len;
>>>>>>> + RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
> priv_size);
>>>>>>> +	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>>>>>> +	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>>>>>>
>>>>>>> -    /* keep some headroom between start of buffer and data */
>>>>>>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>>>>> +	memset(m, 0, mp->elt_size);
>>>>>>>
>>>>>>> -    /* init some constant fields */
>>>>>>> -    m->pool = mp;
>>>>>>> -    m->nb_segs = 1;
>>>>>>> -    m->port = 0xff;
>>>>>>> +	/* start of buffer is after dp_packet structure and priv data */
>>>>>>> +	m->priv_size = priv_size;
>>>>>>> +	m->buf_addr = (char *)m + buf_size;
>>>>>>> +	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>>>>>> +	m->buf_len = (uint16_t)buf_len;
>>>>>>> +
>>>>>>> +	/* keep some headroom between start of buffer and data */
>>>>>>> + m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM,
> (uint16_t)m->buf_len);
>>>>>>> +
>>>>>>> +	/* init some constant fields */
>>>>>>> +	m->pool = mp;
>>>>>>> +	m->nb_segs = 1;
>>>>>>> +	m->port = 0xff;
>>>>>>
>>>>>>Please don't mix tabs and spaces in the file.
>>>>>
>>>>>Apologies - this was a copy/paste from a DPDK file, which uses tabs.
>>>>>
>>>>>>
>>>>>>>  }
>>>>>>>
>>>>>>>  static void
>>>>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>  {
>>>>>>>      struct rte_mbuf *m = _m;
>>>>>>>
>>>>>>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>>>>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>>>>>>
>>>>>>>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>>>>>>  }
>>>>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu)
>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>      struct dpdk_mp *dmp = NULL;
>>>>>>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>>>>>      unsigned mp_size;
>>>>>>> +    struct rte_pktmbuf_pool_private mbp_priv;
>>>>>>>
>>>>>>>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>>>>>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>>>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu)
>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>      dmp->socket_id = socket_id;
>>>>>>>      dmp->mtu = mtu;
>>>>>>>      dmp->refcount = 1;
>>>>>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>>> RTE_PKTMBUF_HEADROOM;
>>>>>>> +    mbp_priv.mbuf_priv_size = 0;
>>>>>>>
>>>>>>>      mp_size = MAX_NB_MBUF;
>>>>>>>      do {
>>>>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu)
>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>              return NULL;
>>>>>>>          }
>>>>>>>
>>>>>>> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>>>>> + dmp->mp = rte_mempool_create(mp_name, mp_size,
>>> MBUF_SEGMENT_SIZE(mtu),
>>>>>>>                                       MP_CACHE_SZ,
>>>>>>> sizeof(struct rte_pktmbuf_pool_private),
>>>>>>> -                                     rte_pktmbuf_pool_init, NULL,
>>>>>>> + ovs_rte_pktmbuf_pool_init, &mbp_priv,
>>>>>>>                                       ovs_rte_pktmbuf_init, NULL,
>>>>>>>                                       socket_id, 0);
>>>>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>> MIN_NB_MBUF);
>>>>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>>> *dev, int n_rxq, int
>>>>>n_txq)
>>>>>>>  {
>>>>>>>      int diag = 0;
>>>>>>>      int i;
>>>>>>> +    struct rte_eth_conf conf = port_conf;
>>>>>>>
>>>>>>>      /* A device may report more queues than it makes available (this has
>>>>>>>       * been observed for Intel xl710, which reserves some of them for
>>>>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>>> *dev, int n_rxq, int
>>>>>>n_txq)
>>>>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>>>>>>>          }
>>>>>>>
>>>>>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>> &port_conf);
>>>>>>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>>>>>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>>>>>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>> +        }
>>>>>>> +
>>>>>>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>>>>>>>          if (diag) {
>>>>>>>              break;
>>>>>>>          }
>>>>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_,
>>> unsigned int port_no,
>>>>>>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>>>>>>      int sid;
>>>>>>>      int err = 0;
>>>>>>> +    uint32_t buf_size;
>>>>>>>
>>>>>>>      ovs_mutex_init(&netdev->mutex);
>>>>>>>      ovs_mutex_lock(&netdev->mutex);
>>>>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_,
>>> unsigned int port_no,
>>>>>>>      netdev->port_id = port_no;
>>>>>>>      netdev->type = type;
>>>>>>>      netdev->flags = 0;
>>>>>>> +
>>>>>>> + /* Initialize port's MTU and frame len to the default
> Ethernet values.
>>>>>>> +     * Larger, user-specified (jumbo) frame buffers are accommodated in
>>>>>>> +     * netdev_dpdk_set_config.
>>>>>>> +     */
>>>>>>> +    netdev->max_packet_len = ETHER_MAX_LEN;
>>>>>>>      netdev->mtu = ETHER_MTU;
>>>>>>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>>>>>>
>>>>>>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>>>>>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>>>>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>>> FRAME_LEN_TO_MTU(buf_size));
>>>>>>>      if (!netdev->dpdk_mp) {
>>>>>>>          err = ENOMEM;
>>>>>>>          goto unlock;
>>>>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[],
>>> const char prefix[],
>>>>>>>      return 0;
>>>>>>>  }
>>>>>>>
>>>>>>> +static void
>>>>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>>>>>> +{
>>>>>>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
>>>>>>> +    int local_mtu;
>>>>>>> +
>>>>>>> +    if(mtu_str) {
>>>>>>> +        local_mtu = atoi(mtu_str);
>>>>>>
>>>>>>Please use strtol or strtoul here, and detect errors in the string by
>>>>>>checking endptr. That way, we can be sure that random garbage on the
>>>>>>stack won't accidentally become the mtu.
>>>>>
>>>>>Will do, thanks.
>>>>>
>>>>>>
>>>>>>> +    }
>>>>>>> +    if(!mtu_str || local_mtu < ETHER_MTU ||
>>>>>>> + local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {
>>>>>>> +        local_mtu = ETHER_MTU;
>>>>>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting
>>> to %d.\n",
>>>>>>> +                   local_mtu);
>>>>>>> +    }
>>>>>>> +    *mtu = local_mtu;
>>>>>>> +}
>>>>>>> +
>>>>>>>  static int
>>>>>>>  vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>>>>>>>  {
>>>>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev
>>> *netdev_, struct smap
>>>>>>*args)
>>>>>>>      smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
>>>>>>>      smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>>>>>>> smap_add_format(args, "configured_tx_queues", "%d",
> dev->real_n_txq);
>>>>>>> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>>>>>>>      ovs_mutex_unlock(&dev->mutex);
>>>>>>>
>>>>>>>      return 0;
>>>>>>>  }
>>>>>>>
>>>>>>> +/* Set the mtu of DPDK_DEV_ETH ports */
>>>>>>> +static int
>>>>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>> +{
>>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>> +    int old_mtu, err;
>>>>>>> +    uint32_t buf_size;
>>>>>>> +    int dpdk_mtu;
>>>>>>> +    struct dpdk_mp *old_mp;
>>>>>>> +    struct dpdk_mp *mp;
>>>>>>> +
>>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>>> +    if (dev->mtu == mtu) {
>>>>>>> +        err = 0;
>>>>>>> +        goto out;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>>>>>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>>>>>> +
>>>>>>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>>>>>> +    if (!mp) {
>>>>>>> +        err = ENOMEM;
>>>>>>> +        goto out;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    rte_eth_dev_stop(dev->port_id);
>>>>>>> +
>>>>>>> +    old_mtu = dev->mtu;
>>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>>> +    dev->dpdk_mp = mp;
>>>>>>> +    dev->mtu = mtu;
>>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>> +
>>>>>>> +    err = dpdk_eth_dev_init(dev);
>>>>>>
>>>>>>Why call the dpdk_eth_dev_init here? Since you are directly calling
>>>>>>rte_eth_dev_X functions here - wouldn't it make sense to use the
>>>>>>rte_eth_dev_start() instead?
>>>
>>> The rx queues for the device need to be assigned the mempool that was
>>> allocated for the jumbo frames; since dpdk_eth_dev_init handles the
>>> queue setup, rte_eth_dev configuration, etc. it makes sense to call it
>>> here.
>>>
>>>>>
>>>>>This is legacy code, but I can refactor if needs be.
>>>>>
>>>>>>
>>>>>>> +    if (err) {
>>>>>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
>>> known "
>>>>>>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>>>>>> +        dpdk_mp_put(mp);
>>>>>>> +        dev->mtu = old_mtu;
>>>>>>> +        dev->dpdk_mp = old_mp;
>>>>>>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>> +        err = dpdk_eth_dev_init(dev);
>>>>>>
>>>>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you
>>>>>>will cause a queue reconfiguration action, which is not needed, and can
>>>>>>error.
>>>
>>> This call is needed, in case the eth_dev's queue configuration was
>>> changed by the previous call.
>>> The error handling for this call is arguably outside the scope of this
>>> patch, and could be submitted as a separate patch - what do you think?
>>
>>That's probably a good idea.
>>
>>>>>>
>>>>>
>>>>>As above.
>>>>>
>>>>>>If you are intending to set MTU, I would have expected calls to
>>>>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I
>>>>>>recognize that much of this is legacy, but since you are touching it :)
>>>>>
>>>>>I actually had a number of issues with rte_eth_dev_set_mtu early on
>>> in the patch; I can take
>>>>>a second look now that the code is more stable though.
>>>>
>>>>I remembered why I didn't use this.
>>>>
>>>>If you look at the IXGBE driver's mtu_set function, it refuses frame
>>> lengths which are larger
>>>>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM).
>>>>This might be avoided by configuring the device for scatter RX
>>> behavior, but this incurs a
>>>>performance impact.
>>
>>It seems strange to me that we need to ignore the library API for this
>>one case, right? What if the user is only doing vhostuser PMDs mixed
>>with a bnx2x card?
>>
>
> I can't say that I've familiar enough with this setup to form an
> answer on that one Aaron.
>
>>It would be better to get it fixed in the library, I think. After all,
>>they gave a generic setting to change MTU - why don't they make it
>>behave well?
>>
>
> Let me follow up on that with some local DPDK folks here. 
>
> Would you consider this a showstopper for upstreaming of this feature?
> If we need to wait for a change to DPDK, Jumbo Frames will miss the
> 2.5 window.

I don't think it would be a show stopper as long as other devices wouldn't
be negatively impacted, but it's a place where a /*XXX: This is a
workaround for dpdk*/ comment should be included so that we can fix it
when/if it is updated in the library. What do you think?

>>>><snippet>
>>>>ixgbe_dev_mtu_set(...) {
>>>>
>>>>...
>>>>
>>>>    /* refuse mtu that requires the support of scattered packets when this
>>>>     * feature has not been enabled before. */
>>>>    if (!dev->data->scattered_rx &&
>>>>        (frame_size + 2 * IXGBE_VLAN_TAG_SIZE >
>>>>         dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM))
>>>>        return -EINVAL;
>>>>
>>>>...
>>>>}
>>>></snippet>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>> +        goto out;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    dpdk_mp_put(old_mp);
>>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>>> +out:
>>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>> +    return err;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int
>>>>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>>>>>>> +{
>>>>>>> +    int mtu;
>>>>>>> +
>>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>>> +
>>>>>>> +    return netdev_dpdk_set_mtu(netdev_, mtu);
>>>>>>> +}
>>>>>>> +
>>>>>>>  static int
>>>>>>>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>>>>>>  {
>>>>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev
>>> *netdev, int *mtup)
>>>>>>>
>>>>>>>      return 0;
>>>>>>>  }
>>>>>>> -
>>>>>>> -static int
>>>>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>> -{
>>>>>>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>> -    int old_mtu, err;
>>>>>>> -    struct dpdk_mp *old_mp;
>>>>>>> -    struct dpdk_mp *mp;
>>>>>>> -
>>>>>>> -    ovs_mutex_lock(&dpdk_mutex);
>>>>>>> -    ovs_mutex_lock(&dev->mutex);
>>>>>>> -    if (dev->mtu == mtu) {
>>>>>>> -        err = 0;
>>>>>>> -        goto out;
>>>>>>> -    }
>>>>>>> -
>>>>>>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>>>>>> -    if (!mp) {
>>>>>>> -        err = ENOMEM;
>>>>>>> -        goto out;
>>>>>>> -    }
>>>>>>> -
>>>>>>> -    rte_eth_dev_stop(dev->port_id);
>>>>>>> -
>>>>>>> -    old_mtu = dev->mtu;
>>>>>>> -    old_mp = dev->dpdk_mp;
>>>>>>> -    dev->dpdk_mp = mp;
>>>>>>> -    dev->mtu = mtu;
>>>>>>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>>> -
>>>>>>> -    err = dpdk_eth_dev_init(dev);
>>>>>>> -    if (err) {
>>>>>>> -        dpdk_mp_put(mp);
>>>>>>> -        dev->mtu = old_mtu;
>>>>>>> -        dev->dpdk_mp = old_mp;
>>>>>>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>>> -        dpdk_eth_dev_init(dev);
>>>>>>> -        goto out;
>>>>>>> -    }
>>>>>>> -
>>>>>>> -    dpdk_mp_put(old_mp);
>>>>>>> -    netdev_change_seq_changed(netdev);
>>>>>>> -out:
>>>>>>> -    ovs_mutex_unlock(&dev->mutex);
>>>>>>> -    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>> -    return err;
>>>>>>> -}
>>>>>>> -
>>>>>>>  static int
>>>>>>>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>>>>>>
>>>>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev
>>> *netdev_, struct smap
>>>>>>*args)
>>>>>>> smap_add_format(args, "numa_id", "%d",
>>> rte_eth_dev_socket_id(dev->port_id));
>>>>>>>      smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>>>>>> smap_add_format(args, "min_rx_bufsize", "%u",
>>> dev_info.min_rx_bufsize);
>>>>>>> - smap_add_format(args, "max_rx_pktlen", "%u",
> dev_info.max_rx_pktlen);
>>>>>>> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>>>>>> smap_add_format(args, "max_rx_queues", "%u",
> dev_info.max_rx_queues);
>>>>>>> smap_add_format(args, "max_tx_queues", "%u",
> dev_info.max_tx_queues);
>>>>>>> smap_add_format(args, "max_mac_addrs", "%u",
> dev_info.max_mac_addrs);
>>>>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)
>>>>>>>      return 0;
>>>>>>>  }
>>>>>>>
>>>>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */
>>>>>>> +static int
>>>>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>> +{
>>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>> +    int err = 0;
>>>>>>> +    struct dpdk_mp *old_mp;
>>>>>>> +    struct dpdk_mp *mp;
>>>>>>> +
>>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>>> +    if (dev->mtu == mtu) {
>>>>>>> +        err = 0;
>>>>>>> +        goto out;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    mp = dpdk_mp_get(dev->socket_id, mtu);
>>>>>>> +    if (!mp) {
>>>>>>> +        err = ENOMEM;
>>>>>>> +        goto out;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>>> +    dev->dpdk_mp = mp;
>>>>>>> +    dev->mtu = mtu;
>>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>> +
>>>>>>> +    dpdk_mp_put(old_mp);
>>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>>> +out:
>>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>> +    return err;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static int
>>>>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const
>>> struct smap *args)
>>>>>>> +{
>>>>>>> +    int mtu;
>>>>>>> +
>>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>>> +
>>>>>>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>>>>>> +}
>>>>>>> +
>>>>>>>  static void
>>>>>>>  dpdk_common_init(void)
>>>>>>>  {
>>>>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk:
>>>>>>>      return err;
>>>>>>>  }
>>>>>>>
>>>>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,
> MULTIQ, SEND, \
>>>>>>> - GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV) \
>>>>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
>>>>>>> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
>>>>>>> +        GET_STATUS, RXQ_RECV)                                          \
>>>>>>>  {                                                             \
>>>>>>>      NAME,                                                     \
>>>>>>>      INIT,                       /* init */                    \
>>>>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk:
>>>>>>>      DESTRUCT,                                                 \
>>>>>>>      netdev_dpdk_dealloc,                                      \
>>>>>>>      netdev_dpdk_get_config,                                   \
>>>>>>> -    NULL,                       /* netdev_dpdk_set_config */  \
>>>>>>> +    SET_CONFIG,                                               \
>>>>>>>      NULL,                       /* get_tunnel_config */       \
>>>>>>>      NULL,                       /* build header */            \
>>>>>>>      NULL,                       /* push header */             \
>>>>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk:
>>>>>>>      netdev_dpdk_set_etheraddr,                                \
>>>>>>>      netdev_dpdk_get_etheraddr,                                \
>>>>>>>      netdev_dpdk_get_mtu,                                      \
>>>>>>> -    netdev_dpdk_set_mtu,                                      \
>>>>>>> +    SET_MTU,                                                  \
>>>>>>>      netdev_dpdk_get_ifindex,                                  \
>>>>>>>      GET_CARRIER,                                              \
>>>>>>>      netdev_dpdk_get_carrier_resets,                           \
>>>>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =
>>>>>>>          NULL,
>>>>>>>          netdev_dpdk_construct,
>>>>>>>          netdev_dpdk_destruct,
>>>>>>> +        netdev_dpdk_set_config,
>>>>>>>          netdev_dpdk_set_multiq,
>>>>>>>          netdev_dpdk_eth_send,
>>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>>          netdev_dpdk_get_carrier,
>>>>>>>          netdev_dpdk_get_stats,
>>>>>>>          netdev_dpdk_get_features,
>>>>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class =
>>>>>>>          NULL,
>>>>>>>          netdev_dpdk_ring_construct,
>>>>>>>          netdev_dpdk_destruct,
>>>>>>> +        netdev_dpdk_set_config,
>>>>>>>          netdev_dpdk_set_multiq,
>>>>>>>          netdev_dpdk_ring_send,
>>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>>          netdev_dpdk_get_carrier,
>>>>>>>          netdev_dpdk_get_stats,
>>>>>>>          netdev_dpdk_get_features,
>>>>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class
>>> OVS_UNUSED dpdk_vhost_cuse_class
>>>>=
>>>>>>>          dpdk_vhost_cuse_class_init,
>>>>>>>          netdev_dpdk_vhost_cuse_construct,
>>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>>> +        NULL,
>>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>>          netdev_dpdk_vhost_send,
>>>>>>> +        NULL,
>>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>>          NULL,
>>>>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class
>>> OVS_UNUSED dpdk_vhost_user_class
>>>>=
>>>>>>>          dpdk_vhost_user_class_init,
>>>>>>>          netdev_dpdk_vhost_user_construct,
>>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>>> +        netdev_dpdk_vhost_set_config,
>>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>>          netdev_dpdk_vhost_send,
>>>>>>> +        netdev_dpdk_vhost_set_mtu,
>>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>>          NULL,
>>>>>_______________________________________________
>>>>>dev mailing list
>>>>>dev@openvswitch.org
>>>>>http://openvswitch.org/mailman/listinfo/dev
>>>>_______________________________________________
>>>>dev mailing list
>>>>dev@openvswitch.org
>>>>http://openvswitch.org/mailman/listinfo/dev
>><#secure method=pgpmime mode=encrypt>
Mark Kavanagh Jan. 29, 2016, 4:20 p.m. UTC | #8
>"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
>
>>>
>>>"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
>>>> Further clarifications below Aaron - please let me know if you have
>>>> any additional comments.
>>>>
>>>> Thanks,
>>>> Mark
>>>
>>>Thanks again for this work, Mark!
>>>
>>>>>
>>>>>One addendum below, Aaron.
>>>>>
>>>>>>
>>>>>>>Hi Mark,
>>>>>>>
>>>>>>>Thanks for the patch! I've not done a very thorough review of this, but
>>>>>>>my first-blush comments are inline.
>>>>>>
>>>>>>Thanks for the review Aaron!
>>>>>>
>>>>>>Any additional comments are welcome - my initial responses are
>> inline, below.
>>>>>>
>>>>>>Thanks,
>>>>>>Mark
>>>>>>
>>>>>>>
>>>>>>>Mark Kavanagh <mark.b.kavanagh@intel.com> writes:
>>>>>>>> Add support for Jumbo Frames to DPDK-enabled port types,
>>>>>>>> using single-segment-mbufs.
>>>>>>>>
>>>>>>>> Using this approach, the amount of memory allocated for each mbuf
>>>>>>>> to store frame data is increased to a value greater than 1518B
>>>>>>>> (typical Ethernet maximum frame length). The increased space
>>>>>>>> available in the mbuf means that an entire Jumbo Frame can be carried
>>>>>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf
>>>>>>>> segments.
>>>>>>>>
>>>>>>>> The amount of space allocated to each mbuf to hold frame data is
>>>>>>>> defined dynamically by the user when adding a DPDK port to a bridge.
>>>>>>>> If an MTU value is not supplied, or the user-supplied value is invalid,
>>>>>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>>>>>>>
>>>>>>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>>>>>>> ---
>>>>>>>
>>>>>>>This is a new feature which has user-visible impact, so should be
>>>>>>>announced in NEWS (plus, it's great news).
>>>>>>
>>>>>>No problem - I'll add an entry.
>>>>>>
>>>>>>>
>>>>>>>>  INSTALL.DPDK.md   |  59 ++++++++-
>>>>>>>> lib/netdev-dpdk.c | 356
>>>> +++++++++++++++++++++++++++++++++++++++++-------------
>>>>>>>>  2 files changed, 328 insertions(+), 87 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>>>>>>> index 96b686c..2f23e27 100644
>>>>>>>> --- a/INSTALL.DPDK.md
>>>>>>>> +++ b/INSTALL.DPDK.md
>>>>>>>> @@ -859,10 +859,61 @@ by adding the following string:
>>>>>>>> to <interface> sections of all network devices used by
>> DPDK. Parameter 'N'
>>>>>>>>  determines how many queues can be used by the guest.
>>>>>>>>
>>>>>>>> +
>>>>>>>> +Jumbo Frames
>>>>>>>> +------------
>>>>>>>> +
>>>>>>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
>>>>>>>> +
>>>>>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to
>>>> the ovs-vsctl
>>>>>>>> +'add-port' command-line, along with the required MTU for the port.
>>>>>>>> +e.g.
>>>>>>>
>>>>>>>This text and the example do not match. I think '--dpdk-mtu' is not valid.
>>>>>>
>>>>>>Good catch, thanks.
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> +     ```
>>>>>>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>>>> options:dpdk-mtu=9000
>>>>>>>> +     ```
>>>>>>>> +
>>>>>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf
>> segments are
>>>>>>>> +increased, such that a full Jumbo Frame may be accommodated
>>>> inside a single
>>>>>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.
>>>>>>>
>>>>>>>Why no support? DPDK supports changing the mtu.
>>>>>>
>>>>>>I guess my rationale here is that an MTU change can't be triggered
>>>> via OVS command-line, nor
>>>>>>can it be triggered programmatically via DPDK (apart from an
>> explicit call to
>>>>>>rte_eth_dev_set_mtu).
>>>>>>So, while technically it's possibly, from a user's point of view,
>>>> there's no way to
>>>>>configure
>>>>>>it, outside of modifying the code directly. If I've missed something
>>>> here, please let me
>>>>>>know.
>>>>>>
>>>>>>>
>>>>>>>> +Jumbo frame support has been validated against 13312B frames, using the
>>>>>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>>>>>>>> +theoretically be supported. Supported port types excludes
>>>> vHost-Cuse ports, as
>>>>>>>> +this feature is pending deprecation.
>>>>>>>> +
>>>>>>>> +
>>>>>>>> +vHost Ports and Jumbo Frames
>>>>>>>> +----------------------------
>>>>>>>> +Jumbo frame support is available for DPDK vHost-User ports
>>>> only. Some additional
>>>>>>>> +configuration is needed to take advantage of this feature:
>>>>>>>> +
>>>>>>>> + 1. `mergeable buffers` must be enabled for vHost ports, as
>>>> demonstrated in
>>>>>>>> +      the QEMU command line snippet below:
>>>>>>>> +
>>>>>>>> +      ```
>>>>>>>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>>>>>>> + '-device
>>>> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>>>>>>> +      ```
>>>>>>>> +
>>>>>>>> + 2. Where virtio devices are bound to the Linux kernel driver
>> in a guest
>>>>>>>> + environment (i.e. interfaces are not bound to an in-guest DPDK
>>>> driver), the
>>>>>>>> + MTU of those logical network interfaces must also be
>> increased. This
>>>>>>>> + avoids segmentation of Jumbo Frames in the guest. Note that
>>>> 'MTU' refers
>>>>>>>> + to the length of the IP packet only, and not that of the entire
>>>> frame.
>>>>>>>> +
>>>>>>>> + e.g. To calculate the exact MTU of a standard IPv4 frame,
>>>> subtract the L2
>>>>>>>> + header and CRC lengths (i.e. 18B) from the max supported
>> frame size.
>>>>>>>> +     So, to set the MTU for a 13312B Jumbo Frame:
>>>>>>>> +
>>>>>>>> +      ```
>>>>>>>> +      ifconfig eth1 mtu 13294
>>>>>>>> +      ```
>>>>>>>> +
>>>>>>>> +
>>>>>>>>  Restrictions:
>>>>>>>>  -------------
>>>>>>>>
>>>>>>>> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
>>>>>>>>    - Currently DPDK port does not make use any offload functionality.
>>>>>>>>    - DPDK-vHost support works with 1G huge pages.
>>>>>>>>
>>>>>>>> @@ -903,6 +954,12 @@ Restrictions:
>>>>>>>> the next release of DPDK (which includes the above patch) is
>>>> available and
>>>>>>>>      integrated into OVS.
>>>>>>>>
>>>>>>>> +  Jumbo Frames:
>>>>>>>> + - `virtio-pmd`: DPDK apps in the guest do not exit
>>>> gracefully. The source of
>>>>>>>> +     this issue is currently being investigated.
>>>>>>>> + - vHost-Cuse: Jumbo Frame support is not available for vHost
>> Cuse ports.
>>>>>>>> +
>>>>>>>> +
>>>>>>>>  Bug Reporting:
>>>>>>>>  --------------
>>>>>>>>
>>>>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>>>>> index de7e488..5f23e28 100644
>>>>>>>> --- a/lib/netdev-dpdk.c
>>>>>>>> +++ b/lib/netdev-dpdk.c
>>>>>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>>>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>>>>>>>  #define OVS_VPORT_DPDK "ovs_dpdk"
>>>>>>>>
>>>>>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>>>>>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>>>>>>>> +
>>>>>>>>  /*
>>>>>>>>   * need to reserve tons of extra space in the mbufs so we can align the
>>>>>>>>   * DMA addresses to 4KB.
>>>>>>>> * The minimum mbuf size is limited to avoid scatter behaviour
>> and drop in
>>>>>>>>   * performance for standard Ethernet MTU.
>>>>>>>>   */
>>>>>>>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>>>>>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>>>>>>>> -                              + sizeof(struct dp_packet) \
>>>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>>>> -#define MBUF_SIZE_DRIVER     (2048                       \
>>>>>>>> -                              + sizeof (struct rte_mbuf) \
>>>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>>>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>>>>>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN +
>>>> ETHER_CRC_LEN)
>>>>>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN
>>>> - ETHER_CRC_LEN)
>>>>>>>> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>>>>>>>> +                                    + sizeof(struct dp_packet)   \
>>>>>>>> +                                    + RTE_PKTMBUF_HEADROOM)
>>>>>>>> +
>>>>>>>> +/* This value should be specified as a multiple of the DPDK NIC driver's
>>>>>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>>>>>>> + */
>>>>>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>>>>>>>
>>>>>>>> /* Max and min number of packets in the mempool.  OVS tries to
>> allocate a
>>>>>>>> * mempool with MAX_NB_MBUF: if this fails (because the system
>>>> doesn't have
>>>>>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>>>  #define MIN_NB_MBUF          (4096 * 4)
>>>>>>>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>>>>>
>>>>>>>> +#define DPDK_VLAN_TAG_LEN    4
>>>>>>>> +
>>>>>>>>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>>>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF %
>>>> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>>>>>>>>
>>>>>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>>>>>>>          .header_split   = 0, /* Header Split disabled */
>>>>>>>>          .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>>>>>>>          .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>>>>>>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>>>>>>>          .hw_strip_crc   = 0,
>>>>>>>>      },
>>>>>>>>      .rx_adv_conf = {
>>>>>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)
>>>>>>>>      return class->construct == netdev_dpdk_construct;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>>>>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K
>>>> for igb_uio).
>>>>>>>> + * If 'frame_len' is not a multiple of this value, insufficient
>>>> buffers are
>>>>>>>> + * allocated to accomodate the packet in its
>>>> entirety. Furthermore, the igb_uio
>>>>>>>> + * driver needs to ensure that there is also sufficient space in
>>>> the Rx buffer
>>>>>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX
>> buffer is too
>>>>>>>> + * small, then the driver enables scatter RX behaviour, which reduces
>>>>>>>> + * performance. To prevent this, use a buffer size that is closest to
>>>>>>>> + * 'frame_len', but which satisfies the aforementioned criteria.
>>>>>>>> + */
>>>>>>>> +static uint32_t
>>>>>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>>>>>>> +{
>>>>>>>> +    struct rte_eth_dev_info info;
>>>>>>>> +    uint32_t buf_size;
>>>>>>>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>>>>>>> +
>>>>>>>> +    if(netdev->type == DPDK_DEV_ETH) {
>>>>>>>> +        rte_eth_dev_info_get(netdev->port_id, &info);
>>>>>>>> +        buf_size = (info.min_rx_bufsize == 0) ?
>>>>>>>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>>>>>>> +                   info.min_rx_bufsize;
>>>>>>>
>>>>>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?
>>>>>>
>>>>>>I'm not sure if I follow - the MTU isn't the issue here, but rather
>>>> the buffer size used
>>>>>>when creating the rte_mempool, which must be a multiple of the
>>>> driver's min_rx_bufsize.
>>>>>>
>>>>>>>
>>>>>>>> +    } else {
>>>>>>>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    if(len % buf_size) {
>>>>>>>> +        len = buf_size * ((len/buf_size) + 1);
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return len;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
>>>>>>>>   * for all other segments data, bss and text. */
>>>>>>>>
>>>>>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  static void
>>>>>>>> -__rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>> -                   void *opaque_arg OVS_UNUSED,
>>>>>>>> -                   void *_m,
>>>>>>>> -                   unsigned i OVS_UNUSED)
>>>>>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>>>>>>>  {
>>>>>>>> -    struct rte_mbuf *m = _m;
>>>>>>>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>>>>>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>>>>>>> +    struct rte_pktmbuf_pool_private default_mbp_priv;
>>>>>>>> +    uint16_t roomsz;
>>>>>>>>
>>>>>>>>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>>>>>>
>>>>>>>> -    memset(m, 0, mp->elt_size);
>>>>>>>> +    /* if no structure is provided, assume no mbuf private area */
>>>>>>>> +
>>>>>>>> +    user_mbp_priv = opaque_arg;
>>>>>>>> +    if (user_mbp_priv == NULL) {
>>>>>>>> +        default_mbp_priv.mbuf_priv_size = 0;
>>>>>>>> +        if (mp->elt_size > sizeof(struct dp_packet)) {
>>>>>>>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
>>>>>>>> +        } else {
>>>>>>>> +            roomsz = 0;
>>>>>>>> +        }
>>>>>>>> +        default_mbp_priv.mbuf_data_room_size = roomsz;
>>>>>>>> +        user_mbp_priv = &default_mbp_priv;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>>>>>>> +        user_mbp_priv->mbuf_data_room_size +
>>>>>>>> +        user_mbp_priv->mbuf_priv_size);
>>>>>>>> +
>>>>>>>> +    mbp_priv = rte_mempool_get_priv(mp);
>>>>>>>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +/* Initialise some fields in the mbuf structure that are not
>>>> modified by the
>>>>>>>> + * user once created (origin pool, buffer start address, etc.*/
>>>>>>>> +static void
>>>>>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>> +                       void *opaque_arg OVS_UNUSED,
>>>>>>>> +                       void *_m,
>>>>>>>> +                       unsigned i OVS_UNUSED)
>>>>>>>> +{
>>>>>>>> +	struct rte_mbuf *m = _m;
>>>>>>>> +	uint32_t buf_size, buf_len, priv_size;
>>>>>>>> +
>>>>>>>> +	priv_size = rte_pktmbuf_priv_size(mp);
>>>>>>>> +	buf_size = sizeof(struct dp_packet) + priv_size;
>>>>>>>> +	buf_len = rte_pktmbuf_data_room_size(mp);
>>>>>>>>
>>>>>>>> -    /* start of buffer is just after mbuf structure */
>>>>>>>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>>>>>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>>>>>> -                    sizeof(struct dp_packet);
>>>>>>>> -    m->buf_len = (uint16_t)buf_len;
>>>>>>>> + RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
>> priv_size);
>>>>>>>> +	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>>>>>>> +	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>>>>>>>
>>>>>>>> -    /* keep some headroom between start of buffer and data */
>>>>>>>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>>>>>> +	memset(m, 0, mp->elt_size);
>>>>>>>>
>>>>>>>> -    /* init some constant fields */
>>>>>>>> -    m->pool = mp;
>>>>>>>> -    m->nb_segs = 1;
>>>>>>>> -    m->port = 0xff;
>>>>>>>> +	/* start of buffer is after dp_packet structure and priv data */
>>>>>>>> +	m->priv_size = priv_size;
>>>>>>>> +	m->buf_addr = (char *)m + buf_size;
>>>>>>>> +	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>>>>>>> +	m->buf_len = (uint16_t)buf_len;
>>>>>>>> +
>>>>>>>> +	/* keep some headroom between start of buffer and data */
>>>>>>>> + m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM,
>> (uint16_t)m->buf_len);
>>>>>>>> +
>>>>>>>> +	/* init some constant fields */
>>>>>>>> +	m->pool = mp;
>>>>>>>> +	m->nb_segs = 1;
>>>>>>>> +	m->port = 0xff;
>>>>>>>
>>>>>>>Please don't mix tabs and spaces in the file.
>>>>>>
>>>>>>Apologies - this was a copy/paste from a DPDK file, which uses tabs.
>>>>>>
>>>>>>>
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  static void
>>>>>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>>  {
>>>>>>>>      struct rte_mbuf *m = _m;
>>>>>>>>
>>>>>>>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>>>>>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>>>>>>>
>>>>>>>>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>>>>>>>  }
>>>>>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu)
>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>      struct dpdk_mp *dmp = NULL;
>>>>>>>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>>>>>>      unsigned mp_size;
>>>>>>>> +    struct rte_pktmbuf_pool_private mbp_priv;
>>>>>>>>
>>>>>>>>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>>>>>>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>>>>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu)
>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>      dmp->socket_id = socket_id;
>>>>>>>>      dmp->mtu = mtu;
>>>>>>>>      dmp->refcount = 1;
>>>>>>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>>>> RTE_PKTMBUF_HEADROOM;
>>>>>>>> +    mbp_priv.mbuf_priv_size = 0;
>>>>>>>>
>>>>>>>>      mp_size = MAX_NB_MBUF;
>>>>>>>>      do {
>>>>>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu)
>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>              return NULL;
>>>>>>>>          }
>>>>>>>>
>>>>>>>> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>>>>>> + dmp->mp = rte_mempool_create(mp_name, mp_size,
>>>> MBUF_SEGMENT_SIZE(mtu),
>>>>>>>>                                       MP_CACHE_SZ,
>>>>>>>> sizeof(struct rte_pktmbuf_pool_private),
>>>>>>>> -                                     rte_pktmbuf_pool_init, NULL,
>>>>>>>> + ovs_rte_pktmbuf_pool_init, &mbp_priv,
>>>>>>>>                                       ovs_rte_pktmbuf_init, NULL,
>>>>>>>>                                       socket_id, 0);
>>>>>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>>> MIN_NB_MBUF);
>>>>>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>>>> *dev, int n_rxq, int
>>>>>>n_txq)
>>>>>>>>  {
>>>>>>>>      int diag = 0;
>>>>>>>>      int i;
>>>>>>>> +    struct rte_eth_conf conf = port_conf;
>>>>>>>>
>>>>>>>>      /* A device may report more queues than it makes available (this has
>>>>>>>>       * been observed for Intel xl710, which reserves some of them for
>>>>>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>>>> *dev, int n_rxq, int
>>>>>>>n_txq)
>>>>>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>>>>>>>>          }
>>>>>>>>
>>>>>>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>>> &port_conf);
>>>>>>>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>>>>>>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>>>>>>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>> +        }
>>>>>>>> +
>>>>>>>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>>>>>>>>          if (diag) {
>>>>>>>>              break;
>>>>>>>>          }
>>>>>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_,
>>>> unsigned int port_no,
>>>>>>>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>>>>>>>      int sid;
>>>>>>>>      int err = 0;
>>>>>>>> +    uint32_t buf_size;
>>>>>>>>
>>>>>>>>      ovs_mutex_init(&netdev->mutex);
>>>>>>>>      ovs_mutex_lock(&netdev->mutex);
>>>>>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_,
>>>> unsigned int port_no,
>>>>>>>>      netdev->port_id = port_no;
>>>>>>>>      netdev->type = type;
>>>>>>>>      netdev->flags = 0;
>>>>>>>> +
>>>>>>>> + /* Initialize port's MTU and frame len to the default
>> Ethernet values.
>>>>>>>> +     * Larger, user-specified (jumbo) frame buffers are accommodated in
>>>>>>>> +     * netdev_dpdk_set_config.
>>>>>>>> +     */
>>>>>>>> +    netdev->max_packet_len = ETHER_MAX_LEN;
>>>>>>>>      netdev->mtu = ETHER_MTU;
>>>>>>>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>>>>>>>
>>>>>>>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>>>>>>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>>>>>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>>>> FRAME_LEN_TO_MTU(buf_size));
>>>>>>>>      if (!netdev->dpdk_mp) {
>>>>>>>>          err = ENOMEM;
>>>>>>>>          goto unlock;
>>>>>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[],
>>>> const char prefix[],
>>>>>>>>      return 0;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static void
>>>>>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>>>>>>> +{
>>>>>>>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
>>>>>>>> +    int local_mtu;
>>>>>>>> +
>>>>>>>> +    if(mtu_str) {
>>>>>>>> +        local_mtu = atoi(mtu_str);
>>>>>>>
>>>>>>>Please use strtol or strtoul here, and detect errors in the string by
>>>>>>>checking endptr. That way, we can be sure that random garbage on the
>>>>>>>stack won't accidentally become the mtu.
>>>>>>
>>>>>>Will do, thanks.
>>>>>>
>>>>>>>
>>>>>>>> +    }
>>>>>>>> +    if(!mtu_str || local_mtu < ETHER_MTU ||
>>>>>>>> + local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {
>>>>>>>> +        local_mtu = ETHER_MTU;
>>>>>>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting
>>>> to %d.\n",
>>>>>>>> +                   local_mtu);
>>>>>>>> +    }
>>>>>>>> +    *mtu = local_mtu;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static int
>>>>>>>>  vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>>>>>>>>  {
>>>>>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev
>>>> *netdev_, struct smap
>>>>>>>*args)
>>>>>>>>      smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
>>>>>>>>      smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>>>>>>>> smap_add_format(args, "configured_tx_queues", "%d",
>> dev->real_n_txq);
>>>>>>>> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>>>>>>>>      ovs_mutex_unlock(&dev->mutex);
>>>>>>>>
>>>>>>>>      return 0;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +/* Set the mtu of DPDK_DEV_ETH ports */
>>>>>>>> +static int
>>>>>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>> +{
>>>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>> +    int old_mtu, err;
>>>>>>>> +    uint32_t buf_size;
>>>>>>>> +    int dpdk_mtu;
>>>>>>>> +    struct dpdk_mp *old_mp;
>>>>>>>> +    struct dpdk_mp *mp;
>>>>>>>> +
>>>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>>>> +    if (dev->mtu == mtu) {
>>>>>>>> +        err = 0;
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>>>>>>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>>>>>>> +
>>>>>>>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>>>>>>> +    if (!mp) {
>>>>>>>> +        err = ENOMEM;
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    rte_eth_dev_stop(dev->port_id);
>>>>>>>> +
>>>>>>>> +    old_mtu = dev->mtu;
>>>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>>>> +    dev->dpdk_mp = mp;
>>>>>>>> +    dev->mtu = mtu;
>>>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>> +
>>>>>>>> +    err = dpdk_eth_dev_init(dev);
>>>>>>>
>>>>>>>Why call the dpdk_eth_dev_init here? Since you are directly calling
>>>>>>>rte_eth_dev_X functions here - wouldn't it make sense to use the
>>>>>>>rte_eth_dev_start() instead?
>>>>
>>>> The rx queues for the device need to be assigned the mempool that was
>>>> allocated for the jumbo frames; since dpdk_eth_dev_init handles the
>>>> queue setup, rte_eth_dev configuration, etc. it makes sense to call it
>>>> here.
>>>>
>>>>>>
>>>>>>This is legacy code, but I can refactor if needs be.
>>>>>>
>>>>>>>
>>>>>>>> +    if (err) {
>>>>>>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
>>>> known "
>>>>>>>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>>>>>>> +        dpdk_mp_put(mp);
>>>>>>>> +        dev->mtu = old_mtu;
>>>>>>>> +        dev->dpdk_mp = old_mp;
>>>>>>>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>> +        err = dpdk_eth_dev_init(dev);
>>>>>>>
>>>>>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you
>>>>>>>will cause a queue reconfiguration action, which is not needed, and can
>>>>>>>error.
>>>>
>>>> This call is needed, in case the eth_dev's queue configuration was
>>>> changed by the previous call.
>>>> The error handling for this call is arguably outside the scope of this
>>>> patch, and could be submitted as a separate patch - what do you think?
>>>
>>>That's probably a good idea.
>>>
>>>>>>>
>>>>>>
>>>>>>As above.
>>>>>>
>>>>>>>If you are intending to set MTU, I would have expected calls to
>>>>>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I
>>>>>>>recognize that much of this is legacy, but since you are touching it :)
>>>>>>
>>>>>>I actually had a number of issues with rte_eth_dev_set_mtu early on
>>>> in the patch; I can take
>>>>>>a second look now that the code is more stable though.
>>>>>
>>>>>I remembered why I didn't use this.
>>>>>
>>>>>If you look at the IXGBE driver's mtu_set function, it refuses frame
>>>> lengths which are larger
>>>>>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM).
>>>>>This might be avoided by configuring the device for scatter RX
>>>> behavior, but this incurs a
>>>>>performance impact.
>>>
>>>It seems strange to me that we need to ignore the library API for this
>>>one case, right? What if the user is only doing vhostuser PMDs mixed
>>>with a bnx2x card?
>>>
>>
>> I can't say that I've familiar enough with this setup to form an
>> answer on that one Aaron.
>>
>>>It would be better to get it fixed in the library, I think. After all,
>>>they gave a generic setting to change MTU - why don't they make it
>>>behave well?
>>>
>>
>> Let me follow up on that with some local DPDK folks here.
>>
>> Would you consider this a showstopper for upstreaming of this feature?
>> If we need to wait for a change to DPDK, Jumbo Frames will miss the
>> 2.5 window.
>
>I don't think it would be a show stopper as long as other devices wouldn't
>be negatively impacted, but it's a place where a /*XXX: This is a
>workaround for dpdk*/ comment should be included so that we can fix it
>when/if it is updated in the library. What do you think?
>

Sounds good - I'll add a comment in the code and post V3 of the patch, incorporating this and the rework from your earlier comments.
I'll follow up with the DPDK team offline to see what the course of action to address this issue is.

Thanks for your input!

>>>>><snippet>
>>>>>ixgbe_dev_mtu_set(...) {
>>>>>
>>>>>...
>>>>>
>>>>>    /* refuse mtu that requires the support of scattered packets when this
>>>>>     * feature has not been enabled before. */
>>>>>    if (!dev->data->scattered_rx &&
>>>>>        (frame_size + 2 * IXGBE_VLAN_TAG_SIZE >
>>>>>         dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM))
>>>>>        return -EINVAL;
>>>>>
>>>>>...
>>>>>}
>>>>></snippet>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    dpdk_mp_put(old_mp);
>>>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>>>> +out:
>>>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>> +    return err;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int
>>>>>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>>>>>>>> +{
>>>>>>>> +    int mtu;
>>>>>>>> +
>>>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>>>> +
>>>>>>>> +    return netdev_dpdk_set_mtu(netdev_, mtu);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static int
>>>>>>>>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>>>>>>>  {
>>>>>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev
>>>> *netdev, int *mtup)
>>>>>>>>
>>>>>>>>      return 0;
>>>>>>>>  }
>>>>>>>> -
>>>>>>>> -static int
>>>>>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>> -{
>>>>>>>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>> -    int old_mtu, err;
>>>>>>>> -    struct dpdk_mp *old_mp;
>>>>>>>> -    struct dpdk_mp *mp;
>>>>>>>> -
>>>>>>>> -    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>> -    ovs_mutex_lock(&dev->mutex);
>>>>>>>> -    if (dev->mtu == mtu) {
>>>>>>>> -        err = 0;
>>>>>>>> -        goto out;
>>>>>>>> -    }
>>>>>>>> -
>>>>>>>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>>>>>>> -    if (!mp) {
>>>>>>>> -        err = ENOMEM;
>>>>>>>> -        goto out;
>>>>>>>> -    }
>>>>>>>> -
>>>>>>>> -    rte_eth_dev_stop(dev->port_id);
>>>>>>>> -
>>>>>>>> -    old_mtu = dev->mtu;
>>>>>>>> -    old_mp = dev->dpdk_mp;
>>>>>>>> -    dev->dpdk_mp = mp;
>>>>>>>> -    dev->mtu = mtu;
>>>>>>>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>>>> -
>>>>>>>> -    err = dpdk_eth_dev_init(dev);
>>>>>>>> -    if (err) {
>>>>>>>> -        dpdk_mp_put(mp);
>>>>>>>> -        dev->mtu = old_mtu;
>>>>>>>> -        dev->dpdk_mp = old_mp;
>>>>>>>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>>>> -        dpdk_eth_dev_init(dev);
>>>>>>>> -        goto out;
>>>>>>>> -    }
>>>>>>>> -
>>>>>>>> -    dpdk_mp_put(old_mp);
>>>>>>>> -    netdev_change_seq_changed(netdev);
>>>>>>>> -out:
>>>>>>>> -    ovs_mutex_unlock(&dev->mutex);
>>>>>>>> -    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>> -    return err;
>>>>>>>> -}
>>>>>>>> -
>>>>>>>>  static int
>>>>>>>>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>>>>>>>
>>>>>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev
>>>> *netdev_, struct smap
>>>>>>>*args)
>>>>>>>> smap_add_format(args, "numa_id", "%d",
>>>> rte_eth_dev_socket_id(dev->port_id));
>>>>>>>>      smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>>>>>>> smap_add_format(args, "min_rx_bufsize", "%u",
>>>> dev_info.min_rx_bufsize);
>>>>>>>> - smap_add_format(args, "max_rx_pktlen", "%u",
>> dev_info.max_rx_pktlen);
>>>>>>>> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>>>>>>> smap_add_format(args, "max_rx_queues", "%u",
>> dev_info.max_rx_queues);
>>>>>>>> smap_add_format(args, "max_tx_queues", "%u",
>> dev_info.max_tx_queues);
>>>>>>>> smap_add_format(args, "max_mac_addrs", "%u",
>> dev_info.max_mac_addrs);
>>>>>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)
>>>>>>>>      return 0;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */
>>>>>>>> +static int
>>>>>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>> +{
>>>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>> +    int err = 0;
>>>>>>>> +    struct dpdk_mp *old_mp;
>>>>>>>> +    struct dpdk_mp *mp;
>>>>>>>> +
>>>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>>>> +    if (dev->mtu == mtu) {
>>>>>>>> +        err = 0;
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    mp = dpdk_mp_get(dev->socket_id, mtu);
>>>>>>>> +    if (!mp) {
>>>>>>>> +        err = ENOMEM;
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>>>> +    dev->dpdk_mp = mp;
>>>>>>>> +    dev->mtu = mtu;
>>>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>> +
>>>>>>>> +    dpdk_mp_put(old_mp);
>>>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>>>> +out:
>>>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>> +    return err;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int
>>>>>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const
>>>> struct smap *args)
>>>>>>>> +{
>>>>>>>> +    int mtu;
>>>>>>>> +
>>>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>>>> +
>>>>>>>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static void
>>>>>>>>  dpdk_common_init(void)
>>>>>>>>  {
>>>>>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk:
>>>>>>>>      return err;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,
>> MULTIQ, SEND, \
>>>>>>>> - GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV) \
>>>>>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
>>>>>>>> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
>>>>>>>> +        GET_STATUS, RXQ_RECV)                                          \
>>>>>>>>  {                                                             \
>>>>>>>>      NAME,                                                     \
>>>>>>>>      INIT,                       /* init */                    \
>>>>>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk:
>>>>>>>>      DESTRUCT,                                                 \
>>>>>>>>      netdev_dpdk_dealloc,                                      \
>>>>>>>>      netdev_dpdk_get_config,                                   \
>>>>>>>> -    NULL,                       /* netdev_dpdk_set_config */  \
>>>>>>>> +    SET_CONFIG,                                               \
>>>>>>>>      NULL,                       /* get_tunnel_config */       \
>>>>>>>>      NULL,                       /* build header */            \
>>>>>>>>      NULL,                       /* push header */             \
>>>>>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk:
>>>>>>>>      netdev_dpdk_set_etheraddr,                                \
>>>>>>>>      netdev_dpdk_get_etheraddr,                                \
>>>>>>>>      netdev_dpdk_get_mtu,                                      \
>>>>>>>> -    netdev_dpdk_set_mtu,                                      \
>>>>>>>> +    SET_MTU,                                                  \
>>>>>>>>      netdev_dpdk_get_ifindex,                                  \
>>>>>>>>      GET_CARRIER,                                              \
>>>>>>>>      netdev_dpdk_get_carrier_resets,                           \
>>>>>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =
>>>>>>>>          NULL,
>>>>>>>>          netdev_dpdk_construct,
>>>>>>>>          netdev_dpdk_destruct,
>>>>>>>> +        netdev_dpdk_set_config,
>>>>>>>>          netdev_dpdk_set_multiq,
>>>>>>>>          netdev_dpdk_eth_send,
>>>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>>>          netdev_dpdk_get_carrier,
>>>>>>>>          netdev_dpdk_get_stats,
>>>>>>>>          netdev_dpdk_get_features,
>>>>>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class =
>>>>>>>>          NULL,
>>>>>>>>          netdev_dpdk_ring_construct,
>>>>>>>>          netdev_dpdk_destruct,
>>>>>>>> +        netdev_dpdk_set_config,
>>>>>>>>          netdev_dpdk_set_multiq,
>>>>>>>>          netdev_dpdk_ring_send,
>>>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>>>          netdev_dpdk_get_carrier,
>>>>>>>>          netdev_dpdk_get_stats,
>>>>>>>>          netdev_dpdk_get_features,
>>>>>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class
>>>> OVS_UNUSED dpdk_vhost_cuse_class
>>>>>=
>>>>>>>>          dpdk_vhost_cuse_class_init,
>>>>>>>>          netdev_dpdk_vhost_cuse_construct,
>>>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>>>> +        NULL,
>>>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>>>          netdev_dpdk_vhost_send,
>>>>>>>> +        NULL,
>>>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>>>          NULL,
>>>>>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class
>>>> OVS_UNUSED dpdk_vhost_user_class
>>>>>=
>>>>>>>>          dpdk_vhost_user_class_init,
>>>>>>>>          netdev_dpdk_vhost_user_construct,
>>>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>>>> +        netdev_dpdk_vhost_set_config,
>>>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>>>          netdev_dpdk_vhost_send,
>>>>>>>> +        netdev_dpdk_vhost_set_mtu,
>>>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>>>          NULL,
>>>>>>_______________________________________________
>>>>>>dev mailing list
>>>>>>dev@openvswitch.org
>>>>>>http://openvswitch.org/mailman/listinfo/dev
>>>>>_______________________________________________
>>>>>dev mailing list
>>>>>dev@openvswitch.org
>>>>>http://openvswitch.org/mailman/listinfo/dev
>>><#secure method=pgpmime mode=encrypt>
Mark Kavanagh Feb. 16, 2016, 11:43 a.m. UTC | #9
>
>"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
>
>>>
>>>"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
>>>> Further clarifications below Aaron - please let me know if you have
>>>> any additional comments.
>>>>
>>>> Thanks,
>>>> Mark
>>>
>>>Thanks again for this work, Mark!
>>>
>>>>>
>>>>>One addendum below, Aaron.
>>>>>
>>>>>>
>>>>>>>Hi Mark,
>>>>>>>
>>>>>>>Thanks for the patch! I've not done a very thorough review of this, but
>>>>>>>my first-blush comments are inline.
>>>>>>
>>>>>>Thanks for the review Aaron!
>>>>>>
>>>>>>Any additional comments are welcome - my initial responses are
>> inline, below.
>>>>>>
>>>>>>Thanks,
>>>>>>Mark
>>>>>>
>>>>>>>
>>>>>>>Mark Kavanagh <mark.b.kavanagh@intel.com> writes:
>>>>>>>> Add support for Jumbo Frames to DPDK-enabled port types,
>>>>>>>> using single-segment-mbufs.
>>>>>>>>
>>>>>>>> Using this approach, the amount of memory allocated for each mbuf
>>>>>>>> to store frame data is increased to a value greater than 1518B
>>>>>>>> (typical Ethernet maximum frame length). The increased space
>>>>>>>> available in the mbuf means that an entire Jumbo Frame can be carried
>>>>>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf
>>>>>>>> segments.
>>>>>>>>
>>>>>>>> The amount of space allocated to each mbuf to hold frame data is
>>>>>>>> defined dynamically by the user when adding a DPDK port to a bridge.
>>>>>>>> If an MTU value is not supplied, or the user-supplied value is invalid,
>>>>>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>>>>>>>
>>>>>>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>>>>>>> ---
>>>>>>>
>>>>>>>This is a new feature which has user-visible impact, so should be
>>>>>>>announced in NEWS (plus, it's great news).
>>>>>>
>>>>>>No problem - I'll add an entry.
>>>>>>
>>>>>>>
>>>>>>>>  INSTALL.DPDK.md   |  59 ++++++++-
>>>>>>>> lib/netdev-dpdk.c | 356
>>>> +++++++++++++++++++++++++++++++++++++++++-------------
>>>>>>>>  2 files changed, 328 insertions(+), 87 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>>>>>>> index 96b686c..2f23e27 100644
>>>>>>>> --- a/INSTALL.DPDK.md
>>>>>>>> +++ b/INSTALL.DPDK.md
>>>>>>>> @@ -859,10 +859,61 @@ by adding the following string:
>>>>>>>> to <interface> sections of all network devices used by
>> DPDK. Parameter 'N'
>>>>>>>>  determines how many queues can be used by the guest.
>>>>>>>>
>>>>>>>> +
>>>>>>>> +Jumbo Frames
>>>>>>>> +------------
>>>>>>>> +
>>>>>>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
>>>>>>>> +
>>>>>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to
>>>> the ovs-vsctl
>>>>>>>> +'add-port' command-line, along with the required MTU for the port.
>>>>>>>> +e.g.
>>>>>>>
>>>>>>>This text and the example do not match. I think '--dpdk-mtu' is not valid.
>>>>>>
>>>>>>Good catch, thanks.
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> +     ```
>>>>>>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>>>> options:dpdk-mtu=9000
>>>>>>>> +     ```
>>>>>>>> +
>>>>>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf
>> segments are
>>>>>>>> +increased, such that a full Jumbo Frame may be accommodated
>>>> inside a single
>>>>>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.
>>>>>>>
>>>>>>>Why no support? DPDK supports changing the mtu.
>>>>>>
>>>>>>I guess my rationale here is that an MTU change can't be triggered
>>>> via OVS command-line, nor
>>>>>>can it be triggered programmatically via DPDK (apart from an
>> explicit call to
>>>>>>rte_eth_dev_set_mtu).
>>>>>>So, while technically it's possibly, from a user's point of view,
>>>> there's no way to
>>>>>configure
>>>>>>it, outside of modifying the code directly. If I've missed something
>>>> here, please let me
>>>>>>know.
>>>>>>
>>>>>>>
>>>>>>>> +Jumbo frame support has been validated against 13312B frames, using the
>>>>>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
>>>>>>>> +theoretically be supported. Supported port types excludes
>>>> vHost-Cuse ports, as
>>>>>>>> +this feature is pending deprecation.
>>>>>>>> +
>>>>>>>> +
>>>>>>>> +vHost Ports and Jumbo Frames
>>>>>>>> +----------------------------
>>>>>>>> +Jumbo frame support is available for DPDK vHost-User ports
>>>> only. Some additional
>>>>>>>> +configuration is needed to take advantage of this feature:
>>>>>>>> +
>>>>>>>> + 1. `mergeable buffers` must be enabled for vHost ports, as
>>>> demonstrated in
>>>>>>>> +      the QEMU command line snippet below:
>>>>>>>> +
>>>>>>>> +      ```
>>>>>>>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>>>>>>> + '-device
>>>> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>>>>>>> +      ```
>>>>>>>> +
>>>>>>>> + 2. Where virtio devices are bound to the Linux kernel driver
>> in a guest
>>>>>>>> + environment (i.e. interfaces are not bound to an in-guest DPDK
>>>> driver), the
>>>>>>>> + MTU of those logical network interfaces must also be
>> increased. This
>>>>>>>> + avoids segmentation of Jumbo Frames in the guest. Note that
>>>> 'MTU' refers
>>>>>>>> + to the length of the IP packet only, and not that of the entire
>>>> frame.
>>>>>>>> +
>>>>>>>> + e.g. To calculate the exact MTU of a standard IPv4 frame,
>>>> subtract the L2
>>>>>>>> + header and CRC lengths (i.e. 18B) from the max supported
>> frame size.
>>>>>>>> +     So, to set the MTU for a 13312B Jumbo Frame:
>>>>>>>> +
>>>>>>>> +      ```
>>>>>>>> +      ifconfig eth1 mtu 13294
>>>>>>>> +      ```
>>>>>>>> +
>>>>>>>> +
>>>>>>>>  Restrictions:
>>>>>>>>  -------------
>>>>>>>>
>>>>>>>> -  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
>>>>>>>>    - Currently DPDK port does not make use any offload functionality.
>>>>>>>>    - DPDK-vHost support works with 1G huge pages.
>>>>>>>>
>>>>>>>> @@ -903,6 +954,12 @@ Restrictions:
>>>>>>>> the next release of DPDK (which includes the above patch) is
>>>> available and
>>>>>>>>      integrated into OVS.
>>>>>>>>
>>>>>>>> +  Jumbo Frames:
>>>>>>>> + - `virtio-pmd`: DPDK apps in the guest do not exit
>>>> gracefully. The source of
>>>>>>>> +     this issue is currently being investigated.
>>>>>>>> + - vHost-Cuse: Jumbo Frame support is not available for vHost
>> Cuse ports.
>>>>>>>> +
>>>>>>>> +
>>>>>>>>  Bug Reporting:
>>>>>>>>  --------------
>>>>>>>>
>>>>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>>>>> index de7e488..5f23e28 100644
>>>>>>>> --- a/lib/netdev-dpdk.c
>>>>>>>> +++ b/lib/netdev-dpdk.c
>>>>>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>>>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>>>>>>>  #define OVS_VPORT_DPDK "ovs_dpdk"
>>>>>>>>
>>>>>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>>>>>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>>>>>>>> +
>>>>>>>>  /*
>>>>>>>>   * need to reserve tons of extra space in the mbufs so we can align the
>>>>>>>>   * DMA addresses to 4KB.
>>>>>>>> * The minimum mbuf size is limited to avoid scatter behaviour
>> and drop in
>>>>>>>>   * performance for standard Ethernet MTU.
>>>>>>>>   */
>>>>>>>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>>>>>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>>>>>>>> -                              + sizeof(struct dp_packet) \
>>>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>>>> -#define MBUF_SIZE_DRIVER     (2048                       \
>>>>>>>> -                              + sizeof (struct rte_mbuf) \
>>>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>>>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>>>>>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN +
>>>> ETHER_CRC_LEN)
>>>>>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN
>>>> - ETHER_CRC_LEN)
>>>>>>>> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>>>>>>>> +                                    + sizeof(struct dp_packet)   \
>>>>>>>> +                                    + RTE_PKTMBUF_HEADROOM)
>>>>>>>> +
>>>>>>>> +/* This value should be specified as a multiple of the DPDK NIC driver's
>>>>>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>>>>>>> + */
>>>>>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>>>>>>>
>>>>>>>> /* Max and min number of packets in the mempool.  OVS tries to
>> allocate a
>>>>>>>> * mempool with MAX_NB_MBUF: if this fails (because the system
>>>> doesn't have
>>>>>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>>>  #define MIN_NB_MBUF          (4096 * 4)
>>>>>>>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>>>>>
>>>>>>>> +#define DPDK_VLAN_TAG_LEN    4
>>>>>>>> +
>>>>>>>>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>>>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF %
>>>> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>>>>>>>>
>>>>>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>>>>>>>          .header_split   = 0, /* Header Split disabled */
>>>>>>>>          .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>>>>>>>          .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>>>>>>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>>>>>>>          .hw_strip_crc   = 0,
>>>>>>>>      },
>>>>>>>>      .rx_adv_conf = {
>>>>>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)
>>>>>>>>      return class->construct == netdev_dpdk_construct;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>>>>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K
>>>> for igb_uio).
>>>>>>>> + * If 'frame_len' is not a multiple of this value, insufficient
>>>> buffers are
>>>>>>>> + * allocated to accomodate the packet in its
>>>> entirety. Furthermore, the igb_uio
>>>>>>>> + * driver needs to ensure that there is also sufficient space in
>>>> the Rx buffer
>>>>>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX
>> buffer is too
>>>>>>>> + * small, then the driver enables scatter RX behaviour, which reduces
>>>>>>>> + * performance. To prevent this, use a buffer size that is closest to
>>>>>>>> + * 'frame_len', but which satisfies the aforementioned criteria.
>>>>>>>> + */
>>>>>>>> +static uint32_t
>>>>>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>>>>>>> +{
>>>>>>>> +    struct rte_eth_dev_info info;
>>>>>>>> +    uint32_t buf_size;
>>>>>>>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>>>>>>> +
>>>>>>>> +    if(netdev->type == DPDK_DEV_ETH) {
>>>>>>>> +        rte_eth_dev_info_get(netdev->port_id, &info);
>>>>>>>> +        buf_size = (info.min_rx_bufsize == 0) ?
>>>>>>>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>>>>>>> +                   info.min_rx_bufsize;
>>>>>>>
>>>>>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?
>>>>>>
>>>>>>I'm not sure if I follow - the MTU isn't the issue here, but rather
>>>> the buffer size used
>>>>>>when creating the rte_mempool, which must be a multiple of the
>>>> driver's min_rx_bufsize.
>>>>>>
>>>>>>>
>>>>>>>> +    } else {
>>>>>>>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    if(len % buf_size) {
>>>>>>>> +        len = buf_size * ((len/buf_size) + 1);
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return len;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
>>>>>>>>   * for all other segments data, bss and text. */
>>>>>>>>
>>>>>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  static void
>>>>>>>> -__rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>> -                   void *opaque_arg OVS_UNUSED,
>>>>>>>> -                   void *_m,
>>>>>>>> -                   unsigned i OVS_UNUSED)
>>>>>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>>>>>>>  {
>>>>>>>> -    struct rte_mbuf *m = _m;
>>>>>>>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>>>>>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>>>>>>> +    struct rte_pktmbuf_pool_private default_mbp_priv;
>>>>>>>> +    uint16_t roomsz;
>>>>>>>>
>>>>>>>>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>>>>>>
>>>>>>>> -    memset(m, 0, mp->elt_size);
>>>>>>>> +    /* if no structure is provided, assume no mbuf private area */
>>>>>>>> +
>>>>>>>> +    user_mbp_priv = opaque_arg;
>>>>>>>> +    if (user_mbp_priv == NULL) {
>>>>>>>> +        default_mbp_priv.mbuf_priv_size = 0;
>>>>>>>> +        if (mp->elt_size > sizeof(struct dp_packet)) {
>>>>>>>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
>>>>>>>> +        } else {
>>>>>>>> +            roomsz = 0;
>>>>>>>> +        }
>>>>>>>> +        default_mbp_priv.mbuf_data_room_size = roomsz;
>>>>>>>> +        user_mbp_priv = &default_mbp_priv;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>>>>>>> +        user_mbp_priv->mbuf_data_room_size +
>>>>>>>> +        user_mbp_priv->mbuf_priv_size);
>>>>>>>> +
>>>>>>>> +    mbp_priv = rte_mempool_get_priv(mp);
>>>>>>>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +/* Initialise some fields in the mbuf structure that are not
>>>> modified by the
>>>>>>>> + * user once created (origin pool, buffer start address, etc.*/
>>>>>>>> +static void
>>>>>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>> +                       void *opaque_arg OVS_UNUSED,
>>>>>>>> +                       void *_m,
>>>>>>>> +                       unsigned i OVS_UNUSED)
>>>>>>>> +{
>>>>>>>> +	struct rte_mbuf *m = _m;
>>>>>>>> +	uint32_t buf_size, buf_len, priv_size;
>>>>>>>> +
>>>>>>>> +	priv_size = rte_pktmbuf_priv_size(mp);
>>>>>>>> +	buf_size = sizeof(struct dp_packet) + priv_size;
>>>>>>>> +	buf_len = rte_pktmbuf_data_room_size(mp);
>>>>>>>>
>>>>>>>> -    /* start of buffer is just after mbuf structure */
>>>>>>>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>>>>>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>>>>>> -                    sizeof(struct dp_packet);
>>>>>>>> -    m->buf_len = (uint16_t)buf_len;
>>>>>>>> + RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
>> priv_size);
>>>>>>>> +	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>>>>>>> +	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>>>>>>>
>>>>>>>> -    /* keep some headroom between start of buffer and data */
>>>>>>>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>>>>>> +	memset(m, 0, mp->elt_size);
>>>>>>>>
>>>>>>>> -    /* init some constant fields */
>>>>>>>> -    m->pool = mp;
>>>>>>>> -    m->nb_segs = 1;
>>>>>>>> -    m->port = 0xff;
>>>>>>>> +	/* start of buffer is after dp_packet structure and priv data */
>>>>>>>> +	m->priv_size = priv_size;
>>>>>>>> +	m->buf_addr = (char *)m + buf_size;
>>>>>>>> +	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>>>>>>> +	m->buf_len = (uint16_t)buf_len;
>>>>>>>> +
>>>>>>>> +	/* keep some headroom between start of buffer and data */
>>>>>>>> + m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM,
>> (uint16_t)m->buf_len);
>>>>>>>> +
>>>>>>>> +	/* init some constant fields */
>>>>>>>> +	m->pool = mp;
>>>>>>>> +	m->nb_segs = 1;
>>>>>>>> +	m->port = 0xff;
>>>>>>>
>>>>>>>Please don't mix tabs and spaces in the file.
>>>>>>
>>>>>>Apologies - this was a copy/paste from a DPDK file, which uses tabs.
>>>>>>
>>>>>>>
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  static void
>>>>>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>>  {
>>>>>>>>      struct rte_mbuf *m = _m;
>>>>>>>>
>>>>>>>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>>>>>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>>>>>>>
>>>>>>>>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>>>>>>>  }
>>>>>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu)
>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>      struct dpdk_mp *dmp = NULL;
>>>>>>>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>>>>>>      unsigned mp_size;
>>>>>>>> +    struct rte_pktmbuf_pool_private mbp_priv;
>>>>>>>>
>>>>>>>>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>>>>>>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>>>>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu)
>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>      dmp->socket_id = socket_id;
>>>>>>>>      dmp->mtu = mtu;
>>>>>>>>      dmp->refcount = 1;
>>>>>>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>>>> RTE_PKTMBUF_HEADROOM;
>>>>>>>> +    mbp_priv.mbuf_priv_size = 0;
>>>>>>>>
>>>>>>>>      mp_size = MAX_NB_MBUF;
>>>>>>>>      do {
>>>>>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu)
>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>              return NULL;
>>>>>>>>          }
>>>>>>>>
>>>>>>>> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>>>>>> + dmp->mp = rte_mempool_create(mp_name, mp_size,
>>>> MBUF_SEGMENT_SIZE(mtu),
>>>>>>>>                                       MP_CACHE_SZ,
>>>>>>>> sizeof(struct rte_pktmbuf_pool_private),
>>>>>>>> -                                     rte_pktmbuf_pool_init, NULL,
>>>>>>>> + ovs_rte_pktmbuf_pool_init, &mbp_priv,
>>>>>>>>                                       ovs_rte_pktmbuf_init, NULL,
>>>>>>>>                                       socket_id, 0);
>>>>>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>>> MIN_NB_MBUF);
>>>>>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>>>> *dev, int n_rxq, int
>>>>>>n_txq)
>>>>>>>>  {
>>>>>>>>      int diag = 0;
>>>>>>>>      int i;
>>>>>>>> +    struct rte_eth_conf conf = port_conf;
>>>>>>>>
>>>>>>>>      /* A device may report more queues than it makes available (this has
>>>>>>>>       * been observed for Intel xl710, which reserves some of them for
>>>>>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>>>> *dev, int n_rxq, int
>>>>>>>n_txq)
>>>>>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>>>>>>>>          }
>>>>>>>>
>>>>>>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>>> &port_conf);
>>>>>>>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>>>>>>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>>>>>>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>> +        }
>>>>>>>> +
>>>>>>>> +        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
>>>>>>>>          if (diag) {
>>>>>>>>              break;
>>>>>>>>          }
>>>>>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_,
>>>> unsigned int port_no,
>>>>>>>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>>>>>>>      int sid;
>>>>>>>>      int err = 0;
>>>>>>>> +    uint32_t buf_size;
>>>>>>>>
>>>>>>>>      ovs_mutex_init(&netdev->mutex);
>>>>>>>>      ovs_mutex_lock(&netdev->mutex);
>>>>>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_,
>>>> unsigned int port_no,
>>>>>>>>      netdev->port_id = port_no;
>>>>>>>>      netdev->type = type;
>>>>>>>>      netdev->flags = 0;
>>>>>>>> +
>>>>>>>> + /* Initialize port's MTU and frame len to the default
>> Ethernet values.
>>>>>>>> +     * Larger, user-specified (jumbo) frame buffers are accommodated in
>>>>>>>> +     * netdev_dpdk_set_config.
>>>>>>>> +     */
>>>>>>>> +    netdev->max_packet_len = ETHER_MAX_LEN;
>>>>>>>>      netdev->mtu = ETHER_MTU;
>>>>>>>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>>>>>>>
>>>>>>>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>>>>>>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>>>>>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>>>> FRAME_LEN_TO_MTU(buf_size));
>>>>>>>>      if (!netdev->dpdk_mp) {
>>>>>>>>          err = ENOMEM;
>>>>>>>>          goto unlock;
>>>>>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[],
>>>> const char prefix[],
>>>>>>>>      return 0;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +static void
>>>>>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>>>>>>> +{
>>>>>>>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
>>>>>>>> +    int local_mtu;
>>>>>>>> +
>>>>>>>> +    if(mtu_str) {
>>>>>>>> +        local_mtu = atoi(mtu_str);
>>>>>>>
>>>>>>>Please use strtol or strtoul here, and detect errors in the string by
>>>>>>>checking endptr. That way, we can be sure that random garbage on the
>>>>>>>stack won't accidentally become the mtu.
>>>>>>
>>>>>>Will do, thanks.
>>>>>>
>>>>>>>
>>>>>>>> +    }
>>>>>>>> +    if(!mtu_str || local_mtu < ETHER_MTU ||
>>>>>>>> + local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {
>>>>>>>> +        local_mtu = ETHER_MTU;
>>>>>>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting
>>>> to %d.\n",
>>>>>>>> +                   local_mtu);
>>>>>>>> +    }
>>>>>>>> +    *mtu = local_mtu;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static int
>>>>>>>>  vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
>>>>>>>>  {
>>>>>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev
>>>> *netdev_, struct smap
>>>>>>>*args)
>>>>>>>>      smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
>>>>>>>>      smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
>>>>>>>> smap_add_format(args, "configured_tx_queues", "%d",
>> dev->real_n_txq);
>>>>>>>> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>>>>>>>>      ovs_mutex_unlock(&dev->mutex);
>>>>>>>>
>>>>>>>>      return 0;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +/* Set the mtu of DPDK_DEV_ETH ports */
>>>>>>>> +static int
>>>>>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>> +{
>>>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>> +    int old_mtu, err;
>>>>>>>> +    uint32_t buf_size;
>>>>>>>> +    int dpdk_mtu;
>>>>>>>> +    struct dpdk_mp *old_mp;
>>>>>>>> +    struct dpdk_mp *mp;
>>>>>>>> +
>>>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>>>> +    if (dev->mtu == mtu) {
>>>>>>>> +        err = 0;
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>>>>>>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>>>>>>> +
>>>>>>>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>>>>>>> +    if (!mp) {
>>>>>>>> +        err = ENOMEM;
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    rte_eth_dev_stop(dev->port_id);
>>>>>>>> +
>>>>>>>> +    old_mtu = dev->mtu;
>>>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>>>> +    dev->dpdk_mp = mp;
>>>>>>>> +    dev->mtu = mtu;
>>>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>> +
>>>>>>>> +    err = dpdk_eth_dev_init(dev);
>>>>>>>
>>>>>>>Why call the dpdk_eth_dev_init here? Since you are directly calling
>>>>>>>rte_eth_dev_X functions here - wouldn't it make sense to use the
>>>>>>>rte_eth_dev_start() instead?
>>>>
>>>> The rx queues for the device need to be assigned the mempool that was
>>>> allocated for the jumbo frames; since dpdk_eth_dev_init handles the
>>>> queue setup, rte_eth_dev configuration, etc. it makes sense to call it
>>>> here.
>>>>
>>>>>>
>>>>>>This is legacy code, but I can refactor if needs be.
>>>>>>
>>>>>>>
>>>>>>>> +    if (err) {
>>>>>>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
>>>> known "
>>>>>>>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>>>>>>> +        dpdk_mp_put(mp);
>>>>>>>> +        dev->mtu = old_mtu;
>>>>>>>> +        dev->dpdk_mp = old_mp;
>>>>>>>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>> +        err = dpdk_eth_dev_init(dev);
>>>>>>>
>>>>>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you
>>>>>>>will cause a queue reconfiguration action, which is not needed, and can
>>>>>>>error.
>>>>
>>>> This call is needed, in case the eth_dev's queue configuration was
>>>> changed by the previous call.
>>>> The error handling for this call is arguably outside the scope of this
>>>> patch, and could be submitted as a separate patch - what do you think?
>>>
>>>That's probably a good idea.
>>>
>>>>>>>
>>>>>>
>>>>>>As above.
>>>>>>
>>>>>>>If you are intending to set MTU, I would have expected calls to
>>>>>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I
>>>>>>>recognize that much of this is legacy, but since you are touching it :)
>>>>>>
>>>>>>I actually had a number of issues with rte_eth_dev_set_mtu early on
>>>> in the patch; I can take
>>>>>>a second look now that the code is more stable though.
>>>>>
>>>>>I remembered why I didn't use this.
>>>>>
>>>>>If you look at the IXGBE driver's mtu_set function, it refuses frame
>>>> lengths which are larger
>>>>>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM).
>>>>>This might be avoided by configuring the device for scatter RX
>>>> behavior, but this incurs a
>>>>>performance impact.
>>>
>>>It seems strange to me that we need to ignore the library API for this
>>>one case, right? What if the user is only doing vhostuser PMDs mixed
>>>with a bnx2x card?
>>>
>>
>> I can't say that I've familiar enough with this setup to form an
>> answer on that one Aaron.
>>
>>>It would be better to get it fixed in the library, I think. After all,
>>>they gave a generic setting to change MTU - why don't they make it
>>>behave well?
>>>
>>
>> Let me follow up on that with some local DPDK folks here.
>>

Just a quick follow up on this Aaron: I did implement a version of this patch that used rte_eth_dev_set_mtu; however, not all DPDK drivers yet implement that function (e.g. i40e), so I reverted to the previous version.
DPDK team did mention that the _set_mtu function for i40e should be implemented 'soon', but no dates were specified. See this thread on the DPDK ML for more details: http://dpdk.org/ml/archives/dev/2016-February/032910.html 

>> Would you consider this a showstopper for upstreaming of this feature?
>> If we need to wait for a change to DPDK, Jumbo Frames will miss the
>> 2.5 window.
>
>I don't think it would be a show stopper as long as other devices wouldn't
>be negatively impacted, but it's a place where a /*XXX: This is a
>workaround for dpdk*/ comment should be included so that we can fix it
>when/if it is updated in the library. What do you think?
>
>>>>><snippet>
>>>>>ixgbe_dev_mtu_set(...) {
>>>>>
>>>>>...
>>>>>
>>>>>    /* refuse mtu that requires the support of scattered packets when this
>>>>>     * feature has not been enabled before. */
>>>>>    if (!dev->data->scattered_rx &&
>>>>>        (frame_size + 2 * IXGBE_VLAN_TAG_SIZE >
>>>>>         dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM))
>>>>>        return -EINVAL;
>>>>>
>>>>>...
>>>>>}
>>>>></snippet>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    dpdk_mp_put(old_mp);
>>>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>>>> +out:
>>>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>> +    return err;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int
>>>>>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
>>>>>>>> +{
>>>>>>>> +    int mtu;
>>>>>>>> +
>>>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>>>> +
>>>>>>>> +    return netdev_dpdk_set_mtu(netdev_, mtu);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static int
>>>>>>>>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>>>>>>>  {
>>>>>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev
>>>> *netdev, int *mtup)
>>>>>>>>
>>>>>>>>      return 0;
>>>>>>>>  }
>>>>>>>> -
>>>>>>>> -static int
>>>>>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>> -{
>>>>>>>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>> -    int old_mtu, err;
>>>>>>>> -    struct dpdk_mp *old_mp;
>>>>>>>> -    struct dpdk_mp *mp;
>>>>>>>> -
>>>>>>>> -    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>> -    ovs_mutex_lock(&dev->mutex);
>>>>>>>> -    if (dev->mtu == mtu) {
>>>>>>>> -        err = 0;
>>>>>>>> -        goto out;
>>>>>>>> -    }
>>>>>>>> -
>>>>>>>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>>>>>>> -    if (!mp) {
>>>>>>>> -        err = ENOMEM;
>>>>>>>> -        goto out;
>>>>>>>> -    }
>>>>>>>> -
>>>>>>>> -    rte_eth_dev_stop(dev->port_id);
>>>>>>>> -
>>>>>>>> -    old_mtu = dev->mtu;
>>>>>>>> -    old_mp = dev->dpdk_mp;
>>>>>>>> -    dev->dpdk_mp = mp;
>>>>>>>> -    dev->mtu = mtu;
>>>>>>>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>>>> -
>>>>>>>> -    err = dpdk_eth_dev_init(dev);
>>>>>>>> -    if (err) {
>>>>>>>> -        dpdk_mp_put(mp);
>>>>>>>> -        dev->mtu = old_mtu;
>>>>>>>> -        dev->dpdk_mp = old_mp;
>>>>>>>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>>>> -        dpdk_eth_dev_init(dev);
>>>>>>>> -        goto out;
>>>>>>>> -    }
>>>>>>>> -
>>>>>>>> -    dpdk_mp_put(old_mp);
>>>>>>>> -    netdev_change_seq_changed(netdev);
>>>>>>>> -out:
>>>>>>>> -    ovs_mutex_unlock(&dev->mutex);
>>>>>>>> -    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>> -    return err;
>>>>>>>> -}
>>>>>>>> -
>>>>>>>>  static int
>>>>>>>>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>>>>>>>
>>>>>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev
>>>> *netdev_, struct smap
>>>>>>>*args)
>>>>>>>> smap_add_format(args, "numa_id", "%d",
>>>> rte_eth_dev_socket_id(dev->port_id));
>>>>>>>>      smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>>>>>>> smap_add_format(args, "min_rx_bufsize", "%u",
>>>> dev_info.min_rx_bufsize);
>>>>>>>> - smap_add_format(args, "max_rx_pktlen", "%u",
>> dev_info.max_rx_pktlen);
>>>>>>>> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>>>>>>> smap_add_format(args, "max_rx_queues", "%u",
>> dev_info.max_rx_queues);
>>>>>>>> smap_add_format(args, "max_tx_queues", "%u",
>> dev_info.max_tx_queues);
>>>>>>>> smap_add_format(args, "max_mac_addrs", "%u",
>> dev_info.max_mac_addrs);
>>>>>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)
>>>>>>>>      return 0;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */
>>>>>>>> +static int
>>>>>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>> +{
>>>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>> +    int err = 0;
>>>>>>>> +    struct dpdk_mp *old_mp;
>>>>>>>> +    struct dpdk_mp *mp;
>>>>>>>> +
>>>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>>>> +    if (dev->mtu == mtu) {
>>>>>>>> +        err = 0;
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    mp = dpdk_mp_get(dev->socket_id, mtu);
>>>>>>>> +    if (!mp) {
>>>>>>>> +        err = ENOMEM;
>>>>>>>> +        goto out;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>>>> +    dev->dpdk_mp = mp;
>>>>>>>> +    dev->mtu = mtu;
>>>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>> +
>>>>>>>> +    dpdk_mp_put(old_mp);
>>>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>>>> +out:
>>>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>> +    return err;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static int
>>>>>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const
>>>> struct smap *args)
>>>>>>>> +{
>>>>>>>> +    int mtu;
>>>>>>>> +
>>>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>>>> +
>>>>>>>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>  static void
>>>>>>>>  dpdk_common_init(void)
>>>>>>>>  {
>>>>>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk:
>>>>>>>>      return err;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,
>> MULTIQ, SEND, \
>>>>>>>> - GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV) \
>>>>>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
>>>>>>>> +        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
>>>>>>>> +        GET_STATUS, RXQ_RECV)                                          \
>>>>>>>>  {                                                             \
>>>>>>>>      NAME,                                                     \
>>>>>>>>      INIT,                       /* init */                    \
>>>>>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk:
>>>>>>>>      DESTRUCT,                                                 \
>>>>>>>>      netdev_dpdk_dealloc,                                      \
>>>>>>>>      netdev_dpdk_get_config,                                   \
>>>>>>>> -    NULL,                       /* netdev_dpdk_set_config */  \
>>>>>>>> +    SET_CONFIG,                                               \
>>>>>>>>      NULL,                       /* get_tunnel_config */       \
>>>>>>>>      NULL,                       /* build header */            \
>>>>>>>>      NULL,                       /* push header */             \
>>>>>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk:
>>>>>>>>      netdev_dpdk_set_etheraddr,                                \
>>>>>>>>      netdev_dpdk_get_etheraddr,                                \
>>>>>>>>      netdev_dpdk_get_mtu,                                      \
>>>>>>>> -    netdev_dpdk_set_mtu,                                      \
>>>>>>>> +    SET_MTU,                                                  \
>>>>>>>>      netdev_dpdk_get_ifindex,                                  \
>>>>>>>>      GET_CARRIER,                                              \
>>>>>>>>      netdev_dpdk_get_carrier_resets,                           \
>>>>>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =
>>>>>>>>          NULL,
>>>>>>>>          netdev_dpdk_construct,
>>>>>>>>          netdev_dpdk_destruct,
>>>>>>>> +        netdev_dpdk_set_config,
>>>>>>>>          netdev_dpdk_set_multiq,
>>>>>>>>          netdev_dpdk_eth_send,
>>>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>>>          netdev_dpdk_get_carrier,
>>>>>>>>          netdev_dpdk_get_stats,
>>>>>>>>          netdev_dpdk_get_features,
>>>>>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class dpdk_ring_class =
>>>>>>>>          NULL,
>>>>>>>>          netdev_dpdk_ring_construct,
>>>>>>>>          netdev_dpdk_destruct,
>>>>>>>> +        netdev_dpdk_set_config,
>>>>>>>>          netdev_dpdk_set_multiq,
>>>>>>>>          netdev_dpdk_ring_send,
>>>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>>>          netdev_dpdk_get_carrier,
>>>>>>>>          netdev_dpdk_get_stats,
>>>>>>>>          netdev_dpdk_get_features,
>>>>>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class
>>>> OVS_UNUSED dpdk_vhost_cuse_class
>>>>>=
>>>>>>>>          dpdk_vhost_cuse_class_init,
>>>>>>>>          netdev_dpdk_vhost_cuse_construct,
>>>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>>>> +        NULL,
>>>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>>>          netdev_dpdk_vhost_send,
>>>>>>>> +        NULL,
>>>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>>>          NULL,
>>>>>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class
>>>> OVS_UNUSED dpdk_vhost_user_class
>>>>>=
>>>>>>>>          dpdk_vhost_user_class_init,
>>>>>>>>          netdev_dpdk_vhost_user_construct,
>>>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>>>> +        netdev_dpdk_vhost_set_config,
>>>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>>>          netdev_dpdk_vhost_send,
>>>>>>>> +        netdev_dpdk_vhost_set_mtu,
>>>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>>>          NULL,
>>>>>>_______________________________________________
>>>>>>dev mailing list
>>>>>>dev@openvswitch.org
>>>>>>http://openvswitch.org/mailman/listinfo/dev
>>>>>_______________________________________________
>>>>>dev mailing list
>>>>>dev@openvswitch.org
>>>>>http://openvswitch.org/mailman/listinfo/dev
>>><#secure method=pgpmime mode=encrypt>
Aaron Conole Feb. 16, 2016, 2:11 p.m. UTC | #10
"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:

>>
>>"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
>>
>>>>
>>>>"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
>>>>> Further clarifications below Aaron - please let me know if you have
>>>>> any additional comments.
>>>>>
>>>>> Thanks,
>>>>> Mark
>>>>
>>>>Thanks again for this work, Mark!
>>>>
>>>>>>
>>>>>>One addendum below, Aaron.
>>>>>>
>>>>>>>
>>>>>>>>Hi Mark,
>>>>>>>>
>>>>>>>>Thanks for the patch! I've not done a very thorough review of this, but
>>>>>>>>my first-blush comments are inline.
>>>>>>>
>>>>>>>Thanks for the review Aaron!
>>>>>>>
>>>>>>>Any additional comments are welcome - my initial responses are
>>> inline, below.
>>>>>>>
>>>>>>>Thanks,
>>>>>>>Mark
>>>>>>>
>>>>>>>>
>>>>>>>>Mark Kavanagh <mark.b.kavanagh@intel.com> writes:
>>>>>>>>> Add support for Jumbo Frames to DPDK-enabled port types,
>>>>>>>>> using single-segment-mbufs.
>>>>>>>>>
>>>>>>>>> Using this approach, the amount of memory allocated for each mbuf
>>>>>>>>> to store frame data is increased to a value greater than 1518B
>>>>>>>>> (typical Ethernet maximum frame length). The increased space
>>>>>>>>> available in the mbuf means that an entire Jumbo Frame can be carried
>>>>>>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf
>>>>>>>>> segments.
>>>>>>>>>
>>>>>>>>> The amount of space allocated to each mbuf to hold frame data is
>>>>>>>>> defined dynamically by the user when adding a DPDK port to a bridge.
>>>>>>>>> If an MTU value is not supplied, or the user-supplied value is invalid,
>>>>>>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>>>>>>>>
>>>>>>>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>>>>>>>> ---
>>>>>>>>
>>>>>>>>This is a new feature which has user-visible impact, so should be
>>>>>>>>announced in NEWS (plus, it's great news).
>>>>>>>
>>>>>>>No problem - I'll add an entry.
>>>>>>>
>>>>>>>>
>>>>>>>>>  INSTALL.DPDK.md   |  59 ++++++++-
>>>>>>>>> lib/netdev-dpdk.c | 356
>>>>> +++++++++++++++++++++++++++++++++++++++++-------------
>>>>>>>>>  2 files changed, 328 insertions(+), 87 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>>>>>>>> index 96b686c..2f23e27 100644
>>>>>>>>> --- a/INSTALL.DPDK.md
>>>>>>>>> +++ b/INSTALL.DPDK.md
>>>>>>>>> @@ -859,10 +859,61 @@ by adding the following string:
>>>>>>>>> to <interface> sections of all network devices used by
>>> DPDK. Parameter 'N'
>>>>>>>>>  determines how many queues can be used by the guest.
>>>>>>>>>
>>>>>>>>> +
>>>>>>>>> +Jumbo Frames
>>>>>>>>> +------------
>>>>>>>>> +
>>>>>>>>> +Support for Jumbo Frames may be enabled at run-time for
> DPDK-type ports.
>>>>>>>>> +
>>>>>>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to
>>>>> the ovs-vsctl
>>>>>>>>> +'add-port' command-line, along with the required MTU for the port.
>>>>>>>>> +e.g.
>>>>>>>>
>>>>>>>>This text and the example do not match. I think '--dpdk-mtu' is
> not valid.
>>>>>>>
>>>>>>>Good catch, thanks.
>>>>>>>
>>>>>>>>
>>>>>>>>> +
>>>>>>>>> +     ```
>>>>>>>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>>>>> options:dpdk-mtu=9000
>>>>>>>>> +     ```
>>>>>>>>> +
>>>>>>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf
>>> segments are
>>>>>>>>> +increased, such that a full Jumbo Frame may be accommodated
>>>>> inside a single
>>>>>>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.
>>>>>>>>
>>>>>>>>Why no support? DPDK supports changing the mtu.
>>>>>>>
>>>>>>>I guess my rationale here is that an MTU change can't be triggered
>>>>> via OVS command-line, nor
>>>>>>>can it be triggered programmatically via DPDK (apart from an
>>> explicit call to
>>>>>>>rte_eth_dev_set_mtu).
>>>>>>>So, while technically it's possibly, from a user's point of view,
>>>>> there's no way to
>>>>>>configure
>>>>>>>it, outside of modifying the code directly. If I've missed something
>>>>> here, please let me
>>>>>>>know.
>>>>>>>
>>>>>>>>
>>>>>>>>> +Jumbo frame support has been validated against 13312B
> frames, using the
>>>>>>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC
> drivers may
>>>>>>>>> +theoretically be supported. Supported port types excludes
>>>>> vHost-Cuse ports, as
>>>>>>>>> +this feature is pending deprecation.
>>>>>>>>> +
>>>>>>>>> +
>>>>>>>>> +vHost Ports and Jumbo Frames
>>>>>>>>> +----------------------------
>>>>>>>>> +Jumbo frame support is available for DPDK vHost-User ports
>>>>> only. Some additional
>>>>>>>>> +configuration is needed to take advantage of this feature:
>>>>>>>>> +
>>>>>>>>> + 1. `mergeable buffers` must be enabled for vHost ports, as
>>>>> demonstrated in
>>>>>>>>> +      the QEMU command line snippet below:
>>>>>>>>> +
>>>>>>>>> +      ```
>>>>>>>>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>>>>>>>> + '-device
>>>>> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>>>>>>>> +      ```
>>>>>>>>> +
>>>>>>>>> + 2. Where virtio devices are bound to the Linux kernel driver
>>> in a guest
>>>>>>>>> + environment (i.e. interfaces are not bound to an in-guest DPDK
>>>>> driver), the
>>>>>>>>> + MTU of those logical network interfaces must also be
>>> increased. This
>>>>>>>>> + avoids segmentation of Jumbo Frames in the guest. Note that
>>>>> 'MTU' refers
>>>>>>>>> + to the length of the IP packet only, and not that of the entire
>>>>> frame.
>>>>>>>>> +
>>>>>>>>> + e.g. To calculate the exact MTU of a standard IPv4 frame,
>>>>> subtract the L2
>>>>>>>>> + header and CRC lengths (i.e. 18B) from the max supported
>>> frame size.
>>>>>>>>> +     So, to set the MTU for a 13312B Jumbo Frame:
>>>>>>>>> +
>>>>>>>>> +      ```
>>>>>>>>> +      ifconfig eth1 mtu 13294
>>>>>>>>> +      ```
>>>>>>>>> +
>>>>>>>>> +
>>>>>>>>>  Restrictions:
>>>>>>>>>  -------------
>>>>>>>>>
>>>>>>>>> - - Work with 1500 MTU, needs few changes in DPDK lib to fix
> this issue.
>>>>>>>>>    - Currently DPDK port does not make use any offload functionality.
>>>>>>>>>    - DPDK-vHost support works with 1G huge pages.
>>>>>>>>>
>>>>>>>>> @@ -903,6 +954,12 @@ Restrictions:
>>>>>>>>> the next release of DPDK (which includes the above patch) is
>>>>> available and
>>>>>>>>>      integrated into OVS.
>>>>>>>>>
>>>>>>>>> +  Jumbo Frames:
>>>>>>>>> + - `virtio-pmd`: DPDK apps in the guest do not exit
>>>>> gracefully. The source of
>>>>>>>>> +     this issue is currently being investigated.
>>>>>>>>> + - vHost-Cuse: Jumbo Frame support is not available for vHost
>>> Cuse ports.
>>>>>>>>> +
>>>>>>>>> +
>>>>>>>>>  Bug Reporting:
>>>>>>>>>  --------------
>>>>>>>>>
>>>>>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>>>>>> index de7e488..5f23e28 100644
>>>>>>>>> --- a/lib/netdev-dpdk.c
>>>>>>>>> +++ b/lib/netdev-dpdk.c
>>>>>>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>>>>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>>>>>>>>  #define OVS_VPORT_DPDK "ovs_dpdk"
>>>>>>>>>
>>>>>>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>>>>>>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>>>>>>>>> +
>>>>>>>>>  /*
>>>>>>>>> * need to reserve tons of extra space in the mbufs so we can
> align the
>>>>>>>>>   * DMA addresses to 4KB.
>>>>>>>>> * The minimum mbuf size is limited to avoid scatter behaviour
>>> and drop in
>>>>>>>>>   * performance for standard Ethernet MTU.
>>>>>>>>>   */
>>>>>>>>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>>>>>>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>>>>>>>>> -                              + sizeof(struct dp_packet) \
>>>>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>>>>> -#define MBUF_SIZE_DRIVER     (2048                       \
>>>>>>>>> -                              + sizeof (struct rte_mbuf) \
>>>>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>>>>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>>>>>>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN +
>>>>> ETHER_CRC_LEN)
>>>>>>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN
>>>>> - ETHER_CRC_LEN)
>>>>>>>>> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>>>>>>>>> +                                    + sizeof(struct dp_packet)   \
>>>>>>>>> +                                    + RTE_PKTMBUF_HEADROOM)
>>>>>>>>> +
>>>>>>>>> +/* This value should be specified as a multiple of the DPDK
> NIC driver's
>>>>>>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>>>>>>>> + */
>>>>>>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>>>>>>>>
>>>>>>>>> /* Max and min number of packets in the mempool.  OVS tries to
>>> allocate a
>>>>>>>>> * mempool with MAX_NB_MBUF: if this fails (because the system
>>>>> doesn't have
>>>>>>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>>>>  #define MIN_NB_MBUF          (4096 * 4)
>>>>>>>>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>>>>>>
>>>>>>>>> +#define DPDK_VLAN_TAG_LEN    4
>>>>>>>>> +
>>>>>>>>>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>>>>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF %
>>>>> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>>>>>>>>>
>>>>>>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>>>>>>>>          .header_split   = 0, /* Header Split disabled */
>>>>>>>>>          .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>>>>>>>>          .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>>>>>>>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>>>>>>>>          .hw_strip_crc   = 0,
>>>>>>>>>      },
>>>>>>>>>      .rx_adv_conf = {
>>>>>>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)
>>>>>>>>>      return class->construct == netdev_dpdk_construct;
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>>>>>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K
>>>>> for igb_uio).
>>>>>>>>> + * If 'frame_len' is not a multiple of this value, insufficient
>>>>> buffers are
>>>>>>>>> + * allocated to accomodate the packet in its
>>>>> entirety. Furthermore, the igb_uio
>>>>>>>>> + * driver needs to ensure that there is also sufficient space in
>>>>> the Rx buffer
>>>>>>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX
>>> buffer is too
>>>>>>>>> + * small, then the driver enables scatter RX behaviour, which reduces
>>>>>>>>> + * performance. To prevent this, use a buffer size that is closest to
>>>>>>>>> + * 'frame_len', but which satisfies the aforementioned criteria.
>>>>>>>>> + */
>>>>>>>>> +static uint32_t
>>>>>>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>>>>>>>> +{
>>>>>>>>> +    struct rte_eth_dev_info info;
>>>>>>>>> +    uint32_t buf_size;
>>>>>>>>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>>>>>>>> +
>>>>>>>>> +    if(netdev->type == DPDK_DEV_ETH) {
>>>>>>>>> +        rte_eth_dev_info_get(netdev->port_id, &info);
>>>>>>>>> +        buf_size = (info.min_rx_bufsize == 0) ?
>>>>>>>>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>>>>>>>> +                   info.min_rx_bufsize;
>>>>>>>>
>>>>>>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?
>>>>>>>
>>>>>>>I'm not sure if I follow - the MTU isn't the issue here, but rather
>>>>> the buffer size used
>>>>>>>when creating the rte_mempool, which must be a multiple of the
>>>>> driver's min_rx_bufsize.
>>>>>>>
>>>>>>>>
>>>>>>>>> +    } else {
>>>>>>>>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    if(len % buf_size) {
>>>>>>>>> +        len = buf_size * ((len/buf_size) + 1);
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    return len;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> /* XXX: use dpdk malloc for entire OVS. in fact huge page
> should be used
>>>>>>>>>   * for all other segments data, bss and text. */
>>>>>>>>>
>>>>>>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>>  static void
>>>>>>>>> -__rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>>> -                   void *opaque_arg OVS_UNUSED,
>>>>>>>>> -                   void *_m,
>>>>>>>>> -                   unsigned i OVS_UNUSED)
>>>>>>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>>>>>>>>  {
>>>>>>>>> -    struct rte_mbuf *m = _m;
>>>>>>>>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>>>>>>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>>>>>>>> +    struct rte_pktmbuf_pool_private default_mbp_priv;
>>>>>>>>> +    uint16_t roomsz;
>>>>>>>>>
>>>>>>>>>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>>>>>>>
>>>>>>>>> -    memset(m, 0, mp->elt_size);
>>>>>>>>> +    /* if no structure is provided, assume no mbuf private area */
>>>>>>>>> +
>>>>>>>>> +    user_mbp_priv = opaque_arg;
>>>>>>>>> +    if (user_mbp_priv == NULL) {
>>>>>>>>> +        default_mbp_priv.mbuf_priv_size = 0;
>>>>>>>>> +        if (mp->elt_size > sizeof(struct dp_packet)) {
>>>>>>>>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
>>>>>>>>> +        } else {
>>>>>>>>> +            roomsz = 0;
>>>>>>>>> +        }
>>>>>>>>> +        default_mbp_priv.mbuf_data_room_size = roomsz;
>>>>>>>>> +        user_mbp_priv = &default_mbp_priv;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>>>>>>>> +        user_mbp_priv->mbuf_data_room_size +
>>>>>>>>> +        user_mbp_priv->mbuf_priv_size);
>>>>>>>>> +
>>>>>>>>> +    mbp_priv = rte_mempool_get_priv(mp);
>>>>>>>>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +/* Initialise some fields in the mbuf structure that are not
>>>>> modified by the
>>>>>>>>> + * user once created (origin pool, buffer start address, etc.*/
>>>>>>>>> +static void
>>>>>>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>>> +                       void *opaque_arg OVS_UNUSED,
>>>>>>>>> +                       void *_m,
>>>>>>>>> +                       unsigned i OVS_UNUSED)
>>>>>>>>> +{
>>>>>>>>> +	struct rte_mbuf *m = _m;
>>>>>>>>> +	uint32_t buf_size, buf_len, priv_size;
>>>>>>>>> +
>>>>>>>>> +	priv_size = rte_pktmbuf_priv_size(mp);
>>>>>>>>> +	buf_size = sizeof(struct dp_packet) + priv_size;
>>>>>>>>> +	buf_len = rte_pktmbuf_data_room_size(mp);
>>>>>>>>>
>>>>>>>>> -    /* start of buffer is just after mbuf structure */
>>>>>>>>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>>>>>>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>>>>>>> -                    sizeof(struct dp_packet);
>>>>>>>>> -    m->buf_len = (uint16_t)buf_len;
>>>>>>>>> + RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
>>> priv_size);
>>>>>>>>> +	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>>>>>>>> +	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>>>>>>>>
>>>>>>>>> -    /* keep some headroom between start of buffer and data */
>>>>>>>>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>>>>>>> +	memset(m, 0, mp->elt_size);
>>>>>>>>>
>>>>>>>>> -    /* init some constant fields */
>>>>>>>>> -    m->pool = mp;
>>>>>>>>> -    m->nb_segs = 1;
>>>>>>>>> -    m->port = 0xff;
>>>>>>>>> +	/* start of buffer is after dp_packet structure and priv data */
>>>>>>>>> +	m->priv_size = priv_size;
>>>>>>>>> +	m->buf_addr = (char *)m + buf_size;
>>>>>>>>> +	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>>>>>>>> +	m->buf_len = (uint16_t)buf_len;
>>>>>>>>> +
>>>>>>>>> +	/* keep some headroom between start of buffer and data */
>>>>>>>>> + m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM,
>>> (uint16_t)m->buf_len);
>>>>>>>>> +
>>>>>>>>> +	/* init some constant fields */
>>>>>>>>> +	m->pool = mp;
>>>>>>>>> +	m->nb_segs = 1;
>>>>>>>>> +	m->port = 0xff;
>>>>>>>>
>>>>>>>>Please don't mix tabs and spaces in the file.
>>>>>>>
>>>>>>>Apologies - this was a copy/paste from a DPDK file, which uses tabs.
>>>>>>>
>>>>>>>>
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>>  static void
>>>>>>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>>>  {
>>>>>>>>>      struct rte_mbuf *m = _m;
>>>>>>>>>
>>>>>>>>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>>>>>>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>>>>>>>>
>>>>>>>>>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>>>>>>>>  }
>>>>>>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>>      struct dpdk_mp *dmp = NULL;
>>>>>>>>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>>>>>>>      unsigned mp_size;
>>>>>>>>> +    struct rte_pktmbuf_pool_private mbp_priv;
>>>>>>>>>
>>>>>>>>>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>>>>>>>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>>>>>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>>      dmp->socket_id = socket_id;
>>>>>>>>>      dmp->mtu = mtu;
>>>>>>>>>      dmp->refcount = 1;
>>>>>>>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>>>>> RTE_PKTMBUF_HEADROOM;
>>>>>>>>> +    mbp_priv.mbuf_priv_size = 0;
>>>>>>>>>
>>>>>>>>>      mp_size = MAX_NB_MBUF;
>>>>>>>>>      do {
>>>>>>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>>              return NULL;
>>>>>>>>>          }
>>>>>>>>>
>>>>>>>>> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>>>>>>> + dmp->mp = rte_mempool_create(mp_name, mp_size,
>>>>> MBUF_SEGMENT_SIZE(mtu),
>>>>>>>>>                                       MP_CACHE_SZ,
>>>>>>>>> sizeof(struct rte_pktmbuf_pool_private),
>>>>>>>>> -                                     rte_pktmbuf_pool_init, NULL,
>>>>>>>>> + ovs_rte_pktmbuf_pool_init, &mbp_priv,
>>>>>>>>>                                       ovs_rte_pktmbuf_init, NULL,
>>>>>>>>>                                       socket_id, 0);
>>>>>>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>>>> MIN_NB_MBUF);
>>>>>>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>>>>> *dev, int n_rxq, int
>>>>>>>n_txq)
>>>>>>>>>  {
>>>>>>>>>      int diag = 0;
>>>>>>>>>      int i;
>>>>>>>>> +    struct rte_eth_conf conf = port_conf;
>>>>>>>>>
>>>>>>>>> /* A device may report more queues than it makes available
> (this has
>>>>>>>>>       * been observed for Intel xl710, which reserves some of them for
>>>>>>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>>>>> *dev, int n_rxq, int
>>>>>>>>n_txq)
>>>>>>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>>>>>>>>>          }
>>>>>>>>>
>>>>>>>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>>>> &port_conf);
>>>>>>>>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>>>>>>>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>>>>>>>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>>> +        }
>>>>>>>>> +
>>>>>>>>> + diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
> &conf);
>>>>>>>>>          if (diag) {
>>>>>>>>>              break;
>>>>>>>>>          }
>>>>>>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_,
>>>>> unsigned int port_no,
>>>>>>>>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>>>>>>>>      int sid;
>>>>>>>>>      int err = 0;
>>>>>>>>> +    uint32_t buf_size;
>>>>>>>>>
>>>>>>>>>      ovs_mutex_init(&netdev->mutex);
>>>>>>>>>      ovs_mutex_lock(&netdev->mutex);
>>>>>>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_,
>>>>> unsigned int port_no,
>>>>>>>>>      netdev->port_id = port_no;
>>>>>>>>>      netdev->type = type;
>>>>>>>>>      netdev->flags = 0;
>>>>>>>>> +
>>>>>>>>> + /* Initialize port's MTU and frame len to the default
>>> Ethernet values.
>>>>>>>>> + * Larger, user-specified (jumbo) frame buffers are
> accommodated in
>>>>>>>>> +     * netdev_dpdk_set_config.
>>>>>>>>> +     */
>>>>>>>>> +    netdev->max_packet_len = ETHER_MAX_LEN;
>>>>>>>>>      netdev->mtu = ETHER_MTU;
>>>>>>>>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>>>>>>>>
>>>>>>>>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>>>>>>>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>>>>>>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>>>>> FRAME_LEN_TO_MTU(buf_size));
>>>>>>>>>      if (!netdev->dpdk_mp) {
>>>>>>>>>          err = ENOMEM;
>>>>>>>>>          goto unlock;
>>>>>>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[],
>>>>> const char prefix[],
>>>>>>>>>      return 0;
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> +static void
>>>>>>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>>>>>>>> +{
>>>>>>>>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
>>>>>>>>> +    int local_mtu;
>>>>>>>>> +
>>>>>>>>> +    if(mtu_str) {
>>>>>>>>> +        local_mtu = atoi(mtu_str);
>>>>>>>>
>>>>>>>>Please use strtol or strtoul here, and detect errors in the string by
>>>>>>>>checking endptr. That way, we can be sure that random garbage on the
>>>>>>>>stack won't accidentally become the mtu.
>>>>>>>
>>>>>>>Will do, thanks.
>>>>>>>
>>>>>>>>
>>>>>>>>> +    }
>>>>>>>>> +    if(!mtu_str || local_mtu < ETHER_MTU ||
>>>>>>>>> + local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {
>>>>>>>>> +        local_mtu = ETHER_MTU;
>>>>>>>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting
>>>>> to %d.\n",
>>>>>>>>> +                   local_mtu);
>>>>>>>>> +    }
>>>>>>>>> +    *mtu = local_mtu;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>  static int
>>>>>>>>> vhost_construct_helper(struct netdev *netdev_)
> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>>  {
>>>>>>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev
>>>>> *netdev_, struct smap
>>>>>>>>*args)
>>>>>>>>> smap_add_format(args, "configured_rx_queues", "%d",
> netdev_->n_rxq);
>>>>>>>>> smap_add_format(args, "requested_tx_queues", "%d",
> netdev_->n_txq);
>>>>>>>>> smap_add_format(args, "configured_tx_queues", "%d",
>>> dev->real_n_txq);
>>>>>>>>> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>>>>>>>>>      ovs_mutex_unlock(&dev->mutex);
>>>>>>>>>
>>>>>>>>>      return 0;
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> +/* Set the mtu of DPDK_DEV_ETH ports */
>>>>>>>>> +static int
>>>>>>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>>> +{
>>>>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>>> +    int old_mtu, err;
>>>>>>>>> +    uint32_t buf_size;
>>>>>>>>> +    int dpdk_mtu;
>>>>>>>>> +    struct dpdk_mp *old_mp;
>>>>>>>>> +    struct dpdk_mp *mp;
>>>>>>>>> +
>>>>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>>>>> +    if (dev->mtu == mtu) {
>>>>>>>>> +        err = 0;
>>>>>>>>> +        goto out;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>>>>>>>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>>>>>>>> +
>>>>>>>>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>>>>>>>> +    if (!mp) {
>>>>>>>>> +        err = ENOMEM;
>>>>>>>>> +        goto out;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    rte_eth_dev_stop(dev->port_id);
>>>>>>>>> +
>>>>>>>>> +    old_mtu = dev->mtu;
>>>>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>>>>> +    dev->dpdk_mp = mp;
>>>>>>>>> +    dev->mtu = mtu;
>>>>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>>> +
>>>>>>>>> +    err = dpdk_eth_dev_init(dev);
>>>>>>>>
>>>>>>>>Why call the dpdk_eth_dev_init here? Since you are directly calling
>>>>>>>>rte_eth_dev_X functions here - wouldn't it make sense to use the
>>>>>>>>rte_eth_dev_start() instead?
>>>>>
>>>>> The rx queues for the device need to be assigned the mempool that was
>>>>> allocated for the jumbo frames; since dpdk_eth_dev_init handles the
>>>>> queue setup, rte_eth_dev configuration, etc. it makes sense to call it
>>>>> here.
>>>>>
>>>>>>>
>>>>>>>This is legacy code, but I can refactor if needs be.
>>>>>>>
>>>>>>>>
>>>>>>>>> +    if (err) {
>>>>>>>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
>>>>> known "
>>>>>>>>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>>>>>>>> +        dpdk_mp_put(mp);
>>>>>>>>> +        dev->mtu = old_mtu;
>>>>>>>>> +        dev->dpdk_mp = old_mp;
>>>>>>>>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>>> +        err = dpdk_eth_dev_init(dev);
>>>>>>>>
>>>>>>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you
>>>>>>>>will cause a queue reconfiguration action, which is not needed, and can
>>>>>>>>error.
>>>>>
>>>>> This call is needed, in case the eth_dev's queue configuration was
>>>>> changed by the previous call.
>>>>> The error handling for this call is arguably outside the scope of this
>>>>> patch, and could be submitted as a separate patch - what do you think?
>>>>
>>>>That's probably a good idea.
>>>>
>>>>>>>>
>>>>>>>
>>>>>>>As above.
>>>>>>>
>>>>>>>>If you are intending to set MTU, I would have expected calls to
>>>>>>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I
>>>>>>>>recognize that much of this is legacy, but since you are touching it :)
>>>>>>>
>>>>>>>I actually had a number of issues with rte_eth_dev_set_mtu early on
>>>>> in the patch; I can take
>>>>>>>a second look now that the code is more stable though.
>>>>>>
>>>>>>I remembered why I didn't use this.
>>>>>>
>>>>>>If you look at the IXGBE driver's mtu_set function, it refuses frame
>>>>> lengths which are larger
>>>>>>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM).
>>>>>>This might be avoided by configuring the device for scatter RX
>>>>> behavior, but this incurs a
>>>>>>performance impact.
>>>>
>>>>It seems strange to me that we need to ignore the library API for this
>>>>one case, right? What if the user is only doing vhostuser PMDs mixed
>>>>with a bnx2x card?
>>>>
>>>
>>> I can't say that I've familiar enough with this setup to form an
>>> answer on that one Aaron.
>>>
>>>>It would be better to get it fixed in the library, I think. After all,
>>>>they gave a generic setting to change MTU - why don't they make it
>>>>behave well?
>>>>
>>>
>>> Let me follow up on that with some local DPDK folks here.
>>>
>
> Just a quick follow up on this Aaron: I did implement a version of
> this patch that used rte_eth_dev_set_mtu; however, not all DPDK
> drivers yet implement that function (e.g. i40e), so I reverted to the
> previous version.
> DPDK team did mention that the _set_mtu function for i40e should be
> implemented 'soon', but no dates were specified. See this thread on
> the DPDK ML for more details:
> http://dpdk.org/ml/archives/dev/2016-February/032910.html

Thanks so much for that, it's good to know. I guess we need to wait for
them to fully implement their api before calling it, so I'm okay with
this change.

>>> Would you consider this a showstopper for upstreaming of this feature?
>>> If we need to wait for a change to DPDK, Jumbo Frames will miss the
>>> 2.5 window.
>>
>>I don't think it would be a show stopper as long as other devices wouldn't
>>be negatively impacted, but it's a place where a /*XXX: This is a
>>workaround for dpdk*/ comment should be included so that we can fix it
>>when/if it is updated in the library. What do you think?
>>
>>>>>><snippet>
>>>>>>ixgbe_dev_mtu_set(...) {
>>>>>>
>>>>>>...
>>>>>>
>>>>>>    /* refuse mtu that requires the support of scattered packets when this
>>>>>>     * feature has not been enabled before. */
>>>>>>    if (!dev->data->scattered_rx &&
>>>>>>        (frame_size + 2 * IXGBE_VLAN_TAG_SIZE >
>>>>>>         dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM))
>>>>>>        return -EINVAL;
>>>>>>
>>>>>>...
>>>>>>}
>>>>>></snippet>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> +        goto out;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    dpdk_mp_put(old_mp);
>>>>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>>>>> +out:
>>>>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>>> +    return err;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static int
>>>>>>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct
> smap *args)
>>>>>>>>> +{
>>>>>>>>> +    int mtu;
>>>>>>>>> +
>>>>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>>>>> +
>>>>>>>>> +    return netdev_dpdk_set_mtu(netdev_, mtu);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>  static int
>>>>>>>>>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>>>>>>>>  {
>>>>>>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev
>>>>> *netdev, int *mtup)
>>>>>>>>>
>>>>>>>>>      return 0;
>>>>>>>>>  }
>>>>>>>>> -
>>>>>>>>> -static int
>>>>>>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>>> -{
>>>>>>>>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>>> -    int old_mtu, err;
>>>>>>>>> -    struct dpdk_mp *old_mp;
>>>>>>>>> -    struct dpdk_mp *mp;
>>>>>>>>> -
>>>>>>>>> -    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>>> -    ovs_mutex_lock(&dev->mutex);
>>>>>>>>> -    if (dev->mtu == mtu) {
>>>>>>>>> -        err = 0;
>>>>>>>>> -        goto out;
>>>>>>>>> -    }
>>>>>>>>> -
>>>>>>>>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>>>>>>>> -    if (!mp) {
>>>>>>>>> -        err = ENOMEM;
>>>>>>>>> -        goto out;
>>>>>>>>> -    }
>>>>>>>>> -
>>>>>>>>> -    rte_eth_dev_stop(dev->port_id);
>>>>>>>>> -
>>>>>>>>> -    old_mtu = dev->mtu;
>>>>>>>>> -    old_mp = dev->dpdk_mp;
>>>>>>>>> -    dev->dpdk_mp = mp;
>>>>>>>>> -    dev->mtu = mtu;
>>>>>>>>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>>>>> -
>>>>>>>>> -    err = dpdk_eth_dev_init(dev);
>>>>>>>>> -    if (err) {
>>>>>>>>> -        dpdk_mp_put(mp);
>>>>>>>>> -        dev->mtu = old_mtu;
>>>>>>>>> -        dev->dpdk_mp = old_mp;
>>>>>>>>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>>>>> -        dpdk_eth_dev_init(dev);
>>>>>>>>> -        goto out;
>>>>>>>>> -    }
>>>>>>>>> -
>>>>>>>>> -    dpdk_mp_put(old_mp);
>>>>>>>>> -    netdev_change_seq_changed(netdev);
>>>>>>>>> -out:
>>>>>>>>> -    ovs_mutex_unlock(&dev->mutex);
>>>>>>>>> -    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>>> -    return err;
>>>>>>>>> -}
>>>>>>>>> -
>>>>>>>>>  static int
>>>>>>>>>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>>>>>>>>
>>>>>>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev
>>>>> *netdev_, struct smap
>>>>>>>>*args)
>>>>>>>>> smap_add_format(args, "numa_id", "%d",
>>>>> rte_eth_dev_socket_id(dev->port_id));
>>>>>>>>>      smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>>>>>>>> smap_add_format(args, "min_rx_bufsize", "%u",
>>>>> dev_info.min_rx_bufsize);
>>>>>>>>> - smap_add_format(args, "max_rx_pktlen", "%u",
>>> dev_info.max_rx_pktlen);
>>>>>>>>> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>>>>>>>> smap_add_format(args, "max_rx_queues", "%u",
>>> dev_info.max_rx_queues);
>>>>>>>>> smap_add_format(args, "max_tx_queues", "%u",
>>> dev_info.max_tx_queues);
>>>>>>>>> smap_add_format(args, "max_mac_addrs", "%u",
>>> dev_info.max_mac_addrs);
>>>>>>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)
>>>>>>>>>      return 0;
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */
>>>>>>>>> +static int
>>>>>>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>>> +{
>>>>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>>> +    int err = 0;
>>>>>>>>> +    struct dpdk_mp *old_mp;
>>>>>>>>> +    struct dpdk_mp *mp;
>>>>>>>>> +
>>>>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>>>>> +    if (dev->mtu == mtu) {
>>>>>>>>> +        err = 0;
>>>>>>>>> +        goto out;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    mp = dpdk_mp_get(dev->socket_id, mtu);
>>>>>>>>> +    if (!mp) {
>>>>>>>>> +        err = ENOMEM;
>>>>>>>>> +        goto out;
>>>>>>>>> +    }
>>>>>>>>> +
>>>>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>>>>> +    dev->dpdk_mp = mp;
>>>>>>>>> +    dev->mtu = mtu;
>>>>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>>> +
>>>>>>>>> +    dpdk_mp_put(old_mp);
>>>>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>>>>> +out:
>>>>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>>> +    return err;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static int
>>>>>>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const
>>>>> struct smap *args)
>>>>>>>>> +{
>>>>>>>>> +    int mtu;
>>>>>>>>> +
>>>>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>>>>> +
>>>>>>>>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>>  static void
>>>>>>>>>  dpdk_common_init(void)
>>>>>>>>>  {
>>>>>>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk:
>>>>>>>>>      return err;
>>>>>>>>>  }
>>>>>>>>>
>>>>>>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,
>>> MULTIQ, SEND, \
>>>>>>>>> - GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV) \
>>>>>>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,
> SET_CONFIG, \
>>>>>>>>> + MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS,
> GET_FEATURES, \
>>>>>>>>> + GET_STATUS, RXQ_RECV) \
>>>>>>>>>  {                                                             \
>>>>>>>>>      NAME,                                                     \
>>>>>>>>>      INIT,                       /* init */                    \
>>>>>>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk:
>>>>>>>>>      DESTRUCT,                                                 \
>>>>>>>>>      netdev_dpdk_dealloc,                                      \
>>>>>>>>>      netdev_dpdk_get_config,                                   \
>>>>>>>>> -    NULL,                       /* netdev_dpdk_set_config */  \
>>>>>>>>> +    SET_CONFIG,                                               \
>>>>>>>>>      NULL,                       /* get_tunnel_config */       \
>>>>>>>>>      NULL,                       /* build header */            \
>>>>>>>>>      NULL,                       /* push header */             \
>>>>>>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk:
>>>>>>>>>      netdev_dpdk_set_etheraddr,                                \
>>>>>>>>>      netdev_dpdk_get_etheraddr,                                \
>>>>>>>>>      netdev_dpdk_get_mtu,                                      \
>>>>>>>>> -    netdev_dpdk_set_mtu,                                      \
>>>>>>>>> +    SET_MTU,                                                  \
>>>>>>>>>      netdev_dpdk_get_ifindex,                                  \
>>>>>>>>>      GET_CARRIER,                                              \
>>>>>>>>>      netdev_dpdk_get_carrier_resets,                           \
>>>>>>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =
>>>>>>>>>          NULL,
>>>>>>>>>          netdev_dpdk_construct,
>>>>>>>>>          netdev_dpdk_destruct,
>>>>>>>>> +        netdev_dpdk_set_config,
>>>>>>>>>          netdev_dpdk_set_multiq,
>>>>>>>>>          netdev_dpdk_eth_send,
>>>>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>>>>          netdev_dpdk_get_carrier,
>>>>>>>>>          netdev_dpdk_get_stats,
>>>>>>>>>          netdev_dpdk_get_features,
>>>>>>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class
> dpdk_ring_class =
>>>>>>>>>          NULL,
>>>>>>>>>          netdev_dpdk_ring_construct,
>>>>>>>>>          netdev_dpdk_destruct,
>>>>>>>>> +        netdev_dpdk_set_config,
>>>>>>>>>          netdev_dpdk_set_multiq,
>>>>>>>>>          netdev_dpdk_ring_send,
>>>>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>>>>          netdev_dpdk_get_carrier,
>>>>>>>>>          netdev_dpdk_get_stats,
>>>>>>>>>          netdev_dpdk_get_features,
>>>>>>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class
>>>>> OVS_UNUSED dpdk_vhost_cuse_class
>>>>>>=
>>>>>>>>>          dpdk_vhost_cuse_class_init,
>>>>>>>>>          netdev_dpdk_vhost_cuse_construct,
>>>>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>>>>> +        NULL,
>>>>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>>>>          netdev_dpdk_vhost_send,
>>>>>>>>> +        NULL,
>>>>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>>>>          NULL,
>>>>>>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class
>>>>> OVS_UNUSED dpdk_vhost_user_class
>>>>>>=
>>>>>>>>>          dpdk_vhost_user_class_init,
>>>>>>>>>          netdev_dpdk_vhost_user_construct,
>>>>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>>>>> +        netdev_dpdk_vhost_set_config,
>>>>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>>>>          netdev_dpdk_vhost_send,
>>>>>>>>> +        netdev_dpdk_vhost_set_mtu,
>>>>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>>>>          NULL,
>>>>>>>_______________________________________________
>>>>>>>dev mailing list
>>>>>>>dev@openvswitch.org
>>>>>>>http://openvswitch.org/mailman/listinfo/dev
>>>>>>_______________________________________________
>>>>>>dev mailing list
>>>>>>dev@openvswitch.org
>>>>>>http://openvswitch.org/mailman/listinfo/dev
>>>><#secure method=pgpmime mode=encrypt>
Mark Kavanagh Feb. 16, 2016, 2:15 p.m. UTC | #11
>
>"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
>
>>>
>>>"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
>>>
>>>>>
>>>>>"Kavanagh, Mark B" <mark.b.kavanagh@intel.com> writes:
>>>>>> Further clarifications below Aaron - please let me know if you have
>>>>>> any additional comments.
>>>>>>
>>>>>> Thanks,
>>>>>> Mark
>>>>>
>>>>>Thanks again for this work, Mark!
>>>>>
>>>>>>>
>>>>>>>One addendum below, Aaron.
>>>>>>>
>>>>>>>>
>>>>>>>>>Hi Mark,
>>>>>>>>>
>>>>>>>>>Thanks for the patch! I've not done a very thorough review of this, but
>>>>>>>>>my first-blush comments are inline.
>>>>>>>>
>>>>>>>>Thanks for the review Aaron!
>>>>>>>>
>>>>>>>>Any additional comments are welcome - my initial responses are
>>>> inline, below.
>>>>>>>>
>>>>>>>>Thanks,
>>>>>>>>Mark
>>>>>>>>
>>>>>>>>>
>>>>>>>>>Mark Kavanagh <mark.b.kavanagh@intel.com> writes:
>>>>>>>>>> Add support for Jumbo Frames to DPDK-enabled port types,
>>>>>>>>>> using single-segment-mbufs.
>>>>>>>>>>
>>>>>>>>>> Using this approach, the amount of memory allocated for each mbuf
>>>>>>>>>> to store frame data is increased to a value greater than 1518B
>>>>>>>>>> (typical Ethernet maximum frame length). The increased space
>>>>>>>>>> available in the mbuf means that an entire Jumbo Frame can be carried
>>>>>>>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf
>>>>>>>>>> segments.
>>>>>>>>>>
>>>>>>>>>> The amount of space allocated to each mbuf to hold frame data is
>>>>>>>>>> defined dynamically by the user when adding a DPDK port to a bridge.
>>>>>>>>>> If an MTU value is not supplied, or the user-supplied value is invalid,
>>>>>>>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B).
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Mark Kavanagh <mark.b.kavanagh@intel.com>
>>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>>This is a new feature which has user-visible impact, so should be
>>>>>>>>>announced in NEWS (plus, it's great news).
>>>>>>>>
>>>>>>>>No problem - I'll add an entry.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>  INSTALL.DPDK.md   |  59 ++++++++-
>>>>>>>>>> lib/netdev-dpdk.c | 356
>>>>>> +++++++++++++++++++++++++++++++++++++++++-------------
>>>>>>>>>>  2 files changed, 328 insertions(+), 87 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
>>>>>>>>>> index 96b686c..2f23e27 100644
>>>>>>>>>> --- a/INSTALL.DPDK.md
>>>>>>>>>> +++ b/INSTALL.DPDK.md
>>>>>>>>>> @@ -859,10 +859,61 @@ by adding the following string:
>>>>>>>>>> to <interface> sections of all network devices used by
>>>> DPDK. Parameter 'N'
>>>>>>>>>>  determines how many queues can be used by the guest.
>>>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> +Jumbo Frames
>>>>>>>>>> +------------
>>>>>>>>>> +
>>>>>>>>>> +Support for Jumbo Frames may be enabled at run-time for
>> DPDK-type ports.
>>>>>>>>>> +
>>>>>>>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to
>>>>>> the ovs-vsctl
>>>>>>>>>> +'add-port' command-line, along with the required MTU for the port.
>>>>>>>>>> +e.g.
>>>>>>>>>
>>>>>>>>>This text and the example do not match. I think '--dpdk-mtu' is
>> not valid.
>>>>>>>>
>>>>>>>>Good catch, thanks.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +
>>>>>>>>>> +     ```
>>>>>>>>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk
>>>>>> options:dpdk-mtu=9000
>>>>>>>>>> +     ```
>>>>>>>>>> +
>>>>>>>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf
>>>> segments are
>>>>>>>>>> +increased, such that a full Jumbo Frame may be accommodated
>>>>>> inside a single
>>>>>>>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable.
>>>>>>>>>
>>>>>>>>>Why no support? DPDK supports changing the mtu.
>>>>>>>>
>>>>>>>>I guess my rationale here is that an MTU change can't be triggered
>>>>>> via OVS command-line, nor
>>>>>>>>can it be triggered programmatically via DPDK (apart from an
>>>> explicit call to
>>>>>>>>rte_eth_dev_set_mtu).
>>>>>>>>So, while technically it's possibly, from a user's point of view,
>>>>>> there's no way to
>>>>>>>configure
>>>>>>>>it, outside of modifying the code directly. If I've missed something
>>>>>> here, please let me
>>>>>>>>know.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +Jumbo frame support has been validated against 13312B
>> frames, using the
>>>>>>>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC
>> drivers may
>>>>>>>>>> +theoretically be supported. Supported port types excludes
>>>>>> vHost-Cuse ports, as
>>>>>>>>>> +this feature is pending deprecation.
>>>>>>>>>> +
>>>>>>>>>> +
>>>>>>>>>> +vHost Ports and Jumbo Frames
>>>>>>>>>> +----------------------------
>>>>>>>>>> +Jumbo frame support is available for DPDK vHost-User ports
>>>>>> only. Some additional
>>>>>>>>>> +configuration is needed to take advantage of this feature:
>>>>>>>>>> +
>>>>>>>>>> + 1. `mergeable buffers` must be enabled for vHost ports, as
>>>>>> demonstrated in
>>>>>>>>>> +      the QEMU command line snippet below:
>>>>>>>>>> +
>>>>>>>>>> +      ```
>>>>>>>>>> +      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
>>>>>>>>>> + '-device
>>>>>> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
>>>>>>>>>> +      ```
>>>>>>>>>> +
>>>>>>>>>> + 2. Where virtio devices are bound to the Linux kernel driver
>>>> in a guest
>>>>>>>>>> + environment (i.e. interfaces are not bound to an in-guest DPDK
>>>>>> driver), the
>>>>>>>>>> + MTU of those logical network interfaces must also be
>>>> increased. This
>>>>>>>>>> + avoids segmentation of Jumbo Frames in the guest. Note that
>>>>>> 'MTU' refers
>>>>>>>>>> + to the length of the IP packet only, and not that of the entire
>>>>>> frame.
>>>>>>>>>> +
>>>>>>>>>> + e.g. To calculate the exact MTU of a standard IPv4 frame,
>>>>>> subtract the L2
>>>>>>>>>> + header and CRC lengths (i.e. 18B) from the max supported
>>>> frame size.
>>>>>>>>>> +     So, to set the MTU for a 13312B Jumbo Frame:
>>>>>>>>>> +
>>>>>>>>>> +      ```
>>>>>>>>>> +      ifconfig eth1 mtu 13294
>>>>>>>>>> +      ```
>>>>>>>>>> +
>>>>>>>>>> +
>>>>>>>>>>  Restrictions:
>>>>>>>>>>  -------------
>>>>>>>>>>
>>>>>>>>>> - - Work with 1500 MTU, needs few changes in DPDK lib to fix
>> this issue.
>>>>>>>>>>    - Currently DPDK port does not make use any offload functionality.
>>>>>>>>>>    - DPDK-vHost support works with 1G huge pages.
>>>>>>>>>>
>>>>>>>>>> @@ -903,6 +954,12 @@ Restrictions:
>>>>>>>>>> the next release of DPDK (which includes the above patch) is
>>>>>> available and
>>>>>>>>>>      integrated into OVS.
>>>>>>>>>>
>>>>>>>>>> +  Jumbo Frames:
>>>>>>>>>> + - `virtio-pmd`: DPDK apps in the guest do not exit
>>>>>> gracefully. The source of
>>>>>>>>>> +     this issue is currently being investigated.
>>>>>>>>>> + - vHost-Cuse: Jumbo Frame support is not available for vHost
>>>> Cuse ports.
>>>>>>>>>> +
>>>>>>>>>> +
>>>>>>>>>>  Bug Reporting:
>>>>>>>>>>  --------------
>>>>>>>>>>
>>>>>>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
>>>>>>>>>> index de7e488..5f23e28 100644
>>>>>>>>>> --- a/lib/netdev-dpdk.c
>>>>>>>>>> +++ b/lib/netdev-dpdk.c
>>>>>>>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl =
>>>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>>>>>  #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
>>>>>>>>>>  #define OVS_VPORT_DPDK "ovs_dpdk"
>>>>>>>>>>
>>>>>>>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
>>>>>>>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
>>>>>>>>>> +
>>>>>>>>>>  /*
>>>>>>>>>> * need to reserve tons of extra space in the mbufs so we can
>> align the
>>>>>>>>>>   * DMA addresses to 4KB.
>>>>>>>>>> * The minimum mbuf size is limited to avoid scatter behaviour
>>>> and drop in
>>>>>>>>>>   * performance for standard Ethernet MTU.
>>>>>>>>>>   */
>>>>>>>>>> -#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
>>>>>>>>>> -#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
>>>>>>>>>> -                              + sizeof(struct dp_packet) \
>>>>>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>>>>>> -#define MBUF_SIZE_DRIVER     (2048                       \
>>>>>>>>>> -                              + sizeof (struct rte_mbuf) \
>>>>>>>>>> -                              + RTE_PKTMBUF_HEADROOM)
>>>>>>>>>> -#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
>>>>>>>>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN +
>>>>>> ETHER_CRC_LEN)
>>>>>>>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN
>>>>>> - ETHER_CRC_LEN)
>>>>>>>>>> +#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
>>>>>>>>>> +                                    + sizeof(struct dp_packet)   \
>>>>>>>>>> +                                    + RTE_PKTMBUF_HEADROOM)
>>>>>>>>>> +
>>>>>>>>>> +/* This value should be specified as a multiple of the DPDK
>> NIC driver's
>>>>>>>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
>>>>>>>>>> + */
>>>>>>>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN    13312
>>>>>>>>>>
>>>>>>>>>> /* Max and min number of packets in the mempool.  OVS tries to
>>>> allocate a
>>>>>>>>>> * mempool with MAX_NB_MBUF: if this fails (because the system
>>>>>> doesn't have
>>>>>>>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl =
>>>>>> VLOG_RATE_LIMIT_INIT(5, 20);
>>>>>>>>>>  #define MIN_NB_MBUF          (4096 * 4)
>>>>>>>>>>  #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
>>>>>>>>>>
>>>>>>>>>> +#define DPDK_VLAN_TAG_LEN    4
>>>>>>>>>> +
>>>>>>>>>>  /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
>>>>>>>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF %
>>>>>> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
>>>>>>>>>>
>>>>>>>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = {
>>>>>>>>>>          .header_split   = 0, /* Header Split disabled */
>>>>>>>>>>          .hw_ip_checksum = 0, /* IP checksum offload disabled */
>>>>>>>>>>          .hw_vlan_filter = 0, /* VLAN filtering disabled */
>>>>>>>>>> -        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
>>>>>>>>>>          .hw_strip_crc   = 0,
>>>>>>>>>>      },
>>>>>>>>>>      .rx_adv_conf = {
>>>>>>>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class)
>>>>>>>>>>      return class->construct == netdev_dpdk_construct;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity
>>>>>>>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K
>>>>>> for igb_uio).
>>>>>>>>>> + * If 'frame_len' is not a multiple of this value, insufficient
>>>>>> buffers are
>>>>>>>>>> + * allocated to accomodate the packet in its
>>>>>> entirety. Furthermore, the igb_uio
>>>>>>>>>> + * driver needs to ensure that there is also sufficient space in
>>>>>> the Rx buffer
>>>>>>>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX
>>>> buffer is too
>>>>>>>>>> + * small, then the driver enables scatter RX behaviour, which reduces
>>>>>>>>>> + * performance. To prevent this, use a buffer size that is closest to
>>>>>>>>>> + * 'frame_len', but which satisfies the aforementioned criteria.
>>>>>>>>>> + */
>>>>>>>>>> +static uint32_t
>>>>>>>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
>>>>>>>>>> +{
>>>>>>>>>> +    struct rte_eth_dev_info info;
>>>>>>>>>> +    uint32_t buf_size;
>>>>>>>>>> +    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
>>>>>>>>>> +
>>>>>>>>>> +    if(netdev->type == DPDK_DEV_ETH) {
>>>>>>>>>> +        rte_eth_dev_info_get(netdev->port_id, &info);
>>>>>>>>>> +        buf_size = (info.min_rx_bufsize == 0) ?
>>>>>>>>>> +                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
>>>>>>>>>> +                   info.min_rx_bufsize;
>>>>>>>>>
>>>>>>>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu?
>>>>>>>>
>>>>>>>>I'm not sure if I follow - the MTU isn't the issue here, but rather
>>>>>> the buffer size used
>>>>>>>>when creating the rte_mempool, which must be a multiple of the
>>>>>> driver's min_rx_bufsize.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +    } else {
>>>>>>>>>> +        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    if(len % buf_size) {
>>>>>>>>>> +        len = buf_size * ((len/buf_size) + 1);
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    return len;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> /* XXX: use dpdk malloc for entire OVS. in fact huge page
>> should be used
>>>>>>>>>>   * for all other segments data, bss and text. */
>>>>>>>>>>
>>>>>>>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p)
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>>  static void
>>>>>>>>>> -__rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>>>> -                   void *opaque_arg OVS_UNUSED,
>>>>>>>>>> -                   void *_m,
>>>>>>>>>> -                   unsigned i OVS_UNUSED)
>>>>>>>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
>>>>>>>>>>  {
>>>>>>>>>> -    struct rte_mbuf *m = _m;
>>>>>>>>>> -    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
>>>>>>>>>> +    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
>>>>>>>>>> +    struct rte_pktmbuf_pool_private default_mbp_priv;
>>>>>>>>>> +    uint16_t roomsz;
>>>>>>>>>>
>>>>>>>>>>      RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
>>>>>>>>>>
>>>>>>>>>> -    memset(m, 0, mp->elt_size);
>>>>>>>>>> +    /* if no structure is provided, assume no mbuf private area */
>>>>>>>>>> +
>>>>>>>>>> +    user_mbp_priv = opaque_arg;
>>>>>>>>>> +    if (user_mbp_priv == NULL) {
>>>>>>>>>> +        default_mbp_priv.mbuf_priv_size = 0;
>>>>>>>>>> +        if (mp->elt_size > sizeof(struct dp_packet)) {
>>>>>>>>>> +            roomsz = mp->elt_size - sizeof(struct dp_packet);
>>>>>>>>>> +        } else {
>>>>>>>>>> +            roomsz = 0;
>>>>>>>>>> +        }
>>>>>>>>>> +        default_mbp_priv.mbuf_data_room_size = roomsz;
>>>>>>>>>> +        user_mbp_priv = &default_mbp_priv;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
>>>>>>>>>> +        user_mbp_priv->mbuf_data_room_size +
>>>>>>>>>> +        user_mbp_priv->mbuf_priv_size);
>>>>>>>>>> +
>>>>>>>>>> +    mbp_priv = rte_mempool_get_priv(mp);
>>>>>>>>>> +    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +/* Initialise some fields in the mbuf structure that are not
>>>>>> modified by the
>>>>>>>>>> + * user once created (origin pool, buffer start address, etc.*/
>>>>>>>>>> +static void
>>>>>>>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>>>> +                       void *opaque_arg OVS_UNUSED,
>>>>>>>>>> +                       void *_m,
>>>>>>>>>> +                       unsigned i OVS_UNUSED)
>>>>>>>>>> +{
>>>>>>>>>> +	struct rte_mbuf *m = _m;
>>>>>>>>>> +	uint32_t buf_size, buf_len, priv_size;
>>>>>>>>>> +
>>>>>>>>>> +	priv_size = rte_pktmbuf_priv_size(mp);
>>>>>>>>>> +	buf_size = sizeof(struct dp_packet) + priv_size;
>>>>>>>>>> +	buf_len = rte_pktmbuf_data_room_size(mp);
>>>>>>>>>>
>>>>>>>>>> -    /* start of buffer is just after mbuf structure */
>>>>>>>>>> -    m->buf_addr = (char *)m + sizeof(struct dp_packet);
>>>>>>>>>> -    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
>>>>>>>>>> -                    sizeof(struct dp_packet);
>>>>>>>>>> -    m->buf_len = (uint16_t)buf_len;
>>>>>>>>>> + RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) ==
>>>> priv_size);
>>>>>>>>>> +	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
>>>>>>>>>> +	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
>>>>>>>>>>
>>>>>>>>>> -    /* keep some headroom between start of buffer and data */
>>>>>>>>>> -    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
>>>>>>>>>> +	memset(m, 0, mp->elt_size);
>>>>>>>>>>
>>>>>>>>>> -    /* init some constant fields */
>>>>>>>>>> -    m->pool = mp;
>>>>>>>>>> -    m->nb_segs = 1;
>>>>>>>>>> -    m->port = 0xff;
>>>>>>>>>> +	/* start of buffer is after dp_packet structure and priv data */
>>>>>>>>>> +	m->priv_size = priv_size;
>>>>>>>>>> +	m->buf_addr = (char *)m + buf_size;
>>>>>>>>>> +	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
>>>>>>>>>> +	m->buf_len = (uint16_t)buf_len;
>>>>>>>>>> +
>>>>>>>>>> +	/* keep some headroom between start of buffer and data */
>>>>>>>>>> + m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM,
>>>> (uint16_t)m->buf_len);
>>>>>>>>>> +
>>>>>>>>>> +	/* init some constant fields */
>>>>>>>>>> +	m->pool = mp;
>>>>>>>>>> +	m->nb_segs = 1;
>>>>>>>>>> +	m->port = 0xff;
>>>>>>>>>
>>>>>>>>>Please don't mix tabs and spaces in the file.
>>>>>>>>
>>>>>>>>Apologies - this was a copy/paste from a DPDK file, which uses tabs.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>>  static void
>>>>>>>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp,
>>>>>>>>>>  {
>>>>>>>>>>      struct rte_mbuf *m = _m;
>>>>>>>>>>
>>>>>>>>>> -    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
>>>>>>>>>> +    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
>>>>>>>>>>
>>>>>>>>>>      dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
>>>>>>>>>>  }
>>>>>>>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>>>      struct dpdk_mp *dmp = NULL;
>>>>>>>>>>      char mp_name[RTE_MEMPOOL_NAMESIZE];
>>>>>>>>>>      unsigned mp_size;
>>>>>>>>>> +    struct rte_pktmbuf_pool_private mbp_priv;
>>>>>>>>>>
>>>>>>>>>>      LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
>>>>>>>>>>          if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
>>>>>>>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>>>      dmp->socket_id = socket_id;
>>>>>>>>>>      dmp->mtu = mtu;
>>>>>>>>>>      dmp->refcount = 1;
>>>>>>>>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) +
>>>>>> RTE_PKTMBUF_HEADROOM;
>>>>>>>>>> +    mbp_priv.mbuf_priv_size = 0;
>>>>>>>>>>
>>>>>>>>>>      mp_size = MAX_NB_MBUF;
>>>>>>>>>>      do {
>>>>>>>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu)
>>>>>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>>>              return NULL;
>>>>>>>>>>          }
>>>>>>>>>>
>>>>>>>>>> -        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
>>>>>>>>>> + dmp->mp = rte_mempool_create(mp_name, mp_size,
>>>>>> MBUF_SEGMENT_SIZE(mtu),
>>>>>>>>>>                                       MP_CACHE_SZ,
>>>>>>>>>> sizeof(struct rte_pktmbuf_pool_private),
>>>>>>>>>> -                                     rte_pktmbuf_pool_init, NULL,
>>>>>>>>>> + ovs_rte_pktmbuf_pool_init, &mbp_priv,
>>>>>>>>>>                                       ovs_rte_pktmbuf_init, NULL,
>>>>>>>>>>                                       socket_id, 0);
>>>>>>>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >=
>>>>>> MIN_NB_MBUF);
>>>>>>>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>>>>>> *dev, int n_rxq, int
>>>>>>>>n_txq)
>>>>>>>>>>  {
>>>>>>>>>>      int diag = 0;
>>>>>>>>>>      int i;
>>>>>>>>>> +    struct rte_eth_conf conf = port_conf;
>>>>>>>>>>
>>>>>>>>>> /* A device may report more queues than it makes available
>> (this has
>>>>>>>>>>       * been observed for Intel xl710, which reserves some of them for
>>>>>>>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk
>>>>>> *dev, int n_rxq, int
>>>>>>>>>n_txq)
>>>>>>>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
>>>>>>>>>>          }
>>>>>>>>>>
>>>>>>>>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>>>>>> &port_conf);
>>>>>>>>>> +        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
>>>>>>>>>> +            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
>>>>>>>>>> +            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>>>> +        }
>>>>>>>>>> +
>>>>>>>>>> + diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq,
>> &conf);
>>>>>>>>>>          if (diag) {
>>>>>>>>>>              break;
>>>>>>>>>>          }
>>>>>>>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_,
>>>>>> unsigned int port_no,
>>>>>>>>>>      struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
>>>>>>>>>>      int sid;
>>>>>>>>>>      int err = 0;
>>>>>>>>>> +    uint32_t buf_size;
>>>>>>>>>>
>>>>>>>>>>      ovs_mutex_init(&netdev->mutex);
>>>>>>>>>>      ovs_mutex_lock(&netdev->mutex);
>>>>>>>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_,
>>>>>> unsigned int port_no,
>>>>>>>>>>      netdev->port_id = port_no;
>>>>>>>>>>      netdev->type = type;
>>>>>>>>>>      netdev->flags = 0;
>>>>>>>>>> +
>>>>>>>>>> + /* Initialize port's MTU and frame len to the default
>>>> Ethernet values.
>>>>>>>>>> + * Larger, user-specified (jumbo) frame buffers are
>> accommodated in
>>>>>>>>>> +     * netdev_dpdk_set_config.
>>>>>>>>>> +     */
>>>>>>>>>> +    netdev->max_packet_len = ETHER_MAX_LEN;
>>>>>>>>>>      netdev->mtu = ETHER_MTU;
>>>>>>>>>> -    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
>>>>>>>>>>
>>>>>>>>>> -    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
>>>>>>>>>> +    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
>>>>>>>>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id,
>>>>>> FRAME_LEN_TO_MTU(buf_size));
>>>>>>>>>>      if (!netdev->dpdk_mp) {
>>>>>>>>>>          err = ENOMEM;
>>>>>>>>>>          goto unlock;
>>>>>>>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[],
>>>>>> const char prefix[],
>>>>>>>>>>      return 0;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +static void
>>>>>>>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
>>>>>>>>>> +{
>>>>>>>>>> +    const char *mtu_str = smap_get(args, "dpdk-mtu");
>>>>>>>>>> +    int local_mtu;
>>>>>>>>>> +
>>>>>>>>>> +    if(mtu_str) {
>>>>>>>>>> +        local_mtu = atoi(mtu_str);
>>>>>>>>>
>>>>>>>>>Please use strtol or strtoul here, and detect errors in the string by
>>>>>>>>>checking endptr. That way, we can be sure that random garbage on the
>>>>>>>>>stack won't accidentally become the mtu.
>>>>>>>>
>>>>>>>>Will do, thanks.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +    }
>>>>>>>>>> +    if(!mtu_str || local_mtu < ETHER_MTU ||
>>>>>>>>>> + local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {
>>>>>>>>>> +        local_mtu = ETHER_MTU;
>>>>>>>>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting
>>>>>> to %d.\n",
>>>>>>>>>> +                   local_mtu);
>>>>>>>>>> +    }
>>>>>>>>>> +    *mtu = local_mtu;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  static int
>>>>>>>>>> vhost_construct_helper(struct netdev *netdev_)
>> OVS_REQUIRES(dpdk_mutex)
>>>>>>>>>>  {
>>>>>>>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev
>>>>>> *netdev_, struct smap
>>>>>>>>>*args)
>>>>>>>>>> smap_add_format(args, "configured_rx_queues", "%d",
>> netdev_->n_rxq);
>>>>>>>>>> smap_add_format(args, "requested_tx_queues", "%d",
>> netdev_->n_txq);
>>>>>>>>>> smap_add_format(args, "configured_tx_queues", "%d",
>>>> dev->real_n_txq);
>>>>>>>>>> +    smap_add_format(args, "mtu", "%d", dev->mtu);
>>>>>>>>>>      ovs_mutex_unlock(&dev->mutex);
>>>>>>>>>>
>>>>>>>>>>      return 0;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +/* Set the mtu of DPDK_DEV_ETH ports */
>>>>>>>>>> +static int
>>>>>>>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>>>> +{
>>>>>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>>>> +    int old_mtu, err;
>>>>>>>>>> +    uint32_t buf_size;
>>>>>>>>>> +    int dpdk_mtu;
>>>>>>>>>> +    struct dpdk_mp *old_mp;
>>>>>>>>>> +    struct dpdk_mp *mp;
>>>>>>>>>> +
>>>>>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>>>>>> +    if (dev->mtu == mtu) {
>>>>>>>>>> +        err = 0;
>>>>>>>>>> +        goto out;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
>>>>>>>>>> +    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
>>>>>>>>>> +
>>>>>>>>>> +    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
>>>>>>>>>> +    if (!mp) {
>>>>>>>>>> +        err = ENOMEM;
>>>>>>>>>> +        goto out;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    rte_eth_dev_stop(dev->port_id);
>>>>>>>>>> +
>>>>>>>>>> +    old_mtu = dev->mtu;
>>>>>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>>>>>> +    dev->dpdk_mp = mp;
>>>>>>>>>> +    dev->mtu = mtu;
>>>>>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>>>> +
>>>>>>>>>> +    err = dpdk_eth_dev_init(dev);
>>>>>>>>>
>>>>>>>>>Why call the dpdk_eth_dev_init here? Since you are directly calling
>>>>>>>>>rte_eth_dev_X functions here - wouldn't it make sense to use the
>>>>>>>>>rte_eth_dev_start() instead?
>>>>>>
>>>>>> The rx queues for the device need to be assigned the mempool that was
>>>>>> allocated for the jumbo frames; since dpdk_eth_dev_init handles the
>>>>>> queue setup, rte_eth_dev configuration, etc. it makes sense to call it
>>>>>> here.
>>>>>>
>>>>>>>>
>>>>>>>>This is legacy code, but I can refactor if needs be.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +    if (err) {
>>>>>>>>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last
>>>>>> known "
>>>>>>>>>> +                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
>>>>>>>>>> +        dpdk_mp_put(mp);
>>>>>>>>>> +        dev->mtu = old_mtu;
>>>>>>>>>> +        dev->dpdk_mp = old_mp;
>>>>>>>>>> +        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>>>> +        err = dpdk_eth_dev_init(dev);
>>>>>>>>>
>>>>>>>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you
>>>>>>>>>will cause a queue reconfiguration action, which is not needed, and can
>>>>>>>>>error.
>>>>>>
>>>>>> This call is needed, in case the eth_dev's queue configuration was
>>>>>> changed by the previous call.
>>>>>> The error handling for this call is arguably outside the scope of this
>>>>>> patch, and could be submitted as a separate patch - what do you think?
>>>>>
>>>>>That's probably a good idea.
>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>As above.
>>>>>>>>
>>>>>>>>>If you are intending to set MTU, I would have expected calls to
>>>>>>>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I
>>>>>>>>>recognize that much of this is legacy, but since you are touching it :)
>>>>>>>>
>>>>>>>>I actually had a number of issues with rte_eth_dev_set_mtu early on
>>>>>> in the patch; I can take
>>>>>>>>a second look now that the code is more stable though.
>>>>>>>
>>>>>>>I remembered why I didn't use this.
>>>>>>>
>>>>>>>If you look at the IXGBE driver's mtu_set function, it refuses frame
>>>>>> lengths which are larger
>>>>>>>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM).
>>>>>>>This might be avoided by configuring the device for scatter RX
>>>>>> behavior, but this incurs a
>>>>>>>performance impact.
>>>>>
>>>>>It seems strange to me that we need to ignore the library API for this
>>>>>one case, right? What if the user is only doing vhostuser PMDs mixed
>>>>>with a bnx2x card?
>>>>>
>>>>
>>>> I can't say that I've familiar enough with this setup to form an
>>>> answer on that one Aaron.
>>>>
>>>>>It would be better to get it fixed in the library, I think. After all,
>>>>>they gave a generic setting to change MTU - why don't they make it
>>>>>behave well?
>>>>>
>>>>
>>>> Let me follow up on that with some local DPDK folks here.
>>>>
>>
>> Just a quick follow up on this Aaron: I did implement a version of
>> this patch that used rte_eth_dev_set_mtu; however, not all DPDK
>> drivers yet implement that function (e.g. i40e), so I reverted to the
>> previous version.
>> DPDK team did mention that the _set_mtu function for i40e should be
>> implemented 'soon', but no dates were specified. See this thread on
>> the DPDK ML for more details:
>> http://dpdk.org/ml/archives/dev/2016-February/032910.html
>
>Thanks so much for that, it's good to know. I guess we need to wait for
>them to fully implement their api before calling it, so I'm okay with
>this change.

Perfect - thanks again for your review and feedback on this patch!

>
>>>> Would you consider this a showstopper for upstreaming of this feature?
>>>> If we need to wait for a change to DPDK, Jumbo Frames will miss the
>>>> 2.5 window.
>>>
>>>I don't think it would be a show stopper as long as other devices wouldn't
>>>be negatively impacted, but it's a place where a /*XXX: This is a
>>>workaround for dpdk*/ comment should be included so that we can fix it
>>>when/if it is updated in the library. What do you think?
>>>
>>>>>>><snippet>
>>>>>>>ixgbe_dev_mtu_set(...) {
>>>>>>>
>>>>>>>...
>>>>>>>
>>>>>>>    /* refuse mtu that requires the support of scattered packets when this
>>>>>>>     * feature has not been enabled before. */
>>>>>>>    if (!dev->data->scattered_rx &&
>>>>>>>        (frame_size + 2 * IXGBE_VLAN_TAG_SIZE >
>>>>>>>         dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM))
>>>>>>>        return -EINVAL;
>>>>>>>
>>>>>>>...
>>>>>>>}
>>>>>>></snippet>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> +        goto out;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    dpdk_mp_put(old_mp);
>>>>>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>>>>>> +out:
>>>>>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>>>> +    return err;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static int
>>>>>>>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct
>> smap *args)
>>>>>>>>>> +{
>>>>>>>>>> +    int mtu;
>>>>>>>>>> +
>>>>>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>>>>>> +
>>>>>>>>>> +    return netdev_dpdk_set_mtu(netdev_, mtu);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  static int
>>>>>>>>>>  netdev_dpdk_get_numa_id(const struct netdev *netdev_)
>>>>>>>>>>  {
>>>>>>>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev
>>>>>> *netdev, int *mtup)
>>>>>>>>>>
>>>>>>>>>>      return 0;
>>>>>>>>>>  }
>>>>>>>>>> -
>>>>>>>>>> -static int
>>>>>>>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>>>> -{
>>>>>>>>>> -    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>>>> -    int old_mtu, err;
>>>>>>>>>> -    struct dpdk_mp *old_mp;
>>>>>>>>>> -    struct dpdk_mp *mp;
>>>>>>>>>> -
>>>>>>>>>> -    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>>>> -    ovs_mutex_lock(&dev->mutex);
>>>>>>>>>> -    if (dev->mtu == mtu) {
>>>>>>>>>> -        err = 0;
>>>>>>>>>> -        goto out;
>>>>>>>>>> -    }
>>>>>>>>>> -
>>>>>>>>>> -    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
>>>>>>>>>> -    if (!mp) {
>>>>>>>>>> -        err = ENOMEM;
>>>>>>>>>> -        goto out;
>>>>>>>>>> -    }
>>>>>>>>>> -
>>>>>>>>>> -    rte_eth_dev_stop(dev->port_id);
>>>>>>>>>> -
>>>>>>>>>> -    old_mtu = dev->mtu;
>>>>>>>>>> -    old_mp = dev->dpdk_mp;
>>>>>>>>>> -    dev->dpdk_mp = mp;
>>>>>>>>>> -    dev->mtu = mtu;
>>>>>>>>>> -    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>>>>>> -
>>>>>>>>>> -    err = dpdk_eth_dev_init(dev);
>>>>>>>>>> -    if (err) {
>>>>>>>>>> -        dpdk_mp_put(mp);
>>>>>>>>>> -        dev->mtu = old_mtu;
>>>>>>>>>> -        dev->dpdk_mp = old_mp;
>>>>>>>>>> -        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
>>>>>>>>>> -        dpdk_eth_dev_init(dev);
>>>>>>>>>> -        goto out;
>>>>>>>>>> -    }
>>>>>>>>>> -
>>>>>>>>>> -    dpdk_mp_put(old_mp);
>>>>>>>>>> -    netdev_change_seq_changed(netdev);
>>>>>>>>>> -out:
>>>>>>>>>> -    ovs_mutex_unlock(&dev->mutex);
>>>>>>>>>> -    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>>>> -    return err;
>>>>>>>>>> -}
>>>>>>>>>> -
>>>>>>>>>>  static int
>>>>>>>>>>  netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
>>>>>>>>>>
>>>>>>>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev
>>>>>> *netdev_, struct smap
>>>>>>>>>*args)
>>>>>>>>>> smap_add_format(args, "numa_id", "%d",
>>>>>> rte_eth_dev_socket_id(dev->port_id));
>>>>>>>>>>      smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
>>>>>>>>>> smap_add_format(args, "min_rx_bufsize", "%u",
>>>>>> dev_info.min_rx_bufsize);
>>>>>>>>>> - smap_add_format(args, "max_rx_pktlen", "%u",
>>>> dev_info.max_rx_pktlen);
>>>>>>>>>> +    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
>>>>>>>>>> smap_add_format(args, "max_rx_queues", "%u",
>>>> dev_info.max_rx_queues);
>>>>>>>>>> smap_add_format(args, "max_tx_queues", "%u",
>>>> dev_info.max_tx_queues);
>>>>>>>>>> smap_add_format(args, "max_mac_addrs", "%u",
>>>> dev_info.max_mac_addrs);
>>>>>>>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void)
>>>>>>>>>>      return 0;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */
>>>>>>>>>> +static int
>>>>>>>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
>>>>>>>>>> +{
>>>>>>>>>> +    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
>>>>>>>>>> +    int err = 0;
>>>>>>>>>> +    struct dpdk_mp *old_mp;
>>>>>>>>>> +    struct dpdk_mp *mp;
>>>>>>>>>> +
>>>>>>>>>> +    ovs_mutex_lock(&dpdk_mutex);
>>>>>>>>>> +    ovs_mutex_lock(&dev->mutex);
>>>>>>>>>> +    if (dev->mtu == mtu) {
>>>>>>>>>> +        err = 0;
>>>>>>>>>> +        goto out;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    mp = dpdk_mp_get(dev->socket_id, mtu);
>>>>>>>>>> +    if (!mp) {
>>>>>>>>>> +        err = ENOMEM;
>>>>>>>>>> +        goto out;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    old_mp = dev->dpdk_mp;
>>>>>>>>>> +    dev->dpdk_mp = mp;
>>>>>>>>>> +    dev->mtu = mtu;
>>>>>>>>>> +    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
>>>>>>>>>> +
>>>>>>>>>> +    dpdk_mp_put(old_mp);
>>>>>>>>>> +    netdev_change_seq_changed(netdev);
>>>>>>>>>> +out:
>>>>>>>>>> +    ovs_mutex_unlock(&dev->mutex);
>>>>>>>>>> +    ovs_mutex_unlock(&dpdk_mutex);
>>>>>>>>>> +    return err;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static int
>>>>>>>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const
>>>>>> struct smap *args)
>>>>>>>>>> +{
>>>>>>>>>> +    int mtu;
>>>>>>>>>> +
>>>>>>>>>> +    dpdk_dev_parse_mtu(args, &mtu);
>>>>>>>>>> +
>>>>>>>>>> +    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>>  static void
>>>>>>>>>>  dpdk_common_init(void)
>>>>>>>>>>  {
>>>>>>>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk:
>>>>>>>>>>      return err;
>>>>>>>>>>  }
>>>>>>>>>>
>>>>>>>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,
>>>> MULTIQ, SEND, \
>>>>>>>>>> - GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV) \
>>>>>>>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT,
>> SET_CONFIG, \
>>>>>>>>>> + MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS,
>> GET_FEATURES, \
>>>>>>>>>> + GET_STATUS, RXQ_RECV) \
>>>>>>>>>>  {                                                             \
>>>>>>>>>>      NAME,                                                     \
>>>>>>>>>>      INIT,                       /* init */                    \
>>>>>>>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk:
>>>>>>>>>>      DESTRUCT,                                                 \
>>>>>>>>>>      netdev_dpdk_dealloc,                                      \
>>>>>>>>>>      netdev_dpdk_get_config,                                   \
>>>>>>>>>> -    NULL,                       /* netdev_dpdk_set_config */  \
>>>>>>>>>> +    SET_CONFIG,                                               \
>>>>>>>>>>      NULL,                       /* get_tunnel_config */       \
>>>>>>>>>>      NULL,                       /* build header */            \
>>>>>>>>>>      NULL,                       /* push header */             \
>>>>>>>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk:
>>>>>>>>>>      netdev_dpdk_set_etheraddr,                                \
>>>>>>>>>>      netdev_dpdk_get_etheraddr,                                \
>>>>>>>>>>      netdev_dpdk_get_mtu,                                      \
>>>>>>>>>> -    netdev_dpdk_set_mtu,                                      \
>>>>>>>>>> +    SET_MTU,                                                  \
>>>>>>>>>>      netdev_dpdk_get_ifindex,                                  \
>>>>>>>>>>      GET_CARRIER,                                              \
>>>>>>>>>>      netdev_dpdk_get_carrier_resets,                           \
>>>>>>>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class =
>>>>>>>>>>          NULL,
>>>>>>>>>>          netdev_dpdk_construct,
>>>>>>>>>>          netdev_dpdk_destruct,
>>>>>>>>>> +        netdev_dpdk_set_config,
>>>>>>>>>>          netdev_dpdk_set_multiq,
>>>>>>>>>>          netdev_dpdk_eth_send,
>>>>>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>>>>>          netdev_dpdk_get_carrier,
>>>>>>>>>>          netdev_dpdk_get_stats,
>>>>>>>>>>          netdev_dpdk_get_features,
>>>>>>>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class
>> dpdk_ring_class =
>>>>>>>>>>          NULL,
>>>>>>>>>>          netdev_dpdk_ring_construct,
>>>>>>>>>>          netdev_dpdk_destruct,
>>>>>>>>>> +        netdev_dpdk_set_config,
>>>>>>>>>>          netdev_dpdk_set_multiq,
>>>>>>>>>>          netdev_dpdk_ring_send,
>>>>>>>>>> +        netdev_dpdk_set_mtu,
>>>>>>>>>>          netdev_dpdk_get_carrier,
>>>>>>>>>>          netdev_dpdk_get_stats,
>>>>>>>>>>          netdev_dpdk_get_features,
>>>>>>>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class
>>>>>> OVS_UNUSED dpdk_vhost_cuse_class
>>>>>>>=
>>>>>>>>>>          dpdk_vhost_cuse_class_init,
>>>>>>>>>>          netdev_dpdk_vhost_cuse_construct,
>>>>>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>>>>>> +        NULL,
>>>>>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>>>>>          netdev_dpdk_vhost_send,
>>>>>>>>>> +        NULL,
>>>>>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>>>>>          NULL,
>>>>>>>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class
>>>>>> OVS_UNUSED dpdk_vhost_user_class
>>>>>>>=
>>>>>>>>>>          dpdk_vhost_user_class_init,
>>>>>>>>>>          netdev_dpdk_vhost_user_construct,
>>>>>>>>>>          netdev_dpdk_vhost_destruct,
>>>>>>>>>> +        netdev_dpdk_vhost_set_config,
>>>>>>>>>>          netdev_dpdk_vhost_set_multiq,
>>>>>>>>>>          netdev_dpdk_vhost_send,
>>>>>>>>>> +        netdev_dpdk_vhost_set_mtu,
>>>>>>>>>>          netdev_dpdk_vhost_get_carrier,
>>>>>>>>>>          netdev_dpdk_vhost_get_stats,
>>>>>>>>>>          NULL,
>>>>>>>>_______________________________________________
>>>>>>>>dev mailing list
>>>>>>>>dev@openvswitch.org
>>>>>>>>http://openvswitch.org/mailman/listinfo/dev
>>>>>>>_______________________________________________
>>>>>>>dev mailing list
>>>>>>>dev@openvswitch.org
>>>>>>>http://openvswitch.org/mailman/listinfo/dev
>>>>><#secure method=pgpmime mode=encrypt>
diff mbox

Patch

diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md
index 96b686c..2f23e27 100644
--- a/INSTALL.DPDK.md
+++ b/INSTALL.DPDK.md
@@ -859,10 +859,61 @@  by adding the following string:
 to <interface> sections of all network devices used by DPDK. Parameter 'N'
 determines how many queues can be used by the guest.
 
+
+Jumbo Frames
+------------
+
+Support for Jumbo Frames may be enabled at run-time for DPDK-type ports.
+
+To avail of Jumbo Frame support, add the '--dpdk-mtu' option to the ovs-vsctl
+'add-port' command-line, along with the required MTU for the port.
+e.g.
+
+     ```
+     ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk options:dpdk-mtu=9000
+     ```
+
+When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are
+increased, such that a full Jumbo Frame may be accommodated inside a single
+mbuf segment. Once set, the MTU for a DPDK port is immutable.
+
+Jumbo frame support has been validated against 13312B frames, using the
+DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers may
+theoretically be supported. Supported port types excludes vHost-Cuse ports, as
+this feature is pending deprecation.
+
+
+vHost Ports and Jumbo Frames
+----------------------------
+Jumbo frame support is available for DPDK vHost-User ports only. Some additional
+configuration is needed to take advantage of this feature:
+
+  1. `mergeable buffers` must be enabled for vHost ports, as demonstrated in
+      the QEMU command line snippet below:
+
+      ```
+      '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \'
+      '-device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on'
+      ```
+
+  2. Where virtio devices are bound to the Linux kernel driver in a guest
+     environment (i.e. interfaces are not bound to an in-guest DPDK driver), the
+     MTU of those logical network interfaces must also be increased. This
+     avoids segmentation of Jumbo Frames in the guest. Note that 'MTU' refers
+     to the length of the IP packet only, and not that of the entire frame.
+
+     e.g. To calculate the exact MTU of a standard IPv4 frame, subtract the L2
+     header and CRC lengths (i.e. 18B) from the max supported frame size.
+     So, to set the MTU for a 13312B Jumbo Frame:
+
+      ```
+      ifconfig eth1 mtu 13294
+      ```
+
+
 Restrictions:
 -------------
 
-  - Work with 1500 MTU, needs few changes in DPDK lib to fix this issue.
   - Currently DPDK port does not make use any offload functionality.
   - DPDK-vHost support works with 1G huge pages.
 
@@ -903,6 +954,12 @@  Restrictions:
     the next release of DPDK (which includes the above patch) is available and
     integrated into OVS.
 
+  Jumbo Frames:
+  - `virtio-pmd`: DPDK apps in the guest do not exit gracefully. The source of
+     this issue is currently being investigated.
+  - vHost-Cuse: Jumbo Frame support is not available for vHost Cuse ports.
+
+
 Bug Reporting:
 --------------
 
diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
index de7e488..5f23e28 100644
--- a/lib/netdev-dpdk.c
+++ b/lib/netdev-dpdk.c
@@ -62,20 +62,25 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE
 #define OVS_VPORT_DPDK "ovs_dpdk"
 
+#define NETDEV_DPDK_JUMBO_FRAME_ENABLED     1
+#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE      1024
+
 /*
  * need to reserve tons of extra space in the mbufs so we can align the
  * DMA addresses to 4KB.
  * The minimum mbuf size is limited to avoid scatter behaviour and drop in
  * performance for standard Ethernet MTU.
  */
-#define MTU_TO_MAX_LEN(mtu)  ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
-#define MBUF_SIZE_MTU(mtu)   (MTU_TO_MAX_LEN(mtu)        \
-                              + sizeof(struct dp_packet) \
-                              + RTE_PKTMBUF_HEADROOM)
-#define MBUF_SIZE_DRIVER     (2048                       \
-                              + sizeof (struct rte_mbuf) \
-                              + RTE_PKTMBUF_HEADROOM)
-#define MBUF_SIZE(mtu)       MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER)
+#define MTU_TO_FRAME_LEN(mtu)       ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN)
+#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN - ETHER_CRC_LEN)
+#define MBUF_SEGMENT_SIZE(mtu)      ( MTU_TO_FRAME_LEN(mtu)      \
+                                    + sizeof(struct dp_packet)   \
+                                    + RTE_PKTMBUF_HEADROOM)
+
+/* This value should be specified as a multiple of the DPDK NIC driver's
+ * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio').
+ */
+#define NETDEV_DPDK_MAX_FRAME_LEN    13312
 
 /* Max and min number of packets in the mempool.  OVS tries to allocate a
  * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have
@@ -86,6 +91,8 @@  static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20);
 #define MIN_NB_MBUF          (4096 * 4)
 #define MP_CACHE_SZ          RTE_MEMPOOL_CACHE_MAX_SIZE
 
+#define DPDK_VLAN_TAG_LEN    4
+
 /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */
 BUILD_ASSERT_DECL(MAX_NB_MBUF % ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0);
 
@@ -114,7 +121,6 @@  static const struct rte_eth_conf port_conf = {
         .header_split   = 0, /* Header Split disabled */
         .hw_ip_checksum = 0, /* IP checksum offload disabled */
         .hw_vlan_filter = 0, /* VLAN filtering disabled */
-        .jumbo_frame    = 0, /* Jumbo Frame Support disabled */
         .hw_strip_crc   = 0,
     },
     .rx_adv_conf = {
@@ -254,6 +260,39 @@  is_dpdk_class(const struct netdev_class *class)
     return class->construct == netdev_dpdk_construct;
 }
 
+/* DPDK NIC drivers allocate RX buffers at a particular granularity
+ * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K for igb_uio).
+ * If 'frame_len' is not a multiple of this value, insufficient buffers are
+ * allocated to accomodate the packet in its entirety. Furthermore, the igb_uio
+ * driver needs to ensure that there is also sufficient space in the Rx buffer
+ * to accommodate two VLAN tags (for QinQ frames). If the RX buffer is too
+ * small, then the driver enables scatter RX behaviour, which reduces
+ * performance. To prevent this, use a buffer size that is closest to
+ * 'frame_len', but which satisfies the aforementioned criteria.
+ */
+static uint32_t
+dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len)
+{
+    struct rte_eth_dev_info info;
+    uint32_t buf_size;
+    uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN);
+
+    if(netdev->type == DPDK_DEV_ETH) {
+        rte_eth_dev_info_get(netdev->port_id, &info);
+        buf_size = (info.min_rx_bufsize == 0) ?
+                   NETDEV_DPDK_DEFAULT_RX_BUFSIZE :
+                   info.min_rx_bufsize;
+    } else {
+        buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE;
+    }
+
+    if(len % buf_size) {
+        len = buf_size * ((len/buf_size) + 1);
+    }
+
+    return len;
+}
+
 /* XXX: use dpdk malloc for entire OVS. in fact huge page should be used
  * for all other segments data, bss and text. */
 
@@ -280,31 +319,70 @@  free_dpdk_buf(struct dp_packet *p)
 }
 
 static void
-__rte_pktmbuf_init(struct rte_mempool *mp,
-                   void *opaque_arg OVS_UNUSED,
-                   void *_m,
-                   unsigned i OVS_UNUSED)
+ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg)
 {
-    struct rte_mbuf *m = _m;
-    uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet);
+    struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv;
+    struct rte_pktmbuf_pool_private default_mbp_priv;
+    uint16_t roomsz;
 
     RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet));
 
-    memset(m, 0, mp->elt_size);
+    /* if no structure is provided, assume no mbuf private area */
+
+    user_mbp_priv = opaque_arg;
+    if (user_mbp_priv == NULL) {
+        default_mbp_priv.mbuf_priv_size = 0;
+        if (mp->elt_size > sizeof(struct dp_packet)) {
+            roomsz = mp->elt_size - sizeof(struct dp_packet);
+        } else {
+            roomsz = 0;
+        }
+        default_mbp_priv.mbuf_data_room_size = roomsz;
+        user_mbp_priv = &default_mbp_priv;
+    }
+
+    RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) +
+        user_mbp_priv->mbuf_data_room_size +
+        user_mbp_priv->mbuf_priv_size);
+
+    mbp_priv = rte_mempool_get_priv(mp);
+    memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv));
+}
+
+/* Initialise some fields in the mbuf structure that are not modified by the
+ * user once created (origin pool, buffer start address, etc.*/
+static void
+__ovs_rte_pktmbuf_init(struct rte_mempool *mp,
+                       void *opaque_arg OVS_UNUSED,
+                       void *_m,
+                       unsigned i OVS_UNUSED)
+{
+	struct rte_mbuf *m = _m;
+	uint32_t buf_size, buf_len, priv_size;
+
+	priv_size = rte_pktmbuf_priv_size(mp);
+	buf_size = sizeof(struct dp_packet) + priv_size;
+	buf_len = rte_pktmbuf_data_room_size(mp);
 
-    /* start of buffer is just after mbuf structure */
-    m->buf_addr = (char *)m + sizeof(struct dp_packet);
-    m->buf_physaddr = rte_mempool_virt2phy(mp, m) +
-                    sizeof(struct dp_packet);
-    m->buf_len = (uint16_t)buf_len;
+	RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == priv_size);
+	RTE_MBUF_ASSERT(mp->elt_size >= buf_size);
+	RTE_MBUF_ASSERT(buf_len <= UINT16_MAX);
 
-    /* keep some headroom between start of buffer and data */
-    m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len);
+	memset(m, 0, mp->elt_size);
 
-    /* init some constant fields */
-    m->pool = mp;
-    m->nb_segs = 1;
-    m->port = 0xff;
+	/* start of buffer is after dp_packet structure and priv data */
+	m->priv_size = priv_size;
+	m->buf_addr = (char *)m + buf_size;
+	m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size;
+	m->buf_len = (uint16_t)buf_len;
+
+	/* keep some headroom between start of buffer and data */
+	m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, (uint16_t)m->buf_len);
+
+	/* init some constant fields */
+	m->pool = mp;
+	m->nb_segs = 1;
+	m->port = 0xff;
 }
 
 static void
@@ -315,7 +393,7 @@  ovs_rte_pktmbuf_init(struct rte_mempool *mp,
 {
     struct rte_mbuf *m = _m;
 
-    __rte_pktmbuf_init(mp, opaque_arg, _m, i);
+    __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i);
 
     dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len);
 }
@@ -326,6 +404,7 @@  dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
     struct dpdk_mp *dmp = NULL;
     char mp_name[RTE_MEMPOOL_NAMESIZE];
     unsigned mp_size;
+    struct rte_pktmbuf_pool_private mbp_priv;
 
     LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) {
         if (dmp->socket_id == socket_id && dmp->mtu == mtu) {
@@ -338,6 +417,8 @@  dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
     dmp->socket_id = socket_id;
     dmp->mtu = mtu;
     dmp->refcount = 1;
+    mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + RTE_PKTMBUF_HEADROOM;
+    mbp_priv.mbuf_priv_size = 0;
 
     mp_size = MAX_NB_MBUF;
     do {
@@ -346,10 +427,10 @@  dpdk_mp_get(int socket_id, int mtu) OVS_REQUIRES(dpdk_mutex)
             return NULL;
         }
 
-        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu),
+        dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SEGMENT_SIZE(mtu),
                                      MP_CACHE_SZ,
                                      sizeof(struct rte_pktmbuf_pool_private),
-                                     rte_pktmbuf_pool_init, NULL,
+                                     ovs_rte_pktmbuf_pool_init, &mbp_priv,
                                      ovs_rte_pktmbuf_init, NULL,
                                      socket_id, 0);
     } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= MIN_NB_MBUF);
@@ -433,6 +514,7 @@  dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
 {
     int diag = 0;
     int i;
+    struct rte_eth_conf conf = port_conf;
 
     /* A device may report more queues than it makes available (this has
      * been observed for Intel xl710, which reserves some of them for
@@ -444,7 +526,12 @@  dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int n_rxq, int n_txq)
             VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq);
         }
 
-        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &port_conf);
+        if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) {
+            conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED;
+            conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu);
+        }
+
+        diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf);
         if (diag) {
             break;
         }
@@ -586,6 +673,7 @@  netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
     struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_);
     int sid;
     int err = 0;
+    uint32_t buf_size;
 
     ovs_mutex_init(&netdev->mutex);
     ovs_mutex_lock(&netdev->mutex);
@@ -605,10 +693,16 @@  netdev_dpdk_init(struct netdev *netdev_, unsigned int port_no,
     netdev->port_id = port_no;
     netdev->type = type;
     netdev->flags = 0;
+
+    /* Initialize port's MTU and frame len to the default Ethernet values.
+     * Larger, user-specified (jumbo) frame buffers are accommodated in
+     * netdev_dpdk_set_config.
+     */
+    netdev->max_packet_len = ETHER_MAX_LEN;
     netdev->mtu = ETHER_MTU;
-    netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu);
 
-    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu);
+    buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN);
+    netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, FRAME_LEN_TO_MTU(buf_size));
     if (!netdev->dpdk_mp) {
         err = ENOMEM;
         goto unlock;
@@ -651,6 +745,24 @@  dpdk_dev_parse_name(const char dev_name[], const char prefix[],
     return 0;
 }
 
+static void
+dpdk_dev_parse_mtu(const struct smap *args, int *mtu)
+{
+    const char *mtu_str = smap_get(args, "dpdk-mtu");
+    int local_mtu;
+
+    if(mtu_str) {
+        local_mtu = atoi(mtu_str);
+    }
+    if(!mtu_str || local_mtu < ETHER_MTU ||
+                   local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) {
+        local_mtu = ETHER_MTU;
+        VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting to %d.\n",
+                   local_mtu);
+    }
+    *mtu = local_mtu;
+}
+
 static int
 vhost_construct_helper(struct netdev *netdev_) OVS_REQUIRES(dpdk_mutex)
 {
@@ -777,11 +889,77 @@  netdev_dpdk_get_config(const struct netdev *netdev_, struct smap *args)
     smap_add_format(args, "configured_rx_queues", "%d", netdev_->n_rxq);
     smap_add_format(args, "requested_tx_queues", "%d", netdev_->n_txq);
     smap_add_format(args, "configured_tx_queues", "%d", dev->real_n_txq);
+    smap_add_format(args, "mtu", "%d", dev->mtu);
     ovs_mutex_unlock(&dev->mutex);
 
     return 0;
 }
 
+/* Set the mtu of DPDK_DEV_ETH ports */
+static int
+netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int old_mtu, err;
+    uint32_t buf_size;
+    int dpdk_mtu;
+    struct dpdk_mp *old_mp;
+    struct dpdk_mp *mp;
+
+    ovs_mutex_lock(&dpdk_mutex);
+    ovs_mutex_lock(&dev->mutex);
+    if (dev->mtu == mtu) {
+        err = 0;
+        goto out;
+    }
+
+    buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu));
+    dpdk_mtu = FRAME_LEN_TO_MTU(buf_size);
+
+    mp = dpdk_mp_get(dev->socket_id, dpdk_mtu);
+    if (!mp) {
+        err = ENOMEM;
+        goto out;
+    }
+
+    rte_eth_dev_stop(dev->port_id);
+
+    old_mtu = dev->mtu;
+    old_mp = dev->dpdk_mp;
+    dev->dpdk_mp = mp;
+    dev->mtu = mtu;
+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
+
+    err = dpdk_eth_dev_init(dev);
+    if (err) {
+        VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last known "
+                  "good value '%d'\n", mtu, dev->up.name, old_mtu);
+        dpdk_mp_put(mp);
+        dev->mtu = old_mtu;
+        dev->dpdk_mp = old_mp;
+        dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
+        err = dpdk_eth_dev_init(dev);
+        goto out;
+    }
+
+    dpdk_mp_put(old_mp);
+    netdev_change_seq_changed(netdev);
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    ovs_mutex_unlock(&dpdk_mutex);
+    return err;
+}
+
+static int
+netdev_dpdk_set_config(struct netdev *netdev_, const struct smap *args)
+{
+    int mtu;
+
+    dpdk_dev_parse_mtu(args, &mtu);
+
+    return netdev_dpdk_set_mtu(netdev_, mtu);
+}
+
 static int
 netdev_dpdk_get_numa_id(const struct netdev *netdev_)
 {
@@ -1358,54 +1536,6 @@  netdev_dpdk_get_mtu(const struct netdev *netdev, int *mtup)
 
     return 0;
 }
-
-static int
-netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu)
-{
-    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
-    int old_mtu, err;
-    struct dpdk_mp *old_mp;
-    struct dpdk_mp *mp;
-
-    ovs_mutex_lock(&dpdk_mutex);
-    ovs_mutex_lock(&dev->mutex);
-    if (dev->mtu == mtu) {
-        err = 0;
-        goto out;
-    }
-
-    mp = dpdk_mp_get(dev->socket_id, dev->mtu);
-    if (!mp) {
-        err = ENOMEM;
-        goto out;
-    }
-
-    rte_eth_dev_stop(dev->port_id);
-
-    old_mtu = dev->mtu;
-    old_mp = dev->dpdk_mp;
-    dev->dpdk_mp = mp;
-    dev->mtu = mtu;
-    dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
-
-    err = dpdk_eth_dev_init(dev);
-    if (err) {
-        dpdk_mp_put(mp);
-        dev->mtu = old_mtu;
-        dev->dpdk_mp = old_mp;
-        dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu);
-        dpdk_eth_dev_init(dev);
-        goto out;
-    }
-
-    dpdk_mp_put(old_mp);
-    netdev_change_seq_changed(netdev);
-out:
-    ovs_mutex_unlock(&dev->mutex);
-    ovs_mutex_unlock(&dpdk_mutex);
-    return err;
-}
-
 static int
 netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier);
 
@@ -1682,7 +1812,7 @@  netdev_dpdk_get_status(const struct netdev *netdev_, struct smap *args)
     smap_add_format(args, "numa_id", "%d", rte_eth_dev_socket_id(dev->port_id));
     smap_add_format(args, "driver_name", "%s", dev_info.driver_name);
     smap_add_format(args, "min_rx_bufsize", "%u", dev_info.min_rx_bufsize);
-    smap_add_format(args, "max_rx_pktlen", "%u", dev_info.max_rx_pktlen);
+    smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len);
     smap_add_format(args, "max_rx_queues", "%u", dev_info.max_rx_queues);
     smap_add_format(args, "max_tx_queues", "%u", dev_info.max_tx_queues);
     smap_add_format(args, "max_mac_addrs", "%u", dev_info.max_mac_addrs);
@@ -1904,6 +2034,51 @@  dpdk_vhost_user_class_init(void)
     return 0;
 }
 
+/* Set the mtu of DPDK_DEV_VHOST ports */
+static int
+netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu)
+{
+    struct netdev_dpdk *dev = netdev_dpdk_cast(netdev);
+    int err = 0;
+    struct dpdk_mp *old_mp;
+    struct dpdk_mp *mp;
+
+    ovs_mutex_lock(&dpdk_mutex);
+    ovs_mutex_lock(&dev->mutex);
+    if (dev->mtu == mtu) {
+        err = 0;
+        goto out;
+    }
+
+    mp = dpdk_mp_get(dev->socket_id, mtu);
+    if (!mp) {
+        err = ENOMEM;
+        goto out;
+    }
+
+    old_mp = dev->dpdk_mp;
+    dev->dpdk_mp = mp;
+    dev->mtu = mtu;
+    dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu);
+
+    dpdk_mp_put(old_mp);
+    netdev_change_seq_changed(netdev);
+out:
+    ovs_mutex_unlock(&dev->mutex);
+    ovs_mutex_unlock(&dpdk_mutex);
+    return err;
+}
+
+static int
+netdev_dpdk_vhost_set_config(struct netdev *netdev_, const struct smap *args)
+{
+    int mtu;
+
+    dpdk_dev_parse_mtu(args, &mtu);
+
+    return netdev_dpdk_vhost_set_mtu(netdev_, mtu);
+}
+
 static void
 dpdk_common_init(void)
 {
@@ -2040,8 +2215,9 @@  unlock_dpdk:
     return err;
 }
 
-#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, MULTIQ, SEND, \
-    GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV)          \
+#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, SET_CONFIG, \
+        MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES,   \
+        GET_STATUS, RXQ_RECV)                                          \
 {                                                             \
     NAME,                                                     \
     INIT,                       /* init */                    \
@@ -2053,7 +2229,7 @@  unlock_dpdk:
     DESTRUCT,                                                 \
     netdev_dpdk_dealloc,                                      \
     netdev_dpdk_get_config,                                   \
-    NULL,                       /* netdev_dpdk_set_config */  \
+    SET_CONFIG,                                               \
     NULL,                       /* get_tunnel_config */       \
     NULL,                       /* build header */            \
     NULL,                       /* push header */             \
@@ -2067,7 +2243,7 @@  unlock_dpdk:
     netdev_dpdk_set_etheraddr,                                \
     netdev_dpdk_get_etheraddr,                                \
     netdev_dpdk_get_mtu,                                      \
-    netdev_dpdk_set_mtu,                                      \
+    SET_MTU,                                                  \
     netdev_dpdk_get_ifindex,                                  \
     GET_CARRIER,                                              \
     netdev_dpdk_get_carrier_resets,                           \
@@ -2213,8 +2389,10 @@  static const struct netdev_class dpdk_class =
         NULL,
         netdev_dpdk_construct,
         netdev_dpdk_destruct,
+        netdev_dpdk_set_config,
         netdev_dpdk_set_multiq,
         netdev_dpdk_eth_send,
+        netdev_dpdk_set_mtu,
         netdev_dpdk_get_carrier,
         netdev_dpdk_get_stats,
         netdev_dpdk_get_features,
@@ -2227,8 +2405,10 @@  static const struct netdev_class dpdk_ring_class =
         NULL,
         netdev_dpdk_ring_construct,
         netdev_dpdk_destruct,
+        netdev_dpdk_set_config,
         netdev_dpdk_set_multiq,
         netdev_dpdk_ring_send,
+        netdev_dpdk_set_mtu,
         netdev_dpdk_get_carrier,
         netdev_dpdk_get_stats,
         netdev_dpdk_get_features,
@@ -2241,8 +2421,10 @@  static const struct netdev_class OVS_UNUSED dpdk_vhost_cuse_class =
         dpdk_vhost_cuse_class_init,
         netdev_dpdk_vhost_cuse_construct,
         netdev_dpdk_vhost_destruct,
+        NULL,
         netdev_dpdk_vhost_set_multiq,
         netdev_dpdk_vhost_send,
+        NULL,
         netdev_dpdk_vhost_get_carrier,
         netdev_dpdk_vhost_get_stats,
         NULL,
@@ -2255,8 +2437,10 @@  static const struct netdev_class OVS_UNUSED dpdk_vhost_user_class =
         dpdk_vhost_user_class_init,
         netdev_dpdk_vhost_user_construct,
         netdev_dpdk_vhost_destruct,
+        netdev_dpdk_vhost_set_config,
         netdev_dpdk_vhost_set_multiq,
         netdev_dpdk_vhost_send,
+        netdev_dpdk_vhost_set_mtu,
         netdev_dpdk_vhost_get_carrier,
         netdev_dpdk_vhost_get_stats,
         NULL,