Message ID | eeeb45b07990f708c4802d64be51a8b83cdc0774.1500558504.git.joseph.salisbury@canonical.com |
---|---|
State | New |
Headers | show |
On 07/30/17 16:25, Joseph Salisbury wrote: > From: Sivakumar Krishnasamy <ksiva@linux.vnet.ibm.com> > > BugLink: http://bugs.launchpad.net/bugs/1692538 > > Current largesend and checksum offload feature in ibmveth driver, > - Source VM sends the TCP packets with ip_summed field set as > CHECKSUM_PARTIAL and TCP pseudo header checksum is placed in > checksum field > - CHECKSUM_PARTIAL flag in SKB will enable ibmveth driver to mark > "no checksum" and "checksum good" bits in transmit buffer descriptor > before the packet is delivered to pseries PowerVM Hypervisor > - If ibmveth has largesend capability enabled, transmit buffer descriptors > are market accordingly before packet is delivered to Hypervisor > (along with mss value for packets with length > MSS) > - Destination VM's ibmveth driver receives the packet with "checksum good" > bit set and so, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY > - If "largesend" bit was on, mss value is copied from receive descriptor > into SKB's gso_size and other flags are appropriately set for > packets > MSS size > - The packet is now successfully delivered up the stack in destination VM > > The offloads described above works fine for TCP communication among VMs in > the same pseries server ( VM A <=> PowerVM Hypervisor <=> VM B ) > > We are now enabling support for OVS in pseries PowerVM environment. One of > our requirements is to have ibmveth driver configured in "Trunk" mode, when > they are used with OVS. This is because, PowerVM Hypervisor will no more > bridge the packets between VMs, instead the packets are delivered to > IO Server which hosts OVS to bridge them between VMs or to external > networks (flow shown below), > VM A <=> PowerVM Hypervisor <=> IO Server(OVS) <=> PowerVM Hypervisor > <=> VM B > In "IO server" the packet is received by inbound Trunk ibmveth and then > delivered to OVS, which is then bridged to outbound Trunk ibmveth (shown > below), > Inbound Trunk ibmveth <=> OVS <=> Outbound Trunk ibmveth > > In this model, we hit the following issues which impacted the VM > communication performance, > > - Issue 1: ibmveth doesn't support largesend and checksum offload features > when configured as "Trunk". Driver has explicit checks to prevent > enabling these offloads. > > - Issue 2: SYN packet drops seen at destination VM. When the packet > originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to > IO server's inbound Trunk ibmveth, on validating "checksum good" bits > in ibmveth receive routine, SKB's ip_summed field is set with > CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux > Bridge) and delivered to outbound Trunk ibmveth. At this point the > outbound ibmveth transmit routine will not set "no checksum" and > "checksum good" bits in transmit buffer descriptor, as it does so only > when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets > delivered to destination VM, TCP layer receives the packet with checksum > value of 0 and with no checksum related flags in ip_summed field. This > leads to packet drops. So, TCP connections never goes through fine. > > - Issue 3: First packet of a TCP connection will be dropped, if there is > no OVS flow cached in datapath. OVS while trying to identify the flow, > computes the checksum. The computed checksum will be invalid at the > receiving end, as ibmveth transmit routine zeroes out the pseudo > checksum value in the packet. This leads to packet drop. > > - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list. > When Physical NIC has GRO enabled and when OVS bridges these packets, > OVS vport send code will end up calling dev_queue_xmit, which in turn > calls validate_xmit_skb. > In validate_xmit_skb routine, the larger packets will get segmented into > MSS sized segments, if SKB has a frag_list and if the driver to which > they are delivered to doesn't support NETIF_F_FRAGLIST feature. > > This patch addresses the above four issues, thereby enabling end to end > largesend and checksum offload support for better performance. > > - Fix for Issue 1 : Remove checks which prevent enabling TCP largesend and > checksum offloads. > - Fix for Issue 2 : When ibmveth receives a packet with "checksum good" > bit set and if its configured in Trunk mode, set appropriate SKB fields > using skb_partial_csum_set (ip_summed field is set with > CHECKSUM_PARTIAL) > - Fix for Issue 3: Recompute the pseudo header checksum before sending the > SKB up the stack. > - Fix for Issue 4: Linearize the SKBs with frag_list. Though we end up > allocating buffers and copying data, this fix gives > upto 4X throughput increase. > > Note: All these fixes need to be dropped together as fixing just one of > them will lead to other issues immediately (especially for Issues 1,2 & 3). > > Signed-off-by: Sivakumar Krishnasamy <ksiva@linux.vnet.ibm.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > (cherry picked from commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c) > Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Clean cherry-pick, good test results and limited to a single driver. Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com> > --- > drivers/net/ethernet/ibm/ibmveth.c | 107 ++++++++++++++++++++++++++++++------- > drivers/net/ethernet/ibm/ibmveth.h | 1 + > 2 files changed, 90 insertions(+), 18 deletions(-) > > diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c > index 72ab7b6..9a74c4e 100644 > --- a/drivers/net/ethernet/ibm/ibmveth.c > +++ b/drivers/net/ethernet/ibm/ibmveth.c > @@ -46,6 +46,8 @@ > #include <asm/vio.h> > #include <asm/iommu.h> > #include <asm/firmware.h> > +#include <net/tcp.h> > +#include <net/ip6_checksum.h> > > #include "ibmveth.h" > > @@ -808,8 +810,7 @@ static int ibmveth_set_csum_offload(struct net_device *dev, u32 data) > > ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0, &ret_attr); > > - if (ret == H_SUCCESS && !(ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK) && > - !(ret_attr & IBMVETH_ILLAN_TRUNK_PRI_MASK) && > + if (ret == H_SUCCESS && > (ret_attr & IBMVETH_ILLAN_PADDED_PKT_CSUM)) { > ret4 = h_illan_attributes(adapter->vdev->unit_address, clr_attr, > set_attr, &ret_attr); > @@ -1040,6 +1041,15 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, > dma_addr_t dma_addr; > unsigned long mss = 0; > > + /* veth doesn't handle frag_list, so linearize the skb. > + * When GRO is enabled SKB's can have frag_list. > + */ > + if (adapter->is_active_trunk && > + skb_has_frag_list(skb) && __skb_linearize(skb)) { > + netdev->stats.tx_dropped++; > + goto out; > + } > + > /* > * veth handles a maximum of 6 segments including the header, so > * we have to linearize the skb if there are more than this. > @@ -1064,9 +1074,6 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, > > desc_flags = IBMVETH_BUF_VALID; > > - if (skb_is_gso(skb) && adapter->fw_large_send_support) > - desc_flags |= IBMVETH_BUF_LRG_SND; > - > if (skb->ip_summed == CHECKSUM_PARTIAL) { > unsigned char *buf = skb_transport_header(skb) + > skb->csum_offset; > @@ -1076,6 +1083,9 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, > /* Need to zero out the checksum */ > buf[0] = 0; > buf[1] = 0; > + > + if (skb_is_gso(skb) && adapter->fw_large_send_support) > + desc_flags |= IBMVETH_BUF_LRG_SND; > } > > retry_bounce: > @@ -1128,7 +1138,7 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, > descs[i+1].fields.address = dma_addr; > } > > - if (skb_is_gso(skb)) { > + if (skb->ip_summed == CHECKSUM_PARTIAL && skb_is_gso(skb)) { > if (adapter->fw_large_send_support) { > mss = (unsigned long)skb_shinfo(skb)->gso_size; > adapter->tx_large_packets++; > @@ -1232,6 +1242,71 @@ static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt) > } > } > > +static void ibmveth_rx_csum_helper(struct sk_buff *skb, > + struct ibmveth_adapter *adapter) > +{ > + struct iphdr *iph = NULL; > + struct ipv6hdr *iph6 = NULL; > + __be16 skb_proto = 0; > + u16 iphlen = 0; > + u16 iph_proto = 0; > + u16 tcphdrlen = 0; > + > + skb_proto = be16_to_cpu(skb->protocol); > + > + if (skb_proto == ETH_P_IP) { > + iph = (struct iphdr *)skb->data; > + > + /* If the IP checksum is not offloaded and if the packet > + * is large send, the checksum must be rebuilt. > + */ > + if (iph->check == 0xffff) { > + iph->check = 0; > + iph->check = ip_fast_csum((unsigned char *)iph, > + iph->ihl); > + } > + > + iphlen = iph->ihl * 4; > + iph_proto = iph->protocol; > + } else if (skb_proto == ETH_P_IPV6) { > + iph6 = (struct ipv6hdr *)skb->data; > + iphlen = sizeof(struct ipv6hdr); > + iph_proto = iph6->nexthdr; > + } > + > + /* In OVS environment, when a flow is not cached, specifically for a > + * new TCP connection, the first packet information is passed up > + * the user space for finding a flow. During this process, OVS computes > + * checksum on the first packet when CHECKSUM_PARTIAL flag is set. > + * > + * Given that we zeroed out TCP checksum field in transmit path > + * (refer ibmveth_start_xmit routine) as we set "no checksum bit", > + * OVS computed checksum will be incorrect w/o TCP pseudo checksum > + * in the packet. This leads to OVS dropping the packet and hence > + * TCP retransmissions are seen. > + * > + * So, re-compute TCP pseudo header checksum. > + */ > + if (iph_proto == IPPROTO_TCP && adapter->is_active_trunk) { > + struct tcphdr *tcph = (struct tcphdr *)(skb->data + iphlen); > + > + tcphdrlen = skb->len - iphlen; > + > + /* Recompute TCP pseudo header checksum */ > + if (skb_proto == ETH_P_IP) > + tcph->check = ~csum_tcpudp_magic(iph->saddr, > + iph->daddr, tcphdrlen, iph_proto, 0); > + else if (skb_proto == ETH_P_IPV6) > + tcph->check = ~csum_ipv6_magic(&iph6->saddr, > + &iph6->daddr, tcphdrlen, iph_proto, 0); > + > + /* Setup SKB fields for checksum offload */ > + skb_partial_csum_set(skb, iphlen, > + offsetof(struct tcphdr, check)); > + skb_reset_network_header(skb); > + } > +} > + > static int ibmveth_poll(struct napi_struct *napi, int budget) > { > struct ibmveth_adapter *adapter = > @@ -1239,7 +1314,6 @@ static int ibmveth_poll(struct napi_struct *napi, int budget) > struct net_device *netdev = adapter->netdev; > int frames_processed = 0; > unsigned long lpar_rc; > - struct iphdr *iph; > u16 mss = 0; > > restart_poll: > @@ -1297,17 +1371,7 @@ static int ibmveth_poll(struct napi_struct *napi, int budget) > > if (csum_good) { > skb->ip_summed = CHECKSUM_UNNECESSARY; > - if (be16_to_cpu(skb->protocol) == ETH_P_IP) { > - iph = (struct iphdr *)skb->data; > - > - /* If the IP checksum is not offloaded and if the packet > - * is large send, the checksum must be rebuilt. > - */ > - if (iph->check == 0xffff) { > - iph->check = 0; > - iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl); > - } > - } > + ibmveth_rx_csum_helper(skb, adapter); > } > > if (length > netdev->mtu + ETH_HLEN) { > @@ -1626,6 +1690,13 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id) > netdev->hw_features |= NETIF_F_TSO; > } > > + adapter->is_active_trunk = false; > + if (ret == H_SUCCESS && (ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK)) { > + adapter->is_active_trunk = true; > + netdev->hw_features |= NETIF_F_FRAGLIST; > + netdev->features |= NETIF_F_FRAGLIST; > + } > + > netdev->min_mtu = IBMVETH_MIN_MTU; > netdev->max_mtu = ETH_MAX_MTU; > > diff --git a/drivers/net/ethernet/ibm/ibmveth.h b/drivers/net/ethernet/ibm/ibmveth.h > index 7acda04..de6e381 100644 > --- a/drivers/net/ethernet/ibm/ibmveth.h > +++ b/drivers/net/ethernet/ibm/ibmveth.h > @@ -157,6 +157,7 @@ struct ibmveth_adapter { > int pool_config; > int rx_csum; > int large_send; > + bool is_active_trunk; > void *bounce_buffer; > dma_addr_t bounce_buffer_dma; > >
On 30.07.2017 16:25, Joseph Salisbury wrote: > From: Sivakumar Krishnasamy <ksiva@linux.vnet.ibm.com> > > BugLink: http://bugs.launchpad.net/bugs/1692538 > > Current largesend and checksum offload feature in ibmveth driver, > - Source VM sends the TCP packets with ip_summed field set as > CHECKSUM_PARTIAL and TCP pseudo header checksum is placed in > checksum field > - CHECKSUM_PARTIAL flag in SKB will enable ibmveth driver to mark > "no checksum" and "checksum good" bits in transmit buffer descriptor > before the packet is delivered to pseries PowerVM Hypervisor > - If ibmveth has largesend capability enabled, transmit buffer descriptors > are market accordingly before packet is delivered to Hypervisor > (along with mss value for packets with length > MSS) > - Destination VM's ibmveth driver receives the packet with "checksum good" > bit set and so, SKB's ip_summed field is set with CHECKSUM_UNNECESSARY > - If "largesend" bit was on, mss value is copied from receive descriptor > into SKB's gso_size and other flags are appropriately set for > packets > MSS size > - The packet is now successfully delivered up the stack in destination VM > > The offloads described above works fine for TCP communication among VMs in > the same pseries server ( VM A <=> PowerVM Hypervisor <=> VM B ) > > We are now enabling support for OVS in pseries PowerVM environment. One of > our requirements is to have ibmveth driver configured in "Trunk" mode, when > they are used with OVS. This is because, PowerVM Hypervisor will no more > bridge the packets between VMs, instead the packets are delivered to > IO Server which hosts OVS to bridge them between VMs or to external > networks (flow shown below), > VM A <=> PowerVM Hypervisor <=> IO Server(OVS) <=> PowerVM Hypervisor > <=> VM B > In "IO server" the packet is received by inbound Trunk ibmveth and then > delivered to OVS, which is then bridged to outbound Trunk ibmveth (shown > below), > Inbound Trunk ibmveth <=> OVS <=> Outbound Trunk ibmveth > > In this model, we hit the following issues which impacted the VM > communication performance, > > - Issue 1: ibmveth doesn't support largesend and checksum offload features > when configured as "Trunk". Driver has explicit checks to prevent > enabling these offloads. > > - Issue 2: SYN packet drops seen at destination VM. When the packet > originates, it has CHECKSUM_PARTIAL flag set and as it gets delivered to > IO server's inbound Trunk ibmveth, on validating "checksum good" bits > in ibmveth receive routine, SKB's ip_summed field is set with > CHECKSUM_UNNECESSARY flag. This packet is then bridged by OVS (or Linux > Bridge) and delivered to outbound Trunk ibmveth. At this point the > outbound ibmveth transmit routine will not set "no checksum" and > "checksum good" bits in transmit buffer descriptor, as it does so only > when the ip_summed field is CHECKSUM_PARTIAL. When this packet gets > delivered to destination VM, TCP layer receives the packet with checksum > value of 0 and with no checksum related flags in ip_summed field. This > leads to packet drops. So, TCP connections never goes through fine. > > - Issue 3: First packet of a TCP connection will be dropped, if there is > no OVS flow cached in datapath. OVS while trying to identify the flow, > computes the checksum. The computed checksum will be invalid at the > receiving end, as ibmveth transmit routine zeroes out the pseudo > checksum value in the packet. This leads to packet drop. > > - Issue 4: ibmveth driver doesn't have support for SKB's with frag_list. > When Physical NIC has GRO enabled and when OVS bridges these packets, > OVS vport send code will end up calling dev_queue_xmit, which in turn > calls validate_xmit_skb. > In validate_xmit_skb routine, the larger packets will get segmented into > MSS sized segments, if SKB has a frag_list and if the driver to which > they are delivered to doesn't support NETIF_F_FRAGLIST feature. > > This patch addresses the above four issues, thereby enabling end to end > largesend and checksum offload support for better performance. > > - Fix for Issue 1 : Remove checks which prevent enabling TCP largesend and > checksum offloads. > - Fix for Issue 2 : When ibmveth receives a packet with "checksum good" > bit set and if its configured in Trunk mode, set appropriate SKB fields > using skb_partial_csum_set (ip_summed field is set with > CHECKSUM_PARTIAL) > - Fix for Issue 3: Recompute the pseudo header checksum before sending the > SKB up the stack. > - Fix for Issue 4: Linearize the SKBs with frag_list. Though we end up > allocating buffers and copying data, this fix gives > upto 4X throughput increase. > > Note: All these fixes need to be dropped together as fixing just one of > them will lead to other issues immediately (especially for Issues 1,2 & 3). > > Signed-off-by: Sivakumar Krishnasamy <ksiva@linux.vnet.ibm.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > (cherry picked from commit 66aa0678efc29abd2ab02a09b23f9a8bc9f12a6c) > Signed-off-by: Joseph Salisbury <joseph.salisbury@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> > --- Limited to specific hw and good test results. > drivers/net/ethernet/ibm/ibmveth.c | 107 ++++++++++++++++++++++++++++++------- > drivers/net/ethernet/ibm/ibmveth.h | 1 + > 2 files changed, 90 insertions(+), 18 deletions(-) > > diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c > index 72ab7b6..9a74c4e 100644 > --- a/drivers/net/ethernet/ibm/ibmveth.c > +++ b/drivers/net/ethernet/ibm/ibmveth.c > @@ -46,6 +46,8 @@ > #include <asm/vio.h> > #include <asm/iommu.h> > #include <asm/firmware.h> > +#include <net/tcp.h> > +#include <net/ip6_checksum.h> > > #include "ibmveth.h" > > @@ -808,8 +810,7 @@ static int ibmveth_set_csum_offload(struct net_device *dev, u32 data) > > ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0, &ret_attr); > > - if (ret == H_SUCCESS && !(ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK) && > - !(ret_attr & IBMVETH_ILLAN_TRUNK_PRI_MASK) && > + if (ret == H_SUCCESS && > (ret_attr & IBMVETH_ILLAN_PADDED_PKT_CSUM)) { > ret4 = h_illan_attributes(adapter->vdev->unit_address, clr_attr, > set_attr, &ret_attr); > @@ -1040,6 +1041,15 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, > dma_addr_t dma_addr; > unsigned long mss = 0; > > + /* veth doesn't handle frag_list, so linearize the skb. > + * When GRO is enabled SKB's can have frag_list. > + */ > + if (adapter->is_active_trunk && > + skb_has_frag_list(skb) && __skb_linearize(skb)) { > + netdev->stats.tx_dropped++; > + goto out; > + } > + > /* > * veth handles a maximum of 6 segments including the header, so > * we have to linearize the skb if there are more than this. > @@ -1064,9 +1074,6 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, > > desc_flags = IBMVETH_BUF_VALID; > > - if (skb_is_gso(skb) && adapter->fw_large_send_support) > - desc_flags |= IBMVETH_BUF_LRG_SND; > - > if (skb->ip_summed == CHECKSUM_PARTIAL) { > unsigned char *buf = skb_transport_header(skb) + > skb->csum_offset; > @@ -1076,6 +1083,9 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, > /* Need to zero out the checksum */ > buf[0] = 0; > buf[1] = 0; > + > + if (skb_is_gso(skb) && adapter->fw_large_send_support) > + desc_flags |= IBMVETH_BUF_LRG_SND; > } > > retry_bounce: > @@ -1128,7 +1138,7 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, > descs[i+1].fields.address = dma_addr; > } > > - if (skb_is_gso(skb)) { > + if (skb->ip_summed == CHECKSUM_PARTIAL && skb_is_gso(skb)) { > if (adapter->fw_large_send_support) { > mss = (unsigned long)skb_shinfo(skb)->gso_size; > adapter->tx_large_packets++; > @@ -1232,6 +1242,71 @@ static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt) > } > } > > +static void ibmveth_rx_csum_helper(struct sk_buff *skb, > + struct ibmveth_adapter *adapter) > +{ > + struct iphdr *iph = NULL; > + struct ipv6hdr *iph6 = NULL; > + __be16 skb_proto = 0; > + u16 iphlen = 0; > + u16 iph_proto = 0; > + u16 tcphdrlen = 0; > + > + skb_proto = be16_to_cpu(skb->protocol); > + > + if (skb_proto == ETH_P_IP) { > + iph = (struct iphdr *)skb->data; > + > + /* If the IP checksum is not offloaded and if the packet > + * is large send, the checksum must be rebuilt. > + */ > + if (iph->check == 0xffff) { > + iph->check = 0; > + iph->check = ip_fast_csum((unsigned char *)iph, > + iph->ihl); > + } > + > + iphlen = iph->ihl * 4; > + iph_proto = iph->protocol; > + } else if (skb_proto == ETH_P_IPV6) { > + iph6 = (struct ipv6hdr *)skb->data; > + iphlen = sizeof(struct ipv6hdr); > + iph_proto = iph6->nexthdr; > + } > + > + /* In OVS environment, when a flow is not cached, specifically for a > + * new TCP connection, the first packet information is passed up > + * the user space for finding a flow. During this process, OVS computes > + * checksum on the first packet when CHECKSUM_PARTIAL flag is set. > + * > + * Given that we zeroed out TCP checksum field in transmit path > + * (refer ibmveth_start_xmit routine) as we set "no checksum bit", > + * OVS computed checksum will be incorrect w/o TCP pseudo checksum > + * in the packet. This leads to OVS dropping the packet and hence > + * TCP retransmissions are seen. > + * > + * So, re-compute TCP pseudo header checksum. > + */ > + if (iph_proto == IPPROTO_TCP && adapter->is_active_trunk) { > + struct tcphdr *tcph = (struct tcphdr *)(skb->data + iphlen); > + > + tcphdrlen = skb->len - iphlen; > + > + /* Recompute TCP pseudo header checksum */ > + if (skb_proto == ETH_P_IP) > + tcph->check = ~csum_tcpudp_magic(iph->saddr, > + iph->daddr, tcphdrlen, iph_proto, 0); > + else if (skb_proto == ETH_P_IPV6) > + tcph->check = ~csum_ipv6_magic(&iph6->saddr, > + &iph6->daddr, tcphdrlen, iph_proto, 0); > + > + /* Setup SKB fields for checksum offload */ > + skb_partial_csum_set(skb, iphlen, > + offsetof(struct tcphdr, check)); > + skb_reset_network_header(skb); > + } > +} > + > static int ibmveth_poll(struct napi_struct *napi, int budget) > { > struct ibmveth_adapter *adapter = > @@ -1239,7 +1314,6 @@ static int ibmveth_poll(struct napi_struct *napi, int budget) > struct net_device *netdev = adapter->netdev; > int frames_processed = 0; > unsigned long lpar_rc; > - struct iphdr *iph; > u16 mss = 0; > > restart_poll: > @@ -1297,17 +1371,7 @@ static int ibmveth_poll(struct napi_struct *napi, int budget) > > if (csum_good) { > skb->ip_summed = CHECKSUM_UNNECESSARY; > - if (be16_to_cpu(skb->protocol) == ETH_P_IP) { > - iph = (struct iphdr *)skb->data; > - > - /* If the IP checksum is not offloaded and if the packet > - * is large send, the checksum must be rebuilt. > - */ > - if (iph->check == 0xffff) { > - iph->check = 0; > - iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl); > - } > - } > + ibmveth_rx_csum_helper(skb, adapter); > } > > if (length > netdev->mtu + ETH_HLEN) { > @@ -1626,6 +1690,13 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id) > netdev->hw_features |= NETIF_F_TSO; > } > > + adapter->is_active_trunk = false; > + if (ret == H_SUCCESS && (ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK)) { > + adapter->is_active_trunk = true; > + netdev->hw_features |= NETIF_F_FRAGLIST; > + netdev->features |= NETIF_F_FRAGLIST; > + } > + > netdev->min_mtu = IBMVETH_MIN_MTU; > netdev->max_mtu = ETH_MAX_MTU; > > diff --git a/drivers/net/ethernet/ibm/ibmveth.h b/drivers/net/ethernet/ibm/ibmveth.h > index 7acda04..de6e381 100644 > --- a/drivers/net/ethernet/ibm/ibmveth.h > +++ b/drivers/net/ethernet/ibm/ibmveth.h > @@ -157,6 +157,7 @@ struct ibmveth_adapter { > int pool_config; > int rx_csum; > int large_send; > + bool is_active_trunk; > void *bounce_buffer; > dma_addr_t bounce_buffer_dma; > >
diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c index 72ab7b6..9a74c4e 100644 --- a/drivers/net/ethernet/ibm/ibmveth.c +++ b/drivers/net/ethernet/ibm/ibmveth.c @@ -46,6 +46,8 @@ #include <asm/vio.h> #include <asm/iommu.h> #include <asm/firmware.h> +#include <net/tcp.h> +#include <net/ip6_checksum.h> #include "ibmveth.h" @@ -808,8 +810,7 @@ static int ibmveth_set_csum_offload(struct net_device *dev, u32 data) ret = h_illan_attributes(adapter->vdev->unit_address, 0, 0, &ret_attr); - if (ret == H_SUCCESS && !(ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK) && - !(ret_attr & IBMVETH_ILLAN_TRUNK_PRI_MASK) && + if (ret == H_SUCCESS && (ret_attr & IBMVETH_ILLAN_PADDED_PKT_CSUM)) { ret4 = h_illan_attributes(adapter->vdev->unit_address, clr_attr, set_attr, &ret_attr); @@ -1040,6 +1041,15 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, dma_addr_t dma_addr; unsigned long mss = 0; + /* veth doesn't handle frag_list, so linearize the skb. + * When GRO is enabled SKB's can have frag_list. + */ + if (adapter->is_active_trunk && + skb_has_frag_list(skb) && __skb_linearize(skb)) { + netdev->stats.tx_dropped++; + goto out; + } + /* * veth handles a maximum of 6 segments including the header, so * we have to linearize the skb if there are more than this. @@ -1064,9 +1074,6 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, desc_flags = IBMVETH_BUF_VALID; - if (skb_is_gso(skb) && adapter->fw_large_send_support) - desc_flags |= IBMVETH_BUF_LRG_SND; - if (skb->ip_summed == CHECKSUM_PARTIAL) { unsigned char *buf = skb_transport_header(skb) + skb->csum_offset; @@ -1076,6 +1083,9 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, /* Need to zero out the checksum */ buf[0] = 0; buf[1] = 0; + + if (skb_is_gso(skb) && adapter->fw_large_send_support) + desc_flags |= IBMVETH_BUF_LRG_SND; } retry_bounce: @@ -1128,7 +1138,7 @@ static netdev_tx_t ibmveth_start_xmit(struct sk_buff *skb, descs[i+1].fields.address = dma_addr; } - if (skb_is_gso(skb)) { + if (skb->ip_summed == CHECKSUM_PARTIAL && skb_is_gso(skb)) { if (adapter->fw_large_send_support) { mss = (unsigned long)skb_shinfo(skb)->gso_size; adapter->tx_large_packets++; @@ -1232,6 +1242,71 @@ static void ibmveth_rx_mss_helper(struct sk_buff *skb, u16 mss, int lrg_pkt) } } +static void ibmveth_rx_csum_helper(struct sk_buff *skb, + struct ibmveth_adapter *adapter) +{ + struct iphdr *iph = NULL; + struct ipv6hdr *iph6 = NULL; + __be16 skb_proto = 0; + u16 iphlen = 0; + u16 iph_proto = 0; + u16 tcphdrlen = 0; + + skb_proto = be16_to_cpu(skb->protocol); + + if (skb_proto == ETH_P_IP) { + iph = (struct iphdr *)skb->data; + + /* If the IP checksum is not offloaded and if the packet + * is large send, the checksum must be rebuilt. + */ + if (iph->check == 0xffff) { + iph->check = 0; + iph->check = ip_fast_csum((unsigned char *)iph, + iph->ihl); + } + + iphlen = iph->ihl * 4; + iph_proto = iph->protocol; + } else if (skb_proto == ETH_P_IPV6) { + iph6 = (struct ipv6hdr *)skb->data; + iphlen = sizeof(struct ipv6hdr); + iph_proto = iph6->nexthdr; + } + + /* In OVS environment, when a flow is not cached, specifically for a + * new TCP connection, the first packet information is passed up + * the user space for finding a flow. During this process, OVS computes + * checksum on the first packet when CHECKSUM_PARTIAL flag is set. + * + * Given that we zeroed out TCP checksum field in transmit path + * (refer ibmveth_start_xmit routine) as we set "no checksum bit", + * OVS computed checksum will be incorrect w/o TCP pseudo checksum + * in the packet. This leads to OVS dropping the packet and hence + * TCP retransmissions are seen. + * + * So, re-compute TCP pseudo header checksum. + */ + if (iph_proto == IPPROTO_TCP && adapter->is_active_trunk) { + struct tcphdr *tcph = (struct tcphdr *)(skb->data + iphlen); + + tcphdrlen = skb->len - iphlen; + + /* Recompute TCP pseudo header checksum */ + if (skb_proto == ETH_P_IP) + tcph->check = ~csum_tcpudp_magic(iph->saddr, + iph->daddr, tcphdrlen, iph_proto, 0); + else if (skb_proto == ETH_P_IPV6) + tcph->check = ~csum_ipv6_magic(&iph6->saddr, + &iph6->daddr, tcphdrlen, iph_proto, 0); + + /* Setup SKB fields for checksum offload */ + skb_partial_csum_set(skb, iphlen, + offsetof(struct tcphdr, check)); + skb_reset_network_header(skb); + } +} + static int ibmveth_poll(struct napi_struct *napi, int budget) { struct ibmveth_adapter *adapter = @@ -1239,7 +1314,6 @@ static int ibmveth_poll(struct napi_struct *napi, int budget) struct net_device *netdev = adapter->netdev; int frames_processed = 0; unsigned long lpar_rc; - struct iphdr *iph; u16 mss = 0; restart_poll: @@ -1297,17 +1371,7 @@ static int ibmveth_poll(struct napi_struct *napi, int budget) if (csum_good) { skb->ip_summed = CHECKSUM_UNNECESSARY; - if (be16_to_cpu(skb->protocol) == ETH_P_IP) { - iph = (struct iphdr *)skb->data; - - /* If the IP checksum is not offloaded and if the packet - * is large send, the checksum must be rebuilt. - */ - if (iph->check == 0xffff) { - iph->check = 0; - iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl); - } - } + ibmveth_rx_csum_helper(skb, adapter); } if (length > netdev->mtu + ETH_HLEN) { @@ -1626,6 +1690,13 @@ static int ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id) netdev->hw_features |= NETIF_F_TSO; } + adapter->is_active_trunk = false; + if (ret == H_SUCCESS && (ret_attr & IBMVETH_ILLAN_ACTIVE_TRUNK)) { + adapter->is_active_trunk = true; + netdev->hw_features |= NETIF_F_FRAGLIST; + netdev->features |= NETIF_F_FRAGLIST; + } + netdev->min_mtu = IBMVETH_MIN_MTU; netdev->max_mtu = ETH_MAX_MTU; diff --git a/drivers/net/ethernet/ibm/ibmveth.h b/drivers/net/ethernet/ibm/ibmveth.h index 7acda04..de6e381 100644 --- a/drivers/net/ethernet/ibm/ibmveth.h +++ b/drivers/net/ethernet/ibm/ibmveth.h @@ -157,6 +157,7 @@ struct ibmveth_adapter { int pool_config; int rx_csum; int large_send; + bool is_active_trunk; void *bounce_buffer; dma_addr_t bounce_buffer_dma;