diff mbox

net-gre-gro: Fix a bug that breaks the forwarding path

Message ID 1393536377-32243-1-git-send-email-hkchu@google.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Jerry Chu Feb. 27, 2014, 9:26 p.m. UTC
From: Jerry Chu <hkchu@google.com>

I stumbled across a bug that was introduced by my own GRE-GRO
patch (bf5a755f5e9186406bbf50f4087100af5bd68e40
net-gre-gro: Add GRE support to the GRO stack) submitted a while
back that breaks the forwarding path because various GSO related
fields were not set. This will cause on the egress path either
the GSO code to fail, or a GRE-TSO capable (NETIF_F_GSO_GRE)
NICs to choke. The following fix has been tested for both cases.

Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
---
 net/core/dev.c           | 2 ++
 net/ipv4/af_inet.c       | 3 +++
 net/ipv4/gre_offload.c   | 3 +++
 net/ipv4/tcp_offload.c   | 2 +-
 net/ipv6/tcpv6_offload.c | 2 +-
 5 files changed, 10 insertions(+), 2 deletions(-)

Comments

Or Gerlitz Feb. 27, 2014, 10:25 p.m. UTC | #1
On Thu, Feb 27, 2014 at 11:26 PM, H.K. Jerry Chu <hkchu@google.com> wrote:
> From: Jerry Chu <hkchu@google.com>
>
> I stumbled across a bug that was introduced by my own GRE-GRO
> patch (bf5a755f5e9186406bbf50f4087100af5bd68e40
> net-gre-gro: Add GRE support to the GRO stack) submitted a while
> back that breaks the forwarding path because various GSO related
> fields were not set. This will cause on the egress path either
> the GSO code to fail, or a GRE-TSO capable (NETIF_F_GSO_GRE)
> NICs to choke. The following fix has been tested for both cases.
>
> Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
> ---
>  net/core/dev.c           | 2 ++
>  net/ipv4/af_inet.c       | 3 +++
>  net/ipv4/gre_offload.c   | 3 +++
>  net/ipv4/tcp_offload.c   | 2 +-
>  net/ipv6/tcpv6_offload.c | 2 +-
>  5 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b1b0c8d..75517d8 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4045,6 +4045,8 @@ static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
>         skb->vlan_tci = 0;
>         skb->dev = napi->dev;
>         skb->skb_iif = 0;
> +       skb->encapsulation = 0;
> +       skb_shinfo(skb)->gso_type = 0;
>
>         napi->skb = skb;
>  }
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index ecd2c3f..68229ca 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1431,6 +1431,9 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
>         int proto = iph->protocol;
>         int err = -ENOSYS;
>
> +       if (skb->encapsulation)
> +               skb_set_inner_network_header(skb, nhoff);
> +

Note that packets (e.g those who are UDP encapsulated) will pass here
twice, once for the external IP header and once for the internal, so
if the driver marks the encapsulation sign the inner header setting
will happen twice, is that what we want?


>         csum_replace2(&iph->check, iph->tot_len, newlen);
>         iph->tot_len = newlen;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jerry Chu Feb. 27, 2014, 11:39 p.m. UTC | #2
On Thu, Feb 27, 2014 at 2:25 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
>
> On Thu, Feb 27, 2014 at 11:26 PM, H.K. Jerry Chu <hkchu@google.com> wrote:
> > From: Jerry Chu <hkchu@google.com>
> >
> > I stumbled across a bug that was introduced by my own GRE-GRO
> > patch (bf5a755f5e9186406bbf50f4087100af5bd68e40
> > net-gre-gro: Add GRE support to the GRO stack) submitted a while
> > back that breaks the forwarding path because various GSO related
> > fields were not set. This will cause on the egress path either
> > the GSO code to fail, or a GRE-TSO capable (NETIF_F_GSO_GRE)
> > NICs to choke. The following fix has been tested for both cases.
> >
> > Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
> > ---
> >  net/core/dev.c           | 2 ++
> >  net/ipv4/af_inet.c       | 3 +++
> >  net/ipv4/gre_offload.c   | 3 +++
> >  net/ipv4/tcp_offload.c   | 2 +-
> >  net/ipv6/tcpv6_offload.c | 2 +-
> >  5 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index b1b0c8d..75517d8 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4045,6 +4045,8 @@ static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
> >         skb->vlan_tci = 0;
> >         skb->dev = napi->dev;
> >         skb->skb_iif = 0;
> > +       skb->encapsulation = 0;
> > +       skb_shinfo(skb)->gso_type = 0;
> >
> >         napi->skb = skb;
> >  }
> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> > index ecd2c3f..68229ca 100644
> > --- a/net/ipv4/af_inet.c
> > +++ b/net/ipv4/af_inet.c
> > @@ -1431,6 +1431,9 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
> >         int proto = iph->protocol;
> >         int err = -ENOSYS;
> >
> > +       if (skb->encapsulation)
> > +               skb_set_inner_network_header(skb, nhoff);
> > +
>
> Note that packets (e.g those who are UDP encapsulated) will pass here
> twice, once for the external IP header and once for the internal, so
> if the driver marks the encapsulation sign the inner header setting
> will happen twice, is that what we want?

The inner most hdr will be the last to set. But I don't know
all the rules regarding skb->encapsulation. E.g., in this patch
napi_reuse_skb() will reset skb->encapsulation. If that's problematic
then I could call skb_set_inner_network_header() inside gre_gro_receive()
(but that will require checking against "type" in order to compute the right
offset).

Jerry

>
>
> >         csum_replace2(&iph->check, iph->tot_len, newlen);
> >         iph->tot_len = newlen;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 28, 2014, 9:56 p.m. UTC | #3
From: Jerry Chu <hkchu@google.com>
Date: Thu, 27 Feb 2014 15:39:47 -0800

> On Thu, Feb 27, 2014 at 2:25 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
>>
>> On Thu, Feb 27, 2014 at 11:26 PM, H.K. Jerry Chu <hkchu@google.com> wrote:
>> > From: Jerry Chu <hkchu@google.com>
>> >
>> > I stumbled across a bug that was introduced by my own GRE-GRO
>> > patch (bf5a755f5e9186406bbf50f4087100af5bd68e40
>> > net-gre-gro: Add GRE support to the GRO stack) submitted a while
>> > back that breaks the forwarding path because various GSO related
>> > fields were not set. This will cause on the egress path either
>> > the GSO code to fail, or a GRE-TSO capable (NETIF_F_GSO_GRE)
>> > NICs to choke. The following fix has been tested for both cases.
>> >
>> > Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
>> > ---
>> >  net/core/dev.c           | 2 ++
>> >  net/ipv4/af_inet.c       | 3 +++
>> >  net/ipv4/gre_offload.c   | 3 +++
>> >  net/ipv4/tcp_offload.c   | 2 +-
>> >  net/ipv6/tcpv6_offload.c | 2 +-
>> >  5 files changed, 10 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index b1b0c8d..75517d8 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -4045,6 +4045,8 @@ static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
>> >         skb->vlan_tci = 0;
>> >         skb->dev = napi->dev;
>> >         skb->skb_iif = 0;
>> > +       skb->encapsulation = 0;
>> > +       skb_shinfo(skb)->gso_type = 0;
>> >
>> >         napi->skb = skb;
>> >  }
>> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>> > index ecd2c3f..68229ca 100644
>> > --- a/net/ipv4/af_inet.c
>> > +++ b/net/ipv4/af_inet.c
>> > @@ -1431,6 +1431,9 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
>> >         int proto = iph->protocol;
>> >         int err = -ENOSYS;
>> >
>> > +       if (skb->encapsulation)
>> > +               skb_set_inner_network_header(skb, nhoff);
>> > +
>>
>> Note that packets (e.g those who are UDP encapsulated) will pass here
>> twice, once for the external IP header and once for the internal, so
>> if the driver marks the encapsulation sign the inner header setting
>> will happen twice, is that what we want?
> 
> The inner most hdr will be the last to set. But I don't know
> all the rules regarding skb->encapsulation. E.g., in this patch
> napi_reuse_skb() will reset skb->encapsulation. If that's problematic
> then I could call skb_set_inner_network_header() inside gre_gro_receive()
> (but that will require checking against "type" in order to compute the right
> offset).

The topic of the skb->encapsulation semantics has come up several times in the
past few weeks.

We cannot move forward on any changes in this area until the semantics
are well defined, and documented.

Can someone work on a patch which documents skb->encapsulation properly,
and then we can come back to fixing this bug?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz March 3, 2014, 9:30 a.m. UTC | #4
On Fri, Feb 28, 2014 at 11:56 PM, David Miller <davem@davemloft.net> wrote:
> From: Jerry Chu <hkchu@google.com>
> Date: Thu, 27 Feb 2014 15:39:47 -0800
>
>> On Thu, Feb 27, 2014 at 2:25 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
>>>
>>> On Thu, Feb 27, 2014 at 11:26 PM, H.K. Jerry Chu <hkchu@google.com> wrote:
>>> > From: Jerry Chu <hkchu@google.com>
>>> >
>>> > I stumbled across a bug that was introduced by my own GRE-GRO
>>> > patch (bf5a755f5e9186406bbf50f4087100af5bd68e40
>>> > net-gre-gro: Add GRE support to the GRO stack) submitted a while
>>> > back that breaks the forwarding path because various GSO related
>>> > fields were not set. This will cause on the egress path either
>>> > the GSO code to fail, or a GRE-TSO capable (NETIF_F_GSO_GRE)
>>> > NICs to choke. The following fix has been tested for both cases.
>>> >
>>> > Signed-off-by: H.K. Jerry Chu <hkchu@google.com>
>>> > ---
>>> >  net/core/dev.c           | 2 ++
>>> >  net/ipv4/af_inet.c       | 3 +++
>>> >  net/ipv4/gre_offload.c   | 3 +++
>>> >  net/ipv4/tcp_offload.c   | 2 +-
>>> >  net/ipv6/tcpv6_offload.c | 2 +-
>>> >  5 files changed, 10 insertions(+), 2 deletions(-)
>>> >
>>> > diff --git a/net/core/dev.c b/net/core/dev.c
>>> > index b1b0c8d..75517d8 100644
>>> > --- a/net/core/dev.c
>>> > +++ b/net/core/dev.c
>>> > @@ -4045,6 +4045,8 @@ static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
>>> >         skb->vlan_tci = 0;
>>> >         skb->dev = napi->dev;
>>> >         skb->skb_iif = 0;
>>> > +       skb->encapsulation = 0;
>>> > +       skb_shinfo(skb)->gso_type = 0;
>>> >
>>> >         napi->skb = skb;
>>> >  }
>>> > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
>>> > index ecd2c3f..68229ca 100644
>>> > --- a/net/ipv4/af_inet.c
>>> > +++ b/net/ipv4/af_inet.c
>>> > @@ -1431,6 +1431,9 @@ static int inet_gro_complete(struct sk_buff *skb, int nhoff)
>>> >         int proto = iph->protocol;
>>> >         int err = -ENOSYS;
>>> >
>>> > +       if (skb->encapsulation)
>>> > +               skb_set_inner_network_header(skb, nhoff);
>>> > +
>>>
>>> Note that packets (e.g those who are UDP encapsulated) will pass here
>>> twice, once for the external IP header and once for the internal, so
>>> if the driver marks the encapsulation sign the inner header setting
>>> will happen twice, is that what we want?
>>
>> The inner most hdr will be the last to set. But I don't know
>> all the rules regarding skb->encapsulation. E.g., in this patch
>> napi_reuse_skb() will reset skb->encapsulation. If that's problematic
>> then I could call skb_set_inner_network_header() inside gre_gro_receive()
>> (but that will require checking against "type" in order to compute the right
>> offset).
>
> The topic of the skb->encapsulation semantics has come up several times in the
> past few weeks.
>
> We cannot move forward on any changes in this area until the semantics
> are well defined, and documented.
>
> Can someone work on a patch which documents skb->encapsulation properly,
> and then we can come back to fixing this bug?


Yep, seems we need to be a bit more formal/precise here, we started to discuss
that couple of weeks ago on this thread
http://marc.info/?l=linux-netdev&m=139233048521647&w=2
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz March 4, 2014, 4:13 p.m. UTC | #5
On 28/02/2014 23:56, David Miller wrote:
> The topic of the skb->encapsulation semantics has come up several 
> times in the past few weeks. We cannot move forward on any changes in 
> this area until the semantics are well defined, and documented. Can 
> someone work on a patch which documents skb->encapsulation properly, 
> and then we can come back to fixing this bug? Thanks. 

Lets try... the skb->encapsulation flag was introduced and used in 3.8 
by the
sequence of these three commits

0afb166 vxlan: Add capability of Rx checksum offload for inner packet
d6727fe vxlan: capture inner headers during encapsulation
6a674e9 net: Add support for hardware-offloaded encapsulation

When discussed earlier on the list in the context of the skb->ip_summed 
field,
Tom Herbert came with the following interpretation for the semantics 
which Joseph confirmed

"when skb->encapsulation is set the ip_summed is valid for both the inner and outer header
(e.g. CHECKSUM_COMPLETE is always assumed okay for both layers). If skb->encapsulation is not set then ip_summed is only valid for outer header"

As for the TX side of things, the change-log of commit 6a674e9 states

"For Tx encapsulation offload, the driver will need to set the right bits in netdev->hw_enc_features. The protocol driver will have to set the skb->encapsulation bit and populate the inner headers, so the NIC driver will use those inner headers to calculate the csum in hardware."

So in higher level, it seems that the role of the skb->encapsulation field is to mark the skb to carry encapsulated packet for the code path between the time the packet is encapsulated by the protocol driver (e.g vxlan/ipip) to the time driver xmit is called. Or from the time driver rx code runs till the the time the packet is decapsulated.

Further, my personal interpretation was that on the rx path, skb should carry the encapsulation flag **only** if the HW was able to offload the inner checksum.

Joseph, what's your thinking here?

Or.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jerry Chu March 4, 2014, 10:13 p.m. UTC | #6
Hi Or,

Thanks for writing this up.

On Tue, Mar 4, 2014 at 8:13 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> On 28/02/2014 23:56, David Miller wrote:
>>
>> The topic of the skb->encapsulation semantics has come up several times in
>> the past few weeks. We cannot move forward on any changes in this area until
>> the semantics are well defined, and documented. Can someone work on a patch
>> which documents skb->encapsulation properly, and then we can come back to
>> fixing this bug? Thanks.
>
>
> Lets try... the skb->encapsulation flag was introduced and used in 3.8 by
> the
> sequence of these three commits
>
> 0afb166 vxlan: Add capability of Rx checksum offload for inner packet
> d6727fe vxlan: capture inner headers during encapsulation
> 6a674e9 net: Add support for hardware-offloaded encapsulation
>
> When discussed earlier on the list in the context of the skb->ip_summed
> field,
> Tom Herbert came with the following interpretation for the semantics which
> Joseph confirmed
>
> "when skb->encapsulation is set the ip_summed is valid for both the inner
> and outer header
> (e.g. CHECKSUM_COMPLETE is always assumed okay for both layers). If
> skb->encapsulation is not set then ip_summed is only valid for outer header"

For GRE encapped pkts is the following interpretation correct?

1) ip_summed == CHECKSUM_COMPLETE
    => csum covers IP payload csum of the outer IP datagram

2) ip_summed == CHECKSUM_UNNECESSARY
2.1. skb->encapsulation is on => both GRE csum (if one is present) and TCP/UDP
csum have been validated (assuming inner is a TCP or UDP pkt)

2.2. skb->encapsulation is off => only GRE csum (if one is present) is
validated.

>
> As for the TX side of things, the change-log of commit 6a674e9 states
>
> "For Tx encapsulation offload, the driver will need to set the right bits in
> netdev->hw_enc_features. The protocol driver will have to set the
> skb->encapsulation bit and populate the inner headers, so the NIC driver
> will use those inner headers to calculate the csum in hardware."

So we only support/care about csum offload for the inner pkts, which
makes sense.

>
> So in higher level, it seems that the role of the skb->encapsulation field
> is to mark the skb to carry encapsulated packet for the code path between
> the time the packet is encapsulated by the protocol driver (e.g vxlan/ipip)
> to the time driver xmit is called. Or from the time driver rx code runs till
> the the time the packet is decapsulated.
>
> Further, my personal interpretation was that on the rx path, skb should
> carry the encapsulation flag **only** if the HW was able to offload the
> inner checksum.

SGTM.

Jerry

>
> Joseph, what's your thinking here?
>
> Or.
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joseph Gasparakis March 4, 2014, 10:36 p.m. UTC | #7
On Tue, 4 Mar 2014, Or Gerlitz wrote:

> On 28/02/2014 23:56, David Miller wrote:
> > The topic of the skb->encapsulation semantics has come up several times in
> > the past few weeks. We cannot move forward on any changes in this area until
> > the semantics are well defined, and documented. Can someone work on a patch
> > which documents skb->encapsulation properly, and then we can come back to
> > fixing this bug? Thanks. 
> 
> Lets try... the skb->encapsulation flag was introduced and used in 3.8 by the
> sequence of these three commits
> 
> 0afb166 vxlan: Add capability of Rx checksum offload for inner packet
> d6727fe vxlan: capture inner headers during encapsulation
> 6a674e9 net: Add support for hardware-offloaded encapsulation
> 
> When discussed earlier on the list in the context of the skb->ip_summed field,
> Tom Herbert came with the following interpretation for the semantics which
> Joseph confirmed
> 
> "when skb->encapsulation is set the ip_summed is valid for both the inner and
> outer header
> (e.g. CHECKSUM_COMPLETE is always assumed okay for both layers). If
> skb->encapsulation is not set then ip_summed is only valid for outer header"
> 

Agree. This should be valid for both Rx and Tx. 

> As for the TX side of things, the change-log of commit 6a674e9 states
> 
> "For Tx encapsulation offload, the driver will need to set the right bits in
> netdev->hw_enc_features. The protocol driver will have to set the
> skb->encapsulation bit and populate the inner headers, so the NIC driver will
> use those inner headers to calculate the csum in hardware."
> 
> So in higher level, it seems that the role of the skb->encapsulation field is
> to mark the skb to carry encapsulated packet for the code path between the
> time the packet is encapsulated by the protocol driver (e.g vxlan/ipip) to the
> time driver xmit is called. Or from the time driver rx code runs till the the
> time the packet is decapsulated.
Correct. Here is a little bit more explanation about the though behind 
these statements:

When the packet gets decapsulated skb->encapsulation should be reset to 0 as
all that is left is the (previously to decap) inner packet. For the same reason
the inner headers also will not be valid any more: there are no inner headers as such.
Personaly in Rx I assume that when the skb leaves the driver, and the 
hardware has detected encapsulation and hence the csums have been verified 
(or not), the skb->encapsulation is on and skb->ip_summed is set accordingly for both 
layers, but the inner headers are not set and even if they are they are not valid.

Also for Tx, skb->encapsulation should be the indication to the 
driver that it can use the inner headers (i.e. they are valid) in the skb 
in order to offload the inner csum.

> 
> Further, my personal interpretation was that on the rx path, skb should carry
> the encapsulation flag **only** if the HW was able to offload the inner
> checksum.
> 
> Joseph, what's your thinking here?

Yes, I agree. If the hardware cannot offload the inner checksum most 
probably it couldn't even detect the encapsulation.

> 
> Or.
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joseph Gasparakis March 4, 2014, 10:53 p.m. UTC | #8
On Tue, 4 Mar 2014, Jerry Chu wrote:

> Hi Or,
> 
> Thanks for writing this up.
> 
> On Tue, Mar 4, 2014 at 8:13 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> > On 28/02/2014 23:56, David Miller wrote:
> >>
> >> The topic of the skb->encapsulation semantics has come up several times in
> >> the past few weeks. We cannot move forward on any changes in this area until
> >> the semantics are well defined, and documented. Can someone work on a patch
> >> which documents skb->encapsulation properly, and then we can come back to
> >> fixing this bug? Thanks.
> >
> >
> > Lets try... the skb->encapsulation flag was introduced and used in 3.8 by
> > the
> > sequence of these three commits
> >
> > 0afb166 vxlan: Add capability of Rx checksum offload for inner packet
> > d6727fe vxlan: capture inner headers during encapsulation
> > 6a674e9 net: Add support for hardware-offloaded encapsulation
> >
> > When discussed earlier on the list in the context of the skb->ip_summed
> > field,
> > Tom Herbert came with the following interpretation for the semantics which
> > Joseph confirmed
> >
> > "when skb->encapsulation is set the ip_summed is valid for both the inner
> > and outer header
> > (e.g. CHECKSUM_COMPLETE is always assumed okay for both layers). If
> > skb->encapsulation is not set then ip_summed is only valid for outer header"
> 
> For GRE encapped pkts is the following interpretation correct?
> 
> 1) ip_summed == CHECKSUM_COMPLETE
>     => csum covers IP payload csum of the outer IP datagram
> 
> 2) ip_summed == CHECKSUM_UNNECESSARY
> 2.1. skb->encapsulation is on => both GRE csum (if one is present) and TCP/UDP
> csum have been validated (assuming inner is a TCP or UDP pkt)

i40e also supports SCTP csumming for the inner packet, too.

> 
> 2.2. skb->encapsulation is off => only GRE csum (if one is present) is
> validated.
> 

Apart for the SCTP request for inclusion, it looks reasonable to me.

> >
> > As for the TX side of things, the change-log of commit 6a674e9 states
> >
> > "For Tx encapsulation offload, the driver will need to set the right bits in
> > netdev->hw_enc_features. The protocol driver will have to set the
> > skb->encapsulation bit and populate the inner headers, so the NIC driver
> > will use those inner headers to calculate the csum in hardware."
> 
> So we only support/care about csum offload for the inner pkts, which
> makes sense.

Well, we care about outer too. It is just that the inner headers are for 
the inner csum, the outer headers are for the outer csum. The outer 
headers will be always there anyway, so the patch introduced the inner 
ones. But I guess this is what you meant, right?

> 
> >
> > So in higher level, it seems that the role of the skb->encapsulation field
> > is to mark the skb to carry encapsulated packet for the code path between
> > the time the packet is encapsulated by the protocol driver (e.g vxlan/ipip)
> > to the time driver xmit is called. Or from the time driver rx code runs till
> > the the time the packet is decapsulated.
> >
> > Further, my personal interpretation was that on the rx path, skb should
> > carry the encapsulation flag **only** if the HW was able to offload the
> > inner checksum.
> 
> SGTM.
> 
> Jerry
> 
> >
> > Joseph, what's your thinking here?
> >
> > Or.
> >
> >
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jerry Chu March 4, 2014, 11:11 p.m. UTC | #9
On Tue, Mar 4, 2014 at 2:53 PM, Joseph Gasparakis
<joseph.gasparakis@intel.com> wrote:
>
>
> On Tue, 4 Mar 2014, Jerry Chu wrote:
>
>> Hi Or,
>>
>> Thanks for writing this up.
>>
>> On Tue, Mar 4, 2014 at 8:13 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
>> > On 28/02/2014 23:56, David Miller wrote:
>> >>
>> >> The topic of the skb->encapsulation semantics has come up several times in
>> >> the past few weeks. We cannot move forward on any changes in this area until
>> >> the semantics are well defined, and documented. Can someone work on a patch
>> >> which documents skb->encapsulation properly, and then we can come back to
>> >> fixing this bug? Thanks.
>> >
>> >
>> > Lets try... the skb->encapsulation flag was introduced and used in 3.8 by
>> > the
>> > sequence of these three commits
>> >
>> > 0afb166 vxlan: Add capability of Rx checksum offload for inner packet
>> > d6727fe vxlan: capture inner headers during encapsulation
>> > 6a674e9 net: Add support for hardware-offloaded encapsulation
>> >
>> > When discussed earlier on the list in the context of the skb->ip_summed
>> > field,
>> > Tom Herbert came with the following interpretation for the semantics which
>> > Joseph confirmed
>> >
>> > "when skb->encapsulation is set the ip_summed is valid for both the inner
>> > and outer header
>> > (e.g. CHECKSUM_COMPLETE is always assumed okay for both layers). If
>> > skb->encapsulation is not set then ip_summed is only valid for outer header"
>>
>> For GRE encapped pkts is the following interpretation correct?
>>
>> 1) ip_summed == CHECKSUM_COMPLETE
>>     => csum covers IP payload csum of the outer IP datagram
>>
>> 2) ip_summed == CHECKSUM_UNNECESSARY
>> 2.1. skb->encapsulation is on => both GRE csum (if one is present) and TCP/UDP
>> csum have been validated (assuming inner is a TCP or UDP pkt)
>
> i40e also supports SCTP csumming for the inner packet, too.
>
>>
>> 2.2. skb->encapsulation is off => only GRE csum (if one is present) is
>> validated.
>>
>
> Apart for the SCTP request for inclusion, it looks reasonable to me.
>
>> >
>> > As for the TX side of things, the change-log of commit 6a674e9 states
>> >
>> > "For Tx encapsulation offload, the driver will need to set the right bits in
>> > netdev->hw_enc_features. The protocol driver will have to set the
>> > skb->encapsulation bit and populate the inner headers, so the NIC driver
>> > will use those inner headers to calculate the csum in hardware."
>>
>> So we only support/care about csum offload for the inner pkts, which
>> makes sense.
>
> Well, we care about outer too. It is just that the inner headers are for
> the inner csum, the outer headers are for the outer csum. The outer
> headers will be always there anyway, so the patch introduced the inner
> ones. But I guess this is what you meant, right?

Actually i was wondering that since there is only one set csum_start/csum_offset
fields per skb, which would you support CHECKSUM_PARTIAL for both inner
and outer? You can only support one, right? (I haven't checked the UDP encap
code to see how this works though.)

Jerry

>
>>
>> >
>> > So in higher level, it seems that the role of the skb->encapsulation field
>> > is to mark the skb to carry encapsulated packet for the code path between
>> > the time the packet is encapsulated by the protocol driver (e.g vxlan/ipip)
>> > to the time driver xmit is called. Or from the time driver rx code runs till
>> > the the time the packet is decapsulated.
>> >
>> > Further, my personal interpretation was that on the rx path, skb should
>> > carry the encapsulation flag **only** if the HW was able to offload the
>> > inner checksum.
>>
>> SGTM.
>>
>> Jerry
>>
>> >
>> > Joseph, what's your thinking here?
>> >
>> > Or.
>> >
>> >
>>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert March 5, 2014, 12:50 a.m. UTC | #10
On Tue, Mar 4, 2014 at 2:36 PM, Joseph Gasparakis
<joseph.gasparakis@intel.com> wrote:
>
>
> On Tue, 4 Mar 2014, Or Gerlitz wrote:
>
>> On 28/02/2014 23:56, David Miller wrote:
>> > The topic of the skb->encapsulation semantics has come up several times in
>> > the past few weeks. We cannot move forward on any changes in this area until
>> > the semantics are well defined, and documented. Can someone work on a patch
>> > which documents skb->encapsulation properly, and then we can come back to
>> > fixing this bug? Thanks.
>>
>> Lets try... the skb->encapsulation flag was introduced and used in 3.8 by the
>> sequence of these three commits
>>
>> 0afb166 vxlan: Add capability of Rx checksum offload for inner packet
>> d6727fe vxlan: capture inner headers during encapsulation
>> 6a674e9 net: Add support for hardware-offloaded encapsulation
>>
>> When discussed earlier on the list in the context of the skb->ip_summed field,
>> Tom Herbert came with the following interpretation for the semantics which
>> Joseph confirmed
>>
>> "when skb->encapsulation is set the ip_summed is valid for both the inner and
>> outer header
>> (e.g. CHECKSUM_COMPLETE is always assumed okay for both layers). If
>> skb->encapsulation is not set then ip_summed is only valid for outer header"
>>
>
> Agree. This should be valid for both Rx and Tx.
>
>> As for the TX side of things, the change-log of commit 6a674e9 states
>>
>> "For Tx encapsulation offload, the driver will need to set the right bits in
>> netdev->hw_enc_features. The protocol driver will have to set the
>> skb->encapsulation bit and populate the inner headers, so the NIC driver will
>> use those inner headers to calculate the csum in hardware."
>>
>> So in higher level, it seems that the role of the skb->encapsulation field is
>> to mark the skb to carry encapsulated packet for the code path between the
>> time the packet is encapsulated by the protocol driver (e.g vxlan/ipip) to the
>> time driver xmit is called. Or from the time driver rx code runs till the the
>> time the packet is decapsulated.
> Correct. Here is a little bit more explanation about the though behind
> these statements:
>
> When the packet gets decapsulated skb->encapsulation should be reset to 0 as
> all that is left is the (previously to decap) inner packet. For the same reason
> the inner headers also will not be valid any more: there are no inner headers as such.
> Personaly in Rx I assume that when the skb leaves the driver, and the
> hardware has detected encapsulation and hence the csums have been verified
> (or not), the skb->encapsulation is on and skb->ip_summed is set accordingly for both
> layers, but the inner headers are not set and even if they are they are not valid.
>
I am concerned that we are overloading the skb->encpasulation, for
instance if this is set in TX path meaning inner headers are valid,
this should also be true of RX. It might be better if this field
indicated characteristics of the packet independent of being in RX or
TX path. For checksum, maybe we should have a separate encap_checksum
field. Also, to be future proof may this should be two bits for the
devices that can verify checksums in multiple levels of encapsulations
(yes, I know this sounds absurd, but it's no more absurd than devices
vendors taking it upon themselves to parse to some restricted set of
encapsulations instead of just giving us the full packet checksum! :-)
).

I hope to have patches soon on fixing CHECKSUM_COMPLETE, I think it
entails some more state in skb indicating offset and extent of
checksum value in the packet. An like a said, we need to get out of
the habit of stashing pseudo csums there (unless CHECKSUM_PARTIAL is
set).

> Also for Tx, skb->encapsulation should be the indication to the
> driver that it can use the inner headers (i.e. they are valid) in the skb
> in order to offload the inner csum.
>
>>
>> Further, my personal interpretation was that on the rx path, skb should carry
>> the encapsulation flag **only** if the HW was able to offload the inner
>> checksum.
>>
>> Joseph, what's your thinking here?
>
> Yes, I agree. If the hardware cannot offload the inner checksum most
> probably it couldn't even detect the encapsulation.
>
>>
>> Or.
>>
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jerry Chu March 5, 2014, 12:54 a.m. UTC | #11
On Tue, Mar 4, 2014 at 5:01 PM, Joseph Gasparakis
<joseph.gasparakis@intel.com> wrote:
>
>
> On Tue, 4 Mar 2014, Jerry Chu wrote:
>
>> On Tue, Mar 4, 2014 at 2:53 PM, Joseph Gasparakis
>> <joseph.gasparakis@intel.com> wrote:
>> >
>> >
>> > On Tue, 4 Mar 2014, Jerry Chu wrote:
>> >
>> >> Hi Or,
>> >>
>> >> Thanks for writing this up.
>> >>
>> >> On Tue, Mar 4, 2014 at 8:13 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
>> >> > On 28/02/2014 23:56, David Miller wrote:
>> >> >>
>> >> >> The topic of the skb->encapsulation semantics has come up several times in
>> >> >> the past few weeks. We cannot move forward on any changes in this area until
>> >> >> the semantics are well defined, and documented. Can someone work on a patch
>> >> >> which documents skb->encapsulation properly, and then we can come back to
>> >> >> fixing this bug? Thanks.
>> >> >
>> >> >
>> >> > Lets try... the skb->encapsulation flag was introduced and used in 3.8 by
>> >> > the
>> >> > sequence of these three commits
>> >> >
>> >> > 0afb166 vxlan: Add capability of Rx checksum offload for inner packet
>> >> > d6727fe vxlan: capture inner headers during encapsulation
>> >> > 6a674e9 net: Add support for hardware-offloaded encapsulation
>> >> >
>> >> > When discussed earlier on the list in the context of the skb->ip_summed
>> >> > field,
>> >> > Tom Herbert came with the following interpretation for the semantics which
>> >> > Joseph confirmed
>> >> >
>> >> > "when skb->encapsulation is set the ip_summed is valid for both the inner
>> >> > and outer header
>> >> > (e.g. CHECKSUM_COMPLETE is always assumed okay for both layers). If
>> >> > skb->encapsulation is not set then ip_summed is only valid for outer header"
>> >>
>> >> For GRE encapped pkts is the following interpretation correct?
>> >>
>> >> 1) ip_summed == CHECKSUM_COMPLETE
>> >>     => csum covers IP payload csum of the outer IP datagram
>> >>
>> >> 2) ip_summed == CHECKSUM_UNNECESSARY
>> >> 2.1. skb->encapsulation is on => both GRE csum (if one is present) and TCP/UDP
>> >> csum have been validated (assuming inner is a TCP or UDP pkt)
>> >
>> > i40e also supports SCTP csumming for the inner packet, too.
>> >
>> >>
>> >> 2.2. skb->encapsulation is off => only GRE csum (if one is present) is
>> >> validated.
>> >>
>> >
>> > Apart for the SCTP request for inclusion, it looks reasonable to me.
>> >
>> >> >
>> >> > As for the TX side of things, the change-log of commit 6a674e9 states
>> >> >
>> >> > "For Tx encapsulation offload, the driver will need to set the right bits in
>> >> > netdev->hw_enc_features. The protocol driver will have to set the
>> >> > skb->encapsulation bit and populate the inner headers, so the NIC driver
>> >> > will use those inner headers to calculate the csum in hardware."
>> >>
>> >> So we only support/care about csum offload for the inner pkts, which
>> >> makes sense.
>> >
>> > Well, we care about outer too. It is just that the inner headers are for
>> > the inner csum, the outer headers are for the outer csum. The outer
>> > headers will be always there anyway, so the patch introduced the inner
>> > ones. But I guess this is what you meant, right?
>>
>> Actually i was wondering that since there is only one set csum_start/csum_offset
>> fields per skb, which would you support CHECKSUM_PARTIAL for both inner
>> and outer? You can only support one, right? (I haven't checked the UDP encap
>> code to see how this works though.)
>>
>> Jerry
>
> I am not sure I understand the question, Jerry... Are you asking in Tx
> path if the ip_summed==CHECKSUM_PARTIAL, which headers will the
> csum_start/csum_offset refer to?

Yes (if the question makes sense since i haven't read any UDP encap
code - will the outer (UDP) csum be enabled?).

Thanks,

Jerry

>
>> >
>> >>
>> >> >
>> >> > So in higher level, it seems that the role of the skb->encapsulation field
>> >> > is to mark the skb to carry encapsulated packet for the code path between
>> >> > the time the packet is encapsulated by the protocol driver (e.g vxlan/ipip)
>> >> > to the time driver xmit is called. Or from the time driver rx code runs till
>> >> > the the time the packet is decapsulated.
>> >> >
>> >> > Further, my personal interpretation was that on the rx path, skb should
>> >> > carry the encapsulation flag **only** if the HW was able to offload the
>> >> > inner checksum.
>> >>
>> >> SGTM.
>> >>
>> >> Jerry
>> >>
>> >> >
>> >> > Joseph, what's your thinking here?
>> >> >
>> >> > Or.
>> >> >
>> >> >
>> >>
>>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joseph Gasparakis March 5, 2014, 1:01 a.m. UTC | #12
On Tue, 4 Mar 2014, Jerry Chu wrote:

> On Tue, Mar 4, 2014 at 2:53 PM, Joseph Gasparakis
> <joseph.gasparakis@intel.com> wrote:
> >
> >
> > On Tue, 4 Mar 2014, Jerry Chu wrote:
> >
> >> Hi Or,
> >>
> >> Thanks for writing this up.
> >>
> >> On Tue, Mar 4, 2014 at 8:13 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> >> > On 28/02/2014 23:56, David Miller wrote:
> >> >>
> >> >> The topic of the skb->encapsulation semantics has come up several times in
> >> >> the past few weeks. We cannot move forward on any changes in this area until
> >> >> the semantics are well defined, and documented. Can someone work on a patch
> >> >> which documents skb->encapsulation properly, and then we can come back to
> >> >> fixing this bug? Thanks.
> >> >
> >> >
> >> > Lets try... the skb->encapsulation flag was introduced and used in 3.8 by
> >> > the
> >> > sequence of these three commits
> >> >
> >> > 0afb166 vxlan: Add capability of Rx checksum offload for inner packet
> >> > d6727fe vxlan: capture inner headers during encapsulation
> >> > 6a674e9 net: Add support for hardware-offloaded encapsulation
> >> >
> >> > When discussed earlier on the list in the context of the skb->ip_summed
> >> > field,
> >> > Tom Herbert came with the following interpretation for the semantics which
> >> > Joseph confirmed
> >> >
> >> > "when skb->encapsulation is set the ip_summed is valid for both the inner
> >> > and outer header
> >> > (e.g. CHECKSUM_COMPLETE is always assumed okay for both layers). If
> >> > skb->encapsulation is not set then ip_summed is only valid for outer header"
> >>
> >> For GRE encapped pkts is the following interpretation correct?
> >>
> >> 1) ip_summed == CHECKSUM_COMPLETE
> >>     => csum covers IP payload csum of the outer IP datagram
> >>
> >> 2) ip_summed == CHECKSUM_UNNECESSARY
> >> 2.1. skb->encapsulation is on => both GRE csum (if one is present) and TCP/UDP
> >> csum have been validated (assuming inner is a TCP or UDP pkt)
> >
> > i40e also supports SCTP csumming for the inner packet, too.
> >
> >>
> >> 2.2. skb->encapsulation is off => only GRE csum (if one is present) is
> >> validated.
> >>
> >
> > Apart for the SCTP request for inclusion, it looks reasonable to me.
> >
> >> >
> >> > As for the TX side of things, the change-log of commit 6a674e9 states
> >> >
> >> > "For Tx encapsulation offload, the driver will need to set the right bits in
> >> > netdev->hw_enc_features. The protocol driver will have to set the
> >> > skb->encapsulation bit and populate the inner headers, so the NIC driver
> >> > will use those inner headers to calculate the csum in hardware."
> >>
> >> So we only support/care about csum offload for the inner pkts, which
> >> makes sense.
> >
> > Well, we care about outer too. It is just that the inner headers are for
> > the inner csum, the outer headers are for the outer csum. The outer
> > headers will be always there anyway, so the patch introduced the inner
> > ones. But I guess this is what you meant, right?
> 
> Actually i was wondering that since there is only one set csum_start/csum_offset
> fields per skb, which would you support CHECKSUM_PARTIAL for both inner
> and outer? You can only support one, right? (I haven't checked the UDP encap
> code to see how this works though.)
> 
> Jerry

I am not sure I understand the question, Jerry... Are you asking in Tx 
path if the ip_summed==CHECKSUM_PARTIAL, which headers will the 
csum_start/csum_offset refer to?

> >
> >>
> >> >
> >> > So in higher level, it seems that the role of the skb->encapsulation field
> >> > is to mark the skb to carry encapsulated packet for the code path between
> >> > the time the packet is encapsulated by the protocol driver (e.g vxlan/ipip)
> >> > to the time driver xmit is called. Or from the time driver rx code runs till
> >> > the the time the packet is decapsulated.
> >> >
> >> > Further, my personal interpretation was that on the rx path, skb should
> >> > carry the encapsulation flag **only** if the HW was able to offload the
> >> > inner checksum.
> >>
> >> SGTM.
> >>
> >> Jerry
> >>
> >> >
> >> > Joseph, what's your thinking here?
> >> >
> >> > Or.
> >> >
> >> >
> >>
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov March 5, 2014, 1:14 a.m. UTC | #13
On Tue, Mar 4, 2014 at 2:13 PM, Jerry Chu <hkchu@google.com> wrote:
> Hi Or,
>
> Thanks for writing this up.
>
> On Tue, Mar 4, 2014 at 8:13 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
>> On 28/02/2014 23:56, David Miller wrote:
>>>
>>> The topic of the skb->encapsulation semantics has come up several times in
>>> the past few weeks. We cannot move forward on any changes in this area until
>>> the semantics are well defined, and documented. Can someone work on a patch
>>> which documents skb->encapsulation properly, and then we can come back to
>>> fixing this bug? Thanks.
>>
>>
>> Lets try... the skb->encapsulation flag was introduced and used in 3.8 by
>> the
>> sequence of these three commits
>>
>> 0afb166 vxlan: Add capability of Rx checksum offload for inner packet
>> d6727fe vxlan: capture inner headers during encapsulation
>> 6a674e9 net: Add support for hardware-offloaded encapsulation
>>
>> When discussed earlier on the list in the context of the skb->ip_summed
>> field,
>> Tom Herbert came with the following interpretation for the semantics which
>> Joseph confirmed
>>
>> "when skb->encapsulation is set the ip_summed is valid for both the inner
>> and outer header
>> (e.g. CHECKSUM_COMPLETE is always assumed okay for both layers). If
>> skb->encapsulation is not set then ip_summed is only valid for outer header"
>
> For GRE encapped pkts is the following interpretation correct?
>
> 1) ip_summed == CHECKSUM_COMPLETE
>     => csum covers IP payload csum of the outer IP datagram

my understanding here is that if ip_summed == CHECKSUM_COMPLETE
the hw computed skb->csum covers the whole packet except mac header.
vlan_untag() will adjust csum
then ip_local_deliver_finish() will pop iphdr without touching skb->csum,
since for the valid iphdr, it's a nop operation.
then gre_rcv()-> gre_cisco_rcv()->parse_gre_header()->iptunnel_pull_header()
will adjust skb->csum after pulling gre header.
and so on.

xoring csum at every pull is not cheap, so when HW sets
CHECKSUM_UNNECESSARY and verifies inner tcp/udp checksum it's the best.
Ideally HW would have a fallback to checksum_complete if it cannot parse encap.

> 2) ip_summed == CHECKSUM_UNNECESSARY
> 2.1. skb->encapsulation is on => both GRE csum (if one is present) and TCP/UDP
> csum have been validated (assuming inner is a TCP or UDP pkt)
>
> 2.2. skb->encapsulation is off => only GRE csum (if one is present) is
> validated.
>
>>
>> As for the TX side of things, the change-log of commit 6a674e9 states
>>
>> "For Tx encapsulation offload, the driver will need to set the right bits in
>> netdev->hw_enc_features. The protocol driver will have to set the
>> skb->encapsulation bit and populate the inner headers, so the NIC driver
>> will use those inner headers to calculate the csum in hardware."
>
> So we only support/care about csum offload for the inner pkts, which
> makes sense.
>
>>
>> So in higher level, it seems that the role of the skb->encapsulation field
>> is to mark the skb to carry encapsulated packet for the code path between
>> the time the packet is encapsulated by the protocol driver (e.g vxlan/ipip)
>> to the time driver xmit is called. Or from the time driver rx code runs till
>> the the time the packet is decapsulated.
>>
>> Further, my personal interpretation was that on the rx path, skb should
>> carry the encapsulation flag **only** if the HW was able to offload the
>> inner checksum.
>
> SGTM.
>
> Jerry
>
>>
>> Joseph, what's your thinking here?
>>
>> Or.
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joseph Gasparakis March 5, 2014, 1:27 a.m. UTC | #14
On Tue, 4 Mar 2014, Jerry Chu wrote:

> On Tue, Mar 4, 2014 at 5:01 PM, Joseph Gasparakis
> <joseph.gasparakis@intel.com> wrote:
> >
> >
> > On Tue, 4 Mar 2014, Jerry Chu wrote:
> >
> >> On Tue, Mar 4, 2014 at 2:53 PM, Joseph Gasparakis
> >> <joseph.gasparakis@intel.com> wrote:
> >> >
> >> >
> >> > On Tue, 4 Mar 2014, Jerry Chu wrote:
> >> >
> >> >> Hi Or,
> >> >>
> >> >> Thanks for writing this up.
> >> >>
> >> >> On Tue, Mar 4, 2014 at 8:13 AM, Or Gerlitz <ogerlitz@mellanox.com> wrote:
> >> >> > On 28/02/2014 23:56, David Miller wrote:
> >> >> >>
> >> >> >> The topic of the skb->encapsulation semantics has come up several times in
> >> >> >> the past few weeks. We cannot move forward on any changes in this area until
> >> >> >> the semantics are well defined, and documented. Can someone work on a patch
> >> >> >> which documents skb->encapsulation properly, and then we can come back to
> >> >> >> fixing this bug? Thanks.
> >> >> >
> >> >> >
> >> >> > Lets try... the skb->encapsulation flag was introduced and used in 3.8 by
> >> >> > the
> >> >> > sequence of these three commits
> >> >> >
> >> >> > 0afb166 vxlan: Add capability of Rx checksum offload for inner packet
> >> >> > d6727fe vxlan: capture inner headers during encapsulation
> >> >> > 6a674e9 net: Add support for hardware-offloaded encapsulation
> >> >> >
> >> >> > When discussed earlier on the list in the context of the skb->ip_summed
> >> >> > field,
> >> >> > Tom Herbert came with the following interpretation for the semantics which
> >> >> > Joseph confirmed
> >> >> >
> >> >> > "when skb->encapsulation is set the ip_summed is valid for both the inner
> >> >> > and outer header
> >> >> > (e.g. CHECKSUM_COMPLETE is always assumed okay for both layers). If
> >> >> > skb->encapsulation is not set then ip_summed is only valid for outer header"
> >> >>
> >> >> For GRE encapped pkts is the following interpretation correct?
> >> >>
> >> >> 1) ip_summed == CHECKSUM_COMPLETE
> >> >>     => csum covers IP payload csum of the outer IP datagram
> >> >>
> >> >> 2) ip_summed == CHECKSUM_UNNECESSARY
> >> >> 2.1. skb->encapsulation is on => both GRE csum (if one is present) and TCP/UDP
> >> >> csum have been validated (assuming inner is a TCP or UDP pkt)
> >> >
> >> > i40e also supports SCTP csumming for the inner packet, too.
> >> >
> >> >>
> >> >> 2.2. skb->encapsulation is off => only GRE csum (if one is present) is
> >> >> validated.
> >> >>
> >> >
> >> > Apart for the SCTP request for inclusion, it looks reasonable to me.
> >> >
> >> >> >
> >> >> > As for the TX side of things, the change-log of commit 6a674e9 states
> >> >> >
> >> >> > "For Tx encapsulation offload, the driver will need to set the right bits in
> >> >> > netdev->hw_enc_features. The protocol driver will have to set the
> >> >> > skb->encapsulation bit and populate the inner headers, so the NIC driver
> >> >> > will use those inner headers to calculate the csum in hardware."
> >> >>
> >> >> So we only support/care about csum offload for the inner pkts, which
> >> >> makes sense.
> >> >
> >> > Well, we care about outer too. It is just that the inner headers are for
> >> > the inner csum, the outer headers are for the outer csum. The outer
> >> > headers will be always there anyway, so the patch introduced the inner
> >> > ones. But I guess this is what you meant, right?
> >>
> >> Actually i was wondering that since there is only one set csum_start/csum_offset
> >> fields per skb, which would you support CHECKSUM_PARTIAL for both inner
> >> and outer? You can only support one, right? (I haven't checked the UDP encap
> >> code to see how this works though.)
> >>
> >> Jerry
> >
> > I am not sure I understand the question, Jerry... Are you asking in Tx
> > path if the ip_summed==CHECKSUM_PARTIAL, which headers will the
> > csum_start/csum_offset refer to?
> 
> Yes (if the question makes sense since i haven't read any UDP encap
> code - will the outer (UDP) csum be enabled?).
> 
> Thanks,
> 
> Jerry

I believe they will refer to the inner headers as csum_start and 
csum_offset do not get changed during encapsulation and they refer to the 
original (inner) packet, but please verify this.

> 
> >
> >> >
> >> >>
> >> >> >
> >> >> > So in higher level, it seems that the role of the skb->encapsulation field
> >> >> > is to mark the skb to carry encapsulated packet for the code path between
> >> >> > the time the packet is encapsulated by the protocol driver (e.g vxlan/ipip)
> >> >> > to the time driver xmit is called. Or from the time driver rx code runs till
> >> >> > the the time the packet is decapsulated.
> >> >> >
> >> >> > Further, my personal interpretation was that on the rx path, skb should
> >> >> > carry the encapsulation flag **only** if the HW was able to offload the
> >> >> > inner checksum.
> >> >>
> >> >> SGTM.
> >> >>
> >> >> Jerry
> >> >>
> >> >> >
> >> >> > Joseph, what's your thinking here?
> >> >> >
> >> >> > Or.
> >> >> >
> >> >> >
> >> >>
> >>
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joseph Gasparakis March 5, 2014, 1:46 a.m. UTC | #15
On Tue, 4 Mar 2014, Tom Herbert wrote:

> On Tue, Mar 4, 2014 at 2:36 PM, Joseph Gasparakis
> <joseph.gasparakis@intel.com> wrote:
> >
> >
> > On Tue, 4 Mar 2014, Or Gerlitz wrote:
> >
> >> On 28/02/2014 23:56, David Miller wrote:
> >> > The topic of the skb->encapsulation semantics has come up several times in
> >> > the past few weeks. We cannot move forward on any changes in this area until
> >> > the semantics are well defined, and documented. Can someone work on a patch
> >> > which documents skb->encapsulation properly, and then we can come back to
> >> > fixing this bug? Thanks.
> >>
> >> Lets try... the skb->encapsulation flag was introduced and used in 3.8 by the
> >> sequence of these three commits
> >>
> >> 0afb166 vxlan: Add capability of Rx checksum offload for inner packet
> >> d6727fe vxlan: capture inner headers during encapsulation
> >> 6a674e9 net: Add support for hardware-offloaded encapsulation
> >>
> >> When discussed earlier on the list in the context of the skb->ip_summed field,
> >> Tom Herbert came with the following interpretation for the semantics which
> >> Joseph confirmed
> >>
> >> "when skb->encapsulation is set the ip_summed is valid for both the inner and
> >> outer header
> >> (e.g. CHECKSUM_COMPLETE is always assumed okay for both layers). If
> >> skb->encapsulation is not set then ip_summed is only valid for outer header"
> >>
> >
> > Agree. This should be valid for both Rx and Tx.
> >
> >> As for the TX side of things, the change-log of commit 6a674e9 states
> >>
> >> "For Tx encapsulation offload, the driver will need to set the right bits in
> >> netdev->hw_enc_features. The protocol driver will have to set the
> >> skb->encapsulation bit and populate the inner headers, so the NIC driver will
> >> use those inner headers to calculate the csum in hardware."
> >>
> >> So in higher level, it seems that the role of the skb->encapsulation field is
> >> to mark the skb to carry encapsulated packet for the code path between the
> >> time the packet is encapsulated by the protocol driver (e.g vxlan/ipip) to the
> >> time driver xmit is called. Or from the time driver rx code runs till the the
> >> time the packet is decapsulated.
> > Correct. Here is a little bit more explanation about the though behind
> > these statements:
> >
> > When the packet gets decapsulated skb->encapsulation should be reset to 0 as
> > all that is left is the (previously to decap) inner packet. For the same reason
> > the inner headers also will not be valid any more: there are no inner headers as such.
> > Personaly in Rx I assume that when the skb leaves the driver, and the
> > hardware has detected encapsulation and hence the csums have been verified
> > (or not), the skb->encapsulation is on and skb->ip_summed is set accordingly for both
> > layers, but the inner headers are not set and even if they are they are not valid.
> >
> I am concerned that we are overloading the skb->encpasulation, for
> instance if this is set in TX path meaning inner headers are valid,
> this should also be true of RX. It might be better if this field
> indicated characteristics of the packet independent of being in RX or
> TX path.

Well, the fact that we cannot have valid inner headers in Rx is simply 
because hardware does not provide them to the driver, so it doesn't make 
sense for the driver to parse the pkt and populate them. Do you have 
in mind a place in Rx path before the protocol (decapsulation) driver that 
we would do something with inner headers and we need them valid?

> For checksum, maybe we should have a separate encap_checksum
> field. Also, to be future proof may this should be two bits for the
> devices that can verify checksums in multiple levels of encapsulations
> (yes, I know this sounds absurd, but it's no more absurd than devices
> vendors taking it upon themselves to parse to some restricted set of
> encapsulations instead of just giving us the full packet checksum! :-)
> ).
> 
> I hope to have patches soon on fixing CHECKSUM_COMPLETE, I think it
> entails some more state in skb indicating offset and extent of
> checksum value in the packet. An like a said, we need to get out of
> the habit of stashing pseudo csums there (unless CHECKSUM_PARTIAL is
> set).
> 
> > Also for Tx, skb->encapsulation should be the indication to the
> > driver that it can use the inner headers (i.e. they are valid) in the skb
> > in order to offload the inner csum.
> >
> >>
> >> Further, my personal interpretation was that on the rx path, skb should carry
> >> the encapsulation flag **only** if the HW was able to offload the inner
> >> checksum.
> >>
> >> Joseph, what's your thinking here?
> >
> > Yes, I agree. If the hardware cannot offload the inner checksum most
> > probably it couldn't even detect the encapsulation.
> >
> >>
> >> Or.
> >>
> >>
> >>
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz March 5, 2014, 8:47 p.m. UTC | #16
On Wed, Mar 5, 2014 at 2:50 AM, Tom Herbert <therbert@google.com> wrote:
> I am concerned that we are overloading the skb->encpasulation

Guys, so to sum up yesterday's messages on this thread -- do folks
think we can come up with better/proper definition for the
skb->encapsulation bit, or we have to make this multiple bits, or we
have (hopefully not too non linear) dependancy on proper
definitions/usage of the skb->ip_summed field?


Or.

> for instance if this is set in TX path meaning inner headers are valid,
> this should also be true of RX. It might be better if this field
> indicated characteristics of the packet independent of being in RX or
> TX path. For checksum, maybe we should have a separate encap_checksum
> field. Also, to be future proof may this should be two bits for the
> devices that can verify checksums in multiple levels of encapsulations
> (yes, I know this sounds absurd, but it's no more absurd than devices
> vendors taking it upon themselves to parse to some restricted set of
> encapsulations instead of just giving us the full packet checksum! :-)).

> I hope to have patches soon on fixing CHECKSUM_COMPLETE, I think it
> entails some more state in skb indicating offset and extent of
> checksum value in the packet. An like a said, we need to get out of
> the habit of stashing pseudo csums there (unless CHECKSUM_PARTIAL is
> set).
>
>> Also for Tx, skb->encapsulation should be the indication to the
>> driver that it can use the inner headers (i.e. they are valid) in the skb
>> in order to offload the inner csum.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa March 6, 2014, 4:30 p.m. UTC | #17
On Thu, Mar 06, 2014 at 08:42:32AM -0800, Joseph Gasparakis wrote:
> 
> 
> On Wed, 5 Mar 2014, Or Gerlitz wrote:
> 
> > On Wed, Mar 5, 2014 at 2:50 AM, Tom Herbert <therbert@google.com> wrote:
> > > I am concerned that we are overloading the skb->encpasulation
> > 
> > Guys, so to sum up yesterday's messages on this thread -- do folks
> > think we can come up with better/proper definition for the
> > skb->encapsulation bit, or we have to make this multiple bits, or we
> > have (hopefully not too non linear) dependancy on proper
> > definitions/usage of the skb->ip_summed field?
> > 
> > 
> > Or.
> >
> 
> From my perspective I do not see a gap with the mentioned fields, but I am 
> open if anyone has something to suggest.

I am still unsure if we want to deal with segmentation in nested tunnels. In
that case I fear we would need additional markers when to actual segment the
UDP packet and update the fragments.

This came up in commit 91a48a2e85a3b7 ("ipv4: ipv6: better estimate tunnel
header cut for correct ufo handling").

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joseph Gasparakis March 6, 2014, 4:42 p.m. UTC | #18
On Wed, 5 Mar 2014, Or Gerlitz wrote:

> On Wed, Mar 5, 2014 at 2:50 AM, Tom Herbert <therbert@google.com> wrote:
> > I am concerned that we are overloading the skb->encpasulation
> 
> Guys, so to sum up yesterday's messages on this thread -- do folks
> think we can come up with better/proper definition for the
> skb->encapsulation bit, or we have to make this multiple bits, or we
> have (hopefully not too non linear) dependancy on proper
> definitions/usage of the skb->ip_summed field?
> 
> 
> Or.
>

From my perspective I do not see a gap with the mentioned fields, but I am 
open if anyone has something to suggest.

Joseph
 
> > for instance if this is set in TX path meaning inner headers are valid,
> > this should also be true of RX. It might be better if this field
> > indicated characteristics of the packet independent of being in RX or
> > TX path. For checksum, maybe we should have a separate encap_checksum
> > field. Also, to be future proof may this should be two bits for the
> > devices that can verify checksums in multiple levels of encapsulations
> > (yes, I know this sounds absurd, but it's no more absurd than devices
> > vendors taking it upon themselves to parse to some restricted set of
> > encapsulations instead of just giving us the full packet checksum! :-)).
> 
> > I hope to have patches soon on fixing CHECKSUM_COMPLETE, I think it
> > entails some more state in skb indicating offset and extent of
> > checksum value in the packet. An like a said, we need to get out of
> > the habit of stashing pseudo csums there (unless CHECKSUM_PARTIAL is
> > set).
> >
> >> Also for Tx, skb->encapsulation should be the indication to the
> >> driver that it can use the inner headers (i.e. they are valid) in the skb
> >> in order to offload the inner csum.
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index b1b0c8d..75517d8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4045,6 +4045,8 @@  static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb)
 	skb->vlan_tci = 0;
 	skb->dev = napi->dev;
 	skb->skb_iif = 0;
+	skb->encapsulation = 0;
+	skb_shinfo(skb)->gso_type = 0;
 
 	napi->skb = skb;
 }
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index ecd2c3f..68229ca 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1431,6 +1431,9 @@  static int inet_gro_complete(struct sk_buff *skb, int nhoff)
 	int proto = iph->protocol;
 	int err = -ENOSYS;
 
+	if (skb->encapsulation)
+		skb_set_inner_network_header(skb, nhoff);
+
 	csum_replace2(&iph->check, iph->tot_len, newlen);
 	iph->tot_len = newlen;
 
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index f1d3228..2d24f29 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -255,6 +255,9 @@  static int gre_gro_complete(struct sk_buff *skb, int nhoff)
 	int err = -ENOENT;
 	__be16 type;
 
+	skb->encapsulation = 1;
+	skb_shinfo(skb)->gso_type = SKB_GSO_GRE;
+
 	type = greh->protocol;
 	if (greh->flags & GRE_KEY)
 		grehlen += GRE_HEADER_SECTION;
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index b92b817..c25953a 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -310,7 +310,7 @@  static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 
 	th->check = ~tcp_v4_check(skb->len - thoff, iph->saddr,
 				  iph->daddr, 0);
-	skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
+	skb_shinfo(skb)->gso_type |= SKB_GSO_TCPV4;
 
 	return tcp_gro_complete(skb);
 }
diff --git a/net/ipv6/tcpv6_offload.c b/net/ipv6/tcpv6_offload.c
index 0d78132..3e8458a 100644
--- a/net/ipv6/tcpv6_offload.c
+++ b/net/ipv6/tcpv6_offload.c
@@ -73,7 +73,7 @@  static int tcp6_gro_complete(struct sk_buff *skb, int thoff)
 
 	th->check = ~tcp_v6_check(skb->len - thoff, &iph->saddr,
 				  &iph->daddr, 0);
-	skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
+	skb_shinfo(skb)->gso_type |= SKB_GSO_TCPV6;
 
 	return tcp_gro_complete(skb);
 }