[net] gso: do GSO for local skb with size bigger than MTU

Message ID	54AA2912.6090903@gmail.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Message-ID: <54AA2912.6090903@gmail.com> Date: Mon, 05 Jan 2015 14:02:58 +0800 From: Fan Du <fengyuleidian0615@gmail.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130106 Thunderbird/17.0.2 MIME-Version: 1.0 To: "Du, Fan" <fan.du@intel.com>, Thomas Graf <tgraf@suug.ch>, "davem@davemloft.net" <davem@davemloft.net>, "jesse@nicira.com" <jesse@nicira.com> CC: "Michael S. Tsirkin" <mst@redhat.com>, 'Jason Wang' <jasowang@redhat.com>, "netdev@vger.kernel.org" <netdev@vger.kernel.org>, "fw@strlen.de" <fw@strlen.de>, "dev@openvswitch.org" <dev@openvswitch.org>, "pshelar@nicira.com" <pshelar@nicira.com> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU References: <1417156385-18276-1-git-send-email-fan.du@intel.com> <1417158128.3268.2@smtp.corp.redhat.com> <5A90DA2E42F8AE43BC4A093BF0678848DED92B@SHSMSX104.ccr.corp.intel.com> <20141201135225.GA16814@casper.infradead.org> <20141202154839.GB5344@t520.home> <20141202170927.GA9457@casper.infradead.org> <20141202173401.GB4126@redhat.com> <20141202174158.GB9457@casper.infradead.org> <5A90DA2E42F8AE43BC4A093BF0678848DEDFDB@SHSMSX104.ccr.corp.intel.com> In-Reply-To: <5A90DA2E42F8AE43BC4A093BF0678848DEDFDB@SHSMSX104.ccr.corp.intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk

FengYu LeiDian Jan. 5, 2015, 6:02 a.m. UTC

于 2014年12月03日 10:31, Du, Fan 写道:
>
>
>> -----Original Message-----
>> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>> Sent: Wednesday, December 3, 2014 1:42 AM
>> To: Michael S. Tsirkin
>> Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>> fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>>
>> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>>>>> What about containers or any other virtualization environment that
>>>>> doesn't use Virtio?
>>>>
>>>> The host can dictate the MTU in that case for both veth or OVS
>>>> internal which would be primary container plumbing techniques.
>>>
>>> It typically can't do this easily for VMs with emulated devices:
>>> real ethernet uses a fixed MTU.
>>>
>>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>>> unrelated optimization.
>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>>
>> PMTU discovery only resolves the issue if an actual IP stack is running inside the
>> VM. This may not be the case at all.
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Some thoughts here:
>
> Think otherwise, this is indeed what host stack should forge a ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
> message with _inner_ skb network and transport header, do whatever type of encapsulation,
> and thereafter push such packet upward to Guest/Container, which make them feel, the intermediate node
> or the peer send such message. PMTU should be expected to work correct.
> And such behavior should be shared by all other encapsulation tech if they are also suffered.

Hi David, Jesse and Thomas

As discussed in here: https://www.marc.info/?l=linux-netdev&m=141764712631150&w=4 and
quotes from Jesse:
My proposal would be something like this:
  * For L2, reduce the VM MTU to the lowest common denominator on the segment.
  * For L3, use path MTU discovery or fragment inner packet (i.e.
normal routing behavior).
  * As a last resort (such as if using an old version of virtio in the
guest), fragment the tunnel packet.

For L2, it's a administrative action
For L3, PMTU approach looks better, because once the sender is alerted the reduced MTU,
packet size after encapsulation will not exceed physical MTU, so no additional fragments
efforts needed.
For "As a last resort... fragment the tunnel packet", the original patch:
https://www.marc.info/?l=linux-netdev&m=141715655024090&w=4 did the job, but seems it's
not welcomed.

Below raw patch adopts PMTU approach, please review! Any kind of comments/suggestions
is welcomed.

Jesse Gross Jan. 5, 2015, 5:58 p.m. UTC | #1

On Mon, Jan 5, 2015 at 1:02 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2014年12月03日 10:31, Du, Fan 写道:
>
>>
>>
>>> -----Original Message-----
>>> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>>> Sent: Wednesday, December 3, 2014 1:42 AM
>>> To: Michael S. Tsirkin
>>> Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>>> fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>>> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than
>>> MTU
>>>
>>> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>>>>
>>>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>>>>>
>>>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>>>>>>
>>>>>> What about containers or any other virtualization environment that
>>>>>> doesn't use Virtio?
>>>>>
>>>>>
>>>>> The host can dictate the MTU in that case for both veth or OVS
>>>>> internal which would be primary container plumbing techniques.
>>>>
>>>>
>>>> It typically can't do this easily for VMs with emulated devices:
>>>> real ethernet uses a fixed MTU.
>>>>
>>>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>>>> unrelated optimization.
>>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>>>
>>>
>>> PMTU discovery only resolves the issue if an actual IP stack is running
>>> inside the
>>> VM. This may not be the case at all.
>>
>>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> Some thoughts here:
>>
>> Think otherwise, this is indeed what host stack should forge a
>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
>> message with _inner_ skb network and transport header, do whatever type of
>> encapsulation,
>> and thereafter push such packet upward to Guest/Container, which make them
>> feel, the intermediate node
>> or the peer send such message. PMTU should be expected to work correct.
>> And such behavior should be shared by all other encapsulation tech if they
>> are also suffered.
>
>
> Hi David, Jesse and Thomas
>
> As discussed in here:
> https://www.marc.info/?l=linux-netdev&m=141764712631150&w=4 and
> quotes from Jesse:
> My proposal would be something like this:
>  * For L2, reduce the VM MTU to the lowest common denominator on the
> segment.
>  * For L3, use path MTU discovery or fragment inner packet (i.e.
> normal routing behavior).
>  * As a last resort (such as if using an old version of virtio in the
> guest), fragment the tunnel packet.
>
>
> For L2, it's a administrative action
> For L3, PMTU approach looks better, because once the sender is alerted the
> reduced MTU,
> packet size after encapsulation will not exceed physical MTU, so no
> additional fragments
> efforts needed.
> For "As a last resort... fragment the tunnel packet", the original patch:
> https://www.marc.info/?l=linux-netdev&m=141715655024090&w=4 did the job, but
> seems it's
> not welcomed.

This needs to be properly integrated into IP processing if it is to
work correctly. One of the reasons for only doing path MTU discovery
for L3 is that it operates seamlessly as part of normal operation -
there is no need to forge addresses or potentially generate ICMP when
on an L2 network. However, this ignores the IP handling that is going
on (note that in OVS it is possible for L3 to be implemented as a set
of flows coming from a controller).

It also should not be VXLAN specific or duplicate VXLAN encapsulation
code. As this is happening before encapsulation, the generated ICMP
does not need to be encapsulated either if it is created in the right
location.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

FengYu LeiDian Jan. 6, 2015, 9:34 a.m. UTC | #2

On 2015/1/6 1:58, Jesse Gross wrote:
> On Mon, Jan 5, 2015 at 1:02 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
>> 于 2014年12月03日 10:31, Du, Fan 写道:
>>
>>>
>>>> -----Original Message-----
>>>> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>>>> Sent: Wednesday, December 3, 2014 1:42 AM
>>>> To: Michael S. Tsirkin
>>>> Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>>>> fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>>>> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than
>>>> MTU
>>>>
>>>> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>>>>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>>>>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>>>>>>> What about containers or any other virtualization environment that
>>>>>>> doesn't use Virtio?
>>>>>>
>>>>>> The host can dictate the MTU in that case for both veth or OVS
>>>>>> internal which would be primary container plumbing techniques.
>>>>>
>>>>> It typically can't do this easily for VMs with emulated devices:
>>>>> real ethernet uses a fixed MTU.
>>>>>
>>>>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>>>>> unrelated optimization.
>>>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>>>>
>>>> PMTU discovery only resolves the issue if an actual IP stack is running
>>>> inside the
>>>> VM. This may not be the case at all.
>>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>
>>> Some thoughts here:
>>>
>>> Think otherwise, this is indeed what host stack should forge a
>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
>>> message with _inner_ skb network and transport header, do whatever type of
>>> encapsulation,
>>> and thereafter push such packet upward to Guest/Container, which make them
>>> feel, the intermediate node
>>> or the peer send such message. PMTU should be expected to work correct.
>>> And such behavior should be shared by all other encapsulation tech if they
>>> are also suffered.
>>
>> Hi David, Jesse and Thomas
>>
>> As discussed in here:
>> https://www.marc.info/?l=linux-netdev&m=141764712631150&w=4 and
>> quotes from Jesse:
>> My proposal would be something like this:
>>   * For L2, reduce the VM MTU to the lowest common denominator on the
>> segment.
>>   * For L3, use path MTU discovery or fragment inner packet (i.e.
>> normal routing behavior).
>>   * As a last resort (such as if using an old version of virtio in the
>> guest), fragment the tunnel packet.
>>
>>
>> For L2, it's a administrative action
>> For L3, PMTU approach looks better, because once the sender is alerted the
>> reduced MTU,
>> packet size after encapsulation will not exceed physical MTU, so no
>> additional fragments
>> efforts needed.
>> For "As a last resort... fragment the tunnel packet", the original patch:
>> https://www.marc.info/?l=linux-netdev&m=141715655024090&w=4 did the job, but
>> seems it's
>> not welcomed.
> This needs to be properly integrated into IP processing if it is to
> work correctly.
Do you mean the original patch in this thread? yes, it works correctly
in my cloud env. If you has any other concerns, please let me know. :)
> One of the reasons for only doing path MTU discovery
> for L3 is that it operates seamlessly as part of normal operation -
> there is no need to forge addresses or potentially generate ICMP when
> on an L2 network. However, this ignores the IP handling that is going
> on (note that in OVS it is possible for L3 to be implemented as a set
> of flows coming from a controller).
>
> It also should not be VXLAN specific or duplicate VXLAN encapsulation
> code. As this is happening before encapsulation, the generated ICMP
> does not need to be encapsulated either if it is created in the right
> location.
Yes, I agree. GRE share the same issue from the code flow.
Pushing back ICMP msg back without encapsulation without circulating down
to physical device is possible. The "right location" as far as I know
could only be in ovs_vport_send. In addition this probably requires wrapper
route looking up operation for GRE/VXLAN, after get the under layer 
device MTU
from the routing information, then calculate reduced MTU becomes feasible.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jesse Gross Jan. 6, 2015, 7:11 p.m. UTC | #3

On Tue, Jan 6, 2015 at 4:34 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
>
> On 2015/1/6 1:58, Jesse Gross wrote:
>>
>> On Mon, Jan 5, 2015 at 1:02 AM, Fan Du <fengyuleidian0615@gmail.com>
>> wrote:
>>>
>>> 于 2014年12月03日 10:31, Du, Fan 写道:
>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>>>>> Sent: Wednesday, December 3, 2014 1:42 AM
>>>>> To: Michael S. Tsirkin
>>>>> Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>>>>> fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>>>>> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger
>>>>> than
>>>>> MTU
>>>>>
>>>>> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>>>>>>
>>>>>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>>>>>>>
>>>>>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>>>>>>>>
>>>>>>>> What about containers or any other virtualization environment that
>>>>>>>> doesn't use Virtio?
>>>>>>>
>>>>>>>
>>>>>>> The host can dictate the MTU in that case for both veth or OVS
>>>>>>> internal which would be primary container plumbing techniques.
>>>>>>
>>>>>>
>>>>>> It typically can't do this easily for VMs with emulated devices:
>>>>>> real ethernet uses a fixed MTU.
>>>>>>
>>>>>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>>>>>> unrelated optimization.
>>>>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>>>>>
>>>>>
>>>>> PMTU discovery only resolves the issue if an actual IP stack is running
>>>>> inside the
>>>>> VM. This may not be the case at all.
>>>>
>>>>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>
>>>> Some thoughts here:
>>>>
>>>> Think otherwise, this is indeed what host stack should forge a
>>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
>>>> message with _inner_ skb network and transport header, do whatever type
>>>> of
>>>> encapsulation,
>>>> and thereafter push such packet upward to Guest/Container, which make
>>>> them
>>>> feel, the intermediate node
>>>> or the peer send such message. PMTU should be expected to work correct.
>>>> And such behavior should be shared by all other encapsulation tech if
>>>> they
>>>> are also suffered.
>>>
>>>
>>> Hi David, Jesse and Thomas
>>>
>>> As discussed in here:
>>> https://www.marc.info/?l=linux-netdev&m=141764712631150&w=4 and
>>> quotes from Jesse:
>>> My proposal would be something like this:
>>>   * For L2, reduce the VM MTU to the lowest common denominator on the
>>> segment.
>>>   * For L3, use path MTU discovery or fragment inner packet (i.e.
>>> normal routing behavior).
>>>   * As a last resort (such as if using an old version of virtio in the
>>> guest), fragment the tunnel packet.
>>>
>>>
>>> For L2, it's a administrative action
>>> For L3, PMTU approach looks better, because once the sender is alerted
>>> the
>>> reduced MTU,
>>> packet size after encapsulation will not exceed physical MTU, so no
>>> additional fragments
>>> efforts needed.
>>> For "As a last resort... fragment the tunnel packet", the original patch:
>>> https://www.marc.info/?l=linux-netdev&m=141715655024090&w=4 did the job,
>>> but
>>> seems it's
>>> not welcomed.
>>
>> This needs to be properly integrated into IP processing if it is to
>> work correctly.
>
> Do you mean the original patch in this thread? yes, it works correctly
> in my cloud env. If you has any other concerns, please let me know. :)

Ok...but that doesn't actually address the points that I made.

>> One of the reasons for only doing path MTU discovery
>> for L3 is that it operates seamlessly as part of normal operation -
>> there is no need to forge addresses or potentially generate ICMP when
>> on an L2 network. However, this ignores the IP handling that is going
>> on (note that in OVS it is possible for L3 to be implemented as a set
>> of flows coming from a controller).
>>
>> It also should not be VXLAN specific or duplicate VXLAN encapsulation
>> code. As this is happening before encapsulation, the generated ICMP
>> does not need to be encapsulated either if it is created in the right
>> location.
>
> Yes, I agree. GRE share the same issue from the code flow.
> Pushing back ICMP msg back without encapsulation without circulating down
> to physical device is possible. The "right location" as far as I know
> could only be in ovs_vport_send. In addition this probably requires wrapper
> route looking up operation for GRE/VXLAN, after get the under layer device
> MTU
> from the routing information, then calculate reduced MTU becomes feasible.

As I said, it needs to be integrated into L3 processing. In OVS this
would mean adding some primitives to the kernel and then exposing the
functionality upwards into userspace/controller.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

FengYu LeiDian Jan. 7, 2015, 5:58 a.m. UTC | #4

于 2015年01月07日 03:11, Jesse Gross 写道:
>>> One of the reasons for only doing path MTU discovery
>>> >>for L3 is that it operates seamlessly as part of normal operation -
>>> >>there is no need to forge addresses or potentially generate ICMP when
>>> >>on an L2 network. However, this ignores the IP handling that is going
>>> >>on (note that in OVS it is possible for L3 to be implemented as a set
>>> >>of flows coming from a controller).
>>> >>
>>> >>It also should not be VXLAN specific or duplicate VXLAN encapsulation
>>> >>code. As this is happening before encapsulation, the generated ICMP
>>> >>does not need to be encapsulated either if it is created in the right
>>> >>location.
>> >
>> >Yes, I agree. GRE share the same issue from the code flow.
>> >Pushing back ICMP msg back without encapsulation without circulating down
>> >to physical device is possible. The "right location" as far as I know
>> >could only be in ovs_vport_send. In addition this probably requires wrapper
>> >route looking up operation for GRE/VXLAN, after get the under layer device
>> >MTU
>> >from the routing information, then calculate reduced MTU becomes feasible.
> As I said, it needs to be integrated into L3 processing. In OVS this
> would mean adding some primitives to the kernel and then exposing the
> functionality upwards into userspace/controller.

I'm bit of confused with "L3 processing" you mentioned here... SORRY
Apparently I'm not seeing the whole picture as you pointed out. Could you please
elaborate "L3 processing" a bit more? docs/codes/or other useful links. Appreciated.

My understanding is:
controller sets the forwarding rules into kernel datapath, any flow not matching
with the rules are threw to controller by upcall. Once the rule decision is made
by controller, then, this flow packet is pushed down to datapath to be forwarded
again according to the new rule.

So I'm not sure whether pushing the over-MTU-sized packet or pushing the forged ICMP
without encapsulation to controller is required by current ovs implementation. By doing
so, such over-MTU-sized packet is treated as a event for the controller to be take
care of.

Jesse Gross Jan. 7, 2015, 8:52 p.m. UTC | #5

On Tue, Jan 6, 2015 at 9:58 PM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2015年01月07日 03:11, Jesse Gross 写道:
>>>>
>>>> One of the reasons for only doing path MTU discovery
>>>> >>for L3 is that it operates seamlessly as part of normal operation -
>>>> >>there is no need to forge addresses or potentially generate ICMP when
>>>> >>on an L2 network. However, this ignores the IP handling that is going
>>>> >>on (note that in OVS it is possible for L3 to be implemented as a set
>>>> >>of flows coming from a controller).
>>>> >>
>>>> >>It also should not be VXLAN specific or duplicate VXLAN encapsulation
>>>> >>code. As this is happening before encapsulation, the generated ICMP
>>>> >>does not need to be encapsulated either if it is created in the right
>>>> >>location.
>>>
>>> >
>>> >Yes, I agree. GRE share the same issue from the code flow.
>>> >Pushing back ICMP msg back without encapsulation without circulating
>>> > down
>>> >to physical device is possible. The "right location" as far as I know
>>> >could only be in ovs_vport_send. In addition this probably requires
>>> > wrapper
>>> >route looking up operation for GRE/VXLAN, after get the under layer
>>> > device
>>> >MTU
>>> >from the routing information, then calculate reduced MTU becomes
>>> > feasible.
>>
>> As I said, it needs to be integrated into L3 processing. In OVS this
>> would mean adding some primitives to the kernel and then exposing the
>> functionality upwards into userspace/controller.
>
>
> I'm bit of confused with "L3 processing" you mentioned here... SORRY
> Apparently I'm not seeing the whole picture as you pointed out. Could you
> please
> elaborate "L3 processing" a bit more? docs/codes/or other useful links.
> Appreciated.

L3 processing is anywhere that routing takes place - i.e. where you
would decrement the TTL and change the MAC addresses. Part of routing
is dealing with differing MTUs, so that needs to be integrated into
the same logic.

> My understanding is:
> controller sets the forwarding rules into kernel datapath, any flow not
> matching
> with the rules are threw to controller by upcall. Once the rule decision is
> made
> by controller, then, this flow packet is pushed down to datapath to be
> forwarded
> again according to the new rule.
>
> So I'm not sure whether pushing the over-MTU-sized packet or pushing the
> forged ICMP
> without encapsulation to controller is required by current ovs
> implementation. By doing
> so, such over-MTU-sized packet is treated as a event for the controller to
> be take
> care of.

If flows are implementing routing (again, they are doing things like
decrementing the TTL) then it is necessary for them to also handle
this situation using some potentially new primitives (like a size
check). Otherwise you end up with issues like the ones that I
mentioned above like needing to forge addresses because you don't know
what the correct ones are. If the flows aren't doing things to
implement routing, then you really have a flat L2 network and you
shouldn't be doing this type of behavior at all as I described in the
original plan.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

FengYu LeiDian Jan. 8, 2015, 9:39 a.m. UTC | #6

于 2015年01月08日 04:52, Jesse Gross 写道:
>> My understanding is:
>> >controller sets the forwarding rules into kernel datapath, any flow not
>> >matching
>> >with the rules are threw to controller by upcall. Once the rule decision is
>> >made
>> >by controller, then, this flow packet is pushed down to datapath to be
>> >forwarded
>> >again according to the new rule.
>> >
>> >So I'm not sure whether pushing the over-MTU-sized packet or pushing the
>> >forged ICMP
>> >without encapsulation to controller is required by current ovs
>> >implementation. By doing
>> >so, such over-MTU-sized packet is treated as a event for the controller to
>> >be take
>> >care of.
> If flows are implementing routing (again, they are doing things like
> decrementing the TTL) then it is necessary for them to also handle
> this situation using some potentially new primitives (like a size
> check). Otherwise you end up with issues like the ones that I
> mentioned above like needing to forge addresses because you don't know
> what the correct ones are.

Thanks for explaining, Jesse!

btw, I don't get it about "to forge addresses", building ICMP message
with Guest packet doesn't require to forge address when not encapsulating
ICMP message with outer headers.

If the flows aren't doing things to
> implement routing, then you really have a flat L2 network and you
> shouldn't be doing this type of behavior at all as I described in the
> original plan.

For flows implementing routing scenario:
First of all, over-MTU-sized packet could only be detected once the flow
as been consulted(each port could implement a 'check' hook to do this),
and just before send to the actual port.

Then pushing the over-MTU-sized packet back to controller, it's the controller
who will will decide whether to build ICMP message, or whatever routing behaviour
it may take. And sent it back with the port information. This ICMP message will
travel back to Guest.

Why does the flow has to use primitive like a "check size"? "check size"
will only take effect after do_output. I'm not very clear with this approach.

And not all scenario involving flow with routing behaviour, just set up a
vxlan tunnel, and attach KVM guest or Docker onto it for playing or developing.
This wouldn't necessarily require user to set additional specific flows to make
over-MTU-sized packet pass through the tunnel correctly. In such scenario, I
think the original patch in this thread to fragment tunnel packet is still needed
OR workout a generic component to build ICMP for all type tunnel in L2 level.
Both of those will act as a backup plan as there is no such specific flow as
default.

If I missed something obviously, please let me know.

Jesse Gross Jan. 8, 2015, 7:55 p.m. UTC | #7

On Thu, Jan 8, 2015 at 1:39 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2015年01月08日 04:52, Jesse Gross 写道:
>>>
>>> My understanding is:
>>> >controller sets the forwarding rules into kernel datapath, any flow not
>>> >matching
>>> >with the rules are threw to controller by upcall. Once the rule decision
>>> > is
>>> >made
>>> >by controller, then, this flow packet is pushed down to datapath to be
>>> >forwarded
>>> >again according to the new rule.
>>> >
>>> >So I'm not sure whether pushing the over-MTU-sized packet or pushing the
>>> >forged ICMP
>>> >without encapsulation to controller is required by current ovs
>>> >implementation. By doing
>>> >so, such over-MTU-sized packet is treated as a event for the controller
>>> > to
>>> >be take
>>> >care of.
>>
>> If flows are implementing routing (again, they are doing things like
>> decrementing the TTL) then it is necessary for them to also handle
>> this situation using some potentially new primitives (like a size
>> check). Otherwise you end up with issues like the ones that I
>> mentioned above like needing to forge addresses because you don't know
>> what the correct ones are.
>
>
> Thanks for explaining, Jesse!
>
> btw, I don't get it about "to forge addresses", building ICMP message
> with Guest packet doesn't require to forge address when not encapsulating
> ICMP message with outer headers.

Your patch has things like this (for the inner IP header):

+                               new_ip->saddr = orig_ip->daddr;
+                               new_ip->daddr = orig_ip->saddr;

These addresses are owned by the endpoints, not the host generating
generating the ICMP message, so I would consider that to be forging
addresses.

> If the flows aren't doing things to
>>
>> implement routing, then you really have a flat L2 network and you
>> shouldn't be doing this type of behavior at all as I described in the
>> original plan.
>
>
> For flows implementing routing scenario:
> First of all, over-MTU-sized packet could only be detected once the flow
> as been consulted(each port could implement a 'check' hook to do this),
> and just before send to the actual port.
>
> Then pushing the over-MTU-sized packet back to controller, it's the
> controller
> who will will decide whether to build ICMP message, or whatever routing
> behaviour
> it may take. And sent it back with the port information. This ICMP message
> will
> travel back to Guest.
>
> Why does the flow has to use primitive like a "check size"? "check size"
> will only take effect after do_output. I'm not very clear with this
> approach.

Checking the size obviously needs to be an action that would take
place before outputting in order for it to have any effect. Attaching
a check to a port does not fit in very well with the other primitives
of OVS, so I think an action is the obvious place to put it.

> And not all scenario involving flow with routing behaviour, just set up a
> vxlan tunnel, and attach KVM guest or Docker onto it for playing or
> developing.
> This wouldn't necessarily require user to set additional specific flows to
> make
> over-MTU-sized packet pass through the tunnel correctly. In such scenario, I
> think the original patch in this thread to fragment tunnel packet is still
> needed
> OR workout a generic component to build ICMP for all type tunnel in L2
> level.
> Both of those will act as a backup plan as there is no such specific flow as
> default.

In these cases, we should find a way to adjust the MTU, preferably
automatically using virtio.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

FengYu LeiDian Jan. 9, 2015, 5:42 a.m. UTC | #8

于 2015年01月09日 03:55, Jesse Gross 写道:
> On Thu, Jan 8, 2015 at 1:39 AM, Fan Du <fengyuleidian0615@gmail.com> wrote:
>> 于 2015年01月08日 04:52, Jesse Gross 写道:
>>>>
>>>> My understanding is:
>>>>> controller sets the forwarding rules into kernel datapath, any flow not
>>>>> matching
>>>>> with the rules are threw to controller by upcall. Once the rule decision
>>>>> is
>>>>> made
>>>>> by controller, then, this flow packet is pushed down to datapath to be
>>>>> forwarded
>>>>> again according to the new rule.
>>>>>
>>>>> So I'm not sure whether pushing the over-MTU-sized packet or pushing the
>>>>> forged ICMP
>>>>> without encapsulation to controller is required by current ovs
>>>>> implementation. By doing
>>>>> so, such over-MTU-sized packet is treated as a event for the controller
>>>>> to
>>>>> be take
>>>>> care of.
>>>
>>> If flows are implementing routing (again, they are doing things like
>>> decrementing the TTL) then it is necessary for them to also handle
>>> this situation using some potentially new primitives (like a size
>>> check). Otherwise you end up with issues like the ones that I
>>> mentioned above like needing to forge addresses because you don't know
>>> what the correct ones are.
>>
>>
>> Thanks for explaining, Jesse!
>>
>> btw, I don't get it about "to forge addresses", building ICMP message
>> with Guest packet doesn't require to forge address when not encapsulating
>> ICMP message with outer headers.
>
> Your patch has things like this (for the inner IP header):
>
> +                               new_ip->saddr = orig_ip->daddr;
> +                               new_ip->daddr = orig_ip->saddr;
>
> These addresses are owned by the endpoints, not the host generating
> generating the ICMP message, so I would consider that to be forging
> addresses.
>
>> If the flows aren't doing things to
>>>
>>> implement routing, then you really have a flat L2 network and you
>>> shouldn't be doing this type of behavior at all as I described in the
>>> original plan.
>>
>>
>> For flows implementing routing scenario:
>> First of all, over-MTU-sized packet could only be detected once the flow
>> as been consulted(each port could implement a 'check' hook to do this),
>> and just before send to the actual port.
>>
>> Then pushing the over-MTU-sized packet back to controller, it's the
>> controller
>> who will will decide whether to build ICMP message, or whatever routing
>> behaviour
>> it may take. And sent it back with the port information. This ICMP message
>> will
>> travel back to Guest.
>>
>> Why does the flow has to use primitive like a "check size"? "check size"
>> will only take effect after do_output. I'm not very clear with this
>> approach.
>
> Checking the size obviously needs to be an action that would take
> place before outputting in order for it to have any effect. Attaching
> a check to a port does not fit in very well with the other primitives
> of OVS, so I think an action is the obvious place to put it.

If flow is defined as:

	CHECK_SIZE -> OUTPUT

Then traversing actions at CHECK_SIZE needs to find the exactly OUTPUT port,
thus get its underlay encapsulation method as well as valid route for physical
NIC MTU, with those information can calculation whether GSOed packets
exceeds physical MTU. This is feasible anyway at the first look. After this,
it's the controller responsibility to handle such event.

If the CHECK_SIZE returns positive(over-MTU-sized packets show up), then call
output_userspace to push this packet upper wards.

I'm not sure this vague idea is the expected behaviour as required by "L3 processing".

>> And not all scenario involving flow with routing behaviour, just set up a
>> vxlan tunnel, and attach KVM guest or Docker onto it for playing or
>> developing.
>> This wouldn't necessarily require user to set additional specific flows to
>> make
>> over-MTU-sized packet pass through the tunnel correctly. In such scenario, I
>> think the original patch in this thread to fragment tunnel packet is still
>> needed
>> OR workout a generic component to build ICMP for all type tunnel in L2
>> level.
>> Both of those will act as a backup plan as there is no such specific flow as
>> default.
>
> In these cases, we should find a way to adjust the MTU, preferably
> automatically using virtio.
>

FengYu LeiDian Jan. 9, 2015, 5:48 a.m. UTC | #9

于 2015年01月09日 03:55, Jesse Gross 写道:
> On Thu, Jan 8, 2015 at 1:39 AM, Fan Du<fengyuleidian0615@gmail.com>  wrote:
>> >于 2015年01月08日 04:52, Jesse Gross 写道:
>>>> >>>
>>>> >>>My understanding is:
>>>>> >>> >controller sets the forwarding rules into kernel datapath, any flow not
>>>>> >>> >matching
>>>>> >>> >with the rules are threw to controller by upcall. Once the rule decision
>>>>> >>> >is
>>>>> >>> >made
>>>>> >>> >by controller, then, this flow packet is pushed down to datapath to be
>>>>> >>> >forwarded
>>>>> >>> >again according to the new rule.
>>>>> >>> >
>>>>> >>> >So I'm not sure whether pushing the over-MTU-sized packet or pushing the
>>>>> >>> >forged ICMP
>>>>> >>> >without encapsulation to controller is required by current ovs
>>>>> >>> >implementation. By doing
>>>>> >>> >so, such over-MTU-sized packet is treated as a event for the controller
>>>>> >>> >to
>>>>> >>> >be take
>>>>> >>> >care of.
>>> >>
>>> >>If flows are implementing routing (again, they are doing things like
>>> >>decrementing the TTL) then it is necessary for them to also handle
>>> >>this situation using some potentially new primitives (like a size
>>> >>check). Otherwise you end up with issues like the ones that I
>>> >>mentioned above like needing to forge addresses because you don't know
>>> >>what the correct ones are.
>> >
>> >
>> >Thanks for explaining, Jesse!
>> >
>> >btw, I don't get it about "to forge addresses", building ICMP message
>> >with Guest packet doesn't require to forge address when not encapsulating
>> >ICMP message with outer headers.
> Your patch has things like this (for the inner IP header):
>
> +                               new_ip->saddr = orig_ip->daddr;
> +                               new_ip->daddr = orig_ip->saddr;
>
> These addresses are owned by the endpoints, not the host generating
> generating the ICMP message, so I would consider that to be forging
> addresses.
>
>> >If the flows aren't doing things to
>>> >>
>>> >>implement routing, then you really have a flat L2 network and you
>>> >>shouldn't be doing this type of behavior at all as I described in the
>>> >>original plan.
>> >
>> >
>> >For flows implementing routing scenario:
>> >First of all, over-MTU-sized packet could only be detected once the flow
>> >as been consulted(each port could implement a 'check' hook to do this),
>> >and just before send to the actual port.
>> >
>> >Then pushing the over-MTU-sized packet back to controller, it's the
>> >controller
>> >who will will decide whether to build ICMP message, or whatever routing
>> >behaviour
>> >it may take. And sent it back with the port information. This ICMP message
>> >will
>> >travel back to Guest.
>> >
>> >Why does the flow has to use primitive like a "check size"? "check size"
>> >will only take effect after do_output. I'm not very clear with this
>> >approach.
> Checking the size obviously needs to be an action that would take
> place before outputting in order for it to have any effect. Attaching
> a check to a port does not fit in very well with the other primitives
> of OVS, so I think an action is the obvious place to put it.
>
>> >And not all scenario involving flow with routing behaviour, just set up a
>> >vxlan tunnel, and attach KVM guest or Docker onto it for playing or
>> >developing.
>> >This wouldn't necessarily require user to set additional specific flows to
>> >make
>> >over-MTU-sized packet pass through the tunnel correctly. In such scenario, I
>> >think the original patch in this thread to fragment tunnel packet is still
>> >needed
>> >OR workout a generic component to build ICMP for all type tunnel in L2
>> >level.
>> >Both of those will act as a backup plan as there is no such specific flow as
>> >default.
> In these cases, we should find a way to adjust the MTU, preferably
> automatically using virtio.

I'm gonna to argue this a bit more here.

virtio_net pose no limit at its simulated net device, actually it can fall into
anywhere between 68 and 65535. Most importantly, virtio_net just simulates NIC,
it just can’t assume/presume there is an encapsulating port at its downstream.
How should virtio automatically adjust its upper guest MTU?

Jesse Gross Jan. 12, 2015, 6:48 p.m. UTC | #10

On Thu, Jan 8, 2015 at 9:42 PM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2015年01月09日 03:55, Jesse Gross 写道:
>
>> On Thu, Jan 8, 2015 at 1:39 AM, Fan Du <fengyuleidian0615@gmail.com>
>> wrote:
>>>
>>> 于 2015年01月08日 04:52, Jesse Gross 写道:
>>>>>
>>>>>
>>>>> My understanding is:
>>>>>>
>>>>>> controller sets the forwarding rules into kernel datapath, any flow
>>>>>> not
>>>>>> matching
>>>>>> with the rules are threw to controller by upcall. Once the rule
>>>>>> decision
>>>>>> is
>>>>>> made
>>>>>> by controller, then, this flow packet is pushed down to datapath to be
>>>>>> forwarded
>>>>>> again according to the new rule.
>>>>>>
>>>>>> So I'm not sure whether pushing the over-MTU-sized packet or pushing
>>>>>> the
>>>>>> forged ICMP
>>>>>> without encapsulation to controller is required by current ovs
>>>>>> implementation. By doing
>>>>>> so, such over-MTU-sized packet is treated as a event for the
>>>>>> controller
>>>>>> to
>>>>>> be take
>>>>>> care of.
>>>>
>>>>
>>>> If flows are implementing routing (again, they are doing things like
>>>> decrementing the TTL) then it is necessary for them to also handle
>>>> this situation using some potentially new primitives (like a size
>>>> check). Otherwise you end up with issues like the ones that I
>>>> mentioned above like needing to forge addresses because you don't know
>>>> what the correct ones are.
>>>
>>>
>>>
>>> Thanks for explaining, Jesse!
>>>
>>> btw, I don't get it about "to forge addresses", building ICMP message
>>> with Guest packet doesn't require to forge address when not encapsulating
>>> ICMP message with outer headers.
>>
>>
>> Your patch has things like this (for the inner IP header):
>>
>> +                               new_ip->saddr = orig_ip->daddr;
>> +                               new_ip->daddr = orig_ip->saddr;
>>
>> These addresses are owned by the endpoints, not the host generating
>> generating the ICMP message, so I would consider that to be forging
>> addresses.
>>
>>> If the flows aren't doing things to
>>>>
>>>>
>>>> implement routing, then you really have a flat L2 network and you
>>>> shouldn't be doing this type of behavior at all as I described in the
>>>> original plan.
>>>
>>>
>>>
>>> For flows implementing routing scenario:
>>> First of all, over-MTU-sized packet could only be detected once the flow
>>> as been consulted(each port could implement a 'check' hook to do this),
>>> and just before send to the actual port.
>>>
>>> Then pushing the over-MTU-sized packet back to controller, it's the
>>> controller
>>> who will will decide whether to build ICMP message, or whatever routing
>>> behaviour
>>> it may take. And sent it back with the port information. This ICMP
>>> message
>>> will
>>> travel back to Guest.
>>>
>>> Why does the flow has to use primitive like a "check size"? "check size"
>>> will only take effect after do_output. I'm not very clear with this
>>> approach.
>>
>>
>> Checking the size obviously needs to be an action that would take
>> place before outputting in order for it to have any effect. Attaching
>> a check to a port does not fit in very well with the other primitives
>> of OVS, so I think an action is the obvious place to put it.
>
>
> If flow is defined as:
>
>         CHECK_SIZE -> OUTPUT
>
> Then traversing actions at CHECK_SIZE needs to find the exactly OUTPUT port,
> thus get its underlay encapsulation method as well as valid route for
> physical
> NIC MTU, with those information can calculation whether GSOed packets
> exceeds physical MTU. This is feasible anyway at the first look. After this,
> it's the controller responsibility to handle such event.
>
> If the CHECK_SIZE returns positive(over-MTU-sized packets show up), then
> call
> output_userspace to push this packet upper wards.
>
> I'm not sure this vague idea is the expected behaviour as required by "L3
> processing".

Yes, I think that's the right path.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jesse Gross Jan. 12, 2015, 6:55 p.m. UTC | #11

On Thu, Jan 8, 2015 at 9:48 PM, Fan Du <fengyuleidian0615@gmail.com> wrote:
> 于 2015年01月09日 03:55, Jesse Gross 写道:
>>
>> On Thu, Jan 8, 2015 at 1:39 AM, Fan Du<fengyuleidian0615@gmail.com>
>> wrote:
>>
>>> >于 2015年01月08日 04:52, Jesse Gross 写道:
>>>>>
>>>>> >>>
>>>>> >>>My understanding is:
>>>>>>
>>>>>> >>> >controller sets the forwarding rules into kernel datapath, any
>>>>>> >>> > flow not
>>>>>> >>> >matching
>>>>>> >>> >with the rules are threw to controller by upcall. Once the rule
>>>>>> >>> > decision
>>>>>> >>> >is
>>>>>> >>> >made
>>>>>> >>> >by controller, then, this flow packet is pushed down to datapath
>>>>>> >>> > to be
>>>>>> >>> >forwarded
>>>>>> >>> >again according to the new rule.
>>>>>> >>> >
>>>>>> >>> >So I'm not sure whether pushing the over-MTU-sized packet or
>>>>>> >>> > pushing the
>>>>>> >>> >forged ICMP
>>>>>> >>> >without encapsulation to controller is required by current ovs
>>>>>> >>> >implementation. By doing
>>>>>> >>> >so, such over-MTU-sized packet is treated as a event for the
>>>>>> >>> > controller
>>>>>> >>> >to
>>>>>> >>> >be take
>>>>>> >>> >care of.
>>>>
>>>> >>
>>>> >>If flows are implementing routing (again, they are doing things like
>>>> >>decrementing the TTL) then it is necessary for them to also handle
>>>> >>this situation using some potentially new primitives (like a size
>>>> >>check). Otherwise you end up with issues like the ones that I
>>>> >>mentioned above like needing to forge addresses because you don't know
>>>> >>what the correct ones are.
>>>
>>> >
>>> >
>>> >Thanks for explaining, Jesse!
>>> >
>>> >btw, I don't get it about "to forge addresses", building ICMP message
>>> >with Guest packet doesn't require to forge address when not
>>> > encapsulating
>>> >ICMP message with outer headers.
>>
>> Your patch has things like this (for the inner IP header):
>>
>> +                               new_ip->saddr = orig_ip->daddr;
>> +                               new_ip->daddr = orig_ip->saddr;
>>
>> These addresses are owned by the endpoints, not the host generating
>> generating the ICMP message, so I would consider that to be forging
>> addresses.
>>
>>> >If the flows aren't doing things to
>>>>
>>>> >>
>>>> >>implement routing, then you really have a flat L2 network and you
>>>> >>shouldn't be doing this type of behavior at all as I described in the
>>>> >>original plan.
>>>
>>> >
>>> >
>>> >For flows implementing routing scenario:
>>> >First of all, over-MTU-sized packet could only be detected once the flow
>>> >as been consulted(each port could implement a 'check' hook to do this),
>>> >and just before send to the actual port.
>>> >
>>> >Then pushing the over-MTU-sized packet back to controller, it's the
>>> >controller
>>> >who will will decide whether to build ICMP message, or whatever routing
>>> >behaviour
>>> >it may take. And sent it back with the port information. This ICMP
>>> > message
>>> >will
>>> >travel back to Guest.
>>> >
>>> >Why does the flow has to use primitive like a "check size"? "check size"
>>> >will only take effect after do_output. I'm not very clear with this
>>> >approach.
>>
>> Checking the size obviously needs to be an action that would take
>> place before outputting in order for it to have any effect. Attaching
>> a check to a port does not fit in very well with the other primitives
>> of OVS, so I think an action is the obvious place to put it.
>>
>>> >And not all scenario involving flow with routing behaviour, just set up
>>> > a
>>> >vxlan tunnel, and attach KVM guest or Docker onto it for playing or
>>> >developing.
>>> >This wouldn't necessarily require user to set additional specific flows
>>> > to
>>> >make
>>> >over-MTU-sized packet pass through the tunnel correctly. In such
>>> > scenario, I
>>> >think the original patch in this thread to fragment tunnel packet is
>>> > still
>>> >needed
>>> >OR workout a generic component to build ICMP for all type tunnel in L2
>>> >level.
>>> >Both of those will act as a backup plan as there is no such specific
>>> > flow as
>>> >default.
>>
>> In these cases, we should find a way to adjust the MTU, preferably
>> automatically using virtio.
>
>
> I'm gonna to argue this a bit more here.
>
> virtio_net pose no limit at its simulated net device, actually it can fall
> into
> anywhere between 68 and 65535. Most importantly, virtio_net just simulates
> NIC,
> it just can’t assume/presume there is an encapsulating port at its
> downstream.
> How should virtio automatically adjust its upper guest MTU?

There are at least two parts to this:
 * Calculating the right MTU for the guest device.
 * Transferring the MTU from the host to the guest.

The first would presumably involve exposing some kind of API that the
component that does know the right value could program. In this case,
that component could be OVS using the same type of information that
you just described in the earlier post about L3. The API could simply
to just set the MTU of the device in the host and this gets mirrored
to the guest.

The second part I guess is probably a fairly straightforward extension
to virtio but I don't know the details.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thomas Graf Jan. 13, 2015, 4:58 p.m. UTC | #12

On 01/12/15 at 10:55am, Jesse Gross wrote:
> There are at least two parts to this:
>  * Calculating the right MTU for the guest device.
>  * Transferring the MTU from the host to the guest.
> 
> The first would presumably involve exposing some kind of API that the
> component that does know the right value could program. In this case,
> that component could be OVS using the same type of information that
> you just described in the earlier post about L3. The API could simply
> to just set the MTU of the device in the host and this gets mirrored
> to the guest.
> 
> The second part I guess is probably a fairly straightforward extension
> to virtio but I don't know the details.

Francesco Fusco wrote code to do exactly this. Maybe he still has
it somewhere.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[net] gso: do GSO for local skb with size bigger than MTU

Commit Message

Comments

Patch