diff mbox

[ovs-dev,v9,7/7] userspace: add non-tap (l3) support to GRE vports

Message ID 1453270506-10643-8-git-send-email-simon.horman@netronome.com
State Deferred
Headers show

Commit Message

Simon Horman Jan. 20, 2016, 6:15 a.m. UTC
Add support for layer 3 GRE vports (non-tap aka non-VTEP).

This makes use of a separate vport type for GRE, rather than a new mode for
the existing (tap/VTEP) GRE vports as this fits more naturally with the
kernel where implementation of GRE and thus implementation of this feature
there.

In order to differentiate packets for two different types of GRE vports a
new flow key attribute, OVS_KEY_ATTR_NEXT_BASE_LAYER, is used.  It is
intended that this attribute is only used in userspace as there appears to
be no need for it to be used in the kernel datapath.

It is envisaged that this attribute may be used for other non-UDP
encapsulation protocols that support both layer3 and layer2 inner-packets.
While for UDP encapsulation the UDP port can be used for differentiation
without the need for this new attribute.

One alternative approach to this new attribute, which I have not
investigated in detail, would be to use a second classifier in tnl-ports.c
for non-UDP layer3 tunnels; leaving the existing classifier for all other
tunnels.

Signed-off-by: Simon Horman <simon.horman@netronome.com>

---
v9
* New patch
---
 datapath-windows/ovsext/Vport.c                   |  4 ++
 datapath/linux/compat/include/linux/openvswitch.h |  4 +-
 lib/dpif-netlink.c                                |  5 ++
 lib/flow.c                                        | 18 +++++
 lib/flow.h                                        |  4 ++
 lib/match.c                                       |  4 ++
 lib/netdev-linux.c                                |  3 +-
 lib/netdev-vport.c                                | 88 +++++++++++++++++++----
 lib/netdev-vport.h                                |  1 +
 lib/odp-execute.c                                 |  2 +
 lib/odp-util.c                                    | 21 ++++++
 lib/tnl-ports.c                                   | 74 +++++++++++++------
 lib/tnl-ports.h                                   |  4 +-
 ofproto/ofproto-dpif-ipfix.c                      |  2 +-
 ofproto/ofproto-dpif-sflow.c                      |  3 +-
 ofproto/tunnel.c                                  |  7 +-
 tests/tunnel-push-pop-ipv6.at                     | 20 +++---
 tests/tunnel-push-pop.at                          | 33 +++++++--
 18 files changed, 240 insertions(+), 57 deletions(-)

Comments

Jesse Gross Jan. 27, 2016, 6:09 a.m. UTC | #1
On Tue, Jan 19, 2016 at 10:15 PM, Simon Horman
<simon.horman@netronome.com> wrote:
> Add support for layer 3 GRE vports (non-tap aka non-VTEP).
>
> This makes use of a separate vport type for GRE, rather than a new mode for
> the existing (tap/VTEP) GRE vports as this fits more naturally with the
> kernel where implementation of GRE and thus implementation of this feature
> there.
>
> In order to differentiate packets for two different types of GRE vports a
> new flow key attribute, OVS_KEY_ATTR_NEXT_BASE_LAYER, is used.  It is
> intended that this attribute is only used in userspace as there appears to
> be no need for it to be used in the kernel datapath.
>
> It is envisaged that this attribute may be used for other non-UDP
> encapsulation protocols that support both layer3 and layer2 inner-packets.
> While for UDP encapsulation the UDP port can be used for differentiation
> without the need for this new attribute.
>
> One alternative approach to this new attribute, which I have not
> investigated in detail, would be to use a second classifier in tnl-ports.c
> for non-UDP layer3 tunnels; leaving the existing classifier for all other
> tunnels.
>
> Signed-off-by: Simon Horman <simon.horman@netronome.com>

I think it's likely that we'll have a variety of tunnels in the near
future that can natively support multiple inner frame formats without
changing the encapsulation type. This is somewhat different from what
we have now where although different tunnels support various frame
formats, there is primarily a 1:1 mapping of encapsulation:inner frame
(such as with VXLAN, LISP, and STT). Protocols that fall into this
category include GRE (Ethernet/IPv4/IPv6/MPLS), Geneve, and VXLAN-GPE
(NSH).

It seems like it would be ideal if we can avoid creating a new port
type for each of these possible combinations and somehow just make the
flow keys look right. After all, in the case of GRE, we already
support all of the inner protocols that I mentioned about through the
main flow lookup so it would be cool if after decapsulation the
appropriate packet came out with the addition of GRE metadata.

One other comment - in the case of UDP based protocols I don't think
that it is likely that different UDP ports will be used to indicated
different inner protocols. This shouldn't be necessary for protocols
that have a next protocol field and I don't think the people would be
excited about changing the port in that case.
Simon Horman Feb. 1, 2016, 4:05 p.m. UTC | #2
On Tue, Jan 26, 2016 at 10:09:28PM -0800, Jesse Gross wrote:
> On Tue, Jan 19, 2016 at 10:15 PM, Simon Horman
> <simon.horman@netronome.com> wrote:
> > Add support for layer 3 GRE vports (non-tap aka non-VTEP).
> >
> > This makes use of a separate vport type for GRE, rather than a new mode for
> > the existing (tap/VTEP) GRE vports as this fits more naturally with the
> > kernel where implementation of GRE and thus implementation of this feature
> > there.
> >
> > In order to differentiate packets for two different types of GRE vports a
> > new flow key attribute, OVS_KEY_ATTR_NEXT_BASE_LAYER, is used.  It is
> > intended that this attribute is only used in userspace as there appears to
> > be no need for it to be used in the kernel datapath.
> >
> > It is envisaged that this attribute may be used for other non-UDP
> > encapsulation protocols that support both layer3 and layer2 inner-packets.
> > While for UDP encapsulation the UDP port can be used for differentiation
> > without the need for this new attribute.
> >
> > One alternative approach to this new attribute, which I have not
> > investigated in detail, would be to use a second classifier in tnl-ports.c
> > for non-UDP layer3 tunnels; leaving the existing classifier for all other
> > tunnels.
> >
> > Signed-off-by: Simon Horman <simon.horman@netronome.com>
> 
> I think it's likely that we'll have a variety of tunnels in the near
> future that can natively support multiple inner frame formats without
> changing the encapsulation type. This is somewhat different from what
> we have now where although different tunnels support various frame
> formats, there is primarily a 1:1 mapping of encapsulation:inner frame
> (such as with VXLAN, LISP, and STT). Protocols that fall into this
> category include GRE (Ethernet/IPv4/IPv6/MPLS), Geneve, and VXLAN-GPE
> (NSH).
> 
> It seems like it would be ideal if we can avoid creating a new port
> type for each of these possible combinations and somehow just make the
> flow keys look right. After all, in the case of GRE, we already
> support all of the inner protocols that I mentioned about through the
> main flow lookup so it would be cool if after decapsulation the
> appropriate packet came out with the addition of GRE metadata.

I think that sounds reasonable. But I am wondering if you have any thoughts
on how this might be implemented.

The once approach that I have considered so far would be to rework of
kernel's GRE code to allow it to provide a netdev that can handle both tap
and non-tap (l2 and l3 payloads). I can investigate this more closely if
you think it is an approach worth pursuing.

> One other comment - in the case of UDP based protocols I don't think
> that it is likely that different UDP ports will be used to indicated
> different inner protocols. This shouldn't be necessary for protocols
> that have a next protocol field and I don't think the people would be
> excited about changing the port in that case.

Understood.
Jesse Gross Feb. 3, 2016, 1:15 a.m. UTC | #3
On Mon, Feb 1, 2016 at 8:05 AM, Simon Horman <simon.horman@netronome.com> wrote:
> On Tue, Jan 26, 2016 at 10:09:28PM -0800, Jesse Gross wrote:
>> On Tue, Jan 19, 2016 at 10:15 PM, Simon Horman
>> <simon.horman@netronome.com> wrote:
>> > Add support for layer 3 GRE vports (non-tap aka non-VTEP).
>> >
>> > This makes use of a separate vport type for GRE, rather than a new mode for
>> > the existing (tap/VTEP) GRE vports as this fits more naturally with the
>> > kernel where implementation of GRE and thus implementation of this feature
>> > there.
>> >
>> > In order to differentiate packets for two different types of GRE vports a
>> > new flow key attribute, OVS_KEY_ATTR_NEXT_BASE_LAYER, is used.  It is
>> > intended that this attribute is only used in userspace as there appears to
>> > be no need for it to be used in the kernel datapath.
>> >
>> > It is envisaged that this attribute may be used for other non-UDP
>> > encapsulation protocols that support both layer3 and layer2 inner-packets.
>> > While for UDP encapsulation the UDP port can be used for differentiation
>> > without the need for this new attribute.
>> >
>> > One alternative approach to this new attribute, which I have not
>> > investigated in detail, would be to use a second classifier in tnl-ports.c
>> > for non-UDP layer3 tunnels; leaving the existing classifier for all other
>> > tunnels.
>> >
>> > Signed-off-by: Simon Horman <simon.horman@netronome.com>
>>
>> I think it's likely that we'll have a variety of tunnels in the near
>> future that can natively support multiple inner frame formats without
>> changing the encapsulation type. This is somewhat different from what
>> we have now where although different tunnels support various frame
>> formats, there is primarily a 1:1 mapping of encapsulation:inner frame
>> (such as with VXLAN, LISP, and STT). Protocols that fall into this
>> category include GRE (Ethernet/IPv4/IPv6/MPLS), Geneve, and VXLAN-GPE
>> (NSH).
>>
>> It seems like it would be ideal if we can avoid creating a new port
>> type for each of these possible combinations and somehow just make the
>> flow keys look right. After all, in the case of GRE, we already
>> support all of the inner protocols that I mentioned about through the
>> main flow lookup so it would be cool if after decapsulation the
>> appropriate packet came out with the addition of GRE metadata.
>
> I think that sounds reasonable. But I am wondering if you have any thoughts
> on how this might be implemented.
>
> The once approach that I have considered so far would be to rework of
> kernel's GRE code to allow it to provide a netdev that can handle both tap
> and non-tap (l2 and l3 payloads). I can investigate this more closely if
> you think it is an approach worth pursuing.

I think this sounds like a good idea if we can find a way to do it
cleanly. From OVS's perspective, the main thing that we need is a way
to indicate the first header that we expect to see. We used to have
this in struct tnl_ptk_info proto but that is no longer exposed to
OVS. We also want to make sure that a device that is configured in
this mode behaves in a logical way when not connected to OVS - i.e. it
knows whether to emit ARP for L2 ports but not L3. I suppose now that
lightweight tunneling is here both interfaces are common and therefore
the problem is the same in each case, which is a good thing.

Jiri (cc'ed) is working on GPE and NSH support to VXLAN at the moment.
I think this is very closely related and complementary as it also
depends on sending non-Ethernet frames to OVS. He might have some
ideas on how to handle this.
Thomas Morin Feb. 3, 2016, 4:41 p.m. UTC | #4
Hi Jesse, Simon,

It is great to see this work moving forward !

There is one cosmetic thing that we may want to address, which is how to 
name these ports.
- I believe that it is better to avoid the l3 port naming, since they 
will apply to all protocols that can be designated with an ethertype, 
and some of these are not L3 (MPLS is a typical example).
- "non-tap" is a possible name, but not everyone knows that tap means 
Ethernet
- "non-ethernet" would actually describe pretty acurately what the port does

Anyhow, "non-tap" is better than l3 and we can live with this name.

27/01/2016-01-27 07:09, Jesse Gross :
> It seems like it would be ideal if we can avoid creating a new port
> type for each of these possible combinations and somehow just make the
> flow keys look right. After all, in the case of GRE, we already
> support all of the inner protocols that I mentioned about through the
> main flow lookup so it would be cool if after decapsulation the
> appropriate packet came out with the addition of GRE metadata.

Choosing between the two behaviors with an option rather than with 
different type is also the choice I had done in the patched that I had 
tried to cook some time ago based on Lorand Jakab l3port patch. I agree 
with the rationale above that it will help avoid code duplication to 
support the same thing for Geneve, GPE etc.

One thing though: the piece of code mapping incoming GRE traffic to 
tunnel ports (when more than one is defined) will need to know if tunnel 
port are tap or non-tap to do the right thing. I don't know if that will 
be equally easy if this information is stored as an option or using 
addtional tunnel types.

Best,

-Thomas
Jiri Benc Feb. 16, 2016, 10:15 p.m. UTC | #5
Sorry for the late answer, was busy with a conference and internal
meetings in the past two weeks.

On Tue, 2 Feb 2016 17:15:15 -0800, Jesse Gross wrote:
> I think this sounds like a good idea if we can find a way to do it
> cleanly. From OVS's perspective, the main thing that we need is a way
> to indicate the first header that we expect to see. We used to have
> this in struct tnl_ptk_info proto but that is no longer exposed to
> OVS. We also want to make sure that a device that is configured in
> this mode behaves in a logical way when not connected to OVS - i.e. it
> knows whether to emit ARP for L2 ports but not L3. I suppose now that
> lightweight tunneling is here both interfaces are common and therefore
> the problem is the same in each case, which is a good thing.

There's only one way to solve this cleanly in the kernel. The L2 vs. L3
mode has to be selected while creating the tunnel interface and cannot
be changed afterwards (only by deleting and recreating the interface).
The reason is that the L3 interface needs to be of ARHRD_NONE type
instead of ARPHRD_ETHER. With additional flags set by the kernel
(IFF_NOARP in particular), this works as expected out of ovs, e.g. for
route based encapsulation.

It's not possible to mix those two in a single interface. E.g. for
VXLAN-GPE, it's either Ethernet header is encapsulated or not for a
given interface (and thus for a given vport), and never both. If we did
that, such interface wouldn't work standalone, outside of ovs.

I don't think it's a problem. The information whether Ethernet header
info is provided in the flow key or not can be directly deduced from the
net_device type. It's quite generic this way: if it's ARPHRD_ETHER,
there's Ethernet header, if it's ARPHRD_NONE, no L2 header is
available. In the future, it's easy to add different L2 transports if
desired in the same way.

The user has to request the L2 or L3 mode when creating the VXLAN-GPE
interface. This will be the same for L3 Geneve, and likely for GRE, too
(I'll have to check the current implementation of that one). So yes,
we'll need a way to distinguish this when creating the vport. Either a
new vport type, or an L3 flag. It makes sense, actually: the vports are
very different, e.g. different flow rules are needed for L2 and L3
tunnels (for L3, push_eth when switching to Ethernet will be needed to
be configured at least).

Of course, currently, the kernel datapath allows only ARPHRD_ETHER
interfaces to be added to the ovs bridge and this will need to be
changed.

> Jiri (cc'ed) is working on GPE and NSH support to VXLAN at the moment.
> I think this is very closely related and complementary as it also
> depends on sending non-Ethernet frames to OVS. He might have some
> ideas on how to handle this.

 Jiri
Simon Horman Feb. 19, 2016, 8:09 a.m. UTC | #6
On Tue, Feb 16, 2016 at 11:15:20PM +0100, Jiri Benc wrote:
> Sorry for the late answer, was busy with a conference and internal
> meetings in the past two weeks.
> 
> On Tue, 2 Feb 2016 17:15:15 -0800, Jesse Gross wrote:
> > I think this sounds like a good idea if we can find a way to do it
> > cleanly. From OVS's perspective, the main thing that we need is a way
> > to indicate the first header that we expect to see. We used to have
> > this in struct tnl_ptk_info proto but that is no longer exposed to
> > OVS. We also want to make sure that a device that is configured in
> > this mode behaves in a logical way when not connected to OVS - i.e. it
> > knows whether to emit ARP for L2 ports but not L3. I suppose now that
> > lightweight tunneling is here both interfaces are common and therefore
> > the problem is the same in each case, which is a good thing.
> 
> There's only one way to solve this cleanly in the kernel. The L2 vs. L3
> mode has to be selected while creating the tunnel interface and cannot
> be changed afterwards (only by deleting and recreating the interface).
> The reason is that the L3 interface needs to be of ARHRD_NONE type
> instead of ARPHRD_ETHER. With additional flags set by the kernel
> (IFF_NOARP in particular), this works as expected out of ovs, e.g. for
> route based encapsulation.
> 
> It's not possible to mix those two in a single interface. E.g. for
> VXLAN-GPE, it's either Ethernet header is encapsulated or not for a
> given interface (and thus for a given vport), and never both. If we did
> that, such interface wouldn't work standalone, outside of ovs.
> 
> I don't think it's a problem. The information whether Ethernet header
> info is provided in the flow key or not can be directly deduced from the
> net_device type. It's quite generic this way: if it's ARPHRD_ETHER,
> there's Ethernet header, if it's ARPHRD_NONE, no L2 header is
> available. In the future, it's easy to add different L2 transports if
> desired in the same way.
> 
> The user has to request the L2 or L3 mode when creating the VXLAN-GPE
> interface. This will be the same for L3 Geneve, and likely for GRE, too
> (I'll have to check the current implementation of that one). So yes,
> we'll need a way to distinguish this when creating the vport. Either a
> new vport type, or an L3 flag. It makes sense, actually: the vports are
> very different, e.g. different flow rules are needed for L2 and L3
> tunnels (for L3, push_eth when switching to Ethernet will be needed to
> be configured at least).
> 
> Of course, currently, the kernel datapath allows only ARPHRD_ETHER
> interfaces to be added to the ovs bridge and this will need to be
> changed.

I was hoping that my idea would work; thanks for saving me the effort of
implementing it only to find out that it has the problem you describe.

I think that a mode switch is possible, and perhaps it is the best way
forwards. But I was hoping to arrange things so that L3 and L2 GRE
vports could be used simultaneously, which is why I went for different
vport types: unfortunately this complicated the user-space code.

At this point I see three options. Jesse, do you have a preference?

1. Use a vport mode for L3 GRE as Jiri suggests.
   This seems like it may lead to the cleanest implementation.
   We could later move away from this approach if there is a need
   for L3 and L2 GRE to co-exist.

2. Add a vport type for L3 GRE as this patch does (or otherwise).
   This seems a bit more complex than 1, with the caveat that I'm
   yet to implement 1. But its also more flexible as it allows
   L3 and L2 GRE to co-exist.

3. Ok, I have no 3rd option other than "something else".

> > Jiri (cc'ed) is working on GPE and NSH support to VXLAN at the moment.
> > I think this is very closely related and complementary as it also
> > depends on sending non-Ethernet frames to OVS. He might have some
> > ideas on how to handle this.
> 
>  Jiri
> 
> -- 
> Jiri Benc
Jesse Gross Feb. 26, 2016, 1:32 a.m. UTC | #7
On Tue, Feb 16, 2016 at 2:15 PM, Jiri Benc <jbenc@redhat.com> wrote:
> Sorry for the late answer, was busy with a conference and internal
> meetings in the past two weeks.
>
> On Tue, 2 Feb 2016 17:15:15 -0800, Jesse Gross wrote:
>> I think this sounds like a good idea if we can find a way to do it
>> cleanly. From OVS's perspective, the main thing that we need is a way
>> to indicate the first header that we expect to see. We used to have
>> this in struct tnl_ptk_info proto but that is no longer exposed to
>> OVS. We also want to make sure that a device that is configured in
>> this mode behaves in a logical way when not connected to OVS - i.e. it
>> knows whether to emit ARP for L2 ports but not L3. I suppose now that
>> lightweight tunneling is here both interfaces are common and therefore
>> the problem is the same in each case, which is a good thing.
>
> There's only one way to solve this cleanly in the kernel. The L2 vs. L3
> mode has to be selected while creating the tunnel interface and cannot
> be changed afterwards (only by deleting and recreating the interface).
> The reason is that the L3 interface needs to be of ARHRD_NONE type
> instead of ARPHRD_ETHER. With additional flags set by the kernel
> (IFF_NOARP in particular), this works as expected out of ovs, e.g. for
> route based encapsulation.
>
> It's not possible to mix those two in a single interface. E.g. for
> VXLAN-GPE, it's either Ethernet header is encapsulated or not for a
> given interface (and thus for a given vport), and never both. If we did
> that, such interface wouldn't work standalone, outside of ovs.
>
> I don't think it's a problem. The information whether Ethernet header
> info is provided in the flow key or not can be directly deduced from the
> net_device type. It's quite generic this way: if it's ARPHRD_ETHER,
> there's Ethernet header, if it's ARPHRD_NONE, no L2 header is
> available. In the future, it's easy to add different L2 transports if
> desired in the same way.
>
> The user has to request the L2 or L3 mode when creating the VXLAN-GPE
> interface. This will be the same for L3 Geneve, and likely for GRE, too
> (I'll have to check the current implementation of that one). So yes,
> we'll need a way to distinguish this when creating the vport. Either a
> new vport type, or an L3 flag. It makes sense, actually: the vports are
> very different, e.g. different flow rules are needed for L2 and L3
> tunnels (for L3, push_eth when switching to Ethernet will be needed to
> be configured at least).
>
> Of course, currently, the kernel datapath allows only ARPHRD_ETHER
> interfaces to be added to the ovs bridge and this will need to be
> changed.

The thing that bothers me about this is that it's not really a binary
split between Ethernet and IP. L2 "obviously" means Ethernet (although
theoretically there could be other things), but L3 means IPv4, IPv6,
MPLS, NSH, etc.? That is weird.

From a non-OVS kernel perspective, setting L2 basically means that you
want to do ARP on the interface and L3 is no ARP. However, if the
tunnel is sending and receiving some additional tag then it's not
really going to work directly in L3 mode.

In the OVS case, you will definitely need to have different flows
depending on the inner packet type. However, the most flexible way to
handle this is by exposing it through the flow itself - it's not
necessarily best to do this through the port, especially if it is not
fully specific.

Finally, I agree that it would be nice to be able to handle multiple
types of encapsulated data at the same time. It seems difficult, or at
least more complex, to achieve this with multiple ports. Each would
try to register to the same handler and this would fail unless there
is some kind of additional registration mechanism.
Jiri Benc Feb. 26, 2016, 6:13 a.m. UTC | #8
On Thu, 25 Feb 2016 17:32:28 -0800, Jesse Gross wrote:
> The thing that bothers me about this is that it's not really a binary
> split between Ethernet and IP. L2 "obviously" means Ethernet (although
> theoretically there could be other things), but L3 means IPv4, IPv6,
> MPLS, NSH, etc.? That is weird.

Depends on the point of view. It's a packet without (any) L2 header.
You're right that MPLS or NSH are not understood as L3, however they're
not L2, either. So let's call this mode "L2-less" or so?

> From a non-OVS kernel perspective, setting L2 basically means that you
> want to do ARP on the interface and L3 is no ARP. However, if the
> tunnel is sending and receiving some additional tag then it's not
> really going to work directly in L3 mode.

It's still L2-less. Meaning no neighbor resolution. There's no
difference between MPLS, NSH and IP wrt. how the packet is sent.

> In the OVS case, you will definitely need to have different flows
> depending on the inner packet type. However, the most flexible way to
> handle this is by exposing it through the flow itself - it's not
> necessarily best to do this through the port, especially if it is not
> fully specific.

I'm not sure I follow. But although you don't get an Ethernet header
from such port, you do get an ethertype of the packet (in the kernel,
it's skb->protocol). It should be quite straightforward to construct
flow from it.

> Finally, I agree that it would be nice to be able to handle multiple
> types of encapsulated data at the same time. It seems difficult, or at
> least more complex, to achieve this with multiple ports. Each would
> try to register to the same handler and this would fail unless there
> is some kind of additional registration mechanism.

At least in the kernel, I don't see a way to do it cleanly. The
net_device has to be usable outside of ovs, this includes things like
tcpdump working correctly.

 Jiri
Jesse Gross Feb. 26, 2016, 5:03 p.m. UTC | #9
On Thu, Feb 25, 2016 at 10:13 PM, Jiri Benc <jbenc@redhat.com> wrote:
> On Thu, 25 Feb 2016 17:32:28 -0800, Jesse Gross wrote:
>> The thing that bothers me about this is that it's not really a binary
>> split between Ethernet and IP. L2 "obviously" means Ethernet (although
>> theoretically there could be other things), but L3 means IPv4, IPv6,
>> MPLS, NSH, etc.? That is weird.
>
> Depends on the point of view. It's a packet without (any) L2 header.
> You're right that MPLS or NSH are not understood as L3, however they're
> not L2, either. So let's call this mode "L2-less" or so?
>
>> From a non-OVS kernel perspective, setting L2 basically means that you
>> want to do ARP on the interface and L3 is no ARP. However, if the
>> tunnel is sending and receiving some additional tag then it's not
>> really going to work directly in L3 mode.
>
> It's still L2-less. Meaning no neighbor resolution. There's no
> difference between MPLS, NSH and IP wrt. how the packet is sent.
>
>> In the OVS case, you will definitely need to have different flows
>> depending on the inner packet type. However, the most flexible way to
>> handle this is by exposing it through the flow itself - it's not
>> necessarily best to do this through the port, especially if it is not
>> fully specific.
>
> I'm not sure I follow. But although you don't get an Ethernet header
> from such port, you do get an ethertype of the packet (in the kernel,
> it's skb->protocol). It should be quite straightforward to construct
> flow from it.
>
>> Finally, I agree that it would be nice to be able to handle multiple
>> types of encapsulated data at the same time. It seems difficult, or at
>> least more complex, to achieve this with multiple ports. Each would
>> try to register to the same handler and this would fail unless there
>> is some kind of additional registration mechanism.
>
> At least in the kernel, I don't see a way to do it cleanly. The
> net_device has to be usable outside of ovs, this includes things like
> tcpdump working correctly.

I think what we can do is rename L3 mode to "no_arp" or something like
that since I believe that's the only difference between L2 and L3 mode
as currently defined. In this mode we could still pass fully formed
Ethernet frames to the netdevice input/output. To me this is exactly
analogous to passing MPLS or NSH to the stack - it won't inherently
know what to do with it unless there is some kind of external
processing but the data is there in the packet and could be decoded by
normal tools.

Since OVS would be explicitly manipulating the Ethernet headers
through push/pop actions, it doesn't need any external resolving.
Therefore, devices attached to OVS could always be put in this
"no_arp" mode and we would be able to handle simultaneous traffic of
different type.
diff mbox

Patch

diff --git a/datapath-windows/ovsext/Vport.c b/datapath-windows/ovsext/Vport.c
index 7b0103d6b523..fc2299f1c051 100644
--- a/datapath-windows/ovsext/Vport.c
+++ b/datapath-windows/ovsext/Vport.c
@@ -1005,6 +1005,8 @@  OvsInitTunnelVport(PVOID userContext,
     case OVS_VPORT_TYPE_GRE:
         status = OvsInitGreTunnel(vport);
         break;
+    case OVS_VPORT_TYPE_GRE_L3:
+        break;
     case OVS_VPORT_TYPE_VXLAN:
     {
         POVS_TUNFLT_INIT_CONTEXT tunnelContext = NULL;
@@ -1266,6 +1268,8 @@  OvsRemoveAndDeleteVport(PVOID usrParamsContext,
     case OVS_VPORT_TYPE_GRE:
         OvsCleanupGreTunnel(vport);
         break;
+    case OVS_VPORT_TYPE_GRE_L3:
+        break;
     case OVS_VPORT_TYPE_NETDEV:
         if (vport->isExternal) {
             if (vport->nicIndex == 0) {
diff --git a/datapath/linux/compat/include/linux/openvswitch.h b/datapath/linux/compat/include/linux/openvswitch.h
index 502e8f1aca66..f68697f578f3 100644
--- a/datapath/linux/compat/include/linux/openvswitch.h
+++ b/datapath/linux/compat/include/linux/openvswitch.h
@@ -230,9 +230,10 @@  enum ovs_vport_type {
 	OVS_VPORT_TYPE_UNSPEC,
 	OVS_VPORT_TYPE_NETDEV,   /* network device */
 	OVS_VPORT_TYPE_INTERNAL, /* network device implemented by datapath */
-	OVS_VPORT_TYPE_GRE,      /* GRE tunnel. */
+	OVS_VPORT_TYPE_GRE,      /* GRE Tap tunnel (L2 in GRE). */
 	OVS_VPORT_TYPE_VXLAN,	 /* VXLAN tunnel. */
 	OVS_VPORT_TYPE_GENEVE,	 /* Geneve tunnel. */
+	OVS_VPORT_TYPE_GRE_L3,   /* GRE tunnel (L3 in GRE). */
 	OVS_VPORT_TYPE_LISP = 105,  /* LISP tunnel */
 	OVS_VPORT_TYPE_STT = 106, /* STT tunnel */
 	__OVS_VPORT_TYPE_MAX
@@ -354,6 +355,7 @@  enum ovs_key_attr {
 	OVS_KEY_ATTR_CT_LABELS,	/* 16-octet connection tracking labels */
 	OVS_KEY_ATTR_PACKET_ETHERTYPE, /* be16 Ethernet type for packet
 					* execution. */
+	OVS_KEY_ATTR_NEXT_BASE_LAYER, /* base layer of encapsulated packet */
 
 #ifdef __KERNEL__
 	/* Only used within kernel data path. */
diff --git a/lib/dpif-netlink.c b/lib/dpif-netlink.c
index bab2297541ac..3cd273d13384 100644
--- a/lib/dpif-netlink.c
+++ b/lib/dpif-netlink.c
@@ -760,6 +760,9 @@  get_vport_type(const struct dpif_netlink_vport *vport)
     case OVS_VPORT_TYPE_GRE:
         return "gre";
 
+    case OVS_VPORT_TYPE_GRE_L3:
+        return "l3gre";
+
     case OVS_VPORT_TYPE_VXLAN:
         return "vxlan";
 
@@ -792,6 +795,8 @@  netdev_to_ovs_vport_type(const struct netdev *netdev)
         return OVS_VPORT_TYPE_STT;
     } else if (!strcmp(type, "geneve")) {
         return OVS_VPORT_TYPE_GENEVE;
+    } else if (!strcmp(type, "l3gre")) { /* Must be before search for "gre" */
+        return OVS_VPORT_TYPE_GRE_L3;
     } else if (strstr(type, "gre")) {
         return OVS_VPORT_TYPE_GRE;
     } else if (!strcmp(type, "vxlan")) {
diff --git a/lib/flow.c b/lib/flow.c
index 4cd7ebedb9c9..09544bbef904 100644
--- a/lib/flow.c
+++ b/lib/flow.c
@@ -821,6 +821,20 @@  miniflow_extract(struct dp_packet *packet, struct miniflow *dst)
                 miniflow_push_be16(mf, tp_dst, htons(icmp->icmp6_code));
                 miniflow_pad_to_64(mf, tp_dst);
             }
+        } else if (OVS_LIKELY(nw_proto == IPPROTO_GRE)) {
+            if (OVS_LIKELY(size >= sizeof(struct gre_base_hdr))) {
+                const struct gre_base_hdr *gre = data_pull(&data, &size,
+                                                           sizeof *gre);
+			    if (gre->protocol == htons(ETH_TYPE_TEB)) {
+                    /* No need to store a zero value for next_base_layer
+                     * in the miniflow which would cost an extra word of
+                     * storage. */
+                    BUILD_ASSERT(LAYER_2 == 0);
+                } else {
+                    miniflow_push_uint8(mf, next_base_layer, LAYER_3);
+                    miniflow_pad_to_64(mf, next_base_layer);
+                }
+            }
         }
     }
  out:
@@ -1435,6 +1449,8 @@  flow_wc_map(const struct flow *flow, struct flowmap *map)
 
         if (OVS_UNLIKELY(flow->nw_proto == IPPROTO_IGMP)) {
             FLOWMAP_SET(map, igmp_group_ip4);
+        } else if (OVS_UNLIKELY(flow->nw_proto == IPPROTO_GRE)) {
+            FLOWMAP_SET(map, next_base_layer);
         } else {
             FLOWMAP_SET(map, tcp_flags);
             FLOWMAP_SET(map, tp_src);
@@ -1453,6 +1469,8 @@  flow_wc_map(const struct flow *flow, struct flowmap *map)
             FLOWMAP_SET(map, nd_target);
             FLOWMAP_SET(map, arp_sha);
             FLOWMAP_SET(map, arp_tha);
+        } else if (OVS_UNLIKELY(flow->nw_proto == IPPROTO_GRE)) {
+            FLOWMAP_SET(map, next_base_layer);
         } else {
             FLOWMAP_SET(map, tcp_flags);
             FLOWMAP_SET(map, tp_src);
diff --git a/lib/flow.h b/lib/flow.h
index 7e5f50e0ad4f..ccbe522cd968 100644
--- a/lib/flow.h
+++ b/lib/flow.h
@@ -149,6 +149,10 @@  struct flow {
     ovs_be16 tp_dst;            /* TCP/UDP/SCTP destination port/ICMP code. */
     ovs_be32 igmp_group_ip4;    /* IGMP group IPv4 address.
                                  * Keep last for BUILD_ASSERT_DECL below. */
+
+    uint8_t next_base_layer;    /* Fields of encapsulated packet, if any,
+                                 * start at this layer */
+    uint8_t pad4[7];
 };
 BUILD_ASSERT_DECL(sizeof(struct flow) % sizeof(uint64_t) == 0);
 BUILD_ASSERT_DECL(sizeof(struct flow_tnl) % sizeof(uint64_t) == 0);
diff --git a/lib/match.c b/lib/match.c
index 6440d260495d..e09ae63430b7 100644
--- a/lib/match.c
+++ b/lib/match.c
@@ -1309,6 +1309,10 @@  match_format(const struct match *match, struct ds *s, int priority)
                             TCP_FLAGS(OVS_BE16_MAX));
     }
 
+    if (wc->masks.next_base_layer) {
+        ds_put_format(s, "next_base_layer=%"PRIu8",", f->next_base_layer);
+    }
+
     if (s->length > start_len) {
         ds_chomp(s, ',');
     }
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index b47ba0f8d430..aee5ad6cd2d0 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -5573,7 +5573,8 @@  get_etheraddr(const char *netdev_name, struct eth_addr *ea)
         return error;
     }
     hwaddr_family = ifr.ifr_hwaddr.sa_family;
-    if (hwaddr_family != AF_UNSPEC && hwaddr_family != ARPHRD_ETHER) {
+    if (hwaddr_family != AF_UNSPEC && hwaddr_family != ARPHRD_ETHER &&
+        hwaddr_family != ARPHRD_IPGRE) {
         VLOG_INFO("%s device has unknown hardware address family %d",
                   netdev_name, hwaddr_family);
         return EINVAL;
diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
index 88f5022f4bd5..78c8e9720de9 100644
--- a/lib/netdev-vport.c
+++ b/lib/netdev-vport.c
@@ -145,7 +145,7 @@  netdev_vport_is_layer3(const struct netdev *dev)
 {
     const char *type = netdev_get_type(dev);
 
-    return (!strcmp("lisp", type));
+    return (!strcmp("lisp", type) || !strcmp("l3gre", type));
 }
 
 static bool
@@ -943,12 +943,17 @@  ip_extract_tnl_md(struct dp_packet *packet, struct flow_tnl *tnl,
     return l4;
 }
 
+static ovs_be16
+header_eth_type(const void *header)
+{
+    const struct eth_header *eth = header;
+    return eth->eth_type;
+}
+
 static bool
 is_header_ipv6(const void *header)
 {
-    const struct eth_header *eth;
-    eth = header;
-    return eth->eth_type == htons(ETH_TYPE_IPV6);
+    return header_eth_type(header) == htons(ETH_TYPE_IPV6);
 }
 
 /* Pushes the 'size' bytes of 'header' into the headroom of 'packet',
@@ -973,6 +978,9 @@  push_ip_header(struct dp_packet *packet,
 
     memcpy(eth, header, size);
 
+    dp_packet_reset_offsets(packet);
+    packet->l3_ofs = sizeof (struct eth_header);
+
     if (is_header_ipv6(header)) {
         ip6 = ipv6_hdr(eth);
         *ip_tot_size -= IPV6_HEADER_LEN;
@@ -1120,7 +1128,7 @@  gre_header_len(ovs_be16 flags)
 
 static int
 parse_gre_header(struct dp_packet *packet,
-                 struct flow_tnl *tnl)
+                 struct flow_tnl *tnl, bool tap)
 {
     const struct gre_base_hdr *greh;
     ovs_16aligned_be32 *options;
@@ -1136,7 +1144,8 @@  parse_gre_header(struct dp_packet *packet,
         return -EINVAL;
     }
 
-    if (greh->protocol != htons(ETH_TYPE_TEB)) {
+    if ((tap && greh->protocol != htons(ETH_TYPE_TEB)) ||
+        (!tap && greh->protocol == htons(ETH_TYPE_TEB))) {
         return -EINVAL;
     }
 
@@ -1169,6 +1178,10 @@  parse_gre_header(struct dp_packet *packet,
         options++;
     }
 
+    if (!tap) {
+        packet->md.packet_ethertype = greh->protocol;
+    }
+
     return hlen;
 }
 
@@ -1182,7 +1195,7 @@  pkt_metadata_init_tnl(struct pkt_metadata *md)
 }
 
 static int
-netdev_gre_pop_header(struct dp_packet *packet)
+netdev_gre_pop_header__(struct dp_packet *packet, bool tap)
 {
     struct pkt_metadata *md = &packet->md;
     struct flow_tnl *tnl = &md->tunnel;
@@ -1196,7 +1209,7 @@  netdev_gre_pop_header(struct dp_packet *packet)
         return EINVAL;
     }
 
-    hlen = parse_gre_header(packet, tnl);
+    hlen = parse_gre_header(packet, tnl, tap);
     if (hlen < 0) {
         return -hlen;
     }
@@ -1206,6 +1219,31 @@  netdev_gre_pop_header(struct dp_packet *packet)
     return 0;
 }
 
+static int
+netdev_gretap_pop_header(struct dp_packet *packet)
+{
+    return netdev_gre_pop_header__(packet, true);
+}
+
+static int
+netdev_gre_pop_header(struct dp_packet *packet)
+{
+    int err;
+
+    err = netdev_gre_pop_header__(packet, false);
+    if (err) {
+        return err;
+    }
+
+    if (eth_type_mpls(packet->md.packet_ethertype)) {
+        packet->l2_5_ofs = 0;
+    } else {
+        packet->l3_ofs = 0;
+    }
+
+    return 0;
+}
+
 static void
 netdev_gre_push_header(struct dp_packet *packet,
                        const struct ovs_action_push_tnl *data)
@@ -1219,12 +1257,13 @@  netdev_gre_push_header(struct dp_packet *packet,
         ovs_be16 *csum_opt = (ovs_be16 *) (greh + 1);
         *csum_opt = csum(greh, ip_tot_size);
     }
+    packet->md.packet_ethertype = header_eth_type(data->header);
 }
 
 static int
-netdev_gre_build_header(const struct netdev *netdev,
-                        struct ovs_action_push_tnl *data,
-                        const struct flow *tnl_flow)
+netdev_gre_build_header__(const struct netdev *netdev,
+                          struct ovs_action_push_tnl *data,
+                          const struct flow *tnl_flow, ovs_be16 proto)
 {
     struct netdev_vport *dev = netdev_vport_cast(netdev);
     struct netdev_tunnel_config *tnl_cfg;
@@ -1251,7 +1290,7 @@  netdev_gre_build_header(const struct netdev *netdev,
         greh = (struct gre_base_hdr *) (ip + 1);
     }
 
-    greh->protocol = htons(ETH_TYPE_TEB);
+    greh->protocol = proto;
     greh->flags = 0;
 
     options = (ovs_16aligned_be32 *) (greh + 1);
@@ -1279,6 +1318,24 @@  netdev_gre_build_header(const struct netdev *netdev,
 }
 
 static int
+netdev_gretap_build_header(const struct netdev *netdev,
+                        struct ovs_action_push_tnl *data,
+                        const struct flow *tnl_flow)
+{
+    return netdev_gre_build_header__(netdev, data, tnl_flow,
+                                     htons(ETH_TYPE_TEB));
+}
+
+static int
+netdev_gre_build_header(const struct netdev *netdev,
+                        struct ovs_action_push_tnl *data,
+                        const struct flow *tnl_flow)
+{
+    return netdev_gre_build_header__(netdev, data, tnl_flow,
+                                     tnl_flow->dl_type);
+}
+
+static int
 netdev_vxlan_pop_header(struct dp_packet *packet)
 {
     struct pkt_metadata *md = &packet->md;
@@ -1555,9 +1612,12 @@  netdev_vport_tunnel_register(void)
         TUNNEL_CLASS("geneve", "genev_sys", netdev_geneve_build_header,
                                             push_udp_header,
                                             netdev_geneve_pop_header),
-        TUNNEL_CLASS("gre", "gre_sys", netdev_gre_build_header,
+        TUNNEL_CLASS("gre", "gre_sys", netdev_gretap_build_header,
                                        netdev_gre_push_header,
-                                       netdev_gre_pop_header),
+                                       netdev_gretap_pop_header),
+        TUNNEL_CLASS("l3gre", "l3gre_sys", netdev_gre_build_header,
+                                           netdev_gre_push_header,
+                                           netdev_gre_pop_header),
         TUNNEL_CLASS("ipsec_gre", "gre_sys", NULL, NULL, NULL),
         TUNNEL_CLASS("vxlan", "vxlan_sys", netdev_vxlan_build_header,
                                            push_udp_header,
diff --git a/lib/netdev-vport.h b/lib/netdev-vport.h
index be02cb569d96..6bd30adee321 100644
--- a/lib/netdev-vport.h
+++ b/lib/netdev-vport.h
@@ -20,6 +20,7 @@ 
 #include <stdbool.h>
 #include <stddef.h>
 #include "compiler.h"
+#include "openvswitch/types.h"
 
 struct dpif_netlink_vport;
 struct dpif_flow_stats;
diff --git a/lib/odp-execute.c b/lib/odp-execute.c
index 39fe1cec4018..248c7c7b4c65 100644
--- a/lib/odp-execute.c
+++ b/lib/odp-execute.c
@@ -342,6 +342,7 @@  odp_execute_set_action(struct dp_packet *packet, const struct nlattr *a)
     case OVS_KEY_ATTR_CT_ZONE:
     case OVS_KEY_ATTR_CT_MARK:
     case OVS_KEY_ATTR_CT_LABELS:
+    case OVS_KEY_ATTR_NEXT_BASE_LAYER:
     case __OVS_KEY_ATTR_MAX:
     default:
         OVS_NOT_REACHED();
@@ -446,6 +447,7 @@  odp_execute_masked_set_action(struct dp_packet *packet,
     case OVS_KEY_ATTR_ICMP:
     case OVS_KEY_ATTR_ICMPV6:
     case OVS_KEY_ATTR_TCP_FLAGS:
+    case OVS_KEY_ATTR_NEXT_BASE_LAYER:
     case __OVS_KEY_ATTR_MAX:
     default:
         OVS_NOT_REACHED();
diff --git a/lib/odp-util.c b/lib/odp-util.c
index 0d24327cc627..e9fca07a279d 100644
--- a/lib/odp-util.c
+++ b/lib/odp-util.c
@@ -166,6 +166,7 @@  ovs_key_attr_to_string(enum ovs_key_attr attr, char *namebuf, size_t bufsize)
     case OVS_KEY_ATTR_DP_HASH: return "dp_hash";
     case OVS_KEY_ATTR_RECIRC_ID: return "recirc_id";
     case OVS_KEY_ATTR_PACKET_ETHERTYPE: return "pkt_eth";
+    case OVS_KEY_ATTR_NEXT_BASE_LAYER: return "next_base_layer";
 
     case __OVS_KEY_ATTR_MAX:
     default:
@@ -1829,6 +1830,7 @@  static const struct attr_len_tbl ovs_flow_key_attr_lens[OVS_KEY_ATTR_MAX + 1] =
     [OVS_KEY_ATTR_CT_MARK]   = { .len = 4 },
     [OVS_KEY_ATTR_CT_LABELS] = { .len = sizeof(struct ovs_key_ct_labels) },
     [OVS_KEY_ATTR_PACKET_ETHERTYPE] = { .len = 2 },
+    [OVS_KEY_ATTR_NEXT_BASE_LAYER] = { .len = 1 },
 };
 
 /* Returns the correct length of the payload for a flow key attribute of the
@@ -2959,6 +2961,13 @@  format_odp_key_attr(const struct nlattr *a, const struct nlattr *ma,
         ds_chomp(ds, ',');
         break;
     }
+
+    case OVS_KEY_ATTR_NEXT_BASE_LAYER: {
+        const uint8_t *mask = ma ? nl_attr_get(ma) : NULL;
+        format_u8u(ds, "type", nl_attr_get_u8(a), mask, verbose);
+        break;
+    }
+
     case OVS_KEY_ATTR_UNSPEC:
     case __OVS_KEY_ATTR_MAX:
     default:
@@ -4389,6 +4398,11 @@  odp_flow_key_from_flow__(const struct odp_flow_key_parms *parms,
             sctp_key = nl_msg_put_unspec_uninit(buf, OVS_KEY_ATTR_SCTP,
                                                sizeof *sctp_key);
             get_tp_key(data, sctp_key);
+        } else if (flow->nw_proto == IPPROTO_GRE) {
+            if (!export_mask || data->next_base_layer == 0xff) {
+                nl_msg_put_u8(buf, OVS_KEY_ATTR_NEXT_BASE_LAYER,
+                              data->next_base_layer);
+            }
         } else if (flow->dl_type == htons(ETH_TYPE_IP)
                 && flow->nw_proto == IPPROTO_ICMP) {
             struct ovs_key_icmp *icmp_key;
@@ -4965,6 +4979,13 @@  parse_l2_5_onward(const struct nlattr *attrs[OVS_KEY_ATTR_MAX + 1],
             put_tp_key(sctp_key, flow);
             expected_bit = OVS_KEY_ATTR_SCTP;
         }
+    } else if (src_flow->nw_proto == IPPROTO_GRE
+               && (src_flow->dl_type == htons(ETH_TYPE_IP) ||
+                   src_flow->dl_type == htons(ETH_TYPE_IPV6))
+               && !(src_flow->nw_frag & FLOW_NW_FRAG_LATER)) {
+        if (present_attrs & (UINT64_C(1) << OVS_KEY_ATTR_NEXT_BASE_LAYER)) {
+            flow->next_base_layer = nl_attr_get_u8(attrs[OVS_KEY_ATTR_NEXT_BASE_LAYER]);
+        }
     } else if (src_flow->nw_proto == IPPROTO_ICMP
                && src_flow->dl_type == htons(ETH_TYPE_IP)
                && !(src_flow->nw_frag & FLOW_NW_FRAG_LATER)) {
diff --git a/lib/tnl-ports.c b/lib/tnl-ports.c
index e7f2066ab5c5..53adcb7a9a1a 100644
--- a/lib/tnl-ports.c
+++ b/lib/tnl-ports.c
@@ -27,6 +27,7 @@ 
 #include "hash.h"
 #include "list.h"
 #include "netdev.h"
+#include "netdev-vport.h"
 #include "ofpbuf.h"
 #include "ovs-thread.h"
 #include "odp-util.h"
@@ -52,6 +53,7 @@  static struct ovs_list addr_list;
 struct tnl_port {
     odp_port_t port;
     ovs_be16 udp_port;
+    bool is_layer3;
     char dev_name[IFNAMSIZ];
     struct ovs_list node;
 };
@@ -61,6 +63,7 @@  static struct ovs_list port_list;
 struct tnl_port_in {
     struct cls_rule cr;
     odp_port_t portno;
+    bool match_base_layer;
     struct ovs_refcount ref_cnt;
     char dev_name[IFNAMSIZ];
 };
@@ -82,7 +85,7 @@  tnl_port_free(struct tnl_port_in *p)
 
 static void
 tnl_port_init_flow(struct flow *flow, struct eth_addr mac,
-                   struct in6_addr *addr, ovs_be16 udp_port)
+                   struct in6_addr *addr, ovs_be16 udp_port, bool is_layer3)
 {
     memset(flow, 0, sizeof *flow);
 
@@ -99,20 +102,21 @@  tnl_port_init_flow(struct flow *flow, struct eth_addr mac,
         flow->nw_proto = IPPROTO_UDP;
     } else {
         flow->nw_proto = IPPROTO_GRE;
+        flow->next_base_layer = is_layer3 ? LAYER_3 : LAYER_2;
     }
     flow->tp_dst = udp_port;
 }
 
 static void
 map_insert(odp_port_t port, struct eth_addr mac, struct in6_addr *addr,
-           ovs_be16 udp_port, const char dev_name[])
+           ovs_be16 udp_port, const char dev_name[], bool is_layer3)
 {
     const struct cls_rule *cr;
     struct tnl_port_in *p;
     struct match match;
 
     memset(&match, 0, sizeof match);
-    tnl_port_init_flow(&match.flow, mac, addr, udp_port);
+    tnl_port_init_flow(&match.flow, mac, addr, udp_port, is_layer3);
 
     do {
         cr = classifier_lookup(&cls, CLS_MAX_VERSION, &match.flow, NULL);
@@ -133,6 +137,13 @@  map_insert(odp_port_t port, struct eth_addr mac, struct in6_addr *addr,
          * doesn't make sense to match on UDP port numbers. */
         if (udp_port) {
             match.wc.masks.tp_dst = OVS_BE16_MAX;
+        } else {
+            /* Match base layer for non-UDP tunnels as it may
+             * be used to differentiate them. For UDP tunnels the
+             * port number provides differentiation.
+             */
+            match.wc.masks.next_base_layer = UINT8_MAX;
+            p->match_base_layer = true;
         }
         if (IN6_IS_ADDR_V4MAPPED(addr)) {
             match.wc.masks.nw_dst = OVS_BE32_MAX;
@@ -151,15 +162,15 @@  map_insert(odp_port_t port, struct eth_addr mac, struct in6_addr *addr,
 }
 
 void
-tnl_port_map_insert(odp_port_t port,
-                    ovs_be16 udp_port, const char dev_name[])
+tnl_port_map_insert(odp_port_t port, ovs_be16 udp_port,
+                    const char dev_name[], bool is_layer3)
 {
     struct tnl_port *p;
     struct ip_device *ip_dev;
 
     ovs_mutex_lock(&mutex);
     LIST_FOR_EACH(p, node, &port_list) {
-        if (udp_port == p->udp_port) {
+        if (udp_port == p->udp_port && udp_port) {
              goto out;
         }
     }
@@ -167,6 +178,7 @@  tnl_port_map_insert(odp_port_t port,
     p = xzalloc(sizeof *p);
     p->port = port;
     p->udp_port = udp_port;
+    p->is_layer3 = is_layer3;
     ovs_strlcpy(p->dev_name, dev_name, sizeof p->dev_name);
     list_insert(&port_list, &p->node);
 
@@ -174,11 +186,11 @@  tnl_port_map_insert(odp_port_t port,
         if (ip_dev->addr4 != INADDR_ANY) {
             struct in6_addr addr4 = in6_addr_mapped_ipv4(ip_dev->addr4);
             map_insert(p->port, ip_dev->mac, &addr4,
-                       p->udp_port, p->dev_name);
+                       p->udp_port, p->dev_name, is_layer3);
         }
         if (ipv6_addr_is_set(&ip_dev->addr6)) {
             map_insert(p->port, ip_dev->mac, &ip_dev->addr6,
-                       p->udp_port, p->dev_name);
+                       p->udp_port, p->dev_name, is_layer3);
         }
     }
 
@@ -199,19 +211,20 @@  tnl_port_unref(const struct cls_rule *cr)
 }
 
 static void
-map_delete(struct eth_addr mac, struct in6_addr *addr, ovs_be16 udp_port)
+map_delete(struct eth_addr mac, struct in6_addr *addr, ovs_be16 udp_port,
+           bool is_layer3)
 {
     const struct cls_rule *cr;
     struct flow flow;
 
-    tnl_port_init_flow(&flow, mac, addr, udp_port);
+    tnl_port_init_flow(&flow, mac, addr, udp_port, is_layer3);
 
     cr = classifier_lookup(&cls, CLS_MAX_VERSION, &flow, NULL);
     tnl_port_unref(cr);
 }
 
 void
-tnl_port_map_delete(ovs_be16 udp_port)
+tnl_port_map_delete(ovs_be16 udp_port, bool is_layer3)
 {
     struct tnl_port *p, *next;
     struct ip_device *ip_dev;
@@ -232,10 +245,10 @@  tnl_port_map_delete(ovs_be16 udp_port)
     LIST_FOR_EACH(ip_dev, node, &addr_list) {
         if (ip_dev->addr4 != INADDR_ANY) {
             struct in6_addr addr4 = in6_addr_mapped_ipv4(ip_dev->addr4);
-            map_delete(ip_dev->mac, &addr4, udp_port);
+            map_delete(ip_dev->mac, &addr4, udp_port, is_layer3);
         }
         if (ipv6_addr_is_set(&ip_dev->addr6)) {
-            map_delete(ip_dev->mac, &ip_dev->addr6, udp_port);
+            map_delete(ip_dev->mac, &ip_dev->addr6, udp_port, is_layer3);
         }
     }
 
@@ -244,15 +257,35 @@  out:
     ovs_mutex_unlock(&mutex);
 }
 
-/* 'flow' is non-const to allow for temporary modifications during the lookup.
- * Any changes are restored before returning. */
+/* 'flow' is non-const to allow for:
+ * - Temporary modifications during the lookup
+ *    these are reverted before returning.
+ * - Setting matching on next_base_layer as required by the port looked up. */
 odp_port_t
 tnl_port_map_lookup(struct flow *flow, struct flow_wildcards *wc)
 {
     const struct cls_rule *cr = classifier_lookup(&cls, CLS_MAX_VERSION, flow,
                                                   wc);
+    enum base_layer next_base_layer_mask;
+    struct tnl_port_in *p;
+    odp_port_t portno;
+
+    /* next_base_layer should be matched when looking up tunnel port*/
+    next_base_layer_mask = wc->masks.base_layer;
+    wc->masks.next_base_layer = UINT8_MAX;
+
+    if (!cr) {
+        portno = ODPP_NONE;
+    } else {
+        p = tnl_port_cast(cr);
+        portno = p->portno;
+    }
+
+    if (!cr || !p->match_base_layer) {
+        wc->masks.next_base_layer = next_base_layer_mask;
+    }
 
-    return (cr) ? tnl_port_cast(cr)->portno : ODPP_NONE;
+    return portno;
 }
 
 static void
@@ -334,11 +367,11 @@  map_insert_ipdev(struct ip_device *ip_dev)
         if (ip_dev->addr4 != INADDR_ANY) {
             struct in6_addr addr4 = in6_addr_mapped_ipv4(ip_dev->addr4);
             map_insert(p->port, ip_dev->mac, &addr4,
-                       p->udp_port, p->dev_name);
+                       p->udp_port, p->dev_name, p->is_layer3);
         }
         if (ipv6_addr_is_set(&ip_dev->addr6)) {
             map_insert(p->port, ip_dev->mac, &ip_dev->addr6,
-                       p->udp_port, p->dev_name);
+                       p->udp_port, p->dev_name, p->is_layer3);
         }
     }
 }
@@ -386,15 +419,16 @@  insert_ipdev(const char dev_name[])
 static void
 delete_ipdev(struct ip_device *ip_dev)
 {
+    bool is_layer3 = netdev_vport_is_layer3(ip_dev->dev);
     struct tnl_port *p;
 
     LIST_FOR_EACH(p, node, &port_list) {
         if (ip_dev->addr4 != INADDR_ANY) {
             struct in6_addr addr4 = in6_addr_mapped_ipv4(ip_dev->addr4);
-            map_delete(ip_dev->mac, &addr4, p->udp_port);
+            map_delete(ip_dev->mac, &addr4, p->udp_port, is_layer3);
         }
         if (ipv6_addr_is_set(&ip_dev->addr6)) {
-            map_delete(ip_dev->mac, &ip_dev->addr6, p->udp_port);
+            map_delete(ip_dev->mac, &ip_dev->addr6, p->udp_port, is_layer3);
         }
     }
 
diff --git a/lib/tnl-ports.h b/lib/tnl-ports.h
index 4195e6a82e0f..f9958cec60af 100644
--- a/lib/tnl-ports.h
+++ b/lib/tnl-ports.h
@@ -27,9 +27,9 @@ 
 odp_port_t tnl_port_map_lookup(struct flow *flow, struct flow_wildcards *wc);
 
 void tnl_port_map_insert(odp_port_t port, ovs_be16 udp_port,
-                         const char dev_name[]);
+                         const char dev_name[], bool is_layer3);
 
-void tnl_port_map_delete(ovs_be16 udp_port);
+void tnl_port_map_delete(ovs_be16 udp_port, bool is_layer3);
 void tnl_port_map_insert_ipdev(const char dev[]);
 void tnl_port_map_delete_ipdev(const char dev[]);
 void tnl_port_map_run(void);
diff --git a/ofproto/ofproto-dpif-ipfix.c b/ofproto/ofproto-dpif-ipfix.c
index a610c536204f..6509f21ce0ec 100644
--- a/ofproto/ofproto-dpif-ipfix.c
+++ b/ofproto/ofproto-dpif-ipfix.c
@@ -588,7 +588,7 @@  dpif_ipfix_add_tunnel_port(struct dpif_ipfix *di, struct ofport *ofport,
     dip = xmalloc(sizeof *dip);
     dip->ofport = ofport;
     dip->odp_port = odp_port;
-    if (strcmp(type, "gre") == 0) {
+    if (strcmp(type, "gre") == 0 || strcmp(type, "l3gre") == 0) {
         /* 32-bit key gre */
         dip->tunnel_type = DPIF_IPFIX_TUNNEL_GRE;
         dip->tunnel_key_length = 4;
diff --git a/ofproto/ofproto-dpif-sflow.c b/ofproto/ofproto-dpif-sflow.c
index 33d8bec7495d..7c93f6ab18bd 100644
--- a/ofproto/ofproto-dpif-sflow.c
+++ b/ofproto/ofproto-dpif-sflow.c
@@ -584,7 +584,7 @@  static enum dpif_sflow_tunnel_type
 dpif_sflow_tunnel_type(struct ofport *ofport) {
     const char *type = netdev_get_type(ofport->netdev);
     if (type) {
-	if (strcmp(type, "gre") == 0) {
+	if (strcmp(type, "gre") == 0 || strcmp(type, "l3gre") == 0) {
 	    return DPIF_SFLOW_TUNNEL_GRE;
 	} else if (strcmp(type, "ipsec_gre") == 0) {
 	    return DPIF_SFLOW_TUNNEL_IPSEC_GRE;
@@ -1035,6 +1035,7 @@  sflow_read_set_action(const struct nlattr *attr,
     case OVS_KEY_ATTR_CT_MARK:
     case OVS_KEY_ATTR_CT_LABELS:
     case OVS_KEY_ATTR_UNSPEC:
+    case OVS_KEY_ATTR_NEXT_BASE_LAYER:
     case __OVS_KEY_ATTR_MAX:
     default:
         break;
diff --git a/ofproto/tunnel.c b/ofproto/tunnel.c
index 24b717a3ce86..322b14f44ea7 100644
--- a/ofproto/tunnel.c
+++ b/ofproto/tunnel.c
@@ -26,6 +26,7 @@ 
 #include "hash.h"
 #include "hmap.h"
 #include "netdev.h"
+#include "netdev-vport.h"
 #include "odp-util.h"
 #include "ofpbuf.h"
 #include "packets.h"
@@ -194,7 +195,8 @@  tnl_port_add__(const struct ofport_dpif *ofport, const struct netdev *netdev,
     tnl_port_mod_log(tnl_port, "adding");
 
     if (native_tnl) {
-        tnl_port_map_insert(odp_port, cfg->dst_port, name);
+        tnl_port_map_insert(odp_port, cfg->dst_port, name,
+                            netdev_vport_is_layer3(netdev));
     }
     return true;
 }
@@ -261,7 +263,8 @@  tnl_port_del__(const struct ofport_dpif *ofport) OVS_REQ_WRLOCK(rwlock)
             netdev_get_tunnel_config(tnl_port->netdev);
         struct hmap **map;
 
-        tnl_port_map_delete(cfg->dst_port);
+        tnl_port_map_delete(cfg->dst_port,
+                            netdev_vport_is_layer3(tnl_port->netdev));
         tnl_port_mod_log(tnl_port, "removing");
         map = tnl_match_map(&tnl_port->match);
         hmap_remove(*map, &tnl_port->match_node);
diff --git a/tests/tunnel-push-pop-ipv6.at b/tests/tunnel-push-pop-ipv6.at
index 8f6506a716f9..139e29714dfd 100644
--- a/tests/tunnel-push-pop-ipv6.at
+++ b/tests/tunnel-push-pop-ipv6.at
@@ -12,6 +12,8 @@  AT_CHECK([ovs-vsctl add-port int-br t2 -- set Interface t2 type=vxlan \
                        options:remote_ip=2001:cafe::93 options:out_key=flow options:csum=true ofport_request=4\
                     -- add-port int-br t4 -- set Interface t4 type=geneve \
                        options:remote_ip=flow options:key=123 ofport_request=5\
+                    -- add-port int-br t5 -- set Interface t5 type=l3gre \
+                       options:remote_ip=2001:cafe::92 options:key=455 ofport_request=6\
                        ], [0])
 
 AT_CHECK([ovs-appctl dpif/show], [0], [dnl
@@ -21,10 +23,11 @@  dummy@ovs-dummy: hit:0 missed:0
 		p0 1/1: (dummy)
 	int-br:
 		int-br 65534/2: (dummy)
-		t1 3/3: (gre: key=456, remote_ip=2001:cafe::92)
+		t1 3/4: (gre: key=456, remote_ip=2001:cafe::92)
 		t2 2/4789: (vxlan: key=123, remote_ip=2001:cafe::92)
 		t3 4/4789: (vxlan: csum=true, out_key=flow, remote_ip=2001:cafe::93)
 		t4 5/6081: (geneve: key=123, remote_ip=flow)
+		t5 6/3: (l3gre: key=455, remote_ip=2001:cafe::92)
 ])
 
 dnl First setup dummy interface IP address, then add the route
@@ -52,7 +55,8 @@  IP                                            MAC                 Bridge
 AT_CHECK([ovs-appctl tnl/ports/show |sort], [0], [dnl
 Listening ports:
 genev_sys_6081 (6081)
-gre_sys (3)
+gre_sys (4)
+l3gre_sys (3)
 vxlan_sys_4789 (4789)
 ])
 
@@ -65,7 +69,7 @@  AT_CHECK([tail -1 stdout], [0],
 dnl Check GRE tunnel pop
 AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 'in_port(1),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:00),eth_type(0x86dd),ipv6(src=2001:cafe::92,dst=2001:cafe::88,label=0,proto=47,tclass=0x0,hlimit=64)'], [0], [stdout])
 AT_CHECK([tail -1 stdout], [0],
-  [Datapath actions: tnl_pop(3)
+  [Datapath actions: tnl_pop(4)
 ])
 
 dnl Check Geneve tunnel pop
@@ -92,7 +96,7 @@  dnl Check GRE tunnel push
 AT_CHECK([ovs-ofctl add-flow int-br action=3])
 AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 'in_port(2),eth_type(0x0800),ipv4(src=1.1.3.88,dst=1.1.3.112,proto=47,tos=0,ttl=64,frag=no)'], [0], [stdout])
 AT_CHECK([tail -1 stdout], [0],
-  [Datapath actions: tnl_push(tnl_port(3),header(size=62,type=3,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x86dd),ipv6(src=2001:cafe::88,dst=2001:cafe::92,label=0,proto=47,tclass=0x0,hlimit=64),gre((flags=0x2000,proto=0x6558),key=0x1c8)),out_port(100))
+  [Datapath actions: tnl_push(tnl_port(4),header(size=62,type=3,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x86dd),ipv6(src=2001:cafe::88,dst=2001:cafe::92,label=0,proto=47,tclass=0x0,hlimit=64),gre((flags=0x2000,proto=0x6558),key=0x1c8)),out_port(100))
 ])
 
 dnl Check Geneve tunnel push
@@ -118,12 +122,12 @@  AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port  3'], [0], [dnl
   port  3: rx pkts=1, bytes=98, drop=0, errs=0, frame=0, over=0, crc=0
 ])
 
-dnl Check GRE only accepts encapsulated Ethernet frames
-AT_CHECK([ovs-appctl netdev-dummy/receive p0 'aa55aa550000001b213cab6486dd60000000006a2f402001cafe0000000000000000000000922001cafe00000000000000000000008820000800000001c8fe71d883724fbeb6f4e1494a080045000054ba200000400184861e0000011e00000200004227e75400030af3195500000000f265010000000000101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637'])
+dnl Check decapsulation of L3GRE packet
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 'aa55aa550000001b213cab6486dd60000000005a2f402001cafe0000000000000000000000922001cafe00000000000000000000008820000800000001c745000054ba200000400184861e0000011e00000200004227e75400030af3195500000000f265010000000000101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637'])
 ovs-appctl time/warp 1000
 
-AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port  3'], [0], [dnl
-  port  3: rx pkts=1, bytes=98, drop=0, errs=0, frame=0, over=0, crc=0
+AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port  6'], [0], [dnl
+  port  6: rx pkts=1, bytes=84, drop=0, errs=0, frame=0, over=0, crc=0
 ])
 
 dnl Check decapsulation of Geneve packet with options
diff --git a/tests/tunnel-push-pop.at b/tests/tunnel-push-pop.at
index 242ffaf1bc46..952b5a9f8583 100644
--- a/tests/tunnel-push-pop.at
+++ b/tests/tunnel-push-pop.at
@@ -12,6 +12,8 @@  AT_CHECK([ovs-vsctl add-port int-br t2 -- set Interface t2 type=vxlan \
                        options:remote_ip=1.1.2.93 options:out_key=flow options:csum=true ofport_request=4\
                     -- add-port int-br t4 -- set Interface t4 type=geneve \
                        options:remote_ip=flow options:key=123 ofport_request=5\
+                    -- add-port int-br t5 -- set Interface t5 type=l3gre \
+                       options:remote_ip=1.1.2.92 options:key=455 ofport_request=6\
                        ], [0])
 
 AT_CHECK([ovs-appctl dpif/show], [0], [dnl
@@ -21,10 +23,11 @@  dummy@ovs-dummy: hit:0 missed:0
 		p0 1/1: (dummy)
 	int-br:
 		int-br 65534/2: (dummy)
-		t1 3/3: (gre: key=456, remote_ip=1.1.2.92)
+		t1 3/4: (gre: key=456, remote_ip=1.1.2.92)
 		t2 2/4789: (vxlan: key=123, remote_ip=1.1.2.92)
 		t3 4/4789: (vxlan: csum=true, out_key=flow, remote_ip=1.1.2.93)
 		t4 5/6081: (geneve: key=123, remote_ip=flow)
+		t5 6/3: (l3gre: key=455, remote_ip=1.1.2.92)
 ])
 
 dnl First setup dummy interface IP address, then add the route
@@ -50,7 +53,8 @@  IP                                            MAC                 Bridge
 AT_CHECK([ovs-appctl tnl/ports/show |sort], [0], [dnl
 Listening ports:
 genev_sys_6081 (6081)
-gre_sys (3)
+gre_sys (4)
+l3gre_sys (3)
 vxlan_sys_4789 (4789)
 ])
 
@@ -63,7 +67,7 @@  AT_CHECK([tail -1 stdout], [0],
 dnl Check GRE tunnel pop
 AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 'in_port(1),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:00),eth_type(0x0800),ipv4(src=1.1.2.92,dst=1.1.2.88,proto=47,tos=0,ttl=64,frag=no)'], [0], [stdout])
 AT_CHECK([tail -1 stdout], [0],
-  [Datapath actions: tnl_pop(3)
+  [Datapath actions: tnl_pop(4)
 ])
 
 dnl Check Geneve tunnel pop
@@ -90,7 +94,14 @@  dnl Check GRE tunnel push
 AT_CHECK([ovs-ofctl add-flow int-br action=3])
 AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 'in_port(2),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:00),eth_type(0x0800),ipv4(src=1.1.3.88,dst=1.1.3.112,proto=47,tos=0,ttl=64,frag=no)'], [0], [stdout])
 AT_CHECK([tail -1 stdout], [0],
-  [Datapath actions: tnl_push(tnl_port(3),header(size=42,type=3,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=47,tos=0,ttl=64,frag=0x40),gre((flags=0x2000,proto=0x6558),key=0x1c8)),out_port(100))
+  [Datapath actions: tnl_push(tnl_port(4),header(size=42,type=3,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=47,tos=0,ttl=64,frag=0x40),gre((flags=0x2000,proto=0x6558),key=0x1c8)),out_port(100))
+])
+
+dnl Check L3GRE tunnel push
+AT_CHECK([ovs-ofctl add-flow int-br action=6])
+AT_CHECK([ovs-appctl ofproto/trace ovs-dummy 'in_port(2),eth(src=f8:bc:12:44:34:b6,dst=aa:55:aa:55:00:00),eth_type(0x0800),ipv4(src=1.1.3.88,dst=1.1.3.112,proto=47,tos=0,ttl=64,frag=no)'], [0], [stdout])
+AT_CHECK([tail -1 stdout], [0],
+  [Datapath actions: pop_eth,tnl_push(tnl_port(3),header(size=42,type=3,eth(dst=f8:bc:12:44:34:b6,src=aa:55:aa:55:00:00,dl_type=0x0800),ipv4(src=1.1.2.88,dst=1.1.2.92,proto=47,tos=0,ttl=64,frag=0x40),gre((flags=0x2000,proto=0x800),key=0x1c7)),out_port(100))
 ])
 
 dnl Check Geneve tunnel push
@@ -116,12 +127,20 @@  AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port  3'], [0], [dnl
   port  3: rx pkts=1, bytes=98, drop=0, errs=0, frame=0, over=0, crc=0
 ])
 
-dnl Check GRE only accepts encapsulated Ethernet frames
-AT_CHECK([ovs-appctl netdev-dummy/receive p0 'aa55aa550000001b213cab6408004500007e79464000402fba550101025c0101025820000800000001c8fe71d883724fbeb6f4e1494a080045000054ba200000400184861e0000011e00000200004227e75400030af3195500000000f265010000000000101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637'])
+dnl Check decapsulation of L3GRE packet
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 'aa55aa550000001b213cab6408004500007079464000402fba630101025c0101025820000800000001c745000054ba200000400184861e0000011e00000200004227e75400030af3195500000000f265010000000000101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637'])
 ovs-appctl time/warp 1000
 
-AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port  3'], [0], [dnl
+AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port  6'], [0], [dnl
+  port  6: rx pkts=1, bytes=84, drop=0, errs=0, frame=0, over=0, crc=0
+])
+
+dnl Check GREL3 only accepts non-fragmented packets?
+AT_CHECK([ovs-appctl netdev-dummy/receive p0 'aa55aa550000001b213cab6408004500007e79464000402fba550101025c0101025820000800000001c7fe71d883724fbeb6f4e1494a080045000054ba200000400184861e0000011e00000200004227e75400030af3195500000000f265010000000000101112131415161718191a1b1c1d1e1f202122232425262728292a2b2c2d2e2f3031323334353637'])
+
+AT_CHECK([ovs-ofctl dump-ports int-br | grep 'port  [[36]]' | sort], [0], [dnl
   port  3: rx pkts=1, bytes=98, drop=0, errs=0, frame=0, over=0, crc=0
+  port  6: rx pkts=1, bytes=84, drop=0, errs=0, frame=0, over=0, crc=0
 ])
 
 dnl Check decapsulation of Geneve packet with options