diff mbox series

[PATCHv11,bpf-next,2/5] xdp: add a new helper for dev map multicast support

Message ID 20200907082724.1721685-3-liuhangbin@gmail.com
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series xdp: add a new helper for dev map multicast support | expand

Commit Message

Hangbin Liu Sept. 7, 2020, 8:27 a.m. UTC
This patch is for xdp multicast support. which has been discussed
before[0], The goal is to be able to implement an OVS-like data plane in
XDP, i.e., a software switch that can forward XDP frames to multiple ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because a
physical interface can be part of a logical grouping, such as a bond
device.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. The forwarding
map could be DEVMAP or DEVMAP_HASH, but the exclude map *must* be
DEVMAP_HASH to get better performace. If user don't want to use exclude
map and just want simply stop redirecting back to ingress device, they
can use flag BPF_F_EXCLUDE_INGRESS.

As both bpf_xdp_redirect_map() and this new helpers are using struct
bpf_redirect_info, I add a new ex_map and set tgt_value to NULL in the
new helper to make a difference with bpf_xdp_redirect_map().

Also I keep the the general data path in net/core/filter.c, the native data
path in kernel/bpf/devmap.c so we can use direct calls to get better
performace.

[0] https://xdp-project.net/#Handling-multicast

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---

v11:
Fix bpf_redirect_map_multi() helper description typo.
Add loop limit for devmap_get_next_obj() and dev_map_redirect_multi().

v10:
Update helper bpf_xdp_redirect_map_multi()
- No need to check map pointer as we will do the check in verifier.

v9:
Update helper bpf_xdp_redirect_map_multi()
- Use ARG_CONST_MAP_PTR_OR_NULL for helper arg2

v8:
Update function dev_in_exclude_map():
- remove duplicate ex_map map_type check in
- lookup the element in dev map by obj dev index directly instead
  of looping all the map

v7:
a) Fix helper flag check
b) Limit the *ex_map* to use DEVMAP_HASH only and update function
   dev_in_exclude_map() to get better performance.

v6: converted helper return types from int to long

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

---
 include/linux/bpf.h            |  20 +++++
 include/linux/filter.h         |   1 +
 include/net/xdp.h              |   1 +
 include/uapi/linux/bpf.h       |  27 +++++++
 kernel/bpf/devmap.c            | 132 +++++++++++++++++++++++++++++++++
 kernel/bpf/verifier.c          |   6 ++
 net/core/filter.c              | 118 +++++++++++++++++++++++++++--
 net/core/xdp.c                 |  29 ++++++++
 tools/include/uapi/linux/bpf.h |  27 +++++++
 9 files changed, 356 insertions(+), 5 deletions(-)

Comments

Alexei Starovoitov Sept. 9, 2020, 9:52 p.m. UTC | #1
On Mon, Sep 07, 2020 at 04:27:21PM +0800, Hangbin Liu wrote:
> This patch is for xdp multicast support. which has been discussed
> before[0], The goal is to be able to implement an OVS-like data plane in
> XDP, i.e., a software switch that can forward XDP frames to multiple ports.
> 
> To achieve this, an application needs to specify a group of interfaces
> to forward a packet to. It is also common to want to exclude one or more
> physical interfaces from the forwarding operation - e.g., to forward a
> packet to all interfaces in the multicast group except the interface it
> arrived on. While this could be done simply by adding more groups, this
> quickly leads to a combinatorial explosion in the number of groups an
> application has to maintain.
> 
> To avoid the combinatorial explosion, we propose to include the ability
> to specify an "exclude group" as part of the forwarding operation. This
> needs to be a group (instead of just a single port index), because a
> physical interface can be part of a logical grouping, such as a bond
> device.
> 
> Thus, the logical forwarding operation becomes a "set difference"
> operation, i.e. "forward to all ports in group A that are not also in
> group B". This series implements such an operation using device maps to
> represent the groups. This means that the XDP program specifies two
> device maps, one containing the list of netdevs to redirect to, and the
> other containing the exclude list.

"set difference" and BPF_F_EXCLUDE_INGRESS makes sense to me as high level api,
but I don't see how program or helper is going to modify the packet
before multicasting it.
Even to implement a basic switch the program would need to modify destination
mac addresses before xmiting it on the device.
In case of XDP_TX the bpf program is doing it manually.
With this api the program is out of the loop.
It can prepare a packet for one target netdev, but sending the same
packet as-is to other netdevs isn't going to to work correctly.
Veth-s and tap-s don't care about mac and the stack will silently accept
packets even with wrong mac.
The same thing may happen with physical netdevs. The driver won't care
that dst mac is wrong. It will xmit it out, but the other side of the wire
will likely drop that packet unless it's promisc.
Properly implemented bridge shouldn't be doing it, but
I really don't see how this api can work in practice to implement real bridge.
What am I missing?
Hangbin Liu Sept. 10, 2020, 2:35 a.m. UTC | #2
Hi Alexei,

On Wed, Sep 09, 2020 at 02:52:06PM -0700, Alexei Starovoitov wrote:
> On Mon, Sep 07, 2020 at 04:27:21PM +0800, Hangbin Liu wrote:
> > This patch is for xdp multicast support. which has been discussed
> > before[0], The goal is to be able to implement an OVS-like data plane in
> > XDP, i.e., a software switch that can forward XDP frames to multiple ports.
> > 
> > To achieve this, an application needs to specify a group of interfaces
> > to forward a packet to. It is also common to want to exclude one or more
> > physical interfaces from the forwarding operation - e.g., to forward a
> > packet to all interfaces in the multicast group except the interface it
> > arrived on. While this could be done simply by adding more groups, this
> > quickly leads to a combinatorial explosion in the number of groups an
> > application has to maintain.
> > 
> > To avoid the combinatorial explosion, we propose to include the ability
> > to specify an "exclude group" as part of the forwarding operation. This
> > needs to be a group (instead of just a single port index), because a
> > physical interface can be part of a logical grouping, such as a bond
> > device.
> > 
> > Thus, the logical forwarding operation becomes a "set difference"
> > operation, i.e. "forward to all ports in group A that are not also in
> > group B". This series implements such an operation using device maps to
> > represent the groups. This means that the XDP program specifies two
> > device maps, one containing the list of netdevs to redirect to, and the
> > other containing the exclude list.
> 
> "set difference" and BPF_F_EXCLUDE_INGRESS makes sense to me as high level api,
> but I don't see how program or helper is going to modify the packet
> before multicasting it.
> Even to implement a basic switch the program would need to modify destination
> mac addresses before xmiting it on the device.
> In case of XDP_TX the bpf program is doing it manually.
> With this api the program is out of the loop.
> It can prepare a packet for one target netdev, but sending the same
> packet as-is to other netdevs isn't going to to work correctly.

Yes, we can't modify the packets on ingress as there are multi egress ports
and each one may has different requirements. So this helper will only forward
the packets to other group(looks like a multicast group) devices.

I think the packets modification (edit dst mac, add vlan tag, etc) should be
done on egress, which rely on David's XDP egress support.

> Veth-s and tap-s don't care about mac and the stack will silently accept
> packets even with wrong mac.
> The same thing may happen with physical netdevs. The driver won't care
> that dst mac is wrong. It will xmit it out, but the other side of the wire
> will likely drop that packet unless it's promisc.
> Properly implemented bridge shouldn't be doing it, but
> I really don't see how this api can work in practice to implement real bridge.
> What am I missing?

Not sure if I missed something. Does current linux bridge do dst mac
modification? I thought it only forward packets(although it has fdb instead of
flush the packet to all ports)

On patch 4/5 there is an example about forwarding packets. It still need
to get remote's mac address by arp/nd.

Thanks
Hangbin
David Ahern Sept. 10, 2020, 3:30 a.m. UTC | #3
On 9/9/20 8:35 PM, Hangbin Liu wrote:
> Hi Alexei,
> 
> On Wed, Sep 09, 2020 at 02:52:06PM -0700, Alexei Starovoitov wrote:
>> On Mon, Sep 07, 2020 at 04:27:21PM +0800, Hangbin Liu wrote:
>>> This patch is for xdp multicast support. which has been discussed
>>> before[0], The goal is to be able to implement an OVS-like data plane in
>>> XDP, i.e., a software switch that can forward XDP frames to multiple ports.
>>>
>>> To achieve this, an application needs to specify a group of interfaces
>>> to forward a packet to. It is also common to want to exclude one or more
>>> physical interfaces from the forwarding operation - e.g., to forward a
>>> packet to all interfaces in the multicast group except the interface it
>>> arrived on. While this could be done simply by adding more groups, this
>>> quickly leads to a combinatorial explosion in the number of groups an
>>> application has to maintain.
>>>
>>> To avoid the combinatorial explosion, we propose to include the ability
>>> to specify an "exclude group" as part of the forwarding operation. This
>>> needs to be a group (instead of just a single port index), because a
>>> physical interface can be part of a logical grouping, such as a bond
>>> device.
>>>
>>> Thus, the logical forwarding operation becomes a "set difference"
>>> operation, i.e. "forward to all ports in group A that are not also in
>>> group B". This series implements such an operation using device maps to
>>> represent the groups. This means that the XDP program specifies two
>>> device maps, one containing the list of netdevs to redirect to, and the
>>> other containing the exclude list.
>>
>> "set difference" and BPF_F_EXCLUDE_INGRESS makes sense to me as high level api,
>> but I don't see how program or helper is going to modify the packet
>> before multicasting it.
>> Even to implement a basic switch the program would need to modify destination
>> mac addresses before xmiting it on the device.
>> In case of XDP_TX the bpf program is doing it manually.
>> With this api the program is out of the loop.
>> It can prepare a packet for one target netdev, but sending the same
>> packet as-is to other netdevs isn't going to to work correctly.
> 
> Yes, we can't modify the packets on ingress as there are multi egress ports
> and each one may has different requirements. So this helper will only forward
> the packets to other group(looks like a multicast group) devices.
> 
> I think the packets modification (edit dst mac, add vlan tag, etc) should be
> done on egress, which rely on David's XDP egress support.

agreed. The DEVMAP used for redirect can have programs attached that
update the packet headers - assuming you want to update them.

This is tagged as "multicast" support but it really is redirecting a
packet to multiple devices. One use case I see that evolves from this
set is the ability to both forward packets (e.g., host ingress to VM)
and grab a copy tcpdump style by redirecting packets to a virtual device
(similar to a patch set for dropwatch). ie., no need for an perf-events
style copy to push to userspace.
Alexei Starovoitov Sept. 10, 2020, 5:35 a.m. UTC | #4
On Wed, Sep 9, 2020 at 8:30 PM David Ahern <dsahern@gmail.com> wrote:
> >
> > I think the packets modification (edit dst mac, add vlan tag, etc) should be
> > done on egress, which rely on David's XDP egress support.
>
> agreed. The DEVMAP used for redirect can have programs attached that
> update the packet headers - assuming you want to update them.

Then you folks have to submit them as one set.
As-is the programmer cannot achieve correct behavior.
Toke Høiland-Jørgensen Sept. 10, 2020, 9:44 a.m. UTC | #5
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Wed, Sep 9, 2020 at 8:30 PM David Ahern <dsahern@gmail.com> wrote:
>> >
>> > I think the packets modification (edit dst mac, add vlan tag, etc) should be
>> > done on egress, which rely on David's XDP egress support.
>>
>> agreed. The DEVMAP used for redirect can have programs attached that
>> update the packet headers - assuming you want to update them.
>
> Then you folks have to submit them as one set.
> As-is the programmer cannot achieve correct behavior.

The ability to attach a program to devmaps is already there. See:

fbee97feed9b ("bpf: Add support to attach bpf program to a devmap entry")

But now that you mention it, it does appear that this series is skipping
the hook that will actually run such a program. Didn't realise that was
in the caller of bq_enqueue() and not inside bq_enqueue() itself...

Hangbin, you'll need to add the hook for dev_map_run_prog() before
bq_enqueue(); see the existing dev_map_enqueue() function.

-Toke
Alexei Starovoitov Sept. 10, 2020, 3:39 p.m. UTC | #6
On Thu, Sep 10, 2020 at 2:44 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Wed, Sep 9, 2020 at 8:30 PM David Ahern <dsahern@gmail.com> wrote:
> >> >
> >> > I think the packets modification (edit dst mac, add vlan tag, etc) should be
> >> > done on egress, which rely on David's XDP egress support.
> >>
> >> agreed. The DEVMAP used for redirect can have programs attached that
> >> update the packet headers - assuming you want to update them.
> >
> > Then you folks have to submit them as one set.
> > As-is the programmer cannot achieve correct behavior.
>
> The ability to attach a program to devmaps is already there. See:
>
> fbee97feed9b ("bpf: Add support to attach bpf program to a devmap entry")

ahh. you meant that one.

> But now that you mention it, it does appear that this series is skipping
> the hook that will actually run such a program. Didn't realise that was
> in the caller of bq_enqueue() and not inside bq_enqueue() itself...
>
> Hangbin, you'll need to add the hook for dev_map_run_prog() before
> bq_enqueue(); see the existing dev_map_enqueue() function.

If that's the expected usage it should have been described in the commit log
and thoroughly exercised in the tests.
Jesper Dangaard Brouer Sept. 10, 2020, 5:50 p.m. UTC | #7
On Thu, 10 Sep 2020 11:44:50 +0200
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Wed, Sep 9, 2020 at 8:30 PM David Ahern <dsahern@gmail.com> wrote:  
> >> >
> >> > I think the packets modification (edit dst mac, add vlan tag, etc) should be
> >> > done on egress, which rely on David's XDP egress support.  
> >>
> >> agreed. The DEVMAP used for redirect can have programs attached that
> >> update the packet headers - assuming you want to update them.  
> >
> > Then you folks have to submit them as one set.
> > As-is the programmer cannot achieve correct behavior.  
> 
> The ability to attach a program to devmaps is already there. See:
> 
> fbee97feed9b ("bpf: Add support to attach bpf program to a devmap entry")
> 
> But now that you mention it, it does appear that this series is skipping
> the hook that will actually run such a program. Didn't realise that was
> in the caller of bq_enqueue() and not inside bq_enqueue() itself...

In the first revisions of Ahern's patchset (before fully integrated in
devmap), this was the case, but it changed in some of the last
revisions. (This also lost the sort-n-bulk effect in the process, that
optimize I-cache).  In these earlier revisions it operated on
xdp_frame's.  It would have been a lot easier for Hangbin's patch if
the devmap-prog operated on these xdp_frame's.

Maybe we should change the devmap-prog approach, and run this on the
xdp_frame's (in bq_xmit_all() to be precise) .  Hangbin's patchset
clearly shows that we need this "layer" between running the xdp_prog and
the devmap-prog.
David Ahern Sept. 10, 2020, 6:35 p.m. UTC | #8
On 9/10/20 11:50 AM, Jesper Dangaard Brouer wrote:
> Maybe we should change the devmap-prog approach, and run this on the
> xdp_frame's (in bq_xmit_all() to be precise) .  Hangbin's patchset
> clearly shows that we need this "layer" between running the xdp_prog and
> the devmap-prog. 

I would prefer to leave it in dev_map_enqueue.

The main premise at the moment is that the program attached to the
DEVMAP entry is an ACL specific to that dev. If the program is going to
drop the packet, then no sense queueing it.

I also expect a follow on feature will be useful to allow the DEVMAP
program to do another REDIRECT (e.g., potentially after modifying). It
is not handled at the moment as it needs thought - e.g., limiting the
number of iterative redirects. If such a feature does happen, then no
sense queueing it to the current device.
Jesper Dangaard Brouer Sept. 11, 2020, 7:58 a.m. UTC | #9
On Thu, 10 Sep 2020 12:35:33 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 9/10/20 11:50 AM, Jesper Dangaard Brouer wrote:
> > Maybe we should change the devmap-prog approach, and run this on the
> > xdp_frame's (in bq_xmit_all() to be precise) .  Hangbin's patchset
> > clearly shows that we need this "layer" between running the xdp_prog and
> > the devmap-prog.   
> 
> I would prefer to leave it in dev_map_enqueue.
> 
> The main premise at the moment is that the program attached to the
> DEVMAP entry is an ACL specific to that dev. If the program is going to
> drop the packet, then no sense queueing it.
> 
> I also expect a follow on feature will be useful to allow the DEVMAP
> program to do another REDIRECT (e.g., potentially after modifying). It
> is not handled at the moment as it needs thought - e.g., limiting the
> number of iterative redirects. If such a feature does happen, then no
> sense queueing it to the current device.

It makes a lot of sense to do queuing before redirecting again.  The
(hidden) bulking we do at XDP redirect is the primary reason for the
performance boost. We all remember performance difference between
non-map version of redirect (which Toke fixed via always having the
bulking available in net_device->xdp_bulkq).

In a simple micro-benchmark I bet it will look better running the
devmap-prog right after the xdp_prog (which is what we have today). But
I claim this is the wrong approach, as soon as (1) traffic is more
intermixed, and (2) devmap-prog gets bigger and becomes more specific
to the egress-device (e.g. BPF update constants per egress-device).
When this happens performance suffers, as I-cache and data-access to
each egress-device gets pushed out of cache. (Hint VPP/fd.io approach)

Queuing xdp_frames up for your devmap-prog makes sense, as these share
common properties.  With intermix traffic the first xdp_prog will sort
packets into egress-devices, and then the devmap-prog can operate on
these.  The best illustration[1] of this sorting I saw in a Netflix
blogpost[2] about FreeBSD, section "RSS Assisted LRO" (not directly
related, but illustration was good).


[1] https://miro.medium.com/max/700/1%2alTGL1_D6hTMEMa7EDV8yZA.png
[2] https://netflixtechblog.com/serving-100-gbps-from-an-open-connect-appliance-cdb51dda3b99
David Ahern Sept. 15, 2020, 4:12 p.m. UTC | #10
On 9/11/20 1:58 AM, Jesper Dangaard Brouer wrote:
> On Thu, 10 Sep 2020 12:35:33 -0600
> David Ahern <dsahern@gmail.com> wrote:
> 
>> On 9/10/20 11:50 AM, Jesper Dangaard Brouer wrote:
>>> Maybe we should change the devmap-prog approach, and run this on the
>>> xdp_frame's (in bq_xmit_all() to be precise) .  Hangbin's patchset
>>> clearly shows that we need this "layer" between running the xdp_prog and
>>> the devmap-prog.   
>>
>> I would prefer to leave it in dev_map_enqueue.
>>
>> The main premise at the moment is that the program attached to the
>> DEVMAP entry is an ACL specific to that dev. If the program is going to
>> drop the packet, then no sense queueing it.
>>
>> I also expect a follow on feature will be useful to allow the DEVMAP
>> program to do another REDIRECT (e.g., potentially after modifying). It
>> is not handled at the moment as it needs thought - e.g., limiting the
>> number of iterative redirects. If such a feature does happen, then no
>> sense queueing it to the current device.
> 
> It makes a lot of sense to do queuing before redirecting again.  The
> (hidden) bulking we do at XDP redirect is the primary reason for the
> performance boost. We all remember performance difference between
> non-map version of redirect (which Toke fixed via always having the
> bulking available in net_device->xdp_bulkq).
> 
> In a simple micro-benchmark I bet it will look better running the
> devmap-prog right after the xdp_prog (which is what we have today). But
> I claim this is the wrong approach, as soon as (1) traffic is more
> intermixed, and (2) devmap-prog gets bigger and becomes more specific
> to the egress-device (e.g. BPF update constants per egress-device).
> When this happens performance suffers, as I-cache and data-access to
> each egress-device gets pushed out of cache. (Hint VPP/fd.io approach)
> 
> Queuing xdp_frames up for your devmap-prog makes sense, as these share
> common properties.  With intermix traffic the first xdp_prog will sort
> packets into egress-devices, and then the devmap-prog can operate on
> these.  The best illustration[1] of this sorting I saw in a Netflix
> blogpost[2] about FreeBSD, section "RSS Assisted LRO" (not directly
> related, but illustration was good).
> 
> 
> [1] https://miro.medium.com/max/700/1%2alTGL1_D6hTMEMa7EDV8yZA.png
> [2] https://netflixtechblog.com/serving-100-gbps-from-an-open-connect-appliance-cdb51dda3b99
> 

I understand the theory and testing will need to bear that out. There is
a bit of distance (code wise) between where the program is run now and
where you want to put it - the conversion from xdp_buff
to xdp_frame, the enqueue, and what it means to do a redirect to another
device in bq_xmit_all.

More importantly though for a redirect is the current xdp_ok_fwd_dev
check in __xdp_enqueue which for a redirect could be doing the wrong
checks for the wrong device.
diff mbox series

Patch

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 884392297874..01c8d82ff2e4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1342,6 +1342,11 @@  int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
 int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 		    struct net_device *dev_rx);
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex);
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  u32 flags);
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog);
 bool dev_map_can_have_prog(struct bpf_map *map);
@@ -1517,6 +1522,21 @@  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return 0;
 }
 
+static inline
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	return false;
+}
+
+static inline
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  u32 flags)
+{
+	return 0;
+}
+
 struct sk_buff;
 
 static inline int dev_map_generic_redirect(struct bpf_dtab_netdev *dst,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 995625950cc1..583dbd4c8dce 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -612,6 +612,7 @@  struct bpf_redirect_info {
 	u32 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
+	struct bpf_map *ex_map;
 	u32 kern_flags;
 };
 
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 3814fb631d52..8453d477bb22 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -132,6 +132,7 @@  void xdp_warn(const char *msg, const char *func, const int line);
 #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__)
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
 
 static inline
 void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8dda13880957..60785cf1989c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3576,6 +3576,27 @@  union bpf_attr {
  * 		the data in *dst*. This is a wrapper of copy_from_user().
  * 	Return
  * 		0 on success, or a negative error in case of failure.
+ *
+ * long bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		This is a multicast implementation for XDP redirect. It will
+ * 		redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map*.
+ *
+ * 		The forwarding *map* could be either BPF_MAP_TYPE_DEVMAP or
+ * 		BPF_MAP_TYPE_DEVMAP_HASH. But the *ex_map* must be
+ * 		BPF_MAP_TYPE_DEVMAP_HASH to get better performance.
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map() as a unicast implementation,
+ * 		which supports redirecting packet to a specific ifindex
+ * 		in the map. As both helpers use struct bpf_redirect_info
+ * 		to store the redirect info, we will use a a NULL tgt_value
+ * 		to distinguish multicast and unicast redirecting.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3727,6 +3748,7 @@  union bpf_attr {
 	FN(inode_storage_delete),	\
 	FN(d_path),			\
 	FN(copy_from_user),		\
+	FN(redirect_map_multi),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -3898,6 +3920,11 @@  enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 2b5ca93c17de..f9a4b663c713 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -511,6 +511,138 @@  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,
 	return __xdp_enqueue(dev, xdp, dev_rx);
 }
 
+/* Use direct call in fast path instead of map->ops->map_get_next_key() */
+static int devmap_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP:
+		return dev_map_get_next_key(map, key, next_key);
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+		return dev_map_hash_get_next_key(map, key, next_key);
+	default:
+		break;
+	}
+
+	return -ENOENT;
+}
+
+bool dev_in_exclude_map(struct bpf_dtab_netdev *obj, struct bpf_map *map,
+			int exclude_ifindex)
+{
+	if (obj->dev->ifindex == exclude_ifindex)
+		return true;
+
+	if (!map)
+		return false;
+
+	return __dev_map_hash_lookup_elem(map, obj->dev->ifindex) != NULL;
+}
+
+static struct bpf_dtab_netdev *devmap_get_next_obj(struct xdp_buff *xdp, struct bpf_map *map,
+						   struct bpf_map *ex_map, u32 *key,
+						   u32 *next_key, int ex_ifindex)
+{
+	struct bpf_dtab_netdev *obj;
+	struct net_device *dev;
+	u32 *tmp_key = key;
+	u32 index;
+	int err;
+
+	err = devmap_get_next_key(map, tmp_key, next_key);
+	if (err)
+		return NULL;
+
+	/* When using dev map hash, we could restart the hashtab traversal
+	 * in case the key has been updated/removed in the mean time.
+	 * So we may end up potentially looping due to traversal restarts
+	 * from first elem.
+	 *
+	 * Let's use map's max_entries to limit the loop number.
+	 */
+	for (index = 0; index < map->max_entries; index++) {
+		switch (map->map_type) {
+		case BPF_MAP_TYPE_DEVMAP:
+			obj = __dev_map_lookup_elem(map, *next_key);
+			break;
+		case BPF_MAP_TYPE_DEVMAP_HASH:
+			obj = __dev_map_hash_lookup_elem(map, *next_key);
+			break;
+		default:
+			break;
+		}
+
+		if (!obj || dev_in_exclude_map(obj, ex_map, ex_ifindex))
+			goto find_next;
+
+		dev = obj->dev;
+
+		if (!dev->netdev_ops->ndo_xdp_xmit)
+			goto find_next;
+
+		err = xdp_ok_fwd_dev(dev, xdp->data_end - xdp->data);
+		if (unlikely(err))
+			goto find_next;
+
+		return obj;
+
+find_next:
+		tmp_key = next_key;
+		err = devmap_get_next_key(map, tmp_key, next_key);
+		if (err)
+			break;
+	}
+
+	return NULL;
+}
+
+int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
+			  struct bpf_map *map, struct bpf_map *ex_map,
+			  u32 flags)
+{
+	struct bpf_dtab_netdev *obj = NULL, *next_obj = NULL;
+	struct xdp_frame *xdpf, *nxdpf;
+	bool last_one = false;
+	int ex_ifindex;
+	u32 key, next_key;
+
+	ex_ifindex = flags & BPF_F_EXCLUDE_INGRESS ? dev_rx->ifindex : 0;
+
+	/* Find first available obj */
+	obj = devmap_get_next_obj(xdp, map, ex_map, NULL, &key, ex_ifindex);
+	if (!obj)
+		return 0;
+
+	xdpf = xdp_convert_buff_to_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	for (;;) {
+		/* Check if we still have one more available obj */
+		next_obj = devmap_get_next_obj(xdp, map, ex_map, &key,
+					       &next_key, ex_ifindex);
+		if (!next_obj)
+			last_one = true;
+
+		if (last_one) {
+			bq_enqueue(obj->dev, xdpf, dev_rx);
+			return 0;
+		}
+
+		nxdpf = xdpf_clone(xdpf);
+		if (unlikely(!nxdpf)) {
+			xdp_return_frame_rx_napi(xdpf);
+			return -ENOMEM;
+		}
+
+		bq_enqueue(obj->dev, nxdpf, dev_rx);
+
+		/* Deal with next obj */
+		obj = next_obj;
+		key = next_key;
+	}
+}
+
 int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,
 			     struct bpf_prog *xdp_prog)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 95444022f74c..d79068df2b10 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4273,6 +4273,7 @@  static int check_map_func_compatibility(struct bpf_verifier_env *env,
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
 		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_redirect_map_multi &&
 		    func_id != BPF_FUNC_map_lookup_elem)
 			goto error;
 		break;
@@ -4372,6 +4373,11 @@  static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
+	case BPF_FUNC_redirect_map_multi:
+		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
+		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH)
+			goto error;
+		break;
 	case BPF_FUNC_sk_redirect_map:
 	case BPF_FUNC_msg_redirect_map:
 	case BPF_FUNC_sock_map_update:
diff --git a/net/core/filter.c b/net/core/filter.c
index 47eef9a0be6a..a2999ea8178b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3539,12 +3539,19 @@  static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 };
 
 static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
-			    struct bpf_map *map, struct xdp_buff *xdp)
+			    struct bpf_map *map, struct xdp_buff *xdp,
+			    struct bpf_map *ex_map, u32 flags)
 {
 	switch (map->map_type) {
 	case BPF_MAP_TYPE_DEVMAP:
 	case BPF_MAP_TYPE_DEVMAP_HASH:
-		return dev_map_enqueue(fwd, xdp, dev_rx);
+		/* We use a NULL fwd value to distinguish multicast
+		 * and unicast forwarding
+		 */
+		if (fwd)
+			return dev_map_enqueue(fwd, xdp, dev_rx);
+		else
+			return dev_map_enqueue_multi(xdp, dev_rx, map, ex_map, flags);
 	case BPF_MAP_TYPE_CPUMAP:
 		return cpu_map_enqueue(fwd, xdp, dev_rx);
 	case BPF_MAP_TYPE_XSKMAP:
@@ -3601,12 +3608,14 @@  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct bpf_map *map = READ_ONCE(ri->map);
+	struct bpf_map *ex_map = ri->ex_map;
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
 	int err;
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (unlikely(!map)) {
@@ -3618,7 +3627,7 @@  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 
 		err = dev_xdp_enqueue(fwd, xdp, dev);
 	} else {
-		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
+		err = __bpf_tx_xdp_map(dev, fwd, map, xdp, ex_map, ri->flags);
 	}
 
 	if (unlikely(err))
@@ -3632,6 +3641,62 @@  int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
+static int dev_map_redirect_multi(struct net_device *dev, struct sk_buff *skb,
+				  struct bpf_prog *xdp_prog,
+				  struct bpf_map *map, struct bpf_map *ex_map,
+				  u32 flags)
+
+{
+	struct bpf_dtab_netdev *dst;
+	struct sk_buff *nskb;
+	bool exclude_ingress;
+	u32 key, next_key, index;
+	void *fwd;
+	int err;
+
+	/* Get first key from forward map */
+	err = map->ops->map_get_next_key(map, NULL, &key);
+	if (err)
+		return err;
+
+	exclude_ingress = !!(flags & BPF_F_EXCLUDE_INGRESS);
+
+	/* When using dev map hash, we could restart the hashtab traversal
+	 * in case the key has been updated/removed in the mean time.
+	 * So we may end up potentially looping due to traversal restarts
+	 * from first elem.
+	 *
+	 * Let's use map's max_entries to limit the loop number.
+	 */
+	for (index = 0; index < map->max_entries; index++) {
+		fwd = __xdp_map_lookup_elem(map, key);
+		if (fwd) {
+			dst = (struct bpf_dtab_netdev *)fwd;
+			if (dev_in_exclude_map(dst, ex_map,
+					       exclude_ingress ? dev->ifindex : 0))
+				goto find_next;
+
+			nskb = skb_clone(skb, GFP_ATOMIC);
+			if (!nskb)
+				return -ENOMEM;
+
+			/* Try forword next one no mater the current forward
+			 * succeed or not */
+			dev_map_generic_redirect(dst, nskb, xdp_prog);
+		}
+
+find_next:
+		err = map->ops->map_get_next_key(map, &key, &next_key);
+		if (err)
+			break;
+
+		key = next_key;
+	}
+
+	consume_skb(skb);
+	return 0;
+}
+
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
@@ -3639,19 +3704,30 @@  static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct bpf_map *map)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	struct bpf_map *ex_map = ri->ex_map;
 	u32 index = ri->tgt_index;
 	void *fwd = ri->tgt_value;
 	int err = 0;
 
 	ri->tgt_index = 0;
 	ri->tgt_value = NULL;
+	ri->ex_map = NULL;
 	WRITE_ONCE(ri->map, NULL);
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
 	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
-		struct bpf_dtab_netdev *dst = fwd;
+		/* We use a NULL fwd value to distinguish multicast
+		 * and unicast forwarding
+		 */
+		if (fwd) {
+			struct bpf_dtab_netdev *dst = fwd;
+
+			err = dev_map_generic_redirect(dst, skb, xdp_prog);
+		} else {
+			err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
+						     ex_map, ri->flags);
+		}
 
-		err = dev_map_generic_redirect(dst, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
 	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
@@ -3765,6 +3841,36 @@  static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_xdp_redirect_map_multi, struct bpf_map *, map,
+	   struct bpf_map *, ex_map, u64, flags)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	/* Limit ex_map type to DEVMAP_HASH to get better performance */
+	if (unlikely((ex_map && ex_map->map_type != BPF_MAP_TYPE_DEVMAP_HASH) ||
+		     flags & ~BPF_F_EXCLUDE_INGRESS))
+		return XDP_ABORTED;
+
+	ri->tgt_index = 0;
+	/* Set the tgt_value to NULL to distinguish with bpf_xdp_redirect_map */
+	ri->tgt_value = NULL;
+	ri->flags = flags;
+	ri->ex_map = ex_map;
+
+	WRITE_ONCE(ri->map, map);
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_map_multi_proto = {
+	.func           = bpf_xdp_redirect_map_multi,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_CONST_MAP_PTR,
+	.arg2_type      = ARG_CONST_MAP_PTR_OR_NULL,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -6833,6 +6939,8 @@  xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_proto;
 	case BPF_FUNC_redirect_map:
 		return &bpf_xdp_redirect_map_proto;
+	case BPF_FUNC_redirect_map_multi:
+		return &bpf_xdp_redirect_map_multi_proto;
 	case BPF_FUNC_xdp_adjust_tail:
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_fib_lookup:
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 48aba933a5a8..9fd3e89768c4 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -467,3 +467,32 @@  void xdp_warn(const char *msg, const char *func, const int line)
 	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
 };
 EXPORT_SYMBOL_GPL(xdp_warn);
+
+struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
+{
+	unsigned int headroom, totalsize;
+	struct xdp_frame *nxdpf;
+	struct page *page;
+	void *addr;
+
+	headroom = xdpf->headroom + sizeof(*xdpf);
+	totalsize = headroom + xdpf->len;
+
+	if (unlikely(totalsize > PAGE_SIZE))
+		return NULL;
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+	addr = page_to_virt(page);
+
+	memcpy(addr, xdpf, totalsize);
+
+	nxdpf = addr;
+	nxdpf->data = addr + headroom;
+	nxdpf->frame_sz = PAGE_SIZE;
+	nxdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
+	nxdpf->mem.id = 0;
+
+	return nxdpf;
+}
+EXPORT_SYMBOL_GPL(xdpf_clone);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 8dda13880957..60785cf1989c 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3576,6 +3576,27 @@  union bpf_attr {
  * 		the data in *dst*. This is a wrapper of copy_from_user().
  * 	Return
  * 		0 on success, or a negative error in case of failure.
+ *
+ * long bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, u64 flags)
+ * 	Description
+ * 		This is a multicast implementation for XDP redirect. It will
+ * 		redirect the packet to ALL the interfaces in *map*, but
+ * 		exclude the interfaces in *ex_map*.
+ *
+ * 		The forwarding *map* could be either BPF_MAP_TYPE_DEVMAP or
+ * 		BPF_MAP_TYPE_DEVMAP_HASH. But the *ex_map* must be
+ * 		BPF_MAP_TYPE_DEVMAP_HASH to get better performance.
+ *
+ * 		Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
+ * 		which additionally excludes the current ingress device.
+ *
+ * 		See also bpf_redirect_map() as a unicast implementation,
+ * 		which supports redirecting packet to a specific ifindex
+ * 		in the map. As both helpers use struct bpf_redirect_info
+ * 		to store the redirect info, we will use a a NULL tgt_value
+ * 		to distinguish multicast and unicast redirecting.
+ * 	Return
+ * 		**XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3727,6 +3748,7 @@  union bpf_attr {
 	FN(inode_storage_delete),	\
 	FN(d_path),			\
 	FN(copy_from_user),		\
+	FN(redirect_map_multi),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -3898,6 +3920,11 @@  enum bpf_lwt_encap_mode {
 	BPF_LWT_ENCAP_IP,
 };
 
+/* BPF_FUNC_redirect_map_multi flags. */
+enum {
+	BPF_F_EXCLUDE_INGRESS		= (1ULL << 0),
+};
+
 #define __bpf_md_ptr(type, name)	\
 union {					\
 	type name;			\