igmp: Staggered igmp report intervals for unsolicited igmp reports

Message ID	alpine.DEB.2.00.1009221400010.32661@router.home
State	Changes Requested, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Date: Wed, 22 Sep 2010 14:01:28 -0500 (CDT) From: Christoph Lameter <cl@linux.com> To: linux-rdma@vger.kernel.org, netdev@vger.kernel.org cc: Bob Arendt <rda@rincon.com>, "David S. Miller" <davem@davemloft.net>, David L Stevens <dlstevens@us.ibm.com> Subject: igmp: Staggered igmp report intervals for unsolicited igmp reports In-Reply-To: <alpine.DEB.2.00.1009221354410.32661@router.home> Message-ID: <alpine.DEB.2.00.1009221400010.32661@router.home> References: <alpine.DEB.2.00.1009221354410.32661@router.home> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: netdev-owner@vger.kernel.org Precedence: bulk

Christoph Lameter (Ampere) Sept. 22, 2010, 7:01 p.m. UTC

The earlier patch added an initial mininum latency and got us up to
~80ms. However, there are large networks that take longer to configure
multicast paths.

This patch changes the behavior for unsolicited igmp reports to ensure
that even sporadic loss of the initial IGMP reports will result in a
reasonable fast subscription.

The rfc states that the first igmp report should be sent immediately and
then mentions that a couple of more should be sent but does not specify
exactly how the repeating of the igmp reports should occur. The RFC
suggests that the behavior in response to an IGMP report (randomized
response 0-max response time) could be followed but we have seen issues
with this suggestion since the intervals can be very short. There is also
no reason to randomize since the unsolicited reports are not a response to
an igmp query but the result of a join request in the code.

The patch here establishes more fixed delays for sending unsolicited
igmp reports after join. There is still a fuzz factor associated but the
sending of the igmp reports follows more tightly a set of intervals and sends
up to 7 igmp reports.

IGMP Report	Time delay
------------------------------------------------------------
0		3 ticks		"immediate" accordig to RFC.
1		40ms
2		200ms
3		1sec
4		5sec
5		10sec
6		60sec

So unsolicited reports are send for an interval of at least a minute
(reports are aborted if igmp reports or other info is seen).

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 net/ipv4/igmp.c |   38 ++++++++++++++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 4 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Stevens Sept. 22, 2010, 7:30 p.m. UTC | #1

Christoph Lameter <cl@linux.com> wrote on 09/22/2010 12:01:28 PM:

> The earlier patch added an initial mininum latency and got us up to
> ~80ms. However, there are large networks that take longer to configure
> multicast paths.

        I feel your pain, but the protocol allows this to be 0 and all
of the unsolicited reports can be lost. I don't think adding a minimum
latency solves a general problem. Perhaps the device should queue some
packets if it isn't ready quickly? A querier is what makes these
reliable, but for the start-up in particular, I think it'd be better
to not initiate the send on devices that have this problem until the
device is actually ready to send-- why not put the delay in the device
driver on initialization?
 
> with this suggestion since the intervals can be very short. There is 
also
> no reason to randomize since the unsolicited reports are not a response 
to
> an igmp query but the result of a join request in the code.

        These are also staggered to prevent a storm by mass reboots, e.g.,
from a power outage, and the default groups are joined on interface
bring-up.


> The patch here establishes more fixed delays for sending unsolicited
> igmp reports after join. There is still a fuzz factor associated but the
> sending of the igmp reports follows more tightly a set of intervals and 
sends
> up to 7 igmp reports.
> 
> IGMP Report   Time delay
> ------------------------------------------------------------
> 0      3 ticks      "immediate" accordig to RFC.
> 1      40ms
> 2      200ms
> 3      1sec
> 4      5sec
> 5      10sec
> 6      60sec
> 
> So unsolicited reports are send for an interval of at least a minute
> (reports are aborted if igmp reports or other info is seen).

        This is outside the protocol spec, and the intervals are neither
random nor scaled based on any network performance metric.

1) I'm not sure there's a problem here to solve, other than for your
        particular hardware.
2) I think this would better be solved in the driver-- don't do the
        upper initialization and group joins until the sends can actually
        succeed.
3) I don't think it's a good idea to make up intervals, and especially
        non-randomized ones. The probability of getting all minimum 
intervals
        is very low (which goes back to #1) and sending fixed intervals 
may
        introduce a problem (packet storms) that isn't there per RFC. 
These
        fixed intervals can also be either way too long or way too short,
        depending on link characteristics they don't account for. Leaving
        the intervals randomized based on querier-supplied data seems much
        more appropriate to me.


                                                                +-DLS

 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) Sept. 22, 2010, 7:58 p.m. UTC | #2

On Wed, 22 Sep 2010, David Stevens wrote:

>         I feel your pain, but the protocol allows this to be 0 and all
> of the unsolicited reports can be lost. I don't think adding a minimum
> latency solves a general problem. Perhaps the device should queue some

The protocol does not specificy the intervals during unsolicited igmp
sends. It only specifies the intervals as a result of a igmp query.

> packets if it isn't ready quickly? A querier is what makes these
> reliable, but for the start-up in particular, I think it'd be better
> to not initiate the send on devices that have this problem until the
> device is actually ready to send-- why not put the delay in the device
> driver on initialization?

The device is ready. Its just the multicast group that has not been
established yet.

> > an igmp query but the result of a join request in the code.
>
>         These are also staggered to prevent a storm by mass reboots, e.g.,
> from a power outage, and the default groups are joined on interface
> bring-up.

There is still some staggering left (see IGMP_Unsolicited_Fuzz). I can
increase that if necessary.

There also cannot be any storm since any unsolicited igmp report by any
system will stop the unsolicited igmp reports by any other system.

> > So unsolicited reports are send for an interval of at least a minute
> > (reports are aborted if igmp reports or other info is seen).
>
>         This is outside the protocol spec, and the intervals are neither
> random nor scaled based on any network performance metric.

Where does it say that in the spec? Again this is an *unsolicited* igmp
report.

> 2) I think this would better be solved in the driver-- don't do the
>         upper initialization and group joins until the sends can actually
>         succeed.

The driver is fine. Its just the multicast path in the network that take
time to establish.

> 3) I don't think it's a good idea to make up intervals, and especially
>         non-randomized ones. The probability of getting all minimum
> intervals
>         is very low (which goes back to #1) and sending fixed intervals
> may
>         introduce a problem (packet storms) that isn't there per RFC.
> These
>         fixed intervals can also be either way too long or way too short,
>         depending on link characteristics they don't account for. Leaving
>         the intervals randomized based on querier-supplied data seems much
>         more appropriate to me.

These are *unsolicited* igmp reports. There is *no* querier supplied data
yet. The first querier supplied data (or any other unsolicited igmp
report) will immediately stop the unsolicited reports and then will
continue to respond in randomized intervals based on the data that the
querier has supplied.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Stevens Sept. 22, 2010, 8:36 p.m. UTC | #3

Christoph Lameter <cl@linux.com> wrote:
> 
> On Wed, 22 Sep 2010, David Stevens wrote:
> 
> >         I feel your pain, but the protocol allows this to be 0 and all
> > of the unsolicited reports can be lost. I don't think adding a minimum
> > latency solves a general problem. Perhaps the device should queue some
> 
> The protocol does not specificy the intervals during unsolicited igmp
> sends. It only specifies the intervals as a result of a igmp query.

RFC 3376:
"  To cover the possibility of the State-Change Report being missed by
   one or more multicast routers, it is retransmitted [Robustness
   Variable] - 1 more times, at intervals chosen at random from the
   range (0, [Unsolicited Report Interval])."
and
"8.11. Unsolicited Report Interval

   The Unsolicited Report Interval is the time between repetitions of a
   host's initial report of membership in a group.  Default: 1 second."

> The device is ready. Its just the multicast group that has not been
> established yet.
        Well, if you know that's going to happen with your device, then
again, why not queue them on start up until you have indication that
the group has been established, or delay in the driver.
        You're changing IGMP for all device types to fix a problem in
only one.
 
> There also cannot be any storm since any unsolicited igmp report by any
> system will stop the unsolicited igmp reports by any other system.

        Not if they are simultaneous, which is exactly when it is a 
problem. :-)
> 
> > > So unsolicited reports are send for an interval of at least a minute
> > > (reports are aborted if igmp reports or other info is seen).
> >
> >         This is outside the protocol spec, and the intervals are 
neither
> > random nor scaled based on any network performance metric.
> 
> Where does it say that in the spec? Again this is an *unsolicited* igmp
> report.

        See quotes above.
 
> These are *unsolicited* igmp reports. There is *no* querier supplied 
data
> yet. The first querier supplied data (or any other unsolicited igmp
> report) will immediately stop the unsolicited reports and then will
> continue to respond in randomized intervals based on the data that the
> querier has supplied.

        There are initial values, which are currently constants, but it'd
be (more) reasonable to turn those into per-interface tunables or
per-interface initial values with IB interfaces using larger values.

IGMP_Unsolicited_Report_Count (default 2)
IGMP_Unsolicited_Report_Interval (default 10secs which is 10x larger,as
        you want, than the RFC suggests).

                                                                +-DLS

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bob Arendt Sept. 22, 2010, 8:56 p.m. UTC | #4

On 09/22/2010 12:58 PM, Christoph Lameter wrote:
> On Wed, 22 Sep 2010, David Stevens wrote:
>> 3) I don't think it's a good idea to make up intervals, and especially
>>          non-randomized ones. The probability of getting all minimum
>> intervals
>>          is very low (which goes back to #1) and sending fixed intervals
>> may
>>          introduce a problem (packet storms) that isn't there per RFC.
>> These
>>          fixed intervals can also be either way too long or way too short,
>>          depending on link characteristics they don't account for. Leaving
>>          the intervals randomized based on querier-supplied data seems much
>>          more appropriate to me.
>
> These are *unsolicited* igmp reports. There is *no* querier supplied data
> yet. The first querier supplied data (or any other unsolicited igmp
> report) will immediately stop the unsolicited reports and then will
> continue to respond in randomized intervals based on the data that the
> querier has supplied.
>
>

There certainly seems to be some backing for part of Christoph's concept in
the IETF rfc's.  I've posted the relevant sections below.  IGMPv2 doesn't specify
a limit on retransmissions of an unsolicited Join, only that they stop once
multicast traffic is received. While IGMPv2 defines an "Unsolicited Report
Interval" default of 10 seconds, it appears that this is a significant enough
issue that the later IGMPv3 document calls out a default of 1 second, and
goes on to define a "Robustness Variable" and talks about the same case that
Christoph is trying to mitigate.

However, both rfc's *do* specify that the random timers should be used based
on a value called the "unsolicited report interval".

Perhaps implementing the IGMPv3 capability with kernel parameters for an
"unsolicited report interval" and "robustness variable" would satisfy
Christoph's issue?

-Bob Arendt

rfc2236 IGMPv2  =============================
Section 3 .... page 4 para 2
    When a host joins a multicast group, it should immediately transmit
    an unsolicited Version 2 Membership Report for that group, in case it
    is the first member of that group on the network.  To cover the
    possibility of the initial Membership Report being lost or damaged,
    it is recommended that it be repeated once or twice after short
    delays [Unsolicited Report Interval].

Section 6 ...  page 8 para 4
- "start timer" for the group on the interface, using a delay value
      chosen uniformly from the interval (0, Max Response Time], where
      Max Response time is specified in the Query.  If this is an
      unsolicited Report, the timer is set to a delay value chosen
      uniformly from the interval (0, [Unsolicited Report Interval] ].

8.10.  Unsolicited Report Interval  (page 18)
    The Unsolicited Report Interval is the time between repetitions of a
    host's initial report of membership in a group.  Default: 10 seconds.

rfc3376 IGMPv3  ============================
Section 5.1 page 19, near end
    (note - unsolicited Join is a type of State-Change report)
    To cover the possibility of the State-Change Report being missed by
    one or more multicast routers, it is retransmitted [Robustness
    Variable] - 1 more times, at intervals chosen at random from the
    range (0, [Unsolicited Report Interval]).

8.11. Unsolicited Report Interval  (page 41)
    The Unsolicited Report Interval is the time between repetitions of a
    host's initial report of membership in a group.  Default: 1 second.

8.1. Robustness Variable (page 39)
    The Robustness Variable allows tuning for the expected packet loss on
    a network.  If a network is expected to be lossy, the Robustness
    Variable may be increased.  IGMP is robust to (Robustness Variable -
    1) packet losses.  The Robustness Variable MUST NOT be zero, and
    SHOULD NOT be one.  Default: 2

8.14.1. Robustness Variable  (page 41/42)
    The Robustness Variable tunes IGMP to expected losses on a link.
    IGMPv3 is robust to (Robustness Variable - 1) packet losses, e.g., if
    the Robustness Variable is set to the default value of 2, IGMPv3 is
    robust to a single packet loss but may operate imperfectly if more
    losses occur.  On lossy subnetworks, the Robustness Variable should
    be increased to allow for the expected level of packet loss. However,
    increasing the Robustness Variable increases the leave latency of the
    subnetwork.  (The leave latency is the time between when the last
    member stops listening to a source or group and when the traffic
    stops flowing.)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) Sept. 22, 2010, 9:26 p.m. UTC | #5

On Wed, 22 Sep 2010, David Stevens wrote:

> > The protocol does not specificy the intervals during unsolicited igmp
> > sends. It only specifies the intervals as a result of a igmp query.
>
> RFC 3376:
> "  To cover the possibility of the State-Change Report being missed by
>    one or more multicast routers, it is retransmitted [Robustness
>    Variable] - 1 more times, at intervals chosen at random from the
>    range (0, [Unsolicited Report Interval])."
> and
> "8.11. Unsolicited Report Interval
>
>    The Unsolicited Report Interval is the time between repetitions of a
>    host's initial report of membership in a group.  Default: 1 second."

Hmmm looks like I looked at the earlier RFC 2236 3) (was not really
interested in IGMP v3, IGMPv2 is run).

   When a host joins a multicast group, it should immediately transmit
   an unsolicited Version 2 Membership Report for that group, in case it
   is the first member of that group on the network.  To cover the
   possibility of the initial Membership Report being lost or damaged,
   it is recommended that it be repeated once or twice after short
   delays [Unsolicited Report Interval].  (A simple way to accomplish
   this is to send the initial Version 2 Membership Report and then act
   as if a Group-Specific Query was received for that group, and set a
   timer appropriately).

The new Unsolicited Report Interval is promising. We need to support that?

> > The device is ready. Its just the multicast group that has not been
> > established yet.
>         Well, if you know that's going to happen with your device, then
> again, why not queue them on start up until you have indication that
> the group has been established, or delay in the driver.
>         You're changing IGMP for all device types to fix a problem in
> only one.
>
> > There also cannot be any storm since any unsolicited igmp report by any
> > system will stop the unsolicited igmp reports by any other system.
>
>         Not if they are simultaneous, which is exactly when it is a
> problem. :-)

But then they are not simulateneous since there is a fuzz factor.

> > These are *unsolicited* igmp reports. There is *no* querier supplied
> data
> > yet. The first querier supplied data (or any other unsolicited igmp
> > report) will immediately stop the unsolicited reports and then will
> > continue to respond in randomized intervals based on the data that the
> > querier has supplied.
>
>         There are initial values, which are currently constants, but it'd
> be (more) reasonable to turn those into per-interface tunables or
> per-interface initial values with IB interfaces using larger values.
>
> IGMP_Unsolicited_Report_Count (default 2)
> IGMP_Unsolicited_Report_Interval (default 10secs which is 10x larger,as
>         you want, than the RFC suggests).

Ahhh... Interesting..... 1 second now? That is much better and would avoid
long drawn out joins due to the long delays.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Sept. 22, 2010, 9:50 p.m. UTC | #6

On Wed, Sep 22, 2010 at 02:58:15PM -0500, Christoph Lameter wrote:
> > packets if it isn't ready quickly? A querier is what makes these
> > reliable, but for the start-up in particular, I think it'd be better
> > to not initiate the send on devices that have this problem until the
> > device is actually ready to send-- why not put the delay in the device
> > driver on initialization?
> 
> The device is ready. Its just the multicast group that has not been
> established yet.

In IB when the SA replies to a group join the group should be ready,
prior to that the device can't send into the group because it has no
MLID for the group.. If you have a MLID then the group is working.

Is the issue you are dropping IGMP packets because the 224.0.0.2 join
hasn't finished? Ideally you'd wait for the SA to reply before sending
a IGMP, but a simpler solution might just be to use the broadcast MLID
for packets addressed to a MGID that has not yet got a MLID. This
would bebe similar to the ethernet behaviour of flooding.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) Sept. 23, 2010, 3:32 p.m. UTC | #7

On Wed, 22 Sep 2010, Jason Gunthorpe wrote:

> > The device is ready. Its just the multicast group that has not been
> > established yet.
>
> In IB when the SA replies to a group join the group should be ready,
> prior to that the device can't send into the group because it has no
> MLID for the group.. If you have a MLID then the group is working.

When the SA replies it has created the MLID but not reconfigured the
fabric yet. So the initial IGMP messages get lost.

> Is the issue you are dropping IGMP packets because the 224.0.0.2 join
> hasn't finished? Ideally you'd wait for the SA to reply before sending
> a IGMP, but a simpler solution might just be to use the broadcast MLID
> for packets addressed to a MGID that has not yet got a MLID. This
> would bebe similar to the ethernet behaviour of flooding.

IGMP reports are sent on the multicast group not on 224.0.0.2. 224.0.0.2
is only used when leaving a multicast group.

I thought also about solutions along the same lines. We could modify the
IB layer to send to 224.0.0.2 while until the SA has confirmed the
creation of the MC group. For that to work we first would need to modify
the SA logic to ensure that it only sends confirmation *after* the fabric
has been reconfigured. Then we need to switch the MLIDs of the MC group
when the notification is received.

If the IB layer has not joined 224.0.0.2 yet (and it will take awhile)
then we could even fallback to broadcast until its ready.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Sept. 23, 2010, 5:26 p.m. UTC | #8

On Thu, Sep 23, 2010 at 10:32:17AM -0500, Christoph Lameter wrote:

> > Is the issue you are dropping IGMP packets because the 224.0.0.2 join
> > hasn't finished? Ideally you'd wait for the SA to reply before sending
> > a IGMP, but a simpler solution might just be to use the broadcast MLID
> > for packets addressed to a MGID that has not yet got a MLID. This
> > would bebe similar to the ethernet behaviour of flooding.
> 
> IGMP reports are sent on the multicast group not on 224.0.0.2. 224.0.0.2
> is only used when leaving a multicast group.

Hm, that is quite different than in IGMPv3.. How does this work at all
in IB? A message to the multicast group isn't going to make it to any
routers unless the routers use some other means to join the IB MGID.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) Sept. 23, 2010, 5:37 p.m. UTC | #9

On Thu, 23 Sep 2010, Jason Gunthorpe wrote:

> On Thu, Sep 23, 2010 at 10:32:17AM -0500, Christoph Lameter wrote:
>
> > > Is the issue you are dropping IGMP packets because the 224.0.0.2 join
> > > hasn't finished? Ideally you'd wait for the SA to reply before sending
> > > a IGMP, but a simpler solution might just be to use the broadcast MLID
> > > for packets addressed to a MGID that has not yet got a MLID. This
> > > would bebe similar to the ethernet behaviour of flooding.
> >
> > IGMP reports are sent on the multicast group not on 224.0.0.2. 224.0.0.2
> > is only used when leaving a multicast group.
>
> Hm, that is quite different than in IGMPv3.. How does this work at all
> in IB? A message to the multicast group isn't going to make it to any
> routers unless the routers use some other means to join the IB MGID.

IPoIB creates a infiniband multicast group via the IB calls for a IP
multicast group. Then IGMP comes into play and the kernel sends the IP
based igmp report. This igmp report must be received by an outside router
(on an IP network) in order to for traffic to get forwarded into the IB
fabric. You can end up with a IB multicast configuration that is all fine
but with loss of the unsolicited packets due to fabric reconfiguration not
being complete yet. The larger the fabric the worse the situation.

If all unsolicited igmp reports are lost then the router will
only start forwarding the mc group after the reporting intervals
(which could be in the range of minutes) when it triggers igmp reports
through a general igmp query. Until that time the MC group looks dead. And
people and software may conclude that the **** network is broken.

This is a general issue for any network where configurations for MC
forwarding is needed and where initial igmp reports may get lost. A
staggering of time intervals would be a general solution to that issue.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Jason Gunthorpe Sept. 23, 2010, 5:46 p.m. UTC | #10

On Thu, Sep 23, 2010 at 12:37:28PM -0500, Christoph Lameter wrote:
> On Thu, 23 Sep 2010, Jason Gunthorpe wrote:
> 
> > On Thu, Sep 23, 2010 at 10:32:17AM -0500, Christoph Lameter wrote:
> >
> > > > Is the issue you are dropping IGMP packets because the 224.0.0.2 join
> > > > hasn't finished? Ideally you'd wait for the SA to reply before sending
> > > > a IGMP, but a simpler solution might just be to use the broadcast MLID
> > > > for packets addressed to a MGID that has not yet got a MLID. This
> > > > would bebe similar to the ethernet behaviour of flooding.
> > >
> > > IGMP reports are sent on the multicast group not on 224.0.0.2. 224.0.0.2
> > > is only used when leaving a multicast group.
> >
> > Hm, that is quite different than in IGMPv3.. How does this work at all
> > in IB? A message to the multicast group isn't going to make it to any
> > routers unless the routers use some other means to join the IB MGID.
> 
> IPoIB creates a infiniband multicast group via the IB calls for a IP
> multicast group. Then IGMP comes into play and the kernel sends the IP
> based igmp report. This igmp report must be received by an outside router
> (on an IP network) in order to for traffic to get forwarded into the IB
> fabric. You can end up with a IB multicast configuration that is all fine
> but with loss of the unsolicited packets due to fabric reconfiguration not
> being complete yet. The larger the fabric the worse the situation.

But my point is that IB has very limited multicast, if I create a IB
group and then send IGMP into that group *it will not reach a router*.

I have to send something to the all routers group or the all IGMPv3
group to get it to reach a router with any reliably.

The only way this kind of scheme could work is if an IGMPv2 IPoIB
router listens for IB MGID Create notices from the SA and
automatically joins all groups that are created, so it can get IGMPv2
membership reports. Which obviously adds more delay, lag, and risk.

I'm *guessing* that the change in IGMPv3 to send reports to 224.0.0.22
(all IGMPv3 multicast address) is related to this sort of problem, and
it seems like on IB IGMPv2 is not a good fit and should not be used if
v3 is available..

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) Sept. 23, 2010, 5:56 p.m. UTC | #11

On Thu, 23 Sep 2010, Jason Gunthorpe wrote:

> > IPoIB creates a infiniband multicast group via the IB calls for a IP
> > multicast group. Then IGMP comes into play and the kernel sends the IP
> > based igmp report. This igmp report must be received by an outside router
> > (on an IP network) in order to for traffic to get forwarded into the IB
> > fabric. You can end up with a IB multicast configuration that is all fine
> > but with loss of the unsolicited packets due to fabric reconfiguration not
> > being complete yet. The larger the fabric the worse the situation.
>
> But my point is that IB has very limited multicast, if I create a IB
> group and then send IGMP into that group *it will not reach a router*.

The IPoIB routers automatically join all IP MC groups created.

> The only way this kind of scheme could work is if an IGMPv2 IPoIB
> router listens for IB MGID Create notices from the SA and
> automatically joins all groups that are created, so it can get IGMPv2
> membership reports. Which obviously adds more delay, lag, and risk.

Right that is how it works now.

> I'm *guessing* that the change in IGMPv3 to send reports to 224.0.0.22
> (all IGMPv3 multicast address) is related to this sort of problem, and
> it seems like on IB IGMPv2 is not a good fit and should not be used if
> v3 is available..

Existing routers do no support IGMPv3.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

David Stevens Sept. 27, 2010, 7:32 p.m. UTC | #12

Christoph Lameter <cl@linux.com> wrote on 09/23/2010 10:37:28 AM:

> 
> If all unsolicited igmp reports are lost then the router will
> only start forwarding the mc group after the reporting intervals
> (which could be in the range of minutes) when it triggers igmp reports
> through a general igmp query. Until that time the MC group looks dead. 
And
> people and software may conclude that the **** network is broken.

        You can, of course, add a querier (or configure it, assuming
an attached switch supports it) and set the query interval and robustness
count as appropriate for that network.

> This is a general issue for any network where configurations for MC
> forwarding is needed and where initial igmp reports may get lost.

Meaning "IB-only", right? :-) Maybe other NBMA networks too, but
certainly not a typical problem for typical networks (i.e., Ethernet).

> A staggering of time intervals would be a general solution to that 
issue.

As would be having those networks queue packets for hardware addresses 
they
know require a delay before a transmit can complete. But that approach 
can't
adversely affect already-working solutions for typical networks, or
depart unnecessarily from established standard protocols.

 +-DLS

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

igmp: Staggered igmp report intervals for unsolicited igmp reports

Commit Message

Comments

Patch