Message ID | 1224863858-7933-1-git-send-email-galak@kernel.crashing.org (mailing list archive) |
---|---|
State | Superseded, archived |
Delegated to: | Benjamin Herrenschmidt |
Headers | show |
From: Kumar Gala <galak@kernel.crashing.org> Date: Fri, 24 Oct 2008 10:57:38 -0500 > Commit 18404756765c713a0be4eb1082920c04822ce588 introduced a regression > on a subset of SMP based PPC systems whose interrupt controller only > allow setting an irq to a single processor. The previous behavior > was only CPU0 was initially setup to get interrupts. Revert back > to that behavior. > > Signed-off-by: Kumar Gala <galak@kernel.crashing.org> I really don't remember getting all of my interrupts only on cpu 0 on sparc64 before any of these changes. I therefore find all of this quite mysterious. :-)
On Fri, 2008-10-24 at 16:18 -0700, David Miller wrote: > From: Kumar Gala <galak@kernel.crashing.org> > Date: Fri, 24 Oct 2008 10:57:38 -0500 > > > Commit 18404756765c713a0be4eb1082920c04822ce588 introduced a regression > > on a subset of SMP based PPC systems whose interrupt controller only > > allow setting an irq to a single processor. The previous behavior > > was only CPU0 was initially setup to get interrupts. Revert back > > to that behavior. > > > > Signed-off-by: Kumar Gala <galak@kernel.crashing.org> > > I really don't remember getting all of my interrupts only on cpu 0 > on sparc64 before any of these changes. I therefore find all of > this quite mysterious. :-) Well, I don't know how you do it but on powerpc, we explicitely fill the affinity masks at boot time when we can spread interrupts... Maybe we should change it the other way around and limit the mask when we can't ? It's hard to tell for sure at this stage. Ben.
Benjamin Herrenschmidt wrote: > On Fri, 2008-10-24 at 16:18 -0700, David Miller wrote: > >>From: Kumar Gala <galak@kernel.crashing.org> >>Date: Fri, 24 Oct 2008 10:57:38 -0500 >> >> >>>Commit 18404756765c713a0be4eb1082920c04822ce588 introduced a regression >>>on a subset of SMP based PPC systems whose interrupt controller only >>>allow setting an irq to a single processor. The previous behavior >>>was only CPU0 was initially setup to get interrupts. Revert back >>>to that behavior. >>> >>>Signed-off-by: Kumar Gala <galak@kernel.crashing.org> >> >>I really don't remember getting all of my interrupts only on cpu 0 >>on sparc64 before any of these changes. I therefore find all of >>this quite mysterious. :-) > > > Well, I don't know how you do it but on powerpc, we explicitely fill the > affinity masks at boot time when we can spread interrupts... Maybe we > should change it the other way around and limit the mask when we can't ? > It's hard to tell for sure at this stage. > > Ben. > What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this thing supposed to be able to spread irq between its cpus? kevin
From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Sun, 26 Oct 2008 08:33:09 +1100 > Well, I don't know how you do it but on powerpc, we explicitely fill the > affinity masks at boot time when we can spread interrupts... Maybe we > should change it the other way around and limit the mask when we can't ? > It's hard to tell for sure at this stage. On sparc64 we look at the cpu mask configured for the interrupt and do one of two things: 1) If all bits are set, we round robin assign a cpu at IRQ enable time. 2) Else we pick the first bit set in the mask. One modification I want to make is to make case #1 NUMA aware. But back to my original wonder, since I've always tipped off of this generic IRQ layer cpu mask, when was it ever defaulting to zero and causing the behvaior your powerpc guys actually want? :-)
From: Kevin Diggs <kevdig@hypersurf.com> Date: Sat, 25 Oct 2008 15:53:46 -0700 > What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this > thing supposed to be able to spread irq between its cpus? Networking interrupts should lock onto a single CPU, unconditionally. That's the optimal way to handle networking interrupts, especially with multiqueue chips. This is what the userland IRQ balance daemon does.
On Sat, 2008-10-25 at 21:04 -0700, David Miller wrote: > But back to my original wonder, since I've always tipped off of this > generic IRQ layer cpu mask, when was it ever defaulting to zero > and causing the behvaior your powerpc guys actually want? :-) Well, I'm not sure what Kumar wants. Most powerpc SMP setups actually want to spread interrupts to all CPUs, and those who can't tend to just not implement set_affinity... So Kumar must have a special case of MPIC usage here on FSL platforms. In any case, the platform limitations should be dealt with there or the user could break it by manipulating affinity via /proc anyway. By yeah, I do expect default affinity to be all CPUs and in fact, I even have an -OLD- comment in the code that says /* let the mpic know we want intrs. default affinitya is 0xffffffff ... Now, I've tried to track that down but it's hard because the generic code seem to have changed in many ways around affinity handling... So it looks like nowadays, the generic setup_irq() will call irq_select_affinity() when an interrupt is first requested. Unless you set CONFIG_AUTO_IRQ_AFFINITY and implement your own irq_select_affinity(), thus, you will get the default one which copies the content of this global irq_default_affinity to the interrupt. However it does that _after_ your IRQ startup() has been called (yes, this is very fishy), and so after you did your irq_choose_cpu()... This is all very messy, along with hooks for balancing and other confusing stuff that I suspect keeps changing. I'll have to spend more time next week to sort out what exactly is happening on powerpc and whether we get our interrupts spread or not... That's the downside of having more generic irq code I suppose: now people keep rewriting half of the generic code with x86 exclusively in mind and we have to be extra careful :-) Cheers, Ben.
> What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this > thing supposed to be able to spread irq between its cpus? Depends on the interrupt controller. I don't know that machine but for example the Apple Dual G5's use an MPIC that can spread based on an internal HW round robin scheme. This isn't always the best idea tho for cache reasons... depends if an at what level your caches are shared between CPUs. Ben.
From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Sun, 26 Oct 2008 17:48:43 +1100 > > > What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this > > thing supposed to be able to spread irq between its cpus? > > Depends on the interrupt controller. I don't know that machine > but for example the Apple Dual G5's use an MPIC that can spread > based on an internal HW round robin scheme. This isn't always > the best idea tho for cache reasons... depends if an at what level > your caches are shared between CPUs. it's always going to be the wrong thing to do for networking cards, especially once we start doing RX flow seperation in software
On Sun, 2008-10-26 at 00:16 -0700, David Miller wrote: > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Date: Sun, 26 Oct 2008 17:48:43 +1100 > > > > > > What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this > > > thing supposed to be able to spread irq between its cpus? > > > > Depends on the interrupt controller. I don't know that machine > > but for example the Apple Dual G5's use an MPIC that can spread > > based on an internal HW round robin scheme. This isn't always > > the best idea tho for cache reasons... depends if an at what level > > your caches are shared between CPUs. > > it's always going to be the wrong thing to do for networking cards, > especially once we start doing RX flow seperation in software True, though I don't have policy in the kernel for that, ie, it's pretty much irqbalanced job to do that. At this stage, the kernel always tries to spread when it can... at least on powerpc. Ben.
Benjamin Herrenschmidt wrote: >>What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this >>thing supposed to be able to spread irq between its cpus? > > > Depends on the interrupt controller. I don't know that machine > but for example the Apple Dual G5's use an MPIC that can spread > based on an internal HW round robin scheme. This isn't always > the best idea tho for cache reasons... depends if an at what level > your caches are shared between CPUs. > > Ben. > Sorry. I thought GigE was a common name for the machine. It is a dual 450 MHz G4 powermac with a gigabit ethernet and AGP. It now has a PowerLogix dual 1.1 GHz 7455 in it. I think the L3 caches are seperate? Not sure about the original cpu card. Can the OS tell? The reason I asked is that I seem to remember a config option that would restrict the irqs to cpu 0? Help suggested it was needed for certain PowerMacs. Didn't provide any help as to which ones. My GigE currently spreads them between the two. I have not noticed any additional holes in the space time contiuum. kevin
On Sun, 2008-10-26 at 18:30 -0800, Kevin Diggs wrote: > The reason I asked is that I seem to remember a config option that > would restrict the irqs to cpu 0? Help suggested it was needed for > certain PowerMacs. Didn't provide any help as to which ones. My GigE > currently spreads them between the two. I have not noticed any > additional holes in the space time contiuum. Yeah, a long time ago we had unexplained lockups when spreading interrupts, hence the config option. I think it's all been fixed since then. Ben.
On Oct 26, 2008, at 1:33 AM, Benjamin Herrenschmidt wrote: > On Sat, 2008-10-25 at 21:04 -0700, David Miller wrote: >> But back to my original wonder, since I've always tipped off of this >> generic IRQ layer cpu mask, when was it ever defaulting to zero >> and causing the behvaior your powerpc guys actually want? :-) > > Well, I'm not sure what Kumar wants. Most powerpc SMP setups actually > want to spread interrupts to all CPUs, and those who can't tend to > just > not implement set_affinity... So Kumar must have a special case of > MPIC > usage here on FSL platforms. > > In any case, the platform limitations should be dealt with there or > the > user could break it by manipulating affinity via /proc anyway. > > By yeah, I do expect default affinity to be all CPUs and in fact, I > even > have an -OLD- comment in the code that says > > /* let the mpic know we want intrs. default affinitya is > 0xffffffff ... While we have the comment the code appears not to really follow it. We appear to write 1 << hard_smp_processor_id(). - k
David Miller wrote: > From: Kevin Diggs <kevdig@hypersurf.com> > Date: Sat, 25 Oct 2008 15:53:46 -0700 > >> What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this >> thing supposed to be able to spread irq between its cpus? > > Networking interrupts should lock onto a single CPU, unconditionally. > That's the optimal way to handle networking interrupts, especially > with multiqueue chips. What about something like the Cavium Octeon, where we have 16 cores but a single core isn't powerful enough to keep up with a gigE device? Chris
From: "Chris Friesen" <cfriesen@nortel.com> Date: Mon, 27 Oct 2008 11:36:21 -0600 > David Miller wrote: > > From: Kevin Diggs <kevdig@hypersurf.com> > > Date: Sat, 25 Oct 2008 15:53:46 -0700 > > > >> What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this > >> thing supposed to be able to spread irq between its cpus? > > Networking interrupts should lock onto a single CPU, unconditionally. > > That's the optimal way to handle networking interrupts, especially > > with multiqueue chips. > > What about something like the Cavium Octeon, where we have 16 cores but a single core isn't powerful enough to keep up with a gigE device? Hello, we either have hardware that does flow seperation and has multiple RX queues going to multiple MSI-X interrupts or we do flow seperation in software (work in progress patches were posted for that about a month ago, maybe something final will land in 2.6.29) Just moving the interrupt around when not doing flow seperation is as suboptimal as you can possibly get. You'll get out of order packet processing within the same flow, TCP will retransmit when the reordering gets deep enough, and then you're totally screwed performance wise.
David Miller wrote: > From: "Chris Friesen" <cfriesen@nortel.com> >> What about something like the Cavium Octeon, where we have 16 cores but a >> single core isn't powerful enough to keep up with a gigE device? > > Hello, we either have hardware that does flow seperation and has multiple > RX queues going to multiple MSI-X interrupts or we do flow seperation in > software (work in progress patches were posted for that about a month ago, > maybe something final will land in 2.6.29) Are there any plans for a mechanism to allow the kernel to figure out (or be told) what packets cpu-affined tasks are interested in and route the interrupts appropriately? > Just moving the interrupt around when not doing flow seperation is as > suboptimal as you can possibly get. You'll get out of order packet > processing within the same flow, TCP will retransmit when the reordering > gets deep enough, and then you're totally screwed performance wise. Ideally I agree with you. In this particular case however the hardware is capable of doing flow separation, but the vendor driver doesn't support it (and isn't in mainline). Packet rates are high enough that a single core cannot keep up, but are low enough that they can be handled by multiple cores without reordering if interrupt mitigation is not used. It's not an ideal situation, but we're sort of stuck unless we do custom driver work. Chris
From: "Chris Friesen" <cfriesen@nortel.com> Date: Mon, 27 Oct 2008 13:10:55 -0600 > David Miller wrote: > > From: "Chris Friesen" <cfriesen@nortel.com> > > > Hello, we either have hardware that does flow seperation and has multiple > > RX queues going to multiple MSI-X interrupts or we do flow seperation in > > software (work in progress patches were posted for that about a month ago, > > maybe something final will land in 2.6.29) > > Are there any plans for a mechanism to allow the kernel to figure > out (or be told) what packets cpu-affined tasks are interested in > and route the interrupts appropriately? No, not at all. Now there are plans to allow the user to add classification rules into the chip for specific flows, on hardware that supports this, via ethtool. > Ideally I agree with you. In this particular case however the > hardware is capable of doing flow separation, but the vendor driver > doesn't support it (and isn't in mainline). Packet rates are high > enough that a single core cannot keep up, but are low enough that > they can be handled by multiple cores without reordering if > interrupt mitigation is not used. Your driver is weak and doesn't support the hardware correctly, and you want to put the onus on everyone else with sane hardware and drivers? > It's not an ideal situation, but we're sort of stuck unless we do > custom driver work. Wouldn't want you to get your hands dirty or anything like that now, would we? :-)))
On Oct 27, 2008, at 1:28 PM, David Miller wrote: > From: "Chris Friesen" <cfriesen@nortel.com> > Date: Mon, 27 Oct 2008 11:36:21 -0600 > >> David Miller wrote: >>> From: Kevin Diggs <kevdig@hypersurf.com> >>> Date: Sat, 25 Oct 2008 15:53:46 -0700 >>> >>>> What does this all mean to my GigE (dual 1.1 GHz 7455s)? Is this >>>> thing supposed to be able to spread irq between its cpus? >>> Networking interrupts should lock onto a single CPU, >>> unconditionally. >>> That's the optimal way to handle networking interrupts, especially >>> with multiqueue chips. >> >> What about something like the Cavium Octeon, where we have 16 cores >> but a single core isn't powerful enough to keep up with a gigE >> device? > > Hello, we either have hardware that does flow seperation and has > multiple RX queues > going to multiple MSI-X interrupts or we do flow seperation in > software (work > in progress patches were posted for that about a month ago, maybe > something final > will land in 2.6.29) > > Just moving the interrupt around when not doing flow seperation is as > suboptimal as you can possibly get. You'll get out of order packet > processing within the same flow, TCP will retransmit when the > reordering gets deep enough, and then you're totally screwed > performance wise. I haven't been following the netdev patches, but what about HW that does flow separation w/o multiple interrupts? We (Freescale) are working on such a device: http://www.freescale.com/webapp/sps/site/prod_summary.jsp?fastpreview=1&code=P4080 - k
From: Kumar Gala <galak@kernel.crashing.org> Date: Mon, 27 Oct 2008 14:43:29 -0500 > I haven't been following the netdev patches, but what about HW that does flow separation w/o multiple interrupts? > > We (Freescale) are working on such a device: > > http://www.freescale.com/webapp/sps/site/prod_summary.jsp?fastpreview=1&code=P4080 It could probably tie into the software based flow seperation support.
On Mon, 2008-10-27 at 08:43 -0500, Kumar Gala wrote: > > While we have the comment the code appears not to really follow it. > We appear to write 1 << hard_smp_processor_id(). That code is called by each CPU that gets onlined and OR's it's bit in the mask. Ben.
On Oct 27, 2008, at 3:27 PM, Benjamin Herrenschmidt wrote: > On Mon, 2008-10-27 at 08:43 -0500, Kumar Gala wrote: >> >> While we have the comment the code appears not to really follow it. >> We appear to write 1 << hard_smp_processor_id(). > > That code is called by each CPU that gets onlined and OR's it's > bit in the mask. ahh, I see now. - k
On Oct 27, 2008, at 2:49 PM, David Miller wrote: > From: Kumar Gala <galak@kernel.crashing.org> > Date: Mon, 27 Oct 2008 14:43:29 -0500 > >> I haven't been following the netdev patches, but what about HW that >> does flow separation w/o multiple interrupts? >> >> We (Freescale) are working on such a device: >> >> http://www.freescale.com/webapp/sps/site/prod_summary.jsp?fastpreview=1&code=P4080 > > It could probably tie into the software based flow seperation support. Will have to look at the code.. we might be able to fit in the HW irq scheme. We effective have a way of getting a per cpu interrupt for the flow. - k
David Miller wrote: > From: "Chris Friesen" <cfriesen@nortel.com> >>Are there any plans for a mechanism to allow the kernel to figure >>out (or be told) what packets cpu-affined tasks are interested in >>and route the interrupts appropriately? > > No, not at all. > Now there are plans to allow the user to add classification rules into > the chip for specific flows, on hardware that supports this, via ethtool. Okay, that sounds reasonable. Good to know where you're planning on going. > Your driver is weak and doesn't support the hardware correctly, and you > want to put the onus on everyone else with sane hardware and drivers? I'm not expecting any action...I was just objecting somewhat to "Networking interrupts should lock onto a single CPU, unconditionally." Add "for a particular flow" into that and I wouldn't have said anything. >>It's not an ideal situation, but we're sort of stuck unless we do >>custom driver work. > Wouldn't want you to get your hands dirty or anything like that now, > would we? :-))) I'd love to. But other things take time too, so we live with it for now. Chris
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index c498a1b..728d36a 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -17,7 +17,7 @@ #ifdef CONFIG_SMP -cpumask_t irq_default_affinity = CPU_MASK_ALL; +cpumask_t irq_default_affinity = CPU_MASK_CPU0; /** * synchronize_irq - wait for pending IRQ handlers (on other CPUs)
Commit 18404756765c713a0be4eb1082920c04822ce588 introduced a regression on a subset of SMP based PPC systems whose interrupt controller only allow setting an irq to a single processor. The previous behavior was only CPU0 was initially setup to get interrupts. Revert back to that behavior. Signed-off-by: Kumar Gala <galak@kernel.crashing.org> --- kernel/irq/manage.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-)