Message ID | 20091012.032657.24181291.davem@davemloft.net |
---|---|
State | Accepted |
Delegated to: | David Miller |
Headers | show |
[ Please retain CC: in all replies, thanks. ]
Hey, I want to investigate this further because something about
these traces still perplexes me.
Could you get me some information?
1) Setup the failing case (but with one of the fixes in the kernel
so you can run commands), and grab the contens of /proc/interrupts
and post that output here.
2) What firmware and hypervisor are you running on this machine?
(you can get this via 'showhost' at the "sc>" prompt)
I'm running Sun System Firmware 7.1.7.h on my machine.
The reason I ask #2 is that there is a hypervisor bug with LDC
connections wherein the interrupt can be sent twice erroneously
and this can cause loops in the per-cpu interrupt INO list.
There is a partial workaround already in the tree:
commit 5a606b72a4309a656cd1a19ad137dc5557c4b8ea
Author: David S. Miller <davem@sunset.davemloft.net>
Date: Mon Jul 9 22:40:36 2007 -0700
[SPARC64]: Do not ACK an INO if it is disabled or inprogress.
This is also a partial workaround for a bug in the LDOM firmware which
double-transmits RX inos during high load. Without this, such an
event causes the kernel to loop forever in the interrupt call chain
ACK'ing but never actually running the IRQ handler (and thus clearing
the interrupt condition in the device).
There is still a bad potential effect when double INOs occur,
not covered by this changeset. Namely, if the INO is already on
the per-cpu INO vector list, we still blindly re-insert it and
thus we can end up losing interrupts already linked in after
it.
We could deal with that by traversing the list before insertion,
but that's too expensive for this edge case.
Signed-off-by: David S. Miller <davem@davemloft.net>
But, as stated, it cannot deal with all possibilities that result
from this firmware bug. Best is to have the most uptodate firmware
with the fix.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
1. cat /proc/interrupts (interval 2s-5s) root@sun_netraT5220_turgo-1_ldom-3:/root> cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 10803 10834 10832 10832 10831 10831 10860 10830 <NULL> timer 17: 34 0 0 0 0 0 0 0 sun4v hvcons 18: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 19: 27547 0 0 0 0 0 0 0 vsun4v eth0 RX 20: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 21: 7 0 0 0 0 0 0 0 vsun4v eth0 RX 22: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 23: 7 0 0 0 0 0 0 0 vsun4v eth0 RX 24: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 25: 31 0 0 0 0 0 0 0 vsun4v eth1 RX 26: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 27: 7 0 0 0 0 0 0 0 vsun4v eth1 RX 28: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 29: 6 0 0 0 0 0 0 0 vsun4v eth1 RX 30: 0 0 0 0 0 0 0 0 vsun4v vdiska TX 31: 10 0 0 0 0 0 0 0 vsun4v vdiska RX 32: 0 0 0 0 0 0 0 0 vsun4v DS TX 33: 10 0 0 0 0 0 0 0 vsun4v DS RX root@sun_netraT5220_turgo-1_ldom-3:/root> cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 13930 13961 13959 13959 13958 13958 13987 13957 <NULL> timer 17: 37 0 0 0 0 0 0 0 sun4v hvcons 18: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 19: 27558 0 0 0 0 0 0 0 vsun4v eth0 RX 20: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 21: 7 0 0 0 0 0 0 0 vsun4v eth0 RX 22: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 23: 7 0 0 0 0 0 0 0 vsun4v eth0 RX 24: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 25: 34 0 0 0 0 0 0 0 vsun4v eth1 RX 26: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 27: 7 0 0 0 0 0 0 0 vsun4v eth1 RX 28: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 29: 6 0 0 0 0 0 0 0 vsun4v eth1 RX 30: 0 0 0 0 0 0 0 0 vsun4v vdiska TX 31: 10 0 0 0 0 0 0 0 vsun4v vdiska RX 32: 0 0 0 0 0 0 0 0 vsun4v DS TX 33: 10 0 0 0 0 0 0 0 vsun4v DS RX root@sun_netraT5220_turgo-1_ldom-3:/root> cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 16314 16345 16343 16343 16342 16342 16371 16341 <NULL> timer 17: 40 0 0 0 0 0 0 0 sun4v hvcons 18: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 19: 27576 0 0 0 0 0 0 0 vsun4v eth0 RX 20: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 21: 7 0 0 0 0 0 0 0 vsun4v eth0 RX 22: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 23: 7 0 0 0 0 0 0 0 vsun4v eth0 RX 24: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 25: 40 0 0 0 0 0 0 0 vsun4v eth1 RX 26: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 27: 7 0 0 0 0 0 0 0 vsun4v eth1 RX 28: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 29: 6 0 0 0 0 0 0 0 vsun4v eth1 RX 30: 0 0 0 0 0 0 0 0 vsun4v vdiska TX 31: 10 0 0 0 0 0 0 0 vsun4v vdiska RX 32: 0 0 0 0 0 0 0 0 vsun4v DS TX 33: 10 0 0 0 0 0 0 0 vsun4v DS RX root@sun_netraT5220_turgo-1_ldom-3:/root> cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 17078 17109 17107 17107 17106 17106 17135 17105 <NULL> timer 17: 43 0 0 0 0 0 0 0 sun4v hvcons 18: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 19: 27582 0 0 0 0 0 0 0 vsun4v eth0 RX 20: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 21: 7 0 0 0 0 0 0 0 vsun4v eth0 RX 22: 0 0 0 0 0 0 0 0 vsun4v eth0 TX 23: 7 0 0 0 0 0 0 0 vsun4v eth0 RX 24: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 25: 40 0 0 0 0 0 0 0 vsun4v eth1 RX 26: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 27: 7 0 0 0 0 0 0 0 vsun4v eth1 RX 28: 0 0 0 0 0 0 0 0 vsun4v eth1 TX 29: 6 0 0 0 0 0 0 0 vsun4v eth1 RX 30: 0 0 0 0 0 0 0 0 vsun4v vdiska TX 31: 10 0 0 0 0 0 0 0 vsun4v vdiska RX 32: 0 0 0 0 0 0 0 0 vsun4v DS TX 33: 10 0 0 0 0 0 0 0 vsun4v DS RX root@sun_netraT5220_turgo-1_ldom-3:/root> 2. where is sc>? i run uname -a in sunos uname -a SunOS sun_netraT5220_turgo-1 5.10 Generic_127111-05 sun4v sparc SUNW,Netra-T5220 F.Y.I, sorry for delay. Yongli He 2009/10/15 David Miller <davem@davemloft.net>: > > [ Please retain CC: in all replies, thanks. ] > > Hey, I want to investigate this further because something about > these traces still perplexes me. > > Could you get me some information? > > 1) Setup the failing case (but with one of the fixes in the kernel > so you can run commands), and grab the contens of /proc/interrupts > and post that output here. > > 2) What firmware and hypervisor are you running on this machine? > (you can get this via 'showhost' at the "sc>" prompt) > > I'm running Sun System Firmware 7.1.7.h on my machine. > > The reason I ask #2 is that there is a hypervisor bug with LDC > connections wherein the interrupt can be sent twice erroneously > and this can cause loops in the per-cpu interrupt INO list. > > There is a partial workaround already in the tree: > > commit 5a606b72a4309a656cd1a19ad137dc5557c4b8ea > Author: David S. Miller <davem@sunset.davemloft.net> > Date: Mon Jul 9 22:40:36 2007 -0700 > > [SPARC64]: Do not ACK an INO if it is disabled or inprogress. > > This is also a partial workaround for a bug in the LDOM firmware which > double-transmits RX inos during high load. Without this, such an > event causes the kernel to loop forever in the interrupt call chain > ACK'ing but never actually running the IRQ handler (and thus clearing > the interrupt condition in the device). > > There is still a bad potential effect when double INOs occur, > not covered by this changeset. Namely, if the INO is already on > the per-cpu INO vector list, we still blindly re-insert it and > thus we can end up losing interrupts already linked in after > it. > > We could deal with that by traversing the list before insertion, > but that's too expensive for this edge case. > > Signed-off-by: David S. Miller <davem@davemloft.net> > > But, as stated, it cannot deal with all possibilities that result > from this firmware bug. Best is to have the most uptodate firmware > with the fix. > -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: hyl <heyongli@gmail.com> Date: Fri, 16 Oct 2009 14:20:00 +0800 > 2. where is sc>? i run uname -a in sunos That's the system console prompt on the ALOM. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c index adf5f27..cb3c72c 100644 --- a/arch/sparc/kernel/ldc.c +++ b/arch/sparc/kernel/ldc.c @@ -1242,13 +1242,13 @@ int ldc_bind(struct ldc_channel *lp, const char *name) snprintf(lp->tx_irq_name, LDC_IRQ_NAME_MAX, "%s TX", name); err = request_irq(lp->cfg.rx_irq, ldc_rx, - IRQF_SAMPLE_RANDOM | IRQF_SHARED, + IRQF_SAMPLE_RANDOM | IRQF_DISABLED | IRQF_SHARED, lp->rx_irq_name, lp); if (err) return err; err = request_irq(lp->cfg.tx_irq, ldc_tx, - IRQF_SAMPLE_RANDOM | IRQF_SHARED, + IRQF_SAMPLE_RANDOM | IRQF_DISABLED | IRQF_SHARED, lp->tx_irq_name, lp); if (err) { free_irq(lp->cfg.rx_irq, lp);