Hi tredegar!
Quote:
Originally Posted by tredegar
Looks like it might be an IRQ sharing problem.
You could try passing (one, two, or all - sorry: you'll have to experiment, only 7 combinations to try!) of these kernel options:
noapic
nolapic
acpi=off
|
Well, that wasn't so bad ... I was afraid you'd want me to do all the permutations of them in different order, too! :-)
I've been spending all day on this $$"!&^%$$!! problem and I've narrowed things down a little.
First of all, the sshd errors were a red herring; more searching the web showed that it was a problem with the version of sshd that I was running. I upgraded and *those* particular error messages went away.
The core problem, though, remains. Any decent network activity and I get heaps of these messages logged:
kernel: hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01). Trying to recover by ending request.
last message repeated 3 times
kernel: ide: failed opcode was: unknown
kernel: hda: drive not ready for command
kernel: hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }
And in fact the KDE "someone has just put in a music CD; what do you want to do with it?" popup window springs up every time, it's that confused!
It seems to me that the IDE driver is using the same interrupt IRQ as the embedded Realtek r8169 driver, and with any significant network activity the IDE driver is reading and acting on some of the interrupts.
When my machine boots I can see that the 'Native IDE controller" and the "Network Controller" on the PCI bus have the same IRQ - 15. And the problem is that, when Linux boots, it puts them both on the same IRQ there.
With a normal boot I have both 'ide0' and 'eth0' sharing IRQ 17 in /proc/interrupts:
17: 58077 4 98999 129160 IO-APIC-fasteoi ide0, eth0
With 'noapic' they both get shifted to IRQ 15; all the IRQs are moved 'down' to lower numbers, and instead of 'IO-APIC-fasteoi' and such the entries are all instead 'XT-PIC-XT' (I wish I knew what the difference was). But I still have the same problem with any network activity.
With 'acpi=off' I get the normal behaviour (shared on IRQ 17); with 'nolapic' the kernel hangs right after loading the ide driver, reporting 'hda: lost interrupt' messages. The other combinations of those three boot options all have the same results - either shared 'IO-APIC-fasteoi' IRQ 17 or shared 'XT-PIC-XT' IRQ 15. And the same problem.
When I use the BIOS to 'reserve' IRQ 15 the bios - and then Linux - use a different IRQ ... but for BOTH drivers, again having them share the same interrupt.
I compiled a new kernel with no IDE driver whatsoever - just as a test, I'd like to actually be able to use my DVD-ROM while surfing the net :-) - and Linux then shared the 'eth0' driver with the 'libata' driver, both of them using the same IRQ (17 again, I think). 'libata' is the SATA driver, is that correct? Anyway, the problem DISAPPEARED in that scenario.
Searching the internet for the message:
kernel: hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01). Trying to recover by ending request.
showed that people were scratching their heads over it back in 2005; there were a few messages in the linux-kernel mailing list about it, pointing at 'IRQ routing' as being the problem. But I couldn't find any solution.
So it seems to me that:
- the problem is due to the IDE and R8169 drivers sharing the same interrupt;
- maybe it's the IDE driver which is badly behaved, as the libata driver presumably was exposed to the same rush of shared interrupts from the network activity, but didn't have - or log - any problems;
- there seems to be no way I can get the embedded Realtek network controller to use a different IRQ - there's nothing in the bios that will let me do it (only block off IRQs from being used, but Linux then just finds another IRQ to have both IDE and eth0 share). I've downloaded several Realtek utilities but they will only *display* the IRQ, not change it.
It's really embarrassing how little I know of modern PC hardware these days. A decade ago ... well, maybe a few years more ... I was happily solving interrupt problems with conflicting ISA cards and the like. As a modern hardware naif it seems to me that Linux can quite happily re-route IRQs merrily as it boots ... so surely there's a way to tell it 'excuse me, please put the Realtek driver on its own interrupt'? Or the IDE driver?
It seems all I can do is either:
A. Find a way to get the Realtek hardware to change its IRQ; but I think I've exhausted that possibilty. The BIOS won't let me do it, no utility I can find will, the motherboard manufacturer doesn't mention it.
B. Try and find a way that the Linux kernel allows one to meddle with the 'IRQ routing', to stop those two drivers from sharing an interrupt.
C. See if there are options to toughen up the IDE driver? There was a kernel directive IDEPCI_SHARE_IRQ which seemed to be EXACTLY what I wanted, so I set it to 'N', but the problem remained.
Thanks sincerely for your advice, this is all very frustrating; I appreciate your time!
Brad