LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Networking (http://www.linuxquestions.org/questions/linux-networking-3/)
-   -   Realtek RTL8111/8168B IRQ clash? Hardware errors with high activity (http://www.linuxquestions.org/questions/linux-networking-3/realtek-rtl8111-8168b-irq-clash-hardware-errors-with-high-activity-619903/)

madbrad 02-09-2008 08:46 AM

Realtek RTL8111/8168B IRQ clash? Hardware errors with high activity
 
Hi. I've got a brand-new system with a Gigabyte P35-DS4 motherboard, which has an embedded Realtek RTL8111/8168B gigabit network controller. I'm running Linux 2.6.23.14, freshly fetched from kernel.org a couple of weeks ago.

The system was running perfectly ... until I decided to start using the network. With both the Linux kernel's r8169 module and the r8168 driver from realtek.com.tw - separately loaded, one at a time - I have the same problem - the driver loads properly, the eth0 interface configures properly, all the networking functions operate correctly ... but when I receive packets at the full 100Mbit/s rate from another machine (both my eth0 and the other machine auto-negotiated to 100Mb/sec full duplex) I see various errors suddenly pop up in the syslog:

sshd[4685]: error: channel 0: chan_read_failed for istate 1
sshd[4685]: error: channel 0: chan_read_failed for istate 3
last message repeated 20 times
kernel: hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01). Trying to recover by ending request.
last message repeated 3 times
kernel: ide: failed opcode was: unknown
kernel: hda: drive not ready for command
kernel: hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }

And so forth.

When either of the r8169/r8168 modules are loaded they report as follows in the log (this example is the regular Linux (kernel.org) r8169 module):

kernel: 8169 Gigabit Ethernet driver 2.2LK loaded
kernel: ACPI: PCI Interrupt 0000:04:00.0[A] -> GSI 17 (level, low) -> IRQ 17
kernel: eth0: RTL8168b/8111b at 0xf8d1c000, 00:1a:4d:58:a3:54, XID 38000000 IRQ 17

A look at the IRQ 17 line in /proc/interrupts shows that the IDE driver and the Realtek driver are both sharing IRQ 17:

# fgrep eth /proc/interrupts
17: 58077 4 98999 129160 IO-APIC-fasteoi ide0, eth0

Given the kernel messages about 'hda' - which is my sole IDE disk device on the system, the DVD-ROM drive (all my hard disk drives are SATA/AHCI) - it seems to me that the realtek driver is losing interrupts, or the IDE driver is picking up the interrupts destined for the ethernet device. But it's been a loooong time since I had to play with PC hardware and interrupts ... I don't have a clue how IRQs are (automatically?) assigned on a PCI bus these days, nor how to change things.

Has anyone had this problem with the embedded Realtek RTL8168/8111 driver and hardware interrupt confusion with moderate to high network activity?

How can I 'move' the Realtek device to another interrupt? Is there a general 'what to do with messy interrupt conflicts on PCI busses' HOWTO out there for a hardware novice?

Many thanks for any help ... I'm rather desperate - I thought this new system was working fine until I started to use it for real over the network! :-(

Regards,


Brad

tredegar 02-10-2008 04:48 AM

Looks like it might be an IRQ sharing problem.
You could try passing (one, two, or all - sorry: you'll have to experiment, only 7 combinations to try!) of these kernel options:
noapic
nolapic
acpi=off

to the kernel at boot time (just add them to the end of the "kernel" line in /boot/grub/menu.lst and reboot)

Then check your bootlogs to see what is happening.

To be on the safe side, I'd recommend creating a new boot entry in menu.lst to play with these options (just copy your current entry, but change the title to something like Testing), just in case one of these options prevents the kernel from booting at all - then you still have your original to fall back on.

Let us know how you get on.

madbrad 02-10-2008 06:05 AM

Hi tredegar!

Quote:

Originally Posted by tredegar (Post 3052233)
Looks like it might be an IRQ sharing problem.
You could try passing (one, two, or all - sorry: you'll have to experiment, only 7 combinations to try!) of these kernel options:
noapic
nolapic
acpi=off

Well, that wasn't so bad ... I was afraid you'd want me to do all the permutations of them in different order, too! :-)

I've been spending all day on this $$"!&^%$$!! problem and I've narrowed things down a little.

First of all, the sshd errors were a red herring; more searching the web showed that it was a problem with the version of sshd that I was running. I upgraded and *those* particular error messages went away.

The core problem, though, remains. Any decent network activity and I get heaps of these messages logged:

kernel: hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01). Trying to recover by ending request.
last message repeated 3 times
kernel: ide: failed opcode was: unknown
kernel: hda: drive not ready for command
kernel: hda: status error: status=0x58 { DriveReady SeekComplete DataRequest }

And in fact the KDE "someone has just put in a music CD; what do you want to do with it?" popup window springs up every time, it's that confused!

It seems to me that the IDE driver is using the same interrupt IRQ as the embedded Realtek r8169 driver, and with any significant network activity the IDE driver is reading and acting on some of the interrupts.

When my machine boots I can see that the 'Native IDE controller" and the "Network Controller" on the PCI bus have the same IRQ - 15. And the problem is that, when Linux boots, it puts them both on the same IRQ there.

With a normal boot I have both 'ide0' and 'eth0' sharing IRQ 17 in /proc/interrupts:

17: 58077 4 98999 129160 IO-APIC-fasteoi ide0, eth0

With 'noapic' they both get shifted to IRQ 15; all the IRQs are moved 'down' to lower numbers, and instead of 'IO-APIC-fasteoi' and such the entries are all instead 'XT-PIC-XT' (I wish I knew what the difference was). But I still have the same problem with any network activity.

With 'acpi=off' I get the normal behaviour (shared on IRQ 17); with 'nolapic' the kernel hangs right after loading the ide driver, reporting 'hda: lost interrupt' messages. The other combinations of those three boot options all have the same results - either shared 'IO-APIC-fasteoi' IRQ 17 or shared 'XT-PIC-XT' IRQ 15. And the same problem.

When I use the BIOS to 'reserve' IRQ 15 the bios - and then Linux - use a different IRQ ... but for BOTH drivers, again having them share the same interrupt.

I compiled a new kernel with no IDE driver whatsoever - just as a test, I'd like to actually be able to use my DVD-ROM while surfing the net :-) - and Linux then shared the 'eth0' driver with the 'libata' driver, both of them using the same IRQ (17 again, I think). 'libata' is the SATA driver, is that correct? Anyway, the problem DISAPPEARED in that scenario.

Searching the internet for the message:

kernel: hda: cdrom_pc_intr: The drive appears confused (ireason = 0x01). Trying to recover by ending request.

showed that people were scratching their heads over it back in 2005; there were a few messages in the linux-kernel mailing list about it, pointing at 'IRQ routing' as being the problem. But I couldn't find any solution.

So it seems to me that:

- the problem is due to the IDE and R8169 drivers sharing the same interrupt;

- maybe it's the IDE driver which is badly behaved, as the libata driver presumably was exposed to the same rush of shared interrupts from the network activity, but didn't have - or log - any problems;

- there seems to be no way I can get the embedded Realtek network controller to use a different IRQ - there's nothing in the bios that will let me do it (only block off IRQs from being used, but Linux then just finds another IRQ to have both IDE and eth0 share). I've downloaded several Realtek utilities but they will only *display* the IRQ, not change it.

It's really embarrassing how little I know of modern PC hardware these days. A decade ago ... well, maybe a few years more ... I was happily solving interrupt problems with conflicting ISA cards and the like. As a modern hardware naif it seems to me that Linux can quite happily re-route IRQs merrily as it boots ... so surely there's a way to tell it 'excuse me, please put the Realtek driver on its own interrupt'? Or the IDE driver?

It seems all I can do is either:

A. Find a way to get the Realtek hardware to change its IRQ; but I think I've exhausted that possibilty. The BIOS won't let me do it, no utility I can find will, the motherboard manufacturer doesn't mention it.

B. Try and find a way that the Linux kernel allows one to meddle with the 'IRQ routing', to stop those two drivers from sharing an interrupt.

C. See if there are options to toughen up the IDE driver? There was a kernel directive IDEPCI_SHARE_IRQ which seemed to be EXACTLY what I wanted, so I set it to 'N', but the problem remained.

Thanks sincerely for your advice, this is all very frustrating; I appreciate your time!


Brad

tredegar 02-10-2008 07:28 AM

Thanks for your lucid post.
Quote:

It seems all I can do is either:
A: I can't find a way to do this either :(
modinfo r8169 didn't help much, but there is a module option for Debug verbosity level


B: That would be a good idea, & I thought those kernel options might help.
There's more info on kernel options and interrupts here:
http://www.kernel.org/pub/linux/kern...n_pdf/ch09.pdf
You might make more sense of it than I do!
Maybe try acpi=noirq ?

C: I don't know :(

Searching shows me there seem to be a lot of problems with your chipset & linux.
The wimp's way out may be to try disabling your Realtek RTL8111/8168B in your BIOS and trying a different network card.

One other thought: Is there anything in your BIOS that you might be able to change that could alter the way interrupts are being handled? [Eg set PnP BIOS=NO ]

jay73 02-10-2008 07:54 AM

Have you tried using the irqpoll boot argument? It has helped me before although the last time I was having issues like yours, it didn't do anything. I guess it's worth a try.

Loosewheel 02-10-2008 12:17 PM

'ifconfig' gives an option: irq addr
Set the interrupt line used by this device. Not all devices can
dynamically change their IRQ setting

madbrad 02-10-2008 04:12 PM

Quote:

Originally Posted by tredegar (Post 3052296)
modinfo r8169 didn't help much, but there is a module option for Debug verbosity level

Yes, I tried that; it didn't tell me anything useful. It's *really* frustrating how I can't seem to find any way to tell the embedded Realtek controller to just use another jolly interrupt! An 'irq=XXX' module option would have been perfect :-(

Quote:

There's more info on kernel options and interrupts here:
http://www.kernel.org/pub/linux/kern...n_pdf/ch09.pdf
You might make more sense of it than I do!
Maybe try acpi=noirq?
I think I've tried most of possibilities listed under 'Interrupt Options'! 'acpi=noirq' didn't make any difference.

Quote:

The wimp's way out may be to try disabling your Realtek RTL8111/8168B in your BIOS and trying a different network card.
Ugh. I know people have had problems with the chipset, but can't find much about this specific one (lots of problems with the JMicron southbridge when it first came out, I think). And the motherboards that have these chipsets are so prevalent, how are they getting around this? Are they all running Windows? :-(

Quote:

One other thought: Is there anything in your BIOS that you might be able to change that could alter the way interrupts are being handled? [Eg set PnP BIOS=NO ]
I wish there was, but no. Nothing that allows me to change the IRQ of the network controller (or any other device), nothing at all about PnP other than the one bios page/menu which only allows me to reserve or block various IRQs ... but when I do that Linux just moves BOTH the IDE and Network drivers to share another interrupt, together. And the IDE driver just doesn't like that :-(

Quote:

Originally Posted by jay73
Have you tried using the irqpoll boot argument? It has helped me before although the last time I was having issues like yours, it didn't do anything. I guess it's worth a try.

I tried it; it seemed to make the IDE driver must less 'sensitive'; it took a full minute for KDE to think that a ghost had inserted a music CD in the drive. But the same error messages, just a bit slower. I think the nature of the 'irqpoll' option, from what it says in the documentation, may just slow things down in general.

Quote:

Originally Posted by Loosewheel
'ifconfig' gives an option: irq addr
Set the interrupt line used by this device. Not all devices can
dynamically change their IRQ setting.

Loosewheel, that option would have been BRILLIANT if it worked! I had no idea that ifconfig could do that, but there it is sitting in the output of an 'ifconfig -a'. Plus I've noted that the 'r8169' Realtek driver only seems to 'grab' its interrupt - the 'eth0' driver only appears in /proc/interrupts - after I've actually assigned an address to a plumbed eth0 device and up'ed it. That should have told me that ifconfig itself was doing some sort of interrupt configuration/activation magic.

Anyway, I tried that - no luck:

irq: SIOCSIFMAP: Operation not supported

Would have been perfect if it had worked. :-(

Thanks for the help fellows. I don't get it; the Intel P35 chipset is pretty modern and popular, I thought, and lots of motherboards - I believe - have both it and the embedded Realtek network controllers. I wonder how they're getting around this?


Brad

jay73 02-10-2008 05:28 PM

Quote:

I wonder how they're getting around this?
Essentially, not.
I have repeatedly been hit by that thing over the last year (similar motherboard), once on Fedora, once on FreeBSD and the other time on Ubuntu. I Wasted lots of time looking for a solution and eventually ended up switching to a different distro until the issue was solved by a kernel update. Unless you write your own kernel patches, there isn't much you can do.

madbrad 02-10-2008 08:30 PM

Quote:

Originally Posted by jay73 (Post 3052858)
I have repeatedly been hit by that thing over the last year (similar motherboard), once on Fedora, once on FreeBSD and the other time on Ubuntu. I Wasted lots of time looking for a solution and eventually ended up switching to a different distro until the issue was solved by a kernel update. Unless you write your own kernel patches, there isn't much you can do.

Did you discover a distribution that got rid of the problem, then?

And - I'm probably reading your post wrong - was the problem finally solved by a kernel update, or are you still on the good distribution waiting?

I had a friend e-mail me just half an hour ago that Ubuntu would fix all my problems ... and you've just said here that Ubuntu didn't work for you. :-( I'm keen to know what distribution you found that worked!


Brad

jay73 02-10-2008 10:18 PM

Well, Ubuntu Gutsy works fine. The previous one (what was it called again?) worked fine until my optical drives became useless after a kernel update. Same thing with Fedora 7 but now Fedora 8 is OK again. I guess this is nothing distro specific, it's probably just the kernel devs solving a problem, then causing a regression with the next update, then solving it once more. Your best bet would be to try different distros. If you can afford the space, install two. If one goes down, you have a quick alternative while you're waiting for things to get stretched out..

madbrad 02-23-2008 10:35 PM

Just posting a summation of what I've found out to solve my problem, in case anyone else ever does a search.

First up, from a few recent posts in the linux kernel mailing list, it looks like the interrupt handler for the ide-cd module has been - or is in the process - of being rewritten. The message I saw (dated 14/2/08) said that the release candidate 2.6.25-rc1 kernel should have the fix. However I loaded up 2.6.25-rc2 today and the bug was still there. Still, the mention of the relevant change in the code - changing 'cdrom_pc_intr' to 'cdrom_newpc_intr' - suggests that my problem with conflicting interrupts between the Realtek and the CD-Rom will hopefully be fixed soon.

In the meantime I've found a workaround, the same one used by Ubuntu I think, which works out of the box, as noted here by jay73. I've disabled IDE entirely on my machine, and enabled the sr_mod module (CONFIG_BLK_DEV_SR). The sr_mod module apparently sits above the 'cdrom' driver and presents a scsi device - /dev/sr0 or /dev/scd0 - to the system. I think this was the only way to use a CD-ROM back a few years before the ide-cd driver came out.

Anyway, with IDE totally turned off in my kernel and sr_mod loaded I can use the DVD even though the 'libata' driver (rather than the 'ide0' driver) is still sharing the same IRQ as the Realtek device. Luckily the only IDE device in my system is the DVD drive so this workaround is sufficient until hopefully a new kernel fixes the bug.

I've realised I have several questions about how the kernel works out of all this ... for example, what does the Ubuntu kernel do if there are IDE devices (other than the DVD/CD) in the system? Is there another workaround? I tried various kernel boot parameters to try and keep the IDE driver enabled while telling it to 'ignore' the DVD drive - 'hda=scsi', 'hda=ide-scsi' - but nothing worked. Why is it that, when IDE is compiled in, a listing of /proc/interrupts shows that 'ide0' is using IRQ 17 with the network driver ... but when IDE is disabled 'libata' appears in its place? How does the kernel juggle the IDE and libata drivers around?

And, finally, what with all the various boot parameters to turn ACPI off, on and sideways, or otherwise meddle with the boot-time 'IRQ balancing' and such ... surely there would be a way to tell the kernel to move things around so that the IDE driver had its own unique IRQ? I thought IDEPCI_SHARE_IRQ would do it, but no.

Thanks for the help, Brad.


All times are GMT -5. The time now is 06:22 PM.