I'm running Fedora Core 1 as a firewall (Shorewall) with 3 attached networks (net, loc, dmz). Motherboard is an ASUS A7N266-VM/AA (Athlon XP 1800+). Server chassis is 2U Rackmount with a PCI riser card, allowing for 3 PCI cards, which are filled by 3 NIC's (onboard LAN is disabled in BIOS).
NIC's installed:
1x3C905B
1x3C905C-TX-M
1xRealtek RTL-8139
For a while, my firewall was crashing randomly with no relevant messages anywhere in the log files. These crashes would completely freeze the machine forcing a hard reset. After scratching my head for a while, I added "apm=off nohlt" to my kernel options, rebooted, and waited for it to crash again.
When I got to work this morning, the Internet was down, so while I was disturbed, I was excited at the same time to see if the kernel options had helped. Sure enough, the firewall was responsive to input from KB/Mouse, but routing was definitely not happening between the three interfaces. Upon searching the logs, I found TONS of these messages:
fw kernel: eth0: PCI bus error, bus status 80000020
fw kernel: eth0: Host error, FIFO diagnostic register 0000.
fw kernel: eth0: Too much work in interrupt, status e003.
I restarted the network services, and the interfaces went down and up again, but I was still receiving the same messages in the logs and network traffic was down. I did a soft reset, Linux rebooted and everything has been fine ever since. It seems that when there is a lot of traffic flowing through this box that the errors occur.
I happened upon this thread with Donald Becker (who wrote the 3c59x drivers from what I understand): well, since this is my first post I cannot post a URL, but you can find it at tux.org. The title of the thread is [vortex] 3c59x LK1.1.16 Linux-2.4 PCI bus error/Host error. I suppose you'll have to search that site to find the discussion, since I can't post URLs yet. Sorry.
Basically, from reading this thread, I understand more about the actual problem, but I still have no idea how to fix it.
Here is my output from lspci -vvx reporting my NIC's:
------------------------------------------------------------
01:06.0 Ethernet controller: 3Com Corporation 3c905B 100BaseTX [Cyclone] (rev 30)
Subsystem: 3Com Corporation 3C905B Fast Etherlink XL 10/100
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (2500ns min, 2500ns max), cache line size 08
Interrupt: pin A routed to IRQ 5
Region 0: I/O ports at d800 [size=128]
Region 1: Memory at e6000000 (32-bit, non-prefetchable) [size=128]
Expansion ROM at <unassigned> [disabled] [size=128K]
Capabilities: [dc] Power Management version 1
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00: b7 10 55 90 17 00 10 02 30 00 00 02 08 40 00 00
10: 01 d8 00 00 00 00 00 e6 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 55 90
30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a
01:07.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ (rev 10)
Subsystem: AOPEN Inc. ALN-325C
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR+
Latency: 64 (8000ns min, 16000ns max)
Interrupt: pin A routed to IRQ 5
Region 0: I/O ports at d400 [size=256]
Region 1: Memory at e5800000 (32-bit, non-prefetchable) [size=256]
Capabilities: [50] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0-,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-
00: ec 10 39 81 07 00 90 82 10 00 00 02 00 40 00 00
10: 01 d4 00 00 00 00 80 e5 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 a0 a0 07 00
30: 00 00 00 00 50 00 00 00 00 00 00 00 05 01 20 40
01:08.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M [Tornado] (rev 78)
Subsystem: 3Com Corporation 3C905C-TX Fast Etherlink for PC Management NIC
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap+ 66Mhz- UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (2500ns min, 2500ns max), cache line size 08
Interrupt: pin A routed to IRQ 6
Region 0: I/O ports at d000 [size=128]
Region 1: Memory at e5000000 (32-bit, non-prefetchable) [size=128]
Expansion ROM at <unassigned> [disabled] [size=128K]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=2 PME-
00: b7 10 00 92 17 00 10 02 78 00 00 02 08 40 00 00
10: 01 d0 00 00 00 00 00 e5 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 00 10
30: 00 00 00 00 dc 00 00 00 00 00 00 00 06 01 0a 0a
As you can see, the Realtek and the 3c905B are sharing IRQ 5, and all 3 NIC's have the "BusMaster" flag set. Seems to me that the solution would be to make only one NIC be the BusMaster and/or force all 3 NIC's to listen on different IRQ's.
When this last crash happened, my kernel options were: "apm=off nohlt acpi=off" (I thought that by disabling ACPI in Linux that it would assign different IRQ's to the 3 NIC's, but apparently it didn't).
So... I guess my question is this: would disabling the PCI BusMaster on 2 of the NIC's solve the problem? If so, how do I accomplish this in Linux? And/or do I have to assign different IRQ's to the 3 NIC's? If so, how do I accomplish this as well?
Any feedback would be appreciated VERY much. This is a production firewall, and I'm basically having to work myself to death until this problem gets solved.
Thanks in advance,
Greg