LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices

Reply
 
LinkBack Search this Thread
Old 03-06-2008, 11:23 AM   #1
dougtheslug
LQ Newbie
 
Registered: May 2005
Distribution: Debian
Posts: 14

Rep: Reputation: 0
Unhappy NFS errors for the past few days


I have a pair of servers at home which have been acting up over the last few days.

Server 1 is running apache2 under debian. Using NFS to mount all the filespace
Server 2 is simply hosting NFS shares under debian.

The apache2 server has an uptime of almost a year now, but recently it's been having trouble connecting to the NFS server. The following is a sample of what I'm seeing in the logs on a regular basis now. Previously, these entries would appear maybe once every month or two, which is normal and can be caused by any number of random things like network congestion or too many requests right at that time.

Mar 6 04:53:53 localhost kernel: nfs: server 192.168.0.50 OK
Mar 6 04:54:24 localhost kernel: nfs: server 192.168.0.50 not responding, still trying
Mar 6 04:54:24 localhost kernel: nfs: server 192.168.0.50 OK
Mar 6 04:55:19 localhost kernel: nfs: server 192.168.0.50 not responding, still trying
Mar 6 04:55:23 localhost kernel: nfs: server 192.168.0.50 not responding, still trying
Mar 6 04:55:23 localhost kernel: nfs: server 192.168.0.50 OK
Mar 6 04:55:47 localhost kernel: nfs: server 192.168.0.50 OK
Mar 6 04:56:48 localhost kernel: nfs: server 192.168.0.50 not responding, still trying
Mar 6 04:56:48 localhost kernel: nfs: server 192.168.0.50 OK

But this is getting ridiculous, and it's almost bringing the entire apache2 system to a halt

- I have restarted the NFS machine, thinking the nfsd was hooped, that did not solve the issue.
- I have restarted the apache2 daemon, but that has had no effect.
- I have unmounted and run fsck -f on a couple hard drives on the file server, and those come back clean.
- I can ssh reliably into both machines and work inside each machine. CPU usage is next to nil, and network traffic is minimal.
- There is no sign of any other issues in any logs on either machine, outside of these kernel-nfs messages

Anyone have any leads on this?
It is extremely frustrating as I would expect to see something, *anything*, in the logs. Either a dying hard drive (DMA errors and resets), or network card issues (link down/up messages), or something to indicate the source of these nfs timeouts.
 
Old 03-06-2008, 11:57 AM   #2
slackhack
Senior Member
 
Registered: Jun 2004
Distribution: Arch, Debian, Slack
Posts: 1,016

Rep: Reputation: 46
What kind of filesystems do you have on the servers?

ifconfig shows no errors or collisions or anything like that on the network interfaces?
 
Old 03-06-2008, 12:09 PM   #3
dougtheslug
LQ Newbie
 
Registered: May 2005
Distribution: Debian
Posts: 14

Original Poster
Rep: Reputation: 0
I've not looked into ifconfig. I completely forgot that I could look there for errors. I always assumed something would show up in /var/log/

Unfortunately, our corporate network is so paranoid about IP, that I have no way to connect to the server right now to check. Plus, all forms of encryption inside the network is banned. Huzzah.

As for the filesystems, since I set the machines up a year ago, I can't remember exactly, but I'm pretty sure I just used the default the debian installer chooses, ext3.

Would I not encounter issues with ssh-ing into the machine as well though if it was issues with the network card, or the network driver?

It would be sad to reboot just shy of a 1 year uptime, but I'm tempted to just see if a reboot resets something that has gone awry deep in the core of the system. Is there a way to reset the network card and driver without rebooting the machine?
 
Old 03-06-2008, 03:23 PM   #4
slackhack
Senior Member
 
Registered: Jun 2004
Distribution: Arch, Debian, Slack
Posts: 1,016

Rep: Reputation: 46
When you're able to ssh in, you could just do /etc/init.d/networking restart to restart the network. that probably won't do too much, but who knows. While in, be sure to check ifconfig for errors.

You might also want to mess around with NFS to see if you can fix it or at least get any more info. I would pick a share and unmount it and then try remounting it again to see if it generates any error messages. Maybe there's some misconfiguration -- I think debian changed the default behavior of something fairly recently (3-6 mos?), so maybe that's screwing something up. It might also help to restart nfs itself (I think it's nfs-common on debian -- or shut it down, re-export the export table, start it again, etc.)

Lastly, I hear you about the uptime, but in the end it's not worth the headache avoiding a possible solution just to preserve the uptime. If you can't get it resolved any other way, it might be worth it to just reboot and see if that clears up anything. Then you get to start all over again.
 
Old 03-06-2008, 11:14 PM   #5
dougtheslug
LQ Newbie
 
Registered: May 2005
Distribution: Debian
Posts: 14

Original Poster
Rep: Reputation: 0
eth0 Link encap:Ethernet HWaddr 00:0D:87:89:60:28
inet addr:192.168.0.51 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:40098981 errors:0 dropped:0 overruns:0 frame:269263
TX packets:40854082 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:259462197 (247.4 MiB) TX bytes:92252480 (87.9 MiB)
Interrupt:11 Base address:0xdc00

completely clean. Which I guess is a good thing. I'm almost hoping to find the issue, but at the same time, I don't want it to be whatever I'm looking at right then.

I have done a /etc/init.d/networking restart, and while that hasn't gotten rid of the errors, it does seem to have at least restored the responsiveness of the server even when it does error. It doesn't completely lock up for ages and ages. I'm not sure quite what to make of this, but I will be watching it closely.
 
Old 03-07-2008, 06:27 AM   #6
slackhack
Senior Member
 
Registered: Jun 2004
Distribution: Arch, Debian, Slack
Posts: 1,016

Rep: Reputation: 46
Quote:
Originally Posted by dougtheslug View Post
eth0 Link encap:Ethernet HWaddr 00:0D:87:89:60:28
inet addr:192.168.0.51 Bcast:192.168.0.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:40098981 errors:0 dropped:0 overruns:0 frame:269263
TX packets:40854082 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:259462197 (247.4 MiB) TX bytes:92252480 (87.9 MiB)
Interrupt:11 Base address:0xdc00

completely clean. Which I guess is a good thing. I'm almost hoping to find the issue, but at the same time, I don't want it to be whatever I'm looking at right then.

I have done a /etc/init.d/networking restart, and while that hasn't gotten rid of the errors, it does seem to have at least restored the responsiveness of the server even when it does error. It doesn't completely lock up for ages and ages. I'm not sure quite what to make of this, but I will be watching it closely.
completely clean -- except for 269263 frame errors! In conjunction with errors in the error field I think that certainly points to some hardware problems (cables, nic, etc.). All by itself I'm not sure if it's not just reflecting the problem that you already know exists. netstat -es might give more info. If there's no errors there, it might just be telling us that there are problems sending the data because of the data itself (i.e., packet size is wrong?) Not 100 percent sure of the significance, but it might be a small clue to look into further.
 
Old 03-07-2008, 12:56 PM   #7
dougtheslug
LQ Newbie
 
Registered: May 2005
Distribution: Debian
Posts: 14

Original Poster
Rep: Reputation: 0
Hrmm, I didn't know that "frame" = "frame errors". Time to go read an ifconfig man page :\ I just assumed "errors:0" meant I was in the clear. That's why I copy pasted the whole thing in, just in case I had missed something :P

Now, since the responsiveness seemed to return temporarily after the networking restart last night, I didn't go onsite to take a look at the servers. But the server is still showing signs of issues, so I will have to go onsite tonight. I will check all the cables, watch it for a bit, Open up the case side and check to see if anything has worked loose...

As for any changes, the machine has basically been untouched for the last year. If it ain't broke, don't fix it kind of thing... I haven't done any apt-upgrades or changed any settings since it was setup last year. The only things that change are the data being sent by apache2 and processed by perl, plus all that data is stored on the NFS server.

Anyways, I appreciate the continued support. I will provide more updates as they come.
 
Old 03-07-2008, 01:00 PM   #8
dougtheslug
LQ Newbie
 
Registered: May 2005
Distribution: Debian
Posts: 14

Original Poster
Rep: Reputation: 0
the ifconfig man page doesn't contain the word "frame" :\
 
Old 03-07-2008, 03:56 PM   #9
slackhack
Senior Member
 
Registered: Jun 2004
Distribution: Arch, Debian, Slack
Posts: 1,016

Rep: Reputation: 46
Looks like it could be a driver issue:

Quote:
>inet addr:192.168.1.22 Bcast:192.168.1.255 Mask:255.255.255.0
>inet6 addr: fe80::255:7bff:feb5:7df7/64 Scope:Link
>UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
>RX packets:1 errors:0 dropped:0 overruns:0 frame:21668
>TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>collisions:0 txqueuelen:1000
>RX bytes:60 (60.0 b) TX bytes:0 (0.0 b)
>Interrupt:169

>why the frame value(21668) so large?

You can't talk to the NIC - you may be using the wrong driver. The
actual meaning of a Frame error is that the number of BITS is not
exactly dividable by 8 (bytes) which is wrong because an Ethernet
frame is always and exact number of bytes in length. This usually
indicates a hardware failure, but can also be caused by using the
wrong driver.

http://groups.google.com/group/comp....ec00d71bec72c2
what nic and driver are you using?

also saw this:

Quote:
Saw somewhere on the net in cabled situations that frame errors are related to the CRC check on packets not giving the right result and then retransmission requests taking place. Also saw some hints on full-duplex vs. half-duplex conflicts.

http://forum.openwrt.org/viewtopic.php?id=1653
what does mii-tool -v say about the duplex setting? is there perhaps a mismatch between it and the computer it's connecting to?
 
Old 03-08-2008, 12:12 AM   #10
dougtheslug
LQ Newbie
 
Registered: May 2005
Distribution: Debian
Posts: 14

Original Poster
Rep: Reputation: 0
I can tell that there is some stuff in here that isn't so hot, but unfortunately, I don't know what some of this is. Is there anything in this that raises any red flags?

d3100:~# netstat -es
Ip:
37857474 total packets received
0 forwarded
0 incoming packets discarded
29334854 incoming packets delivered
31867300 requests sent out
13 dropped because of missing route
29709 fragments dropped after timeout
10006441 reassemblies required
1577825 packets reassembled ok
128110 packet reassembles failed
1736391 fragments received ok
Icmp:
549 ICMP messages received
1 input ICMP message failed.
ICMP input histogram:
destination unreachable: 472
timeout in transit: 53
echo requests: 23
echo replies: 1
21108 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 101
time exceeded: 20984
echo replies: 23
Tcp:
69250 active connections openings
192748 passive connection openings
1 failed connection attempts
17702 connection resets received
3 connections established
20863496 segments received
23666056 segments send out
164475 segments retransmited
23 bad segments received.
69112 resets sent
Udp:
8029922 packets received
31 packets to unknown port received.
0 packet receive errors
8182086 packets sent
TcpExt:
1064 resets received for embryonic SYN_RECV sockets
10 ICMP packets dropped because they were out-of-window
2 ICMP packets dropped because socket was locked
197014 TCP sockets finished time wait in fast timer
56 packets rejects in established connections because of timestamp
91957 delayed acks sent
32 delayed acks further delayed because of locked socket
Quick ack mode was activated 2781 times
343577 packets directly queued to recvmsg prequeue.
319737 of bytes directly received from prequeue
8741772 packet headers predicted
21 packets header predicted and directly queued to user
3363649 acknowledgments not containing data received
6759359 predicted acknowledgments
1665 times recovered from packet loss due to fast retransmit
47654 times recovered from packet loss due to SACK data
9 bad SACKs received
Detected reordering 31 times using FACK
Detected reordering 3 times using SACK
Detected reordering 413 times using reno fast retransmit
Detected reordering 31 times using time stamp
39 congestion windows fully recovered
119 congestion windows partially recovered using Hoe heuristic
TCPDSACKUndo: 368
2877 congestion windows recovered after partial ack
38301 TCP data loss events
TCPLostRetransmit: 131
112 timeouts after reno fast retransmit
5255 timeouts after SACK recovery
1856 timeouts in loss state
109123 fast retransmits
2171 forward retransmits
21379 retransmits in slow start
12924 other TCP timeouts
TCPRenoRecoveryFail: 621
4066 sack retransmits failed
4 times receiver scheduled too late for direct processing
32065 DSACKs sent for old packets
19 DSACKs sent for out of order packets
2714 DSACKs received
27 connections reset due to unexpected data
18 connections reset due to early user close
409 connections aborted due to timeout
 
Old 03-08-2008, 12:34 AM   #11
dougtheslug
LQ Newbie
 
Registered: May 2005
Distribution: Debian
Posts: 14

Original Poster
Rep: Reputation: 0
The network cards appear to be up and running in the same modes.

NFS server: NIC is a weird nic, not onboard. There's no identification on any of the PCB, but it has a chipset with an I on it, which matches the Intel product info collected.
p3-450:~# mii-tool -v
eth0: negotiated 100baseTx-FD flow-control, link ok
product info: Intel 82555 rev 4
basic mode: autonegotiation enabled
basic status: autonegotiation complete, link ok
capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
advertising: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control

Webserver: NIC is onboard. Board is an ECS L7VMM3, with soldered on CPU. NIC is a VIA VT823x chipset. lsmod shows a variety of via drivers, including "via-rhine" which is what I believe the NIC driver is. (The driver automatically detected and installed in the debian installer)
d3100:~# mii-tool -v
eth0: negotiated 100baseTx-FD flow-control, link ok
product info: vendor 00:40:63, model 50 rev 8
basic mode: autonegotiation enabled
basic status: autonegotiation complete, link ok
capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
advertising: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control

There looks to be nothing out of the ordinary here.
 
Old 03-08-2008, 12:51 AM   #12
dougtheslug
LQ Newbie
 
Registered: May 2005
Distribution: Debian
Posts: 14

Original Poster
Rep: Reputation: 0
I am going to swap in a different ethernet cable tonight, and see if that makes any differences. Unless theres any more leads or suggestions, my next steps are as follows:
1) restart the machine. This might fix something that may be corrupted in memory, such as the driver.
2) disable the onboard nic and replace it with an known working NIC. (I know I should have somewhere intel ether express pro's, dlink DFE538-tx's, and 3com 3c509b-tx's) I just hope the kernel already has modules for these nics compiled in, cause I've never actually compiled and installed a NIC driver outside of a complete kernel compile before.

I am still slightly concerned about doing (1). If that fixes it, I will come out of this not knowing what the heck was wrong, and I won't have really learned how to properly resolve the issue. Plus, it may only be a temporary fix.

Lastly, I have done a visual inspection of both machines. Nothing seems out of whack. All internal and external connections appear solid. Update in the morning. Time to swap eth cables.
 
Old 03-09-2008, 12:15 AM   #13
dougtheslug
LQ Newbie
 
Registered: May 2005
Distribution: Debian
Posts: 14

Original Poster
Rep: Reputation: 0
So it's been a bit longer than this morning, but I didn't want to jinx it :P
There have been no signs of any errors since I replaced the network cable last night. I hope that was the issue. An easy issue to fix, but a PITA to track down.

Thanks again for all your help. I will be here first if the issue begins to replicate again. (I sure hope I haven't jinxed it by claiming this "officially")
 
Old 03-09-2008, 08:42 AM   #14
slackhack
Senior Member
 
Registered: Jun 2004
Distribution: Arch, Debian, Slack
Posts: 1,016

Rep: Reputation: 46
Quote:
Originally Posted by dougtheslug View Post
So it's been a bit longer than this morning, but I didn't want to jinx it :P
There have been no signs of any errors since I replaced the network cable last night. I hope that was the issue. An easy issue to fix, but a PITA to track down.

Thanks again for all your help. I will be here first if the issue begins to replicate again. (I sure hope I haven't jinxed it by claiming this "officially")
Bad cables can really wreak havoc on your network. I suspected that or the nic when I saw all those frame errors, but wasn't sure because the error field was 0. That adds to the knowledge base, for me anyway: high frame errors can mean bad cable. Glad you tracked it down, keep us posted if it crops up again.

 
Old 03-12-2008, 01:06 PM   #15
dougtheslug
LQ Newbie
 
Registered: May 2005
Distribution: Debian
Posts: 14

Original Poster
Rep: Reputation: 0
I believe I figured out what was the issue. I got a big box of ethernet cables a number of years back, and some of them we're still being used. Now, I just completed a move, but some servers were left behind. The cables for the towers left behind became sort of tangled, but nothing was actually unplugged during the process. The cable management was removed and the cables just sort of laid in a pile.

It appears as if these really old cables became entangled with power cables and such. The kicker, is that after replacing one of these cables, I cut one open. The 8 indivudual wires weren't in twisted pairs.

They were laid straight inside this flattish ethernet cable. Hooray interference. That will teach me for using free stuff. I'm truly surprised now that I see this that these cables ever worked.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
fc4 no. 8 on most downloads in past 30 days???? bendeco13 Linux - Software 5 08-22-2007 04:45 PM
Getting permission denied errors with NFS triley Linux - General 1 08-24-2006 04:39 PM
Errors with NFS richessence Linux - Networking 5 07-05-2005 04:44 PM
Intermittent nfs errors bilbod Linux - Networking 0 01-05-2005 09:43 PM
Been trying to install Red Hat 7.0 for the past 3 days w/o success, HELP!!! jayman626 Linux - Software 6 07-16-2001 02:55 AM


All times are GMT -5. The time now is 08:17 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration