LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Server (https://www.linuxquestions.org/questions/linux-server-73/)
-   -   I/O timeout using NFS fileserver (https://www.linuxquestions.org/questions/linux-server-73/i-o-timeout-using-nfs-fileserver-4175458935/)

TiredOfThis 04-20-2013 05:34 PM

I/O timeout using NFS fileserver
 
I have a fileserver exporting a raid5 over NFSv4.

Recently I've started running into problems where I/O will timeout. I'll do a copy (either through the GUI or the CLI) and it will just hang for a while. dmesg output will show server timeouts (not responding).

If I run htop on the server it shows the memory filling up with buffers until it hits max, then the drives start churning like mad (meanwhile the I/O stalls on the client). No more progress gets made (and I/O times out) until the retry limit is hit on the client and the operation fails.

This is definitely an nfs issue. I can do a local copy to the raid volume or copy the same files via ftp and everything goes fine. dmesg and system journal on fileserver is clean.

I realize I can mitigate the problem by adding more RAM to the server, but I'd rather figure out what's broken.

I've done some Googling and tweaking things like number of nfs server threads (not helpful) or export and mount options (also seems not helpful). The volume is exported 'sync', so I would expect all I/O to be flushed (making it weird that I see this "memory fills up and then begins to flush" behavior).

Any ideas?

lleb 04-20-2013 08:53 PM

1. what OS?
2. what does your exports look like?
3. what type of LAN are you dealing with?
4. have you checked swappiness settings?
5. you can also clear you cache and force it to not fill so often its different depending on the OS.
6. what is the client OS?

TiredOfThis 04-20-2013 09:01 PM

1. Client and server are running Arch Linux
2. /mnt/raid 10.10.10.1/24(rw,root_squash,insecure,no_subtree_check,sync,nohide)
3. Client -> Gigabit switch -> server
4. Swappiness is at 60. Do you think I should tune it one way or the other?

lleb 04-21-2013 12:24 AM

Quote:

Originally Posted by TiredOfThis (Post 4935572)
1. Client and server are running Arch Linux
2. /mnt/raid 10.10.10.1/24(rw,root_squash,insecure,no_subtree_check,sync,nohide)
3. Client -> Gigabit switch -> server
4. Swappiness is at 60. Do you think I should tune it one way or the other?

while im no guru, and specifically with arch as ive never run it, but there are a few things to check so far as tuning goes.

for NFSv4 you should have fsid=0 in there as well as crossmnt as options: example:

Code:

#        NFS4
/exports *(rw,insecure,subtree_check,crossmnt,fsid=0)

I have to use the insecure flag as I have OSx in my LAN. Also from what I have been told you should place your exports for NFSv4 in /exports and mount everything from there.

the LAN should be more then fast enough as long as both ends also run 10/100/1000 NICs and you are cabled with cat5e or better cable.

yes Id tuen swappiness down to 30 or even 10. In a modern system, there really is no reason to have it swapping that often any longer. 60 is still the legacy default that most OSs will set on install.

Not sure that will help, but could not hurt.

syg00 04-21-2013 01:53 AM

Quote:

Originally Posted by TiredOfThis (Post 4935512)
Recently I've started running into problems where I/O will timeout. I'll do a copy (either through the GUI or the CLI) and it will just hang for a while. dmesg output will show server timeouts (not responding).

What changed ?.

TiredOfThis 04-21-2013 07:56 AM

Thanks for your suggestions!

Quote:

Originally Posted by lleb (Post 4935628)
for NFSv4 you should have fsid=0 in there

Shouldn't matter since I'm exporting only one mount, and I'm not grouping them together (see below).

Quote:

Originally Posted by lleb (Post 4935628)
as well as crossmnt

I don't export any filesystems onto the raid, and I don't group filesystems, so I shouldn't need crossmt either.

Quote:

Originally Posted by lleb (Post 4935628)
Also from what I have been told you should place your exports for NFSv4 in /exports and mount everything from there.

This is the usual way to do it. However, since I'm only exporting a single mount, and since I don't care about security since this is an internal LAN, I don't do it that way. It shouldn't affect performance at all.

Quote:

Originally Posted by lleb (Post 4935628)
yes Id tuen swappiness down to 30 or even 10

I don't think this will help (in fact, if my problem is memory exhaustion it could hurt), but I set it to 30. Problem still persists.

TiredOfThis 04-21-2013 08:07 AM

Quote:

Originally Posted by syg00 (Post 4935650)
What changed ?.

Nothing, as far as I can see. Although let me tell you more of the story so maybe you can tell me what could have changed :)

The fileserver used to be running an ancient version of OpenSUSE (I think it was 10.3) and exported using NFSv3. I never ran an update, because I never rebooted the system. I know that no updates would run because I actually disabled the update system on that machine.

One day I lost power, and the fileserver rebooted. I took the opportunity to update the system software on the client (Arch Linux) machine, but I didn't do anything to the server.

For some reason, after the client came back up, it couldn't connect to the server at all. So I ssh'd into the server and restarted the portmapper and idmapd daemons, and viola. It worked.

Except that now I was getting these timeouts. It only happened when copying a lot of data (usually over a GB), which I don't do very frequently. But it was annoying, and sometimes my backup cronjob would die, putting my data at risk.

I figured it was due to the ancient version of nfsd and Linux I was running on the fileserver, so I bought a new hard drive, installed Arch Linux on it, and freshened the whole thing up to modern. It's now on Linux 3.8.6 and has the most recent (as of about two weeks ago) versions of nfsd, etc. I still don't run updates and don't reboot, but at least now the baseline is more current.

However, even after all this the problem still persists.

I've tried replacing the switch and re-cabling the network, but that doesn't seem to have helped. Also, the problem only shows up with NFS writes (not reads) to the fileserver. I can read data all day long, I can sftp data to the fileserver with no problems, etc. The problem also shows up if I mount the volume on other computers on the network, so it's not just localized to my main desktop PC.

So although I say "nothing changed", as you can see after the problem started happening I changed almost everything, and the issue still persists.

lleb 04-21-2013 01:28 PM

sadly sorry im out of ideas then. that is about as much of what i know for NFS. Good luck. sorry i could not be of more assistance.

TiredOfThis 04-21-2013 08:59 PM

Thanks for trying! This has got me stumped as well.


All times are GMT -5. The time now is 10:16 AM.