NFS Large File Copies Fail - Error writing to file: Input/output error

deck- · 06-27-2009, 09:33 PM

Greetings,

I recently upgraded my file/media server to Fedora 11. After doing so, I can no longer copy large files to the server. The files begin to transfer, but typically after about 1gb of the file has transferred, the transfer stalls and ultimately fails with the message:

"Error writing to file: Input/output error"

I've run out of ideas as to what could cause this problem. I have tried the following:

1. Different NFS versions: NFS3 and NFS4
2. Tried copying the files to different physical drives on the server.
3. Tried copying the files from different physical drives on the client.
4. Tried different rsize and wsize block sizes when mounting the NFS share
5. Tried copying the files via a different protocol. SSH in this case. The file transfers are always successful when I use SSH.

Regardless of what I do, the result is the same. The file transfers always fail after approximately 1gb.

Some other notes.

1. Both the client and the server are running Fedora 11 kernel 2.6.29.5-191.fc11.x86_64

I am out of ideas. Has anyone else experienced something similar?

irishbitte · 06-28-2009, 03:33 PM

Can you post the contents of

Code:

/etc/security/limits.conf

here?

deck- · 07-01-2009, 08:56 AM

Thanks for the reply,

My /etc/security/limits.conf looks as follows:

# /etc/security/limits.conf
#
#Each line describes a limit for a user in the form:
#
#<domain> <type> <item> <value>
#
#Where:
#<domain> can be:
# - an user name
# - a group name, with @group syntax
# - the wildcard *, for default entry
# - the wildcard %, can be also used with %group syntax,
# for maxlogin limit
#
#<type> can have the two values:
# - "soft" for enforcing the soft limits
# - "hard" for enforcing hard limits
#
#<item> can be one of the following:
# - core - limits the core file size (KB)
# - data - max data size (KB)
# - fsize - maximum filesize (KB)
# - memlock - max locked-in-memory address space (KB)
# - nofile - max number of open files
# - rss - max resident set size (KB)
# - stack - max stack size (KB)
# - cpu - max CPU time (MIN)
# - nproc - max number of processes
# - as - address space limit (KB)
# - maxlogins - max number of logins for this user
# - maxsyslogins - max number of logins on the system
# - priority - the priority to run user process with
# - locks - max number of file locks the user can hold
# - sigpending - max number of pending signals
# - msgqueue - max memory used by POSIX message queues (bytes)
# - nice - max nice priority allowed to raise to values: [-20, 19]
# - rtprio - max realtime priority
#
#<domain> <type> <item> <value>
#

#* soft core 0
#* hard rss 10000
#@student hard nproc 20
#@faculty soft nproc 20
#@faculty hard nproc 50
#ftp hard nproc 0
#@student - maxlogins 4

# End of file

I don't believe there is anything there that could cause this problem. Everything is commented.

I have done some more work on this, but still no resolution. I have setup another server, and I am seeing a reoccuring problem:

Both servers have completely different hardware, but both are logging kernel errors during these large file transfers:

Oct 22 11:11:57 tical kernel: BUG: unable to handle kernel NULL pointer dereference at 00000004
Oct 22 11:11:57 tical kernel: IP: [<c053b184>] inode_has_perm+0x25/0x6a
Oct 22 11:11:57 tical kernel: *pdpt = 0000000012408001 *pde = 000000001240e067 *pte = 0000000000000000
Oct 22 11:11:57 tical kernel: Oops: 0000 [#7] SMP
Oct 22 11:11:57 tical kernel: last sysfs file: /sys/module/lockd/initstate
Oct 22 11:11:57 tical kernel: Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 raid456 async_xor async_memcpy async_tx xor ppdev via686a hwmon 8139too i2c_viapro i2c_voodoo3 pcspkr serio_raw i2c_algo_bit 8139cp sata_promise pata_pdc2027x mii i2c_core parport_pc parport ata_generic pata_acpi pata_via [last unloaded: scsi_wait_scan]
Oct 22 11:11:57 tical kernel:
Oct 22 11:11:57 tical kernel: Pid: 1674, comm: nfsd Tainted: G D (2.6.29.4-167.fc11.i686.PAE #1)
Oct 22 11:11:57 tical kernel: EIP: 0060:[<c053b184>] EFLAGS: 00010246 CPU: 0
Oct 22 11:11:57 tical kernel: EIP is at inode_has_perm+0x25/0x6a
Oct 22 11:11:57 tical kernel: EAX: 00000000 EBX: 00000000 ECX: 00100004 EDX: cc662858
Oct 22 11:11:57 tical kernel: ESI: 00100004 EDI: d41d7800 EBP: d71b5e98 ESP: d71b5e4c
Oct 22 11:11:57 tical kernel: DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Oct 22 11:11:57 tical kernel: Process nfsd (pid: 1674, ti=d71b4000 task=d6f56500 task.ti=d71b4000)
Oct 22 11:11:57 tical kernel: Stack:
Oct 22 11:11:57 tical kernel: c0971e00 0000012a c0971e00 d6f51940 00000000 000001e8 c130be00 d6f56500
Oct 22 11:11:57 tical kernel: c096e894 c0971c00 00000000 c0971c00 00000246 00000001 d71b5ea4 00000246
Oct 22 11:11:57 tical kernel: d41d7f00 cc662858 d41d7800 d71b5eb4 c053d41d 00000000 d3854a50 d41d7f00
Oct 22 11:11:57 tical kernel: Call Trace:
Oct 22 11:11:57 tical kernel: [<c053d41d>] ? selinux_dentry_open+0xda/0xe2
Oct 22 11:11:57 tical kernel: [<c05366f6>] ? security_dentry_open+0x14/0x16
Oct 22 11:11:57 tical kernel: [<c04a73be>] ? __dentry_open+0xf1/0x1f9
Oct 22 11:11:57 tical kernel: [<c04a7532>] ? dentry_open+0x6c/0x76
Oct 22 11:11:57 tical kernel: [<d97db620>] ? nfsd_open+0x107/0x12e [nfsd]
Oct 22 11:11:57 tical kernel: [<d97db7fd>] ? nfsd_commit+0x3a/0x82 [nfsd]
Oct 22 11:11:57 tical kernel: [<d97e4625>] ? nfsd4_commit+0x0/0x3d [nfsd]
Oct 22 11:11:57 tical kernel: [<d97e464d>] ? nfsd4_commit+0x28/0x3d [nfsd]
Oct 22 11:11:57 tical kernel: [<d97e3c20>] ? nfsd4_proc_compound+0x19f/0x2bd [nfsd]
Oct 22 11:11:57 tical kernel: [<d97d7218>] ? nfsd_dispatch+0xd6/0x1a2 [nfsd]
Oct 22 11:11:57 tical kernel: [<d9381b8c>] ? svc_process+0x391/0x596 [sunrpc]
Oct 22 11:11:57 tical kernel: [<d97d7720>] ? nfsd+0xf7/0x147 [nfsd]
Oct 22 11:11:57 tical kernel: [<d97d7629>] ? nfsd+0x0/0x147 [nfsd]
Oct 22 11:11:57 tical kernel: [<c0446fc8>] ? kthread+0x41/0x65
Oct 22 11:11:57 tical kernel: [<c0446f87>] ? kthread+0x0/0x65
Oct 22 11:11:57 tical kernel: [<c0409dbf>] ? kernel_thread_helper+0x7/0x10
Oct 22 11:11:57 tical kernel: Code: e0 ea 5b 5e 5d c3 55 89 e5 57 56 53 83 ec 40 0f 1f 44 00 00 8b 5d 08 89 c7 31 c0 f6 82 45 01 00 00 02 89 ce 75 42 8b 47 58 85 db <8b> 40 04 89 45 b4 8b 82 4c 01 00 00 89 45 b8 75 16 b9 0e 00 00
Oct 22 11:11:57 tical kernel: EIP: [<c053b184>] inode_has_perm+0x25/0x6a SS:ESP 0068:d71b5e4c
Oct 22 11:11:57 tical kernel: ---[ end trace 06a1359b8e9aca60 ]---

Could this a legitimate bug?

irishbitte · 07-06-2009, 07:07 PM

Hmm, I really doubt it. Can you post

Code:

/etc/exports

deck- · 07-16-2009, 07:41 PM

[root@rza ~]# cat /etc/exports
/nfs4exports 192.168.0.101(rw,sync,insecure,root_squash,no_subtree_check,fsid=0)
/nfs4exports/raid5 192.168.0.101(rw,nohide,sync,insecure,root_squash,no_subtree_check)

irishbitte · 07-16-2009, 08:29 PM

Ok, maybe I can see a problem here: you seem to have two exports, but the second one is a subdirectory of the first? Why have you done this? The conditions of the first will supersede those of the second, and the NFS daemon may not function correctly. Also, can you try changing the 'sync' option to 'async'. This is slightly less dependable, but you may be suffering from a timeout issue using the 'sync' option.

deck- · 07-16-2009, 11:42 PM

That is the correct way to export NFS4 directories.

The /nfs4exports directory is the parent directory for all nfs4 exports. You have to define a directory like this to export via nfs4

/nfs4exports/raid5 is bind mounted to that directory.

I had the exact same problem with nfs3 with the bind mounted directory exported directly the way nfs3 is exported. Given that this is reproducible on different machines with different versions of NFS suggests it is not a problem with NFS at all.

Likewise, because I can run this NFS server without problems using OpenSUSE 11.3 on the exact same hardware with the exact same setup suggests it is a problem somewhere in Fedora 11. I have to believe that there is some security constraint that I am overlooked even though i have tried disabling SELinux and my firewalls on both machines.

I will try changing to the async option.

pdallas · 07-27-2013, 11:14 AM

...erm... did anyone ever solve this? I am still (in year 2013) using OpenSuse 11.4 (evergreen), and still have this problem with transferring large files over NFS. Sync or Async does not seem to make too much difference, it's just that anything over a few Megabytes will give an Input/Output error.
I would update to a newer release of OpenSuse, if it wasn't for the bugs in their LXDE bluetooth implementation (which for us is very important).
Any ideas?

Ryno-mite · 10-07-2021, 07:40 PM

Years late to the party, I came across this thread and this fixed it.

irishbitte's answer to try changing NFS server export-level setting from

Code:

sync

to

Code:

async

In my case, I was copying a large 140GB file that kept failing at 70GB, no matter what copy tool, the built-in Proxmox backup tool, pv or even dd failed at exactly the same place!

Of course, this is a risky change because in case of a power outage or unexpected network or server interruption, async can result in corrupted data. In my case it's just for redundant backup files, so for mission-critical data integrity the solution would be to make another NFS share on the same server with sync enabled.

Thanks!

MadeInGermany · 10-09-2021, 05:29 AM

Could be a problem in tcp/ip:
after some time it does an arp broadcast, that must be answered to continue.
Try to mount with option proto=udp to rule that out.

(Now seeing this is an old post, but maybe there are other forum readers interested.)

gbbijl · 08-01-2022, 02:30 AM

I stumbled upon this thread, but saw nothing that helped me. I also found https://access.redhat.com/solutions/282443, and for those without a Red Hat account (which is free by the way), a summary:

Remove any timeo or retry mount options, or add settings high enough. The RPC calls are timing out, causing I/O errors. Note that timeo values are in deciseconds (1/10th of a second), so a value of for example 10 means 1 second. By default the timeo is set to 600 (1 minute) with 2 retries

I also found out that this issue does not arise when the mount is 'hard' mounted, so that's the other option here.

A third option is to use rsync --bwlimit=<a value not too high> <source> <destination>, as that throttles the speed of the transfer.

--Edit--
I seem to have found the golden solution: 'sync' or 'noac' to the mount options. There is also a 'sync'/'async' option server side, but is only for server-side caching, and has nothing to do with the client.