I am periodically (about 20% of the time) running into a problem when copying a large number of files (over 87000) from one disk to another. This is an older kernel (2.6.32-279.14.1.el6) in the latest CentOS running under Xen domU on AWS. The source and target filesystems are both ext4. The error message I get from rsync looks like this:
Code:
rsync: stat "/mnt/xvdj/tmp/.CharName.pm.OV0XJM" failed: No such file or directory (2)
rsync: rename "/mnt/xvdj/tmp/.CharName.pm.OV0XJM" -> "root/.cpan/build/Unicode-String-2.09-YxjsfF/lib/Unicode/CharName.pm": No such file or directory (2)
It happens in both 32-bit and 64-bit at almost the same frequency, but 64-bit seems to be slightly more frequent. It happens with different files, but I have seen it happen on the same file a couple times (two each for two files ... and the latest happened in 32-bit once and 64-bit ones).
The breakdown of what I think is happening is this: The rsync program receives a new file from the source (on disk, not network). The --temp-dir= option specifies a "tmp" directory in the target filesystem (allows rsync to rename() the file to its ultimate place). It writes the file in that temporary directory and closes it. Then it stats it and finds it is not there. Then it tries to rename it (not sure why after stat files) to move it to the ultiplate place, and that files. It then proceeds to the next file. Never more than one file is affected. And the AWOL file is not actually present in the "tmp" directory at the end (plausibly a bug in rsync didn't create it, but I'd expect that to be more systemic and fail the same way each time).
When I run rsync under strace it never has this problem. But I've only done that half a dozen times since is slows things down quite a bit (probably why it succeeds). I have mounted the filesystem with "-o sync" and that does not stop the problem ... it happened both times I used "-o sync".
But here is my question:
I had read a few years back that ext4 was not yet "prime time ready" for general use. I believe this was around the time of the 2.6.32 kernel. I had read that the issue was because of ordering issues in ext4 that could cause files to be lost or invisible for a while (in the tiny time fragments of system calls).
Is this what those earlier readings might have been talking about?
Should I switch back to ext3 or ext2 (changing the kernel isn't a good option for this).