ext3 fs goes ro after a day or three; nfs sharing issues
I have built three SAN partitions on my EMC DMX800 and attached them to a Linux server (Dell PowerEdge 2650, Linux 2.6.9-42.ELsmp #1 SMP) with the intent to share them out via NFS.
One of them is a home directory file system, auto-mounting to other unix systems. The other two are just NFS-shared file systems. I am continually running into issues with these file systems where they go read-only or (with the home directory) corrupting files. I have fsck'ed these and got them back to usability, only to have them get corrupted or go read-only again. I have disabled the home directory system so I can concentrate on one of the others which is a critical file system for our network. Errors I keep seeing in /var/log/messages: Jul 28 19:30:01 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_journal_start_sb: Detected aborted journal Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_free_blocks_sb: bit already cleared for block 15107529 Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_free_blocks_sb: bit already cleared for block 15107530 Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_free_blocks_sb: bit already cleared for block 15107531 Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_free_blocks_sb: bit already cleared for block 15107533 Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_journal_start_sb: Detected aborted journal Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_reserve_inode_write: Journal has aborted Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_reserve_inode_write: Journal has aborted Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_orphan_del: Journal has aborted Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_truncate: Journal has aborted Jul 29 12:59:48 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)! Jul 29 12:59:48 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted ! Jul 29 13:00:04 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)! Jul 29 13:00:04 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted ! Jul 29 13:00:55 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)! Jul 29 13:00:55 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted ! Jul 29 13:01:01 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)! Jul 29 13:01:01 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted ! Jul 29 13:01:39 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)! Jul 29 13:01:39 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted ! Jul 29 13:03:08 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)! Jul 29 13:03:08 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted ! Jul 29 13:13:39 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 29 13:13:51 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 29 13:21:29 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 29 13:38:36 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 29 13:43:53 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 29 13:44:11 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 29 13:44:21 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 29 13:45:32 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 29 13:48:28 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 29 14:00:05 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode. Jul 29 14:00:05 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode. Jul 29 14:00:05 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode. Jul 29 14:02:44 kcnfsp01 kernel: EXT3-fs: recovery complete. Jul 29 14:02:44 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode. Jul 29 14:08:31 kcnfsp01 kernel: EXT3-fs: journal inode is deleted. Jul 29 14:29:53 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode. Jul 29 22:09:54 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 29 22:09:54 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_journal_start_sb: Detected aborted journal Jul 29 22:32:34 kcnfsp01 kernel: EXT3-fs: journal inode is deleted. Jul 30 00:32:26 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 30 08:58:16 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 30 08:59:39 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_free_blocks_sb: bit already cleared for block 50985247 Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_reserve_inode_write: Journal has aborted Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_truncate: Journal has aborted Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_reserve_inode_write: Journal has aborted Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_orphan_del: Journal has aborted Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_reserve_inode_write: Journal has aborted Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_delete_inode: Journal has aborted Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_journal_start_sb: Detected aborted journal Jul 30 09:03:14 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 30 09:05:02 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 30 09:21:39 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Jul 30 09:23:45 kcnfsp01 kernel: EXT3-fs: recovery complete. Jul 30 09:23:45 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode. Jul 30 11:13:34 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_new_block: Allocating block in system zone - block = 16547840 Jul 30 11:13:34 kcnfsp01 kernel: EXT3-fs error (device sde1) in ext3_reserve_inode_write: Journal has aborted Jul 30 11:13:34 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_journal_start_sb: Detected aborted journal Jul 30 11:13:34 kcnfsp01 kernel: EXT3-fs error (device sde1) in ext3_ordered_commit_write: Journal has aborted Jul 30 12:54:57 kcnfsp01 kernel: EXT3-fs warning (device sde1): ext3_clear_journal_err: Filesystem error recorded from previous mount: IO failure Jul 30 12:54:57 kcnfsp01 kernel: EXT3-fs warning (device sde1): ext3_clear_journal_err: Marking fs in need of filesystem check. Jul 30 12:54:57 kcnfsp01 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended Jul 30 12:54:57 kcnfsp01 kernel: EXT3-fs: recovery complete. Jul 30 12:54:57 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode. Aug 1 12:55:01 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_new_block: Allocating block in system zone - block = 16547841 Aug 1 12:55:01 kcnfsp01 kernel: EXT3-fs error (device sde1) in ext3_reserve_inode_write: Journal has aborted Aug 1 12:55:01 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_journal_start_sb: Detected aborted journal Aug 1 12:55:01 kcnfsp01 kernel: EXT3-fs error (device sde1) in ext3_ordered_commit_write: Journal has aborted Aug 2 21:00:08 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_readdir: bad entry in directory #25116673: rec_len % 4 != 0 - offset=0, inode=93754411, rec_len=21073, name_len=237 Aug 2 21:00:08 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_readdir: bad entry in directory #8142849: rec_len % 4 != 0 - offset=0, inode=3395865643, rec_len=15878, name_len=180 Aug 3 21:00:05 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 Aug 4 10:23:05 kcnfsp01 kernel: EXT3-fs warning (device sde1): ext3_clear_journal_err: Filesystem error recorded from previous mount: IO failure Aug 4 10:23:05 kcnfsp01 kernel: EXT3-fs warning (device sde1): ext3_clear_journal_err: Marking fs in need of filesystem check. Aug 4 10:23:05 kcnfsp01 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended Aug 4 10:23:05 kcnfsp01 kernel: EXT3-fs: recovery complete. Aug 4 10:23:05 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode. /dev/sdc is the home directory file system (big-time errors), sdd and sde are the others. /dev/sde is the one I am working on now, hoping it's resolution will be applicable to the other two as well. When this happens, I unmount the file system from all hosts, remove it from /etc/exports, run 'exportfs -r' and then unmount it. When I re-mount it (no fsck) it's fine for another day or so, then goes ro again. I ran fsck on the home directory file system the first time is reported journaling errors, and while it fixed about a zillion errors it also removed journaling, made the fs ext2, and then when I remounted it it was empty. I restored from tape but have since left it offline. Also, I have also checked with EMC and there are no disk errors on any of the devices in this SAN system. It's used for a few dozen other servers, has been in place for years, and has no other issues. As all three file systems I have put on this server are showing the same or similar problems, I assume the issue is with the server and not the SAN. I do also see this issue on boot, which I am not sure is related: kernel: nfs warning: mount version older than kernel amd[2614]: mount_nfs_fh: NFS version 3 In addition, NFS services are failing to start on normal reboot, although I have placed the scripts after all other network service scripts in /etc/rc2.d. When I log in after a reboot and start NFS, it starts fine. I'm about at wits' end. Any suggestions? |
From my experience, when a filesystem jumps to read-only, it was hardware related. Either the drive is going bad or the controller is going bad or needs firmware updates.
|
Yeah, exploring that possibility now. it doesn't work on either fiber path, and the paths are connected to diff fiber switches, so looking internally. Nothing else seems to be a problem, though...
Thanks for your reply. |
You should run smartctl on those drives to see what the error rate is like. You also could have corruption on the system hard drive, such that the file system handlers and libraries are corrupted.
You have a hardware problem someplace. |
All times are GMT -5. The time now is 12:33 PM. |