LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   ext3 fs goes ro after a day or three; nfs sharing issues (https://www.linuxquestions.org/questions/linux-general-1/ext3-fs-goes-ro-after-a-day-or-three%3B-nfs-sharing-issues-660430/)

Othyisar 08-04-2008 03:37 PM

ext3 fs goes ro after a day or three; nfs sharing issues
 
I have built three SAN partitions on my EMC DMX800 and attached them to a Linux server (Dell PowerEdge 2650, Linux 2.6.9-42.ELsmp #1 SMP) with the intent to share them out via NFS.

One of them is a home directory file system, auto-mounting to other unix systems. The other two are just NFS-shared file systems.

I am continually running into issues with these file systems where they go read-only or (with the home directory) corrupting files. I have fsck'ed these and got them back to usability, only to have them get corrupted or go read-only again.

I have disabled the home directory system so I can concentrate on one of the others which is a critical file system for our network.

Errors I keep seeing in /var/log/messages:

Jul 28 19:30:01 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_journal_start_sb: Detected aborted journal
Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_free_blocks_sb: bit already cleared for block 15107529
Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_free_blocks_sb: bit already cleared for block 15107530
Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_free_blocks_sb: bit already cleared for block 15107531
Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_free_blocks_sb: bit already cleared for block 15107533
Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_journal_start_sb: Detected aborted journal
Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_reserve_inode_write: Journal has aborted
Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_reserve_inode_write: Journal has aborted
Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_orphan_del: Journal has aborted
Jul 29 10:25:25 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_truncate: Journal has aborted
Jul 29 12:59:48 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)!
Jul 29 12:59:48 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted !
Jul 29 13:00:04 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)!
Jul 29 13:00:04 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted !
Jul 29 13:00:55 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)!
Jul 29 13:00:55 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted !
Jul 29 13:01:01 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)!
Jul 29 13:01:01 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted !
Jul 29 13:01:39 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)!
Jul 29 13:01:39 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted !
Jul 29 13:03:08 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_check_descriptors: Block bitmap for group 16 not in group (block 33554432)!
Jul 29 13:03:08 kcnfsp01 kernel: EXT3-fs: group descriptors corrupted !
Jul 29 13:13:39 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 29 13:13:51 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 29 13:21:29 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 29 13:38:36 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 29 13:43:53 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 29 13:44:11 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 29 13:44:21 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 29 13:45:32 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 29 13:48:28 kcnfsp01 kernel: EXT3-fs error (device sdd1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 29 14:00:05 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 29 14:00:05 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 29 14:00:05 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 29 14:02:44 kcnfsp01 kernel: EXT3-fs: recovery complete.
Jul 29 14:02:44 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 29 14:08:31 kcnfsp01 kernel: EXT3-fs: journal inode is deleted.
Jul 29 14:29:53 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 29 22:09:54 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 29 22:09:54 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_journal_start_sb: Detected aborted journal
Jul 29 22:32:34 kcnfsp01 kernel: EXT3-fs: journal inode is deleted.
Jul 30 00:32:26 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 30 08:58:16 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 30 08:59:39 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_free_blocks_sb: bit already cleared for block 50985247
Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_reserve_inode_write: Journal has aborted
Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_truncate: Journal has aborted
Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_reserve_inode_write: Journal has aborted
Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_orphan_del: Journal has aborted
Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_reserve_inode_write: Journal has aborted
Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1) in ext3_delete_inode: Journal has aborted
Jul 30 09:00:58 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_journal_start_sb: Detected aborted journal
Jul 30 09:03:14 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 30 09:05:02 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 30 09:21:39 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Jul 30 09:23:45 kcnfsp01 kernel: EXT3-fs: recovery complete.
Jul 30 09:23:45 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Jul 30 11:13:34 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_new_block: Allocating block in system zone - block = 16547840
Jul 30 11:13:34 kcnfsp01 kernel: EXT3-fs error (device sde1) in ext3_reserve_inode_write: Journal has aborted
Jul 30 11:13:34 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_journal_start_sb: Detected aborted journal
Jul 30 11:13:34 kcnfsp01 kernel: EXT3-fs error (device sde1) in ext3_ordered_commit_write: Journal has aborted
Jul 30 12:54:57 kcnfsp01 kernel: EXT3-fs warning (device sde1): ext3_clear_journal_err: Filesystem error recorded from previous mount: IO failure
Jul 30 12:54:57 kcnfsp01 kernel: EXT3-fs warning (device sde1): ext3_clear_journal_err: Marking fs in need of filesystem check.
Jul 30 12:54:57 kcnfsp01 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
Jul 30 12:54:57 kcnfsp01 kernel: EXT3-fs: recovery complete.
Jul 30 12:54:57 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode.
Aug 1 12:55:01 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_new_block: Allocating block in system zone - block = 16547841
Aug 1 12:55:01 kcnfsp01 kernel: EXT3-fs error (device sde1) in ext3_reserve_inode_write: Journal has aborted
Aug 1 12:55:01 kcnfsp01 kernel: EXT3-fs error (device sde1): ext3_journal_start_sb: Detected aborted journal
Aug 1 12:55:01 kcnfsp01 kernel: EXT3-fs error (device sde1) in ext3_ordered_commit_write: Journal has aborted
Aug 2 21:00:08 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_readdir: bad entry in directory #25116673: rec_len % 4 != 0 - offset=0, inode=93754411, rec_len=21073, name_len=237
Aug 2 21:00:08 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_readdir: bad entry in directory #8142849: rec_len % 4 != 0 - offset=0, inode=3395865643, rec_len=15878, name_len=180
Aug 3 21:00:05 kcnfsp01 kernel: EXT3-fs error (device sdc1): ext3_readdir: bad entry in directory #2: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Aug 4 10:23:05 kcnfsp01 kernel: EXT3-fs warning (device sde1): ext3_clear_journal_err: Filesystem error recorded from previous mount: IO failure
Aug 4 10:23:05 kcnfsp01 kernel: EXT3-fs warning (device sde1): ext3_clear_journal_err: Marking fs in need of filesystem check.
Aug 4 10:23:05 kcnfsp01 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended
Aug 4 10:23:05 kcnfsp01 kernel: EXT3-fs: recovery complete.
Aug 4 10:23:05 kcnfsp01 kernel: EXT3-fs: mounted filesystem with ordered data mode.

/dev/sdc is the home directory file system (big-time errors), sdd and sde are the others. /dev/sde is the one I am working on now, hoping it's resolution will be applicable to the other two as well.

When this happens, I unmount the file system from all hosts, remove it from /etc/exports, run 'exportfs -r' and then unmount it. When I re-mount it (no fsck) it's fine for another day or so, then goes ro again.

I ran fsck on the home directory file system the first time is reported journaling errors, and while it fixed about a zillion errors it also removed journaling, made the fs ext2, and then when I remounted it it was empty. I restored from tape but have since left it offline.

Also, I have also checked with EMC and there are no disk errors on any of the devices in this SAN system. It's used for a few dozen other servers, has been in place for years, and has no other issues.

As all three file systems I have put on this server are showing the same or similar problems, I assume the issue is with the server and not the SAN.

I do also see this issue on boot, which I am not sure is related:

kernel: nfs warning: mount version older than kernel
amd[2614]: mount_nfs_fh: NFS version 3

In addition, NFS services are failing to start on normal reboot, although I have placed the scripts after all other network service scripts in /etc/rc2.d. When I log in after a reboot and start NFS, it starts fine.

I'm about at wits' end. Any suggestions?

trickykid 08-06-2008 02:13 PM

From my experience, when a filesystem jumps to read-only, it was hardware related. Either the drive is going bad or the controller is going bad or needs firmware updates.

Othyisar 08-08-2008 03:02 PM

Yeah, exploring that possibility now. it doesn't work on either fiber path, and the paths are connected to diff fiber switches, so looking internally. Nothing else seems to be a problem, though...

Thanks for your reply.

jiml8 08-08-2008 10:04 PM

You should run smartctl on those drives to see what the error rate is like. You also could have corruption on the system hard drive, such that the file system handlers and libraries are corrupted.

You have a hardware problem someplace.


All times are GMT -5. The time now is 12:33 PM.