LinuxQuestions.org - Troubleshooting large partition "attempt to access beyond end of device"

I apologize if this seems long, but I wanted to show exactly what happened and what I have done. In some cases, I might have unwittingly exacerbated the problem, so please let me know if I might have made the problem worse.

In short, I have a 7.5TB partition with data I cannot access, with the infamous "attempt to access beyond end of device" kernel messag.

Here's what happened:

There is a server with 24 drives. 12 drives were used to make up a 7.5TB RAID-5, and other twelve drives were not raided yet and had valuable data. I created a partition on it using parted that shipped with Ubuntu 8.04 Hardy, put EXT3 using mkfs.ext3 and copied the data over.

So /dev/hdb1 was the single 7.5TB partition with a copy of the data. (I could have put a filesystem on /dev/hdb, but I am acting consistently with other setups on many servers.) After copying data to this partition, I checked it was all there and correct size (using diff). I then wiped out the non-raided drives and raided them. (The problems I am having are with the first raided drive, not the second, so I believe we can forget about the latter raid drive.)

After restarting the server, mounted the first 7.5TB drive, which is having problms. Here's the entry from df -h:

--------------------------START------------------------------
...
/dev/sdb1 7.4T 2.1T 5.0T 29% /media/sdb1
---------------------------END-------------------------------

Mounts very quickly, looks good. But here's what happens when ls the directory:

--------------------------START------------------------------
ls: cannot access sdb1/dir3: Input/output error
ls: cannot access sdb1/dir1: Input/output error
ls: cannot access sdb1/dir2: Input/output error
lost+found dir1 dir2 dir3
---------------------------END-------------------------------

And output from dmesg | tail:

--------------------------START------------------------------
SELinux: initialized (dev sdb1, type ext3), uses xattr
attempt to access beyond end of device
sdb1: rw=32, want=4195876888, limit=3228132399
EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=262242305, block=524484610
attempt to access beyond end of device
sdb1: rw=32, want=4218945560, limit=3228132399
EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=263684097, block=527368194
attempt to access beyond end of device
sdb1: rw=32, want=12908757016, limit=3228132399
EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=806797313, block=1613594626
---------------------------END-------------------------------

I can cd into the mounted partition and access lost+found and its contents directly, but not th other three directories.

From here on, I will show the three steps I took, in chronological order. (I repeated some steps, as I was muddling my way towards a better understanding.)
(1) Run fsck
(2) Try to resize partition using parted
(3) Try to temporarily remove some filesystem features so I can use parted

(STEP 1) I decide to run Run fsck /dev/sdb1:

--------------------------START------------------------------
SELinux: initialized (dev sdb1, type ext3), uses xattr
attempt to access beyond end of device
sdb1: rw=32, want=4195876888, limit=3228132399
EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=262242305, block=524484610
attempt to access beyond end of device
sdb1: rw=32, want=4218945560, limit=3228132399
EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=263684097, block=527368194
attempt to access beyond end of device
sdb1: rw=32, want=12908757016, limit=3228132399
EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=806797313, block=1613594626
[root@localhost media]# umount /dev/sdb1
[root@localhost media]# fsck /dev/sdb1
fsck 1.40.8 (13-Mar-2008)
e2fsck 1.40.8 (13-Mar-2008)
The filesystem size (according to the superblock) is 2014129285 blocks
The physical size of the device is 403516549 blocks
Either the superblock or the partition table is likely to be corrupt!
Abort<y>? no

/dev/sdb1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Error reading block 403537922 (Invalid argument) while getting next inode from scan. Ignore error<y>? yes

Force rewrite<y>? no

Error reading block 403537923 (Invalid argument) while getting next inode from scan. Ignore error<y>? no

Error while scanning inodes (201768960): Can't read next inode
e2fsck: aborted

---------------------------END-------------------------------

The above continues for every consecutive value following 403537922, i.e., a very long time. Cannot (and wonder if should not anyhow) use -y option because asks about beginning: " Abort<y>?".

(In the past, I input 'y' to Ignore/force rewrite, but was taking entirely too long.)

I also tried repairing with several alternate superblocks, but didn't change anything.

I got ahead of myself and rationalized that I could expand the partition size and that (hopefully) the data might reside on the disk.

(STEP 2) So my parted session:

--------------------------START------------------------------
> parted /dev/sdb
GNU Parted 1.8.8
Using /dev/sdb
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: AMCC 9550SX-12M DISK (scsi)
Disk /dev/sdb: 8250GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number Start End Size Type File system Flags
1 32.3kB 1653GB 1653GB primary ext3

(parted) resize 1 32.3kB 8250GB
Error: The file system is bigger than its volume!
Ignore/Cancel? Ignore
Warning: File system has errors! You should run e2fsck.
Ignore/Cancel? Ignore
Error: File system has an incompatible feature enabled.
(parted)
---------------------------END-------------------------------

Note that this puts the size at ~1.6TB, not 7.4TB reported by df.

(STEP 3) I wanted to try to figure out what features were incompatible. I searched around online and then I ran tune2fs -l /dev/sdb1:

--------------------------START------------------------------
tune2fs 1.40.8 (13-Mar-2008)
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: eaf4f430-1380-4125-9498-65bbc94a4d07
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal resize_inode dir_index filetype sparse_super large_file
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean with errors
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 1007075328
Block count: 2014129285
Reserved block count: 100706464
Free blocks: 1441093558
Free inodes: 1007009890
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 543
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 16384
Inode blocks per group: 512
Filesystem created: Thu Sep 18 18:41:54 2008
Last mount time: Wed Oct 15 17:04:44 2008
Last write time: Wed Oct 15 17:10:06 2008
Mount count: 15
Maximum mount count: 26
Last checked: Thu Sep 18 18:41:54 2008
Check interval: 15552000 (6 months)
Next check after: Tue Mar 17 18:41:54 2009
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 128
Journal inode: 8
Default directory hash: tea
Directory Hash Seed: c4650f98-ce1d-4305-ae22-36cbb05a095a
Journal backup: inode blocks
---------------------------END-------------------------------

So I try to temporarily flag off some features I read about on various forums as possibly being the culprit:

--------------------------START------------------------------
> tune2fs -O ^resize_inode /dev/sdb1
tune2fs 1.40.8 (13-Mar-2008)

Please run e2fsck on the filesystem.

> tune2fs -O ^dir_index /dev/sdb1
tune2fs 1.40.8 (13-Mar-2008)
> tune2fs -l /dev/sdb1

(Same output as before, i.e, features were not disabled.)
---------------------------END-------------------------------

This is where I am stuck: I cannot repair using fsck/e2fsck, and any attempt to resize the partition (which has an underlying assumption that it might help, which could be wrong) fails.

My questions:

1. Am I missing anything obvious? I have been searching through forums for solutions, but this is a report of all the knowledge I have gained in the process.

2. If I cannot solve this myself, should I consider a data recovery service? It would be a recommendation, as the server is not mine (I'm helping to figure out the problem as part of a collaboration). I am part of publicly-funded group, so you can imagine that in our economy money is tight and highly regulated.

I really appreciate any wisdom or insights. This is far from my primary work responsibility, but there is no one with the clearly-defined role of caring for this particular server.

I really want to step up and resolve the issue, and learn a little more about what happened in the process. I cannot thank you enough for any feedback.