How to recover from raid “corrupted groups descriptors” failure?
Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
How to recover from raid “corrupted groups descriptors” failure?
I have a dedicated raid box (readynas ultra 4) that lost power suddenly. Now it will not mount due to a variety of errors and I'm seeking some guidance from anyone who may be able to offer advice as I'm a bit out of my depth.
The setup is a four disk raid 5 array with 3 2TB drives and an older 250GB and has been stable for years allowing my backup skills to degrade so I cannot lose the data. The smaller drive has been throwing errors, but I do not believe I have lost it and the power failure to be the cause.
I have been doing quite a bit of data gathering the past few days, but have done nothing to endanger the data (I hope).
From the syslog, this is the first error.
Code:
Feb 7 12:29:42 kernel: EXT4-fs (dm-0): ext4_check_descriptors: Block bitmap for group 17216 not in group (block 623566614080)!
Feb 7 12:29:42 kernel: EXT4-fs (dm-0): group descriptors corrupted!
And here is some basic information about the array:
An e2fsck -n /dev/c/c creates a large amount of output starting like below and would like to relocate ~25,000 blocks.
Code:
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
e2fsck: Group descriptors look bad... trying backup blocks...
Block bitmap for group 5504 is not in group. (block 18375699958922960920)
Relocate? no
Inode bitmap for group 5504 is not in group. (block 408137612634012487)
Relocate? no
Inode table for group 5504 is not in group. (block 9710113478488063446)
WARNING: SEVERE DATA LOSS POSSIBLE.
Relocate? no
...
Mount gives this error
Code:
# cat /etc/fstab | grep /c
/dev/c/c /c ext4 defaults,acl,user_xattr,usrjquota=aquota.user,grpjquota=aquota.group,jqfmt=vfsv1,noatime,nodiratime 0 2
# mount /dev/c/c /c
mount: wrong fs type, bad option, bad superblock on /dev/c/c,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
I had thought I had a corrupt superblock and tried mounting the backups I found with dumpe2fs with no success. I'd like to back up the data I have before doing any rescue attempts but assume I would need to buy four 2TB disks to dd over the data.
My current line of thinking is that I need to run e2fsck -f -y, but I've continued research to ensure I've covered all my options before modifying the disk. Also, I think I may be missing something obvious or elements of a raid that I haven't tried as I am not familiar with the mdadm commands. For example, perhaps I could break and rebuild the array with the three 2TB drives?
Anyway, I appreciate it if you've read this far and am happy to hear about any ideas you may have.
Not at all - perhaps you would like to explain further @smallpond ?.
To me it looks like the array itself is ok, but the filesystem is hosed - so apparently I am in agreement with the OP.
Quote:
Originally Posted by jaypifer
I'd like to back up the data I have before doing any rescue attempts but assume I would need to buy four 2TB disks to dd over the data.
No like about it - you must get a backup of all the data before you start playing around. Work on the copy only. Usually it's actually preferable to have two copies so you can simply re-copy (from the second copy) to try different scenarios, but in your case the hardware looks ok (that 250 drive might be a worry), so you can recopy from the array in need.
Quote:
My current line of thinking is that I need to run e2fsck -f -y, but I've continued research to ensure I've covered all my options before modifying the disk. Also, I think I may be missing something obvious or elements of a raid that I haven't tried as I am not familiar with the mdadm commands. For example, perhaps I could break and rebuild the array with the three 2TB drives
fsck is for fixing the filesystem so it is consistent. Your data may be compromised, as you have been warned. You may get all the fragments saved in lost+found. Or you may not.
And even if you do, you may not be able to ascertain which files were affected. Or how.
You can't break the RAID up as you suggested - you would have to re-create the array, and lose all your data.
RAID is not a substitute for good backups. There is no substitute.
Once you have a image to work on, maybe try something like photorec - it does more than just photos. It's a scraper, and will take forever, but it may do what you need. There are other similar forensic tools, but they will all add significantly to the time, and may not add much benefit.
No like about it - you must get a backup of all the data before you start playing around. Work on the copy only. Usually it's actually preferable to have two copies so you can simply re-copy (from the second copy) to try different scenarios, but in your case the hardware looks ok (that 250 drive might be a worry), so you can recopy from the array in need.fsck is for fixing the filesystem so it is consistent. Your data may be compromised, as you have been warned. You may get all the fragments saved in lost+found. Or you may not.
Okay, sounds like I didn't miss something obvious nor a simple fix. I'll order the four drives and roll up my sleeves in a few days after they arrive. It's a good idea to have the copies around for several failures until success is achieved. I'll update here on progress.
I understand that the data may have issues, but hope that all is well. Nothing was written to the drives since failure, nor do I believe any files were in use at the time.
It's a good idea to have the copies around for several failures until success is achieved.
Definitely! Running "fsck -y" always has the possibility of unrecoverable data loss. Note the "WARNING: SEVERE DATA LOSS POSSIBLE" in one of your "fsck -n" runs. It's the job of fsck to make the filesystem consistent, and sometimes that comes at the expense of user data. There are programs like testdisk that can recover files from damaged filesystems, but that can become much harder once that damage has been "fixed".
As an update, I got the four drives and ended up taking the drives out of the array one by one and cloning them using my desktop. A bit of research told me to use ddrescue rather than dd. Carefully checking each drive as I swapped them out I was able to run this:
Each 2TB drive took 8 hours to copy. Now I've kicked off:
Code:
e2fsck -y -f -v /dev/c/c
In hindsight, I could and should have used the -C flag to know what's going on. It has been running three days now with the NAS CPU pegged at 99.7%. I tried to killall -USR1 e2fsck to no avail. I've checked /sys/block/sd*/stat and read and writes seem to be happening so I guess I'll just wait a few weeks to see if it finishes.
Just wanted to close this out for anyone that may have the same issue. After five days, I did end up killing the process and restarting it. I did some analysis and saw that it had been working the whole time, but slowly. I restarted with the -C and enjoyed about three hours of additional progress reporting before the CPU pegged at 99.8%.
I waited another six days and the process finished. More analysis, then I rebooted and mounted. From what I can tell, almost everything is safe and sound. I do have 991 items in lost+found/, but they appear to be directory listing or possibly things that were previously in the recycle bin. I'll go through those slowly at a later time.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.