LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   Software Raid, lots of issues after a power outage, please help me keep data (https://www.linuxquestions.org/questions/linux-software-2/software-raid-lots-of-issues-after-a-power-outage-please-help-me-keep-data-845846/)

itjstagame 11-22-2010 12:25 PM

Software Raid, lots of issues after a power outage, please help me keep data
 
Ok, I've had a 4x 400Gb Raid 5 running for exactly 3 years now. There's been plenty of power outages in that time and lots of resyncing afterwards but all has been good afterwards with no issues.

Last week after a power outage and reboot I went to check the status of the resync and it only listed 3 hdds.

Looking at dmesg:

md: bind<sdb>
md: bind<sdc>
md: bind<sdd>
md: bind<sda>
md: kicking non-fresh sdd from array!
md: unbind<sdd>
md: export_rdev(sdd)
md: md0: raid array is not clean -- starting background reconstruction
raid5: device sda operational as raid disk 0
raid5: device sdc operational as raid disk 2
raid5: device sdb operational as raid disk 1
raid5: cannot start dirty degraded array for md0

Finding information online, I did mdadm -fail -remove, which stated sdd was already removed (which made sense since only 3 drives were listed).
Then mdadm -R /dev/md0

It came online and started resyncing, I thought all was well and left it.

After resyncing I tried to access some recently downloaded files and was not able to write to the disc or read the files I wanted.

Checking syslog I saw thousands of this:
Nov 22 07:49:02 Byznotchnyai kernel: attempt to access beyond end of device
Nov 22 07:49:02 Byznotchnyai kernel: md0: rw=0, want=14963797120, limit=2344267776

cat /proc/mdstat seemed fine, it thought the array was sound, everything I read online says it must just be errors with the ext3 partition itself and to run fsck but I worry if it's something md related that 'fixing' in ext3 will just delete almost all of my files.

At first the 'bad' files seems to be some files that were in the progress of downloading when the power went out, so I thought, ok that's fine, but then I noticed stuff that had been finished for weeks wasn't working. So then I started copying off things that were irreplacibly important (like 5 years of pictures) and even some of those are throwing I/O errors.

It seems a good 1/4-1/3 of all of my files are 'bad' and I know if I let fsck do it's thing it'll just delete them all.

The hdd itself seems fine, they all report the same info in smartctl and don't throw any errors, so I don't know why just that one would be non-fresh or why a resync would trash my data.

I've heard of backup super blocks but I'm not sure how to find them, does anyone have any suggestions on how to either reassemble the md (which I've seen mentioned a few times but also worries me) or what to do to see if it's really ext 3. Or how to see which hdd in the array is throwing the I/Os, maybe it really is just a bad driving somehow.

I'm really at a loss and I'm very annoyed because I put all of my important info on my Raid 5 thinking it'd be 'safer' than another method. Thanks so much.

jefro 11-22-2010 04:06 PM

The problem was the software raid in my opinion, I have never liked them. A true hardware raid may also have left you in the ditch though.

My only guess is the partitions have overlapped but that is a wild guess.

Another issue if is what type of journaling on the ext3.


Might boot to a live cd and then see how it tries to access and any tools to recover. http://planet.admon.org/howto/using-...to-check-ext3/

itjstagame 11-22-2010 06:18 PM

Its purely a data partition so I can test and try to fix from inside my system.

It's ext3, I didn't know there were different journaling options.

I mean my array is reporting all good on the MD front it's just ext3 that's telling me there are errors and a check should be forced, which would be fine normally but it is finding a significant number of my files as errors.

It makes me wonder if the drives are in the wrong order or parity is messed up some how. I did just try failing sdd again to see if maybe it could run off just the 3 hdds using parity and somehow data on sdd was wrong, but the filesystem and I/O issues are exactly the same.

On that note I just realized it's strange the system didn't mount with just 3 drives anyway in the first place, one non-fresh disk shouldn't be an issue with raid 5. Also while copying data off it seems like anything from Oct or Nov is having an issue, older stuff seems mostly fine. I'm at a loss.

I am trying to copy off what I can for now, but I'm not getting a lot. I guess if I have to rebuild I'll go with RAID 1 and LVM or maybe 0+1, at least with mirroring I will know how to sanely get at my data.

jefro 11-22-2010 07:44 PM

Well, there are plenty of posts on the backup superblocks.

You can sure try http://linux.die.net/man/8/fsck.ext3 with the backup superblock. It will tell you if you have the wrong format.

I'd still be tempted to do that from a live cd just to be sure you have complete control over it.


All times are GMT -5. The time now is 08:49 AM.