Software Raid, lots of issues after a power outage, please help me keep data
Linux - SoftwareThis forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Software Raid, lots of issues after a power outage, please help me keep data
Ok, I've had a 4x 400Gb Raid 5 running for exactly 3 years now. There's been plenty of power outages in that time and lots of resyncing afterwards but all has been good afterwards with no issues.
Last week after a power outage and reboot I went to check the status of the resync and it only listed 3 hdds.
Looking at dmesg:
md: bind<sdb>
md: bind<sdc>
md: bind<sdd>
md: bind<sda>
md: kicking non-fresh sdd from array!
md: unbind<sdd>
md: export_rdev(sdd)
md: md0: raid array is not clean -- starting background reconstruction
raid5: device sda operational as raid disk 0
raid5: device sdc operational as raid disk 2
raid5: device sdb operational as raid disk 1
raid5: cannot start dirty degraded array for md0
Finding information online, I did mdadm -fail -remove, which stated sdd was already removed (which made sense since only 3 drives were listed).
Then mdadm -R /dev/md0
It came online and started resyncing, I thought all was well and left it.
After resyncing I tried to access some recently downloaded files and was not able to write to the disc or read the files I wanted.
Checking syslog I saw thousands of this:
Nov 22 07:49:02 Byznotchnyai kernel: attempt to access beyond end of device
Nov 22 07:49:02 Byznotchnyai kernel: md0: rw=0, want=14963797120, limit=2344267776
cat /proc/mdstat seemed fine, it thought the array was sound, everything I read online says it must just be errors with the ext3 partition itself and to run fsck but I worry if it's something md related that 'fixing' in ext3 will just delete almost all of my files.
At first the 'bad' files seems to be some files that were in the progress of downloading when the power went out, so I thought, ok that's fine, but then I noticed stuff that had been finished for weeks wasn't working. So then I started copying off things that were irreplacibly important (like 5 years of pictures) and even some of those are throwing I/O errors.
It seems a good 1/4-1/3 of all of my files are 'bad' and I know if I let fsck do it's thing it'll just delete them all.
The hdd itself seems fine, they all report the same info in smartctl and don't throw any errors, so I don't know why just that one would be non-fresh or why a resync would trash my data.
I've heard of backup super blocks but I'm not sure how to find them, does anyone have any suggestions on how to either reassemble the md (which I've seen mentioned a few times but also worries me) or what to do to see if it's really ext 3. Or how to see which hdd in the array is throwing the I/Os, maybe it really is just a bad driving somehow.
I'm really at a loss and I'm very annoyed because I put all of my important info on my Raid 5 thinking it'd be 'safer' than another method. Thanks so much.
Its purely a data partition so I can test and try to fix from inside my system.
It's ext3, I didn't know there were different journaling options.
I mean my array is reporting all good on the MD front it's just ext3 that's telling me there are errors and a check should be forced, which would be fine normally but it is finding a significant number of my files as errors.
It makes me wonder if the drives are in the wrong order or parity is messed up some how. I did just try failing sdd again to see if maybe it could run off just the 3 hdds using parity and somehow data on sdd was wrong, but the filesystem and I/O issues are exactly the same.
On that note I just realized it's strange the system didn't mount with just 3 drives anyway in the first place, one non-fresh disk shouldn't be an issue with raid 5. Also while copying data off it seems like anything from Oct or Nov is having an issue, older stuff seems mostly fine. I'm at a loss.
I am trying to copy off what I can for now, but I'm not getting a lot. I guess if I have to rebuild I'll go with RAID 1 and LVM or maybe 0+1, at least with mirroring I will know how to sanely get at my data.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.