LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (http://www.linuxquestions.org/questions/linux-software-2/)
-   -   mdadm raid5 problem... oh please save my data (http://www.linuxquestions.org/questions/linux-software-2/mdadm-raid5-problem-oh-please-save-my-data-704637/)

pepsimachine15 02-14-2009 11:38 AM

mdadm raid5 problem... oh please save my data
 
i have 6 drives in a raid5 array using mdadm. this setup had been working fine for a very long time. well, something happened within the past couple days where my array dissapeared, literally. it was so long ago that i set this up that i cant even remember what i did or how it was configured, but i do know that the array automagically comes up when the machine boots up - it no longer does this. i seem to have no mdadm.conf file on my system anywhere that i can find. i could swear at some point in time i had set one up. anyway here's the breakdown:

/dev/hd[abcdgh]1 are the 6 partitions in the array. i tried doing: "mdadm --assemble /dev/md0" and got: "mdadm: /dev/md0 not identified in the config file"

so after that, i tried re-creating the array with: "mdadm --create --verbose --level=5 --raid-devices=6 /dev/hda1 /dev/hdb1 /dev/hdc1 /dev/hdd1 /dev/hdg1 /dev/hdh1" which listed all those drives as being part of a raid array, and i hit "y" to continue

at that point it **seems** that everything is fine. i do "mount /dev/md0 /mnt/raid5" and "ls /mnt/raid5" shows me all of my directories in the partition. however if i type "ls /mnt/raid5/directory" i recieve "/bin/ls: /mnt/raid5/directory: input/output error" - so even though i can see the directories, they are not accessible

i unmounted, then stopped the array with "mdadm -S /dev/md0". i then tried to force the array to rebuild itself by marking a drive bad: "mdadm --manage /dev/md0 --fail /dev/hdg1" and then re-assembling the array with "mdadm --assemble --force /dev/md0" - this did mark the failed drive as clean again and started re-assembling the array. i did "watch cat /proc/mdstat" and followed the progress, it completed 100%.

i remounted, still get input/output errors. unmount, and ran fsck. when running fsck it told me something about the superblock on the ext3 file system and asked if i should clear it... i said yes (probably a big mistake) after this, it would come up with "inode xxxxx has compression flag set on filesystem without compression, fix?" and i type yes (again probably a mistake).. more fsck errors about inode i_size and inode i_blocks is xxxxxx and should be 0, fix? and again i type yes. after typing yes about 20 times i decide something is seriously wrong and i cancel fsck.

now i stored all of my "important" data on this raid5 array thinking it would be safer on there than it would be on a single drive in my system. so i would like to recover these files before i go kill myself for losing all of this data

stress_junkie 02-14-2009 04:33 PM

I haven't created any arrays under Linux but when my computer is starting it executes
Code:

dmraid -ay
You could try that.

Also, it seems that the mdadm utility should be useful.
Quote:

/sbin/mdadm --help
mdadm is used for building, managing, and monitoring
Linux md devices (aka RAID arrays)
Usage: mdadm --create device options...
Create a new array from unused devices.
mdadm --assemble device options...
Assemble a previously created array.
mdadm --build device options...
Create or assemble an array without metadata.
mdadm --manage device options...
make changes to an existing array.
mdadm --misc options... devices
report on or modify various md related devices.
mdadm --grow options device
resize/reshape an active array
mdadm --incremental device
add a device to an array as appropriate
mdadm --monitor options...
Monitor one or more array for significant changes.
mdadm device options...
Shorthand for --manage.
Any parameter that does not start with '-' is treated as a device name
or, for --examine-bitmap, a file name.
The first such name is often the name of an md device. Subsequent
names are often names of component devices.

For detailed help on the above major modes use --help after the mode
e.g.
mdadm --assemble --help
For general help on options use
mdadm --help-options
The mdadm utility appears to have lots of functionality in the area of analysing, monitoring, and recreating arrays.

This is a good example of how RAID does not replace backups. Enough said.

I believe that your best course of action is to use the dmraid -ay and see if that assembles your RAID array. If not then use the method that you already described. That will allow you to use mdadm to monitor the RAID array and see what characteristics are causing the problem.

If a disk has failed then you have to remove the disk from the virtual set, shut down the computer and replace the disk, then start the computer, reassemble the partial RAID virtual set, add the new disk into the virtual set, and rebuild the RAID array. I know this because I have worked with RAID a lot in work but not with Linux.

pepsimachine15 02-14-2009 05:13 PM

dmraid is used in hardware (fakeraid) systems, such as a sil0680 ide raid card. it has nothing to do with mdadm which is complete software raid

mdadm is what i need to be using, but it is giving me no clue of what is wrong. all my drives report good in smart tests in the bios, so it doesnt look like a drive problem. and if a drive did fail, mdadm notify's you of this when you start it up, however it is reporting 5 drives in use with 1 spare, which is all normal.

stress_junkie 02-14-2009 06:22 PM

Your original post has everything that I would have done. I think you were right when you speculated that clearing the superblock with fsck may have created more problems. Happily I think you can get around that by specifying a backup superblock. I just looked this up.

Use mke2fs -n (mkfs -t ext3 -n) to do a fake create file system on the virtual partition. This will tell you where the backup superblocks are located.

Use fsck -b <superblock> to specify the location of the backup superblock.

If that works then we still have the original problem with the RAID array. Did you apply updates lately? Maybe a new module replaced one that was working and the new module doesn't work. /var/log/messages or dmesg should have information about failing modules.

pepsimachine15 02-14-2009 07:46 PM

Code:

root:# mke2fs -n /dev/md0
mke2fs 1.38 (30-Jun-2005)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
24428544 inodes, 48827440 blocks
2441372 blocks (5.00%) reserved for the super user
First data block=0
1491 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872


root:# fsck -b 819200 /dev/md0
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
/sbin/e2fsck: Bad magic number in super-block while trying to open /dev/md0

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

fileserver:/~
root:# fsck -b 32768 /dev/md0
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
/sbin/e2fsck: Bad magic number in super-block while trying to open /dev/md0

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>


no new updates applied. there was a power outage 2 nights ago, and all my machines were off when i woke up. i dont think i had even looked for any data on the array since then, so it could have happened then. 1 night ago i edited my xorg.conf to change the display driver to a generic vesa driver, i used the xorgconfig program.

what i dont understand is that there was no evidence at all of any mdadm config. i dont see how this would have been up and running for over a year and then all of a sudden my config file disappears and nothing works. i DID have #mdadm --assemble /dev/md0 listed in my rc.local file, but it was commented out, like maybe there was some other file loading mdadm. as i said, no clue, i set this up over a year ago and it just quit working all of a sudden.

back up's are nice - i had one. keyword, had. never ever buy dynex brand dvd's. i tried accessing the data on them - they are all corrupted coasters.

pepsimachine15 02-14-2009 08:16 PM

jogging my memory, i think it is possible that when i originally created this array, i may have specified the -chunk option to change the chunk size to something other than the default. maybe this is why when i recreated the array (using the default chunk size) i get input/output errors? this is only a possibility - i really dont remember if i used this option, nor do i remember what i would have set it to.

the system is running through grep -r mdadm /* right now to see if i can find any information on mdadm or where the old config went to... gonna take a while though...

pepsimachine15 02-14-2009 08:32 PM

yes - i am an idiot. i stopped /dev/md0 and re-created using a different chunk size... --chunk=128 <-i figured i would have done increments of 64 so i just started trying them. 128 was it - now when i mount the array i can browse into my directories, and access files.

now - i still have the problem where i used fsck and cleared the superblock before.... now that my raid array is set up correctly, how should i go about fixing the superblock and any other inode sizes and blocks i may have messed up? since i can currently access data as it is, i am copying data off right now onto another drive before i start messing with fsck, but when i do, should i type yes to the following??? --

Code:

root:# fsck -b 11239424 /dev/md0
fsck 1.38 (30-Jun-2005)
e2fsck 1.38 (30-Jun-2005)
Backing up journal inode block information.

/dev/md0 was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Journal is not regular file.  Fix<y>?

and should i choose yes for the rest of them, since i will have restored the superblock from a backup and the journal is supposedly fixed? or should i post back on any other errors during fsck and ask if they should be fixed or not?

stress_junkie 02-14-2009 10:12 PM

If you can get all of the data off then you might do as well to create a new file system.

If that would take too long to be appealing then I think that passing the -a parameter to fsck, which is like answering yes to all inquiries from fsck, would be fairly safe.

I'm delighted that you sorted out the problem.


All times are GMT -5. The time now is 01:49 PM.