raid5 issues - super non-persistent

spargonaut · 11-29-2008, 03:44 PM

howdy all!

I'm having serious issues and will appreciate any and all help.
( sorry for the long post, i'm trying to be as informative as possible )

I have a file server running debian with the 2.6.18-6-486 kernel.
Its running samba for my windows boxes to connect to, and nfs for my linux desktop to connect to.

its got 8 drives in it that make up several raid and lvm partitions, which are laid out as thus:
( output from: cat /proc/mdstat plus some notes from me )

md2 : inactive sdc1[2] sdd1[3] sdb1[1] sda1[0]
1953535744 blocks super non-persistent

md1 : active raid5 hda1[0] hdd1[3] hdc1[2] hdb1[1]
351558144 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]

md0 : active raid1 hda2[0] hdc2[1]
78172160 blocks [2/2] [UU]

md0 is made up from two 80GB partitions on two 200GB IDE HDD's
md1 is made up from the remaining 120GB partitions as well as the entire disk of two other 120GB IDE HDDs
( all of the above are using onboard IDE controllers on the motherboard. )

md2 ( the problem child ) is made up from four 500GB sata HDD's
( these drives are connected via two pci sata controller cards: one rosewill and one SIIG ( originally ) )
( they are now connected through two SIIG cards, i was hoping that replacing the rosewill card would save me, but no such luck, dangit. )

on top of these raid arrays i have ( had ) LVM2 running.
luckily, i put the root and swap partition LVs in their own VG on md0, so at least the machine can boot.

also, i recently installed an inexpensive 10/100/1000 NIC card. I'll get to the relevance of this in a bit.

now onto the problems:
here i was, enjoying a nice, quiet friday night at home writing programs, playing with some ICs and general tom foolery
when i decided to move some rather large files from my windows box to my file server ( hostname: burro ).
in the middle of transferring one of the files, windows informed me that 'the resource is unavailable.'

wondering WTH? i ran to my closet ( read: server room ) to take a closer look. I popped on the monitor only to be greeted by a whole bunch of this:
( taken from /var/log/messages )
*******************************************************************************
Nov 28 19:18:47 burro kernel: Tx Queue <0>
Nov 28 19:18:47 burro kernel: TDH <3a>
Nov 28 19:18:47 burro kernel: TDT <3b>
Nov 28 19:18:47 burro kernel: next_to_use <3b>
Nov 28 19:18:47 burro kernel: next_to_clean <3a>
Nov 28 19:18:47 burro kernel: buffer_info[next_to_clean]
Nov 28 19:18:47 burro kernel: time_stamp <300e5202>
Nov 28 19:18:47 burro kernel: next_to_watch <3a>
Nov 28 19:18:47 burro kernel: jiffies <300e5399>
Nov 28 19:18:47 burro kernel: next_to_watch.status <0>
Nov 28 19:18:49 burro kernel: Tx Queue <0>
Nov 28 19:18:49 burro kernel: TDH <3a>
Nov 28 19:18:49 burro kernel: TDT <3b>
Nov 28 19:18:49 burro kernel: next_to_use <3b>
Nov 28 19:18:49 burro kernel: next_to_clean <3a>
Nov 28 19:18:49 burro kernel: buffer_info[next_to_clean]
Nov 28 19:18:49 burro kernel: time_stamp <300e5202>
Nov 28 19:18:49 burro kernel: next_to_watch <3a>
Nov 28 19:18:49 burro kernel: jiffies <300e558d>
Nov 28 19:18:49 burro kernel: next_to_watch.status <0>
Nov 28 19:18:51 burro kernel: NETDEV WATCHDOG: eth1: transmit timed out
Nov 28 19:18:54 burro kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
Nov 28 19:19:16 burro kernel: ata2: soft resetting port
Nov 28 19:19:16 burro kernel: ata1: soft resetting port
Nov 28 19:19:16 burro kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 28 19:19:16 burro kernel: ata1.00: configured for UDMA/100
Nov 28 19:19:16 burro kernel: ata1: EH complete
Nov 28 19:19:16 burro kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 28 19:19:16 burro kernel: ata2.00: configured for UDMA/100
Nov 28 19:19:16 burro kernel: ata2: EH complete
Nov 28 19:19:46 burro kernel: ata1: soft resetting port
Nov 28 19:19:46 burro kernel: ata2: soft resetting port
Nov 28 19:19:53 burro kernel: ata1: port is slow to respond, please be patient
Nov 28 19:19:53 burro kernel: ata2: port is slow to respond, please be patient
Nov 28 19:19:58 burro kernel: Tx Queue <0>
Nov 28 19:19:58 burro kernel: TDH <0>
Nov 28 19:19:58 burro kernel: TDT <4>
Nov 28 19:19:58 burro kernel: next_to_use <4>
Nov 28 19:19:58 burro kernel: next_to_clean <0>
Nov 28 19:19:58 burro kernel: buffer_info[next_to_clean]
Nov 28 19:19:58 burro kernel: time_stamp <300e973c>
Nov 28 19:19:58 burro kernel: next_to_watch <1>
Nov 28 19:19:58 burro kernel: jiffies <300e98f2>
Nov 28 19:19:58 burro kernel: next_to_watch.status <0>
Nov 28 19:20:00 burro kernel: Tx Queue <0>
Nov 28 19:20:00 burro kernel: TDH <0>
Nov 28 19:20:00 burro kernel: TDT <4>
Nov 28 19:20:00 burro kernel: next_to_use <4>
Nov 28 19:20:00 burro kernel: next_to_clean <0>
Nov 28 19:20:00 burro kernel: buffer_info[next_to_clean]
Nov 28 19:20:00 burro kernel: time_stamp <300e973c>
Nov 28 19:20:00 burro kernel: next_to_watch <1>
Nov 28 19:20:00 burro kernel: jiffies <300e9ae6>
Nov 28 19:20:00 burro kernel: next_to_watch.status <0>
Nov 28 19:20:02 burro kernel: Tx Queue <0>
Nov 28 19:20:02 burro kernel: TDH <0>
Nov 28 19:20:02 burro kernel: TDT <4>
Nov 28 19:20:02 burro kernel: next_to_use <4>
Nov 28 19:20:02 burro kernel: next_to_clean <0>
Nov 28 19:20:02 burro kernel: buffer_info[next_to_clean]
Nov 28 19:20:02 burro kernel: time_stamp <300e973c>
Nov 28 19:20:02 burro kernel: next_to_watch <1>
Nov 28 19:20:02 burro kernel: jiffies <300e9cda>
Nov 28 19:20:02 burro kernel: next_to_watch.status <0>
Nov 28 19:20:04 burro kernel: NETDEV WATCHDOG: eth1: transmit timed out
Nov 28 19:20:07 burro kernel: e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
Nov 28 19:20:16 burro kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 28 19:20:16 burro kernel: ATA: abnormal status 0xD8 on port 0xE0814087
Nov 28 19:20:16 burro last message repeated 4 times

*******************************************************************************

now, all of the chatter above told me that my inexpensive 10/100/1000 NIC card was freaking out.
I decided that it needed to come out, so i went to my client machines that had mounts from burro to unmount everything
kitten ( the linux desktop ) was unable to unmount, and the windows machine had already said "no" to being able to connect to burro.
( i ignored the wifes macbook, no need to worry her at this point ).
at this point, i halted the machine ( which i am beginning to regret ), removed the NIC card, and rebooted.

when it came back up, it was unable to bring up md2.
( i have since rebooted several times, or i would have included the original dmesg )

it stopped at the screen telling me that i needed to either enter the root password for maintenance or hit ctrl-D to continue.

I logged in with the root password, and did some research on mdadm to try to bring the array back up.

the command i used was:
mdadm --run /dev/md2

with these results:
*******************************************************************************
md: kicking non-fresh sdb1 from array!
md: unbind<sdb1>
md: export_rdev(sdb1)
md: kicking non-fresh sda1 from array!
md: unbind<sda1>
md: export_rdev(sda1)
raid5: device sdc1 operational as raid disk 2
raid5: device sdd1 operational as raid disk 3
raid5: not enough operational devices for md2 (2/4 failed)
RAID5 conf printout:
--- rd:4 wd:2 fd:2
disk 2, o:1 dev:sdc1
disk 3, o:1 dev:sdd1
raid5: failed to run raid set md2
md: pers->run() failed ...
mdadm: failed to run array /dev/md2: Input/output
*******************************************************************************

for some reason, i can't remember what, but i think its a problem with the superblocks.

In anycase, while i'm not exactly a linux n00b, i am somewhat of a n00b when it comes to raid recovery.
( as in, i've set up a raid array a handful of times, and replaces _a_ failed drive, but i've never lost the whole array )
that being said, I am grateful for any and all input.

please let me know if you would like / need any more information / log files.
i refrained from posting dmesg and/or all of /var/log/messages because this posts already long length.

thank you so very much.
spargonaut

spargonaut · 11-29-2008, 06:37 PM

after doing some messing around....
( and finding this thread: http://www.linuxforums.org/forum/ser...id6-array.html )

i was able to get the array readable by running the following command:
mdadm --assemble --force /dev/md2 /dev/sd[a-d]1

which, for some reason, left out /dev/sda1
( I'm still not sure why b/c when i ran 'mdadm --examine /dev/sd[a-d]1' it said that it was /dev/sdb1 that was faulty )

I just thought i would update this with some new information.
I'm currently doing a backup of the data while i still have it.

for the rest of the evening, i'll be attempting to get the fourth drive back into the array.
( only to be followed by writing a backup script that runs nightly.
(( eGADS! i never want that to happen again! that scared the sh** out of me! ) )

I'll post back here if i make progress ( and can remember ).

I hope someone finds this information helpful.