[SOLVED] Hardrive crash in vmware?

GeneralDark · 06-21-2009, 02:49 PM

Hello.
Feel free to move this topic to another section thats more appropriate. I do not really know where this belong.

First of all I'm gonna describe "the whole picture" (sorry for my english).
I have an ESXi host wich is having 4 harddrives. 1 for the ESXi OS and 3 for storage.
The 3 storage drives have 1 .vmdk file each (all equally large) on them, not a single file more.
The fileserver is running from drive 1 (same as ESXi is installed on). that means that the Debian system is running on that drive (rootdrive).
The Debian system is then configured to use the 3 storagedrives as a raid 5 (software raid created in debian installer), and that partition is encrypted with LUKS (the rootpartition is encrypted aswell if that of any concern).

The encrypted storage is a ext3 and has several forlders exported using SMB.

Earlier today I did a successful check on both drives (both the OS drive and the raid-drive) ,was forced to since I rebooted, and all was fine.

About an hour ago I was browsing the storage as usual but then suddenly I noticed that a textfile I was working in made the application hang (MS Word). I reconnected the drive and everything was fine, except that the directory I had been working in was empty. I checked several other directories but all of them was intact.

Well, I rebooted the system, thinking it might help with a fresh start.
Rebooting the system goes fine, I enter the password for the rootdrive and everything is fine until its about to mount the raid-drive.
I get the following error-msg:
The superblock could not be read or does not describe a correct ext2 filesystem.

Checking the /var/log/fschk/checkfs log states:
fsck.ext3: /no such file or directory wrile trying to open /dev/mapper/raid-crypt

ls /dev/mapper does indeed not show this drive.
fdisk -l shows that all the drives (root and storage) are ok, but the /dev/dm-* doesnt contain a valid partition table

I would like to point out that the LUKS passwords are intact and wont be a problem if the problem can be fixed.

I have searched the forumes, but I'm not sure if those other topics would help me since this is abit more complexed.
Anyone could explain what has happend, and why? And what I would to do save it would be nice aswell

Help is really appriciated since I got everything on those drives.

Did I miss some important info? Just ask.

unSpawn · 06-23-2009, 05:26 AM

Quote:

Originally Posted by GeneralDark

Well, I rebooted the system, thinking it might help with a fresh start. Rebooting the system goes fine, I enter the password for the rootdrive and everything is fine until its about to mount the raid-drive. I get the following error-msg: The superblock could not be read or does not describe a correct ext2 filesystem. Checking the /var/log/fschk/checkfs log states: fsck.ext3: / no such file or directory wrile trying to open /dev/mapper/raid-crypt ls /dev/mapper does indeed not show this drive. fdisk -l shows that all the drives (root and storage) are ok, but the /dev/dm-* doesnt contain a valid partition table I would like to point out that the LUKS passwords are intact and wont be a problem if the problem can be fixed.

If you boot your VM guest into any runlevel that allows you manual control over what gets mounted how, are there any system messages that could indicate problems at the "hardware" level? And if you read back the logs? Can you query, scan, examine all RAID components verbosely with mdadm?

GeneralDark · 06-29-2009, 12:46 PM

Thx for the advice. I booted up with a gentoo live cd and saw the following in dmesg:

Code:

scsi2 : ioc0: LSI53C1030 B0, FwRev=00000000h, Ports=1, MaxQ=128, IRQ=16
scsi 2:0:0:0: Direct-Access     VMware   Virtual disk     1.0  PQ: 0 ANSI: 2
 target2:0:0: Beginning Domain Validation
 target2:0:0: Domain Validation skipping write tests
 target2:0:0: Ending Domain Validation
 target2:0:0: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127)
scsi 2:0:1:0: Direct-Access     VMware   Virtual disk     1.0  PQ: 0 ANSI: 2
 target2:0:1: Beginning Domain Validation
 target2:0:1: Domain Validation skipping write tests
 target2:0:1: Ending Domain Validation
 target2:0:1: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127)
scsi 2:0:2:0: Direct-Access     VMware   Virtual disk     1.0  PQ: 0 ANSI: 2
 target2:0:2: Beginning Domain Validation
 target2:0:2: Domain Validation skipping write tests
 target2:0:2: Ending Domain Validation
 target2:0:2: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127)
scsi 2:0:3:0: Direct-Access     VMware   Virtual disk     1.0  PQ: 0 ANSI: 2
 target2:0:3: Beginning Domain Validation
 target2:0:3: Domain Validation skipping write tests
 target2:0:3: Ending Domain Validation
 target2:0:3: FAST-40 WIDE SCSI 80.0 MB/s ST (25 ns, offset 127)
sd 2:0:0:0: [sda] 16777216 512-byte hardware sectors (8590 MB)
sd 2:0:0:0: [sda] Test WP failed, assume Write Enabled
sd 2:0:0:0: [sda] Cache data unavailable
sd 2:0:0:0: [sda] Assuming drive cache: write through
sd 2:0:0:0: [sda] 16777216 512-byte hardware sectors (8590 MB)
sd 2:0:0:0: [sda] Test WP failed, assume Write Enabled
sd 2:0:0:0: [sda] Cache data unavailable
sd 2:0:0:0: [sda] Assuming drive cache: write through
 sda: sda1 sda2 < sda5 >
sd 2:0:0:0: [sda] Attached SCSI disk
sd 2:0:0:0: Attached scsi generic sg0 type 0
sd 2:0:1:0: [sdb] 1951756452 512-byte hardware sectors (999299 MB)
sd 2:0:1:0: [sdb] Test WP failed, assume Write Enabled
sd 2:0:1:0: [sdb] Cache data unavailable
sd 2:0:1:0: [sdb] Assuming drive cache: write through
sd 2:0:1:0: [sdb] 1951756452 512-byte hardware sectors (999299 MB)
sd 2:0:1:0: [sdb] Test WP failed, assume Write Enabled
sd 2:0:1:0: [sdb] Cache data unavailable
sd 2:0:1:0: [sdb] Assuming drive cache: write through
 sdb: sdb1
sd 2:0:1:0: [sdb] Attached SCSI disk
sd 2:0:1:0: Attached scsi generic sg1 type 0
sd 2:0:2:0: [sdc] 1951756452 512-byte hardware sectors (999299 MB)
sd 2:0:2:0: [sdc] Test WP failed, assume Write Enabled
sd 2:0:2:0: [sdc] Cache data unavailable
sd 2:0:2:0: [sdc] Assuming drive cache: write through
sd 2:0:2:0: [sdc] 1951756452 512-byte hardware sectors (999299 MB)
sd 2:0:2:0: [sdc] Test WP failed, assume Write Enabled
sd 2:0:2:0: [sdc] Cache data unavailable
sd 2:0:2:0: [sdc] Assuming drive cache: write through
 sdc: sdc1
sd 2:0:2:0: [sdc] Attached SCSI disk
sd 2:0:2:0: Attached scsi generic sg2 type 0
sd 2:0:3:0: [sdd] 1951756452 512-byte hardware sectors (999299 MB)
sd 2:0:3:0: [sdd] Test WP failed, assume Write Enabled
sd 2:0:3:0: [sdd] Cache data unavailable
sd 2:0:3:0: [sdd] Assuming drive cache: write through
sd 2:0:3:0: [sdd] 1951756452 512-byte hardware sectors (999299 MB)
sd 2:0:3:0: [sdd] Test WP failed, assume Write Enabled
sd 2:0:3:0: [sdd] Cache data unavailable
sd 2:0:3:0: [sdd] Assuming drive cache: write through
 sdd: sdd1
sd 2:0:3:0: [sdd] Attached SCSI disk
sd 2:0:3:0: Attached scsi generic sg3 type 0

I guess it looks normal?

So, I ran a ext3 check:

Code:

livecd ~ # fsck.ext3 /dev/sdb1
e2fsck 1.40.8 (13-Mar-2008)
fsck.ext3: Group descriptors look bad... trying backup blocks...
fsck.ext3: Bad magic number in super-block while trying to open /dev/sdb1

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

livecd ~ # fsck.ext3 /dev/sdc1
e2fsck 1.40.8 (13-Mar-2008)
fsck.ext3: Superblock invalid, trying backup blocks...
fsck.ext3: Bad magic number in super-block while trying to open /dev/sdc1

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

livecd ~ # fsck.ext3 /dev/sdd1
e2fsck 1.40.8 (13-Mar-2008)
fsck.ext3: Superblock invalid, trying backup blocks...
fsck.ext3: Bad magic number in super-block while trying to open /dev/sdd1

The superblock could not be read or does not describe a correct ext2
filesystem.  If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>

mdadm --detail /dev/md0 shows the following when running it in the normal debian system (sorry for the picture):
http://img33.imageshack.us/img33/8157/mdadm.jpg

/proc/mdstat tells me:

Code:

Personalities : [raid6] [raid5] [raid4]
md0 : inactive sdc1[1]
      975876352 blocks
unused devices: <none>

Also, I ran badblocks on all drives without errors.

Does this mean the first drive is faulty (sdb)? Is sdc the only working drive? Why doesnt the other two drives show up in mdadm? Did I miss something?

Happy for answers

unSpawn · 06-29-2009, 04:35 PM

Sorry, I was away for a bit. Your mdadm.jpg says your RAID is active, degraded and not started. Mdadm doesn't seem to be able to access 2 out of the 3 devices that make up your array. RAID5 can suffer 1 out of 3 but 2 out of 3 is fatal (AFAIK, I'm no expert). If there's anything to test I'd start the host OS only and start testing at the lowest hardware level. Where applicable use "dry run" mode to do recon and write logs or use "tee" to gather information to post/attach. If that doesn't show failures you might want to boot the VM guest OS in controlled mode (single, runlevel 1 or whatever equivalent disaster mode) and try mdadm with as much detail in the scan, detail, examine modes to gather information to post/attach. If *that* doesn't show errors (and I'll guess it does, but anyway

you could try activating your degraded array with 'mdadm --assemble -f /dev/md0 missing /dev/sdc1 missing' but I doubt that'll work. OTOH, if you have even the faintest hint, gut feeling or whatever else kind of omen, common sense tells you to make backups. Sure that hurts, and I don't even know if it will help in any way, but if the data is of value you would agree you'd better be safe than sorry, right?

GeneralDark · 06-30-2009, 10:46 AM

Thanks for the advice. I mangaged to forcebuild the raidarray with 2 drives and the calculate the third drive without a problem. Everything is back to normal, even the superblock got correct.
Thanks alot

unSpawn · 06-30-2009, 11:34 AM

Well done!