Hi Guys,
This isnt too much of a question of how to resolve any kind of issue but more of a question to try to understand LVM better.
I understand what LVM is, does, and generally how it works and how to use it.
I had a server with a Logical Volume Group (LVG_STORE) with three physical volumes inside of it with a single Logical Volume (LV_OPT). Recently though, one of these three disks failed which lead to a number of issues.
Some more background information:
- Server had a total of 4 disks
- 1 disk -- sda -- two partitions -- one for root and one for swap
- 3 disks -- sdb, sdc, and sdd (143GB 3gb/s SAS disks) in the Logical Volume
- sdb failed
- The RAID disk controller reported all disks OK
- Uncommenting it in /etc/fstab allowed me to mount to it manually and read certain parts but a specific directory (probably that bad block on the disk) would always cause the disk to throw an error and subsequently be unmountable
When I uncommented it from fstab and was able to boot and do some manual work on it and copy some data, I got this error
Code:
Feb 28 10:43:18 demo1 kernel: Aborting journal on device dm-0.
Feb 28 10:43:18 demo1 kernel: ------------[ cut here ]------------
Feb 28 10:43:18 demo1 kernel: WARNING: at fs/buffer.c:1197 mark_buffer_dirty+0x23/0x72()
Feb 28 10:43:18 demo1 kernel: Modules linked in: joydev st ide_disk ide_cd_mod vboxnetadp(N) vboxnetflt(N) vboxdrv(N) ipv6 bonding binfmt_misc fuse loop dm_mod bnx2 rtc_cmos i2c_piix4 rtc_core serio_raw pcspkr rtc_lib shpchp button i2c_core sr_mod ses pci_hotplug sg cdrom enclosure usbhid hid ff_memless ohci_hcd sd_mod crc_t10dif ehci_hcd usbcore edd ext3 mbcache jbd fan ide_pci_generic serverworks ide_core ata_generic sata_svw pata_serverworks libata dock thermal processor thermal_sys hwmon aacraid scsi_mod
Feb 28 10:43:18 demo1 kernel: Supported: No
Feb 28 10:43:18 demo1 kernel: Pid: 5014, comm: umount Tainted: G 2.6.27.19-5-default #1
Feb 28 10:43:18 demo1 kernel:
Feb 28 10:43:18 demo1 kernel: Call Trace:
Feb 28 10:43:18 demo1 kernel: [<ffffffff8020da29>] show_trace_log_lvl+0x41/0x58
Feb 28 10:43:18 demo1 kernel: [<ffffffff8049a3da>] dump_stack+0x69/0x6f
Feb 28 10:43:18 demo1 kernel: [<ffffffff8023d562>] warn_on_slowpath+0x51/0x77
Feb 28 10:43:18 demo1 kernel: [<ffffffff802d30c4>] mark_buffer_dirty+0x23/0x72
Feb 28 10:43:18 demo1 kernel: [<ffffffffa00e1b55>] ext3_put_super+0x54/0x1ce [ext3]
Feb 28 10:43:18 demo1 kernel: [<ffffffff802b33bf>] generic_shutdown_super+0x60/0xee
Feb 28 10:43:18 demo1 kernel: [<ffffffff802b345a>] kill_block_super+0xd/0x1e
Feb 28 10:43:18 demo1 kernel: [<ffffffff802b3518>] deactivate_super+0x60/0x79
Feb 28 10:43:18 demo1 kernel: [<ffffffff802c7ebb>] sys_umount+0x87/0x91
Feb 28 10:43:18 demo1 kernel: [<ffffffff8020bfbb>] system_call_fastpath+0x16/0x1b
Feb 28 10:43:18 demo1 kernel: [<00007f1d146df1c7>] 0x7f1d146df1c7
Feb 28 10:43:18 demo1 kernel:
Feb 28 10:43:18 demo1 kernel: ---[ end trace d044caa59498ad32 ]---
My dmesg & assorted logs are littered with
Code:
Feb 28 10:43:18 demo1 kernel: sd 0:0:1:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE,SUGGEST_OK
Feb 28 10:43:18 demo1 kernel: sd 0:0:1:0: [sdb] Sense Key : Hardware Error [current]
Feb 28 10:43:18 demo1 kernel: sd 0:0:1:0: [sdb] Add. Sense: Internal target failure
Feb 28 10:43:18 demo1 kernel: end_request: I/O error, dev sdb, sector 384
Feb 28 10:43:18 demo1 kernel: Buffer I/O error on device dm-0, logical block 0
Feb 28 10:43:18 demo1 kernel: lost page write due to I/O error on dm-0
Now, I know its pretty apparent that my drive died but here is my question--
It was evidently clear that sdb was the issue but for some reason the
entire Logical Volume was unable to be mounted once the disk through an error. I tried everything from finding the backup superblocks and trying e2fsck -b on /dev/mapper/LVG_STORE-LV_OPT to backing up as much data as I could before I hit "X" amount of reads or before I hit that specific block on the disk and the error was thrown resulting in me unable to do anything else.
Why does a single disk failing cause an entire LVM to fail? This seems a little backwards unless I am missing something?
Any information would help. Thank you.