LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 07-03-2012, 03:26 PM   #1
transient
LQ Newbie
 
Registered: Aug 2011
Posts: 17

Rep: Reputation: Disabled
Why would a failed drive in RAID 1 cause entire system to halt?


I'm a little stumped. I have a Dell PowerEdge R200, running Ubuntu 10.04 (ext4), and an LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08) HBA. The server only has 2 drives, and the RAID controller can only do RAID 0 or RAID 1; we use RAID 1.

A couple of days ago the server went into read-only mode. A reboot and fsck from Knoppix and it came back up. 3 days later, same thing. This time I knew enough to check dmesg and I found a number of these errors:
Quote:
Jul 2 10:09:46 amjxt2 kernel: [242948.040068] mptscsih: ioc0: attempting task abort! (sc=ffff8800bf803f00)
Jul 2 10:09:46 amjxt2 kernel: [242948.049480] sd 2:1:0:0: [sda] CDB: Write(10): 2a 00 08 bb a8 d8 00 00 08 00
Jul 2 10:09:47 amjxt2 kernel: [242948.190614] mptscsih: ioc0: task abort: FAILED (sc=ffff8800bf803f00)
Jul 2 10:09:47 amjxt2 kernel: [242948.200146] mptscsih: ioc0: attempting task abort! (sc=ffff880037fb5600)
Jul 2 10:09:47 amjxt2 kernel: [242948.209697] sd 2:1:0:0: [sda] CDB: Write(10): 2a 00 08 bb a8 f8 00 00 08 00
and

Quote:
Jul 2 10:10:16 amjxt2 kernel: [242977.664275] mptbase: ioc0: Initiating recovery
Jul 2 10:10:39 amjxt2 kernel: [243001.010038] mptscsih: ioc0: host reset: SUCCESS (sc=ffff8800bf803f00)
Jul 2 10:11:51 amjxt2 kernel: [243072.692286] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00)
Jul 2 10:11:51 amjxt2 kernel: [243072.713450] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00)
Jul 2 10:11:51 amjxt2 kernel: [243072.734841] mptbase: ioc0: LogInfo(0x31110d00): Originator={PL}, Code={Reset}, SubCode(0x0d00)
Jul 2 10:11:59 amjxt2 kernel: [243080.196520] mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 id=8
Jul 2 10:11:59 amjxt2 kernel: [243080.207432] mptbase: ioc0: PhysDisk is now missing
Jul 2 10:11:59 amjxt2 kernel: [243080.218163] mptbase: ioc0: RAID STATUS CHANGE for PhysDisk 1 id=8
Jul 2 10:11:59 amjxt2 kernel: [243080.228950] mptbase: ioc0: PhysDisk is now missing, out of sync
Jul 2 10:11:59 amjxt2 kernel: [243080.239516] mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0
Jul 2 10:11:59 amjxt2 kernel: [243080.249851] mptbase: ioc0: volume membership of PhysDisk 255 has changed
Jul 2 10:11:59 amjxt2 kernel: [243080.260359] mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0
Jul 2 10:11:59 amjxt2 kernel: [243080.260361] mptbase: ioc0: volume membership of PhysDisk 255 has changed
Jul 2 10:11:59 amjxt2 kernel: [243080.260364] mptbase: ioc0: RAID STATUS CHANGE for VolumeID 0
Jul 2 10:11:59 amjxt2 kernel: [243080.260366] mptbase: ioc0: volume is now degraded, enabled
Jul 2 10:11:59 amjxt2 kernel: [243080.264242] end_device-2:0: mptsas: ioc0: removing sata device: fw_channel 0, fw_id 8, phy 0,sas_addr 0x1221000000000000
Jul 2 10:11:59 amjxt2 kernel: [243080.264246] phy-2:0: mptsas: ioc0: delete phy 0, phy-obj (0xffff880037b77800)
Jul 2 10:11:59 amjxt2 kernel: [243080.264259] port-2:0: mptsas: ioc0: delete port 0, sas_addr (0x1221000000000000)
Jul 2 10:11:59 amjxt2 kernel: [243080.264449] scsi target2:0:0: mptsas: ioc0: delete device: fw_channel 0, fw_id 8, phy 0, sas_addr 0x1221000000000000
Jul 2 10:12:36 amjxt2 kernel: [243117.330140] mptbase: ioc0: WARNING - IOC is in FAULT state (8064h)!!!
Jul 2 10:12:36 amjxt2 kernel: [243117.341301] mptbase: ioc0: WARNING - Issuing HardReset from mpt_fault_reset_work!!
Jul 2 10:12:36 amjxt2 kernel: [243117.362947] mptbase: ioc0: Initiating recovery
Jul 2 10:12:36 amjxt2 kernel: [243117.373835] mptbase: ioc0: WARNING - IOC is in FAULT state!!!
Jul 2 10:12:36 amjxt2 kernel: [243117.384672] mptbase: ioc0: WARNING - FAULT code = 8064h
Jul 2 10:12:39 amjxt2 kernel: [243120.800057] mptbase: ioc0: Recovered from IOC FAULT
Jul 2 10:13:05 amjxt2 kernel: [243146.710053] mptbase: ioc0: WARNING - Issuing Reset from mpt_config!!
Jul 2 10:13:05 amjxt2 kernel: [243146.720824] mptbase: ioc0: Attempting Retry Config request type 0x13, page 0x0, action 0
Jul 2 10:13:37 amjxt2 kernel: [243178.330073] mptbase: ioc0: WARNING - mpt_fault_reset_work: HardReset: success
I installed lsiutil and checked the RAID controller and confirmed 1 RAID volume, 2 physical drives, 1 gone. It was PhysDisk 1, which in the above output is the one that reported the RAID status change.

My confusion is this: it's my understanding that RAID 1 is supposed to protect against this kind of all out failure. PhysDisk 1 fails, PhysDisk 2 has all the same info and takes over, allowing you to replace the failed disk and rebuild with no downtime. Maybe a little slowdown in performance at worse. Why would a failed/failing disk cause the system to go read-only, and continue to revert to read-only after some time? I can only guess that either both drives happened to be going bad at the same time, or that the actual controller card was the issue. Is there any utility that gives you the health of the controller itself? Would lsiutil have reported an issue with the controller?

Thanks!
 
Old 07-03-2012, 03:37 PM   #2
frieza
Senior Member
 
Registered: Feb 2002
Location: harvard, il
Distribution: Ubuntu 11.4,DD-WRT micro plus ssh,lfs-6.6,Fedora 15,Fedora 16
Posts: 3,233

Rep: Reputation: 406Reputation: 406Reputation: 406Reputation: 406Reputation: 406
it depends I suppose on how the raid is set up to begin with

perhaps the system is dropping into read-only to prevent the drives from getting too far out of sync?

however I would backup your data, hold your breath,replace the failing drive and hopefully the working drive is successfully mirrored onto the new drive,

remember, mirrored raids are designed to provide redundancy and fault tolerance, not provide a substitute for regular backups.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Rsync'd my entire drive to an external drive, grub-installed and FAILED to load GUI rootaccess Linux - General 1 03-15-2012 01:09 PM
ACPI suspènd + halt = failed Daemon- Slackware 1 04-09-2008 06:23 AM
Physically detect a failed hard drive in a software RAID 5 array testnbbuser Linux - Server 3 12-21-2007 05:10 PM
Software for creating drive/fs image of entire system? olafskaug Linux - General 1 01-12-2006 08:43 AM
Raid 1 Recovery after a drive failed... Wyntyr Linux - General 2 09-02-2005 04:01 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 12:11 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration