LinuxQuestions.org
Latest LQ Deal: Linux Power User Bundle
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 03-22-2012, 12:20 AM   #1
saran_redhat
Member
 
Registered: May 2009
Location: chennai
Posts: 247

Rep: Reputation: 16
Raid6 problem


Hi Guys,

A couple of weeks ago we had 2 drives fail in our raid 6 array which is housing /mnt/Raid3 - /dev/sdd1

I have replaced both drives however the raid is not synching to finish the rebuild. It appears that a 3rd drive is reporting ECC errors and the rebuild is stopping.

The device will not mount

[root@media ~]# mount /dev/sdd1 /mnt/Raid3
mount: /dev/sdd1: can't read superblock

I know this generally means using an xfs_repair however when I try this I get the following:

[root@media ~]# xfs_repair /dev/sdd1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair. If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
Give some solutions
 
Old 03-22-2012, 08:13 AM   #2
MensaWater
LQ Guru
 
Registered: May 2005
Location: Atlanta Georgia USA
Distribution: Redhat (RHEL), CentOS, Fedora, CoreOS, Debian, FreeBSD, HP-UX, Solaris, SCO
Posts: 6,870
Blog Entries: 14

Rep: Reputation: 1112Reputation: 1112Reputation: 1112Reputation: 1112Reputation: 1112Reputation: 1112Reputation: 1112Reputation: 1112Reputation: 1112
Is this hardware RAID? It is unusual for 2 disks to fail at the same time but RAID6 should survive that if it occurs. The fact you're still having issues with the mount makes it seem you may have an issue with a 3rd disk which makes me wonder if the issue isn't the RAID controller card rather than the disks themselves. Have you checked that out?
 
Old 03-27-2012, 04:31 AM   #3
saran_redhat
Member
 
Registered: May 2009
Location: chennai
Posts: 247

Original Poster
Rep: Reputation: 16
Quote:
Originally Posted by MensaWater View Post
Is this hardware RAID? It is unusual for 2 disks to fail at the same time but RAID6 should survive that if it occurs. The fact you're still having issues with the mount makes it seem you may have an issue with a 3rd disk which makes me wonder if the issue isn't the RAID controller card rather than the disks themselves. Have you checked that out?
HI,

Thanks for the reply.

Actually this is hardware raid using 3ware 3DM2 configured in raid6. When we upgrade firmware. can you confirm the data will be lost or data will be safe.
Because huge data files stored in this and only one drive shows ECC error. Regarding this can you give some more informations and any other solutions without losing datas.

The following are error log


E=0202 T=00:50:52 : Data ECC error (int)

task file written out : cd dh ch cl sn sc ft

: 60 66 98 BB 80 80 80

E=0202 T=00:50:52 P=Bh: Soft reset drive

task file read back : st dh ch cl sn sc er

: 50 A0 00 00 01 01 01

E=0202 T=00:50:52 P=Bh: Repair LBA 0x2698bba0...OK

E=0202 T=00:50:52 P=Bh: Repair LBA 0x2698bba1...OK

E=0202 T=00:50:52 P=Bh: Repair LBA 0x2698bba2...OK

Send AEN (code, time): 0x23, 03/27/2012 00:50:52

Sector repair completed

(EC:0x23, SK=0x01, ASC=0x11, ASCQ=0x00, SEV=02, Type=0x71)

port=11, LBA=0x2698BBA1

E=0202 T=00:50:52 P=Bh: Complete IPRs in error



H-RebuilderRaid5/6 ERROR, time: 37397fa (ErrorCode): 202

Start Stripe #: 004d3177

Send AEN (code, time): 0x26, 03/27/2012 00:51:07

Drive ECC error reported

(EC:0x26, SK=0x03, ASC=0x11, ASCQ=0x00, SEV=01, Type=0x71)

port=11, unit=2

Send AEN (code, time): 0x2d, 03/27/2012 00:51:07

Source drive error occurred

(EC:0x2d, SK=0x04, ASC=0x6b, ASCQ=0x02, SEV=01, Type=0x71)

port=11, unit=2

Send AEN (code, time): 0x4, 03/27/2012 00:51:07

Rebuild failed

(EC:0x04, SK=0x04, ASC=0x6b, ASCQ=0x02, SEV=01, Type=0x71)

unit=2

DcbMgr::UpdateStatus: UNIT 2 (time 00:51:07)

Updating cache settings for unit: 0

Updating cache settings for unit: 1

Override cache enable; not dirty

Updating cache settings for unit: 2

Override cache enable; not dirty

DcbMgr::WriteSegment(map=0x45229C, segID=0x8, events=12, error=0x0)

DcbMgr::WriteSegment(map=0x45229C, segID=0x1, events=12, error=0x0)

DcbMgr::UpdateStatus: (finish 00:51:07)

Send AEN (code, time): 0x3b, 03/27/2012 00:51:07

Rebuild paused

(EC:0x3b, SK=0x00, ASC=0x00, ASCQ=0x00, SEV=03, Type=0x71)

unit=2



Unit 2 H-RebuilderRaid5/6 STOPPED



Thanks

Last edited by saran_redhat; 03-27-2012 at 04:36 AM.
 
Old 03-27-2012, 09:06 AM   #4
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 18,896

Rep: Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261
Quote:
Originally Posted by saran_redhat View Post
HI,
Thanks for the reply.

Actually this is hardware raid using 3ware 3DM2 configured in raid6. When we upgrade firmware. can you confirm the data will be lost or data will be safe.
No one here can give you ANY guarantees about your data, and it would be foolish to accept anyones guarantee that you will be 100% safe. You're using RHEL 6; call them for support, since you're in a corporate environment, you should be paying for RHEL. Also, since your root question involves a 3ware controller, and updating firmware on it, checking with 3ware would be a good thing too.
Quote:
Because huge data files stored in this and only one drive shows ECC error. Regarding this can you give some more informations and any other solutions without losing datas.
Yes. Back up your data, which you should be doing anyway. RAID is not a substitute for good, verified backups, and nothing really is. If you're concerned, make a copy of the data to a safe location before doing ANYTHING to it. And I'd agree with mensawater about your hardware going bad...it's highly unlikely that so many drives would fail in such a short span. If you're going to upgrade the controller anyway, I'd suggest getting a whole new controller, and moving the RAID array to it, and starting fresh. Also, what kind of disk backplane do you have the drives plugged in to, as that can be a problem as well.
 
Old 03-27-2012, 09:49 AM   #5
suicidaleggroll
LQ Guru
 
Registered: Nov 2010
Location: Colorado
Distribution: OpenSUSE, CentOS
Posts: 5,492

Rep: Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092
I doubt RHEL will be any help, this looks like a 3ware issue to me, OS independent. The OS won't mount the drive because the RAID card isn't allowing it to be mounted due to the ECC errors. I've had similar issues when an A/C system went out, computer temp spiked (but not high enough for the mobo to shut it down), RAID drives stopped responding, RAID card booted them out of the array. When enough drives had been kicked out of the array, the RAID card umounted the filesystem from the OS, breaking all of the processes that were running. Luckily in my case, once I shut everything down, let it cool off, and fired it back up, it came back like nothing had happened.

Definitely contact LSI/3ware about this. What model is the card? I have a few different 3ware cards running in my machines here, and while I do have intermittent problems here and there (not to mention god awful write speeds), I've never seen this before.

What is 3DM2 showing for the status of the RAID? Degraded or Error?

Last edited by suicidaleggroll; 03-27-2012 at 09:50 AM.
 
Old 03-27-2012, 10:43 AM   #6
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 18,896

Rep: Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261Reputation: 4261
Quote:
Originally Posted by suicidaleggroll View Post
I doubt RHEL will be any help, this looks like a 3ware issue to me, OS independent.
I agree totally...but was thinking of them more from an OS/data backup/verify standpoint.
Quote:
The OS won't mount the drive because the RAID card isn't allowing it to be mounted due to the ECC errors. I've had similar issues when an A/C system went out, computer temp spiked (but not high enough for the mobo to shut it down), RAID drives stopped responding, RAID card booted them out of the array. When enough drives had been kicked out of the array, the RAID card umounted the filesystem from the OS, breaking all of the processes that were running. Luckily in my case, once I shut everything down, let it cool off, and fired it back up, it came back like nothing had happened.

Definitely contact LSI/3ware about this. What model is the card? I have a few different 3ware cards running in my machines here, and while I do have intermittent problems here and there (not to mention god awful write speeds), I've never seen this before.

What is 3DM2 showing for the status of the RAID? Degraded or Error?
Yeah, it's tough to diagnose hardware RAID errors...lots of moving parts (controller, backplane, drives, cables, firmware...). I agree with you about it being a hardware error, though. 3 drives in a short span??
 
Old 03-27-2012, 10:55 AM   #7
suicidaleggroll
LQ Guru
 
Registered: Nov 2010
Location: Colorado
Distribution: OpenSUSE, CentOS
Posts: 5,492

Rep: Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092Reputation: 2092
Quote:
Originally Posted by TB0ne View Post
I agree with you about it being a hardware error, though. 3 drives in a short span??
A while back I had a 3ware card (9650SE-8LPML) which decided that all of a sudden it didn't like the drives I was using in my 8-drive RAID 6 array (I had been using them 24/7 for the last year without a problem). With no warning, it started kicking a drive (or two, sometimes) out of the array claiming that it wasn't responding, different drive every time. Within 1 minute of kicking the drive out of the array, it would see the drive again, recognize it as good, and rebuild the array automatically. This happened every 1-2 days for about 6 months straight, and during that time I was CERTAIN that at some point it was going to kick out 3 drives simultaneously and lose the array, so I pulled everything important off of that machine and moved it to the others while I tried to diagnose the problem. That never happened though, it was always 1-2 drives and always rebuilt without a problem. After a few months LSI released a firmware update for the card, and after updating the problem completely vanished.

Maybe the OP's problem is related?

Since then I've stopped buying 3ware cards. Never realized how slow they were until I started using Adaptec...
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
RAID6 rebuild weirdness mvanhorn Linux - Server 8 11-03-2010 08:05 AM
Disable RAID6 in kernel? dbrazeau Linux - Kernel 6 04-13-2010 11:37 PM
RAID6 I/O and Alignment aviso Linux - Server 0 08-16-2009 12:29 PM
RAID6 Setup Questions carlosinfl Linux - Hardware 3 05-22-2007 09:44 AM


All times are GMT -5. The time now is 03:01 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration