LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 06-20-2007, 10:41 PM   #1
ImLagging
LQ Newbie
 
Registered: Dec 2006
Posts: 5

Rep: Reputation: 0
Problems with RAID


I've been experience an annoying problem with a fairly new server, with RAID 1, that I set up at the beginning of April. About a month after the server was set up, sda started reporting that it was failing (any attempts to access anything on that drive caused either an error or a several minute delay and then an error). Upon a cold reboot, everything seemed fine, but I replaced the drive with a new one just in case, even though it was about a month old. About a week later, the same thing happened, again with the replaced sda. The replaced sda was never used. I could maybe see 1 hard drive going bad due to shipping, but 2 different hard drives from 2 different shipments going out so soon doesn't sound right.

At this point, I switched /tmp to sdb. My thinking is that if this is going to happen, at least the server can keep running until I can reboot it since this seems to be the only solution. The biggest thing was that mysql would stop running since it couldn't use /tmp on sda. After this, everything seemed fine for about a month. Then sdb experienced the same symptoms. Then again a week later. I don't know if this has anything to do with it, but it seems that whatever drive gives /tmp is the one to "fail". I've run smartctl on both drives multiple times and each time the short and long tests come out just fine.

The kicker is that about a week after the last time sdb "failed", both drives failed. I've pretty much had it with this thing. I've been unable to find anything on the net that could help me fix the problem. The only solution that I can think of is to mirror /tmp. Or get a hardware RAID card, but I don't feel like converting to hardware RAID if I don't need to. Currently, I have / in md0 and /home in md1. I have /tmp as a separate partition, but not mirrored, so that I can have it set as noexec.

Whenever one of the drives fails, /var/log/messages is loaded with this error:

Code:
Jun 11 09:08:19 myserver kernel: ata2: command 0xc8 timeout, stat 0xd0 host_stat 0x61
Jun 11 09:08:19 myserver kernel: ata2: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Jun 11 09:08:19 myserver kernel: ata2: status=0xd0 { Busy }
Jun 11 09:08:19 myserver kernel: SCSI error : <1 0 0 0> return code = 0x8000002
Jun 11 09:08:19 myserver kernel: Info fld=0x531e898, Current sdb: sense key Aborted Command
Jun 11 09:08:19 myserver kernel: Additional sense: Scsi parity error
Jun 11 09:08:19 myserver kernel: end_request: I/O error, dev sdb, sector 87156888
Jun 11 09:08:19 myserver kernel: EXT3-fs error (device sdb5): ext3_find_entry: reading directory #2 offset 0
It just continually repeats. dmesg gives the following:
Code:
ata2: command 0xc8 timeout, stat 0xd0 host_stat 0x61
ata2: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata2: status=0xd0 { Busy }
SCSI error : <1 0 0 0> return code = 0x8000002
Info fld=0x531e898, Current sdb: sense key Aborted Command
Additional sense: Scsi parity error
end_request: I/O error, dev sdb, sector 87156888
EXT3-fs error (device sdb5): ext3_find_entry: reading directory #2 offset 0
Here's some other information.

Server Specs:
Supermicro Motherboard H8SSL-i
AMD Opteron 165
2GB RAM
2 Western Digital WD4000KS SATA drives
Software RAID 1
CentOS 4.5 64bit

Code:
uname -a
Linux myserver.example.com 2.6.9-42.0.10.ELsmp #1 SMP Tue Feb 27 09:40:21 EST 2007 x86_64 x86_64 x86_64 GNU/Linux
Code:
mdadm --detail /dev/md1
/dev/md1:
        Version : 00.90.01
  Creation Time : Thu Apr  5 00:29:22 2007
     Raid Level : raid1
     Array Size : 346080128 (330.05 GiB 354.39 GB)
    Device Size : 346080128 (330.05 GiB 354.39 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Mon Jun 11 09:09:10 2007
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : e39dde77:f0bc1dda:cbe1ab76:6e9ed6ad
         Events : 0.1761398

    Number   Major   Minor   RaidDevice State
       0       8        6        0      active sync   /dev/sda6
       1       0        0        -      removed

       2       8       22        -      faulty   /dev/sdb6
Code:
lsmod
Module                  Size  Used by
ipt_owner               5441  3
ipt_REJECT              8897  1
iptable_filter          4673  1
ip_tables              21825  3 ipt_owner,ipt_REJECT,iptable_filter
md5                     5953  1
ipv6                  284193  16
parport_pc             29569  0
lp                     15345  0
parport                44493  2 parport_pc,lp
autofs4                24393  0
sunrpc                176441  1
sr_mod                 20965  0
usb_storage            71561  0
dm_mod                 68609  0
button                  9313  0
battery                11465  0
ac                      6985  0
joydev                 12097  0
ohci_hcd               24529  0
ehci_hcd               33989  0
tg3                   109509  0
floppy                 66065  0
ext3                  138193  4
jbd                    69105  1 ext3
raid1                  19137  2
sata_svw               10053  7
libata                 78345  1 sata_svw
sd_mod                 19393  9
scsi_mod              141457  4 sr_mod,usb_storage,libata,sd_mod
Code:
lspci
00:01.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge
00:02.0 Host bridge: Broadcom BCM5785 [HT1000] Legacy South Bridge
00:02.1 IDE interface: Broadcom BCM5785 [HT1000] IDE
00:02.2 ISA bridge: Broadcom BCM5785 [HT1000] LPC
00:03.0 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:03.1 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:03.2 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:05.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:0d.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge (rev b2)
01:0e.0 IDE interface: Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode)
01:0e.1 IDE interface: Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode)
02:03.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
02:03.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
Code:
fdisk -l

Disk /dev/sda: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1          65      522081   83  Linux
/dev/sda2   *          66        5164    40957717+  fd  Linux raid autodetect
/dev/sda3            5165        5425     2096482+  82  Linux swap
/dev/sda4            5426       48641   347132520    5  Extended
/dev/sda5            5426        5556     1052226   83  Linux
/dev/sda6            5557       48641   346080231   fd  Linux raid autodetect

Disk /dev/sdb: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1          65      522081   83  Linux
/dev/sdb2              66        5164    40957717+  fd  Linux raid autodetect
/dev/sdb3            5165        5425     2096482+  82  Linux swap
/dev/sdb4            5426       48641   347132520    5  Extended
/dev/sdb5            5426        5556     1052226   83  Linux
/dev/sdb6            5557       48641   346080231   fd  Linux raid autodetect

Disk /dev/md0: 41.9 GB, 41940615168 bytes
2 heads, 4 sectors/track, 10239408 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md0 doesn't contain a valid partition table

Disk /dev/md1: 354.3 GB, 354386051072 bytes
2 heads, 4 sectors/track, 86520032 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md1 doesn't contain a valid partition table
Any suggestions would be appreciated. If you need any other information, please ask. Sorry for such a long post, but I felt it necessary to provide as much info as possible. I have several other servers with a very similar setup (same hardware/software), but they use a 3ware RAID card for hardware RAID. This is the only server I'm using software RAID on. It's also the only server that's causing any kind of problems. The only other difference is that I'm using WD's RAID Edition drives for the other servers.
 
Old 06-20-2007, 11:48 PM   #2
Electro
LQ Guru
 
Registered: Jan 2002
Posts: 6,042

Rep: Reputation: Disabled
Not all controllers can handle 24/7 and 365 days a year up time. Probably it is best to get a hardware RAID controller. Also you may want to use ECC memory if you have not done so. ECC memory does minimize any problems that may come up. The hard drive cache might be giving the array inconsistent data, so try disable cache of each hard drive in the RAID array. If you can, try space out the hard drives, so the vibrations do not create any problems. You may have to use rubber grommets to isolate the hard drive bay and the chassis.
 
Old 06-21-2007, 12:03 AM   #3
ImLagging
LQ Newbie
 
Registered: Dec 2006
Posts: 5

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by Electro
Not all controllers can handle 24/7 and 365 days a year up time. Probably it is best to get a hardware RAID controller. Also you may want to use ECC memory if you have not done so. ECC memory does minimize any problems that may come up. The hard drive cache might be giving the array inconsistent data, so try disable cache of each hard drive in the RAID array. If you can, try space out the hard drives, so the vibrations do not create any problems. You may have to use rubber grommets to isolate the hard drive bay and the chassis.
ECC is not supported by this motherboard according to Supermicro. I actually bought some for an earlier server and it refused to Post until I bought non ECC memory. As far as where the drives are located, this is a 1U server and it's designed to have the drives right next to each other.

Supermicro makes mostly server hardware (and have a good reputation for doing so), so I don't see why they'd make something that can't handle being up 24/7/365. Here's a link to the server I bought: http://www.supermicro.com/Aplus/syst...AS-1010S-T.cfm.
 
Old 06-21-2007, 03:01 PM   #4
Electro
LQ Guru
 
Registered: Jan 2002
Posts: 6,042

Rep: Reputation: Disabled
The motherboard does support ECC memory. AMD processors from 940 socket to AM2 does support ECC memory. That motherboard requires only unbuffered memory.

Crucial saids it supports ECC memory. Have a look at http://www.crucial.com/store/listpar...odel=H8SSL%2Di. Probably you have to use single-side memory instead of dual-sided memory.
 
Old 06-21-2007, 03:36 PM   #5
ImLagging
LQ Newbie
 
Registered: Dec 2006
Posts: 5

Original Poster
Rep: Reputation: 0
You're right, it does support ECC. I must have been looking at the wrong info. I did buy ECC RAM (from Kingston - KVR400X72C3A). But, this is a Socket 939, not a 940. Not sure if that will make a difference.

Last edited by ImLagging; 06-21-2007 at 03:41 PM.
 
Old 06-22-2007, 01:19 AM   #6
Electro
LQ Guru
 
Registered: Jan 2002
Posts: 6,042

Rep: Reputation: Disabled
The 939 socket just has fewer hypertransport ports than 940 (I think). Just like 940 socket, 939 does support ECC memory even though many motherboard manufactures do not include ECC as an option in the BIOS. ECC memory should minimize problems. You may want to get a power line conditioner because a 260 watt power supply seems under powered.

If nothing fixes the problem, you may want to get a 4U chassis or larger to use a 3ware card. Though pin pointing if is a controller problem or a hard drive problem is hard to tell.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
RAID problems davestyle Linux - Hardware 1 02-10-2006 09:05 AM
Problems with RAID msound Linux - General 6 02-01-2006 12:20 PM
Raid 5 problems :o DigitalVixen Linux - Hardware 4 01-22-2005 04:59 PM
RAID 1 / which tool for problems? georgee Linux - Hardware 0 10-25-2004 07:24 AM
RAID problems, mandrake 10 Tomas79 Linux - Software 0 08-13-2004 10:10 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 02:08 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration