I've been experience an annoying problem with a fairly new server, with RAID 1, that I set up at the beginning of April. About a month after the server was set up, sda started reporting that it was failing (any attempts to access anything on that drive caused either an error or a several minute delay and then an error). Upon a cold reboot, everything seemed fine, but I replaced the drive with a new one just in case, even though it was about a month old. About a week later, the same thing happened, again with the replaced sda. The replaced sda was never used. I could maybe see 1 hard drive going bad due to shipping, but 2 different hard drives from 2 different shipments going out so soon doesn't sound right.
At this point, I switched /tmp to sdb. My thinking is that if this is going to happen, at least the server can keep running until I can reboot it since this seems to be the only solution. The biggest thing was that mysql would stop running since it couldn't use /tmp on sda. After this, everything seemed fine for about a month. Then sdb experienced the same symptoms. Then again a week later. I don't know if this has anything to do with it, but it seems that whatever drive gives /tmp is the one to "fail". I've run smartctl on both drives multiple times and each time the short and long tests come out just fine.
The kicker is that about a week after the last time sdb "failed", both drives failed. I've pretty much had it with this thing. I've been unable to find anything on the net that could help me fix the problem. The only solution that I can think of is to mirror /tmp. Or get a hardware RAID card, but I don't feel like converting to hardware RAID if I don't need to. Currently, I have / in md0 and /home in md1. I have /tmp as a separate partition, but not mirrored, so that I can have it set as noexec.
Whenever one of the drives fails, /var/log/messages is loaded with this error:
Code:
Jun 11 09:08:19 myserver kernel: ata2: command 0xc8 timeout, stat 0xd0 host_stat 0x61
Jun 11 09:08:19 myserver kernel: ata2: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
Jun 11 09:08:19 myserver kernel: ata2: status=0xd0 { Busy }
Jun 11 09:08:19 myserver kernel: SCSI error : <1 0 0 0> return code = 0x8000002
Jun 11 09:08:19 myserver kernel: Info fld=0x531e898, Current sdb: sense key Aborted Command
Jun 11 09:08:19 myserver kernel: Additional sense: Scsi parity error
Jun 11 09:08:19 myserver kernel: end_request: I/O error, dev sdb, sector 87156888
Jun 11 09:08:19 myserver kernel: EXT3-fs error (device sdb5): ext3_find_entry: reading directory #2 offset 0
It just continually repeats. dmesg gives the following:
Code:
ata2: command 0xc8 timeout, stat 0xd0 host_stat 0x61
ata2: translated ATA stat/err 0xd0/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata2: status=0xd0 { Busy }
SCSI error : <1 0 0 0> return code = 0x8000002
Info fld=0x531e898, Current sdb: sense key Aborted Command
Additional sense: Scsi parity error
end_request: I/O error, dev sdb, sector 87156888
EXT3-fs error (device sdb5): ext3_find_entry: reading directory #2 offset 0
Here's some other information.
Server Specs:
Supermicro Motherboard H8SSL-i
AMD Opteron 165
2GB RAM
2 Western Digital WD4000KS SATA drives
Software RAID 1
CentOS 4.5 64bit
Code:
uname -a
Linux myserver.example.com 2.6.9-42.0.10.ELsmp #1 SMP Tue Feb 27 09:40:21 EST 2007 x86_64 x86_64 x86_64 GNU/Linux
Code:
mdadm --detail /dev/md1
/dev/md1:
Version : 00.90.01
Creation Time : Thu Apr 5 00:29:22 2007
Raid Level : raid1
Array Size : 346080128 (330.05 GiB 354.39 GB)
Device Size : 346080128 (330.05 GiB 354.39 GB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 1
Persistence : Superblock is persistent
Update Time : Mon Jun 11 09:09:10 2007
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
UUID : e39dde77:f0bc1dda:cbe1ab76:6e9ed6ad
Events : 0.1761398
Number Major Minor RaidDevice State
0 8 6 0 active sync /dev/sda6
1 0 0 - removed
2 8 22 - faulty /dev/sdb6
Code:
lsmod
Module Size Used by
ipt_owner 5441 3
ipt_REJECT 8897 1
iptable_filter 4673 1
ip_tables 21825 3 ipt_owner,ipt_REJECT,iptable_filter
md5 5953 1
ipv6 284193 16
parport_pc 29569 0
lp 15345 0
parport 44493 2 parport_pc,lp
autofs4 24393 0
sunrpc 176441 1
sr_mod 20965 0
usb_storage 71561 0
dm_mod 68609 0
button 9313 0
battery 11465 0
ac 6985 0
joydev 12097 0
ohci_hcd 24529 0
ehci_hcd 33989 0
tg3 109509 0
floppy 66065 0
ext3 138193 4
jbd 69105 1 ext3
raid1 19137 2
sata_svw 10053 7
libata 78345 1 sata_svw
sd_mod 19393 9
scsi_mod 141457 4 sr_mod,usb_storage,libata,sd_mod
Code:
lspci
00:01.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge
00:02.0 Host bridge: Broadcom BCM5785 [HT1000] Legacy South Bridge
00:02.1 IDE interface: Broadcom BCM5785 [HT1000] IDE
00:02.2 ISA bridge: Broadcom BCM5785 [HT1000] LPC
00:03.0 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:03.1 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:03.2 USB Controller: Broadcom BCM5785 [HT1000] USB (rev 01)
00:05.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
01:0d.0 PCI bridge: Broadcom BCM5785 [HT1000] PCI/PCI-X Bridge (rev b2)
01:0e.0 IDE interface: Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode)
01:0e.1 IDE interface: Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode)
02:03.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
02:03.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
Code:
fdisk -l
Disk /dev/sda: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sda1 1 65 522081 83 Linux
/dev/sda2 * 66 5164 40957717+ fd Linux raid autodetect
/dev/sda3 5165 5425 2096482+ 82 Linux swap
/dev/sda4 5426 48641 347132520 5 Extended
/dev/sda5 5426 5556 1052226 83 Linux
/dev/sda6 5557 48641 346080231 fd Linux raid autodetect
Disk /dev/sdb: 400.0 GB, 400088457216 bytes
255 heads, 63 sectors/track, 48641 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 * 1 65 522081 83 Linux
/dev/sdb2 66 5164 40957717+ fd Linux raid autodetect
/dev/sdb3 5165 5425 2096482+ 82 Linux swap
/dev/sdb4 5426 48641 347132520 5 Extended
/dev/sdb5 5426 5556 1052226 83 Linux
/dev/sdb6 5557 48641 346080231 fd Linux raid autodetect
Disk /dev/md0: 41.9 GB, 41940615168 bytes
2 heads, 4 sectors/track, 10239408 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/md0 doesn't contain a valid partition table
Disk /dev/md1: 354.3 GB, 354386051072 bytes
2 heads, 4 sectors/track, 86520032 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Disk /dev/md1 doesn't contain a valid partition table
Any suggestions would be appreciated. If you need any other information, please ask. Sorry for such a long post, but I felt it necessary to provide as much info as possible. I have several other servers with a very similar setup (same hardware/software), but they use a 3ware RAID card for hardware RAID. This is the only server I'm using software RAID on. It's also the only server that's causing any kind of problems. The only other difference is that I'm using WD's RAID Edition drives for the other servers.