LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Hardware (https://www.linuxquestions.org/questions/linux-hardware-18/)
-   -   HDD errors, DriveStatusError, CentOS 5, DL360 (https://www.linuxquestions.org/questions/linux-hardware-18/hdd-errors-drivestatuserror-centos-5-dl360-666473/)

LandRover 08-30-2008 10:40 AM

HDD errors, DriveStatusError, CentOS 5, DL360
 
Hey,

I got a server DL360 G5, about 1 year old.
The server has 4xSAS 15k 72GB using raid 1+0 which gives me 1 logical drive at ~144GB. Running CentOS 5 Final, Kernel: 2.6.18-92.1.10.el5xen.

Up until now everything went smooth, the server had +200days of uptime but few hours ago the server stuck. After the reboot odd messages started to pop up which never happen ever before, not even once:

Code:

Aug 30 17:16:28 fun-files kernel: hda: task_in_intr: status=0x51 { DriveReady SeekComplete Error }
Aug 30 17:16:28 fun-files kernel: hda: task_in_intr: error=0x04 { DriveStatusError }
Aug 30 17:16:28 fun-files kernel: ide: failed opcode was: unknown

After the reboot the server seem to work but these messages keep poping in many different ways (log attached).

The raid controller shows all disks are fine, all warning lights on the server are off, which makes it all weirder.
I'm working with Linux servers a lot and never came across something like this.

Searched about it and found it has something to do with a kernel bug, but it never happened before so I safely rule it out. I also doubt its only warning as it never happened before with the same kernel version and everything.

When I type for example:
fdisk -l
It totally floods the log until I break it with CTRL+C.

LOGS:
Regards,
Oleg G.

amani 08-30-2008 09:28 PM

Your drives are not OK and/or the controller is not OK.

You really need to formulate a check up routine... by changing bios settings to start with

jiml8 08-30-2008 09:47 PM

You are experiencing a drive or a controller failure. Backup immediately and commence diagnosis.

LandRover 08-31-2008 03:39 AM

The system is backed up on an automatic service, I'm not worried about that.

Any bright idea how can I flash out the invalid drive out of a 4 drives array?
I've also other servers usually when a drive is dead it's just turns red, I replace it and that's it.

Thanks!

jiml8 08-31-2008 12:01 PM

If you have a hardware raid controller it should provide the tools to do that. I'm betting on the controller itself being bad, though, based on the symptoms presented.

LandRover 08-31-2008 03:45 PM

The server is still covered by warranty, I just need to find the invalid part and minimized downtime on a production server.

I managed to get few spare SAS drives, I'll try swaping all the 4 current HDDs one by one allowing the controller to rebuild each one of them and see if the problem goes away. I hope you're wrong and it's not a controller, controller would require downtime and warrenty calling and I cant allow this at that point, hrmf.

HP's tool seems to only run under windows, cant find any copy of that for linux for more info.


All times are GMT -5. The time now is 02:33 PM.