Raid Issues

jim_cliff11 · 11-07-2016, 01:01 PM

Hi All,

The OpenSuse server we are using is giving me a few issues regarding the RAID configuration.

See attached picture for information.

The distro boots up and runs fine but the orange lights on two of the drives are flashing.

Is there a way within OpenSuse to run diagnostics and fix any faults within the drives?

lsblk gives me the following:

Code:

cciss!c0d0   disk  1.8T LOGICAL VOLUME
cciss!c0d0p1 part    2G
cciss!c0d0p2 part   40G
cciss!c0d0p3 part  1.8T
cciss!c0d1   disk  1.4T LOGICAL VOLUME
cciss!c0d1p1 part  1.4T

I have a raid 5 configured into two logical drives.

Thanks,
Jim

szboardstretcher · 11-07-2016, 01:04 PM

"orange lights on two of the drives are flashing"

In the Dell world, amber flashing lights means that the drive has physically failed. There is no amount of commands that can ever fix it. You should pull them and replace them before you lose another drive and lose your array and your DATA. Get a backup.

http://www.dell.com/support/article/us/en/04/SLN292269

syg00 · 11-07-2016, 03:55 PM

I would have thought the messages were self explanatory.
Change the battery.

smallpond · 11-07-2016, 04:01 PM

For hardware RAID you need to follow the RAID controller procedures for finding out what's wrong and replacing or reinitializing the drives. Anything you try to do in software directly to the drives would be likely to conflict with the RAID controller.

jefro · 11-07-2016, 09:53 PM

https://www.smartmontools.org/wiki/S...ID-Controllers tells what might work on smart tools.

The battery ought to be replaced before you play with the controller too much. I mean after you make a full backup.

Some raid bios's do have ways to test drives. Boot to the raid bios with some key combo at boot after normal bios and before OS boots. Ctrl-a or some key combo.

If you have a test stand you can test drives one at a time or just read smart numbers. The old scsi drives could be low level formated. We used to do that every year and kept them working for decades.

jim_cliff11 · 11-09-2016, 12:55 PM

Thanks for the feedback.

I've ordered a replacement battery to sort that issue out.

As for the 'imminent failure of the hard drives' warning, does this basically mean the HD is about to go FUBAR? I am running RAID 5 with 3 physical disks but in honesty I don't know how to go about resolving this. The drives I am running are ATA GB0750C8047 units at 750GB a piece. Few questions below:

1. In order to stabilise my system do I need to replace these with identical drives? eg. Make, model, size etc?
2. Do I need to power off the server, or can I simply pull each drive at a time, replace HD and push back in? Then I'm guessing the RAID will do its thing: restore data on the new drive. Once this is done, then follow the same procedure with the second drive? Or is this completely wrong? Is there anything I need to do or initialise to begin the restoration procedure?

Sorry for my lack of knowledge.
Any help greatly appreciated.

Jim

szboardstretcher · 11-09-2016, 12:58 PM

Quote:

For hardware RAID you need to follow the RAID controller procedures for finding out what's wrong and replacing or reinitializing the drives.

You'll have to look up your raid setup and look through the manual for precise answers for your hardware.

But:

"Generally" the blinking amber light means that the drive will fail soon. It needs replaced.
"Generally" You do not have to replace them with the same drive, just same size or bigger
"Generally" with hardware raid, you do not have to shutdown the system. You can replace a drive, wait for the rebuild to happen.

ember1205 · 11-09-2016, 11:30 PM

S.M.A.R.T. (Self-Monitoring Analysis and Reporting Technology) is a set of tools designed to inter-operate with like-built hard drives to constantly monitor a variety of different aspects of the drive's performance. Various areas are monitored and certain programmed thresholds are used as the "standard of measure" for the different areas. When those thresholds are crossed, the software is able to alert to an underperforming drive that may be on the edge of catastrophic failure. Generally, when you get "imminent failure" messages, you should be doing everything in your power to ensure you have a solid backup and then swapping the drives out for good ones and getting the array rebuilt.

As has already been said multiple times, the warning messages should have been fairly self-explanatory.

jim_cliff11 · 11-10-2016, 03:50 AM

Quote:

Originally Posted by ember1205

As has already been said multiple times, the warning messages should have been fairly self-explanatory.

Yes, and I acknowledge this. I'm now trying to gather as much information as I can in order to replace the drives.

Can anyone elaborate on whether the P400 controller needs to have identical replacement drives? I need to be sure before I purchase a new drives that their going to do the job. I'm only able to find direct replacement drives in the USA, struggling in the UK. So would a different manufacturer, model and higher size volume suffice?

Thanks,
Jim

jim_cliff11 · 11-10-2016, 07:50 AM

Just spoke a local IT technician who told me the GB0750C8047 Seagate Barracuda drive will be loaded with a firmware specific to HP. So I cant just throw any old 750GB Barracuda drive in.

On the actual drive itself, it does day firmware: HPG1.

Does anyone else have any experience with this?

ember1205 · 11-10-2016, 08:10 AM

Have you considered contacting HP or an authorized HP shop for guidance? "Any" drive will work. What you're concerned with is the SMART communications between the drives and the controller and the drive being "as capable as possible" of inter-operating with the controller. Find out if you can upgrade the firmware on the drive yourself and whether you can get the actual firmware from HP's web site (you'd be surprised at the stuff you can download from them).