Hard Drive Failing?

keysorsoze · 12-03-2006, 09:24 PM

Alright Meet Scott I'll take your advice, Hopefully this will fix these errors. Sorry for all the confusion we just need to get this up and running again and I don't want to make a stupid mistake and blow away a load of very important mail data.

meetscott · 12-03-2006, 10:29 PM

Whoah!!!! This is the first time you've mentioned that you were trying to recover data! *Which* disk has the data you are trying to recover??? I've done this before but I need to know what you are trying to preserve as this could change the strategy a little bit. You *should* be able to get change a partition type in the partition table for example, in your case from "fd" to "83", without harming your data in the partition. You cannot, of course, move the partition boundaries around and expect to recover all your data. You cannot run mkfs.ext3 or whatever on the partition and expect to recover data on it when you're done. I say *should* because this is risky stuff and not guaranteed to work. There are all sorts of things you can inadvertantly do wrong to hose things up. I know, I've done most of them ;-)

*Which* disk are you trying to recover? What other objectives do you have by doing all this? I did not realize you had bigger objectives than just getting a system working and behaving properly. If you have valuable data, are you looking to recover that and start over? Are you looking to recover it and then create a raid system for your "valuable" mail? It's important to know what you're looking to end up with here.

By the way, I hated to be difficult to others who were doing a great, well intentioned job in trying to help you. I may have been harsh

I was becoming frustrated because I saw obvious issues not being addressed and time wasted on issues that don't seem to apply.

keysorsoze · 12-03-2006, 10:43 PM

MeetScott,

This system is not dead it's still working. I didn't build the machine I took over it from an another administrator. My company got a call one day that the messages posted above were appearing on login screen of a Red Hat mail server at one of our clients sites. There is about 90GB of mail on there that is very important and needs to be backed up. Currently the scripts that were developed by the other administrator are not working properly and no backups are being performed. I guess the script was to backup the 90GB of data and send it to a raid drive. Then the raid drive would mirror it to another drive using Raid 1. I am told that there are 4 drives in the system and have not confirmed this because I have not visited the site to look at the physical machine. I'll probably go sometime this week to see whats inside since I am very confused on what the hell is going on and why we have such a weird partitioning scheme. There is no documentation on how the system was partitioned so I am scratching my head on how to place make things right. So there is the complete story. We are planning on backing the 90GB off before I do anything to the system so the system still functions just not properly.

Electro · 12-04-2006, 12:46 AM

A lot of posters are confused from the information that is printed that the thread starter has posted. The hdg2 is indeed a hard drive -- well its secondary primary partition. Also hde is also a hard drive. These two drives are on the primary and secondary channel of the Promise controller (PDC202XX). This controller had to go into reset mode to get it self working again. Promise controllers are known to have this problem only in Linux. Though people insist of using this brand in Linux. ASUS is the main contributor of providing Promise controllers, so I assume it is an ASUS motherboard. I recommend that people do not use Promise controller in Linux because this problem will always comes up eventually. Promise controllers are ok in Windows, but not ok in Linux.

The hde drive and hdg1 is probably part of the RAID-1 array. Probably hdf is a hot spare, but I do not know. I suggest checking /etc/mdadm.conf.

Backing up files should be sent some where else instead of being on the same system.

SMART is useless if you do not know how to read the information that is printed. Like what meetscott have posted, it is best to use your senses to figure out the condition of the hard drive. From what it is given, there is just corrupted sectors that can hopefully be corrected.

I recommend using the utility from the hard drive manufacture to find the truth to the problem. It can fix most problems. The corrupted sectors should be corrected with minimal data lost using the utility from the hard drive manufacture. Though the corrupted sectors could be the cause of the Promise controller.

Primary IDE controller:
hda = Primary Master
hdb = Primary Slave
hdc = Secondary Master
hdd = Secondary Slave

Promise IDE controller (PDC202XX):
hde = Primary Master
hdf = Primary Slave
hdg = Secondary Master
hdh = Secondary Slave

The partition type is actually not read by any OS. Some format utilities require it.

meetscott · 12-04-2006, 01:11 AM

Oh boy! This sorta thing is tough. This sounds like a system that definitely needs raid 1. No question about it. From the output you had earlier, I still stand by what I said about only having 2 physical drives. You also need a backup scheme that's well documented and trustworthy. This is the sort of thing where you might consider making backups by generations. We used to keep 10 generations of backups at a major hospital I used to work at. In addition to that we kept a fully production capable system "hot" all the time for fail over. It was fed real time database updates so there was virtually no data loss if there was a failure.

What I'm getting at here is this system and its expectations need to be carefully evaluated in terms of business need (or life and death in the hospital case as medications were distributed and patient charting was electronic). You may have already done this. I'm probably preaching to the choir here so to speak. Sorry if I am.

You do need to see the physical system and understand what's there. It could be that other drives are present and not functioning. You need to know for sure. Looking back at the other posts, it looks like the system may have been configured to use raid at one time. I'm suspecting that one or more of the raid drives failed and the system has continued to run on the other functioning drive. This whole thing is way more *screwed* up than I could have imagined. I did *not* realize you were taking over an established system. This has got to be one of the worst stories I've ever heard. Twilight zone stuff for admins

From what you've said and looking through the previous posts, I think only /dev/hda has anything that needs to be saved because I don't see any other drives functioning (if there are any besides /dev/hdf). I would think some users would have squawked by now about not being able to get their mail. Well maybe they have but you just haven't heard about it yet because you just took over!

If you are being charged with getting data off the other drives and they are broken, you might consider a data recovery service. Just a thought on that one. I don't know what you're circumstances are.

Last thing. It would be ideal if you did have another system available, something exactly the same would be great so you could kick the tires and learn about it without fear of making a mistake. I work for a bank now and we keep 3 environments going all the time: Production (the live environment), UAT (a test environment just like production), and a development environment (used for installing and testing new code). As a developer, it takes a month, minumum, to see my code go into production after I've finished coding because of all the steps in the process of testing and verifying. It can take much more time too. Depends.

When you finally do find out what you've *really* got, what you're up against, and what *really* needs to be recovered for data. We can come up with a better plan to get this solved. Sounds like you may be buying more hardware like replacing drives or getting another test system.

I hope this helps. Let us know if there is anything more we can do for you. I'm kind of personally invested in this now so I'd really like to know how it all turns out

meetscott · 12-04-2006, 01:33 AM

Electo slipped his post in there while I was making mine. Thanks for the information Electro. That's good stuff. I've heard similar things about Promise controllers although I have no first hand experience with them.

This is for Electro: What raid card would you recommend for keysorsoze? Can he simply hook his 4 drives into his Primary IDE controllers and use software raid? I've personally had excellent luck for years with software raid. Would this represent too much of a performance hit? How would the entries get so hosed in fstab? There is nothing going on there that makes me believe this was ever set up right in the first place.

Electro, since you have some experience with these controllers and configurations, maybe you can better suggest the set up path keysorsoze should take? I would like to know your thoughts on that too.

keysorsoze · 12-04-2006, 09:00 AM

Hey Thanks for all the help guys I never knew this would turn into such a big thing. I do know that the Raid Controller is integrated into the motherboard and not using a physical PCI controller. It is also using some type of software raid to perform the mirroring. I'll have to go to the site either today or tomorrow but I'll be back with some type of news. Thanks for all the replys and help they have given me a lot of help and insight into the situation.

Electro · 12-04-2006, 05:53 PM

Quote:

Originally Posted by meetscott

This is for Electro: What raid card would you recommend for keysorsoze? Can he simply hook his 4 drives into his Primary IDE controllers and use software raid? I've personally had excellent luck for years with software raid. Would this represent too much of a performance hit? How would the entries get so hosed in fstab? There is nothing going on there that makes me believe this was ever set up right in the first place.

Electro, since you have some experience with these controllers and configurations, maybe you can better suggest the set up path keysorsoze should take? I would like to know your thoughts on that too.

For a cheap server, I suggest two Highpoint Rocket133. Putting one drive on each controller. Then setup them up as RAID-1 using Linux software RAID. This setup will provide redundancy with both the controller and the hard drive.

Since it is a mail server, I suggest using hardware RAID and setup RAID-5 for just the mail or /var/spool/mail. The controller that I suggest is a 3ware 8-port or 12-port SATA controller. RAID-5 can write multiple times in a single thread. For the OS drive, I suggest RAID-1. With the RAID-5 controller, provide hot spares, so the server can keep on going if an IT admin can not make the trip to fix the problem.

For a file server, I suggest two RAID-5 or RAID-6 and mirror the two. This will create RAID-15 or RAID-16. This is a costly setup but very, very redundant and fast. Two read threads can be done at the same time while multiple write tasks can be done.

All mail and file servers do not need a fast processor unless the software that they are running provides additional features like SSL. All servers should use at least ECC memory and have redundant power supplies.

When setting up RAID with IDE hard drives, only one hard drive for each connector on the card. If two IDE hard drives are on one cable is used, both hard drives will fail instead of one hard drive.

It does help to turn off cache for the hard drive that will be used in a RAID array. If the cache is not disabled, the data will not be store instantly when it is sent to the hard drives. The cache that Linux provides for storage mediums and hardware RAID controllers should be used instead.

Backups should be stored on another computer instead of the same computer that is serving users. Backups can be cheaply stored by using separate hard drives for incremental, differential, and full backups. I suggest keeping a copy of a full backup in a deposit box at a bank or some where else in a bolted down safe.

As of kernel version 2.6.14, all 3ware cards are supported. This includes their PCI Express versions too.

BTW, I have not yet setup RAID although I do know what the computer needs for many server types.

keysorsoze · 12-09-2006, 10:46 AM

Update on System Status. So it turns out that after all the errors were occurring the mail finally crapped out on the system and people were no longer receiving email. I was able to login via SSH so I simply reboot the system and like magic everything fixed itself. The defective drive even corrected itself. Here is a cat of the fdisk -l, I guess editing the /etc/fstab worked somewhat. The reboot also cleared up all the errors. I don't know why? I am going to keep monitoring the system for errors and find a solution by probably replacing the raid controller. We are also going to use Backupexec to backup the data off the system before we make changes. Thanks for all the help guys.

[root@mail ~]# fdisk -l

Disk /dev/hda: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hda1 * 1 13 104391 83 Linux
/dev/hda2 14 395 3068415 82 Linux swap
/dev/hda3 396 4865 35905275 83 Linux

Disk /dev/hdf: 250.0 GB, 250059350016 bytes
255 heads, 63 sectors/track, 30401 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hdf1 * 1 30401 244196001 fd Linux raid autodetect

Disk /dev/hdg: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/hdg1 1 128 1028159+ 82 Linux swap
/dev/hdg2 129 19457 155260192+ 83 Linux

Disk /dev/md0: 250.0 GB, 250056605696 bytes
2 heads, 4 sectors/track, 61048976 cylinders
Units = cylinders of 8 * 512 = 4096 bytes

Disk /dev/md0 doesn't contain a valid partition table

meetscott · 12-09-2006, 12:53 PM

I'm glad you were able to get things up and running... at least for now. I think what you're going to do sounds like a really good plan.

There's definitely more to do. You had said before there were 4 physical disks. I only see 3 listed by fdisk. Again, it looks like there are is only *one* disk participating in the raid configuration, that is /dev/hdf. There is another disk that needs to be mirroring hdf I think. At least now you have a little time to figure things out before the system decomposes again.

Overall, things still seem a bit odd to me. The raid partition on hdf is marked bootable, but I think you're actually booting on /dev/hda. For a good raid configuration, you would ideally want both drives participating in the raid to have a bootable partition. The BIOS would be set to the first disk, and then the second disk in case one failed. Lilo or whatever you're using will look for both as well. Only one is necessary to get the system running though.

Also, the swap space isn't evenly distributed. Ah well. One thing at a time. Get the system stable first and then move from there to get the rest squared away.

Keep us posted if there are any other significant events