Read Only FS, randomly, 2 reboots cured, latest one doesn't!

fRAiLtY- · 05-04-2012, 10:41 AM

Hi guys,

I have a Hetzner server located in their DC in Germany, I'm in the UK. We've had the same server for nearly 3 years now and it's never missed a beat, not once. It has a 3ware controller and in total 1500GB of space on it.

I ran fdisk -l and I see this:

/dev/sda1
/dev/sda2
/dev/sda3

On May 2nd, at 10pm the root filesystem (/dev/root) went into read-only mode and no sites on the server were accessible. I contacted our DC who said they would reboot the server, which they did. The server came up and so did the sites. On May 3rd at 13:00 it went down again with exactly the same circumstances, again Hetzner rebooted the server and all was well. I began scanning logs and found that there was some errors involving EXT-3 and "journal", which I'm not familiar with.

Today, at exactly the same time as yesterday the same thing happened, however a reboot has not fixed it this time and the sites remain down. I have asked Hetzner to do a "deep scan" of the server under their recommendation as many people I have spoken to and several threads across the internet point to potential drive failure. This should take around 8hrs apparently. In the mean time I have around 40 websites down (all e-commerce) and many unhappy clients. I have backups of course, but am trying to just get everything up and running ASAP.

Can anyone give me some advice on what to do, should the tests on hardware come back OK, which I'm dreading. What commands should be run, how can I check and repair the filesystem etc.

Many thanks in advance.

Tom.

Kustom42 · 05-04-2012, 03:28 PM

Quote:

Originally Posted by fRAiLtY-

Hi guys,
We've had the same server for nearly 3 years now

Hardware has a lifespan of 3-5 years. To be pro-active you should refresh the hardware every 3 years. Depending on the manufacturer of the server, there are different diagnostics that can be ran. However, since this is a rented server it should be the responsibility of the hardware owner, in this case Hetzner not yourself, to diagnose and replace if necessary. I would contact your server provider have them do the necessary hardware checks as that is their responsibility. And my strongest suggestion would be to get a new server and cut-over to the new one with fresh hardware. Keep yourself on a 3 year hardware life cycle, all of the major companies have a life cycle for hardware.

fRAiLtY- · 05-04-2012, 03:34 PM

Hi Kustom42,

Hetzner are running what's apparently called a "deep scan" on the system now, due to finish within the next few hours. Should this scan yield nothing, which I suspect (just my luck) what's my play? They're claiming it's likely software, yet everyone I've spoken to seems to suggest it's hardware.

This would kinda be backed up by the nature of the occurences, out of the blue. What puzzles me is what causes it go into read-only, presumably it reboots the server to do this? It's just one minute we're on the websites, the next we get 500 internal errors and the filesystem is on read-only. Has the server rebooted in this time into read-only mode or do the drives unmount or.. just curious what happens?

At the minute I'm assuming Hetzner will say their hardware is fine and basically tell me to go away. For speed I need to get the clients sites up and running ASAP, what's the best option? I've heard fsck mentioned?

Cheers.

Kustom42 · 05-04-2012, 03:40 PM

Don't run an fsck, it is very common for drives that are failing to fail back into read only mode. This is due to I/O errors that are received by the kernel on the drive. I personally have worked for one of the biggest server providers and I can tell you that is the only answer you are going to get. You will have to take the initiative and purchase a new server. VERIFY THEY DO NOT REUSE HARDWARE, the company I worked for(which shall rename nameless for legal ramification purposes) did and I would see at least 20 servers crash a week that were less than a week from purchase date.

Do some digging and read some reviews, I'm not sure about providers in your neck of the woods so it would be hard for me give you any recommendations.

I would imagine in about an hour you will get your incident resolved with Hetzner that basically tells you that they can't find a problem and to figure it out yourself because its your problem. This is a nice way of saying go f yourself we don't care, if you do find a new provider asap, leave the drive mounted as read-only and work as quick as you can to get cut-over to a new server with a new company.

syg00 · 05-05-2012, 01:20 AM

I don't understand the advice not to run fsck. If the filesystem is broken, whether because of hardware or software failure, you need to run fsck. It may actually run automatically upon reboot after the f/s goes read-only.

Look at /etc/fstab - and see what it has as the "errors" option; probably "remount-ro".

I've never had a (stable) filesystem "go bad" - there's always been dodgy hardware involved. Doesn't necessarily mean the hard disk BTW. And by "stable" I mean one that is "enterprise ready". New filesystems (like btrfs a couple of yeas ago) don't qualify.

Kustom42 · 05-07-2012, 10:55 AM

Running an fsck on a faulty hard-drive has a big potential to cause data loss. Since the best solution here is to move to new hardware it would be best to preserve the integrity of the current data for copy-over. If he runs an fsck and it removes a /var/www/html/website/ folder an entire web-site of data could be lost.

Your signature is the best first step to take here, make a backup! If drives are beginning to give you errors get a good copy of your data before you do anything else.

chrism01 · 05-07-2012, 08:28 PM

I agree with backup + new HW.
If you have sufficient access, you could run the smartctl sw eg http://www.linuxjournal.com/magazine...rd-disks-smart to do your own checks.