Filesystem gone Read Only

Complicated Disaster · 01-13-2008, 07:12 AM

Dear Experts,

My server seems to have developed a problem where the data filesystem goes nuts. Files disappear or randomly move from one folder to another. When this happens and I try to recreat the files from backup I get the error:

cp: cannot create regular file `mytest.cnf': Read-only file system

I cannot umount the partition as I get the error:

umount: /web: device is busy

The only way I have found to bring it back is to reboot, but the problem seems to reoccur.

I am running Mandriva Linux 2006. /web is ext3 running on /dev/MD0. MDADM reports the raid array as clean.

Any ideas as to what is causing this, or what I can try to fix it?

Thanks in advance.

CD

PhenuxRizing · 01-13-2008, 09:39 AM

Assuming that it is not a bug with that particular release of Mandriva, it sounds like you might be having hard drive problems.

In a shell, logged in as root, type "smartclt -H". This gives you an overall health status of your hard drive using built in monitoring technology. Start there and see if that's the issue. If not it could also be a software problem, or even a misconfiguration.

Complicated Disaster · 01-13-2008, 11:09 AM

Thanks. Useful looking utility. However all disks report health status as PASSED.

I don't think it's an actual bug with the O/S as the system has been running fine for over a year now. I'm baffled. Not that this is an unusual state for me when it comes to Linux!

CD

tredegar · 01-13-2008, 11:45 AM

How much disk space do you have free ?
df -h

Complicated Disaster · 01-13-2008, 12:00 PM

Filesystem Size Used Avail Use% Mounted on
/dev/hda1 5.8G 4.5G 1.1G 82% /
/dev/hda6 140G 127G 14G 91% /home
/dev/md0 147G 69G 71G 50% /web

CD

tredegar · 01-13-2008, 12:03 PM

Quote:

Filesystem Size Used Avail Use% Mounted on

Looks OK
Have you run fsck ? Easiest way is:
shutdown -rF now

Complicated Disaster · 01-13-2008, 01:17 PM

That certainly had some effect. I got a few EXT3 errors on shutting down. And on reboot the scan produced loads of errors. I guess I have to check my data for consistency. Is it really that easy? I feel a bit stupid now!

Cheers

CD

sundialsvcs · 01-13-2008, 06:59 PM

I would say that you definitely have a drive that's about to "go."

If you have a smartctl command anywhere, say in /sbin/... read about it ("man smartctl") then run it. This will give you the drive's own error-logs and diagnostics.

Nevertheless, assume that the drive is about to conk-out and replace it immediately. (USB/Firewire external drives are very handy because you can take their drive out of it, put it into service, and put your existing drive in the external case.)

tredegar · 01-14-2008, 02:56 AM

Quote:

I would say that you definitely have a drive that's about to "go."

He has already run smartctl. His disks passed.
His filesystem was messed up for some reason. Maybe fsck has fixed it maybe not

CD, you should take a look in /lost+found which is where fsck puts files, or fragments of files, it does not know what to do with.
If /lost+found is empty, you are probably OK, otherwise you should save your data, and probably reinstall from scratch or restore from your backup.
What could have caused this? Maybe power-glitch / brownout / power failure / incorrect shutdown / failing drive / loose cable or connector.... ... .. .

Complicated Disaster · 01-15-2008, 03:45 AM

Hi,

I've looked in the Lost+Found and there's loads off stuff there. However I think I've recovered most of the missing bits from backup. Note that the error only affected the *data* drive, not the O/S drive. They are different physical drives. So I think I'm OK (for now) at least. I don't want to take a backup until I'm sure that the whole data set is valid as I only have enough disk space for one backup at a time!

I'm still baffled as to how it would have happened as I run on a UPS. I'm thinking that maybe one of the drives in the Raid array *is* on the way out, but I have a cold spare so I think I'll leave it and see what happens.

Thanks for your help.

CD