LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Slackware (https://www.linuxquestions.org/questions/slackware-14/)
-   -   Filesystem Corruption Hell (https://www.linuxquestions.org/questions/slackware-14/filesystem-corruption-hell-475130/)

Prostetnic_Jeltz 08-18-2006 10:40 AM

Filesystem Corruption Hell
 
Hi all -

slack install had been working fine, with upgrades, for over a year. occasional fs corruption problems (always on the / partition, ext3), but nothing fsck didn't fix.

yesterday in response to a simple command (no unusual activity or installs recently), the terminal spit out an error with something like "libblkid.so.1 not found" - I checked /lib and sure enough several libraries had borked permissions (like Br-?r-S?-x) and really large filesize numbers or commas in the filesize column.

ok, tried shutdown -F, but fsck wouldn't run, saying there were missing libraries. ok, boot from install disk and try fsck on the / partition, but fsck came back saying the filesystem was clean! tried everything I could think of with fsck, no luck.

I then tried directly copying over the corrupted .so's from the install disk, but I can't remove the corrupted ones ("operation not permitted"). I tried recreating a new /lib, but the system didn't like that, init couldn't run, and eventually got I kernel panics over the version of libc and whatever it is trying to read (the installed system is upgraded over the install disks which are 10.1)

I'm lost now -- any suggestions as to how to fix this, get fsck to work, or how to recreate a /lib on a non-bootable system!?

thanks for any help at all :scratch:

Franklin 08-18-2006 11:15 AM

Quote:

Originally Posted by Prostetnic_Jeltz
Hi all -

slack install had been working fine, with upgrades, for over a year. occasional fs corruption problems (always on the / partition, ext3), but nothing fsck didn't fix.

That does not seem like a good situation - a warning of things to come maybe?

Quote:

Originally Posted by Prostetnic_Jeltz
yesterday in response to a simple command (no unusual activity or installs recently), the terminal spit out an error with something like "libblkid.so.1 not found" - I checked /lib and sure enough several libraries had borked permissions (like Br-?r-S?-x) and really large filesize numbers or commas in the filesize column.

ok, tried shutdown -F, but fsck wouldn't run, saying there were missing libraries. ok, boot from install disk and try fsck on the / partition, but fsck came back saying the filesystem was clean! tried everything I could think of with fsck, no luck.

[snipped the rest]

I'm lost now -- any suggestions as to how to fix this, get fsck to work, or how to recreate a /lib on a non-bootable system!?

thanks for any help at all :scratch:

Well, I'm no expert but, I would suspect some long developing hardware problem. If you used a sane partion scheme (i.e. /home, /usr/local, other partitions with important data, on separate partitions), I would reinstall from scratch, without formating the partitions you need to save. Then, when you have a bootable system, save your data and then try to ID the problem - HD, Mem, motherboard etc.

If everything is installed to one, large / partition, I don't have a good suggestion other than booting a live CD (slackware disk 2 should work) and mounting the old root - again with the aim to save your data rather than save the install.

Perhaps others can suggest something else.

Prostetnic_Jeltz 08-18-2006 12:56 PM

thanks for your reply, Franklin. I agree with your suggestion on hardware - but on the other hand, I have several partitions on 2 drives, and it only ever occurred on one partition (the only one which is ext3, btw), so I'm holding out hope it could be non-hardware related. in any event, there's some underlying problem there for sure.

on the data, I have important stuff backed up, and I have a second drive, so I could transfer anything important using a live cd (I think - haven't tried it, but it should work). but it would take a lot of time to get everything all set up again -

and of course, doing a reinstall instead of solving it (if there's a solution) irks me as a geek, on principle :D

Franklin 08-18-2006 01:38 PM

Quote:

Originally Posted by Prostetnic_Jeltz
thanks for your reply, Franklin. I agree with your suggestion on hardware - but on the other hand, I have several partitions on 2 drives, and it only ever occurred on one partition (the only one which is ext3, btw), so I'm holding out hope it could be non-hardware related. in any event, there's some underlying problem there for sure.

I didn't mention this before because I can't verify what actually was the cause, but I had a similar issue with a drive that was on my server. The server ran slackware 10.2 with 2.4.32. There was one drive partitioned as swap, /, /home, and /data. /home and / were formatted with reiser. /data was formatted ext3. One day, I could not access one of my directories on the /data partition (ext3) due to corruption and an interesting error that I unfortunately can't remember now. Long story short, I was able to recover my data but I had to rebuild the journal to do it and it was messy - lost all file names but saved everything.

Anyway, I tested the drive over and over with several utilities and I can't find anything wrong with it. I switched to all reiserfs and things have been fine since. Never had an issue with ext3 before and I don't know what caused this.

I still have the drive running, but I don't really trust it 100%. I did find it interesting that after pulling everything out of the server I could not start it again - dead power supply.

gnashley 08-19-2006 04:05 AM

reiserfs is probably a better choice.

davidsrsb 08-19-2006 08:27 AM

I have only ever had file system corruption on machines with bad motherboard/ram
I have had both ext3 and reiserfs3 fail irreversably when this has happened.
Bad hardware is fatal to a Linux box.

salmaklak 08-19-2006 10:32 AM

1. smartmontools(dot)sourceforge(dot)net.
2. 18 inches maximum for IDE leads.
Good luck. ;)

ledow 08-19-2006 12:22 PM

I have to agree with davidsrsb here - there isn't a filesystem in existence that can adequately compensate for faulty bits being written to disk. It's just not supposed to happen, in the same way that if you get a bit-error in a vital part of RAM, your computer will not handle it (unless you use ECC RAM and even then it's not necessarily guaranteed - it does a better job at noticing corruption but it can't always correct it).

In terms of filesystems, reiser and ext3 are just as susceptible to faulty bits as any other - the only advantage they have is their journalling which means that bits are double-checked if a crash should occur in the middle of it being written. That does not mean that they will recover from bits that "change" on the disk afterwards, e.g. bad sectors, faulty RAM etc.

Personally, I've never seen either reiser or ext3 "recover" better than the other when it comes to random filesystem corruption. It all depends on the luck of the draw as to where the changed bits are, what part of the filesystem that hits, how easy it is to detect that a bit has gone wrong (internal checksums, copies of the FAT etc.), how easy it is to "guess" what was meant by the corrupted part (e.g. recreating from checksums, using a second copy of the index, etc.). In practical terms, reiser and ext3 and most other common filesystems have next-to-no checks that anything that's not in a journal is "intact". If they do, they very, very rarely have any information which would aid recovery of that filesystem by an automated system (though a human could probably have a good stab).

Prostetnic_Jeltz 08-19-2006 03:23 PM

many thanks for the replies, everybody.

once I get an installation up and running again, I'll start investigating for hardware problems.... a friend mentioned that he would suspect the power supply, and I'll try smartmontools. also I found this app which seems to be perfect for testing memory: http://www.memtest86.com/ I didn't know there was a cable length limit for ide - I think they're ok, but I'll check that too, as soon as I find a tape measure :D


All times are GMT -5. The time now is 06:05 PM.