Severe filestructure corruption! Please help!

sausagejohnson · 06-26-2004, 06:53 AM

I did something stupid. I did a CTRL-Alt-Del while the system was starting up which forced the system to shut down. I think this screwed up my system. The filesystem is partially shot. Check out the following partial sample of my / directory with ls -la:

?rwsr-x--T 2585 2925707467 2299930527 3240038924 Apr 13 1994 root
drwxr-xr-x 2 root root 8192 Jan 2 01:08 sbin
?r--rwxr-T 32767 109827237 388427479 3587299184 Jun 5 1951 usr

You can see my /root and /usr folders are no longer folders but have turned into very large corrupted files. I've shown the /sbin directory as a comparison.

What on earth is going on? How do I fix this? I tried a linux resue and then fsck against the partition which gave me the message "Journal restored". But nothing changed.

Can anyone please help me out?

avarus · 06-26-2004, 07:06 AM

Hi,

Well first off Ctrl+Alt+Del should not cause this kind of corruption. Looks like you were unlucky, but maybe it is a sign your drive is unwell.

Anyway, I think maybe you needed to pass the '-f' option to fsck, so try the recovery disk again.

TIM

sausagejohnson · 06-26-2004, 07:15 AM

Thanks, Tim. I've passed a fsck -f /dev/hda2 (also with the yes option - can't remember the flag) and there are literally thousands of pages of corrupted inodes being processed. I'll post the result as soon as it stops.

The drive is a 4 week old 160GIG WD drive.

Update: Still going over an hour. I'll stop it and run it all day tomorrow. Crazy amount of errors.

avarus · 06-26-2004, 08:25 AM

Drive is fubar my friend - return to shop!

sausagejohnson · 06-27-2004, 07:37 PM

I'm not prepared to say that the drive is kaput just yet. Although after running fsck for several hours with hundreds of thousands of inodes being constantly repaired. Probably around 600,000 inode repairs so far.

The fact is that I have been working in the windows partition at the end of the drive for several days now working a DV editing project. As you might know with DV editing, software crashes occasionally and so once or twice, the software locked up and hard reboot was needed.

Of course, I would go to the GRUB screen and select Windows 98. Then, the DOS SCANDISK screen came up and I let it do it's thing. Now I am beginning to wonder if this program somehow went cross-partition and caused the damage. This seems unthinkable except that it is unusual for windows to live at the end of a drive as it does in my setup.

Anyhow, there is no data corruption in the 30 GIG windows partition, but seemingly massive corruption in the 120 GIG linux partition. I'll continue to run fsck tonight (it picks up from where it left off) and see if the damage repair completes itself.

How many inodes are there in 120GIG of space? How long will this take does anyone know?

bwyer · 06-27-2004, 09:21 PM

You mentioned that this is a 4-week-old 160GB drive. You're above the threshold for LBA48 there, so there is a possibility that you've just been bit by coincidence.

What version of the Linux kernel are you running, and how is your drive connected to your machine?

sausagejohnson · 06-27-2004, 09:28 PM

I'm running Redhat 9. It's always been a faithful distribution. I'm connecting via IDE using one of those new fangled IDE cables. The board is a Gygabyte 7n400Pro. Not using a SATA drive.

bwyer · 06-27-2004, 10:00 PM

Cool. I looked up your board and it's based on the nForce chipset. That did ring a bell on something that I came across today trying to debug similar problems on my system. This was mentioned on a posting I saw:

Quote:

The ATA100 support for nforce2 boards is mature in kernel versions 2.4.24 and 2.6.3. Just be sure to enable the kernel's nforce2 IDE driver.

The article I'm referring to is here.

I'm afraid I can't help you much beyond that, though.

sausagejohnson · 06-27-2004, 10:17 PM

Thanks for the document reference. I will go through that. I guess tonight, I'll finish up the fsck and see what's left of the system. My /work directory was untouched and I backed it up so no worries about loss.

Thanks.

bwyer · 06-28-2004, 06:22 PM

I was thinking, too, that this might be Windows 98 not handling the large drive correctly. Did you notice the corrupt files before or after windows did its scandisk? I just can't help but go back to the fact that you have an LBA48 drive which may or may not be handled correctly by both of these operating systems.

sausagejohnson · 06-28-2004, 06:36 PM

I can't remember if it was before or after. Sorry, I know that doesn't help. The thing that makes me doubt that the OSes couldn't handle the drive is the fact that everything was perfect for about month. I have installed games to windows, no problems, and have been working mostly in linux for over a month on various bits and pieces.

For the last week, I have been capturing large DV files and doing huge edits in windows. I would imagine if windows wasn't able to handle it, my DV files should have becoming corrupted. Instead, I created perfect edits and exported them back to DV tape without any glitches.

Anyhow, I stopped fsck after many hours last night and checked the file structure with the boot CD. linux rescue can no longer see any linux files at all (at the beginning of this thread I had a file structure and only /usr and /root were damaged), so basically all those fsck inode repairs were simply destroying what little filestructure was left.

Tonight I will reinstall linux, and then go into windows and deliberately reboot without shutting down, and perform a DOS scandisk to see once and for all if that caused the problem.

bwyer · 06-28-2004, 07:09 PM

Good point on the Windows bit. A couple of things I thought of last night that I'd like to see.[list=1][*]How big does your BIOS say your drive is (in sectors)?[*]What does hdparm -I say about your drive?[*]What does fdisk -lu say about your drive? [/list=1]
Once again, going back to the LBA48 support, your BIOS should say something around 312,000,000 sectors if you have a 160GB drive. If it's showing less, your BIOS doesn't support LBA48 and you'll want to upgrade. This may or may not be the source of the problem, but is certainly something to look at.

The question about hdparm is based on what I've seen with my Maxtor 200GB drive. Here's what it says:

Code:

[root@webserver /]# hdparm -I /dev/hdb

/dev/hdb:

ATA device, with non-removable media
        Model Number:       Maxtor 6Y200P0
        Serial Number:      Y617WD4E
        Firmware Revision:  YAR41BW0
Standards:
        Supported: 7 6 5 4
        Likely used: 7
Configuration:
        Logical         max     current
        cylinders       16383   65535
        heads           16      1
        sectors/track   63      63
        --
        CHS current addressable sectors:    4128705
        LBA    user addressable sectors:  268435455
        LBA48  user addressable sectors:  398297088
        device size with M = 1024*1024:      194481 MBytes
        device size with M = 1000*1000:      203928 MBytes (203 GB)
Capabilities:
        LBA, IORDY(can be disabled)
        Queue depth: 1
        Standby timer values: spec'd by Standard, no device specific minimum
        R/W multiple sector transfer: Max = 16  Current = 16
        Advanced power management level: unknown setting (0x0000)
        Recommended acoustic management value: 192, current value: 254
        DMA: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 udma5 udma6
             Cycle time: min=120ns recommended=120ns
        PIO: pio0 pio1 pio2 pio3 pio4
             Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
        Enabled Supported:
           *    NOP cmd
           *    READ BUFFER cmd
           *    WRITE BUFFER cmd
           *    Host Protected Area feature set
           *    Look-ahead
           *    Write cache
           *    Power Management feature set
                Security Mode feature set
           *    SMART feature set
           *    FLUSH CACHE EXT command
           *    Mandatory FLUSH CACHE command
           *    Device Configuration Overlay feature set
           *    48-bit Address feature set
           *    Automatic Acoustic Management feature set
                SET MAX security extension
                Advanced Power Management feature set
           *    DOWNLOAD MICROCODE cmd
           *    SMART self-test
           *    SMART error logging
Security:
        Master password revision code = 65534
                supported
        not     enabled
        not     locked
        not     frozen
        not     expired: security count
        not     supported: enhanced erase
HW reset results:
        CBLID- above Vih
        Device num = 1 determined by CSEL
Checksum: correct

Take a look at the bold-faced Configuration section. Note that under straight LBA, I could only address about 137GB (in hard drive marketing Gig's (1,000,000,000 bytes); however, in LBA48 mode the entire drive is addressable.

Finally, fdisk -lu gives you a dump of the partition table. Here's what my 200GB drive looks like:

Code:

[root@webserver /]# fdisk -lu /dev/hdb

Disk /dev/hdb: 255 heads, 63 sectors, 24792 cylinders
Units = sectors of 1 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/hdb1   *        63 398283479 199141708+  83  Linux

Notice that my Linux partition starts on sector 63 and extends to sector 398,283,479, which, if you look at the LBA48 user addressable sectors, you'll see is slightly less than the maximum of 398,297,088.

I think that, if you check these things, it might reveal a clue to what's going on. Aside from that, and checking the drivers, I can't think of anything else that might contribute to the issue.

Let me know how your rebuild goes and please post the results of those commands.

sausagejohnson · 06-28-2004, 07:26 PM

Thanks, bwyer. I'll run these test and post the results after the rebuild. I don't remember ever seeing an LBA48 entry in my BIOS, but I would imagine being a new nforce2 board it would have that. Still, we'll see. Thanks for your advice. I'll get the results up tomorrow.

sausagejohnson · 06-29-2004, 06:28 PM

Hi bwyer,
I was very busy last night and I did all of the following:

1) Started the RH9 install and preserved my /boot on /dev/hda1
2) Selected format for dev/hda2 and then did a bad blocks check
3) Bad blocks were located and RH9 recommended that I did NOT use this drive. Hmmm... not good.
4) I downloaded the Western Digital diagnostic tool and did the fast test and it came back saying the drive was clean. I did not run the extended test because it recommended that I backup the drive and I didn't want it to destroy the windows side.
5) Decided to push ahead with the install, and it did so without any problems. Linux runs beautifully again, and so is my existing windows partition at the end of the drive.
6) Checked BIOS, and it says the following about my drive:

WDC WD1600JB-00FUA0
BIOS [LBA]
Capacity 160GB
Cylinders 16643
Head 255
Precomp 0
LandingZone 65534
Sector 63

7) I ran hdparm -I /dev/hda:

Code:

/dev/hda:

ATA device, with non-removable media
	Model Number:       WDC WD1600JB-00FUA0                     
	Serial Number:      WD-WMAER1147969
	Firmware Revision:  15.05R15
Standards:
	Supported: 6 5 4 3 
	Likely used: 6
Configuration:
	Logical		max	current
	cylinders	16383	65535
	heads		16	1
	sectors/track	63	63
	--
	CHS current addressable sectors:    4128705
	LBA    user addressable sectors:  268435455
	LBA48  user addressable sectors:  312579695
	device size with M = 1024*1024:      152626 MBytes
	device size with M = 1000*1000:      160040 MBytes (160 GB)
Capabilities:
	LBA, IORDY(can be disabled)
	bytes avail on r/w long: 74	Queue depth: 1
	Standby timer values: spec'd by Standard, with device specific minimum
	R/W multiple sector transfer: Max = 16	Current = 16
	Recommended acoustic management value: 128, current value: 254
	DMA: mdma0 mdma1 mdma2 udma0 udma1 *udma2 udma3 udma4 udma5 
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4 
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	READ BUFFER cmd
	   *	WRITE BUFFER cmd
	   *	Host Protected Area feature set
	   *	Look-ahead
	   *	Write cache
	   *	Power Management feature set
		Security Mode feature set
		SMART feature set
	   *	FLUSH CACHE EXT command
	   *	Mandatory FLUSH CACHE command 
	   *	Device Configuration Overlay feature set 
	   *	48-bit Address feature set 
		Automatic Acoustic Management feature set 
		SET MAX security extension
	   *	DOWNLOAD MICROCODE cmd
	   *	SMART self-test 
	   *	SMART error logging 
Security: 
		supported
	not	enabled
	not	locked
	not	frozen
	not	expired: security count
	not	supported: enhanced erase
HW reset results:
	CBLID- above Vih
	Device num = 0 determined by CSEL
Checksum: correct

8) I ran fdisk -lu /dev/hda and got:

Code:

Disk /dev/hda: 160.0 GB, 160040803840 bytes
255 heads, 63 sectors/track, 19457 cylinders, total 312579695 sectors
Units = sectors of 1 * 512 = 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/hda1            63    160649     80293+  83  Linux
/dev/hda2        160650 249039629 124439490   83  Linux
/dev/hda3   * 249039630 310472189  30716280    c  Win95 FAT32 (LBA)
/dev/hda4     310472190 312576704   1052257+  82  Linux swap

9) Then I went to windows (dos mode) and did a scandisk c: /CHECKONLY. It came up not errors, and it didn't affect linux. So I can rule that out.

So all looks pretty good and close to what you said. I should run the entended test anyway I guess, but I think this was just a freak thing. I guess I will continue to backup on a regular basis. A shame I can't bring any closure to this for others but Redhat's indication that there may be bad blocks on the drive is still a concern. Perhaps I'll unmount my /dev/hda2 and fun a read-only deep check using fsck.

bwyer, thank you for all your assistence with this. I appreciate the time you took.

bwyer · 06-29-2004, 07:43 PM

Hello!

Looks like you've got all of your bases covered. The main concerns I had don't look like they were an issue, considering the fact that your BIOS did recognize your drive correctly, as does Linux.

So, in summary, you've determined the following (lemme see if I have the facts straight):

Your BIOS recognizes your drive correctly, so it has to support LBA48
Windows 98 works fine with your system (it probably wouldn't have if the above weren't true) and SCANDISK didn't corrupt
Western Digital's diagnostics did not identify any issues with your drive

Given these facts, it appears that there can't be anything wrong with your drive, at least mechanically and from a DOS perspective. Now, taking the Linux side of things:

The RH9 installer received errors on a Bad Block Scan (which obviously weren't really bad blocks because the drive passed diags)
Linux reports the correct geometry that happens to match the BIOS (good)
We know that there is a custom IDE driver for Linux for your particular chipset

I'd say that you need to double-check the driver. By virtue of the fact that a custom driver was written for this chipset implies that there's some deficiency in the base IDE driver that makes it incompatible with this chipset. It is possible that the deficiency is causing corruption.

I did just come across the changelist for the 2.4.21 kernel in searching Google that mentions that the first direct support for the Nforce2 IDE controller was 2.4.21-pre4. I also found that the Nforce2 IDE is on the HCL. I'm guessing that the custom drivers were required prior to 2.4.21, or there may have been some bugfixes later.

I also found some fixes for some issues with the Nforce2 chipset in 2.4.26.

The bottom line: Probably your best bet would be to make sure you're on the latest kernel you can get for RH9. I think RedHat Network still works enough to get you to the latest release. Either that, or build you a custom kernel.

In any case, good luck and sorry I couldn't find you a quick fix.

Brett