NFORCE2 IDE controller with Athlon XP 2400 problem - data corruption
Linux - HardwareThis forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
NFORCE2 IDE controller with Athlon XP 2400 problem - data corruption
This problem is really strange, but it is happening consistently for several weeks now.
It occurs in an Athlon XP 2400+ with 2x256 DDR 166 modules and an A7N8X Deluxe board
with 2 HD's (40 GB and 120 GB) and one DVD-RW (LG-GSA 4040b). The A7N8X Deluxe uses
the nforce2 PCI Bridge and IDE Controller.
I had kernel 2.4.22 with the XFS patch. I upgraded to 2.6.5 yesterday to see if I had
the problem corrected, but it behaves the same way.
The problem is: when I read a large file, say, larger than 200 MB, some bytes are read
incorrectly. It looks like it also happens when I write it too. It only happens when I
try to read it quickly, like in a cp or md5sum operation. If I download it from the
internet, it looks like it is ok.
If I try to run md5sum 4 times on the same file, every time the md5sum is different.
Like that:
$ md5sum 1GB_file.rar
bd17afc743b1d69d7458553cc5971145 1GB_file.rar
$ md5sum 1GB_file.rar
2b17bc4e5d7609b5fddbf67b5c84b869 1GB_file.rar
$ md5sum 1GB_file.rar
3ed1c36bed43f355f53df2ac763b7ea2 1GB_file.rar
$ md5sum 1GB_file.rar
0f6e73894382bca04326f4e4abbca4d3 1GB_file.rar
I've made an experience trying to get the pattern. I've built a shell script for building an 1GB file
with 167777217 lines with 64 'a' each:
x=0
while [ $x -lt 16777216 ]
do
let x+=1
echo aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa >> a.txt
done
Well, it turns out that a 'grep -nv aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'
on the file shows up random line numbers, like that:
Some of these files are wrong because of when I wrote - they always show up when I run the
same grep again. Anyway, it's always the same bit that is switched - an 'a' turns into a 'c'.
Oh, and this problem occurs in Linux for every partition and drive I test: the VFAT partition
in drive 1, the XFS partition in drive 1 and the ReiserFS partition in drive 2.
Anyway it looks like it is a hardware problem, but then I booted in windows (which uses the
VFAT partition in drive 1) and made the same tests using md5sum for windows and md5summer.
Well, this time, after more than a dozen tries with the same large files, the md5sum showed
up completely equals! No flaws, no changed bits. All disk optimizations were on.
I even tried, in linux, disabling DMA, readahead, multcount, 32-bit support and unmaskirq with
no good results (with everything disabled, in single mode, sometimes the file returned the
right md5sum, *but* I repeat the operation and it doesn't return the correct one anymore.
Maybe it returned ok the first time because the operation was slower than with every feature
enabled).
My main suspect here is the NForce PCI bridge or IDE Controller. Using the dmesg command it
tells me something about cable bits set incorrectly, could this be related?
Here is the configuration of my machine: lspci -vv, hdparm's for hda and hdb.
ATA device, with non-removable media
Model Number: ST3120023A
Serial Number: 3KA1YADZ
Firmware Revision: 3.33
Standards:
Used: ATA/ATAPI-6 T13 1410D revision 2
Supported: 6 5 4 3
Configuration:
Logical max current
cylinders 16383 4047
heads 16 16
sectors/track 63 255
--
CHS current addressable sectors: 16511760
LBA user addressable sectors: 234441648
device size with M = 1024*1024: 114473 MBytes
device size with M = 1000*1000: 120034 MBytes (120 GB)
Capabilities:
LBA, IORDY(can be disabled)
bytes avail on r/w long: 4 Queue depth: 1
Standby timer values: spec'd by Standard
R/W multiple sector transfer: Max = 16 Current = ?
Recommended acoustic management value: 128, current value: 128
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 *udma5
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=240ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* READ BUFFER cmd
* WRITE BUFFER cmd
* Host Protected Area feature set
* Look-ahead
* Write cache
* Power Management feature set
Security Mode feature set
* SMART feature set
* Mandatory FLUSH CACHE command
* Device Configuration Overlay feature set
* Automatic Acoustic Management feature set
SET MAX security extension
* DOWNLOAD MICROCODE cmd
* SMART self-test
* SMART error logging
Security:
supported
not enabled
not locked
not frozen
not expired: security count
not supported: enhanced erase
HW reset results:
CBLID- above Vih
Device num = 1 determined by the jumper
Checksum: correct
------------------
And just for the record...
PLEASE someone help me! =( It's not good being stuck with not being able to rely on your computer!
Much less your beloved free operational system...
I don't think I can be of much help, but I would ask this one question and make one comment: Are you overclocking your Athlon? If so, that could be a contributing factor, and I'd recommend you turn it off. Additionally, if the problem only afflicts partitions using Linux-specific file types, then it sounds like you need to check the integrity of your file system by running fsck. Have you tried that, and if so, did it reveal anything? -- J.W.
But no, I don't overclock my Athlon. The timings are pretty standard.
Also, it does not afflict only partitions using Linux-specific filesystems - it also affects the VFAT partition when used under linux. This very same partition, used under Windows, works ok.
I've done fsck's, yes, but I wouldn't trust them anyway because they are a massive read - and I am having problem with massive reads. =(
I think I have found the culprit. It's the amd74xx module - the one that says that on init:
NFORCE2: IDE controller at PCI slot 0000:00:09.0
NFORCE2: chipset revision 162
NFORCE2: not 100% native mode: will probe irqs later
NFORCE2: BIOS didn't set cable bits correctly. Enabling workaround.
NFORCE2: 0000:00:09.0 (rev a2) UDMA133 controller
I have rebooted in single mode, rmmod -f amd74xx and hdparm -d 0 /dev/hda /dev/hdb
and so I could md5sum all files correctly. All file operations deemed successful.
If I compile the kernel without this module, though, it doesn't recognize /dev/hda or /dev/hdb at all. =(
It looks like the DMA handling code for this thing is buggy and leads to corruption and lockups. It has a lot of &, | and such and I don't have a clue how the nforce2 DMA works, so I can't even understand it for now.
But If for the time being I could arrange a way to recognize /dev/hda, hdb and hdc without this module (and without it compiled in the kernel) I think this would be a good workaround, as I would have DMA - only not optimized for nforce2. For the time being I am using my computer without DMA. =(
From what I've seen in the documentation, IDEBUS=xx only applies to PIO modes... but I'll try it later anyway, though I don't think this is it.
I've run memtest86 to the end with no corruption at all. I don't think it is a memory problem, it doesn't look like so.
I am using the nvidia graphics driver. For other subsystems - sound and ethernet - I am using forcedeth (open-source nvidia ethernet driver) and i810_audio from alsa. NVidia provides no other drivers - only ethernet, graphics and sound... So I don't think they are to blame, because i've run tests with /bin/bash as init parameter from grub which doesn't load any modules, and the problem was still there.
Anyway, again, thanks for the responses. Something is better than anything, and as soon as I can I'll do the test with idebus=33.
I had (or have - dunno have to try some more) a problem right in there,too.Specs are about the same and the controller wouldn't find the hdd's anymore at all after a few reboots.When it found the hdd's it wouldn't find the burner on the other channel.Did change all cables and went from kernel 2.4 to 2.6 and so far things are ok.Brasil seems to be a bad location for nforce2 boards :-)
Additional information...
Making the same scripts with 'c' instead of 'a', I could notice
that it doesn't flip bits: it just set them, so a file with many 'c's
goes on uncorrupted.
So you're not using the SATA controller? The Sil3112 controller on the Abit NF7-S (which is a similar nForce2-based board) caused data corruption in early versions of the BIOS. A BIOS upgrade was required to take care of that. Not that this has to be your problem, but nForce2-based boards have their more than fair share of issues, and using a recent BIOS is always a good idea.
Also check if you have the possibility to increase the IDE delay a couple of microseconds in your BIOS setup, that is known to help in similar situations (NF7-S again).
Ok, I just tested it in windows for the fourth time and this time the error
happened in that stupid operational system too. So it's a hardware problem.
Sorry to waste anybody's time. Goodbye and thanks for all the fish.
I have several NF2 boards, including the Asus offering. Ive run the kernels listed above, as well as others too. All of my systems, have been MS free for a while now, and never once, have I had corruption. On the NTFS, my ide had a major screw up on XP, which prompted my immediate switch. But I dont think its fair to say the code in your kernel is buggy. It must be something else, because even in my server, which manages TB's of info, some files as large as 8-16 gigs, never once have I had a read write error like so. This is definetly isolated.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.