[SOLVED] Disc goes read-only

linuxbird · 11-12-2012, 09:27 AM

On a Slackware 13.37 32 bit system, I have 4 - 2TB SATA drives, and a PATA boot/root drive.

Two of the 2TB SATA drives are changing to some mode, which makes them read-only, and creates input/output errors if they are accessed with NFS. I am not clear as to why this is happening. Looking at dmesg, I see that there is an unhandled error code, and that there may be a lost page write.

If I pull the drives out, and put them in another system, they appear to work fine.

SMART tests are fine.

The power supply is showing good voltages under load.

Reboot brings the drives online. Over time, or use, they go into this reduced mode.

Pointers?

malekmustaq · 11-12-2012, 10:58 AM

Quote:

Originally Posted by linuxbird

On a Slackware 13.37 32 bit system, I have 4 - 2TB SATA drives, and a PATA boot/root drive.

Two of the 2TB SATA drives are changing to some mode, which makes them read-only, and creates input/output errors if they are accessed with NFS. I am not clear as to why this is happening. Looking at dmesg, I see that there is an unhandled error code, and that there may be a lost page write.

If I pull the drives out, and put them in another system, they appear to work fine.

SMART tests are fine.

The power supply is showing good voltages under load.

Reboot brings the drives online. Over time, or use, they go into this reduced mode.

Pointers?

Using big hard disks you must understand that the days of MBR record is almost overtaken by the exceeding magnitude of disk storage today. There is a Gnu/Linux way of overcoming this limitation. See for your self. Read this and this and many more, about it.

Goodluck.

catkin · 11-12-2012, 11:14 AM

Have you tried swapping the SATA signal cables around to eliminate a bad cable as the cause?

linuxbird · 11-12-2012, 01:29 PM

Quote:

Originally Posted by catkin

Have you tried swapping the SATA signal cables around to eliminate a bad cable as the cause?

Yes, I swapped out two of them, as the existing ones were long, and I had two shorter newer ones that have locks on them.

linuxbird · 11-12-2012, 01:31 PM

Quote:

Originally Posted by malekmustaq

Using big hard disks you must understand that the days of MBR record is almost overtaken by the exceeding magnitude of disk storage today. There is a Gnu/Linux way of overcoming this limitation. See for your self. Read this and this and many more, about it.

Goodluck.

All 4 disks are MBR, and the smaller PATA disk is MBR, and less than a TB.

However, this system worked flawlessly for two years. And I am trying to figure out how and why it has decided to send two of the 5 total drives, into a toes up mode.

Thanks for the links on GPT, etc.

linuxbird · 11-14-2012, 09:27 PM

I've done several passes of smartctl initiated long tests, and there is no sign of any problem. Replacing the SATA and power cables only seemed to extend the time to failure.

Placing on the of the 2TB drives on a USB adapter, and trying for several hours on another computer yielded no observed errors.

The drive seems to take longer to fail when not accessed, or accessed locally, rather than through NFS. However the data is not totally definitive, and the 'studies' are not exactly controlled.

There are no temperature problems, nor apparent power problems, nor dirt accumulation on the MB/SATA controller.

Any other ideas anyone? Perhaps I should have asked on the hardware forum, but I do not consider this strictly a hardware problem yet.

Thanks.

catkin · 11-15-2012, 08:42 AM

Quote:

Originally Posted by linuxbird

Placing on the of the 2TB drives on a USB adapter, and trying for several hours on another computer yielded no observed errors.

That suggests a problem with the SATA hardware on the motherboard. You could try the inverse test, replacing one of the problem HDDs with a known good HDD from another system.

linuxbird · 11-15-2012, 09:24 PM

Quote:

Originally Posted by catkin

That suggests a problem with the SATA hardware on the motherboard. You could try the inverse test, replacing one of the problem HDDs with a known good HDD from another system.

I ordered two 3TB drives, and when they arrive Monday, I will start that process. I have other machines which I can shakedown and burn in the new drives on.

I did find that if I just mount one of the 4 large drives, things last longer before the degradation.

Thanks.

linuxbird · 11-20-2012, 08:31 PM

Shuffled drives around, and put two new drives on the system. The conclusion I have is that there is a motherboard problem. Both new drives fail after a period of 30 seconds to an hour after boot. All drives pass SMART long test, without any problem (as read upon reboot).

It's a 775 processor, so I may be SOL finding another MB that meets my requirements.

linuxbird · 11-21-2012, 09:25 PM

This is the final report, I promise. I found that the new drives were failing like the old. I found that it was a matter of time before drives failed talking to the MB. I suspected that the SATA hardware on the MB was crapping out on me. I tried heating and cooling that area of the MB to see if I could create the failure more quickly.

Then I swapped out all power splitters, followed by some better SATA cables with locks on the ends. Then things started getting better. I checked power and found that when things were getting flakey, the power draw for the box was below 290W, with a 600W PS. I checked the DC voltages at various points.

After I swapped out all the data cables, things started working better. So I did come contact cleaning, etc.

Then I got it so that the system would run for an hour without any data problems with a SATA drive. Then three hours, and then I fsck'd the file systems. I added the 3TB drives, and will be copying things to them tonight.

My conclusion is that the likely culprit was the SATA cables, which were 2gen cables lacking locks. One, even though it worked better than the others, looked ugly in the connector. There was deformation of the socket that the PCB on the harddrive or the socket on the MB plug into.

I need to find more locking style SATA cables.

catkin · 11-22-2012, 04:13 AM

Maybe you are right about "the SATA hardware on the MB was crapping out". Maybe it is weak at reading and writing signals and the cable replacing and contact cleaning works have improved the signal transmissions enough to move out of the failure region into the mostly success region. If that's right, a minor degradation of the connections -- which is designed-for in the specification -- will result in failures in the not-distant-enough future.

linuxbird · 11-25-2012, 08:33 AM

Post note.

After chasing possible drive handlings of NCQ and other esoteric issues, I decided to swap out the MB. I had a spare 775 MB, and put it in. Unfortunately it needed DDR3 memory, and I had DDR2 memory already in that system. So I borrowed a DDR3 stick from another system, got it up and running, and the SATA behavior is now rock solid.

Hardware problem resolved. No signs of dirt, damage or anything to the MB, just an internal intermittent failure of something. Flexing the board a little didn't cause a failure, so the probability of it being something like a circuit board feed through is not real high.

Now I need a project for a 775 MB sans SATA. (grin)

I'll order some DDR3 for this system, and find another home for the DDR2 memory I pulled out. I realize I have tons of obsolete memory laying around. I wish it could be melted and made into bars.