RAID seek_error_rate on both disks suddenly growing very fast by smartctl

Manager · 02-26-2017, 06:08 AM

My mail server started slowing down a lot today, sometimes timing out, so I SSH-ed in and checked. What I found is:

The raid processes were taking up about 10-15% of CPU time (my server isn't very busy, just mail for a dozen users and some low traffic websites, typically 98-99% idle, but that had dipped into to the 80's and dwelled there when this started).

From smartctl:

seek_error_rate has started growing amazingly fast on both disks:

sda increases by about 3000 per MINUTE on average
sdb increases by about 2000 per MINUTE on average

I watched this over 30 minutes, and it was fairly steady, totaling ~60,000 and ~90,000 growth by the end of the 30 minute viewing period. This is approximately 33 to 50 seek errors per SECOND per drive on average. The total number for the seek_error_rate was 133,457,113 and 559,913,401 for the two drives.

Yet, no changes in Raw Read Error Rate during this time. It is zero for sdb, no change on sda but it is sticking at 60,121,870 which is remarkable ... albeit, again, did not grow over the 30 minutes in which I was observing the constant seek errors.

No changes in Reallocated Sector Count.
No spin retries.
Temperature around 40C (105F) for both, so not getting hot.

I'm hesitant to try rebooting without better understanding what might be the problem, and perhaps doing more diagnostics while it's assuredly still alive (plus doing a fire drill to double check backups while it's still accessible).

It's running CentOS 5.11, and since it's about to reach End of Life, I plan to install a new OS starting in about a month from now, so this needs a proper analysis whereby whatever needs to be fixed should be fixed in advance of the upgrade.

After all these years, it seems very odd that this would start to happen with both drives at the same time. It's at a big server rack facility / data center.

Anybody have experience with something like this, or any insights or ideas?

sundialsvcs · 02-26-2017, 09:41 PM

Maybe the RAID controller is failing. Or maybe, it's just really bad luck.

Power-supply problems? Disk drives use direct-current motors which can't tolerate the slightest dip in line voltage.

Manager · 02-27-2017, 12:16 PM

It looks like it actually may not be a worry. It's a Seagate Barracuda 7200.10 and Seagate reports things as follows, according to the author below:

"... the author explains that all the values are actually 48 bits, and due to the way they are encoded it follows that those values are large. More specifically, raw value of the Seek error rate attribute should be converted to hexadecimal and then upper 16 bits are number of errors, while lower 32 bits are total number of seeks.

"In this concrete case the raw value for Seek error rate is 17262017054, or 0x000404E57A1E. The first 16 bits is 0x0004 and the last 32 bits are 0x04E57A1E. What this means is that there were 4 seek errors (meaning the head wasn't positioned correctly after being moved to some track) but there were 82147870 seeks in total. So, this is very very small fraction of errors."

http://sgros.blogspot.com/2013/01/se...rt-values.html

In my original post here on LinuxQuestions.org, since my values do not go beyond 8 digits in hex, I guess there have been zero seek errors, and it is mainly just counting the number of seeks, not the number of errors, which is why it has been growing at such a fast rate.

(One of my drives is much older than the other, due to the fairly recently replacement of a drive due to failure, which would explain the difference in total numbers, as comparing the two numbers for the two drives, they are nearly proportional to number of hours in service.)