RAID seek_error_rate on both disks suddenly growing very fast by smartctl
Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
RAID seek_error_rate on both disks suddenly growing very fast by smartctl
My mail server started slowing down a lot today, sometimes timing out, so I SSH-ed in and checked. What I found is:
The raid processes were taking up about 10-15% of CPU time (my server isn't very busy, just mail for a dozen users and some low traffic websites, typically 98-99% idle, but that had dipped into to the 80's and dwelled there when this started).
From smartctl:
seek_error_rate has started growing amazingly fast on both disks:
sda increases by about 3000 per MINUTE on average
sdb increases by about 2000 per MINUTE on average
I watched this over 30 minutes, and it was fairly steady, totaling ~60,000 and ~90,000 growth by the end of the 30 minute viewing period. This is approximately 33 to 50 seek errors per SECOND per drive on average. The total number for the seek_error_rate was 133,457,113 and 559,913,401 for the two drives.
Yet, no changes in Raw Read Error Rate during this time. It is zero for sdb, no change on sda but it is sticking at 60,121,870 which is remarkable ... albeit, again, did not grow over the 30 minutes in which I was observing the constant seek errors.
No changes in Reallocated Sector Count.
No spin retries.
Temperature around 40C (105F) for both, so not getting hot.
I'm hesitant to try rebooting without better understanding what might be the problem, and perhaps doing more diagnostics while it's assuredly still alive (plus doing a fire drill to double check backups while it's still accessible).
It's running CentOS 5.11, and since it's about to reach End of Life, I plan to install a new OS starting in about a month from now, so this needs a proper analysis whereby whatever needs to be fixed should be fixed in advance of the upgrade.
After all these years, it seems very odd that this would start to happen with both drives at the same time. It's at a big server rack facility / data center.
Anybody have experience with something like this, or any insights or ideas?
It looks like it actually may not be a worry. It's a Seagate Barracuda 7200.10 and Seagate reports things as follows, according to the author below:
"... the author explains that all the values are actually 48 bits, and due to the way they are encoded it follows that those values are large. More specifically, raw value of the Seek error rate attribute should be converted to hexadecimal and then upper 16 bits are number of errors, while lower 32 bits are total number of seeks.
"In this concrete case the raw value for Seek error rate is 17262017054, or 0x000404E57A1E. The first 16 bits is 0x0004 and the last 32 bits are 0x04E57A1E. What this means is that there were 4 seek errors (meaning the head wasn't positioned correctly after being moved to some track) but there were 82147870 seeks in total. So, this is very very small fraction of errors."
In my original post here on LinuxQuestions.org, since my values do not go beyond 8 digits in hex, I guess there have been zero seek errors, and it is mainly just counting the number of seeks, not the number of errors, which is why it has been growing at such a fast rate.
(One of my drives is much older than the other, due to the fairly recently replacement of a drive due to failure, which would explain the difference in total numbers, as comparing the two numbers for the two drives, they are nearly proportional to number of hours in service.)
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.