Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hi all. First time poster here, with a strange problem to report with ps output.
I have a script running on all our servers which monitors processes for high CPU or memory usage. It runs every minute and keeps 10 copies of the output from ps. It then uses the numbers from the first and last files to calculate the averages over the last 10 minutes. However, last night we had a crazy alert, telling us that the mysqld process on one of our servers was taking up 383924791.62% of the CPU time!
The alerts stopped after 10 iterations but I managed to capture the output while it was happening. It turns out that the mysqld process had gone from 191 days' CPU time to 213483 in the space of one minute. Here's a record of the point it happened (which is the output from 'ps -eo pid,user,time,etime,vsize,args'):
Code:
1543 mysql 191-22:23:04 834-09:30:26 9410820 /usr/libexec/mysqld [*snip*] --port=3306
1543 mysql 213483-11:26:51 834-09:31:26 9410820 /usr/libexec/mysqld [*snip*] --port=3306
Since this happened, the CPU time for the process has continued counting from the higher figure, so there've been no further alerts. But can anyone shed any light on why this massive jump in reported CPU time might have happened?
Admittedly, we're running an old version of Fedora here (FC12), and the server has been up for over two years. It's not a major problem for us, but I'm very curious how on Earth this could have happened.
we're running an old version of Fedora here (FC12),
Fedora, the "Core" part was dropped from F7 on, is at 18 right now with F19 scheduled for later this year as well. So you're either 6 releases behind or you're running the wrong Linux distribution. Running an unsupported, obsolete release means no bug fixes, no improvements and no security fixes.
Quote:
Originally Posted by DuffPaddy
I have a script running on all our servers which monitors processes for high CPU or memory usage.
Unless you have a "re-invent the wheel" fetish (also called "NIH Syndrome") you should know there's standard tools for that, commonly referred to as "SAR", like Atop (or Atsar), Dstat, Collectl or plain "sar".
Quote:
Originally Posted by DuffPaddy
can anyone shed any light on why this massive jump in reported CPU time might have happened?
I doubt that. There's several angles you could start your investigation with like finding out what procps derives the raw value from and how it calculates "cputime", checking for bugs in the procps package, ones related to kernel 2.6.31 (timers?) and checking for hardware-related issues. Priority-wise it doesn't stack up against you running an obsolete release.
Yes, I'm well aware of sar, and use it to monitor overall system usage (in conjunction with SysUsage). My script checks on a process by process basis, to check whether any individual process is exceeding any predefined CPU or VM thresholds. Some of those other tools look useful though, and may even do what I want. EDIT: now that I delve a little further, the pidstat module within the sysstat package might well do what I want, with no need to install anything else. I might try that on the system in question to see if its numbers tally with ps's impossible times.
And I'm also in touch with the Fedora release schedule: my own systems are far more up-to-date than those where I work. It's still referred to as FCxx though (e.g. in uname and RPMs), despite the "core" having been dropped ages ago.
But like I say, not really a problem, more a case of: "This is very strange. Anyone seen it before?"
With all due respect even if anyone has seen this before it does not automagically mean common symptoms sharing the same cause (HW-specific bugs, bugs caused by kernel - HW combos, unstable TSCs, whatever's up with (non-)HPET clock sources, anything near or depending on kernel/time/timekeeping.c, libprocps, ps, use of virtualization, etc, etc) and even worse, with this kind of problem you're supposed to have instrumentation in place before it happens (don't ask me how anyone would anticipate that let alone use production servers as playground ;-p). Did any counters for other processes show unexpected values at that time? Any kernel, NTP or other "interesting" messages logged around that time?
No, this was the only process that experienced the jump. However, it's a dedicated database server, so that process is always the busiest by far.
If I add the values of utime and stime together from /proc/1543/stat, it comes to about 213483.8 days. So ps seems to be correctly reporting what's in the kernel.
As far as "interesting" messages, there's these appearing in dmesg:
Code:
CE: hpet increasing min_delta_ns to 15000 nsec
CE: hpet increasing min_delta_ns to 22500 nsec
CE: hpet increasing min_delta_ns to 33750 nsec
However, I'm not sure if they're relevant, and may have happened some time ago.
I don't think it's worth pursuing this one any further. I've seen some bug reports relating to utime overflow, but the figures don't tally with this one. We'll be finally upgrading these servers later this year, so we'll see what happens then.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.