Bizarre problem with ps

DuffPaddy · 02-01-2013, 04:11 AM

Hi all. First time poster here, with a strange problem to report with ps output.

I have a script running on all our servers which monitors processes for high CPU or memory usage. It runs every minute and keeps 10 copies of the output from ps. It then uses the numbers from the first and last files to calculate the averages over the last 10 minutes. However, last night we had a crazy alert, telling us that the mysqld process on one of our servers was taking up 383924791.62% of the CPU time!

The alerts stopped after 10 iterations but I managed to capture the output while it was happening. It turns out that the mysqld process had gone from 191 days' CPU time to 213483 in the space of one minute. Here's a record of the point it happened (which is the output from 'ps -eo pid,user,time,etime,vsize,args'):

Code:

1543 mysql    191-22:23:04 834-09:30:26 9410820 /usr/libexec/mysqld [*snip*] --port=3306

1543 mysql    213483-11:26:51 834-09:31:26 9410820 /usr/libexec/mysqld [*snip*] --port=3306

Since this happened, the CPU time for the process has continued counting from the higher figure, so there've been no further alerts. But can anyone shed any light on why this massive jump in reported CPU time might have happened?

Admittedly, we're running an old version of Fedora here (FC12), and the server has been up for over two years. It's not a major problem for us, but I'm very curious how on Earth this could have happened.

unSpawn · 02-01-2013, 07:19 AM

Quote:

Originally Posted by DuffPaddy

Hi all. First time poster here

Welcome to LQ, hope you like it here.

In order of importance:

Quote:

Originally Posted by DuffPaddy

we're running an old version of Fedora here (FC12),

Fedora, the "Core" part was dropped from F7 on, is at 18 right now with F19 scheduled for later this year as well. So you're either 6 releases behind or you're running the wrong Linux distribution. Running an unsupported, obsolete release means no bug fixes, no improvements and no security fixes.

Quote:

Originally Posted by DuffPaddy

I have a script running on all our servers which monitors processes for high CPU or memory usage.

Unless you have a "re-invent the wheel" fetish (also called "NIH Syndrome") you should know there's standard tools for that, commonly referred to as "SAR", like Atop (or Atsar), Dstat, Collectl or plain "sar".

Quote:

Originally Posted by DuffPaddy

can anyone shed any light on why this massive jump in reported CPU time might have happened?

I doubt that. There's several angles you could start your investigation with like finding out what procps derives the raw value from and how it calculates "cputime", checking for bugs in the procps package, ones related to kernel 2.6.31 (timers?) and checking for hardware-related issues. Priority-wise it doesn't stack up against you running an obsolete release.

DuffPaddy · 02-01-2013, 09:46 AM

Thanks for the welcome and help, unSpawn.

Yes, I'm well aware of sar, and use it to monitor overall system usage (in conjunction with SysUsage). My script checks on a process by process basis, to check whether any individual process is exceeding any predefined CPU or VM thresholds. Some of those other tools look useful though, and may even do what I want. EDIT: now that I delve a little further, the pidstat module within the sysstat package might well do what I want, with no need to install anything else. I might try that on the system in question to see if its numbers tally with ps's impossible times.

And I'm also in touch with the Fedora release schedule: my own systems are far more up-to-date than those where I work. It's still referred to as FCxx though (e.g. in uname and RPMs), despite the "core" having been dropped ages ago.

But like I say, not really a problem, more a case of: "This is very strange. Anyone seen it before?"

unSpawn · 02-01-2013, 11:06 AM

With all due respect even if anyone has seen this before it does not automagically mean common symptoms sharing the same cause (HW-specific bugs, bugs caused by kernel - HW combos, unstable TSCs, whatever's up with (non-)HPET clock sources, anything near or depending on kernel/time/timekeeping.c, libprocps, ps, use of virtualization, etc, etc) and even worse, with this kind of problem you're supposed to have instrumentation in place before it happens (don't ask me how anyone would anticipate that let alone use production servers as playground ;-p). Did any counters for other processes show unexpected values at that time? Any kernel, NTP or other "interesting" messages logged around that time?

DuffPaddy · 02-02-2013, 08:25 AM

No, this was the only process that experienced the jump. However, it's a dedicated database server, so that process is always the busiest by far.

If I add the values of utime and stime together from /proc/1543/stat, it comes to about 213483.8 days. So ps seems to be correctly reporting what's in the kernel.

As far as "interesting" messages, there's these appearing in dmesg:

Code:

CE: hpet increasing min_delta_ns to 15000 nsec
CE: hpet increasing min_delta_ns to 22500 nsec
CE: hpet increasing min_delta_ns to 33750 nsec

However, I'm not sure if they're relevant, and may have happened some time ago.

I don't think it's worth pursuing this one any further. I've seen some bug reports relating to utime overflow, but the figures don't tally with this one. We'll be finally upgrading these servers later this year, so we'll see what happens then.

syg00 · 02-02-2013, 04:11 PM

Have a read of this - note particularly "Additional Information".

Passing around arbitrary values will cause random failures. So reboot, then upgrade.