LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 02-01-2013, 04:11 AM   #1
DuffPaddy
LQ Newbie
 
Registered: Feb 2013
Location: Tregaron, Wales
Distribution: Fedora, Mint
Posts: 3

Rep: Reputation: Disabled
Question Bizarre problem with ps


Hi all. First time poster here, with a strange problem to report with ps output.

I have a script running on all our servers which monitors processes for high CPU or memory usage. It runs every minute and keeps 10 copies of the output from ps. It then uses the numbers from the first and last files to calculate the averages over the last 10 minutes. However, last night we had a crazy alert, telling us that the mysqld process on one of our servers was taking up 383924791.62% of the CPU time!

The alerts stopped after 10 iterations but I managed to capture the output while it was happening. It turns out that the mysqld process had gone from 191 days' CPU time to 213483 in the space of one minute. Here's a record of the point it happened (which is the output from 'ps -eo pid,user,time,etime,vsize,args'):

Code:
1543 mysql    191-22:23:04 834-09:30:26 9410820 /usr/libexec/mysqld [*snip*] --port=3306

1543 mysql    213483-11:26:51 834-09:31:26 9410820 /usr/libexec/mysqld [*snip*] --port=3306
Since this happened, the CPU time for the process has continued counting from the higher figure, so there've been no further alerts. But can anyone shed any light on why this massive jump in reported CPU time might have happened?

Admittedly, we're running an old version of Fedora here (FC12), and the server has been up for over two years. It's not a major problem for us, but I'm very curious how on Earth this could have happened.
 
Old 02-01-2013, 07:19 AM   #2
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
Quote:
Originally Posted by DuffPaddy View Post
Hi all. First time poster here
Welcome to LQ, hope you like it here.


In order of importance:
Quote:
Originally Posted by DuffPaddy View Post
we're running an old version of Fedora here (FC12),
Fedora, the "Core" part was dropped from F7 on, is at 18 right now with F19 scheduled for later this year as well. So you're either 6 releases behind or you're running the wrong Linux distribution. Running an unsupported, obsolete release means no bug fixes, no improvements and no security fixes.


Quote:
Originally Posted by DuffPaddy View Post
I have a script running on all our servers which monitors processes for high CPU or memory usage.
Unless you have a "re-invent the wheel" fetish (also called "NIH Syndrome") you should know there's standard tools for that, commonly referred to as "SAR", like Atop (or Atsar), Dstat, Collectl or plain "sar".


Quote:
Originally Posted by DuffPaddy View Post
can anyone shed any light on why this massive jump in reported CPU time might have happened?
I doubt that. There's several angles you could start your investigation with like finding out what procps derives the raw value from and how it calculates "cputime", checking for bugs in the procps package, ones related to kernel 2.6.31 (timers?) and checking for hardware-related issues. Priority-wise it doesn't stack up against you running an obsolete release.
 
Old 02-01-2013, 09:46 AM   #3
DuffPaddy
LQ Newbie
 
Registered: Feb 2013
Location: Tregaron, Wales
Distribution: Fedora, Mint
Posts: 3

Original Poster
Rep: Reputation: Disabled
Thanks for the welcome and help, unSpawn.

Yes, I'm well aware of sar, and use it to monitor overall system usage (in conjunction with SysUsage). My script checks on a process by process basis, to check whether any individual process is exceeding any predefined CPU or VM thresholds. Some of those other tools look useful though, and may even do what I want. EDIT: now that I delve a little further, the pidstat module within the sysstat package might well do what I want, with no need to install anything else. I might try that on the system in question to see if its numbers tally with ps's impossible times.

And I'm also in touch with the Fedora release schedule: my own systems are far more up-to-date than those where I work. It's still referred to as FCxx though (e.g. in uname and RPMs), despite the "core" having been dropped ages ago.

But like I say, not really a problem, more a case of: "This is very strange. Anyone seen it before?"
 
Old 02-01-2013, 11:06 AM   #4
unSpawn
Moderator
 
Registered: May 2001
Posts: 29,415
Blog Entries: 55

Rep: Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600Reputation: 3600
With all due respect even if anyone has seen this before it does not automagically mean common symptoms sharing the same cause (HW-specific bugs, bugs caused by kernel - HW combos, unstable TSCs, whatever's up with (non-)HPET clock sources, anything near or depending on kernel/time/timekeeping.c, libprocps, ps, use of virtualization, etc, etc) and even worse, with this kind of problem you're supposed to have instrumentation in place before it happens (don't ask me how anyone would anticipate that let alone use production servers as playground ;-p). Did any counters for other processes show unexpected values at that time? Any kernel, NTP or other "interesting" messages logged around that time?
 
Old 02-02-2013, 08:25 AM   #5
DuffPaddy
LQ Newbie
 
Registered: Feb 2013
Location: Tregaron, Wales
Distribution: Fedora, Mint
Posts: 3

Original Poster
Rep: Reputation: Disabled
No, this was the only process that experienced the jump. However, it's a dedicated database server, so that process is always the busiest by far.

If I add the values of utime and stime together from /proc/1543/stat, it comes to about 213483.8 days. So ps seems to be correctly reporting what's in the kernel.

As far as "interesting" messages, there's these appearing in dmesg:

Code:
CE: hpet increasing min_delta_ns to 15000 nsec
CE: hpet increasing min_delta_ns to 22500 nsec
CE: hpet increasing min_delta_ns to 33750 nsec
However, I'm not sure if they're relevant, and may have happened some time ago.

I don't think it's worth pursuing this one any further. I've seen some bug reports relating to utime overflow, but the figures don't tally with this one. We'll be finally upgrading these servers later this year, so we'll see what happens then.
 
Old 02-02-2013, 04:11 PM   #6
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,140

Rep: Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123Reputation: 4123
Have a read of this - note particularly "Additional Information".

Passing around arbitrary values will cause random failures. So reboot, then upgrade.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bizarre Networking Problem Sava Linux - Networking 10 09-08-2007 11:43 PM
Most bizarre keyboard problem Wilb Linux - Hardware 4 03-06-2006 12:33 PM
Bizarre XF86Config problem... Shade Linux - Software 3 07-07-2003 01:19 AM
Bizarre CD problem masterJ Slackware 9 07-02-2003 08:31 AM
a Bizarre problem seemed to be..... yunxiang Linux - General 2 02-06-2003 04:49 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 01:40 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration