Script to monitor CPU usage, run command at threshold: input?

jamtat · 05-25-2019, 01:16 PM

I'm wanting to put together a script that checks for high CPU usage and, at a certain threshold, runs a command. The reason for the script is that lately my system will unpredictably get high CPU usage, making the GUI difficult or impossible to use.

It seems Xorg is the culprit process soaking up cycles. The fix is rather simple: I simply restart my window manager (JWM) and CPU usage goes back down to normal levels. I can issue jwm -restart from a terminal or I set up a key combination that does the same as an interim solution to the problem. So I want my CPU-monitoring script to run that command so as to automate things--something that would be a real help when I'm not physically present to restart the WM.

I've found a script on the internet that seems like it could be easily adapted to my scenario and I'd like to ask here for some input on it (and on the task in general). The script, with my modifications added, would look as follows:

Code:

#!/bin/bash

CPU_LOAD=$(uptime | cut -d"," -f6 | sed -e "s/\.//g") #this selects uptime's 15 min. field
CPU_THRESHOLD=060
#based on recent monitoring 060 seems like probably a good number for the 15 min. field threshold for this system

if [ $CPU_LOAD -gt $CPU_THRESHOLD ] ; then
  /usr/bin/jwm -restart #do I need to specify display? like :0.0?
fi

exit 0

So I would run this as a cron job, say, at 10 minute intervals. My main question has to with issuing the jwm -restart command. Past experience has shown that I cannot issue that command from another tty and have it be effective (for example, if I log into the system remotely in an ssh session and try to run it from there). I'm guessing that may be because I need to specify the display to which the command needs to be sent. Does that make sense? Also, might it be better to use the 5-minute field for monitoring threshold?

Any input on the task, means for accomplishing it, or improvements to the script, will be appreciated.

WideOpenSkies · 05-25-2019, 03:11 PM

Can you run the cron job as is and see if that works, first? We can debug if not setting a specific size works.

As for this:

Quote:

Originally Posted by jamtat

Also, might it be better to use the 5-minute field for monitoring threshold?

I do think every couple of minutes would be good. Ten minutes seems too long. I don't use jwm, but for my wm -- dwm -- I have a script checking CPU usage every half second. Maybe you'll want to do the same thing.

berndbausch · 05-25-2019, 06:36 PM

Quote:

Originally Posted by jamtat

might it be better to use the 5-minute field for monitoring threshold?

The 5-minute field gives you the load average over five minutes. This means it takes very roughly three times as long to detect a high load condition when you use the 15 minutes field. Also, the longer the average is, the more it smoothens out your load, which may make it harder for the system to detect peaks.

It really depends on factors like:

How long does your high load condition last. If it’s forever, a longer average will eventually lead to a restart.
How much disruption is caused by the restart. If it’s not much, you can restart more often.
How painful is it to work under load. If it’s very painful, restart more often.

One more consideration: What you are doing is an interesting exercise, but the real problem is obviously elsewhere. Where does the load come from - misconfiguration? Not enough RAM? A process that runs under jwm causes problems? This is where you should do your troubleshooting.

jamtat · 05-27-2019, 09:58 AM

Thanks for the input so far. The 15-minute load average seemed to me like the better metric to monitor since there is a greater chance that some legitimate process might meet the designated threshold for 5 minutes. I think about the only thing I ever did that demanded that many CPU cycles over a 5-minute period was video transcoding and I don't do much of that anymore. But in any case the likelihood of a process that demands those kind of cycles extending over a 15-minute period being rogue is greater than the likelihood of one extending over a 5-minute period. So I'll probably go with the 15-minute field in my preliminary testing.

And yes, I do need to determine what exactly is causing this. I did a bit of troubleshooting a few weeks ago and all I was able to determine at that time is that Xorg is using the cycles. The machine has 8 GB RAM and it is not being completely used up so I doubt it's that. So I'll be continuing to try to determine what graphical process/program might lie behind that.

DISPLAY=0.0 jwm -restart is what seems to work to restart the WM from within an ssh session, btw. CORRECTION: no, that works from a tty. Trying to restart from an ssh session is a different matter and involves it's own set of issues. Since I'm focusing in this post on automating this from the host machine (as a cron job), I'm going to set aside the issue of possibly restarting the WM from within an ssh session.

ondoho · 05-29-2019, 11:19 PM

I appreciate the effort you put into this, and the expertise required to accomplish it.

However, I question the usefulness of the chosen "solution".

You say:

Quote:

Originally Posted by jamtat

It seems Xorg is the culprit process soaking up cycles. The fix is rather simple: I simply restart my window manager (JWM) and CPU usage goes back down to normal levels.

I think you should really find out what is happening there, and try to fix that.

troubleshooting steps:

try a different window manager and see if that fixes it
is there always a certain program open when the freeze happens? a full-blown web browser would be a common culprit. what are you doing with it? are you allowing all javascript? using media a lot?
is your graphics unit fully supported by its driver, i.e. is hardware accelaration available?

PS: if you want my help, you need to provide more code output.

jamtat · 06-03-2019, 04:41 PM

Yeah, I know some further troubleshooting of the high CPU usage issue is needed, ondoho. I did a little of that a few weeks ago but didn't get too far. I've done a little more now and have a potential candidate process other than Xorg. But while I continue my efforts I'm hoping to ensure that the machine, while unattended, doesn't get into a state where it's difficult or impossible to use upon my return--thus the rationale for the script I've tried to create. Besides, troubleshooting the issue, if it comes down to soliciting help here, really belongs in its own thread (look for one later, should my current troubleshooting attempts meet with failure).

Meantime, my script for automating the restart of the WM, for reasons that are not yet clear to me, is so far not working. So I've come up with another related script that should be helpful to my troubleshooting efforts, as follows:

Code:

#!/bin/bash
CPULEVEL15=$(cat /proc/loadavg | awk '{print $3}' | sed -e "s/\.//g") #poll 15 min. CPU load avg., remove decimal point
CPUHIST=$(tail -n 40 /home/user/cpu_usage.txt) #file containing record of system's CPU usage grabbed at 5 min. intervals
CPU_THRESHOLD=050 #set 15 min. load avg. threshold above which notification should be sent

if [[ "$CPULEVEL15" -gt "$CPU_THRESHOLD" ]] ; then
  #echo "comparison succeeded" # <----test whether comparison is working
  echo -e "Current 15-minute CPU load average is: $CPULEVEL15%\n$CPUHIST" | mail -s "My-host high CPU load alert" me@my-mail.com
      fi

      exit 0

As may be clear, the script relies on a program like mailx being installed, and an attending valid smtp configuration and installed utility (I personally use msmtp). It also relies on another script I created which polls CPU load averages every 5 minutes and saves them to a file (named cpu_usage.txt, located in the user's home directory). It compares the 15-minute average CPU load with a threshold limit set by the user and, if that load is higher than the stipulated threshold, triggers an e-mail notification. I've now set that up as a cron job that runs at 15-minute intervals; testing indicates it should work as intended.

Likely more to come later.

ondoho · 06-04-2019, 12:57 AM

i prefer to troubleshoot the simplest things first, even if they're less likely to be the cause of the problem.
try another window manager.

jamtat · 06-05-2019, 10:57 PM

The script I created to trigger e-mail notifications when a certain 15-minute CPU load threshold is reached seems to be working great so far. As to troubleshooting strategy and starting with simpler things, a lengthy engagement with computer problems and determining their causes has definitely led me to appreciate that approach and it is one I typically use. In this case it is less applicable since I'm running a WM custom configured to be usable by my wife, and if I switch to some other it may be a barrier to her using the computer. So I'm trying to avoid switching WMs.

When the high load average occurred today, I managed once again to bring loads back down to normal by killing a particular process that runs under Xorg but is neither Xorg itself nor the WM. So it seems I am zeroing in on the true culprit. So perhaps I will wind up modifying my script so that it will kill and then restart that application when high average CPU loads occur, rather than sending me an e-mail notification. Or perhaps both.

MadeInGermany · 06-07-2019, 05:21 AM

If you look at the 15 minutes load curve then you'll see that there is quick rise and slow fall.
Better take the minimum of the 5 minutes and the 15 minutes values; the resulting curve becomes more symmetric, i.e. it takes longer to trigger an alert and shorter to cancel it.

ondoho · 06-12-2019, 03:58 AM

Quote:

Originally Posted by jamtat

In this case it is less applicable since I'm running a WM custom configured to be usable by my wife, and if I switch to some other it may be a barrier to her using the computer. So I'm trying to avoid switching WMs.

"Troubleshooting" is not meant to become a solution, just help you find what's going on.

Quote:

Originally Posted by jamtat

When the high load average occurred today, I managed once again to bring loads back down to normal by killing a particular process that runs under Xorg but is neither Xorg itself nor the WM.

I wonder why you aren't telling us what that process is.
Could help to propose real solutions.

Mike_Walsh · 06-13-2019, 07:32 AM

Puppy Linux uses JWM as its default WM. I occasionally get this same problem; re-starting 'X' always seems to 'cure' it, but for me the problem is invariably the same one.

It's not Xorg, or the WM. I'm a long-term Chrome user, and recent versions don't always kill the

Code:

--nacl-helper

....process at Chrome startup, after it's done its part of the startup process. Killing the process in mate-system-monitor always brings it back under control. Your problem, however, sounds a bit different to mine; I just wanted to point out that your assertion that JWM isn't responsible is like as not correct.

Mike.

ondoho · 06-15-2019, 01:04 AM

Quote:

Originally Posted by Mike_Walsh

Chrome

Settings => Advanced => Uncheck "keep background processes running when chrome is closed" or some such.

also: closed source, big G, grumble grumble.