Extremely high load average around 03:30 (AM) each night

pan64 · 11-03-2023, 05:15 AM

you made a lot of pidstat. What about that?

elgholm · 11-07-2023, 04:39 AM

Quote:

Originally Posted by pan64

you made a lot of pidstat. What about that?

They tell me nothing of interest... I temporarily removed the logging, to try other things (and didn't want that inflicting on the loadavg), but I'll put them back now and send you a list tomorrow!

This was uptime from tonight:
...
03:33:08 up 205 days, 7:20, 0 users, load average: 0.05, 0.08, 0.09
03:33:09 up 205 days, 7:20, 0 users, load average: 0.05, 0.08, 0.09
03:33:10 up 205 days, 7:20, 0 users, load average: 0.05, 0.08, 0.09
03:33:11 up 205 days, 7:20, 0 users, load average: 0.05, 0.08, 0.09
03:33:12 up 205 days, 7:20, 0 users, load average: 93.58, 19.47, 6.36
03:33:13 up 205 days, 7:20, 0 users, load average: 93.58, 19.47, 6.36
03:33:14 up 205 days, 7:20, 0 users, load average: 93.58, 19.47, 6.36
03:33:15 up 205 days, 7:20, 0 users, load average: 93.58, 19.47, 6.36
03:33:16 up 205 days, 7:20, 0 users, load average: 93.58, 19.47, 6.36
...

pan64 · 11-07-2023, 04:51 AM

you can run a ps too to list processes and compare them before/after

chrism01 · 11-07-2023, 08:26 PM

Did you ever try process accounting?
Given that you're still chasing this one down, it'd be worth a try.
https://www.networkworld.com/article...-on-linux.html

elgholm · 11-08-2023, 01:14 AM

Quote:

Originally Posted by pan64

you can run a ps too to list processes and compare them before/after

I've done so, many times, I even have a script that starts automatically as soon as I get high loadavg and spits out a bunch of ps, top, iotop, iostat stuff... Nothing looks out of the ordinary.. It's an extreme spike for a very short period of time it seems... But, yeah, it _didn't_ happen for the last two days... Don't ask me why... =)

elgholm · 11-08-2023, 01:18 AM

Quote:

Originally Posted by chrism01

Did you ever try process accounting?
Given that you're still chasing this one down, it'd be worth a try.
https://www.networkworld.com/article...-on-linux.html

No. There already is some standard process accounting installed - default - so I can run a lot of this commands and get output. But I only see accumulated information, like loadavg and such - not which processes are involved. This I think is a little bit weird, since I can see in my top output that linux spawns a bunch of process-gathering processes (python scripts and various other stuff) when it believes loadavg is too high. This is done per default, I haven't installed or set this up. But for the love of me I can't figure out which command to run, or where to look, to actually see the process information detail gathered by these utilites - but I think it should be there somewhere. I don't really see why the accounting processes should spawn otherwise.

I'll try and look through your link, and see if that pushes me in the right direction. This is a production server, so I rather not install a bunch of new stuff on it - or restart the machine (heavens no!)...

PS. Fun fact, the link doesn't work in Chrome - can't scroll page, javasript errors (probably because of errors in the cookie-dialogue) - but works in Firefox.

MadeInGermany · 11-08-2023, 03:52 AM

Then check your Chrome settings. It works with mine.

Display everything in one pidstat:

Code:

pidstat -urdwhl 1 400

-h combine to one line per process
-l long command (args)

If this is a container and you don't see relevant things then run it on the container's host.
If on the host you still do not see any high per-process values then include the kernel tasks:

Code:

pidstat -p ALL -urdwhl 2 400

pan64 · 11-08-2023, 05:59 AM

hm. I don't know if load average counts the threads (LWPs) or processes. Also I don't know how pidstat works with threads (try -t -v). Probably it is just a single multithreaded application.

boughtonp · 11-08-2023, 07:41 AM

Quote:

Originally Posted by elgholm

This is a production server, so I rather not install a bunch of new stuff on it - or restart the machine (heavens no!)...

I don't see where restarting was suggested, but a server which cannot go down for maintenance is a disaster waiting to happen.

Anything important enough that it must stay online is important enough to have sufficient redundancy such that any single server can be swapped out of a pool for a while without causing issues.

elgholm · 11-09-2023, 12:53 AM

Quote:

Originally Posted by MadeInGermany

Then check your Chrome settings. It works with mine.

Display everything in one pidstat:

Code:

pidstat -urdwhl 1 400

-h combine to one line per process
-l long command (args)

If this is a container and you don't see relevant things then run it on the container's host.
If on the host you still do not see any high per-process values then include the kernel tasks:

Code:

pidstat -p ALL -urdwhl 2 400

Thanks a bunch! Will try these this night!

metaed · 11-10-2023, 12:42 PM

Quote:

Originally Posted by elgholm

There already is some standard process accounting installed - default - so I can run a lot of this commands and get output. But I only see accumulated information, like loadavg and such - not which processes are involved. This I think is a little bit weird, since I can see in my top output that linux spawns a bunch of process-gathering processes (python scripts and various other stuff) when it believes loadavg is too high. This is done per default, I haven't installed or set this up. But for the love of me I can't figure out which command to run, or where to look, to actually see the process information detail gathered by these utilites - but I think it should be there somewhere. I don't really see why the accounting processes should spawn otherwise.

Okay this is my fault. You said earlier you were already getting summary statistics from process accounting, and therefore I assumed you had turned it on. Now I'm pretty sure you are unfamiliar with process accounting, so never turned it on. It is very common for a Linux distro to come with process accounting support but not actually start it at boot time. The distro I run, Slackware, checks for the existence of the log file at boot time, and when it exists, starts process accounting. But if the log file doesn't exist, it doesn't bother.

You need the accton command. This tells the kernel to start (or stop) writing every process termination to a file. For usage, check man 8 accton.

Before running accton, /var/log/pacct should be created if it doesn't exist. For security reasons, /var/log/pacct should not be world readable. You can touch /var/log/pacct and then chmod 640 /var/log/pacct.

The startup command is typically: accton /var/log/pacct. And when you're done collecting data, use accton off to stop process accounting so you don't fill up your drive later.

elgholm · 11-11-2023, 04:18 AM

Quote:

Originally Posted by metaed

Okay this is my fault. You said earlier you were already getting summary statistics from process accounting, and therefore I assumed you had turned it on. Now I'm pretty sure you are unfamiliar with process accounting, so never turned it on. It is very common for a Linux distro to come with process accounting support but not actually start it at boot time. The distro I run, Slackware, checks for the existence of the log file at boot time, and when it exists, starts process accounting. But if the log file doesn't exist, it doesn't bother.

You need the accton command. This tells the kernel to start (or stop) writing every process termination to a file. For usage, check man 8 accton.

Before running accton, /var/log/pacct should be created if it doesn't exist. For security reasons, /var/log/pacct should not be world readable. You can touch /var/log/pacct and then chmod 640 /var/log/pacct.

The startup command is typically: accton /var/log/pacct. And when you're done collecting data, use accton off to stop process accounting so you don't fill up your drive later.

Perfect! Thank you so much! Will look into this on monday!
I'm sorry if I made you confused, what _is_ installed is some kind of monitoring program - since I see it waking up when shit happens. But it's most probably not the process accounting stuff you write about above, instead it's some sort of process information gathering utilities. They've been enabled by the default installation (RHEL clone).

elgholm · 11-14-2023, 06:29 AM

There's some process that creates en extreme amount of threads.
So, as I've already suspected, there's not really high CPU usage. Instead there's around 3k threads starting up, probably trampling each others toes, which spikes the load average. I'm now gonna try to run a ps command to show me amount of threads per process.
So far I've been trying to find high CPU usage and/or high i/o (blocking), dead processes/zombies, but haven't found anything yet.
Too bad accton only logs processes, not thread creation.

pan64 · 11-14-2023, 06:38 AM

see post #38, pidstat has a -t flag

MadeInGermany · 11-14-2023, 09:41 AM

Ah yes, -t breaks it down to process threads (LWPs).

Anyway, the main pid should have the sum of the threads.