Need advice figuring out system load average on an embedded server.

sycc90 · 05-26-2016, 01:21 PM

Hello!
This is my first time around this forums so I hope I'm in the right place. I usually manage to solve all of my linux-related problems by just googling, but this time around I would like to clarify some doubts I'm having.

We are developing a device which runs a version of embedded linux, it runs a simple lighttpd webserver for control and configuration.
The device is used for various CPU-intensive tasks. I'm trying to determine how many of this tasks we can run simultaneously and use that to improve performance where possible.

To start off somewhere I decided to monitor the CPU usage and system load average over time. I'm mostly new with all of this so the load average is something I'm still not very familiar with. I've been reading everything I can about it and I think I have sort of an understanding of what it measures and how to look at it, but I'd like to clarify a couple of doubts that I still have:
1) The load average shows how many processes are using or waiting for system resources. Does this mean that, as long as the load doesn't increase over time, all of them are being successfully processed? Not taking into account what delay this would entail.
2) Contrary to the previous point, a slowly increasing load average would mean that processes are getting piled up because there's not enough system resources, is that correct?
3) High load average means an increase in response time. So, if latency is not an issue then a high load average (say 20 or so for a 1-core processor) wouldn't be an issue? As long as it doesn't increase over time, of course.

This all started with that device I mentioned above. I increased the number of tasks until I reached a CPU (single-core) usage of about ~80%, however, I noticed the webserver started slowing down and the pages took longer to load. Looking at the top command I have something like this (I've trimmed the bottom, everything there only showed up as 0 CPU and 0 MEM):

Code:

top - 19:17:26 up 18 min,  1 user,  load average: 4.59, 4.86, 2.84
Tasks:  69 total,   1 running,  68 sleeping,   0 stopped,   0 zombie
Cpu(s): 78.1%us, 20.3%sy,  0.0%ni,  0.6%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Mem:    221936k total,    44588k used,   177348k free,        0k buffers
Swap:        0k total,        0k used,        0k free,    14892k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1308 root      20   0  494m  10m 2788 S 78.1  4.8  12:01.21 main_app
   22 root      20   0     0    0    0 S  1.0  0.0   0:02.48 kworker/0:1
 1359 root      20   0  2644 1204  632 S  0.3  0.5   0:03.64 lighttpd
 4289 root      20   0  2428 1128  876 R  0.3  0.5   0:05.34 top
26722 root      20   0  2860  760  592 S  0.3  0.3   0:00.01 sh
26735 root      20   0  2724  372  312 S  0.3  0.2   0:00.01 sleep

Where all the processing threads are spawned inside the main_app process (I can look at them with shift+H). Most of them are waiting on sleep or mutex routines.

I'm not sure how to narrow down my search as to what is causing the increase in load average. I can clearly see that there's basically no idle time available, does this mean the CPU is the bottleneck at this point?

I hope someone can help me out with this. At least with the questions about the load average, the rest is something I'll have to figure out on my own probably, but I wanted to provide some background on where this questions are coming from.

Thank you!
-- Sycc

rtmistler · 05-26-2016, 02:55 PM

Hi, Welcome to the forums.

Process ID 1308 main_app is using 78% of the CPU. That exact process, doesn't matter about the threads, because those will be different process ID numbers. Whatever that process is, it is taking a lot of CPU time. Is that code which you've written and can modify to debug?

Note: If you have like 8 CPU cores, then the loading of 4.59 isn't so heinously bad. Divide that number by the number of cores. I am guessing you do not have 8 cores however.

syg00 · 05-26-2016, 07:46 PM

Where to start ...
All of my answers are likely to be of the "it depends" variety; sorry about that, but maybe you'll see why.
loadavg is a very rubbery number. On single core, non-HT chips it is a lot less rubbery, so you are less dis-advantaged than most.

Quote:

Originally Posted by sycc90

1) The load average shows how many processes are using or waiting for system resources. Does this mean that, as long as the load doesn't increase over time, all of them are being successfully processed? Not taking into account what delay this would entail.
2) Contrary to the previous point, a slowly increasing load average would mean that processes are getting piled up because there's not enough system resources, is that correct?
3) High load average means an increase in response time. So, if latency is not an issue then a high load average (say 20 or so for a 1-core processor) wouldn't be an issue? As long as it doesn't increase over time, of course.

1: Hopefully yes, but not necessarily, as some may be permanently in non-interruptable sleep and so continually are included in the count (that webserver would be the best candidate for doing this). Processes in non-interruptable sleep need to be woken up - I know of no way to determine when/if this will happen.
2: Probably, but same caveat as 1:
3: Maybe, if CPU is the only constraint - I've seen loadavg well over 200 and the machine was running fine with almost no CPU usage. You must be the only person I know of that isn't worried about latency and/or response time ...

Do you have awk (preferably gawk) available - I can give you a one-liner to strip that top output to see if you have non-interruptible tasks (just look for a "D" in that column headed by the "S" (state).

jefro · 05-26-2016, 10:09 PM

Hello and welcome to LQ.

You need more than top.

Some ideas to get you closer.

https://blog.serverdensity.com/80-li...ng-tools-know/
http://www.cyberciti.biz/tips/top-li...ing-tools.html