Bad:"Help Me!! High CPU Utilization suddenly

sakusri16 · 07-23-2018, 05:36 AM

Hi

We have Linux centos servers in production environment under loadbalancers. Suddenly all instances utilizing more than 95% of CPU. When we verify the glances its showing some kernel process are utilizing more.When we tried to kill those process again its started with another kernel process .Below are the process utilizing frequently.I have tried to reboot the server also ,after 1 hours its again started same.

• Watchdog_1
• Bioset
• Kdevtmpfs
• Ib_mcact
• kthrotid

TB0ne · 07-23-2018, 07:35 AM

Quote:

Originally Posted by sakusri16

Hi
We have Linux centos servers in production environment under loadbalancers. Suddenly all instances utilizing more than 95% of CPU. When we verify the glances its showing some kernel process are utilizing more.When we tried to kill those process again its started with another kernel process .Below are the process utilizing frequently.I have tried to reboot the server also ,after 1 hours its again started same.

• Watchdog_1
• Bioset
• Kdevtmpfs
• Ib_mcact
• kthrotid

Without any sort of details, we can tell you nothing. You don't say what version of CentOS you're using, what loadbalancers, and (most importantly) what has *CHANGED*. Because things just don't happen 'suddenly'....**SOMETHING** changed on your system(s), because if everything was working fine, it wouldn't just start doing something new for no reason. What happened? The processes you mentioned should NOT be killed...they are all system processes. Please look up what each of them do. And what does the top command tell you? Because those processes listed shouldn't be using an appreciable amount of CPU.

Without knowing what's running on those systems, load, what changed, etc., there's no way we can even guess. Have you tried just waiting to see if the process completes, and load drops? What, if any, actual PROBLEMS are you seeing on the system(s) when this happens?

sakusri16 · 07-23-2018, 08:10 AM

Thanks for the update.

We are using centos7 and its configured under AWS ALB.We have not done anything from the server side.Above mentioned process are running with Nginx user is it correct? because i could see some of the other servers are running the same process with root users.

Currently we set the process limit to 30% for that pid's to reduce the utilization.

Glances output :

CPU [|||||||||| 18.1%] CPU 18.1% nice: 0.0% MEM 22.9% active: 1.62G SWAP 0.0% LOAD 4-core
MEM [|||||||||||| 22.9%] user: 14.2% irq: 0.0% total: 7.15G inactive: 991M total: 2.00G 1 min: 1.58
SWAP [ 0.0%] system: 2.9% iowait: 0.0% used: 1.64G buffers: 2.04M used: 0 5 min: 1.35
idle: 82.5% steal: 0.1% free: 5.51G cached: 1.50G free: 2.00G 15 min: 1.75

NETWORK Rx/s Tx/s TASKS 168 (212 thr), 3 run, 165 slp, 0 oth sorted automatically by cpu_percent, flat view
eth0 45.9Mb 2.81Mb
lo 921Kb 921Kb CPU% MEM% VIRT RES PID USER NI S TIME+ IOR/s IOW/s Command
47.0 1.5 681M 111M 15873 nginx 0 R 7:54.82 0 0 php-fpm: pool www
DISK I/O R/s W/s 44.8 1.3 661M 98.5M 15840 nginx 0 S 8:10.40 0 0 php-fpm: pool www
xvda1 0 0 41.2 1.3 668M 93.2M 28266 nginx 0 S 6:48.21 0 0 php-fpm: pool www
32.2 0.1 75.7M 10.9M 30185 nginx 0 S 1h19:18 0 0 rcu_bh
FILE SYS Used Total 3.8 0.2 225M 15.7M 31542 root 0 R 0:55.51 0 0 /usr/bin/python /bin/glances
/ (xvda1) 50.3G 75.0G 3.8 1.2 658M 88.8M 28082 nginx 0 R 7:16.75 0 0 php-fpm: pool www
3.2 1.0 649M 74.1M 21200 nginx 0 S 8:19.25 0 0 php-fpm: pool www
2.9 0.9 650M 69.2M 30675 nginx 0 S 2:55.67 0 0 php-fpm: pool www
Thanks
Srini

TB0ne · 07-23-2018, 09:14 AM

Quote:

Originally Posted by sakusri16

Thanks for the update.
We are using centos7 and its configured under AWS ALB.We have not done anything from the server side.

Since you're using AWS, you need to contact Amazon support, and ask if any updates have been applied.

Quote:

Above mentioned process are running with Nginx user is it correct? because i could see some of the other servers are running the same process with root users.

No idea, because you don't say how your server was set up/configured. May be correct; may NOT be.

Quote:

Currently we set the process limit to 30% for that pid's to reduce the utilization.
Glances output :
CPU [|||||||||| 18.1%] CPU 18.1% nice: 0.0% MEM 22.9% active: 1.62G SWAP 0.0% LOAD 4-core
MEM [|||||||||||| 22.9%] user: 14.2% irq: 0.0% total: 7.15G inactive: 991M total: 2.00G 1 min: 1.58
SWAP [ 0.0%] system: 2.9% iowait: 0.0% used: 1.64G buffers: 2.04M used: 0 5 min: 1.35
idle: 82.5% steal: 0.1% free: 5.51G cached: 1.50G free: 2.00G 15 min: 1.75

NETWORK Rx/s Tx/s TASKS 168 (212 thr), 3 run, 165 slp, 0 oth sorted automatically by cpu_percent, flat view
eth0 45.9Mb 2.81Mb
lo 921Kb 921Kb CPU% MEM% VIRT RES PID USER NI S TIME+ IOR/s IOW/s Command
47.0 1.5 681M 111M 15873 nginx 0 R 7:54.82 0 0 php-fpm: pool www
DISK I/O R/s W/s 44.8 1.3 661M 98.5M 15840 nginx 0 S 8:10.40 0 0 php-fpm: pool www
xvda1 0 0 41.2 1.3 668M 93.2M 28266 nginx 0 S 6:48.21 0 0 php-fpm: pool www
32.2 0.1 75.7M 10.9M 30185 nginx 0 S 1h19:18 0 0 rcu_bh
FILE SYS Used Total 3.8 0.2 225M 15.7M 31542 root 0 R 0:55.51 0 0 /usr/bin/python /bin/glances
/ (xvda1) 50.3G 75.0G 3.8 1.2 658M 88.8M 28082 nginx 0 R 7:16.75 0 0 php-fpm: pool www
3.2 1.0 649M 74.1M 21200 nginx 0 S 8:19.25 0 0 php-fpm: pool www
2.9 0.9 650M 69.2M 30675 nginx 0 S 2:55.67 0 0 php-fpm: pool www

Again, **WHAT PROBLEMS** are you seeing???? Are users complaining? Is your system responding very slowly??? What is running on these servers, and what are they doing??? Again, have you looked at the top output, instead of glances?? ALL of these things factor in to system load. And nothing you've posted seems particularly high/bad.

sakusri16 · 07-23-2018, 09:25 AM

Sorry for the inconvenience

We are using Nginx as a webserver and hosting a Drupal websites in that, Customer facing some slowness we they access the websites and also we have configured AutoScaling in AWS. If certain % of CPU increased the ASG launch a new instances.That is the Big issue we need to pay those servers also due to this utilization.

TOP Output:

top - 19:53:56 up 6:04, 2 users, load average: 3.14, 2.28, 2.01
Tasks: 175 total, 6 running, 169 sleeping, 0 stopped, 0 zombie
%Cpu(s): 28.9 us, 11.1 sy, 0.0 ni, 59.1 id, 0.0 wa, 0.0 hi, 0.7 si, 0.2 st
KiB Mem : 7492252 total, 3892524 free, 1535144 used, 2064584 buff/cache
KiB Swap: 2097148 total, 2097148 free, 0 used. 5490016 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25194 nginx 20 0 713804 135228 51444 R 77.7 1.8 12:29.60 php-fpm
32288 nginx 20 0 77508 10924 4 S 29.0 0.1 93:40.58 md1_raid1
32413 nginx 20 0 679820 88896 41080 R 22.7 1.2 6:58.62 php-fpm
32424 nginx 20 0 674672 86980 44316 R 20.0 1.2 7:01.21 php-fpm
32339 nginx 20 0 672256 88256 45988 R 7.3 1.2 6:38.91 php-fpm
31081 nginx 20 0 666244 82620 46424 S 4.7 1.1 8:31.45 php-fpm
18108 nginx 20 0 660164 73124 45024 S 3.3 1.0 11:54.24 php-fpm
32377 nginx 20 0 674228 81388 39232 S 2.7 1.1 7:30.02 php-fpm
9 root 20 0 0 0 0 R 0.3 0.0 0:28.72 rcu_sched
1709 root 20 0 417856 11604 3052 S 0.3 0.2 0:00.83 fail2ban-server
4614 nginx 20 0 384360 249140 2884 S 0.3 3.3 0:03.67 nginx
5468 root 20 0 161984 2332 1584 R 0.3 0.0 0:00.47 top
32286 nginx 20 0 2448 748 24 S 0.3 0.0 0:00.88 6979c9306630463
1 root 20 0 191140 4104 2608 S 0.0 0.1 0:04.19 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:16.91 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root rt 0 0 0 0 S 0.0 0.0 0:00.09 migration/0
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
10 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 lru-add-drain
11 root rt 0 0 0 0 S 0.0 0.0 0:00.09 watchdog/0
12 root rt 0 0 0 0 S 0.0 0.0 0:00.06 watchdog/1
13 root rt 0 0 0 0 S 0.0 0.0 0:00.04 migration/1
14 root 20 0 0 0 0 S 0.0 0.0 0:00.20 ksoftirqd/1
16 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
17 root rt 0 0 0 0 S 0.0 0.0 0:00.07 watchdog/2
18 root rt 0 0 0 0 S 0.0 0.0 0:00.04 migration/2
19 root 20 0 0 0 0 S 0.0 0.0 0:00.15 ksoftirqd/2
21 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H
22 root rt 0 0 0 0 S 0.0 0.0 0:00.06 watchdog/3
23 root rt 0 0 0 0 S 0.0 0.0 0:00.04 migration/3
24 root 20 0 0 0 0 S 0.0 0.0 0:00.15 ksoftirqd/3
26 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0H
28 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs
29 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 netns
30 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xenwatch
31 root 20 0 0 0 0 S 0.0 0.0 0:00.00 xenbus

TB0ne · 07-23-2018, 09:39 AM

Quote:

Originally Posted by sakusri16

Sorry for the inconvenience
We are using Nginx as a webserver and hosting a Drupal websites in that, Customer facing some slowness we they access the websites and also we have configured AutoScaling in AWS. If certain % of CPU increased the ASG launch a new instances.That is the Big issue we need to pay those servers also due to this utilization.

Right. So back to most of my original questions..please answer them.

"Some slowness" is pretty vague as is just mentioning nginx. If you're serving up a ton of pages that are database-driven, with a huge database, and thousands of concurrent users, your system WILL be slow, and need to be scaled up. Your iostat output says you're getting io-bound/waiting. Either add more disk/storage, and move your database to a different drive to lessen IO problems, or re-write some of your web pages. Again, there are FAR too many details to guess at.

And as said, CONTACT AMAZON SUPPORT...you are paying for their services.