[SOLVED] How to troubleshoot why my server is slow?
Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I have a 3 server set up for an OpenStack deployment. All three nodes are (almost) identical in what OpenStack services are installed on them. The node in question has a bit more control plane api services related to networking on it. They all have 2 CPU's and 48gb of RAM and run CentOS 7.5 Minimal.
Now my problem is that this machine is super slow. It takes several minutes to SSH into the machine and running a command (# docker stats) takes about 10 minutes to provide output. This problem just started a couple of days ago. I think it has something to do with the RAM. I ran a test and the physical RAM itself seems to be fine. I am confused on how to troubleshoot this as both of the other nodes I have (with the same software on it) aren't slow like this node is. After restarting the machine it works just like it should for an hour or two, after that it starts slowing down and when using top it shows the memory usage and buff/cache constantly going up. For reference the server has been running all night and it is showing this with free -m:
PHP Code:
[root@node-1 ~]# free -m total used free shared buff/cache available Mem: 48123 37710 291 8 10121 1518 Swap: 13939 6918 7021
My other node (this one runs just fine) has this output for free -m:
[root@node-2 ~]# free -m total used free shared buff/cache available Mem: 48000 23239 23263 4 1497 24074 Swap: 13939 7082 6857
If anyone can help point me in the right direction so I can learn to fix this issue I would appreciate it. Thanks.
EDIT: The Architecture is very specific to OpenStack. I am running mariadb, nova, heat, cinder/ceph, ironic, glance, keystone, and rabbitmq on all nodes. Node-1 should have the core neutron apis running on it. Thats the only difference between the software installed on each node. The cloud is not 'in-use' right now, as I am still trying to get everything finalized. Two VMs are running at the moment for testing the cloud.
here is a post of top on the slow node. This was about 3 hours after I restarted the node, the free RAM keeps going down until it hits 300mb. At that point it takes several minutes to be able to SSH into the machine
Is your cloud in production? Is it running any workloads? A few words about the architecture might help.
Use vmstat to confirm that there is paging activity.
The first node uses 10GB of buffer cache, the other 1.4GB. Find out where the difference comes from - it has something to do with file IO. How do these “identical” nodes differ? Perhaps you run cinder-volume on the first node, with a file backend, and your instances use volumes a lot?
Run top to get an idea which processes use the most CPU and memory. The CPU users might also be the heaviest file I/O users.
Disk problems can also cause extreme slowness. Assuming you already checked for disk drive errors in logs?
And, machine is swapping, that will slow down things.
How is the network load? https://linux.die.net/man/1/nload
The cloud is deployed and working, there are a few networks that were made off of neutron (OpenStack's Networking SDN). But I am only running two VMs at the moment. One of them is running on the slow node and the other is running on a node that works just fine. This issue started only a few days ago, there isn't anything that changed that I can think of from the time the machine was working normally vs. now when its super slow. Using top only shows me the total usage of RAM and the programs themselves don't show any significant usage of RAM either. Everything is running in docker containers and when I use 'docker stats' the most usage by a container is only around 1gb. There are 50 containers running and most of them run at around 100mb of RAM usage. The containers with high usage show use around 500mb, but thats only 1 or 2 containers. Over time the RAM usage just keeps going up, which leads me to believe the system is letting the disk cache take over unused RAM, however the system gets slower and slower while that happens. This makes me think the disk cache isn't releasing RAM that it has in use when something else needs it, but I don't know how to check or confirm this.
P.S. This set-up is not in use yet. I am an intern who was given the project of setting up this cloud. I've never worked with linux before this internship either so I am a super newbie. I appreciate all the help, thanks!
No surprise - the load average is way too high for 2 CPUs, the wa% is way too high and there are are a bunch of important workloads in state "D". All related.
Given the swap usage as well. I'd bet the disk response time is crap, and tie-ing up the whole system.
Just a guess tho' ...
Get something like sysstat to see the disk response time.
It looks like the high load average is caused by processes wanting to do disk I/O. Thus the buffer cache size of 9GB. These are processes with a state of D (uninterruptable sleep); your top output shows the DB, nova-conductor and a large number of Neutron API servers.
Why is the DB process so large? Can you check what’s in the DB? And why are so many API requests made to Neutron (at least that’s my guess seeing the Neutron servers waiting for disk I/O)? Check their logs. Perhaps you use DEBUG logging. Switching this off would the improve the situation, but in the end you need to understand where all the Neutron activity is from.
Perhaps there are other processes in the D state. top should be able to filter for them. Figure out what they are, there might be more clues.
Also compare the top and vmstat output of this server with the two others. Are you deploying Neutron on this one controller only?
Finally, use ask.openstack.org and the OpenStack mailing list for other opinions.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.