Maintaining the correct amount of load on a cluster of linux servers.
Linux - ServerThis forum is for the discussion of Linux Software used in a server related context.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Maintaining the correct amount of load on a cluster of linux servers.
Hi All,
This is my first time posting a thread here, and please excuse me if I make a mistake.
At the moment I am busy monitoring a cluster of nodes. The problem is sometimes the load becomes quite high and we have to close a box and wait for the load to drop. The problem came about a week ago running month end and every box in the cluster was closed so no one could connect to them. As i'm sure you know once closed all the users already connected to the box stay where they are and no new users are able to connect.
I personally believe that this might not be the most economical way to reduce the load. We do kill users session on the work-space servers but not on this grid to reduce load. All we do is close and then wait for the load to reduce but because the one is closed then the other users are now going to other nodes but then they spike but if they can't work this is reflects badly on us, it is basically just one big horrible cycle.
S basically my question is, is there a better way of going about this? We obviously don't want to have a situation where non of the users can connect again. I've tried some research but have found nothing that would actually solve the problem.
Let me say thank you in advance, and if you have any questions at all ill be more than willing to give more information.
Nobody can give you any meaningful advice because you don't mention what you're using to distribute the load.
So some generic advice.
Check your load balancer and see what kind of algorithm it's using to distribute the sessions. For example, if it's doing routing based on source IP and all your clients are coming in behind some form of NAT and presenting from the same IP then routing may be non-optimal. If it's pure round-robin then check to see if your application requires any form of session management, in this case you'll want to ensure the same client hits the same end-point.
The problem is sometimes the load becomes quite high and we have to close a box and wait for the load to drop.
What worries me here is "what are you basing this on" ?. Are you referring to CPU% "load" or loadavg ?.
And are you simply waiting for some arbitrary number to appear then taking action ?. Where did this magic number come from ?.
Is the performance of the cluster (or particular nodes) being impacted prior to you commencing shutdown(s) ?.
We need some detail. What is that number, how many cores/execution threads are involved, how many tasks in uninteruptable sleep, any resource contention (CPU/disk/network) ?.
We have one node that each user connects to and its sole purpose it to decide which node the user is distributed to. I'm not to sure on how it decides to distribute. It basically finds out which one can accept. From what I understand it is scripted to give the job to the node with the lowest amount load, but it doesn't seem to be doing that very well. I also understand that a certain job can be huge and cause a node to spike really high but then the next one wont even use a fraction of the load. So its not an exact science. But maybe it is not fixable and maybe this it just how it needs to be. Thank you for you advise I will check all of it now.
What worries me here is "what are you basing this on" ?. Are you referring to CPU% "load" or loadavg ?.
And are you simply waiting for some arbitrary number to appear then taking action ?. Where did this magic number come from ?.
Is the performance of the cluster (or particular nodes) being impacted prior to you commencing shutdown(s) ?.
We need some detail. What is that number, how many cores/execution threads are involved, how many tasks in uninteruptable sleep, any resource contention (CPU/disk/network) ?.
It is the CPU load. No we monitor it the whole time after it reaches a certain load then we have to close and open with what the client has asked. The dash board is a live load and it refreshed every second. It is a particular node in a cluster.
It is the CPU load. No we monitor it the whole time after it reaches a certain load then we have to close and open with what the client has asked. The dash board is a live load and it refreshed every second. It is a particular node in a cluster.
So, when the load is high, have you looked at the running processes or reviewed the logs? You need to determine the cause of the high load before a solution can be implemented.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.