Maintaining the correct amount of load on a cluster of linux servers.

Fluffyyyzzz · 12-10-2018, 03:41 AM

Hi All,

This is my first time posting a thread here, and please excuse me if I make a mistake.

At the moment I am busy monitoring a cluster of nodes. The problem is sometimes the load becomes quite high and we have to close a box and wait for the load to drop. The problem came about a week ago running month end and every box in the cluster was closed so no one could connect to them. As i'm sure you know once closed all the users already connected to the box stay where they are and no new users are able to connect.

I personally believe that this might not be the most economical way to reduce the load. We do kill users session on the work-space servers but not on this grid to reduce load. All we do is close and then wait for the load to reduce but because the one is closed then the other users are now going to other nodes but then they spike but if they can't work this is reflects badly on us, it is basically just one big horrible cycle.

S basically my question is, is there a better way of going about this? We obviously don't want to have a situation where non of the users can connect again. I've tried some research but have found nothing that would actually solve the problem.

Let me say thank you in advance, and if you have any questions at all ill be more than willing to give more information.

Thank you very much,

Have a good day.

TenTenths · 12-10-2018, 04:29 AM

Nobody can give you any meaningful advice because you don't mention what you're using to distribute the load.

So some generic advice.

Check your load balancer and see what kind of algorithm it's using to distribute the sessions. For example, if it's doing routing based on source IP and all your clients are coming in behind some form of NAT and presenting from the same IP then routing may be non-optimal. If it's pure round-robin then check to see if your application requires any form of session management, in this case you'll want to ensure the same client hits the same end-point.

syg00 · 12-10-2018, 05:48 AM

Quote:

Originally Posted by Fluffyyyzzz

The problem is sometimes the load becomes quite high and we have to close a box and wait for the load to drop.

What worries me here is "what are you basing this on" ?. Are you referring to CPU% "load" or loadavg ?.
And are you simply waiting for some arbitrary number to appear then taking action ?. Where did this magic number come from ?.
Is the performance of the cluster (or particular nodes) being impacted prior to you commencing shutdown(s) ?.

We need some detail. What is that number, how many cores/execution threads are involved, how many tasks in uninteruptable sleep, any resource contention (CPU/disk/network) ?.

Fluffyyyzzz · 12-10-2018, 05:56 AM

We have one node that each user connects to and its sole purpose it to decide which node the user is distributed to. I'm not to sure on how it decides to distribute. It basically finds out which one can accept. From what I understand it is scripted to give the job to the node with the lowest amount load, but it doesn't seem to be doing that very well. I also understand that a certain job can be huge and cause a node to spike really high but then the next one wont even use a fraction of the load. So its not an exact science. But maybe it is not fixable and maybe this it just how it needs to be. Thank you for you advise I will check all of it now.

Fluffyyyzzz · 12-10-2018, 06:02 AM

Quote:

Originally Posted by syg00

What worries me here is "what are you basing this on" ?. Are you referring to CPU% "load" or loadavg ?.
And are you simply waiting for some arbitrary number to appear then taking action ?. Where did this magic number come from ?.
Is the performance of the cluster (or particular nodes) being impacted prior to you commencing shutdown(s) ?.

We need some detail. What is that number, how many cores/execution threads are involved, how many tasks in uninteruptable sleep, any resource contention (CPU/disk/network) ?.

It is the CPU load. No we monitor it the whole time after it reaches a certain load then we have to close and open with what the client has asked. The dash board is a live load and it refreshed every second. It is a particular node in a cluster.

dc.901 · 12-11-2018, 12:14 PM

Quote:

Originally Posted by Fluffyyyzzz

It is the CPU load. No we monitor it the whole time after it reaches a certain load then we have to close and open with what the client has asked. The dash board is a live load and it refreshed every second. It is a particular node in a cluster.

So, when the load is high, have you looked at the running processes or reviewed the logs? You need to determine the cause of the high load before a solution can be implemented.