LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Server
User Name
Password
Linux - Server This forum is for the discussion of Linux Software used in a server related context.

Notices


Reply
  Search this Thread
Old 12-10-2018, 03:41 AM   #1
Fluffyyyzzz
LQ Newbie
 
Registered: Dec 2018
Posts: 3

Rep: Reputation: Disabled
Maintaining the correct amount of load on a cluster of linux servers.


Hi All,

This is my first time posting a thread here, and please excuse me if I make a mistake.

At the moment I am busy monitoring a cluster of nodes. The problem is sometimes the load becomes quite high and we have to close a box and wait for the load to drop. The problem came about a week ago running month end and every box in the cluster was closed so no one could connect to them. As i'm sure you know once closed all the users already connected to the box stay where they are and no new users are able to connect.

I personally believe that this might not be the most economical way to reduce the load. We do kill users session on the work-space servers but not on this grid to reduce load. All we do is close and then wait for the load to reduce but because the one is closed then the other users are now going to other nodes but then they spike but if they can't work this is reflects badly on us, it is basically just one big horrible cycle.

S basically my question is, is there a better way of going about this? We obviously don't want to have a situation where non of the users can connect again. I've tried some research but have found nothing that would actually solve the problem.

Let me say thank you in advance, and if you have any questions at all ill be more than willing to give more information.

Thank you very much,

Have a good day.
 
Old 12-10-2018, 04:29 AM   #2
TenTenths
Senior Member
 
Registered: Aug 2011
Location: Dublin
Distribution: Centos 5 / 6 / 7
Posts: 3,475

Rep: Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553Reputation: 1553
Nobody can give you any meaningful advice because you don't mention what you're using to distribute the load.

So some generic advice.

Check your load balancer and see what kind of algorithm it's using to distribute the sessions. For example, if it's doing routing based on source IP and all your clients are coming in behind some form of NAT and presenting from the same IP then routing may be non-optimal. If it's pure round-robin then check to see if your application requires any form of session management, in this case you'll want to ensure the same client hits the same end-point.
 
Old 12-10-2018, 05:48 AM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,126

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Quote:
Originally Posted by Fluffyyyzzz View Post
The problem is sometimes the load becomes quite high and we have to close a box and wait for the load to drop.
What worries me here is "what are you basing this on" ?. Are you referring to CPU% "load" or loadavg ?.
And are you simply waiting for some arbitrary number to appear then taking action ?. Where did this magic number come from ?.
Is the performance of the cluster (or particular nodes) being impacted prior to you commencing shutdown(s) ?.

We need some detail. What is that number, how many cores/execution threads are involved, how many tasks in uninteruptable sleep, any resource contention (CPU/disk/network) ?.
 
Old 12-10-2018, 05:56 AM   #4
Fluffyyyzzz
LQ Newbie
 
Registered: Dec 2018
Posts: 3

Original Poster
Rep: Reputation: Disabled
We have one node that each user connects to and its sole purpose it to decide which node the user is distributed to. I'm not to sure on how it decides to distribute. It basically finds out which one can accept. From what I understand it is scripted to give the job to the node with the lowest amount load, but it doesn't seem to be doing that very well. I also understand that a certain job can be huge and cause a node to spike really high but then the next one wont even use a fraction of the load. So its not an exact science. But maybe it is not fixable and maybe this it just how it needs to be. Thank you for you advise I will check all of it now.
 
Old 12-10-2018, 06:02 AM   #5
Fluffyyyzzz
LQ Newbie
 
Registered: Dec 2018
Posts: 3

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by syg00 View Post
What worries me here is "what are you basing this on" ?. Are you referring to CPU% "load" or loadavg ?.
And are you simply waiting for some arbitrary number to appear then taking action ?. Where did this magic number come from ?.
Is the performance of the cluster (or particular nodes) being impacted prior to you commencing shutdown(s) ?.

We need some detail. What is that number, how many cores/execution threads are involved, how many tasks in uninteruptable sleep, any resource contention (CPU/disk/network) ?.
It is the CPU load. No we monitor it the whole time after it reaches a certain load then we have to close and open with what the client has asked. The dash board is a live load and it refreshed every second. It is a particular node in a cluster.
 
Old 12-11-2018, 12:14 PM   #6
dc.901
Senior Member
 
Registered: Aug 2018
Location: Atlanta, GA - USA
Distribution: CentOS/RHEL, openSuSE/SLES, Ubuntu
Posts: 1,005

Rep: Reputation: 370Reputation: 370Reputation: 370Reputation: 370
Quote:
Originally Posted by Fluffyyyzzz View Post
It is the CPU load. No we monitor it the whole time after it reaches a certain load then we have to close and open with what the client has asked. The dash board is a live load and it refreshed every second. It is a particular node in a cluster.
So, when the load is high, have you looked at the running processes or reviewed the logs? You need to determine the cause of the high load before a solution can be implemented.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Configuring & Maintaining NFS ans FTP and Apache web servers vrk Linux - Newbie 3 07-07-2012 07:37 PM
Make most amount of Linux users in least amount of time studpenguin General 24 02-02-2007 03:42 PM
tcpdump wont show the correct amount of traffic in single port scan? positrox Linux - Networking 0 08-05-2006 09:04 AM
me wants cluster me wants cluster me wants cluster funkymunky Linux - Networking 3 01-06-2004 07:51 AM
Disk Druid not reflecting correct cylinder amount. manny dingo Linux - Software 1 06-15-2001 11:01 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Server

All times are GMT -5. The time now is 05:33 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration