Cluster Issue, Slowdown When CPUs are Increased

weathertom · 07-28-2007, 10:34 AM

I am running into an issue with a cluster I setup and am wondering if anyone has any thoughts. Here is the situation:

I have four dual Xeon 2.4 Ghz machines, each with 2 Gb of RAM and Gigabit NICs. I put an additional NIC in the machine that serves as the master for the cluster. The machines are connected through a gigabit switch (and the master is also connected to the outside world via its second NIC card). I was able to get the cluster up and running and can run jobs from the master spread across the slave machines. However, here is the thing that is causing me problems. I am getting better performance when running jobs (specifically, I am running the WRF-ARW weather model) on just one CPU per machine (4 total CPUs) than when I run on all available CPUs (2 per machine, 8 total). If I view CPU usage when the tasks I running, I notice that when just one CPU per machine is being used it is generally running around 95% utilization, but when I run on 2 CPUs per machine, the CPUs are each only running at about 45-50% utilization. I have tried a variety of things thus far to figure out why I am not getting a performance boost by adding the additional CPUs. I obviously do not expect a doubling in performance since there would be some scaling, but I expect some boost, rather than near neutral or decreased performance. I am not coming close to using up available RAM so that is not an issue. When I just run on one machine and go from one CPU to two CPUs (on that one machine) I get the expected boost in performance. I only get the slowdown/lack of increase when I am running across the cluster and go from one to two CPUs on the machines. I thought I might have found the problem with respect to NFS (since the slaves simply NFS mount to the master for the job being run) and increased the number of NFS daemons running on the master (was at the default of 8, so I first increased to 16, then tried 32, but didn't get much if any improvement). I also changed the NFS options for sync to async since I had read some traffic that that could speed things up, but no luck there either.

Any thoughts/comments would be greatly appreciated. Thank you.

jailbait · 07-28-2007, 12:32 PM

"I notice that when just one CPU per machine is being used it is generally running around 95% utilization, but when I run on 2 CPUs per machine, the CPUs are each only running at about 45-50% utilization."

Does your application multi-thread? If your application is single threaded then it could not use 2 CPU's simultaneously.

-------------
Steve Stites

weathertom · 07-28-2007, 01:38 PM

Yes, it is multi-threaded (it is the WRF-ARW weather model). When I go from 1 to 2 CPUs on a single machine I get the expected performance boost. It is when I jump from 1 to 2 on the slaves (essentially going from 4 total to 8 total CPUs) that I get the lack of improvement (and actually a very slight decrease).

jailbait · 07-28-2007, 02:53 PM

What, on average, is the ratio of time it takes the master to schedule an appication on a slave to the amount of time it takes the application to run on the slave. The inverse of this ratio gives the number of slaves that you can keep 100%active. For example if the ratio is .10 then the master can keep a maximum of 10 slaves busy. If the ratio is .25 then the master can keep a maximum of 4 slaves busy.

So in your case if the ratio was about .25 then you could only keep 4 slaves busy. Increasing the number of slaves to 6 would mean that the slaves would spend about a third of their time idle while waiting for the master to schedule their next application task. 8 slaves would spend about 50% of their time waiting to be rescheduled by the master which could be the problem that you are experiencing.

Having 4 dual processor slaves instead of 8 single processor slaves might introduce some consideration other than the master/ slave processing ratio. I think that you have already proven that this is not the case when you said:

"When I just run on one machine and go from one CPU to two CPUs (on that one machine) I get the expected boost in performance. "

So, right now I am leaning to the theory that your master/slave processing ratio is limiting you to being able to maximally utilize 4 slaves at most.

----------------
Steve Stites

weathertom · 07-29-2007, 06:57 AM

I don't believe that is the issue. It only occurs when additional processes are added to the slaves. For example, if I run 4 processes, 2 on the master and 2 on the first slave (4 total processes) I see the same behavior/slowdown. But, if I span the 4 processes across all four machines (1 on each) things work fine. And, as I mentioned, if I run 2 processes on just the master things work fine (in terms of performance) with both CPUs on the master being utilized around 95%.

jailbait · 07-29-2007, 12:44 PM

It would be interesting to know how many uniprocessor slaves can be utilized. Do you have the equipment where you can run more than 4 uniprocessor slaves?

--------------
Steve Stites

weathertom · 07-30-2007, 06:47 AM

No, I don't. Is there something which would cause the slaves/clients to run into an NFS bottleneck and have to share a single NFS connection? From what I've read/seen it sounds like they should be okay (originally the number of NFS daemons on the master was the default of 8, but I increased that to 32, which should be more than sufficient given the number of clients and the need to only run two jobs on each).

syg00 · 07-30-2007, 07:47 AM

Sounds like scheduler algorithm at the master.
If it is only aware of "nodes" rather then (actual) processors it will not be able to scale (at the nodes) at all.
Sounds like work is being sent out, and handled at the "node" - but there isn't enough to keep a multi-processor "node" fully busy.

Just guessing, of course ...

jailbait · 07-30-2007, 12:11 PM

"since the slaves simply NFS mount to the master for the job being run"

"Sounds like scheduler algorithm at the master. If it is only aware of "nodes" rather then (actual) processors it will not be able to scale (at the nodes) at all. Sounds like work is being sent out, and handled at the "node" - but there isn't enough to keep a multi-processor "node" fully busy."

You might try giving each slave two NFS ports and then have them mount both ports to the master. Then the master would be scheduling 8 slaves instead of 4.

---------------------
Steve Stites

munichtexan · 02-24-2008, 04:20 AM

This is not an uncommon phenomenon. I had long discussions with a system architect who claimed 95% utilizatition, yet if you compared computation time to the area of the problem being simulated - utilization of CPU was only 50%. The system architect was arguing about overhead communications while I was arguing about how much time it actually took to run the simulation versus number of processors.

Historically, parallel and distributed computing usually does not have high simulation (or actual program running) utilization. If you look at the amount of time associated with "actual program running" versus number of cpu's - most software does not scale linearly. In other words - two cpu's translate to a 1/(2*0.45) or 1.11X decrease in computation time. Code has to be specifically modified to improve this situation. It is not an OS issue like most people would like to claim. The code itself needs to be properly setup to Multi-thread. The bios does this to some degree in relation to the cores on a processor, but dual-processor boards also have to take into account the "hyper-threading" and memory sharing of the processors in the software.

Compilation of the code can improve overall performance to a point. But code like Matlab does not allow you to modify source or compilation techniques. I would go back to your Matlab supply group and ask them about how many processors or cores Matlab is set up to multi-thread to. You may be limited to one master and 4 slaves.

btmiller · 02-25-2008, 02:12 AM

Basically, what munichtexan says. It sounds to me like running 1 process per node exhausts more than half the available system or network bandwidth. Thus when you run two processes per node the network bandwidth is totally exhausted and your processes have to wait around for messages. In extreme cases, such as yours, this causes negative scalability.

A couple quick questions that may be able to help you (in no particular order). These assume that you've verified all hardware and software in the system is working well with no faults. Note that I work with computational chemistry software, not atmospheric modeling, but I've had to deal with similar challenges in the past.

o Does your code use MPI? If so can you use an MPI profiler to figure out communication efficiency and check for "hot spots?" In particular, you want to know if your code is bandwidth or latency sensitive. Most codes are some of both, but it's helpful to know what feature you have to work on improving. It may be helpful to just run tcpdump or something to figure out if most of your traffic is MPI parallel communications traffic or NFS traffic or whatever.

2. Node bandwidth: What type of gigabit Ethernet card are you using (PCI, PCI-X, or PCIe)? Is there sufficient bandwidth over the relevant bus to handle the communications needs of the application.

3. Switch bandwidth and latency: Is your switch non-blocking (i.e. does it actually have enoug hbandwidth across the backplane to support gigabit connection on all its ports). Some cheaper switches aren't, so caveat emptor! Likewise, what's the internal switch latency? Usually this isn't a big problem but it's helpful for calculating point to point latencies.

4. If you determine latency is the problem, do your Ethernet cards support some sort of RDMA or other non TCP/IP transport? The TCP/IP stack adds a significant chunk of latency that you may be able to remove if you have the right hardware.

5. If bandwidth is a problem, can you add a second Ethernet card to the nodes and do trunking? This would have to be supported by the switch and may not provide much benefit depending on the communication pattern,

There's a good article on ClusterMonkey about how cheap 10 Gbps InfiniBand is getting. InfiniBand has much higher bandwidth and lower latency than gigabit Ethernet. If it's within your budget, maybe you should consider upgrading? You probably want to get a vendor to loan you some hardware to see if it will actually help before sinking the money in, though.