LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware
User Name
Password
Linux - Hardware This forum is for Hardware issues.
Having trouble installing a piece of hardware? Want to know if that peripheral is compatible with Linux?

Notices


Reply
  Search this Thread
Old 06-19-2014, 03:22 PM   #1
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191
How to diagnose performance problem


I have a multi-threaded application. I am testing it on two systems and the performance on the newer system is pretty terrible (generally worse than on the older system) and I want to know what might be wrong with the new system that could explain the bad performance, and how to test for it.

Old system: 2 CPU chips, each with 6 cores, Xeon X5675 @ 3.07GHz

Newer: 4 CPU chips, each 8 cores Xeon E5-4640 @ 2.4GHz.

Each system has far more ram than the problem needs and there is no significant I/O, so the issue is either in the CPU throughput or in the CPU to ram throughput.

Most single threaded operations I have tested run slower on the newer system. I would have expected the advantages of the newer CPU to balance the reduction in raw GHz to have more similar single threaded performance, but apparently not.

But the big problem is in multi-threaded performance, how much longer does it take to run N copies of an algorithm in N different cores vs. running one copy in one core with the other cores idle. A small and simple enough algorithm should not contend for shared caches or for the memory bus, so we would expect no elapsed time difference between N copies in N different cores vs. one copy in one core with the other cores idle.

But I'm not running small or simple, so I expect, and see, the results of contention for shared caches and buses.

Old (12 core) system:
2 threads 1.38 times longer than 1.
4 threads 1.14 times longer than 2.
8 threads 1.30 times longer than 4.

New (32 core) system:
2 threads 1.73 times longer than 1.
4 threads 1.40 times longer than 2.
8 threads 1.37 times longer than 4.
16 threads 1.52 times longer than 8.
32 threads 23.76 (yes really) times longer than 16 ! !

As long as that multiplier is less than 2 for doing twice as many steps in parallel, using more cores is better total throughput. But since the contention is so much worse on the newer system, using 16 cores on the newer system is much less throughput than 8 cores on the old system.

The newer system has far more total cache, which ought to be the significant factor in reducing contention. So why does it have so much more actual contention?

If the unexpected performance difference were more subtle, I would conclude it is something odd about the algorithms I'm testing, which I can't share here, so the whole question would be unfair to ask.

But since it is so extreme, there must be something fundamentally wrong with the new hardware: Memory bus way too slow, so that even a smaller number of cache misses take more total time; or something wrong with caches, so the larger caches have more misses; or some other kind of contention that I haven't thought of.

I thought of thermal contention (running multiple cores might cause an auto clock decrease because of heating) and I checked for it the simple way. I think /proc/cpuinfo shows actual clock rate and during a long run of the test, that reported clock rate was normal.

Last edited by johnsfine; 06-19-2014 at 03:39 PM.
 
Old 06-19-2014, 09:39 PM   #2
cyent
Member
 
Registered: Aug 2001
Location: ChristChurch New Zealand
Distribution: Ubuntu
Posts: 343

Rep: Reputation: 76
Some ideas....

Disk cache and cache effects in general can dominate.

So care needs to be taken to see you are not just measuring how long it takes to lift things off disk.

Using this rune can help you level the playing field somewhat.

sudo bash -c 'echo 1 > /proc/sys/vm/drop_caches'

Watching top or gkrellm can tell you whether all the cores are running full tilt... or suffering lock contention somewhere.

What can happen is if a small number of threads can fit entirely into cache.... life is Good. But a large number of threads may have a large ram footprint may force continous paging out to ram which would result in exactly the slow down you are seeing.

valgrind --tool=cachegrind may help identify this. (Assuming correct settings for cache sizes.)
 
Old 06-20-2014, 04:57 AM   #3
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Original Poster
Rep: Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191
Quote:
Originally Posted by cyent View Post
Disk cache and cache effects in general can dominate.
I am quite sure Disk/File I/O is not a factor, so Disk/File cache is not a factor.

I assume CPU cache is a factor, but don't know how to diagnose that, since the system with more and larger CPU caches is slower.

Quote:
So care needs to be taken to see you are not just measuring how long it takes to lift things off disk.
I am quite sure that is neither significant nor included. The key multi-threading performance test part of what I have done includes zero file I/O.

Quote:
Watching top or gkrellm can tell you whether all the cores are running full tilt... or suffering lock contention somewhere.
Far more cores are running full tilt on the (slower) 32 core system than on the (faster) 12 core system. Locking is not significant. Slower very clearly mean fewer instructions executed per second in each core. That means more stalls for something that is most likely cache misses, but I don't know how to either confirm it is cache misses nor explain why the system with more and larger caches has more cache misses.
 
Old 06-20-2014, 10:55 AM   #4
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 491Reputation: 491Reputation: 491Reputation: 491Reputation: 491
Two things to consider:

1) The CPU cache is definitely a factor, and that's why I always buy processors with the largest available CPU cache. In your code mess around with I/O chunk/buffer sizes. For example if you process a chunk of input at a time with a certain buffer size, adjust the buffer size and test the speed. This is how I test performance for my optimized programs. Some multiple of the cache size usually works best, but even 1 byte over this size can massively decrease speed.

2) Make sure that in your code you do NOT keep creating and joining threads. This results in slower code than single threaded code. Instead, start threads and keep them running through the whole section, synchronizing them using a barrier if needed. Also, whenever possible make sure each thread has its own variables and data, in C this would mean you pass a different struct to each thread, then when they finish you can add the data together. The number of threads is also critical, so test using different number of threads including odd numbers like 3, 5, etc. If the CPU is hyperthreaded, consider testing with hyperthreading off.
 
Old 06-20-2014, 12:08 PM   #5
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Original Poster
Rep: Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191
Quote:
Originally Posted by metaschima View Post
In your code mess around with I/O chunk/buffer sizes.
No I/O buffer is relevant. This code processes for hours with no I/O.

Anything you might call a "chunk size" in the processing is determined by the problem and not adjustable by the algorithm.

Quote:
Make sure that in your code you do NOT keep creating and joining threads.
Yes. I had that correct already.

Quote:
whenever possible make sure each thread has its own variables and data, in C this would mean you pass a different struct to each thread,
Most memory accesses are to private data used only by one thread and I was careful to have the same thread malloc that data as uses it and do so fairly close to use. So if the OS gets those details of NUMA right there would be a further performance boost.

Some data access is to shared data and cannot be avoided. No locking overhead is involved, but there are obvious cache coherency overheads for multi cpu-package systems.

Quote:
then when they finish you can add the data together
None of the relationships are that simple (sum of results of separate threads).

Quote:
The number of threads is also critical, so test using different number of threads including odd numbers like 3, 5, etc.
A key algorithm splits reasonably only on a power of two. Other parts could be split any way I like, but I can't imagine odd could be best (the systems have two or four CPU packages in addition to an even number of cores per CPU package).

Quote:
If the CPU is hyperthreaded, consider testing with hyperthreading off.
I'm more used to telling that to others than being told. I'm well aware this type of problem always runs worse with hyperthreading enabled and I don't even waste the time starting a performance test before making sure hyperthreading is disabled.

Last edited by johnsfine; 06-20-2014 at 12:10 PM.
 
Old 06-20-2014, 12:45 PM   #6
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 491Reputation: 491Reputation: 491Reputation: 491Reputation: 491
Quote:
Originally Posted by johnsfine View Post
No I/O buffer is relevant. This code processes for hours with no I/O.
Well, in that case maybe individual CPU cores are just slower in the newer system, the older system has higher Ghz rating. I/O to and from RAM counts and buffer sizes would matter.

I've had better performance in some cases with 3 threads instead of 4 (max for my system). Not sure why, but it's worth a try.
 
Old 06-22-2014, 04:45 PM   #7
cyent
Member
 
Registered: Aug 2001
Location: ChristChurch New Zealand
Distribution: Ubuntu
Posts: 343

Rep: Reputation: 76
Quote:
Originally Posted by johnsfine View Post
I am quite sure Disk/File I/O is not a factor, so Disk/File cache is not a factor.

I assume CPU cache is a factor, but don't know how to diagnose that, since the system with more and larger CPU caches is slower.



I am quite sure that is neither significant nor included. The key multi-threading performance test part of what I have done includes zero file I/O.



Far more cores are running full tilt on the (slower) 32 core system than on the (faster) 12 core system. Locking is not significant. Slower very clearly mean fewer instructions executed per second in each core. That means more stalls for something that is most likely cache misses, but I don't know how to either confirm it is cache misses nor explain why the system with more and larger caches has more cache misses.
Ok. There is a clue in here somewhere..... "Far more cores are running full tilt on the (slower) 32 core system than on the (faster) 12 core system"

I think we'd need to see the code to work this one out.... but I bet that observation will be the core to solving it,
 
Old 06-22-2014, 05:41 PM   #8
metaschima
Senior Member
 
Registered: Dec 2013
Distribution: Slackware
Posts: 1,982

Rep: Reputation: 491Reputation: 491Reputation: 491Reputation: 491Reputation: 491
Quote:
Originally Posted by cyent View Post
Ok. There is a clue in here somewhere..... "Far more cores are running full tilt on the (slower) 32 core system than on the (faster) 12 core system"

I think we'd need to see the code to work this one out.... but I bet that observation will be the core to solving it,
I agree, in general all cores should be close to 100%.

Others things to check are the kernel config, especially for a NUMA system.

Also, I've had a fair amount of experience with threading (pthreads and openmp) and I can say with a good amount of certainty that more threads does NOT in any way guarantee more performance. In fact, many times performance decreases. Try to look through the code, mess with some buffer sizes and check to make sure everything works on a conceptual level. Threaded programming require a lot more consideration than non-threaded programming, and a lot more testing too.
 
Old 06-22-2014, 05:56 PM   #9
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,720

Rep: Reputation: 1282Reputation: 1282Reputation: 1282Reputation: 1282Reputation: 1282Reputation: 1282Reputation: 1282Reputation: 1282Reputation: 1282
You need to keep one core (at least, two is better) for miscellaneous system activity.

This alone can play havoc with throughput as one or two of the threads will get suspended just keeping track of system resources; and that can start causing CPU thrashing as threads start bouncing from core to core (also increasing the cache problems as threads move).

You don't indicate what distribution you are using, as some will use cgroups to provide such separation. At that point once you exceed the cgroup allocation you start thrashing your threads against each other. The advantage of cgroups is that it is easier to know exactly how much is available.

In a 32 core system, I would believe the best throughput would top out around 26 to 28 threads, then start decreasing.

You can find this out by using a linear increase in threads rather than using powers of 2 (though starting at 16 is reasonable - a series of 16, 18, 20, 22, 24, 26, 28, 30, 32) will tell you quite accurately the maximum number of threads. You could even use a binary search for the key value. Since 16 looked good and 32 looked bad, try 24. If that is good, try 28. If good, try 30... My guess would be that 28 would be good, but 30 start to show a decrease.

It can happen much earlier (potential memory limits, paging activity, and such). It really depends on the specific application (and the data).
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] How do I diagnose a Network problem? CentOS 6.3 NetDoc Linux - Networking 22 04-24-2015 07:11 AM
Lighttpd performance problem + RAID performance problem in a high load site phaz0r Linux - Server 0 11-16-2008 08:52 AM
how to diagnose computer slowing down problem towsonu2003 Linux - Laptop and Netbook 7 03-05-2006 05:43 PM
Strange X-Login problem how do I diagnose? davidcrawley Linux - Software 7 09-30-2004 11:42 AM
How to diagnose a hardware problem? bhashman Linux - Hardware 4 02-23-2004 11:11 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Hardware

All times are GMT -5. The time now is 01:06 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration