Multithreaded Numbercrunching /IO not producing the performance gain i expected
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Multithreaded Numbercrunching /IO not producing the performance gain i expected
Hello Guys.
I have written a program which convert Microsoft Bitmap files (24bit BGR)
into raw 16bit RGB.
For fun, i decided to make it multithreaded, and see what kind of performance gain i could extract.
**** The Origonal Algorithm (SINGLE threaded) ****
This runs through a few hundred megabytes in about 30 seconds.
For 100 Bitmas....
1) Load 24bit BGR into Memory
2) Convert to 16bit RGB
3) Save To Disk
4) Repeat.
/*** The Second Algorithm ***
This Is Just the Above Algorithm Running in 3 threads, with a little thread synchronisation to read an array of source images..
This runs a little faster, as i would expect, as some threads are bound to be doing the number crunching while other threads are waiting for disk I/O.
This runs in around 20 Seconds.
/*** The 3rd Algorithm ***/
The second algorithm doesnt seem to be idea.. as there will be times when all threads are waiting on I/O.. this 3rd algorithm is designed to be optimal... But for reasons I cannot understand... it is much much slower than the single threaded version, it runs in 40 seconds !!!
This Algorithm seems perfect on paper..
the IO is always running IO, never waiting for numbercrunching to finish.
and the numbercrunching is never waiting for IO to finish.
There is a little more overhead with thread syncronisation in the 3rd algorithm.. but not much.. Results are the same across single core / dual core chips, and across windows / linux.
Are you monitoring swap space? It could be that you're loading too much at once. The "write" thread might need to write to memory pages that have to be put in swap space while the "read" thread is running, and vice versa.
ta0kira
I set swappyness to zero.
If all bitmaps were loaded at the same time (worst case scenario) the program should use approxamatly 130 megs of ram.
My development sytem has over a gig free, i dont think swap is the problem.
Algorithm 2 Only uses Mutex's to read / increment an array index,
whereas algorithm 3 needs to do a little..
Code:
do {
lock_mutex()
work = poll_for_work()
unlock_mutex()
if(work)
break;
else
yield()
}loop
do_work();
for thread synchronisation.
do you think that all the context switching involved in the yield() loop could be slowing me down? I have eperimented with short sleeps instead of yield, but this has little impact.
i may try a spinlock, and see how that effects performance.
/*** The 3rd Algorithm ***/
The second algorithm doesnt seem to be idea.. as there will be times when all threads are waiting on I/O.. this 3rd algorithm is designed to be optimal... But for reasons I cannot understand... it is much much slower than the single threaded version, it runs in 40 seconds !!!
Take a look at I/O performance. Are you loading from one area of the disk and saving to another? The difference in performance between the various runs might be due to differences in how far apart the load and save files are on the same disk. If you can put them on separate disks performance might improve.
Another thing to look it is your cache size. A large cache can greatly improve disk I/O by reducing the number of times that the disk arm is moved to and from the load and save files. As you increase the amount of application multi-threading are you decreasing the amount of memory available for cache?
I would have thought something like the following would be a better way of implementing this.
load image and give to worker threads
worker thread one take one width of the image and convert
worker thread two take one width of the image and convert
etc each only working on one width of the image
save image.
I'd take a look at pthread_cond_broadcast. You can force the "work" function to block using pthread_cond_wait if there's no work available and have the reading thread do a pthread_cond_broadcast every time it imports a bitmap. If you have a separate write thread then the work thread can do the same for it.
ta0kira
If you've only got 1 I/O thr, the number crunching thrs are bottlenecking on waiting for I/O to read or write.
One method is to have a master thr that simply supplies the list of files to be processed to a master queue, then have multiple thrs that do their own loop: read_fname_from_q/read_file/calc/write_write_file/yield/repeat.
This is the boss+workers model. Very popular.
The question becomes: what takes the most time ... converting the bitmap once you've loaded it, or reading and writing the files?
I would frankly expect that I/O takes the most time (it usually does), so threading probably won't buy you anything useful.
The CPU intensive task of converting the bitmap won't benefit from being divided into multiple threads unless you have multiple (real...) CPUs available, or a high-speed network or cluster of some kind, since multitasking divides a CPU's time; it does not multiply it.
But a bitmap can be converted while the disk-drive is moving data in and out. So if you have enough real-RAM available, you could potentially have one thread moving data in and out while the other thread did the CPU-intensive work. (The CPU-thread should have a slightly lower priority.)
So... do some timing. Frankly, 30 seconds ain't bad.
KAudioCreator (CD ripper) somehow puts the I/O thread (or fork?) on one CPU while performing compression on the other for SMP systems (inferred using a CPU load monitor.) Not sure how it does it, but that might be a good program to hack.
ta0kira
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.