LinuxQuestions.org
Review your favorite Linux distribution.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 02-26-2008, 07:25 AM   #1
qwijibow
LQ Guru
 
Registered: Apr 2003
Location: nottingham england
Distribution: Gentoo
Posts: 2,672

Rep: Reputation: 47
Multithreaded Numbercrunching /IO not producing the performance gain i expected


Hello Guys.

I have written a program which convert Microsoft Bitmap files (24bit BGR)
into raw 16bit RGB.

For fun, i decided to make it multithreaded, and see what kind of performance gain i could extract.

**** The Origonal Algorithm (SINGLE threaded) ****
This runs through a few hundred megabytes in about 30 seconds.
For 100 Bitmas....
1) Load 24bit BGR into Memory
2) Convert to 16bit RGB
3) Save To Disk
4) Repeat.

/*** The Second Algorithm ***
This Is Just the Above Algorithm Running in 3 threads, with a little thread synchronisation to read an array of source images..
This runs a little faster, as i would expect, as some threads are bound to be doing the number crunching while other threads are waiting for disk I/O.
This runs in around 20 Seconds.

/*** The 3rd Algorithm ***/
The second algorithm doesnt seem to be idea.. as there will be times when all threads are waiting on I/O.. this 3rd algorithm is designed to be optimal... But for reasons I cannot understand... it is much much slower than the single threaded version, it runs in 40 seconds !!!

I/O Thread ( performs I/O ONLY )
while(there_are_more_bitmaps_to_load)
{
load_next_bitmap()
put_bitmap_on_conversion_queue()
}

while(there_are_more_bitmaps_to_save)
{
take_converted_bitmap_from_save_queue()
save_bitmap()
}


Number Crunching Threads (Numbercrunching only)(Number of Logical CPU's)

while(there_are_more_bitmans_on_conversion_queue()
{
get_bitmap_from_conversion_queue()
convert_bitmap()
put_bitmap_on_save_queue()
}


This Algorithm seems perfect on paper..
the IO is always running IO, never waiting for numbercrunching to finish.
and the numbercrunching is never waiting for IO to finish.

There is a little more overhead with thread syncronisation in the 3rd algorithm.. but not much.. Results are the same across single core / dual core chips, and across windows / linux.

Any ideas what i might be doing wrong ?
 
Old 02-26-2008, 07:40 AM   #2
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
Are you monitoring swap space? It could be that you're loading too much at once. The "write" thread might need to write to memory pages that have to be put in swap space while the "read" thread is running, and vice versa.
ta0kira
 
Old 02-26-2008, 01:53 PM   #3
qwijibow
LQ Guru
 
Registered: Apr 2003
Location: nottingham england
Distribution: Gentoo
Posts: 2,672

Original Poster
Rep: Reputation: 47
I set swappyness to zero.
If all bitmaps were loaded at the same time (worst case scenario) the program should use approxamatly 130 megs of ram.
My development sytem has over a gig free, i dont think swap is the problem.

Algorithm 2 Only uses Mutex's to read / increment an array index,
whereas algorithm 3 needs to do a little..

Code:
do {
  lock_mutex()
  work = poll_for_work()
  unlock_mutex()
  if(work)
    break;
  else
    yield()
}loop

do_work();
for thread synchronisation.

do you think that all the context switching involved in the yield() loop could be slowing me down? I have eperimented with short sleeps instead of yield, but this has little impact.

i may try a spinlock, and see how that effects performance.
 
Old 02-26-2008, 04:32 PM   #4
paulsm4
LQ Guru
 
Registered: Mar 2004
Distribution: SusE 8.2
Posts: 5,863
Blog Entries: 1

Rep: Reputation: Disabled
Hi -

At the very least, why don't you run Linux "time" on the versions with/without your threading, and post the results?

IMHO .. PSM
 
Old 02-26-2008, 04:48 PM   #5
jailbait
LQ Guru
 
Registered: Feb 2003
Location: Virginia, USA
Distribution: Debian 12
Posts: 8,337

Rep: Reputation: 548Reputation: 548Reputation: 548Reputation: 548Reputation: 548Reputation: 548
Quote:
Originally Posted by qwijibow View Post

/*** The 3rd Algorithm ***/
The second algorithm doesnt seem to be idea.. as there will be times when all threads are waiting on I/O.. this 3rd algorithm is designed to be optimal... But for reasons I cannot understand... it is much much slower than the single threaded version, it runs in 40 seconds !!!

I/O Thread ( performs I/O ONLY )
while(there_are_more_bitmaps_to_load)
{
load_next_bitmap()
put_bitmap_on_conversion_queue()
}

while(there_are_more_bitmaps_to_save)
{
take_converted_bitmap_from_save_queue()
save_bitmap()
}

Take a look at I/O performance. Are you loading from one area of the disk and saving to another? The difference in performance between the various runs might be due to differences in how far apart the load and save files are on the same disk. If you can put them on separate disks performance might improve.

Another thing to look it is your cache size. A large cache can greatly improve disk I/O by reducing the number of times that the disk arm is moved to and from the load and save files. As you increase the amount of application multi-threading are you decreasing the amount of memory available for cache?

-------------------
Steve Stites
 
Old 02-26-2008, 04:53 PM   #6
dmail
Member
 
Registered: Oct 2005
Posts: 970

Rep: Reputation: Disabled
I would have thought something like the following would be a better way of implementing this.
load image and give to worker threads
worker thread one take one width of the image and convert
worker thread two take one width of the image and convert
etc each only working on one width of the image
save image.
 
Old 02-26-2008, 05:09 PM   #7
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
I'd take a look at pthread_cond_broadcast. You can force the "work" function to block using pthread_cond_wait if there's no work available and have the reading thread do a pthread_cond_broadcast every time it imports a bitmap. If you have a separate write thread then the work thread can do the same for it.
ta0kira
 
Old 02-26-2008, 05:18 PM   #8
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Rocky 9.2
Posts: 18,359

Rep: Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751Reputation: 2751
If you've only got 1 I/O thr, the number crunching thrs are bottlenecking on waiting for I/O to read or write.
One method is to have a master thr that simply supplies the list of files to be processed to a master queue, then have multiple thrs that do their own loop: read_fname_from_q/read_file/calc/write_write_file/yield/repeat.
This is the boss+workers model. Very popular.
 
Old 02-27-2008, 10:47 AM   #9
qwijibow
LQ Guru
 
Registered: Apr 2003
Location: nottingham england
Distribution: Gentoo
Posts: 2,672

Original Poster
Rep: Reputation: 47
to chrism01: Thanks, But i have alreayd implemented that as algorithm 2, see my first post.


to dmail:
The IO is the botttle neck. not the conversion.


to ta0kira:
Good idea, i'll look into it...
 
Old 02-27-2008, 08:22 PM   #10
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,659
Blog Entries: 4

Rep: Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941Reputation: 3941
The question becomes: what takes the most time ... converting the bitmap once you've loaded it, or reading and writing the files?

I would frankly expect that I/O takes the most time (it usually does), so threading probably won't buy you anything useful.

The CPU intensive task of converting the bitmap won't benefit from being divided into multiple threads unless you have multiple (real...) CPUs available, or a high-speed network or cluster of some kind, since multitasking divides a CPU's time; it does not multiply it.

But a bitmap can be converted while the disk-drive is moving data in and out. So if you have enough real-RAM available, you could potentially have one thread moving data in and out while the other thread did the CPU-intensive work. (The CPU-thread should have a slightly lower priority.)

So... do some timing. Frankly, 30 seconds ain't bad.
 
Old 02-27-2008, 08:41 PM   #11
ta0kira
Senior Member
 
Registered: Sep 2004
Distribution: FreeBSD 9.1, Kubuntu 12.10
Posts: 3,078

Rep: Reputation: Disabled
KAudioCreator (CD ripper) somehow puts the I/O thread (or fork?) on one CPU while performing compression on the other for SMP systems (inferred using a CPU load monitor.) Not sure how it does it, but that might be a good program to hack.
ta0kira
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Tools for Measuring Multithreaded Performance. Sunilsbjoshi Linux - Server 1 01-26-2007 04:50 AM
performance gain when dsl modem off jwgorman Linux - Newbie 0 07-06-2005 08:44 AM
Athlon 64 performance gain for custom FPU-intensive app? Entropius Linux - Hardware 2 08-15-2004 05:30 AM
Athlon 64 floating-point performance gain? Entropius Linux - Hardware 1 07-19-2004 05:37 AM
LFS performance gain? vonPulex Linux From Scratch 2 07-08-2003 01:11 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 05:09 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration