Multithreaded Numbercrunching /IO not producing the performance gain i expected

qwijibow · 02-26-2008, 07:25 AM

Hello Guys.

I have written a program which convert Microsoft Bitmap files (24bit BGR)
into raw 16bit RGB.

For fun, i decided to make it multithreaded, and see what kind of performance gain i could extract.

**** The Origonal Algorithm (SINGLE threaded) ****
This runs through a few hundred megabytes in about 30 seconds.
For 100 Bitmas....
1) Load 24bit BGR into Memory
2) Convert to 16bit RGB
3) Save To Disk
4) Repeat.

/*** The Second Algorithm ***
This Is Just the Above Algorithm Running in 3 threads, with a little thread synchronisation to read an array of source images..
This runs a little faster, as i would expect, as some threads are bound to be doing the number crunching while other threads are waiting for disk I/O.
This runs in around 20 Seconds.

/*** The 3rd Algorithm ***/
The second algorithm doesnt seem to be idea.. as there will be times when all threads are waiting on I/O.. this 3rd algorithm is designed to be optimal... But for reasons I cannot understand... it is much much slower than the single threaded version, it runs in 40 seconds !!!

I/O Thread ( performs I/O ONLY )
while(there_are_more_bitmaps_to_load)
{
load_next_bitmap()
put_bitmap_on_conversion_queue()
}

while(there_are_more_bitmaps_to_save)
{
take_converted_bitmap_from_save_queue()
save_bitmap()
}

Number Crunching Threads (Numbercrunching only)(Number of Logical CPU's)

while(there_are_more_bitmans_on_conversion_queue()
{
get_bitmap_from_conversion_queue()
convert_bitmap()
put_bitmap_on_save_queue()
}

This Algorithm seems perfect on paper..
the IO is always running IO, never waiting for numbercrunching to finish.
and the numbercrunching is never waiting for IO to finish.

There is a little more overhead with thread syncronisation in the 3rd algorithm.. but not much.. Results are the same across single core / dual core chips, and across windows / linux.

Any ideas what i might be doing wrong ?

ta0kira · 02-26-2008, 07:40 AM

Are you monitoring swap space? It could be that you're loading too much at once. The "write" thread might need to write to memory pages that have to be put in swap space while the "read" thread is running, and vice versa.
ta0kira

qwijibow · 02-26-2008, 01:53 PM

I set swappyness to zero.
If all bitmaps were loaded at the same time (worst case scenario) the program should use approxamatly 130 megs of ram.
My development sytem has over a gig free, i dont think swap is the problem.

Algorithm 2 Only uses Mutex's to read / increment an array index,
whereas algorithm 3 needs to do a little..

Code:

do {
  lock_mutex()
  work = poll_for_work()
  unlock_mutex()
  if(work)
    break;
  else
    yield()
}loop

do_work();

for thread synchronisation.

do you think that all the context switching involved in the yield() loop could be slowing me down? I have eperimented with short sleeps instead of yield, but this has little impact.

i may try a spinlock, and see how that effects performance.

paulsm4 · 02-26-2008, 04:32 PM

Hi -

At the very least, why don't you run Linux "time" on the versions with/without your threading, and post the results?

IMHO .. PSM

jailbait · 02-26-2008, 04:48 PM

Quote:

Originally Posted by qwijibow

/*** The 3rd Algorithm ***/
The second algorithm doesnt seem to be idea.. as there will be times when all threads are waiting on I/O.. this 3rd algorithm is designed to be optimal... But for reasons I cannot understand... it is much much slower than the single threaded version, it runs in 40 seconds !!!

I/O Thread ( performs I/O ONLY )
while(there_are_more_bitmaps_to_load)
{
load_next_bitmap()
put_bitmap_on_conversion_queue()
}

while(there_are_more_bitmaps_to_save)
{
take_converted_bitmap_from_save_queue()
save_bitmap()
}

Take a look at I/O performance. Are you loading from one area of the disk and saving to another? The difference in performance between the various runs might be due to differences in how far apart the load and save files are on the same disk. If you can put them on separate disks performance might improve.

Another thing to look it is your cache size. A large cache can greatly improve disk I/O by reducing the number of times that the disk arm is moved to and from the load and save files. As you increase the amount of application multi-threading are you decreasing the amount of memory available for cache?

-------------------
Steve Stites

dmail · 02-26-2008, 04:53 PM

I would have thought something like the following would be a better way of implementing this.
load image and give to worker threads
worker thread one take one width of the image and convert
worker thread two take one width of the image and convert
etc each only working on one width of the image
save image.

ta0kira · 02-26-2008, 05:09 PM

I'd take a look at pthread_cond_broadcast. You can force the "work" function to block using pthread_cond_wait if there's no work available and have the reading thread do a pthread_cond_broadcast every time it imports a bitmap. If you have a separate write thread then the work thread can do the same for it.
ta0kira

chrism01 · 02-26-2008, 05:18 PM

If you've only got 1 I/O thr, the number crunching thrs are bottlenecking on waiting for I/O to read or write.
One method is to have a master thr that simply supplies the list of files to be processed to a master queue, then have multiple thrs that do their own loop: read_fname_from_q/read_file/calc/write_write_file/yield/repeat.
This is the boss+workers model. Very popular.

qwijibow · 02-27-2008, 10:47 AM

to chrism01: Thanks, But i have alreayd implemented that as algorithm 2, see my first post.

to dmail:
The IO is the botttle neck. not the conversion.

to ta0kira:
Good idea, i'll look into it...

sundialsvcs · 02-27-2008, 08:22 PM

The question becomes: what takes the most time ... converting the bitmap once you've loaded it, or reading and writing the files?

I would frankly expect that I/O takes the most time (it usually does), so threading probably won't buy you anything useful.

The CPU intensive task of converting the bitmap won't benefit from being divided into multiple threads unless you have multiple (real...) CPUs available, or a high-speed network or cluster of some kind, since multitasking divides a CPU's time; it does not multiply it.

But a bitmap can be converted while the disk-drive is moving data in and out. So if you have enough real-RAM available, you could potentially have one thread moving data in and out while the other thread did the CPU-intensive work. (The CPU-thread should have a slightly lower priority.)

So... do some timing. Frankly, 30 seconds ain't bad.

ta0kira · 02-27-2008, 08:41 PM

KAudioCreator (CD ripper) somehow puts the I/O thread (or fork?) on one CPU while performing compression on the other for SMP systems (inferred using a CPU load monitor.) Not sure how it does it, but that might be a good program to hack.
ta0kira