how does the kernel decide which thread gets the cpu?

Thinking · 10-17-2005, 02:06 PM

hiho@ll

i have 2 threads

one (T1) reads using read(); into a 500kb big buffer using 64k chunks
(i mean read(); reads 64k and stores it in a rotating buffer which size is 500k)

the second thread (T2) should write the stuff using write(); on the other side of the connection

the problem:

everything works fine
but i tried some performance stuff

the test was that i wrote data (using a little prog we call PROC1) into T1 continuously

the problem i noticed is that T1 gets much more cpu than T2

this means T1 writes in the 500k buffer since it's nearly full and T2 then writes the whole stuff
then T1 gets cpu again nearly fills the buffer (sometimes it fills the buffer completely) and then T2 gets cpu

i think the problem is that the kernel knows that T1 wants to have some data
while one process writes the data to T1
so this 2 processes (T1,PROC1) get more cpu
then T2 can write the whole stuff so it needs the cpu

so anybody has an idea how i can do a perfect optimized sheduling for the threads T1 and T2 so no one (PROC1,T1,T2) has so much cpu that the other ones stuck

EDIT: btw. the problem described is the biggest bottleneck i have
each process which is involved in the "read from one descriptor and write it to another" process eats the cpu from the other one
anybody has an idea different than a rotating buffer to speed up the app?

thx@ll

jtshaw · 10-17-2005, 04:44 PM

It depends a bit on which version of the kernel you are using (particularly 2.6 vs. 2.4) but I'll try and give you a general overview of how things work.

The scheduler picks the thread/process with the highest effective priority to execute. Two things can happen to stop that process from executing. One, a blocking system call (read, sleep, ect.). Two, the process/thread runs out of quanta. What we mean by running out of quanta is a pretty simple concept. Some processes are I/O bound (they use a lot of blocking system calls and are waiting on I/O devices a lot) and some are CPU bound (they need lots and lots of cycles to finish. To avoid a CPU bound process from taking over your computer and never letting anything else get processor time. To combat this every process gets a max amount of execution time each time it is schedule, when it runs out a new process gets switched in and the process that was running has to wait it's turn to be executed again.

AlonsoTheBonzo · 10-17-2005, 06:56 PM

Im not sure that I completely understand what you are trying to do, but it sounds like to balance the writes and reads across threads you might want to be doing something a little more actively than just counting on the OS to balance the load.

For example, as these are threads on the same process, it would be relatively easy for them to share some variable (protected with a mutex) that would hold the total bytes written to the buffer. The reader could block until the buffer is 1/2 full or something, while the writer increments as data comes in. You might even be able to use a conditional mutex to help with this.

On this page it looks like someone is doing just what you are trying to do with conditional mutex. (Scroll down to "Producer and Consumer communicating through a Protected Buffer")

Thinking · 10-18-2005, 07:19 AM

@alonsothebonzo
that's exactly what i want to do

but i don't want to decrease the performance
and thread locks do this

what i already tried is this:
i have a function read(); and write();

i think i can get the maximum available power of read/write if they work simultaneously

using scheduling they can't work at the same time

so i can't reduce the time that read(); or write(); takes

so i tried to reduce the time between read();/write(); using some buffer techniques

i tried 2 things
a ring buffer and a slot buffer (don't know if this is the correct term)

the ring buffer worked like this
read(); reads in a loop and writes something to a buffer
it reads 64k chunks and writes in a 500k buffer

the problem i got: read fills the whole buffer and write wrote the whole buffer
read(); was a thread an write(); was a thread
because they didn't ran at the same time i had the same problem
read(); reads something and write writes it if the buffer is full

then i tried a slot buffer
this means i have a structure with buffer chunks of size 64k
so i had 500 64k chunks

read(); reads the stuff and copied it to a free chunk (this i think is the fastest way of reading, because after the read i can place a bcopy(); and then i can read(); again)

because the buffer simply rotates (so it's like a fifo) i have to check for a varible if something is in the chunk
so i had a simple loop which checks
is something here
no
is something here
no
is something here
yes --> write();
is something here
...

so i got the problem that they stole the cpu time from each other

@jtshaw

i already read about this stuff
the only solution i see (which maybe work) is that i "hack" the kernel and set the time each process gets to a few microseconds

so the threads can nearly execute at the same time (which is maybe a solution for my rotating buffer)

EDIT: i just found the sendfile(); function but it doesn't work with a socket on the reader side
anybody knows a senfile implementation for sockets only
on freebsd it seems there is something like a zero_copy function

AlonsoTheBonzo · 10-18-2005, 12:51 PM

Assuming this is a single processor box, no matter how your code is written only one thing is really happening at once. I think therefore that the very small additional overhead of a mutex and block would be inconsequential to the overall performance of the application. Seems to me that the kernel-level process scheduling and whatnot is far higher in cost (and unavoidable) than the addition of a mutex strategy.

Are pipes too slow?

sundialsvcs · 10-18-2005, 03:16 PM

I agree with Alonso. The progress of your threads is determined by their ability to perform I/O. The "overhead" of doing synchronization is actually non-existent by comparison to the delays inherent in even a single I/O call.

It may well be that the "reader" is able to out-run the "writer" because the "reader" is consuming a much larger buffer (in so-called "shared" memory) that has been allocated for it by the filesystem. There is no throttling mechanism in place to keep the buffer from filling up.

Furthermore, while you assume that both the reader and the writer really do have "something productive to do," perhaps in the eyes of the operating system they don't. File operations take place through a buffering mechanism and the behavior of that buffer must be taken into account. The system wants to achieve good throughput and efficient utilization of the physical device. If the reader is reading from the same device that the writer is writing to, then the kernel properly views the two processes as competing... and the filesystem's buffering behavior, not your process design, makes the most difference.

Your well-intentioned "500k buffer" might be making things worse, because after all that buffer is in virtual memory. And the filesystem is already doing buffering of its own, automatically. If your large buffer is being paged in and out, it's really defeating your purpose.

It seems to me, and I think that Alonso will agree, that the core of the problem is that you want the reader's and the writer's performance to be more balanced. You really don't want that buffer to fill up. You imply that you want to delay reads, if necessary, to keep that from happening. Consequently, you need to implement some kind of handshaking between the two threads in order to achieve that.

Or, how about a completely different approach? What if you used asynchronous I/O? What if you simply ... copied the file, and let the operating system and the filesystem do all the buffering-magic? What if you memory-mapped the two files and did a move()?

Thinking · 10-19-2005, 03:55 AM

@AlonsoTheBonzo
well i'm not sure if i'm correct but mutex synchronization is the slowes synchronisation method
semaphores are much faster
i tried it with mutex a few days ago and it decreased the speed about 30%
i'm not sure if pipes are that usefull because i have to do a read/write for this too and as far as i know they are not so much faster than simple socket I/O (local!)
so i don't think that pipes reduce the overhead
but maybe if the threads can be correctly balanced with pipes!
if that's true you may be right

@sundialvscs
the only thing i use for this for now is RAM and cpu, my app and the kernel
no hard disk
no NIC

so if i do wrong synchronization it is very likely that i get a big synchronization overhead!

your 500k argument with paging makes sense to me
also your "let-the kernel do the stuff" argument is also good
but i can't do this
because i have 2 sockets (i don't really use the filesystem (if i do i could use sendfile(); but i can't))
and (maybe i only don't know how) the kernel can't do the stuff alone
so i have to do it by myself

i'm not sure what you mean with asynchronous I/O
because i think i do this already
for 2 sockets i have 4 threads
2 reading
2 writing
so i do it asynchron, not?

but i think i know how i can achieve the maximum performance for this
i have to reduce the way data takes from one socket to the other
and i think the only way is a zero-copy algorithm for 2 sockets
sounds like this
http://lwn.net/Articles/152424/

btw: i don't want a solution i want THE solution ;-)
i have a working (good working) solution since 1 week, but i want it faster
as fast as it's possible under the sun

AlonsoTheBonzo · 10-19-2005, 12:21 PM

"i tried it with mutex a few days ago and it decreased the speed about 30%"

Can I see this code, or the relevant portions, including the parts where you are timing the threads?

jtshaw · 10-19-2005, 12:22 PM

Quote:

Originally posted by Thinking
@AlonsoTheBonzo
well i'm not sure if i'm correct but mutex synchronization is the slowes synchronisation method
semaphores are much faster

That doesn't make any sense to me. A mutex is typically implemented as a semaphore with the initial value set to 1 instead of x (which would be >= 1). I'm fairly certain that is the way it is implement in Linux as it wouldn't make much sense to have two different implementations of essentially the same idea. Feel free to point out if I'm wrong here (happens pretty often

).

I agree with Alfonzo, the way your doing I/O I'd say your way more bound by the speed of the disk then you are by any locking mechanism you might attempt to use to provide good thread synchronization.

How are the threads doing the writing getting the information they are writing?

Typically in a situation where you have lots of writing, and lots of reading going on a reader-writer lock is the way to go (since so long is nobody is writing we care little how many are reading in most cases) but this seams no to fly with what you want to do either. I guess I'm still confused about how slowing down the readers would ever be of benefit to you.