Q: are pipelines processed in parallel/serial?

jamesj629 · 01-17-2008, 05:25 PM

Do pipelines start multiple threads so that data potentially flows from start to final output at once? (Given enough CPUs)

For instance...

cat data | awk 'some conditional operations' | sed 's/xyz/def/' | tr '@' ':' | awk 'some more' | sort

Do each of these stages have their own process?
Is cat assured to finish before handing over control to awk?

Many thanks for clearing this up :P

~James

Tinkster · 01-17-2008, 05:49 PM

hi,

And welcome to LQ!

Just a thought: I'd guess this depends on the receiving process, not on
the pipe. E.g. it may make sense for awk as the receiver (right-hand side
of a pipe) to do its thing to each line as they dash past, but it certainly
doesn't make sense to start sorting before you've seen all data.

Pure speculation, I never really thought about it in the past ;}

Cheers,
Tink

ilikejam · 01-17-2008, 06:22 PM

Hi.

The pipes are set up in parallel, but as Tink suggests, it's up to the utilities whether they pass the data through 'live' or not.

Try ping into awk:
$ ping localhost | awk '{print $(NF-1)}'
and it works in parallel

Then stick sed on the end:
$ ping localhost | awk '{print $(NF-1)}' | sed '1d'
and sed waits for the end of the input stream before doing anything.

Dave

syg00 · 01-17-2008, 06:26 PM

"man 7 pipe"

jamesj629 · 01-17-2008, 06:43 PM

thanks all

sundialsvcs · 01-17-2008, 07:17 PM

A "pipe" is an inter-process communication (IPC) channel. (It's one of several.)

Basically, a "pipe" presents itself as a file, which a particular process can either read from or write to. The "pipe," then, becomes a buffered communications-mechanism between its "reader" and its "writer," always appearing to both of them as "just a file." But here's the magic...

If "you" are reading from the pipe, but the writer (still exists and...) has not yet written anything (more) to the pipe, "you" will be put to sleep until the writer does write something. Then, you'll be able to read what the writer has just written.
If "you" are writing, and the pipe becomes "full," you will be put to sleep until the pipe is no longer full.

So, with all that having been said, the two processes .. the "reader" and the "writer" .. are free to run on whatever CPUs they can find, as best they can manage, at the sole discretion of the system scheduler.

When you, in the shell, type something like ls | grep foo, you actually cause two processes to be launched: one is ls, which writes its output to its STDOUT, and the other is grep, which reads its input from its STDIN. And... (magic time!) the STDOUT from the one is the STDIN of the other! It's a pipe.

(Please step outside the room while your brain explodes. We've all been there... we don't mind. Now, when you come back into the room, you ought to be saying either "Sweet!" or else, "That is so way k-e-w-e-l!"]

Yeah, those dudes at Bell Labs way back in the 1970's {I was almost-there, but nevermind!} had some pretty mind-blowing ideas...

ilikejam · 01-18-2008, 06:18 AM

There are two major products that come out of Berkeley: LSD and UNIX. We don't believe this to be a coincidence.

- Jeremy S. Anderson

jamesj629 · 01-18-2008, 07:58 AM

There are times when I think everything would make a lot more sense if I were on LSD. Then I remember I prefer the purple pills the doctor gives me. *munch* *munch*

osor · 01-18-2008, 06:55 PM

In addition to the great explanation by sundialsvcs, there is another thing about unix pipes which is useful but requires care in some circumstances (actually there are a few other such things, but I will only talk about one that relates somewhat to your question): When a process is writing to a pipe and the file descriptor on the “read” end has been closed, the process is sent a SIGPIPE signal. The default action for receiving such a signal (i.e., the action taken unless the program explicitly handles or ignores the signal) is to terminate the program.

As you might imagine, this behavior presents great potential for use. Here is a prototypical example of the kind of situation for which it was intended: suppose you have a really big gzip file, of which you want to read the first few lines to see what is inside. You could do something like this:

Code:

zcat reallybigfile.gz | head

The under normal circumstances, the zcat might take a minute or more to execute, but since only the first ten lines are desired, the execution of zcat is “short-circuited”. What happens is this: A pipe is opened with the “write” end replacing stdout of zcat and the “read” end replacing stdin of head. The process for zcat does its job and starts deflating the file chunk-by-chunk and writing it to stdout. The head utility does its job and reads 10 lines from its stdin and writes them to its stdout. Afterward, it exits (and part of that exiting involves the closing of the “read” end of the pipe). When zcat tries to write additional data to the “broken pipe”, it receives a SIGPIPE signal and terminates without completely deflating the entire file. This saves you a lot of time/cpu cycles, and satisfies “do what I mean”.

As you might also imagine, you can also abuse this functionality, and it might even get you into trouble if you aren’t careful.

jamesj629 · 01-19-2008, 02:33 AM

This is probably one of the more interesting linux topics I've covered so far. Thanks a lot guys