bash 'read' built-in with multiple processes reading same descriptor

ta0kira · 09-08-2009, 04:12 PM

I've written a script that forks the specified number of times (with &) and each of the processes created reads and executes lines from standard input until no more lines are available. All of the processes read from the same descriptor, so I'm wondering if read is implemented to only read the descriptor up to the next newline. In other words, I'm wondering if read will, e.g., read a page from the file then parse it for a newline. So far this does not seem to be the case because the script is working well; however, I'm not sure if I should worry about an instance where a line might be split in two.

Here is the script in case you're curious:

Code:

#!/bin/bash


if [ $# -lt 1 ]; then
  echo "$0 [number] (file(s)...)" 1>&2
  exit 1
fi


max_count="$1"

if ! [ "$max_count" -gt 0 ] 2> /dev/null; then
  echo "$0: invalid count '$1'" 1>&2
  exit 1
fi

shift


function clean_up()
{
  kill 0
}


trap clean_up SIGINT SIGTERM


cat $* | for I in `seq 1 "$max_count"`; do
  cat | while read line; do
    echo "[$0: fork $I: '$line']" 1>&2
    ( eval $line; )
  done &
done

Basically one creates a file where each line is a command or pipes the lines into the script. The script will fork $1 times and the forks will take turns reading the piped input. For example:

Code:

./parallel-exec.sh 8 process-list

Thanks.
Kevin Barry

jlinkels · 09-08-2009, 05:00 PM

Without making an in-depth study intuitively I would say that read reads up the the next newline and no further.

Look at it this way: you are reading from stdin. If you have an interactive program you read from stdin as well. As soon as you hit enter, read reads a newline and returns. It does not wait for more keyboard input. Since stdin is stdin regardless you pipe a file into a program or you are hammering on a keyboard.

If I am not mistaken, if I enter:

Code:

printf "ls \
-l " | sh<enter>

this produces a long dir listing, whereas I do:

Code:

printf "ls \n -l " | sh<enter>

this produces a short listing and a "command not found."

The first example proves that a line split over multiple lines does not cause read to return (I mean the read in sh), while the second example proves that a command is being processed by sh as soon as \n is seen.

jlinkels

ta0kira · 09-08-2009, 08:07 PM

Quote:

Originally Posted by jlinkels

Without making an in-depth study intuitively I would say that read reads up the the next newline and no further.

Look at it this way: you are reading from stdin. If you have an interactive program you read from stdin as well. As soon as you hit enter, read reads a newline and returns. It does not wait for more keyboard input. Since stdin is stdin regardless you pipe a file into a program or you are hammering on a keyboard.

If I am not mistaken, if I enter:

Code:

printf "ls \
-l " | sh<enter>

this produces a long dir listing, whereas I do:

Code:

printf "ls \n -l " | sh<enter>

this produces a short listing and a "command not found."

The first example proves that a line split over multiple lines does not cause read to return (I mean the read in sh), while the second example proves that a command is being processed by sh as soon as \n is seen.

jlinkels

I think we're talking about a few different things here. The terminal itself is in canonical input mode by default; therefore, typed text isn't transferred to the program until [Enter] is pressed. As far as the read system call, this will read what's available if it isn't blocking or will block until the requested buffer can be filled. In the case of sh, fread probably fills a fixed buffer and parses for a newline. In order for this not to happen, I think the program would almost have to read (system call) a character at a time and stop at the newline. This is counterintuitive, which is why I asked. It seems as though bash coordinates read (the built-in) between subshells, or it implements it by reading a character at a time. Because these are the only possibilities I can think of, I'm wondering if it's actually mere coincidence that it works. Thanks.
Kevin Barry

ta0kira · 09-09-2009, 04:42 PM

I might actually be better off writing this in C so I can make sure lines won't actually get split. One problem I've also run into with the script is that each fork reads ahead several lines, so some forks are stuck with a bunch of long processes, whereas some burn through a bunch of short processes and exit.
Kevin Barry

catkin · 09-09-2009, 05:37 PM

What do you mean by "a line might be split in two"? Do you mean that the input is a sequence of bash commands -- in which case they could be split over many lines, as in a do-done compound or a "here document"?

Why do you have the second cat? AIUI, it is simply reading from stdin and writing to stdout which becomes the stdin for the read.

Why are you backgrounding the do-done? Why not simply background the eval subshell?

What is your objective? To input a list commands and process them in parallel, one per process? If that's the case, can you specify the format of those commands, especially regards being spread over more than one line. A sample input would be helpful.

This is a very interesting question about how the shell can act sanely in this situation when the system fork call duplicates the open file descriptors and their buffers. Why are some lines not read by more than one sub-shell?

Your musings about which system calls are being used are probably close to the mark -- the shell may be doing a fread() and the shell read command may be doing a blocking character-at-a-time read until it gets a line end.

It's late and I may not be grasping this clearly enough. Good night!

ta0kira · 09-09-2009, 08:13 PM

Quote:

Originally Posted by catkin

What do you mean by "a line might be split in two"? Do you mean that the input is a sequence of bash commands -- in which case they could be split over many lines, as in a do-done compound or a "here document"?

Most read system calls (man 2 read, as opposed to help read) read a fixed-size block, which almost certainly doesn't align with a newline. When X processes are reading from the same pipe, this intuitively should mean that each process will read e.g. 4096 bytes, then parse it for newlines. Unless each of the X processes reads 1 byte at a time, I find it very unlikely that they can all read from the same pipe without chopping up the lines. Because the chopping-up isn't happening, I wonder if bash is actually doing that.

Quote:

Originally Posted by catkin

Why do you have the second cat? AIUI, it is simply reading from stdin and writing to stdout which becomes the stdin for the read.

Why are you backgrounding the do-done? Why not simply background the eval subshell?

What is your objective?

The point is that the for loop forks X times, so that number of background processes are always running. Each background process has a while loop, which cats from the parent shell. This results in X background processes evaluating a line at a time. The objective is to feed Y lines to X background processes in such a way that X of those lines are being evaluated at any given time. Think of X as the number of cores, e.g. I have a dual-quad-core hyperthreaded server at work and today I fed ~180 lines to this script with an X of 16. This resulted in the lines being evaluated 16 at a time until all were done. For the most part the lines are a single-line command iterated over a large number of files.
Kevin Barry

catkin · 09-10-2009, 09:16 AM

Thanks for explaining.

I explored this with a modified version of your script intended to be functionally identical as far as the "read" commands are concerned. If I've accidentally changed it functionally my exploration is invalid. Here's the script

Code:

#!/bin/bash

function clean_up()
{ 
	kill 0 # kill all processes in the current process group with SIGTERM
}

trap clean_up SIGINT SIGTERM

for I in `seq 1 5`
do
    echo "[$0: fork $I: '$line']" 1>&2
	if [[ "$1" = 'cat' ]]; then
  		cat | while read line
		do
			sleep $(( $RANDOM / 4096 + 1))
   		 	echo "[$0: fork $I: '$line']" 1>&2
  		done &
	else
 	 	while read line
		do
			sleep $(( $RANDOM / 4096 + 1))
   		 	echo "[$0: fork $I: '$line']" 1>&2
  		done &
	fi
done < input.txt

Using the above script, I found that using the cat, only the first background process got the lines. That makes sense; the first background process' cat slurps the whole input file and feeds it all to the first background process. The first background process' "read" reads it a line at a time (until EOF). Without the cat, the 5 background processes' "read" commands read the lines more or less evenly. As per our hypothesising this suggests that bash's "read" must read a character-at-a-time to be sure it doesn't go beyond a line end character. Useful but not high performance.

Here's the terminal transcript

Code:

c@CW8:~/d/bin/try$ ./essay.sh cat
[./essay.sh: fork 1: '']
[./essay.sh: fork 2: '']
[./essay.sh: fork 3: '']
[./essay.sh: fork 4: '']
[./essay.sh: fork 5: '']
c@CW8:~/d/bin/try$ [./essay.sh: fork 1: 'The progressive self-manifestation of Nature in man, termed']
[./essay.sh: fork 1: 'in modern language his evolution, must necessarily depend']
[./essay.sh: fork 1: 'upon three successive elements. There is that which is already']
[./essay.sh: fork 1: 'evolved; there is that which, still imperfect, still partly fluid,']
[./essay.sh: fork 1: 'is persistently in the stage of conscious evolution; and there']
[./essay.sh: fork 1: 'is that which is to be evolved and may perhaps be already']
[./essay.sh: fork 1: 'displayed, if not constantly, then occasionally or with some']
[./essay.sh: fork 1: 'regularity of recurrence, in primary formations or in others']
[./essay.sh: fork 1: 'more developed and, it may well be, even in some, however']
[./essay.sh: fork 1: 'rare, that are near to the highest possible realisation of our']
[./essay.sh: fork 1: 'present humanity. For the march of Nature is not drilled to a']
[./essay.sh: fork 1: 'regular and mechanical forward stepping. She reaches constantly']
[./essay.sh: fork 1: 'beyond herself even at the cost of subsequent deplorable retreats.']
[./essay.sh: fork 1: 'She has rushes; she has splendid and mighty outbursts; she']
[./essay.sh: fork 1: 'has immense realisations. She storms sometimes passionately']
[./essay.sh: fork 1: 'forward hoping to take the kingdom of heaven by violence.']
[./essay.sh: fork 1: 'And these self-exceedings are the revelation of that in her']
[./essay.sh: fork 1: 'which is most divine or else most diabolical, but in either case']
[./essay.sh: fork 1: 'the most puissant to bring her rapidly forward towards her']
[./essay.sh: fork 1: 'goal.']
c@CW8:~/d/bin/try$ ./essay.sh cat 
[./essay.sh: fork 1: '']
[./essay.sh: fork 2: '']
[./essay.sh: fork 3: '']
[./essay.sh: fork 4: '']
[./essay.sh: fork 5: '']
c@CW8:~/d/bin/try$ [./essay.sh: fork 1: 'The progressive self-manifestation of Nature in man, termed']
[./essay.sh: fork 2: 'in modern language his evolution, must necessarily depend']
[./essay.sh: fork 3: 'upon three successive elements. There is that which is already']
[./essay.sh: fork 4: 'evolved; there is that which, still imperfect, still partly fluid,']
[./essay.sh: fork 5: 'is persistently in the stage of conscious evolution; and there']
[./essay.sh: fork 1: 'is that which is to be evolved and may perhaps be already']
[./essay.sh: fork 2: 'displayed, if not constantly, then occasionally or with some']
[./essay.sh: fork 3: 'regularity of recurrence, in primary formations or in others']
[./essay.sh: fork 4: 'more developed and, it may well be, even in some, however']
[./essay.sh: fork 5: 'rare, that are near to the highest possible realisation of our']
[./essay.sh: fork 1: 'present humanity. For the march of Nature is not drilled to a']
[./essay.sh: fork 2: 'regular and mechanical forward stepping. She reaches constantly']
[./essay.sh: fork 3: 'beyond herself even at the cost of subsequent deplorable retreats.']
[./essay.sh: fork 4: 'She has rushes; she has splendid and mighty outbursts; she']
[./essay.sh: fork 5: 'has immense realisations. She storms sometimes passionately']
[./essay.sh: fork 1: 'forward hoping to take the kingdom of heaven by violence.']
[./essay.sh: fork 5: 'goal.']
[./essay.sh: fork 4: 'the most puissant to bring her rapidly forward towards her']
[./essay.sh: fork 3: 'which is most divine or else most diabolical, but in either case']
[./essay.sh: fork 2: 'And these self-exceedings are the revelation of that in her']

Removing the cat might solve your problem, "One problem I've also run into with the script is that each fork reads ahead several lines, so some forks are stuck with a bunch of long processes, whereas some burn through a bunch of short processes and exit.". The expression "several lines" suggests that cat is buffering (which I expected, but not such a small buffer as seems probable from your report) and that it is buffering to a line end rather than to a fixed number of bytes (which I didn't expect).

Hopefully bash's "read" opens the file descriptor in exclusive mode, thus guarding against a race condition between your background processes.

ta0kira · 09-10-2009, 10:59 AM

Now that you mention it, cat must be reading a character at a time. The actual problem is that I don't want it to be buffered; I don't want a fork to read another line until it's done with the one it has. Having the lines more or less evenly divided at the beginning causes the number of forks to trail off at the end. If less that X forks are running, that should mean there are no lines left to run when those are complete. Unfortunately, removing the cat from the fork prevents read from actually reading anything because the fork is in the background. I'm not sure why this is, but I haven't found a way around it. I guess the better option is to write it in C, where I can have two-way communication between the parent and the forks (e.g. the fork requests another line when it's done with one.)
Kevin Barry

catkin · 09-10-2009, 11:44 AM

Quote:

Originally Posted by ta0kira

Now that you mention it, cat must be reading a character at a time. The actual problem is that I don't want it to be buffered; I don't want a fork to read another line until it's done with the one it has. Having the lines more or less evenly divided at the beginning causes the number of forks to trail off at the end. If less that X forks are running, that should mean there are no lines left to run when those are complete. Unfortunately, removing the cat from the fork prevents read from actually reading anything because the fork is in the background. I'm not sure why this is, but I haven't found a way around it. I guess the better option is to write it in C, where I can have two-way communication between the parent and the forks (e.g. the fork requests another line when it's done with one.)
Kevin Barry

What is different between out systems that removing the cat works for me (as illustrated above) and cat seems to be reading several lines on your system? Or did I change the script in some functional way during the re-write? Have you tried my script?

Code:

c@CW8:~$ cat /etc/issue
Ubuntu 8.04.3 LTS \n \l

c@CW8:~$ bash --version
GNU bash, version 3.2.39(1)-release (i486-pc-linux-gnu)
Copyright (C) 2007 Free Software Foundation, Inc.
c@CW8:~$ cat --version
cat (GNU coreutils) 6.10
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Torbjorn Granlund and Richard M. Stallman.

tuxdev · 09-10-2009, 01:17 PM

I believe what you are looking for are Bash "coprocesses"

ta0kira · 09-10-2009, 05:24 PM

I wasn't aware of that feature, although it had crossed my mind as something that might work were it possible. The problem I see with that, though, is there isn't a (man 2) select built-in in bash; therefore, it would be difficult for the parent to monitor pipes from multiple coprocesses at once. Even distribution of lines isn't desirable; even distribution of processing time is. In fact, I'm quite convinced it would be easier to write this in C than wrestle bash into doing it.
Kevin Barry

catkin · 09-11-2009, 03:31 AM

Quote:

Originally Posted by ta0kira

I wasn't aware of that feature, although it had crossed my mind as something that might work were it possible. The problem I see with that, though, is there isn't a (man 2) select built-in in bash; therefore, it would be difficult for the parent to monitor pipes from multiple coprocesses at once. Even distribution of lines isn't desirable; even distribution of processing time is. In fact, I'm quite convinced it would be easier to write this in C than wrestle bash into doing it.
Kevin Barry

But my script, running without the cat did exactly what you want (as I understand it). Each of the sub-shells came back for another line when it had finished its work; that is as good a distribution of processing time as is possible, given that the individual lines/commands cannot be shared. It set up a bunch of sub-shells that did one task and came back for the next.

No problem if you want to code it in C, but it seems a pity when we have an apparently working solution in bash.

ta0kira · 09-11-2009, 09:15 AM

Take this, for example:

Code:

seq 1 1 50000 | { cat | sleep 1; cat; } | head -n1

On my computer, the number 945 is shown. This means the first cat read 944 lines that didn't actually go anywhere; sleep doesn't read standard input. Therefore: cat buffers data before any of it's actually read. This means what you're seeing is read reading from data buffered by cat, which seems to be split up up front. This can be made more apparent with hundreds of long lines where the first take several times longer to execute (e.g. 20 minutes for the ones up front, 2 minutes for the latter 2/3.)
Kevin Barry

catkin · 09-11-2009, 09:23 AM

Quote:

Originally Posted by ta0kira

Take this, for example:

Code:

seq 1 1 50000 | { cat | sleep 1; cat; } | head -n1

On my computer, the number 945 is shown. This means the first cat read 944 lines that didn't actually go anywhere; sleep doesn't read standard input. Therefore: cat buffers data before any of it's actually read. This means what you're seeing is read reading from data buffered by cat, which seems to be split up up front. This can be made more apparent with hundreds of long lines where the first take several times longer to execute (e.g. 20 minutes for the ones up front, 2 minutes for the latter 2/3.)
Kevin Barry

I think we have agreed that cat buffers. That is why I am suggesting that your script will work (in the sense that each sub-shell will read and process a line at a time) if you remove the sub-shell cat; the shell "read" commands are well behaved, taking only a line at at time. Have you tried simply removing the cat in the sub-shell like this? I think it will do what you want

Code:

cat $* | for I in `seq 1 "$max_count"`; do
  while read line; do
    echo "[$0: fork $I: '$line']" 1>&2
    ( eval $line; )
  done &
done

ta0kira · 09-11-2009, 10:44 AM

Quote:

Originally Posted by catkin

Have you tried simply removing the cat in the sub-shell like this? I think it will do what you want

This is what I meant above when I said read doesn't actually read when I removed cat; this wouldn't work for me. I'm not exactly sure why, but it probably has to do with it thinking standard input is a terminal. For example, this doesn't do anything at all:

Code:

while true; do 
  echo "sleep `expr $RANDOM % 5`"
done | ( for I in `seq 1 4`; do
  while read line; do
    echo "[$0: fork $I: '$line']" 1>&2
    ( eval $line; )
  done &
done; sleep 10; kill 0; )

Add the cat back in and it works, though.
Kevin Barry