Intermittent mkfifo Pipe Failures

djl · 10-24-2012, 02:15 PM

We have been experiencing odd failures related to FIFO pipes in our shell scripts.

The technique we commonly use is to background processes which write to a mkfifo pipe, which allows us to effectively do parallel processing in our scripts on large amounts of data. However, sometimes the command that receives the data from the pipe (via cat) behaves as though the input contains a zero-byte file, yet when you subsequently cat the contents of the pipe, all of the data is returned.

In order to help us collect more data to find the cause and solution for this bug, please run the script below and post the output along with the OS version and file system type of the directory where you run it e.g.:

Code:

uname -a
mount -l | grep $(df -h . | tail -1 | awk '{print $1}')

Here is the script that effectively reproduces the bug:

Code:

#!/bin/ksh

failcount=0
nreps=10000

awk 'BEGIN{for(i=0;i<1000;++i){print i}}' >datafile

docat="cat datafile"
for i in {1 .. 100} ; do
	docat="$docat && cat datafile"
done

for reps in {1..$nreps} ; do
	mkfifo mypipe
	eval "$docat" | awk '{print}' > mypipe &
	cat mypipe >myfile
	[[ -z $(head myfile) ]] && failcount=$(( $failcount + 1 ))
	echo "failrate ($failcount/$reps)" >status
	rm -f mypipe myfile
done

cat status
rm -f status datafile

This behavior may be caused by the kernel, or at the file system level, but we do not believe it's normal. The following are results from some recent tests we have run on multiple file systems:

Code:

FAILSPER10K FSTYPE LOCALOS REMOTEOS
1           ext4   RHEL6   ()
1           tmpfs  RHEL5   ()
3           ext3   RHEL5   ()
326         nfs    RHEL5   RHEL5
373         nfs4   RHEL5   RHEL6

Version info from one of the hosts we used to run these tests:

Code:

$ uname -a
Linux myhost.mydomain.com 2.6.18-274.3.1.el5 #1 SMP Fri Aug 26 18:49:02 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

$ /bin/ksh --version
  version         sh (AT&T Research) 93t+ 2010-02-02

$/bin/awk --version
GNU Awk 3.1.5

$ /usr/bin/mkfifo --version
mkfifo (GNU coreutils) 5.97

$ yum --version nfs
3.2.22

If you have any other insights or suggetions, please post a reply.

Please note that this is a relatively rare phenomenon, so the script supplied does some odd things in order to make it happen frequently enough to measure. There are ways to make it happen less often, but we still see it happen with negative consequences for our data processing systems.

Reuti · 10-25-2012, 11:07 AM

I can’t reproduce this behavior. But maybe it’s a race-condition as nowhere the exit code of any operation is checked. Can you insert a wait and try with it again:

Code:

        …
	echo "failrate ($failcount/$reps)" >status
        wait
	rm -f mypipe myfile
        …

jpollard · 12-03-2012, 04:31 AM

I think your problem is timing.

You need to ensure that the fifo is open for reading before you start writing. (reference man 7 fifo).

The way you have the command structure set up this may or may not happen.

You are putting the writing process (the: eval "$docat" | awk '{print}' > mypipe) in the background. If the first part (the eval "$docat" part) delays things long enough, then the following forground process (the cat mypipe >myfile) has time to get started, and open a read on the fifo.

If it happens the other way (the first write occurs) you should be getting a "SIGPIPE" error.

The timing becomes critical due to the way the 'eval "$docat"...' is handled.
The first process in the sequence started it the last one - awk '{print}' >mypipe. Awk doesn't wait, and will open the fifo... (no delay), then depending on how long it takes the "eval "$docat..." to produce the first bit of data...

If this delay is too short, then a write to the pipe can occur before the foreground process starts the read. This can happen due to other system loading factors outside the script. IF you happen have two or more processors, this should happen less often (depending on load of course). If you are on an idle multiprocessor, it should almost never happen.

You might try adding a simple sleep 2 to the awk (such as ...| sleep 2; awk '{... ) as I think that would be the minimal change.

You can also try using bash coprocesses...