We have been experiencing odd failures related to FIFO pipes in our shell scripts.
The technique we commonly use is to background processes which write to a mkfifo pipe, which allows us to effectively do parallel processing in our scripts on large amounts of data. However, sometimes the command that receives the data from the pipe (via cat) behaves as though the input contains a zero-byte file, yet when you subsequently cat the contents of the pipe, all of the data is returned.
In order to help us collect more data to find the cause and solution for this bug, please run the script below and post the output along with the OS version and file system type of the directory where you run it e.g.:
Code:
uname -a
mount -l | grep $(df -h . | tail -1 | awk '{print $1}')
Here is the script that effectively reproduces the bug:
Code:
#!/bin/ksh
failcount=0
nreps=10000
awk 'BEGIN{for(i=0;i<1000;++i){print i}}' >datafile
docat="cat datafile"
for i in {1 .. 100} ; do
docat="$docat && cat datafile"
done
for reps in {1..$nreps} ; do
mkfifo mypipe
eval "$docat" | awk '{print}' > mypipe &
cat mypipe >myfile
[[ -z $(head myfile) ]] && failcount=$(( $failcount + 1 ))
echo "failrate ($failcount/$reps)" >status
rm -f mypipe myfile
done
cat status
rm -f status datafile
This behavior may be caused by the kernel, or at the file system level, but we do not believe it's normal. The following are results from some recent tests we have run on multiple file systems:
Code:
FAILSPER10K FSTYPE LOCALOS REMOTEOS
1 ext4 RHEL6 ()
1 tmpfs RHEL5 ()
3 ext3 RHEL5 ()
326 nfs RHEL5 RHEL5
373 nfs4 RHEL5 RHEL6
Version info from one of the hosts we used to run these tests:
Code:
$ uname -a
Linux myhost.mydomain.com 2.6.18-274.3.1.el5 #1 SMP Fri Aug 26 18:49:02 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
$ /bin/ksh --version
version sh (AT&T Research) 93t+ 2010-02-02
$/bin/awk --version
GNU Awk 3.1.5
$ /usr/bin/mkfifo --version
mkfifo (GNU coreutils) 5.97
$ yum --version nfs
3.2.22
If you have any other insights or suggetions, please post a reply.
Please note that this is a relatively rare phenomenon, so the script supplied does some odd things in order to make it happen frequently enough to measure. There are ways to make it happen less often, but we still see it happen with negative consequences for our data processing systems.