Bash: when an empty IFS does not work like a default IFS (info)
ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Introduction to Linux - A Hands on Guide
This guide was created as an overview of the Linux Operating System, geared toward new users as an exploration tour and getting started guide, with exercises at the end of each chapter.
For more advanced trainees it can be a desktop reference, and a collection of the base knowledge needed to proceed with system and network administration. This book contains many real life examples derived from the author's experience as a Linux system and network administrator, trainer and consultant. They hope these examples will help you to get a better understanding of the Linux system and that you feel encouraged to try out things on your own.
Click Here to receive this Complete Guide absolutely free.
I cannot find the cited thread, but it's quite obvious the opposite (that is an empty IFS doesn't act as the default). Another simple example:
Code:
$ while read one two three; do echo "$two<"; done < <(echo one two three)
two<
$ while IFS= read one two three; do echo "$two<"; done < <(echo one two three)
<
$ while IFS= read one two three; do echo "$one<"; done < <(echo one two three)
one two three<
Since there is not a null character in the input string, it is not split at all. Using bash 4.1.7 here.
Catkin, the -d delimiter option for the read Bash built-in has nothing to do with IFS if you only read into one variable. The delimiter defaults to newline, not IFS, and an empty delimiter refers to ASCII NUL, zero byte.
Only when you read into multiple variables or an array, IFS comes into play. delimiter still specifies the delimiter for the entire input, but that input is then split using IFS between the variables/array. If multiple variables are used, the final parameter will always receive the rest of the input.
It is easier to understand if you think of read as obtaining one full record, delimited by delimiter which defaults to a newline. The record will be split into fields according to IFS, except that the final variable will receive all the rest of the fields. If only one variable is specified, it will receive the entire record, regardless of what IFS is.
Here is the relevant snippet of man bash-builtins manpage, edited for brevity:
Code:
read [-ers] [-a aname] [-d delim] options... [name ...]
One line is read from the standard input, or from the file
descriptor fd supplied as an argument to the -u option, and the
first word is assigned to the first name, the second word to the
second name, and so on, with leftover words and their interven‐
ing separators assigned to the last name. If there are fewer
words read from the input stream than names, the remaining names
are assigned empty values. The characters in IFS are used to
split the line into words. The backslash character (\) may be
used to remove any special meaning for the next character read
and for line continuation. Options, if supplied, have the fol‐
lowing meanings:
-a aname
The words are assigned to sequential indices of the array
variable aname, starting at 0. aname is unset before any
new values are assigned. Other name arguments are
ignored.
-d delim
The first character of delim is used to terminate the
input line, rather than newline.
It gets even weirder. If you don't have a trailing record separator, the entire last record is silently discarded:
Code:
na@farm:~$ input=$'no_space\n space_before\nspace_after \n space_both_sides '
na@farm:~$ IFS=$'\t '; printf '%s\n' "$input" | while read -r record ; do echo ">$record<"; done
>no_space<
>space_before<
>space_after<
>space_both_sides<
na@farm:~$ IFS=$'\t '; printf '%s' "$input" | while read -r record ; do echo ">$record<"; done
>no_space<
>space_before<
>space_after<
(no space_both_sides in output at all)
It happens even if IFS is something else, and we use NUL separators:
Code:
na@farm:~$ input=$'no_space\n space_before\nspace_after \n space_both_sides '
na@farm:~$ IFS='Z'; printf '%s\0' "$input" | while read -rd '' record ; do echo ">$record<"; done
>no_space
space_before
space_after
space_both_sides <
na@farm:~$ IFS='Z'; printf '%s' "$input" | while read -rd '' record ; do echo ">$record<"; done
(outputs nothing!?!)
If we use IFS explicitly as the record separator, it still happens if we do not have a trailing separator:
Code:
na@farm:~$ input=$'no_space\n space_before\nspace_after \n space_both_sides '
na@farm:~$ IFS=$'\n'; printf '%s\n' "$input" | while read -rd $'\n' record ; do echo ">$record<"; done
>no_space<
> space_before<
>space_after <
> space_both_sides <
na@farm:~$ IFS=$'\n'; printf '%s' "$input" | while read -rd $'\n' record ; do echo ">$record<"; done
>no_space<
> space_before<
>space_after <
(no space_both_sides here either)
I believe this is a bug in Bash read builtin. It should not ignore the record if there is no trailing separator. Nor should it consume the trailing separator for a variable that receives the rest of the record, as it does here:
Code:
na@farm:~$ input=$'no_space\n space_before\nspace_after \n space_both_sides '
na@farm:~$ IFS=$'\t '; printf '%s\0' "$input" | while read -rd '' record ; do echo ">$record<"; done
>no_space<
>space_before<
>space_after<
>space_both_sides< (should have a space before <)
It does consume the trailing separator even when multiple variables or an array is used, and happens whatever record separator is used, so at least it is consistent:
Code:
na@farm:~$ input=$'no_space\n space_before\nspace_after \n space_both_sides '
na@farm:~$ IFS=$'\t '; printf '%sZ' "$input" | while read -rd 'Z' one two ; do echo ">$one<|>$two<"; done
>no_space
<|>space_before
space_after
space_both_sides< (should have a space before <)
na@farm:~$ IFS=$'\t '; printf '%sZ' "$input" | while read -rd 'Z' -a any ; do printf '>%s<\n' "${any[@]}" ; done
>no_space
<
>space_before
space_after<
>
<
>space_both_sides<
This has big implications on safe file name handling in Bash. In particular, to avoid truncating file names with trailing characters that might match IFS, one has to set IFS to an empty string:
Code:
na@farm:~$ touch $'test-file1' $'test-file2 '
na@farm:~$ unset IFS
na@farm:~$ find . -maxdepth 1 -type f -name 'test-file*' -print0 |
while read -rd "" FILE ; do
[ -f "$FILE" ] || printf '%s: No such file.\n' "$FILE" >&2
done
./test-file2: No such file.
na@farm:~$ find . -maxdepth 1 -type f -name 'test-file*' -print0 |
while IFS="" read -rd "" FILE ; do
[ -f "$FILE" ] || printf '%s: No such file.\n' "$FILE" >&2
done
(no output; both file names handled correctly)
Fortunately, it does not mess up the IFS, since it only sets it for the read built-in, temporarily. To wit:
Code:
na@farm:~$ touch $'test-file1' $'test-file2 '
na@farm:~$ IFS=$'\t '
na@farm:~$ while IFS="" read -rd "" FILE ; do
[ -f "$FILE" ] || printf '%s: No such file.\n' "$FILE" >&2
done < <( find . -maxdepth 1 -type f -name 'test-file*' -print0 )
(no output; both file names handled correctly)
na@farm:~$ printf '>%s< (%d chars)\n' "$IFS" ${#IFS}
> < (2 chars)
na@farm:~$ rm -f $'test-file1' $'test-file2 '
It gets even weirder. If you don't have a trailing record separator, the entire last record is silently discarded:
That's a problem with the while loop, not read. At least not exactly. If there's no trailing delimiter, read doesn't return as true. So the input gets read, but the loop's sub-commands don't get executed. You need to process the final variable values outside the loop if you want to safely handle all situations.
That's a problem with the while loop, not read. At least not exactly. If there's no trailing delimiter, read doesn't return as true.
Now whose brilliant idea was that?
Now we need to do e.g.
Code:
test -v LANG && OLD_LANG="$LANG" || unset OLD_LANG ; LANG=C
test -v LC_ALL && OLD_LC_ALL="$LC_ALL" || unset OLD_LC_ALL ; LC_ALL=C
while [ 1 ]; do
FILE=""
IFS="" read -rd '' FILE || [ -n "$FILE" ] || break
#
# do something with file
#
done
test -v OLD_LANG && LANG="$LANG" || unset LANG
test -v OLD_LC_ALL && LC_ALL="$LC_ALL" || unset LC_ALL
to handle e.g. NUL-delimited file lists correctly, just in case there is no final NUL at end. The locale override is necessary to avoid non-UTF-8 sequences from aborting the script (if an UTF-8 locale is used).
Thanks for the info, though, David the H.
Last edited by Nominal Animal; 04-16-2012 at 03:11 AM.
I was a bit pressed for time when I posted yesterday, so I couldn't go through the thread carefully. I could only post & run re NA's last post.
To first respond to what's come since then:
Pulling up "help read" gives us this:
Code:
Exit Status:
The return code is zero, unless end-of-file is encountered, read times out,
or an invalid file descriptor is supplied as the argument to -u.
Remember, one invocation of read grabs only a single delimited section of input data. It doesn't seem completely unreasonable to me to have it differentiate between hitting a delimiter and EOF. It's only when used in a loop, which invokes it multiple times, that it becomes a "gotcha".
I suppose it would be nice to be able to have read return true on an EOF as well, perhaps as an option flag. Other than that, except in trivial cases, it would probably be best to set up a function for the sub-commands, to keep from having to duplicate the entire code section again.
Another option could be to use mapfile or similar to capture the input lines into an array first, and process those.
Now, to respond to catkin, whitespace is treated slightly differently by IFS than other characters.
When a whitespace character set in IFS matches a whitespace string at the front or end of a line, then all of that whitespace is removed, and the first non-IFS-set character starts the first field.
Non-whitespace characters in IFS, OTOH, always match individually, and are always considered actual delimiters. That means that, if encountered initially, the "empty" value in front of it is considered the first field.
Remember, one invocation of read grabs only a single delimited section of input data. It doesn't seem completely unreasonable to me to have it differentiate between hitting a delimiter and EOF.
No, it is not unreasonable.
What is unreasonable is that you cannot differentiate between "data read, but no delimiter", "no more input", and "read error". The exit status is the same (false, 1) in all three cases, at least in Bash-4.2.10(1)-release (x86_64-pc-linux-gnu).
In most programming environments one encounters EOF only in the end of file error sense, i.e. only when trying to read past the end of input. Indeed, in POSIX systems, this is the only way kernels tell the userspace that the file pointer is at the end of input, or that there will be no further data available. A short read is always possible, and does not indicate anything about whether there is further data or not.
To me, personally, getting an EOF error (nonzero exit status) while also having input, is counterintuitive.
That said, it is historical behaviour, and therefore will not change.
Fortunately, the workaround can apparently be written pretty concisely. For NUL-separated records:
Code:
DATA=""
while IFS="" read -rd "" DATA || [ -n "$DATA" ]; do
# Do something with DATA
DATA=""
done
While read does clear the DATA in normal situations, it does not clear it if a read error occurs; for example, if you do close the input, for example via exec<&- accidentally in the loop body. Without explicitly clearing DATA the loop would never exit in the true read error case.
I guess I was just a bit frustrated with the behaviour. Thanks again for your efforts, David the H.
Fortunately, the workaround can apparently be written pretty concisely. For NUL-separated records:
Code:
DATA=""
while IFS="" read -rd "" DATA || [ -n "$DATA" ]; do
# Do something with DATA
DATA=""
done
Neat workaround for read's counter-intuitive behaviour Nominal Animal
A couple of points of style: DATA= is functionally identical to DATA="" and it may be preferable to use single quotes where doubles are not necessary as in the -d option.
Last edited by catkin; 04-18-2012 at 10:44 AM.
Reason: Fix missing bolding
You also stated more eloquently what I was trying to express with my suggestion of an option flag. We'd need to preserve historical compatibility while providing some ability to detect both states.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.