LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 04-15-2012, 04:58 AM   #1
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Bash: when an empty IFS does not work like a default IFS (info)


A few years ago, here on LQ, some of us essayed a theory that, for bash, an an empty IFS is functionally identical to the default IFS.

There is a situation in which that is not true. It is when used in conjunction with read. Here's a demonstration.
Code:
c@CW8:/tmp/tmp$ rm *
c@CW8:/tmp/tmp$ touch no_space_after 'space_after '
c@CW8:/tmp/tmp$ echo -n "$IFS" | od -a
0000000  sp  ht  nl
0000003
c@CW8:/tmp/tmp$ while read -r -d '' file; do echo "$file<"; done < <(find $dir -type f -print0)
./no_space_after<
./space_after<
c@CW8:/tmp/tmp$ while IFS= read -r -d '' file; do echo "$file<"; done < <(find $dir -type f -print0)
./no_space_after<
./space_after <
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 04-15-2012, 06:12 AM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986Reputation: 1986
I cannot find the cited thread, but it's quite obvious the opposite (that is an empty IFS doesn't act as the default). Another simple example:
Code:
$ while read one two three; do echo "$two<"; done < <(echo one two three)
two<
$ while IFS= read one two three; do echo "$two<"; done < <(echo one two three)
<
$ while IFS= read one two three; do echo "$one<"; done < <(echo one two three)
one two three<
Since there is not a null character in the input string, it is not split at all. Using bash 4.1.7 here.

Last edited by colucix; 04-15-2012 at 06:14 AM.
 
Old 04-15-2012, 09:29 AM   #3
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,831

Rep: Reputation: 2112Reputation: 2112Reputation: 2112Reputation: 2112Reputation: 2112Reputation: 2112Reputation: 2112Reputation: 2112Reputation: 2112Reputation: 2112Reputation: 2112
Well the documented behaviour is that unset IFS is the same as default, not empty IFS:
Quote:
3.5.7 Word Splitting
...
If IFS is unset, or its value is exactly <space><tab><newline>, the default...

If the value of IFS is null, no word splitting occurs.
 
2 members found this post helpful.
Old 04-15-2012, 10:02 AM   #4
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Catkin, the -d delimiter option for the read Bash built-in has nothing to do with IFS if you only read into one variable. The delimiter defaults to newline, not IFS, and an empty delimiter refers to ASCII NUL, zero byte.

Only when you read into multiple variables or an array, IFS comes into play. delimiter still specifies the delimiter for the entire input, but that input is then split using IFS between the variables/array. If multiple variables are used, the final parameter will always receive the rest of the input.

It is easier to understand if you think of read as obtaining one full record, delimited by delimiter which defaults to a newline. The record will be split into fields according to IFS, except that the final variable will receive all the rest of the fields. If only one variable is specified, it will receive the entire record, regardless of what IFS is.

Here is the relevant snippet of man bash-builtins manpage, edited for brevity:
Code:
read [-ers] [-a aname] [-d delim] options... [name ...]
    One  line  is  read  from  the  standard input, or from the file
    descriptor fd supplied as an argument to the -u option, and  the
    first word is assigned to the first name, the second word to the
    second name, and so on, with leftover words and their  interven‐
    ing  separators  assigned  to the last name.  If there are fewer
    words read from the input stream than names, the remaining names
    are  assigned  empty  values.  The characters in IFS are used to
    split the line into words.  The backslash character (\)  may  be
    used  to  remove any special meaning for the next character read
    and for line continuation.  Options, if supplied, have the  fol‐
    lowing meanings:
     -a aname
        The words are assigned to sequential indices of the array
        variable aname, starting at 0.  aname is unset before any
        new  values  are  assigned.   Other  name  arguments  are
        ignored.
     -d delim
        The first character of delim is  used  to  terminate  the
        input line, rather than newline.
 
Old 04-15-2012, 09:56 PM   #5
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by ntubski View Post
Well the documented behaviour is that unset IFS is the same as default, not empty IFS:
Thanks for the correction ntubski

I was mis-remembering

Sorry for the confusion.
 
Old 04-15-2012, 10:26 PM   #6
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by Nominal Animal View Post
Only when you read into multiple variables or an array, IFS comes into play.
Hello Nominal Animal

The OP shows read being used with a single value and IFS having an effect (the trailing space is removed). Here's a more comprehensive demonstration:
Code:
c@CW8:~$ input='no_space
>  space_before
> space_after 
>  space_both_sides '
c@CW8:~$ echo "$input" | while read -r record; do echo ">$record<"; done
>no_space<
>space_before<
>space_after<
>space_both_sides<
c@CW8:~$ echo "$input" | while IFS= read -r record; do echo ">$record<"; done
>no_space<
> space_before<
>space_after <
> space_both_sides <
I interpreted that to mean that any characters in IFS are stripped from the left and right sides of the record but it is not so:
Code:
c@CW8:~$ input='no_space                                                             
Xspace_before
space_afterX
Xspace_both_sidesX'
c@CW8:~$ echo "$input" | while read record; do echo ">$record<"; done
>no_space<
>Xspace_before<
>space_afterX<
>Xspace_both_sidesX<
c@CW8:~$ echo "$input" | while IFS=X read record; do echo ">$record<"; done
>no_space<
>Xspace_before<
>space_after<
>Xspace_both_sidesX<
I do not understand why the trailing X was stripped from the third record but not the fourth. It is not because it is the last record:
Code:
c@CW8:~$ input='no_space
> Xspace_before
> space_afterX
> Xspace_both_sidesX
> another record'
c@CW8:~$ echo "$input" | while IFS=X read record; do echo ">$record<"; done
>no_space<
>Xspace_before<
>space_after<
>Xspace_both_sidesX<
>another record<
 
Old 04-16-2012, 12:43 AM   #7
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
It gets even weirder. If you don't have a trailing record separator, the entire last record is silently discarded:
Code:
na@farm:~$ input=$'no_space\n space_before\nspace_after \n space_both_sides '
na@farm:~$ IFS=$'\t '; printf '%s\n' "$input" | while read -r record ; do echo ">$record<"; done
>no_space<
>space_before<
>space_after<
>space_both_sides<
na@farm:~$ IFS=$'\t '; printf '%s' "$input" | while read -r record ; do echo ">$record<"; done
>no_space<
>space_before<
>space_after<
(no space_both_sides in output at all)
It happens even if IFS is something else, and we use NUL separators:
Code:
na@farm:~$ input=$'no_space\n space_before\nspace_after \n space_both_sides '
na@farm:~$ IFS='Z'; printf '%s\0' "$input" | while read -rd '' record ; do echo ">$record<"; done
>no_space
 space_before
space_after 
 space_both_sides <
na@farm:~$ IFS='Z'; printf '%s' "$input" | while read -rd '' record ; do echo ">$record<"; done
(outputs nothing!?!)
If we use IFS explicitly as the record separator, it still happens if we do not have a trailing separator:
Code:
na@farm:~$ input=$'no_space\n space_before\nspace_after \n space_both_sides '
na@farm:~$ IFS=$'\n'; printf '%s\n' "$input" | while read -rd $'\n' record ; do echo ">$record<"; done
>no_space<
> space_before<
>space_after <
> space_both_sides <
na@farm:~$ IFS=$'\n'; printf '%s' "$input" | while read -rd $'\n' record ; do echo ">$record<"; done
>no_space<
> space_before<
>space_after <
(no space_both_sides here either)
I believe this is a bug in Bash read builtin. It should not ignore the record if there is no trailing separator. Nor should it consume the trailing separator for a variable that receives the rest of the record, as it does here:
Code:
na@farm:~$ input=$'no_space\n space_before\nspace_after \n space_both_sides '
na@farm:~$ IFS=$'\t '; printf '%s\0' "$input" | while read -rd '' record ; do echo ">$record<"; done
>no_space<
>space_before<
>space_after<
>space_both_sides<   (should have a space before <)
It does consume the trailing separator even when multiple variables or an array is used, and happens whatever record separator is used, so at least it is consistent:
Code:
na@farm:~$ input=$'no_space\n space_before\nspace_after \n space_both_sides '
na@farm:~$ IFS=$'\t '; printf '%sZ' "$input" | while read -rd 'Z' one two ; do echo ">$one<|>$two<"; done
>no_space
<|>space_before
space_after 
 space_both_sides<   (should have a space before <)
na@farm:~$ IFS=$'\t '; printf '%sZ' "$input" | while read -rd 'Z' -a any ; do printf '>%s<\n' "${any[@]}" ; done
>no_space
<
>space_before
space_after<
>
<
>space_both_sides<
This has big implications on safe file name handling in Bash. In particular, to avoid truncating file names with trailing characters that might match IFS, one has to set IFS to an empty string:
Code:
na@farm:~$ touch $'test-file1' $'test-file2 '
na@farm:~$ unset IFS
na@farm:~$ find . -maxdepth 1 -type f -name 'test-file*' -print0 | 
           while read -rd "" FILE ; do
               [ -f "$FILE" ] || printf '%s: No such file.\n' "$FILE" >&2
           done
./test-file2: No such file.
na@farm:~$ find . -maxdepth 1 -type f -name 'test-file*' -print0 | 
           while IFS="" read -rd "" FILE ; do
               [ -f "$FILE" ] || printf '%s: No such file.\n' "$FILE" >&2
           done
(no output; both file names handled correctly)
Fortunately, it does not mess up the IFS, since it only sets it for the read built-in, temporarily. To wit:
Code:
na@farm:~$ touch $'test-file1' $'test-file2 '
na@farm:~$ IFS=$'\t '
na@farm:~$ while IFS="" read -rd "" FILE ; do
               [ -f "$FILE" ] || printf '%s: No such file.\n' "$FILE" >&2
           done < <( find . -maxdepth 1 -type f -name 'test-file*' -print0 )
(no output; both file names handled correctly)
na@farm:~$ printf '>%s< (%d chars)\n' "$IFS" ${#IFS}
>	 < (2 chars)
na@farm:~$ rm -f $'test-file1' $'test-file2 '
I'm off to fix my blog post about this.
 
2 members found this post helpful.
Old 04-16-2012, 01:19 AM   #8
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Quote:
Originally Posted by Nominal Animal View Post
It gets even weirder. If you don't have a trailing record separator, the entire last record is silently discarded:
That's a problem with the while loop, not read. At least not exactly. If there's no trailing delimiter, read doesn't return as true. So the input gets read, but the loop's sub-commands don't get executed. You need to process the final variable values outside the loop if you want to safely handle all situations.

See here: http://mywiki.wooledge.org/BashFAQ/001

Last edited by David the H.; 04-16-2012 at 01:21 AM.
 
1 members found this post helpful.
Old 04-16-2012, 01:59 AM   #9
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by Nominal Animal View Post
Code:
na@farm:~$ IFS=$'\t '
na@farm:~$ printf '>%s< (%d chars)\n' "$IFS" ${#IFS}
>	 < (2 chars)
na@farm:~$ rm -f $'test-file1' $'test-file2 '
Thanks for coming on-board with this

So it looks like that very rare phenomenon, a bash bug. I plan to report it after waiting a few days for understanding to clarify.

Incidentally, printf's %q option is helpful to show the actual value of IFS in the above:
Code:
c@CW8:~$ IFS=$'\t '
c@CW8:~$ printf '>%q< (%d chars)\n' "$IFS" ${#IFS}
>$'\t '< (2 chars)


---------- Post added 16th Apr 2012 at 12:30 ----------

Quote:
Originally Posted by David the H. View Post
That's a problem with the while loop, not read.
Thanks David
 
Old 04-16-2012, 02:01 AM   #10
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by David the H. View Post
That's a problem with the while loop, not read. At least not exactly. If there's no trailing delimiter, read doesn't return as true.
Now whose brilliant idea was that?

Now we need to do e.g.
Code:
test -v LANG   && OLD_LANG="$LANG"     || unset OLD_LANG   ; LANG=C
test -v LC_ALL && OLD_LC_ALL="$LC_ALL" || unset OLD_LC_ALL ; LC_ALL=C

while [ 1 ]; do

    FILE=""
    IFS="" read -rd '' FILE || [ -n "$FILE" ] || break

    #
    # do something with file
    #

done

test -v OLD_LANG   && LANG="$LANG"     || unset LANG
test -v OLD_LC_ALL && LC_ALL="$LC_ALL" || unset LC_ALL
to handle e.g. NUL-delimited file lists correctly, just in case there is no final NUL at end. The locale override is necessary to avoid non-UTF-8 sequences from aborting the script (if an UTF-8 locale is used).

Thanks for the info, though, David the H.

Last edited by Nominal Animal; 04-16-2012 at 02:11 AM.
 
Old 04-17-2012, 06:19 AM   #11
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
I was a bit pressed for time when I posted yesterday, so I couldn't go through the thread carefully. I could only post & run re NA's last post.

To first respond to what's come since then:

Pulling up "help read" gives us this:

Code:
    Exit Status:
    The return code is zero, unless end-of-file is encountered, read times out,
    or an invalid file descriptor is supplied as the argument to -u.
Remember, one invocation of read grabs only a single delimited section of input data. It doesn't seem completely unreasonable to me to have it differentiate between hitting a delimiter and EOF. It's only when used in a loop, which invokes it multiple times, that it becomes a "gotcha".

I suppose it would be nice to be able to have read return true on an EOF as well, perhaps as an option flag. Other than that, except in trivial cases, it would probably be best to set up a function for the sub-commands, to keep from having to duplicate the entire code section again.

Another option could be to use mapfile or similar to capture the input lines into an array first, and process those.


Now, to respond to catkin, whitespace is treated slightly differently by IFS than other characters.

When a whitespace character set in IFS matches a whitespace string at the front or end of a line, then all of that whitespace is removed, and the first non-IFS-set character starts the first field.

Non-whitespace characters in IFS, OTOH, always match individually, and are always considered actual delimiters. That means that, if encountered initially, the "empty" value in front of it is considered the first field.


http://mywiki.wooledge.org/IFS
 
Old 04-17-2012, 02:14 PM   #12
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Quote:
Originally Posted by David the H. View Post
Remember, one invocation of read grabs only a single delimited section of input data. It doesn't seem completely unreasonable to me to have it differentiate between hitting a delimiter and EOF.
No, it is not unreasonable.

What is unreasonable is that you cannot differentiate between "data read, but no delimiter", "no more input", and "read error". The exit status is the same (false, 1) in all three cases, at least in Bash-4.2.10(1)-release (x86_64-pc-linux-gnu).

In most programming environments one encounters EOF only in the end of file error sense, i.e. only when trying to read past the end of input. Indeed, in POSIX systems, this is the only way kernels tell the userspace that the file pointer is at the end of input, or that there will be no further data available. A short read is always possible, and does not indicate anything about whether there is further data or not.

To me, personally, getting an EOF error (nonzero exit status) while also having input, is counterintuitive.

That said, it is historical behaviour, and therefore will not change.

Fortunately, the workaround can apparently be written pretty concisely. For NUL-separated records:
Code:
DATA=""
while IFS="" read -rd "" DATA || [ -n "$DATA" ]; do
    # Do something with DATA
    DATA=""
done
While read does clear the DATA in normal situations, it does not clear it if a read error occurs; for example, if you do close the input, for example via exec<&- accidentally in the loop body. Without explicitly clearing DATA the loop would never exit in the true read error case.

I guess I was just a bit frustrated with the behaviour. Thanks again for your efforts, David the H.
 
2 members found this post helpful.
Old 04-18-2012, 09:43 AM   #13
catkin
LQ 5k Club
 
Registered: Dec 2008
Location: Tamil Nadu, India
Distribution: Debian
Posts: 8,578

Original Poster
Blog Entries: 31

Rep: Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208Reputation: 1208
Quote:
Originally Posted by Nominal Animal View Post
Fortunately, the workaround can apparently be written pretty concisely. For NUL-separated records:
Code:
DATA=""
while IFS="" read -rd "" DATA || [ -n "$DATA" ]; do
    # Do something with DATA
    DATA=""
done
Neat workaround for read's counter-intuitive behaviour Nominal Animal

A couple of points of style: DATA= is functionally identical to DATA="" and it may be preferable to use single quotes where doubles are not necessary as in the -d option.

Last edited by catkin; 04-18-2012 at 09:44 AM. Reason: Fix missing bolding
 
Old 04-19-2012, 09:40 AM   #14
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Agreed. It's an exquisite workaround. So simple.

You also stated more eloquently what I was trying to express with my suggestion of an option flag. We'd need to preserve historical compatibility while providing some ability to detect both states.

Thanks for the discussion and the hint.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Resetting Internal Variables to Default (BASH, IFS in particular) rm_-rf_windows Linux - General 4 03-08-2012 05:26 PM
IFS problem? sebelk Programming 5 05-17-2010 09:11 AM
Procmail and IFS cipher7836 Linux - Newbie 2 08-05-2009 11:13 AM
Ifs Gins Programming 2 07-18-2006 04:01 AM
setting IFS variable infamous41md Linux - Newbie 2 05-20-2003 06:12 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:16 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration