LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   BASH comments (https://www.linuxquestions.org/questions/programming-9/bash-comments-923522/)

danielbmartin 01-12-2012 09:20 PM

BASH comments
 
Sometimes code readability is improved by breaking a pipe into individual commands, one per line.

Contrived example:
Code:

cat $InFile  \
|rev          \
|cut -d, -f7- \
|rev          \
> $Work3

It would be even better to comment every line...
Code:

cat $InFile  \  # Read input file
|rev          \  # Reverse (flip each line end-for-end)
|cut -d, -f7- \  # Toss fields 1-6
|rev          \  # Reverse, again
> $Work3        # Write interim result to a work file

... but that doesn't work.

Is there a way to do this?

Daniel B. Martin

David the H. 01-12-2012 10:17 PM

No, I don't believe that's possible. the backslash is there to escape the non-printing newline character that normally terminates the line. There can be nothing else following it.

To the shell, the multiple lines appear to be a single line, so the comments can only come after the last one.

Telengard 01-12-2012 11:57 PM

If you put the pipe at the end of the line Bash will continue with the next line. No need for the backslash.

Code:

$ cat a
#! /bin/bash

echo "one two three" | #comment
cut -d ' ' -f 2
$ ./a
two
$

The comment following the pipe doesn't seem to interfere with the continuation.
I failed to find any support for this in the GNU Bash Reference Manual though, so caveat emptor.

Nominal Animal 01-13-2012 03:14 AM

Quote:

Originally Posted by Telengard (Post 4573333)
If you put the pipe at the end of the line Bash will continue with the next line. No need for the backslash.

The same also applies for the && and || operators. It also works in dash and tcsh shells too, so I do believe this is common behaviour to all shells.

Actually, it seems to be obliquely defined: all these operators require both sides to exist -- an empty command on either side makes absolutely no sense for these, and a comment or newline or empty line(s) do not produce any statement. The situation is complementary to 'do' or 'then' statements -- they need a preceding semicolon or a new line in Bash, Bourne shells and derivatives and POSIX shells -- so this behaviour is quite intuitive and useful.

I just wish it was explicitly documented somewhere.

danielbmartin 01-13-2012 09:25 AM

Quote:

Originally Posted by Telengard (Post 4573333)
If you put the pipe at the end of the line Bash will continue with the next line. No need for the backslash.

I like your idea but was unable to apply it. Please see if you can modify this contrived code snippet and make it work.
Code:

cat $InFile  \  # Read input file
|rev          \  # Reverse (flip each line end-for-end)
|rev          \  # Reverse, again
> $Work3        # Write interim result to a work file

Daniel B. Martin

Telengard 01-13-2012 10:23 AM

Quote:

Originally Posted by danielbmartin (Post 4573647)
I like your idea but was unable to apply it. Please see if you can modify this contrived code snippet and make it work.
Code:

cat $InFile  \  # Read input file
|rev          \  # Reverse (flip each line end-for-end)
|rev          \  # Reverse, again
> $Work3        # Write interim result to a work file

Daniel B. Martin

What did you try? What happened when you tried it? What error message, if any?

Code:

test$ ls -gGh                                               
total 8.0K
-rwxr--r-- 1 243 2012-01-13 11:02 contrived
-rw-r--r-- 1  24 2012-01-13 10:45 file.in
test$ cat contrived
#! /bin/bash

InFile="file.in"
Work3="file.out"

{
    cat $InFile  |  # Read input file
    rev          |  # Reverse (flip each line end-for-end)
    rev            # Reverse, again
} > $Work3          # Write interim result to a work file
test$ cat file.in
one
two
three
four
five
test$ ./contrived
test$ ls -gGh
total 12K
-rwxr--r-- 1 243 2012-01-13 11:02 contrived
-rw-r--r-- 1  24 2012-01-13 10:45 file.in
-rw-r--r-- 1  24 2012-01-13 11:04 file.out
test$ cat file.out
one
two
three
four
five
test$

If this style of code is tricky to remember and write then it would be a mistake to adopt it.

Telengard 01-13-2012 10:46 AM

Quote:

Originally Posted by Nominal Animal (Post 4573433)
I just wish it was explicitly documented somewhere.

The fact that I found no documentation of the behavior makes it difficult to recommend.

H_TeXMeX_H 01-13-2012 11:04 AM

I don't see the point of so many comments. Just make a comment on what the pipe accomplishes. For example:

Code:

# cut off last 6 fields
cat $InFile | rev | cut -d, -f7- | rev > $Work3


danielbmartin 01-13-2012 03:02 PM

Quote:

Originally Posted by H_TeXMeX_H (Post 4573718)
I don't see the point of so many comments. Just make a comment on what the pipe accomplishes. For example:

Code:

# cut off last 6 fields
cat $InFile | rev | cut -d, -f7- | rev > $Work3


Let me remind you that this is a contrived example. My real pipes may be much longer. As a matter of personal style I like to comment every line. You might have a different style, and I respect that.

When writing a complicated piece of code (regardless of language) I like to express in words the logic I want to implement. Then, one piece at a time, I fill in the code. When done, my code is fully commented because I started with all comments.

Daniel B. Martin

H_TeXMeX_H 01-13-2012 03:18 PM

Well, I do write a lot of scripts, and usually if a command becomes too complicated I split it up, I don't do it all at once.

This helps not only because you can comment each part, but because it is easier to understand.

Oftentimes there are different and more readable ways to do things that don't involve huge commands or long pipes.

Either way Telengard posted what you want.

Nominal Animal 01-13-2012 11:27 PM

I myself like to use subshells when working with long pipes, i.e.
Code:

(
    # First part of the pipe ...
) | (
    # Second part of the pipe ...
) | (
    # and so on ...
)

For me, a very typical one is something like
Code:

( echo 'plot "datafile" u 1:2 t "data" w points, \'
  echo '    cos(x)*exp(x/2.5) t "expected" w lines'

  echo "press Enter to close the Gnuplot window" >&2
  while read LINE ; do
      [ -z "$LINE" ] && break
      echo "$LINE"
  done
) | gnuplot

which uses Gnuplot to plot some data, but stays interactive. An empty line will close Gnuplot and exit the compound command, but if I happen to think of an additional Gnuplot command -- say set logscale x; replot -- all I need is to type it and hit enter.

On embedded machines with a limited memory subsystem subshells may not be the best option, but for standard Intel/PowerPC et al. architectures, the subshell is forked from the parent using copy-on-write, using physically the same RAM for code and initial data structures, and therefore uses very little actual system resources. (Just a per-process kernel structure for each subshell, I believe.) This means that on a typical workstation or a server, there is no practical difference in resource use between plain pipe commands and piped subshells.

Telengard 01-14-2012 12:07 AM

Quote:

Originally Posted by Nominal Animal (Post 4574136)
On embedded machines with a limited memory subsystem subshells may not be the best option, but for standard Intel/PowerPC et al. architectures, the subshell is forked from the parent using copy-on-write, using physically the same RAM for code and initial data structures, and therefore uses very little actual system resources. (Just a per-process kernel structure for each subshell, I believe.) This means that on a typical workstation or a server, there is no practical difference in resource use between plain pipe commands and piped subshells.

Neat idea, and well reasoned, but I don't see the advantage over { command-list; }.

Nominal Animal 01-14-2012 02:41 AM

Quote:

Originally Posted by Telengard (Post 4574149)
Neat idea, and well reasoned, but I don't see the advantage over { command-list; }.

You're absolutely right.

I've just never really bothered to find out about the side effects when using command lists in a pipe (specifically, does shell state propagate or not, or if it is just inherited from the parent shell like subshells) -- and to be honest, I tend to always forget the required semicolon from the end of the command list. I've gravitated to using subshells, because I've felt them to be more intuitive.

To those that are unaware of the semicolon detail with command lists, the equivalent command list variant of piped subshells,
Code:

( echo foo ) | ( cat ; echo bar )
is
Code:

{ echo foo ; } | { cat ; echo bar ; }
Using command lists, the shell does not create unnecessary extra processes. Note the semicolons. If you try
Code:

{ echo foo } | { cat ; echo bar }
the shell does not recognize the braced expressions as command lists, and at least Bash 4.2.10 complains about a syntax error. It's very difficult to realize that the only problem is missing semicolons before closing braces. (Well, unless you remember that the command list syntax is, like Telengard stated above, {command(s)...;} and not just {command(s)...} .)

Telengard 01-14-2012 10:57 AM

Quote:

Originally Posted by Nominal Animal (Post 4574202)
I've gravitated to using subshells, because I've felt them to be more intuitive.

That makes sense as an advantage. Parenthesized command groups have simpler syntax and may be easier to type. Curly braces need spaces and the list must end with a control operator.

David the H. 01-14-2012 11:14 AM

I have no problem with command grouping. They're basically just an anonymous functions. And it only takes getting caught by the final semicolon thing a few times before you learn to watch out for it.

Nominal Animal 01-14-2012 11:51 AM

With subshells, each subshell inherits its state from the parent; changes never propagate back. With command lists, the state is shared with the parent but only if the command list is not part of a pipe.
Code:

x=5 ; { x=6 ;} ; echo $x
outputs 6, but
Code:

x=5 ; { x=6 ;} | { x=7 ;} ; echo $x
outputs 5.

I'm not sure if this is properly documented anywhere. I believe a future version of Bash might well output 7 in the latter case: running the last command list of a pipe in the original shell state might be a worthwhile optimization.

(Just to be clear: x=5;(x=6);echo $x will always output 5, as will x=5;(x=6)|(x=7);echo $x .)

Telengard 01-14-2012 01:59 PM

Quote:

Originally Posted by Nominal Animal (Post 4574484)
Code:

x=5 ; { x=6 ;} | { x=7 ;} ; echo $x
outputs 5.

I'm not sure if this is properly documented anywhere.

It is in the authoritative documentation, but the behavior is sometimes unexpected. Frankly I don't quite get the rationale of creating separate subshells for both the LHS and RHS expressions.

Pipelines - Bash Reference Manual

Quote:

Originally Posted by Bashref
Each command in a pipeline is executed in its own subshell (see Command Execution Environment).

Command Execution Environment - Bash Reference Manual

Quote:

Originally Posted by Bashref
Command substitution, commands grouped with parentheses, and asynchronous commands are invoked in a subshell environment that is a duplicate of the shell environment, except that traps caught by the shell are reset to the values that the shell inherited from its parent at invocation. Builtin commands that are invoked as part of a pipeline are also executed in a subshell environment. Changes made to the subshell environment cannot affect the shell's execution environment.

Code:

$ unset a; a="parent_shell"
$ { declare -p a > /dev/stderr; a="pipe_LHS"; } |
> { declare -p a > /dev/stderr; a="pipe_RHS"; }
declare -- a="parent_shell"
declare -- a="parent_shell"
$ declare -p a
declare -- a="parent_shell"
$

  • The value of a is set to parent_shell in the parent shell.
  • A subshell environment is created on the LHS of the pipeline.
  • LHS of the pipeline inherits the value of a from the parent shell.
  • LHS of the pipeline sets a new value of a in its own environment.
  • LHS of the pipeline ends and its environment is destroyed.
  • A subshell environment is created on the RHS of the pipeline.
  • RHS of the pipeline inherits the value of a from the parent shell.
  • RHS of the pipeline sets a new value of a in its own environment.
  • RHS of the pipeline ends and its environment is destroyed.

As each command group is enclosed in {;} (curly braces), one might expect the entire command line to share the same value of a. Instead, three separate shell environments each contain their own unique a. It is the pipeline which creates the subshells and propagates a into them.

That's how I think it works, but as I said it seems tricky to me.

David the H. 01-15-2012 12:12 AM

Hmm, I was always under the impression that the part before the first pipe ran in the current environment. But it looks like I was mistaken.

In any case, this should also mean that any time you use a (..) subshell in a pipeline you end up spawning two sub-shells for it, correct?

Nominal Animal 01-15-2012 12:38 AM

Quote:

Originally Posted by Telengard (Post 4574558)
Frankly I don't quite get the rationale of creating separate subshells for both the LHS and RHS expressions.

Exactly. I kind of expect the Bash developers to eventually drop the subshell for the rightmost expression, because it is such an obvious optimization, and I believe some other shells do it already. Unfortunately, this means the final expression in a pipe sequence might, in the future, affect the parent shell state, if using command lists.

Quote:

Originally Posted by Telengard (Post 4574558)
It is the pipeline which creates the subshells

Yes, exactly.

I decided to do some tests, and the results are a bit startling.
Code:

strace -qf bash -c '  date    |  cat    |  cat  ' 2>&1 | grep -ce 'clone('
strace -qf bash -c '( date  ) | ( cat  ) | ( cat  )' 2>&1 | grep -ce 'clone('
strace -qf bash -c '{ date ;} | { cat ;} | { cat ;}' 2>&1 | grep -ce 'clone('

These output the number of child processes created by Bash. In the first two cases it is 3 as one would expect; both date and cat are external commands, not Bash built-ins. However, Bash creates 6 child processes for the command list case!

(I believe this is related to the way Bash creates the implicit subshells. Normally, if there is only one command to run in a subshell, Bash exec's it, avoiding the unnecessary fork()/clone().)

Timing tests,
Code:

time bash -c 'for ((i=0; i<1000; i++)); do  date    |  cat    |  cat    ; done' 2>&1 >/dev/null
time bash -c 'for ((i=0; i<1000; i++)); do ( date  ) | ( cat  ) | ( cat  ) ; done' 2>&1 >/dev/null
time bash -c 'for ((i=0; i<1000; i++)); do { date ;} | { cat ;} | { cat ;} ; done' 2>&1 >/dev/null

has similar results. On my workstation, plain commands and explicit subshells produce consistently the same real time results, 1.45s to 1.51s, while command lists is definitely slower, about 2.30s real time.

Using more complex pipelines there is no difference between subshells and command lists:
Code:

strace -qf bash -c '( date ; date  ) | ( date ; cat  ) | ( date ; cat  )' 2>&1 | grep -ce 'clone('
strace -qf bash -c '{ date ; date ;} | { date ; cat ;} | { date ; cat ;}' 2>&1 | grep -ce 'clone('
time bash -c 'for ((i=0; i<1000; i++)); do ( date ; date  ) | ( date ; cat  ) | ( date ; cat  ) ; done' 2>&1 >/dev/null
time bash -c 'for ((i=0; i<1000; i++)); do { date ; date ;} | { date ; cat ;} | { date ; cat ;} ; done' 2>&1 >/dev/null

Bash does create an extra process (subshell) for each pipe segment, forking total 9 child processes, for both above cases. I could not measure any real difference in the timings, either.

These tests show that at least on my workstation, using explicit subshells in Bash pipelines is definitely a good idea. They do not use any extra resources compared to the alternatives, no extra syntax requirements compared to normal shell syntax, and the semantics are clear.

Quote:

Originally Posted by Telengard (Post 4574558)
That's how I think it works, but as I said it seems tricky to me.

I have exactly the same understanding.

You know, up to now I have avoided using command lists in Bash. Where one might use a command list, I've used a Bash function (subshell in a pipeline) instead. Without your posts in this thread, Telengard, I would still be relying on a hazy personal preference, instead of actual knowledge. I for one have learned something new, something that I probably would not have found out on my own alone; thank you!

David the H. 01-15-2012 02:12 AM

Quote:

Originally Posted by Nominal Animal (Post 4574856)
Exactly. I kind of expect the Bash developers to eventually drop the subshell for the rightmost expression, because it is such an obvious optimization, and I believe some other shells do it already.

ksh runs the final expression in the current environment, and bash 4.2 has partially implemented the same behavior. The new lastpipe shell option enables it, but it only works when job control is disabled, so it's kind of inconvenient to use in interactive shells.

ta0kira 01-15-2012 02:47 AM

Quote:

Originally Posted by Telengard (Post 4574558)
Frankly I don't quite get the rationale of creating separate subshells for both the LHS and RHS expressions.

The shell needs to retain it's file descriptors while the commands execute, so it's intuitive from an implementation perspective. The LHS has fd 1 replaced and the RHS has fd 0 replaced.
Quote:

Originally Posted by Nominal Animal (Post 4574856)
Exactly. I kind of expect the Bash developers to eventually drop the subshell for the rightmost expression, because it is such an obvious optimization, and I believe some other shells do it already. Unfortunately, this means the final expression in a pipe sequence might, in the future, affect the parent shell state, if using command lists.

I don't quite understand how this would be implemented, unless the shell temporarily copied fd 0, executed the pipeline, then copied it back. This would obviously compromise background processing and Ctrl+Z of a foreground pipeline. The other alternative would be to do what you suggest only when the last grouping consists only of built-ins, which would be horribly inconsistent. Take these two lines, for example:
Code:

while true; do echo $((val++)); sleep 1; done
while true; do echo $((val++)); sleep 1; done | while true; do head -n1; done

If you Ctrl+Z the first line it will SIGSTOP the sleep process, but when you fg it will no longer be in the loop. Because of the subshells in the second line the process group can be SIGSTOPed, and if necessary, put in the background. Without subshells you couldn't do this because fg/bg is based on process groups. There are certainly advantages to be had in non-interactive mode (scripts) and when the shell isn't a session leader, but it can be a headache when something works on the command line and not in a script.

All of these things make a lot of sense if you look at how a shell is written in C, but the syntax of bash makes it appear as though this behavior is idiosyncratic. In my opinion, things like this irritate people because you don't need to understand the internal limitations of bash in order to use it. Unless bash starts using the "system" idiom to call external programs or it starts routing all IPC itself, it will never get away from extensive use of subshells.
Kevin Barry

Nominal Animal 01-15-2012 04:09 AM

Quote:

Originally Posted by ta0kira (Post 4574909)
I don't quite understand how this would be implemented, unless [..snip..]

Based on your nice analysis, it does seem like my fears for that kind of optimization screwing up things later on is unfounded.

Quote:

Originally Posted by ta0kira (Post 4574909)
Unless bash starts using the "system" idiom to call external programs or it starts routing all IPC itself, it will never get away from extensive use of subshells.

I seriously hope it does not!

We're getting terribly off-topic here, but C system() function is a major source of security problems (related to quoting and escaping), and adding yet another "framework" for IPC will severely restrict the usability of Bash. I'm severely tempted to rant about applying modularity instead of framework paradigm, but that would be completely off-topic, and serve no purpose here really.

I thought my tests above showed that the cost of subshells in pipelines is neglible; zero for all single-command pipe segments, and only one process per pipe segment for multi-command ones. In particular,
Code:

date | # First command in the pipe,
cat  | # second command,
cat    # third command.

and
Code:

( # First command in the pipe
  date
) | (
  # Second command in the pipe
  cat
) | (
  # Third command in the pipe
  cat
)

use the same (minimum!) number of processes, CPU time, and wall clock time. The equivalent code snippet using command lists uses three extra child processes on Bash-4.2.10.

The comment style for the first code example does work in Bash (and many other shells like tcsh, too), but I have not found it explicitly documented as working anywhere. I believe it is implicit, perhaps a side effect of the way commands are parsed, rather than anything intentional.

The second code snippet, the one using subshells, is explicitly documented. (In particular, the semantics are exactly the same at least in Bash, POSIX shells, and tcsh: the state is inherited from the parent process, and changes do not propagate outside the subshell.) There are no extra syntax quirks, unlike command lists in Bash (which require the final semicolon and is whitespace sensitive).

Let me put this in other words:

I claim that using explicit subshells in Bash pipelines, i.e. (command(s)...)|(command(s)...)|...|(command(s)...) when comments or long commands are used, makes the code easier to write and to understand, and has no extra computing cost (run time or processes). Therefore, for complex Bash pipelines, I recommend the style used in my second code example in this post.

danielbmartin 01-15-2012 11:26 AM

This thread expanded into a more thorough exploration of the subject than anticipated.

Some languages (APL and REXX, for example) make it easy to comment in the desired fashion. Now I know it's not so easy in BASH. Okay, I can live with that. Thanks, and let's mark this one SOLVED!

Daniel B. Martin

Telengard 01-15-2012 11:44 AM

Quote:

Originally Posted by David the H. (Post 4574845)
Hmm, I was always under the impression that the part before the first pipe ran in the current environment. But it looks like I was mistaken.

As I was saying, tricky eh? To me, it would seem more natural if all components of a pipeline shared a single subshell environment.

Quote:

Originally Posted by Nominal Animal (Post 4574856)
These output the number of child processes created by Bash. In the first two cases it is 3 as one would expect; both date and cat are external commands, not Bash built-ins. However, Bash creates 6 child processes for the command list case!
...
On my workstation, plain commands and explicit subshells produce consistently the same real time results, 1.45s to 1.51s, while command lists is definitely slower, about 2.30s real time.
...
These tests show that at least on my workstation, using explicit subshells in Bash pipelines is definitely a good idea.

Code:

$ echo $BASH_VERSION
3.2.39(1)-release
$ strace -qf bash -c '  date    |  cat    |  cat  ' 2>&1 | grep -ce 'clone('
3
$ strace -qf bash -c '( date  ) | ( cat  ) | ( cat  )' 2>&1 | grep -ce 'clone('
3
$ strace -qf bash -c '{ date ;} | { cat ;} | { cat ;}' 2>&1 | grep -ce 'clone('
6
$ time bash -c 'for ((i=0; i<1000; i++)); do  date    |  cat    |  cat    ; done' 2>&1 >/dev/null

real    0m18.341s
user    0m5.344s
sys    0m4.772s
$ time bash -c 'for ((i=0; i<1000; i++)); do ( date  ) | ( cat  ) | ( cat  ) ; done' 2>&1 >/dev/null

real    0m16.456s
user    0m5.820s
sys    0m4.920s
$ time bash -c 'for ((i=0; i<1000; i++)); do { date ;} | { cat ;} | { cat ;} ; done' 2>&1 >/dev/null

real    0m20.988s
user    0m5.460s
sys    0m6.148s
$

Astounding! :eek: Your test suggests that (at least in pipelines) { list; } is slower. I'm at a loss to explain why it spawns twice as many child processes. :confused:

Quote:

Originally Posted by Nominal Animal (Post 4574938)
I claim that using explicit subshells in Bash pipelines, i.e. (command(s)...)|(command(s)...)|...|(command(s)...) when comments or long commands are used, makes the code easier to write and to understand, and has no extra computing cost (run time or processes). Therefore, for complex Bash pipelines, I recommend the style used in my second code example in this post.

Barring source code analysis and stringent benchmarks, I must concede that explicit subshells win on efficiency. Congrats, Nom. It would seem you've fully justified your practice. (Not that I doubted you, but I just can't explain it.) :hattip:

Quote:

We're getting terribly off-topic here
I believe danielbmartin already got what s?he wanted from this thread, so IMHO no harm in exploring these tangential topics.

Quote:

The comment style for the first code example does work in Bash (and many other shells like tcsh, too), but I have not found it explicitly documented as working anywhere.
I don't know if it is documented anywhere, but it seems to be accepted practice in more places than just the shell.

Code:

$ awk 'BEGIN {print "one", #comment
> "two"}'
one two
$

Quote:

Originally Posted by ta0kira (Post 4574909)
The LHS has fd 1 replaced and the RHS has fd 0 replaced.
...
All of these things make a lot of sense if you look at how a shell is written in C, but the syntax of bash makes it appear as though this behavior is idiosyncratic. In my opinion, things like this irritate people because you don't need to understand the internal limitations of bash in order to use it.

That's a fine explanation, but doesn't make the behavior more intuitive. Still, I'd rather not see Bash's default behavior stray too far from the traditional Bourne shell. While I do want a modern shell with standards, I see value in preserving compatibility with the past. If the day comes that Bash no longer meets my needs then I can choose a more advanced modern shell.

ta0kira 01-15-2012 12:36 PM

Quote:

Originally Posted by danielbmartin (Post 4575189)
Some languages (APL and REXX, for example) make it easy to comment in the desired fashion. Now I know it's not so easy in BASH.

I just started learning python and I was appalled to find out that I couldn't have blank lines within control structures. Each language has its own style, I suppose...
Kevin Barry

danielbmartin 01-16-2012 07:14 AM

Quote:

Originally Posted by Telengard (Post 4575202)
I believe danielbmartin already got what s?he wanted from this thread...

Yes, my question was answered. I am a he, always have been, have no intention of changing that. :D

Quote:

Originally Posted by Telengard (Post 4575202)
...so IMHO no harm in exploring these tangential topics.

No harm at all, but please don't do so on my behalf.

Daniel B. Martin


All times are GMT -5. The time now is 03:35 PM.