Bash Scripting: Question About tr

tvynr · 06-23-2006, 09:13 PM

Hello, all. I am writing a bash script to simplify a common operation I am performing on my machine. The script executes multiple programs using a given set of CL parameters and produces a neat and tidy output. This is a fairly common exercise, I understand.

The subprocesses, however, dump large quantities of output to the standard output stream. In an effort to make the output more usable and readable in terms of a status report, I am reprocessing that output. All of the subprocesses use the technique of writing a carriage return instead of a newline and simply rewriting the last line. For example, the string "some garbage 0.00%\rsome more garbage 0.01%\rsome extra junk 0.02%" might be a common substring of the output of one of the subprocesses.

Since there is more than just the state of progress being written, I am piping the output through egrep -o to retrieve only the part I want (in the above example, the expression "[0-9.]+%" would be used to sift out the relevant portion). However, since the line breaks are carriage returns and not newlines, grepping the output doesn't work; grep keeps reading until it finds a newline, which doesn't appear until the subprocess has completed.

So, to address this problem, I tossed a tr between the subprocess and the grep to translate all carriage returns into newlines. This seems to work passably well but the granularity of the progress indicator is rougher than it was in the subprocess. I added --line-buffered to grep and that fixed some things, but it's still quite jerky.

My assumption is that tr is buffering more than I'd like and I only get output when its buffer fills up; this would explain the jerkiness, since the output will be processed in bursts in that case. So, the question is: is there any way I can control or affect the size of the buffer that tr uses?

Thanks for reading! I'm pretty sure I'm not going to get this to work without either rewriting tr or processing it character by character in the interpreted script (inefficiency... ick), but I thought I'd ask.

Cheers!

acid_kewpie · 06-25-2006, 02:52 PM

Whilst i've not been able to see anything concrete, tr clearly processes on a per line basis by default. It wants to obtain an entire field to execute on. I just had a play and as i'd have thought, it's the occurence of whatever $IFS contains that defines when it processes its data. for example, if we have a named pipe called test, which we listen to and pipe through to tr:

Code:

tail -f test | tr -d x

and then run a little doodad to enter data into it:

Code:

for i in $(seq 1 10); do echo -n $i > test; sleep 1; done

this shows nothing from tr at all as no line feeds enter it. echoing a normal line to it incluing a carriage return as normal, and all the contents dumps out. if you then run it again, but with a different LFS value:

Code:

IFS=5; for i in $(seq 1 10); do echo -n $i > test; sleep 1; done

then you see 1 to 4 appear at once, then 6 to 10 appear after you echo anythign else to the pipe. so if you can find somethign preferable to use over a new line you should have a clearer buffer, but i'mhalf thinking i'm going off in a direction that means nothign to what you're really asking...

bigrigdriver · 06-25-2006, 05:34 PM

I don't know enough about shell scripting, and less about calls to C modules, but I wonder if it would be possible to make calls to C flush functions to flush the buffer after each newline or carriage return.

To find out which you have installed on your system, from a console, do 'apropos flush'. Then start reading. There might be something you can use to flush the buffer at each newline.

I may be way off base. If I am, I apologize for my illitaracy.

acid_kewpie · 06-26-2006, 01:22 AM

well it looks like each newline will flush the buffer. i'd suggest trying your own experiemnts with real output to check this.

tvynr · 06-26-2006, 06:09 AM

First, thanks to both of you for your replies. bigrigdriver: I'm afraid that I do not understand what you mean at all.

I suppose I should first find out what apropos is. Is this at all like KDE's dcop?

acid_kewpie: I tried the bash snippets you posted above and found them quite interesting. What precisely is IFS doing here? Do I take it that echo is using IFS to determine when to flush the buffer? Or is it something lower level than echo? Will I have to hope that the underlying processes will respect the contents of the IFS environment variable?

It seems possible that the problem will be resolved simply by adding $'\r' to the end of the IFS environment variable. I'm gonna go play with it and see if I can make it dance. :-D

Thanks again!

jschiwal · 06-26-2006, 06:36 AM

I tried a little experiment. First I wrote a oneliner that simulates a progress indicater that uses "\r" to reprint on the same line:
> for (( progress=0; progress<101; progress++ )); do sleep 2; echo -ne "progress: ${progress}\r"; done
progress: 3

Next, I changed the value of IFS so that the return character would be used to separate fields. This was piped to your "tr '\r' '\n' filter. It would then print out on individual lines at each iteration of the loop.
> for (( progress=0; progress<101; progress++ )); do sleep 2; IFS=\005; echo -ne "progress: ${progress}\r" | tr '\r' '\n'; done
progress: 0
progress: 1
progress: 2
progress: 3

tvynr · 06-26-2006, 06:41 AM

I've corrected my analysis of the problem, actually. I ran the following snippet:

Code:

for n in $(seq 0 9); do echo -en "$n\r"; sleep 1; done | (tr $'\r' $'\n')

which behaved as I wanted; each entry was translated immediately and written to the standard output stream with a newline instead of the original carriage return. Then, I tried

Code:

for n in $(seq 0 9); do echo -en "$n\r"; sleep 1; done | (tr $'\r' $'\n' | egrep --line-buffered -o '[0-9]')

which dumped the entire ten lines all at once. I must conclude, therefore, that the problem is grep and not tr, as I had originally thought.

This confuses me. I am using --line-buffered, which I thought would fix any buffering issues created by grep. I'll keep digging.

jschiwal: I'm not sure I fully understand your test. What does it illustrate? What is character 5?

Thanks again for your help, all!

tvynr · 06-26-2006, 06:56 AM

I've performed a couple more tests:

Code:

n=0; while [ "$n" -lt "10000" ]; do echo -en "$n\r"; sleep 0.0001; n=$(($n+1)); done | (tr $'\r' $'\n' | egrep --line-buffered -o '[0-9]+'
n=0; while [ "$n" -lt "10000" ]; do echo -en "$n\r"; sleep 0.0001; n=$(($n+1)); done | (tr $'\r' $'\n' | egrep -o '[0-9]+')

The purpose of the above two one-liners is to attempt to determine the number of lines being buffered. As you can see, the only difference in the two commands is that the execution of egrep in one uses the "--line-buffered" flag whilst the other does not.

I executed each command several times. During each execution, I observed when the display changed and made a note of the bottom number. In both cases, the display updates came in bursts... the same bursts. Both commands had bulk output which ended at 1040, 1859, 2678, 3497, 4317, and so on. The numbers seem to be roughly eight hundred to one thousand lines apart but are consistent and repeatable. The presence of the "--line-buffered" flag did not seem to have any effect on this behavior.

I hope I have described this test sufficiently. Did it make any sense? Does it seem like it's producing valuable data? It suggests to me that the line buffering flag on grep either does not behave the way I think it does or does not work at all.

I'd appreciate any and all suggestions. At the moment, it would seem that I might have to use a different line parser. I imagine sed and awk could both approach this task as well...

Just in case anyone is wondering, the egrep -o is being used to separate a match for a regexp pattern from the rest of the line it is in. For example, for the line

Code:

Current progress:  0.57%   Estimated time remaining: way too long.  Blah blah.

would be filtered with the expression "[0-9.]+%" (as I mentioned in the OP) to produce "0.57%". I emphasize this since, in light of the above example, the presence of the call to egrep looks kind of useless.

Cheers!

jschiwal · 06-26-2006, 08:32 AM

Change the \005 to \r.
for (( progress=0; progress<101; progress++ )); do sleep 2; IFS='\r'; echo -ne "progress: ${progress}\r" | tr '\r' '\n'; done

Changing IFS allows tr to translate from a CR to a NL without having to wait for a final NL.

Keep in mind that the console is display the output of stderr rather than stdout. Or the message is sent to /dev/tty. This allows messages to be displayed while operating on stdin and outputting to stdout.

There is a gotcha in changing IFS. You need to change it back before it causes problems elsewhere. For example, try the oneliner. Then type "ls". Surprise! Now try "/bin/ls". That worked. The reason is that the first version is aliased to something like: alias ls='/bin/ls $LS_OPTIONS'. The space no longer seperates command line arguments.

I'm not certain why you want to be doing this. Is it because you have several programs each sending progress indicators to stderr and you want to combine them into a loggable form? There are programs that have an option to use a loggable output. Others have a quiet option.

Be careful handling stderr. You don't want to do something that will insert it into the data stream. ( That sounds like something from TRON! )

spirit receiver · 06-26-2006, 08:48 AM

I think the following loop should work as well:

Code:

(echo -ne "first line\r"; sleep 1; echo -ne "second line\r")| while read -d $'\r'; do echo "$REPLY"; done | grep -o "line"

jschiwal · 06-26-2006, 09:54 AM

Correction.

I looked at my original test line again, substituting a longer message. tr worked without having to change IFS. I guess I didn't read though message #7 carefully.
Grep would be another story however. Also, substituting one pattern for another is more up sed's alley. But sed is also line oriented.

I wrote a short program to output the indicator. Then I used a oneliner to shorten up the indicator. The output is almost funny. It runs in 50 to 100 spurts. So I think I have a better idea on what you are trying to do.
At first I thought you wanted to change it so that there was a line printed for each change.

jschiwal@hpamd64:~/Documents> cat progtest
#! /bin/bash
for (( progress=0; progress<10000; progress++ )); do
sleep 0.01
echo -ne "Current Progress ${progress} $(date -R)\r"
done
jschiwal@hpamd64:~/Documents> IFS='\r'; ./progtest | tr '\r' '\n' | sed -u 's/^$Current Progress [0-9][0-9]*$.*/\1/' | tr '\n' '\r' ; echo
Current Progress 199

I think that the chunkiness is caused by buffering in the pipe.
echo "$(ulimit -p)*1024 | bc"
8192

tvynr · 06-26-2006, 10:35 PM

jschiwal: Excellent deduction!

After you mentioned that and in light of the performance of

Code:

for n in $(seq 0 9); do echo -en "$n\r"; sleep 1; done | (tr $'\r' $'\n' | egrep --line-buffered -o '[0-9]')

(which was choppy) and

Code:

for n in $(seq 0 9); do echo -en "$n\r"; sleep 0.25; done | (tr $'\r' $'\n')

(which was good), I executed

Code:

for n in $(seq 0 9); do echo -en "$n\r"; sleep 0.25; done | (tr $'\r' $'\n' | cat)

which turned out to be just as choppy as the one with grep. It looks like the pipe is what's causing the trouble after all.

That leads me to a fascinating little question... how do I change the pipe? Do I have to create the pipe myself using mkfifo and specify some special parameters? Is there any way I can change how much the pipe is buffering?

To answer your question about my rationale: I have multiple programs all of which display their progress indication in a different way. All of them display quite a lot of header information when they are first executed (at least twenty lines) and none of them allow me to suppress that behavior without suppressing progress indication as well. Finally, one of them seems to be writing newlines to standard error as it writes its output, causing the stderr of the progress indicator to contain several thousand newlines by the time the program is finished running, spreading out its display quite a bit. All of this together is rather inconvenient. My intention is to gather the output into a form which is more reportable to a user viewing the execution of my script.

Of course, I'm being a nice scriptwriter and allowing a command line parameter to suppress the output processing behavior if that's necessary. I do this especially in light of the fact that I am reprocessing the subprocesses' standard error streams. I realize this is problematic if something goes wrong with a subprocess; eventually, I hope to both be able to provide intelligent reporting based upon the subprocesses' exit codes as well as direct all of this through tee to produce a copy of each subprocess's output and error streams (again as directed by the command line parameters). However, for most executions of the script, this should not be necessary.

Additionally, it's helping me develop some bash skills I don't usually have the need to expand.

Thanks muchly for your help.

Cheers!

archtoad6 · 06-27-2006, 05:56 AM

Would read w/ the "-d" option help?

See either [c|k]onsole:

Code:

help read

or search the bash man page for "[-t timeout]".

spirit receiver · 06-27-2006, 06:36 AM

Quote:

Originally Posted by tvynr

It looks like the pipe is what's causing the trouble after all.

But it's not the pipe alone, have a look at

Code:

for n in $(seq 0 9); do echo -en "$n\r"; sleep 0.25; done | (cat | cat)

As for that "read" command: I guess it would help, see my example above

tvynr · 06-27-2006, 11:50 AM

That's quite an interesting snippet... I followed up with

Code:

for n in $(seq 0 9); do echo -en "$n\r"; sleep 0.25; done | (cat | egrep -o --line-buffered '[0-9]+')

and

Code:

for n in $(seq 0 9); do echo -en "$n\r"; sleep 0.25; done | egrep -o --line-buffered '[0-9]+'

which both perform in the jerky fashion.

Upon reading archtoad6's message and rereading spirit_receiver's earlier post containing the read example, I now understand its intention: replace tr with the while read loop, yes? So I tried the snippet

Code:

for n in $(seq 0 9); do echo -en "$n\r"; sleep 0.25; done | (while read -d $'\r' line; do echo "$line"; done) | egrep -o --line-buffered '[0-9]+'

which worked quite nicely. Of course, when I try

Code:

for n in $(seq 0 9); do echo -en "$n\r"; sleep 0.25; done | (tr $'\r' $'\n') | egrep -o --line-buffered '[0-9]+'

I get the unpleasant behavior again.

I have inserted the replacement for tr into my script and everything runs most pleasantly. :-D

In summary, I guess the solution is to replace

Code:

tr "$a" "$b"

with

Code:

while read -d "$a" line; do echo -n "$line$b"; done

whenever this problem crops up (where $a is the character to replace and $b is the character with which to replace it). In my case, $b happens to be $'\n', so I can simplify the echo command.

Many thanks to all of you for your explanations and assistance in tracking this down. I'm still quite perplexed by the buffering that the pipe creates, especially considering that it only seems to happen some of the time. However, the display on my script is much smoother and I'm quite satisfied.

Again, thanks for all your help!

Cheers!