Another awk question: un-word-wrap lines

David the H. · 12-06-2007, 11:05 AM

My last text-editing query here gave me some very useful answers, so here's another one.

The text file consists of paragraphs separated by blank lines. Each paragraph has been word-wrapped to about 80 columns. Something like this:

<blank line>
80 column line of text
80 column line of text
80 column line of text
<blank line>
80 column line of text
80 column line of text
<blank line>

Now I want to re-wrap the text to a different value. Note that if I simply run a wrap command on the file as-is, I get a lot of lines broken up in odd places, so I have to remove the current wrapping first. I imagine I need to go through an intermediate format like this:

<blank line>
One long, unwrapped line
<blank line>
One long, unwrapped line
<blank line>

Then I want to re-wrap the text to the new value (45 columns, in this case). Do note that this is *word wrapping* I want, so the actual break points must be at word boundries.

Of course, if there's a better way to go about it, I'd like to know.

I don't doubt awk can do this, I just need the syntax. Any help here? And since I like to know exactly what the scripts are doing, I'd appreciate it if you could also walk me through the commands.

Thanks in advance.

ilikejam · 12-06-2007, 02:34 PM

Hi.

This should do it:

Code:

awk 'BEGIN {RS="\n\n"; FS="\n"} {for (i=1;i<=NF;i++) printf $i; printf "\n\n"}' /path/to/file

BEGIN {RS="\n\n"; FS="\n"}
Defines the record separator to be two newlines (this is one newline by default, but we want to grab records that are 'anything between two newlines'), and defines the field separator to be a single newline (this is usually any run of one or more whitespace characters, but we want a field to be a whole line).

for (i=1;i<=NF;i++) printf $i
Runs through each field (a line in our case) in each record (the bits between the blank lines), and prints the field with no newline after it, joining the split lines back up.

printf "\n\n"
Prints a double newline to get your blank line after each line in the output.

Dave

matthewg42 · 12-06-2007, 03:55 PM

A slightly shorter version in Perl (somewhat less readable though):

Code:

perl -ane 'chop;print "\n\n" if(/^\s*$/); map{print "$_ ";}@F;' /path/to/file

David the H. · 12-07-2007, 12:49 PM

Thanks. ilikejam, that mostly did the job. There was only one problem. The last word of one line was getting concatenated with the first word of the next one with no space in between. But I solved that by modifying 'printf $i' to 'printf $i" "', which appears to insert a blank space at the end of each field as it's printed out. At least the document comes out the way I want it to.

Now, could you kindly let me know the best way to re-wrap it at a different column value?

@matthewg42. Thanks for your code too. But I don't think I'm ready to tackle perl yet. I'm currently having enough trouble just getting my mind around awk, sed, and regex.

I'll be saving it for future reference though.

matthewg42 · 12-07-2007, 01:16 PM

Perl is awesome. awk is a good tool, and it is what got me into Perl in a way.

I once wrote some nice reports using Awk. However they were long-running, and while I was searching to ways to speed them up I found a2p - a program which automatically turns simple Awk code into Perl code.

I ran this on the report scripts, and the result ran in half the time! Perl it utterly amazing, and not nearly as hard as some people make out. Don't let those $ characters put you off, they make life easier by telling you what type a variable is.

ilikejam · 12-07-2007, 04:13 PM

Quote:

Originally Posted by David the H.

Now, could you kindly let me know the best way to re-wrap it at a different column value?

I can, but not in awk (I'm sure it's possible in awk, but I can't think of a way that doesn't involve nested loops and other such irritations). In perl, though:

Code:

perl -pe 's/(.{0,45})\s/$1\n/g'

Change 45 to be whatever the column wrap should be.

It matches the longest series of single characters (the .) between 0 and 45 characters long (the {0,45} bit) that ends with a whitespace character (the \s), and then sticks a newline onto the end of the first ($1) regex group match (a regex group being a regex part surrounded by () ).

Basically it finds the longest string that's less than 46 chars long and has a whitespace character at the end, then replaces that string with itself with an appended newline.

I love the way perl scripts are incredibly concise, but explaining them requires ridiculous verbosity.

Dave

David the H. · 12-09-2007, 05:18 AM

Well, that's easier than I thought it would be. Just a simple regex. I've studied enough to know the basics of that. It even works in sed without much modification.

So what would make it more difficult in awk? Couldn't you do something like in the unwrapping script where you define the field as a single line, then simply apply the regex substitution to that line? Just where are "nested loops and other irritations" needed?

Sorry for all the questions. I hope some day soon to be able to sit down and really learn what I'm doing (including perl). But right now I'm just trying to get some jobs done, and learn a bit as I go.

ilikejam · 12-09-2007, 10:15 AM

awk can't do backreferences, so it's difficult to replace something with a modified version of itself. Well, gawk can, but why bother when perl does it so nicely.

Dave

colucix · 12-09-2007, 12:26 PM

I totally agree: the perl solution is one of the shortest and more elegant. Anyway, not so painful in awk, e.g.

Code:

BEGIN { maxlen = 45 ; stringa = "" }
{ if ( NF ) {
     for ( i = 1; i <= NF ; i++ ) {
         if ( stringa == "" )
            stringa = $i
         else
            if ( length(stringa) + length($i) + 1 <= maxlen )
               stringa = ( stringa " " $i )
            else {
               print stringa
               stringa = $i
            }
      }
  }
  else {
     print stringa
     stringa = ""
     print
  }
}
END { if ( length(stringa) > 0 ) print stringa }

This takes in account empty lines and wrap text to maxlen characters per line. For example:

Code:

$ cat testfile
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.

All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.

All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.

$ gawk -f wrap.awk testfile
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.

All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.

All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.

David the H. · 12-11-2007, 10:19 AM

I see. Thanks for the explanation. So what you're saying it that the problem is with awk and it's lack of backreferencing. Ok. But since awk on linux is really just a symlink to gawk, the average user such as me shouldn't have to worry. I just have to be aware that it's not backwards-portable.

One reason I might want to do it in (g)awk instead of perl though is that I may want to combine all of these functions into a single script. In fact, I've been thinking of doing just that, creating a single script that would do all the reformatting I want in one run. But then again, I suppose a single bash script that runs multiple tools would do also.

ilikejam · 12-11-2007, 02:05 PM

Cool. Whatever Works For You.

I'm no master of perl, but I wanted a one-language script for this sort of thing, I'd be more likely to use perl than to do it than awk.

On the other hand, I think using awk to join the lines then perl to split them again is probably the most readable combination, so that's probably what I'd do. There's a lot to be said for maintainable code.

Dave

David the H. · 12-13-2007, 01:49 PM

Ok, I'm getting some odd behavior here. I'm trying to unwordwrap a new file, and it's choking on some percentage signs. I'm now running it from a script file, and when I ran it on the new text, I'm getting this error:

Code:

year-olds in many world regions, the US ranked dead last in algebra. On identical tests, the US kids averaged 43% and their Japanese awk: ../awk_scripts/unwordwrap.awk:13: (FILENAME=- FNR=1929) fatal: not enough arguments to satisfy format string
        ` counterparts 78%. In my book, 78% is pretty good - it corresponds to a C+, '
                                            ^ ran out for this one

If I edit the original file to remove the offending % sign, it just moves on and chokes on the next one. The strangest thing though, is that it handles several % signs perfectly before finally choking on about the 5th one in the file, which occurs on line 12356 (it's a 400+ page document). There doesn't seem to be anything special about the lines other than the percents in them, and I've changed nothing in the script except the extra space as I mentioned above. What could be causing this?

Perhaps I'll have to go with the perl command after all.

ilikejam · 12-13-2007, 01:55 PM

Funky.

Could you post the contents of the script, and give us the exact syntax used to execute it?

Dave

David the H. · 12-13-2007, 04:37 PM

As I said, the script is pretty much unaltered from what I was given above. I just pasted the commands into a text file, added the gawk initialization line, and ran it with awk -f unwordwrap.awk < input.txt

Here's the file (except for some comment lines, removed for clarity).

Code:

#!/usr/bin/gawk

BEGIN {RS="\n\n"; FS="\n"}
{for (i=1;i<=NF;i++) printf $i" ";  printf "\n\n" }

In the end I ran the perl command on the file instead and it worked perfectly, but I am curious as to what's going on.

colucix · 12-13-2007, 05:11 PM

Code:

BEGIN {RS="\n\n"; FS="\n"}
{for (i=1;i<=NF;i++) printf $i" ";  printf "\n\n" }

The correct syntax for the printf statements is

printf format, item1, item2, ...

In the code above the first printf lacks the format specification, or better... it interprets the whole string $i as format. This brings to problems when the string contains the % symbol, which is the format specifier. In particular the problem arises when you have two occurences of % in the same line of text. Without going through more details, you can avoid this problem if you correctly specify the format and the item to print, as in

Code:

BEGIN {RS="\n\n"; FS="\n"}
{for (i=1;i<=NF;i++) printf "%s ", $i ; printf "\n\n" }

The code in blue prints the item $i as a string (%s) followed by a blank space.

Just out of curiosity, have you tried the code I suggested in post #9 ? It skips the unwrap passage and wraps directly to the desired length. I have not tested a lot, anyway. But I wonder if it works on a very long test as yours. Cheers!