LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 12-06-2007, 11:05 AM   #1
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Another awk question: un-word-wrap lines


My last text-editing query here gave me some very useful answers, so here's another one.

The text file consists of paragraphs separated by blank lines. Each paragraph has been word-wrapped to about 80 columns. Something like this:

<blank line>
80 column line of text
80 column line of text
80 column line of text
<blank line>
80 column line of text
80 column line of text
<blank line>

Now I want to re-wrap the text to a different value. Note that if I simply run a wrap command on the file as-is, I get a lot of lines broken up in odd places, so I have to remove the current wrapping first. I imagine I need to go through an intermediate format like this:

<blank line>
One long, unwrapped line
<blank line>
One long, unwrapped line
<blank line>

Then I want to re-wrap the text to the new value (45 columns, in this case). Do note that this is *word wrapping* I want, so the actual break points must be at word boundries.

Of course, if there's a better way to go about it, I'd like to know.

I don't doubt awk can do this, I just need the syntax. Any help here? And since I like to know exactly what the scripts are doing, I'd appreciate it if you could also walk me through the commands.

Thanks in advance.
 
Old 12-06-2007, 02:34 PM   #2
ilikejam
Senior Member
 
Registered: Aug 2003
Location: Glasgow
Distribution: Fedora / Solaris
Posts: 3,109

Rep: Reputation: 97
Hi.

This should do it:
Code:
awk 'BEGIN {RS="\n\n"; FS="\n"} {for (i=1;i<=NF;i++) printf $i; printf "\n\n"}' /path/to/file
BEGIN {RS="\n\n"; FS="\n"}
Defines the record separator to be two newlines (this is one newline by default, but we want to grab records that are 'anything between two newlines'), and defines the field separator to be a single newline (this is usually any run of one or more whitespace characters, but we want a field to be a whole line).

for (i=1;i<=NF;i++) printf $i
Runs through each field (a line in our case) in each record (the bits between the blank lines), and prints the field with no newline after it, joining the split lines back up.

printf "\n\n"
Prints a double newline to get your blank line after each line in the output.

Dave

Last edited by ilikejam; 12-06-2007 at 02:43 PM.
 
Old 12-06-2007, 03:55 PM   #3
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 65
A slightly shorter version in Perl (somewhat less readable though):
Code:
perl -ane 'chop;print "\n\n" if(/^\s*$/); map{print "$_ ";}@F;' /path/to/file
 
Old 12-07-2007, 12:49 PM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Original Poster
Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Thanks. ilikejam, that mostly did the job. There was only one problem. The last word of one line was getting concatenated with the first word of the next one with no space in between. But I solved that by modifying 'printf $i' to 'printf $i" "', which appears to insert a blank space at the end of each field as it's printed out. At least the document comes out the way I want it to.

Now, could you kindly let me know the best way to re-wrap it at a different column value?

@matthewg42. Thanks for your code too. But I don't think I'm ready to tackle perl yet. I'm currently having enough trouble just getting my mind around awk, sed, and regex. I'll be saving it for future reference though.
 
Old 12-07-2007, 01:16 PM   #5
matthewg42
Senior Member
 
Registered: Oct 2003
Location: UK
Distribution: Kubuntu 12.10 (using awesome wm though)
Posts: 3,530

Rep: Reputation: 65
Perl is awesome. awk is a good tool, and it is what got me into Perl in a way.

I once wrote some nice reports using Awk. However they were long-running, and while I was searching to ways to speed them up I found a2p - a program which automatically turns simple Awk code into Perl code.

I ran this on the report scripts, and the result ran in half the time! Perl it utterly amazing, and not nearly as hard as some people make out. Don't let those $ characters put you off, they make life easier by telling you what type a variable is.
 
Old 12-07-2007, 04:13 PM   #6
ilikejam
Senior Member
 
Registered: Aug 2003
Location: Glasgow
Distribution: Fedora / Solaris
Posts: 3,109

Rep: Reputation: 97
Quote:
Originally Posted by David the H. View Post
Now, could you kindly let me know the best way to re-wrap it at a different column value?
I can, but not in awk (I'm sure it's possible in awk, but I can't think of a way that doesn't involve nested loops and other such irritations). In perl, though:
Code:
perl -pe 's/(.{0,45})\s/$1\n/g'
Change 45 to be whatever the column wrap should be.

It matches the longest series of single characters (the .) between 0 and 45 characters long (the {0,45} bit) that ends with a whitespace character (the \s), and then sticks a newline onto the end of the first ($1) regex group match (a regex group being a regex part surrounded by () ).

Basically it finds the longest string that's less than 46 chars long and has a whitespace character at the end, then replaces that string with itself with an appended newline.

I love the way perl scripts are incredibly concise, but explaining them requires ridiculous verbosity.

Dave

Last edited by ilikejam; 12-07-2007 at 04:20 PM.
 
Old 12-09-2007, 05:18 AM   #7
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Original Poster
Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Well, that's easier than I thought it would be. Just a simple regex. I've studied enough to know the basics of that. It even works in sed without much modification.

So what would make it more difficult in awk? Couldn't you do something like in the unwrapping script where you define the field as a single line, then simply apply the regex substitution to that line? Just where are "nested loops and other irritations" needed?

Sorry for all the questions. I hope some day soon to be able to sit down and really learn what I'm doing (including perl). But right now I'm just trying to get some jobs done, and learn a bit as I go.
 
Old 12-09-2007, 10:15 AM   #8
ilikejam
Senior Member
 
Registered: Aug 2003
Location: Glasgow
Distribution: Fedora / Solaris
Posts: 3,109

Rep: Reputation: 97
awk can't do backreferences, so it's difficult to replace something with a modified version of itself. Well, gawk can, but why bother when perl does it so nicely.

Dave
 
Old 12-09-2007, 12:26 PM   #9
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
I totally agree: the perl solution is one of the shortest and more elegant. Anyway, not so painful in awk, e.g.
Code:
BEGIN { maxlen = 45 ; stringa = "" }
{ if ( NF ) {
     for ( i = 1; i <= NF ; i++ ) {
         if ( stringa == "" )
            stringa = $i
         else
            if ( length(stringa) + length($i) + 1 <= maxlen )
               stringa = ( stringa " " $i )
            else {
               print stringa
               stringa = $i
            }
      }
  }
  else {
     print stringa
     stringa = ""
     print
  }
}
END { if ( length(stringa) > 0 ) print stringa }
This takes in account empty lines and wrap text to maxlen characters per line. For example:
Code:
$ cat testfile
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.

All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.

All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.

$ gawk -f wrap.awk testfile
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.

All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.

All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
 
Old 12-11-2007, 10:19 AM   #10
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Original Poster
Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
I see. Thanks for the explanation. So what you're saying it that the problem is with awk and it's lack of backreferencing. Ok. But since awk on linux is really just a symlink to gawk, the average user such as me shouldn't have to worry. I just have to be aware that it's not backwards-portable.

One reason I might want to do it in (g)awk instead of perl though is that I may want to combine all of these functions into a single script. In fact, I've been thinking of doing just that, creating a single script that would do all the reformatting I want in one run. But then again, I suppose a single bash script that runs multiple tools would do also.
 
Old 12-11-2007, 02:05 PM   #11
ilikejam
Senior Member
 
Registered: Aug 2003
Location: Glasgow
Distribution: Fedora / Solaris
Posts: 3,109

Rep: Reputation: 97
Cool. Whatever Works For You.

I'm no master of perl, but I wanted a one-language script for this sort of thing, I'd be more likely to use perl than to do it than awk.

On the other hand, I think using awk to join the lines then perl to split them again is probably the most readable combination, so that's probably what I'd do. There's a lot to be said for maintainable code.

Dave
 
Old 12-13-2007, 01:49 PM   #12
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Original Poster
Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Ok, I'm getting some odd behavior here. I'm trying to unwordwrap a new file, and it's choking on some percentage signs. I'm now running it from a script file, and when I ran it on the new text, I'm getting this error:

Code:
year-olds in many world regions, the US ranked dead last in algebra. On identical tests, the US kids averaged 43% and their Japanese awk: ../awk_scripts/unwordwrap.awk:13: (FILENAME=- FNR=1929) fatal: not enough arguments to satisfy format string
        ` counterparts 78%. In my book, 78% is pretty good - it corresponds to a C+, '
                                            ^ ran out for this one
If I edit the original file to remove the offending % sign, it just moves on and chokes on the next one. The strangest thing though, is that it handles several % signs perfectly before finally choking on about the 5th one in the file, which occurs on line 12356 (it's a 400+ page document). There doesn't seem to be anything special about the lines other than the percents in them, and I've changed nothing in the script except the extra space as I mentioned above. What could be causing this?

Perhaps I'll have to go with the perl command after all.
 
Old 12-13-2007, 01:55 PM   #13
ilikejam
Senior Member
 
Registered: Aug 2003
Location: Glasgow
Distribution: Fedora / Solaris
Posts: 3,109

Rep: Reputation: 97
Funky.

Could you post the contents of the script, and give us the exact syntax used to execute it?

Dave
 
Old 12-13-2007, 04:37 PM   #14
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Original Poster
Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
As I said, the script is pretty much unaltered from what I was given above. I just pasted the commands into a text file, added the gawk initialization line, and ran it with awk -f unwordwrap.awk < input.txt

Here's the file (except for some comment lines, removed for clarity).
Code:
#!/usr/bin/gawk

BEGIN {RS="\n\n"; FS="\n"}
{for (i=1;i<=NF;i++) printf $i" ";  printf "\n\n" }
In the end I ran the perl command on the file instead and it worked perfectly, but I am curious as to what's going on.
 
Old 12-13-2007, 05:11 PM   #15
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Code:
BEGIN {RS="\n\n"; FS="\n"}
{for (i=1;i<=NF;i++) printf $i" ";  printf "\n\n" }
The correct syntax for the printf statements is

printf format, item1, item2, ...

In the code above the first printf lacks the format specification, or better... it interprets the whole string $i as format. This brings to problems when the string contains the % symbol, which is the format specifier. In particular the problem arises when you have two occurences of % in the same line of text. Without going through more details, you can avoid this problem if you correctly specify the format and the item to print, as in
Code:
BEGIN {RS="\n\n"; FS="\n"}
{for (i=1;i<=NF;i++) printf "%s ", $i ; printf "\n\n" }
The code in blue prints the item $i as a string (%s) followed by a blank space.

Just out of curiosity, have you tried the code I suggested in post #9 ? It skips the unwrap passage and wraps directly to the desired length. I have not tested a lot, anyway. But I wonder if it works on a very long test as yours. Cheers!
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
awk/gawk/sed - read lines from file1, comment out or delete matching lines in file2 rascal84 Linux - General 1 05-24-2006 09:19 AM
"enscript --word-wrap" does not wrap line of text file powah Linux - General 3 05-16-2006 09:12 PM
Microsoft Word won't word wrap Micro420 General 1 06-13-2005 04:36 PM
word wrap in java linux_ub Programming 3 08-11-2004 01:16 AM
Word Wrap Squall LQ Suggestions & Feedback 4 02-03-2004 03:25 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 12:08 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration