Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
My last text-editing query here gave me some very useful answers, so here's another one.
The text file consists of paragraphs separated by blank lines. Each paragraph has been word-wrapped to about 80 columns. Something like this:
<blank line>
80 column line of text
80 column line of text
80 column line of text
<blank line>
80 column line of text
80 column line of text
<blank line>
Now I want to re-wrap the text to a different value. Note that if I simply run a wrap command on the file as-is, I get a lot of lines broken up in odd places, so I have to remove the current wrapping first. I imagine I need to go through an intermediate format like this:
<blank line>
One long, unwrapped line
<blank line>
One long, unwrapped line
<blank line>
Then I want to re-wrap the text to the new value (45 columns, in this case). Do note that this is *word wrapping* I want, so the actual break points must be at word boundries.
Of course, if there's a better way to go about it, I'd like to know.
I don't doubt awk can do this, I just need the syntax. Any help here? And since I like to know exactly what the scripts are doing, I'd appreciate it if you could also walk me through the commands.
BEGIN {RS="\n\n"; FS="\n"}
Defines the record separator to be two newlines (this is one newline by default, but we want to grab records that are 'anything between two newlines'), and defines the field separator to be a single newline (this is usually any run of one or more whitespace characters, but we want a field to be a whole line).
for (i=1;i<=NF;i++) printf $i
Runs through each field (a line in our case) in each record (the bits between the blank lines), and prints the field with no newline after it, joining the split lines back up.
printf "\n\n"
Prints a double newline to get your blank line after each line in the output.
Thanks. ilikejam, that mostly did the job. There was only one problem. The last word of one line was getting concatenated with the first word of the next one with no space in between. But I solved that by modifying 'printf $i' to 'printf $i" "', which appears to insert a blank space at the end of each field as it's printed out. At least the document comes out the way I want it to.
Now, could you kindly let me know the best way to re-wrap it at a different column value?
@matthewg42. Thanks for your code too. But I don't think I'm ready to tackle perl yet. I'm currently having enough trouble just getting my mind around awk, sed, and regex. I'll be saving it for future reference though.
Perl is awesome. awk is a good tool, and it is what got me into Perl in a way.
I once wrote some nice reports using Awk. However they were long-running, and while I was searching to ways to speed them up I found a2p - a program which automatically turns simple Awk code into Perl code.
I ran this on the report scripts, and the result ran in half the time! Perl it utterly amazing, and not nearly as hard as some people make out. Don't let those $ characters put you off, they make life easier by telling you what type a variable is.
Now, could you kindly let me know the best way to re-wrap it at a different column value?
I can, but not in awk (I'm sure it's possible in awk, but I can't think of a way that doesn't involve nested loops and other such irritations). In perl, though:
Code:
perl -pe 's/(.{0,45})\s/$1\n/g'
Change 45 to be whatever the column wrap should be.
It matches the longest series of single characters (the .) between 0 and 45 characters long (the {0,45} bit) that ends with a whitespace character (the \s), and then sticks a newline onto the end of the first ($1) regex group match (a regex group being a regex part surrounded by () ).
Basically it finds the longest string that's less than 46 chars long and has a whitespace character at the end, then replaces that string with itself with an appended newline.
I love the way perl scripts are incredibly concise, but explaining them requires ridiculous verbosity.
Well, that's easier than I thought it would be. Just a simple regex. I've studied enough to know the basics of that. It even works in sed without much modification.
So what would make it more difficult in awk? Couldn't you do something like in the unwrapping script where you define the field as a single line, then simply apply the regex substitution to that line? Just where are "nested loops and other irritations" needed?
Sorry for all the questions. I hope some day soon to be able to sit down and really learn what I'm doing (including perl). But right now I'm just trying to get some jobs done, and learn a bit as I go.
awk can't do backreferences, so it's difficult to replace something with a modified version of itself. Well, gawk can, but why bother when perl does it so nicely.
I totally agree: the perl solution is one of the shortest and more elegant. Anyway, not so painful in awk, e.g.
Code:
BEGIN { maxlen = 45 ; stringa = "" }
{ if ( NF ) {
for ( i = 1; i <= NF ; i++ ) {
if ( stringa == "" )
stringa = $i
else
if ( length(stringa) + length($i) + 1 <= maxlen )
stringa = ( stringa " " $i )
else {
print stringa
stringa = $i
}
}
}
else {
print stringa
stringa = ""
print
}
}
END { if ( length(stringa) > 0 ) print stringa }
This takes in account empty lines and wrap text to maxlen characters per line. For example:
Code:
$ cat testfile
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy. All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
$ gawk -f wrap.awk testfile
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
I see. Thanks for the explanation. So what you're saying it that the problem is with awk and it's lack of backreferencing. Ok. But since awk on linux is really just a symlink to gawk, the average user such as me shouldn't have to worry. I just have to be aware that it's not backwards-portable.
One reason I might want to do it in (g)awk instead of perl though is that I may want to combine all of these functions into a single script. In fact, I've been thinking of doing just that, creating a single script that would do all the reformatting I want in one run. But then again, I suppose a single bash script that runs multiple tools would do also.
I'm no master of perl, but I wanted a one-language script for this sort of thing, I'd be more likely to use perl than to do it than awk.
On the other hand, I think using awk to join the lines then perl to split them again is probably the most readable combination, so that's probably what I'd do. There's a lot to be said for maintainable code.
Ok, I'm getting some odd behavior here. I'm trying to unwordwrap a new file, and it's choking on some percentage signs. I'm now running it from a script file, and when I ran it on the new text, I'm getting this error:
Code:
year-olds in many world regions, the US ranked dead last in algebra. On identical tests, the US kids averaged 43% and their Japanese awk: ../awk_scripts/unwordwrap.awk:13: (FILENAME=- FNR=1929) fatal: not enough arguments to satisfy format string
` counterparts 78%. In my book, 78% is pretty good - it corresponds to a C+, '
^ ran out for this one
If I edit the original file to remove the offending % sign, it just moves on and chokes on the next one. The strangest thing though, is that it handles several % signs perfectly before finally choking on about the 5th one in the file, which occurs on line 12356 (it's a 400+ page document). There doesn't seem to be anything special about the lines other than the percents in them, and I've changed nothing in the script except the extra space as I mentioned above. What could be causing this?
Perhaps I'll have to go with the perl command after all.
As I said, the script is pretty much unaltered from what I was given above. I just pasted the commands into a text file, added the gawk initialization line, and ran it with awk -f unwordwrap.awk < input.txt
Here's the file (except for some comment lines, removed for clarity).
In the code above the first printf lacks the format specification, or better... it interprets the whole string $i as format. This brings to problems when the string contains the % symbol, which is the format specifier. In particular the problem arises when you have two occurences of % in the same line of text. Without going through more details, you can avoid this problem if you correctly specify the format and the item to print, as in
The code in blue prints the item $i as a string (%s) followed by a blank space.
Just out of curiosity, have you tried the code I suggested in post #9 ? It skips the unwrap passage and wraps directly to the desired length. I have not tested a lot, anyway. But I wonder if it works on a very long test as yours. Cheers!
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.