LinuxQuestions.org
Support LQ: Use code LQ3 and save $3 on Domain Registration
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 11-23-2007, 06:31 PM   #1
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949
Inserting blank lines in text file


I just finished using pdftotext to convert a book file into plain text, but there are some problems with the formatting I want to correct. Specifically, there are two things I want to change.

1) The conversion process was supposed to insert a line break at the end of each page, but for some reason it fails to work correctly. Instead of a line break, I get a line control character that shows up as a "^L" at the beginning of the first line of the page in nano, and as an unreadable character in kwrite. I'd like to search the file for all instances of this and insert, not one, but two or three, actual line breaks.

In addition, it would be nice if I could add actual line numbers to this break, counting +1 for each match and inserting it into the replacement text. This isn't strictly necessary, but the book includes an index and it would be nice to be able to use it.

2) I'd like to also insert a single blank line between each paragraph. I can easily match the beginning of each paragraph by its five-space indent. I just need to insert a blank line in front of it.

I'm not even sure what program is best to use, awk? sed? I tried using kwrite's replace function, and the matching part is easy, but AFAICT, it's impossible to insert line breaks and such in the replacement text.

So, what's the best way to go about this? Here's a sample of the text to match. Note the page break before the word "excel":

Code:
...questions written out on pieces of paper, which
they surreptitiously examine, waiting their turn and
oblivious of whatever discussion their peers are at
this moment engaged in.
    Something has happened between first and twelfth
grade, and it's not just puberty. I'd guess that it's
partly peer pressure not to
^L    excel (except in sports); partly that the
society teaches short-term gratification; partly the
impression that science or mathematics won't buy you
a sports car; partly that so little is expected of
students; and partly that there are few rewards or
role models for intelligent discussion of science and
technology - or even for learning for its own sake.
Those few who remain interested are vilified as
`nerds' or `geeks' or `grinds'.
    But there's something else: I find many adults
are put off when young children pose scientific
questions. Why is the Moon round? the children ask.
Why is grass green? What is a dream? How deep can you
dig a hole? When...
 
Old 11-23-2007, 07:25 PM   #2
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 12,357

Rep: Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043Reputation: 1043
If you want that page counter I guess you're needing awk (perl whatever).
I'd reckon you could set the regex to look for lines starting with any (single) char followed by 5 (?) blanks. Then you don't have to worry what the hex value really is.
You would of course be relying on the input being set-up correctly for page length.
 
Old 11-23-2007, 07:39 PM   #3
mRgOBLIN
Slackware Contributor
 
Registered: Jun 2002
Location: New Zealand
Distribution: Slackware
Posts: 999

Rep: Reputation: 227Reputation: 227Reputation: 227
Try this.

Code:
#!/usr/bin/gawk 


BEGIN { count = 0 }

/\012/ {count++}

{ gsub(/\012/,"\n\n"count"\n\n"); gsub(/^(\040){5}/,"\n&") ; print $0 }
save it as somefile.awk and use it like this

Code:
awk --re-interval -f somefile.awk < myfile.txt > mynewfile.txt
 
Old 11-24-2007, 01:19 PM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Original Poster
Rep: Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949
Thank you mRgOBLIN. Your code solved half of my problem. It successfully breaks up the regular paragraphs. I did have to modify it a little though, as it turns out that some of the indentations are only 4 spaces wide. But I know enough about regex to change {5} to {4,5} and that solved that.

Unfortunately though, it fails to recognize the page breaks. I guess it fails to match whatever character the line break symbol actually is. Could it be because I specified the file encoding as UTF-8 or something?

Here's an direct cut&paste from kwrite, as opposed to the c&p from nano I gave you earlier, which apparently converted the character to ascii or something. I don't know for sure if it will show up correctly here, but in my browser, the page-break shows up as a unicode glyph box with the numbers 000C in it.

Code:
heard. Nor did he know, even
     vaguely, about quantum indeterminacy, and he recognized DNA only as
three frequently linked capital letters.
Also, would you mind stepping me through what all that awk code actually does? I'm slowly learning my way through regex, and I'd like to learn a bit about awk as well. It would help me the next time I want to do something like this.
 
Old 11-24-2007, 01:53 PM   #5
Tinkster
Moderator
 
Registered: Apr 2002
Location: in a fallen world
Distribution: slackware by choice, others too :} ... android.
Posts: 22,988
Blog Entries: 11

Rep: Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880Reputation: 880
\012 needs to be \014 ... 12 is a line-feed, not a form-feed (which is
what ^L is).

As for the explanation:
Code:
BEGIN { count = 0 }
The BEGIN{} tells awk to do some preparation before any work begins.
He sets the counter to 0 there.

Code:
/\012/ {count++}
This ties a pattern to an action - the appearance of a new-line
character increments the counter by one.


Code:
{ gsub(/\012/,"\n\n"count"\n\n"); gsub(/^(\040){5}/,"\n&") ; print $0 }
gsub operates on the input, and modifies it in-place.
The first gsub searches the input (in this case a whole line) for a
new-line, and replaces each occurrence with 2 newlines, the content
of the variable count, followed by two more newlines.
The second gsub replaces your 5 spaces (ASCII 040) with a new-line
followed by the spaces.
And then the line gets printed.

And if my ASCII-foo didn't completely vanish changing both occurrences
of \012 to \014 should solve your problem. And if you re-run it you probably
want to get rid of the "gsub(/^(\040){5}/,"\n&")" bit, or you'll get
extra new-lines between paragraphs.


Cheers,
Tink
 
Old 11-24-2007, 03:34 PM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Original Poster
Rep: Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949Reputation: 1949
Thank you. That makes it much clearer. And yes, now it works perfectly. It was just a matter of finding the right character code, which is sometimes not easy. I had actually figured out most of the syntax before you answered, but I couldn't figure out which numbers to use. I finally ended up using kwrite's replace function to change the form-feed to a simple string of unique characters, then modified the awk script to replace that instead. It took me a lot of trial & error to get it right though.

I still have to do some hand-editing to fix everything, but the text looks much better now. And now I understand at least the basics of how to use awk to replace text. Thanks again.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
removing blank lines in a text file christianunix Linux - Newbie 11 10-29-2007 12:24 AM
Inserting lines into a file through shell scripting false-hopes Linux - General 1 10-22-2005 11:39 AM
inserting/deleting characters into a text file ananthbv Programming 7 07-13-2004 11:40 PM
Replace blank/almost blank lines in file Wynd Linux - General 3 01-27-2004 04:49 PM
inserting text into a file DavidPhillips Programming 5 08-15-2003 04:53 PM


All times are GMT -5. The time now is 06:58 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration