-   Linux - General (
-   -   Inserting blank lines in text file (

David the H. 11-23-2007 06:31 PM

Inserting blank lines in text file
I just finished using pdftotext to convert a book file into plain text, but there are some problems with the formatting I want to correct. Specifically, there are two things I want to change.

1) The conversion process was supposed to insert a line break at the end of each page, but for some reason it fails to work correctly. Instead of a line break, I get a line control character that shows up as a "^L" at the beginning of the first line of the page in nano, and as an unreadable character in kwrite. I'd like to search the file for all instances of this and insert, not one, but two or three, actual line breaks.

In addition, it would be nice if I could add actual line numbers to this break, counting +1 for each match and inserting it into the replacement text. This isn't strictly necessary, but the book includes an index and it would be nice to be able to use it.

2) I'd like to also insert a single blank line between each paragraph. I can easily match the beginning of each paragraph by its five-space indent. I just need to insert a blank line in front of it.

I'm not even sure what program is best to use, awk? sed? I tried using kwrite's replace function, and the matching part is easy, but AFAICT, it's impossible to insert line breaks and such in the replacement text.

So, what's the best way to go about this? Here's a sample of the text to match. Note the page break before the word "excel":


...questions written out on pieces of paper, which
they surreptitiously examine, waiting their turn and
oblivious of whatever discussion their peers are at
this moment engaged in.
    Something has happened between first and twelfth
grade, and it's not just puberty. I'd guess that it's
partly peer pressure not to
^L    excel (except in sports); partly that the
society teaches short-term gratification; partly the
impression that science or mathematics won't buy you
a sports car; partly that so little is expected of
students; and partly that there are few rewards or
role models for intelligent discussion of science and
technology - or even for learning for its own sake.
Those few who remain interested are vilified as
`nerds' or `geeks' or `grinds'.
    But there's something else: I find many adults
are put off when young children pose scientific
questions. Why is the Moon round? the children ask.
Why is grass green? What is a dream? How deep can you
dig a hole? When...

syg00 11-23-2007 07:25 PM

If you want that page counter I guess you're needing awk (perl whatever).
I'd reckon you could set the regex to look for lines starting with any (single) char followed by 5 (?) blanks. Then you don't have to worry what the hex value really is.
You would of course be relying on the input being set-up correctly for page length.

mRgOBLIN 11-23-2007 07:39 PM

Try this.



BEGIN { count = 0 }

/\012/ {count++}

{ gsub(/\012/,"\n\n"count"\n\n"); gsub(/^(\040){5}/,"\n&") ; print $0 }

save it as somefile.awk and use it like this


awk --re-interval -f somefile.awk < myfile.txt > mynewfile.txt

David the H. 11-24-2007 01:19 PM

Thank you mRgOBLIN. Your code solved half of my problem. It successfully breaks up the regular paragraphs. I did have to modify it a little though, as it turns out that some of the indentations are only 4 spaces wide. But I know enough about regex to change {5} to {4,5} and that solved that.

Unfortunately though, it fails to recognize the page breaks. I guess it fails to match whatever character the line break symbol actually is. Could it be because I specified the file encoding as UTF-8 or something?

Here's an direct cut&paste from kwrite, as opposed to the c&p from nano I gave you earlier, which apparently converted the character to ascii or something. I don't know for sure if it will show up correctly here, but in my browser, the page-break shows up as a unicode glyph box with the numbers 000C in it.


heard. Nor did he know, even
    vaguely, about quantum indeterminacy, and he recognized DNA only as
three frequently linked capital letters.

Also, would you mind stepping me through what all that awk code actually does? I'm slowly learning my way through regex, and I'd like to learn a bit about awk as well. It would help me the next time I want to do something like this.

Tinkster 11-24-2007 01:53 PM

\012 needs to be \014 ... 12 is a line-feed, not a form-feed (which is
what ^L is).

As for the explanation:

BEGIN { count = 0 }
The BEGIN{} tells awk to do some preparation before any work begins.
He sets the counter to 0 there.


/\012/ {count++}
This ties a pattern to an action - the appearance of a new-line
character increments the counter by one.


{ gsub(/\012/,"\n\n"count"\n\n"); gsub(/^(\040){5}/,"\n&") ; print $0 }
gsub operates on the input, and modifies it in-place.
The first gsub searches the input (in this case a whole line) for a
new-line, and replaces each occurrence with 2 newlines, the content
of the variable count, followed by two more newlines.
The second gsub replaces your 5 spaces (ASCII 040) with a new-line
followed by the spaces.
And then the line gets printed.

And if my ASCII-foo didn't completely vanish changing both occurrences
of \012 to \014 should solve your problem. And if you re-run it you probably
want to get rid of the "gsub(/^(\040){5}/,"\n&")" bit, or you'll get
extra new-lines between paragraphs.


David the H. 11-24-2007 03:34 PM

Thank you. That makes it much clearer. And yes, now it works perfectly. It was just a matter of finding the right character code, which is sometimes not easy. I had actually figured out most of the syntax before you answered, but I couldn't figure out which numbers to use. I finally ended up using kwrite's replace function to change the form-feed to a simple string of unique characters, then modified the awk script to replace that instead. It took me a lot of trial & error to get it right though.

I still have to do some hand-editing to fix everything, but the text looks much better now. And now I understand at least the basics of how to use awk to replace text. Thanks again.

All times are GMT -5. The time now is 11:55 AM.