Linux - GeneralThis Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
I just finished using pdftotext to convert a book file into plain text, but there are some problems with the formatting I want to correct. Specifically, there are two things I want to change.
1) The conversion process was supposed to insert a line break at the end of each page, but for some reason it fails to work correctly. Instead of a line break, I get a line control character that shows up as a "^L" at the beginning of the first line of the page in nano, and as an unreadable character in kwrite. I'd like to search the file for all instances of this and insert, not one, but two or three, actual line breaks.
In addition, it would be nice if I could add actual line numbers to this break, counting +1 for each match and inserting it into the replacement text. This isn't strictly necessary, but the book includes an index and it would be nice to be able to use it.
2) I'd like to also insert a single blank line between each paragraph. I can easily match the beginning of each paragraph by its five-space indent. I just need to insert a blank line in front of it.
I'm not even sure what program is best to use, awk? sed? I tried using kwrite's replace function, and the matching part is easy, but AFAICT, it's impossible to insert line breaks and such in the replacement text.
So, what's the best way to go about this? Here's a sample of the text to match. Note the page break before the word "excel":
Code:
...questions written out on pieces of paper, which
they surreptitiously examine, waiting their turn and
oblivious of whatever discussion their peers are at
this moment engaged in.
Something has happened between first and twelfth
grade, and it's not just puberty. I'd guess that it's
partly peer pressure not to
^L excel (except in sports); partly that the
society teaches short-term gratification; partly the
impression that science or mathematics won't buy you
a sports car; partly that so little is expected of
students; and partly that there are few rewards or
role models for intelligent discussion of science and
technology - or even for learning for its own sake.
Those few who remain interested are vilified as
`nerds' or `geeks' or `grinds'.
But there's something else: I find many adults
are put off when young children pose scientific
questions. Why is the Moon round? the children ask.
Why is grass green? What is a dream? How deep can you
dig a hole? When...
If you want that page counter I guess you're needing awk (perl whatever).
I'd reckon you could set the regex to look for lines starting with any (single) char followed by 5 (?) blanks. Then you don't have to worry what the hex value really is.
You would of course be relying on the input being set-up correctly for page length.
Thank you mRgOBLIN. Your code solved half of my problem. It successfully breaks up the regular paragraphs. I did have to modify it a little though, as it turns out that some of the indentations are only 4 spaces wide. But I know enough about regex to change {5} to {4,5} and that solved that.
Unfortunately though, it fails to recognize the page breaks. I guess it fails to match whatever character the line break symbol actually is. Could it be because I specified the file encoding as UTF-8 or something?
Here's an direct cut&paste from kwrite, as opposed to the c&p from nano I gave you earlier, which apparently converted the character to ascii or something. I don't know for sure if it will show up correctly here, but in my browser, the page-break shows up as a unicode glyph box with the numbers 000C in it.
Code:
heard. Nor did he know, even
vaguely, about quantum indeterminacy, and he recognized DNA only as
three frequently linked capital letters.
Also, would you mind stepping me through what all that awk code actually does? I'm slowly learning my way through regex, and I'd like to learn a bit about awk as well. It would help me the next time I want to do something like this.
gsub operates on the input, and modifies it in-place.
The first gsub searches the input (in this case a whole line) for a
new-line, and replaces each occurrence with 2 newlines, the content
of the variable count, followed by two more newlines.
The second gsub replaces your 5 spaces (ASCII 040) with a new-line
followed by the spaces.
And then the line gets printed.
And if my ASCII-foo didn't completely vanish changing both occurrences
of \012 to \014 should solve your problem. And if you re-run it you probably
want to get rid of the "gsub(/^(\040){5}/,"\n&")" bit, or you'll get
extra new-lines between paragraphs.
Thank you. That makes it much clearer. And yes, now it works perfectly. It was just a matter of finding the right character code, which is sometimes not easy. I had actually figured out most of the syntax before you answered, but I couldn't figure out which numbers to use. I finally ended up using kwrite's replace function to change the form-feed to a simple string of unique characters, then modified the awk script to replace that instead. It took me a lot of trial & error to get it right though.
I still have to do some hand-editing to fix everything, but the text looks much better now. And now I understand at least the basics of how to use awk to replace text. Thanks again.
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.