LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Text manipulation with Regex after copying from PDF file (https://www.linuxquestions.org/questions/linux-newbie-8/text-manipulation-with-regex-after-copying-from-pdf-file-4175571604/)

Stefek 02-06-2016 01:01 PM

Text manipulation with Regex after copying from PDF file
 
I an avid ereader user. Since most ereaders does not handle PDFs well enough and the content is unreadable most of the time, I prefer copying the text to a simple txt file rather than opening the original PDF file on the device. This normally results in loss of the entire text formatting ofcourse. Thats ok with me, but one thing is really annoying in the process. Usually the text in the original PDF is separated with paragraphs, not spaces at the end of each text line (dont know actually why but that's how it seems to be) so after copying the text to a simple txt file each line in the former PDF text ends with a new paragraph which normally produces somethin akin to this
https://i.imgur.com/seIobB2.png
My question is: Can I use regex to eliminate redundant paragraphs? I can already delete all of them by using \n flag (via find and replace command in most text editors) but this results in one continuous, long text block which is equally unbearable. I am thinking about finding and eliminating only the ones that have a lowercase after a paragraph sign - since this would usually denote the paragraph placed in the middle of the sentence. Am I thinking right? If so what would be the best regex formula for that?

grail 02-07-2016 02:30 AM

From the picture I see 2 things:

1. The editor you have open has word wrap on and hence why the lines of the original text are over multiple lines

2. In addition to the above, the copy to text has placed a newline at the end of each line

Your original solution to remove the '\n' (newline) is the correct idea, but unfortunately there does not seem to be any defining feature to tell it when to stop doing this.
I noticed the original text has an indent at the start of a new paragraph, but these appear to have been removed as well :(

Maybe if you used a converter instead of copying you might be left with something that can be altered easier??

syg00 02-07-2016 03:15 AM

pdftotxt or somesuch ?. Then fix what's left.

I deliberately bought a kobo because (amongst other things) it will handle pdfs - has a problem with graphics, but straight text is fine, if not flowable.

pan64 02-07-2016 03:40 AM

probably calibre can be used to convert those files.

syg00 02-07-2016 05:20 AM

Calibre is good, but has similar problems with "graphics" - electrical diags for example. Very hard to re-scale but also match the text.


All times are GMT -5. The time now is 04:38 AM.