LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 02-06-2016, 02:01 PM   #1
Stefek
LQ Newbie
 
Registered: Feb 2016
Posts: 1

Rep: Reputation: Disabled
Text manipulation with Regex after copying from PDF file


I an avid ereader user. Since most ereaders does not handle PDFs well enough and the content is unreadable most of the time, I prefer copying the text to a simple txt file rather than opening the original PDF file on the device. This normally results in loss of the entire text formatting ofcourse. Thats ok with me, but one thing is really annoying in the process. Usually the text in the original PDF is separated with paragraphs, not spaces at the end of each text line (dont know actually why but that's how it seems to be) so after copying the text to a simple txt file each line in the former PDF text ends with a new paragraph which normally produces somethin akin to this
https://i.imgur.com/seIobB2.png
My question is: Can I use regex to eliminate redundant paragraphs? I can already delete all of them by using \n flag (via find and replace command in most text editors) but this results in one continuous, long text block which is equally unbearable. I am thinking about finding and eliminating only the ones that have a lowercase after a paragraph sign - since this would usually denote the paragraph placed in the middle of the sentence. Am I thinking right? If so what would be the best regex formula for that?

Last edited by Stefek; 02-06-2016 at 04:57 PM.
 
Old 02-07-2016, 03:30 AM   #2
grail
LQ Guru
 
Registered: Sep 2009
Location: Perth
Distribution: Manjaro
Posts: 9,243

Rep: Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684Reputation: 2684
From the picture I see 2 things:

1. The editor you have open has word wrap on and hence why the lines of the original text are over multiple lines

2. In addition to the above, the copy to text has placed a newline at the end of each line

Your original solution to remove the '\n' (newline) is the correct idea, but unfortunately there does not seem to be any defining feature to tell it when to stop doing this.
I noticed the original text has an indent at the start of a new paragraph, but these appear to have been removed as well

Maybe if you used a converter instead of copying you might be left with something that can be altered easier??
 
Old 02-07-2016, 04:15 AM   #3
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,832

Rep: Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820
pdftotxt or somesuch ?. Then fix what's left.

I deliberately bought a kobo because (amongst other things) it will handle pdfs - has a problem with graphics, but straight text is fine, if not flowable.
 
Old 02-07-2016, 04:40 AM   #4
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 8,104

Rep: Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267Reputation: 2267
probably calibre can be used to convert those files.
 
Old 02-07-2016, 06:20 AM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 14,832

Rep: Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820Reputation: 1820
Calibre is good, but has similar problems with "graphics" - electrical diags for example. Very hard to re-scale but also match the text.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
text file manipulation crowzie Linux - General 17 08-13-2015 03:46 PM
Text file manipulation: Extracting specific rows according to numerical pattern CHARL0TTE Linux - Newbie 3 10-07-2009 08:14 AM
Text file manipulation: alphanumeric strings CHARL0TTE Linux - Newbie 2 07-10-2009 10:40 AM
Copying image using acroread from pdf file xptools Linux - General 9 04-15-2009 05:27 AM
text manipulation, REGEX plus the following line farkus888 Solaris / OpenSolaris 3 02-27-2007 07:26 PM


All times are GMT -5. The time now is 03:19 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration