LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Blogs > linux-related notes
User Name
Password

Notices


Just annotations of little "how to's", so I know I can find how to do something I've already done when I need to do it again, in case I don't remember anymore, which is not unlikely. Hopefully they can be useful to others, but I can't guarantee that it will work, or that it won't even make things worse.
Rate this Entry

Pure-text de-truncator script, a work in progress

Posted 09-30-2012 at 01:28 PM by the dsc
Updated 09-30-2012 at 01:32 PM by the dsc
Tags epub, pdf, sed, truncated

Sometimes I want to quote some PDF on pure text, luckily enough it would have actual text rather than being an non-OCRed image, but often there's still the problem that the text on the PDF is truncated/formatted in a fake/dumb way, with actual "new lines" to break the page, which may not make the text completely unreadable when pasted in a text editor, but it's reasonably annoying, and may require quite some time to fix manually.

I'm trying to create a script or one-liner with a few pipes that would try to fix those texts. I think it basically has to read the text line by line, check if it's an empty line (in which case it does nothing I guess), if it ends in a period (again, do nothing), or if it's too short (and does nothing, just in case it's a header). If none of these apply, it's likely that it ends in a unnecessary "new line", which is then stripped. Not necessarily in this particular sequence of tests, perhaps there's some other sequence that's more logical for the flow.


So far what I have is:

Code:
xclip -o | while read a ; do lchr="${a#${a%?}})" ; if [ ! $lchr = "." ] ; then echo $a NODOT | tr -d '\n' ; else echo $a PERIOD ; fi ; done
Very preliminary, the "xclip" means that the text is coming from the clipboard, eventually I'd change it for some "cat $file/$*" phrasing I guess. The all-caps echos are obviously just a "debug mode", in order to make utterly clear what's going on.

For while it only checks if the line ends in period or not, and strips the "new line" if it does not. I still got to find out how to test empty lines and some arbitrary but reasonably good number of characters for eventual headers.

Any help is appreciated!



References:
http://www.unix.com/shell-programmin...er-string.html
http://www.linuxquestions.org/questi...ng-sed-191121/
Posted in Uncategorized
Views 1675 Comments 0
« Prev     Main     Next »
Total Comments 0

Comments

 

  



All times are GMT -5. The time now is 10:48 PM.

Main Menu
Advertisement
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration