Latest LQ Deal: Linux Power User Bundle
Go Back > Blogs > linux-related notes
User Name


Just annotations of little "how to's", so I know I can find how to do something I've already done when I need to do it again, in case I don't remember anymore, which is not unlikely. Hopefully they can be useful to others, but I can't guarantee that it will work, or that it won't even make things worse.
Rate this Entry

Pure-text de-truncator script, a work in progress

Posted 09-30-2012 at 02:28 PM by the dsc
Updated 09-30-2012 at 02:32 PM by the dsc
Tags epub, pdf, sed, truncated

Sometimes I want to quote some PDF on pure text, luckily enough it would have actual text rather than being an non-OCRed image, but often there's still the problem that the text on the PDF is truncated/formatted in a fake/dumb way, with actual "new lines" to break the page, which may not make the text completely unreadable when pasted in a text editor, but it's reasonably annoying, and may require quite some time to fix manually.

I'm trying to create a script or one-liner with a few pipes that would try to fix those texts. I think it basically has to read the text line by line, check if it's an empty line (in which case it does nothing I guess), if it ends in a period (again, do nothing), or if it's too short (and does nothing, just in case it's a header). If none of these apply, it's likely that it ends in a unnecessary "new line", which is then stripped. Not necessarily in this particular sequence of tests, perhaps there's some other sequence that's more logical for the flow.

So far what I have is:

xclip -o | while read a ; do lchr="${a#${a%?}})" ; if [ ! $lchr = "." ] ; then echo $a NODOT | tr -d '\n' ; else echo $a PERIOD ; fi ; done
Very preliminary, the "xclip" means that the text is coming from the clipboard, eventually I'd change it for some "cat $file/$*" phrasing I guess. The all-caps echos are obviously just a "debug mode", in order to make utterly clear what's going on.

For while it only checks if the line ends in period or not, and strips the "new line" if it does not. I still got to find out how to test empty lines and some arbitrary but reasonably good number of characters for eventual headers.

Any help is appreciated!

Posted in Uncategorized
Views 1134 Comments 0
« Prev     Main     Next »
Total Comments 0




All times are GMT -5. The time now is 05:33 PM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration