Just annotations of little "how to's", so I know I can find how to do something I've already done when I need to do it again, in case I don't remember anymore, which is not unlikely. Hopefully they can be useful to others, but I can't guarantee that it will work, or that it won't even make things worse.

Pure-text de-truncator script, a work in progress

Posted 09-30-2012 at 01:28 PM by the dsc
Updated 09-30-2012 at 01:32 PM by the dsc

Tags epub, pdf, sed, truncated

Sometimes I want to quote some PDF on pure text, luckily enough it would have actual text rather than being an non-OCRed image, but often there's still the problem that the text on the PDF is truncated/formatted in a fake/dumb way, with actual "new lines" to break the page, which may not make the text completely unreadable when pasted in a text editor, but it's reasonably annoying, and may require quite some time to fix manually.

I'm trying to create a script or one-liner with a few pipes that would try to fix those texts. I think it basically has to read the text line by line, check if it's an empty line (in which case it does nothing I guess), if it ends in a period (again, do nothing), or if it's too short (and does nothing, just in case it's a header). If none of these apply, it's likely that it ends in a unnecessary "new line", which is then stripped. Not necessarily in this particular sequence of tests, perhaps there's some other sequence that's more logical for the flow.

So far what I have is:

Code:

xclip -o | while read a ; do lchr="${a#${a%?}})" ; if [ ! $lchr = "." ] ; then echo $a NODOT | tr -d '\n' ; else echo $a PERIOD ; fi ; done

Very preliminary, the "xclip" means that the text is coming from the clipboard, eventually I'd change it for some "cat $file/$*" phrasing I guess. The all-caps echos are obviously just a "debug mode", in order to make utterly clear what's going on.

For while it only checks if the line ends in period or not, and strips the "new line" if it does not. I still got to find out how to test empty lines and some arbitrary but reasonably good number of characters for eventual headers.

Any help is appreciated!

References:
http://www.unix.com/shell-programmin...er-string.html
http://www.linuxquestions.org/questi...ng-sed-191121/

Posted in Uncategorized

Views 1675 Comments 0

« Prev Main Next »

Total Comments 0