Well, this kind of thing usually turns out to be manual editing job (as in replacing all the double spaces with one, stripping out trailing blanks and then deleting empty lines. Then the fun begins.
You can use one of the editors to replace all the dots followed by a space with a dot-newline; I have no idea how to use kwrite
but I do know how to do it sed
. What you have to do is globally replace the dot followed by a space with a dot followed by \^V^J
(control-V lets you type a control character following, and a new line is control-J). So,
sed 's/\. /\.\^V^J/g' filename > /tmp/filename
That will separate your text into a sentence per line (or maybe two lines). If you go through the output file you can simply join adjacent lines where there's a split word or sentence (real easy with vi
to do that simply typing J
It's almost always painful to get a PDF document into plain text no matter what you do and the above is how I've done it -- plus, of course, a lot of grumbling and gnashing of teeth. Sorry. There might be a utility out there somewhere that will do a clean PDF-to-text but I'm not aware of one.
Hope this helps some.