remove blank line if not followed by caps
hello all,
I had to convert a long text from pdf to txt, unfortunately, it leaves me with a lots of cut lines I have a lots of lines that are cut in the middle by one or more empty lines I am looking for a way to remove those empty lines if they are not followed by a capital letter I use multiple linux distro in dual boot and can install any software necessary for the job any help appreciated reup dereut |
Many of those lines may not be actual lines but rather a lot of white space (blanks).
Try this (it strips all multiple spaces to one space then strips trailing blanks): Code:
sed 's/ */ /g;s/ *$//g' your_file > /tmp/whatever Then, again using sed, try this (it deletes blank lines) Code:
sed '/^$/d' /tmp/whatever > /tmp/otherever Might help, probably won't hurt. Hope this helps some. |
tronayne, the first did nothing, but that was to be expected as I had replace before all double space by a single space (using replace in kwrite)
the second just removed all empty lines, but I am trying to remove empty lines that are not followed by a capital letter OR not preceded by a . (dot) what could be a solution would be to remove all empty lines then replace all words finishing by a . (dot) by the same with a \n after (or something) I have tried to use regular expressions in kwrite to do a replace but if I do [a-z]\. replace by [a-z]\.\n I end up with words finishing with [a-z].\n (file. become fil[a-z].\n ) this I need to dig a bit more reup |
Well, this kind of thing usually turns out to be manual editing job (as in replacing all the double spaces with one, stripping out trailing blanks and then deleting empty lines. Then the fun begins.
You can use one of the editors to replace all the dots followed by a space with a dot-newline; I have no idea how to use kwrite but I do know how to do it sed. What you have to do is globally replace the dot followed by a space with a dot followed by \^V^J (control-V lets you type a control character following, and a new line is control-J). So, Code:
sed 's/\. /\.\^V^J/g' filename > /tmp/filename It's almost always painful to get a PDF document into plain text no matter what you do and the above is how I've done it -- plus, of course, a lot of grumbling and gnashing of teeth. Sorry. There might be a utility out there somewhere that will do a clean PDF-to-text but I'm not aware of one. Hope this helps some. |
I believe something is wrong in the command you gave me above, it replaces all . with .^V^J
but I get the direction I have to go. I will mark this thread as solved as I believe like you that there is only manual editing that could give me the right result thanks for your help and time reup |
All times are GMT -5. The time now is 02:53 PM. |