LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Software (https://www.linuxquestions.org/questions/linux-software-2/)
-   -   remove blank line if not followed by caps (https://www.linuxquestions.org/questions/linux-software-2/remove-blank-line-if-not-followed-by-caps-915343/)

dereut 11-24-2011 09:50 AM

remove blank line if not followed by caps
 
hello all,

I had to convert a long text from pdf to txt, unfortunately, it leaves me with a lots of cut lines

I have a lots of lines that are cut in the middle by one or more empty lines

I am looking for a way to remove those empty lines if they are not followed by a capital letter

I use multiple linux distro in dual boot and can install any software necessary for the job

any help appreciated

reup dereut

tronayne 11-24-2011 10:44 AM

Many of those lines may not be actual lines but rather a lot of white space (blanks).

Try this (it strips all multiple spaces to one space then strips trailing blanks):
Code:

sed 's/  */ /g;s/  *$//g' your_file > /tmp/whatever
and have a look at the result (note, it's s/space space*$/):

Then, again using sed, try this (it deletes blank lines)
Code:

sed '/^$/d' /tmp/whatever > /tmp/otherever
and take a look at that.

Might help, probably won't hurt.

Hope this helps some.

dereut 11-24-2011 11:16 AM

tronayne, the first did nothing, but that was to be expected as I had replace before all double space by a single space (using replace in kwrite)

the second just removed all empty lines, but I am trying to remove empty lines that are not followed by a capital letter OR not preceded by a . (dot)

what could be a solution would be to remove all empty lines then replace all words finishing by a . (dot) by the same with a \n after (or something)

I have tried to use regular expressions in kwrite to do a replace but if I do [a-z]\. replace by [a-z]\.\n I end up with words finishing with [a-z].\n (file. become fil[a-z].\n )

this I need to dig a bit more

reup

tronayne 11-24-2011 12:07 PM

Well, this kind of thing usually turns out to be manual editing job (as in replacing all the double spaces with one, stripping out trailing blanks and then deleting empty lines. Then the fun begins.

You can use one of the editors to replace all the dots followed by a space with a dot-newline; I have no idea how to use kwrite but I do know how to do it sed. What you have to do is globally replace the dot followed by a space with a dot followed by \^V^J (control-V lets you type a control character following, and a new line is control-J). So,
Code:

sed 's/\. /\.\^V^J/g' filename > /tmp/filename
That will separate your text into a sentence per line (or maybe two lines). If you go through the output file you can simply join adjacent lines where there's a split word or sentence (real easy with vi or vim to do that simply typing J).

It's almost always painful to get a PDF document into plain text no matter what you do and the above is how I've done it -- plus, of course, a lot of grumbling and gnashing of teeth. Sorry. There might be a utility out there somewhere that will do a clean PDF-to-text but I'm not aware of one.

Hope this helps some.

dereut 11-24-2011 12:41 PM

I believe something is wrong in the command you gave me above, it replaces all . with .^V^J

but I get the direction I have to go. I will mark this thread as solved as I believe like you that there is only manual editing that could give me the right result

thanks for your help and time

reup


All times are GMT -5. The time now is 02:53 PM.