LinuxQuestions.org
LinuxAnswers - the LQ Linux tutorial section.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices

Reply
 
Search this Thread
Old 11-24-2011, 09:50 AM   #1
dereut
LQ Newbie
 
Registered: Apr 2005
Posts: 10

Rep: Reputation: 0
remove blank line if not followed by caps


hello all,

I had to convert a long text from pdf to txt, unfortunately, it leaves me with a lots of cut lines

I have a lots of lines that are cut in the middle by one or more empty lines

I am looking for a way to remove those empty lines if they are not followed by a capital letter

I use multiple linux distro in dual boot and can install any software necessary for the job

any help appreciated

reup dereut
 
Old 11-24-2011, 10:44 AM   #2
tronayne
Senior Member
 
Registered: Oct 2003
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 3,036

Rep: Reputation: 755Reputation: 755Reputation: 755Reputation: 755Reputation: 755Reputation: 755Reputation: 755
Many of those lines may not be actual lines but rather a lot of white space (blanks).

Try this (it strips all multiple spaces to one space then strips trailing blanks):
Code:
sed 's/  */ /g;s/  *$//g' your_file > /tmp/whatever
and have a look at the result (note, it's s/space space*$/):

Then, again using sed, try this (it deletes blank lines)
Code:
sed '/^$/d' /tmp/whatever > /tmp/otherever
and take a look at that.

Might help, probably won't hurt.

Hope this helps some.
 
1 members found this post helpful.
Old 11-24-2011, 11:16 AM   #3
dereut
LQ Newbie
 
Registered: Apr 2005
Posts: 10

Original Poster
Rep: Reputation: 0
tronayne, the first did nothing, but that was to be expected as I had replace before all double space by a single space (using replace in kwrite)

the second just removed all empty lines, but I am trying to remove empty lines that are not followed by a capital letter OR not preceded by a . (dot)

what could be a solution would be to remove all empty lines then replace all words finishing by a . (dot) by the same with a \n after (or something)

I have tried to use regular expressions in kwrite to do a replace but if I do [a-z]\. replace by [a-z]\.\n I end up with words finishing with [a-z].\n (file. become fil[a-z].\n )

this I need to dig a bit more

reup
 
Old 11-24-2011, 12:07 PM   #4
tronayne
Senior Member
 
Registered: Oct 2003
Location: Northeastern Michigan, where Carhartt is a Designer Label
Distribution: Slackware 32- & 64-bit Stable
Posts: 3,036

Rep: Reputation: 755Reputation: 755Reputation: 755Reputation: 755Reputation: 755Reputation: 755Reputation: 755
Well, this kind of thing usually turns out to be manual editing job (as in replacing all the double spaces with one, stripping out trailing blanks and then deleting empty lines. Then the fun begins.

You can use one of the editors to replace all the dots followed by a space with a dot-newline; I have no idea how to use kwrite but I do know how to do it sed. What you have to do is globally replace the dot followed by a space with a dot followed by \^V^J (control-V lets you type a control character following, and a new line is control-J). So,
Code:
sed 's/\. /\.\^V^J/g' filename > /tmp/filename
That will separate your text into a sentence per line (or maybe two lines). If you go through the output file you can simply join adjacent lines where there's a split word or sentence (real easy with vi or vim to do that simply typing J).

It's almost always painful to get a PDF document into plain text no matter what you do and the above is how I've done it -- plus, of course, a lot of grumbling and gnashing of teeth. Sorry. There might be a utility out there somewhere that will do a clean PDF-to-text but I'm not aware of one.

Hope this helps some.

Last edited by tronayne; 11-24-2011 at 12:08 PM.
 
Old 11-24-2011, 12:41 PM   #5
dereut
LQ Newbie
 
Registered: Apr 2005
Posts: 10

Original Poster
Rep: Reputation: 0
I believe something is wrong in the command you gave me above, it replaces all . with .^V^J

but I get the direction I have to go. I will mark this thread as solved as I believe like you that there is only manual editing that could give me the right result

thanks for your help and time

reup
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
print line if next line blank raefdhaon Linux - Newbie 4 08-02-2010 01:18 PM
awk command line: blank line record sep, new line field sep robertmarkbram Programming 4 02-21-2010 05:25 AM
Caps Lock: How to remove the toggle effect? glloq Ubuntu 2 11-05-2009 06:30 AM
grab the line below a blank line and the line above the next blank line awk or perl? Pantomime Linux - General 7 06-26-2008 08:13 AM
script to check the last line in a file is blank line or not naveensankineni Programming 10 03-01-2008 11:13 PM


All times are GMT -5. The time now is 06:27 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration