LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-21-2007, 08:32 AM   #1
mangoo
LQ Newbie
 
Registered: Aug 2005
Posts: 8

Rep: Reputation: 0
awk - remove lines between AAAA and BBBB


I have a long file, which looks similar to the one I pasted below.
I would like to remove all lines between the lines "-----" and "_____" - I wrote there "remove this text".

In other words, I have to use the shortest match, and cut everything out between "-----" and "______" (their length can vary). It's OK if these marking lines get removed, too.

Anyone has awk ideas for that?



File to be edited:

normal text
don't touch

------------
Remove
this
text
____________________

another normal text
normal text
don't touch


------------------
Remove me
please
__________________________

yet another normal text
normal text
don't touch
 
Old 11-21-2007, 09:07 AM   #2
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 38
Code:
awk '/^___/{f=0}f{next}/^---/{f=1}1'
or:

Code:
awk '/^(---|___)/{print}/^---/,/^___/{next}1'
If you want to remove --- ___:

Code:
awk '/^___/{f=0;next}f{next}/^---/{f=1;next}1'
Or just

Code:
awk '/^---/,/^___/{next}1'
Or with sed:

Code:
sed '/^---/,/^___/d'

Last edited by radoulov; 11-21-2007 at 09:14 AM.
 
Old 11-21-2007, 09:19 AM   #3
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
radoulov, I would have told
Code:
awk '/^----/,/^____/{next}{print}'
but the numeric value for "true" is a stroke of genius!
 
Old 11-21-2007, 09:20 AM   #4
b0uncer
LQ Guru
 
Registered: Aug 2003
Distribution: CentOS, OS X
Posts: 5,131

Rep: Reputation: Disabled
Quote:
In other words, I have to use the shortest match,--
Sorry for suspecting, (ignore me if I'm wrong) but... have to ? I hope it wasn't about a schoolwork.

I would have picked up sed myself, for a start anyway.
 
Old 11-21-2007, 09:27 AM   #5
pixellany
LQ Veteran
 
Registered: Nov 2005
Location: Annapolis, MD
Distribution: Mint
Posts: 17,809

Rep: Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743Reputation: 743
Why AWK when you can SED ???

sed '/--/,/__/ d' filename > newfilename

Deletes everything starting with "--" up to and including "__". I arbitrarily use two of each character.

If this WAS homework, then shame on me for doing it for you.....
 
Old 11-21-2007, 10:43 AM   #6
mangoo
LQ Newbie
 
Registered: Aug 2005
Posts: 8

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by b0uncer View Post
Sorry for suspecting, (ignore me if I'm wrong) but... have to ? I hope it wasn't about a schoolwork.

I would have picked up sed myself, for a start anyway.
Yes, have to (actually, I was thinking if "have to" is the right expression before I started this thread). And no, not a schoolwork.

And yes, all the answers are unfortunately *wrong* (perhaps I didn't specify the "test case" clear enough).
What I want to do is to remove all advertisements from a mbox file of some mailing list.

These advertisements are placed between ----- and _____ - so far, everything clear.

The problem is it is a mbox file (or, emails one after another) - so there are sometimes nice drawings etc.

And hence I was looking for a way to remove the shortest match (shortest match is not the longest match; shortest match is also not a match longer then the shortest).

That being said, take a look at this "improved" test case:

1 normal text 1
1 don't touch 1

------------
Remove
this
text
____________________



2 normal text 2
a nice diagram:
--------------------------
| This will be gone, too |
| but should stay |
--------------------------
2 normal text 2
2 don't touch 2


------------------
Remove me
please
__________________________

3 yet another normal text 3
3 normal text 3
3 don't touch 3



With all suggested solutions in this thread, "normal text 2" would not look like we would like to - we would cut not the longest match, but also not the shortest between any two ----- and _______.
 
Old 11-21-2007, 11:25 AM   #7
PAix
Member
 
Registered: Jul 2007
Location: United Kingdom, W Mids
Distribution: SUSE 11.0 as of Nov 2008
Posts: 195

Rep: Reputation: 40
Sorry, but I have to interject. All the answers were not wrong, they were correct but the original question appears in retrospect to have been wrong. Welcome to the world of scope creep. The original question had a sort of elegance that made it easy meat.

So how exactly will this bit below be differentiated from the normal delete candidate onset offset patterns?
Quote:
-------------------------
| This will be gone, too |
| but should stay |
--------------------------
I can't see anything that would make it anything other than potentially dead meat at the moment.

PAix
 
Old 11-21-2007, 12:47 PM   #8
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 38
Quote:
Originally Posted by mangoo View Post
[...]
With all suggested solutions in this thread, "normal text 2" would not look like we would like to - we would cut not the longest match, but also not the shortest between any two ----- and _______.
Could you have more than one occurrence of --- something ___ in the same file?
I mean:

Code:
---
a
b
c
___

something else

---
a
b
___
Where it's the second block (the shortest) which is supposed to be removed.
 
Old 11-21-2007, 12:58 PM   #9
mangoo
LQ Newbie
 
Registered: Aug 2005
Posts: 8

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by radoulov View Post
Could you have more than one occurrence of --- something ___ in the same file?
I mean:

Code:
---
a
b
c
___

something else

---
a
b
___
Where it's the second block (the shortest) which is supposed to be removed.

It's the mbox file (a file containing many emails) - so yes, I have several thousands occurrences.

I was thinking of possible easier solutions:

1) cut 4 or 5 lines above ^_________
2) reverse all lines in the file - I think I don't have any tables or drawings which use _______ - and then reverse lines back

But anyway, this find *really* shortest match seems to be more interesting and useful (think of HTML / XML tags).
 
Old 11-21-2007, 01:41 PM   #10
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 38
You mean something like this:
Code:
tac filename|awk '/^___/,/^---/{next}1'|tac
or:

Code:
tac <(awk '/^___/,/^---/{next}1'<(tac filename))
Code:
tac <(sed '/^___/,/^---/d'<(tac filename))

Last edited by radoulov; 11-21-2007 at 02:59 PM. Reason: Corrected ...
 
Old 11-21-2007, 03:18 PM   #11
mangoo
LQ Newbie
 
Registered: Aug 2005
Posts: 8

Original Poster
Rep: Reputation: 0
Thanks a lot for all your answers.

Here are also some ideas from comp.lang.awk group:

http://groups.google.com/group/comp....a81536e6734e7e
 
Old 11-21-2007, 04:24 PM   #12
radoulov
Member
 
Registered: Apr 2007
Location: Milano, Italia/Варна, България
Distribution: Ubuntu, Open SUSE
Posts: 212

Rep: Reputation: 38
Another possible solution:

Code:
awk 'NR == FNR && /^-+$/ { 
	f = FNR
}
NR == FNR && /^_+$/ {
	for(i=f; i<=FNR; i++)
		x[i]
	}
NR > FNR && !(FNR in x)
' filename filename
 
Old 07-07-2011, 01:32 AM   #13
philip.patlur
LQ Newbie
 
Registered: Jul 2011
Posts: 1

Rep: Reputation: Disabled
I have slightly different problem

I need to strip out anthing thats between =+=+=+= and =+=+=+= in a file
 
Old 07-07-2011, 01:55 AM   #14
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Code:
sed '/=+=+=+=/,/=+=+=+=/d' file
Is this what you're looking for? If not, please show an example of input and the desired output.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Delete lines using awk kkjegan Programming 13 09-11-2007 07:36 PM
awk to remove first 3 lines and print remaining $1, $2 fields phyx Linux - General 1 01-10-2007 05:21 PM
awk: remove similar lines from logfile peos Programming 7 06-19-2006 07:13 AM
awk/gawk/sed - read lines from file1, comment out or delete matching lines in file2 rascal84 Linux - General 1 05-24-2006 09:19 AM
awk text that is on several lines homey Programming 2 10-31-2004 09:27 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:06 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration