awk

mangoo · 11-21-2007, 08:32 AM

I have a long file, which looks similar to the one I pasted below.
I would like to remove all lines between the lines "-----" and "_____" - I wrote there "remove this text".

In other words, I have to use the shortest match, and cut everything out between "-----" and "______" (their length can vary). It's OK if these marking lines get removed, too.

Anyone has awk ideas for that?

File to be edited:

normal text
don't touch

------------
Remove
this
text
____________________

another normal text
normal text
don't touch

------------------
Remove me
please
__________________________

yet another normal text
normal text
don't touch

radoulov · 11-21-2007, 09:07 AM

Code:

awk '/^___/{f=0}f{next}/^---/{f=1}1'

or:

Code:

awk '/^(---|___)/{print}/^---/,/^___/{next}1'

If you want to remove --- ___:

Code:

awk '/^___/{f=0;next}f{next}/^---/{f=1;next}1'

Or just

Code:

awk '/^---/,/^___/{next}1'

Or with sed:

Code:

sed '/^---/,/^___/d'

colucix · 11-21-2007, 09:19 AM

radoulov, I would have told

Code:

awk '/^----/,/^____/{next}{print}'

but the numeric value for "true" is a stroke of genius!

b0uncer · 11-21-2007, 09:20 AM

Quote:

In other words, I have to use the shortest match,--

Sorry for suspecting, (ignore me if I'm wrong) but... have to ? I hope it wasn't about a schoolwork.

I would have picked up sed myself, for a start anyway.

pixellany · 11-21-2007, 09:27 AM

Why AWK when you can SED ???

sed '/--/,/__/ d' filename > newfilename

Deletes everything starting with "--" up to and including "__". I arbitrarily use two of each character.

If this WAS homework, then shame on me for doing it for you.....

mangoo · 11-21-2007, 10:43 AM

Quote:

Originally Posted by b0uncer

Sorry for suspecting, (ignore me if I'm wrong) but... have to ? I hope it wasn't about a schoolwork.

I would have picked up sed myself, for a start anyway.

Yes, have to (actually, I was thinking if "have to" is the right expression before I started this thread). And no, not a schoolwork.

And yes, all the answers are unfortunately *wrong*

(perhaps I didn't specify the "test case" clear enough).
What I want to do is to remove all advertisements from a mbox file of some mailing list.

These advertisements are placed between ----- and _____ - so far, everything clear.

The problem is it is a mbox file (or, emails one after another) - so there are sometimes nice drawings etc.

And hence I was looking for a way to remove the shortest match (shortest match is not the longest match; shortest match is also not a match longer then the shortest).

That being said, take a look at this "improved" test case:

1 normal text 1
1 don't touch 1

------------
Remove
this
text
____________________

2 normal text 2
a nice diagram:
--------------------------
| This will be gone, too |
| but should stay |
--------------------------
2 normal text 2
2 don't touch 2

------------------
Remove me
please
__________________________

3 yet another normal text 3
3 normal text 3
3 don't touch 3

With all suggested solutions in this thread, "normal text 2" would not look like we would like to - we would cut not the longest match, but also not the shortest between any two ----- and _______.

PAix · 11-21-2007, 11:25 AM

Sorry, but I have to interject. All the answers were not wrong, they were correct but the original question appears in retrospect to have been wrong. Welcome to the world of scope creep. The original question had a sort of elegance that made it easy meat.

So how exactly will this bit below be differentiated from the normal delete candidate onset offset patterns?

Quote:

-------------------------
| This will be gone, too |
| but should stay |
--------------------------

I can't see anything that would make it anything other than potentially dead meat at the moment.

PAix

radoulov · 11-21-2007, 12:47 PM

Quote:

Originally Posted by mangoo

[...]
With all suggested solutions in this thread, "normal text 2" would not look like we would like to - we would cut not the longest match, but also not the shortest between any two ----- and _______.

Could you have more than one occurrence of --- something ___ in the same file?
I mean:

Code:

---
a
b
c
___

something else

---
a
b
___

Where it's the second block (the shortest) which is supposed to be removed.

mangoo · 11-21-2007, 12:58 PM

Quote:

Originally Posted by radoulov

Could you have more than one occurrence of --- something ___ in the same file?
I mean:

Code:

---
a
b
c
___

something else

---
a
b
___

Where it's the second block (the shortest) which is supposed to be removed.

It's the mbox file (a file containing many emails) - so yes, I have several thousands occurrences.

I was thinking of possible easier solutions:

1) cut 4 or 5 lines above ^_________
2) reverse all lines in the file - I think I don't have any tables or drawings which use _______ - and then reverse lines back

But anyway, this find *really* shortest match seems to be more interesting and useful (think of HTML / XML tags).

radoulov · 11-21-2007, 01:41 PM

You mean something like this:

Code:

tac filename|awk '/^___/,/^---/{next}1'|tac

or:

Code:

tac <(awk '/^___/,/^---/{next}1'<(tac filename))

Code:

tac <(sed '/^___/,/^---/d'<(tac filename))

mangoo · 11-21-2007, 03:18 PM

Thanks a lot for all your answers.

Here are also some ideas from comp.lang.awk group:

http://groups.google.com/group/comp....a81536e6734e7e

radoulov · 11-21-2007, 04:24 PM

Another possible solution:

Code:

awk 'NR == FNR && /^-+$/ { 
	f = FNR
}
NR == FNR && /^_+$/ {
	for(i=f; i<=FNR; i++)
		x[i]
	}
NR > FNR && !(FNR in x)
' filename filename

philip.patlur · 07-07-2011, 01:32 AM

I have slightly different problem

I need to strip out anthing thats between =+=+=+= and =+=+=+= in a file

colucix · 07-07-2011, 01:55 AM

Code:

sed '/=+=+=+=/,/=+=+=+=/d' file

Is this what you're looking for? If not, please show an example of input and the desired output.