deleting a line matching two or more regexp in bash, sed maybe?

grail · 05-19-2010, 09:13 PM

Well I think for the two scenario that ntubski's is the winner, but happy to throw hat in for or more part:

Code:

awk 'BEGIN{patterns="aunque tengo";split(patterns,array)}{for(x in array)if($0 ~ array[x])i++;if(i < 2)print;i=0}' input_file > output_file

patolfo · 05-20-2010, 12:31 PM

Quote:

Originally Posted by grail

Well I think for the two scenario that ntubski's is the winner, but happy to throw hat in for or more part:

Code:

awk 'BEGIN{patterns="aunque tengo";split(patterns,array)}{for(x in array)if($0 ~ array[x])i++;if(i < 3)print;i=0}' input_file > output_file

I think the only problem with your code is that the number of rgexs to look for is hard coded into the "if(i < 3)" expression.

I am thinking of adding a variable having the array length stored in it, and use it in the conditional.

But anyway these are the codes that make just the right thing

Code:

#!/bin/bash
sed  '/aunque/{/me/{/daño/d}}' $1 > output_file
sed  '/aunque.*me.*daño/d' $1 > output_file2
awk 'BEGIN{patterns="aunque me daño";split(patterns,array)}{for(x in array)if($0 ~ array[x])i++;if(i < 3)print;i=0}' $1 > output_file3

max_matches=1               #max number of pattern matches allowed
patterns=('aunque' 'me' 'daño') #the patterns to match (you can use as many as you want)
file="$1"
counts="$( eval echo -n {1..$(($max_matches+1))} | tr ' ' '|' )"
{ for pattern in "${patterns[@]}"; do
  egrep -n "$pattern" "$file"
done; grep -n '' "$file"; } | sort -n | uniq -c | egrep "^ *($counts) " | sed -r 's/^[^:]+://'
exit

Can awk edit in place, like the -i option in sed?

p.s.Now i looked for sed -r option in google, and i got this:
-r, --regexp-extended
use extended regular expressions in the script.

And what the heck are those, expanded regexps?
Somebody, some light, which are normal and which advanced, regexp?

patolfo · 05-20-2010, 12:35 PM

but i can not get it to work, well it runs alright, but nothing appears in the console, or the file...
Besides can you explain the regexp inside the last sed: "sed -r 's/^[^:]+://", i think that is where the problem is

Code:

#!/bin/bash

max_matches=1               #max number of pattern matches allowed
patterns=('aunque' 'tengo') #the patterns to match (you can use as many as you want)

file="$1"

counts="$( eval echo -n {1..$(($max_matches+1))} | tr ' ' '|' )"

{ for pattern in "${patterns[@]}"; do
  egrep -n "$pattern" "$file"
done; grep -n '' "$file"; } | sort -n | uniq -c | egrep "^ *($counts) " | sed -r 's/^[^:]+://'

ntubski · 05-20-2010, 04:04 PM

Quote:

And what the heck are those, expanded regexps?
Somebody, some light, which are normal and which advanced, regexp?

Extended-regexps.

The problem with ta0kira's script is the last character in the egrep pattern needs to be a tab:

Code:

| egrep "^ *($counts)<TAB>" |

The script can be simplified a bit more:

Code:

#!/bin/bash

max_matches=1               #max number of pattern matches allowed
patterns=('aunque' 'tengo') #the patterns to match (you can use as many as you want)

file="$1"

counts="$( seq --separator '|' $((max_matches+1)))"

{ for pattern in "${patterns[@]}"; do
  egrep -n "$pattern" "$file"
done; grep -n '' "$file"; } | sort -n | uniq -c | sed -nr "/^ *($counts)[\t]/{s/^[^:]+://;p}"

Quote:

ntubski (if it is a name where does it comes from)

A combination of my first initial and a corruption of my last name.

grail · 05-20-2010, 07:18 PM

Quote:

I think the only problem with your code is that the number of rgexs to look for is hard coded into the "if(i < 3)" expression.

It is < 2 as if you have two or more of the required regex's (name of thread) then we don't want that printed to new file (hence deleted).
You can use length(array) in {g}awk instead.

And as far as I know there is no -i option similar to sed as remembering also that awk doesn't necessarily have to update a file it is more for using the file/input
to generate information as required.

ta0kira · 05-20-2010, 07:55 PM

Quote:

Originally Posted by ntubski

The problem with ta0kira's script is the last character in the egrep pattern needs to be a tab:

Code:

| egrep "^ *($counts)<TAB>" |

I guess it depends on the implementation of uniq. Mine uses a space (FreeBSD.) To be safe, maybe just use '^ *($counts)[ \t]+'.

sed -r 's/^[^:]+://' matches from the beginning of the line up until the first ":", then deletes all of it. This gets rid of the duplication count given by uniq -c and the line numbering given by grep -n.

grail's solution has at least two advantages over mine:

It can be used with piped input; it only reads the data once.
It's only one line, although the comment regarding the hard-coding of the patterns can be solved any number of very simple ways using a script.

Kevin Barry

patolfo · 05-21-2010, 12:30 PM

Well i am using suse