sed delete lines from file one if regexp are listed in file two

fucinheira · 09-17-2009, 04:26 AM

Hello there,

I am trying to delete lines of a file if they contain text that is present on another file. For example

> cat one.txt:
a
b
c
d
e
f
g

> cat two.txt
c
d
e

If I run the following script
> cat test.sh
#!/bin/bash

while read LINE
do
sed -e "/$LINE/d" $1
done < $2

I get the following output:
> ./test.sh one.txt two.txt
a
b
d
e
f
g
a
b
c
e
f
g
a
b
c
d
f
g

instead of the "expected":
a
b
f
g

Obviously I am doing something wrong. I would appreciate any help.

Thanks, Jose

druuna · 09-17-2009, 04:50 AM

Hi,

The (none expected) output is correct.

The while read takes the first line from two.txt and removes it from one.txt and prints.
Result: a b d e f g (c is removed)

Then the while read take the second line from two.txt and removes it from one.txt and prints.
Result: a b c e f g (d is removed)

Same for the third line in two.txt.

One.txt is use three times (once for every line in two.txt).

You need to read all the entries in two.txt and give them to sed in one go.

Something like this (oneliner from command line) will do what you want:

sed -e '/c/d' -e '/d/d' -e '/e/d' one.txt

I'm not sure if you know how, but I'll let you play with this first.

Anyway, hope this helps.

jschiwal · 09-17-2009, 05:00 AM

Look at the comm command. The input files need to be sorted. You can output lines unique to the second file:

comm -13 <(sort file1) <(sort file2)

See "man comm" for full details on this command.

fucinheira · 09-17-2009, 06:22 AM

Thanks for both answers, they provide useful hints. Actually the situation is more complicated. Sorry if my previous message was a bit misleading. Image that file one.txt contains several hundreds lines of text with several words in each line while file two.txt contains a list of a couple of hundred words. What I would like is to delete every line in one.txt that contains at least one word listed in file two.txt.

Thanks again, Jose

druuna · 09-17-2009, 06:40 AM

Hi,

Do take a look at what jschiwal mentioned.

I do believe that the comm command can do what you want (and sorting the files is crucial!).

jschiwal · 09-17-2009, 07:01 AM

Also look at the options for grep. You can use a file for the source of the patterns. You can also use an option that returns lines not matching the patterns. These combined would have the effect of deleting lines in one file that don't contain words in a list.

fucinheira · 09-17-2009, 08:28 AM

Quote:

Originally Posted by jschiwal

Also look at the options for grep. You can use a file for the source of the patterns. You can also use an option that returns lines not matching the patterns. These combined would have the effect of deleting lines in one file that don't contain words in a list.

Thanks again to both! It works, provide it that I remove empty lines first.

> sed '/^$/d' one.txt
> sed '/^$/d' two.txt
> cat one.txt | grep -v -f two.txt > output.txt
> cat one.txt
a
b
c
d
e
f
g
> cat two.txt
c
d
e
> cat output.txt
a
b
f
g

Great! Just what I need!

Jose