search two files for specific words remove the line from one file

CyberIT · 11-18-2021, 01:14 PM

Hello

I have two files: file1 and file2. File2 is large

Im trying to query file2, line by line, for specific words that may be in file1, line by line and if the word matches a line in file 2 that line get removed within file2.

I could use some help to start a script for it. bash? python?

Thank you much!

Turbocapitalist · 11-18-2021, 01:37 PM

Maybe grep with the -v, -f, and -F options? The output can be saved into another file using a redirection.

shruggy · 11-18-2021, 02:52 PM

If files are sorted, join may be an interesting alternative. See 8.3.6 Union, Intersection and Difference of files in the GNU Coreutils Manual, particularly, the difference.

For unsorted files, combine from package moreutils is also an option.

CyberIT · 11-18-2021, 03:35 PM

This is above my head but Id like to figure it out... What Im trying to do is the following

example
LINE 1:

Code:

cat file1 | awk '{print $2}'    -gives an output of blah.example.com

LINE 2:

Code:

cat file1 | awk '{print $2}'    -gives an output of help.example.com

With that info I want to find {print $2} of each line within file1 then look for same output within file2 and remove the entire line within file2.

Im not sure if I can use bash to do that or would python be better? Any help or examples would be great?? Thank you much!

shruggy · 11-18-2021, 03:50 PM

Have you tried what was suggested by Turbocapitalist?

Code:

grep -Fvf <(awk '$0=$2' file1) file2

CyberIT · 11-18-2021, 06:06 PM

Quote:

Originally Posted by Turbocapitalist

Maybe grep with the -v, -f, and -F options? The output can be saved into another file using a redirection.

Thank you for your response.

I tried this but the outcome wasnt what I expected. I dont think I used it properly.

CyberIT · 11-18-2021, 06:07 PM

Quote:

Originally Posted by shruggy

Have you tried what was suggested by Turbocapitalist?

Code:

grep -Fvf <(awk '$0=$2' file1) file2

Thank you for your reply!

Yep I tried what was suggested earlier but the outcome was not what I wanted. It seemed to just copy what the file was, nothing more so I assume I didnt have the proper format. However the example you posted was not what I used so I will try that out too.

chrism01 · 11-18-2021, 11:31 PM

You can supply a file as the list of words t be matched/deleted etc to sed to compare against a 2nd file https://stackoverflow.com/questions/...another-file-a . Look for the text "grep -Fvxf <lines-to-remove> <all-lines>" on that page.

shruggy · 11-19-2021, 05:24 AM

Quote:

Originally Posted by CyberIT

Yep I tried what was suggested earlier but the outcome was not what I wanted. It seemed to just copy what the file was, nothing more so I assume I didnt have the proper format.

Then you should describe the format of both files in more detail.

First, try getting the grep solution to work. It is not the fastest solution, but probably one of the easier to understand. You can optimize it further if needed. An awk solution is more flexible and probably faster as well, especially, if you explicitly specify mawk rather than gawk as awk interpreter (the former tends to be faster than the latter)

Code:

awk 'NR==FNR{_[$2];next}!($0 in _)' file1 file2

but adjusting it to your needs requires some understanding of how awk works.

A shell solution may be THE easiest to understand, but probably the slowest one as well

Code:

#!/bin/sh
while IFS= read -r line
do grep -qw "$line" file1 || printf %s\\n "$line"
done <file2

Again, the grep command, the printf command and even the read command may require some adjustments depending on what exactly are you trying to read, to match, and to output.

And as said, if both files are sorted, there are more efficient ways to do this. E.g.

Code:

join -12 -21 -v2 <(sort -ubk2,2 file1) <(sort file2)

Of course, this doesn't make sense if you have to sort both files on the fly as I did above. But if the files are already sorted (or even if only the large one is) then join may beat awk performance wise.

syg00 · 11-19-2021, 06:07 AM

Who cares about performance ?.
I spent an entire career optimising system performance - I had a good (well paying) life. No-one cares anymore (yes, no one will employ me now).
nickel-and-dime'íng in a home environment is pointless - just find a solution you like and run with it.

MadeInGermany · 11-19-2021, 12:04 PM

I haven't seen a requirement for having words in column 2 in file1?
The following expects 1 word per line.

Code:

fgrep -vf file1 file2

CyberIT · 11-19-2021, 03:31 PM

WOW! Thank you all for your comments.

Basically, I think I have to put the contents of file1 into memory while searching file2. If a word from file1 is found in file2 then remove the line is containing it.

I will definitely review what everyone has posted and try them out and see what I can do. Thanks!

CyberIT · 11-19-2021, 04:11 PM

Quote:

Originally Posted by MadeInGermany

I haven't seen a requirement for having words in column 2 in file1?
The following expects 1 word per line.

Code:

fgrep -vf file1 file2

frgep seems to have done the trick. Thank you!

However I noticed that it actually removed more lines than what I needed.

Is there a way to only remove lines in file2 that start with the word in file1?

shruggy · 11-19-2021, 05:27 PM

Quote:

Originally Posted by MadeInGermany

I haven't seen a requirement for having words in column 2 in file1?

I supposed the OP really meant what they posted.

Quote:

Originally Posted by CyberIT

With that info I want to find {print $2} of each line within file1 then look for same output within file2 and remove the entire line within file2.

Looks like I was wrong though.

Quote:

Originally Posted by CyberIT

frgep seems to have done the trick. Thank you!

Turbocapitalist · 11-19-2021, 09:22 PM

Quote:

Originally Posted by CyberIT

Is there a way to only remove lines in file2 that start with the word in file1?

The fgrep name is just a shortcut for grep -F which was shown above in post #2. But that is for fixed strings not patterns. Now that you want to anchor the string, you have to make a pattern.

Code:

grep -f <(cat file1 | sed 's/^/^/' ) file2

How many patterns are in file1? If there are many then you may want a different approach.