Removing lines from a file that are in another file

jrmorrill · 01-06-2011, 07:46 PM

I have a large text file containing over 180k lines and another text file containing about 1k. I would like to remove lines in the 180k-line file that exist in the 1k-line file. I thought there was a simple way to do this but I haven't come across it yet. Any advice? Thanks

David the H. · 01-06-2011, 10:42 PM

Can we get some examples of the text involved? If the lines to be matched are identical, then it should be easy. But there may be more work involved if you need to make partial matches, or if there's a chance of oddball characters or the like.

Here's a grep one-liner that should work in the simplest case. You'll have to direct it into a new file, as grep doesn't have in-place editing.

Code:

grep -v -F -f "smallfile.txt" largefile.txt >newfile.txt

-v invert match (print non-matching lines)
-F fixed strings (for efficiency and to avoid regex problems)
-f match strings from filename

jschiwal · 01-06-2011, 10:49 PM

There is a command "comm" that can do this for two sorted lists.

comm -13 <(sort file1) <(sort file2)
would return lines that are unique to file2. The results are of course sorted.
If the files are already sorted you could use:
comm -13 file1 file2

I use this to compare two directory listings and produce a script to delete files from some devices.

jrmorrill · 01-07-2011, 02:32 PM

Quote:

Can we get some examples of the text involved?

Sure, the lines have only one word and are all lower case characters, nothing special.

file_1.txt:

aaa
bbb
ccc
ddd

file_2.txt:

bbb
ccc

output.txt:

aaa
ddd

Thanks for the one-liners! I'll give them both a shot.

trickykid · 01-07-2011, 04:56 PM

Behold, the power of sed, no reason to create a 3rd file with the output from grep or worry if the files are already sorted or not.

Well, first backup your file_1.txt, the larger one you want to delete lines if they exist in file_2.txt, just in case.

Code:

for PATTERN in `cat file_2.txt`; do sed -i "/$PATTERN/d" file_1.txt; done

colucix · 01-07-2011, 06:04 PM

An awk solution:

Code:

awk '! _[$0]++ && FNR < NR' file_2.txt file_1.txt

but this assumes there are no repetitions of the same line, otherwise doubles will be omitted. The order of the two conditions and that one of the two arguments is mandatory.

Edit: to take in account repetitions a simple modification is required:

Code:

awk '! _[$0]++ && FNR < NR {print; delete _[$0]}' file_2.txt file_1.txt

or

Code:

awk '! _[$0]++ && FNR < NR {print; _[$0]--}' file_2.txt file_1.txt

whereas the delete statement is not available, since it is a GNU awk extension.