[SOLVED] using awk to find item listed in one file in another file

bop-a-nator · 02-12-2013, 10:35 AM

Hi,

This works fine on my little test files below my problems is when I apply this to files that are much larger it is taking way too long to run, what am I missing?

prompt> cat fruits.txt
apple
cherry
grapes

prompt> cat mydata.txt
A fruit is apple
A carrot is a veggie
An orange is a fruit
Some grapes are good
potatoes are good too

prompt> /bin/gawk 'NR==FNR{a[$1];next} {for (item in a) if ($0 ~ item) print $0}' fruits.txt mydata.txt > result.txt

prompt> cat result.txt
A fruit is apple
Some grapes are good

Thanks for your help,
bop-a-nator

colucix · 02-12-2013, 10:59 AM

Maybe grep is optimized for this kind of task. You can give it a try:

Code:

grep -f fruits.txt mydata.txt

Thinking about awk now....

bop-a-nator · 02-12-2013, 11:08 AM

I have used that basic grep -f for pattern match, though I have run into that "skipping" data for whatever reason on large files. I searched all over the web and could not find a reason for it, so I have been reluctant to trust it as I cannot figure out what the "breaking" point is of where it just seems to decide to start missing matchs of items in the middle of the a large file. So am hoping awk might be more reliable.

ntubski · 02-12-2013, 12:14 PM

For grep you should use the -F option which says the patterns are Fixed Strings and not regular expressions, this allows grep to use a much faster algorithm for matching:

Code:

grep -Ff fruits.txt mydata.txt

Not sure why you would get "skipping", maybe you have some strange characters in your files?

awk doesn't have a way to use a faster algorithm, it's going to be slow for large files.

EDIT: jpollard's suggestion works for awk as well; if you are searching for whole words, then you can get good performance with awk. I would still recommend grep -F because it will be fast either way.

bop-a-nator · 02-12-2013, 01:50 PM

Ok thanks you both for your prompt feedback, I will give the grep -Ff fruits.txt mydata.txt a try with my large files.

Thanks again,
bop-a-nator

jpollard · 02-12-2013, 07:37 PM

You might consider using perl - it has faster pattern matching, and (even better) it will compile the program before it starts to execute it. It also has the capability to optimize matching -

For instance, in general pattern matching you have scan the entire string. Using perl you can optimize away most of the pattern by simply splitting the line up into an array of tokens.

If the words you are looking for are in a hash table (what the awk script uses for "in") then the speed can be quite fast (hashing is much faster than pattern matching). You eliminate the need for a pattern match at all - the hash either exists or not. If it exists then you print the input line.