LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   using awk to find item listed in one file in another file - runs very long (https://www.linuxquestions.org/questions/linux-newbie-8/using-awk-to-find-item-listed-in-one-file-in-another-file-runs-very-long-4175449841/)

bop-a-nator 02-12-2013 10:35 AM

using awk to find item listed in one file in another file - runs very long
 
Hi,

This works fine on my little test files below my problems is when I apply this to files that are much larger it is taking way too long to run, what am I missing?

prompt> cat fruits.txt
apple
cherry
grapes

prompt> cat mydata.txt
A fruit is apple
A carrot is a veggie
An orange is a fruit
Some grapes are good
potatoes are good too

prompt> /bin/gawk 'NR==FNR{a[$1];next} {for (item in a) if ($0 ~ item) print $0}' fruits.txt mydata.txt > result.txt

prompt> cat result.txt
A fruit is apple
Some grapes are good

Thanks for your help,
bop-a-nator

colucix 02-12-2013 10:59 AM

Maybe grep is optimized for this kind of task. You can give it a try:
Code:

grep -f fruits.txt mydata.txt
Thinking about awk now.... :scratch:

bop-a-nator 02-12-2013 11:08 AM

I have used that basic grep -f for pattern match, though I have run into that "skipping" data for whatever reason on large files. I searched all over the web and could not find a reason for it, so I have been reluctant to trust it as I cannot figure out what the "breaking" point is of where it just seems to decide to start missing matchs of items in the middle of the a large file. So am hoping awk might be more reliable.

ntubski 02-12-2013 12:14 PM

For grep you should use the -F option which says the patterns are Fixed Strings and not regular expressions, this allows grep to use a much faster algorithm for matching:
Code:

grep -Ff fruits.txt mydata.txt
Not sure why you would get "skipping", maybe you have some strange characters in your files?

awk doesn't have a way to use a faster algorithm, it's going to be slow for large files.

EDIT: jpollard's suggestion works for awk as well; if you are searching for whole words, then you can get good performance with awk. I would still recommend grep -F because it will be fast either way.

bop-a-nator 02-12-2013 01:50 PM

Ok thanks you both for your prompt feedback, I will give the grep -Ff fruits.txt mydata.txt a try with my large files.

Thanks again,
bop-a-nator

jpollard 02-12-2013 07:37 PM

You might consider using perl - it has faster pattern matching, and (even better) it will compile the program before it starts to execute it. It also has the capability to optimize matching -

For instance, in general pattern matching you have scan the entire string. Using perl you can optimize away most of the pattern by simply splitting the line up into an array of tokens.

If the words you are looking for are in a hash table (what the awk script uses for "in") then the speed can be quite fast (hashing is much faster than pattern matching). You eliminate the need for a pattern match at all - the hash either exists or not. If it exists then you print the input line.


All times are GMT -5. The time now is 02:41 PM.