using awk to find item listed in one file in another file - runs very long
Hi,
This works fine on my little test files below my problems is when I apply this to files that are much larger it is taking way too long to run, what am I missing? prompt> cat fruits.txt apple cherry grapes prompt> cat mydata.txt A fruit is apple A carrot is a veggie An orange is a fruit Some grapes are good potatoes are good too prompt> /bin/gawk 'NR==FNR{a[$1];next} {for (item in a) if ($0 ~ item) print $0}' fruits.txt mydata.txt > result.txt prompt> cat result.txt A fruit is apple Some grapes are good Thanks for your help, bop-a-nator |
Maybe grep is optimized for this kind of task. You can give it a try:
Code:
grep -f fruits.txt mydata.txt |
I have used that basic grep -f for pattern match, though I have run into that "skipping" data for whatever reason on large files. I searched all over the web and could not find a reason for it, so I have been reluctant to trust it as I cannot figure out what the "breaking" point is of where it just seems to decide to start missing matchs of items in the middle of the a large file. So am hoping awk might be more reliable.
|
For grep you should use the -F option which says the patterns are Fixed Strings and not regular expressions, this allows grep to use a much faster algorithm for matching:
Code:
grep -Ff fruits.txt mydata.txt awk doesn't have a way to use a faster algorithm, it's going to be slow for large files. EDIT: jpollard's suggestion works for awk as well; if you are searching for whole words, then you can get good performance with awk. I would still recommend grep -F because it will be fast either way. |
Ok thanks you both for your prompt feedback, I will give the grep -Ff fruits.txt mydata.txt a try with my large files.
Thanks again, bop-a-nator |
You might consider using perl - it has faster pattern matching, and (even better) it will compile the program before it starts to execute it. It also has the capability to optimize matching -
For instance, in general pattern matching you have scan the entire string. Using perl you can optimize away most of the pattern by simply splitting the line up into an array of tokens. If the words you are looking for are in a hash table (what the awk script uses for "in") then the speed can be quite fast (hashing is much faster than pattern matching). You eliminate the need for a pattern match at all - the hash either exists or not. If it exists then you print the input line. |
All times are GMT -5. The time now is 02:41 PM. |