LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 02-12-2013, 11:35 AM   #1
bop-a-nator
LQ Newbie
 
Registered: Sep 2012
Location: North East USA
Distribution: at work: Red Hat Enterprise Linux Server release 5.8 (Tikanga); at home: what do you recommend?
Posts: 24

Rep: Reputation: Disabled
using awk to find item listed in one file in another file - runs very long


Hi,

This works fine on my little test files below my problems is when I apply this to files that are much larger it is taking way too long to run, what am I missing?

prompt> cat fruits.txt
apple
cherry
grapes

prompt> cat mydata.txt
A fruit is apple
A carrot is a veggie
An orange is a fruit
Some grapes are good
potatoes are good too

prompt> /bin/gawk 'NR==FNR{a[$1];next} {for (item in a) if ($0 ~ item) print $0}' fruits.txt mydata.txt > result.txt

prompt> cat result.txt
A fruit is apple
Some grapes are good

Thanks for your help,
bop-a-nator
 
Old 02-12-2013, 11:59 AM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978Reputation: 1978
Maybe grep is optimized for this kind of task. You can give it a try:
Code:
grep -f fruits.txt mydata.txt
Thinking about awk now....
 
Old 02-12-2013, 12:08 PM   #3
bop-a-nator
LQ Newbie
 
Registered: Sep 2012
Location: North East USA
Distribution: at work: Red Hat Enterprise Linux Server release 5.8 (Tikanga); at home: what do you recommend?
Posts: 24

Original Poster
Rep: Reputation: Disabled
I have used that basic grep -f for pattern match, though I have run into that "skipping" data for whatever reason on large files. I searched all over the web and could not find a reason for it, so I have been reluctant to trust it as I cannot figure out what the "breaking" point is of where it just seems to decide to start missing matchs of items in the middle of the a large file. So am hoping awk might be more reliable.
 
Old 02-12-2013, 01:14 PM   #4
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,251

Rep: Reputation: 1420Reputation: 1420Reputation: 1420Reputation: 1420Reputation: 1420Reputation: 1420Reputation: 1420Reputation: 1420Reputation: 1420Reputation: 1420
For grep you should use the -F option which says the patterns are Fixed Strings and not regular expressions, this allows grep to use a much faster algorithm for matching:
Code:
grep -Ff fruits.txt mydata.txt
Not sure why you would get "skipping", maybe you have some strange characters in your files?

awk doesn't have a way to use a faster algorithm, it's going to be slow for large files.

EDIT: jpollard's suggestion works for awk as well; if you are searching for whole words, then you can get good performance with awk. I would still recommend grep -F because it will be fast either way.

Last edited by ntubski; 02-13-2013 at 07:18 PM. Reason: awk can be fast for word replacement
 
1 members found this post helpful.
Old 02-12-2013, 02:50 PM   #5
bop-a-nator
LQ Newbie
 
Registered: Sep 2012
Location: North East USA
Distribution: at work: Red Hat Enterprise Linux Server release 5.8 (Tikanga); at home: what do you recommend?
Posts: 24

Original Poster
Rep: Reputation: Disabled
Ok thanks you both for your prompt feedback, I will give the grep -Ff fruits.txt mydata.txt a try with my large files.

Thanks again,
bop-a-nator
 
Old 02-12-2013, 08:37 PM   #6
jpollard
Senior Member
 
Registered: Dec 2012
Location: Washington DC area
Distribution: Fedora, CentOS, Slackware
Posts: 4,714

Rep: Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280Reputation: 1280
You might consider using perl - it has faster pattern matching, and (even better) it will compile the program before it starts to execute it. It also has the capability to optimize matching -

For instance, in general pattern matching you have scan the entire string. Using perl you can optimize away most of the pattern by simply splitting the line up into an array of tokens.

If the words you are looking for are in a hash table (what the awk script uses for "in") then the speed can be quite fast (hashing is much faster than pattern matching). You eliminate the need for a pattern match at all - the hash either exists or not. If it exists then you print the input line.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
sed/awk to find different expressions in a file atikan Linux - Newbie 10 10-16-2012 06:33 AM
Using find with file name and mtime to remove files gets Arg list too long error smaxey Linux - General 3 12-30-2009 06:34 PM
sed delete lines from file one if regexp are listed in file two fucinheira Programming 6 09-17-2009 09:28 AM
have long file list; want to find word matches(package deletion purposes) lxquestions000019 Linux - Newbie 2 07-13-2009 05:28 AM
tar file listed in the text file nawuza Linux - Newbie 10 07-24-2008 01:22 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 12:57 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration