[bash] 25,000 greps - better solution?
I am going through a multi-step process to produce output files, which involves 25,000 greps at one stage. While I do achieve the desired result I am wondering whether the process could be improved (sped up and/or decluttered).
Current Procees Code:
sort -u -o idsdone ids?????? |
I'm sure there is a bash way to solve this problem :)
But why don't you use Python or something for that? Though it's dirty, hope it will help: Code:
#!/usr/bin/python |
Which do you have more of:
1. unique IDs? 2. ids<yyyy><mm> files? The answer to that should influence the solution you pursue. A typical programming trade-off is "more memory = more speed" and is here as well. My first thought was to create a lookup table. Build a hash table/dictionary in memory with the unique IDs serving as the indices/keys. Construct a a list of all the ids<yyyy><mm> files on the system. For each unique ID, store all the ids<yyyy><mm> filenames that contain that ID in the appropriate table location. Then "diff" each unique IDs list against the list of all available ids<yyyy><mm> files. That approach may be easier in a different language (like Python) as opposed to a shell script. EDIT: which looks to be what goldenbarb is doing... but my Python skills are not so great. |
The id's in ids<yyyy><mm> are unique across all files. 280,000 ids across 420 files.
I am sure Python would be much more suited to the task. However, I have loads of shell scripts dealing with these id files already including all the necessary commandline parameter testing. I wouldn't want to replicate the whole logic in Python even if it runs longer as shell script. |
Code:
for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated Code:
grep -F -f idsmissing idsmore > idsmissing_dated |
ntubski may have given you what you need, but in case not...
Quote:
Quote:
Quote:
I interpreted the first statement (from your first post) to mean that each ids<yyyy><mm> file had 280,000 lines--no mention of whether the individual ids could repeat in the same file or other files. I interpreted the second statement to mean that there are 280,000 lines, cumulative, among all ids<yyyy><mm> files and that each id is unique to both the ids<yyyy><mm> it is found in and all other ids<yyyy><mm> files. in other words, a given ID appears in one and only one file. Is either correct? |
Quote:
if that is equivalent, it would appear to be the obvious (yes I know..) choice!: Code:
sasha@reactor: time for r in $(cat 25000); do egrep "^$r" 25000-2; done Code:
12345 Code:
^12345 Code:
912345 Anyhow, I tested using `xargs` above too, instead of a loop, and found that it chopped about 1:30 off the time; not a shabby improvement, although not of the same calibre as the grep -Ff ;) |
I think this may be done with the 'comm' command, but I can't be sure, because I don't understand what needs to be done. Try to simplify it if you can, then I may be able to help. 'grep -f' for file input may also work as GrapefruiTgirl says.
|
Quote:
|
If I correctly interpreted your requirements, here is an awk one-liner that should do the job in less than one second (awk is known to be very fast):
Code:
awk 'FILENAME != "idsmore"{_[$0] = ""} FILENAME == "idsmore"{if (! ( $1 in _ )) print $1 >> ( "ids" $2 $3 )}' ids?????? idsmore |
Interesting stuff; a 100 different ways to skin a cat!
Quote:
Quote:
I like ntubski's ingenious lightning-fast grep: Quote:
I give you example files:GrapefruiTgirl's xargs solution, once a space has been added to the search expression, produces the correct result but on my data is as slow as my loop (2m5s): Code:
time cat idsmissing | xargs -I{} egrep "^{} " idsmore >idsmissing_dated colucix' awk, which does the whole job in a oner, runs in only 0.74s but produces a different result. I'll need to look at this. |
I'm still working on another awk solution here (for the fun of it) but it is not producing the results I had expected (in fact, not producing any results :p)
Meanwhile, let us know if you become satisfied with one of the given solutions (like the grep -F -f -w) while I keep fiddling with this. :) |
Am I understanding this correctly, that grep -F speeds up the operation because it does not try to interpret the patterns as regular expressions? I get the same result without -F but it crawls. fgrep is a fraction of a second faster than grep -F.
colucix' all-in-one awk is correct. I must have been working on an older data set. It takes 0.75s for the whole operation whereas my 4-command solution with ntubski's grep -F -w takes 3.4s. |
Quote:
|
Quote:
|
All times are GMT -5. The time now is 12:14 AM. |