[bash] 25,000 greps - better solution?
I am going through a multi-step process to produce output files, which involves 25,000 greps at one stage. While I do achieve the desired result I am wondering whether the process could be improved (sped up and/or decluttered).
Current Procees Code:
sort -u -o idsdone ids?????? |
I'm sure there is a bash way to solve this problem :)
But why don't you use Python or something for that? Though it's dirty, hope it will help: Code:
#!/usr/bin/python |
Which do you have more of:
1. unique IDs? 2. ids<yyyy><mm> files? The answer to that should influence the solution you pursue. A typical programming trade-off is "more memory = more speed" and is here as well. My first thought was to create a lookup table. Build a hash table/dictionary in memory with the unique IDs serving as the indices/keys. Construct a a list of all the ids<yyyy><mm> files on the system. For each unique ID, store all the ids<yyyy><mm> filenames that contain that ID in the appropriate table location. Then "diff" each unique IDs list against the list of all available ids<yyyy><mm> files. That approach may be easier in a different language (like Python) as opposed to a shell script. EDIT: which looks to be what goldenbarb is doing... but my Python skills are not so great. |
The id's in ids<yyyy><mm> are unique across all files. 280,000 ids across 420 files.
I am sure Python would be much more suited to the task. However, I have loads of shell scripts dealing with these id files already including all the necessary commandline parameter testing. I wouldn't want to replicate the whole logic in Python even if it runs longer as shell script. |
Code:
for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated Code:
grep -F -f idsmissing idsmore > idsmissing_dated |
ntubski may have given you what you need, but in case not...
Quote:
Quote:
Quote:
I interpreted the first statement (from your first post) to mean that each ids<yyyy><mm> file had 280,000 lines--no mention of whether the individual ids could repeat in the same file or other files. I interpreted the second statement to mean that there are 280,000 lines, cumulative, among all ids<yyyy><mm> files and that each id is unique to both the ids<yyyy><mm> it is found in and all other ids<yyyy><mm> files. in other words, a given ID appears in one and only one file. Is either correct? |
Quote:
if that is equivalent, it would appear to be the obvious (yes I know..) choice!: Code:
sasha@reactor: time for r in $(cat 25000); do egrep "^$r" 25000-2; done Code:
12345 Code:
^12345 Code:
912345 Anyhow, I tested using `xargs` above too, instead of a loop, and found that it chopped about 1:30 off the time; not a shabby improvement, although not of the same calibre as the grep -Ff ;) |
I think this may be done with the 'comm' command, but I can't be sure, because I don't understand what needs to be done. Try to simplify it if you can, then I may be able to help. 'grep -f' for file input may also work as GrapefruiTgirl says.
|
Quote:
|
If I correctly interpreted your requirements, here is an awk one-liner that should do the job in less than one second (awk is known to be very fast):
Code:
awk 'FILENAME != "idsmore"{_[$0] = ""} FILENAME == "idsmore"{if (! ( $1 in _ )) print $1 >> ( "ids" $2 $3 )}' ids?????? idsmore |
Interesting stuff; a 100 different ways to skin a cat!
Quote:
Quote:
I like ntubski's ingenious lightning-fast grep: Quote:
I give you example files:GrapefruiTgirl's xargs solution, once a space has been added to the search expression, produces the correct result but on my data is as slow as my loop (2m5s): Code:
time cat idsmissing | xargs -I{} egrep "^{} " idsmore >idsmissing_dated colucix' awk, which does the whole job in a oner, runs in only 0.74s but produces a different result. I'll need to look at this. |
I'm still working on another awk solution here (for the fun of it) but it is not producing the results I had expected (in fact, not producing any results :p)
Meanwhile, let us know if you become satisfied with one of the given solutions (like the grep -F -f -w) while I keep fiddling with this. :) |
Am I understanding this correctly, that grep -F speeds up the operation because it does not try to interpret the patterns as regular expressions? I get the same result without -F but it crawls. fgrep is a fraction of a second faster than grep -F.
colucix' all-in-one awk is correct. I must have been working on an older data set. It takes 0.75s for the whole operation whereas my 4-command solution with ntubski's grep -F -w takes 3.4s. |
Quote:
|
Quote:
|
colucix, sorry about the confusion, your result is spot-on - see #13.
I am not interested in the interim files. They were just a cludge for my 4-step solution. The only thing I need is appending the new ids to 2 files (as per my 1st post). I need to sit down and work through your awk. I'd like to understand how it does it so I know better next time. |
Quote:
Quote:
If something is still not clear, feel free to ask. FYI, my only and unique awk reference is the GNU official guide: http://www.gnu.org/software/gawk/manual/. |
Quote:
Code:
FILENAME != "idsmore" { |
I was asking about the test files. I cannot write a solution without test files. You said you somehow made 5 files containing 56000 different ids each (total 280000) and a file "idsmore" containing 300000 different ids, how ? I don't really need that many. Or, if the problem is solved, just forget about it.
|
colucix - got it. Ingenious!
|
@H_TeXMeX_H
Ops, sorry... I totally misunderstood your post. I generated 280000 numbers between 1 and 1000000 using the following awk code: Code:
BEGIN { |
ok, thanks.
|
Quote:
Quote:
Quote:
|
ntubski, thanks for the info on grep. Changed my scripts accordingly.
|
All times are GMT -5. The time now is 12:13 AM. |