[SOLVED] [bash] 25,000 greps

hashbang#! · 11-08-2010, 02:32 PM

I am going through a multi-step process to produce output files, which involves 25,000 greps at one stage. While I do achieve the desired result I am wondering whether the process could be improved (sped up and/or decluttered).

input 1
set of dated files called ids<yyyy><mm> containing numeric id's, one per line, 280,000 lines in total:
Code:
123456
999996
51234
...
input 2
one file called idsmore containing one numeric id with date per line, 300,000 in total:
Code:
123456 2009 12
21001 2008 04
21003 2008 02
...
output
identify id's in idsmore, which do not exist in any ids<yyyy><mm>

append those id's to the file ids<yyyy><mm> indicated by the corresponding date indicated in idsmore

Current Procees

Code:

sort -u -o idsdone ids??????

diff --new-line-format=%L --old-line-format=  --unchanged-line-format= \ 
    idsdone <(cut -d ' ' -f 1 idsmore | sort -u) >idsmissing

for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated

while read r y m; do 
    echo $r >>ids$y$m
    echo $r >>ids$y$m.log
done < <(cat idsmissing_dated)

Everything is lightning-fast except the for loop with 25,000 grep iterations. That takes about 2m.

goldenbarb · 11-08-2010, 04:55 PM

I'm sure there is a bash way to solve this problem

But why don't you use Python or something for that?

Though it's dirty, hope it will help:

Code:

#!/usr/bin/python

l1 = []
l2 = {}

for i in open("/tmp/1").readlines():
    l1.append(i.rstrip())

for k in open("/tmp/2").readlines():
    line = k.rstrip().split(' ')
    f = line[0]
    l2[f] = line[1] + " " + line[2]

for i in l1:
    if l2.has_key(i):
        print i, l2[i]

Dark_Helmet · 11-08-2010, 05:19 PM

Which do you have more of:

1. unique IDs?
2. ids<yyyy><mm> files?

The answer to that should influence the solution you pursue.

A typical programming trade-off is "more memory = more speed" and is here as well. My first thought was to create a lookup table. Build a hash table/dictionary in memory with the unique IDs serving as the indices/keys. Construct a a list of all the ids<yyyy><mm> files on the system. For each unique ID, store all the ids<yyyy><mm> filenames that contain that ID in the appropriate table location. Then "diff" each unique IDs list against the list of all available ids<yyyy><mm> files.

That approach may be easier in a different language (like Python) as opposed to a shell script.

EDIT: which looks to be what goldenbarb is doing... but my Python skills are not so great.

hashbang#! · 11-08-2010, 05:28 PM

The id's in ids<yyyy><mm> are unique across all files. 280,000 ids across 420 files.

I am sure Python would be much more suited to the task. However, I have loads of shell scripts dealing with these id files already including all the necessary commandline parameter testing. I wouldn't want to replicate the whole logic in Python even if it runs longer as shell script.

ntubski · 11-08-2010, 05:35 PM

Code:

for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated

I believe this is equivalent to

Code:

grep -F -f idsmissing idsmore > idsmissing_dated

Dark_Helmet · 11-08-2010, 06:03 PM

ntubski may have given you what you need, but in case not...

Quote:

Originally Posted by hashbang#!

I am sure Python would be much more suited to the task. However, I have loads of shell scripts dealing with these id files already including all the necessary commandline parameter testing. I wouldn't want to replicate the whole logic in Python even if it runs longer as shell script.

I understand the reluctance. I just want to point out that you don't have to convert everything to python. Your shell script could invoke a python script and simply process the python script's output--just like any other command. If a problem comes along that's more suited to the abilities of a given language, you can use a sub-script written in the more-apt language to accomplish the task. Just food for thought.

Quote:

set of dated files called ids<yyyy><mm> containing numeric id's, one per line, 280,000 lines in total

Quote:

The id's in ids<yyyy><mm> are unique across all files. 280,000 ids across 420 files.

Ok, I know I will sound like a moron here, but just to clarify...

I interpreted the first statement (from your first post) to mean that each ids<yyyy><mm> file had 280,000 lines--no mention of whether the individual ids could repeat in the same file or other files.

I interpreted the second statement to mean that there are 280,000 lines, cumulative, among all ids<yyyy><mm> files and that each id is unique to both the ids<yyyy><mm> it is found in and all other ids<yyyy><mm> files. in other words, a given ID appears in one and only one file.

Is either correct?

GrapefruiTgirl · 11-08-2010, 06:13 PM

Quote:

Originally Posted by ntubski

Code:

for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated

I believe this is equivalent to

Code:

grep -F -f idsmissing idsmore > idsmissing_dated

ntubski,

if that is equivalent, it would appear to be the obvious (yes I know..) choice!:

Code:

sasha@reactor:  time for r in $(cat 25000); do egrep "^$r" 25000-2; done
WORD HERE  hello

real    2m17.704s
user    0m16.040s
sys     1m47.828s
sasha@reactor:  time cat 25000 | xargs -I{} egrep "^{}" 25000-2      
WORD HERE  hello

real    0m48.177s
user    0m14.386s
sys     0m26.204s
sasha@reactor: time grep -F -f 25000 25000-2
WORD HERE  hello

real    0m0.007s
user    0m0.003s
sys     0m0.004s

however, let's say on a particular iteration, $r is:

Code:

so OP's regex would equate to searching file2 for:

Code:

^12345

but using the `grep -F -f` method, we get a match if file2 contains a line like:

Code:

so it fails to match explicitly at "beginning of line", does it not? Maybe I'm not seeing something.

Anyhow, I tested using `xargs` above too, instead of a loop, and found that it chopped about 1:30 off the time; not a shabby improvement, although not of the same calibre as the grep -Ff

H_TeXMeX_H · 11-09-2010, 03:33 AM

I think this may be done with the 'comm' command, but I can't be sure, because I don't understand what needs to be done. Try to simplify it if you can, then I may be able to help. 'grep -f' for file input may also work as GrapefruiTgirl says.

colucix · 11-09-2010, 03:52 AM

Quote:

Originally Posted by GrapefruiTgirl

so it fails to match explicitly at "beginning of line", does it not? Maybe I'm not seeing something.

Yes, that's correct. The -w option should be useful in this case.

colucix · 11-09-2010, 04:24 AM

If I correctly interpreted your requirements, here is an awk one-liner that should do the job in less than one second (awk is known to be very fast):

Code:

awk 'FILENAME != "idsmore"{_[$0] = ""} FILENAME == "idsmore"{if (! ( $1 in _ )) print $1 >> ( "ids" $2 $3 )}' ids?????? idsmore

Please, pay attention to the fact that the file name "idsmore" must be the last argument in the command line.

hashbang#! · 11-09-2010, 05:18 AM

Interesting stuff; a 100 different ways to skin a cat!

Quote:

Originally Posted by Dark_Helmet

I just want to point out that you don't have to convert everything to python. Your shell script could invoke a python script [...]

Point taken. Actually, I wrote a few Python scripts, which modify html files and which I am calling from a bash script. I haven't looked into this yet but my gut feeling in this particular scenario was that 1000 invocations of Python are slooowww or maybe it's just the html parsing that is slow. Another topic for another thread.

Quote:

Originally Posted by Dark_Helmet

I interpreted the second statement to mean that there are 280,000 lines, cumulative, among all ids<yyyy><mm> files and that each id is unique to both the ids<yyyy><mm> it is found in and all other ids<yyyy><mm> files. in other words, a given ID appears in one and only one file.

Correct.

I like ntubski's ingenious lightning-fast grep:

Quote:

Originally Posted by ntubski

Code:

for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated

I believe this is equivalent to

Code:

grep -F -f idsmissing idsmore > idsmissing_dated

Unfortunately, as GrapefruiTgirl noticed, it finds too many ids because it picks up ids which are substrings of other ids (see marked ids in the example below).

I give you example files:
idsmissing

Code:

1111
2222
3333

idsmore

Code:

0000 2008 04
1111 2008 07
11111 2006 12
21111 2005 11
1212 2008 07
2222 2008 01
2323 2008 12
3333 2008 06

GrapefruiTgirl's xargs solution, once a space has been added to the search expression, produces the correct result but on my data is as slow as my loop (2m5s):

Code:

time cat idsmissing | xargs -I{} egrep "^{} " idsmore >idsmissing_dated

colucix' -w addition fixes ntubski's grep and this runs in an amazing 0.15s !!!!

colucix' awk, which does the whole job in a oner, runs in only 0.74s but produces a different result. I'll need to look at this.

GrapefruiTgirl · 11-09-2010, 05:49 AM

I'm still working on another awk solution here (for the fun of it) but it is not producing the results I had expected (in fact, not producing any results

)

Meanwhile, let us know if you become satisfied with one of the given solutions (like the grep -F -f -w) while I keep fiddling with this.

hashbang#! · 11-09-2010, 06:16 AM

Am I understanding this correctly, that grep -F speeds up the operation because it does not try to interpret the patterns as regular expressions? I get the same result without -F but it crawls. fgrep is a fraction of a second faster than grep -F.

colucix' all-in-one awk is correct. I must have been working on an older data set. It takes 0.75s for the whole operation whereas my 4-command solution with ntubski's grep -F -w takes 3.4s.

colucix · 11-09-2010, 06:31 AM

Quote:

Originally Posted by hashbang#!

I am still looking at colucix' awk.

He he... actually I tested it on 5 files containing 56000 different ids each (total 280000) and a file "idsmore" containing 300000 different ids. Then I applied both your script and the awk code and I obtained exactly the same result (except for the .log and the idsmissing files). If this is the only difference, a little add-on to the code, should do the trick.

H_TeXMeX_H · 11-09-2010, 06:36 AM

Quote:

Originally Posted by colucix

He he... actually I tested it on 5 files containing 56000 different ids each (total 280000) and a file "idsmore" containing 300000 different ids. Then I applied both your script and the awk code and I obtained exactly the same result (except for the .log and the idsmissing files). If this is the only difference, a little add-on to the code, should do the trick.

Can you post the code to generate those, because I can't imagine in my head what needs to be done.