LinuxQuestions.org - [SOLVED] [bash] 25,000 greps

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - [bash] 25,000 greps - better solution? (https://www.linuxquestions.org/questions/programming-9/%5Bbash%5D-25-000-greps-better-solution-843079/)

hashbang#!

11-08-2010 02:32 PM

[bash] 25,000 greps - better solution?

I am going through a multi-step process to produce output files, which involves 25,000 greps at one stage. While I do achieve the desired result I am wondering whether the process could be improved (sped up and/or decluttered).

input 1
set of dated files called ids<yyyy><mm> containing numeric id's, one per line, 280,000 lines in total:

Code:

123456 999996 51234 ...

input 2
one file called idsmore containing one numeric id with date per line, 300,000 in total:

Code:

123456 2009 12 21001 2008 04 21003 2008 02 ...

output
identify id's in idsmore, which do not exist in any ids<yyyy><mm>

append those id's to the file ids<yyyy><mm> indicated by the corresponding date indicated in idsmore

Current Procees

Code:

sort -u -o idsdone ids??????



diff --new-line-format=%L --old-line-format=  --unchanged-line-format= \ 

    idsdone <(cut -d ' ' -f 1 idsmore | sort -u) >idsmissing



for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated



while read r y m; do 

    echo $r >>ids$y$m

    echo $r >>ids$y$m.log

done < <(cat idsmissing_dated)

Everything is lightning-fast except the for loop with 25,000 grep iterations. That takes about 2m.

goldenbarb

11-08-2010 04:55 PM

I'm sure there is a bash way to solve this problem :)
But why don't you use Python or something for that?

Though it's dirty, hope it will help:

Code:

#!/usr/bin/python



l1 = []

l2 = {}



for i in open("/tmp/1").readlines():

    l1.append(i.rstrip())



for k in open("/tmp/2").readlines():

    line = k.rstrip().split(' ')

    f = line[0]

    l2[f] = line[1] + " " + line[2]



for i in l1:

    if l2.has_key(i):

        print i, l2[i]

Dark_Helmet

11-08-2010 05:19 PM

Which do you have more of:

1. unique IDs?
2. ids<yyyy><mm> files?

The answer to that should influence the solution you pursue.

A typical programming trade-off is "more memory = more speed" and is here as well. My first thought was to create a lookup table. Build a hash table/dictionary in memory with the unique IDs serving as the indices/keys. Construct a a list of all the ids<yyyy><mm> files on the system. For each unique ID, store all the ids<yyyy><mm> filenames that contain that ID in the appropriate table location. Then "diff" each unique IDs list against the list of all available ids<yyyy><mm> files.

That approach may be easier in a different language (like Python) as opposed to a shell script.

EDIT: which looks to be what goldenbarb is doing... but my Python skills are not so great.

hashbang#!

11-08-2010 05:28 PM

The id's in ids<yyyy><mm> are unique across all files. 280,000 ids across 420 files.

I am sure Python would be much more suited to the task. However, I have loads of shell scripts dealing with these id files already including all the necessary commandline parameter testing. I wouldn't want to replicate the whole logic in Python even if it runs longer as shell script.

ntubski

11-08-2010 05:35 PM

Code:

for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated

I believe this is equivalent to

Code:

grep -F -f idsmissing idsmore > idsmissing_dated

Dark_Helmet

11-08-2010 06:03 PM

ntubski may have given you what you need, but in case not...

Quote:

Originally Posted by hashbang#!

I am sure Python would be much more suited to the task. However, I have loads of shell scripts dealing with these id files already including all the necessary commandline parameter testing. I wouldn't want to replicate the whole logic in Python even if it runs longer as shell script.

I understand the reluctance. I just want to point out that you don't have to convert everything to python. Your shell script could invoke a python script and simply process the python script's output--just like any other command. If a problem comes along that's more suited to the abilities of a given language, you can use a sub-script written in the more-apt language to accomplish the task. Just food for thought.

Quote:

set of dated files called ids<yyyy><mm> containing numeric id's, one per line, 280,000 lines in total

Quote:

The id's in ids<yyyy><mm> are unique across all files. 280,000 ids across 420 files.

Ok, I know I will sound like a moron here, but just to clarify...

I interpreted the first statement (from your first post) to mean that each ids<yyyy><mm> file had 280,000 lines--no mention of whether the individual ids could repeat in the same file or other files.

I interpreted the second statement to mean that there are 280,000 lines, cumulative, among all ids<yyyy><mm> files and that each id is unique to both the ids<yyyy><mm> it is found in and all other ids<yyyy><mm> files. in other words, a given ID appears in one and only one file.

Is either correct?

GrapefruiTgirl

11-08-2010 06:13 PM

Quote:

Originally Posted by ntubski (Post 4153073)

Code:

for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated

I believe this is equivalent to

Code:

grep -F -f idsmissing idsmore > idsmissing_dated

ntubski,

if that is equivalent, it would appear to be the obvious (yes I know..) choice!:

Code:

sasha@reactor:  time for r in $(cat 25000); do egrep "^$r" 25000-2; done

WORD HERE  hello



real    2m17.704s

user    0m16.040s

sys    1m47.828s

sasha@reactor:  time cat 25000 | xargs -I{} egrep "^{}" 25000-2      

WORD HERE  hello



real    0m48.177s

user    0m14.386s

sys    0m26.204s

sasha@reactor: time grep -F -f 25000 25000-2

WORD HERE  hello



real    0m0.007s

user    0m0.003s

sys    0m0.004s

however, let's say on a particular iteration, $r is:

Code:

12345

so OP's regex would equate to searching file2 for:

Code:

^12345

but using the `grep -F -f` method, we get a match if file2 contains a line like:

Code:

912345

so it fails to match explicitly at "beginning of line", does it not? Maybe I'm not seeing something.

Anyhow, I tested using `xargs` above too, instead of a loop, and found that it chopped about 1:30 off the time; not a shabby improvement, although not of the same calibre as the grep -Ff ;)

H_TeXMeX_H

11-09-2010 03:33 AM

I think this may be done with the 'comm' command, but I can't be sure, because I don't understand what needs to be done. Try to simplify it if you can, then I may be able to help. 'grep -f' for file input may also work as GrapefruiTgirl says.

colucix

11-09-2010 03:52 AM

Quote:

Originally Posted by GrapefruiTgirl (Post 4153091)

so it fails to match explicitly at "beginning of line", does it not? Maybe I'm not seeing something.

Yes, that's correct. The -w option should be useful in this case.

colucix

11-09-2010 04:24 AM

If I correctly interpreted your requirements, here is an awk one-liner that should do the job in less than one second (awk is known to be very fast):

Code:

awk 'FILENAME != "idsmore"{_[$0] = ""} FILENAME == "idsmore"{if (! ( $1 in _ )) print $1 >> ( "ids" $2 $3 )}' ids?????? idsmore

Please, pay attention to the fact that the file name "idsmore" must be the last argument in the command line.

hashbang#!

11-09-2010 05:18 AM

Interesting stuff; a 100 different ways to skin a cat!

Quote:

Originally Posted by Dark_Helmet (Post 4153087)

I just want to point out that you don't have to convert everything to python. Your shell script could invoke a python script [...]

Point taken. Actually, I wrote a few Python scripts, which modify html files and which I am calling from a bash script. I haven't looked into this yet but my gut feeling in this particular scenario was that 1000 invocations of Python are slooowww or maybe it's just the html parsing that is slow. Another topic for another thread.

Quote:

Originally Posted by Dark_Helmet (Post 4153087)

I interpreted the second statement to mean that there are 280,000 lines, cumulative, among all ids<yyyy><mm> files and that each id is unique to both the ids<yyyy><mm> it is found in and all other ids<yyyy><mm> files. in other words, a given ID appears in one and only one file.

Correct.

I like ntubski's ingenious lightning-fast grep:

Quote:

Originally Posted by ntubski (Post 4153073)

Code:

for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated

I believe this is equivalent to

Code:

grep -F -f idsmissing idsmore > idsmissing_dated

Unfortunately, as GrapefruiTgirl noticed, it finds too many ids because it picks up ids which are substrings of other ids (see marked ids in the example below).

I give you example files:
idsmissing

Code:

1111 2222 3333

idsmore

Code:

0000 2008 04 1111 2008 07 11111 2006 12 21111 2005 11 1212 2008 07 2222 2008 01 2323 2008 12 3333 2008 06

GrapefruiTgirl's xargs solution, once a space has been added to the search expression, produces the correct result but on my data is as slow as my loop (2m5s):

Code:

time cat idsmissing | xargs -I{} egrep "^{} " idsmore >idsmissing_dated

colucix' -w addition fixes ntubski's grep and this runs in an amazing 0.15s !!!!

colucix' awk, which does the whole job in a oner, runs in only 0.74s but produces a different result. I'll need to look at this.

GrapefruiTgirl

11-09-2010 05:49 AM

I'm still working on another awk solution here (for the fun of it) but it is not producing the results I had expected (in fact, not producing any results :p)

Meanwhile, let us know if you become satisfied with one of the given solutions (like the grep -F -f -w) while I keep fiddling with this. :)

hashbang#!

11-09-2010 06:16 AM

Am I understanding this correctly, that grep -F speeds up the operation because it does not try to interpret the patterns as regular expressions? I get the same result without -F but it crawls. fgrep is a fraction of a second faster than grep -F.

colucix' all-in-one awk is correct. I must have been working on an older data set. It takes 0.75s for the whole operation whereas my 4-command solution with ntubski's grep -F -w takes 3.4s.

colucix

11-09-2010 06:31 AM

Quote:

Originally Posted by hashbang#! (Post 4153482)

I am still looking at colucix' awk.

He he... actually I tested it on 5 files containing 56000 different ids each (total 280000) and a file "idsmore" containing 300000 different ids. Then I applied both your script and the awk code and I obtained exactly the same result (except for the .log and the idsmissing files). If this is the only difference, a little add-on to the code, should do the trick.

H_TeXMeX_H

11-09-2010 06:36 AM

Quote:

Originally Posted by colucix (Post 4153501)

Can you post the code to generate those, because I can't imagine in my head what needs to be done.

hashbang#!

11-09-2010 06:49 AM

colucix, sorry about the confusion, your result is spot-on - see #13.

I am not interested in the interim files. They were just a cludge for my 4-step solution. The only thing I need is appending the new ids to 2 files (as per my 1st post).

I need to sit down and work through your awk. I'd like to understand how it does it so I know better next time.

colucix

11-09-2010 07:19 AM

Quote:

Originally Posted by hashbang#! (Post 4153528)

colucix, sorry about the confusion, your result is spot-on - see #13.

Ok. Just seen the modification.

Quote:

Originally Posted by hashbang#! (Post 4153528)

I need to sit down and work through your awk. I'd like to understand how it does it so I know better next time.

Just take in mind that FILENAME is an internal variable storing the name of the file currently parsed. When you pass multiple filenames as argument, awk processes all of them in sequence and the FILENAME variable changes accordingly. For this reason, we have two rules in the code: the first one si executed for all the ids?????? file, the second one only for the idsmore file.

If something is still not clear, feel free to ask. FYI, my only and unique awk reference is the GNU official guide: http://www.gnu.org/software/gawk/manual/.

colucix

11-09-2010 07:34 AM

Quote:

Originally Posted by H_TeXMeX_H (Post 4153509)

Can you post the code to generate those, because I can't imagine in my head what needs to be done.

You can simply add more print statements at your pleasure. Since the code does not require sorting (as per the first lines of the OP's script), there is no need for the sorted temporary files... or were you really interested in how to perform sorting in awk?

Code:

FILENAME != "idsmore" {

  _[$0] = ""

}

FILENAME == "idsmore" {

  if (! ( $1 in _ )) {

    print $1 >> ( "ids" $2 $3 )

    print $1 >> ( "ids" $2 $3 ".log")

    print $1 >> "idsmissing"

    print >> "idsmissing_dated"

  }

}

H_TeXMeX_H

11-09-2010 07:45 AM

I was asking about the test files. I cannot write a solution without test files. You said you somehow made 5 files containing 56000 different ids each (total 280000) and a file "idsmore" containing 300000 different ids, how ? I don't really need that many. Or, if the problem is solved, just forget about it.

hashbang#!

11-09-2010 08:33 AM

colucix - got it. Ingenious!

colucix

11-09-2010 08:48 AM

@H_TeXMeX_H

Ops, sorry... I totally misunderstood your post. I generated 280000 numbers between 1 and 1000000 using the following awk code:

Code:

BEGIN {

  srand()

  do {

    num = 1 + int(rand() * 999999)

    if (! (num in _)) {

      print num

      _[num] = ""

      count++

    }

  } while ( count < 280000 )

}

then I cutted the output in five pieces using the split command and finally I added 20000 more numbers (plus dates) to generate idsmore. But for testing purposes you may also generate numbers in sequence between 1 and 280000 using seq.

H_TeXMeX_H

11-09-2010 08:49 AM

ok, thanks.

ntubski

11-09-2010 03:50 PM

Quote:

Originally Posted by hashbang#! (Post 4153482)

Am I understanding this correctly, that grep -F speeds up the operation because it does not try to interpret the patterns as regular expressions?

Yes, when grep knows that the patterns aren't regular expressions it can use the hash table approach, suggested by some posters here, internally.

Quote:

I get the same result without -F but it crawls. fgrep is a fraction of a second faster than grep -F.

You have to expect some random variation between different runs of the same program.

Quote:

egrep is the same as grep -E. fgrep is the same as grep -F. Direct invocation as either egrep or fgrep is deprecated, but is provided to allow historical applications that rely on them to run unmodified.

from http://www.manpagez.com/man/1/grep/

hashbang#!

11-09-2010 04:14 PM

ntubski, thanks for the info on grep. Changed my scripts accordingly.

All times are GMT -5. The time now is 12:13 AM.