LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 11-08-2010, 02:32 PM   #1
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Rep: Reputation: 17
[bash] 25,000 greps - better solution?


I am going through a multi-step process to produce output files, which involves 25,000 greps at one stage. While I do achieve the desired result I am wondering whether the process could be improved (sped up and/or decluttered).

input 1
set of dated files called ids<yyyy><mm> containing numeric id's, one per line, 280,000 lines in total:
Code:
123456
999996
51234
...
input 2
one file called idsmore containing one numeric id with date per line, 300,000 in total:
Code:
123456 2009 12
21001 2008 04
21003 2008 02
...
output
  • identify id's in idsmore, which do not exist in any ids<yyyy><mm>
  • append those id's to the file ids<yyyy><mm> indicated by the corresponding date indicated in idsmore

Current Procees

Code:
sort -u -o idsdone ids??????

diff --new-line-format=%L --old-line-format=  --unchanged-line-format= \ 
    idsdone <(cut -d ' ' -f 1 idsmore | sort -u) >idsmissing

for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated

while read r y m; do 
    echo $r >>ids$y$m
    echo $r >>ids$y$m.log
done < <(cat idsmissing_dated)
Everything is lightning-fast except the for loop with 25,000 grep iterations. That takes about 2m.
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 11-08-2010, 04:55 PM   #2
goldenbarb
Member
 
Registered: Aug 2010
Distribution: Fedora, Centos, Debian
Posts: 49

Rep: Reputation: 7
I'm sure there is a bash way to solve this problem
But why don't you use Python or something for that?

Though it's dirty, hope it will help:

Code:
#!/usr/bin/python

l1 = []
l2 = {}

for i in open("/tmp/1").readlines():
    l1.append(i.rstrip())

for k in open("/tmp/2").readlines():
    line = k.rstrip().split(' ')
    f = line[0]
    l2[f] = line[1] + " " + line[2]

for i in l1:
    if l2.has_key(i):
        print i, l2[i]
 
Old 11-08-2010, 05:19 PM   #3
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 374Reputation: 374Reputation: 374Reputation: 374
Which do you have more of:

1. unique IDs?
2. ids<yyyy><mm> files?

The answer to that should influence the solution you pursue.

A typical programming trade-off is "more memory = more speed" and is here as well. My first thought was to create a lookup table. Build a hash table/dictionary in memory with the unique IDs serving as the indices/keys. Construct a a list of all the ids<yyyy><mm> files on the system. For each unique ID, store all the ids<yyyy><mm> filenames that contain that ID in the appropriate table location. Then "diff" each unique IDs list against the list of all available ids<yyyy><mm> files.

That approach may be easier in a different language (like Python) as opposed to a shell script.

EDIT: which looks to be what goldenbarb is doing... but my Python skills are not so great.

Last edited by Dark_Helmet; 11-08-2010 at 05:23 PM.
 
Old 11-08-2010, 05:28 PM   #4
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
The id's in ids<yyyy><mm> are unique across all files. 280,000 ids across 420 files.


I am sure Python would be much more suited to the task. However, I have loads of shell scripts dealing with these id files already including all the necessary commandline parameter testing. I wouldn't want to replicate the whole logic in Python even if it runs longer as shell script.
 
Old 11-08-2010, 05:35 PM   #5
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,782

Rep: Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083Reputation: 2083
Code:
for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated
I believe this is equivalent to
Code:
grep -F -f idsmissing idsmore > idsmissing_dated
 
1 members found this post helpful.
Old 11-08-2010, 06:03 PM   #6
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 374Reputation: 374Reputation: 374Reputation: 374
ntubski may have given you what you need, but in case not...

Quote:
Originally Posted by hashbang#!
I am sure Python would be much more suited to the task. However, I have loads of shell scripts dealing with these id files already including all the necessary commandline parameter testing. I wouldn't want to replicate the whole logic in Python even if it runs longer as shell script.
I understand the reluctance. I just want to point out that you don't have to convert everything to python. Your shell script could invoke a python script and simply process the python script's output--just like any other command. If a problem comes along that's more suited to the abilities of a given language, you can use a sub-script written in the more-apt language to accomplish the task. Just food for thought.

Quote:
set of dated files called ids<yyyy><mm> containing numeric id's, one per line, 280,000 lines in total
Quote:
The id's in ids<yyyy><mm> are unique across all files. 280,000 ids across 420 files.
Ok, I know I will sound like a moron here, but just to clarify...

I interpreted the first statement (from your first post) to mean that each ids<yyyy><mm> file had 280,000 lines--no mention of whether the individual ids could repeat in the same file or other files.

I interpreted the second statement to mean that there are 280,000 lines, cumulative, among all ids<yyyy><mm> files and that each id is unique to both the ids<yyyy><mm> it is found in and all other ids<yyyy><mm> files. in other words, a given ID appears in one and only one file.

Is either correct?

Last edited by Dark_Helmet; 11-08-2010 at 06:04 PM.
 
Old 11-08-2010, 06:13 PM   #7
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556
Quote:
Originally Posted by ntubski View Post
Code:
for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated
I believe this is equivalent to
Code:
grep -F -f idsmissing idsmore > idsmissing_dated
ntubski,

if that is equivalent, it would appear to be the obvious (yes I know..) choice!:
Code:
sasha@reactor:  time for r in $(cat 25000); do egrep "^$r" 25000-2; done
WORD HERE  hello

real    2m17.704s
user    0m16.040s
sys     1m47.828s
sasha@reactor:  time cat 25000 | xargs -I{} egrep "^{}" 25000-2      
WORD HERE  hello

real    0m48.177s
user    0m14.386s
sys     0m26.204s
sasha@reactor: time grep -F -f 25000 25000-2
WORD HERE  hello

real    0m0.007s
user    0m0.003s
sys     0m0.004s
however, let's say on a particular iteration, $r is:
Code:
12345
so OP's regex would equate to searching file2 for:
Code:
^12345
but using the `grep -F -f` method, we get a match if file2 contains a line like:
Code:
912345
so it fails to match explicitly at "beginning of line", does it not? Maybe I'm not seeing something.

Anyhow, I tested using `xargs` above too, instead of a loop, and found that it chopped about 1:30 off the time; not a shabby improvement, although not of the same calibre as the grep -Ff
 
1 members found this post helpful.
Old 11-09-2010, 03:33 AM   #8
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
I think this may be done with the 'comm' command, but I can't be sure, because I don't understand what needs to be done. Try to simplify it if you can, then I may be able to help. 'grep -f' for file input may also work as GrapefruiTgirl says.
 
Old 11-09-2010, 03:52 AM   #9
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by GrapefruiTgirl View Post
so it fails to match explicitly at "beginning of line", does it not? Maybe I'm not seeing something.
Yes, that's correct. The -w option should be useful in this case.
 
2 members found this post helpful.
Old 11-09-2010, 04:24 AM   #10
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
If I correctly interpreted your requirements, here is an awk one-liner that should do the job in less than one second (awk is known to be very fast):
Code:
awk 'FILENAME != "idsmore"{_[$0] = ""} FILENAME == "idsmore"{if (! ( $1 in _ )) print $1 >> ( "ids" $2 $3 )}' ids?????? idsmore
Please, pay attention to the fact that the file name "idsmore" must be the last argument in the command line.
 
1 members found this post helpful.
Old 11-09-2010, 05:18 AM   #11
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
Interesting stuff; a 100 different ways to skin a cat!


Quote:
Originally Posted by Dark_Helmet View Post
I just want to point out that you don't have to convert everything to python. Your shell script could invoke a python script [...]
Point taken. Actually, I wrote a few Python scripts, which modify html files and which I am calling from a bash script. I haven't looked into this yet but my gut feeling in this particular scenario was that 1000 invocations of Python are slooowww or maybe it's just the html parsing that is slow. Another topic for another thread.


Quote:
Originally Posted by Dark_Helmet View Post
I interpreted the second statement to mean that there are 280,000 lines, cumulative, among all ids<yyyy><mm> files and that each id is unique to both the ids<yyyy><mm> it is found in and all other ids<yyyy><mm> files. in other words, a given ID appears in one and only one file.
Correct.

I like ntubski's ingenious lightning-fast grep:
Quote:
Originally Posted by ntubski View Post
Code:
for r in $(cat idsmissing); do egrep "^$r " idsmore; done > idsmissing_dated
I believe this is equivalent to
Code:
grep -F -f idsmissing idsmore > idsmissing_dated
Unfortunately, as GrapefruiTgirl noticed, it finds too many ids because it picks up ids which are substrings of other ids (see marked ids in the example below).
I give you example files:
idsmissing
Code:
1111
2222
3333
idsmore
Code:
0000 2008 04
1111 2008 07
11111 2006 12
21111 2005 11
1212 2008 07
2222 2008 01
2323 2008 12
3333 2008 06
GrapefruiTgirl's xargs solution, once a space has been added to the search expression, produces the correct result but on my data is as slow as my loop (2m5s):

Code:
time cat idsmissing | xargs -I{} egrep "^{} " idsmore >idsmissing_dated
colucix' -w addition fixes ntubski's grep and this runs in an amazing 0.15s !!!!

colucix' awk, which does the whole job in a oner, runs in only 0.74s but produces a different result. I'll need to look at this.
 
Old 11-09-2010, 05:49 AM   #12
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556Reputation: 556
I'm still working on another awk solution here (for the fun of it) but it is not producing the results I had expected (in fact, not producing any results )

Meanwhile, let us know if you become satisfied with one of the given solutions (like the grep -F -f -w) while I keep fiddling with this.
 
Old 11-09-2010, 06:16 AM   #13
hashbang#!
Member
 
Registered: Aug 2009
Location: soon to be independent Scotland
Distribution: Debian
Posts: 120

Original Poster
Rep: Reputation: 17
Am I understanding this correctly, that grep -F speeds up the operation because it does not try to interpret the patterns as regular expressions? I get the same result without -F but it crawls. fgrep is a fraction of a second faster than grep -F.

colucix' all-in-one awk is correct. I must have been working on an older data set. It takes 0.75s for the whole operation whereas my 4-command solution with ntubski's grep -F -w takes 3.4s.

Last edited by hashbang#!; 11-09-2010 at 06:32 AM.
 
Old 11-09-2010, 06:31 AM   #14
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Quote:
Originally Posted by hashbang#! View Post
I am still looking at colucix' awk.
He he... actually I tested it on 5 files containing 56000 different ids each (total 280000) and a file "idsmore" containing 300000 different ids. Then I applied both your script and the awk code and I obtained exactly the same result (except for the .log and the idsmissing files). If this is the only difference, a little add-on to the code, should do the trick.
 
Old 11-09-2010, 06:36 AM   #15
H_TeXMeX_H
LQ Guru
 
Registered: Oct 2005
Location: $RANDOM
Distribution: slackware64
Posts: 12,928
Blog Entries: 2

Rep: Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301Reputation: 1301
Quote:
Originally Posted by colucix View Post
He he... actually I tested it on 5 files containing 56000 different ids each (total 280000) and a file "idsmore" containing 300000 different ids. Then I applied both your script and the awk code and I obtained exactly the same result (except for the .log and the idsmissing files). If this is the only difference, a little add-on to the code, should do the trick.
Can you post the code to generate those, because I can't imagine in my head what needs to be done.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
LXer: KDE Reaches 1,000,000 commits in its Subversion Repository LXer Syndicated Linux News 0 07-21-2009 03:50 AM
LXer: 24,000,000 Google hit for I HATE WINDOWS LXer Syndicated Linux News 0 05-04-2007 03:33 PM
LXer: SugarCRM Announces 1,000 Customers and 1,000,000 Open Source Downloads as Momentum for Open Source Applications Grows LXer Syndicated Linux News 0 12-19-2006 05:33 AM
1,000,000,000 PCs by 2010 masand Linux - News 4 11-01-2004 01:55 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 02:48 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration