LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-13-2010, 11:22 AM   #1
twaddlac
Member
 
Registered: Jun 2010
Posts: 34

Rep: Reputation: 0
Question grep for HUGE files?


Hello,

I am trying to compare a list of patterns from one file and grep them against another file and print out only the unique patterns. Unfortunately these files are so large that they have yet to run to completion.

Here's the command that I used:

Code:
grep -L -f file_one.txt file_two.txt > output.output
Here's some example data:

Code:
>FQ4HLCS01BMR4N
>FQ4HLCS01BZNV6
>FQ4HLCS01B40PB
>FQ4HLCS01BT43K
>FQ4HLCS01CB736
>FQ4HLCS01BU3UM
>FQ4HLCS01BBIFQ
Any suggestions on how to increase efficiency or use another command?

Thank you very much!
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 08-13-2010, 11:43 AM   #2
Sergei Steshenko
Senior Member
 
Registered: May 2005
Posts: 4,481

Rep: Reputation: 454Reputation: 454Reputation: 454Reputation: 454Reputation: 454
Quote:
Originally Posted by twaddlac View Post
Hello,

I am trying to compare a list of patterns from one file and grep them against another file and print out only the unique patterns. Unfortunately these files are so large that they have yet to run to completion.

Here's the command that I used:

Code:
grep -L -f file_one.txt file_two.txt > output.output
Here's some example data:

Code:
>FQ4HLCS01BMR4N
>FQ4HLCS01BZNV6
>FQ4HLCS01B40PB
>FQ4HLCS01BT43K
>FQ4HLCS01CB736
>FQ4HLCS01BU3UM
>FQ4HLCS01BBIFQ
Any suggestions on how to increase efficiency or use another command?

Thank you very much!
Consider using 'egrep' or 'grep -P' (the latter is ofr Perl-like regular expressions) and consider forming a regular expression.
 
Old 08-13-2010, 11:50 AM   #3
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
Quote:
Originally Posted by Sergei Steshenko View Post
Consider using 'egrep' or 'grep -P' (the latter is ofr Perl-like regular expressions) and consider forming a regular expression.
egrep -> grep -E.
grepping for patterns of one file from another file is what -f option does, so don't know why the suggestion to form regex.

Last edited by ghostdog74; 08-13-2010 at 11:56 AM.
 
Old 08-13-2010, 01:40 PM   #4
twaddlac
Member
 
Registered: Jun 2010
Posts: 34

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by ghostdog74 View Post
egrep -> grep -E.
I've tried these and they still won't allow the comparison to run to completion. One file is ~12MB and the other ~14MB and I'm a little stumped as to why this isn't completing. I'm running a guest CentOS machine on a Windows 7 host and I dedicated 4GB of ram to my guest machine. You would think it should run with no problem but I guess there's something going on under the hood that I don't know about.

Let me know if you have any ideas about how to speed this up! I looked into glimpse but I had complications installing it so I gave up on that but if it's worth it for this particular task then I may try to install it again.

Thank you very much again!
 
Old 08-13-2010, 01:48 PM   #5
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 552Reputation: 552Reputation: 552Reputation: 552Reputation: 552Reputation: 552
You say that 'they won't (allow) run to completion' - what exactly *do* they do? Does the process hang? Or does it just go sooooo frikking slooow that you think it'll never end.. Or what - please explain. Do you get any output at all going into your output file?

How much RAM total is in the physical computer?

As far as `grep` is concerned, those file sizes are not huge. Even on a 32bit platform you should be able to grep >2 Gib files. Yes, it would seem there's something more going on 'under the hood'.
 
Old 08-13-2010, 01:59 PM   #6
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191Reputation: 1191
Quote:
Originally Posted by twaddlac View Post
trying to compare a list of patterns from one file and grep them against another file and print out only the unique patterns.
What does that mean?

I read the description of -L in man grep to try to figure out what your command does as a "reverse engineering" approach to figuring out what you want to do. But as far as I understand the info in man grep, your command does nothing useful (even if it had time to finish).

Have you tested your grep command on smaller sample files?

Regarding your description of what you want to do, I think you might mean:

Print out each line of the first file whose contents do not appear anywhere in the second file.

Assuming I've totally misunderstood you, can you describe what you want to do more precisely.

Quote:
Unfortunately these files are so large that they have yet to run to completion.
Quote:
Originally Posted by twaddlac View Post
One file is ~12MB and the other ~14MB and I'm a little stumped as to why this isn't completing.
IIUC, each pattern is a line in first file. Are the potential matches in the second file also lines (or at least at the start of lines) or might they appear mid line?

You are asking for a much bigger basic task than you seem to understand. Take almost about 750K patterns and match each against up to 14 million potential places in the second file. With no special optimization, that would take ten trillion comparisons. I assume grep is at least a bit smarter than that. But I doubt it is incredibly sophisticated about the special requirements of working with a very large number of patterns. So I wouldn't bet it is smart enough to do this job in reasonable time.

A decent algorithm must first sort the set of patterns and while searching must take advantage of the fact that the list is sorted. I don't know whether grep does that when working with a list of patterns.

If the potential match positions are highly restricted (such as if they were all at the start of lines) a better algorithm would sort the second file. That is why I asked where the potential matches are.

If potential matches restricted enough and if you sort both files (sorting the second on boundaries of potential matches, increasing total size if such boundaries overlap) then you could easily use some faster tool to find out the point of correspondence (or lack of correspondence) between the files.

Last edited by johnsfine; 08-13-2010 at 02:10 PM.
 
Old 08-13-2010, 02:17 PM   #7
twaddlac
Member
 
Registered: Jun 2010
Posts: 34

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by johnsfine View Post
Regarding your description of what you want to do, I think you might mean:

Print out each line of the first file whose contents do not appear anywhere in the second file.
This is precisely what I want to do.

Quote:
Originally Posted by johnsfine View Post
Are the potential matches in the second file also lines (or at least at the start of lines) or might they appear mid line?
They are exactly like the data in the first file. No mid line matches, special characters etc. In regards the the <b>-L</b> command I thought that this command suppressed the common patterns and would only print out the unique patterns but apparently I was mistaken. Let me know if you need any other info and if you have any optimization techniques!
 
Old 08-13-2010, 09:07 PM   #8
ghostdog74
Senior Member
 
Registered: Aug 2006
Posts: 2,697
Blog Entries: 5

Rep: Reputation: 244Reputation: 244Reputation: 244
show samples of your file1 and file2 input files. and show desired output
 
Old 08-14-2010, 12:48 AM   #9
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
Do you still need the repeated lines in file1 if file1 has duplicates? Does it matter if the results are not in order based from file1?

Last edited by konsolebox; 08-14-2010 at 02:23 AM.
 
Old 08-14-2010, 02:11 AM   #10
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 18,065

Rep: Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896Reputation: 2896
I too am unable to understand what you expect/hope to get as output.
I was prepared to allow grep some more smarts than johnsfine. Small test files somewhat supported that - it does 4k reads (not line at a time) on the input file containing the patterns, and 32k reads in the "target" file. I merely did some strace-ing, not profiling of the userland code. A larger pattern file (62k) however, drove memory remapping nuts - before the target file is even opened. In a virtual guest this might be an even bigger problem - who knows.

Update: just tried a similar test on a ArchLinux (non-X, CLI only) VirtualBox guest on a Win7 64-bit host. Similar results, although timings were not noticeably worse. Host does not page.

Last edited by syg00; 08-14-2010 at 03:47 AM.
 
Old 08-14-2010, 10:03 AM   #11
twaddlac
Member
 
Registered: Jun 2010
Posts: 34

Original Poster
Rep: Reputation: 0
Quote:
Originally Posted by GrapefruiTgirl View Post
You say that 'they won't (allow) run to completion' - what exactly *do* they do? Does the process hang? Or does it just go sooooo frikking slooow that you think it'll never end.. Or what - please explain. Do you get any output at all going into your output file?
The process just hangs.. The VM ends up freezing and aborts the process and there is no output in the file I am printing to.

Quote:
Originally Posted by GrapefruiTgirl View Post
How much RAM total is in the physical computer?
There is 4GB of ram and try to boost the performance by using 3.1GB of ram using ready boost. I thought that this would make a significant difference, allowing my VM to use ~3GB of memory but it hasn't proven too advantageous.

Quote:
Originally Posted by ghostdog74
show samples of your file1 and file2 input files. and show desired output
The sample input data are in the original post. Likewise, the other input file would be exactly the same: >FQ4HLCS0*. As for the desired output, in context, would be something like this:

File 1:
a
b
c
d

File 2:
b
c
a
f

Output:
d

Omitting all the matched data and only printing out the unmatched data.

Quote:
Originally Posted by konsolebox
Do you still need the repeated lines in file1 if file1 has duplicates? Does it matter if the results are not in order based from file1?
No, I do not need the duplicated files and there aren't any duplicate data within one file. I don't care if the results are in order or not, as long as I get the data.

Let me know if you have any more questions and thanks for your help so far!! :-)
 
1 members found this post helpful.
Old 08-14-2010, 10:23 AM   #12
GrapefruiTgirl
LQ Guru
 
Registered: Dec 2006
Location: underground
Distribution: Slackware64
Posts: 7,594

Rep: Reputation: 552Reputation: 552Reputation: 552Reputation: 552Reputation: 552Reputation: 552
OK, so one problem is apparent immediately:

Using the sample data you gave in the above post, I created 2 files, and used the grep command from post #1 to get the "d" output. But, I got nothing instead.

The below however, does appear to work:
Code:
sasha@reactor: cat file1
a
b
c
d

sasha@reactor: cat file2
b
c
a
f

sasha@reactor: while read LINE; do [ ! "$(grep "${LINE}" file2)" ] && echo "${LINE}"; done < file1
d
sasha@reactor:
 
Old 08-14-2010, 10:55 AM   #13
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
I expect things to go faster in awk. I made two scripts for that.
Code:
#!/usr/bin/env gawk

BEGIN {
	file1 = ARGV[1]
	file2 = ARGV[2]

	delete hash

	while (getline < file1) {
		hash[$0] = 1
	}

	while (getline < file2) {
		if ($0 in hash) {
			delete hash[$0]
		}
	}

	for (a in hash) {
		print a
	}

	exit (0)
}
Code:
#!/usr/bin/env gawk

BEGIN {
	file1 = ARGV[1]
	file2 = ARGV[2]

	loopspersession = 100000

	if (!(loopspersession > 0)) {
		exit(1)
	}

	i = 0

	while (!i && getline < file1) {
		delete hash

		hash[$0] = 1

		i = loopspersession - 1

		while (i && getline < file1) {
			hash[$0] = 1
			i--
		}

		while (getline < file2) {
			if ($0 in hash) {
				delete hash[$0]
			}
		}

		close (file2)

		for (a in hash) {
			print a
		}
	}

	exit(0)
}
Both scripts can be run by
Code:
gawk -f <script> -- <file1> <file2> > <outputfile>
The first script is simpler and can work if you have enough memory. With the second script, you can divide the work in sessions. Just change the value of loopspersession.

I'm not sure if caching file2 will also make things faster.

Last edited by konsolebox; 08-14-2010 at 11:20 AM. Reason: 2nd script
 
2 members found this post helpful.
Old 08-14-2010, 11:55 AM   #14
wroom
Member
 
Registered: Dec 2009
Location: Sweden
Posts: 159

Rep: Reputation: 31
It happens to me once and awhile that grep fills up all the eight gigs of ram and the 24 gigs of swap without making such a difficult query. My guess it is not being that efficient it should be. grep does much reentrant execution, filling up the heap with intermediate data. Don't know why it is built this way, really.

One way to speed things up if you run a multicore processor is to split up the file containing match patterns into several files, and run as many greps in parallel as you got cpu cores.
I assume grep will take exponentially less memory per process with less patterns to match?

And just to make you curious on some alternatives, i did ls -lF /bin/*grep /usr/bin/*grep on my SUSE machine and got:
Code:
-rwxr-xr-x 1 root root  86928 Jun  6  2008 /bin/egrep*
-rwxr-xr-x 1 root root  53872 Jun  6  2008 /bin/fgrep*
-rwxr-xr-x 1 root root 215088 Jun  6  2008 /bin/grep*
-rwxr-xr-x 1 root root   1677 Jun  6  2008 /usr/bin/bzgrep*
-rwxr-xr-x 1 root root  19736 Jun  7  2008 /usr/bin/deepgrep*
lrwxrwxrwx 1 root root     10 Apr 26 03:27 /usr/bin/egrep -> /bin/egrep*
lrwxrwxrwx 1 root root     10 Apr 26 03:27 /usr/bin/fgrep -> /bin/fgrep*
lrwxrwxrwx 1 root root      9 Apr 26 03:27 /usr/bin/grep -> /bin/grep*
-rwxr-xr-x 1 root root   1332 Jun  7  2008 /usr/bin/mgrep*
-rwxr-xr-x 1 root root  28880 Jun 23  2008 /usr/bin/pcregrep*
-r-xr-xr-x 2 root root  40800 Jun  6  2008 /usr/bin/pgrep*
-rwxr-xr-x 1 root root     66 Jan 20  2010 /usr/bin/zegrep*
-rwxr-xr-x 1 root root     66 Jan 20  2010 /usr/bin/zfgrep*
-rwxr-xr-x 1 root root   5002 Jan 20  2010 /usr/bin/zgrep*
-rwxr-xr-x 1 root root   1188 Jun  6  2008 /usr/bin/zipgrep*
So i guess "there is more than one way to skin a cat".
 
1 members found this post helpful.
Old 08-14-2010, 02:43 PM   #15
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,498

Rep: Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806Reputation: 1806
The correct way to use grep would be:
Code:
grep -v -F -f file2 file1
Note the order of file arguments.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
view huge txt files kpachopoulos Linux - General 2 09-04-2007 08:17 AM
Huge syslog files ! MikeAtVillage Linux - General 8 05-03-2006 05:56 AM
how to analyze huge tcpdump files? hedpe Linux - Networking 1 03-13-2006 07:22 PM
Huge Firewall Log Files seanfitz Linux - Networking 1 01-29-2004 10:23 AM
Huge log files altor Linux - Newbie 4 09-03-2003 08:40 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 07:31 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration