LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 08-14-2010, 03:10 PM   #16
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235

Probably the problem in grep is that everytime it fetches a pattern, it re-reads the whole file or at least similarly even if it allocates the whole files in memory. Or could be the other way around. It allocates too many data in memory.
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 08-14-2010, 03:15 PM   #17
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
The concept I have with respect to the awk scripts I presented can also be further optimized in C or C++ with the use hash table libraries like this one. At least in C you're free to allocate data from file1 and file2 to a large buffer which can't really be done cleanly in awk. Things are also more flexible there. Optimization might also be enhance by gcc.

There's a probability that the code that awk generates is already optimized but let's look at the higher probability.

Last edited by konsolebox; 08-14-2010 at 03:17 PM.
 
Old 08-14-2010, 04:45 PM   #18
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by ntubski View Post
The correct way to use grep would be:
Code:
grep -v -F -f file2 file1
Note the order of file arguments.
I think that is all twaddlac needs.

I was too pessimistic about the algorithms in grep. When presented with a massive number of patterns, it must sort them first and then use the fact that they're sorted (or it couldn't achieve the performance I just tested). As I said earlier, this task is better done by sorting file2, which you have accomplished by telling grep to use file2 as the patterns.

I made two files of N 15-character items each where N-10 items were the same between the two files but in a different sequence but 10 items scattered around each file were unique to that file. I then tested that your command was fast and correct at finding the 10 unique items.

N=10,000 .25 seconds
N=100,000 1.9 seconds
N=1,000,000 21 seconds

If grep wasn't doing a good job of pre processing the patterns than N=1,000,000 would run 100 times slower than N=100,000 rather than just over 10 times slower.

With N=1,000,000 my test is roughly the same size task as twaddlac wants, and it only took 21 seconds.
 
2 members found this post helpful.
Old 08-14-2010, 05:53 PM   #19
wroom
Member
 
Registered: Dec 2009
Location: Sweden
Posts: 159

Rep: Reputation: 31
Quote:
Originally Posted by konsolebox View Post
...further optimized in C or C++...At least in C you're free to allocate data from file1 and file2 to a large buffer...
And instead of reading in files in buffers, one can use mmap() to map whole or part of a file in memory. mmap() can be called with different options, like to use the file as backing store for the memory used instead of using the swapfile, having the blocks of the file automagically swapped in and out of memory depending on access or, if one wants that, have the complete file locked in memory.

With mmap one can map virtually a file much larger than physical memory size.
 
1 members found this post helpful.
Old 08-14-2010, 06:14 PM   #20
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
Quote:
Originally Posted by wroom View Post
And instead of reading in files in buffers, one can use mmap() to map whole or part of a file in memory. mmap() can be called with different options, like to use the file as backing store for the memory used instead of using the swapfile, having the blocks of the file automagically swapped in and out of memory depending on access or, if one wants that, have the complete file locked in memory.

With mmap one can map virtually a file much larger than physical memory size.
Thanks for the info. That should be helpful in many cases, esp. with very large files. With that there's no performance loss compared to stream-like buffering right? Or is it just best with random data? - even if it makes an allocation type whereby a large memory is used for caching the file. I'm now wondering how access to data is handled in the virtual memory by the virtual memory manager as compared to ordinary buffers; that which of them is accessed quicker.
 
Old 08-14-2010, 07:46 PM   #21
wroom
Member
 
Registered: Dec 2009
Location: Sweden
Posts: 159

Rep: Reputation: 31
Paging is rather quick. And mmap() use the same method to read in, (and also write out when that option is specified), blocks from/to the mapped file.
mmap can be instructed to readahead for a sequential read of a file, and on the other hand can be told to limit the readahead for best performance with random access.

Linux is very good att reading and writing blocks that is a multiple of _SC_PAGE_SIZE, (4096 bytes). Same method is used by mmap() as for paging memory. Instead of reading in a file to a stream buffer and then moving it to the users allocated buffer while keeping track of stream buffer fillage and reblocking to the read size the program requested for each access, mmap reads in blocks instantly to the mapped address space. And if the mmap() region is allocated readonly, or if there are no updates to write out to a R/W mapped region, then the page block simply "vanish" from physical memory if the OS need to swap it out.

The performance of being able to mmap() a large chunk of a disk in memory and have say four or six parallel threads to search for redundancy matches, (or match patterns), in the same shared region of memory is extreme.

And if the application can discern which parts of the mmapped region is the most important, it can command reading in those blocks, and locking them in memory.

One can also mmap the same file twice, and have one linear mapping of the file, and at the same time work with a relocated version of the same file using remap_file_pages() .
 
1 members found this post helpful.
Old 08-15-2010, 06:15 AM   #22
johnsfine
LQ Guru
 
Registered: Dec 2007
Distribution: Centos
Posts: 5,286

Rep: Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197Reputation: 1197
Quote:
Originally Posted by twaddlac View Post
Code:
grep -L -f file_one.txt file_two.txt > output.output
Anyone have a guess why this command uses so much memory and time?

From the grep man page, I can see this command is basically useless. It certainly doesn't do anything like what twaddlac wants done. But I can't understand why it takes a lot of memory and time. Like twaddlac, I tried the command on large files and didn't wait for it to finish.

The correct command (provided by ntubski) runs fairly quickly
Code:
grep -v -F -f file2 file1
The incorrect command ought to preprocess the pattern list in a similar way and with similar performance. Then it ought to find a match almost immediately and stop there (because it is only supposed to report files with zero matches). So it should be faster than the correct command.

BTW, the simple script posted by konsolebox runs almost ten times faster than the grep command on the large files. The grep command doesn't tell grep that the matches occur only on line boundaries. Lacking that information makes the problem take about 15 times more work (on the test data I used) than is required if you use that information.

But since grep is fast enough for the specified file size, there is no need to write a script to save 17 of the 21 seconds of execution time.

Even more so, there is no need to code the program in a faster language to save a tinier fraction of the execution time.

Quote:
Originally Posted by wroom View Post
And instead of reading in files in buffers, one can use mmap() to map whole or part of a file in memory.
That is going very far into an advanced performance optimization that would save a very tiny fraction of the total execution time. The saved CPU time of copying from the buffer is tiny either compared to the processing time of what must be done to the patterns after reading them, or compared to the disk access time of the actual read.

We're taking about 26MB of data split across two files. That is a trivial amount as long as your algorithm doesn't introduce N squared processing (see post #12 for the simple N squared approach that would take far too long on files this size).

Without the N squared problem, grep is fast enough. gawk is a lot faster, but grep is fast enough that gawp isn't worth the bother. C++ would be a tiny bit faster than gawk, but that really isn't worth the bother. Using mmap instead of ordinary file I/O would make the C++ version an additional very tiny amount faster. Really really not worth the bother.

Last edited by johnsfine; 08-15-2010 at 06:39 AM.
 
3 members found this post helpful.
Old 08-15-2010, 08:33 AM   #23
makyo
Member
 
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735

Rep: Reputation: 76
Hi.
Quote:
Originally Posted by johnsfine View Post
Anyone have a guess why this command uses so much memory and time? ...
Briefly, I think the regular expression engine eats up the time, possibly also the memory.

In working on a similar question with less posted information, http://www.unix.com/shell-programmin...et-output.html , I compared diff, grep (plain), grep -F, awk, and comm for a 1000-line randomly-generated collection of sentences. The characteristics of this file were:
Code:
 1154 lines read
 8031 words read
67975 chars read

    1154 newline
    7859 space
      65 !
       1 &
      31 '
      69 ,
     778 .
     159 ?
( the remainder being alphabetic characters )
In many of the comparisons, I used a line-by-line reverse of the file as the second file, so that the first line of the first file becomes the last line of the second file, and so on.

Using plain grep -- interpreting the lines as regular expressions never finished, although a subset -- about 100 lines -- finished in about 4 seconds. Using grep -F allowed the process to complete processing the entire 1000-line files in a very short time, 0.08 seconds. I think the difference is the ".", "?", etc., special characters causing the engine to take a lot of extra time.

In my opinion, if nothing is known definitively beforehand about the files, and if the order of the results is immaterial (as the OP said), then
Code:
comm -3 sorted-file1 sorted-file2
is the best start for a comparison. That does no interpretation as regular expressions, will consume very little memory, and produces a list of unique items in both files. With the results from that, then additional analysis might be done.

cheers, makyo
 
3 members found this post helpful.
Old 08-15-2010, 10:07 AM   #24
konsolebox
Senior Member
 
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
Blog Entries: 8

Rep: Reputation: 235Reputation: 235Reputation: 235
Quote:
Originally Posted by johnsfine View Post
... gawk is a lot faster, but grep is fast enough that gawp isn't worth the bother. C++ would be a tiny bit faster than gawk, but that really isn't worth the bother. Using mmap instead of ordinary file I/O would make the C++ version an additional very tiny amount faster. Really really not worth the bother.
For me it was just worth the fun and serves the purpose.

--- edit ----

Quote:
Originally Posted by johnsfine View Post
BTW, the simple script posted by konsolebox runs almost ten times faster than the grep command on the large files...
I'm really glad to hear that. Thanks for testing the script.

Last edited by konsolebox; 08-15-2010 at 10:16 AM.
 
Old 08-15-2010, 12:38 PM   #25
twaddlac
Member
 
Registered: Jun 2010
Posts: 34

Original Poster
Rep: Reputation: 0
Thumbs up

Thank you for all of your help! This was very insightful and I think that I have learned a few new tricks for future reference! :-)

Quote:
Originally Posted by ntubski View Post
The correct way to use grep would be:
Code:
grep -v -F -f file2 file1
Note the order of file arguments.
This seemed to do the trick in a mere 5 seconds! I intend on trying the other suggestions as well just to get more experience! Thanks again everyone!! :-)
 
Old 09-26-2012, 09:18 AM   #26
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,794

Rep: Reputation: 2087Reputation: 2087Reputation: 2087Reputation: 2087Reputation: 2087Reputation: 2087Reputation: 2087Reputation: 2087Reputation: 2087Reputation: 2087Reputation: 2087
Apologies for the necropost, but this thread was recently linked to and it seemed to me there were a few loose ends.

Quote:
Originally Posted by johnsfine View Post
Code:
grep -L -f file_one.txt file_two.txt > output.output
Anyone have a guess why this command uses so much memory and time?

From the grep man page, I can see this command is basically useless. It certainly doesn't do anything like what twaddlac wants done. But I can't understand why it takes a lot of memory and time. Like twaddlac, I tried the command on large files and didn't wait for it to finish.

The correct command (provided by ntubski) runs fairly quickly
Code:
grep -v -F -f file2 file1
The incorrect command ought to preprocess the pattern list in a similar way and with similar performance. Then it ought to find a match almost immediately and stop there (because it is only supposed to report files with zero matches). So it should be faster than the correct command.
Without the -F grep assumes that the file contains regexp patterns. There is no useful way to sort regexp patterns in general so it has to try each one in turn for every line of the other file. Obviously, grep could notice that these specific patterns use the particular subset of regexps which is the set of plain strings, but it doesn't bother doing that optimization.

Quote:
BTW, the simple script posted by konsolebox runs almost ten times faster than the grep command on the large files. The grep command doesn't tell grep that the matches occur only on line boundaries. Lacking that information makes the problem take about 15 times more work (on the test data I used) than is required if you use that information.
grep also has the -x or --line-regexp option which could help there. At the time it didn't occur to me to use it. GNU grep also has -w or --word-regexp for which can be useful in general, although in this case the patterns are for the whole line anyway.
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
view huge txt files kpachopoulos Linux - General 2 09-04-2007 08:17 AM
Huge syslog files ! MikeAtVillage Linux - General 8 05-03-2006 05:56 AM
how to analyze huge tcpdump files? hedpe Linux - Networking 1 03-13-2006 07:22 PM
Huge Firewall Log Files seanfitz Linux - Networking 1 01-29-2004 10:23 AM
Huge log files altor Linux - Newbie 4 09-03-2003 08:40 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 10:40 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration