LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   grep for HUGE files? (https://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/)

konsolebox 08-14-2010 03:10 PM

Probably the problem in grep is that everytime it fetches a pattern, it re-reads the whole file or at least similarly even if it allocates the whole files in memory. Or could be the other way around. It allocates too many data in memory.

konsolebox 08-14-2010 03:15 PM

The concept I have with respect to the awk scripts I presented can also be further optimized in C or C++ with the use hash table libraries like this one. At least in C you're free to allocate data from file1 and file2 to a large buffer which can't really be done cleanly in awk. Things are also more flexible there. Optimization might also be enhance by gcc.

There's a probability that the code that awk generates is already optimized but let's look at the higher probability.

johnsfine 08-14-2010 04:45 PM

Quote:

Originally Posted by ntubski (Post 4066393)
The correct way to use grep would be:
Code:

grep -v -F -f file2 file1
Note the order of file arguments.

I think that is all twaddlac needs.

I was too pessimistic about the algorithms in grep. When presented with a massive number of patterns, it must sort them first and then use the fact that they're sorted (or it couldn't achieve the performance I just tested). As I said earlier, this task is better done by sorting file2, which you have accomplished by telling grep to use file2 as the patterns.

I made two files of N 15-character items each where N-10 items were the same between the two files but in a different sequence but 10 items scattered around each file were unique to that file. I then tested that your command was fast and correct at finding the 10 unique items.

N=10,000 .25 seconds
N=100,000 1.9 seconds
N=1,000,000 21 seconds

If grep wasn't doing a good job of pre processing the patterns than N=1,000,000 would run 100 times slower than N=100,000 rather than just over 10 times slower.

With N=1,000,000 my test is roughly the same size task as twaddlac wants, and it only took 21 seconds.

wroom 08-14-2010 05:53 PM

Quote:

Originally Posted by konsolebox (Post 4066420)
...further optimized in C or C++...At least in C you're free to allocate data from file1 and file2 to a large buffer...

And instead of reading in files in buffers, one can use mmap() to map whole or part of a file in memory. mmap() can be called with different options, like to use the file as backing store for the memory used instead of using the swapfile, having the blocks of the file automagically swapped in and out of memory depending on access or, if one wants that, have the complete file locked in memory.

With mmap one can map virtually a file much larger than physical memory size.

konsolebox 08-14-2010 06:14 PM

Quote:

Originally Posted by wroom (Post 4066505)
And instead of reading in files in buffers, one can use mmap() to map whole or part of a file in memory. mmap() can be called with different options, like to use the file as backing store for the memory used instead of using the swapfile, having the blocks of the file automagically swapped in and out of memory depending on access or, if one wants that, have the complete file locked in memory.

With mmap one can map virtually a file much larger than physical memory size.

Thanks for the info. That should be helpful in many cases, esp. with very large files. With that there's no performance loss compared to stream-like buffering right? Or is it just best with random data? - even if it makes an allocation type whereby a large memory is used for caching the file. I'm now wondering how access to data is handled in the virtual memory by the virtual memory manager as compared to ordinary buffers; that which of them is accessed quicker.

wroom 08-14-2010 07:46 PM

Paging is rather quick. And mmap() use the same method to read in, (and also write out when that option is specified), blocks from/to the mapped file.
mmap can be instructed to readahead for a sequential read of a file, and on the other hand can be told to limit the readahead for best performance with random access.

Linux is very good att reading and writing blocks that is a multiple of _SC_PAGE_SIZE, (4096 bytes). Same method is used by mmap() as for paging memory. Instead of reading in a file to a stream buffer and then moving it to the users allocated buffer while keeping track of stream buffer fillage and reblocking to the read size the program requested for each access, mmap reads in blocks instantly to the mapped address space. And if the mmap() region is allocated readonly, or if there are no updates to write out to a R/W mapped region, then the page block simply "vanish" from physical memory if the OS need to swap it out.

The performance of being able to mmap() a large chunk of a disk in memory and have say four or six parallel threads to search for redundancy matches, (or match patterns), in the same shared region of memory is extreme.

And if the application can discern which parts of the mmapped region is the most important, it can command reading in those blocks, and locking them in memory.

One can also mmap the same file twice, and have one linear mapping of the file, and at the same time work with a relocated version of the same file using remap_file_pages() .

johnsfine 08-15-2010 06:15 AM

Quote:

Originally Posted by twaddlac (Post 4065350)
Code:

grep -L -f file_one.txt file_two.txt > output.output

Anyone have a guess why this command uses so much memory and time?

From the grep man page, I can see this command is basically useless. It certainly doesn't do anything like what twaddlac wants done. But I can't understand why it takes a lot of memory and time. Like twaddlac, I tried the command on large files and didn't wait for it to finish.

The correct command (provided by ntubski) runs fairly quickly
Code:

grep -v -F -f file2 file1
The incorrect command ought to preprocess the pattern list in a similar way and with similar performance. Then it ought to find a match almost immediately and stop there (because it is only supposed to report files with zero matches). So it should be faster than the correct command.

BTW, the simple script posted by konsolebox runs almost ten times faster than the grep command on the large files. The grep command doesn't tell grep that the matches occur only on line boundaries. Lacking that information makes the problem take about 15 times more work (on the test data I used) than is required if you use that information.

But since grep is fast enough for the specified file size, there is no need to write a script to save 17 of the 21 seconds of execution time.

Even more so, there is no need to code the program in a faster language to save a tinier fraction of the execution time.

Quote:

Originally Posted by wroom (Post 4066505)
And instead of reading in files in buffers, one can use mmap() to map whole or part of a file in memory.

That is going very far into an advanced performance optimization that would save a very tiny fraction of the total execution time. The saved CPU time of copying from the buffer is tiny either compared to the processing time of what must be done to the patterns after reading them, or compared to the disk access time of the actual read.

We're taking about 26MB of data split across two files. That is a trivial amount as long as your algorithm doesn't introduce N squared processing (see post #12 for the simple N squared approach that would take far too long on files this size).

Without the N squared problem, grep is fast enough. gawk is a lot faster, but grep is fast enough that gawp isn't worth the bother. C++ would be a tiny bit faster than gawk, but that really isn't worth the bother. Using mmap instead of ordinary file I/O would make the C++ version an additional very tiny amount faster. Really really not worth the bother.

makyo 08-15-2010 08:33 AM

Hi.
Quote:

Originally Posted by johnsfine (Post 4066770)
Anyone have a guess why this command uses so much memory and time? ...

Briefly, I think the regular expression engine eats up the time, possibly also the memory.

In working on a similar question with less posted information, http://www.unix.com/shell-programmin...et-output.html , I compared diff, grep (plain), grep -F, awk, and comm for a 1000-line randomly-generated collection of sentences. The characteristics of this file were:
Code:

1154 lines read
 8031 words read
67975 chars read

    1154 newline
    7859 space
      65 !
      1 &
      31 '
      69 ,
    778 .
    159 ?
( the remainder being alphabetic characters )

In many of the comparisons, I used a line-by-line reverse of the file as the second file, so that the first line of the first file becomes the last line of the second file, and so on.

Using plain grep -- interpreting the lines as regular expressions never finished, although a subset -- about 100 lines -- finished in about 4 seconds. Using grep -F allowed the process to complete processing the entire 1000-line files in a very short time, 0.08 seconds. I think the difference is the ".", "?", etc., special characters causing the engine to take a lot of extra time.

In my opinion, if nothing is known definitively beforehand about the files, and if the order of the results is immaterial (as the OP said), then
Code:

comm -3 sorted-file1 sorted-file2
is the best start for a comparison. That does no interpretation as regular expressions, will consume very little memory, and produces a list of unique items in both files. With the results from that, then additional analysis might be done.

cheers, makyo

konsolebox 08-15-2010 10:07 AM

Quote:

Originally Posted by johnsfine (Post 4066770)
... gawk is a lot faster, but grep is fast enough that gawp isn't worth the bother. C++ would be a tiny bit faster than gawk, but that really isn't worth the bother. Using mmap instead of ordinary file I/O would make the C++ version an additional very tiny amount faster. Really really not worth the bother.

For me it was just worth the fun and serves the purpose.

--- edit ----

Quote:

Originally Posted by johnsfine (Post 4066770)
BTW, the simple script posted by konsolebox runs almost ten times faster than the grep command on the large files...

I'm really glad to hear that. Thanks for testing the script. :)

twaddlac 08-15-2010 12:38 PM

Thank you for all of your help! This was very insightful and I think that I have learned a few new tricks for future reference! :-)

Quote:

Originally Posted by ntubski (Post 4066393)
The correct way to use grep would be:
Code:

grep -v -F -f file2 file1
Note the order of file arguments.

This seemed to do the trick in a mere 5 seconds! I intend on trying the other suggestions as well just to get more experience! Thanks again everyone!! :-)

ntubski 09-26-2012 09:18 AM

Apologies for the necropost, but this thread was recently linked to and it seemed to me there were a few loose ends.

Quote:

Originally Posted by johnsfine (Post 4066770)
Code:

grep -L -f file_one.txt file_two.txt > output.output
Anyone have a guess why this command uses so much memory and time?

From the grep man page, I can see this command is basically useless. It certainly doesn't do anything like what twaddlac wants done. But I can't understand why it takes a lot of memory and time. Like twaddlac, I tried the command on large files and didn't wait for it to finish.

The correct command (provided by ntubski) runs fairly quickly
Code:

grep -v -F -f file2 file1
The incorrect command ought to preprocess the pattern list in a similar way and with similar performance. Then it ought to find a match almost immediately and stop there (because it is only supposed to report files with zero matches). So it should be faster than the correct command.

Without the -F grep assumes that the file contains regexp patterns. There is no useful way to sort regexp patterns in general so it has to try each one in turn for every line of the other file. Obviously, grep could notice that these specific patterns use the particular subset of regexps which is the set of plain strings, but it doesn't bother doing that optimization.

Quote:

BTW, the simple script posted by konsolebox runs almost ten times faster than the grep command on the large files. The grep command doesn't tell grep that the matches occur only on line boundaries. Lacking that information makes the problem take about 15 times more work (on the test data I used) than is required if you use that information.
grep also has the -x or --line-regexp option which could help there. At the time it didn't occur to me to use it. GNU grep also has -w or --word-regexp for which can be useful in general, although in this case the patterns are for the whole line anyway.


All times are GMT -5. The time now is 06:48 AM.