Probably the problem in grep is that everytime it fetches a pattern, it re-reads the whole file or at least similarly even if it allocates the whole files in memory. Or could be the other way around. It allocates too many data in memory.
|
The concept I have with respect to the awk scripts I presented can also be further optimized in C or C++ with the use hash table libraries like this one. At least in C you're free to allocate data from file1 and file2 to a large buffer which can't really be done cleanly in awk. Things are also more flexible there. Optimization might also be enhance by gcc.
There's a probability that the code that awk generates is already optimized but let's look at the higher probability. |
Quote:
I was too pessimistic about the algorithms in grep. When presented with a massive number of patterns, it must sort them first and then use the fact that they're sorted (or it couldn't achieve the performance I just tested). As I said earlier, this task is better done by sorting file2, which you have accomplished by telling grep to use file2 as the patterns. I made two files of N 15-character items each where N-10 items were the same between the two files but in a different sequence but 10 items scattered around each file were unique to that file. I then tested that your command was fast and correct at finding the 10 unique items. N=10,000 .25 seconds N=100,000 1.9 seconds N=1,000,000 21 seconds If grep wasn't doing a good job of pre processing the patterns than N=1,000,000 would run 100 times slower than N=100,000 rather than just over 10 times slower. With N=1,000,000 my test is roughly the same size task as twaddlac wants, and it only took 21 seconds. |
Quote:
With mmap one can map virtually a file much larger than physical memory size. |
Quote:
|
Paging is rather quick. And mmap() use the same method to read in, (and also write out when that option is specified), blocks from/to the mapped file.
mmap can be instructed to readahead for a sequential read of a file, and on the other hand can be told to limit the readahead for best performance with random access. Linux is very good att reading and writing blocks that is a multiple of _SC_PAGE_SIZE, (4096 bytes). Same method is used by mmap() as for paging memory. Instead of reading in a file to a stream buffer and then moving it to the users allocated buffer while keeping track of stream buffer fillage and reblocking to the read size the program requested for each access, mmap reads in blocks instantly to the mapped address space. And if the mmap() region is allocated readonly, or if there are no updates to write out to a R/W mapped region, then the page block simply "vanish" from physical memory if the OS need to swap it out. The performance of being able to mmap() a large chunk of a disk in memory and have say four or six parallel threads to search for redundancy matches, (or match patterns), in the same shared region of memory is extreme. And if the application can discern which parts of the mmapped region is the most important, it can command reading in those blocks, and locking them in memory. One can also mmap the same file twice, and have one linear mapping of the file, and at the same time work with a relocated version of the same file using remap_file_pages() . |
Quote:
From the grep man page, I can see this command is basically useless. It certainly doesn't do anything like what twaddlac wants done. But I can't understand why it takes a lot of memory and time. Like twaddlac, I tried the command on large files and didn't wait for it to finish. The correct command (provided by ntubski) runs fairly quickly Code:
grep -v -F -f file2 file1 BTW, the simple script posted by konsolebox runs almost ten times faster than the grep command on the large files. The grep command doesn't tell grep that the matches occur only on line boundaries. Lacking that information makes the problem take about 15 times more work (on the test data I used) than is required if you use that information. But since grep is fast enough for the specified file size, there is no need to write a script to save 17 of the 21 seconds of execution time. Even more so, there is no need to code the program in a faster language to save a tinier fraction of the execution time. Quote:
We're taking about 26MB of data split across two files. That is a trivial amount as long as your algorithm doesn't introduce N squared processing (see post #12 for the simple N squared approach that would take far too long on files this size). Without the N squared problem, grep is fast enough. gawk is a lot faster, but grep is fast enough that gawp isn't worth the bother. C++ would be a tiny bit faster than gawk, but that really isn't worth the bother. Using mmap instead of ordinary file I/O would make the C++ version an additional very tiny amount faster. Really really not worth the bother. |
Hi.
Quote:
In working on a similar question with less posted information, http://www.unix.com/shell-programmin...et-output.html , I compared diff, grep (plain), grep -F, awk, and comm for a 1000-line randomly-generated collection of sentences. The characteristics of this file were: Code:
1154 lines read Using plain grep -- interpreting the lines as regular expressions never finished, although a subset -- about 100 lines -- finished in about 4 seconds. Using grep -F allowed the process to complete processing the entire 1000-line files in a very short time, 0.08 seconds. I think the difference is the ".", "?", etc., special characters causing the engine to take a lot of extra time. In my opinion, if nothing is known definitively beforehand about the files, and if the order of the results is immaterial (as the OP said), then Code:
comm -3 sorted-file1 sorted-file2 cheers, makyo |
Quote:
--- edit ---- Quote:
|
Thank you for all of your help! This was very insightful and I think that I have learned a few new tricks for future reference! :-)
Quote:
|
Apologies for the necropost, but this thread was recently linked to and it seemed to me there were a few loose ends.
Quote:
Quote:
|
All times are GMT -5. The time now is 06:48 AM. |