Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game. |
Notices |
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Are you new to LinuxQuestions.org? Visit the following links:
Site Howto |
Site FAQ |
Sitemap |
Register Now
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
|
|
|
08-14-2010, 03:10 PM
|
#16
|
Senior Member
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
|
Probably the problem in grep is that everytime it fetches a pattern, it re-reads the whole file or at least similarly even if it allocates the whole files in memory. Or could be the other way around. It allocates too many data in memory.
|
|
|
Click here to see the post LQ members have rated as the most helpful post in this thread.
|
08-14-2010, 03:15 PM
|
#17
|
Senior Member
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
|
The concept I have with respect to the awk scripts I presented can also be further optimized in C or C++ with the use hash table libraries like this one. At least in C you're free to allocate data from file1 and file2 to a large buffer which can't really be done cleanly in awk. Things are also more flexible there. Optimization might also be enhance by gcc.
There's a probability that the code that awk generates is already optimized but let's look at the higher probability.
Last edited by konsolebox; 08-14-2010 at 03:17 PM.
|
|
|
08-14-2010, 04:45 PM
|
#18
|
LQ Guru
Registered: Dec 2007
Distribution: Centos
Posts: 5,286
|
Quote:
Originally Posted by ntubski
The correct way to use grep would be:
Code:
grep -v -F -f file2 file1
Note the order of file arguments.
|
I think that is all twaddlac needs.
I was too pessimistic about the algorithms in grep. When presented with a massive number of patterns, it must sort them first and then use the fact that they're sorted (or it couldn't achieve the performance I just tested). As I said earlier, this task is better done by sorting file2, which you have accomplished by telling grep to use file2 as the patterns.
I made two files of N 15-character items each where N-10 items were the same between the two files but in a different sequence but 10 items scattered around each file were unique to that file. I then tested that your command was fast and correct at finding the 10 unique items.
N=10,000 .25 seconds
N=100,000 1.9 seconds
N=1,000,000 21 seconds
If grep wasn't doing a good job of pre processing the patterns than N=1,000,000 would run 100 times slower than N=100,000 rather than just over 10 times slower.
With N=1,000,000 my test is roughly the same size task as twaddlac wants, and it only took 21 seconds.
|
|
2 members found this post helpful.
|
08-14-2010, 05:53 PM
|
#19
|
Member
Registered: Dec 2009
Location: Sweden
Posts: 159
Rep:
|
Quote:
Originally Posted by konsolebox
...further optimized in C or C++...At least in C you're free to allocate data from file1 and file2 to a large buffer...
|
And instead of reading in files in buffers, one can use mmap() to map whole or part of a file in memory. mmap() can be called with different options, like to use the file as backing store for the memory used instead of using the swapfile, having the blocks of the file automagically swapped in and out of memory depending on access or, if one wants that, have the complete file locked in memory.
With mmap one can map virtually a file much larger than physical memory size.
|
|
1 members found this post helpful.
|
08-14-2010, 06:14 PM
|
#20
|
Senior Member
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
|
Quote:
Originally Posted by wroom
And instead of reading in files in buffers, one can use mmap() to map whole or part of a file in memory. mmap() can be called with different options, like to use the file as backing store for the memory used instead of using the swapfile, having the blocks of the file automagically swapped in and out of memory depending on access or, if one wants that, have the complete file locked in memory.
With mmap one can map virtually a file much larger than physical memory size.
|
Thanks for the info. That should be helpful in many cases, esp. with very large files. With that there's no performance loss compared to stream-like buffering right? Or is it just best with random data? - even if it makes an allocation type whereby a large memory is used for caching the file. I'm now wondering how access to data is handled in the virtual memory by the virtual memory manager as compared to ordinary buffers; that which of them is accessed quicker.
|
|
|
08-14-2010, 07:46 PM
|
#21
|
Member
Registered: Dec 2009
Location: Sweden
Posts: 159
Rep:
|
Paging is rather quick. And mmap() use the same method to read in, (and also write out when that option is specified), blocks from/to the mapped file.
mmap can be instructed to readahead for a sequential read of a file, and on the other hand can be told to limit the readahead for best performance with random access.
Linux is very good att reading and writing blocks that is a multiple of _SC_PAGE_SIZE, (4096 bytes). Same method is used by mmap() as for paging memory. Instead of reading in a file to a stream buffer and then moving it to the users allocated buffer while keeping track of stream buffer fillage and reblocking to the read size the program requested for each access, mmap reads in blocks instantly to the mapped address space. And if the mmap() region is allocated readonly, or if there are no updates to write out to a R/W mapped region, then the page block simply "vanish" from physical memory if the OS need to swap it out.
The performance of being able to mmap() a large chunk of a disk in memory and have say four or six parallel threads to search for redundancy matches, (or match patterns), in the same shared region of memory is extreme.
And if the application can discern which parts of the mmapped region is the most important, it can command reading in those blocks, and locking them in memory.
One can also mmap the same file twice, and have one linear mapping of the file, and at the same time work with a relocated version of the same file using remap_file_pages() .
|
|
1 members found this post helpful.
|
08-15-2010, 06:15 AM
|
#22
|
LQ Guru
Registered: Dec 2007
Distribution: Centos
Posts: 5,286
|
Quote:
Originally Posted by twaddlac
Code:
grep -L -f file_one.txt file_two.txt > output.output
|
Anyone have a guess why this command uses so much memory and time?
From the grep man page, I can see this command is basically useless. It certainly doesn't do anything like what twaddlac wants done. But I can't understand why it takes a lot of memory and time. Like twaddlac, I tried the command on large files and didn't wait for it to finish.
The correct command (provided by ntubski) runs fairly quickly
Code:
grep -v -F -f file2 file1
The incorrect command ought to preprocess the pattern list in a similar way and with similar performance. Then it ought to find a match almost immediately and stop there (because it is only supposed to report files with zero matches). So it should be faster than the correct command.
BTW, the simple script posted by konsolebox runs almost ten times faster than the grep command on the large files. The grep command doesn't tell grep that the matches occur only on line boundaries. Lacking that information makes the problem take about 15 times more work (on the test data I used) than is required if you use that information.
But since grep is fast enough for the specified file size, there is no need to write a script to save 17 of the 21 seconds of execution time.
Even more so, there is no need to code the program in a faster language to save a tinier fraction of the execution time.
Quote:
Originally Posted by wroom
And instead of reading in files in buffers, one can use mmap() to map whole or part of a file in memory.
|
That is going very far into an advanced performance optimization that would save a very tiny fraction of the total execution time. The saved CPU time of copying from the buffer is tiny either compared to the processing time of what must be done to the patterns after reading them, or compared to the disk access time of the actual read.
We're taking about 26MB of data split across two files. That is a trivial amount as long as your algorithm doesn't introduce N squared processing (see post #12 for the simple N squared approach that would take far too long on files this size).
Without the N squared problem, grep is fast enough. gawk is a lot faster, but grep is fast enough that gawp isn't worth the bother. C++ would be a tiny bit faster than gawk, but that really isn't worth the bother. Using mmap instead of ordinary file I/O would make the C++ version an additional very tiny amount faster. Really really not worth the bother.
Last edited by johnsfine; 08-15-2010 at 06:39 AM.
|
|
3 members found this post helpful.
|
08-15-2010, 08:33 AM
|
#23
|
Member
Registered: Aug 2006
Location: Saint Paul, MN, USA
Distribution: {Free,Open}BSD, CentOS, Debian, Fedora, Solaris, SuSE
Posts: 735
Rep:
|
Hi.
Quote:
Originally Posted by johnsfine
Anyone have a guess why this command uses so much memory and time? ...
|
Briefly, I think the regular expression engine eats up the time, possibly also the memory.
In working on a similar question with less posted information, http://www.unix.com/shell-programmin...et-output.html , I compared diff, grep (plain), grep -F, awk, and comm for a 1000-line randomly-generated collection of sentences. The characteristics of this file were:
Code:
1154 lines read
8031 words read
67975 chars read
1154 newline
7859 space
65 !
1 &
31 '
69 ,
778 .
159 ?
( the remainder being alphabetic characters )
In many of the comparisons, I used a line-by-line reverse of the file as the second file, so that the first line of the first file becomes the last line of the second file, and so on.
Using plain grep -- interpreting the lines as regular expressions never finished, although a subset -- about 100 lines -- finished in about 4 seconds. Using grep -F allowed the process to complete processing the entire 1000-line files in a very short time, 0.08 seconds. I think the difference is the ".", "?", etc., special characters causing the engine to take a lot of extra time.
In my opinion, if nothing is known definitively beforehand about the files, and if the order of the results is immaterial (as the OP said), then
Code:
comm -3 sorted-file1 sorted-file2
is the best start for a comparison. That does no interpretation as regular expressions, will consume very little memory, and produces a list of unique items in both files. With the results from that, then additional analysis might be done.
cheers, makyo
|
|
3 members found this post helpful.
|
08-15-2010, 10:07 AM
|
#24
|
Senior Member
Registered: Oct 2005
Distribution: Gentoo, Slackware, LFS
Posts: 2,248
|
Quote:
Originally Posted by johnsfine
... gawk is a lot faster, but grep is fast enough that gawp isn't worth the bother. C++ would be a tiny bit faster than gawk, but that really isn't worth the bother. Using mmap instead of ordinary file I/O would make the C++ version an additional very tiny amount faster. Really really not worth the bother.
|
For me it was just worth the fun and serves the purpose.
--- edit ----
Quote:
Originally Posted by johnsfine
BTW, the simple script posted by konsolebox runs almost ten times faster than the grep command on the large files...
|
I'm really glad to hear that. Thanks for testing the script.
Last edited by konsolebox; 08-15-2010 at 10:16 AM.
|
|
|
08-15-2010, 12:38 PM
|
#25
|
Member
Registered: Jun 2010
Posts: 34
Original Poster
Rep:
|
Thank you for all of your help! This was very insightful and I think that I have learned a few new tricks for future reference! :-)
Quote:
Originally Posted by ntubski
The correct way to use grep would be:
Code:
grep -v -F -f file2 file1
Note the order of file arguments.
|
This seemed to do the trick in a mere 5 seconds! I intend on trying the other suggestions as well just to get more experience! Thanks again everyone!! :-)
|
|
|
09-26-2012, 09:18 AM
|
#26
|
Senior Member
Registered: Nov 2005
Distribution: Debian, Arch
Posts: 3,794
|
Apologies for the necropost, but this thread was recently linked to and it seemed to me there were a few loose ends.
Quote:
Originally Posted by johnsfine
Code:
grep -L -f file_one.txt file_two.txt > output.output
Anyone have a guess why this command uses so much memory and time?
From the grep man page, I can see this command is basically useless. It certainly doesn't do anything like what twaddlac wants done. But I can't understand why it takes a lot of memory and time. Like twaddlac, I tried the command on large files and didn't wait for it to finish.
The correct command (provided by ntubski) runs fairly quickly
Code:
grep -v -F -f file2 file1
The incorrect command ought to preprocess the pattern list in a similar way and with similar performance. Then it ought to find a match almost immediately and stop there (because it is only supposed to report files with zero matches). So it should be faster than the correct command.
|
Without the -F grep assumes that the file contains regexp patterns. There is no useful way to sort regexp patterns in general so it has to try each one in turn for every line of the other file. Obviously, grep could notice that these specific patterns use the particular subset of regexps which is the set of plain strings, but it doesn't bother doing that optimization.
Quote:
BTW, the simple script posted by konsolebox runs almost ten times faster than the grep command on the large files. The grep command doesn't tell grep that the matches occur only on line boundaries. Lacking that information makes the problem take about 15 times more work (on the test data I used) than is required if you use that information.
|
grep also has the -x or --line-regexp option which could help there. At the time it didn't occur to me to use it. GNU grep also has -w or --word-regexp for which can be useful in general, although in this case the patterns are for the whole line anyway.
|
|
1 members found this post helpful.
|
All times are GMT -5. The time now is 10:40 AM.
|
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.
|
Latest Threads
LQ News
|
|