LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (https://www.linuxquestions.org/questions/programming-9/)
-   -   grep for HUGE files? (https://www.linuxquestions.org/questions/programming-9/grep-for-huge-files-826030/)

twaddlac 08-13-2010 11:22 AM

grep for HUGE files?
 
Hello,

I am trying to compare a list of patterns from one file and grep them against another file and print out only the unique patterns. Unfortunately these files are so large that they have yet to run to completion.

Here's the command that I used:

Code:

grep -L -f file_one.txt file_two.txt > output.output
Here's some example data:

Code:

>FQ4HLCS01BMR4N
>FQ4HLCS01BZNV6
>FQ4HLCS01B40PB
>FQ4HLCS01BT43K
>FQ4HLCS01CB736
>FQ4HLCS01BU3UM
>FQ4HLCS01BBIFQ

Any suggestions on how to increase efficiency or use another command?

Thank you very much!

Sergei Steshenko 08-13-2010 11:43 AM

Quote:

Originally Posted by twaddlac (Post 4065350)
Hello,

I am trying to compare a list of patterns from one file and grep them against another file and print out only the unique patterns. Unfortunately these files are so large that they have yet to run to completion.

Here's the command that I used:

Code:

grep -L -f file_one.txt file_two.txt > output.output
Here's some example data:

Code:

>FQ4HLCS01BMR4N
>FQ4HLCS01BZNV6
>FQ4HLCS01B40PB
>FQ4HLCS01BT43K
>FQ4HLCS01CB736
>FQ4HLCS01BU3UM
>FQ4HLCS01BBIFQ

Any suggestions on how to increase efficiency or use another command?

Thank you very much!

Consider using 'egrep' or 'grep -P' (the latter is ofr Perl-like regular expressions) and consider forming a regular expression.

ghostdog74 08-13-2010 11:50 AM

Quote:

Originally Posted by Sergei Steshenko (Post 4065370)
Consider using 'egrep' or 'grep -P' (the latter is ofr Perl-like regular expressions) and consider forming a regular expression.

egrep -> grep -E.
grepping for patterns of one file from another file is what -f option does, so don't know why the suggestion to form regex.

twaddlac 08-13-2010 01:40 PM

Quote:

Originally Posted by ghostdog74 (Post 4065374)
egrep -> grep -E.

I've tried these and they still won't allow the comparison to run to completion. One file is ~12MB and the other ~14MB and I'm a little stumped as to why this isn't completing. I'm running a guest CentOS machine on a Windows 7 host and I dedicated 4GB of ram to my guest machine. You would think it should run with no problem but I guess there's something going on under the hood that I don't know about.

Let me know if you have any ideas about how to speed this up! I looked into glimpse but I had complications installing it so I gave up on that but if it's worth it for this particular task then I may try to install it again.

Thank you very much again!

GrapefruiTgirl 08-13-2010 01:48 PM

You say that 'they won't (allow) run to completion' - what exactly *do* they do? Does the process hang? Or does it just go sooooo frikking slooow that you think it'll never end.. Or what - please explain. Do you get any output at all going into your output file?

How much RAM total is in the physical computer?

As far as `grep` is concerned, those file sizes are not huge. Even on a 32bit platform you should be able to grep >2 Gib files. Yes, it would seem there's something more going on 'under the hood'.

johnsfine 08-13-2010 01:59 PM

Quote:

Originally Posted by twaddlac (Post 4065350)
trying to compare a list of patterns from one file and grep them against another file and print out only the unique patterns.

What does that mean?

I read the description of -L in man grep to try to figure out what your command does as a "reverse engineering" approach to figuring out what you want to do. But as far as I understand the info in man grep, your command does nothing useful (even if it had time to finish).

Have you tested your grep command on smaller sample files?

Regarding your description of what you want to do, I think you might mean:

Print out each line of the first file whose contents do not appear anywhere in the second file.

Assuming I've totally misunderstood you, can you describe what you want to do more precisely.

Quote:

Unfortunately these files are so large that they have yet to run to completion.
Quote:

Originally Posted by twaddlac (Post 4065451)
One file is ~12MB and the other ~14MB and I'm a little stumped as to why this isn't completing.

IIUC, each pattern is a line in first file. Are the potential matches in the second file also lines (or at least at the start of lines) or might they appear mid line?

You are asking for a much bigger basic task than you seem to understand. Take almost about 750K patterns and match each against up to 14 million potential places in the second file. With no special optimization, that would take ten trillion comparisons. I assume grep is at least a bit smarter than that. But I doubt it is incredibly sophisticated about the special requirements of working with a very large number of patterns. So I wouldn't bet it is smart enough to do this job in reasonable time.

A decent algorithm must first sort the set of patterns and while searching must take advantage of the fact that the list is sorted. I don't know whether grep does that when working with a list of patterns.

If the potential match positions are highly restricted (such as if they were all at the start of lines) a better algorithm would sort the second file. That is why I asked where the potential matches are.

If potential matches restricted enough and if you sort both files (sorting the second on boundaries of potential matches, increasing total size if such boundaries overlap) then you could easily use some faster tool to find out the point of correspondence (or lack of correspondence) between the files.

twaddlac 08-13-2010 02:17 PM

Quote:

Originally Posted by johnsfine (Post 4065470)
Regarding your description of what you want to do, I think you might mean:

Print out each line of the first file whose contents do not appear anywhere in the second file.

This is precisely what I want to do.

Quote:

Originally Posted by johnsfine (Post 4065470)
Are the potential matches in the second file also lines (or at least at the start of lines) or might they appear mid line?

They are exactly like the data in the first file. No mid line matches, special characters etc. In regards the the <b>-L</b> command I thought that this command suppressed the common patterns and would only print out the unique patterns but apparently I was mistaken. Let me know if you need any other info and if you have any optimization techniques!

ghostdog74 08-13-2010 09:07 PM

show samples of your file1 and file2 input files. and show desired output

konsolebox 08-14-2010 12:48 AM

Do you still need the repeated lines in file1 if file1 has duplicates? Does it matter if the results are not in order based from file1?

syg00 08-14-2010 02:11 AM

I too am unable to understand what you expect/hope to get as output.
I was prepared to allow grep some more smarts than johnsfine. Small test files somewhat supported that - it does 4k reads (not line at a time) on the input file containing the patterns, and 32k reads in the "target" file. I merely did some strace-ing, not profiling of the userland code. A larger pattern file (62k) however, drove memory remapping nuts - before the target file is even opened. In a virtual guest this might be an even bigger problem - who knows.

Update: just tried a similar test on a ArchLinux (non-X, CLI only) VirtualBox guest on a Win7 64-bit host. Similar results, although timings were not noticeably worse. Host does not page.

twaddlac 08-14-2010 10:03 AM

Quote:

Originally Posted by GrapefruiTgirl (Post 4065459)
You say that 'they won't (allow) run to completion' - what exactly *do* they do? Does the process hang? Or does it just go sooooo frikking slooow that you think it'll never end.. Or what - please explain. Do you get any output at all going into your output file?

The process just hangs.. The VM ends up freezing and aborts the process and there is no output in the file I am printing to.

Quote:

Originally Posted by GrapefruiTgirl (Post 4065459)
How much RAM total is in the physical computer?

There is 4GB of ram and try to boost the performance by using 3.1GB of ram using ready boost. I thought that this would make a significant difference, allowing my VM to use ~3GB of memory but it hasn't proven too advantageous.

Quote:

Originally Posted by ghostdog74
show samples of your file1 and file2 input files. and show desired output

The sample input data are in the original post. Likewise, the other input file would be exactly the same: >FQ4HLCS0*. As for the desired output, in context, would be something like this:

File 1:
a
b
c
d

File 2:
b
c
a
f

Output:
d

Omitting all the matched data and only printing out the unmatched data.

Quote:

Originally Posted by konsolebox
Do you still need the repeated lines in file1 if file1 has duplicates? Does it matter if the results are not in order based from file1?

No, I do not need the duplicated files and there aren't any duplicate data within one file. I don't care if the results are in order or not, as long as I get the data.

Let me know if you have any more questions and thanks for your help so far!! :-)

GrapefruiTgirl 08-14-2010 10:23 AM

OK, so one problem is apparent immediately:

Using the sample data you gave in the above post, I created 2 files, and used the grep command from post #1 to get the "d" output. But, I got nothing instead.

The below however, does appear to work:
Code:

sasha@reactor: cat file1
a
b
c
d

sasha@reactor: cat file2
b
c
a
f

sasha@reactor: while read LINE; do [ ! "$(grep "${LINE}" file2)" ] && echo "${LINE}"; done < file1
d
sasha@reactor:


konsolebox 08-14-2010 10:55 AM

I expect things to go faster in awk. I made two scripts for that.
Code:

#!/usr/bin/env gawk

BEGIN {
        file1 = ARGV[1]
        file2 = ARGV[2]

        delete hash

        while (getline < file1) {
                hash[$0] = 1
        }

        while (getline < file2) {
                if ($0 in hash) {
                        delete hash[$0]
                }
        }

        for (a in hash) {
                print a
        }

        exit (0)
}

Code:

#!/usr/bin/env gawk

BEGIN {
        file1 = ARGV[1]
        file2 = ARGV[2]

        loopspersession = 100000

        if (!(loopspersession > 0)) {
                exit(1)
        }

        i = 0

        while (!i && getline < file1) {
                delete hash

                hash[$0] = 1

                i = loopspersession - 1

                while (i && getline < file1) {
                        hash[$0] = 1
                        i--
                }

                while (getline < file2) {
                        if ($0 in hash) {
                                delete hash[$0]
                        }
                }

                close (file2)

                for (a in hash) {
                        print a
                }
        }

        exit(0)
}

Both scripts can be run by
Code:

gawk -f <script> -- <file1> <file2> > <outputfile>
The first script is simpler and can work if you have enough memory. With the second script, you can divide the work in sessions. Just change the value of loopspersession.

I'm not sure if caching file2 will also make things faster.

wroom 08-14-2010 11:55 AM

It happens to me once and awhile that grep fills up all the eight gigs of ram and the 24 gigs of swap without making such a difficult query. My guess it is not being that efficient it should be. grep does much reentrant execution, filling up the heap with intermediate data. Don't know why it is built this way, really.

One way to speed things up if you run a multicore processor is to split up the file containing match patterns into several files, and run as many greps in parallel as you got cpu cores.
I assume grep will take exponentially less memory per process with less patterns to match?

And just to make you curious on some alternatives, i did ls -lF /bin/*grep /usr/bin/*grep on my SUSE machine and got:
Code:

-rwxr-xr-x 1 root root  86928 Jun  6  2008 /bin/egrep*
-rwxr-xr-x 1 root root  53872 Jun  6  2008 /bin/fgrep*
-rwxr-xr-x 1 root root 215088 Jun  6  2008 /bin/grep*
-rwxr-xr-x 1 root root  1677 Jun  6  2008 /usr/bin/bzgrep*
-rwxr-xr-x 1 root root  19736 Jun  7  2008 /usr/bin/deepgrep*
lrwxrwxrwx 1 root root    10 Apr 26 03:27 /usr/bin/egrep -> /bin/egrep*
lrwxrwxrwx 1 root root    10 Apr 26 03:27 /usr/bin/fgrep -> /bin/fgrep*
lrwxrwxrwx 1 root root      9 Apr 26 03:27 /usr/bin/grep -> /bin/grep*
-rwxr-xr-x 1 root root  1332 Jun  7  2008 /usr/bin/mgrep*
-rwxr-xr-x 1 root root  28880 Jun 23  2008 /usr/bin/pcregrep*
-r-xr-xr-x 2 root root  40800 Jun  6  2008 /usr/bin/pgrep*
-rwxr-xr-x 1 root root    66 Jan 20  2010 /usr/bin/zegrep*
-rwxr-xr-x 1 root root    66 Jan 20  2010 /usr/bin/zfgrep*
-rwxr-xr-x 1 root root  5002 Jan 20  2010 /usr/bin/zgrep*
-rwxr-xr-x 1 root root  1188 Jun  6  2008 /usr/bin/zipgrep*

So i guess "there is more than one way to skin a cat". ;)

ntubski 08-14-2010 02:43 PM

The correct way to use grep would be:
Code:

grep -v -F -f file2 file1
Note the order of file arguments.


All times are GMT -5. The time now is 03:53 AM.