grep command

castor0troy · 04-17-2013, 09:42 AM

hello guys
im back with this problem of using the awk command for very big files.

millgates:i have tried your awk script and ive been running it for 2 days but no results as yet.
it works for small files.

file1 is 20 mb and file2 is 2gb.

can we split file2 alphabetically and scan them faster?

pan64 · 04-17-2013, 09:50 AM

to speed it up you can try to sort files (and much better scripts can be used)... But I do not know if it works...

ntubski · 04-17-2013, 12:55 PM

Quote:

Originally Posted by millgates

maybe something like

Code:

grep -of input.txt file.txt|sort|uniq -c >output.txt

simple, but not very fast, so if the files are large, I'd try awk or something.

When you give grep a list of regexps it checks each one for every line, so the runtime is O(Pn) (P is the number of patterns, n is number of lines to search in). This will be much faster with -F because then grep knows it has just plain strings and uses a much faster algorithm which is O(P+n). However, since we want to find occurrences only at the beginning of lines we can't use that in this case.

Here is an awk program which combines all the keywords into a single regexp so that the search should be O(P+n):
Doesn't work as millgates points out.

Code:

#!/usr/bin/awk -f

NR == FNR {
    for (i = 1; i <= length($0); i++) {
        char = substr($0, i, 1);
        if (!index(charsets[i], char))
            charsets[i] = charsets[i] char;
    }
}

function regexp_range(charset,    i, c, reg_range) {
    for (i = 1; i <= length(charset); i++) {
        c = substr(charset, i, 1);
        if (index("\\]-^", c))
            reg_range = reg_range "\\" c;
        else
            reg_range = reg_range c;
    }
    return "[" reg_range "]";
}

NR != FNR && !kw_regexp {
    kw_regexp = "^";
    for (i = 1; i in charsets; i++) {
        kw_regexp = kw_regexp regexp_range(charsets[i])
    }
    # print kw_regexp ; exit
}

NR != FNR && match($0, kw_regexp) {
    kw[substr($0, RSTART, RLENGTH)]++;
}

END {
    for(w in kw) {print w, kw[w];}
}

castor0troy · 04-18-2013, 01:16 AM

thanks for this new awk script.

file 1 has 10 million keywords
file2 has 100 million keywords

will this new awk script work for big files?

i was thinking if we split file 2 alphabetically and then match each keyword in file1 to the alphabetically arranged file 2.

wont this be faster and more efficient?

castor0troy · 04-18-2013, 01:23 AM

i ran the above awk script and got this error
awk: linux.awk:19: (FILENAME=zonecrap FNR=1) fatal: Invalid range end: /^[01#23456789abcdefghijklmnopqrstuvwxyz][291s0cft6rdpxqabeghijklmnouvwz-y][id5679ct01oalwsupmhzrenqvgfbjykx�][ra0684hbfikmv573osculeytdngwqjpxz-129][enlfviaywhkmbpustordxcgjzq12356ı4897/

pan64 · 04-18-2013, 01:26 AM

actually we do not know what is your problem, probably you have no enough memory. splitting file2 may help on this, but if you want a real efficient solution you would need to sort file1 and file2. You can simple execute:
sort file1 > file1.sorted
sort file2 > file2.sorted
to check how much time they need
After that short there can be a very quick and efficient solution... (without splitting)
also would be nice to know:
keywords occur at the beginning of lines, only one time in a line or random?

castor0troy · 04-18-2013, 01:46 AM

actually fil1 and 2 are also sorted using sort and uniq

file would be like
car
loans
auto
etc

one keyword on each line.

ive tried all the grep and awk scripts given here but it doesnt work.
i am also willing to pay if someone can get this job done for me.

thanks people

pan64 · 04-18-2013, 01:56 AM

I think we can help you to solve it, just give us more info (instead of paying).
keywords occur at the beginning of lines (in file2), only one time in a line or random?
every keyword from file1 exist in file2 or a few of them are missing?

(your example is not a shorted list)

castor0troy · 04-18-2013, 02:07 AM

Quote:

Originally Posted by pan64

I think we can help you to solve it, just give us more info (instead of paying).
keywords occur at the beginning of lines (in file2), only one time in a line or random?
every keyword from file1 exist in file2 or a few of them are missing?

(your example is not a shorted list)

To answer your questions:
in file 2 keywords occur one on each line
file 2 and 1 is sorted and all uniq keywords
Not every keyword in file 1 is in file 2.

example file
abc
bal
cat
dog
etc

please allow me to explain my requirements once more in detail.

file 2 has 100 million keywords
file 1 has 20 million.

example
file1
abs
bat
ball
car

file2
abs
bata
cricket-bat
ballz

i want the script to take every keyword from file1 and count how many keywords in file 2 begin with that keyword.

result is
abs 1
bat 1
ball 1
car 0

i hope you people can help me crack this.
ive been working on this project since weeks.

pan64 · 04-18-2013, 02:43 AM

so here is the script:

Code:

#!/usr/bin/awk -f

BEGIN { file1 = "./file1" }
{
    i1 = getline key  < file1;  # keyfile
    i2 = 1;
    line = $0;

    counter = 0;
    while ( i1 > 0 && i2 > 0 ) {
	if ( key > line ) {  i2 = getline line; continue }
        if ( index(line, key)  == 1 ) { counter++; i2 = getline line; continue }
	print key ": " counter;
	counter = 0;
	i1 = getline key  < file1;
    }
    print key ": " counter;

}

and you will invoke it with:
<script> <file2>

in your example the lines were not sorted. this script works only if both files were sorted....

castor0troy · 04-18-2013, 02:46 AM

thank you.
both files sorted.
I have saved the codes as linux.awk

awk -f linux.awk file1 file2

i get this on the screen.

: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0

pan64 · 04-18-2013, 02:53 AM

just please read:
awk -f linux.awk file2
file1 (this is the smaller, key file) should be entered in this line. please edit it as you need

Code:

BEGIN { file1 = "./file1" }

I have got the following result (with something similar you gave):
abs: 1
ball: 2
bat: 1
car: 0
mouse: 2

castor0troy · 04-18-2013, 02:55 AM

ok got it.
im running it now.
lets see how much time it takes to compare the big files.

chrism01 · 04-18-2013, 08:07 AM

Hers' a crack at it in Perl.
Assumptions:
both files are left adjusted and key/matches start on left and both are sorted unique.

Code:

#!/usr/bin/perl -w
use strict;

my ( $krec, $kcnt, $mrec, $mlek,  $lk, $ks, $ms, $eom  );

# LQ match/count for very large files
open(KFILE, "<", "kfile.txt" ) or
    die "Can't open kfile: $!\n";

open(MFILE, "<", "mfile.txt" ) or
    die "Can't open mfile: $!\n";


$kcnt = 0;
$mlek = 1;
$eom = 0;

# Read kfile
while ( defined($krec = <KFILE>) )
{
    chomp($krec);
    $kcnt = 0;
    $lk = length($krec);
    $ks = substr($krec, 0,1);

    while ( 1 )
    {
        if( $mlek == 1 )
        {
            if( defined($mrec = <MFILE>) )
            {
                chomp($mrec);
                $ms = substr($mrec,0,1);
            }
            else
            {
                $eom = 1;
            }
        }

        if( $ms le $ks && $eom == 0 )
        {
            $mlek=1;
            if( substr($mrec, 0, $lk) eq  $krec )
            {
                $kcnt++;
            }
        }
        else
        {
            print "$krec $kcnt\n";
            $mlek = 0;
            last;
        }
    }
}

close(KFILE) or die "Can't close kfile: $!\n";
close(MFILE) or die "Can't close mfile: $!\n";

I apologise for the short var names; I kept changing the design & got fed up typing

I've only tested on very short files, but it seems to do the job.

PS If you get more than one working soln, I'd love to see the timings.

ntubski · 04-18-2013, 09:31 AM

Quote:

Originally Posted by castor0troy

i ran the above awk script and got this error
awk: linux.awk:19: (FILENAME=zonecrap FNR=1) fatal: Invalid range end: /^[01#23456789abcdefghijklmnopqrstuvwxyz][291s0cft6rdpxqabeghijklmnouvwz-y][id5679ct01oalwsupmhzrenqvgfbjykx�][ra0684hbfikmv573osculeytdngwqjpxz-129][enlfviaywhkmbpustordxcgjzq12356ı4897/

Oh, I was a bit sloppy: my script didn't escape special characters. I'll edit my post with a correction, but probably the scripts that take advantage of the fact you have sorted files will be faster and not wrong.

Quote:

lets see how much time it takes to compare the big files.

You might try running things on a bit less than the whole big file, say just the first 100MB to get an idea of how fast it will be for the entire thing.