LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 04-17-2013, 10:42 AM   #31
castor0troy
LQ Newbie
 
Registered: Apr 2013
Posts: 27

Original Poster
Rep: Reputation: Disabled

hello guys
im back with this problem of using the awk command for very big files.

millgates:i have tried your awk script and ive been running it for 2 days but no results as yet.
it works for small files.

file1 is 20 mb and file2 is 2gb.

can we split file2 alphabetically and scan them faster?
 
Old 04-17-2013, 10:50 AM   #32
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 8,129

Rep: Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271
to speed it up you can try to sort files (and much better scripts can be used)... But I do not know if it works...
 
Old 04-17-2013, 01:55 PM   #33
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Arch
Posts: 3,013

Rep: Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225
Quote:
Originally Posted by millgates View Post
maybe something like

Code:
grep -of input.txt file.txt|sort|uniq -c >output.txt
simple, but not very fast, so if the files are large, I'd try awk or something.
When you give grep a list of regexps it checks each one for every line, so the runtime is O(Pn) (P is the number of patterns, n is number of lines to search in). This will be much faster with -F because then grep knows it has just plain strings and uses a much faster algorithm which is O(P+n). However, since we want to find occurrences only at the beginning of lines we can't use that in this case.

Here is an awk program which combines all the keywords into a single regexp so that the search should be O(P+n):
Doesn't work as millgates points out.

Code:
#!/usr/bin/awk -f

NR == FNR {
    for (i = 1; i <= length($0); i++) {
        char = substr($0, i, 1);
        if (!index(charsets[i], char))
            charsets[i] = charsets[i] char;
    }
}

function regexp_range(charset,    i, c, reg_range) {
    for (i = 1; i <= length(charset); i++) {
        c = substr(charset, i, 1);
        if (index("\\]-^", c))
            reg_range = reg_range "\\" c;
        else
            reg_range = reg_range c;
    }
    return "[" reg_range "]";
}

NR != FNR && !kw_regexp {
    kw_regexp = "^";
    for (i = 1; i in charsets; i++) {
        kw_regexp = kw_regexp regexp_range(charsets[i])
    }
    # print kw_regexp ; exit
}

NR != FNR && match($0, kw_regexp) {
    kw[substr($0, RSTART, RLENGTH)]++;
}

END {
    for(w in kw) {print w, kw[w];}
}

Last edited by ntubski; 04-18-2013 at 11:55 AM. Reason: note my script doesn't work
 
Old 04-18-2013, 02:16 AM   #34
castor0troy
LQ Newbie
 
Registered: Apr 2013
Posts: 27

Original Poster
Rep: Reputation: Disabled
thanks for this new awk script.

file 1 has 10 million keywords
file2 has 100 million keywords

will this new awk script work for big files?

i was thinking if we split file 2 alphabetically and then match each keyword in file1 to the alphabetically arranged file 2.

wont this be faster and more efficient?
 
Old 04-18-2013, 02:23 AM   #35
castor0troy
LQ Newbie
 
Registered: Apr 2013
Posts: 27

Original Poster
Rep: Reputation: Disabled
i ran the above awk script and got this error
awk: linux.awk:19: (FILENAME=zonecrap FNR=1) fatal: Invalid range end: /^[01#23456789abcdefghijklmnopqrstuvwxyz][291s0cft6rdpxqabeghijklmnouvwz-y][id5679ct01oalwsupmhzrenqvgfbjykx�][ra0684hbfikmv573osculeytdngwqjpxz-129][enlfviaywhkmbpustordxcgjzq12356ı4897/
 
Old 04-18-2013, 02:26 AM   #36
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 8,129

Rep: Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271
actually we do not know what is your problem, probably you have no enough memory. splitting file2 may help on this, but if you want a real efficient solution you would need to sort file1 and file2. You can simple execute:
sort file1 > file1.sorted
sort file2 > file2.sorted
to check how much time they need
After that short there can be a very quick and efficient solution... (without splitting)
also would be nice to know:
keywords occur at the beginning of lines, only one time in a line or random?
 
Old 04-18-2013, 02:46 AM   #37
castor0troy
LQ Newbie
 
Registered: Apr 2013
Posts: 27

Original Poster
Rep: Reputation: Disabled
actually fil1 and 2 are also sorted using sort and uniq


file would be like
car
loans
auto
etc

one keyword on each line.



ive tried all the grep and awk scripts given here but it doesnt work.
i am also willing to pay if someone can get this job done for me.


thanks people
 
Old 04-18-2013, 02:56 AM   #38
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 8,129

Rep: Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271
I think we can help you to solve it, just give us more info (instead of paying).
keywords occur at the beginning of lines (in file2), only one time in a line or random?
every keyword from file1 exist in file2 or a few of them are missing?

(your example is not a shorted list)
 
Old 04-18-2013, 03:07 AM   #39
castor0troy
LQ Newbie
 
Registered: Apr 2013
Posts: 27

Original Poster
Rep: Reputation: Disabled
Quote:
Originally Posted by pan64 View Post
I think we can help you to solve it, just give us more info (instead of paying).
keywords occur at the beginning of lines (in file2), only one time in a line or random?
every keyword from file1 exist in file2 or a few of them are missing?

(your example is not a shorted list)

To answer your questions:
in file 2 keywords occur one on each line
file 2 and 1 is sorted and all uniq keywords
Not every keyword in file 1 is in file 2.


example file
abc
bal
cat
dog
etc

please allow me to explain my requirements once more in detail.


file 2 has 100 million keywords
file 1 has 20 million.


example
file1
abs
bat
ball
car

file2
abs
bata
cricket-bat
ballz

i want the script to take every keyword from file1 and count how many keywords in file 2 begin with that keyword.


result is
abs 1
bat 1
ball 1
car 0

i hope you people can help me crack this.
ive been working on this project since weeks.
 
Old 04-18-2013, 03:43 AM   #40
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 8,129

Rep: Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271
so here is the script:
Code:
#!/usr/bin/awk -f

BEGIN { file1 = "./file1" }
{
    i1 = getline key  < file1;  # keyfile
    i2 = 1;
    line = $0;

    counter = 0;
    while ( i1 > 0 && i2 > 0 ) {
	if ( key > line ) {  i2 = getline line; continue }
        if ( index(line, key)  == 1 ) { counter++; i2 = getline line; continue }
	print key ": " counter;
	counter = 0;
	i1 = getline key  < file1;
    }
    print key ": " counter;

}
and you will invoke it with:
<script> <file2>

in your example the lines were not sorted. this script works only if both files were sorted....
 
1 members found this post helpful.
Old 04-18-2013, 03:46 AM   #41
castor0troy
LQ Newbie
 
Registered: Apr 2013
Posts: 27

Original Poster
Rep: Reputation: Disabled
thank you.
both files sorted.
I have saved the codes as linux.awk

awk -f linux.awk file1 file2


i get this on the screen.

: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
: 0
 
Old 04-18-2013, 03:53 AM   #42
pan64
LQ Guru
 
Registered: Mar 2012
Location: Hungary
Distribution: debian i686 (solaris)
Posts: 8,129

Rep: Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271Reputation: 2271
just please read:
awk -f linux.awk file2
file1 (this is the smaller, key file) should be entered in this line. please edit it as you need
Code:
BEGIN { file1 = "./file1" }
I have got the following result (with something similar you gave):
abs: 1
ball: 2
bat: 1
car: 0
mouse: 2
 
Old 04-18-2013, 03:55 AM   #43
castor0troy
LQ Newbie
 
Registered: Apr 2013
Posts: 27

Original Poster
Rep: Reputation: Disabled
ok got it.
im running it now.
lets see how much time it takes to compare the big files.
 
Old 04-18-2013, 09:07 AM   #44
chrism01
LQ Guru
 
Registered: Aug 2004
Location: Sydney
Distribution: Centos 6.8, Centos 5.10
Posts: 17,241

Rep: Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325Reputation: 2325
Hers' a crack at it in Perl.
Assumptions:
both files are left adjusted and key/matches start on left and both are sorted unique.
Code:
#!/usr/bin/perl -w
use strict;

my ( $krec, $kcnt, $mrec, $mlek,  $lk, $ks, $ms, $eom  );

# LQ match/count for very large files
open(KFILE, "<", "kfile.txt" ) or
    die "Can't open kfile: $!\n";

open(MFILE, "<", "mfile.txt" ) or
    die "Can't open mfile: $!\n";


$kcnt = 0;
$mlek = 1;
$eom = 0;

# Read kfile
while ( defined($krec = <KFILE>) )
{
    chomp($krec);
    $kcnt = 0;
    $lk = length($krec);
    $ks = substr($krec, 0,1);

    while ( 1 )
    {
        if( $mlek == 1 )
        {
            if( defined($mrec = <MFILE>) )
            {
                chomp($mrec);
                $ms = substr($mrec,0,1);
            }
            else
            {
                $eom = 1;
            }
        }

        if( $ms le $ks && $eom == 0 )
        {
            $mlek=1;
            if( substr($mrec, 0, $lk) eq  $krec )
            {
                $kcnt++;
            }
        }
        else
        {
            print "$krec $kcnt\n";
            $mlek = 0;
            last;
        }
    }
}

close(KFILE) or die "Can't close kfile: $!\n";
close(MFILE) or die "Can't close mfile: $!\n";
I apologise for the short var names; I kept changing the design & got fed up typing

I've only tested on very short files, but it seems to do the job.

PS If you get more than one working soln, I'd love to see the timings.

Last edited by chrism01; 04-18-2013 at 09:09 AM.
 
Old 04-18-2013, 10:31 AM   #45
ntubski
Senior Member
 
Registered: Nov 2005
Distribution: Arch
Posts: 3,013

Rep: Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225Reputation: 1225
Quote:
Originally Posted by castor0troy View Post
i ran the above awk script and got this error
awk: linux.awk:19: (FILENAME=zonecrap FNR=1) fatal: Invalid range end: /^[01#23456789abcdefghijklmnopqrstuvwxyz][291s0cft6rdpxqabeghijklmnouvwz-y][id5679ct01oalwsupmhzrenqvgfbjykx�][ra0684hbfikmv573osculeytdngwqjpxz-129][enlfviaywhkmbpustordxcgjzq12356ı4897/
Oh, I was a bit sloppy: my script didn't escape special characters. I'll edit my post with a correction, but probably the scripts that take advantage of the fact you have sorted files will be faster and not wrong.


Quote:
lets see how much time it takes to compare the big files.
You might try running things on a bit less than the whole big file, say just the first 100MB to get an idea of how fast it will be for the entire thing.

Last edited by ntubski; 04-18-2013 at 11:53 AM. Reason: note my script is wrong
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Creating an alias in ksh that uses grep and includes 'grep -v grep' doug248 Linux - Newbie 2 08-05-2012 03:07 PM
[SOLVED] run ps|grep command by script/command line ... ERROR: Unsupported option (BSD syntax) masuch Programming 4 05-23-2012 05:13 AM
How to pass the result of a command to another command (like grep) desb01 Programming 4 06-25-2009 01:09 PM
Help me in Grep Command + cd command in single line JeiPrakash Linux - Newbie 3 05-27-2008 05:16 AM
grep command itz2000 Linux - Newbie 2 09-21-2005 08:06 PM


All times are GMT -5. The time now is 06:22 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration