LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 06-07-2012, 06:50 PM   #1
reach.sree@gmail.com
LQ Newbie
 
Registered: Jun 2012
Posts: 6

Rep: Reputation: Disabled
Awk to Count Multiple patterns in a huge file


Hi,


I have a file that is 430K lines long. It has records like below

|site1|MAP
|site2|MAP
|site1|MODAL
|site2|MAP
|site2|MODAL
|site2|LINK
|site1|LINK

My task is to count the number of time MAP, MODAL, LINK occurs for a single site and write new records like below to a new file

SiteName MAP MODAL LINK
--------------------------
site1 | 1 | 1 | 1
site2 | 2 | 1 | 1

I have accomplished this using grep by doing

countmap=`grep $SITEID $FILENAME | grep MAP | wc -l`
countmodal=`grep $SITEID $FILENAME | grep MODAL | wc -l`
countlink=`grep $SITEID $FILENAME | grep LINK | wc -l`
echo $SITEID\|$countmap\|$countmodal\|$countlink\|

However with a 430K long file it took me more than an hour to accomplish this. My knowledge of awk is rudimentary at best. Kindly help me with this
 
Old 06-07-2012, 07:29 PM   #2
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,586

Rep: Reputation: 481Reputation: 481Reputation: 481Reputation: 481Reputation: 481
Quote:
Originally Posted by reach.sree@gmail.com View Post
Code:
countmap=`grep $SITEID $FILENAME | grep MAP | wc -l`
countmodal=`grep $SITEID $FILENAME | grep MODAL | wc -l`
countlink=`grep $SITEID $FILENAME | grep LINK | wc -l`
echo $SITEID\|$countmap\|$countmodal\|$countlink\|
How many different sites are found in your input file? Your example shows only two, but are there actually 20? If 20, are you making 60 passes through the file? (20 to count MAPs, 20 to count MODALs, and 20 to count LINKs?)

Daniel B. Martin
 
Old 06-07-2012, 07:32 PM   #3
reach.sree@gmail.com
LQ Newbie
 
Registered: Jun 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
I have 10000 sites in my files. And yes grep is a very inefficient way of doing this
 
Old 06-07-2012, 08:13 PM   #4
bigearsbilly
Senior Member
 
Registered: Mar 2004
Location: england
Distribution: Debian, Mint, Puppy, Raspbian
Posts: 3,421

Rep: Reputation: 200Reputation: 200Reputation: 200
Code:
#!/usr/bin/perl 
use strict;

my %sites;
while(<>) {
	chomp;
	my (undef, $site, $thing) = split '\|';
	# print qq(undef, $site, $thing\n);
	$sites{$site}->{$thing}++;

}
$\ = "\n";

print "SiteName MAP MODAL LINK";
print "-" x 40;
while (my ($k, $v) = each (%sites)) {
        print "$k|$v->{MAP}|$v->{MODAL}|$v->{LINK}";
}
 
Old 06-07-2012, 09:15 PM   #5
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,586

Rep: Reputation: 481Reputation: 481Reputation: 481Reputation: 481Reputation: 481
Quote:
Originally Posted by reach.sree@gmail.com View Post
I have 10000 sites in my files. And yes grep is a very inefficient way of doing this
Code:
Input file ...
|site01|MAP
|site02|MAP
|site01|MODAL
|site02|MAP
|site02|MODAL
|site02|LINK
|site01|LINK
|site11|MAP
|site12|MAP
|site11|MODAL
|site12|MAP
|site12|MODAL
|site12|LINK
|site11|LINK
|site04|MAP
|site05|MAP
|site06|MODAL
|site07|MAP
|site08|MODAL
|site08|LINK
|site08|LINK
|site12|MODAL
|site12|LINK
|site12|MODAL
|site12|LINK
|site12|MODAL
|site12|LINK
Code:
Run this ...
sort $InFile | uniq --count
Code:
Get this ...
      1 |site01|LINK
      1 |site01|MAP
      1 |site01|MODAL
      1 |site02|LINK
      2 |site02|MAP
      1 |site02|MODAL
      1 |site04|MAP
      1 |site05|MAP
      1 |site06|MODAL
      1 |site07|MAP
      2 |site08|LINK
      1 |site08|MODAL
      1 |site11|LINK
      1 |site11|MAP
      1 |site11|MODAL
      4 |site12|LINK
      2 |site12|MAP
      4 |site12|MODAL
Try this code on your big real-world input file. Let us know how long it took to execute.

Daniel B. Martin
 
Old 06-08-2012, 12:38 PM   #6
reach.sree@gmail.com
LQ Newbie
 
Registered: Jun 2012
Posts: 6

Original Poster
Rep: Reputation: Disabled
Thanks a lot guys. I went with

awk -F"|" '{c[$2,$3]++;b[$2]=$2 FS 0+c[$2,"MAP"] FS 0+c[$2,"MODAL"] FS 0+c[$2,"LINK"] FS} END {for ( i in b) { print b[i]}}' filename

and then did a sort filename

It took me about 40 sec for the 430K file.

Thanks once again for your input.

Last edited by reach.sree@gmail.com; 06-08-2012 at 12:40 PM.
 
Old 06-08-2012, 02:04 PM   #7
danielbmartin
Senior Member
 
Registered: Apr 2010
Location: Apex, NC, USA
Distribution: Mint 17.3
Posts: 1,586

Rep: Reputation: 481Reputation: 481Reputation: 481Reputation: 481Reputation: 481
Quote:
Originally Posted by reach.sree@gmail.com View Post
It took me about 40 sec for the 430K file.
I have no objection to your choosing the awk instead of my proposed solution. You benefit by having code provided, promptly and at no cost. You may repay the LQ community by publishing feedback.

Tell us what worked and what didn't. Tell us which code ran fast and which ran even faster. Tell us which solution you chose and why.

Daniel B. Martin
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] grep many files in multiple directories using patterns from a file francy_casa Linux - Newbie 4 04-12-2012 09:49 AM
[SOLVED] Search multiple patterns & print matching patterns instead of whole line Trd300 Linux - Newbie 29 03-05-2012 08:41 PM
Using file content as input for awk search patterns srn Programming 2 09-13-2011 03:49 AM
[SOLVED] Awk varying patterns to different file Tauro Linux - Newbie 6 07-29-2011 05:57 PM
search / count unique patterns in text file logicalfuzz Linux - Newbie 2 10-14-2006 08:58 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:53 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration