LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 05-17-2012, 09:50 AM   #1
hanae
Member
 
Registered: May 2012
Posts: 33

Rep: Reputation: Disabled
Count based on the first word


I want to count the number of occurrences of a word in a file, but the problem is that I have other parameters that are attached to each line.

I have tried this perl script:
Quote:
#!/usr/bin/perl -w

$DEBUG = 0;

while (<>) {
# loop over lines
chomp($_);
if ($DEBUG) { print "$_\n"; }
++$counts{$_};

}

# end of input, print %count
for $c (sort keys %counts) {
print "$c\t$counts{$c}\n";
}
This works when I have a list of words(a word in each line); but it doesn't work with my input file that is like this
word1 filename token_position

I want just to count the occurrences of words?

How can I please modify the script?

Thank you
 
Click here to see the post LQ members have rated as the most helpful post in this thread.
Old 05-17-2012, 10:24 AM   #2
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,901

Rep: Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318
so than you can do another loop on the words of a line
Code:
#!/usr/bin/perl -w

$DEBUG = 0;

# loop over lines
while (<>) {

# loop over words
   foreach $word (split) {
      chomp($word);
      if ($DEBUG) { print "$word\n"; }
      ++$counts{$word};
   }
}

...
 
Old 05-17-2012, 10:34 AM   #3
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
unfortunately this script didn't work

Quote:
#!/usr/bin/perl -w

$DEBUG = 0;

# loop over lines
while (<>) {

# loop over words
foreach $word (split) {
chomp($word);
if ($DEBUG) { print "$word\n"; }
++$counts{$word};
}
}



# end of input, print %count

for $c (sort keys %counts) {

print "$c\t$counts{$c}\n";

}
I had the following outpu:
Quote:
0 1
1000 1
1003 1
1009 1
1012 1
1019 1
1021 1
1028 1
1037 1
1039 1
knowing that 1000... are the token number!
normally the output should be the following:

word filename token_number count_number

can you please help??

Thank you,
 
Old 05-17-2012, 10:44 AM   #4
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,901

Rep: Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318
I do not really understand. The output should be:
word1 count_number1
word2 count_number2
...
and nothing else. What kind of token and filename are you talking about. Do you have a test file to try this script?
 
Old 05-17-2012, 10:51 AM   #5
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
I am sorry for the confusion:
but my file as I mentioned in my first post is in the following format:
word1 filename token_position
Quote:
example:
sky losn_revue-1981-2 40234
fly kivre1-0-0 1236
fly kivre1-0-0 1240
I want to add the count_number to it when I run the script.and the output will become:
Quote:
sky losn_revue-1981-2 40234 1
fly kivre1-0-0 1236,1240 2
I hope this is clear now? can you please help?

Thank you,
 
Old 05-17-2012, 12:38 PM   #6
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
Can anyone please help me with the script?

Thank you,
 
Old 05-17-2012, 03:23 PM   #7
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Using awk:
Code:
awk '(NF > 2) { k = $1 " " $2 ; p[k] = p[k] "," $3 ; n[k]++ }
 END { for (k in p) { gsub(/,,+/, ",", p[k]); sub(/^,/, "", p[k]) }
       for (k in p) printf("%s %s %d\n", k, p[k], n[k])
     }' input-file > output-file
The output is in random order, though. If you want to keep the original order, and/or handle any newline convention, use
Code:
awk '#
    BEGIN {
        RS = "(\r\n|\n\r|\r|\n)"
        FS = "[\t\v\f ]+"

        ORS = "\n"      # Newline in output
        OFS = " "       # Field separator in output
    }

    (NF > 2) {
        k = $1 OFS $2
        if (!(k in key))
            key[++keys] = k
        list[k] = list[k] "," $3
        count[k]++
    }

    END {
        for (i = 1; i <= keys; i++) {
            k = key[i]
            n = count[k]
            s = list[k]
            gsub(/,,+/, ",", s)
            sub(/^,/, "", s)
            printf("%s%s%s%s%d%s", k, OFS, s, OFS, n, ORS)
        }
    }' input-file > output-file
If you are interested, I can explain how that works line-by-line.
 
Old 05-17-2012, 04:02 PM   #8
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
Yeah please if is it possible to explain, I will be grateful.

Thank you,
 
Old 05-17-2012, 07:36 PM   #9
Nominal Animal
Senior Member
 
Registered: Dec 2010
Location: Finland
Distribution: Xubuntu, CentOS, LFS
Posts: 1,723
Blog Entries: 3

Rep: Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948Reputation: 948
Awk is based on the concept of record and fields. By default, each line is a record, and each word on that line is a separate field. You write actions or rules, snippets of code that are applied to each record.

There are three types of rules:
  • { body }
    These rules are run for each record.
  • condition { body }
    These rules are run for each record, for which condition evaluates to true.
  • condition
    If condition evaluates to true, then the record is output.

There are special conditions named BEGIN and END . The former is run before the first input record, and the latter after the last input record. The statement next makes awk skip the rest of the current rule, and skip to the next record, without applying any other rules to the current record. (I don't use next in this one, though.)

As you can see, awk is quite straightforward. I use the GNU Awk User's Manual exclusively as my awk reference. Although it does have extensions and quirks other awk variants do not support, the differences are quite well marked. The main advantage of gawk over other awk variants is that it has asort() and asorti()sorting functions, and can use ASCII NUL (\0) as a record or field separator.

To the awk command at hand:

I start with a BEGIN rule. RS is the regular expression for record separators; I set it to match on any newline convention. FS is the regular expression for field separators; I set it to match on any linear whitespace. Some awk variants don't like it when the first line is empty, so I start with an empty comment line (#):
Code:
awk '#
    BEGIN {
        RS = "(\r\n|\n\r|\r|\n)"
        FS = "[\t\v\f ]+"
Next, I explicitly set the output record separator and field separator too.
Code:
        ORS = "\n"      # Newline in output
        OFS = " "       # Field separator in output
    }
At this point, we have set the input and output newline conventions and field separators: each input line will be a separate record, and each word a separate field. In output, each newline will end with just LF (Unix/Linux newline convention), and each field will be separated by a space only. (There will be no CR or tabs in the output, no matter what the input.) There are different, more concise ways to achieve the above, but I like this way.

Next, I define the rule to apply to each record. Since this will only work right with records with more than two fields, let's limit to such records:
Code:
    (NF > 2) {
Note that NF tells the number of fields in current record. $0 contains the entire record, and $1 to $NF the fields.

I will use associative arrays keyed on the first two fields. To save typing, I save the key in variable k. Note that in awk, strings are concatenated by just writing them one after another. (Awk does NOT add implicit separators or whitespace in between.)
Code:
        k = $1 OFS $2
In awk, (somekey in somearray) is true if associative array somearray contains key somekey. All arrays in awk are associative, and all array keys are strings.

To keep the input records in order, I save each new key k into array key, with keys unique k, starting from one. Also note that in awk, you don't need to initialize any variables, they will default to empty (strings) or zero (numbers).
Code:
        if (!(k in key))
            key[++keys] = k
To keep a list (actually, a comma-delimited string) of the third fields, I simply append a comma and the third field:
Code:
        list[k] = list[k] "," $3
I also keep a count of the number of occurrences of this key:
Code:
        count[k]++
    }
After all the input records have been processed, there are keys unique keys in the list and count arrays. The former contains the list of third fields as a comma-separated string (with a leading comma), and the latter the number of occurrences of each key. The key array contains the keys in the order they were first seen, indexed 1..keys.

I could have used just a simple list traversal loop, for (k in list) , but the k would be in undefined order then. There are ways you can control the array traversal, but nothing that works in all awk variants. This is why I kept track of the keys separately.

We obviously need to loop over all unique keys, because each unique key will produce one line of output. The ith key will be k=key[i] , with n occurrences of that key:
Code:
    END {
        for (i = 1; i <= keys; i++) {
            k = key[i]
            n = count[k]
Since the comma-separated list of third fields for the current key (list[k]) has an extra leading comma, we need to remove it. I like to be extra careful, and first replace any multiple successive commas with a single comma:
Code:
            s = list[k]
            gsub(/,,+/, ",", s)
            sub(/^,/, "", s)
All that is left is to print the key k, the list of third fields s, and the number of occurrences n:
Code:
            printf("%s%s%s%s%d%s", k, OFS, s, OFS, n, ORS)
Note that print k, s, n would produce exact same output. I thought setting the OFS and ORS would have been confusing otherwise, so I wrote the separators explicitly.

That's it. Closing the loop and the END rule completes the script.
Code:
        }
    }' input-file > output-file
I have the habit of listing one input file, and redirecting the output to a file, but that is just for illustration. You can read either standard input (no file name arguments), or from multiple files (in which case they are processed in the order they are listed in).

If you want to use this as a script, remove everything after the final }, and write the first line as
Code:
#!/usr/bin/awk -f
and you are done.

Note that according to my tests, the mawk awk variant is significantly faster than GNU awk (gawk). If you have it installed, I recommend using mawk explicitly.

Any questions? Any details you'd like me to clarify?

Hope you find this useful,

Last edited by Nominal Animal; 05-17-2012 at 08:02 PM. Reason: (somekey in somearray) is less confusing in the explanation; thanks, danielbmartin.
 
3 members found this post helpful.
Old 05-18-2012, 03:54 AM   #10
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
Thank you very much, this is indeed very helpful.

I have one more question, how can compile and run the program? should I save it as .awk?or sh?

Thank you again,
 
Old 05-18-2012, 04:00 AM   #11
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,901

Rep: Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318
awk scripts should not be compiled. It is a plain text file and as you already experienced it is processed and executed by the program named awk.







_________________________________
Happy with solution ... mark as SOLVED
If someone helps you, or you approve of what's posted, click the "Add to Reputation" button, on the left of the post.
 
Old 05-18-2012, 04:01 AM   #12
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
oh, thank you.

so how can I execute the above script?
 
Old 05-18-2012, 04:09 AM   #13
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,901

Rep: Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318
you can put everything between the first ' and last ' into a file, let's say test.awk.
execute:
awk -f test.awk inputfile > outpufile
 
Old 05-18-2012, 04:16 AM   #14
hanae
Member
 
Registered: May 2012
Posts: 33

Original Poster
Rep: Reputation: Disabled
this is exactly what I did:

count.awk:
Quote:
==awk '#
BEGIN {
RS = "(\r\n|\n\r|\r|\n)"
FS = "[\t\v\f ]+"

ORS = "\n" # Newline in output
OFS = " " # Field separator in output
}

(NF > 2) {
k = $1 OFS $2
if (!(k in key))
key[++keys] = k
list[k] = list[k] "," $3
count[k]++
}

END {
for (i = 1; i <= keys; i++) {
k = key[i]
n = count[k]
s = list[k]
gsub(/,,+/, ",", s)
sub(/^,/, "", s)
printf("%s%s%s%s%d%s", k, OFS, s, OFS, n, ORS)
}
}'
I used the command:
awk -f count.awk TrFile.txt > output.txt

I got the following errors:
Quote:
awk: 2: unexpected character '''
awk: 28: unexpected character ''
Thank you
 
Old 05-18-2012, 04:20 AM   #15
pan64
LQ Addict
 
Registered: Mar 2012
Location: Hungary
Distribution: debian/ubuntu/suse ...
Posts: 21,901

Rep: Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318Reputation: 7318
I told you between the fist ' and last ', so remove ==awk ' from the beginning and also remove the last '.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Need help in word count command grunge_1 Linux - General 4 03-20-2009 04:01 AM
variable length string using GD (word wrap, carriage return, word/character count)? frieza Programming 1 02-14-2009 05:21 PM
word count issue George2 Programming 6 11-27-2007 06:11 AM
Word count with grep DiagonalArg Linux - Software 3 02-13-2006 12:46 PM
word count pantera Programming 2 08-31-2004 07:23 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 06:01 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration