Count based on the first word

hanae · 05-17-2012, 09:50 AM

I want to count the number of occurrences of a word in a file, but the problem is that I have other parameters that are attached to each line.

I have tried this perl script:

Quote:

#!/usr/bin/perl -w

$DEBUG = 0;

while (<>) {
# loop over lines
chomp($_);
if ($DEBUG) { print "$_\n"; }
++$counts{$_};

}

# end of input, print %count
for $c (sort keys %counts) {
print "$c\t$counts{$c}\n";
}

This works when I have a list of words(a word in each line); but it doesn't work with my input file that is like this
word1 filename token_position

I want just to count the occurrences of words?

How can I please modify the script?

Thank you

pan64 · 05-17-2012, 10:24 AM

so than you can do another loop on the words of a line

Code:

#!/usr/bin/perl -w

$DEBUG = 0;

# loop over lines
while (<>) {

# loop over words
   foreach $word (split) {
      chomp($word);
      if ($DEBUG) { print "$word\n"; }
      ++$counts{$word};
   }
}

...

hanae · 05-17-2012, 10:34 AM

unfortunately this script didn't work

Quote:

#!/usr/bin/perl -w

$DEBUG = 0;

# loop over lines
while (<>) {

# loop over words
foreach $word (split) {
chomp($word);
if ($DEBUG) { print "$word\n"; }
++$counts{$word};
}
}

# end of input, print %count

for $c (sort keys %counts) {

print "$c\t$counts{$c}\n";

}

I had the following outpu:

Quote:

0 1
1000 1
1003 1
1009 1
1012 1
1019 1
1021 1
1028 1
1037 1
1039 1

knowing that 1000... are the token number!
normally the output should be the following:

word filename token_number count_number

can you please help??

Thank you,

pan64 · 05-17-2012, 10:44 AM

I do not really understand. The output should be:
word1 count_number1
word2 count_number2
...
and nothing else. What kind of token and filename are you talking about. Do you have a test file to try this script?

hanae · 05-17-2012, 10:51 AM

I am sorry for the confusion:
but my file as I mentioned in my first post is in the following format:
word1 filename token_position

Quote:

example:
sky losn_revue-1981-2 40234
fly kivre1-0-0 1236
fly kivre1-0-0 1240

I want to add the count_number to it when I run the script.and the output will become:

Quote:

sky losn_revue-1981-2 40234 1
fly kivre1-0-0 1236,1240 2

I hope this is clear now? can you please help?

Thank you,

hanae · 05-17-2012, 12:38 PM

Can anyone please help me with the script?

Thank you,

Nominal Animal · 05-17-2012, 03:23 PM

Using awk:

Code:

awk '(NF > 2) { k = $1 " " $2 ; p[k] = p[k] "," $3 ; n[k]++ }
 END { for (k in p) { gsub(/,,+/, ",", p[k]); sub(/^,/, "", p[k]) }
       for (k in p) printf("%s %s %d\n", k, p[k], n[k])
     }' input-file > output-file

The output is in random order, though. If you want to keep the original order, and/or handle any newline convention, use

Code:

awk '#
    BEGIN {
        RS = "(\r\n|\n\r|\r|\n)"
        FS = "[\t\v\f ]+"

        ORS = "\n"      # Newline in output
        OFS = " "       # Field separator in output
    }

    (NF > 2) {
        k = $1 OFS $2
        if (!(k in key))
            key[++keys] = k
        list[k] = list[k] "," $3
        count[k]++
    }

    END {
        for (i = 1; i <= keys; i++) {
            k = key[i]
            n = count[k]
            s = list[k]
            gsub(/,,+/, ",", s)
            sub(/^,/, "", s)
            printf("%s%s%s%s%d%s", k, OFS, s, OFS, n, ORS)
        }
    }' input-file > output-file

If you are interested, I can explain how that works line-by-line.

hanae · 05-17-2012, 04:02 PM

Yeah please if is it possible to explain, I will be grateful.

Thank you,

Nominal Animal · 05-17-2012, 07:36 PM

Awk is based on the concept of record and fields. By default, each line is a record, and each word on that line is a separate field. You write actions or rules, snippets of code that are applied to each record.

There are three types of rules:

{ body }
These rules are run for each record.
condition { body }
These rules are run for each record, for which condition evaluates to true.
condition
If condition evaluates to true, then the record is output.

There are special conditions named BEGIN and END . The former is run before the first input record, and the latter after the last input record. The statement next makes awk skip the rest of the current rule, and skip to the next record, without applying any other rules to the current record. (I don't use next in this one, though.)

As you can see, awk is quite straightforward. I use the GNU Awk User's Manual exclusively as my awk reference. Although it does have extensions and quirks other awk variants do not support, the differences are quite well marked. The main advantage of gawk over other awk variants is that it has asort() and asorti()sorting functions, and can use ASCII NUL (\0) as a record or field separator.

To the awk command at hand:

I start with a BEGIN rule. RS is the regular expression for record separators; I set it to match on any newline convention. FS is the regular expression for field separators; I set it to match on any linear whitespace. Some awk variants don't like it when the first line is empty, so I start with an empty comment line (#):

Code:

awk '#
    BEGIN {
        RS = "(\r\n|\n\r|\r|\n)"
        FS = "[\t\v\f ]+"

Next, I explicitly set the output record separator and field separator too.

Code:

        ORS = "\n"      # Newline in output
        OFS = " "       # Field separator in output
    }

At this point, we have set the input and output newline conventions and field separators: each input line will be a separate record, and each word a separate field. In output, each newline will end with just LF (Unix/Linux newline convention), and each field will be separated by a space only. (There will be no CR or tabs in the output, no matter what the input.) There are different, more concise ways to achieve the above, but I like this way.

Next, I define the rule to apply to each record. Since this will only work right with records with more than two fields, let's limit to such records:

Code:

    (NF > 2) {

Note that NF tells the number of fields in current record. $0 contains the entire record, and $1 to $NF the fields.

I will use associative arrays keyed on the first two fields. To save typing, I save the key in variable k. Note that in awk, strings are concatenated by just writing them one after another. (Awk does NOT add implicit separators or whitespace in between.)

Code:

        k = $1 OFS $2

In awk, (somekey in somearray) is true if associative array somearray contains key somekey. All arrays in awk are associative, and all array keys are strings.

To keep the input records in order, I save each new key k into array key, with keys unique k, starting from one. Also note that in awk, you don't need to initialize any variables, they will default to empty (strings) or zero (numbers).

Code:

        if (!(k in key))
            key[++keys] = k

To keep a list (actually, a comma-delimited string) of the third fields, I simply append a comma and the third field:

Code:

        list[k] = list[k] "," $3

I also keep a count of the number of occurrences of this key:

Code:

        count[k]++
    }

After all the input records have been processed, there are keys unique keys in the list and count arrays. The former contains the list of third fields as a comma-separated string (with a leading comma), and the latter the number of occurrences of each key. The key array contains the keys in the order they were first seen, indexed 1..keys.

I could have used just a simple list traversal loop, for (k in list) , but the k would be in undefined order then. There are ways you can control the array traversal, but nothing that works in all awk variants. This is why I kept track of the keys separately.

We obviously need to loop over all unique keys, because each unique key will produce one line of output. The ith key will be k=key[i] , with n occurrences of that key:

Code:

    END {
        for (i = 1; i <= keys; i++) {
            k = key[i]
            n = count[k]

Since the comma-separated list of third fields for the current key (list[k]) has an extra leading comma, we need to remove it. I like to be extra careful, and first replace any multiple successive commas with a single comma:

Code:

            s = list[k]
            gsub(/,,+/, ",", s)
            sub(/^,/, "", s)

All that is left is to print the key k, the list of third fields s, and the number of occurrences n:

Code:

            printf("%s%s%s%s%d%s", k, OFS, s, OFS, n, ORS)

Note that print k, s, n would produce exact same output. I thought setting the OFS and ORS would have been confusing otherwise, so I wrote the separators explicitly.

That's it. Closing the loop and the END rule completes the script.

Code:

        }
    }' input-file > output-file

I have the habit of listing one input file, and redirecting the output to a file, but that is just for illustration. You can read either standard input (no file name arguments), or from multiple files (in which case they are processed in the order they are listed in).

If you want to use this as a script, remove everything after the final }, and write the first line as

Code:

#!/usr/bin/awk -f

and you are done.

Note that according to my tests, the mawk awk variant is significantly faster than GNU awk (gawk). If you have it installed, I recommend using mawk explicitly.

Any questions? Any details you'd like me to clarify?

Hope you find this useful,

hanae · 05-18-2012, 03:54 AM

Thank you very much, this is indeed very helpful.

I have one more question, how can compile and run the program? should I save it as .awk?or sh?

Thank you again,

pan64 · 05-18-2012, 04:00 AM

awk scripts should not be compiled. It is a plain text file and as you already experienced it is processed and executed by the program named awk.

_________________________________
Happy with solution ... mark as SOLVED
If someone helps you, or you approve of what's posted, click the "Add to Reputation" button, on the left of the post.

hanae · 05-18-2012, 04:01 AM

oh, thank you.

so how can I execute the above script?

pan64 · 05-18-2012, 04:09 AM

you can put everything between the first ' and last ' into a file, let's say test.awk.
execute:
awk -f test.awk inputfile > outpufile

hanae · 05-18-2012, 04:16 AM

this is exactly what I did:

count.awk:

Quote:

==awk '#
BEGIN {
RS = "(\r\n|\n\r|\r|\n)"
FS = "[\t\v\f ]+"

ORS = "\n" # Newline in output
OFS = " " # Field separator in output
}

(NF > 2) {
k = $1 OFS $2
if (!(k in key))
key[++keys] = k
list[k] = list[k] "," $3
count[k]++
}

END {
for (i = 1; i <= keys; i++) {
k = key[i]
n = count[k]
s = list[k]
gsub(/,,+/, ",", s)
sub(/^,/, "", s)
printf("%s%s%s%s%d%s", k, OFS, s, OFS, n, ORS)
}
}'

I used the command:
awk -f count.awk TrFile.txt > output.txt

I got the following errors:

Quote:

awk: 2: unexpected character '''
awk: 28: unexpected character ''

Thank you

pan64 · 05-18-2012, 04:20 AM

I told you between the fist ' and last ', so remove ==awk ' from the beginning and also remove the last '.