Count based on the first word

hanae · 05-18-2012, 05:05 AM

Thank you now it i working.
But, the program doesn't give the correct results. It attached 1 with every entry while here are repeated entries that should have the frequency number. It seems to me as it it doesn't take into account the fields, the entry is considered all as one entry. especially that there are some words that appear in different files:
examples

Quote:

sky losn_revue-1981-2 40234
sky note4-0-7 6787
fly kivre1-0-0 1236
fly kivre1-0-0 1240
sky file1-1-3 1567

In this case we should have:
fly kivre1-0-0 1236,1240 2
sky note4-0-7,file1-1-3 6787,1567 2

hanae · 05-18-2012, 05:16 AM

does it really matter when the input is a large file?? because i tested with a 3 entry sample input and it works while with my input that contains over 1000 entries doesnt!!

pan64 · 05-18-2012, 06:21 AM

no, the size does not matter, at least it should not be. I would rather think you didn't tell us that, so it has not been implemented.

Code:

#!/usr/bin/perl -w

$DEBUG = 0;

%result = ();

# loop over lines
while (<>) {

@word = split;
$key = shift @word;
next if ! defined $key;
map { $result{$key}{$_}{$word[$_]} = 1 } (0..$#word); 
$counter{$key}++; 
}


foreach  $k ( keys %result ) {
    print "$k ";
    foreach $n ( sort keys %{ $result{$k} } ) {
        $a = join ",", ( keys %{ $result{$k}{$n} } );
	print "$a ";
    }
    print $counter{$k} . "\n";
}

maybe this one works better

hanae · 05-18-2012, 06:32 AM

Yeah it does, but it doesn't give the count:

Quote:

fly kivre1-0-0 1236,1240 2
sky note4-0-7,file1-1-3 6787,1567 2

2 in the examples is the count.

Thank you,

pan64 · 05-18-2012, 06:33 AM

It is now fixed, sorry

hanae · 05-18-2012, 06:38 AM

what do you mean by fixed??

pan64 · 05-18-2012, 06:40 AM

I edited the post, so you can now find the modified code. It handles the counters also. So please check post #18 again.

hanae · 05-18-2012, 06:54 AM

It is really extremely helpful!!Thank you very much pan

Nominal Animal · 05-18-2012, 09:58 AM

I thought the two first fields were supposed to be the key. #16 shows that only the first field is the key, and that both the second and third fields should be gathered in lists.

Here is the modified, commented awk script version:

Code:

#!/usr/bin/awk -f
BEGIN {
    # Each line (using any newline convention) is a separate record.
    RS = "(\r\n|\n\r|\r|\n)"

    # Fields are separated by any amount of whitespace.
    FS = "[\t\v\f ]+"

    # For output, use explicitly the Linux newline convention.
    ORS = "\n"

    # For output, use a single space between fields.
    OFS = " "
}

# Consider only records with three or more fields.
(NF >= 3) {

    # First field is the key.
    k = $1

    # Keep track of each unique key:
    # If count has no key k, then k is a new key.
    if (!(k in count))
        key[++keys] = k

    # Add to the number of times this key has been seen.
    count[k]++

    # Add second field to list1, comma-separated.
    list1[k] = list1[k] "," $2

    # Add third field to list2, comma-separated.
    list2[k] = list2[k] "," $3
}

END {
    # Loop over each unique key k.
    for (i = 1; i <= keys; i++) {
        k = key[i]

        # The number of times this key has been seen.
        n = count[k]

        # The comma-separated lists for this key.
        s1 = list1[k]
        s2 = list2[k]

        # Replace consecutive runs of commas with a single comma.
        # Note: This really only happens if the second or third
        #       fields start or end with a comma.
        gsub(/,,+/, ",", s1)
        gsub(/,,+/, ",", s2)

        # Because we add a comma before each entry, there will always be
        # a leading comma. Remove it by skipping the first character.
        s1 = substr(s1, 2)
        s2 = substr(s2, 2)

        # Output the line.
        print k, s1, s2, n
    }
}

PTrenholme · 05-18-2012, 06:19 PM

I was intrigued by this problem, and thought of another AWK program, using the new multi-dimensional arrays:

Code:

#!/bin/gawk -f
# Print the values  as a comma separated string and the dimensions
# as a colon separated string.
#
# Based on the walk_array function found in /usr/share/walkarray.awk
function print_array( arr, name,  i)
{
  comma=""
  for (i in arr) {
    if (isarray(arr[i])) {
      if (i) printf(":")
      print_array(arr[i], name "[" i "]")
    }
    else {
      if (i) {
	printf("%s%s", comma, i)
	comma=", "
      }
    }
  }
}
# Read the input file storing the information in a 3-dimensional array,
# with the number of occurrences of each first word in words["word"][""][""]
# and the number of occurrences of each additional field in words["word][field#][text].
{
  words[$1][""][""]++
  for (i=2;i<=NF;++i) {
    words[$1][i][$i]++
  }
}
# Print the summary information, with the count at the end enclosed in parenthesis
END {
  for (i in words) { 
    printf("%s",i)
    print_array(words[i],"words[" i "]")
    printf(" (%d)\n", words[i][""][""])
  }
}

Using the two sample data sets, this produces for the first data set:

Code:

$ ./count_by_first_word data 
sky:losn_revue-1981-2:40234 (1)
fly:kivre1-0-0:1240, 1236 (2)

and, for the second:

Code:

$ ./count_by_first_word data2
sky:losn_revue-1981-2, note4-0-7, file1-1-3:1567, 40234, 6787 (3)
fly:kivre1-0-0:1240, 1236 (2)

<edit>
Note that there is no assumption made in that code that there are only three fields in the input file. It also finds the unique values in each of the input fields, and only prints those unique values. Thus, for example, concatenating the second data set with itself prodices this:

Code:

$ cat data2 data2 > data3
$ ./count_by_first_word data3
sky:losn_revue-1981-2, note4-0-7, file1-1-3:1567, 40234, 6787 (6)
fly:kivre1-0-0:1240, 1236 (4)