ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Thank you now it i working.
But, the program doesn't give the correct results. It attached 1 with every entry while here are repeated entries that should have the frequency number. It seems to me as it it doesn't take into account the fields, the entry is considered all as one entry. especially that there are some words that appear in different files:
examples
does it really matter when the input is a large file?? because i tested with a 3 entry sample input and it works while with my input that contains over 1000 entries doesnt!!
I thought the two first fields were supposed to be the key. #16 shows that only the first field is the key, and that both the second and third fields should be gathered in lists.
Here is the modified, commented awk script version:
Code:
#!/usr/bin/awk -f
BEGIN {
# Each line (using any newline convention) is a separate record.
RS = "(\r\n|\n\r|\r|\n)"
# Fields are separated by any amount of whitespace.
FS = "[\t\v\f ]+"
# For output, use explicitly the Linux newline convention.
ORS = "\n"
# For output, use a single space between fields.
OFS = " "
}
# Consider only records with three or more fields.
(NF >= 3) {
# First field is the key.
k = $1
# Keep track of each unique key:
# If count has no key k, then k is a new key.
if (!(k in count))
key[++keys] = k
# Add to the number of times this key has been seen.
count[k]++
# Add second field to list1, comma-separated.
list1[k] = list1[k] "," $2
# Add third field to list2, comma-separated.
list2[k] = list2[k] "," $3
}
END {
# Loop over each unique key k.
for (i = 1; i <= keys; i++) {
k = key[i]
# The number of times this key has been seen.
n = count[k]
# The comma-separated lists for this key.
s1 = list1[k]
s2 = list2[k]
# Replace consecutive runs of commas with a single comma.
# Note: This really only happens if the second or third
# fields start or end with a comma.
gsub(/,,+/, ",", s1)
gsub(/,,+/, ",", s2)
# Because we add a comma before each entry, there will always be
# a leading comma. Remove it by skipping the first character.
s1 = substr(s1, 2)
s2 = substr(s2, 2)
# Output the line.
print k, s1, s2, n
}
}
I was intrigued by this problem, and thought of another AWK program, using the new multi-dimensional arrays:
Code:
#!/bin/gawk -f
# Print the values as a comma separated string and the dimensions
# as a colon separated string.
#
# Based on the walk_array function found in /usr/share/walkarray.awk
function print_array( arr, name, i)
{
comma=""
for (i in arr) {
if (isarray(arr[i])) {
if (i) printf(":")
print_array(arr[i], name "[" i "]")
}
else {
if (i) {
printf("%s%s", comma, i)
comma=", "
}
}
}
}
# Read the input file storing the information in a 3-dimensional array,
# with the number of occurrences of each first word in words["word"][""][""]
# and the number of occurrences of each additional field in words["word][field#][text].
{
words[$1][""][""]++
for (i=2;i<=NF;++i) {
words[$1][i][$i]++
}
}
# Print the summary information, with the count at the end enclosed in parenthesis
END {
for (i in words) {
printf("%s",i)
print_array(words[i],"words[" i "]")
printf(" (%d)\n", words[i][""][""])
}
}
Using the two sample data sets, this produces for the first data set:
Code:
$ ./count_by_first_word data
sky:losn_revue-1981-2:40234 (1)
fly:kivre1-0-0:1240, 1236 (2)
<edit>
Note that there is no assumption made in that code that there are only three fields in the input file. It also finds the unique values in each of the input fields, and only prints those unique values. Thus, for example, concatenating the second data set with itself prodices this:
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.