mawk

JurajPsycho · 01-18-2005, 11:25 AM

Hi, stupid question...
how can I get list of words included in a text file using awk or something? I need to put them into database and make some operations on them... Not lines, but words. Thx.
J.

ksun · 01-18-2005, 11:51 AM

grep might be more appropriate. Try looking up a grep howto.

Tinkster · 01-18-2005, 01:00 PM

strings <file> | uniq

Cheers,
Tink

JurajPsycho · 01-19-2005, 05:58 AM

this is what I looked for:

Code:

cat file_name | awk 'BEGIN { FS="[SEPARATORS]" } { for(i = 1 ; i <= NF ; i++)  print $i }'

J.

jschiwal · 01-19-2005, 06:05 AM

You could spit up the lines into words using 'tr' or 'sed'.
sed 's/ /\n/g' | sort -bf | uniq -i sourcefile >wordlist

tr ' ' '\n' <sourcefile | sort -bf | uniq >wordlist

The you might want to include a filter to remove a entries in the word list to remove lines with numbers and special characters:
sed -e 's/ /\n/g' -e '/[0-9<>+_]/d' | sort | uniq >wordlist

When I tested my first attempt, some lines weren't uniq. Looking in the man page, I found that only successive lines would be reduced, hence I added the sort filter to assure all identical words would be successive.

You may also want to use a sed script instead, in order to handle special cases as they occur.

One thing to consider is capitalization. Do you with to reduce all words to lowercase? But if you did that,
formal words would be incorrect. Also, words spit with a hyphen could be joined by the sed script, but some words should be hyphenated. Like 'file-system'

JurajPsycho · 01-20-2005, 09:42 AM

hmm, my first question was not correct. I didn't need unique words, but words and their counts in a single file. That's what I was thinking about:
use awk to create file containing single word on every new line and use a C code to put them into database (they have to be put in dbs anyway), then make a select with "group by" option. It works, but any other ideas are welcomed.
J.

Tinkster · 01-20-2005, 12:48 PM

Did you see the example in the awk manual?

Code:

#!/usr/bin/awk -f
# Print list of word frequencies
     {
         for (i = 1; i <= NF; i++)
             freq[$i]++
     }
     
     END {
         for (word in freq)
             printf "%s\t%d\n", word, freq[word]
     }

Cheers,
Tink

JurajPsycho · 01-21-2005, 05:03 AM

bingo!
I should get better glasses :-)
thx
J.