Linux - NewbieThis Linux forum is for members that are new to Linux.
Just starting out and have a question?
If it is not in the man pages or the how-to's this is the place!
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
If you have any problems with the registration process or your account login, please contact us. If you need to reset your password, click here.
Having a problem logging in? Please visit this page to clear all LQ-related cookies.
Get a virtual cloud desktop with the Linux distro that you want in less than five minutes with Shells! With over 10 pre-installed distros to choose from, the worry-free installation life is here! Whether you are a digital nomad or just looking for flexibility, Shells can put your Linux machine on the device that you want to use.
Exclusive for LQ members, get up to 45% off per month. Click here for more info.
Hi, stupid question...
how can I get list of words included in a text file using awk or something? I need to put them into database and make some operations on them... Not lines, but words. Thx.
J.
The you might want to include a filter to remove a entries in the word list to remove lines with numbers and special characters:
sed -e 's/ /\n/g' -e '/[0-9<>+_]/d' | sort | uniq >wordlist
When I tested my first attempt, some lines weren't uniq. Looking in the man page, I found that only successive lines would be reduced, hence I added the sort filter to assure all identical words would be successive.
You may also want to use a sed script instead, in order to handle special cases as they occur.
One thing to consider is capitalization. Do you with to reduce all words to lowercase? But if you did that,
formal words would be incorrect. Also, words spit with a hyphen could be joined by the sed script, but some words should be hyphenated. Like 'file-system'
hmm, my first question was not correct. I didn't need unique words, but words and their counts in a single file. That's what I was thinking about:
use awk to create file containing single word on every new line and use a C code to put them into database (they have to be put in dbs anyway), then make a select with "group by" option. It works, but any other ideas are welcomed.
J.
#!/usr/bin/awk -f
# Print list of word frequencies
{
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.