LinuxQuestions.org
Help answer threads with 0 replies.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 01-18-2005, 11:25 AM   #1
JurajPsycho
Member
 
Registered: Sep 2004
Distribution: Debian, kernel 2.6.10
Posts: 50

Rep: Reputation: 15
mawk


Hi, stupid question...
how can I get list of words included in a text file using awk or something? I need to put them into database and make some operations on them... Not lines, but words. Thx.
J.
 
Old 01-18-2005, 11:51 AM   #2
ksun
Member
 
Registered: Sep 2003
Posts: 52

Rep: Reputation: 15
grep might be more appropriate. Try looking up a grep howto.
 
Old 01-18-2005, 01:00 PM   #3
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
strings <file> | uniq


Cheers,
Tink
 
Old 01-19-2005, 05:58 AM   #4
JurajPsycho
Member
 
Registered: Sep 2004
Distribution: Debian, kernel 2.6.10
Posts: 50

Original Poster
Rep: Reputation: 15
this is what I looked for:
Code:
cat file_name | awk 'BEGIN { FS="[SEPARATORS]" } { for(i = 1 ; i <= NF ; i++)  print $i }'
J.
 
Old 01-19-2005, 06:05 AM   #5
jschiwal
LQ Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682Reputation: 682
You could spit up the lines into words using 'tr' or 'sed'.
sed 's/ /\n/g' | sort -bf | uniq -i sourcefile >wordlist

tr ' ' '\n' <sourcefile | sort -bf | uniq >wordlist

The you might want to include a filter to remove a entries in the word list to remove lines with numbers and special characters:
sed -e 's/ /\n/g' -e '/[0-9<>+_]/d' | sort | uniq >wordlist

When I tested my first attempt, some lines weren't uniq. Looking in the man page, I found that only successive lines would be reduced, hence I added the sort filter to assure all identical words would be successive.

You may also want to use a sed script instead, in order to handle special cases as they occur.

One thing to consider is capitalization. Do you with to reduce all words to lowercase? But if you did that,
formal words would be incorrect. Also, words spit with a hyphen could be joined by the sed script, but some words should be hyphenated. Like 'file-system'

Last edited by jschiwal; 01-19-2005 at 06:07 AM.
 
Old 01-20-2005, 09:42 AM   #6
JurajPsycho
Member
 
Registered: Sep 2004
Distribution: Debian, kernel 2.6.10
Posts: 50

Original Poster
Rep: Reputation: 15
hmm

hmm, my first question was not correct. I didn't need unique words, but words and their counts in a single file. That's what I was thinking about:
use awk to create file containing single word on every new line and use a C code to put them into database (they have to be put in dbs anyway), then make a select with "group by" option. It works, but any other ideas are welcomed.
J.
 
Old 01-20-2005, 12:48 PM   #7
Tinkster
Moderator
 
Registered: Apr 2002
Location: earth
Distribution: slackware by choice, others too :} ... android.
Posts: 23,067
Blog Entries: 11

Rep: Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928Reputation: 928
Did you see the example in the awk manual?

Code:
#!/usr/bin/awk -f
# Print list of word frequencies
     {
         for (i = 1; i <= NF; i++)
             freq[$i]++
     }
     
     END {
         for (word in freq)
             printf "%s\t%d\n", word, freq[word]
     }

Cheers,
Tink
 
Old 01-21-2005, 05:03 AM   #8
JurajPsycho
Member
 
Registered: Sep 2004
Distribution: Debian, kernel 2.6.10
Posts: 50

Original Poster
Rep: Reputation: 15
bingo!
I should get better glasses :-)
thx
J.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
"mawk" dependency missing in CD install lilsirecho Arch 8 10-26-2003 11:14 AM
mawk prob in 4.0rc1 iceman47 Linux From Scratch 0 03-05-2003 03:56 PM
mawk and math notsoevil Linux From Scratch 1 02-10-2002 09:51 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 06:29 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration