LinuxQuestions.org
Share your knowledge at the LQ Wiki.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices

Reply
 
Search this Thread
Old 01-14-2007, 02:59 PM   #1
beeblequix
Member
 
Registered: Oct 2005
Location: Tierra Firma, Earth
Distribution: Debian of course...
Posts: 196

Rep: Reputation: 30
How to count occurrences of unique words in a file


hi folks.

overall goal: list number of occurrences for all words in a spurious olde-english-sounding file. I'd like the output to be something like
words instances
and 17555
it 17530
came 17530
to 17530
pass 17523
some-word 4588
behooveth 677
yea 675
behold 666
sucketh 555
...

So far I've
1) downloaded text file to my linux system
2) ran this command to parse each word into its own line:
awk '{for(i=1;i<=NF;i++) print $i}' book_of_xxxmon.txt > outfile1.txt
3) sorted the data:
sort -d outfile1.txt > outfile2.txt
4) tried using sed to pull out punctuation (,.; but ended up using OpenOffice Writer to do that manually > saved as outfile2.txt
5) pulled out the unique words:
uniq outfile2.txt > uniq.bom.txt

I *know* there has to be a cleaner and easier way to do all that but that's all I could do.

Now I'd like to use my new "uniq.bom.txt" file to compare it to the original file to count how many occurrences of each of these words are found in the original. I'd rather not have to manually go through my unique listing, run a command such as this to produce the listing --
echo 'pass'; grep 'pass' book_of_xxxmon.txt|wc -l >> final.list.txt

Any ideas (preferably better ones than mine...)?
 
Old 01-14-2007, 03:06 PM   #2
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: Ubuntu, FreeBSD, NetBSD
Posts: 1,449

Rep: Reputation: 47
Yay homework
 
Old 01-14-2007, 03:20 PM   #3
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: Ubuntu, FreeBSD, NetBSD
Posts: 1,449

Rep: Reputation: 47
Note: you can do this all in one step. Which is one of the things it's trying to teach you.

Code:
#!/bin/sh

# Put any punctuation we want to remove here
PUNCT=";:,."

if [ x"$1" = "x" ]; then
        echo "You need to give this a filename."
        exit 1
fi

awk '{for(x=1;$x;++x)print $x}' "${1}" | tr "${PUNCT}" "@" | sed 's/@//g' | sort | uniq -c
Now... you probably want to pipe that through another sort... or "awk '{print $2 " " $1}' to get it in your preferred form.

I'm going to give you the benefit of the doubt here... and that's only because (on review) of the creative file name... as I doubt a professor would assign something regarding that book unless you're in some bible school... which would make me think you wouldn't have *nix courses.

Edit: Yes, I've given you 99% of the answer you want. And a hint on how to do the rest. Since I believe it may not be homework and even if it was you did do much of the work (but the hard way).

Read the man page for uniq and see how the -c flag helps you here.

Last edited by frob23; 01-14-2007 at 03:25 PM.
 
Old 01-14-2007, 03:47 PM   #4
dv502
Member
 
Registered: Sep 2006
Location: USA - NYC
Distribution: Whatever icon you see!
Posts: 642

Rep: Reputation: 57
Here is a simpler way to count occurances in a text file

cat filename | xargs -n1 | sort | uniq -c > newfilename

cat will read from file
xargs -n1 will put one word on each line, that's a number 1
sort will sort the output
uniq -c will count occurances
> newfilename will record the results in newfilename

Last edited by dv502; 01-15-2007 at 01:48 PM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
search / count unique patterns in text file logicalfuzz Linux - Newbie 2 10-14-2006 07:58 AM
how to count the number of occurrences of a process beeblequix Linux - General 3 09-18-2006 04:17 PM
Request for Information - wrt unique File Format type pscanuck Linux - General 1 02-21-2006 02:49 PM
copying files and give new unique names to each file by using xargs command gnim66 Programming 6 06-22-2005 08:29 PM
Search and Replace: Asian Words to English Words ieeestd802 Linux - Software 0 10-27-2004 07:48 PM


All times are GMT -5. The time now is 08:29 AM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration