LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 01-14-2007, 03:59 PM   #1
beeblequix
Member
 
Registered: Oct 2005
Location: Tierra Firma, Earth
Distribution: Debian of course...
Posts: 198

Rep: Reputation: 30
How to count occurrences of unique words in a file


hi folks.

overall goal: list number of occurrences for all words in a spurious olde-english-sounding file. I'd like the output to be something like
words instances
and 17555
it 17530
came 17530
to 17530
pass 17523
some-word 4588
behooveth 677
yea 675
behold 666
sucketh 555
...

So far I've
1) downloaded text file to my linux system
2) ran this command to parse each word into its own line:
awk '{for(i=1;i<=NF;i++) print $i}' book_of_xxxmon.txt > outfile1.txt
3) sorted the data:
sort -d outfile1.txt > outfile2.txt
4) tried using sed to pull out punctuation (,.; but ended up using OpenOffice Writer to do that manually > saved as outfile2.txt
5) pulled out the unique words:
uniq outfile2.txt > uniq.bom.txt

I *know* there has to be a cleaner and easier way to do all that but that's all I could do.

Now I'd like to use my new "uniq.bom.txt" file to compare it to the original file to count how many occurrences of each of these words are found in the original. I'd rather not have to manually go through my unique listing, run a command such as this to produce the listing --
echo 'pass'; grep 'pass' book_of_xxxmon.txt|wc -l >> final.list.txt

Any ideas (preferably better ones than mine...)?
 
Old 01-14-2007, 04:06 PM   #2
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: OpenBSD, Debian, FreeBSD
Posts: 1,450

Rep: Reputation: 48
Yay homework
 
Old 01-14-2007, 04:20 PM   #3
frob23
Senior Member
 
Registered: Jan 2004
Location: Roughly 29.467N / 81.206W
Distribution: OpenBSD, Debian, FreeBSD
Posts: 1,450

Rep: Reputation: 48
Note: you can do this all in one step. Which is one of the things it's trying to teach you.

Code:
#!/bin/sh

# Put any punctuation we want to remove here
PUNCT=";:,."

if [ x"$1" = "x" ]; then
        echo "You need to give this a filename."
        exit 1
fi

awk '{for(x=1;$x;++x)print $x}' "${1}" | tr "${PUNCT}" "@" | sed 's/@//g' | sort | uniq -c
Now... you probably want to pipe that through another sort... or "awk '{print $2 " " $1}' to get it in your preferred form.

I'm going to give you the benefit of the doubt here... and that's only because (on review) of the creative file name... as I doubt a professor would assign something regarding that book unless you're in some bible school... which would make me think you wouldn't have *nix courses.

Edit: Yes, I've given you 99% of the answer you want. And a hint on how to do the rest. Since I believe it may not be homework and even if it was you did do much of the work (but the hard way).

Read the man page for uniq and see how the -c flag helps you here.

Last edited by frob23; 01-14-2007 at 04:25 PM.
 
Old 01-14-2007, 04:47 PM   #4
dv502
Member
 
Registered: Sep 2006
Location: USA - NYC
Distribution: Whatever icon you see!
Posts: 642

Rep: Reputation: 57
Here is a simpler way to count occurances in a text file

cat filename | xargs -n1 | sort | uniq -c > newfilename

cat will read from file
xargs -n1 will put one word on each line, that's a number 1
sort will sort the output
uniq -c will count occurances
> newfilename will record the results in newfilename

Last edited by dv502; 01-15-2007 at 02:48 PM.
 
Old 06-04-2018, 10:54 AM   #5
asurinsaka
LQ Newbie
 
Registered: Jun 2018
Posts: 1

Rep: Reputation: Disabled
cat words.txt | tr -s ' ' '\n' | sort | uniq -c | sort -r | awk '{ print $2, $1 }'
 
Old 06-04-2018, 12:12 PM   #6
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 1,629

Rep: Reputation: 736Reputation: 736Reputation: 736Reputation: 736Reputation: 736Reputation: 736Reputation: 736
Quote:
Originally Posted by frob23 View Post
Note: you can do this all in one step. Which is one of the things it's trying to teach you.
Code:
...
... tr "${PUNCT}" "@" | sed 's/@//g' ...
Is complex+slow for
Code:
... tr -d "$PUNCT" ...
If the goal is to also delete @ then make @ a part of $PUNCT
Code:
PUNCT=";:,.@"

Last edited by MadeInGermany; 06-04-2018 at 12:15 PM.
 
Old 06-04-2018, 12:41 PM   #7
MadeInGermany
Senior Member
 
Registered: Dec 2011
Location: Simplicity
Posts: 1,629

Rep: Reputation: 736Reputation: 736Reputation: 736Reputation: 736Reputation: 736Reputation: 736Reputation: 736
Your last question:
grep can count!
Code:
{ printf "%s " 'pass'; grep -wc 'pass' book_of_xxxmon.txt; } >> final.list.txt

Last edited by MadeInGermany; 06-04-2018 at 12:46 PM.
 
Old 06-04-2018, 02:02 PM   #8
teckk
Senior Member
 
Registered: Oct 2004
Distribution: FreeBSD Arch
Posts: 3,358

Rep: Reputation: 1017Reputation: 1017Reputation: 1017Reputation: 1017Reputation: 1017Reputation: 1017Reputation: 1017Reputation: 1017
Quote:
to count how many occurrences of each of these words are found in the original.
Instead of making multiple files you could use variables

Examples:

A bunch of txt into a $variable
Code:
list="and or but nor with have get jump catch frog cat dog bear 
mouse and or house car truck and with get frog fish jump catch fox 
have get have and or but nor love hate monday tuesday montag dienstag 
lunes martes truck car monday fish car house boat car bicycle house 
car and boat with without in out on above below car house cat dog 
fish cow pig hog horse chicken goat frog car and house linux bash"
Word count in $variable
Code:
wc -w <<< "$list"
or
echo "$list" | wc -w
List of unique words in $variable
Code:
uniq_words=$(echo -e "${list// /\\n}" | sort -u)
echo "$uniq_words"
wc -w <<< "$uniq_words"
Count occurrences of each word in $variable
Code:
for i in ${uniq_words}; do 
    echo -e "\n"$i""
    grep -o "$i" <<< ${list} | wc -w
done
Edit: And I now notice that we responded to a 11 year old thread. Oh well, maybe someone can benefit, the idea is still the same, parsing a file/$var to get info.

Last edited by teckk; 06-04-2018 at 02:21 PM.
 
Old 06-05-2018, 11:16 AM   #9
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware-current
Posts: 5,517

Rep: Reputation: 2116Reputation: 2116Reputation: 2116Reputation: 2116Reputation: 2116Reputation: 2116Reputation: 2116Reputation: 2116Reputation: 2116Reputation: 2116Reputation: 2116
Code:
bash-4.4$ time cat Gettysburg_Address | tr -s ' ' '\n' | sort | uniq -c | sort -r | awk '{ print $2, $1 }'
real	0m0.015s
user	0m0.029s
sys	0m0.015s
bash-4.4$ time gawk '{a[$0]++} END{for (k in a) print k,a[k]}' RS='[[:space:][:punct:]]+' Gettysburg_Address | sort
real	0m0.013s
user	0m0.012s
sys	0m0.007s
The gawk command is slightly amended from here to also remove punctuation characters.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
search / count unique patterns in text file logicalfuzz Linux - Newbie 2 10-14-2006 08:58 AM
how to count the number of occurrences of a process beeblequix Linux - General 3 09-18-2006 05:17 PM
Request for Information - wrt unique File Format type pscanuck Linux - General 1 02-21-2006 03:49 PM
copying files and give new unique names to each file by using xargs command gnim66 Programming 6 06-22-2005 09:29 PM
Search and Replace: Asian Words to English Words ieeestd802 Linux - Software 0 10-27-2004 08:48 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 11:02 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration