LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Programming (http://www.linuxquestions.org/questions/programming-9/)
-   -   bash scripting and loops (http://www.linuxquestions.org/questions/programming-9/bash-scripting-and-loops-280990/)

phoeniks 01-22-2005 05:47 PM

bash scripting-prblms w loops and grep
 
[CODE]
#!/bin/bash

for CURDOC in `grep -hr "From: Doctor" /home/jsb46/cs265/output/DrList`
do

echo \"$CURDOC\"
echo -----------------
grep -hr "From: Doctor $CURDOC" *|wc -l
echo

done
[CODE]


what's supposed to happen: The text file DrList is a list of doctors pulled from message board backup text files. One file per post. This was done with another script that looked for "From: Doctor" and threw the entire line into DrList. So DrLst looks like this:

From: Doctor Doctor's_Name

now I want to find out how many messages doctor has posted based on the same principal.

the problem : The searching goes fine, but i can't get the entire line of text from DrList into the variable, only one word at a time, so it searches for From:, Doctor, and Doctor's Name separately. it also searches for every occurence of a word each time it appears in DrList.

Bottom Line : please help me grep for a string like "From: Doctor Achilles" rather than all three text chunks separately. Any help is appreciated, so thanks in advance!

J_Szucs 01-22-2005 08:16 PM

Well, if you had just one file (say... data.txt) containing some lines, then this single command will count the occurrences of each line in data.txt, and it will print out each line followed by the number of its occurrence:
awk '{count[$0]=count[$0]+1} END {for (data in count) print data, count[data]}' data.txt

Alternatively, you could cat the datafile to the standard input of awk:
cat data.txt | awk '{count[$0]=count[$0]+1} END {for (data in count) print data, count[data]}'

You could also use wildcards, if you have several files:
awk '{count[$0]=count[$0]+1} END {for (data in count) print data, count[data]}' *.txt

You could filter the lines by pattern, if you do not want to count all lines:
awk '/pattern/ {count[$0]=count[$0]+1} END {for (data in count) print data, count[data]}' *.txt

You could also print the count first, and use the sort command to sort the output of awk in descending order of count:
awk '{count[$0]=count[$0]+1} END {for (data in count) print count[data], data}' *.txt | sort -nr

So you do not really need a script for your task; just a single awk command.

A note: you could also consider to directly feed the mails to awk and use the pattern filter, instead of generating those intermediate files in DrList.

P.S.:
Your script might also work (though much less efficiently), if you inserted these lines before the for cycle:
IFS="
"

phoeniks 01-23-2005 11:52 AM

the thing is though, that the script is being run over a lot of files in different directories. It's going to be run in a specified directory containing about 65 numbered subdirectories. Inside of these subdirectories are the message files i'm searching. THe script has to be run in that base directory, that's why i was using the recursive grep option. So, when run, it has to go into each subdirectory, search each file for occurences of X (x being each line of DrList respectively) and report the number of instances of X over those 25000 files. My experience with awk is only about a week old, so if I can combine that inside of the bash script or make awk search subdirectories recursively, I'm not really sure how to do it. Thanks again.


EDIT: maybe this is a better option; I also tried using awk to grab the third field from drlist, which is the last name, and then pump that into a variable one at a time , and then grep -rh "From: Doctor $DOCTOR" . i've tried a bunch of things and all come very close to working, its just that i can't properly search for the entire string 'From: Doctor $DOCTOR', instead i get three separate searches for each field in those quotes.

bigearsbilly 01-24-2005 04:39 AM

have you looked at using
sort & uniq -c?

e.g.

data:
Code:

billym.primadtpdev>cat ~/1

From: Doctor Jim
From: Doctor Bob
From: Doctor Ringo
From: Doctor Ringo
From: Doctor billy
From: Doctor billy
From: Doctor James
From: Doctor billy
From: Doctor Who

sort & uniq:
Code:

billym.primadtpdev>sort ~/1 | uniq -c

  3 From: Doctor billy
  1 From: Doctor Bob
  1 From: Doctor James
  1 From: Doctor Jim
  2 From: Doctor Ringo
  1 From: Doctor Who


/bin/bash 01-24-2005 11:12 AM

Code:

while read CURDOC
do
  echo \"$CURDOC\"
  echo -----------------
  grep -hr "From: Doctor $CURDOC" *|wc -l
  echo
done </home/jsb46/cs265/output/DrList


J_Szucs 01-24-2005 04:00 PM

Well, just now I realized what your task is actually. So, you have doctor names in file /path/to/drlist...

Though others posted complete solutions that work, I post here an alternative that is also supposed to work and has an advantage over other solutions: it runs grep only once (and with the -F option), so it is supposed to be much faster, especially if you have a lot of files to be searched:

grep -R -F "`cat /path/to/drlist | sed 's/^.*$/From: Doctor &/'`" /path/to/files/* | sort | uniq -c

If drlist contained lines formatted like this: "From: Doctor doctorname", then the counting command would be more simple:

grep -R -f /path/to/drlist /path/to/files/* | sort | uniq -c

(The latter command makes use of the fact that grep can search several regexps in one turn, and those regexps can come from a file - the drlist file in this case).

However, I must admit that with the simple "uniq -c" command (I did not know that option of uniq), there is no need for awk for this task.

Finally, I am really interested if you find the above commands actually faster. Please post your findings.


All times are GMT -5. The time now is 03:19 AM.