Script to get the word count of a paragraph from a long message
Hi all,
I am trying to develop a script that would enable me to count the number of words under a particular title from a long message. My exact requirement is number the lines in the entire file (which will have fields like Subject, From, To, Date, Message ID in any order depending on the mail client the user uses to send the mail) Then get the line number of the subject field and the next field (which may be From or TO or Date) and print the lines between them and then do a wc -m Note:- Simply doing a cat and then grep for Subject doesn't always work in my case. In some cases when there is a line break cat |grep reads only the 1st line. What I have now is I am able to number the lines, get the line number of the subject field and then find the lowest value among the other fields (but higher than the subject line value) pass these values to "sed" to print the line. But however I am unable to pass the values as a variable to sed in a script. Can any of you help me in overcoming this or even suggest a simpler but still a working logic to achieve this. Many thanks in advance |
Quote:
|
Hi TB0ne,
Many thanks for your response. I am posting what I've written to carry out the task that I stated. This is actually an extract from a much longer script that I wrote. I've added illustrations in an attempt to give you an idea of what I'm trying to do. Hope it makes sense. for I in `cat -s /tmp/24hr_queue.txt` do echo $I #I am numbering each line and then assigning the line number value of field of interest to variables a=`less $I | grep -n -m1 ^"Subject: " |cut -d ":" -f1` b=`less $I | grep -n -m1 ^"From: " |cut -d ":" -f1` c=`less $I | grep -n -m1 ^"To: " |cut -d ":" -f1` d=`less $I | grep -n -m1 ^"Message-ID: " |cut -d ":" -f1` e=`less $I | grep -n -m1 ^"Date: " |cut -d ":" -f1` f=`less $I | grep -n -m1 ^"Mime-Version: " |cut -d ":" -f1` g=`less $I | grep -n -m1 ^"Content-Type: " |cut -d ":" -f1` h=`less $I | grep -n -m1 ^"In-Reply-To: " |cut -d ":" -f1` echo $a,$b,$c,$d,$e,$f,$g,$h #here I am trying to find the value immediatley higher than "$a" if [ $b -gt $a ] then echo $b > /tmp/array.txt fi if [ $c -gt $a ] then echo $c >> /tmp/array.txt fi if [ $d -gt $a ] then echo $d >> /tmp/array.txt fi if [ $e -gt $a ] then echo $e >> /tmp/array.txt fi if [ $f -gt $a ] then echo $f >> /tmp/array.txt fi if [ $g -gt $a ] then echo $g >> /tmp/array.txt fi if [ $h -gt $a ] then echo $h >> /tmp/array.txt fi j= cat /tmp/array.txt | xargs -n1 | sort | tail -1 echo $a,$j j=$j-l # Here I am assigning the value immediatley higher to $a to j k=less $I | sed '$a,$j!d' | wc -m (this is the part that is not working for me) if k -gt 300 then echo "$I : Subject too long, will timeout" fi done |
1) Please use [code][/code] tags around your code, to preserve formatting and to improve readability. Edit your last post to correct this, if you don't mind.
2) post some sample input, as grail asked. 3) Is the script to be for bash, posix-compliant, or something else? 4) http://mywiki.wooledge.org/DontReadLinesWithFor 5) $(..) is highly recommended over `..` 6) Always quote your variables, particularly inside the [..] test command. http://mywiki.wooledge.org/Arguments http://mywiki.wooledge.org/WordSplitting http://mywiki.wooledge.org/Quotes 7) Useless use of less? I can't tell for certain from the above what "$I" is supposed to hold...a filename? If so, then just pass it as an argument directly to grep or sed, instead of using a pipe. There are likely better ways to do the job anyway. 8) At the very least, all of your individual if tests can be compacted into a single loop. (Also, xargs?? :scratch:) And again, there are often better ways to do such things: Code:
$ arr1=( 5 7 3 9 4 2 10 ) Edit: 9) Your cat | grep problem sounds like it may be due to dos vs. unix line endings. |
David,
Thanks for your response and suggestions. This is the 1st time I am writing such a long script. The script is to be for bash. I know, the amount of scripting knowledge I have is really insufficient to carry out this task. I thought, let me try and sharpen my scripting skills. I am a part of the team maintaining some 400 linux servers in which 12 million mailboxes are hosted. we daily need to find out how many messages are struck up in the MTA queues (both inward and outward). To do this manually takes more than 6 hours and I am trying to make this script do that. The common reason would be a subject exceeding 300 characters. About the use of less- $i is a file the contents of which is the original email along with the message headers, so just cat would read the entire file which would be pages together. but less does that fine. I need access to the production environment to get you sample input which I don't have now as I'm on a vacation. But I ll try do give it on my own. the contents of $i would be $i- type1 19 Date: 20 From:abcd@gmail.com 21 Subject: line1 22 line 2 23 line 3 . . . 28 line x 29 To:abcd@yahoo.com 30 In-Reply-To: $i type2 18 To: abcd@yahoo.com 19 Message-ID: 20 Subject: line1 21 line 2 22 line 3 . . . 35 line x 36 From: abcd@gmail.com 37 Date $i type 3 37 Content-Type: 38 Subject: line1 39 line 2 40 line 3 . . . 41 line x 42 Mime-Version: 43 Message-ID: In either of the above cases I want the Subject (line 1 ....line x) alone to be read so that I can do a |wc -m to get the number of characters. Thank you once again for your valuable suggestion. |
Don't worry about being a beginner. We all have to start somewhere. This looks like a good project to learn from.
As I asked in 1), please use code tags around your script and text. After reading the OP more carefully, I think I have an easier solution for you using sed, but first lets use your existing script as a learning exercise. To begin with, I'm afraid I don't see your point about less. When used with a pipe, less and cat do pretty much the same thing. It's the -m option in grep that limits the output either way. In any case, the first thing you should look at is reducing the number of calls to external tools like grep. Assuming that there's only one email (and Subject, To, From, etc.) per file, how about we just grab all the important lines at once, and process them later inside the shell? Code:
searchlist='Subject|From|To|Message-ID|Date|Mime-Version|Content-Type|In-Reply-To' Code:
for line in "${array[@]}" ; do Now you have start and end variables with the two matching line numbers. We just need to shift the endpoint by one and extract the final text for counting. Code:
(( end-- )) I hope also you realize that this counts the "Subject: " header part too. If it's important you can adjust the sed command to remove it. Code:
count="$(sed -n "$start,$end { s/Subject:[ ]*// ; p }" "$I" )" And see here for more on doing string manipulations in bash: parameter expansion string manipulation Now, as I mentioned earlier, there's a better way. sed can extract the block you want directly, if you use the proper address forms. Code:
#Don't include "Subject" in the list. See? No need to extract line numbers. :cool: Here are a few useful sed references. http://www.grymoire.com/Unix/Sed.html http://sed.sourceforge.net/grabbag/ http://sed.sourceforge.net/sedfaq.html http://sed.sourceforge.net/sed1line.txt |
Well, you've inspired me a lot. In my case a file may contain more than 1 emails when it is an email thread. However, I try to do that myself and get back to you If I need help. I'll also keep you posted on how I progress and How successful I come out with my 1st project. You've been of great help to me. Thanks again.
|
Good luck then. But probably you'll only need to add a q command in sed after the p, to tell it to quit after the first match.
Code:
sed -rn '/Subject/,/($searchlist)/ { /^($searchlist)/d ; s/Subject:[ ]*// ; p ; q }' "$I" |
All times are GMT -5. The time now is 11:09 PM. |