LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - Newbie (https://www.linuxquestions.org/questions/linux-newbie-8/)
-   -   Script to get the word count of a paragraph from a long message (https://www.linuxquestions.org/questions/linux-newbie-8/script-to-get-the-word-count-of-a-paragraph-from-a-long-message-912186/)

lakshminarayanan 11-06-2011 11:39 AM

Script to get the word count of a paragraph from a long message
 
Hi all,

I am trying to develop a script that would enable me to count the number of words under a particular title from a long message.
My exact requirement is

number the lines in the entire file (which will have fields like Subject, From, To, Date, Message ID in any order depending on the mail client the user uses to send the mail)
Then get the line number of the subject field and the next field (which may be From or TO or Date) and print the lines between them and then do a wc -m

Note:- Simply doing a cat and then grep for Subject doesn't always work in my case. In some cases when there is a line break cat |grep reads only the 1st line.

What I have now is

I am able to number the lines, get the line number of the subject field and then find the lowest value among the other fields (but higher than the subject line value) pass these values to "sed" to print the line. But however I am unable to pass the values as a variable to sed in a script.
Can any of you help me in overcoming this or even suggest a simpler but still a working logic to achieve this.

Many thanks in advance

TB0ne 11-06-2011 08:39 PM

Quote:

Originally Posted by lakshminarayanan (Post 4517354)
Hi all,
I am trying to develop a script that would enable me to count the number of words under a particular title from a long message. My exact requirement is number the lines in the entire file (which will have fields like Subject, From, To, Date, Message ID in any order depending on the mail client the user uses to send the mail) Then get the line number of the subject field and the next field (which may be From or TO or Date) and print the lines between them and then do a wc -m

Note:- Simply doing a cat and then grep for Subject doesn't always work in my case. In some cases when there is a line break cat |grep reads only the 1st line.

What I have now is I am able to number the lines, get the line number of the subject field and then find the lowest value among the other fields (but higher than the subject line value) pass these values to "sed" to print the line. But however I am unable to pass the values as a variable to sed in a script. Can any of you help me in overcoming this or even suggest a simpler but still a working logic to achieve this.
Many thanks in advance

Sure...post what you've written, so we can see what's up, and tell us what exact error(s) you're getting, and post some sample input. Without details, we can't help.

lakshminarayanan 11-07-2011 11:16 AM

Hi TB0ne,

Many thanks for your response. I am posting what I've written to carry out the task that I stated. This is actually an extract from a much longer script that I wrote. I've added illustrations in an attempt to give you an idea of what I'm trying to do. Hope it makes sense.

for I in `cat -s /tmp/24hr_queue.txt`
do
echo $I
#I am numbering each line and then assigning the line number value of field of interest to variables
a=`less $I | grep -n -m1 ^"Subject: " |cut -d ":" -f1`
b=`less $I | grep -n -m1 ^"From: " |cut -d ":" -f1`
c=`less $I | grep -n -m1 ^"To: " |cut -d ":" -f1`
d=`less $I | grep -n -m1 ^"Message-ID: " |cut -d ":" -f1`
e=`less $I | grep -n -m1 ^"Date: " |cut -d ":" -f1`
f=`less $I | grep -n -m1 ^"Mime-Version: " |cut -d ":" -f1`
g=`less $I | grep -n -m1 ^"Content-Type: " |cut -d ":" -f1`
h=`less $I | grep -n -m1 ^"In-Reply-To: " |cut -d ":" -f1`
echo $a,$b,$c,$d,$e,$f,$g,$h
#here I am trying to find the value immediatley higher than "$a"
if [ $b -gt $a ]
then echo $b > /tmp/array.txt
fi
if [ $c -gt $a ]
then echo $c >> /tmp/array.txt
fi
if [ $d -gt $a ]
then echo $d >> /tmp/array.txt
fi
if [ $e -gt $a ]
then echo $e >> /tmp/array.txt
fi
if [ $f -gt $a ]
then echo $f >> /tmp/array.txt
fi
if [ $g -gt $a ]
then echo $g >> /tmp/array.txt
fi
if [ $h -gt $a ]
then echo $h >> /tmp/array.txt
fi
j= cat /tmp/array.txt | xargs -n1 | sort | tail -1
echo $a,$j
j=$j-l
# Here I am assigning the value immediatley higher to $a to j
k=less $I | sed '$a,$j!d' | wc -m (this is the part that is not working for me)
if k -gt 300
then echo "$I : Subject too long, will timeout"
fi
done

David the H. 11-07-2011 03:52 PM

1) Please use [code][/code] tags around your code, to preserve formatting and to improve readability. Edit your last post to correct this, if you don't mind.

2) post some sample input, as grail asked.

3) Is the script to be for bash, posix-compliant, or something else?

4) http://mywiki.wooledge.org/DontReadLinesWithFor

5) $(..) is highly recommended over `..`

6) Always quote your variables, particularly inside the [..] test command.

http://mywiki.wooledge.org/Arguments
http://mywiki.wooledge.org/WordSplitting
http://mywiki.wooledge.org/Quotes

7) Useless use of less? I can't tell for certain from the above what "$I" is supposed to hold...a filename? If so, then just pass it as an argument directly to grep or sed, instead of using a pipe.

There are likely better ways to do the job anyway.

8) At the very least, all of your individual if tests can be compacted into a single loop. (Also, xargs?? :scratch:)

And again, there are often better ways to do such things:

Code:

$ arr1=( 5 7 3 9 4 2 10 )
$ a=5
$ for i in "${arr1[@]}"; do (( i > a )) && arr2[i]="$i" ; done

$ echo "${arr2[@]:1:1}"  #prints the lowest existing array entry
7

( The above assuming bash or another shell with support for arrays, of course ).

Edit:
9) Your cat | grep problem sounds like it may be due to dos vs. unix line endings.

lakshminarayanan 11-08-2011 08:35 PM

David,
Thanks for your response and suggestions. This is the 1st time I am writing such a long script. The script is to be for bash.
I know, the amount of scripting knowledge I have is really insufficient to carry out this task. I thought, let me try and sharpen my scripting skills.
I am a part of the team maintaining some 400 linux servers in which 12 million mailboxes are hosted. we daily need to find out how many messages are struck up in the MTA queues (both inward and outward). To do this manually takes more than 6 hours and I am trying to make this script do that.
The common reason would be a subject exceeding 300 characters.
About the use of less- $i is a file the contents of which is the original email along with the message headers, so just cat would read the entire file which would be pages together. but less does that fine.

I need access to the production environment to get you sample input which I don't have now as I'm on a vacation.
But I ll try do give it on my own. the contents of $i would be

$i- type1
19 Date:
20 From:abcd@gmail.com
21 Subject: line1
22 line 2
23 line 3
.
.
.
28 line x
29 To:abcd@yahoo.com
30 In-Reply-To:

$i type2
18 To: abcd@yahoo.com
19 Message-ID:
20 Subject: line1
21 line 2
22 line 3
.
.
.
35 line x
36 From: abcd@gmail.com
37 Date

$i type 3
37 Content-Type:
38 Subject: line1
39 line 2
40 line 3
.
.
.

41 line x
42 Mime-Version:
43 Message-ID:


In either of the above cases I want the Subject (line 1 ....line x) alone to be read so that I can do a |wc -m to get the number of characters. Thank you once again for your valuable suggestion.

David the H. 11-09-2011 11:21 AM

Don't worry about being a beginner. We all have to start somewhere. This looks like a good project to learn from.

As I asked in 1), please use code tags around your script and text.


After reading the OP more carefully, I think I have an easier solution for you using sed, but first lets use your existing script as a learning exercise.

To begin with, I'm afraid I don't see your point about less. When used with a pipe, less and cat do pretty much the same thing. It's the -m option in grep that limits the output either way.

In any case, the first thing you should look at is reducing the number of calls to external tools like grep. Assuming that there's only one email (and Subject, To, From, etc.) per file, how about we just grab all the important lines at once, and process them later inside the shell?

Code:

searchlist='Subject|From|To|Message-ID|Date|Mime-Version|Content-Type|In-Reply-To'

IFS=$'\n'        #forces wordbreaking on newlines only, necessary for setting the array
array=( egrep -n "^($searchlist)" "$I" ) )

# or more succinctly, if using bash 4+.  Doesn't require changing IFS.
mapfile -t array < <( egrep -n "^($searchlist)" "$I" )

Now we have an array holding all the lines you want to search, prepended by their line number. Next we just have to find the entry containing the "Subject" and the one following it.

Code:

for line in "${array[@]}" ; do

    if (( x == 1 )) ; then
          end="${line%%:*}"
          break
    fi

    if [[ $line == *Subject:* ]] ; then
          start="${line%%:*}"
          x=1
    fi

done

The first if statement is ignored until Subject is found. That line sets the start value, as well as the variable "x". Then on the next iteration, it sets the end value and breaks the loop. ${line%%:*} strips off everything after the colon, replacing cut.

Now you have start and end variables with the two matching line numbers. We just need to shift the endpoint by one and extract the final text for counting.

Code:

(( end-- ))
count="$(sed -n "$start,$end p" "$I" )"

echo "${#count}"

We can even dispense with wc, as the shell can count the output, though there may be a minor difference in number as trailing newlines may be removed by the shell.

I hope also you realize that this counts the "Subject: " header part too. If it's important you can adjust the sed command to remove it.

Code:

count="$(sed -n "$start,$end { s/Subject:[ ]*// ; p }" "$I" )"
You can figure the rest out yourself, I'm sure.

And see here for more on doing string manipulations in bash:

parameter expansion
string manipulation


Now, as I mentioned earlier, there's a better way. sed can extract the block you want directly, if you use the proper address forms.

Code:

#Don't include "Subject" in the list.
searchlist="From|To|Message-ID|Date|Mime-Version|Content-Type|In-Reply-To"

count=$( sed -rn "/Subject/,/($searchlist)/ { /^($searchlist)/d ; p}' "$I" )

#or without "Subject"
count=$( sed -rn '/Subject/,/($searchlist)/ { /^($searchlist)/d ; s/Subject:[ ]*// ; p}' "$I" )

echo "${#count}"

It matches from "Subject", to the next line that contains something in "$searchlist". Then the sub-bracket removes the "$searchlist" line before printing.

See? No need to extract line numbers. :cool:

Here are a few useful sed references.
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/grabbag/
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt

lakshminarayanan 11-10-2011 12:38 AM

Well, you've inspired me a lot. In my case a file may contain more than 1 emails when it is an email thread. However, I try to do that myself and get back to you If I need help. I'll also keep you posted on how I progress and How successful I come out with my 1st project. You've been of great help to me. Thanks again.

David the H. 11-10-2011 01:07 AM

Good luck then. But probably you'll only need to add a q command in sed after the p, to tell it to quit after the first match.

Code:

sed -rn '/Subject/,/($searchlist)/ { /^($searchlist)/d ; s/Subject:[ ]*// ; p ; q }' "$I"


All times are GMT -5. The time now is 11:09 PM.