LinuxQuestions.org
Welcome to the most active Linux Forum on the web.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 11-06-2011, 12:39 PM   #1
lakshminarayanan
LQ Newbie
 
Registered: Jan 2010
Posts: 9

Rep: Reputation: 2
Script to get the word count of a paragraph from a long message


Hi all,

I am trying to develop a script that would enable me to count the number of words under a particular title from a long message.
My exact requirement is

number the lines in the entire file (which will have fields like Subject, From, To, Date, Message ID in any order depending on the mail client the user uses to send the mail)
Then get the line number of the subject field and the next field (which may be From or TO or Date) and print the lines between them and then do a wc -m

Note:- Simply doing a cat and then grep for Subject doesn't always work in my case. In some cases when there is a line break cat |grep reads only the 1st line.

What I have now is

I am able to number the lines, get the line number of the subject field and then find the lowest value among the other fields (but higher than the subject line value) pass these values to "sed" to print the line. But however I am unable to pass the values as a variable to sed in a script.
Can any of you help me in overcoming this or even suggest a simpler but still a working logic to achieve this.

Many thanks in advance
 
Old 11-06-2011, 09:39 PM   #2
TB0ne
LQ Guru
 
Registered: Jul 2003
Location: Birmingham, Alabama
Distribution: SuSE, RedHat, Slack,CentOS
Posts: 17,933

Rep: Reputation: 3692Reputation: 3692Reputation: 3692Reputation: 3692Reputation: 3692Reputation: 3692Reputation: 3692Reputation: 3692Reputation: 3692Reputation: 3692Reputation: 3692
Quote:
Originally Posted by lakshminarayanan View Post
Hi all,
I am trying to develop a script that would enable me to count the number of words under a particular title from a long message. My exact requirement is number the lines in the entire file (which will have fields like Subject, From, To, Date, Message ID in any order depending on the mail client the user uses to send the mail) Then get the line number of the subject field and the next field (which may be From or TO or Date) and print the lines between them and then do a wc -m

Note:- Simply doing a cat and then grep for Subject doesn't always work in my case. In some cases when there is a line break cat |grep reads only the 1st line.

What I have now is I am able to number the lines, get the line number of the subject field and then find the lowest value among the other fields (but higher than the subject line value) pass these values to "sed" to print the line. But however I am unable to pass the values as a variable to sed in a script. Can any of you help me in overcoming this or even suggest a simpler but still a working logic to achieve this.
Many thanks in advance
Sure...post what you've written, so we can see what's up, and tell us what exact error(s) you're getting, and post some sample input. Without details, we can't help.
 
Old 11-07-2011, 12:16 PM   #3
lakshminarayanan
LQ Newbie
 
Registered: Jan 2010
Posts: 9

Original Poster
Rep: Reputation: 2
Hi TB0ne,

Many thanks for your response. I am posting what I've written to carry out the task that I stated. This is actually an extract from a much longer script that I wrote. I've added illustrations in an attempt to give you an idea of what I'm trying to do. Hope it makes sense.

for I in `cat -s /tmp/24hr_queue.txt`
do
echo $I
#I am numbering each line and then assigning the line number value of field of interest to variables
a=`less $I | grep -n -m1 ^"Subject: " |cut -d ":" -f1`
b=`less $I | grep -n -m1 ^"From: " |cut -d ":" -f1`
c=`less $I | grep -n -m1 ^"To: " |cut -d ":" -f1`
d=`less $I | grep -n -m1 ^"Message-ID: " |cut -d ":" -f1`
e=`less $I | grep -n -m1 ^"Date: " |cut -d ":" -f1`
f=`less $I | grep -n -m1 ^"Mime-Version: " |cut -d ":" -f1`
g=`less $I | grep -n -m1 ^"Content-Type: " |cut -d ":" -f1`
h=`less $I | grep -n -m1 ^"In-Reply-To: " |cut -d ":" -f1`
echo $a,$b,$c,$d,$e,$f,$g,$h
#here I am trying to find the value immediatley higher than "$a"
if [ $b -gt $a ]
then echo $b > /tmp/array.txt
fi
if [ $c -gt $a ]
then echo $c >> /tmp/array.txt
fi
if [ $d -gt $a ]
then echo $d >> /tmp/array.txt
fi
if [ $e -gt $a ]
then echo $e >> /tmp/array.txt
fi
if [ $f -gt $a ]
then echo $f >> /tmp/array.txt
fi
if [ $g -gt $a ]
then echo $g >> /tmp/array.txt
fi
if [ $h -gt $a ]
then echo $h >> /tmp/array.txt
fi
j= cat /tmp/array.txt | xargs -n1 | sort | tail -1
echo $a,$j
j=$j-l
# Here I am assigning the value immediatley higher to $a to j
k=less $I | sed '$a,$j!d' | wc -m (this is the part that is not working for me)
if k -gt 300
then echo "$I : Subject too long, will timeout"
fi
done
 
Old 11-07-2011, 04:52 PM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
1) Please use [code][/code] tags around your code, to preserve formatting and to improve readability. Edit your last post to correct this, if you don't mind.

2) post some sample input, as grail asked.

3) Is the script to be for bash, posix-compliant, or something else?

4) http://mywiki.wooledge.org/DontReadLinesWithFor

5) $(..) is highly recommended over `..`

6) Always quote your variables, particularly inside the [..] test command.

http://mywiki.wooledge.org/Arguments
http://mywiki.wooledge.org/WordSplitting
http://mywiki.wooledge.org/Quotes

7) Useless use of less? I can't tell for certain from the above what "$I" is supposed to hold...a filename? If so, then just pass it as an argument directly to grep or sed, instead of using a pipe.

There are likely better ways to do the job anyway.

8) At the very least, all of your individual if tests can be compacted into a single loop. (Also, xargs?? )

And again, there are often better ways to do such things:

Code:
$ arr1=( 5 7 3 9 4 2 10 )
$ a=5
$ for i in "${arr1[@]}"; do (( i > a )) && arr2[i]="$i" ; done

$ echo "${arr2[@]:1:1}"  #prints the lowest existing array entry
7
( The above assuming bash or another shell with support for arrays, of course ).

Edit:
9) Your cat | grep problem sounds like it may be due to dos vs. unix line endings.

Last edited by David the H.; 11-07-2011 at 04:58 PM. Reason: stated
 
Old 11-08-2011, 09:35 PM   #5
lakshminarayanan
LQ Newbie
 
Registered: Jan 2010
Posts: 9

Original Poster
Rep: Reputation: 2
David,
Thanks for your response and suggestions. This is the 1st time I am writing such a long script. The script is to be for bash.
I know, the amount of scripting knowledge I have is really insufficient to carry out this task. I thought, let me try and sharpen my scripting skills.
I am a part of the team maintaining some 400 linux servers in which 12 million mailboxes are hosted. we daily need to find out how many messages are struck up in the MTA queues (both inward and outward). To do this manually takes more than 6 hours and I am trying to make this script do that.
The common reason would be a subject exceeding 300 characters.
About the use of less- $i is a file the contents of which is the original email along with the message headers, so just cat would read the entire file which would be pages together. but less does that fine.

I need access to the production environment to get you sample input which I don't have now as I'm on a vacation.
But I ll try do give it on my own. the contents of $i would be

$i- type1
19 Date:
20 From:abcd@gmail.com
21 Subject: line1
22 line 2
23 line 3
.
.
.
28 line x
29 To:abcd@yahoo.com
30 In-Reply-To:

$i type2
18 To: abcd@yahoo.com
19 Message-ID:
20 Subject: line1
21 line 2
22 line 3
.
.
.
35 line x
36 From: abcd@gmail.com
37 Date

$i type 3
37 Content-Type:
38 Subject: line1
39 line 2
40 line 3
.
.
.

41 line x
42 Mime-Version:
43 Message-ID:


In either of the above cases I want the Subject (line 1 ....line x) alone to be read so that I can do a |wc -m to get the number of characters. Thank you once again for your valuable suggestion.

Last edited by lakshminarayanan; 11-08-2011 at 09:37 PM.
 
1 members found this post helpful.
Old 11-09-2011, 12:21 PM   #6
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Don't worry about being a beginner. We all have to start somewhere. This looks like a good project to learn from.

As I asked in 1), please use code tags around your script and text.


After reading the OP more carefully, I think I have an easier solution for you using sed, but first lets use your existing script as a learning exercise.

To begin with, I'm afraid I don't see your point about less. When used with a pipe, less and cat do pretty much the same thing. It's the -m option in grep that limits the output either way.

In any case, the first thing you should look at is reducing the number of calls to external tools like grep. Assuming that there's only one email (and Subject, To, From, etc.) per file, how about we just grab all the important lines at once, and process them later inside the shell?

Code:
searchlist='Subject|From|To|Message-ID|Date|Mime-Version|Content-Type|In-Reply-To'

IFS=$'\n'	#forces wordbreaking on newlines only, necessary for setting the array
array=( egrep -n "^($searchlist)" "$I" ) )

# or more succinctly, if using bash 4+.  Doesn't require changing IFS.
mapfile -t array < <( egrep -n "^($searchlist)" "$I" )
Now we have an array holding all the lines you want to search, prepended by their line number. Next we just have to find the entry containing the "Subject" and the one following it.

Code:
for line in "${array[@]}" ; do

     if (( x == 1 )) ; then
          end="${line%%:*}"
          break
     fi

     if [[ $line == *Subject:* ]] ; then
          start="${line%%:*}"
          x=1
     fi

done
The first if statement is ignored until Subject is found. That line sets the start value, as well as the variable "x". Then on the next iteration, it sets the end value and breaks the loop. ${line%%:*} strips off everything after the colon, replacing cut.

Now you have start and end variables with the two matching line numbers. We just need to shift the endpoint by one and extract the final text for counting.

Code:
(( end-- ))
count="$(sed -n "$start,$end p" "$I" )"

echo "${#count}"
We can even dispense with wc, as the shell can count the output, though there may be a minor difference in number as trailing newlines may be removed by the shell.

I hope also you realize that this counts the "Subject: " header part too. If it's important you can adjust the sed command to remove it.

Code:
count="$(sed -n "$start,$end { s/Subject:[ ]*// ; p }" "$I" )"
You can figure the rest out yourself, I'm sure.

And see here for more on doing string manipulations in bash:

parameter expansion
string manipulation


Now, as I mentioned earlier, there's a better way. sed can extract the block you want directly, if you use the proper address forms.

Code:
#Don't include "Subject" in the list.
searchlist="From|To|Message-ID|Date|Mime-Version|Content-Type|In-Reply-To"

count=$( sed -rn "/Subject/,/($searchlist)/ { /^($searchlist)/d ; p}' "$I" )

#or without "Subject"
count=$( sed -rn '/Subject/,/($searchlist)/ { /^($searchlist)/d ; s/Subject:[ ]*// ; p}' "$I" )

echo "${#count}"
It matches from "Subject", to the next line that contains something in "$searchlist". Then the sub-bracket removes the "$searchlist" line before printing.

See? No need to extract line numbers.

Here are a few useful sed references.
http://www.grymoire.com/Unix/Sed.html
http://sed.sourceforge.net/grabbag/
http://sed.sourceforge.net/sedfaq.html
http://sed.sourceforge.net/sed1line.txt
 
1 members found this post helpful.
Old 11-10-2011, 01:38 AM   #7
lakshminarayanan
LQ Newbie
 
Registered: Jan 2010
Posts: 9

Original Poster
Rep: Reputation: 2
Well, you've inspired me a lot. In my case a file may contain more than 1 emails when it is an email thread. However, I try to do that myself and get back to you If I need help. I'll also keep you posted on how I progress and How successful I come out with my 1st project. You've been of great help to me. Thanks again.

Last edited by lakshminarayanan; 11-10-2011 at 01:46 AM.
 
1 members found this post helpful.
Old 11-10-2011, 02:07 AM   #8
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Debian sid + kde 3.5 & 4.4
Posts: 6,823

Rep: Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957Reputation: 1957
Good luck then. But probably you'll only need to add a q command in sed after the p, to tell it to quit after the first match.

Code:
sed -rn '/Subject/,/($searchlist)/ { /^($searchlist)/d ; s/Subject:[ ]*// ; p ; q }' "$I"
 
1 members found this post helpful.
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Word count script for medit - help troubleshoot please Chriswaterguy Programming 6 05-02-2011 03:53 AM
Need help in word count command grunge_1 Linux - General 4 03-20-2009 05:01 AM
variable length string using GD (word wrap, carriage return, word/character count)? frieza Programming 1 02-14-2009 06:21 PM
word count pantera Programming 2 08-31-2004 08:23 AM
Word count in paragraph - Open Office, Sutekh Linux - Software 10 04-19-2003 11:27 PM


All times are GMT -5. The time now is 06:43 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration