LinuxQuestions.org
Did you know LQ has a Linux Hardware Compatibility List?
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices

Reply
 
LinkBack Search this Thread
Old 06-08-2004, 01:37 PM   #1
fnd
LQ Newbie
 
Registered: Jun 2004
Posts: 5

Rep: Reputation: 0
parse HTML file and find keywords ?


hi, I'm trying to implement a script for our IT dept. to retrieve the status of the main servers in diff. dept. and notify (email/pager...etc) if there's trouble.

I use wget to retreive the file since it's a HTML file off of a web server. The parsing needs to be done in Linux Bash shell script. This is where I'm getting puzzled since it needs to look for a combination of keywords and passes the results onto another subroutine for processing(send email/pager...etc). Was wondering if anyone has idea on how to properly parse this requirements :

The HTML part looks like :

<tr>
<td bgcolor="red"><a class="n1" bgcolor="red">marketing</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">today</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">alert</a></td>
</tr>
<tr>
<td bgcolor="blue"><a class="n1">accounting</a></td>
<td bgcolor="blue"><a class="n1">today</a></td>
<td bgcolor="blue"><a class="n1">normal</a></td>
</tr>
<tr>
<td bgcolor="red"><a class="n1" bgcolor="red">sales</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">today</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">normal</a></td>
</tr>
<tr>
<td bgcolor="blue"><a class="n1">shipping</a></td>
<td bgcolor="blue"><a class="n1">today</a></td>
<td bgcolor="blue"><a class="n1">alert</a></td>
</tr>

Each row represents one dept. and the status of the main server. So the table row means :

<tr>
<td bgcolor="red"><a class="n1" bgcolor="red"> {Department} </a></td>
<td bgcolor="red"><a class="n1" bgcolor="red"> {When} </a></td>
<td bgcolor="red"><a class="n1" bgcolor="red"> {Status} </a></td>
</tr>

The keyword I need to look for is the word "alert". Once the script finds it then it needs to select out the {Department} and then email/page people. I know I can use a combo of sed and awk to get the keyword "alert" but how do I then traverse and pull out the {Department} which is 2 lines above the line for the {Status} ?

The problem is that if it finds the word "alert", it needs to "go back up" two lines, skipping the {When} line(We don't care when it happened), to get to the line where the {Department} is. Not sure how I should do this.

Also each <TR> table row's data cells have alternate BGCOLOR as you can see, so the bgcolor="red" tag appears in every other row of data. That adds to the complexity of my parsing. Any idea the right way I should do this ?

thanks
 
Old 06-08-2004, 01:52 PM   #2
david_ross
Moderator
 
Registered: Mar 2003
Location: Scotland
Distribution: Slackware, RedHat, Debian
Posts: 12,047

Rep: Reputation: 64
Welcome to LQ.

It would be a lot easier if you can use lynx:
lynx -dump http://www.yourhost.com/stats.html | grep alert

You may also want to consider looking at nagios to monitor your systems:
http://www.nagios.org
 
Old 06-08-2004, 02:59 PM   #3
fnd
LQ Newbie
 
Registered: Jun 2004
Posts: 5

Original Poster
Rep: Reputation: 0
thanks for the advice. Unfortunately I have no control over how I get the input file. The monitoring
is done by a partner so I can only grab whatever output the web server sends out. So I still
need to figure out how to parse from the keyword "alert" back to the {Department}.....
 
Old 06-08-2004, 03:28 PM   #4
david_ross
Moderator
 
Registered: Mar 2003
Location: Scotland
Distribution: Slackware, RedHat, Debian
Posts: 12,047

Rep: Reputation: 64
Is there anything wrong with using lynx like I suggested above?

All you need to do is pipe it into awk if you just want the first field:
lynx -dump http://www.yourhost.com/stats.html | grep alert | awk {'print $1'}
 
Old 06-08-2004, 04:17 PM   #5
fnd
LQ Newbie
 
Registered: Jun 2004
Posts: 5

Original Poster
Rep: Reputation: 0
actually you misunderstood my question.

Take the "marketing" dept. for example,
I can grep the word "alert" just fine from the line :
<td bgcolor="red"><a class="n1" bgcolor="red">alert</a></td>
(Let's call this LINE 3)


However, now that I know that a department is in trouble, how do I
know which department it is ? so now I must retrieve the word
"marketing" from the line :
<td bgcolor="red"><a class="n1" bgcolor="red">marketing</a></td>
(Let's call this LINE 1)

that's not a problem either. I can use a combination of SED and AWK
to cut out the word "marketing".


However, since this is the sequence of the lines from the HTML file :
LINE 1 : <td bgcolor="red"><a class="n1" bgcolor="red">marketing</a></td>
LINE 2 : <td bgcolor="red"><a class="n1" bgcolor="red">today</a></td>
LINE 3 : <td bgcolor="red"><a class="n1" bgcolor="red">alert</a></td>

the processing puts the Bash shell script at LINE 3 when it finds the word
"alert", so now there's no way for the script to look back at LINE 1 and
tell me that it's department "marketing" that's in trouble !!
That's what I'm trying to figure out.......

I hope my question is clearer now ?
any idea ?
 
Old 06-08-2004, 07:16 PM   #6
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 367Reputation: 367Reputation: 367Reputation: 367
I'm not familiar with lynx, but I assume what david is suggesting is this:
1. You receive the markup code from wherever
2. You run lynx by pointing it to the local copy of the html you received
3. Lynx processes the document, and spits out text on the console as an interpretation of the file
4. Now that the html is presented as a processed web page, you would use grep and awk on the visual output of lynx to find your info since the table would be spat out on a single row/line of input.

I may be interpreting that incorrectly, but thought I would mention it.

Second, another option would be to use grep. Look at the "-B" option for grep. It will give you X lines of context before the matched text. So you could do something like:
$ grep -B 2 "alert" your_html_file

The output would look something like this:
<td bgcolor="red"><a class="n1" bgcolor="red">marketing</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">today</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">alert</a></td>
--
<td bgcolor="blue"><a class="n1">shipping</a></td>
<td bgcolor="blue"><a class="n1">today</a></td>
<td bgcolor="blue"><a class="n1">alert</a></td>
--

You could then do any number of other things. You could try to use sed/awk by themselves to parse every fourth line, you could pipe the data into wc (to count the lines of output), then use combinations of the head and tail commands to mask off everything but one line to process at a time, or whatever works.

Last edited by Dark_Helmet; 06-08-2004 at 07:18 PM.
 
Old 06-08-2004, 11:07 PM   #7
Dark_Helmet
Senior Member
 
Registered: Jan 2003
Posts: 2,786

Rep: Reputation: 367Reputation: 367Reputation: 367Reputation: 367
Actually, after thinking about it some, you could do this:

grep -B 2 "alert" your_html_file | grep "marketing\|accounting\|sales\|shipping" | cut -f 3 -d ">" | cut -f 1 -d "<"

That would give you exactly what you want. A list of each department that received an alert, one on each line.

The -B option to the first grep gives you two lines of previous context from each alert you have in the file.

The second grep takes that output, and filters out every line that does not include the name of a recognized department. This command will get lengthy if you have many, many different departments. This will reduce the output to the html lines that contain departments that received an alert.

The first cut removes the html markup on the line all the way to the beginning of the department name's text

The second cut removes the rest of the markup following the department's name.

---

Of course, you could substitute your own sed/awk stuff instead of the two cuts. Using cut is simple, but it does not easily lend itself to changing formats of line input. For instance, if anyone decided to make the department name bold, change the font size, or whatever, the cut commands would return useless data (the name of the new tag inserted).
 
Old 06-09-2004, 10:11 AM   #8
fnd
LQ Newbie
 
Registered: Jun 2004
Posts: 5

Original Poster
Rep: Reputation: 0
oh now I see what David might have meant.
in that case I might try lynx since it takes away the problem of
analyzing the ever-changing HTML tags that could come out
of the web status server.

but thanks for the detail idea on the -B and the cut command.
I haven't used cut all that much. I will try your way and see
thanks a lot !!
 
Old 06-09-2004, 12:35 PM   #9
david_ross
Moderator
 
Registered: Mar 2003
Location: Scotland
Distribution: Slackware, RedHat, Debian
Posts: 12,047

Rep: Reputation: 64
Quote:
Originally posted by fnd
oh now I see what David might have meant.
in that case I might try lynx since it takes away the problem of
analyzing the ever-changing HTML tags that could come out
of the web status server.
Thats right although it doesn't just strip out the tags it reads them and formats the document accordingly so there will be one line for each department (since they all appear in one table row in the html code)
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off
Trackbacks are Off
Pingbacks are On
Refbacks are Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Parse HTML using PHP jilljack Programming 1 11-07-2005 09:46 AM
C source file Parse error before 38 exvor Programming 5 09-19-2005 02:10 PM
My Apache2 fails to parse php-scripts in html pages 3-1415 Linux - Software 6 10-21-2004 05:59 AM
Konqueror + file:/usr/share/doc/HTML/index.html jon_k Linux - Software 2 11-25-2003 05:06 AM
Need help with grep, trying to parse/filter a file... patsnip Programming 4 08-29-2003 02:33 PM


All times are GMT -5. The time now is 12:52 PM.

Main Menu
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration