LinuxQuestions.org - parse HTML file and find keywords ?

- Programming (https://www.linuxquestions.org/questions/programming-9/)

- - parse HTML file and find keywords ? (https://www.linuxquestions.org/questions/programming-9/parse-html-file-and-find-keywords-191171/)

parse HTML file and find keywords ?

hi, I'm trying to implement a script for our IT dept. to retrieve the status of the main servers in diff. dept. and notify (email/pager...etc) if there's trouble.

I use wget to retreive the file since it's a HTML file off of a web server. The parsing needs to be done in Linux Bash shell script. This is where I'm getting puzzled since it needs to look for a combination of keywords and passes the results onto another subroutine for processing(send email/pager...etc). Was wondering if anyone has idea on how to properly parse this requirements :

The HTML part looks like :

<tr>
<td bgcolor="red"><a class="n1" bgcolor="red">marketing</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">today</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">alert</a></td>
</tr>
<tr>
<td bgcolor="blue"><a class="n1">accounting</a></td>
<td bgcolor="blue"><a class="n1">today</a></td>
<td bgcolor="blue"><a class="n1">normal</a></td>
</tr>
<tr>
<td bgcolor="red"><a class="n1" bgcolor="red">sales</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">today</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">normal</a></td>
</tr>
<tr>
<td bgcolor="blue"><a class="n1">shipping</a></td>
<td bgcolor="blue"><a class="n1">today</a></td>
<td bgcolor="blue"><a class="n1">alert</a></td>
</tr>

Each row represents one dept. and the status of the main server. So the table row means :

<tr>
<td bgcolor="red"><a class="n1" bgcolor="red"> {Department} </a></td>
<td bgcolor="red"><a class="n1" bgcolor="red"> {When} </a></td>
<td bgcolor="red"><a class="n1" bgcolor="red"> {Status} </a></td>
</tr>

The keyword I need to look for is the word "alert". Once the script finds it then it needs to select out the {Department} and then email/page people. I know I can use a combo of sed and awk to get the keyword "alert" but how do I then traverse and pull out the {Department} which is 2 lines above the line for the {Status} ?

The problem is that if it finds the word "alert", it needs to "go back up" two lines, skipping the {When} line(We don't care when it happened), to get to the line where the {Department} is. Not sure how I should do this.

Also each <TR> table row's data cells have alternate BGCOLOR as you can see, so the bgcolor="red" tag appears in every other row of data. That adds to the complexity of my parsing. Any idea the right way I should do this ?

thanks

Welcome to LQ.

It would be a lot easier if you can use lynx:
lynx -dump http://www.yourhost.com/stats.html | grep alert

You may also want to consider looking at nagios to monitor your systems:
http://www.nagios.org

thanks for the advice. Unfortunately I have no control over how I get the input file. The monitoring
is done by a partner so I can only grab whatever output the web server sends out. So I still
need to figure out how to parse from the keyword "alert" back to the {Department}.....

Is there anything wrong with using lynx like I suggested above?

All you need to do is pipe it into awk if you just want the first field:
lynx -dump http://www.yourhost.com/stats.html | grep alert | awk {'print $1'}

actually you misunderstood my question.

Take the "marketing" dept. for example,
I can grep the word "alert" just fine from the line :
<td bgcolor="red"><a class="n1" bgcolor="red">alert</a></td>
(Let's call this LINE 3)

However, now that I know that a department is in trouble, how do I
know which department it is ? so now I must retrieve the word
"marketing" from the line :
<td bgcolor="red"><a class="n1" bgcolor="red">marketing</a></td>
(Let's call this LINE 1)

that's not a problem either. I can use a combination of SED and AWK
to cut out the word "marketing".

However, since this is the sequence of the lines from the HTML file :
LINE 1 : <td bgcolor="red"><a class="n1" bgcolor="red">marketing</a></td>
LINE 2 : <td bgcolor="red"><a class="n1" bgcolor="red">today</a></td>
LINE 3 : <td bgcolor="red"><a class="n1" bgcolor="red">alert</a></td>

the processing puts the Bash shell script at LINE 3 when it finds the word
"alert", so now there's no way for the script to look back at LINE 1 and
tell me that it's department "marketing" that's in trouble !!
That's what I'm trying to figure out.......

I hope my question is clearer now ?
any idea ?

I'm not familiar with lynx, but I assume what david is suggesting is this:
1. You receive the markup code from wherever
2. You run lynx by pointing it to the local copy of the html you received
3. Lynx processes the document, and spits out text on the console as an interpretation of the file
4. Now that the html is presented as a processed web page, you would use grep and awk on the visual output of lynx to find your info since the table would be spat out on a single row/line of input.

I may be interpreting that incorrectly, but thought I would mention it.

Second, another option would be to use grep. Look at the "-B" option for grep. It will give you X lines of context before the matched text. So you could do something like:
$ grep -B 2 "alert" your_html_file

The output would look something like this:
<td bgcolor="red"><a class="n1" bgcolor="red">marketing</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">today</a></td>
<td bgcolor="red"><a class="n1" bgcolor="red">alert</a></td>
--
<td bgcolor="blue"><a class="n1">shipping</a></td>
<td bgcolor="blue"><a class="n1">today</a></td>
<td bgcolor="blue"><a class="n1">alert</a></td>
--

You could then do any number of other things. You could try to use sed/awk by themselves to parse every fourth line, you could pipe the data into wc (to count the lines of output), then use combinations of the head and tail commands to mask off everything but one line to process at a time, or whatever works.

Actually, after thinking about it some, you could do this:

grep -B 2 "alert" your_html_file | grep "marketing\|accounting\|sales\|shipping" | cut -f 3 -d ">" | cut -f 1 -d "<"

That would give you exactly what you want. A list of each department that received an alert, one on each line.

The -B option to the first grep gives you two lines of previous context from each alert you have in the file.

The second grep takes that output, and filters out every line that does not include the name of a recognized department. This command will get lengthy if you have many, many different departments. This will reduce the output to the html lines that contain departments that received an alert.

The first cut removes the html markup on the line all the way to the beginning of the department name's text

The second cut removes the rest of the markup following the department's name.

---

Of course, you could substitute your own sed/awk stuff instead of the two cuts. Using cut is simple, but it does not easily lend itself to changing formats of line input. For instance, if anyone decided to make the department name bold, change the font size, or whatever, the cut commands would return useless data (the name of the new tag inserted).

oh now I see what David might have meant.
in that case I might try lynx since it takes away the problem of
analyzing the ever-changing HTML tags that could come out
of the web status server.

but thanks for the detail idea on the -B and the cut command.
I haven't used cut all that much. I will try your way and see
thanks a lot !!

Quote:

Originally posted by fnd
oh now I see what David might have meant.
in that case I might try lynx since it takes away the problem of
analyzing the ever-changing HTML tags that could come out
of the web status server.

Thats right although it doesn't just strip out the tags it reads them and formats the document accordingly so there will be one line for each department (since they all appear in one table row in the html code)