How do I parse this HTML using only bash?

Godboss · 08-25-2017, 01:02 PM

Hey guys,

I am running a raspberry pi farm and one of the little buggers acts as my media server.
Lets call it raspidlna shall we?
On occasion, the software I use for this (minidlna) has a breakdown, due to one of many reasons related to the infrastructure in place.

As a big fan of self-healing approaches I have nagios monitoring the status of the http service provided by minidlna, but I would like to take it one step further, and actually monitor the number of video files being served.

This can be obtained through an http call to http://raspidlna:8200 which in turn responds with the following html:

Code:

<HTML>
  <HEAD><TITLE>MiniDLNA 1.1.2</TITLE></HEAD>
  <BODY>
    <div style="text-align: center">
      <h2>MiniDLNA status</h2>
    </div><h3>Media library</h3>
    <table border=1 cellpadding=10><tr><td>Audio files</td><td>0</td></tr>
      <tr><td>Video files</td><td>472</td></tr>
      <tr><td>Image files</td><td>0</td></tr>
    </table>
    <h3>Connected clients</h3>
    <table border=1 cellpadding=10>
      <tr><td>ID</td><td>Type</td><td>IP Address</td><td>HW Address</td></tr>
      <tr><td>0</td><td>Generic UPnP 1.0</td><td>XXXXXXXXX</td><td>XXXXXXXXX</td></tr>
      <tr><td>1</td><td>Generic UPnP 1.0</td><td>XXXXXXXXX</td><td>XXXXXXXXX</td></tr>
      <tr><td>2</td><td>Generic UPnP 1.0</td><td>XXXXXXXXX</td><td>XXXXXXXXX</td></tr>
      <tr><td>3</td><td>Generic UPnP 1.0</td><td>XXXXXXXXXX</td><td>XXXXXXXXX</td></tr>
      <tr><td>4</td><td>Samsung Series C/D/E</td><td>XXXXXXXXXX</td><td>XXXXXXXXX</td></tr>
      <tr><td>5</td><td>Unknown</td><td>XXXXXXXXX</td><td>XXXXXXXXX</td></tr>
    </table>
  </BODY>
</HTML>

I highlighted in bold the text block related to the Video Files, what I want to read is the sub block in italic and underlined, the number of files.

Now I know we can perform alot of operations to parse strings using sed, awk, grep and even string substitutions with bash, but I must admit whenever I try looking at all of it, I get quite a headache, so... That brings me to you experts... Hopefully one of you will have a nice and clean idea on how to work some magic with this...

All I wrote so far script wise was:

Code:

#!/bin/bash
content=$(wget http://raspidlna:8200/ -q -O -)
declare -i count=????
echo "Total Videos $count"

if (("$count" <  1)) ; then
   echo would now run ./startMinidlna.sh
fi

jlinkels · 08-25-2017, 03:16 PM

This might well be beyond the capabilities of Bash. However, utilities exist for parsing HTML. It seems hxselect from the html-xml-utils package is an option. Or pup. Both have references on Google.

I don't have experience with either one, but a lot of experience in Bash. And would not do this in Bash...

jlinkels

wpeckham · 08-25-2017, 03:22 PM

This can be done, but it will not be quite straightforward. There are some things in your favor.

#1 the string "Video files" is unique and specific to the line needed, so a simple grep will isolate the correct line.
#2 the format of the line is fixed, so the correct string should always start at the same character position or offset.
#3 once we isolate that string, the first non-numeric character (<) will mark the end of the numeric string we need.

Consider grep pattern matching and BASH string subset addressing and extraction and you have all of the tools needed to pull out the number correctly. Do some reading, then come back here with what you come up with and any questions. You CAN do it.

PS: After some thought about #2 and #3, I realized that the only numeric characters on the line are exactly those you need. If we can filter out all non-numeric characters we have the answer even faster. Grep has an option for that.
Something like

Code:

content=$(wget http://raspidlna:8200/ -q -O -)
COUNT=`echo "$content"|grep 'Video files'|grep -o '[0-9]*' `

MIGHT work.

rtmistler · 08-25-2017, 03:36 PM

I agree that you can find the line using bash and further using any of awk or tr, you can break it down into fields by selecting the delimiters of < or > and then also use the cut command to get to just the number value.

What I'm not sure about is multiline files being stored as a variable. But this is due to my inexperience with that, not anything in particular I know about that part of the topic.

EDIT: You can probably just use sed to delete the following, and presumably, fixed terms:
<tr><td>Video files</td><td>
</td></tr>

The minor snag might be some indeterminate number of white spaces left before that, however once again, awk, or cut, and maybe tr will "see" column delimited data and you can find the string value of that number from there.

dugan · 08-25-2017, 05:57 PM

Google got me this:

https://github.com/ericchiang/pup

sundialsvcs · 08-25-2017, 08:04 PM

Why on earth would you try to do it in Bash?

Just add a "shebang" line as the first line of your script, such as:

Code:

#!/usr/bin/ruby

Now, you can write the entire remainder of your script in <<Ruby>>, and no one will ever know, nor care, that you did so. You now have at your disposal everything that <<Ruby>> brings to the table, which certainly includes an HTML parser.

And of course, you have your pick of languages: Perl, PHP, Python, Ruby, Haskell, Java (ick...) . . . .

"The bash scripting-language is highly overrated." You are not in any way confined to it.

syg00 · 08-25-2017, 08:13 PM

The OP is only interested in the number - not parsing the tags. The grep offered above will do it; personally I would use a single call to sed. If there were multiple stanzas (that entailed summing), awk would be my first instinct, but appears unnecessary here.

allend · 08-25-2017, 11:05 PM

Just for fun, a pure bash solution for getting the count. Probably won't do much for the headache though.

Code:

count=$(while read a; do [[ $a =~ "Video files</td><td>"([0-9]+) ]] && echo "${BASH_REMATCH[1]}"; done < <(wget http://raspidlna:8200/ -q -O -))

syg00 · 08-25-2017, 11:08 PM

How does wget qualify ?.
.
.
.

allend · 08-25-2017, 11:17 PM

Ahh true.

Should I amend to "almost pure"?

Turbocapitalist · 08-26-2017, 01:11 AM

If your version of grep supports perl-compatible regular expressions you could use the -P option with some zero-width assertions. See "man perlre" for the details.

Code:

grep -o -P '(?<=<tr><td>Video files</td><td>).*?(?=</td>)'

However, that is quite fragile and if the spacing, especially line breaking, changes then it will need to be adjusted. Same goes for other non-parsing solutions.

For robustness you might use a proper XHTML processor instead like one of the XPath tools. Either of the following XPaths should work

Code:

//tr[td="Video files"]/td/following-sibling::td[1]

//tr[td="Video files"]/td[2]

XPath and regex are easy enough to do in perl with the help of the HTML::TreeBuilder::XPath module from CPAN. Ruby was mentioned above, it too has XPath modules, and I expect so do many of the other scripting languages.

ondoho · 08-26-2017, 05:31 AM

so the HTML is provided by nagios?
i'm not sure how the number of video files served relates to minidlna crashing or not, but anyhow:
wouldn't it be cleverer to explore the possibilities of the software that is providing this information? nagios?

what you are doing there makes for some nice code golfing, but really it's just a hacky duct tape approach.

sundialsvcs · 08-28-2017, 08:53 AM

It's simply easier to use a language system which is readily available and which already includes such niceties as an HTML parser, regular-expression support, and so on.

Quote:

Actum Ne Agas: Do Not Do A Thing Already Done.

Thanks to "#!shebang," Bash makes it equally easy for you to write a "Bash script" in whatever is the most-appropriate scripting language. Only Dr. Korn's shell endeavored to build a "serious" programming language as its built-in scripting engine, and "#!shebang" is frankly a more-elegant way to do it.

As one author put it, "It's kind of like building a particularly-elegant archway over the front door of a supermarket. You might look upon it and even be proud of it, but you might not want to admit to having done it."

justmy2cents · 09-01-2017, 01:02 AM

Not sure if this will help but "JSON is a minimal readable format for strcuturing data. It's used primarily to transmkt data between a server and web apps as an alternative to XML." Python has a built-in JSON libary to "pretty print" JSON output in order to find a specific entry. Just use python -m json.tool to indent and organize the JSON output via cat test.jason | python -m json.tool For more advanced JSON parsing you can install jq which has options to extract specific values from jason input. In that case ust pipe the output to jq instead...