LinuxQuestions.org
Review your favorite Linux distribution.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices


Reply
  Search this Thread
Old 08-25-2017, 01:02 PM   #1
Godboss
LQ Newbie
 
Registered: Aug 2017
Posts: 1

Rep: Reputation: Disabled
Question How do I parse this HTML using only bash?


Hey guys,

I am running a raspberry pi farm and one of the little buggers acts as my media server.
Lets call it raspidlna shall we?
On occasion, the software I use for this (minidlna) has a breakdown, due to one of many reasons related to the infrastructure in place.

As a big fan of self-healing approaches I have nagios monitoring the status of the http service provided by minidlna, but I would like to take it one step further, and actually monitor the number of video files being served.

This can be obtained through an http call to http://raspidlna:8200 which in turn responds with the following html:

Code:
<HTML>
  <HEAD><TITLE>MiniDLNA 1.1.2</TITLE></HEAD>
  <BODY>
    <div style="text-align: center">
      <h2>MiniDLNA status</h2>
    </div><h3>Media library</h3>
    <table border=1 cellpadding=10><tr><td>Audio files</td><td>0</td></tr>
      <tr><td>Video files</td><td>472</td></tr>
      <tr><td>Image files</td><td>0</td></tr>
    </table>
    <h3>Connected clients</h3>
    <table border=1 cellpadding=10>
      <tr><td>ID</td><td>Type</td><td>IP Address</td><td>HW Address</td></tr>
      <tr><td>0</td><td>Generic UPnP 1.0</td><td>XXXXXXXXX</td><td>XXXXXXXXX</td></tr>
      <tr><td>1</td><td>Generic UPnP 1.0</td><td>XXXXXXXXX</td><td>XXXXXXXXX</td></tr>
      <tr><td>2</td><td>Generic UPnP 1.0</td><td>XXXXXXXXX</td><td>XXXXXXXXX</td></tr>
      <tr><td>3</td><td>Generic UPnP 1.0</td><td>XXXXXXXXXX</td><td>XXXXXXXXX</td></tr>
      <tr><td>4</td><td>Samsung Series C/D/E</td><td>XXXXXXXXXX</td><td>XXXXXXXXX</td></tr>
      <tr><td>5</td><td>Unknown</td><td>XXXXXXXXX</td><td>XXXXXXXXX</td></tr>
    </table>
  </BODY>
</HTML>
I highlighted in bold the text block related to the Video Files, what I want to read is the sub block in italic and underlined, the number of files.

Now I know we can perform alot of operations to parse strings using sed, awk, grep and even string substitutions with bash, but I must admit whenever I try looking at all of it, I get quite a headache, so... That brings me to you experts... Hopefully one of you will have a nice and clean idea on how to work some magic with this...

All I wrote so far script wise was:

Code:
#!/bin/bash
content=$(wget http://raspidlna:8200/ -q -O -)
declare -i count=????
echo "Total Videos $count"

if (("$count" <  1)) ; then
   echo would now run ./startMinidlna.sh
fi
 
Old 08-25-2017, 03:16 PM   #2
jlinkels
LQ Guru
 
Registered: Oct 2003
Location: Bonaire, Leeuwarden
Distribution: Debian /Jessie/Stretch/Sid, Linux Mint DE
Posts: 5,195

Rep: Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044Reputation: 1044
This might well be beyond the capabilities of Bash. However, utilities exist for parsing HTML. It seems hxselect from the html-xml-utils package is an option. Or pup. Both have references on Google.

I don't have experience with either one, but a lot of experience in Bash. And would not do this in Bash...

jlinkels
 
Old 08-25-2017, 03:22 PM   #3
wpeckham
LQ Guru
 
Registered: Apr 2010
Location: Continental USA
Distribution: Debian, Ubuntu, RedHat, DSL, Puppy, CentOS, Knoppix, Mint-DE, Sparky, VSIDO, tinycore, Q4OS, Manjaro
Posts: 5,847

Rep: Reputation: 2800Reputation: 2800Reputation: 2800Reputation: 2800Reputation: 2800Reputation: 2800Reputation: 2800Reputation: 2800Reputation: 2800Reputation: 2800Reputation: 2800
This can be done, but it will not be quite straightforward. There are some things in your favor.

#1 the string "Video files" is unique and specific to the line needed, so a simple grep will isolate the correct line.
#2 the format of the line is fixed, so the correct string should always start at the same character position or offset.
#3 once we isolate that string, the first non-numeric character (<) will mark the end of the numeric string we need.

Consider grep pattern matching and BASH string subset addressing and extraction and you have all of the tools needed to pull out the number correctly. Do some reading, then come back here with what you come up with and any questions. You CAN do it.

PS: After some thought about #2 and #3, I realized that the only numeric characters on the line are exactly those you need. If we can filter out all non-numeric characters we have the answer even faster. Grep has an option for that.
Something like
Code:
content=$(wget http://raspidlna:8200/ -q -O -)
COUNT=`echo "$content"|grep 'Video files'|grep -o '[0-9]*' `
MIGHT work.

Last edited by wpeckham; 08-25-2017 at 03:33 PM.
 
Old 08-25-2017, 03:36 PM   #4
rtmistler
Moderator
 
Registered: Mar 2011
Location: USA
Distribution: MINT Debian, Angstrom, SUSE, Ubuntu, Debian
Posts: 9,890
Blog Entries: 13

Rep: Reputation: 4934Reputation: 4934Reputation: 4934Reputation: 4934Reputation: 4934Reputation: 4934Reputation: 4934Reputation: 4934Reputation: 4934Reputation: 4934Reputation: 4934
I agree that you can find the line using bash and further using any of awk or tr, you can break it down into fields by selecting the delimiters of < or > and then also use the cut command to get to just the number value.

What I'm not sure about is multiline files being stored as a variable. But this is due to my inexperience with that, not anything in particular I know about that part of the topic.

EDIT: You can probably just use sed to delete the following, and presumably, fixed terms:
<tr><td>Video files</td><td>
</td></tr>

The minor snag might be some indeterminate number of white spaces left before that, however once again, awk, or cut, and maybe tr will "see" column delimited data and you can find the string value of that number from there.

Last edited by rtmistler; 08-25-2017 at 03:39 PM.
 
Old 08-25-2017, 05:57 PM   #5
dugan
LQ Guru
 
Registered: Nov 2003
Location: Canada
Distribution: distro hopper
Posts: 11,278

Rep: Reputation: 5346Reputation: 5346Reputation: 5346Reputation: 5346Reputation: 5346Reputation: 5346Reputation: 5346Reputation: 5346Reputation: 5346Reputation: 5346Reputation: 5346
Google got me this:

https://github.com/ericchiang/pup
 
Old 08-25-2017, 08:04 PM   #6
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,753
Blog Entries: 4

Rep: Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965
Why on earth would you try to do it in Bash?

Just add a "shebang" line as the first line of your script, such as:
Code:
#!/usr/bin/ruby
Now, you can write the entire remainder of your script in <<Ruby>>, and no one will ever know, nor care, that you did so. You now have at your disposal everything that <<Ruby>> brings to the table, which certainly includes an HTML parser.

And of course, you have your pick of languages: Perl, PHP, Python, Ruby, Haskell, Java (ick...) . . . .

"The bash scripting-language is highly overrated." You are not in any way confined to it.
 
Old 08-25-2017, 08:13 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,188

Rep: Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131
The OP is only interested in the number - not parsing the tags. The grep offered above will do it; personally I would use a single call to sed. If there were multiple stanzas (that entailed summing), awk would be my first instinct, but appears unnecessary here.
 
Old 08-25-2017, 11:05 PM   #8
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,409

Rep: Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775
Just for fun, a pure bash solution for getting the count. Probably won't do much for the headache though.
Code:
count=$(while read a; do [[ $a =~ "Video files</td><td>"([0-9]+) ]] && echo "${BASH_REMATCH[1]}"; done < <(wget http://raspidlna:8200/ -q -O -))
 
Old 08-25-2017, 11:08 PM   #9
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,188

Rep: Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131Reputation: 4131
How does wget qualify ?.
.
.
.
 
Old 08-25-2017, 11:17 PM   #10
allend
LQ 5k Club
 
Registered: Oct 2003
Location: Melbourne
Distribution: Slackware64-15.0
Posts: 6,409

Rep: Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775Reputation: 2775
Ahh true. Should I amend to "almost pure"?
 
Old 08-26-2017, 01:11 AM   #11
Turbocapitalist
LQ Guru
 
Registered: Apr 2005
Distribution: Linux Mint, Devuan, OpenBSD
Posts: 7,423
Blog Entries: 3

Rep: Reputation: 3788Reputation: 3788Reputation: 3788Reputation: 3788Reputation: 3788Reputation: 3788Reputation: 3788Reputation: 3788Reputation: 3788Reputation: 3788Reputation: 3788
If your version of grep supports perl-compatible regular expressions you could use the -P option with some zero-width assertions. See "man perlre" for the details.

Code:
grep -o -P '(?<=<tr><td>Video files</td><td>).*?(?=</td>)'
However, that is quite fragile and if the spacing, especially line breaking, changes then it will need to be adjusted. Same goes for other non-parsing solutions.

For robustness you might use a proper XHTML processor instead like one of the XPath tools. Either of the following XPaths should work

Code:
//tr[td="Video files"]/td/following-sibling::td[1]

//tr[td="Video files"]/td[2]
XPath and regex are easy enough to do in perl with the help of the HTML::TreeBuilder::XPath module from CPAN. Ruby was mentioned above, it too has XPath modules, and I expect so do many of the other scripting languages.
 
Old 08-26-2017, 05:31 AM   #12
ondoho
LQ Addict
 
Registered: Dec 2013
Posts: 19,872
Blog Entries: 12

Rep: Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053Reputation: 6053
so the HTML is provided by nagios?
i'm not sure how the number of video files served relates to minidlna crashing or not, but anyhow:
wouldn't it be cleverer to explore the possibilities of the software that is providing this information? nagios?

what you are doing there makes for some nice code golfing, but really it's just a hacky duct tape approach.
 
Old 08-28-2017, 08:53 AM   #13
sundialsvcs
LQ Guru
 
Registered: Feb 2004
Location: SE Tennessee, USA
Distribution: Gentoo, LFS
Posts: 10,753
Blog Entries: 4

Rep: Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965Reputation: 3965
It's simply easier to use a language system which is readily available and which already includes such niceties as an HTML parser, regular-expression support, and so on.
Quote:
Actum Ne Agas: Do Not Do A Thing Already Done.
Thanks to "#!shebang," Bash makes it equally easy for you to write a "Bash script" in whatever is the most-appropriate scripting language. Only Dr. Korn's shell endeavored to build a "serious" programming language as its built-in scripting engine, and "#!shebang" is frankly a more-elegant way to do it.

As one author put it, "It's kind of like building a particularly-elegant archway over the front door of a supermarket. You might look upon it and even be proud of it, but you might not want to admit to having done it."
 
Old 09-01-2017, 01:02 AM   #14
justmy2cents
Member
 
Registered: May 2017
Location: U.S.
Distribution: Un*x
Posts: 237
Blog Entries: 2

Rep: Reputation: Disabled
Not sure if this will help but "JSON is a minimal readable format for strcuturing data. It's used primarily to transmkt data between a server and web apps as an alternative to XML." Python has a built-in JSON libary to "pretty print" JSON output in order to find a specific entry. Just use python -m json.tool to indent and organize the JSON output via cat test.jason | python -m json.tool For more advanced JSON parsing you can install jq which has options to extract specific values from jason input. In that case ust pipe the output to jq instead...

Last edited by justmy2cents; 09-01-2017 at 01:07 AM.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Parse HTML in bash and react to it schmatzler Programming 3 12-04-2014 06:40 AM
parse text between html wakatana Programming 4 10-27-2009 08:12 AM
Simple parse of html file using bash ericcarlson Linux - Software 2 05-07-2008 09:44 AM
Parse error: parse error, unexpected $ in /home/content/d/o/m/domain/html/addpuppy2.p Scooby-Doo Programming 3 10-25-2007 09:41 AM
Parse HTML using PHP jilljack Programming 1 11-07-2005 09:46 AM

LinuxQuestions.org > Forums > Linux Forums > Linux - General

All times are GMT -5. The time now is 06:16 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration