LinuxQuestions.org
Download your favorite Linux distribution at LQ ISO.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 10-22-2017, 02:59 PM   #1
oulevon
Member
 
Registered: Feb 2001
Location: Boston, USA
Distribution: Slackware
Posts: 438

Rep: Reputation: 30
Regex to extract data between html tag


Hi,

I need to extract the highlighted value value between the span tags in the block of HTML below. The value 49.1 will be changing and I want to monitor it. Does anyone have any pointers or could suggest something to look at to prime me for this? Thanks.

<span class="wx-data" data-station="IADELAID19" data-variable="temperature">
<span class="wx-value">49.1</span>
<span class="wx-unit">°F</span>
</span>
 
Old 10-22-2017, 03:20 PM   #2
!!!
Member
 
Registered: Jan 2017
Location: Fremont, CA, USA
Distribution: Trying any&ALL on old/minimal
Posts: 997

Rep: Reputation: 382Reputation: 382Reputation: 382Reputation: 382
Try these web-search keywords: awk|sed extract value in|between html tags
Let us know what you find and try. Best wishes. Slack
 
Old 10-22-2017, 03:28 PM   #3
business_kid
LQ Guru
 
Registered: Jan 2006
Location: Ireland
Distribution: Slackware, Slarm64 & Android
Posts: 16,366

Rep: Reputation: 2335Reputation: 2335Reputation: 2335Reputation: 2335Reputation: 2335Reputation: 2335Reputation: 2335Reputation: 2335Reputation: 2335Reputation: 2335Reputation: 2335
What's your programming language? In a terminal 'grep -b' or 'grep -u, would get you a byte offset, which you could pass to 'head -c' which loses the stuff before wx-value"> Next comes your number. What you do from there depends on how big or small that number goes.
 
Old 10-22-2017, 06:26 PM   #4
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,140

Rep: Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122
Depends on the data - is that all the input, or only a snippet ?. If the former a simple sed of digits and dots following a ">" will suffice. But the data must always look like that - else you'll need to include the full tag to ensure you get the correct line. It there are more than one, you'll get multi-line output.
grep could do it with PCRE, but makes the regex even more compex.
 
Old 10-22-2017, 08:12 PM   #5
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,140

Rep: Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122Reputation: 4122
Some messing around after searching - try this
Code:
xmllint  --xpath '//span[@class = "wx-value"]/text()' input.file
Assign it to a bash variable and do your comparison.
 
1 members found this post helpful.
Old 10-23-2017, 04:13 PM   #6
KenJackson
Member
 
Registered: Jul 2006
Location: Maryland, USA
Distribution: Fedora and others
Posts: 757

Rep: Reputation: 145Reputation: 145
Is that the only wx-value span on the page? You'll need more code if there's more than one. Didn't test it, but this should work:
Code:
cat file.html | awk '/="wx-value">/{sub(/.*wx-value">/,"");sub(/<.span>.*/,"");print}'
 
Old 11-05-2017, 04:42 AM   #7
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Just as a warning, Regex is not particularly well-suited to xml/html input. The have a nested hierarchy format, while regex operates linearly. A tool specifically designed for xml, like xmllint or xmlstarlet is thus recommended for complex tasks.

However, if your task is simple and the code you're working on is dependably regular, then a regex solution isn't particularly out of order. Just be aware that it can get really messy if you're trying to target tags within tags within tags.

One simple tool that I really like is hxpipe (part of the html-xml-utils package). It converts xml-style input into a format that is more safely parseable by line-based tools. Using the above input, I came up with this:

Code:
hxpipe inputfile.txt | sed -rn ' /wx-value/,/[)]span/ { /^-/ s/-//p }'
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
parse xml and extract the attribute value by search tag in shell boby.kumar Linux - Newbie 4 04-29-2016 02:08 PM
Extract values between xml tag boby.kumar Linux - Newbie 2 11-02-2015 10:31 AM
Read and extract table data in HTML from unix shridhar22 Linux - Newbie 8 11-05-2014 12:24 PM
sed command extract contents withing body tag of html Fond_of_Opensource Linux - Newbie 6 06-04-2007 07:55 AM
PHP: how can I return an image - not the html img src tag, but the image data BrianK Programming 3 05-18-2007 02:28 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 08:00 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration