Regex to extract data between html tag

oulevon · 10-22-2017, 02:59 PM

Hi,

I need to extract the highlighted value value between the span tags in the block of HTML below. The value 49.1 will be changing and I want to monitor it. Does anyone have any pointers or could suggest something to look at to prime me for this? Thanks.


49.1
°F

!!! · 10-22-2017, 03:20 PM

Try these web-search keywords: awk|sed extract value in|between html tags
Let us know what you find and try. Best wishes. Slack

business_kid · 10-22-2017, 03:28 PM

What's your programming language? In a terminal 'grep -b' or 'grep -u, would get you a byte offset, which you could pass to 'head -c' which loses the stuff before wx-value"> Next comes your number. What you do from there depends on how big or small that number goes.

syg00 · 10-22-2017, 06:26 PM

Depends on the data - is that all the input, or only a snippet ?. If the former a simple sed of digits and dots following a ">" will suffice. But the data must always look like that - else you'll need to include the full tag to ensure you get the correct line. It there are more than one, you'll get multi-line output.
grep could do it with PCRE, but makes the regex even more compex.

syg00 · 10-22-2017, 08:12 PM

Some messing around after searching - try this

Code:

xmllint  --xpath '//span[@class = "wx-value"]/text()' input.file

Assign it to a bash variable and do your comparison.

KenJackson · 10-23-2017, 04:13 PM

Is that the only wx-value span on the page? You'll need more code if there's more than one. Didn't test it, but this should work:

Code:

cat file.html | awk '/="wx-value">/{sub(/.*wx-value">/,"");sub(/<.span>.*/,"");print}'

David the H. · 11-05-2017, 04:42 AM

Just as a warning, Regex is not particularly well-suited to xml/html input. The have a nested hierarchy format, while regex operates linearly. A tool specifically designed for xml, like xmllint or xmlstarlet is thus recommended for complex tasks.

However, if your task is simple and the code you're working on is dependably regular, then a regex solution isn't particularly out of order. Just be aware that it can get really messy if you're trying to target tags within tags within tags.

One simple tool that I really like is hxpipe (part of the html-xml-utils package). It converts xml-style input into a format that is more safely parseable by line-based tools. Using the above input, I came up with this:

Code:

hxpipe inputfile.txt | sed -rn ' /wx-value/,/[)]span/ { /^-/ s/-//p }'