[SOLVED] curl (or something) that just gets text

ezekieldas · 05-22-2012, 04:57 PM

Maybe curl cannot do this. I could've sworn there was something at one point that did (lynx?). In any case, I must use curl. I'm scraping data from a site. If I do the following 'curl http://example.com/status' I get the following:

Code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<title>status</title>
<head>
<meta name="description" content="status" />
<meta name="keywords" content="status" />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
<h2>status</h2>
<strong>A</strong>: <i>OK</i><br />
<strong>B</strong>: <i>NOT OK</i><br />
<strong>C</strong>: <i>OK</i><br />
<body>

But all I'd really like to get is plain old text with the important stuff:

Code:

A: OK
B: NOT OK
C: OK

I'm pretty sure I must use curl. But there are numerous other things that could follow such as sed, awk, perl...

Babertje · 05-22-2012, 05:49 PM

There is an utility called "html2text" this strips html code off.
this should work:

Code:

#!/bin/bash
curl -s http://www.example.org/status.html > /tmp/statusdata
html2text /tmp/statusdata |tail -n3

ezekieldas · 05-23-2012, 11:41 AM

Thanks Babertje --After wrestling with trying to avoid using an additional script I finally turned to html2text.

For the most part this worked but some of the data we're trying to grab contained < and/or > and we wanted that.

Code:

sed -e ‘s/<[^>]*>//g’

What we finally arrived at was this:

Code:

curl -s -X GET http://example.com/status | html2text | strings | sed -e 's/ $//; /^$/ d;' | sed -e 's/[\t\v\f\r ]\+/ /g;' \
 | awk '{printf $0; getline; print $0}' | sed -e 's/Pass\./0/g;' | awk -F"[ :]" '{print $1 ":" $NF}'

David the H. · 05-24-2012, 04:59 PM

Ouch. Seven levels of nested pipes is not very efficient. A single well-written awk script could certainly replace all of your separate awk and sed commands. And strings? What do you need that for?

It might help your parsing to run the file through htmltidy first, to clean up any formatting problems before extracting the text.

Another option, depending on your exact needs, may be to use xmlstarlet (or another tool purposely designed for parsing xml/html) instead. One option it has is for converting the input into "pyx" format, which is easier for line-based tools like sed and awk to parse. Again, you should run the html source through tidy first to convert it to proper xhtml.

Code:

curl .. | tidy -n -asxml 2>/dev/null | xmlstarlet pyx

This should give you pyx output. It's up to you decide if parsing that is useful to you or not.