Visit Jeremy's Blog.
Go Back > Forums > Linux Forums > Linux - Newbie
User Name
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!


  Search this Thread
Old 05-22-2012, 04:57 PM   #1
Registered: Mar 2010
Posts: 122

Rep: Reputation: 16
curl (or something) that just gets text

Maybe curl cannot do this. I could've sworn there was something at one point that did (lynx?). In any case, I must use curl. I'm scraping data from a site. If I do the following 'curl' I get the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<meta name="description" content="status" />
<meta name="keywords" content="status" />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<strong>A</strong>: <i>OK</i><br />
<strong>B</strong>: <i>NOT OK</i><br />
<strong>C</strong>: <i>OK</i><br />
But all I'd really like to get is plain old text with the important stuff:

I'm pretty sure I must use curl. But there are numerous other things that could follow such as sed, awk, perl...
Old 05-22-2012, 05:49 PM   #2
Registered: Jun 2009
Location: Haarlem, The Netherlands
Distribution: Archlinux
Posts: 125

Rep: Reputation: 20
There is an utility called "html2text" this strips html code off.
this should work:
curl -s > /tmp/statusdata
html2text /tmp/statusdata |tail -n3

Last edited by Babertje; 05-22-2012 at 06:42 PM. Reason: Added example shellscript
Old 05-23-2012, 11:41 AM   #3
Registered: Mar 2010
Posts: 122

Original Poster
Rep: Reputation: 16
Thanks Babertje --After wrestling with trying to avoid using an additional script I finally turned to html2text.

For the most part this worked but some of the data we're trying to grab contained < and/or > and we wanted that.

sed -e ‘s/<[^>]*>//g’
What we finally arrived at was this:

curl -s -X GET | html2text | strings | sed -e 's/ $//; /^$/ d;' | sed -e 's/[\t\v\f\r ]\+/ /g;' \
 | awk '{printf $0; getline; print $0}' | sed -e 's/Pass\./0/g;' | awk -F"[ :]" '{print $1 ":" $NF}'
Old 05-24-2012, 04:59 PM   #4
David the H.
Bash Guru
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Ouch. Seven levels of nested pipes is not very efficient. A single well-written awk script could certainly replace all of your separate awk and sed commands. And strings? What do you need that for?

It might help your parsing to run the file through htmltidy first, to clean up any formatting problems before extracting the text.

Another option, depending on your exact needs, may be to use xmlstarlet (or another tool purposely designed for parsing xml/html) instead. One option it has is for converting the input into "pyx" format, which is easier for line-based tools like sed and awk to parse. Again, you should run the html source through tidy first to convert it to proper xhtml.

curl .. | tidy -n -asxml 2>/dev/null | xmlstarlet pyx
This should give you pyx output. It's up to you decide if parsing that is useful to you or not.


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Bash command to 'cut' text into another text file & modifying text. velgasius Programming 4 10-17-2011 04:55 AM
Download images from text file using curl Sam71 Programming 4 04-19-2011 04:59 AM
curl bloodsugar Slackware 7 08-17-2009 10:09 AM
cURL: Server has many IPs, how would I make a cURL script use those IPs to send data? guest Programming 0 04-11-2009 11:42 AM
How to parse text file to a set text column width and output to new text file? jsstevenson Programming 12 04-23-2008 02:36 PM > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 10:46 AM.

Main Menu
Write for LQ is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration