LinuxQuestions.org
Latest LQ Deal: Latest LQ Deals
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie
User Name
Password
Linux - Newbie This Linux forum is for members that are new to Linux.
Just starting out and have a question? If it is not in the man pages or the how-to's this is the place!

Notices


Reply
  Search this Thread
Old 05-22-2012, 04:57 PM   #1
ezekieldas
Member
 
Registered: Mar 2010
Posts: 122

Rep: Reputation: 16
curl (or something) that just gets text


Maybe curl cannot do this. I could've sworn there was something at one point that did (lynx?). In any case, I must use curl. I'm scraping data from a site. If I do the following 'curl http://example.com/status' I get the following:

Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<title>status</title>
<head>
<meta name="description" content="status" />
<meta name="keywords" content="status" />
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
<h2>status</h2>
<strong>A</strong>: <i>OK</i><br />
<strong>B</strong>: <i>NOT OK</i><br />
<strong>C</strong>: <i>OK</i><br />
<body>
But all I'd really like to get is plain old text with the important stuff:
Code:
A: OK
B: NOT OK
C: OK

I'm pretty sure I must use curl. But there are numerous other things that could follow such as sed, awk, perl...
 
Old 05-22-2012, 05:49 PM   #2
Babertje
Member
 
Registered: Jun 2009
Location: Haarlem, The Netherlands
Distribution: Archlinux
Posts: 125

Rep: Reputation: 20
There is an utility called "html2text" this strips html code off.
this should work:
Code:
#!/bin/bash
curl -s http://www.example.org/status.html > /tmp/statusdata
html2text /tmp/statusdata |tail -n3

Last edited by Babertje; 05-22-2012 at 06:42 PM. Reason: Added example shellscript
 
Old 05-23-2012, 11:41 AM   #3
ezekieldas
Member
 
Registered: Mar 2010
Posts: 122

Original Poster
Rep: Reputation: 16
Thanks Babertje --After wrestling with trying to avoid using an additional script I finally turned to html2text.

For the most part this worked but some of the data we're trying to grab contained < and/or > and we wanted that.

Code:
sed -e ‘s/<[^>]*>//g’
What we finally arrived at was this:

Code:
curl -s -X GET http://example.com/status | html2text | strings | sed -e 's/ $//; /^$/ d;' | sed -e 's/[\t\v\f\r ]\+/ /g;' \
 | awk '{printf $0; getline; print $0}' | sed -e 's/Pass\./0/g;' | awk -F"[ :]" '{print $1 ":" $NF}'
 
Old 05-24-2012, 04:59 PM   #4
David the H.
Bash Guru
 
Registered: Jun 2004
Location: Osaka, Japan
Distribution: Arch + Xfce
Posts: 6,852

Rep: Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037Reputation: 2037
Ouch. Seven levels of nested pipes is not very efficient. A single well-written awk script could certainly replace all of your separate awk and sed commands. And strings? What do you need that for?

It might help your parsing to run the file through htmltidy first, to clean up any formatting problems before extracting the text.

Another option, depending on your exact needs, may be to use xmlstarlet (or another tool purposely designed for parsing xml/html) instead. One option it has is for converting the input into "pyx" format, which is easier for line-based tools like sed and awk to parse. Again, you should run the html source through tidy first to convert it to proper xhtml.

Code:
curl .. | tidy -n -asxml 2>/dev/null | xmlstarlet pyx
This should give you pyx output. It's up to you decide if parsing that is useful to you or not.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
[SOLVED] Bash command to 'cut' text into another text file & modifying text. velgasius Programming 4 10-17-2011 04:55 AM
Download images from text file using curl Sam71 Programming 4 04-19-2011 04:59 AM
curl bloodsugar Slackware 7 08-17-2009 10:09 AM
cURL: Server has many IPs, how would I make a cURL script use those IPs to send data? guest Programming 0 04-11-2009 11:42 AM
How to parse text file to a set text column width and output to new text file? jsstevenson Programming 12 04-23-2008 02:36 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Newbie

All times are GMT -5. The time now is 04:04 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration