LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > Programming
User Name
Password
Programming This forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.

Notices


Reply
  Search this Thread
Old 01-30-2007, 05:23 AM   #1
sonicthehedgehog
LQ Newbie
 
Registered: Oct 2006
Distribution: Mandriva
Posts: 17

Rep: Reputation: 0
script to grab html content from between specific tags


Hi there,

(bit of a newb to shell scripting but least i'm trying)

I've just set up a small web server and am having a fiddle grabbing content from other pages and displaying them on my page updating every few seconds (I'm grabbing track names from a 'now playing' on a streaming server)

here is an example of the page

Code:
<HTML> loads of stuff always the same length

<table>*stuff i want that changes size*</table> 

loads of stuff always the same length
</HTML>
The problem is that, (due to the way this page must be generated,) all the html is on one long line with no <nl>'s, so i dont know how to use say, awk, to grab the bits I want

at the moment I am grabbing the whole page with wget and using something like

Code:
head -c500 file.html | tail -c300 > output.html
to grab a few hundred bites starting at the 1st <table> tag on the page as this is always the same number of bytes from the start of the page

I'm not using PHP or anything, just trying to do this with a shell scrip to grab the bits i want, tag them on an HTML file every few seconds which the server then serves up.

Im looping the scipt every few secs and its working fine at the mo but the problem is that the 'head |tail' always grabs the same number of characters and as the size of the content in the table varies I end up grabbing extra bits or missing a few characters that I want

To sum up.

+I have one long line of html in a file
+I need to grab all the stuff between <table></table> (which varies in length) on the page and ditch the rest

If anyone fancies it, anybody know a sollution to this problem? maybe some pattern matching tool that I can read up on and use?

Thanks
 
Old 01-30-2007, 05:34 AM   #2
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Probably gawk is what you're looking for. If there is always one and only one <table> ... </table> entry, this can work:

Code:
gawk -F\<table\> '{print $2}' file.html | gawk -F\<\/table\> '{print $1}'
The -F option tells to use the specified field separator. See man gawk for details.
 
Old 01-30-2007, 05:42 AM   #3
sonicthehedgehog
LQ Newbie
 
Registered: Oct 2006
Distribution: Mandriva
Posts: 17

Original Poster
Rep: Reputation: 0
thanks ,

thats exactly the king of thing I'm looking for, the only prob is that there are more tables, although the one i want is always the second table on the page, so it might be possible to get round this somwhow?
 
Old 01-30-2007, 05:50 AM   #4
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
Code:
gawk -F\<\/table\> '{print $2}' file.html | gawk -F\<table\> '{print $2}'
Just a little change: the first call to gawk grab the stuff between the first </table> and the second one, the second call grab the stuff after the remaining <table>. The condition "the one i want is always the second table" must be true!
 
Old 01-30-2007, 05:55 AM   #5
sonicthehedgehog
LQ Newbie
 
Registered: Oct 2006
Distribution: Mandriva
Posts: 17

Original Poster
Rep: Reputation: 0
thats brilliant thanks,

and a lot more elligant than my 'head | tail' combo,

I'll give it a go when I get home

love these forums
 
Old 01-30-2007, 05:57 AM   #6
colucix
LQ Guru
 
Registered: Sep 2003
Location: Bologna
Distribution: CentOS 6.5 OpenSuSE 12.3
Posts: 10,509

Rep: Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983Reputation: 1983
You're welcome!
 
Old 01-30-2007, 01:14 PM   #7
sonicthehedgehog
LQ Newbie
 
Registered: Oct 2006
Distribution: Mandriva
Posts: 17

Original Poster
Rep: Reputation: 0
just got home and tried it out,

works exactly as i wanted

 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
strip html tags rblampain Programming 6 08-07-2005 06:22 AM
Bash script for correcting HTML tags hq4ever Programming 4 11-08-2004 04:06 AM
how can I seprate normal text from html tags spell check it & then again place it ins amit_28oct Programming 5 08-07-2004 07:09 AM
set content-type to 'text/html' in sendmail, using perl script brokenfeet Programming 3 08-05-2003 02:12 PM
regular expression for parsing html tags Bert Linux - Software 3 10-14-2002 04:31 PM

LinuxQuestions.org > Forums > Non-*NIX Forums > Programming

All times are GMT -5. The time now is 03:59 AM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration