script to grab html content from between specific tags

sonicthehedgehog · 01-30-2007, 05:23 AM

Hi there,

(bit of a newb to shell scripting but least i'm trying)

I've just set up a small web server and am having a fiddle grabbing content from other pages and displaying them on my page updating every few seconds (I'm grabbing track names from a 'now playing' on a streaming server)

here is an example of the page

Code:

<HTML> loads of stuff always the same length

<table>*stuff i want that changes size*</table> 

loads of stuff always the same length
</HTML>

The problem is that, (due to the way this page must be generated,) all the html is on one long line with no <nl>'s, so i dont know how to use say, awk, to grab the bits I want

at the moment I am grabbing the whole page with wget and using something like

Code:

head -c500 file.html | tail -c300 > output.html

to grab a few hundred bites starting at the 1st <table> tag on the page as this is always the same number of bytes from the start of the page

I'm not using PHP or anything, just trying to do this with a shell scrip to grab the bits i want, tag them on an HTML file every few seconds which the server then serves up.

Im looping the scipt every few secs and its working fine at the mo but the problem is that the 'head |tail' always grabs the same number of characters and as the size of the content in the table varies I end up grabbing extra bits or missing a few characters that I want

To sum up.

+I have one long line of html in a file
+I need to grab all the stuff between <table></table> (which varies in length) on the page and ditch the rest

If anyone fancies it, anybody know a sollution to this problem? maybe some pattern matching tool that I can read up on and use?

Thanks

colucix · 01-30-2007, 05:34 AM

Probably gawk is what you're looking for. If there is always one and only one <table> ... </table> entry, this can work:

Code:

gawk -F\<table\> '{print $2}' file.html | gawk -F\<\/table\> '{print $1}'

The -F option tells to use the specified field separator. See man gawk for details.

sonicthehedgehog · 01-30-2007, 05:42 AM

thanks

,

thats exactly the king of thing I'm looking for, the only prob is that there are more tables, although the one i want is always the second table on the page, so it might be possible to get round this somwhow?

colucix · 01-30-2007, 05:50 AM

Code:

gawk -F\<\/table\> '{print $2}' file.html | gawk -F\<table\> '{print $2}'

Just a little change: the first call to gawk grab the stuff between the first </table> and the second one, the second call grab the stuff after the remaining <table>. The condition "the one i want is always the second table" must be true!

sonicthehedgehog · 01-30-2007, 05:55 AM

thats brilliant thanks,

and a lot more elligant than my 'head | tail' combo,

I'll give it a go when I get home

love these forums

colucix · 01-30-2007, 05:57 AM

You're welcome!

sonicthehedgehog · 01-30-2007, 01:14 PM

just got home and tried it out,

works exactly as i wanted