Download a webpage using unix

PradeepKr · 08-13-2010, 03:11 PM

how to download a webpage using unix nd then parse throufh its content to extract particular portion (like header or title) ?

MS3FGX · 08-13-2010, 03:44 PM

You can download web pages with wget, and then parse the plaintext files with tools like grep, sed, and awk. We could probably give a bit more specific direction if you explained exactly what you wanted to do.

dreamwalking · 08-13-2010, 06:22 PM

Quote:

wget --wait=20 --limit-rate=20K -c -r -p http://www.example.com

wait=20 pauses for 20 secs between downloads, limit-rate, well, limits the download rate, -c so that you can resume download if it's interrupted, -p to get everything in the page (images needed to display it, etc), -r for recrusive download.

PradeepKr · 08-14-2010, 01:40 PM

Quote:

Originally Posted by MS3FGX

You can download web pages with wget, and then parse the plaintext files with tools like grep, sed, and awk. We could probably give a bit more specific direction if you explained exactly what you wanted to do.

The requirement is,
I already have a script written in PHP which can download a web page and parse through it but it runs sluggishly.
I want a fast working script, may be shell script, for extracting the links from a webpage.

btmiller · 08-14-2010, 04:16 PM

What languages do you know? I hear the HTML::Parse module of Perl is quite good and easy to use, but I switched from Perl to Python (and stopped doing HTML parsing regularly) a couple years ago. If you're trying to extract some sort of data from HTML pages it's possible that the author ofr those pages already put the data in some more easily accessible form (e.g. XML) or provides an API to access it. You might check and see if that is the case.

PradeepKr · 08-15-2010, 10:53 AM

Quote:

Originally Posted by btmiller

What languages do you know? I hear the HTML::Parse module of Perl is quite good and easy to use, but I switched from Perl to Python (and stopped doing HTML parsing regularly) a couple years ago. If you're trying to extract some sort of data from HTML pages it's possible that the author ofr those pages already put the data in some more easily accessible form (e.g. XML) or provides an API to access it. You might check and see if that is the case.

I know Perl, PHP, basic UNIX etc.
No, the author if the webpages cannot give any API. I need to do it as raw HTML only.
Using perl is also same as using PHP(which I am already donig).

I need superfast HTML parsing.