Crawling Websites

wwnexc · 06-22-2006, 12:25 AM

Hi,

I am faced with a little challenge: I have to index a whole site and make it searchable by terms. All links have to be followed, and every part of the website should be charted. The website is php and cgi based, so simply downloading the main directories does not work.

Where would you start for a project like this? Are there some cool webspiders which follow every single link on a website and then download the entire site (something similar to what google is doing, maybe?)

Basically, what i am trying to do is play google for only one website, but index ALL of it, not only a few google-like percent. The search interface can be anything, but i'd prefer web based over all others.

Is there a better way to do this?

Any other suggestions?

THANK YOU so much!

PS.: it is only one (medium size) site. So bandwidth and storage will not be a problem.

theYinYeti · 06-22-2006, 04:54 PM

I would probably try with wget to read all pages, and lynx to produce raw text, that can be parsed any way you like.

Yves

wwnexc · 06-22-2006, 05:20 PM

How would lynx help with that?

May sound like a dumb question, but i honestly have never used lynx before...

theYinYeti · 06-23-2006, 07:06 AM

lynx has the ability to read a web page (local file provided by wget in this case) and output formated raw text to stardard output in "batch mode" instead of displaying the page in interactive mode.

Yves.

theNbomr · 06-23-2006, 09:42 AM

It sounds like you could use a Web indexing tool such as ht://Dig.
There are other similar tools.

--- rod.