LinuxQuestions.org
Visit Jeremy's Blog.
Home Forums Tutorials Articles Register
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Networking
User Name
Password
Linux - Networking This forum is for any issue related to networks or networking.
Routing, network cards, OSI, etc. Anything is fair game.

Notices


Reply
  Search this Thread
Old 06-22-2006, 12:25 AM   #1
wwnexc
Member
 
Registered: Sep 2005
Location: California
Distribution: Slackware & Debian
Posts: 264

Rep: Reputation: 30
Crawling Websites


Hi,

I am faced with a little challenge: I have to index a whole site and make it searchable by terms. All links have to be followed, and every part of the website should be charted. The website is php and cgi based, so simply downloading the main directories does not work.

Where would you start for a project like this? Are there some cool webspiders which follow every single link on a website and then download the entire site (something similar to what google is doing, maybe?)

Basically, what i am trying to do is play google for only one website, but index ALL of it, not only a few google-like percent. The search interface can be anything, but i'd prefer web based over all others.

Is there a better way to do this?

Any other suggestions?

THANK YOU so much!


PS.: it is only one (medium size) site. So bandwidth and storage will not be a problem.

Last edited by wwnexc; 06-22-2006 at 02:14 AM.
 
Old 06-22-2006, 04:54 PM   #2
theYinYeti
Senior Member
 
Registered: Jul 2004
Location: France
Distribution: Arch Linux
Posts: 1,897

Rep: Reputation: 66
I would probably try with wget to read all pages, and lynx to produce raw text, that can be parsed any way you like.

Yves
 
Old 06-22-2006, 05:20 PM   #3
wwnexc
Member
 
Registered: Sep 2005
Location: California
Distribution: Slackware & Debian
Posts: 264

Original Poster
Rep: Reputation: 30
How would lynx help with that?

May sound like a dumb question, but i honestly have never used lynx before...
 
Old 06-23-2006, 07:06 AM   #4
theYinYeti
Senior Member
 
Registered: Jul 2004
Location: France
Distribution: Arch Linux
Posts: 1,897

Rep: Reputation: 66
lynx has the ability to read a web page (local file provided by wget in this case) and output formated raw text to stardard output in "batch mode" instead of displaying the page in interactive mode.

Yves.
 
Old 06-23-2006, 09:42 AM   #5
theNbomr
LQ 5k Club
 
Registered: Aug 2005
Distribution: OpenSuse, Fedora, Redhat, Debian
Posts: 5,399
Blog Entries: 2

Rep: Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908Reputation: 908
It sounds like you could use a Web indexing tool such as ht://Dig.
There are other similar tools.

--- rod.
 
  


Reply



Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
Crawling a directory and executing Python functions. Travis86 Programming 2 07-06-2005 01:06 AM
HTTP Traffic crawling along ericnmu Linux - Networking 7 07-12-2004 12:38 AM
Thinking of crawling back to ms... sammyd56 Linux - General 52 06-05-2004 04:20 AM
Websites that sell websites..? mScDeX General 2 12-29-2003 03:01 PM
Websites? Nephlite Linux - Newbie 2 01-31-2002 01:23 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Networking

All times are GMT -5. The time now is 04:12 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration