LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Non-*NIX Forums > General
User Name
Password
General This forum is for non-technical general discussion which can include both Linux and non-Linux topics. Have fun!

Notices


Reply
  Search this Thread
Old 05-03-2006, 06:05 PM   #1
microsoft/linux
Senior Member
 
Registered: May 2004
Location: Sebec, ME, USA
Distribution: Debian Etch, Windows XP Home, FreeBSD
Posts: 1,445
Blog Entries: 9

Rep: Reputation: 48
web crawler/viewer


Is there anything that would allow one to see the directory structure of a website? I mean, obviously ftp would, but that's only on port (whatever port ftp is on). I guess I mean like the website directory structure. Something that would go out, crawl every ip address, and then allow for later viewing. It wouldn't download the site at all, just allow for viewing of said site, should one choose to. Suggestions? How difficult would this be to code, if there's nothing like it available?
 
Old 05-06-2006, 02:36 PM   #2
microsoft/linux
Senior Member
 
Registered: May 2004
Location: Sebec, ME, USA
Distribution: Debian Etch, Windows XP Home, FreeBSD
Posts: 1,445

Original Poster
Blog Entries: 9

Rep: Reputation: 48
noone has any ideas?
 
Old 05-06-2006, 02:42 PM   #3
gilead
Senior Member
 
Registered: Dec 2005
Location: Brisbane, Australia
Distribution: Slackware64 14.0
Posts: 4,141

Rep: Reputation: 168Reputation: 168
I've never done that before - but there's a no-frills tutorial on writing a web crawler in Java at http://java.sun.com/developer/techni...ty/WebCrawler/ which might provide some useful pointers.
 
Old 05-06-2006, 03:58 PM   #4
microsoft/linux
Senior Member
 
Registered: May 2004
Location: Sebec, ME, USA
Distribution: Debian Etch, Windows XP Home, FreeBSD
Posts: 1,445

Original Poster
Blog Entries: 9

Rep: Reputation: 48
I don't know java at all. I was going to write it in C++, if it needs writing. I was thinking somehting along the lines of a "directory tree", where it checks the directories directly, not downloading anything as it runs, unless someone wants to look at something. Other thoughts?
 
Old 05-06-2006, 04:05 PM   #5
gilead
Senior Member
 
Registered: Dec 2005
Location: Brisbane, Australia
Distribution: Slackware64 14.0
Posts: 4,141

Rep: Reputation: 168Reputation: 168
Sorry about that - I know nothing about C++
 
Old 05-06-2006, 04:07 PM   #6
microsoft/linux
Senior Member
 
Registered: May 2004
Location: Sebec, ME, USA
Distribution: Debian Etch, Windows XP Home, FreeBSD
Posts: 1,445

Original Poster
Blog Entries: 9

Rep: Reputation: 48
you probably don't know of any projects like this already? Any ideas as to where to start?
 
Old 05-06-2006, 04:21 PM   #7
syg00
LQ Veteran
 
Registered: Aug 2003
Location: Australia
Distribution: Lots ...
Posts: 21,119

Rep: Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120Reputation: 4120
Google will turn up plenty.
All indexed by their own ... well, crawlers ...
 
Old 05-06-2006, 04:34 PM   #8
microsoft/linux
Senior Member
 
Registered: May 2004
Location: Sebec, ME, USA
Distribution: Debian Etch, Windows XP Home, FreeBSD
Posts: 1,445

Original Poster
Blog Entries: 9

Rep: Reputation: 48
right, I understand that. But, I was thinking something along the lines of a program that does this. I kind of want to learn how to write one myself.
 
Old 05-06-2006, 09:06 PM   #9
bulliver
Senior Member
 
Registered: Nov 2002
Location: Edmonton AB, Canada
Distribution: Gentoo x86_64; Gentoo PPC; FreeBSD; OS X 10.9.4
Posts: 3,760
Blog Entries: 4

Rep: Reputation: 78
Why use C++? I would use python myself, but ruby and perl are two reasonable alternatives. Seems to me that the biggest bottleneck for this sort of app will be network speed, so I don't think C++ will give any sort of speed increase. Plus the major scripting languages already have built-in modules for parsing and scraping URLs making your task trivial to implement.

As for the app itself, it seems to me that there is no portable sure-fire way to 'ls' a directory, as most Apache admins disable directory listings these days. What this means is you basically have to write a web spider that parses the links in the pages themselves (thus you _will_ have to download them) to find all the available pages.
 
Old 05-07-2006, 01:11 AM   #10
microsoft/linux
Senior Member
 
Registered: May 2004
Location: Sebec, ME, USA
Distribution: Debian Etch, Windows XP Home, FreeBSD
Posts: 1,445

Original Poster
Blog Entries: 9

Rep: Reputation: 48
I know small amounts of perl, but that is about the extent of it. Does that mean I'd have to download everything it comes across? That's going to take a lot of space. Perhaps this isn't such a good idea...Any idea as to how much disk space it'd take?
 
Old 05-07-2006, 02:31 AM   #11
bulliver
Senior Member
 
Registered: Nov 2002
Location: Edmonton AB, Canada
Distribution: Gentoo x86_64; Gentoo PPC; FreeBSD; OS X 10.9.4
Posts: 3,760
Blog Entries: 4

Rep: Reputation: 78
Quote:
Does that mean I'd have to download everything it comes across?
Depends on how much work you want to put in. As I say, there is no way to just list all the files on a webserver (if the site admin doesn't want you to), so you will need to download at least the index page and see what links there are from there. I just can't think of any way to find available pages of a website without downloading them.

Google has an API to access their search results programmatically, which can return all pages from a certain domain. Perhaps that is something to look into.

Quote:
That's going to take a lot of space. Perhaps this isn't such a good idea...Any idea as to how much disk space it'd take?
This would of course depend on the site in question... Google's homepage is 3k, while Groklaw's is 64k (not including images). Many are much larger.

As for some other ideas, perhaps there is an open-source web spider you could have a look at the source of. If you understand C you can have a look at wget's source code which has an option to download and archive arbitrarily deep levels of a website.
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
photoshop file viewer & chm file viewer alaios Linux - Software 5 03-25-2006 06:56 AM
wget as web spider/crawler kpachopoulos Linux - Software 2 08-27-2005 12:58 PM
Which is the widely used and best opensource crawler? coolguy_iiit Linux - Networking 1 01-08-2005 07:56 PM
I need a web crawler and indexer for linux jrenzi Programming 2 10-28-2004 01:11 AM
linux web crawler demmylls Linux - Software 2 03-06-2004 08:56 AM

LinuxQuestions.org > Forums > Non-*NIX Forums > General

All times are GMT -5. The time now is 02:05 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration