web crawler/viewer

microsoft/linux · 05-03-2006, 06:05 PM

Is there anything that would allow one to see the directory structure of a website? I mean, obviously ftp would, but that's only on port (whatever port ftp is on). I guess I mean like the website directory structure. Something that would go out, crawl every ip address, and then allow for later viewing. It wouldn't download the site at all, just allow for viewing of said site, should one choose to. Suggestions? How difficult would this be to code, if there's nothing like it available?

microsoft/linux · 05-06-2006, 02:36 PM

noone has any ideas?

gilead · 05-06-2006, 02:42 PM

I've never done that before - but there's a no-frills tutorial on writing a web crawler in Java at http://java.sun.com/developer/techni...ty/WebCrawler/ which might provide some useful pointers.

microsoft/linux · 05-06-2006, 03:58 PM

I don't know java at all. I was going to write it in C++, if it needs writing. I was thinking somehting along the lines of a "directory tree", where it checks the directories directly, not downloading anything as it runs, unless someone wants to look at something. Other thoughts?

gilead · 05-06-2006, 04:05 PM

Sorry about that - I know nothing about C++

microsoft/linux · 05-06-2006, 04:07 PM

you probably don't know of any projects like this already? Any ideas as to where to start?

syg00 · 05-06-2006, 04:21 PM

Google will turn up plenty.
All indexed by their own ... well, crawlers ...

microsoft/linux · 05-06-2006, 04:34 PM

right, I understand that. But, I was thinking something along the lines of a program that does this. I kind of want to learn how to write one myself.

bulliver · 05-06-2006, 09:06 PM

Why use C++? I would use python myself, but ruby and perl are two reasonable alternatives. Seems to me that the biggest bottleneck for this sort of app will be network speed, so I don't think C++ will give any sort of speed increase. Plus the major scripting languages already have built-in modules for parsing and scraping URLs making your task trivial to implement.

As for the app itself, it seems to me that there is no portable sure-fire way to 'ls' a directory, as most Apache admins disable directory listings these days. What this means is you basically have to write a web spider that parses the links in the pages themselves (thus you _will_ have to download them) to find all the available pages.

microsoft/linux · 05-07-2006, 01:11 AM

I know small amounts of perl, but that is about the extent of it. Does that mean I'd have to download everything it comes across? That's going to take a lot of space. Perhaps this isn't such a good idea...Any idea as to how much disk space it'd take?

bulliver · 05-07-2006, 02:31 AM

Quote:

Does that mean I'd have to download everything it comes across?

Depends on how much work you want to put in. As I say, there is no way to just list all the files on a webserver (if the site admin doesn't want you to), so you will need to download at least the index page and see what links there are from there. I just can't think of any way to find available pages of a website without downloading them.

Google has an API to access their search results programmatically, which can return all pages from a certain domain. Perhaps that is something to look into.

Quote:

That's going to take a lot of space. Perhaps this isn't such a good idea...Any idea as to how much disk space it'd take?

This would of course depend on the site in question... Google's homepage is 3k, while Groklaw's is 64k (not including images). Many are much larger.

As for some other ideas, perhaps there is an open-source web spider you could have a look at the source of. If you understand C you can have a look at wget's source code which has an option to download and archive arbitrarily deep levels of a website.