Crawl local website

lilleskut · 06-19-2012, 12:15 PM

I have a website (php,mysql) to which I made some changes. There are some "dead" pages which are not linked to, but could be accessed by typing in the url directly. I want to get a list of these dead pages (for instance a list of their urls). Is there a tool which can show me all existing pages or even better only the dead pages? I have server access, so the tool could run on the server if necessary.

I know of "wget", but as far as I understand it follows hyperlinks, so would never find the dead pages.

Milkwitzjs · 06-19-2012, 01:39 PM

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots,or—especially in the FOAF community—Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

lilleskut · 06-19-2012, 03:19 PM

Thanks for the explanation of what a crawler is. But do crawlers find pages that are not linked to from other (known) pages? If yes, how do they do that?

Do you have a specific recommendation for a tool/software product to use in my case?

---------- Post added 06-19-12 at 03:19 PM ----------

Thanks for the explanation of what a crawler is. But do crawlers find pages that are not linked to from other (known) pages? If yes, how do they do that?

Do you have a specific recommendation for a tool/software product to use in my case?

PTrenholme · 06-19-2012, 03:32 PM

Quote:

Originally Posted by Milkwitzjs

...This process is called Web crawling or spidering...

But the OP explicitly stated that the pages in question are not referenced by any links in the site.

Perhaps you intended to suggest that he "crawl" the site to discover the set of referenced pages, and the remove the intersection of that set and set of all defined pages from the later set. The elements of the remaining set would be the pages never referenced in the current site.

If the site is using mysql to serve pages, a few simple queries might be all that needs to be done to solve the problem.

lilleskut · 06-19-2012, 05:30 PM

Do I understand correctly that there is no general tool to do this task for me and it depends on the individual structure of the site?

I don't think mysql queries would help in my case. The pages/urls correspond to functions/methods in php files. mysql is only used to store some data which appears within the pages.

So for instance a "function register()" in a class "User" would correspond to the url:
"baseurl/user/register".

I could look through all the function names and see if the corresponding url exists, but I'd rather have a less manual and external method to check. Also some url are already disabled, e.g. by requiring admin login, etc.

dru8274 · 06-19-2012, 08:41 PM

For crawling, ripping and mirroring websites, httrack is an excellent tool. And it will happily parse links from html, css, js files etc. But it doesn't do guesswork - if the pages you seek aren't linked to from anywhere, then it simply can't fetch 'em. Sorry. :-/

sag47 · 06-19-2012, 11:11 PM

OP, if your URLs are generated from PHP functions as you say... then you can avoid the hassle of a web crawler and use standard unix tools to figure out the information you need (and maybe a little bit of scripting). Assuming you have shell access to your web server; if not then download php files.

For intance...

Code:

find . -type f -name '*.php' -print0 | xargs -0 ./myscript_for_urls.py | while read url;do
  curl "http://anywebsiteyoulike/$url" > /dev/null 2>$1
  if [ "$?" -ne "0" ];then
    echo "$url is dead"
  fi
done

And the myscript_for_urls.py *could* contain something like this (likely doesn't run right away)

Code:

#!/usr/bin/env python
import sys,os,re
f = open(sys.argv[1],'r')
contents = f.read()
f.close()
regex = re.compile('\s*function ([a-zA-Z_]+)\(\)')
results = re.findall(regex,contents)
for name in results:
  partial_url = os.path.basename(sys.argv[1])
  partial_url = partial_url.split('.')[0]
  print "%s/%s" % ( partial_url, name )

The python script is a really rough prototype but ultimately you could use any language you're comfortable with. Just an idea.

lilleskut · 06-20-2012, 02:26 AM

Thanks. Seems to work, but apparently it only processes the first php file, i.e. the first that "find" finds.

Another problem is that it drops subdirectory information. Say if I have a php file "First.php" in directory "DIR" and a php file "Second.php" in a subdirectory of DIR called "SUBDIR", this would in my case correspond to urls "dir/first/" and "dir/subdir/second" respectively. As far as I can see the script ignores this subdirectory structures.

sag47 · 06-20-2012, 10:01 AM

Quote:

Originally Posted by lilleskut

Thanks. Seems to work, but apparently it only processes the first php file, i.e. the first that "find" finds.

Another problem is that it drops subdirectory information. Say if I have a php file "First.php" in directory "DIR" and a php file "Second.php" in a subdirectory of DIR called "SUBDIR", this would in my case correspond to urls "dir/first/" and "dir/subdir/second" respectively. As far as I can see the script ignores this subdirectory structures.

You are correct in your observations. I said it was a rough prototype in which you can improve upon for your needs. I didn't know the structure of your website so I made some assumptions of my own and made a simple design. The intent was to give you a starting point in which you could automate testing dead links on your website.

After testing with multiple files here's a better find command.

Code:

#!/bin/bash
find . -type f -name '*.php' -exec ./myscript_for_urls.py {} \; | while read url;do
  curl "http://anywebsiteyoulike/$url" &> /dev/null
  if [ "$?" -ne "0" ];then
    echo "$url is dead"
  fi
done

Now the ./myscript_for_urls.py works on more than just the first file. The problem you're experiencing with the folder structure is with os.path.basename in my original script.

As I said I didn't give you a solution for all situations. I gave you a rough prototype in which you could start to derive a solution that works for you. Happy hacking!

SAM

lilleskut · 06-24-2012, 06:46 PM

thanks a lot!