LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - Software
User Name
Password
Linux - Software This forum is for Software issues.
Having a problem installing a new program? Want to know which application is best for the job? Post your question in this forum.

Notices


Reply
  Search this Thread
Old 06-19-2012, 12:15 PM   #1
lilleskut
LQ Newbie
 
Registered: Feb 2012
Posts: 19

Rep: Reputation: Disabled
Crawl local website


I have a website (php,mysql) to which I made some changes. There are some "dead" pages which are not linked to, but could be accessed by typing in the url directly. I want to get a list of these dead pages (for instance a list of their urls). Is there a tool which can show me all existing pages or even better only the dead pages? I have server access, so the tool could run on the server if necessary.

I know of "wget", but as far as I understand it follows hyperlinks, so would never find the dead pages.

Last edited by lilleskut; 06-19-2012 at 12:35 PM.
 
Old 06-19-2012, 01:39 PM   #2
Milkwitzjs
LQ Newbie
 
Registered: Jun 2012
Posts: 9

Rep: Reputation: 1
Crawl local website

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots,or—especially in the FOAF community—Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).
 
Old 06-19-2012, 03:19 PM   #3
lilleskut
LQ Newbie
 
Registered: Feb 2012
Posts: 19

Original Poster
Rep: Reputation: Disabled
Thanks for the explanation of what a crawler is. But do crawlers find pages that are not linked to from other (known) pages? If yes, how do they do that?

Do you have a specific recommendation for a tool/software product to use in my case?

---------- Post added 06-19-12 at 03:19 PM ----------

Thanks for the explanation of what a crawler is. But do crawlers find pages that are not linked to from other (known) pages? If yes, how do they do that?

Do you have a specific recommendation for a tool/software product to use in my case?
 
Old 06-19-2012, 03:32 PM   #4
PTrenholme
Senior Member
 
Registered: Dec 2004
Location: Olympia, WA, USA
Distribution: Fedora, (K)Ubuntu
Posts: 4,187

Rep: Reputation: 354Reputation: 354Reputation: 354Reputation: 354
Quote:
Originally Posted by Milkwitzjs View Post
...This process is called Web crawling or spidering...
But the OP explicitly stated that the pages in question are not referenced by any links in the site.

Perhaps you intended to suggest that he "crawl" the site to discover the set of referenced pages, and the remove the intersection of that set and set of all defined pages from the later set. The elements of the remaining set would be the pages never referenced in the current site.

If the site is using mysql to serve pages, a few simple queries might be all that needs to be done to solve the problem.
 
Old 06-19-2012, 05:30 PM   #5
lilleskut
LQ Newbie
 
Registered: Feb 2012
Posts: 19

Original Poster
Rep: Reputation: Disabled
Do I understand correctly that there is no general tool to do this task for me and it depends on the individual structure of the site?

I don't think mysql queries would help in my case. The pages/urls correspond to functions/methods in php files. mysql is only used to store some data which appears within the pages.

So for instance a "function register()" in a class "User" would correspond to the url:
"baseurl/user/register".

I could look through all the function names and see if the corresponding url exists, but I'd rather have a less manual and external method to check. Also some url are already disabled, e.g. by requiring admin login, etc.
 
Old 06-19-2012, 08:41 PM   #6
dru8274
Member
 
Registered: Oct 2011
Location: New Zealand
Distribution: Debian
Posts: 105

Rep: Reputation: 37
For crawling, ripping and mirroring websites, httrack is an excellent tool. And it will happily parse links from html, css, js files etc. But it doesn't do guesswork - if the pages you seek aren't linked to from anywhere, then it simply can't fetch 'em. Sorry. :-/
 
Old 06-19-2012, 11:11 PM   #7
sag47
Senior Member
 
Registered: Sep 2009
Location: Raleigh, NC
Distribution: Ubuntu, PopOS, Raspbian
Posts: 1,899
Blog Entries: 36

Rep: Reputation: 477Reputation: 477Reputation: 477Reputation: 477Reputation: 477
OP, if your URLs are generated from PHP functions as you say... then you can avoid the hassle of a web crawler and use standard unix tools to figure out the information you need (and maybe a little bit of scripting). Assuming you have shell access to your web server; if not then download php files.

For intance...

Code:
find . -type f -name '*.php' -print0 | xargs -0 ./myscript_for_urls.py | while read url;do
  curl "http://anywebsiteyoulike/$url" > /dev/null 2>$1
  if [ "$?" -ne "0" ];then
    echo "$url is dead"
  fi
done
And the myscript_for_urls.py *could* contain something like this (likely doesn't run right away)
Code:
#!/usr/bin/env python
import sys,os,re
f = open(sys.argv[1],'r')
contents = f.read()
f.close()
regex = re.compile('\s*function ([a-zA-Z_]+)\(\)')
results = re.findall(regex,contents)
for name in results:
  partial_url = os.path.basename(sys.argv[1])
  partial_url = partial_url.split('.')[0]
  print "%s/%s" % ( partial_url, name )
The python script is a really rough prototype but ultimately you could use any language you're comfortable with. Just an idea.

Last edited by sag47; 06-19-2012 at 11:53 PM.
 
Old 06-20-2012, 02:26 AM   #8
lilleskut
LQ Newbie
 
Registered: Feb 2012
Posts: 19

Original Poster
Rep: Reputation: Disabled
Thanks. Seems to work, but apparently it only processes the first php file, i.e. the first that "find" finds.

Another problem is that it drops subdirectory information. Say if I have a php file "First.php" in directory "DIR" and a php file "Second.php" in a subdirectory of DIR called "SUBDIR", this would in my case correspond to urls "dir/first/" and "dir/subdir/second" respectively. As far as I can see the script ignores this subdirectory structures.
 
Old 06-20-2012, 10:01 AM   #9
sag47
Senior Member
 
Registered: Sep 2009
Location: Raleigh, NC
Distribution: Ubuntu, PopOS, Raspbian
Posts: 1,899
Blog Entries: 36

Rep: Reputation: 477Reputation: 477Reputation: 477Reputation: 477Reputation: 477
Quote:
Originally Posted by lilleskut View Post
Thanks. Seems to work, but apparently it only processes the first php file, i.e. the first that "find" finds.

Another problem is that it drops subdirectory information. Say if I have a php file "First.php" in directory "DIR" and a php file "Second.php" in a subdirectory of DIR called "SUBDIR", this would in my case correspond to urls "dir/first/" and "dir/subdir/second" respectively. As far as I can see the script ignores this subdirectory structures.
You are correct in your observations. I said it was a rough prototype in which you can improve upon for your needs. I didn't know the structure of your website so I made some assumptions of my own and made a simple design. The intent was to give you a starting point in which you could automate testing dead links on your website.

After testing with multiple files here's a better find command.

Code:
#!/bin/bash
find . -type f -name '*.php' -exec ./myscript_for_urls.py {} \; | while read url;do
  curl "http://anywebsiteyoulike/$url" &> /dev/null
  if [ "$?" -ne "0" ];then
    echo "$url is dead"
  fi
done
Now the ./myscript_for_urls.py works on more than just the first file. The problem you're experiencing with the folder structure is with os.path.basename in my original script.

As I said I didn't give you a solution for all situations. I gave you a rough prototype in which you could start to derive a solution that works for you. Happy hacking!

SAM

Last edited by sag47; 06-20-2012 at 12:48 PM.
 
1 members found this post helpful.
Old 06-24-2012, 06:46 PM   #10
lilleskut
LQ Newbie
 
Registered: Feb 2012
Posts: 19

Original Poster
Rep: Reputation: Disabled
thanks a lot!
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off



Similar Threads
Thread Thread Starter Forum Replies Last Post
publish a website on local server hussain6001 Linux - Networking 8 09-27-2010 10:14 AM
Local website optimization Analyzer? your_shadow03 Linux - Newbie 1 12-18-2009 06:31 PM
Migrating website from hosting to local metallica1973 Linux - Networking 4 02-16-2007 06:39 PM
local users can't see website or get mail Michele Linux - Newbie 8 07-01-2004 01:00 PM
I can't get my Website to work on my local intranet. tivoli pete Linux - Newbie 9 10-13-2003 03:23 PM

LinuxQuestions.org > Forums > Linux Forums > Linux - Software

All times are GMT -5. The time now is 09:39 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
Open Source Consulting | Domain Registration