LinuxQuestions.org
Visit Jeremy's Blog.
Go Back   LinuxQuestions.org > Forums > Linux Forums > Linux - General
User Name
Password
Linux - General This Linux forum is for general Linux questions and discussion.
If it is Linux Related and doesn't seem to fit in any other forum then this is the place.

Notices



Reply
 
Search this Thread
Old 05-18-2006, 07:23 PM   #1
disorderly
Member
 
Registered: Sep 2003
Location: NJ
Distribution: RHEL5
Posts: 154

Rep: Reputation: 30
script to fix broken links in website


hello! after converting thousands of pages from *.html to *.php, i have the leftover task of fixing 10,000's of broken links..ug.. there must be a better way.

i'm thinkin this could be done by something like this:
1) run a script that will recursively scan the new PHP files directory
2) check whether the current_file is *.php
3) if so, use sed with a regular expression like: href.*\.html to put the string in a variable, $url
5) determine whether the current_file is within the PHP directory somehow
6) determine if the link is pointing to a file in or out of the PHP files directory (because the files outside the PHP directory still all have their HTML extension, and external links should not be changed either)
6a) could check if the target_file exists, but that might be a bonus later
7) if necessary, then use sed to correct the new link within the current_file in $url

i just don't know how to insert conditional logic when using sed. i mean once sed finds what you tell it to find, how can you tell it to only change the text IF some other conditions are TRUE??

thanks,
disorderly
 
Old 05-18-2006, 08:26 PM   #2
jschiwal
Guru
 
Registered: Aug 2001
Location: Fargo, ND
Distribution: SuSE AMD64
Posts: 15,733

Rep: Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655Reputation: 655
Quote:
1) run a script that will recursively scan the new PHP files directory
2) check whether the current_file is *.php
3) if so, use sed with a regular expression like: href.*\.html to put the string in a variable, $url
You can use the find command to locate all .php files in your directory, and subdirectories.
Such as "find <php directory> -name "*.php" -exec sed -f html2php.sed '{}' \;"

However, given the large numbers of files, you probably will need to use find -print0 | xargs -0, such as in:
find <php directory> -name "*.php" -print0 | xargs -0 --max-args=1000 sed -f html2php.sed

The max-args xorg argument will prevent an out-of-memory bash error.
I just used 1000 as an example.
Quote:
6) determine if the link is pointing to a file in or out of the PHP files directory (because the files outside the PHP directory still all have their HTML extension, and external links should not be changed either)
6a) could check if the target_file exists, but that might be a bonus later
Write your sed script with commands of the form: '/pattern/s/old-pattern/new-pattern/' so that if the link points to your PHP files directory, it matches the pattern. If it doesn't match the pattern, then it won't perform the substitution.

---

I would recommend that you create a temporary directory that contains the same subdirectory structure, and a subset of the files. Use this to test and develop your sed program. You don't want to accidently destroy all of your links on the actual files. When you believe you have a workable script, make a backup of your PHP directory so you can restore the files just in case of an error. "To err is human, to backup is divine".
---
One gotcha is if a URL spans more than one line. Sed program to handle them are more complicated because you need use looping and build a multiline pattern space before performing the substitution.
Using a lint program could avoid this problem. You could use sed or grep to determine whether this case exists.

Last edited by jschiwal; 05-18-2006 at 08:42 PM.
 
Old 05-19-2006, 10:06 AM   #3
disorderly
Member
 
Registered: Sep 2003
Location: NJ
Distribution: RHEL5
Posts: 154

Original Poster
Rep: Reputation: 30
hi jschiwal, thanks for your reply! no worries, i'm working on a temporary directory with all this stuff (i've accidently blown away enough files in my time to have learned by now )
Quote:
You can use the find command to locate all .php files in your directory, and subdirectories.
Such as "find <php directory> -name "*.php" -exec sed -f html2php.sed '{}' \;"
- a much more elegant method than i was playing with, thanks

i think the critical pieces, #5 & #6 are the problem because of relative links. lets say i have a file structure like:
/ # html files
/directory1 # html files
/directory2 # html files
/newPHPDirectory # php files

if i find links in:
/newPhpDirectory/level1/level2/someFile.php
such as:
<a href="../../file.html">foo</a>
<a href="file.html">bar</a>
<a href="../../../directory/file.html">blee</a>

and i find this link in
/newPhpDirectory/level1/sublevel1/sublevel2/anotherFile.php
such as:
<a href="../../file.html">foo</a>

uh oh, now i have to determine what the links are targeting before i can replace them. they all have *.html extension, but some are within the new /newPHPDirectory directory, and others are in the old directories - see what i mean?

seems to me like the script would need to determine
a) in what directory its operating
b) whether the target file is in the old directories, or newPHPdirectory
 
  


Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off


Similar Threads
Thread Thread Starter Forum Replies Last Post
Broken simbolic links ale_murakami Amigo 2 05-20-2005 07:06 AM
Broken links? danthehat Linux - Newbie 4 02-03-2005 04:06 PM
How can I fix an ungodly amount of broken symbolic links? spoonz Slackware 1 05-07-2004 04:04 PM
Konqueror and broken links lachlan Linux - Software 0 07-29-2002 10:23 PM
Micro$oft = broken links sewer_monkey LQ Suggestions & Feedback 8 06-19-2002 03:46 PM


All times are GMT -5. The time now is 04:35 PM.

Main Menu
Advertisement
My LQ
Write for LQ
LinuxQuestions.org is looking for people interested in writing Editorials, Articles, Reviews, and more. If you'd like to contribute content, let us know.
Main Menu
Syndicate
RSS1  Latest Threads
RSS1  LQ News
Twitter: @linuxquestions
identi.ca: @linuxquestions
Facebook: linuxquestions Google+: linuxquestions
Open Source Consulting | Domain Registration