LinuxQuestions.org

LinuxQuestions.org (/questions/)
-   Linux - General (https://www.linuxquestions.org/questions/linux-general-1/)
-   -   script to fix broken links in website (https://www.linuxquestions.org/questions/linux-general-1/script-to-fix-broken-links-in-website-446213/)

disorderly 05-18-2006 06:23 PM

script to fix broken links in website
 
hello! after converting thousands of pages from *.html to *.php, i have the leftover task of fixing 10,000's of broken links..ug.. there must be a better way.

i'm thinkin this could be done by something like this:
1) run a script that will recursively scan the new PHP files directory
2) check whether the current_file is *.php
3) if so, use sed with a regular expression like: href.*\.html to put the string in a variable, $url
5) determine whether the current_file is within the PHP directory somehow
6) determine if the link is pointing to a file in or out of the PHP files directory (because the files outside the PHP directory still all have their HTML extension, and external links should not be changed either)
6a) could check if the target_file exists, but that might be a bonus later
7) if necessary, then use sed to correct the new link within the current_file in $url

i just don't know how to insert conditional logic when using sed. i mean once sed finds what you tell it to find, how can you tell it to only change the text IF some other conditions are TRUE??

thanks,
disorderly

jschiwal 05-18-2006 07:26 PM

Quote:

1) run a script that will recursively scan the new PHP files directory
2) check whether the current_file is *.php
3) if so, use sed with a regular expression like: href.*\.html to put the string in a variable, $url
You can use the find command to locate all .php files in your directory, and subdirectories.
Such as "find <php directory> -name "*.php" -exec sed -f html2php.sed '{}' \;"

However, given the large numbers of files, you probably will need to use find -print0 | xargs -0, such as in:
find <php directory> -name "*.php" -print0 | xargs -0 --max-args=1000 sed -f html2php.sed

The max-args xorg argument will prevent an out-of-memory bash error.
I just used 1000 as an example.
Quote:

6) determine if the link is pointing to a file in or out of the PHP files directory (because the files outside the PHP directory still all have their HTML extension, and external links should not be changed either)
6a) could check if the target_file exists, but that might be a bonus later
Write your sed script with commands of the form: '/pattern/s/old-pattern/new-pattern/' so that if the link points to your PHP files directory, it matches the pattern. If it doesn't match the pattern, then it won't perform the substitution.

---

I would recommend that you create a temporary directory that contains the same subdirectory structure, and a subset of the files. Use this to test and develop your sed program. You don't want to accidently destroy all of your links on the actual files. When you believe you have a workable script, make a backup of your PHP directory so you can restore the files just in case of an error. "To err is human, to backup is divine".
---
One gotcha is if a URL spans more than one line. Sed program to handle them are more complicated because you need use looping and build a multiline pattern space before performing the substitution.
Using a lint program could avoid this problem. You could use sed or grep to determine whether this case exists.

disorderly 05-19-2006 09:06 AM

hi jschiwal, thanks for your reply! no worries, i'm working on a temporary directory with all this stuff (i've accidently blown away enough files in my time to have learned by now ;))
Quote:

You can use the find command to locate all .php files in your directory, and subdirectories.
Such as "find <php directory> -name "*.php" -exec sed -f html2php.sed '{}' \;"
- a much more elegant method than i was playing with, thanks

i think the critical pieces, #5 & #6 are the problem because of relative links. lets say i have a file structure like:
/ # html files
/directory1 # html files
/directory2 # html files
/newPHPDirectory # php files

if i find links in:
/newPhpDirectory/level1/level2/someFile.php
such as:
<a href="../../file.html">foo</a>
<a href="file.html">bar</a>
<a href="../../../directory/file.html">blee</a>

and i find this link in
/newPhpDirectory/level1/sublevel1/sublevel2/anotherFile.php
such as:
<a href="../../file.html">foo</a>

uh oh, now i have to determine what the links are targeting before i can replace them. they all have *.html extension, but some are within the new /newPHPDirectory directory, and others are in the old directories - see what i mean?

seems to me like the script would need to determine
a) in what directory its operating
b) whether the target file is in the old directories, or newPHPdirectory


All times are GMT -5. The time now is 02:55 PM.