script to fix broken links in website
hello! after converting thousands of pages from *.html to *.php, i have the leftover task of fixing 10,000's of broken links..ug.. there must be a better way.
i'm thinkin this could be done by something like this: 1) run a script that will recursively scan the new PHP files directory 2) check whether the current_file is *.php 3) if so, use sed with a regular expression like: href.*\.html to put the string in a variable, $url 5) determine whether the current_file is within the PHP directory somehow 6) determine if the link is pointing to a file in or out of the PHP files directory (because the files outside the PHP directory still all have their HTML extension, and external links should not be changed either) 6a) could check if the target_file exists, but that might be a bonus later 7) if necessary, then use sed to correct the new link within the current_file in $url i just don't know how to insert conditional logic when using sed. i mean once sed finds what you tell it to find, how can you tell it to only change the text IF some other conditions are TRUE?? thanks, disorderly |
Quote:
Such as "find <php directory> -name "*.php" -exec sed -f html2php.sed '{}' \;" However, given the large numbers of files, you probably will need to use find -print0 | xargs -0, such as in: find <php directory> -name "*.php" -print0 | xargs -0 --max-args=1000 sed -f html2php.sed The max-args xorg argument will prevent an out-of-memory bash error. I just used 1000 as an example. Quote:
--- I would recommend that you create a temporary directory that contains the same subdirectory structure, and a subset of the files. Use this to test and develop your sed program. You don't want to accidently destroy all of your links on the actual files. When you believe you have a workable script, make a backup of your PHP directory so you can restore the files just in case of an error. "To err is human, to backup is divine". --- One gotcha is if a URL spans more than one line. Sed program to handle them are more complicated because you need use looping and build a multiline pattern space before performing the substitution. Using a lint program could avoid this problem. You could use sed or grep to determine whether this case exists. |
hi jschiwal, thanks for your reply! no worries, i'm working on a temporary directory with all this stuff (i've accidently blown away enough files in my time to have learned by now ;))
Quote:
i think the critical pieces, #5 & #6 are the problem because of relative links. lets say i have a file structure like: / # html files /directory1 # html files /directory2 # html files /newPHPDirectory # php files if i find links in: /newPhpDirectory/level1/level2/someFile.php such as: <a href="../../file.html">foo</a> <a href="file.html">bar</a> <a href="../../../directory/file.html">blee</a> and i find this link in /newPhpDirectory/level1/sublevel1/sublevel2/anotherFile.php such as: <a href="../../file.html">foo</a> uh oh, now i have to determine what the links are targeting before i can replace them. they all have *.html extension, but some are within the new /newPHPDirectory directory, and others are in the old directories - see what i mean? seems to me like the script would need to determine a) in what directory its operating b) whether the target file is in the old directories, or newPHPdirectory |
All times are GMT -5. The time now is 02:55 PM. |