This a tip, not a question. It's incredibly hard to find how to do this using google, since apparently google's search engine can't tell...
Anyway this bash script takes a URL as an argument, downloads it, extracts all of the hyperlinks from it, and then uses wget in spider mode to check if the hyperlink is still good.
Very useful for checking "links pages" and rss feeds.
Code is heavily commented
Code:
#!/bin/bash
# get the basename of the URL
RSSFILE=${1##*/}
#make sure it doesn't exist (by deleting it)
rm $RSSFILE
#download the URL
wget $1
# takes the argument from the command line and finds urls in it.
# awk takes the fields in the line and turns every field into a record....
# Then it matches the records that start with url
# If the separator is " then there are three fields in the record that we are looking for:
# $1 contains the part to the left of the first double quote, url=
# $2 contains the url and
# $3 contains the part to the right of the second double quote, which is an empty string...
# RS=FS tells awk that the record separator is the same as the field separator
URL=$(awk 'BEGIN{RS=FS}/^href/{print $2"\n"}' FS='"' $RSSFILE)
# If href (html) turns up empty, then try again with url (used in rss)
if [ !$URL ];
then
URL=$(awk 'BEGIN{RS=FS}/^url/{print $2"\n"}' FS='"' $RSSFILE)
fi
for LINE in $URL
do
#if a url doesn't have 200 (OK status) print the URL and error message
wget -nv --spider $LINE 2>&1 | grep -v "200"
done
rm $RSSFILE
exit 0