ProgrammingThis forum is for all programming questions.
The question does not have to be directly related to Linux and any language is fair game.
Notices
Welcome to LinuxQuestions.org, a friendly and active Linux Community.
You are currently viewing LQ as a guest. By joining our community you will have the ability to post topics, receive our newsletter, use the advanced search, subscribe to threads and access many other special features. Registration is quick, simple and absolutely free. Join our community today!
Note that registered members see fewer ads, and ContentLink is completely disabled once you log in.
Below is an implementation of wget written in pure bash. The fetch-page function does what you want, skipping over the header and outputting the rest of the page.
Code:
#!/bin/bash
# Copyright 2008 GilbertAshley <amigo@ibiblio.org>
# BashTrix wget is a minimal implementation of wget
# written in pure BASH, with only a few options.
# The original idea and basic code for this are Copyright 2006 Ainsley Pereira.
# The idea for verify_url is from code which is Copyright 2007 Piete Sartain
# But the above code fragments both still used 'cat'.
# Copyright 2008 Noam Postavsky worked out how to
# get rid of 'cat' and provided other improvements
VERSION=0.2
# Minimum number of arguments needed by this program
MINARGS=1
show_usage() {
echo "Usage: ${0#*/} [OPTIONS] URL"
echo "${0#*/} [-hiOqV] URL"
echo ""
echo " -i FILE --input-file=FILE read filenames from FILE"
echo " -o FILE --output-document=FILE concatenate output to FILE"
echo " -q --quiet Turn off wget's output"
echo " -h --help Show this help page"
echo " -V --version Show BashTrix wget version"
echo
exit
}
show_version() {
echo "BashTrix: wget $VERSION"
echo "BashTrix wget is a minimal implementation of wget"
echo "written in pure BASH, with only a few options."
exit
}
# show usage if '-h' or '--help' is the first argument or no argument is given
case $1 in
""|"-h"|"--help") show_usage ;;
"-V"|"--version") show_version ;;
esac
# get the number of command-line arguments given
ARGC=${#}
# check to make sure enough arguments were given or exit
if [[ $ARGC -lt $MINARGS ]] ; then
echo "Too few arguments given (Minimum:$MINARGS)"
echo
show_usage
fi
# process command-line arguments
for WORD in "$@" ; do
case $WORD in
-*) true ;
case $WORD in
--debug) [[ $DEBUG ]] && echo "Long Option"
DEBUG=1
shift ;;
--input-file=*) [[ $DEBUG ]] && echo "Long FIELD Option using '='"
INPUT_FILE=${WORD:13}
shift ;;
-i) [[ $DEBUG ]] && echo "Short split FIELD Option"
if [[ ${2:0:1} != "-" ]] ; then
INPUT_FILE=$2
shift 2
else
echo "Missing argument"
show_usage
fi ;;
-i*) [[ $DEBUG ]] && echo "Short FIELD Option range -Bad syntax"
echo "Bad syntax. Did you mean this?:"
echo "-i ${WORD:2}"
show_usage
shift ;;
--output-document=*) [[ $DEBUG ]] && echo "Long FIELD Option using '='"
DEST=${WORD:18}
shift ;;
-O) [[ $DEBUG ]] && echo "Short split FIELD Option"
if [[ ${2:0:1} != "-" ]] ; then
DEST=$2
shift 2
else
echo "Missing argument"
show_usage
fi ;;
-O*) [[ $DEBUG ]] && echo "Short FIELD Option range -Bad syntax"
echo "Bad syntax. Did you mean this?:"
echo "-i ${WORD:2}"
show_usage
shift ;;
-q|--quiet) BE_QUIET=1
shift;;
esac
;;
esac
done
# Starts reading from ${HOST}/${URL}. Throws away HTTP headers so
# page contents can be read from file descriptor "$1"
fetch-page()
{
# eval's are necessary so that bash parses expansion of $1<> as a single token
eval "exec $1<>/dev/tcp/${HOST}/80"
eval "echo -e 'GET ${URL} HTTP/0.9\r\n\r\n' >&$1"
# read and throw away HTTP headers, the end of headers is
# indicated by an empty line (all lines are terminated \r\n)
OLD_IFS="$IFS"
IFS=$'\r'$'\n'
while read -u$1 i && [ "${i/$'\r'/}" != "" ]; do : ; done
IFS="$OLD_IFS"
}
# puts contents of ${HOST}/${URL} into ${DEST}
get_it()
{
# make sure $DEST starts empty
: > $DEST
fetch-page 3
fetch-page 4
# clear IFS, otherwise the bytes in it would read as empty
OLD_IFS="$IFS"
IFS=
# we read a single byte at a time from 3 with delimiter 'A',
# and from 4 with delimiter 'B'.
while read -r -n1 -dA -u3 A && read -r -n1 -dB -u4 B ; do
# Now $A is the empty string if the true byte is 'A' or NULL, and
# $B is the empty string if the true byte is 'B' or NULL.
# Therefore if either $A or $B is not empty they have the true byte
if [ -n "$B" ] ; then
echo -n "$B" >> $DEST
elif [ -n "$A" ] ; then
echo -n "$A" >> $DEST
else
# both are empty so the true byte is NULL
echo -en '\0' >> $DEST
fi
done
# restore IFS
IFS="$OLD_IFS"
}
verify_url() {
exec 3<>"/dev/tcp/${HOST}/80"
echo -e "GET ${URL} HTTP/0.9\r\n\r\n" >&3
read -u3 i
if [[ $i =~ "200 OK" ]]; then
echo 1
else
echo 0
fi
}
strip_url() {
# remove the http:// or ftp:// from the RAW_URL
RAW_URL=$1
if [[ ${RAW_URL:0:7} = "http://" ]] ; then
URL=${RAW_URL:7}
elif [[ ${RAW_URL:0:6} = "ftp://" ]] ; then
URL=${RAW_URL:6}
else
URL=${RAW_URL}
fi
}
show_error_404() {
if ! [[ $BE_QUIET ]] ; then
echo "${HOST}/${URL}:"
echo "ERROR 404: Not Found."
fi
}
if [[ $INPUT_FILE ]] ; then
for RAW_URL in $(cat $INPUT_FILE) ; do
# remove the http:// or ftp:// from the RAW_URL
strip_url $RAW_URL
# the HOST is the base name of the website
HOST=${URL%%/*}
# the url is the remaining path to the file(plus the leading '/'
URL=/${URL#*/}
# if the --output-file is not being used, then the DEST is $(basename $URL)
if [[ $DEST = "" ]] ; then
DEST=${URL##*/}
fi
# make sure the URL exists
if [[ "$(verify_url)" = 1 ]] ; then
[[ $DEBUG ]] && echo "${HOST}/${URL} - ${GREEN}found."
get_it
else
show_error_404
fi
done
else
RAW_URL="$@"
# this is the same as above, but for single files
strip_url $RAW_URL
HOST=${URL%%/*}
URL=/${URL#*/}
if [[ $DEST = "" ]] ; then
DEST=${URL##*/}
fi
if [[ "$(verify_url)" = "1" ]] ; then
get_it
else
show_error_404
fi
fi
LinuxQuestions.org is looking for people interested in writing
Editorials, Articles, Reviews, and more. If you'd like to contribute
content, let us know.